Slurm Basics

The job scheduler that manages cluster resources

Modified

2026-01-28

Slurm (Simple Linux Utility for Resource Management) decides who gets to use which resources and when. Every job you run goes through Slurm.

How Slurm works

  1. You request resources (CPUs, memory, time)
  2. Slurm queues your job
  3. When resources are available, Slurm runs your job
  4. When done (or time runs out), resources are released

What a job looks like

Here’s a minimal batch script that runs an R analysis on a compute node:

#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=02:00:00

module load R/4.5.2
Rscript analysis.R

The #SBATCH lines tell Slurm what resources the job needs – 4 CPU cores, 8 GB of RAM, and up to 2 hours. Slurm finds a compute node with those resources available and runs the script there. You submit it with sbatch my_job.slurm and Slurm takes it from there. See Batch Jobs for the full guide.

You can also work interactively on a compute node by requesting resources on the command line:

salloc --cpus-per-task=4 --mem=8G --time=02:00:00

This gives you a shell on a compute node where you can run commands directly. See Interactive Jobs for details.

Partitions

A partition is a group of nodes. Different partitions may have different hardware or policies.

# See available partitions
sinfo
Partition Nodes Max time Description
compute (default) 12 (node01-12) 20 days General compute jobs
gpu 1 (gnode01) 20 days GPU-accelerated work

When submitting jobs, you can specify a partition with --partition=<name> (or -p). If you don’t specify, you get the default compute partition.

Quality of Service (QoS)

QoS defines time limits and priorities. Higher priority jobs start sooner when resources are contested.

QoS Max time Limits Priority Use case
interactive 1 day 2 jobs, 192 CPUs highest Interactive sessions (auto-applied for salloc)
short 1 hour high Quick test runs
medium 1 day medium Standard jobs
long 7 days low Long-running analyses
extended 20 days 1 job lowest Exceptional cases
normal (partition default) baseline Fallback

Request a specific QoS with --qos=<name>:

sbatch --qos=long --time=5-00:00:00 my_job.slurm
TipInteractive QoS is automatic

When you use salloc for interactive sessions, the interactive QoS is applied automatically. You don’t need to specify it.

Requesting resources

Every job needs to specify what it requires:

Option Short Example Meaning
--cpus-per-task -c -c 4 4 CPU cores
--mem --mem=8G 8 GB RAM
--time -t -t 02:00:00 2 hours
--partition -p -p gpu Use GPU partition
--qos --qos=short Use short QoS
--gres --gres=gpu:1 1 GPU (for GPU jobs)

Time format

Time can be specified as: - MM – minutes - HH:MM:SS – hours:minutes:seconds - D-HH:MM:SS – days-hours:minutes:seconds

--time=30          # 30 minutes
--time=02:00:00    # 2 hours
--time=1-00:00:00  # 1 day

Essential commands

Check your jobs

# Your running and pending jobs
squeue --me

# More detail
squeue --me --long

Output columns: - JOBID – Job identifier (use this to cancel jobs) - ST – State: R (running), PD (pending), CG (completing) - TIME – How long it’s been running - NODELIST – Which node(s) it’s running on

Check cluster status

# Node availability
sinfo

# Who's using what
squeue

Cancel a job

# Cancel by job ID
scancel 12345

# Cancel all your jobs
scancel --me

Job history

# Your recent jobs (including finished)
sacct --starttime=today

# More detail on a specific job
sacct -j 12345 --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS

Common job states: - COMPLETED – Finished successfully - FAILED – Exited with error - TIMEOUT – Ran out of time - OUT_OF_MEMORY – Exceeded memory limit - CANCELLED – You (or admin) cancelled it

Resource usage of past jobs

To see how much memory/time a job actually used:

sacct -j 12345 --format=JobID,Elapsed,MaxRSS,MaxVMSize

This helps you tune future resource requests.

Common issues

Job stuck in pending (PD)

Check why with:

squeue --me --long

The NODELIST(REASON) column tells you why: - Resources – Waiting for resources to free up - Priority – Other jobs have higher priority - QOSMaxJobsPerUserLimit – You’ve hit your job limit for this QoS

Job failed with OUT_OF_MEMORY

Request more memory:

salloc --mem=16G ...  # instead of 8G

Job hit TIMEOUT

Either your job needs more time, or there’s an infinite loop. Request more time or investigate your code:

salloc --time=08:00:00 ...  # instead of 2 hours

Next steps