Slurm Basics

The job scheduler that manages cluster resources

Modified

2026-01-28

Slurm (Simple Linux Utility for Resource Management) decides who gets to use which resources and when. Every job you run goes through Slurm.

How Slurm works

You request resources (CPUs, memory, time)
Slurm queues your job
When resources are available, Slurm runs your job
When done (or time runs out), resources are released

What a job looks like

Here’s a minimal batch script that runs an R analysis on a compute node:

#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=02:00:00

module load R/4.5.2
Rscript analysis.R

The #SBATCH lines tell Slurm what resources the job needs – 4 CPU cores, 8 GB of RAM, and up to 2 hours. Slurm finds a compute node with those resources available and runs the script there. You submit it with sbatch my_job.slurm and Slurm takes it from there. See Batch Jobs for the full guide.

You can also work interactively on a compute node by requesting resources on the command line:

salloc --cpus-per-task=4 --mem=8G --time=02:00:00

This gives you a shell on a compute node where you can run commands directly. See Interactive Jobs for details.

Partitions

A partition is a group of nodes. Different partitions may have different hardware or policies.

# See available partitions
sinfo

Partition	Nodes	Max time	Description
`compute` (default)	12 (node01-12)	20 days	General compute jobs
`gpu`	1 (gnode01)	20 days	GPU-accelerated work

When submitting jobs, you can specify a partition with --partition=<name> (or -p). If you don’t specify, you get the default compute partition.

Quality of Service (QoS)

QoS defines time limits and priorities. Higher priority jobs start sooner when resources are contested.

QoS	Max time	Limits	Priority	Use case
`interactive`	1 day	2 jobs, 192 CPUs	highest	Interactive sessions (auto-applied for `salloc`)
`short`	1 hour	–	high	Quick test runs
`medium`	1 day	–	medium	Standard jobs
`long`	7 days	–	low	Long-running analyses
`extended`	20 days	1 job	lowest	Exceptional cases
`normal`	(partition default)	–	baseline	Fallback

Request a specific QoS with --qos=<name>:

sbatch --qos=long --time=5-00:00:00 my_job.slurm

Interactive QoS is automatic

When you use salloc for interactive sessions, the interactive QoS is applied automatically. You don’t need to specify it.

Requesting resources

Every job needs to specify what it requires:

Option	Short	Example	Meaning
`--cpus-per-task`	`-c`	`-c 4`	4 CPU cores
`--mem`		`--mem=8G`	8 GB RAM
`--time`	`-t`	`-t 02:00:00`	2 hours
`--partition`	`-p`	`-p gpu`	Use GPU partition
`--qos`		`--qos=short`	Use short QoS
`--gres`		`--gres=gpu:1`	1 GPU (for GPU jobs)

Time format

Time can be specified as: - MM – minutes - HH:MM:SS – hours:minutes:seconds - D-HH:MM:SS – days-hours:minutes:seconds

--time=30          # 30 minutes
--time=02:00:00    # 2 hours
--time=1-00:00:00  # 1 day

Essential commands

Check your jobs

# Your running and pending jobs
squeue --me

# More detail
squeue --me --long

Output columns: - JOBID – Job identifier (use this to cancel jobs) - ST – State: R (running), PD (pending), CG (completing) - TIME – How long it’s been running - NODELIST – Which node(s) it’s running on

Check cluster status

# Node availability
sinfo

# Who's using what
squeue

Cancel a job

# Cancel by job ID
scancel 12345

# Cancel all your jobs
scancel --me

Job history

# Your recent jobs (including finished)
sacct --starttime=today

# More detail on a specific job
sacct -j 12345 --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS

Common job states: - COMPLETED – Finished successfully - FAILED – Exited with error - TIMEOUT – Ran out of time - OUT_OF_MEMORY – Exceeded memory limit - CANCELLED – You (or admin) cancelled it

Resource usage of past jobs

To see how much memory/time a job actually used:

sacct -j 12345 --format=JobID,Elapsed,MaxRSS,MaxVMSize

This helps you tune future resource requests.

Common issues

Job stuck in pending (PD)

Check why with:

squeue --me --long

The NODELIST(REASON) column tells you why: - Resources – Waiting for resources to free up - Priority – Other jobs have higher priority - QOSMaxJobsPerUserLimit – You’ve hit your job limit for this QoS

Job failed with OUT_OF_MEMORY

Request more memory:

salloc --mem=16G ...  # instead of 8G

Job hit TIMEOUT

Either your job needs more time, or there’s an infinite loop. Request more time or investigate your code:

salloc --time=08:00:00 ...  # instead of 2 hours

Next steps

Interactive Jobs – Working interactively on compute nodes
Batch Jobs – Submitting scripts that run unattended