Slurm Basics
The job scheduler that manages cluster resources
Slurm (Simple Linux Utility for Resource Management) decides who gets to use which resources and when. Every job you run goes through Slurm.
How Slurm works
- You request resources (CPUs, memory, time)
- Slurm queues your job
- When resources are available, Slurm runs your job
- When done (or time runs out), resources are released
What a job looks like
Here’s a minimal batch script that runs an R analysis on a compute node:
#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=02:00:00
module load R/4.5.2
Rscript analysis.RThe #SBATCH lines tell Slurm what resources the job needs – 4 CPU cores, 8 GB of RAM, and up to 2 hours. Slurm finds a compute node with those resources available and runs the script there. You submit it with sbatch my_job.slurm and Slurm takes it from there. See Batch Jobs for the full guide.
You can also work interactively on a compute node by requesting resources on the command line:
salloc --cpus-per-task=4 --mem=8G --time=02:00:00This gives you a shell on a compute node where you can run commands directly. See Interactive Jobs for details.
Partitions
A partition is a group of nodes. Different partitions may have different hardware or policies.
# See available partitions
sinfo| Partition | Nodes | Max time | Description |
|---|---|---|---|
compute (default) |
12 (node01-12) | 20 days | General compute jobs |
gpu |
1 (gnode01) | 20 days | GPU-accelerated work |
When submitting jobs, you can specify a partition with --partition=<name> (or -p). If you don’t specify, you get the default compute partition.
Quality of Service (QoS)
QoS defines time limits and priorities. Higher priority jobs start sooner when resources are contested.
| QoS | Max time | Limits | Priority | Use case |
|---|---|---|---|---|
interactive |
1 day | 2 jobs, 192 CPUs | highest | Interactive sessions (auto-applied for salloc) |
short |
1 hour | – | high | Quick test runs |
medium |
1 day | – | medium | Standard jobs |
long |
7 days | – | low | Long-running analyses |
extended |
20 days | 1 job | lowest | Exceptional cases |
normal |
(partition default) | – | baseline | Fallback |
Request a specific QoS with --qos=<name>:
sbatch --qos=long --time=5-00:00:00 my_job.slurmWhen you use salloc for interactive sessions, the interactive QoS is applied automatically. You don’t need to specify it.
Requesting resources
Every job needs to specify what it requires:
| Option | Short | Example | Meaning |
|---|---|---|---|
--cpus-per-task |
-c |
-c 4 |
4 CPU cores |
--mem |
--mem=8G |
8 GB RAM | |
--time |
-t |
-t 02:00:00 |
2 hours |
--partition |
-p |
-p gpu |
Use GPU partition |
--qos |
--qos=short |
Use short QoS | |
--gres |
--gres=gpu:1 |
1 GPU (for GPU jobs) |
Time format
Time can be specified as: - MM – minutes - HH:MM:SS – hours:minutes:seconds - D-HH:MM:SS – days-hours:minutes:seconds
--time=30 # 30 minutes
--time=02:00:00 # 2 hours
--time=1-00:00:00 # 1 dayEssential commands
Check your jobs
# Your running and pending jobs
squeue --me
# More detail
squeue --me --longOutput columns: - JOBID – Job identifier (use this to cancel jobs) - ST – State: R (running), PD (pending), CG (completing) - TIME – How long it’s been running - NODELIST – Which node(s) it’s running on
Check cluster status
# Node availability
sinfo
# Who's using what
squeueCancel a job
# Cancel by job ID
scancel 12345
# Cancel all your jobs
scancel --meJob history
# Your recent jobs (including finished)
sacct --starttime=today
# More detail on a specific job
sacct -j 12345 --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSSCommon job states: - COMPLETED – Finished successfully - FAILED – Exited with error - TIMEOUT – Ran out of time - OUT_OF_MEMORY – Exceeded memory limit - CANCELLED – You (or admin) cancelled it
Resource usage of past jobs
To see how much memory/time a job actually used:
sacct -j 12345 --format=JobID,Elapsed,MaxRSS,MaxVMSizeThis helps you tune future resource requests.
Common issues
Job stuck in pending (PD)
Check why with:
squeue --me --longThe NODELIST(REASON) column tells you why: - Resources – Waiting for resources to free up - Priority – Other jobs have higher priority - QOSMaxJobsPerUserLimit – You’ve hit your job limit for this QoS
Job failed with OUT_OF_MEMORY
Request more memory:
salloc --mem=16G ... # instead of 8GJob hit TIMEOUT
Either your job needs more time, or there’s an infinite loop. Request more time or investigate your code:
salloc --time=08:00:00 ... # instead of 2 hoursNext steps
- Interactive Jobs – Working interactively on compute nodes
- Batch Jobs – Submitting scripts that run unattended