GPU Jobs
Using the GPU node for accelerated computing
The cluster has a dedicated GPU node for accelerated workloads like machine learning and certain scientific computations.
Requesting GPU resources
You need both --partition=gpu and --gres=gpu:N to use the GPU node:
# Interactive session with 1 GPU
salloc --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=02:00:00
# Batch job
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1--partition=gpu and --gres=gpu:N
--partition=gpuroutes your job to the GPU node (the default partition iscompute, which has no GPUs)--gres=gpu:Ntells Slurm to allocate N GPUs and make them visible to your job
If you only specify --partition=gpu without --gres, your job will run on the GPU node but cannot see or use the GPUs — CUDA will report no devices available. If you only specify --gres=gpu:N without --partition=gpu, the job will be queued on the default compute partition which has no GPUs, and it will never run.
GPU options
| Request | Meaning |
|---|---|
--partition=gpu |
Route job to the GPU node |
--gres=gpu:1 |
Allocate 1 GPU |
--gres=gpu:2 |
Allocate 2 GPUs |
GPU node specs
The GPU node (gnode01) has different hardware than the compute nodes:
| Resource | Spec |
|---|---|
| GPUs | 2x NVIDIA H200 (141 GB HBM3e each) |
| CPU cores | 128 (2x 64, SMT disabled) |
| RAM | 3 TB |
| Local scratch | ~1.8 TB NVMe RAID-1 |
To use all CPU cores on the GPU node, request --cpus-per-task=128.
Checking GPU availability
# See GPU node status
sinfo -p gpu
# Check GPU utilization (on the GPU node)
nvidia-smi
# Check which GPUs are allocated to your job
echo $CUDA_VISIBLE_DEVICESPyTorch example
import torch
# Check if GPU is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
# Use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tensor = torch.randn(1000, 1000, device=device)pytorch_job.slurm
#!/bin/bash
#SBATCH --job-name=pytorch
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=04:00:00
#SBATCH --output=logs/%x_%j.out
cd ~/my_project
uv run python train.pyR with GPU (torch)
The R torch package provides GPU-accelerated tensor operations and deep learning. Since the GPU node has no internet access, you must install the CUDA-enabled version on the head node first.
Installing torch with CUDA support
Installing torch with GPU support is a two-step process. Both steps must be run on the head node (it has internet access). The files install to your R library on /srv/home (shared via NFS), so the GPU node sees them automatically.
Step 1: Install the R package from CRAN
install.packages("torch")Step 2: Download the CUDA-enabled libtorch backend
By default, torch auto-downloads a CPU-only backend on first use. To get the CUDA version, set the CUDA environment variable and call install_torch():
Sys.setenv(CUDA = "12.8")
torch::install_torch()This downloads ~3.6 GB of CUDA libraries (libtorch, cuDNN, etc.) into the torch package directory. See the torch installation guide for other CUDA versions.
A plain install.packages("torch") followed by library(torch) will auto-download the CPU-only backend (because the head node has no GPU). You must explicitly run install_torch() with CUDA set to get GPU support.
Using torch on the GPU node
# Get an interactive GPU session
salloc --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=02:00:00library(torch)
# Verify GPU is available
cuda_is_available() # TRUE
cuda_device_count() # 1 or 2
# Create tensor on GPU
x <- torch_randn(1000, 1000, device = "cuda")If cuda_is_available() returns FALSE:
- Did you request a GPU? Make sure your
sallocorsbatchincludes both--partition=gpuand--gres=gpu:1. Without--gres, the GPU devices are hidden from your job. - Is the CUDA version installed? You may have the CPU-only version. Remove it with
remove.packages("torch")and re-install using the pre-built binary method above.
Custom CUDA code
If you need to compile CUDA code (.cu files), load the CUDA toolkit first:
spack load cuda@12.4.0
nvcc -O2 -o my_program my_program.cu -lcublasSee Software > CUDA toolkit for all available versions and loading options.
Tips
Don’t hog the GPU node
GPU resources are limited. Request only what you need and release the node when done.
CPU fallback
Design your code to work without GPU when possible:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")This lets you develop on the head node (CPU) and run production jobs on GPU.
Monitor GPU memory
On the GPU node, either use nvidia-smi for an infodump of the current status
nvidia-smiAlternatively nvtop can be used for a live overview with graphs:
nvtopOut-of-memory errors on GPU require reducing batch size or model size, not requesting more memory.