GPU Jobs
Using the GPU node for accelerated computing
The cluster has a dedicated GPU node for accelerated workloads like machine learning and certain scientific computations.
Requesting GPU resources
Use the --gres flag to request GPUs:
# Interactive session with 1 GPU
salloc --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=02:00:00
# Batch job
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1GPU options
| Request | Meaning |
|---|---|
--gres=gpu:1 |
1 GPU |
--gres=gpu:2 |
2 GPUs |
--partition=gpu |
Use the GPU partition |
Checking GPU availability
# See GPU node status
sinfo -p gpu
# Check GPU utilization (on the GPU node)
nvidia-smiPyTorch example
import torch
# Check if GPU is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
# Use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tensor = torch.randn(1000, 1000, device=device)pytorch_job.slurm
#!/bin/bash
#SBATCH --job-name=pytorch
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=04:00:00
#SBATCH --output=logs/%x_%j.out
cd ~/my_project
uv run python train.pyR with GPU (torch)
The R torch package can use GPUs:
library(torch)
# Check CUDA availability
cuda_is_available()
# Create tensor on GPU
x <- torch_randn(1000, 1000, device = "cuda")The torch R package downloads CUDA libraries on first load. Run library(torch) on the head node before submitting GPU jobs.
Custom CUDA code
If you need to compile CUDA code (.cu files), load the CUDA toolkit first:
spack load cuda@12.4.0
nvcc -O2 -o my_program my_program.cu -lcublasSee Software > CUDA toolkit for all available versions and loading options.
Tips
Don’t hog the GPU node
GPU resources are limited. Request only what you need and release the node when done.
CPU fallback
Design your code to work without GPU when possible:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")This lets you develop on the head node (CPU) and run production jobs on GPU.
Monitor GPU memory
# On the GPU node
watch nvidia-smiOut-of-memory errors on GPU require reducing batch size or model size, not requesting more memory.