GPU Jobs

Using the GPU node for accelerated computing

Modified

2026-02-13

The cluster has a dedicated GPU node for accelerated workloads like machine learning and certain scientific computations.

Requesting GPU resources

Use the --gres flag to request GPUs:

# Interactive session with 1 GPU
salloc --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=02:00:00

# Batch job
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

GPU options

Request	Meaning
`--gres=gpu:1`	1 GPU
`--gres=gpu:2`	2 GPUs
`--partition=gpu`	Use the GPU partition

Checking GPU availability

# See GPU node status
sinfo -p gpu

# Check GPU utilization (on the GPU node)
nvidia-smi

PyTorch example

import torch

# Check if GPU is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")

# Use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tensor = torch.randn(1000, 1000, device=device)

pytorch_job.slurm

#!/bin/bash
#SBATCH --job-name=pytorch
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=04:00:00
#SBATCH --output=logs/%x_%j.out

cd ~/my_project
uv run python train.py

R with GPU (torch)

The R torch package can use GPUs:

library(torch)

# Check CUDA availability
cuda_is_available()

# Create tensor on GPU
x <- torch_randn(1000, 1000, device = "cuda")

Install torch on head node first

The torch R package downloads CUDA libraries on first load. Run library(torch) on the head node before submitting GPU jobs.

Custom CUDA code

If you need to compile CUDA code (.cu files), load the CUDA toolkit first:

spack load cuda@12.4.0
nvcc -O2 -o my_program my_program.cu -lcublas

See Software > CUDA toolkit for all available versions and loading options.

Tips

Don’t hog the GPU node

GPU resources are limited. Request only what you need and release the node when done.

CPU fallback

Design your code to work without GPU when possible:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

This lets you develop on the head node (CPU) and run production jobs on GPU.

Monitor GPU memory

# On the GPU node
watch nvidia-smi

Out-of-memory errors on GPU require reducing batch size or model size, not requesting more memory.