GPU Jobs

Using the GPU node for accelerated computing

Modified

2026-03-12

The cluster has a dedicated GPU node for accelerated workloads like machine learning and certain scientific computations.

Requesting GPU resources

You need both --partition=gpu and --gres=gpu:N to use the GPU node:

# Interactive session with 1 GPU
salloc --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=02:00:00

# Batch job
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
ImportantAlways specify both --partition=gpu and --gres=gpu:N
  • --partition=gpu routes your job to the GPU node (the default partition is compute, which has no GPUs)
  • --gres=gpu:N tells Slurm to allocate N GPUs and make them visible to your job

If you only specify --partition=gpu without --gres, your job will run on the GPU node but cannot see or use the GPUs — CUDA will report no devices available. If you only specify --gres=gpu:N without --partition=gpu, the job will be queued on the default compute partition which has no GPUs, and it will never run.

GPU options

Request Meaning
--partition=gpu Route job to the GPU node
--gres=gpu:1 Allocate 1 GPU
--gres=gpu:2 Allocate 2 GPUs

GPU node specs

The GPU node (gnode01) has different hardware than the compute nodes:

Resource Spec
GPUs 2x NVIDIA H200 (141 GB HBM3e each)
CPU cores 128 (2x 64, SMT disabled)
RAM 3 TB
Local scratch ~1.8 TB NVMe RAID-1

To use all CPU cores on the GPU node, request --cpus-per-task=128.

Checking GPU availability

# See GPU node status
sinfo -p gpu

# Check GPU utilization (on the GPU node)
nvidia-smi

# Check which GPUs are allocated to your job
echo $CUDA_VISIBLE_DEVICES

PyTorch example

import torch

# Check if GPU is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")

# Use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tensor = torch.randn(1000, 1000, device=device)

pytorch_job.slurm

#!/bin/bash
#SBATCH --job-name=pytorch
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=04:00:00
#SBATCH --output=logs/%x_%j.out

cd ~/my_project
uv run python train.py

R with GPU (torch)

The R torch package provides GPU-accelerated tensor operations and deep learning. Since the GPU node has no internet access, you must install the CUDA-enabled version on the head node first.

Installing torch with CUDA support

Installing torch with GPU support is a two-step process. Both steps must be run on the head node (it has internet access). The files install to your R library on /srv/home (shared via NFS), so the GPU node sees them automatically.

Step 1: Install the R package from CRAN

install.packages("torch")

Step 2: Download the CUDA-enabled libtorch backend

By default, torch auto-downloads a CPU-only backend on first use. To get the CUDA version, set the CUDA environment variable and call install_torch():

Sys.setenv(CUDA = "12.8")
torch::install_torch()

This downloads ~3.6 GB of CUDA libraries (libtorch, cuDNN, etc.) into the torch package directory. See the torch installation guide for other CUDA versions.

ImportantDo not skip step 2

A plain install.packages("torch") followed by library(torch) will auto-download the CPU-only backend (because the head node has no GPU). You must explicitly run install_torch() with CUDA set to get GPU support.

Using torch on the GPU node

# Get an interactive GPU session
salloc --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=02:00:00
library(torch)

# Verify GPU is available
cuda_is_available()     # TRUE
cuda_device_count()     # 1 or 2

# Create tensor on GPU
x <- torch_randn(1000, 1000, device = "cuda")
Note

If cuda_is_available() returns FALSE:

  1. Did you request a GPU? Make sure your salloc or sbatch includes both --partition=gpu and --gres=gpu:1. Without --gres, the GPU devices are hidden from your job.
  2. Is the CUDA version installed? You may have the CPU-only version. Remove it with remove.packages("torch") and re-install using the pre-built binary method above.

Custom CUDA code

If you need to compile CUDA code (.cu files), load the CUDA toolkit first:

spack load cuda@12.4.0
nvcc -O2 -o my_program my_program.cu -lcublas

See Software > CUDA toolkit for all available versions and loading options.

Tips

Don’t hog the GPU node

GPU resources are limited. Request only what you need and release the node when done.

CPU fallback

Design your code to work without GPU when possible:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

This lets you develop on the head node (CPU) and run production jobs on GPU.

Monitor GPU memory

On the GPU node, either use nvidia-smi for an infodump of the current status

nvidia-smi

Alternatively nvtop can be used for a live overview with graphs:

nvtop
Important

Out-of-memory errors on GPU require reducing batch size or model size, not requesting more memory.