Understanding the Cluster

What all those terms actually mean

Modified

2026-01-31

Now that you’ve run R on a compute node, let’s understand what’s actually happening behind the scenes.

What is a cluster?

A cluster is a collection of computers (nodes) connected by a fast network, working together as one system. Instead of one powerful machine, you get many machines that can work in parallel. Unlike the work on laptops or shared workstations, you access nodes through a job manager (in our case, Slurm), which distributes your compute work to nodes and ensures every job gets the resources (CPU, RAM, GPU) they request.

flowchart TD
    A[You] -->|SSH| B(Head Node)
    B --> C{Slurm}
    C -->|Compute Jobs| D[Node 01]
    C -->|Compute Jobs| E[Node 02]
    C -->|Compute Jobs| F[Node 03]

Nodes

A node is a single computer in the cluster. Our cluster has different types:

Head node

Where you land when you SSH in
Has internet access
Used for: logging in, installing packages, editing code, submitting jobs
Shared by all users – don’t run heavy computations here

Compute nodes

Where the actual work happens
No internet access (isolated for security and performance)
Dedicated CPU and RAM for your jobs
You access these through Slurm, not directly

GPU node

A compute node with GPUs attached
For machine learning, certain simulations, or GPU-accelerated code

CPUs, Cores, and Threads

These terms are often confused. Here’s the hierarchy:

Node
└── CPU (physical processor chip)
    └── Core (independent processing unit)
        └── Thread (virtual core, via hyperthreading)

Core: The actual processing unit that runs your code. When you request --cpus-per-task=4, you’re asking for 4 cores.

Thread: Basically all modern CPUs can run 2 threads per core (hyperthreading). This can help with I/O-bound tasks but doesn’t double your computing power.

Practical advice: For R and most data analysis, think in terms of cores. If you want to parallelize across 4 cores, request 4 CPUs. It’s Slurm’s job to make sure each node gets the workload it can handle.

Memory (RAM)

Each node has a fixed amount of RAM. When you request --mem=8G, you’re reserving 8 GB for your job.

Don’t over-request

Requesting more memory than you need wastes resources and may delay your job starting. Start conservative and increase if you get out-of-memory errors.

Jobs

A job is a unit of work you submit to the cluster. Slurm (the job scheduler) manages who gets what resources and when.

Interactive jobs (salloc)

You get a shell on a compute node
Good for development, debugging, exploratory analysis
You stay connected while the job runs

Batch jobs (sbatch)

You submit a script, then disconnect
The script runs when resources are available
Good for long-running or overnight computations

Why can’t compute nodes access the internet?

Security: Compute nodes process data. Isolating them from the internet reduces attack surface.
Performance: Network traffic is reserved for cluster communication and storage.
Reproducibility: Jobs shouldn’t depend on external resources that might change or disappear.

This is why you install packages on the head node. It has internet access, and your home directory is shared across all nodes via NFS (network file system).

Shared storage

Your home directory (/srv/home/<username>) is:

Stored on the head node’s fast NVMe RAID
Mounted on all compute nodes via NFS
Where your R packages, scripts, and data live

When you install a package on the head node, it’s immediately available on compute nodes because they’re accessing the same filesystem.

Software and modules

On your laptop, you install software and it’s just there. On a cluster, multiple users may need different versions of the same software (e.g., R 4.4 vs R 4.5), and software needs to be built specifically for the cluster hardware. The solution is environment modules.

Modules let you load software into your session on demand:

module load R/4.5.2    # Makes R 4.5.2 available
module avail           # List all available software
module list            # See what you currently have loaded
module purge           # Unload everything

When you load a module, it sets up PATH, library paths, and other environment variables so the software works correctly. When you unload it, everything is cleaned up.

Some software is installed via Spack, an HPC package manager that builds software optimized for the cluster’s AMD Zen3 hardware. Spack-installed packages (like PLINK 2.0) become available after loading the Spack module:

module load spack/1.1.1       # Makes Spack packages visible
module load plink2/2.0.0-a.6.9

See Software for the full list of available software and how to request or install additional packages.

Summary

Concept	What it means
Node	One computer in the cluster
Head node	Login/management node (has internet)
Compute node	Where jobs run (no internet)
Core	One processing unit (what you request)
RAM	Memory (request what you need)
Job	A unit of work submitted to Slurm
Module	A loadable software package (use `module load`)
Spack	HPC package manager for optimized software
NFS	Network filesystem sharing your home dir

What is a cluster?

Nodes

CPUs, Cores, and Threads

Memory (RAM)

Jobs

Why can’t compute nodes access the internet?

Shared storage

Resource sharing with Slurm

Software and modules

Summary