Understanding the Cluster

What all those terms actually mean

Modified

2026-03-17

Now that you’ve run R on a compute node, let’s understand what’s actually happening behind the scenes.

What is a cluster?

A cluster is a collection of computers (nodes) connected by a fast network, working together as one system. Instead of one powerful machine, you get many machines that can work in parallel. Unlike the work on laptops or shared workstations, you access nodes through a job manager (in our case, Slurm), which distributes your compute work to nodes and ensures every job gets the resources (CPU, RAM, GPU) they request.

flowchart TD
    A[You] -->|SSH| B(Head Node)
    B --> C{Slurm}
    C -->|Compute Jobs| D[Node 01]
    C -->|Compute Jobs| E[Node 02]
    C -->|Compute Jobs| F[Node 03]

Nodes

A node is a single computer in the cluster. Our cluster has different types:

Head node

  • Where you land when you SSH in
  • Has internet access
  • Used for: logging in, installing packages, editing code, submitting jobs
  • Shared by all users – don’t run heavy computations here

Compute nodes

  • Where the actual work happens
  • No internet access (isolated for security and performance)
  • Dedicated CPU and RAM for your jobs
  • You access these through Slurm, not directly — you cannot SSH to a compute node unless you have an active job running on it

GPU node

  • A compute node with GPUs attached
  • For machine learning, certain simulations, or GPU-accelerated code

CPUs, Cores, and Threads

These terms are often confused. Here’s the hierarchy:

Node
└── CPU (physical processor chip)
    └── Core (independent processing unit)
        └── Thread (a single sequence of instructions)

Core: The actual processing unit that runs your code. Each compute node has 96 cores.

Thread: A thread is a single sequence of instructions being executed — one “thing running”. When you start an R session, that’s one thread. When you run an R script without any explicit parallelization, that also uses one thread. Each core can run 2 threads simultaneously (called hyperthreading or SMT). This can help with some workloads but doesn’t double your computing power — think of it as two tasks sharing one core’s resources rather than having two independent cores.

How Slurm allocates: Slurm allocates whole cores, not individual threads. Each job gets at least one full core (= 2 threads), even if you only need 1 thread. This means a compute node can be divided up in many ways — for example, 96 single-threaded jobs (one per core, each getting 2 threads), 4 jobs with 48 threads each, or 1 job using all 192 threads (a full node).

Practical advice: Request as many CPUs as threads you actually need. If your code uses 4 threads, request --cpus-per-task=4 (or ncpus = 4 in batchtools). Slurm will round up to whole cores internally — you don’t need to worry about that. To use an entire compute node, request --cpus-per-task=192 (96 cores × 2 threads). On the GPU node (no hyperthreading), request --cpus-per-task=128 for all 128 cores.

Memory (RAM)

Each node has a fixed amount of RAM. When you request --mem=8G, you’re reserving 8 GB for your job.

WarningDon’t over-request

Requesting more memory than you need wastes resources and may delay your job starting. Start conservative and increase if you get out-of-memory errors.

Jobs

A job is a unit of work you submit to the cluster. Slurm (the job scheduler) manages who gets what resources and when.

Interactive jobs (salloc)

  • You get a shell on a compute node
  • Good for development, debugging, exploratory analysis
  • You stay connected while the job runs

Batch jobs (sbatch)

  • You submit a script, then disconnect
  • The script runs when resources are available
  • Good for long-running or overnight computations

How parallelism works on a cluster

On your laptop, “running in parallel” usually means using multiple threads or cores within a single process. On a cluster, there are two fundamentally different approaches — and understanding the distinction matters.

Approach A: One job, many threads

You submit a single job that requests many cores and uses a parallel framework (like R’s future or Python’s multiprocessing) to split work across those cores within the job.

┌──────────────────────────────────────────────────────┐
│ Node 05                                              │
│                                                       │
│  Job 4281 (100 threads)                               │
│  ┌──────────┬──────────┬─────┬───────────┐            │
│  │ Thread 1 │ Thread 2 │ ... │ Thread 100│            │
│  └──────────┴──────────┴─────┴───────────┘            │
│  future_lapply(1:1000, my_function)                   │
│                                                       │
└──────────────────────────────────────────────────────┘

  → 1 job × 100 threads × 1 hour = 100 CPU-hours

This is useful when tasks need shared memory, shared data, or coordination between threads.

Approach B: Many jobs, one thread each

You submit many independent jobs, each running one piece of work (one simulation scenario, one dataset, one model fit). Slurm distributes them across nodes automatically.

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Node 05      │ │ Node 06      │ │ Node 07      │ │ Node 08      │
│              │ │              │ │              │ │              │
│ Job 1  (1 c) │ │ Job 26 (1 c) │ │ Job 51 (1 c) │ │ Job 76 (1 c) │
│ Job 2  (1 c) │ │ Job 27 (1 c) │ │ Job 52 (1 c) │ │ Job 77 (1 c) │
│ Job 3  (1 c) │ │ Job 28 (1 c) │ │ Job 53 (1 c) │ │ Job 78 (1 c) │
│ ...          │ │ ...          │ │ ...          │ │ ...          │
│ Job 25 (1 c) │ │ Job 50 (1 c) │ │ Job 75 (1 c) │ │ Job 100(1 c) │
│              │ │              │ │              │ │              │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘

  → 100 jobs × 1 core × 1 hour = 100 CPU-hours

This is the natural fit for simulation studies, parameter sweeps, and other “embarrassingly parallel” workloads where each run is self-contained. If one job fails, you resubmit just that one. You don’t need to think about threading or shared memory at all.

Which approach should I use?

Approach A (multi-threaded) Approach B (many jobs)
Good for Shared data, coordination, large matrices Independent runs, simulations, parameter sweeps
R tools future, mirai, parallel batchtools
Python tools multiprocessing, joblib shell scripts, workflow managers
Failure handling Whole job fails Individual job fails, rest unaffected
Scaling Limited to one node’s cores Scales across the entire cluster

For many research workflows — especially simulation studies — Approach B is the better fit. Tools like R’s batchtools package make it easy: you define a function and a set of parameters, and batchtools submits one Slurm job per combination.

WarningDon’t accidentally combine both approaches

If you use a many-jobs tool (like batchtools) but also use multi-threading inside each job (like future with all available cores), you can massively oversubscribe the cluster. For example, 100 batchtools jobs each spawning 96 threads = 9,600 threads competing for 1,152 cores. Keep your many-jobs workflows single-threaded unless you have a specific reason to do otherwise and adjust your resource requests accordingly.

Why can’t compute nodes access the internet?

  1. Security: Compute nodes process data. Isolating them from the internet reduces attack surface.
  2. Performance: Network traffic is reserved for cluster communication and storage.
  3. Reproducibility: Jobs shouldn’t depend on external resources that might change or disappear.

This is why you install packages on the head node. It has internet access, and your home directory is shared across all nodes via NFS (network file system).

Shared storage

Your home directory (/srv/home/<username>) is:

  • Stored on the head node’s fast NVMe RAID
  • Mounted on all compute nodes via NFS
  • Where your R packages, scripts, and data live

When you install a package on the head node, it’s immediately available on compute nodes because they’re accessing the same filesystem.

Resource sharing with Slurm

Since many users share the cluster, we need a way to fairly distribute resources. That’s Slurm’s job:

  • You request resources (cores, memory, time)
  • Slurm finds a slot for your job
  • Your job runs in isolation from others
  • Resources are released when your job ends

The next section dives deeper into how Slurm works.

Software and modules

On your laptop, you install software and it’s just there. On a cluster, multiple users may need different versions of the same software (e.g., R 4.4 vs R 4.5), and software needs to be built specifically for the cluster hardware. The solution is environment modules.

Modules let you load software into your session on demand:

module load R/4.5.3    # Makes R 4.5.3 available
module avail           # List all available software
module list            # See what you currently have loaded
module purge           # Unload everything

When you load a module, it sets up PATH, library paths, and other environment variables so the software works correctly. When you unload it, everything is cleaned up.

Some software is installed via Spack, an HPC package manager that builds optimized scientific software. Spack packages on this cluster target the AMD Zen 3 instruction set for broad compatibility across all nodes. Spack-installed packages (like PLINK 2.0) become available after loading the Spack module:

module load spack/1.1.1       # Makes Spack packages visible
module load plink2/2.0.0-a.6.9

See Software for the full list of available software and how to request or install additional packages.

Summary

Concept What it means
Node One computer in the cluster
Head node Login/management node (has internet)
Compute node Where jobs run (no internet)
Core One processing unit (what you request)
RAM Memory (request what you need)
Job A unit of work submitted to Slurm
Module A loadable software package (use module load)
Spack HPC package manager for optimized software
NFS Network filesystem sharing your home dir