flowchart TD
A[You] -->|SSH| B(Head Node)
B --> C{Slurm}
C -->|Compute Jobs| D[Node 01]
C -->|Compute Jobs| E[Node 02]
C -->|Compute Jobs| F[Node 03]
Understanding the Cluster
What all those terms actually mean
Now that you’ve run R on a compute node, let’s understand what’s actually happening behind the scenes.
What is a cluster?
A cluster is a collection of computers (nodes) connected by a fast network, working together as one system. Instead of one powerful machine, you get many machines that can work in parallel. Unlike the work on laptops or shared workstations, you access nodes through a job manager (in our case, Slurm), which distributes your compute work to nodes and ensures every job gets the resources (CPU, RAM, GPU) they request.
Nodes
A node is a single computer in the cluster. Our cluster has different types:
Head node
- Where you land when you SSH in
- Has internet access
- Used for: logging in, installing packages, editing code, submitting jobs
- Shared by all users – don’t run heavy computations here
Compute nodes
- Where the actual work happens
- No internet access (isolated for security and performance)
- Dedicated CPU and RAM for your jobs
- You access these through Slurm, not directly
GPU node
- A compute node with GPUs attached
- For machine learning, certain simulations, or GPU-accelerated code
CPUs, Cores, and Threads
These terms are often confused. Here’s the hierarchy:
Node
└── CPU (physical processor chip)
└── Core (independent processing unit)
└── Thread (virtual core, via hyperthreading)
Core: The actual processing unit that runs your code. When you request --cpus-per-task=4, you’re asking for 4 cores.
Thread: Basically all modern CPUs can run 2 threads per core (hyperthreading). This can help with I/O-bound tasks but doesn’t double your computing power.
Practical advice: For R and most data analysis, think in terms of cores. If you want to parallelize across 4 cores, request 4 CPUs. It’s Slurm’s job to make sure each node gets the workload it can handle.
Memory (RAM)
Each node has a fixed amount of RAM. When you request --mem=8G, you’re reserving 8 GB for your job.
Requesting more memory than you need wastes resources and may delay your job starting. Start conservative and increase if you get out-of-memory errors.
Jobs
A job is a unit of work you submit to the cluster. Slurm (the job scheduler) manages who gets what resources and when.
Interactive jobs (salloc)
- You get a shell on a compute node
- Good for development, debugging, exploratory analysis
- You stay connected while the job runs
Batch jobs (sbatch)
- You submit a script, then disconnect
- The script runs when resources are available
- Good for long-running or overnight computations
Why can’t compute nodes access the internet?
- Security: Compute nodes process data. Isolating them from the internet reduces attack surface.
- Performance: Network traffic is reserved for cluster communication and storage.
- Reproducibility: Jobs shouldn’t depend on external resources that might change or disappear.
This is why you install packages on the head node. It has internet access, and your home directory is shared across all nodes via NFS (network file system).
Resource sharing with Slurm
Since many users share the cluster, we need a way to fairly distribute resources. That’s Slurm’s job:
- You request resources (cores, memory, time)
- Slurm finds a slot for your job
- Your job runs in isolation from others
- Resources are released when your job ends
The next section dives deeper into how Slurm works.
Software and modules
On your laptop, you install software and it’s just there. On a cluster, multiple users may need different versions of the same software (e.g., R 4.4 vs R 4.5), and software needs to be built specifically for the cluster hardware. The solution is environment modules.
Modules let you load software into your session on demand:
module load R/4.5.2 # Makes R 4.5.2 available
module avail # List all available software
module list # See what you currently have loaded
module purge # Unload everythingWhen you load a module, it sets up PATH, library paths, and other environment variables so the software works correctly. When you unload it, everything is cleaned up.
Some software is installed via Spack, an HPC package manager that builds software optimized for the cluster’s AMD Zen3 hardware. Spack-installed packages (like PLINK 2.0) become available after loading the Spack module:
module load spack/1.1.1 # Makes Spack packages visible
module load plink2/2.0.0-a.6.9See Software for the full list of available software and how to request or install additional packages.
Summary
| Concept | What it means |
|---|---|
| Node | One computer in the cluster |
| Head node | Login/management node (has internet) |
| Compute node | Where jobs run (no internet) |
| Core | One processing unit (what you request) |
| RAM | Memory (request what you need) |
| Job | A unit of work submitted to Slurm |
| Module | A loadable software package (use module load) |
| Spack | HPC package manager for optimized software |
| NFS | Network filesystem sharing your home dir |