Reproducible Pipelines with targets
Automated, scalable workflows using {targets} and `{crew}``
The targets package brings pipeline-based workflow management to R. Instead of sourcing scripts in order and hoping nothing breaks, you define your analysis as a directed graph of targets (steps), and targets figures out what needs to run, skips what’s already up to date, and optionally runs steps in parallel.
Combined with crew and crew.cluster, targets can distribute work across Slurm compute nodes — making it a powerful framework for analyses that are too large for a single node or that you want to run reproducibly.
For straightforward simulation studies and benchmark experiments, {batchtools} is the recommended starting point — it’s simpler and many users are already familiar with it. targets shines when your workflow has dependencies between steps, when you want automatic caching (only re-run what changed), or when you need a mix of different resource types (CPU + GPU targets). If you’re not sure which to use, start with batchtools.
Why targets?
If your workflow looks like this:
# 01-clean.R
# 02-model.R
# 03-plot.R
# ... run them in order and prayThen targets is for you. Key benefits:
- Skips work that’s already done: Change one model? Only that model and its downstream targets re-run — not the whole pipeline.
- Dependency tracking:
targetsknows which functions and data each step depends on. Change a function → all affected steps automatically re-run. - Reproducibility: The pipeline definition is the documentation. Anyone can see exactly what runs, in what order, with what inputs.
- Parallel execution: Steps that don’t depend on each other run simultaneously — on one node or across many.
- Built-in data management: Results are cached automatically. No more juggling
.rdsfiles.
Getting started
Install packages
On the head node (which has internet access):
install.packages(c("targets", "tarchetypes", "crew", "crew.cluster"))Project structure
A typical targets project looks like:
my-analysis/
├── _targets.R # Pipeline definition (the main file)
├── R/ # Your functions
│ ├── clean.R
│ ├── model.R
│ └── plot.R
├── data/ # Raw input data
└── _targets/ # Auto-generated cache (don't edit)
The key idea: put your logic in functions (in R/), and define the pipeline (what calls what, with what arguments) in _targets.R.
Defining a pipeline
_targets.R is a regular R script that returns a list of targets:
# _targets.R
library(targets)
# Source all functions in R/
tar_source()
list(
# Step 1: Load raw data
tar_target(raw_data, read_data("data/input.csv")),
# Step 2: Clean it
tar_target(clean_data, clean_and_filter(raw_data)),
# Step 3: Fit model
tar_target(model_fit, fit_model(clean_data)),
# Step 4: Create summary table
tar_target(summary_table, summarize_results(model_fit)),
# Step 5: Plot
tar_target(diagnostic_plot, make_diagnostic_plot(model_fit))
)targets automatically detects that clean_data depends on raw_data, model_fit depends on clean_data, etc. Steps 4 and 5 are independent of each other and can run in parallel.
Running the pipeline
library(targets)
# Visualize the pipeline (great for checking your setup)
tar_visnetwork()
# Run it
tar_make()
# Read a result
tar_read(model_fit)After the first run, modify a function (say fit_model()) and run tar_make() again — only the affected targets re-run.
Dynamic branching
One of the most useful features for HPC work is dynamic branching: running the same step over many inputs, like fitting a model for each outcome or running an analysis per dataset.
# _targets.R
library(targets)
tar_source()
list(
tar_target(datasets, list_dataset_paths("data/")),
# Map over datasets — creates one branch per file
tar_target(
results,
analyze_dataset(datasets),
pattern = map(datasets)
),
# Combine all branch results
tar_target(combined, bind_results(results))
)Each branch is an independent unit of work that can run in parallel. If you have 50 datasets and 50 cores available, all 50 analyses run simultaneously.
Scaling with crew on the cluster
By default, tar_make() runs targets sequentially in the local R process. For real workloads, you want parallel execution — and on a cluster, you want that parallelism to span compute nodes.
Local parallelism with crew
For parallelism within a single node (e.g. in an interactive session or batch job), use a crew controller:
# _targets.R
library(targets)
library(crew)
tar_source()
tar_option_set(
controller = crew_controller_local(
name = "local",
workers = as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "4"))
)
)
list(
# ... your targets
)Then run the pipeline in a Slurm job with enough cores:
module load R/4.5.3
salloc --cpus-per-task=8 --mem=32G --time=04:00:00
R -e 'targets::tar_make()'Multi-node parallelism with crew.cluster
This is where things get powerful. crew.cluster provides a Slurm controller that automatically submits worker jobs to compute nodes. Your coordinator R session (which can run on the head node or in a lightweight Slurm job) dispatches work across the cluster.
# _targets.R
library(targets)
library(crew.cluster)
tar_source()
tar_option_set(
controller = crew_controller_slurm(
name = "slurm",
workers = 20,
seconds_idle = 120, # Workers shut down after 2 min idle
options_cluster = crew_options_slurm(
cpus_per_task = 2, # Each worker gets 2 cores
memory_gigabytes_per_cpu = 4,
time_minutes = 60,
log_output = "logs/worker_%j.log",
log_error = "logs/worker_%j.log"
)
)
)
list(
tar_target(raw_data, load_all_data()),
tar_target(
model_result,
fit_model(raw_data, outcome),
pattern = map(outcome)
),
tar_target(report, render_report(model_result))
)Each crew worker is a Slurm job that inherits the environment of the coordinator process — including PATH, loaded modules, and other environment variables. This means you do not need script_lines to set up R. Just make sure to module load R before starting your coordinator R session, and the workers will automatically have R on their PATH.
When you run tar_make():
crew.clustersubmits up to 20 Slurm jobs as workers- Each worker picks up targets as they become ready
- Idle workers shut down automatically after 2 minutes
- Results flow back to the coordinator
You can run the coordinator from the head node (the coordinator itself is lightweight — it just orchestrates):
# On the head node — the heavy work happens on compute nodes
module load R/4.5.3
R -e 'targets::tar_make()'Or submit the coordinator itself as a Slurm job:
#!/bin/bash
#SBATCH --job-name=targets-coordinator
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=08:00:00
#SBATCH --qos=long
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
Rscript -e 'targets::tar_make()'Submit with R loaded so the job inherits the environment:
module load R/4.5.3
sbatch coordinator.shRunning the coordinator on the head node is fine because it is lightweight — it only sends and receives tasks. If your coordinator starts using significant CPU or memory, submit it as a Slurm job instead.
Controller groups
For heterogeneous workloads (e.g. some targets need GPUs, others don’t), use multiple controllers:
# _targets.R
library(targets)
library(crew.cluster)
tar_source()
tar_option_set(
controller = crew_controller_group(
crew_controller_slurm(
name = "cpu",
workers = 16,
seconds_idle = 120,
options_cluster = crew_options_slurm(
cpus_per_task = 4,
memory_gigabytes_per_cpu = 4,
time_minutes = 60
)
),
crew_controller_slurm(
name = "gpu",
workers = 1,
seconds_idle = 300,
options_cluster = crew_options_slurm(
cpus_per_task = 8,
memory_gigabytes_per_cpu = 8,
time_minutes = 120,
partition = "gpu",
script_lines = c(
"#SBATCH --gres=gpu:1"
)
)
)
)
)
list(
tar_target(data, load_data()),
tar_target(features, extract_features(data), resources = tar_resources(
crew = tar_resources_crew(controller = "cpu")
)),
tar_target(deep_model, train_deep_model(features), resources = tar_resources(
crew = tar_resources_crew(controller = "gpu")
))
)partition and --gres
On this cluster, GPU jobs require both partition = "gpu" (to route to the GPU node) and #SBATCH --gres=gpu:N (to allocate GPUs). Without --gres, your job runs on the GPU node but CUDA reports no devices. See GPU Jobs for details.
Monitoring pipelines
Progress
# In another R session while the pipeline is running
tar_progress() # Table of target statuses
tar_watch() # Live dashboard in the browserVisualize what’s outdated
# See what needs to re-run
tar_visnetwork()
# Outdated targets only
tar_outdated()Logs
With crew.cluster, worker logs go to the path specified in log_output (in crew_options_slurm()). Check them for errors:
cat logs/worker_12345.logParallelism within targets
Each crew worker runs one target at a time. If you configure cpus_per_task = 8, those 8 CPUs are available to the single target currently running on that worker. This means you can speed up individual targets (e.g. multithreaded model fits) without oversubscribing — as long as the target’s parallelism matches the worker’s allocation.
The cluster task prolog automatically sets OMP_NUM_THREADS to match SLURM_CPUS_PER_TASK, so BLAS operations are handled. For everything else, read SLURM_CPUS_PER_TASK in your functions. See Batch Jobs > Slurm environment variables for the full list of variables available inside jobs.
Example 1: Multithreaded model fitting (ranger, xgboost)
Packages like ranger and xgboost have explicit thread count arguments:
# R/model.R
fit_ranger <- function(data) {
n_threads <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "1"))
ranger::ranger(y ~ ., data = data, num.threads = n_threads)
}
fit_xgboost <- function(dtrain) {
n_threads <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "1"))
xgboost::xgb.train(
params = list(nthread = n_threads, objective = "reg:squarederror"),
data = dtrain, nrounds = 100
)
}# _targets.R — workers get 8 CPUs each, ranger uses all 8
tar_option_set(
controller = crew_controller_slurm(
workers = 10,
options_cluster = crew_options_slurm(
cpus_per_task = 8,
memory_gigabytes_per_cpu = 2,
time_minutes = 60
)
)
)Example 2: BLAS-heavy computation (matrix operations, glmnet)
For packages that rely on BLAS/LAPACK (matrix multiplication, linear algebra), you don’t need to do anything. The task prolog sets OMP_NUM_THREADS to the number of allocated CPUs, and FlexiBLAS/OpenBLAS automatically uses that many threads.
# This just works — BLAS uses all allocated CPUs
fit_glmnet <- function(x, y) {
glmnet::cv.glmnet(x, y, nfolds = 10)
}Example 3: mclapply / future within a target
If your target uses R-level parallelism (e.g. mclapply or future), match the worker count to the allocation:
# R/simulation.R
run_simulation <- function(params) {
n_cores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "1"))
results <- parallel::mclapply(params, function(p) {
# Each sub-task is single-threaded
simulate_one(p)
}, mc.cores = n_cores)
do.call(rbind, results)
}Example 4: Single-threaded targets (many small tasks)
If each target is fast and single-threaded (e.g. processing one dataset), use many workers with 1 CPU each. The parallelism comes from running many targets concurrently, not from threading within each:
# _targets.R — many lightweight workers
tar_option_set(
controller = crew_controller_slurm(
workers = 50,
seconds_idle = 60,
options_cluster = crew_options_slurm(
cpus_per_task = 1,
memory_gigabytes_per_cpu = 4,
time_minutes = 30
)
)
)
list(
tar_target(file_paths, list.files("data/", full.names = TRUE)),
tar_target(result, process_file(file_paths), pattern = map(file_paths))
)Choosing worker size
| Workload | cpus_per_task |
workers |
Strategy |
|---|---|---|---|
| Many small independent tasks | 1 | 20–100 | Parallelism across targets |
| Multithreaded model fits | 4–16 | 5–20 | Parallelism within each target |
| Whole-node computation | 96 | 1–12 | One worker per node |
| Mixed pipeline | Use controller groups | Varies | See Controller groups |
With workers = 40 and cpus_per_task = 24, you’re requesting 960 CPUs — nearly the full cluster (12 nodes × 96 cores). Be realistic about how many workers you actually need to keep busy. The seconds_idle parameter ensures unused workers release resources.
SLURM_CPUS_PER_TASK, not hardcoded values
Always read SLURM_CPUS_PER_TASK instead of hardcoding thread counts. This way your functions work correctly regardless of the worker configuration — whether you run locally with 4 cores or on a worker with 24.
Best practices for the cluster
Avoid packages in tar_option_set with crew.cluster
When using crew_controller_slurm, do not preload packages via tar_option_set(packages = ...). This can crash crew workers due to a memory corruption issue between certain compiled R packages and the nanonext/NNG library used by crew.
Instead, load packages inside your target functions using require():
# DON'T do this with crew.cluster:
tar_option_set(packages = c("ranger", "glmnet", "data.table"))
# DO this — load inside the function:
fit_model <- function(data) {
require(ranger)
require(data.table)
ranger::ranger(y ~ ., data = data)
}This is safe because packages load after the worker’s network connection is fully established. Lightweight packages like data.table alone are usually fine, but heavy compiled packages (ranger, glmnet, mlr3) can trigger the crash.
Resource-appropriate workers
Match worker resources to what your targets actually need:
crew_controller_slurm(
name = "light",
workers = 20,
options_cluster = crew_options_slurm(
cpus_per_task = 1, # Single-threaded targets
memory_gigabytes_per_cpu = 2,
time_minutes = 30
)
)Don’t request 8 cores per worker if each target is single-threaded — you’ll waste resources and may wait longer in the queue.
Mind the QoS time limits
Worker time_minutes must fit within the QoS time limit. Workers default to the normal QoS. If your workers need more than 1 day, set an appropriate QoS via script_lines:
options_cluster = crew_options_slurm(
time_minutes = 2880, # 2 days
script_lines = c(
"#SBATCH --qos=long" # long QoS allows up to 7 days
)
)See Slurm Basics > QoS for the full table of limits.
Use seconds_idle wisely
seconds_idle controls how long a worker stays alive waiting for new work. Too short and workers churn (repeatedly submitting and waiting in queue). Too long and idle workers block resources. 60–300 seconds is usually a good range.
Keep lightweight targets local
Not every target needs to be dispatched to a Slurm worker. For quick operations like reading a small file, combining results, or rendering a plot, use deployment = "main" to run them in the coordinator process — this avoids the overhead of submitting and waiting for a Slurm job:
tar_target(summary_plot, make_plot(results), deployment = "main")Functions, not scripts
targets works best when your analysis logic lives in functions. Instead of:
# Bad: script-style target
tar_target(clean, {
data <- read.csv("data/raw.csv")
data <- data[data$age > 18, ]
data$bmi <- data$weight / (data$height/100)^2
data
})Write:
# Good: function-based
# R/clean.R
clean_data <- function(path) {
data <- read.csv(path)
data <- data[data$age > 18, ]
data$bmi <- data$weight / (data$height / 100)^2
data
}
# _targets.R
tar_target(clean, clean_data("data/raw.csv"))This makes targets easier to test, debug, and reuse — and targets can track function changes for automatic invalidation.
Further reading
- targets manual — comprehensive book by the package author
- crew documentation — controller framework
- crew.cluster Slurm guide — Slurm-specific setup
- targets use cases — example pipelines