Reproducible Pipelines with targets

Automated, scalable workflows using {targets} and `{crew}``

Modified

2026-03-28

The targets package brings pipeline-based workflow management to R. Instead of sourcing scripts in order and hoping nothing breaks, you define your analysis as a directed graph of targets (steps), and targets figures out what needs to run, skips what’s already up to date, and optionally runs steps in parallel.

Combined with crew and crew.cluster, targets can distribute work across Slurm compute nodes — making it a powerful framework for analyses that are too large for a single node or that you want to run reproducibly.

targets vs batchtools

For straightforward simulation studies and benchmark experiments, {batchtools} is the recommended starting point — it’s simpler and many users are already familiar with it. targets shines when your workflow has dependencies between steps, when you want automatic caching (only re-run what changed), or when you need a mix of different resource types (CPU + GPU targets). If you’re not sure which to use, start with batchtools.

Why targets?

If your workflow looks like this:

# 01-clean.R
# 02-model.R
# 03-plot.R
# ... run them in order and pray

Then targets is for you. Key benefits:

Skips work that’s already done: Change one model? Only that model and its downstream targets re-run — not the whole pipeline.
Dependency tracking: targets knows which functions and data each step depends on. Change a function → all affected steps automatically re-run.
Reproducibility: The pipeline definition is the documentation. Anyone can see exactly what runs, in what order, with what inputs.
Parallel execution: Steps that don’t depend on each other run simultaneously — on one node or across many.
Built-in data management: Results are cached automatically. No more juggling .rds files.

Getting started

Install packages

On the head node (which has internet access):

install.packages(c("targets", "tarchetypes", "crew", "crew.cluster"))

Project structure

A typical targets project looks like:

my-analysis/
├── _targets.R        # Pipeline definition (the main file)
├── R/                 # Your functions
│   ├── clean.R
│   ├── model.R
│   └── plot.R
├── data/              # Raw input data
└── _targets/          # Auto-generated cache (don't edit)

The key idea: put your logic in functions (in R/), and define the pipeline (what calls what, with what arguments) in _targets.R.

Defining a pipeline

_targets.R is a regular R script that returns a list of targets:

# _targets.R
library(targets)

# Source all functions in R/
tar_source()

list(
  # Step 1: Load raw data
  tar_target(raw_data, read_data("data/input.csv")),

  # Step 2: Clean it
  tar_target(clean_data, clean_and_filter(raw_data)),

  # Step 3: Fit model
  tar_target(model_fit, fit_model(clean_data)),

  # Step 4: Create summary table
  tar_target(summary_table, summarize_results(model_fit)),

  # Step 5: Plot
  tar_target(diagnostic_plot, make_diagnostic_plot(model_fit))
)

targets automatically detects that clean_data depends on raw_data, model_fit depends on clean_data, etc. Steps 4 and 5 are independent of each other and can run in parallel.

Running the pipeline

library(targets)

# Visualize the pipeline (great for checking your setup)
tar_visnetwork()

# Run it
tar_make()

# Read a result
tar_read(model_fit)

After the first run, modify a function (say fit_model()) and run tar_make() again — only the affected targets re-run.

Dynamic branching

One of the most useful features for HPC work is dynamic branching: running the same step over many inputs, like fitting a model for each outcome or running an analysis per dataset.

# _targets.R
library(targets)
tar_source()

list(
  tar_target(datasets, list_dataset_paths("data/")),

  # Map over datasets — creates one branch per file
  tar_target(
    results,
    analyze_dataset(datasets),
    pattern = map(datasets)
  ),

  # Combine all branch results
  tar_target(combined, bind_results(results))
)

Each branch is an independent unit of work that can run in parallel. If you have 50 datasets and 50 cores available, all 50 analyses run simultaneously.

Scaling with crew on the cluster

By default, tar_make() runs targets sequentially in the local R process. For real workloads, you want parallel execution — and on a cluster, you want that parallelism to span compute nodes.

Local parallelism with crew

For parallelism within a single node (e.g. in an interactive session or batch job), use a crew controller:

# _targets.R
library(targets)
library(crew)
tar_source()

tar_option_set(
  controller = crew_controller_local(
    name = "local",
    workers = as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "4"))
  )
)

list(
  # ... your targets
)

Then run the pipeline in a Slurm job with enough cores:

module load R/4.5.3
salloc --cpus-per-task=8 --mem=32G --time=04:00:00
R -e 'targets::tar_make()'

Multi-node parallelism with crew.cluster

This is where things get powerful. crew.cluster provides a Slurm controller that automatically submits worker jobs to compute nodes. Your coordinator R session (which can run on the head node or in a lightweight Slurm job) dispatches work across the cluster.

# _targets.R
library(targets)
library(crew.cluster)
tar_source()

tar_option_set(
  controller = crew_controller_slurm(
    name = "slurm",
    workers = 20,
    seconds_idle = 120,          # Workers shut down after 2 min idle
    options_cluster = crew_options_slurm(
      cpus_per_task = 2,         # Each worker gets 2 cores
      memory_gigabytes_per_cpu = 4,
      time_minutes = 60,
      log_output = "logs/worker_%j.log",
      log_error = "logs/worker_%j.log"
    )
  )
)

list(
  tar_target(raw_data, load_all_data()),

  tar_target(
    model_result,
    fit_model(raw_data, outcome),
    pattern = map(outcome)
  ),

  tar_target(report, render_report(model_result))
)

Workers inherit the coordinator’s environment

Each crew worker is a Slurm job that inherits the environment of the coordinator process — including PATH, loaded modules, and other environment variables. This means you do not need script_lines to set up R. Just make sure to module load R before starting your coordinator R session, and the workers will automatically have R on their PATH.

When you run tar_make():

crew.cluster submits up to 20 Slurm jobs as workers
Each worker picks up targets as they become ready
Idle workers shut down automatically after 2 minutes
Results flow back to the coordinator

You can run the coordinator from the head node (the coordinator itself is lightweight — it just orchestrates):

# On the head node — the heavy work happens on compute nodes
module load R/4.5.3
R -e 'targets::tar_make()'

Or submit the coordinator itself as a Slurm job:

#!/bin/bash
#SBATCH --job-name=targets-coordinator
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=08:00:00
#SBATCH --qos=long
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

Rscript -e 'targets::tar_make()'

Submit with R loaded so the job inherits the environment:

module load R/4.5.3
sbatch coordinator.sh

Head node etiquette

Running the coordinator on the head node is fine because it is lightweight — it only sends and receives tasks. If your coordinator starts using significant CPU or memory, submit it as a Slurm job instead.

Controller groups

For heterogeneous workloads (e.g. some targets need GPUs, others don’t), use multiple controllers:

# _targets.R
library(targets)
library(crew.cluster)
tar_source()

tar_option_set(
  controller = crew_controller_group(
    crew_controller_slurm(
      name = "cpu",
      workers = 16,
      seconds_idle = 120,
      options_cluster = crew_options_slurm(
        cpus_per_task = 4,
        memory_gigabytes_per_cpu = 4,
        time_minutes = 60
      )
    ),
    crew_controller_slurm(
      name = "gpu",
      workers = 1,
      seconds_idle = 300,
      options_cluster = crew_options_slurm(
        cpus_per_task = 8,
        memory_gigabytes_per_cpu = 8,
        time_minutes = 120,
        partition = "gpu",
        script_lines = c(
          "#SBATCH --gres=gpu:1"
        )
      )
    )
  )
)

list(
  tar_target(data, load_data()),
  tar_target(features, extract_features(data), resources = tar_resources(
    crew = tar_resources_crew(controller = "cpu")
  )),
  tar_target(deep_model, train_deep_model(features), resources = tar_resources(
    crew = tar_resources_crew(controller = "gpu")
  ))
)

GPU workers need both partition and --gres

On this cluster, GPU jobs require both partition = "gpu" (to route to the GPU node) and #SBATCH --gres=gpu:N (to allocate GPUs). Without --gres, your job runs on the GPU node but CUDA reports no devices. See GPU Jobs for details.

Monitoring pipelines

Progress

# In another R session while the pipeline is running
tar_progress()    # Table of target statuses
tar_watch()       # Live dashboard in the browser

Visualize what’s outdated

# See what needs to re-run
tar_visnetwork()

# Outdated targets only
tar_outdated()

Logs

With crew.cluster, worker logs go to the path specified in log_output (in crew_options_slurm()). Check them for errors:

cat logs/worker_12345.log

Parallelism within targets

Each crew worker runs one target at a time. If you configure cpus_per_task = 8, those 8 CPUs are available to the single target currently running on that worker. This means you can speed up individual targets (e.g. multithreaded model fits) without oversubscribing — as long as the target’s parallelism matches the worker’s allocation.

The cluster task prolog automatically sets OMP_NUM_THREADS to match SLURM_CPUS_PER_TASK, so BLAS operations are handled. For everything else, read SLURM_CPUS_PER_TASK in your functions. See Batch Jobs > Slurm environment variables for the full list of variables available inside jobs.

Example 1: Multithreaded model fitting (ranger, xgboost)

Packages like ranger and xgboost have explicit thread count arguments:

# R/model.R
fit_ranger <- function(data) {
  n_threads <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "1"))
  ranger::ranger(y ~ ., data = data, num.threads = n_threads)
}

fit_xgboost <- function(dtrain) {
  n_threads <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "1"))
  xgboost::xgb.train(
    params = list(nthread = n_threads, objective = "reg:squarederror"),
    data = dtrain, nrounds = 100
  )
}

# _targets.R — workers get 8 CPUs each, ranger uses all 8
tar_option_set(
  controller = crew_controller_slurm(
    workers = 10,
    options_cluster = crew_options_slurm(
      cpus_per_task = 8,
      memory_gigabytes_per_cpu = 2,
      time_minutes = 60
    )
  )
)

Example 2: BLAS-heavy computation (matrix operations, glmnet)

For packages that rely on BLAS/LAPACK (matrix multiplication, linear algebra), you don’t need to do anything. The task prolog sets OMP_NUM_THREADS to the number of allocated CPUs, and FlexiBLAS/OpenBLAS automatically uses that many threads.

# This just works — BLAS uses all allocated CPUs
fit_glmnet <- function(x, y) {
  glmnet::cv.glmnet(x, y, nfolds = 10)
}

Example 3: mclapply / future within a target

If your target uses R-level parallelism (e.g. mclapply or future), match the worker count to the allocation:

# R/simulation.R
run_simulation <- function(params) {
  n_cores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", "1"))

  results <- parallel::mclapply(params, function(p) {
    # Each sub-task is single-threaded
    simulate_one(p)
  }, mc.cores = n_cores)

  do.call(rbind, results)
}

Example 4: Single-threaded targets (many small tasks)

If each target is fast and single-threaded (e.g. processing one dataset), use many workers with 1 CPU each. The parallelism comes from running many targets concurrently, not from threading within each:

# _targets.R — many lightweight workers
tar_option_set(
  controller = crew_controller_slurm(
    workers = 50,
    seconds_idle = 60,
    options_cluster = crew_options_slurm(
      cpus_per_task = 1,
      memory_gigabytes_per_cpu = 4,
      time_minutes = 30
    )
  )
)

list(
  tar_target(file_paths, list.files("data/", full.names = TRUE)),
  tar_target(result, process_file(file_paths), pattern = map(file_paths))
)

Choosing worker size

Workload	`cpus_per_task`	`workers`	Strategy
Many small independent tasks	1	20–100	Parallelism across targets
Multithreaded model fits	4–16	5–20	Parallelism within each target
Whole-node computation	96	1–12	One worker per node
Mixed pipeline	Use controller groups	Varies	See Controller groups

Don’t oversubscribe

With workers = 40 and cpus_per_task = 24, you’re requesting 960 CPUs — nearly the full cluster (12 nodes × 96 cores). Be realistic about how many workers you actually need to keep busy. The seconds_idle parameter ensures unused workers release resources.

Use SLURM_CPUS_PER_TASK, not hardcoded values

Always read SLURM_CPUS_PER_TASK instead of hardcoding thread counts. This way your functions work correctly regardless of the worker configuration — whether you run locally with 4 cores or on a worker with 24.

Best practices for the cluster

Avoid `packages` in `tar_option_set` with crew.cluster

When using crew_controller_slurm, do not preload packages via tar_option_set(packages = ...). This can crash crew workers due to a memory corruption issue between certain compiled R packages and the nanonext/NNG library used by crew.

Instead, load packages inside your target functions using require():

# DON'T do this with crew.cluster:
tar_option_set(packages = c("ranger", "glmnet", "data.table"))

# DO this — load inside the function:
fit_model <- function(data) {
  require(ranger)
  require(data.table)
  ranger::ranger(y ~ ., data = data)
}

This is safe because packages load after the worker’s network connection is fully established. Lightweight packages like data.table alone are usually fine, but heavy compiled packages (ranger, glmnet, mlr3) can trigger the crash.

Store targets cache on shared storage

The _targets/ directory must be on shared storage visible to all nodes. The default (inside your project directory in /srv/home/) works fine.

Resource-appropriate workers

Match worker resources to what your targets actually need:

crew_controller_slurm(
  name = "light",
  workers = 20,
  options_cluster = crew_options_slurm(
    cpus_per_task = 1,              # Single-threaded targets
    memory_gigabytes_per_cpu = 2,
    time_minutes = 30
  )
)

Don’t request 8 cores per worker if each target is single-threaded — you’ll waste resources and may wait longer in the queue.

Mind the QoS time limits

Worker time_minutes must fit within the QoS time limit. Workers default to the normal QoS. If your workers need more than 1 day, set an appropriate QoS via script_lines:

options_cluster = crew_options_slurm(
  time_minutes = 2880,  # 2 days
  script_lines = c(
    "#SBATCH --qos=long"           # long QoS allows up to 7 days
  )
)

See Slurm Basics > QoS for the full table of limits.

Use `seconds_idle` wisely

seconds_idle controls how long a worker stays alive waiting for new work. Too short and workers churn (repeatedly submitting and waiting in queue). Too long and idle workers block resources. 60–300 seconds is usually a good range.

Keep lightweight targets local

Not every target needs to be dispatched to a Slurm worker. For quick operations like reading a small file, combining results, or rendering a plot, use deployment = "main" to run them in the coordinator process — this avoids the overhead of submitting and waiting for a Slurm job:

tar_target(summary_plot, make_plot(results), deployment = "main")

Functions, not scripts

targets works best when your analysis logic lives in functions. Instead of:

# Bad: script-style target
tar_target(clean, {
  data <- read.csv("data/raw.csv")
  data <- data[data$age > 18, ]
  data$bmi <- data$weight / (data$height/100)^2
  data
})

Write:

# Good: function-based
# R/clean.R
clean_data <- function(path) {
  data <- read.csv(path)
  data <- data[data$age > 18, ]
  data$bmi <- data$weight / (data$height / 100)^2
  data
}

# _targets.R
tar_target(clean, clean_data("data/raw.csv"))

This makes targets easier to test, debug, and reuse — and targets can track function changes for automatic invalidation.