5  Performance

If your targets pipeline runs slowly or consumes too many resources, you can make adjustments to improve efficiency. In addition, targets has tools to monitor the runtime progress of a pipeline.

Summary
  • Choose fast persistent storage for the data store.
  • Choose efficient data storage formats for large targets.
  • For high-memory pipelines, consider memory = "transient" and garbage_collection = TRUE in tar_option_set() or tar_target().
  • For highly parallel pipelines, consider storage = "worker" and retrieval = "worker" in tar_option_set() or tar_target().
  • To avoid wasting computational resources, consider setting deployment = "main" in tar_target() for light targets that do not need to run on parallel workers.
  • For high-overhead pipelines with thousands of targets, consider grouping same amount of work into a smaller number of targets.
  • targets has functions to monitor the progress of the pipeline.
  • Profiling with the proffer package can help discover bottlenecks.

5.1 Data location

The data store is a folder on a computer, usually at the root of your project, and targets makes innumerable quick modifications over the course of a pipeline. For best performance, the data store should live on high-performant storage hardware on your local computer. Any slowdown due to disk issues or latency due to a slow network will severely impact the performance of your pipeline.[ Mounted network drives are the particularly egregious.][At the same time, the files in the data store are important and must be available for subsequent runs of the pipeline, so tempdir() is not suitable.]

5.2 Efficient storage formats

The default data storage format is RDS, which can be slow and bulky for large data. For large data pipelines, consider alternative formats to more efficiently store and manage your data. Set the storage format using tar_option_set() or tar_target():

tar_option_set(format = "qs")

Some formats such as "qs" work on all kinds of data, whereas others like "feather" works only on data frames. Most non-default formats store the data faster and in smaller files than the default "rds" format, but they require extra packages to be installed. For example, format = "qs" requires the qs package, and format = "feather" requires the arrow package.

For extremely large datasets that cannot fit into memory, consider format = "file" to treat the data as a file on disk. Downstream targets are free to load only the subsets of the data they need.

5.3 Memory

By default, the pipeline retains targets in memory while it is running. In large data workloads, this could consume too much computer memory and overwhelm the local R session. The solution is simple: in tar_option_set() or tar_target() in the _targets.R file, activate transient memory and garbage collection:

tar_option_set(memory = "transient", garbage_collection = TRUE)

These options tell tar_make() to remove data from the R environment as soon as possible and run garbage collection to make sure the data is really gone from memory (but still in storage). This memory cleanup phase happens once per target.

As with everything performance-related, there is a cost. With transient memory and garbage collection, the pipeline reads data from storage far more often. These data reads take additional time, and if you use cloud storage, they could incur additional monetary charges. In addition, garbage collection is usually a slow operation, and repeated garbage collections could slow down a pipeline with thousands of targets. Please think about the tradeoffs for your specific use case.

And as mentioned previously, format = "file" treats a target as a file path, and the data in the file is not automatically loaded into memory. This may be useful for larger-than-memory files. Downstream targets are free to load only strategic subsets of the data file..

5.4 Parallel workers and data

By default, the main controlling R process stores and retrieves the data. So in large parallel data pipelines, tar_make_clustermq() and tar_make_future() may bottleneck at the data management phase. Solution: if all parallel workers have access to the local data store, you can make those workers store and retrieve the data instead of putting it all on the main controlling R process. In tar_target() or tar_option_set() in the _targets.R file, activate worker storage and retrieval:

tar_option_set(storage = "worker", retrieval = "worker")

If the workers do not have access to the local data store, you can still set storage = "worker" and retrieval = "worker" if you use cloud storage to store and retrieve your data.

5.5 Local targets

In tar_make_clustermq(), the persistent workers launch as soon as a target needs them, and they keep running until no more targets need them anymore. In addition, tar_make_future() submits a new job for every target that needs one. Both behaviors could waste computational resources. For targets that run quickly and cheaply, consider setting deployment = "main" in tar_target():

tar_target(dataset, get_dataset(), deployment = "main")
tar_target(summary, compute_summary_statistics(), deployment = "main")

deployment = "main" says to run the target on the main controlling process instead of a parallel worker. If the target is upstream, then deployment = "main" avoids launching persistent workers too early. If the target is downstream, deployment = "main" allows persistent workers to safely shut down earlier. In the case of transient workers, deployment = "main" avoids the overhead and cost of submitting an unnecessary job or background process.

For targets that really do need parallel workers, make sure deployment = "worker" (default).

tar_target(model, run_machine_learning_model(dataset), deployment = "worker")

The deployment argument of tar_option_set() controls the default deployment argument of subsequent calls to tar_target().

5.6 Many targets

A pipeline with too many targets will begin to slow down. You may notice a minor slowdown at about 1000 targets and a more significant one at around 5000 or 10000 targets. This happens because each target needs to check its data, decide whether it needs to rerun, load its upstream dependencies from memory if applicable, and store its data after running. The overhead of these actions adds up.

To reduce overhead, consider dividing up the work into a smaller number of targets. Each target is a cached operation, and not all steps of the pipeline needs to be cached at a perfect level of granularity. See the sections on what a target should do and how much a target should do.

5.6.1 Batching

Simulation studies and other iterative stochastic pipelines may need to run thousands of independent random replications. For these pipelines, consider batching to reduce the number of targets while preserving the number of replications. In batching, each batch is a dynamic branch target that performs a subset of the replications. For 1000 replications, you might want 40 batches of 25 replications each, 10 batches with 100 replications each, or a different balance depending on the use case. Functions tarchetypes::tar_rep(), tarchetypes::tar_map_rep(), and stantargets::tar_stan_mcmc_rep_summary() are examples of target factories that set up the batching structure without needing to understand dynamic branching.

5.7 Monitoring the pipeline

Even the most efficient targets pipelines can take time to complete because the user-defined tasks themselves are slow. There are convenient ways to monitor the progress of a running pipeline:

  1. tar_poll() continuously refreshes a text summary of runtime progress in the R console. Run it in a new R session at the project root directory. (Only supported in targets version 0.3.1.9000 and higher.)
  2. tar_visnetwork(), tar_progress_summary(), tar_progress_branches(), and tar_progress() show runtime information at a single moment in time.
  3. tar_watch() launches an Shiny app that automatically refreshes the graph every few seconds. Try it out in the example below.
# Define an example target script file with a slow pipeline.
library(targets)
tar_script({
  sleep_run <- function(...) {
    Sys.sleep(10)
  }
  list(
    tar_target(settings, sleep_run()),
    tar_target(data1, sleep_run(settings)),
    tar_target(data2, sleep_run(settings)),
    tar_target(data3, sleep_run(settings)),
    tar_target(model1, sleep_run(data1)),
    tar_target(model2, sleep_run(data2)),
    tar_target(model3, sleep_run(data3)),
    tar_target(figure1, sleep_run(model1)),
    tar_target(figure2, sleep_run(model2)),
    tar_target(figure3, sleep_run(model3)),
    tar_target(conclusions, sleep_run(c(figure1, figure2, figure3)))
  )
})

# Launch the app in a background process.
# You may need to refresh the browser if the app is slow to start.
# The graph automatically refreshes every 10 seconds
tar_watch(seconds = 10, outdated = FALSE, targets_only = TRUE)

# Now run the pipeline and watch the graph change.
px <- tar_make()

tar_watch_ui() and tar_watch_server() make this functionality available to other apps through a Shiny module.

Unfortunately, none of these options can tell you if any parallel workers or external processes are actually still alive. You can monitor local processes with a utility like top or htop, and traditional HPC scheduler like SLURM or SGE support their own polling utilities such as squeue and qstat. tar_process() and tar_pid() get the process ID of the main R process that last attempted to run the pipeline.

5.8 Profiling

The first sections of the chapter describe quick tips and tricks to improve the performance of a pipeline. If these workarounds fail, then before putting more effort into optimization, it is best to empirically confirm why the code is slow in the first place. “Profiling” is the act of scanning a running instance of a program, and it can detect computational bottlenecks. Follow these steps to profile a targets pipeline.

  1. Install the proffer R package and its dependencies.
  2. Run proffer::pprof(tar_make(callr_function = NULL)) on your project.
  3. When a web browser pops up with pprof, select the flame graph and screenshot it.
  4. Post the flame graph, along with any code and data you can share, to the targets package issue tracker. The maintainer will have a look and try to make the package faster for your use case if speedups are possible.