Chapter 5 Best practices

This chapter describes additional best practices for developing and maintaining targets-powered projects.

5.1 How to define good targets

Targets are high-level steps of the workflow that run the work you define in your functions. A target runs some R code and saves the returned R object to storage, usually a single file inside _targets/objects/.

5.1.1 What a target should do

Like a good function, a good target generally does one of three things:

  1. Create a dataset.
  2. Analyze a dataset with a model.
  3. Summarize an analysis or dataset.

If a function gets too long, you can split it into nested sub-functions that make your larger function easier to read and maintain.

5.1.2 Return value

The return value of a target should be an R object that can be saved properly with the storage format you choose using the format argument of tar_target() or tar_option_set(). For example, if the format is "rds" (default), then the target should return an R object that can be saved with saveRDS() and safely loaded properly into another session. Please avoid returning non-exportable objects such as connection objects, Rcpp pointers, xgboost matrices, and greta models. [^Special exceptions are granted to Keras and Torch models, which can be safely returned from targets if you specify format = "keras" or format = "torch".] As a workaround, you can write custom functions to create temporary instances of these non-exportable objects and clean them up after the task is done. The following sketch creates a target that returns a database table while managing a transient connection object.

# _targets.R

get_from_database <- function(table, ...) {
  con <- DBI::dbConnect(...)
  dbReadTable(con, table)

    get_from_database("my_table", ...), # ... has use-case-specific arguments.
    format = "feather" # Requires that the return value is a data frame.

5.2 Side effects

Like a good pure function, a good target should return a single value and not produce side effects. (The exception is output file targets which create files and return their paths.) Avoid modifying the global environment with calls to data() or source(). If you need to source scripts to define global objects, please do so at the top of _targets.R just like source("R/functions.R") from the walkthrough vignette.

5.2.1 How much a target should do

The targets package automatically skips targets that are already up to date, so it is best to define targets that maximize time savings. Good targets usually

  1. Are large enough to subtract a decent amount of runtime when skipped.
  2. Are small enough that some targets can be skipped even if others need to run.
  3. Invoke no side effects such as modifications to the global environment. (But targets with tar_target(format = "file") can save files.)
  4. Return a single value that is
    1. Easy to understand and introspect.
    2. Meaningful to the project.
    3. Easy to save as a file, e.g. with readRDS(). Please avoid non-exportable objects as target return values or global variables.

Regarding the last point above, it is possible to customize the storage format of the target. For details, enter ?tar_target in the console and scroll down to the description of the format argument.

5.3 Dependencies

Adept pipeline construction requires an understanding of dependency detection. To identify the targets and global objects that each target depends on, the targets package uses static code analysis with codetools, and you can emulate this process with tar_deps(). Let us look at the dependencies of the raw_data target.

tar_deps(function() {
  read_csv(raw_data_file, col_types = cols())
#> [1] "{"             "cols"          "raw_data_file" "read_csv"

The raw_data target depends on target raw_data_file because the command for raw_data mentions the symbol raw_data_file. Similarly, if we were to create a user-defined read_csv() function, the raw_data target would also depend on read_csv() and any other user-defined global functions and objects nested inside read_csv(). Changes to any of these objects would cause the raw_data target to rerun on the next tar_make().

Not all of the objects from tar_deps() actually register as dependencies. When it comes to detecting dependencies, targets only recognizes

  1. Other targets (such as raw_data_file).
  2. Functions and objects in the main environment. This environment is almost always the global environment of the R process that runs _targets.R, so these dependencies are usually going to be the custom functions and objects you write yourself.

This process excludes many objects from dependency detection. For example, both { and cols() are excluded because they are defined in the environments of packages (base and readr, respectively). Functions and objects from packages are ignored unless you supply a package environment to the envir argument of tar_option_set() when you call it in _targets.R, e.g. tar_option_set(envir = getNamespace("packageName")). You should only set envir if you write your own package to contain your whole data analysis project.

5.4 Loading and configuring R packages

For most pipelines, it is straightforward to load the R packages that your targets need in order to run.

  1. Call library() at the top of _targets.R to load each package the conventional way, or
  2. Name the required packages using the packages argument of tar_option_set().
  1. is often faster, especially for utilities like tar_visnetwork(), because it avoids loading packages unless absolutely necessary.

Some package management workflows are more complicated. If your use special configuration with conflicted, box, import, or similar utility, please do your configuration inside a project-level .Rprofile file instead of the _targets.R file. In addition, if you use distributed workers inside external containers (Docker, Singularity, AWS AMI, etc.) make sure each container has a copy of this same .Rprofile file where the R worker process spawns. This approach is ensures that all remote workers are configured the same way as the local main process.

5.5 Packages-based invalidation

When it comes time to decide which targets to rerun or skip, the default behavior is to ignore changes to external R packages. Usually, local package libraries do not need to change very often, and it is best to maintain a reproducible project library using renv.

However, sometimes you may wish to invalidate certain targets based on changes to the contents of certain packages. Example scenarios:

  1. You are the developer of a statistical methodology package that serves as the focus of the pipeline.
  2. You implement the workflow itself as an R package that contains your custom functions.

To track the contents of packages package1 and package2, you must

  1. Fully install these packages with install.packages() or equivalent. devtools::load_all() is insufficient because it does not make the packages available to parallel workers.
  2. Write the following in your _targets.R file:
# _targets.R
  packages = c("package1", "package2", ...), # `...` is for other packages.
  imports = c("package1", "package2")
# Write the rest of _targets.R below.
# ...

packages = c("package1", "package2", ...) tells targets to call library(package1), library(package2), etc. before running each target. imports = c("package1", "package2") tells targets to dive into the environments of package1 and package2 and reproducibly track all the objects it finds. For example, if you define a function f() in package1, then you should see a function node for f() in the graph produced by tar_visnetwork(targets_only = FALSE), and targets downstream of f() will invalidate if you install an update to package1 with a new version of f(). The next time you call tar_make(), those invalidated targets will automatically rerun.

5.6 Working with tools outside R

targets lives and operates entirely within the R interpreter, so working with outside tools is a matter of finding the right functionality in R itself. system2() and processx can invoke system commands outside R, and you can include them in your targets’ R commands to run shell scripts, Python scripts, etc. There are also specialized R packages to retrieve data from remote sources and invoke web APIs, including rnoaa, ots, and aws.s3.

5.7 Monitoring the pipeline

If you are using targets, then you probably have an intense computation like Bayesian data analysis or machine learning. These tasks take a long time to run, and it is a good idea to monitor them. Here are some options built directly into targets:

  1. tar_poll() continuously refreshes a text summary of runtime progress in the R console. Run it in a new R session at the project root directory. (Only supported in targets version and higher.)
  2. tar_visnetwork(), tar_progress_summary(), and tar_progress() show runtime information at a single moment in time.
  3. tar_watch() launches an Shiny app that automatically refreshes the graph every few seconds. Try it out in the example below.
# Define an example _targets.R file with a slow pipeline.
  sleep_run <- function(...) {
    tar_target(settings, sleep_run()),
    tar_target(data1, sleep_run(settings)),
    tar_target(data2, sleep_run(settings)),
    tar_target(data3, sleep_run(settings)),
    tar_target(model1, sleep_run(data1)),
    tar_target(model2, sleep_run(data2)),
    tar_target(model3, sleep_run(data3)),
    tar_target(figure1, sleep_run(model1)),
    tar_target(figure2, sleep_run(model2)),
    tar_target(figure3, sleep_run(model3)),
    tar_target(conclusions, sleep_run(c(figure1, figure2, figure3)))

# Launch the app in a background process.
# You may need to refresh the browser if the app is slow to start.
# The graph automatically refreshes every 10 seconds
tar_watch(seconds = 10, outdated = FALSE, targets_only = TRUE)

# Now run the pipeline and watch the graph change.
px <- tar_make()

tar_watch_ui() and tar_watch_server() make this functionality available to other apps through a Shiny module.

Unfortunately, none of these options can tell you if any parallel workers or external processes are still running. You can monitor local processes with a utility like top or htop, and traditional HPC scheduler like SLURM or SGE support their own polling utilities such as squeue and qstat. tar_process() and tar_pid() get the process ID of the main R process that last attempted to run the pipeline.

5.8 Performance

If your pipeline has several thousand targets, functions like tar_make(), tar_outdated(), and tar_visnetwork() may take longer to run. There is an inevitable per-target runtime cost because package needs to check the code and data of each target individually. If this overhead becomes too much, consider batching your work into a smaller group of heavier targets. Using your custom functions, you can make each target perform multiple iterations of a task that was previously given to targets one at a time. For details and an example, please see the discussion on batching at the bottom of the dynamic branching chapter.

Alternatively, if you see slowness in your project, you can contribute to the package with a profiling study. These contributions are great because they help improve the package. Here are the recommended steps.

  1. Install the proffer R package and its dependencies.
  2. Run proffer::pprof(tar_make(callr_function = NULL)) on your project.
  3. When a web browser pops up with pprof, select the flame graph and screenshot it.
  4. Post the flame graph, along with any code and data you can share, to the targets package issue tracker. The maintainer will have a look and try to make the package faster for your use case if speedups are possible.

5.9 Cleaning up

There are multiple functions to help you manually remove data or force targets to rerun.

  • tar_destroy() is by far the most commonly used cleaning function. It removes the _targets/ data store completely, deleting all the results from tar_make() except for external files. Use it if you intend to start the pipeline from scratch without any trace of a previous run.
  • tar_prune() deletes the data and metadata of all the targets no longer present in your current _targets.R file. This is useful if you recently worked through multiple changes to your project and are now trying to discard irrelevant data while keeping the results that still matter.
  • tar_delete() is more selective than tar_destroy() and tar_prune(). It removes the individual data files of a given set of targets from _targets/objects/ while leaving the metadata in _targets/meta/meta alone. If you have a small number of data-heavy targets you need to discard to conserve storage, this function can help.
  • tar_invalidate() is the opposite of tar_delete(): for the selected targets, it deletes the metadata in _targets/meta/meta but keeps the return values in _targets/objects/. After invalidation, you will still be able to locate the data files with tar_path() and manually salvage them in an emergency. However, tar_load() and tar_read() will not be able to read the data into R, and subesequent calls to tar_make() will attempt to rebuild those targets.
Copyright Eli Lilly and Company