6  Targets

A target is a high-level steps of the computational pipeline, and it a piece of work that you define with your custom functions. A target runs some R code and saves the returned R object to storage, usually a single file inside _targets/objects/.

6.1 Target names

A target is an abstraction. The targets package automatically manages data storage and retrieval under the hood, which means you do not need to reference a target’s data file directly (e.g. _targets/objects/your_target_name). Instead, your R code should refer to a target name as if it were a variable in an R session. In other words, from the point of view of the user, a target is an R object in memory. That means a target name must be a valid visible symbol name for an R variable. The name must not begin with a dot, and it must be a string that lets you assign a value, e.g. your_target_name <- TRUE. For stylistic considerations, please refer to the tidyverse style guide syntax chapter.

6.2 What a target should do

Like a good function, a good target generally does one of three things:

  1. Create a dataset.
  2. Analyze a dataset with a model.
  3. Summarize an analysis or dataset.

If a function gets too long, you can split it into nested sub-functions that make your larger function easier to read and maintain.

6.3 How much a target should do

The targets package automatically skips targets that are already up to date, so it is best to define targets that maximize time savings. Good targets usually

  1. Are large enough to subtract a decent amount of runtime when skipped.
  2. Are small enough that some targets can be skipped even if others need to run.
  3. Invoke no side effects such as modifications to the global environment. (But targets with tar_target(format = "file") can save files.)
  4. Return a single value that is
    1. Easy to understand and introspect.
    2. Meaningful to the project.
    3. Easy to save as a file, e.g. with readRDS(). Please avoid non-exportable objects as target return values or global variables.

Regarding the last point above, it is possible to customize the storage format of the target. For details, enter ?tar_target in the console and scroll down to the description of the format argument.

6.4 Working with tools outside R

Each target runs R code, so to invoke a tool outside R, consider system2() or processx to call the appropriate system commands. This technique allows you to run shell scripts, Python scripts, etc. from within R. External scripts should ideally be tracked as input files using tar_target(format = "file") as described in section on external input files. There are also specialized R packages to retrieve data from remote sources and invoke web APIs, including rnoaa, ots, and aws.s3, and you may wish to use custom cues to automatically invalidate a target when the upstream remote data changes.

6.5 Side effects

Like a good pure function, a good target should return a single value and not produce side effects. (The exception is output file targets which create files and return their paths.) Avoid modifying the global environment with calls to data() or source(). If you need to source scripts to define global objects, please do so at the top of your target script file (default: _targets.R) just like source("R/functions.R") from the walkthrough vignette.

6.6 Dependencies

Consider the following pipeline.

# _targets.R file
library(targets)

global_object <- 3

inner_function <- function(argument) {
  local_object <- 1
  argument + global_object + local_object + 2
}

outer_function <- function(object) {
  object + inner_function(object) + 1
}

list(
  tar_target(
    name = second_target,
    command = outer_function(first_target) + 2
  ),
  tar_target(
    name = first_target,
    command = 2
  )
)

In order to run properly, second_target needs up-to-date versions of first_target and outer_function(). In other words, first_target and outer_function() are dependencies of second_target. Likewise, inner_function() is a dependency of outer_function(), and global_object is a dependency of inner_function(). The targets package searches commands and functions for dependencies, noting global symbols like global_object and ignoring local symbols like argument and local_object. The tar_deps() function emulates behavior for you.1

tar_deps(outer_function(first_target) + 2)
#> [1] "+"              "first_target"   "outer_function"
tar_deps(
  function(argument) {
    local_object <- 1
    argument + global_object + local_object + 2
  }
)
#> [1] "{"             "+"             "<-"            "global_object"

After it discards dangling symbols like { and <-, targets translates the dependency information into a dependency graph that you can visualize with tar_visnetwork(). It is good practice to make sure this graph has the correct nodes connected with the correct edges.

# R console
tar_visnetwork()

The dependency graph is a directed acyclic graph (DAG) representation of the pipeline, where each node is a target or global object and each directed edge indicates where a downstream node depends on an upstream node. The DAG is not always a tree, but it never contains a cycle because no target is allowed to directly or indirectly depend on itself. The dependency graph should show a natural progression of work from left to right.2 targets uses static code analysis to build the graph, so the order of tar_target() calls in the target list does not matter. However, targets does not support self-referential loops or other cycles.

When you run the pipeline with tar_make()3, targets runs the correct targets in the correct order with the correct resources according to the graph. For example, by the time second_target starts running, targets makes sure:

  1. Dependency target first_target has already finished running.
  2. Dependencies first_target and outer_function() are up to date.
  3. Dependencies first_target and outer_function() are loaded into memory for second_target to use.
# R console
tar_make()
#> ▶ dispatched target first_target
#> ● completed target first_target [0.001 seconds]
#> ▶ dispatched target second_target
#> ● completed target second_target [0 seconds]
#> ▶ ended pipeline [0.067 seconds]

At this point, any of the following changes will cause the next tar_make() to rerun second_target.

  • Change the value of global_object.
  • Change the body or arguments of inner_function().
  • Change the body or arguments of outer_function().
  • Change the command or value of first_target.
  • Change the command of second_target.

6.7 Return value

The return value of a target should be an R object that can be saved to disk and hashed.

6.7.1 Saving

The object should be compatible with the storage format you choose using the format argument of tar_target() or tar_option_set(). For example, if the format is "rds" (default), then the target should return an R object that can be saved with saveRDS() and safely loaded properly into another session. Please avoid returning non-exportable objects such as connection objects, Rcpp pointers, xgboost matrices, and greta models4.

6.7.2 Hashing

Once a target is saved to disk, targets computes a digest hash to track changes to the data file(s). These hashes are used to decide whether each target is up to date or needs to rerun. In order for the hash to be useful, the data you return from a target must be an accurate reflection of the underlying content of the data. So please try to return the actual data instead of an object that wraps or points to the data. Otherwise, the package will make incorrect decisions regarding which targets can skip and which need to rerun.

6.7.3 Workaround

As a workaround, you can write custom functions to create temporary instances of these non-exportable/non-hashable objects and clean them up after the task is done. The following sketch creates a target that returns a database table while managing a transient connection object.

# _targets.R
library(targets)

get_from_database <- function(table, ...) {
  con <- DBI::dbConnect(...)
  on.exit(close(con))
  dbReadTable(con, table)
}

list(
  tar_target(
    table_from_database,
    get_from_database("my_table", ...), # ... has use-case-specific arguments.
    format = "feather" # Requires that the return value is a data frame.
  )
)

  1. tar_deps() uses the findGlobals() function from the codetools package, with some minor adjustments. See https://adv-r.hadley.nz/expressions.html?q=ast#ast-funs for more information on static code analysis.↩︎

  2. If you have hundreds of targets, then tar_visnetwork() may be slow. If that happens, consider temporarily commenting out some targets in _targets.R just for visualization purposes.↩︎

  3. or tar_make_clustermq() or tar_make_future()↩︎

  4. Special exceptions are granted to Keras and Torch models, which can be safely returned from targets if you specify format = "keras" or format = "torch".↩︎