# _targets.R file
library(targets)
library(tarchetypes)
<- 3
global_object
<- function(argument) {
inner_function <- 1
local_object + global_object + local_object + 2
argument
}
<- function(object) {
outer_function + inner_function(object) + 1
object
}
list(
tar_target(
name = second_target,
command = outer_function(first_target) + 2
),tar_target(
name = first_target,
command = 2
) )
6 Targets
A target is a high-level step of the computational pipeline, and a piece of work that you define with your custom functions. A target runs some R code and saves the returned R object to storage, usually a single file inside _targets/objects/
.
6.1 Target names
A target is an abstraction. The targets
package automatically manages data storage and retrieval under the hood, which means you do not need to reference a target’s data file directly (e.g. _targets/objects/your_target_name
). Instead, your R code should refer to a target name as if it were a variable in an R session. In other words, from the point of view of the user, a target is an R object in memory. That means a target name must be a valid visible symbol name for an R variable. The name must not begin with a dot, and it must be a string that lets you assign a value, e.g. your_target_name <- TRUE
. For stylistic considerations, please refer to the tidyverse style guide syntax chapter.
6.2 What a target should do
Like a good function, a good target generally does one of three things:
- Create a dataset.
- Analyze a dataset with a model.
- Summarize an analysis or dataset.
If a function gets too long, you can split it into nested sub-functions that make your larger function easier to read and maintain.
6.3 How much a target should do
The targets
package automatically skips targets that are already up to date, so it is best to define targets that maximize time savings. Good targets usually
- Are large enough to subtract a decent amount of runtime when skipped.
- Are small enough that some targets can be skipped even if others need to run.
- Invoke no side effects such as modifications to the global environment. (But targets with
tar_target(format = "file")
can save files.) - Return a single value that is
- Easy to understand and introspect.
- Meaningful to the project.
- Easy to save as a file, e.g. with
readRDS()
. Please avoid non-exportable objects as target return values or global variables.
Regarding the last point above, it is possible to customize the storage format of the target. For details, enter ?tar_target
in the console and scroll down to the description of the format
argument.
6.4 Working with tools outside R
Each target runs R code, so to invoke a tool outside R, consider system2()
or processx
to call the appropriate system commands. This technique allows you to run shell scripts, Python scripts, etc. from within R. External scripts should ideally be tracked as input files using tar_target(format = "file")
as described in section on external input files. There are also specialized R packages to retrieve data from remote sources and invoke web APIs, including rnoaa
, ots
, and aws.s3
, and you may wish to use custom cues to automatically invalidate a target when the upstream remote data changes.
6.5 Side effects
Like a good pure function, a good target should return a single value and not produce side effects. (The exception is output file targets which create files and return their paths.) Avoid modifying the global environment with calls to data()
or source()
. If you need to source scripts to define global objects, please do so at the top of your target script file (default: _targets.R
) just like source("R/functions.R")
from the walkthrough vignette.
6.6 Dependencies
Consider the following pipeline.
In order to run properly, second_target
needs up-to-date versions of first_target
and outer_function()
. In other words, first_target
and outer_function()
are dependencies of second_target
. Likewise, inner_function()
is a dependency of outer_function()
, and global_object
is a dependency of inner_function()
. The targets
package searches commands and functions for dependencies, noting global symbols like global_object
and ignoring local symbols like argument
and local_object
. The tar_deps()
function emulates behavior for you.1
tar_deps(outer_function(first_target) + 2)
#> [1] "+" "first_target" "outer_function"
tar_deps(
function(argument) {
<- 1
local_object + global_object + local_object + 2
argument
}
)#> [1] "{" "+" "<-" "global_object"
After it discards dangling symbols like {
and <-
, targets
translates the dependency information into a dependency graph that you can visualize with tar_visnetwork()
. It is good practice to make sure this graph has the correct nodes connected with the correct edges.
# R console
tar_visnetwork()
The dependency graph is a directed acyclic graph (DAG) representation of the pipeline, where each node is a target or global object and each directed edge indicates where a downstream node depends on an upstream node. The DAG is not always a tree, but it never contains a cycle because no target is allowed to directly or indirectly depend on itself. The dependency graph should show a natural progression of work from left to right.2 targets
uses static code analysis to build the graph, so the order of tar_target()
calls in the target list does not matter. However, targets
does not support self-referential loops or other cycles.
When you run the pipeline with tar_make()
3, targets
runs the correct targets in the correct order with the correct resources according to the graph. For example, by the time second_target
starts running, targets
makes sure:
- Dependency target
first_target
has already finished running. - Dependencies
first_target
andouter_function()
are up to date. - Dependencies
first_target
andouter_function()
are loaded into memory forsecond_target
to use.
# R console
tar_make()
#> ▶ dispatched target first_target
#> ● completed target first_target [0.001 seconds, 50 bytes]
#> ▶ dispatched target second_target
#> ● completed target second_target [0.001 seconds, 51 bytes]
#> ▶ ended pipeline [0.075 seconds]
At this point, any of the following changes will cause the next tar_make()
to rerun second_target
.
- Change the value of
global_object
. - Change the body or arguments of
inner_function()
. - Change the body or arguments of
outer_function()
. - Change the command or value of
first_target
. - Change the command of
second_target
.
6.7 Return value
The return value of a target should be an R object that can be saved to disk and hashed.
6.7.1 Saving
The object should be compatible with the storage format you choose using the format
argument of tar_target()
or tar_option_set()
. For example, if the format is "rds"
(default), then the target should return an R object that can be saved with saveRDS()
and safely loaded properly into another session. Please avoid returning non-exportable objects such as connection objects, Rcpp
pointers, xgboost
matrices, and greta
models4.
6.7.2 Hashing
Once a target is saved to disk, targets
computes a digest
hash to track changes to the data file(s). These hashes are used to decide whether each target is up to date or needs to rerun. In order for the hash to be useful, the data you return from a target must be an accurate reflection of the underlying content of the data. So please try to return the actual data instead of an object that wraps or points to the data. Otherwise, the package will make incorrect decisions regarding which targets can skip and which need to rerun.
6.7.3 Workaround
As a workaround, you can write custom functions to create temporary instances of these non-exportable/non-hashable objects and clean them up after the task is done. The following sketch creates a target that returns a database table while managing a transient connection object.
# _targets.R
library(targets)
library(tarchetypes)
<- function(table, ...) {
get_from_database <- DBI::dbConnect(...)
con on.exit(close(con))
dbReadTable(con, table)
}
list(
tar_target(
table_from_database,get_from_database("my_table", ...), # ... has use-case-specific arguments.
format = "feather" # Requires that the return value is a data frame.
) )
tar_deps()
uses thefindGlobals()
function from thecodetools
package, with some minor adjustments. See https://adv-r.hadley.nz/expressions.html?q=ast#ast-funs for more information on static code analysis.↩︎If you have hundreds of targets, then
tar_visnetwork()
may be slow. If that happens, consider temporarily commenting out some targets in_targets.R
just for visualization purposes.↩︎or
tar_make_clustermq()
ortar_make_future()
↩︎Special exceptions are granted to Keras and Torch models, which can be safely returned from targets if you specify
format = "keras"
orformat = "torch"
.↩︎