Chapter 11 What about drake?

superseded lifecycle

targets is the successor of drake, an older pipeline tool. As of 2021-01-21, drake is superseded, which means there are no plans for new features or discretionary enhancements, but basic maintenance and support will continue indefinitely. Existing projects that use drake can safely continue to use drake, and there is no need to retrofit targets. New projects should use targets because it is friendlier and more robust.

11.1 Why is drake superseded?

Nearly four years of community feedback have exposed major user-side limitations regarding data management, collaboration, dynamic branching, and parallel efficiency. Unfortunately, these limitations are permanent. Solutions in drake itself would make the package incompatible with existing projects that use it, and the internal architecture is too copious, elaborate, and mature for such extreme refactoring. That is why targets was created. The targets package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible in drake.

11.2 Transitioning to targets

If you know drake, then you already almost know targets. The programming style is similar, and most functions in targets have counterparts in drake.

Functions in drake Counterparts in targets
use_drake(), drake_script() tar_script()
drake_plan() tar_manifest(), tarchetypes::tar_plan()
target() tar_target(), tar_target_raw()
drake_config() tar_option_set()
outdated(), r_outdated() tar_outdated()
vis_drake_graph(), r_vis_drake_graph() tar_visnetwork(), tar_glimpse()
drake_graph_info(), r_drake_graph_info() tar_network()
make(), r_make() tar_make(), tar_make_clustermq(), tar_make_future()
loadd() tar_load()
readd() tar_read()
diagnose(), build_times(), cached(), drake_cache_log() tar_meta()
drake_progress(), drake_running(), drake_done(), drake_failed(), drake_cancelled() tar_progress()
clean() tar_deduplicate(), tar_delete(), tar_destroy(), tar_invalidate()
drake_gc() tar_prune()
id_chr() tar_name(), tar_path()
knitr_in() tarchetypes::tar_render()
cancel(), cancel_if() tar_cancel()
trigger() tar_cue()
drake_example(), drake_example(), load_mtcars_example(), clean_mtcars_example() Unsupported. Example targets pipelines are in individual repositories linked from here.
drake_build() Unsupported in targets to ensure coherence with dynamic branching.
drake_debug() Read here to learn about interactive debugging in targets.
drake_history(), recoverable() Unsupported in targets. Instead of trying to manage history and data recovery directly, targets maintains a much lighter/friendlier data store to make it easier to use external data versioning tools instead.
missed(), tracked(), deps_code(), deps_target(), deps_knitr(), deps_profile() Unsupported in targets because dependency detection is far easier to understand than in drake.
drake_hpc_template_file(), drake_hpc_template_files() Deemed out of scope for targets.
drake_cache(), new_cache(), find_cache(). Unsupported because targets is far more strict and paternalistic about data/file management.
rescue_cache(), which_clean(), cache_planned(), cache_unplanned() Unsupported due to the simplified data management system and storage cleaning functions.
drake_get_session_info() Deemed superfluous and a potential bottleneck. Discarded for targets.
read_drake_seed() Superfluous because targets always uses the same global seed. tar_meta() shows all the target-level seeds.
show_source() Deemed superfluous. Discarded in targets to conserve storage space in _targets/meta/meta.
drake_tempfile() Superfluous in targets because there is no special disk.frame storage format. (Dynamic file targets are much better for managing disk.frames.)
file_store() Superfluous in targets because all files are dynamic files and there is no longer a need to Base32-encode any file names.

Likewise, many make() arguments have equivalent arguments elsewhere.

Argument of drake::make() Counterparts in targets
targets names in tar_make() etc.
envir envir in tar_option_set()
verbose reporter in tar_make() etc.
parallelism Choice of function: tar_make() vs tar_make_clustermq() vs tar_make_future()
jobs workers in tar_make_clustermq() and tar_make_future()
packages packages in tar_target() and tar_option_set()
lib_loc library in tar_target() and tar_option_set()
trigger cue in tar_target() and tar_option_set()
caching storage and retrieval in tar_target() and tar_option_set()
keep_going error in tar_target() and tar_option_set()
memory_strategy memory in tar_target() and tar_option_set()
garbage_collection garbage_collection in tar_target() and tar_option_set()
template resources in tar_target() and tar_option_set()
curl_handles handle element of resources argument of tar_target() and tar_option_set()
format format in tar_target() and tar_option_set()

In addition, many optional columns of drake plans are expressed differently in targets.

Optional column of drake plans Feature in targets
format format argument of tar_target() and tar_option_set()
dynamic pattern argument of tar_target() and tar_option_set()
transform static branching functions in tarchetypes such as tar_map() and tar_combine()
trigger cue argument of tar_target() and tar_option_set()
hpc deployment argument of tar_target() and tar_option_set()
resources resources argument of tar_target() and tar_option_set()
caching storage and retrieval arguments of tar_target() and tar_option_set()

11.3 Advantages of targets over drake

11.3.1 Better guardrails by design

drake leaves ample room for user-side mistakes, and some of these mistakes require extra awareness or advanced knowledge of R to consistently avoid. The example behaviors below are too systemic to solve and still preserve back-compatibility.

  1. By default, make() looks for functions and global objects in the parent environment of the calling R session. Because the global environment is often old and stale in practical situations, which causes targets to become incorrectly invalidated. Users need to remember to restart the session before calling make(). The issue is discussed here, and the discussion led to functions like r_make() which always create a fresh session to do the work. However, r_make() is not a complete replacement for make(), and beginner users still run into the original problems.
  2. Similar to the above, make() does not find the intended functions and global objects if it is called in a different environment. Edge cases like this one and this one continue to surprise users.
  3. drake is extremely flexible about the location of the .drake/ cache. When a user calls readd(), loadd(), make(), and similar functions, drake searches up through the parent directories until it finds a .drake/ folder. This flexibility seldom helps, and it creates uncertainty and inconsistency when it comes to initializing and accessing projects, especially if there are multiple projects with nested file systems.

The targets package solves all these issues by design. Functions tar_make(), tar_make_clustermq(), and tar_make_future() all create fresh new R sessions by default. They all require a _targets.R configuration file in the project root (working directory of the tar_make() call) so that the functions, global objects, and settings are all populated in the exact same way each session, leading to less frustration, greater consistency, and greater reproducibility. In addition, the _targets/ data store always lives in the project root.

11.3.2 Enhanced debugging support

targets has enhanced debugging support. With the workspaces argument to tar_option_set(), users can locally recreate the conditions under which a target runs. This includes packages, global functions and objects, and the random number generator seed. Similarly, tar_option_set(error = "workspace") automatically saves debugging workspaces for targets that encounter errors. The debug option lets users enter an interactive debugger for a given target while the pipeline is running. And unlike drake, all debugging features are fully compatible with dynamic branching.

11.3.3 Improved tracking of package functions

By default, targets ignores changes to functions inside external packages. However, if a workflow centers on a custom package with methodology under development, users can make targets automatically watch the package’s functions for changes. Simply supply the names of the relevant packages to the imports argument of tar_option_set(). Unlike drake, targets can track multiple packages this way, and the internal mechanism is much safer.

11.3.4 Lighter, friendlier data management

drake’s cache is an intricate file system in a hidden .drake folder. It contains multiple files for each target, and those names are not informative. (See the files in the data/ folder in the diagram below.) Users often have trouble understanding how drake manages data, resolving problems when files are corrupted, placing the data under version control, collaborating with others on the same pipeline, and clearing out superfluous data when the cache grows large in storage.

.drake/
├── config/
├── data/
├───── 17bfcef645301416.rds
├───── 21935c86f12692e2.rds
├───── 37caf5df2892cfc4.rds
├───── ...
├── drake/
├───── history/
├───── return/
├───── tmp/
├── keys/ # A surprisingly large number of tiny text files live here.
├───── memoize/
├───── meta/
├───── objects/
├───── progress/
├───── recover/
├───── session/
└── scratch/ # This folder should be temporary, but it gets egregiously large.

The targets takes a friendlier, more transparent, less mysterious approach to data management. Its data store is a visible _targets folder, and it contains far fewer files: a spreadsheet of metadata, a spreadsheet of target progress, and one informatively named data file for each target. It is much easier to understand the data management process, identify and diagnose problems, place projects under version control, and avoid consuming unnecessary storage resources. Sketch:

_targets/
├── meta/
├───── meta
├───── process
├───── progress
├── objects/
├───── target_name_1
├───── target_name_2
├───── target_name_3
├───── ...
└── scratch/ # Deleted when the pipeline finishes.

11.3.5 Cloud storage

Thanks to the simplified data store and simplified internals, targets can automatically upload data to the Amazon S3 bucket of your choice. Simply configure aws.s3, create a bucket, and select one of the AWS-powered storage formats. Then, targets will automatically upload the return values to the cloud.

# _targets.R
tar_option_set(resources = list(bucket = "my-bucket-name"))
list(
  tar_target(dataset, get_large_dataset(), format = "aws_fst_tbl"),
  tar_target(analysis, analyze_dataset(dataset), format = "aws_qs")
)

Data retrieval is still super easy.

tar_read(dataset)

11.3.6 Show status of functions and global objects

drake has several utilities that inform users which targets are up to date and which need to rerun. However, those utilities are limited by how drake manages functions and other global objects. Whenever drake inspects globals, it stores their values in its cache and loses track of their previous state from the last run of the pipeline. As a result, it has trouble informing users exactly why a given target is out of date. And because the system for tracking global objects is tightly coupled with the cache, this limitation is permanent.

In targets, the metadata management system only updates information on global objects when the pipeline actually runs. This makes it possible to understand which specific changes to your code could have invalided your targets. In large projects with long runtimes, this feature contributes significantly to reproducibility and peace of mind.

11.3.7 Dynamic branching with dplyr::group_by()

Dynamic branching was an architecturally difficult fit in drake, and it can only support one single (vctrs-based) method of slicing and aggregation for processing sub-targets. This limitation has frustrated members of the community, as discussed here and here.

targets, on the other hand, is more flexible regarding slicing and aggregation. When it branches over an object, it can iterate over vectors, lists, and even data frames grouped with dplyr::group_by(). To branch over chunks of a data frame, our data frame target needs to have a special tar_group column. We can create this column in our target’s return value with the tar_group() function.

library(dplyr)
library(targets)
library(tibble)
tibble(
  x = seq_len(6),
  id = rep(letters[seq_len(3)], each = 2)
) %>%
  group_by(id) %>%
  tar_group()
#> # A tibble: 6 x 3
#> # Groups:   id [3]
#>       x id    tar_group
#>   <int> <chr>     <int>
#> 1     1 a             1
#> 2     2 a             1
#> 3     3 b             2
#> 4     4 b             2
#> 5     5 c             3
#> 6     6 c             3

Our actual target has the command above and iteration = "group".

tar_target(
  data,
  tibble(
    x = seq_len(6),
    id = rep(letters[seq_len(3)], each = 2)
  ) %>%
    group_by(id) %>%
    tar_group(),
  iteration = "group"
)

Now, any target that maps over data is going to define one branch for each group in the data frame. The following target creates three branches when run in a pipeline: one returning 3, one returning 7, and one returning 11.

tar_target(
  sums,
  sum(data$x),
  pattern = map(data)
)

11.3.8 Composable dynamic branching

Because the design of targets is fundamentally dynamic, users can create complicated dynamic branching patterns that are never going to be possible in drake. Below, target z creates six branches, one for each combination of w and tuple (x, y). The pattern cross(w, map(x, y)) is equivalent to tidyr::crossing(w, tidyr::nesting(x, y)).

# _targets.R
library(targets)
list(
  tar_target(w, seq_len(2)),
  tar_target(x, head(letters, 3)),
  tar_target(y, head(LETTERS, 3)),
  tar_target(
    z,
    data.frame(w = w, x = x, y = y),
    pattern = cross(w, map(x, y))
  )
)

Thanks to glep and djbirke on GitHub for the idea.

11.3.9 Improved parallel efficiency

Dynamic branching in drake is staged. In other words, all the sub-targets of a dynamic target must complete before the pipeline moves on to downstream targets. The diagram below illustrates this behavior in a pipeline with a dynamic target B that maps over another dynamic target A. For thousands of dynamic sub-targets with highly variable runtimes, this behavior consumes unnecessary runtime and computing resources. And because drake’s architecture was designed at a fundamental level for static branching only, this limitation is permanent.

By contrast, the internal data structures in targets are dynamic by design, which allows for a dynamic branching model with more flexibility and parallel efficiency. Branches can always start as soon as their upstream dependencies complete, even if some of those upstream dependencies are branches. This behavior reduces runtime and reduces consumption of computing resources.

11.3.10 Metaprogramming

In drake, pipelines are defined with the drake_plan() function. drake_plan() supports an elaborate domain specific language that diffuses user-supplied R expressions. This makes it convenient to assign commands to targets in the vast majority of cases, but it also obstructs custom metaprogramming by users (example here). Granted, it is possible to completely circumvent drake_plan() and create the whole data frame from scratch, but this is hardly ideal and seldom done in practice.

The targets package tries to make customization easier. Relative to drake, targets takes a decentralized approach to setting up pipelines, moving as much custom configuration as possible to the target level rather than the whole pipeline level. In addition, the tar_target_raw() function avoids non-standard evaluation while mirroring tar_target() in all other respects. All this makes it much easier to create custom metaprogrammed pipelines and target archetypes while avoiding an elaborate domain specific language for static branching, which was extremely difficult to understand and error prone in drake. The R Targetopia is an emerging ecosystem of workflow frameworks that take full advantage of this customization and democratize reproducible pipelines.

Copyright Eli Lilly and Company