B Design

This chapter explains drake’s internal design and architecture. Goals:

Help developers and enthusiastic users contribute to the code base.
Invite high-level advice and discussion about potential improvements to the overall design.

B.1 Principles

B.1.1 Functions first

From the user’s point of view, drake is a style of programming in its own right, and that style is zealously and irrevocably function-oriented. It harmonizes with statistics and data science, where most methodology naturally takes the form of data transformations, and it embraces the natively function-oriented design of the R language. Functions are first-class citizens in drake, and they dominate the internal design at the highest levels.

B.1.2 Light use of traditional OOP

Most of a drake workflow happens inside the make() function. make() accepts a data frame of function calls (the drake plan), caches some targets, and then drops its internal state when it terminates. The state does not need to persist, and the user does not need to interact with it. This is a major reason why traditional object-oriented programming plays such a small, supporting role.

In drake, full OOP classes and objects are small, simple, and extremely specialized. For example, the decorated storr, priority queue, and logger reference classes are narrowly defined and fit for purpose. The S3 system appears far more often, often as a mechanism of function overloading to streamline control flow, and also as a means of adding structure and validation to small target-specific objects optimized for performance.

In future development, tactical reference classes will arise as needed to encapsulate low-level patterns into natural abstractions. However, drake’s design places greater importance on maximizing runtime efficiency.

B.1.3 High-performant small objects

drake maintains several small list-like objects for each target, such as the local spec, the target data, triggers, and the code analysis results. drake workflows with thousands of targets have thousands of these objects, and as profiling studies have shown, we need these objects to perform as efficiently as possible. Instantiation and field access need to be fast, and the memory footprint needs to be low. For these reasons, we choose simple lists with S3 class attributes, which outclass S4 and reference classes when it comes to instantiation speed.

B.1.4 Fast iteration along aggregated data

Each of the large data structures aggregates a single type of information across all targets to help drake run fast. Examples include the whole workflow specification (config$spec) and the in-memory target metadata cache (config$meta). These objects are hash-table-powered environments to make field access as fast as possible.

B.1.5 Access to information across targets

drake aggressively analyzes dependency relationships among targets. Even while make() builds a single target, it needs to stay aware of the other targets, not only to build the dependency graph, but also for other tasks like dynamic branching. This is a major reason why the workflow specification, dependency graph, priority queue, and metadata are all stored in environments that most functions can reach.

B.2 Specific classes

This section describes drake’s primary internal data structures at a high level. It is not exhaustive, but it does cover most of the architecture.

B.2.1 Config

make(), outdated(), vis_drake_graph(), and related utilities keep track of a drake_config() object. A drake_config() object is a list of class "drake_config". Its purpose is to keep track of the state of a drake workflow and avoid long parameter lists in functions. Future development will focus on refactoring and formalizing drake_config() objects.

B.2.2 Settings

Static runtime parameters such as keep_going and log_build_times live in a list of class drake_settings, which is part of each drake_config object.

B.2.3 Plan

The drake plan is a simple data frame of class "drake_plan", and it is drake’s version of a Makefile. The manual has a whole chapter on plans.

B.2.4 Specification

A drake plan is an implicit representation of targets and their immediate dependencies. Before make() starts to build targets, drake makes all these local dependency structures explicit and machine-readable in a workflow specification. The overall specification (config$spec) an R environment with the local specification of each individual target and each imported object/function. Each local specification is a list of class "drake_spec", and it contains the names of objects referenced from the command, the files declared with file_in() and friends, the dependencies of the condition and change triggers, etc.

B.2.5 Graph

Whereas the specification tracks the local dependency structures, the graph (an igraph object) represents the global dependency structure of the whole workflow. It is less granular than the specification, and make() uses it to run the correct targets in the correct order.

B.2.6 Priority queue

In high-performance computing settings (e.g. parallelism = "clustermq" and parallelism = "future") drake creates a priority queue to schedule targets. For the sake of convenience, the underlying algorithms are different than that of a classical priority queue, but this does not seem to decrease performance in practice.

B.2.7 Metadata

config$meta is an environment, and each element is a list of class "drake_meta". Whereas the workflow specification identifies the names of dependencies, the "drake_meta" contains hashes (and supporting information). drake uses the hashes decide if the target is up to date. Metadata lists are stored in the "meta" namespace of the decorated storr.

config$meta_old is similar to config$meta and exists for performance purposes.

B.2.8 Cache

B.2.8.1 API

drake’s cache API is a decorated storr, a reference class that wraps around a storr object. drake relies heavily on storr namespaces (e.g. for metadata and recovery keys). drake’s custom wrapper around the storr class (i.e. the “decorated” part) has extra methods that power history (a txtq) and specialized data formats, as well as hash tables that only the cache needs.

The new_cache() and drake_cache() functions create and reload drake caches, respectively, and they are equivalent to storr::storr_rds() plus drake:::decorate_storr().

B.2.8.2 Data

Usually, the persistent data values live in a hidden .drake/ folder. Most of the files come from storr_rds() methods. Other files include the history txtq and the values of targets with specialized data formats. The files are structured so they can be used by either with storr::storr_rds() or drake::drake_cache().

Other storr backends like storr_environment() and storr_dbi() are also compatible with this approach. In these non-standard cases, .drake/ does not contain the files of the inner storr, but it still has files supporting history and specialized target formats.

B.2.9 Code analysis lists

drake performs static code analysis on functions and commands in order to resolve the dependency structure of a workflow. Lists of class drake_deps and drake_deps_ht store the results of static code analysis on a single code chunk. Each element of a drake_deps list is a character vector of static dependencies of a certain type (e.g. global variables or file_in() files). The elements of drake_deps_ht lists are hash tables (which increase performance when the static code analysis is running).

B.2.10 Environments

drake has memory management strategies to make sure a target’s dependencies are loaded when make() runs its command. Internally, memory management works with a layered system of environments. This system helps make() protect the user’s calling environment and perform dynamic branching without the need for static code analysis or metaprogramming.

config$envir: the calling environment of make(), which contains the user’s functions and other imported objects. make() tries to leave this environment alone (and temporarily locks it when lock_envir is TRUE).
config$envir_targets: contains static targets. Its parent is config$envir.
config$dynamic: contains entire aggregated dynamic targets when drake needs them. Its parent is config$envir_targets.
config$envir_subtargets: contains individual sub-targets. Its parent is config$envir_dynamic.

In addition, config$envir_loaded keeps track of which targets are loaded in (2), (3), and (4) above.

These environments form a known data clump, and future development will encapsulate them.

B.2.11 Hash tables

The drake_config() object and decorated storr keep track of multiple hash tables to cache data in memory and boost speed while iterating over large collections of targets. They are simply R environments with hash = TRUE, and drake has internal interface functions for working with them. Examples in drake_config() objects:

ht_is_dynamic: keeps track of names of dynamic targets. Makes is_dynamic() faster.
ht_is_subtarget: same as above, but for is_subtarget().
ht_dynamic_deps: names of dynamic dependencies of dynamic targets. Powers is_dynamic_dep().
ht_target_exists: tracks targets that already exist at the beginning of make().
ht_subtarget_parents: keeps track of the parent of each sub-target.

Examples in the decorated storr:

ht_encode_path and ht_decode_path: drake uses Base32 encoding to store references to static file paths. These hash tables avoid redundant encoding/decoding operations and increases performance for large collections of targets.
ht_encode_namespaced and ht_decode_namespaced: same for imported namespaced functions.
ht_hash: powers memo_hash(), which helps us avoid redundant calls to input_file_hash(), output_file_hash(), static_dependency_hash(), and dynamic_dependency_hash().
ht_keys: a small hash table that powers the set_progress method. This progress information is stored in the cache by default, and the user can retrieve it with drake_progress().

B.2.12 Logger

The logger (config$logger) is a reference class that controls messages to the console and a custom log file) if applicable). Logging messages help users informally monitor the progress of make().