This chapter explains
drake’s internal design and architecture. Goals:
- Help developers and enthusiastic users contribute to the code base.
- Invite high-level advice and discussion about potential improvements to the overall design.
B.1.1 Functions first
From the user’s point of view,
drake is a style of programming in its own right, and that style is zealously and irrevocably function-oriented. It harmonizes with statistics and data science, where most methodology naturally takes the form of data transformations, and it embraces the natively function-oriented design of the R language. Functions are first-class citizens in
drake, and they dominate the internal design at the highest levels.
B.1.2 Light use of traditional OOP
Most of a
drake workflow happens inside the
make() accepts a data frame of function calls (the
drake plan), caches some targets, and then drops its internal state when it terminates. The state does not need to persist, and the user does not need to interact with it. This is a major reason why traditional object-oriented programming plays such a small, supporting role.
drake, full OOP classes and objects are small, simple, and extremely specialized. For example, the decorated
storr, priority queue, and logger reference classes are narrowly defined and fit for purpose. The S3 system appears far more often, often as a mechanism of function overloading to streamline control flow, and also as a means of adding structure and validation to small target-specific objects optimized for performance.
In future development, tactical reference classes will arise as needed to encapsulate low-level patterns into natural abstractions. However,
drake’s design places greater importance on maximizing runtime efficiency.
B.1.3 High-performant small objects
drake maintains several small list-like objects for each target, such as the local spec, the target data, triggers, and the code analysis results.
drake workflows with thousands of targets have thousands of these objects, and as profiling studies have shown, we need these objects to perform as efficiently as possible. Instantiation and field access need to be fast, and the memory footprint needs to be low. For these reasons, we choose simple lists with S3 class attributes, which outclass S4 and reference classes when it comes to instantiation speed.
B.1.4 Fast iteration along aggregated data
Each of the large data structures aggregates a single type of information across all targets to help
drake run fast. Examples include the whole workflow specification (
config$spec) and the in-memory target metadata cache (
config$meta). These objects are hash-table-powered environments to make field access as fast as possible.
B.1.5 Access to information across targets
drake aggressively analyzes dependency relationships among targets. Even while
make() builds a single target, it needs to stay aware of the other targets, not only to build the dependency graph, but also for other tasks like dynamic branching. This is a major reason why the workflow specification, dependency graph, priority queue, and metadata are all stored in environments that most functions can reach.
B.2 Specific classes
This section describes
drake’s primary internal data structures at a high level. It is not exhaustive, but it does cover most of the architecture.
vis_drake_graph(), and related utilities keep track of a
drake_config() object. A
drake_config() object is a list of class
"drake_config". Its purpose is to keep track of the state of a
drake workflow and avoid long parameter lists in functions. Future development will focus on refactoring and formalizing
Static runtime parameters such as
log_build_times live in a list of class
drake_settings, which is part of each
drake plan is a simple data frame of class
"drake_plan", and it is
drake’s version of a Makefile. The manual has a whole chapter on plans.
drake plan is an implicit representation of targets and their immediate dependencies. Before
make() starts to build targets,
drake makes all these local dependency structures explicit and machine-readable in a workflow
specification. The overall specification (
config$spec) an R environment with the local specification of each individual target and each imported object/function. Each local specification is a list of class
"drake_spec", and it contains the names of objects referenced from the command, the files declared with
file_in() and friends, the dependencies of the
change triggers, etc.
Whereas the specification tracks the local dependency structures, the graph (an
igraph object) represents the global dependency structure of the whole workflow. It is less granular than the specification, and
make() uses it to run the correct targets in the correct order.
B.2.6 Priority queue
In high-performance computing settings (e.g.
parallelism = "clustermq" and
parallelism = "future")
drake creates a priority queue to schedule targets. For the sake of convenience, the underlying algorithms are different than that of a classical priority queue, but this does not seem to decrease performance in practice.
config$meta is an environment, and each element is a list of class
"drake_meta". Whereas the workflow specification identifies the names of dependencies, the
"drake_meta" contains hashes (and supporting information).
drake uses the hashes decide if the target is up to date. Metadata lists are stored in the
"meta" namespace of the decorated
config$meta_old is similar to
config$meta and exists for performance purposes.
drake’s cache API is a decorated
storr, a reference class that wraps around a
drake relies heavily on
storr namespaces (e.g. for metadata and recovery keys).
drake’s custom wrapper around the
storr class (i.e. the “decorated” part) has extra methods that power history (a
txtq) and specialized data formats, as well as hash tables that only the cache needs.
drake_cache() functions create and reload
drake caches, respectively, and they are equivalent to
Usually, the persistent data values live in a hidden
.drake/ folder. Most of the files come from
storr_rds() methods. Other files include the history
txtq and the values of targets with specialized data formats. The files are structured so they can be used by either with
storr backends like
storr_dbi() are also compatible with this approach. In these non-standard cases,
.drake/ does not contain the files of the inner
storr, but it still has files supporting history and specialized target formats.
B.2.9 Code analysis lists
drake performs static code analysis on functions and commands in order to resolve the dependency structure of a workflow. Lists of class
drake_deps_ht store the results of static code analysis on a single code chunk. Each element of a
drake_deps list is a character vector of static dependencies of a certain type (e.g. global variables or
file_in() files). The elements of
drake_deps_ht lists are hash tables (which increase performance when the static code analysis is running).
drake has memory management strategies to make sure a target’s dependencies are loaded when
make() runs its command. Internally, memory management works with a layered system of environments. This system helps
make() protect the user’s calling environment and perform dynamic branching without the need for static code analysis or metaprogramming.
config$envir: the calling environment of
make(), which contains the user’s functions and other imported objects.
make()tries to leave this environment alone (and temporarily locks it when
config$envir_targets: contains static targets. Its parent is
config$dynamic: contains entire aggregated dynamic targets when
drakeneeds them. Its parent is
config$envir_subtargets: contains individual sub-targets. Its parent is
config$envir_loaded keeps track of which targets are loaded in (2), (3), and (4) above.
These environments form a known data clump, and future development will encapsulate them.
B.2.11 Hash tables
drake_config() object and decorated
storr keep track of multiple hash tables to cache data in memory and boost speed while iterating over large collections of targets. They are simply R environments with
hash = TRUE, and
drake has internal interface functions for working with them. Examples in
ht_is_dynamic: keeps track of names of dynamic targets. Makes
ht_is_subtarget: same as above, but for
ht_dynamic_deps: names of dynamic dependencies of dynamic targets. Powers
ht_target_exists: tracks targets that already exist at the beginning of
ht_subtarget_parents: keeps track of the parent of each sub-target.
Examples in the decorated
drakeuses Base32 encoding to store references to static file paths. These hash tables avoid redundant encoding/decoding operations and increases performance for large collections of targets.
ht_decode_namespaced: same for imported namespaced functions.
memo_hash(), which helps us avoid redundant calls to
ht_keys: a small hash table that powers the
set_progressmethod. This progress information is stored in the cache by default, and the user can retrieve it with
The logger (
config$logger) is a reference class that controls messages to the console and a custom log file) if applicable). Logging messages help users informally monitor the progress of