B Design
This chapter explains drake
’s internal design and architecture. Goals:
- Help developers and enthusiastic users contribute to the code base.
- Invite high-level advice and discussion about potential improvements to the overall design.
B.1 Principles
B.1.1 Functions first
From the user’s point of view, drake
is a style of programming in its own right, and that style is zealously and irrevocably function-oriented. It harmonizes with statistics and data science, where most methodology naturally takes the form of data transformations, and it embraces the natively function-oriented design of the R language. Functions are first-class citizens in drake
, and they dominate the internal design at the highest levels.
B.1.2 Light use of traditional OOP
Most of a drake
workflow happens inside the make()
function. make()
accepts a data frame of function calls (the drake
plan), caches some targets, and then drops its internal state when it terminates. The state does not need to persist, and the user does not need to interact with it. This is a major reason why traditional object-oriented programming plays such a small, supporting role.
In drake
, full OOP classes and objects are small, simple, and extremely specialized. For example, the decorated storr
, priority queue, and logger reference classes are narrowly defined and fit for purpose. The S3 system appears far more often, often as a mechanism of function overloading to streamline control flow, and also as a means of adding structure and validation to small target-specific objects optimized for performance.
In future development, tactical reference classes will arise as needed to encapsulate low-level patterns into natural abstractions. However, drake
’s design places greater importance on maximizing runtime efficiency.
B.1.3 High-performant small objects
drake
maintains several small list-like objects for each target, such as the local spec, the target data, triggers, and the code analysis results. drake
workflows with thousands of targets have thousands of these objects, and as profiling studies have shown, we need these objects to perform as efficiently as possible. Instantiation and field access need to be fast, and the memory footprint needs to be low. For these reasons, we choose simple lists with S3 class attributes, which outclass S4 and reference classes when it comes to instantiation speed.
B.1.4 Fast iteration along aggregated data
Each of the large data structures aggregates a single type of information across all targets to help drake
run fast. Examples include the whole workflow specification (config$spec
) and the in-memory target metadata cache (config$meta
). These objects are hash-table-powered environments to make field access as fast as possible.
B.1.5 Access to information across targets
drake
aggressively analyzes dependency relationships among targets. Even while make()
builds a single target, it needs to stay aware of the other targets, not only to build the dependency graph, but also for other tasks like dynamic branching. This is a major reason why the workflow specification, dependency graph, priority queue, and metadata are all stored in environments that most functions can reach.
B.2 Specific classes
This section describes drake
’s primary internal data structures at a high level. It is not exhaustive, but it does cover most of the architecture.
B.2.1 Config
make()
, outdated()
, vis_drake_graph()
, and related utilities keep track of a drake_config()
object. A drake_config()
object is a list of class "drake_config"
. Its purpose is to keep track of the state of a drake
workflow and avoid long parameter lists in functions. Future development will focus on refactoring and formalizing drake_config()
objects.
B.2.2 Settings
Static runtime parameters such as keep_going
and log_build_times
live in a list of class drake_settings
, which is part of each drake_config
object.
B.2.3 Plan
The drake
plan is a simple data frame of class "drake_plan"
, and it is drake
’s version of a Makefile. The manual has a whole chapter on plans.
B.2.4 Specification
A drake
plan is an implicit representation of targets and their immediate dependencies. Before make()
starts to build targets, drake
makes all these local dependency structures explicit and machine-readable in a workflow specification
. The overall specification (config$spec
) an R environment with the local specification of each individual target and each imported object/function. Each local specification is a list of class "drake_spec"
, and it contains the names of objects referenced from the command, the files declared with file_in()
and friends, the dependencies of the condition
and change
triggers, etc.
B.2.5 Graph
Whereas the specification tracks the local dependency structures, the graph (an igraph
object) represents the global dependency structure of the whole workflow. It is less granular than the specification, and make()
uses it to run the correct targets in the correct order.
B.2.6 Priority queue
In high-performance computing settings (e.g. parallelism = "clustermq"
and parallelism = "future"
) drake
creates a priority queue to schedule targets. For the sake of convenience, the underlying algorithms are different than that of a classical priority queue, but this does not seem to decrease performance in practice.
B.2.7 Metadata
config$meta
is an environment, and each element is a list of class "drake_meta"
. Whereas the workflow specification identifies the names of dependencies, the "drake_meta"
contains hashes (and supporting information). drake
uses the hashes decide if the target is up to date. Metadata lists are stored in the "meta"
namespace of the decorated storr
.
config$meta_old
is similar to config$meta
and exists for performance purposes.
B.2.8 Cache
B.2.8.1 API
drake
’s cache API is a decorated storr
, a reference class that wraps around a storr
object. drake
relies heavily on storr
namespaces (e.g. for metadata and recovery keys). drake
’s custom wrapper around the storr
class (i.e. the “decorated” part) has extra methods that power history (a txtq
) and specialized data formats, as well as hash tables that only the cache needs.
The new_cache()
and drake_cache()
functions create and reload drake
caches, respectively, and they are equivalent to storr::storr_rds()
plus drake:::decorate_storr()
.
B.2.8.2 Data
Usually, the persistent data values live in a hidden .drake/
folder. Most of the files come from storr_rds()
methods. Other files include the history txtq
and the values of targets with specialized data formats. The files are structured so they can be used by either with storr::storr_rds()
or drake::drake_cache()
.
Other storr
backends like storr_environment()
and storr_dbi()
are also compatible with this approach. In these non-standard cases, .drake/
does not contain the files of the inner storr
, but it still has files supporting history and specialized target formats.
B.2.9 Code analysis lists
drake
performs static code analysis on functions and commands in order to resolve the dependency structure of a workflow. Lists of class drake_deps
and drake_deps_ht
store the results of static code analysis on a single code chunk. Each element of a drake_deps
list is a character vector of static dependencies of a certain type (e.g. global variables or file_in()
files). The elements of drake_deps_ht
lists are hash tables (which increase performance when the static code analysis is running).
B.2.10 Environments
drake
has memory management strategies to make sure a target’s dependencies are loaded when make()
runs its command. Internally, memory management works with a layered system of environments. This system helps make()
protect the user’s calling environment and perform dynamic branching without the need for static code analysis or metaprogramming.
config$envir
: the calling environment ofmake()
, which contains the user’s functions and other imported objects.make()
tries to leave this environment alone (and temporarily locks it whenlock_envir
isTRUE
).config$envir_targets
: contains static targets. Its parent isconfig$envir
.config$dynamic
: contains entire aggregated dynamic targets whendrake
needs them. Its parent isconfig$envir_targets
.config$envir_subtargets
: contains individual sub-targets. Its parent isconfig$envir_dynamic
.
In addition, config$envir_loaded
keeps track of which targets are loaded in (2), (3), and (4) above.
These environments form a known data clump, and future development will encapsulate them.
B.2.11 Hash tables
The drake_config()
object and decorated storr
keep track of multiple hash tables to cache data in memory and boost speed while iterating over large collections of targets. They are simply R environments with hash = TRUE
, and drake
has internal interface functions for working with them. Examples in drake_config()
objects:
ht_is_dynamic
: keeps track of names of dynamic targets. Makesis_dynamic()
faster.ht_is_subtarget
: same as above, but foris_subtarget()
.ht_dynamic_deps
: names of dynamic dependencies of dynamic targets. Powersis_dynamic_dep()
.ht_target_exists
: tracks targets that already exist at the beginning ofmake()
.ht_subtarget_parents
: keeps track of the parent of each sub-target.
Examples in the decorated storr
:
ht_encode_path
andht_decode_path
:drake
uses Base32 encoding to store references to static file paths. These hash tables avoid redundant encoding/decoding operations and increases performance for large collections of targets.ht_encode_namespaced
andht_decode_namespaced
: same for imported namespaced functions.ht_hash
: powersmemo_hash()
, which helps us avoid redundant calls toinput_file_hash()
,output_file_hash()
,static_dependency_hash()
, anddynamic_dependency_hash()
.ht_keys
: a small hash table that powers theset_progress
method. This progress information is stored in the cache by default, and the user can retrieve it withdrake_progress()
.
B.2.12 Logger
The logger (config$logger
) is a reference class that controls messages to the console and a custom log file) if applicable). Logging messages help users informally monitor the progress of make()
.