Chapter 2 Data and metadata management
Like its predecessor,
- Abstracts files as R objects.
- Records the real-time progress of targets as they are running.
- Records special metadata in order to skip targets that are already up to date.
drake, which outsources data and metadata management to an external package,
targets has an entirely custom internal data system.
targets goes out of its way to reduce the number of files in storage, centralize the progress data and metadata, assign informative file names, and expose the file system to the user. This approach increases efficiency and portability, and it helps users understand and take control of their data.
2.1 File system
targets pipeline runs, it creates a folder called
_targets to store all the files it needs. In
targets version 0.3.1.9000 and above, users can set the data store path to something other than
_targets/ # Configurable with tar_config_set(). ├── meta/ ├────── meta ├────── process ├────── progress ├── objects/ ├────── target1 ├────── target2 ├────── branching_target_c7bcb4bd ├────── branching_target_285fb6a9 ├────── branching_target_874ca381 └── scratch/ # Temporary files deleted at the end of tar_make().
The number of files equals the number of targets plus two, which makes projects easier to upload and share among collaborators than
.drake/ cache. (Files in
_targets/scratch/ do not count because they can all be safely deleted after
tar_make().) However, these files may still be too large and too numerous for code-specific version control systems like Git. For such projects, it may be more appropriate to share caches through external version-aware platforms such as Dropbox, Microsoft OneDrive, and Google Docs.
With the exception of dynamic files, the return value of each target lives in its own file inside
_targets/objects/. The file name is the name of the target, and there is no file extension. The metadata keeps track of the storage format that governs how to read and write the target’s data. The default format is RDS, so if target
x has no explicit format, then
readRDS("_targets/objects/x") will read the data. (However, we state this just for the sake of understanding. The recommended way to read data is
tar_read(), which takes the storage format into account.)
_targets/meta/process is a pipe-separated flat file recording high-level information about the external
callr process that orchestrates the targets. In that text file is the process ID, which can be used to check if
tar_make() is still running in certain situations. Notably, it helps Shiny developers make apps that allow the user to log out and then resume the session after logging back in.
_targets/meta/progress is a pipe-separated flat file with the name of each target and it’s current runtime progress (running, built, canceled, or errored). The information in this file helps users keep track of what the pipeline is doing at a given moment.
targets periodically appends rows to
_targets/meta/progress as the pipeline progresses, so duplicated names usually appear. For any target with duplicated rows in
_targets/meta/progress, only the lowest row is valid.
In most situations, the progress file can be safely excluded from version control. Functions like
tar_graph() use progress information, but it is not essential to the reproducible end product.
targets uses special metadata to decide which targets are up to date and which need to run. The metadata file
_targets/meta/meta is a flat file with one row for every target and every global object relevant to the pipeline.
targets appends new rows to this file as the pipeline progresses. Unlike
drake, the metadata is centralized and compatible with
data.table, which makes it far faster to check which targets are up to date. In addition, the metadata system allows
targets to check not only for up-to-date targets, but also up-to-date global objects, which makes it easier for the user to understand why a target is outdated.
_targets/meta/meta has the following columns. Global objects use only the
name: Name of the object or target.
type: Class name of the object or target.
data: Hash of the global object or the file containing the target’s return value.
command: Hash of the R command to run the target.
depend: Composite hash of all the target’s immediate upstream dependencies.
seed: Random number generator seed of the target. A target seed is unique and deterministically generated from its name.
path: The file path where the return value is stored. For dynamic files, this field could include multiple character strings.
time: Character, hash of the maximum of all the time stamps of the files in
size: Character, hash of the total file size of all the target’s files in
bytes: Numeric, total file size in bytes of all the target’s files in
format: Name of the storage format of the target. User-specified with
iteration: Iteration mode of the target’s value, either
"list". User-specified with
parent: Name of the parent pattern of the target if the target is a branch.
children: For patterns and branching stems, this field has the names of all the branches and buds. Can contain multiple character strings. Empty for branches and non-branching stems.
seconds: Runtime of the target in seconds.
warnings: Warning messages thrown when the target ran.
error: Error message thrown when the target ran.
These fields are pipe-separated in the flat file. Fields
children can have multiple character strings, and these character strings are separated by asterisks in storage. (In memory,
children are list columns.)
2.6 Skipping up-to-date targets
targets uses the metadata to decide if a target is up to date. The
should_run() method of the
builder class manages this. A target is outdated if one of the following conditions is met.
targets checks these rules in the order given below. There is a special
cue class to allow the user to customize / suppress most of these rules.
- There is no metadata record of the target.
- The target errored last run.
- The target has a different class than it did before.
- The cue mode equals
- The cue mode does not equal
commandmetadata field (the hash of the R command) is different from last time.
dependmetadata field (the hash of the immediate upstream dependency targets and global objects) is different from last time.
- The storage
tar_option_set()) is different from last time.
iterationmethod (user-specified with
tar_option_set()) is different from last time.
- A target’s file (either the one in
_targets/objects/or a dynamic file) does not exist or changed since last time.
A target’s dependencies can include functions, and these functions are tracked for changes using a custom hashing procedure. When a function’s hash changes, the function is considered invalidated, and so are any downstream targets with the
depend cue turned on. The
targets package computes the hash of a function in the following way.
1. Deparse the function with
targets:::safe_deparse(). This function computes a string representation of the function that removes comments and standardizes whitespace so that trivial changes to formatting do not cue targets to rerun.
1. Manually remove any literal pointers from the function string using
targets:::mask_pointers(). Such pointers arise from inline compiled C/C++ functions.
1. Compute a hash on the preprocessed string above using
Those functions themselves have dependencies, and those dependencies are detected with
codetools::findGlobals(). Dependencies of functions may include other global functions or global objects. If a dependency of a function is invalidated, the function itself is invalidated, and so are any dependent targets with the
depend cue turned on.
_targets/meta/meta with an internal
database class, which has methods to read, write, and deduplicate entire datasets as well as row-append records for individual targets. To maximize performance,
data.table when working with entire databases and
base::write() to append individual rows. The
database class also supports and internal in-memory cache in order to avoid costly interactions with storage.
meta each have a
database object and methods specific to the use case. And for additional safety, the
record class encapsulates and validates individual rows of metadata.