Chapter 2 Data and metadata management

Like its predecessor, drake, the targets package

  1. Abstracts files as R objects.
  2. Records the real-time progress of targets as they are running.
  3. Records special metadata in order to skip targets that are already up to date.

Unlike drake, which outsources data and metadata management to an external package, targets has an entirely custom internal data system. targets goes out of its way to reduce the number of files in storage, centralize the progress data and metadata, assign informative file names, and expose the file system to the user. This approach increases efficiency and portability, and it helps users understand and take control of their data.

2.1 File system

When a targets pipeline runs, it creates a folder called _targets to store all the files it needs. In targets version and above, users can set the data store path to something other than _targets (see tar_config_set()).

_targets/ # Configurable with tar_config_set().
├── meta/
├────── meta
├────── process
├────── progress
├── objects/
├────── target1 
├────── target2
├────── branching_target_c7bcb4bd
├────── branching_target_285fb6a9
├────── branching_target_874ca381
└── scratch/ # Temporary files deleted at the end of tar_make().

The number of files equals the number of targets plus two, which makes projects easier to upload and share among collaborators than drake’s .drake/ cache. (Files in _targets/scratch/ do not count because they can all be safely deleted after tar_make().) However, these files may still be too large and too numerous for code-specific version control systems like Git. For such projects, it may be more appropriate to share caches through external version-aware platforms such as Dropbox, Microsoft OneDrive, and Google Docs.

2.2 Targets

With the exception of dynamic files, the return value of each target lives in its own file inside _targets/objects/. The file name is the name of the target, and there is no file extension. The metadata keeps track of the storage format that governs how to read and write the target’s data. The default format is RDS, so if target x has no explicit format, then readRDS("_targets/objects/x") will read the data. (However, we state this just for the sake of understanding. The recommended way to read data is tar_read(), which takes the storage format into account.)

2.3 Process

The file _targets/meta/process is a pipe-separated flat file recording high-level information about the external callr process that orchestrates the targets. In that text file is the process ID, which can be used to check if tar_make() is still running in certain situations. Notably, it helps Shiny developers make apps that allow the user to log out and then resume the session after logging back in.

2.4 Progress

The file _targets/meta/progress is a pipe-separated flat file with the name of each target and it’s current runtime progress (running, built, canceled, or errored). The information in this file helps users keep track of what the pipeline is doing at a given moment. targets periodically appends rows to _targets/meta/progress as the pipeline progresses, so duplicated names usually appear. For any target with duplicated rows in _targets/meta/progress, only the lowest row is valid.

In most situations, the progress file can be safely excluded from version control. Functions like tar_graph() use progress information, but it is not essential to the reproducible end product.

2.5 Metadata

targets uses special metadata to decide which targets are up to date and which need to run. The metadata file _targets/meta/meta is a flat file with one row for every target and every global object relevant to the pipeline. targets appends new rows to this file as the pipeline progresses. Unlike drake, the metadata is centralized and compatible with data.table, which makes it far faster to check which targets are up to date. In addition, the metadata system allows targets to check not only for up-to-date targets, but also up-to-date global objects, which makes it easier for the user to understand why a target is outdated.

_targets/meta/meta has the following columns. Global objects use only the name, type, and data fields.

  • name: Name of the object or target.
  • type: Class name of the object or target.
  • data: Hash of the global object or the file containing the target’s return value.
  • command: Hash of the R command to run the target.
  • depend: Composite hash of all the target’s immediate upstream dependencies.
  • seed: Random number generator seed of the target. A target seed is unique and deterministically generated from its name.
  • path: The file path where the return value is stored. For dynamic files, this field could include multiple character strings.
  • time: Character, hash of the maximum of all the time stamps of the files in path.
  • size: Character, hash of the total file size of all the target’s files in path.
  • bytes: Numeric, total file size in bytes of all the target’s files in path.
  • format: Name of the storage format of the target. User-specified with tar_target() or tar_option_set().
  • iteration: Iteration mode of the target’s value, either "vector" or "list". User-specified with tar_target() or tar_option_set().
  • parent: Name of the parent pattern of the target if the target is a branch.
  • children: For patterns and branching stems, this field has the names of all the branches and buds. Can contain multiple character strings. Empty for branches and non-branching stems.
  • seconds: Runtime of the target in seconds.
  • warnings: Warning messages thrown when the target ran.
  • error: Error message thrown when the target ran.

These fields are pipe-separated in the flat file. Fields path and children can have multiple character strings, and these character strings are separated by asterisks in storage. (In memory, path and children are list columns.)

2.6 Skipping up-to-date targets

targets uses the metadata to decide if a target is up to date. The should_run() method of the builder class manages this. A target is outdated if one of the following conditions is met. targets checks these rules in the order given below. There is a special cue class to allow the user to customize / suppress most of these rules.

  1. There is no metadata record of the target.
  2. The target errored last run.
  3. The target has a different class than it did before.
  4. The cue mode equals "always".
  5. The cue mode does not equal "never".
  6. The command metadata field (the hash of the R command) is different from last time.
  7. The depend metadata field (the hash of the immediate upstream dependency targets and global objects) is different from last time.
  8. The storage format (user-specified with tar_target() or tar_option_set()) is different from last time.
  9. The iteration method (user-specified with tar_target() or tar_option_set()) is different from last time.
  10. A target’s file (either the one in _targets/objects/ or a dynamic file) does not exist or changed since last time.

A target’s dependencies can include functions, and these functions are tracked for changes using a custom hashing procedure. When a function’s hash changes, the function is considered invalidated, and so are any downstream targets with the depend cue turned on. The targets package computes the hash of a function in the following way. 1. Deparse the function with targets:::safe_deparse(). This function computes a string representation of the function that removes comments and standardizes whitespace so that trivial changes to formatting do not cue targets to rerun. 1. Manually remove any literal pointers from the function string using targets:::mask_pointers(). Such pointers arise from inline compiled C/C++ functions. 1. Compute a hash on the preprocessed string above using targets:::digest_chr64().

Those functions themselves have dependencies, and those dependencies are detected with codetools::findGlobals(). Dependencies of functions may include other global functions or global objects. If a dependency of a function is invalidated, the function itself is invalidated, and so are any dependent targets with the depend cue turned on.

2.7 Databases

targets manages _targets/meta/progress and _targets/meta/meta with an internal database class, which has methods to read, write, and deduplicate entire datasets as well as row-append records for individual targets. To maximize performance, targets uses fread() and fwrite() from data.table when working with entire databases and base::write() to append individual rows. The database class also supports and internal in-memory cache in order to avoid costly interactions with storage.

Internal classes progress and meta each have a database object and methods specific to the use case. And for additional safety, the record class encapsulates and validates individual rows of metadata.

Copyright Eli Lilly and Company