10  Local data

Performance

See the performance chapter for options, settings, and other choices to make storage and memory more efficient for large data workflows.

During a pipeline, targets manages R objects in memory and writes them to files on disk. It also stores target-level metadata in a compact central text file.

10.1 Memory

targets uses random access memory (RAM) while the pipeline is running. Each target loads its upstream dependencies into memory and returns an R object in memory. After a target runs or loads, tar_make() either keeps the object in memory or discards it, depending on the settings in tar_target() and tar_option_set(). Set memory = "transient" to release the target whenever possible, and set garbage_collection = TRUE to run garbage collection with gc() just before the target runs. Alternatively, set memory = "persistent" to keep the object in memory and reduce costly interactions with the file system. The trade-off between memory and file I/O depends on your computing platform. See the performance chapter for more details.

10.2 Local data store

In addition to memory, the pipeline writes data to files on disk. tar_make() creates a special data folder called _targets/ at the root of your project.

fs::dir_tree("_targets")
_targets
├── meta
│   ├── crew
│   ├── meta
│   ├── process
│   └── progress
├── objects
│   ├── target1
│   ├── target2
│   ├── dynamic_branch_c7bcb4bd
│   ├── dynamic_branch_285fb6a9
│   └── dynamic_branch_874ca381
├── scratch # tar_make() deletes this folder after it finishes.
└── user # for gittargets user data

The two most important parts are:

  1. _targets/meta/meta: a text file with key target-level metadata.
  2. _targets/objects/: a folder with the output data of each target.

Consider this pipeline:

library(targets)
library(tarchetypes)
list(
  tar_target(
    name = target1,
    command = 11 + 46,
    format = "rds",
    repository = "local"
  )
)

tar_make() does the following:

  1. Run the command of target1 and observe a return value of 57.
  2. Save the value 57 to _targets/objects/target1 using saveRDS().
  3. Append a line to _targets/meta/meta containing the hash, time stamp, file size, warnings, errors, and execution time of target1.
  4. Append a line to _targets/meta/progress to indicate that target1 finished.

Remarks:

  • To read the value of target1 back into R, tar_read(target1) is much better than readRDS("_targets/objects/target1").
  • The format argument of tar_target() controls how tar_make() saves the return value. The default is "rds", and there are more efficient formats such as "qs" and "feather". Some of these formats require external packages. See https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats for details.
  • For efficiency, tar_make() does not write to _targets/meta/meta or _targets/meta/progress every single time a target completes. Instead, it waits and gathers a backlog of text lines in memory, then writes whole batches of lines at a time. This behavior risks losing metadata in the event of a crash, but it reduces costly interactions with the file system. The seconds_meta argument controls how often tar_make() writes metadata. seconds_reporter does the same for messages printed to the R console.

10.3 External files

Some pipelines work with custom external files outside _targets/. The user is still responsible for reading and writing these files. However, the pipeline can track them, detect changes, and decide whether to rerun or skip the targets that the files depend on. Simply create a file target.

In a file target,

  1. tar_target() has format = "file".
  2. The command returns a character vector of file paths.

Consider this pipeline:

# _targets.R
library(targets)
library(tarchetypes)

create_output <- function(file) {
  data <- read.csv(file)
  output <- head(data)
  write.csv(output, "output.csv")
  "output.csv"
}

list(
  tar_target(name = input, command = "data.csv", format = "file"),
  tar_target(name = output, command = create_output(input), format = "file")
)

In the dependency graph, output depends on input because the command of output mentions the symbol input.

tar_visnetwork()

Before the pipeline first runs, data.csv exists, but output.csv does not. During tar_make(), the input target tracks data.csv, and the output target creates and tracks output.csv. If data.csv changes before the next tar_make(), then both input and output rerun. If something outside the pipeline changes output.csv, then output reruns.

Remarks:

  • A file target can have both input and output files.
  • A file target can include directory paths as well as individual file paths.
  • format = "file_fast" is just like format = "file", except that it uses file modification time stamps to check if a file is up to date. If the time stamp of the file agrees with the time stamp in the metadata, the file is considered up to date. Otherwise, targets recomputes the hash of the file to make a final determination.

10.4 Clean up local files

There are multiple functions to remove or clean up local target storage. Some of them also delete cloud data if your pipeline uses an AWS or GCP bucket (see the next chapter).

  • tar_destroy() removes _targets/ data store and any cloud data from the pipeline.
  • tar_prune() deletes data and metadata irrelevant to the current pipeline in _targets.R.
  • tar_delete() deletes specific data files from _targets/objects/ and the cloud. It does not modify metadata.
  • tar_invalidate() removes metadata from specific targets but keeps their data files in _targets/objects/.
  • tar_meta_delete() removes _targets/meta/ files and their copies on the cloud.