Chapter 9 Data: storage and memory

This chapter describes how the targets package stores data, manages memory, allows you to customize the data processing model.

9.1 Local data store

When a target finishes running during tar_make(), it returns an R object. Those return values, along with descriptive metadata, are saved to persistent storage so your pipeline stays up to date even after you exit R. By default, this persistent storage is a special _targets/ folder created in your working directory by tar_make(). The files in the local data store are organized as follows.

_targets/ # Can be customized with tar_config_set().
├── meta/
├────── meta
├────── process
├────── progress
├── objects/
├────── target1 
├────── target2
├────── branching_target_c7bcb4bd
├────── branching_target_285fb6a9
├────── branching_target_874ca381
├── scratch/ # tar_make() deletes this folder after it finishes.
└── user/ # gittargets users can put custom files here for data version control.

The two most important components are:

  1. _targets/meta/meta, a flat text file with descriptive metadata about each target, and
  2. _targets/objects/, a folder with one data file per target.

If your pipeline has a target defined by tar_target(name = x, command = 1 + 1, format = "rds", repository = "local"), during tar_make():

  • The target runs and returns a value of 2.
  • The return value 2 is saved as an RDS file to _targets/objects/x. You could read the return value back into R with readRDS("_targets/objects/x"), but tar_read(x) is far more convenient.
  • _targets/meta/meta gets a new row of metadata describing target x. You can read that metadata with tar_meta(x). Notably, tar_meta(x)$data contains the hash of file _targets/objects/x. This has helps the next tar_make() decide whether to rerun target x.

The format argument of tar_target() (and tar_option_set()) controls how tar_make() saves the return value. The default is "rds", which uses saveRDS(), and there are more efficient formats such as "qs" and "feather". Some of these formats require external packages. See https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats for details.

9.2 External files

If your pipeline loads a preexisting data file or creates files outside the data store, it is good practice to watch them for changes. That way, tar_make() will automatically rerun the appropriate targets if these files change. To watch one of more files, create a target that

  1. Has format = "file" in tar_target(), and
  2. Returns a character vector of local files and/or directories.

The example sketch of a pipeline below follows this pattern.

# _targets.R
library(targets)
create_output <- function(file) {
  data <- read.csv(file)
  output <- head(data)
  write.csv(output, "output.csv")
  "output.csv"
}
list(
  tar_target(name = input, command = "data.csv", format = "file"),
  tar_target(name = output, command = create_output(input), format = "file")
)

We assume a file called data.csv exists prior to running the pipeline. When tar_make() runs the first time, target input runs and returns the value "data.csv". Because format is "file", no extra file is saved to _targets/meta/objects/. Instead, "data.csv" gets hashed, and the hash is stored in the metadata. Then, target output runs, creates the file "output.csv", and that file gets processed the same way.

Target output depends on target input because the command of target output mentions the symbol input. (Verify with tar_visnetwork().) That way, output does not run until input is finished, and output reruns if the hash of input changes. It is good practice to write target symbols instead of literal input paths to ensure the proper dependency relationships. In this case, if output were written with the literal input path as tar_target(name = output, command = create_output("data.csv"), format = "file"), then the dependency relationship would break, and output would not rerun if input changed.

The mechanism of format = "file" applies equally to input files and output files. In fact, a target can track both input and output files at the same time. This is part of how tar_render() works. As discussed in the R Markdown chapter, tar_render() takes an R Markdown source file as input, write a rendered report file as output, and returns a character vector with the paths to both files.

9.3 Memory

A typical target has dependencies upstream. In order to run properly, it needs the return values of those dependencies to exist in the random access memory (RAM). By default, tar_make() reads those dependency targets from the data store, and it keeps in memory those targets and any targets that run. For big data workflows where not all data can fit into RAM, it is wiser to set memory = "transient" and garbage_collection = TRUE in tar_target() (and tar_option_set()). That way, the target return value is removed from memory at the earliest opportunity. The next time the target value is needed, it is reread from storage again, and then removed from memory as soon as possible. Reading a big dataset from storage can take time, which may slow down some pipelines, but it may be worth the extra time to make sure memory usage stays within reasonable limits.

9.4 Cloud storage

Cloud data can lighten the burden of local storage, make the pipeline portable, and facilitate data version control. Using arguments repository and resources of tar_target() (and tar_option_set()), you can send the return value to the cloud instead of a local file in _targets/objects/. The repository argument identifies the cloud service of choice: "aws" for Amazon Web Service (AWS) Simple Storage Service (S3), and "gcp" for Google Cloud Platform (GCP) Google Cloud Storage (GCS). Each platform requires different steps to configure, but there usage in targets is almost exactly the same.

9.4.1 Cost

Cloud services cost money. The more resources you use, the more you owe. Resources not only include the data you store, but also the HTTP requests that tar_make() uses to check if a target exists and is up to date. So cost increases with the number of cloud targets and the frequency that you run them. Please proactively monitor usage in the AWS or GCP web console and rethink your strategy if usage is too high. For example, you might consider running the pipeline locally and then sycning the data store to a bucket only at infrequent strategic milestones.

9.4.2 AWS setup

  1. Sign up for a free tier account at https://aws.amazon.com/free.
  2. Follow these instructions to practice using Simple Storage Service (S3) through the web console at https://console.aws.amazon.com/s3/.
  3. Install the paws R package with install.packages("paws").
  4. Follow the credentials section of the paws README to connect paws to your AWS account. You will set special environment variables in your user-level .Renviron file. Example:
# Example .Renviron file
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
AWS_REGION=us-east-1 # The paws package and thus targets >= 0.8.1.9000 use this.
AWS_DEFAULT_REGION=us-east-1 # For back compatibility with targets <= 0.8.1.
  1. Restart your R session and create an S3 buckets to store target data. You can do this either in the AWS S3 web console or the following code.
library(paws)
s3 <- s3()
s3$create_bucket(Bucket = "my-test-bucket-25edb4956460647d")

9.4.3 GCP setup

  1. Activate a Google Cloud Platform account at https://cloud.google.com/.
  2. Follow the instructions at https://code.markedmondson.me/googleCloudRunner/articles/setup-gcp.html to set up your GCP account to use locally with R. The video is friendly and helpful.
  3. In your .Renviron file, set the GCS_AUTH_FILE environment variable to the same value as GCE_AUTH_FILE from step (2).
  4. Create a Google Cloud Storage (GCS) bucket to store target data. You can do this either with the GCP GCS web dashboard or the following code.
googleCloudStorageR::gcs_create_bucket(
  bucket = "my-test-bucket-25edb4956460647d",
  projectId = Sys.getenv("GCE_DEFAULT_PROJECT_ID")
)

9.4.4 Usage

The following is an example pipeline that sends targets to an AWS S3 bucket. Usage in GCP is almost exactly the same.

# Example _targets.R file:
library(targets)
tar_option_set(
  resources = tar_resources(
    aws = tar_resources_aws(bucket = "my-test-bucket-25edb4956460647d")
  )
)
write_mean <- function(data) {
  tmp <- tempfile()
  writeLines(as.character(mean(data)), tmp)
  tmp
}
list(
  tar_target(
    data,
    rnorm(5),
    format = "qs", # Set format = "aws_qs" in targets <= 0.10.0.
    repository = "aws" # Set to "gcp" for Google Cloud Platform.
  ), 
  tar_target(
    mean_file,
    write_mean(data),
    format = "file", # Set format = "aws_file" in targets <= 0.10.0.
    repository = "aws" # Set to "gcp" for Google Cloud Platform.
  )
)

When you run the pipeline above with tar_make(), your local R session computes rnorm(5), saves it to a temporary qs file on disk, and then uploads it to a file called _targets/objects/data on your S3 bucket. Likewise for mean_file, but because the format is "file" and the repository is "aws", you are responsible for supplying the path to the file that gets uploaded to _targets/objects/mean_file.

format = "file" works differently for cloud storage than local storage. Here, it is assumed that the command of the target writes a single file, and then targets uploads this file to the cloud and deletes the local copy. At that point, the copy in the cloud is tracked for changes, and the local copy does not exist.

tar_make()
#> ● run target data
#> ● run target mean_file
#> ● end pipeline

And of course, your targets stay up to date if you make no changes.

tar_make()
#> ✓ skip target data
#> ✓ skip target mean_file
#> ✓ skip pipeline

If you log into https://s3.console.aws.amazon.com/s3, you should see objects _targets/objects/data and _targets/objects/mean_file in your bucket. To download this data locally, use tar_read() and tar_load() like before. These functions download the data from the bucket and load it into R.

tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983  0.40915323  0.02579023

The "file" format behaves differently on the cloud. tar_read() and tar_load() download the object to a local path (where the target saved it locally before it was uploaded) and return the path so you can process it yourself.16

tar_load(mean_file)
mean_file
#> [1] "_targets/scratch/mean_fileff086e70876d"
readLines(mean_file)
#> [1] "-0.495967480886693"

When you are done with these temporary files and the pipeline is no longer running, you can safely remove everything in _targets/scratch/.

unlink("_targets/scratch/", recursive = TRUE) # tar_destroy(destroy = "scratch")

9.4.5 Data version control

Amazon and Google support versioned buckets. If your bucket has versioning turned on, then every version of every target will be stored,17, and the target metadata will contain the version ID (verify with tar_meta(your_target, path)$path). That way, if you roll back _targets/meta/meta to a prior version, then tar_read(your_target) will read a prior target. And if you roll back the metadata and the code together, then your pipeline will journey back in time while stay up to date (old code synced with old data). Rolling back is possible if you use Git/GitHub and commit your R code files and _targets/meta/meta to the repository. An alternative cloudless versioning solution is gittargets, a package that snapshots the local data store and syncs with an existing code repository.

9.5 Cleaning up local internal data files

There are multiple functions to remove or clean up target storage. Most of these functions delete internal files or records from the data store and delete objects from cloud buckets. They do not delete local external files (i.e. tar_target(..., format = "file", repository = "local")) because some of those files could be local input data that exists prior to tar_make().

  • tar_destroy() is by far the most commonly used cleaning function. It removes the _targets/ folder (or optionally a subfolder in _targets/) and all the cloud targets mentioned in the metadata. Use it if you intend to start the pipeline from scratch without any trace of a previous run.
  • tar_prune() deletes the data and metadata of all the targets no longer present in your current target script file (default: _targets.R). This is useful if you recently worked through multiple changes to your project and are now trying to discard irrelevant data while keeping the results that still matter.
  • tar_delete() is more selective than tar_destroy() and tar_prune(). It removes the individual data files of a given set of targets from _targets/objects/ and cloud buckets while leaving the metadata in _targets/meta/meta alone. If you have a small number of data-heavy targets you need to discard to conserve storage, this function can help.
  • tar_invalidate() is the opposite of tar_delete(): for the selected targets, it deletes the metadata in _targets/meta/meta and does not delete the return values. After invalidation, you will still be able to locate the data files with tar_path() and manually salvage them in an emergency. However, tar_load() and tar_read() will not be able to read the data into R, and subsequent calls to tar_make() will attempt to rebuild those targets.

  1. Non-“file” AWS formats also download files, but they are temporary and immediately discarded after the data is read into memory.↩︎

  2. GCP has safety capabilities such as discarding all but the newest n versions.↩︎

Copyright Eli Lilly and Company