10 Data
This chapter describes how the targets
package stores data, manages memory, allows you to customize the data processing model.
See the performance chapter for options, settings, and other choices to make storage and memory more efficient for large data workflows.
10.1 Local data store
When a target finishes running during tar_make()
, it returns an R object. Those return values, along with descriptive metadata, are saved to persistent storage so your pipeline stays up to date even after you exit R. By default, this persistent storage is a special _targets/
folder created in your working directory by tar_make()
. The files in the local data store are organized as follows.
_targets/ # Can be customized with tar_config_set().
├── meta/
├────── meta
├────── process
├────── progress
├── objects/
├────── target1
├────── target2
├────── branching_target_c7bcb4bd
├────── branching_target_285fb6a9
├────── branching_target_874ca381
├── scratch/ # tar_make() deletes this folder after it finishes.
└── user/ # gittargets users can put custom files here for data version control.
The two most important components are:
_targets/meta/meta
, a flat text file with descriptive metadata about each target, including warning, errors, and runtime. You can read this data as a data frametar_meta()
. and_targets/objects/
, a folder with one data file per target.
If your pipeline has a target defined by tar_target(name = x, command = 1 + 1, format = "rds", repository = "local")
, during tar_make()
:
- The target runs and returns a value of
2
. - The return value
2
is saved as an RDS file to_targets/objects/x
. You could read the return value back into R withreadRDS("_targets/objects/x")
, buttar_read(x)
is far more convenient. _targets/meta/meta
gets a new row of metadata describing targetx
. You can read that metadata withtar_meta(x)
. Notably,tar_meta(x)$data
contains the hash of file_targets/objects/x
. This has helps the nexttar_make()
decide whether to rerun targetx
.
The format
argument of tar_target()
(and tar_option_set()
) controls how tar_make()
saves the return value. The default is "rds"
, which uses saveRDS()
, and there are more efficient formats such as "qs"
and "feather"
. Some of these formats require external packages. See https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats for details.
10.2 External files
If your pipeline loads a preexisting data file or creates files outside the data store, it is good practice to watch them for changes. That way, tar_make()
will automatically rerun the appropriate targets if these files change. To watch one of more files, create a target that
- Has
format = "file"
intar_target()
, and - Returns a character vector of local files and/or directories.
The example sketch of a pipeline below follows this pattern.
# _targets.R
library(targets)
<- function(file) {
create_output <- read.csv(file)
data <- head(data)
output write.csv(output, "output.csv")
"output.csv"
}list(
tar_target(name = input, command = "data.csv", format = "file"),
tar_target(name = output, command = create_output(input), format = "file")
)
We assume a file called data.csv
exists prior to running the pipeline. When tar_make()
runs the first time, target input
runs and returns the value "data.csv"
. Because format
is "file"
, no extra file is saved to _targets/meta/objects/
. Instead, "data.csv"
gets hashed, and the hash is stored in the metadata. Then, target output
runs, creates the file "output.csv"
, and that file gets processed the same way.
Target output
depends on target input
because the command of target output
mentions the symbol input
. (Verify with tar_visnetwork()
.) That way, output
does not run until input
is finished, and output
reruns if the hash of input
changes. It is good practice to write target symbols instead of literal input paths to ensure the proper dependency relationships. In this case, if output
were written with the literal input path as tar_target(name = output, command = create_output("data.csv"), format = "file")
, then the dependency relationship would break, and output
would not rerun if input
changed.
The mechanism of format = "file"
applies equally to input files and output files. In fact, a target can track both input and output files at the same time. This is part of how tar_render()
works. As discussed in the R Markdown chapter, tar_render()
takes an R Markdown source file as input, write a rendered report file as output, and returns a character vector with the paths to both files.
10.3 Memory
A typical target has dependencies upstream. In order to run properly, it needs the return values of those dependencies to exist in the random access memory (RAM). By default, tar_make()
reads those dependency targets from the data store, and it keeps in memory those targets and any targets that run. For big data workflows where not all data can fit into RAM, it is wiser to set memory = "transient"
and garbage_collection = TRUE
in tar_target()
(and tar_option_set()
). That way, the target return value is removed from memory at the earliest opportunity. The next time the target value is needed, it is reread from storage again, and then removed from memory as soon as possible. Reading a big dataset from storage can take time, which may slow down some pipelines, but it may be worth the extra time to make sure memory usage stays within reasonable limits. It is also worth considering format = "file"
in tar_target()
so the file is hashed but not loaded into memory and downstream targets can read only small subsets of the data in the file. See the performance chapter for more details.
10.4 Cloud storage
Cloud data can lighten the burden of local storage, make the pipeline portable, and facilitate data version control. Using arguments repository
and resources
of tar_target()
(and tar_option_set()
), you can send the return value to the cloud instead of a local file in _targets/objects/
. The repository
argument identifies the cloud service of choice: "aws"
for Amazon Web Service (AWS) Simple Storage Service (S3), and "gcp"
for Google Cloud Platform (GCP) Google Cloud Storage (GCS). Each platform requires different steps to configure, but there usage in targets
is almost exactly the same.
10.4.1 Cost
Cloud services cost money. The more resources you use, the more you owe. Resources not only include the data you store, but also the HTTP requests that tar_make()
uses to check if a target exists and is up to date. So cost increases with the number of cloud targets and the frequency that you run them. Please proactively monitor usage in the AWS or GCP web console and rethink your strategy if usage is too high. For example, you might consider running the pipeline locally and then sycning the data store to a bucket only at infrequent strategic milestones.
10.4.2 AWS setup
- Sign up for a free tier account at https://aws.amazon.com/free.
- Follow these instructions to practice using Simple Storage Service (S3) through the web console at https://console.aws.amazon.com/s3/.
- Install the
paws
R package withinstall.packages("paws")
. - Follow the credentials section of the
paws
README to connectpaws
to your AWS account. You will set special environment variables in your user-level.Renviron
file. Example:
# Example .Renviron file
=AKIAIOSFODNN7EXAMPLE
AWS_ACCESS_KEY_ID=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
AWS_SECRET_ACCESS_KEY=us-east-1 # The paws package and thus targets >= 0.8.1.9000 use this.
AWS_REGION=us-east-1 # For back compatibility with targets <= 0.8.1. AWS_DEFAULT_REGION
- Restart your R session and create an S3 buckets to store target data. You can do this either in the AWS S3 web console or the following code.
library(paws)
<- s3()
s3 $create_bucket(Bucket = "my-test-bucket-25edb4956460647d") s3
10.4.3 GCP setup
- Activate a Google Cloud Platform account at https://cloud.google.com/.
- Follow the instructions at https://code.markedmondson.me/googleCloudRunner/articles/setup-gcp.html to set up your GCP account to use locally with R. The video is friendly and helpful.
- In your
.Renviron
file, set theGCS_AUTH_FILE
environment variable to the same value asGCE_AUTH_FILE
from step (2). - Create a Google Cloud Storage (GCS) bucket to store target data. You can do this either with the GCP GCS web dashboard or the following code.
::gcs_create_bucket(
googleCloudStorageRbucket = "my-test-bucket-25edb4956460647d",
projectId = Sys.getenv("GCE_DEFAULT_PROJECT_ID")
)
- Verify that your Google Cloud account and R installation of
GoogleCloudStorageR
are working properly.targets
uses theGoogleCloudStorageR
package internally, and you can make sure it is working by testing a simple upload.
10.4.4 Usage
The following is an example pipeline that sends targets to an AWS S3 bucket. Usage in GCP is almost exactly the same.
# Example _targets.R file:
library(targets)
tar_option_set(
resources = tar_resources(
aws = tar_resources_aws(bucket = "my-test-bucket-25edb4956460647d")
)
)<- function(data) {
write_mean <- tempfile()
tmp writeLines(as.character(mean(data)), tmp)
tmp
}list(
tar_target(
data,rnorm(5),
format = "qs", # Set format = "aws_qs" in targets <= 0.10.0.
repository = "aws" # Set to "gcp" for Google Cloud Platform.
), tar_target(
mean_file,write_mean(data),
format = "file", # Set format = "aws_file" in targets <= 0.10.0.
repository = "aws" # Set to "gcp" for Google Cloud Platform.
) )
When you run the pipeline above with tar_make()
, your local R session computes rnorm(5)
, saves it to a temporary qs
file on disk, and then uploads it to a file called _targets/objects/data
on your S3 bucket. Likewise for mean_file
, but because the format is "file"
and the repository is "aws"
, you are responsible for supplying the path to the file that gets uploaded to _targets/objects/mean_file
.
format = "file"
works differently for cloud storage than local storage. Here, it is assumed that the command of the target writes a single file, and then targets
uploads this file to the cloud and deletes the local copy. At that point, the copy in the cloud is tracked for changes, and the local copy does not exist.
tar_make()
#> ● run target data
#> ● run target mean_file
#> ● end pipeline
And of course, your targets stay up to date if you make no changes.
tar_make()
#> ✓ skip target data
#> ✓ skip target mean_file
#> ✓ skip pipeline
If you log into https://s3.console.aws.amazon.com/s3, you should see objects _targets/objects/data
and _targets/objects/mean_file
in your bucket. To download this data locally, use tar_read()
and tar_load()
like before. These functions download the data from the bucket and load it into R.
tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983 0.40915323 0.02579023
The "file"
format behaves differently on the cloud. tar_read()
and tar_load()
download the object to a local path (where the target saved it locally before it was uploaded) and return the path so you can process it yourself.1
tar_load(mean_file)
mean_file#> [1] "_targets/scratch/mean_fileff086e70876d"
readLines(mean_file)
#> [1] "-0.495967480886693"
When you are done with these temporary files and the pipeline is no longer running, you can safely remove everything in _targets/scratch/
.
unlink("_targets/scratch/", recursive = TRUE) # tar_destroy(destroy = "scratch")
10.4.5 Data version control
Amazon and Google support versioned buckets. If your bucket has versioning turned on, then every version of every target will be stored,2, and the target metadata will contain the version ID (verify with tar_meta(your_target, path)$path
). That way, if you roll back _targets/meta/meta
to a prior version, then tar_read(your_target)
will read a prior target. And if you roll back the metadata and the code together, then your pipeline will journey back in time while stay up to date (old code synced with old data). Rolling back is possible if you use Git/GitHub and commit your R code files and _targets/meta/meta
to the repository. An alternative cloudless versioning solution is gittargets
, a package that snapshots the local data store and syncs with an existing code repository.
10.5 Cleaning up local internal data files
There are multiple functions to remove or clean up target storage. Most of these functions delete internal files or records from the data store and delete objects from cloud buckets. They do not delete local external files (i.e. tar_target(..., format = "file", repository = "local")
) because some of those files could be local input data that exists prior to tar_make()
.
tar_destroy()
is by far the most commonly used cleaning function. It removes the_targets/
folder (or optionally a subfolder in_targets/
) and all the cloud targets mentioned in the metadata. Use it if you intend to start the pipeline from scratch without any trace of a previous run.tar_prune()
deletes the data and metadata of all the targets no longer present in your current target script file (default:_targets.R
). This is useful if you recently worked through multiple changes to your project and are now trying to discard irrelevant data while keeping the results that still matter.tar_delete()
is more selective thantar_destroy()
andtar_prune()
. It removes the individual data files of a given set of targets from_targets/objects/
and cloud buckets while leaving the metadata in_targets/meta/meta
alone. If you have a small number of data-heavy targets you need to discard to conserve storage, this function can help.tar_invalidate()
is the opposite oftar_delete()
: for the selected targets, it deletes the metadata in_targets/meta/meta
and does not delete the return values. After invalidation, you will still be able to locate the data files withtar_path()
and manually salvage them in an emergency. However,tar_load()
andtar_read()
will not be able to read the data into R, and subsequent calls totar_make()
will attempt to rebuild those targets.