11 Cloud storage

Cost

Amazon S3 or Google Cloud Storage are paid services. Amazon and Google not only charge for data, but also for operations that query or modify that data. Read https://aws.amazon.com/s3/pricing/ and https://cloud.google.com/storage/pricing for details.

Package version

This chapter requires targets version 1.3.0 or higher. Please visit the installation instructions.

targets can store data and metadata on the cloud, either with Amazon Web Service (AWS) Simple Storage Service (S3) or Google Cloud Platform (GCP) Google Cloud Storage (GCS).

11.1 Benefits

11.1.1 Store less data locally

Use tar_option_set() and tar_target() to opt into cloud storage and configure options.
tar_make() uploads regular target data to a cloud bucket instead of the local _targets/objects/ folder.
If the repository_meta is not "local", then every seconds_meta seconds, tar_make() uploads metadata and still keeps local copies in _targets/meta/ folder.¹

11.1.2 Inspect the results on a different computer

tar_meta_download() downloads the latest metadata from the bucket to the local _targets/meta/ folder.² This allows you to interactively inspect the downloaded snapshot of the pipeline using tar_visnetwork(), tar_progress(), etc.³
Helpers like tar_read() read local metadata and access target data in the bucket.
As of targets version 1.10.1.9002, tar_make() uploads debugging workspaces to the cloud. tar_workspace_download() can download a workspace file so you can load it locally with tar_workspace().⁴

11.1.3 Track history

Turn on versioning in your bucket.
tar_make() records the versions of the target data in _targets/meta/meta.
Commit _targets/meta/meta to the same version-controlled repository as your R code.
Roll back to a prior commit to roll back the local metadata and give targets access to prior versions of the target data.

11.2 Setup

11.2.1 AWS setup

Skip these steps if you already have an AWS account and bucket.

Sign up for a free tier account at https://aws.amazon.com/free.
Read the Simple Storage Service (S3) instructions and practice in the web console.
Install the paws.storage R package: install.packages("paws.storage").
Follow the paws documentation to set your AWS security credentials.
Create an S3 bucket, either in the web console or with paws.storage::s3()$create_bucket().

11.2.2 GCP setup

Skip these steps if you already have an GCP account and bucket.

Activate a Google Cloud Platform account at https://cloud.google.com.
Install the googleCloudStorageR R package: install.packages("googleCloudStorageR").
Follow the googleCloudStorageR setup instructions to authenticate into Google Cloud and enable required APIs.
Create a Google Cloud Storage (GCS) bucket either in the web console or googleCloudStorageR::gcs_create_bucket().

11.2.3 Pipeline setup

Use tar_option_set() to opt into cloud storage and declare options. For AWS:⁵

resources = tar_resources(aws = tar_resources_aws(bucket = "YOUR_BUCKET", prefix = "YOUR/PREFIX"))
repository = "aws"
Optional: repository_meta = "aws" (for metadata uploads).

Details:

The process is analogous for GCP.
The prefix is just like tar_config_get("store"), but for the cloud. It controls where the data objects live in the bucket, and it should not conflict with other projects.
Arguments repository, resources, and cue of tar_target() override their counterparts in tar_option_set().
In tar_option_set(), repository controls the target data.
In tar_option_set(), repository_meta controls the metadata. For example, set to "aws" to periodically upload the metadata to the AWS S3 bucket configured in resources. Set to "local" to opt out.

11.3 Example

Consider a pipeline with two simple targets.

# Example _targets.R file:
library(targets)
library(tarchetypes)

tar_option_set(
  repository = "aws",
  repository_meta = "aws", # Just for metadata uploads, not required.
  resources = tar_resources(
    aws = tar_resources_aws(
      bucket = "my-test-bucket-25edb4956460647d",
      prefix = "my_project_name"
    )
  )
)

write_file <- function(data) {
  saveRDS(data, "file.rds")
  "file.rds"
}

list(
  tar_target(data, rnorm(5), format = "qs"), 
  tar_target(file, write_file(data), format = "file")
)

As usual, tar_make() runs the correct targets in the correct order. Both data files now live in bucket my-test-bucket-25edb4956460647d at S3 key paths which begin with prefix my_project_name. The file file.rds still exists locally because of format = "file", but _targets/objects/data exists only in the cloud.

tar_make()
#> + data dispatched
#> ✔ data completed [0ms, 87 B]
#> + file dispatched
#> ✔ file completed [0ms, 87 B]
#> ✔ ended pipeline [524ms, 2 completed, 0 skipped]

repository_meta = "aws" lets you switch to a different computer and download your metadata with tar_meta_download(). At that point, your results will appear up to date from the new computer.

tar_make()
#> ✔ skipped pipeline [50ms, 2 skipped]

tar_read() read local metadata and cloud target data.

tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983  0.40915323  0.02579023

For a file target, tar_read() downloads the file to its original location and returns the path.

path <- tar_read(file)
path
#> [1] "file.rds"
readRDS(path)
#> [1] -0.74654607 -0.59593497 -1.57229983  0.40915323  0.02579023

Metadata snapshots are synchronous, so a long target with deployment = "main" may block the main R process and delay uploads.↩︎
Functions tar_meta_upload(), tar_meta_sync(), and tar_meta_delete() also manage cloud metadata.↩︎
Requires that the repository_meta option (in tar_option_get()) is not equal to "local" in _targets.R↩︎
Requires that the repository_meta option (in tar_option_get()) is not equal to "local" in _targets.R↩︎
cue = tar_cue(file = FALSE) is no longer recommended for cloud storage. This unwise shortcut is no longer necessary, as of https://github.com/ropensci/targets/pull/1181 (targets version >= 1.3.2.9003).↩︎