11  Cloud storage

Cost

Amazon S3 or Google Cloud Storage are paid services. Amazon and Google not only charge for data, but also for operations that query or modify that data. Read https://aws.amazon.com/s3/pricing/ and https://cloud.google.com/storage/pricing for details.

Package version

This chapter requires targets version 1.3.0 or higher. Please visit the installation instructions.

targets can store data and metadata on the cloud, either with Amazon Web Service (AWS) Simple Storage Service (S3) or Google Cloud Platform (GCP) Google Cloud Storage (GCS).

11.1 Benefits

11.1.1 Store less data locally

  1. Use tar_option_set() and tar_target() to opt into cloud storage and configure options.
  2. tar_make() uploads target data to a cloud bucket instead of the local _targets/objects/ folder. Likewise for file targets.1
  3. Every seconds_meta seconds, tar_make() uploads metadata and still keeps local copies in _targets/meta/ folder.2

11.1.2 Inspect the results on a different computer

  1. tar_meta_download() downloads the latest metadata from the bucket to the local _targets/meta/ folder.3
  2. Helpers like tar_read() read local metadata and access target data in the bucket.

11.1.3 Track history

  1. Turn on versioning in your bucket.
  2. tar_make() records the versions of the target data in _targets/meta/meta.
  3. Commit _targets/meta/meta to the same version-controlled repository as your R code.
  4. Roll back to a prior commit to roll back the local metadata and give targets access to prior versions of the target data.

11.2 Setup

11.2.1 AWS setup

Skip these steps if you already have an AWS account and bucket.

  1. Sign up for a free tier account at https://aws.amazon.com/free.
  2. Read the Simple Storage Service (S3) instructions and practice in the web console.
  3. Install the paws.storage R package: install.packages("paws.storage").
  4. Follow the paws documentation to set your AWS security credentials.
  5. Create an S3 bucket, either in the web console or with paws.storage::s3()$create_bucket().

11.2.2 GCP setup

Skip these steps if you already have an GCP account and bucket.

  1. Activate a Google Cloud Platform account at https://cloud.google.com.
  2. Install the googleCloudStorageR R package: install.packages("googleCloudStorageR").
  3. Follow the googleCloudStorageR setup instructions to authenticate into Google Cloud and enable required APIs.
  4. Create a Google Cloud Storage (GCS) bucket either in the web console or googleCloudStorageR::gcs_create_bucket().

11.2.3 Pipeline setup

Use tar_option_set() to opt into cloud storage and declare options. For AWS:4

  1. repository = "aws"
  2. resources = tar_resources(aws = tar_resources_aws(bucket = "YOUR_BUCKET", prefix = "YOUR/PREFIX"))

Details:

  • The process is analogous for GCP.
  • The prefix is just like tar_config_get("store"), but for the cloud. It controls where the data objects live in the bucket, and it should not conflict with other projects.
  • Arguments repository, resources, and cue of tar_target() override their counterparts in tar_option_set().
  • In tar_option_set(), repository controls the target data, and repository_meta controls the metadata. However, repository_meta just defaults to repository. To continuously upload the metadata, it usually suffices to set e.g. repository = "aws" in tar_option_set().

11.3 Example

Consider a pipeline with two simple targets.

# Example _targets.R file:
library(targets)

tar_option_set(
  repository = "aws",
  resources = tar_resources(
    aws = tar_resources_aws(
      bucket = "my-test-bucket-25edb4956460647d",
      prefix = "my_project_name"
    )
  )
)

write_file <- function(data) {
  saveRDS(data, "file.rds")
  "file.rds"
}

list(
  tar_target(data, rnorm(5), format = "qs"), 
  tar_target(file, write_file(data), format = "file")
)

As usual, tar_make() runs the correct targets in the correct order. Both data files now live in bucket my-test-bucket-25edb4956460647d at S3 key paths which begin with prefix my_project_name. Neither _targets/objects/data nor file.rds exist locally because repository is "aws".

tar_make()
#> ▶ start target data
#> ● built target data [0 seconds]
#> ▶ start target file
#> ● built target file [0.002 seconds]
#> ▶ end pipeline [1.713 seconds]

At this point, if you switch to a different computer, download your metadata with tar_meta_download(). Then, your results will be up to date.

tar_make()
#> ✔ skip target data
#> ✔ skip target file
#> ✔ skip pipeline [1.653 seconds]

tar_read() read local metadata and cloud target data.

tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983  0.40915323  0.02579023

For a file target, tar_read() downloads the file to its original location and returns the path.

path <- tar_read(file)
path
#> [1] "file.rds"
readRDS(path)
#> [1] -0.74654607 -0.59593497 -1.57229983  0.40915323  0.02579023

  1. For cloud targets, format = "file_fast" has no purpose, and it automatically switches to format = "file".↩︎

  2. Metadata snapshots are synchronous, so a long target with deployment = "main" may block the main R process and delay uploads.↩︎

  3. Functions tar_meta_upload(), tar_meta_sync(), and tar_meta_delete() also manage cloud metadata.↩︎

  4. cue = tar_cue(file = FALSE) is no longer recommended for cloud storage. This unwise shortcut is no longer necessary, as of https://github.com/ropensci/targets/pull/1181 (targets version >= 1.3.2.9003).↩︎