Chapter 10 Cloud integration

targets has built-in cloud capabilities to help scale pipelines up and out. Cloud storage solutions are already available, and cloud computing computing solutions are in the works.

Before getting started, please familiarize yourself with the pricing model and cost management and monitoring tools of Amazon Web Services. Everything has a cost, from virtual instances to web API calls. Free tier accounts give you a modest monthly budget for some services for the first year, but it is easy to exceed the limits. Developers of reusable software should consider applying for promotional credit using this application form.

10.1 Compute

Right now, targets does not have built-in cloud-based distributed computing support. However, future development plans include seamless integration with AWS Batch. As a temporary workaround, it is possible to deploy a burstable SLURM cluster using AWS ParallelCluster and leverage targetsexisting support for traditional schedulers.

10.2 Storage

targets supports cloud storage on a target-by-target basis using Amazon Simple Storage Service, or S3. After a target completes, the return value is uploaded to a user-defined S3 bucket S3 bucket. Follow these steps to get started.

10.2.1 Communicate with your collaborators

An S3 bucket can be a shared file space. If you have access to someone else’s bucket, running their pipeline could accidentally overwrite their cloud data. It is best if you and your colleagues decide in advance who will write to the bucket at any given time.

10.2.2 Get started with the Amazon S3 web console

If you do not already have an Amazon Web Services account, sign up for the free tier at https://aws.amazon.com/free. Then, follow these step-by-step instructions to practice using Amazon S3 through the web console at https://console.aws.amazon.com/s3/.

10.2.3 Configure your local machine

targets uses the aws.s3 package behind the scenes. It is not a strict dependency of targets, so you will need to install it yourself.

install.packages("aws.s3")

Next, aws.s3 needs an access ID, secret access key, and default region. Follow these steps to generate the keys, and choose a region from this table of endpoints. Then, open the .Renviron file in your home directory with usethis::edit_r_environ() and store this information in special environment variables. Here is an example .Renviron file.

# Example .Renviron file
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
AWS_DEFAULT_REGION=us-east-1

Restart your R session so the changes take effect. Your keys are sensitive personal information. You can print them in your private console to verify correctness, but otherwise please avoid saving them to any persistent documents other than .Renviron.

Sys.getenv("AWS_ACCESS_KEY_ID")
#> [1] "AKIAIOSFODNN7EXAMPLE"
Sys.getenv("AWS_SECRET_ACCESS_KEY")
#> [1] "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
Sys.getenv("AWS_DEFAULT_REGION")
#> [1] "us-east-1"

10.2.4 Create S3 buckets

Now, you are ready to create one or more S3 buckets for your targets pipeline. Each pipeline should have its own set of buckets. Create one through the web console or with aws.s3::put_bucket().

library(aws.s3)
put_bucket("my-test-bucket-25edb4956460647d")
#> [1] TRUE

Sign in to https://s3.console.aws.amazon.com/s3 to verify that the bucket exists.

10.2.5 Configure the pipeline

To connect your pipeline with S3,

  1. Supply your bucket name to resources in tar_option_set(). To use different buckets for different targets, set resources directly in tar_target().
  2. Supply AWS-powered storage formats to tar_option_set() and/or tar_target(). See the tar_target() help file for the full list of formats.

Your _targets.R file will look something like this.

# Example _targets.R
library(targets)
tar_option_set(resources = list(bucket = "my-test-bucket-25edb4956460647d"))
write_mean <- function(data) {
  tmp <- tempfile()
  writeLines(as.character(mean(data)), tmp)
  tmp
}
list(
  tar_target(data, rnorm(5), format = "aws_qs"),
  tar_target(mean_file, write_mean(data), format = "aws_file")
)

10.2.6 Run the pipeline

When you run the pipeline above with tar_make(), your local R session computes rnorm(5), saves it to a temporary qs file on disk, and then uploads it to a file called _targets/objects/data on your S3 bucket. Likewise for mean_file, but because the format is "aws_file", you are responsible for supplying the path to the file that gets uploaded to _targets/objects/mean_file.

tar_make()
#> ● run target data
#> ● run target mean_file
#> ● end pipeline

And of course, your targets stay up to date if you make no changes.

tar_make()
#> ✓ skip target data
#> ✓ skip target mean_file
#> ✓ skip pipeline

10.2.7 Manage the data

Log into https://s3.console.aws.amazon.com/s3. You should see objects _targets/objects/data and _targets/objects/mean_file in your bucket. To download this data locally, use tar_read() and tar_load() like before. These functions download the data from the bucket and load it into R.

tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983  0.40915323  0.02579023

The "aws_file" format is different from the other AWS-powered formats. tar_read() and tar_load() download the object to a temporary file and return the path so you can process it yourself.7

tar_load(mean_file)
mean_file
#> [1] "_targets/scratch/mean_fileff086e70876d"
readLines(mean_file)
#> [1] "-0.495967480886693"

When you are done with these temporary files and the pipeline is no longer running, you can safely remove everything in _targets/scratch/.

unlink("_targets/scratch/", recursive = TRUE)

Lastly, if you want to erase the whole project or start over from scratch, consider removing the S3 bucket to avoid incurring storage fees. The easiest way to do this is through the S3 console. You can alternatively call aws.s3::delete_bucket(), but you have to make sure the bucket is empty first.

delete_bucket("my-test-bucket-25edb4956460647d")

  1. Non-“file” AWS formats also download temporary files, but they are immediately discarded after they are read into memory.↩︎

Copyright Eli Lilly and Company