Chapter 10 Cloud integration
targets
has built-in cloud capabilities to help scale pipelines up and out. Cloud storage solutions are already available, and cloud computing computing solutions are in the works.
Before getting started, please familiarize yourself with the pricing model and cost management and monitoring tools of Amazon Web Services. Everything has a cost, from virtual instances to web API calls. Free tier accounts give you a modest monthly budget for some services for the first year, but it is easy to exceed the limits. Developers of reusable software should consider applying for promotional credit using this application form.
10.1 Compute
Right now, targets
does not have built-in cloud-based distributed computing support. However, future development plans include seamless integration with AWS Batch. As a temporary workaround, it is possible to deploy a burstable SLURM cluster using AWS ParallelCluster and leverage targets
’ existing support for traditional schedulers.
10.2 Storage
targets
supports cloud storage on a target-by-target basis using Amazon Simple Storage Service, or S3. After a target completes, the return value is uploaded to a user-defined S3 bucket S3 bucket. Follow these steps to get started.
10.2.1 Communicate with your collaborators
An S3 bucket can be a shared file space. If you have access to someone else’s bucket, running their pipeline could accidentally overwrite their cloud data. It is best if you and your colleagues decide in advance who will write to the bucket at any given time.
10.2.2 Get started with the Amazon S3 web console
If you do not already have an Amazon Web Services account, sign up for the free tier at https://aws.amazon.com/free. Then, follow these step-by-step instructions to practice using Amazon S3 through the web console at https://console.aws.amazon.com/s3/.
10.2.3 Configure your local machine
targets
uses the aws.s3
package behind the scenes. It is not a strict dependency of targets
, so you will need to install it yourself.
install.packages("aws.s3")
Next, aws.s3
needs an access ID, secret access key, and default region. Follow these steps to generate the keys, and choose a region from this table of endpoints. Then, open the .Renviron
file in your home directory with usethis::edit_r_environ()
and store this information in special environment variables. Here is an example .Renviron
file.
# Example .Renviron file
=AKIAIOSFODNN7EXAMPLE
AWS_ACCESS_KEY_ID=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
AWS_SECRET_ACCESS_KEY=us-east-1 AWS_DEFAULT_REGION
Restart your R session so the changes take effect. Your keys are sensitive personal information. You can print them in your private console to verify correctness, but otherwise please avoid saving them to any persistent documents other than .Renviron
.
Sys.getenv("AWS_ACCESS_KEY_ID")
#> [1] "AKIAIOSFODNN7EXAMPLE"
Sys.getenv("AWS_SECRET_ACCESS_KEY")
#> [1] "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
Sys.getenv("AWS_DEFAULT_REGION")
#> [1] "us-east-1"
10.2.4 Create S3 buckets
Now, you are ready to create one or more S3 buckets for your targets
pipeline. Each pipeline should have its own set of buckets. Create one through the web console or with aws.s3::put_bucket()
.
library(aws.s3)
put_bucket("my-test-bucket-25edb4956460647d")
#> [1] TRUE
Sign in to https://s3.console.aws.amazon.com/s3 to verify that the bucket exists.
10.2.5 Configure the pipeline
To connect your pipeline with S3,
- Supply your bucket name to
resources
intar_option_set()
. To use different buckets for different targets, setresources
directly intar_target()
. - Supply AWS-powered storage formats to
tar_option_set()
and/ortar_target()
. See thetar_target()
help file for the full list of formats.
Your _targets.R
file will look something like this.
# Example _targets.R
library(targets)
tar_option_set(resources = list(bucket = "my-test-bucket-25edb4956460647d"))
<- function(data) {
write_mean <- tempfile()
tmp writeLines(as.character(mean(data)), tmp)
tmp
}list(
tar_target(data, rnorm(5), format = "aws_qs"),
tar_target(mean_file, write_mean(data), format = "aws_file")
)
10.2.6 Run the pipeline
When you run the pipeline above with tar_make()
, your local R session computes rnorm(5)
, saves it to a temporary qs
file on disk, and then uploads it to a file called _targets/objects/data
on your S3 bucket. Likewise for mean_file
, but because the format is "aws_file"
, you are responsible for supplying the path to the file that gets uploaded to _targets/objects/mean_file
.
tar_make()
#> ● run target data
#> ● run target mean_file
And of course, your targets stay up to date if you make no changes.
tar_make()
#> ✓ skip target data
#> ✓ skip target mean_file
#> ✓ Already up to date.
10.2.7 Manage the data
Log into https://s3.console.aws.amazon.com/s3. You should see objects _targets/objects/data
and _targets/objects/mean_file
in your bucket. To download this data locally, use tar_read()
and tar_load()
like before. These functions download the data from the bucket and load it into R.
tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983 0.40915323 0.02579023
The "aws_file"
format is different from the other AWS-powered formats. tar_read()
and tar_load()
download the object to a temporary file and return the path so you can process it yourself.6
tar_load(mean_file)
mean_file#> [1] "_targets/scratch/mean_fileff086e70876d"
readLines(mean_file)
#> [1] "-0.495967480886693"
When you are done with these temporary files and the pipeline is no longer running, you can safely remove everything in _targets/scratch/
.
unlink("_targets/scratch/", recursive = TRUE)
Lastly, if you want to erase the whole project or start over from scratch, consider removing the S3 bucket to avoid incurring storage fees. The easiest way to do this is through the S3 console. You can alternatively call aws.s3::delete_bucket()
, but you have to make sure the bucket is empty first.
delete_bucket("my-test-bucket-25edb4956460647d")
Non-“file” AWS formats also download temporary files, but they are immediately discarded after they are read into memory.↩︎