tar_option_set(format = "qs")
14 Performance
This chapter explains simple options and settings to improve the efficiency of your targets
pipelines. It also explains how to monitor the progress of a pipeline currently running.
- Set
seconds_meta_append
,seconds_meta_upload
, andseconds_reporter
to be kind to the local file system and R console. - Choose efficient data storage formats for large targets.
- Consider
memory = "transient"
andgarbage_collection = TRUE
for high-memory tasks. - Consider
cue = tar_cue(file = FALSE)
for cloud storage. - Set
trust_object_timestamps = TRUE
andformat = "file_fast
and the leave data alone duringtar_make()
. - Parallelize data management with
storage = "worker"
andretrieval = "worker"
. - Consider
deployment = "main"
for quick targets that do not need parallel workers. - If each task runs quickly, batch each group of tasks into a target to minimize overhead.
targets
has functions liketar_progress()
andtar_watch()
to monitor the progress of the pipeline.- Profiling with the
proffer
package can help discover bottlenecks.
14.1 Metadata and progress data
By default, tar_make()
writes to the R console and local metadata files up to hundreds of times per second. And if you opt into cloud, it uploads local metadata files to the cloud every few seconds. All this can slow down the pipeline and negatively impact the performance of shared file systems.
Please help targets
be kind to your file system, R console, and cloud API rate limit. The following arguments are available in tar_make()
and tar_config_set()
:
seconds_meta_append
: how often to write to the local metadata files. The default is 0 seconds, but we recommend about 15.seconds_meta_upload
: how often to upload the local metadata files to the cloud. The default of 15 seconds should be okay.seconds_reporter
: how often to print progress messages to the R console. The default is 0, but we recommend 0.5.
If seconds_meta_append
is 15, then tar_make()
waits at least 15 seconds before updating the local metadata files. It spends at least 15 seconds collecting a backlog new metadata, and then it writes all that metadata in bulk. seconds_reporter
does the same thing with R console progress messages. Be warned: a long-running local target may block the R session and make the actual delay much longer. So when a target completes, tar_make()
may not notify you immediately, and the target may not be up to date in the metadata until long after it actually finishes. So please be patient and allow the pipeline to continue until the end.
14.2 Data
The default data storage format is RDS, which can be slow and bulky for large data. For large data pipelines, consider alternative formats to more efficiently store and manage your data. Set the storage format using tar_option_set()
or tar_target()
:
Some formats such as "qs"
work on all kinds of data, whereas others like "feather"
works only on data frames. Most non-default formats store the data faster and in smaller files than the default "rds"
format, but they require extra packages to be installed. For example, format = "qs"
requires the qs
package, and format = "feather"
requires the arrow
package.
For extremely large datasets that cannot fit into memory, consider format = "file"
to treat the data as a file on disk. Downstream targets are free to load only the subsets of the data they need.
14.3 Memory
By default, tar_make()
keeps all target data in memory while it is running. To free superfluous data and consume less memory, activate transient memory and garbage collection:
tar_option_set(memory = "transient", garbage_collection = TRUE)
tar_make(garbage_collection = TRUE)
memory = "transient"
tells targets
to remove data from the R environment as soon as it is no longer needed. However, the computer memory itself is not freed until garbage collection is run, and even then, R may not decrease the size of its heap. You can run garbage collection yourself with the gc()
function in R.
Transient memory and garbage collection have tradeoffs: the pipeline reads data from storage far more often, and these data reads take additional time. In addition, garbage collection is usually a slow operation, and repeated garbage collections could slow down a pipeline with thousands of targets.
tar_target()
storage formats "file"
and "file_fast"
are less convenient, but they let you take more control of how R uses memory.
14.4 Cloud storage latency
targets
supports optional cloud storage. To check if a cloud target is up to date, tar_make()
polls the AWS or GPI API, which may cause time and money. For dramatically faster/cheaper workloads, set cue = tar_cue(file = FALSE)
in tar_target()
and/or tar_option_set()
for cloud targets. Be warned though: if you manually delete or corrupt the data in the cloud bucket, targets
will not notice.
14.5 Hashes
targets
uses hashes to check each target. These hashes are slow, and time stamps can speed up checks on local data files in _targets/objects/
. tar_option_set(trust_object_timestamps = TRUE)
(already the default) opts into fast time stamps, and tar_option_set(trust_object_timestamps = FALSE)
opts out. For similarly fast processing of external file targets, set format = "file_fast"
instead of format = "file"
.
If you use trust_object_timestamps = TRUE
or format = "file_fast"
, do not manually edit those files while the pipeline is running. _targets/objects/
in particular should never be modified by hand. And if you have on file system with low-precision time stamps (EXT3, FAT, XFS) wait at least 2 seconds after the pipeline finishes.
A hash is a fixed-length fingerprint of an object or file. Except in rare cases, different files have different hashes, and two files with the same hash have the same contents. targets
uses hashes to check if files have changed, which helps decide whether to rerun or skip each target. Unfortunately, hashes are expensive to compute, so a large number of targets or a large data file could slow down your pipeline.
File modification timestamps offer a workaround. Operating systems keep track of when each file was last modified, and R functions file.mtime()
and file.info()
can look up these timestamps much faster than hashes can be computed. When you tell targets
to use timestamps, the package compares the current timestamp to the old timestamp from when the pipeline last ran. If the timestamps agree, then targets
assumes the file is up to date and does not bother to recompute the hash. Otherwise, if the timestamps disagree, then targets
recomputes the hash to find out if the contents of the file have really changed. When used safely, this behavior speeds up tar_make()
, tar_outdated()
, tar_visnetwork()
, etc. by avoiding superfluous hash computations when targets are up to date.
14.6 Parallel workers and data
If you run tar_make()
with a crew
controller, then parallel processes will run your targets, but the main R process still manages all the data by default. To delegate data management to the parallel crew
workers, set the storage
and retrieval
settings in tar_target()
or tar_option_set()
:
tar_option_set(storage = "worker", retrieval = "worker")
But be sure those workers have access to the data. They must either find the local data, or the targets must use cloud storage.
14.7 Local targets
In distributed computing with targets
, not every target needs to run on a remote worker. For targets that run quickly and cheaply, consider setting deployment = "main"
in tar_target()
to run them on the main local process:
tar_target(dataset, get_dataset(), deployment = "main")
tar_target(summary, compute_summary_statistics(), deployment = "main")
14.8 Many targets
Each target incurs overhead, and it is not good practice to create thousands of targets which each run quickly. Instead, consider grouping the same amount of work into a smaller number of targets. See the sections on what a target should do and how much a target should do.
Simulation studies and other iterative stochastic pipelines may need to run thousands of independent random replications. For these pipelines, consider batching to reduce the number of targets while preserving the number of replications. In batching, each batch is a dynamic branch target that performs a subset of the replications. For 1000 replications, you might want 40 batches of 25 replications each, 10 batches with 100 replications each, or a different balance depending on the use case. Functions tarchetypes::tar_rep()
, tarchetypes::tar_map_rep()
, and stantargets::tar_stan_mcmc_rep_summary()
are examples of target factories that set up the batching structure without needing to understand dynamic branching.
14.9 Monitoring the pipeline
Even the most efficient targets
pipelines can take time to complete because the user-defined tasks themselves are slow. There are convenient ways to monitor the progress of a running pipeline:
tar_poll()
continuously refreshes a text summary of runtime progress in the R console. Run it in a new R session at the project root directory. (Only supported intargets
version 0.3.1.9000 and higher.)tar_visnetwork()
,tar_progress_summary()
,tar_progress_branches()
, andtar_progress()
show runtime information at a single moment in time.tar_watch()
launches an Shiny app that automatically refreshes the graph every few seconds.
tar_watch()
# Define an example target script file with a slow pipeline.
library(targets)
tar_script({
<- function(...) {
sleep_run Sys.sleep(10)
}list(
tar_target(settings, sleep_run()),
tar_target(data1, sleep_run(settings)),
tar_target(data2, sleep_run(settings)),
tar_target(data3, sleep_run(settings)),
tar_target(model1, sleep_run(data1)),
tar_target(model2, sleep_run(data2)),
tar_target(model3, sleep_run(data3)),
tar_target(figure1, sleep_run(model1)),
tar_target(figure2, sleep_run(model2)),
tar_target(figure3, sleep_run(model3)),
tar_target(conclusions, sleep_run(c(figure1, figure2, figure3)))
)
})
# Launch the app in a background process.
# You may need to refresh the browser if the app is slow to start.
# The graph automatically refreshes every 10 seconds
tar_watch(seconds = 10, outdated = FALSE, targets_only = TRUE)
# Now run the pipeline and watch the graph change.
<- tar_make() px
tar_watch_ui()
and tar_watch_server()
make this functionality available to other apps through a Shiny module.
14.10 Profiling
Profiling tools like proffer
figure out specific places where code runs slowly. It is important to identify these bottlenecks before you try to optimize. Steps:
- Install the
proffer
R package and its dependencies. - Run
proffer::pprof(tar_make(callr_function = NULL))
on your project. - Examine the flame graph to figure out which R functions are taking the most time.