Chapter 14 Time: speed, time logging, prediction, and strategy

14.1 Why is `drake` so slow?

14.1.1 Help us find out!

If you encounter slowness, please report it to https://github.com/ropensci/drake/issues and we will do our best to speed up drake for your use case. Please include a reproducible example and tell us about your operating system and version of R. In addition, flame graphs from the proffer package really help us identify bottlenecks.

14.1.2 Too many targets?

make() and friends tend to slow down if you have a huge number of targets. There are unavoidable overhead costs from storing each single target and checking if it is up to date, so please read this advice on choosing good targets and consider dividing your work into a manageably small number of meaningful targets. Dynamic branching can also help in many cases.

14.1.3 Big data?

drake saves the return value of each target to on disk storage. So in addition to dividing your work into a smaller number of targets, specialized storage formats can help speed things up. It may also be worth reflecting on how much data you really need to store. And if the cache is too big, the storage chapter has advice for downsizing it.

14.1.4 Aggressive shortcuts

If your plan still needs tens of thousands of targets, you can take aggressive shortcuts to make things run faster.

make(
  plan,
  verbose = 0L, # Console messages can pile up runtime.
  log_progress = FALSE, # drake_progress() will be useless.
  log_build_times = FALSE, # build_times() will be useless.
  recoverable = FALSE, # make(recover = TRUE) cannot be used later.
  history = FALSE, # drake_history() cannot be used later.
  session_info = FALSE, # drake_get_session_info() cannot be used later.
  lock_envir = FALSE, # See https://docs.ropensci.org/drake/reference/make.html#self-invalidation.
)

14.2 Time logging

Thanks to Jasper Clarkberg, drake records how long it takes to build each target. For large projects that take hours or days to run, this feature becomes important for planning and execution.

library(drake)
load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/main/mtcars
make(my_plan)

build_times(digits = 8) # From the cache.

## `dplyr`-style `tidyselect` commands
build_times(starts_with("coef"), digits = 8)

14.3 Predict total runtime

drake uses these times to predict the runtime of the next make(). At this moment, everything is up to date in the current example, so the next make() should ideally take no time at all (except for preprocessing overhead).

predict_runtime(my_plan)

Suppose we change a dependency to make some targets out of date. Now, the next make() should take longer since some targets are out of date.

reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}

predict_runtime(my_plan)

And what if you plan to delete the cache and build all the targets from scratch?

predict_runtime(my_plan, from_scratch = TRUE)

14.4 Strategize your high-performance computing

Let’s say you are scaling up your workflow. You just put bigger data and heavier computation in your custom code, and the next time you run make(), your targets will take much longer to build. In fact, you estimate that every target except for your R Markdown report will take two hours to complete. Let’s write down these known times in seconds.

known_times <- rep(7200, nrow(my_plan))
names(known_times) <- my_plan$target
known_times["report"] <- 5
known_times

How many parallel jobs should you use in the next make()? The predict_runtime() function can help you decide. predict_runtime(jobs = n) simulates persistent parallel workers and reports the estimated total runtime of make(jobs = n). (See also predict_workers().)

time <- c()
for (jobs in 1:12){
  time[jobs] <- predict_runtime(
    my_plan,
    jobs_predict = jobs,
    from_scratch = TRUE,
    known_times = known_times
  )
}
library(ggplot2)
ggplot(data.frame(time = time / 3600, jobs = ordered(1:12), group = 1)) +
  geom_line(aes(x = jobs, y = time, group = group)) +
  scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) +
  theme_gray(16) +
  xlab("jobs argument of make()") +
  ylab("Predicted runtime of make() (hours)")

We see serious potential speed gains up to 4 jobs, but beyond that point, we have to double the jobs to shave off another 2 hours. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal.

A final note on predicting runtime: the output of predict_runtime() and predict_workers() also depends the optional workers column of your drake_plan(). If you micromanage which workers are allowed to build which targets, you may minimize reads from disk, but you could also slow down your workflow if you are not careful. See the high-performance computing guide for more.