Chapter 7 drake projects

drake’s design philosophy is extremely R-focused. It embraces in-memory configuration, in-memory dependencies, interactivity, and flexibility. This scope leaves project setup and file management decisions mostly up to the user. This chapter tries to fill in the blanks and address practical hurdles when it comes to setting up projects.

7.1 External resources

  • Miles McBain’s excellent blog post explains the practical issues {drake} solves for most projects, how to set up a project as quickly and painlessly as possible, and how to overcome common obstacles.
  • Miles’ dflow package generates the file structure for a boilerplate drake project. It is a more thorough alternative to drake::use_drake().
  • drake is heavily function-oriented by design, and Miles’ fnmate package automatically generates boilerplate code and docstrings for functions you mention in drake plans.

7.2 Code files

The names and locations of the files are entirely up to you, but this pattern is particularly useful to start with.

make.R
R/
├── packages.R
├── functions.R
└── plan.R

Here, make.R is a master script that

  1. Loads your packages, functions, and other in-memory data.
  2. Creates the drake plan.
  3. Calls make().

Let’s consider the main example, which you can download with drake_example("main"). Here, our master script is called make.R:

source("R/packages.R")  # loads packages
source("R/functions.R") # defines the create_plot() function
source("R/plan.R")      # creates the drake plan
# options(clustermq.scheduler = "multicore") # optional parallel computing. Also needs parallelism = "clustermq"
make(
  plan, # defined in R/plan.R
  verbose = 2
)

We have an R folder containing our supporting files. packages.R typically includes all the packages you will use in the workflow.

# packages.R
library(drake)
library(dplyr)
library(ggplot2)

Your functions.R typically has the supporting custom functions you write for the workflow. If there are many functions, you could split them up into multiple files.

# functions.R
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), binwidth = 10) +
    theme_gray(24)
}

Finally, it is good practice to define a plan.R that defines the plan.

# plan.R
plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
  hist = create_plot(data),
  fit = lm(Ozone ~ Wind + Temp, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

To run the example project above,

  1. Start a clean new R session.
  2. Run the make.R script.

On Mac and Linux, you can do this by opening a terminal and entering R CMD BATCH --no-save make.R. On Windows, restart your R session and call source("make.R") in the R console.

Note: this part of drake does not inherently focus on your script files. There is nothing magical about the names make.R, packages.R, functions.R, or plan.R. Different projects may require different file structures.

drake has other functions to inspect your results and examine your workflow. Before invoking them interactively, it is best to start with a clean new R session.

# Restart R.
interactive()
#> [1] TRUE
source("R/packages.R")
source("R/functions.R")
source("R/plan.R")
vis_drake_graph(plan)

7.3 Safer interactivity

7.3.1 Motivation

A serious drake workflow should be consistent and reliable, ideally with the help of a master R script. Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a dependable manner. Batch mode makes sure all this goes according to plan.

If you use a single persistent interactive R session to repeatedly invoke make() while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets. For example, if you interactively tinker with a new version of create_plot(), targets hist and report will fall out of date without warning, and the next make() will build them again. Even worse, the outputs from hist and report will be wrong if they depend on a half-finished create_plot().

The quickest workaround is to restart R and source() your setup scripts all over again. However, a better solution is to use r_make() and friends. r_make() runs make() in a new transient R session so that accidental changes to your interactive environment do not break your workflow.

7.3.2 Usage

To use r_make(), you need a configuration R script. Unless you supply a custom file path (e.g. r_make(source = "your_file.R") or options(drake_source = "your_file.R")) drake assumes this configuration script is called _drake.R. (So the file name really is magical in this case). The suggested file structure becomes:

_drake.R
R/
├── packages.R
├── functions.R
└── plan.R

Like our previous make.R script, _drake.R runs all our pre-make() setup steps. But this time, rather than calling make(), it ends with a call to drake_config(). drake_config() is the initial preprocessing stage of make(), and it accepts all the same arguments as make().

Example _drake.R:

source("R/packages.R")
source("R/functions.R")
source("R/plan.R")
# options(clustermq.scheduler = "multicore") # optional parallel computing
drake_config(plan, verbose = 2)

Here is what happens when you call r_make().

  1. drake launches a new transient R session using callr::r(). The remaining steps all happen within this transient session.
  2. Run the configuration script (e.g. _drake.R) to
    1. Load the packages, functions, global options, drake plan, etc. into the session’s environnment, and
    2. Run the call to drake_config()and store the results in a variable called config.
  3. Execute make_impl(config = config), an internal drake function.

The purpose of drake_config() is to collect and sanitize all the parameters and settings that make() needs to do its job. In fact, if you do not set the config argument explicitly, then make() invokes drake_config() behind the scenes. make(plan, parallelism = "clustermq", jobs = 2, verbose = 6) is equivalent to

config <- drake_config(plan, verbose = 2)
make_impl(config = config)

There are many more r_*() functions besides r_make(), each of which launches a fresh session and runs an inner drake function on the config object from _drake.R.

Outer function call Inner function call
r_make() make_impl(config = config)
r_drake_build(...) drake_build_impl(config, ...)
r_outdated(...) outdated_impl(config, ...)
r_missed(...) missed_impl(config, ...)
r_vis_drake_graph(...) vis_drake_graph_impl(config, ...)
r_sankey_drake_graph(...) sankey_drake_graph_impl(config, ...)
r_drake_ggraph(...) drake_ggraph_impl(config, ...)
r_drake_graph_info(...) drake_graph_info_impl(config, ...)
r_predict_runtime(...) predict_runtime_impl(config, ...)
r_predict_workers(...) predict_workers_impl(config, ...)
clean()
r_outdated(r_args = list(show = FALSE))
#> [1] "data"     "fit"      "hist"     "raw_data" "report"

r_make()
#> Loading required package: dplyr
#> 
#> Attaching package: ‘dplyr’
#> 
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> 
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Loading required package: ggplot2
#> 
#> Attaching package: ‘tidyr’
#> 
#> The following objects are masked from ‘package:drake’:
#> 
#>     expand, gather
#> 
#> ▶ target raw_data
#> ▶ target data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report
r_outdated(r_args = list(show = FALSE))
#> character(0)

r_vis_drake_graph(targets_only = TRUE, r_args = list(show = FALSE))

Remarks:

  • You can run r_make() in an interactive session, but the transient process it launches will not be interactive. Thus, any browser() statements in the commands in your drake plan will be ignored.
  • You can select and configure the underlying callr function using arguments r_fn and r_args, respectively.
  • For example code, you can download the updated main example (drake_example("main")) and experiment with files _drake.R and interactive.R.

7.4 Script file pitfalls

Despite the above discussion of R scripts, drake plans rely more on in-memory functions. You might be tempted to write a plan like the following, but then drake cannot tell that my_analysis depends on my_data.

bad_plan <- drake_plan(
  my_data = source(file_in("get_data.R")),
  my_analysis = source(file_in("analyze_data.R")),
  my_summaries = source(file_in("summarize_data.R"))
)

vis_drake_graph(bad_plan, targets_only = TRUE)

When it comes to plans, use functions instead.

source("my_functions.R") # defines get_data(), analyze_data(), etc.
good_plan <- drake_plan(
  my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint
  my_analysis = analyze_data(my_data),
  my_summaries = summarize_results(my_data, my_analysis)
)

vis_drake_graph(good_plan, targets_only = TRUE)

In drake >= 7.6.2.9000, code_to_function() leverages existing imperative scripts for use in a drake plan.

get_data <- code_to_function("get_data.R")
do_analysis <- code_to_function("analyze_data.R")
do_summary <- code_to_function("summarize_data.R")

good_plan <- drake_plan(
  my_data = get_data(),
  my_analysis = do_analysis(my_data),
  my_summaries = do_summary(my_data, my_analysis)
)

vis_drake_graph(good_plan, targets_only = TRUE)

7.5 Workflows as R packages

The R package structure is a great way to organize and quality-control a data analysis project. If you write a drake workflow as a package, you will need

  1. Supply the namespace of your package to the envir argument of make() or drake_config() (e.g. make(envir = getNamespace("yourPackage") so drake can watch you package’s functions for changes and rebuild downstream targets accordingly.
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()") and custom set the packages argument so your package name is not included. (Everything in packages is loaded with library()).

For a minimal example, see Tiernan Martin’s drakepkg.

7.6 Other tools

drake enhances reproducibility, but not in all respects. Local library managers, containerization, and session management tools offer more robust solutions in their respective domains. Reproducibility encompasses a wide variety of tools and techniques all working together. Comprehensive overviews:

Copyright Eli Lilly and Company