Chapter 7 drake
projects
drake
’s design philosophy is extremely R-focused. It embraces in-memory configuration, in-memory dependencies, interactivity, and flexibility. This scope leaves project setup and file management decisions mostly up to the user. This chapter tries to fill in the blanks and address practical hurdles when it comes to setting up projects.
7.1 External resources
- Miles McBain’s excellent blog post explains the practical issues {drake} solves for most projects, how to set up a project as quickly and painlessly as possible, and how to overcome common obstacles.
- Miles’
dflow
package generates the file structure for a boilerplatedrake
project. It is a more thorough alternative todrake::use_drake()
. drake
is heavily function-oriented by design, and Miles’fnmate
package automatically generates boilerplate code and docstrings for functions you mention indrake
plans.
7.2 Code files
The names and locations of the files are entirely up to you, but this pattern is particularly useful to start with.
make.R
R/
├── packages.R
├── functions.R
└── plan.R
Here, make.R
is a main top-level script that
- Loads your packages, functions, and other in-memory data.
- Creates the
drake
plan. - Calls
make()
.
Let’s consider the main example, which you can download with drake_example("main")
. Here, our main script is called make.R
:
source("R/packages.R") # loads packages
source("R/functions.R") # defines the create_plot() function
source("R/plan.R") # creates the drake plan
# options(clustermq.scheduler = "multicore") # optional parallel computing. Also needs parallelism = "clustermq"
make(
# defined in R/plan.R
plan, verbose = 2
)
We have an R
folder containing our supporting files. packages.R
typically includes all the packages you will use in the workflow.
# packages.R
library(drake)
library(dplyr)
library(ggplot2)
Your functions.R
typically has the supporting custom functions you write for the workflow. If there are many functions, you could split them up into multiple files.
# functions.R
<- function(data) {
create_plot ggplot(data) +
geom_histogram(aes(x = Ozone), binwidth = 10) +
theme_gray(24)
}
Finally, it is good practice to define a plan.R
that defines the plan.
# plan.R
<- drake_plan(
plan raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
data = raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
hist = create_plot(data),
fit = lm(Ozone ~ Wind + Temp, data),
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
) )
To run the example project above,
- Start a clean new R session.
- Run the
make.R
script.
On Mac and Linux, you can do this by opening a terminal and entering R CMD BATCH --no-save make.R
. On Windows, restart your R session and call source("make.R")
in the R console.
Note: this part of drake
does not inherently focus on your script files. There is nothing magical about the names make.R
, packages.R
, functions.R
, or plan.R
. Different projects may require different file structures.
drake
has other functions to inspect your results and examine your workflow. Before invoking them interactively, it is best to start with a clean new R session.
# Restart R.
interactive()
#> [1] TRUE
source("R/packages.R")
source("R/functions.R")
source("R/plan.R")
vis_drake_graph(plan)
7.3 Safer interactivity
7.3.1 Motivation
A serious drake
workflow should be consistent and reliable, ideally with the help of a main top-level R script. Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a dependable manner. Batch mode makes sure all this goes according to plan.
If you use a single persistent interactive R session to repeatedly invoke make()
while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets. For example, if you interactively tinker with a new version of create_plot()
, targets hist
and report
will fall out of date without warning, and the next make()
will build them again. Even worse, the outputs from hist
and report
will be wrong if they depend on a half-finished create_plot()
.
The quickest workaround is to restart R and source()
your setup scripts all over again. However, a better solution is to use r_make()
and friends. r_make()
runs make()
in a new transient R session so that accidental changes to your interactive environment do not break your workflow.
7.3.2 Usage
To use r_make()
, you need a configuration R script. Unless you supply a custom file path (e.g. r_make(source = "your_file.R")
or options(drake_source = "your_file.R")
) drake
assumes this configuration script is called _drake.R
. (So the file name really is magical in this case). The suggested file structure becomes:
_drake.R
R/
├── packages.R
├── functions.R
└── plan.R
Like our previous make.R
script, _drake.R
runs all our pre-make()
setup steps. But this time, rather than calling make()
, it ends with a call to drake_config()
. drake_config()
is the initial preprocessing stage of make()
, and it accepts all the same arguments as make()
.
Example _drake.R
:
source("R/packages.R")
source("R/functions.R")
source("R/plan.R")
# options(clustermq.scheduler = "multicore") # optional parallel computing
drake_config(plan, verbose = 2)
Here is what happens when you call r_make()
.
drake
launches a new transient R session usingcallr::r()
. The remaining steps all happen within this transient session.- Run the configuration script (e.g.
_drake.R
) to- Load the packages, functions, global options,
drake
plan, etc. into the session’s environnment, and - Run the call to
drake_config()
and store the results in a variable calledconfig
.
- Load the packages, functions, global options,
- Execute
make_impl(config = config)
, an internaldrake
function.
The purpose of drake_config()
is to collect and sanitize all the parameters and settings that make()
needs to do its job. In fact, if you do not set the config
argument explicitly, then make()
invokes drake_config()
behind the scenes. make(plan, parallelism = "clustermq", jobs = 2, verbose = 6)
is equivalent to
<- drake_config(plan, verbose = 2)
config make_impl(config = config)
There are many more r_*()
functions besides r_make()
, each of which launches a fresh session and runs an inner drake
function on the config
object from _drake.R
.
Outer function call | Inner function call |
---|---|
r_make() |
make_impl(config = config) |
r_drake_build(...) |
drake_build_impl(config, ...) |
r_outdated(...) |
outdated_impl(config, ...) |
r_missed(...) |
missed_impl(config, ...) |
r_vis_drake_graph(...) |
vis_drake_graph_impl(config, ...) |
r_sankey_drake_graph(...) |
sankey_drake_graph_impl(config, ...) |
r_drake_ggraph(...) |
drake_ggraph_impl(config, ...) |
r_drake_graph_info(...) |
drake_graph_info_impl(config, ...) |
r_predict_runtime(...) |
predict_runtime_impl(config, ...) |
r_predict_workers(...) |
predict_workers_impl(config, ...) |
clean()
r_outdated(r_args = list(show = FALSE))
#> [1] "data" "fit" "hist" "raw_data" "report"
r_make()
#> Loading required package: dplyr
#>
#> Attaching package: ‘dplyr’
#>
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#>
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
#>
#> Loading required package: ggplot2
#>
#> Attaching package: ‘tidyr’
#>
#> The following objects are masked from ‘package:drake’:
#>
#> expand, gather
#>
#> ▶ target raw_data
#> ▶ target data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report
r_outdated(r_args = list(show = FALSE))
#> character(0)
r_vis_drake_graph(targets_only = TRUE, r_args = list(show = FALSE))
Remarks:
- You can run
r_make()
in an interactive session, but the transient process it launches will not be interactive. Thus, anybrowser()
statements in the commands in yourdrake
plan will be ignored. - You can select and configure the underlying
callr
function using argumentsr_fn
andr_args
, respectively. - For example code, you can download the updated main example (
drake_example("main")
) and experiment with files_drake.R
andinteractive.R
.
7.4 Script file pitfalls
Despite the above discussion of R scripts, drake
plans rely more on in-memory functions. You might be tempted to write a plan like the following, but then drake
cannot tell that my_analysis
depends on my_data
.
<- drake_plan(
bad_plan my_data = source(file_in("get_data.R")),
my_analysis = source(file_in("analyze_data.R")),
my_summaries = source(file_in("summarize_data.R"))
)
vis_drake_graph(bad_plan, targets_only = TRUE)
When it comes to plans, use functions instead.
source("my_functions.R") # defines get_data(), analyze_data(), etc.
<- drake_plan(
good_plan my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint
my_analysis = analyze_data(my_data),
my_summaries = summarize_results(my_data, my_analysis)
)
vis_drake_graph(good_plan, targets_only = TRUE)
In drake
>= 7.6.2.9000, code_to_function() leverages existing imperative scripts for use in a drake
plan.
<- code_to_function("get_data.R")
get_data <- code_to_function("analyze_data.R")
do_analysis <- code_to_function("summarize_data.R")
do_summary
<- drake_plan(
good_plan my_data = get_data(),
my_analysis = do_analysis(my_data),
my_summaries = do_summary(my_data, my_analysis)
)
vis_drake_graph(good_plan, targets_only = TRUE)
7.5 Workflows as R packages
The R package structure is a great way to organize and quality-control a data analysis project. If you write a drake
workflow as a package, you will need
- Supply the namespace of your package to the
envir
argument ofmake()
ordrake_config()
(e.g.make(envir = getNamespace("yourPackage")
sodrake
can watch you package’s functions for changes and rebuild downstream targets accordingly. - If you load the package with
devtools::load_all()
, set theprework
argument ofmake()
: e.g.make(prework = "devtools::load_all()")
and custom set thepackages
argument so your package name is not included. (Everything inpackages
is loaded withlibrary()
).
For a minimal example, see Tiernan Martin’s drakepkg
.
7.6 Other tools
drake
enhances reproducibility, but not in all respects. Local library managers, containerization, and session management tools offer more robust solutions in their respective domains. Reproducibility encompasses a wide variety of tools and techniques all working together. Comprehensive overviews:
- PLOS article by Wilson et al.
- RStudio Conference 2019 presentation by Karthik Ram.
rrtools
by Ben Marwick.