Chapter 8 Script-based workflows

8.1 Function-oriented workflows

drake works best when you write functions for data analysis. Functions break down complicated ideas into manageable pieces.

# R/functions.R
get_data <- function(file){

munge_data <- function(raw_data){
  raw_data %>%
    mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))

fit_model <- function(munged_data){
  lm(Ozone ~ Wind + Temp, munged_data)

When we express computational steps as functions like get_data(), munge_data(), and fit_model(), we create special shorthand to make the rest of our code easier to read and understand.

# R/plan.R
plan <- drake_plan(
  raw_data = get_data(file_in("raw_data.xlsx")),
  munged_data = munge_data(raw_data),
  model = fit_model(munged_data)

This function-oriented approach is elegant, powerful, testable, scalable, and maintainable. However, it can be challenging to convert pre-existing traditional script-based analyses to function-oriented drake-powered workflows. This chapter describes a stopgap to retrofit drake to existing projects. Custom functions are still better in the long run, but the following workaround is quick and painless, and it does not require you to change your original scripts.

8.2 Traditional and legacy workflows

It is common to express data analysis tasks as numbered scripts.


The numeric prefixes indicate the order in which these scripts need to run.

# run_everything.R
source("05_report.R") # Calls rmarkdown::render() on report.Rmd.

8.3 Overcoming Technical Debt

code_to_function() creates drake_plan()-ready functions from scripts like these.

# R/functions.R
load_data <- code_to_function("01_data.R")
munge_data <- code_to_function("02_munge.R")
make_histogram <- code_to_function("03_histogram.R")
do_regression <- code_to_function("04_regression.R")
generate_report <- code_to_function("05_report.R")

Each function contains all the code from its corresponding script, along with a special final line to make sure we never return the same value twice.

#> function (...) 
#> {
#>     raw_data <- readxl::read_excel("raw_data.xlsx")
#>     saveRDS(raw_data, "data/loaded_data.RDS")
#>     list(time = Sys.time(), tempfile = tempfile())
#> }

8.4 Dependencies

drake pays close attention to dependencies. In drake, a target’s dependencies are the things it needs in order to build. Dependencies can include functions, files, and other targets upstream. Any time a dependency changes, the target is no longer valid. The make() function automatically detects when dependencies change, and it rebuilds the targets that need to rebuild.

To leverage drake’s dependency-watching capabilities, we create a drake plan. This plan should include all the steps of the analysis, from loading the data to generating a report.

To write the plan, we plug in the functions we created from code_to_function().

simple_plan <- drake_plan(
  data        = load_data(),
  munged_data = munge_data(),
  hist        = make_histogram(),
  fit         = do_regression(),
  report      = generate_report()

It’s a start, but right now, drake has no idea which targets to run first and which need to wait for dependencies! In the following graph, there are no edges (arrows) connecting the targets!


8.5 Building the connections

Just as our original scripts had to run in a certain order, so do our targets now. We pass targets as function arguments to express this execution order.

For example, when we write munged_data = munge_data(data), we are signaling to drake that the munged_data target depends on the function munge_data() and the target data.

script_based_plan <- drake_plan(
  data        = load_data(),
  munged_data = munge_data(data),
  hist        = make_histogram(munged_data),
  fit         = do_regression(munged_data),
  report      = generate_report(hist, fit)

8.6 Run the workflow

We can now run the workflow with the make() function. The first call to make() runs all the data analysis tasks we got from the scripts.

#> ▶ target data
#> ▶ target munged_data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report

8.7 Keeping the results up to date

Any time we change a script, we need to run code_to_function() again to keep our function up to date. drake notices when this function changes, and make() reruns the updated function and the all downstream functions that rely on the output.

For example, let’s fine tune our histogram. We open 03_histogram.R, change the binwidth argument, and call code_to_function("03_histogram.R") all over again.

# We need to rerun code_to_function() to tell drake that the script changed.
make_histogram <- code_to_function("03_histogram.R")

Targets hist and report depend on the code we modified, so drake marks those targets as outdated.

#> [1] "hist"   "report"

vis_drake_graph(script_based_plan, targets_only = TRUE)

When you call make(), drake runs make_histogram() because the underlying script changed, and it runs generate_report() because the report depends on hist.

#> ▶ target hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> ▶ target report

All the targets are now up to date!

vis_drake_graph(script_based_plan, targets_only = TRUE)

8.8 Final thoughts

Countless data science workflows consist of numbered imperative scripts, and code_to_function() lets drake accommodate script-based projects too big and cumbersome to refactor.

However, for new projects, we strongly recommend that you write functions. Functions help organize your thoughts, and they improve portability, readability, and compatibility with drake. For a deeper discussion of functions and their role in drake, consider watching the webinar recording of the 2019-09-23 rOpenSci Community Call.

Even old projects are sometimes pliable enough to refactor into functions, especially with the new Rclean package.

Copyright Eli Lilly and Company