Chapter 8 Script-based workflows
8.1 Function-oriented workflows
drake
works best when you write functions for data analysis. Functions break down complicated ideas into manageable pieces.
# R/functions.R
<- function(file){
get_data ::read_excel(file)
readxl
}
<- function(raw_data){
munge_data %>%
raw_data mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
}
<- function(munged_data){
fit_model lm(Ozone ~ Wind + Temp, munged_data)
}
When we express computational steps as functions like get_data()
, munge_data()
, and fit_model()
, we create special shorthand to make the rest of our code easier to read and understand.
# R/plan.R
<- drake_plan(
plan raw_data = get_data(file_in("raw_data.xlsx")),
munged_data = munge_data(raw_data),
model = fit_model(munged_data)
)
This function-oriented approach is elegant, powerful, testable, scalable, and maintainable. However, it can be challenging to convert pre-existing traditional script-based analyses to function-oriented drake
-powered workflows. This chapter describes a stopgap to retrofit drake
to existing projects. Custom functions are still better in the long run, but the following workaround is quick and painless, and it does not require you to change your original scripts.
8.2 Traditional and legacy workflows
It is common to express data analysis tasks as numbered scripts.
01_data.R
02_munge.R
03_histogram.R
04_regression.R
05_report.R
The numeric prefixes indicate the order in which these scripts need to run.
# run_everything.R
source("01_data.R")
source("02_munge.R")
source("03_histogram.R")
source("04_regression.R")
source("05_report.R") # Calls rmarkdown::render() on report.Rmd.
8.3 Overcoming Technical Debt
code_to_function()
creates drake_plan()
-ready functions from scripts like these.
# R/functions.R
<- code_to_function("01_data.R")
load_data <- code_to_function("02_munge.R")
munge_data <- code_to_function("03_histogram.R")
make_histogram <- code_to_function("04_regression.R")
do_regression <- code_to_function("05_report.R") generate_report
Each function contains all the code from its corresponding script, along with a special final line to make sure we never return the same value twice.
print(load_data)
#> function (...)
#> {
#> raw_data <- readxl::read_excel("raw_data.xlsx")
#> saveRDS(raw_data, "data/loaded_data.RDS")
#> list(time = Sys.time(), tempfile = tempfile())
#> }
8.4 Dependencies
drake
pays close attention to dependencies. In drake
, a target’s dependencies are the things it needs in order to build. Dependencies can include functions, files, and other targets upstream. Any time a dependency changes, the target is no longer valid. The make()
function automatically detects when dependencies change, and it rebuilds the targets that need to rebuild.
To leverage drake’s dependency-watching capabilities, we create a drake
plan. This plan should include all the steps of the analysis, from loading the data to generating a report.
To write the plan, we plug in the functions we created from code_to_function()
.
<- drake_plan(
simple_plan data = load_data(),
munged_data = munge_data(),
hist = make_histogram(),
fit = do_regression(),
report = generate_report()
)
It’s a start, but right now, drake
has no idea which targets to run first and which need to wait for dependencies! In the following graph, there are no edges (arrows) connecting the targets!
vis_drake_graph(simple_plan)
8.5 Building the connections
Just as our original scripts had to run in a certain order, so do our targets now. We pass targets as function arguments to express this execution order.
For example, when we write munged_data = munge_data(data)
, we are signaling to
drake
that the munged_data
target depends on the function munge_data()
and
the target data
.
<- drake_plan(
script_based_plan data = load_data(),
munged_data = munge_data(data),
hist = make_histogram(munged_data),
fit = do_regression(munged_data),
report = generate_report(hist, fit)
)
vis_drake_graph(script_based_plan)
8.6 Run the workflow
We can now run the workflow with the make()
function. The first call to make()
runs all the data analysis tasks we got from the scripts.
make(script_based_plan)
#> ▶ target data
#> ▶ target munged_data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report
8.7 Keeping the results up to date
Any time we change a script, we need to run code_to_function()
again to keep our function up to date. drake
notices when this function changes, and make()
reruns the updated function and the all downstream functions that rely on the output.
For example, let’s fine tune our histogram. We open 03_histogram.R
, change the binwidth
argument, and call code_to_function("03_histogram.R")
all over again.
# We need to rerun code_to_function() to tell drake that the script changed.
<- code_to_function("03_histogram.R") make_histogram
Targets hist
and report
depend on the code we modified, so drake
marks
those targets as outdated.
outdated(script_based_plan)
#> [1] "hist" "report"
vis_drake_graph(script_based_plan, targets_only = TRUE)
When you call make()
, drake
runs make_histogram()
because the underlying
script changed, and it runs generate_report()
because the report depends on
hist
.
make(script_based_plan)
#> ▶ target hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> ▶ target report
All the targets are now up to date!
vis_drake_graph(script_based_plan, targets_only = TRUE)
8.8 Final thoughts
Countless data science workflows consist of numbered imperative scripts, and code_to_function()
lets drake
accommodate script-based projects too big and cumbersome to refactor.
However, for new projects, we strongly recommend that you write functions. Functions help organize your thoughts, and they improve portability, readability, and compatibility with drake
. For a deeper discussion of functions and their
role in drake
, consider watching the webinar recording of the 2019-09-23 rOpenSci Community Call.
Even old projects are sometimes pliable enough to refactor into functions, especially with the new Rclean
package.