Chapter 6 Dynamic branching

6.1 A note about versions

The first release of dynamic branching was in drake version 7.8.0. In subsequent versions, dynamic branching behaves differently. This manual describes how dynamic branching works in development drake (to become version 7.9.0 in early January 2020). If you are using version 7.8.0, please refer to this version of the chapter instead.

6.2 Motivation

In large workflows, you may need more targets than you can easily type in a plan, and you may not be able to fully specify all targets in advance. Dynamic branching is an interface to declare new targets while make() is running. It lets you create more compact plans and graphs, it is easier to use than static branching, and it improves the startup speed of make() and friends.

6.3 Which kind of branching should I use?

With dynamic branching, make() is faster to initialize, and you have far more flexibility. With static branching, you have meaningful target names, and it is easier to predict what the plan is going to do in advance. There is a ton of room for overlap and personal judgement, and you can even use both kinds of branching together.

6.4 Dynamic targets

A dynamic target is a vector of sub-targets. We let make() figure out which sub-targets to create and how to aggregate them.

As an example, let’s fit a regression model to each continent in Gapminder data. To activate dynamic branching, use the dynamic argument of target().

library(broom)
library(drake)
library(gapminder)
library(tidyverse)

# Split the Gapminder data by continent.
gapminder_continents <- function() {
  gapminder %>%
    mutate(gdpPercap = scale(gdpPercap)) %>%
    split(f = .$continent)
}

# Fit a model to a continent.
fit_model <- function(continent_data) {
  data <- continent_data[[1]]
  data %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(continent = data$continent[1]) %>%
    select(continent, term, statistic, p.value)
}

plan <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents))
)

make(plan)
#> ▶ target continents
#> ▶ dynamic model
#> ❯ subtarget model_c56e5407
#> ❯ subtarget model_706a1529
#> ❯ subtarget model_da843806
#> ❯ subtarget model_862f8003
#> ❯ subtarget model_ebb41f51
#> ■ finalize model

The data type of every sub-target is the same as the dynamic target it belongs to. In other words, model and model_23022788 are both data frames, and readd(model) and friends automatically concatenate all the model_* sub-targets.

readd(model)
#> # A tibble: 10 x 4
#>    continent term        statistic  p.value
#>    <fct>     <chr>           <dbl>    <dbl>
#>  1 Africa    (Intercept)     -4.44 1.08e- 5
#>  2 Africa    year             4.04 5.90e- 5
#>  3 Americas  (Intercept)     -5.56 6.10e- 8
#>  4 Americas  year             5.55 6.16e- 8
#>  5 Asia      (Intercept)     -2.74 6.39e- 3
#>  6 Asia      year             2.75 6.23e- 3
#>  7 Europe    (Intercept)    -14.4  3.12e-37
#>  8 Europe    year            14.5  7.06e-38
#>  9 Oceania   (Intercept)    -11.3  1.32e-10
#> 10 Oceania   year            11.5  9.48e-11

This behavior is powered by the vctrs. A dynamic target like model above is really a “vctr” of sub-targets. Under the hood, the aggregated value of model is what you get from calling vec_c() on all the model_* sub-targets. When you dynamically map() over a non-dynamic object, you are taking slices with vec_slice(). (When you map() over a dynamic target, each element is a sub-target and vec_slice() is not necessary.)

library(vctrs)
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:tibble':
#> 
#>     data_frame
#> The following object is masked from 'package:dplyr':
#> 
#>     data_frame

# same as readd(model)
s <- subtargets(model)
vec_c(
  readd(s[1], character_only = TRUE),
  readd(s[2], character_only = TRUE),
  readd(s[3], character_only = TRUE),
  readd(s[4], character_only = TRUE),
  readd(s[5], character_only = TRUE)
)
#> # A tibble: 10 x 4
#>    continent term        statistic  p.value
#>    <fct>     <chr>           <dbl>    <dbl>
#>  1 Africa    (Intercept)     -4.44 1.08e- 5
#>  2 Africa    year             4.04 5.90e- 5
#>  3 Americas  (Intercept)     -5.56 6.10e- 8
#>  4 Americas  year             5.55 6.16e- 8
#>  5 Asia      (Intercept)     -2.74 6.39e- 3
#>  6 Asia      year             2.75 6.23e- 3
#>  7 Europe    (Intercept)    -14.4  3.12e-37
#>  8 Europe    year            14.5  7.06e-38
#>  9 Oceania   (Intercept)    -11.3  1.32e-10
#> 10 Oceania   year            11.5  9.48e-11

loadd(model)

# Second slice if you were to map() over mtcars.
vec_slice(mtcars, 2)
#>               mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

# Fifth slice if you were to map() over letters.
vec_slice(letters, 5)
#> [1] "e"

You can use vec_c() and vec_slice() to anticipate edge cases in dynamic branching.

# If you map() over a list, each sub-target is a single-element list.
vec_slice(list(1, 2), 1)
#> [[1]]
#> [1] 1

# If each sub-target has multiple elements,
# the aggregated target (e.g. from readd())
# will have more elements than sub-targets.
subtarget1 <- c(1, 2)
subtarget2 <- c(3, 4)
vec_c(subtarget1, subtarget2)
#> [1] 1 2 3 4

Back in our plan, target(fit_model(continents), dynamic = map(continents)) is equivalent to commands fit_model(continents[1]) through fit_model(continents[5]). Since continents is really a list of data frames, continents[1] through continents[5] are also lists of data frames, which is why we need the line data <- continent_data[[1]] in fit_model().

To post-process our models, we can work with either the individual sub-targets or the whole vector of all the models. Below, year uses the former and intercept uses the latter.

plan <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents)),
  # Filter each model individually:
  year = target(filter(model, term == "year"), dynamic = map(model)),
  # Aggregate all the models, then filter the whole vector:
  intercept = filter(model, term != "year")
)

make(plan)
#> ℹ unloading 1 targets from environment
#> ▶ target intercept
#> ▶ dynamic year
#> ❯ subtarget year_20cb8ecb
#> ❯ subtarget year_f7502c3e
#> ❯ subtarget year_a22d53f2
#> ❯ subtarget year_1facb02b
#> ❯ subtarget year_399fff25
#> ■ finalize year

readd(year)
#> # A tibble: 5 x 4
#>   continent term  statistic  p.value
#>   <fct>     <chr>     <dbl>    <dbl>
#> 1 Africa    year       4.04 5.90e- 5
#> 2 Americas  year       5.55 6.16e- 8
#> 3 Asia      year       2.75 6.23e- 3
#> 4 Europe    year      14.5  7.06e-38
#> 5 Oceania   year      11.5  9.48e-11

readd(intercept)
#> # A tibble: 5 x 4
#>   continent term        statistic  p.value
#>   <fct>     <chr>           <dbl>    <dbl>
#> 1 Africa    (Intercept)     -4.44 1.08e- 5
#> 2 Americas  (Intercept)     -5.56 6.10e- 8
#> 3 Asia      (Intercept)     -2.74 6.39e- 3
#> 4 Europe    (Intercept)    -14.4  3.12e-37
#> 5 Oceania   (Intercept)    -11.3  1.32e-10

If automatic concatenation of sub-targets is confusing (e.g. if some sub-targets are NULL, as in https://github.com/ropensci-books/drake/issues/142) you can read the dynamic target as a named list (only in drake version 7.10.0 and above).

readd(model, subtarget_list = TRUE) # Requires drake >= 7.10.0.
#> $model_c56e5407
#> # A tibble: 2 x 4
#>   continent term        statistic   p.value
#>   <fct>     <chr>           <dbl>     <dbl>
#> 1 Africa    (Intercept)     -4.44 0.0000108
#> 2 Africa    year             4.04 0.0000590
#> 
#> $model_706a1529
#> # A tibble: 2 x 4
#>   continent term        statistic      p.value
#>   <fct>     <chr>           <dbl>        <dbl>
#> 1 Americas  (Intercept)     -5.56 0.0000000610
#> 2 Americas  year             5.55 0.0000000616
#> 
#> $model_da843806
#> # A tibble: 2 x 4
#>   continent term        statistic p.value
#>   <fct>     <chr>           <dbl>   <dbl>
#> 1 Asia      (Intercept)     -2.74 0.00639
#> 2 Asia      year             2.75 0.00623
#> 
#> $model_862f8003
#> # A tibble: 2 x 4
#>   continent term        statistic  p.value
#>   <fct>     <chr>           <dbl>    <dbl>
#> 1 Europe    (Intercept)     -14.4 3.12e-37
#> 2 Europe    year             14.5 7.06e-38
#> 
#> $model_ebb41f51
#> # A tibble: 2 x 4
#>   continent term        statistic  p.value
#>   <fct>     <chr>           <dbl>    <dbl>
#> 1 Oceania   (Intercept)     -11.3 1.32e-10
#> 2 Oceania   year             11.5 9.48e-11

Alternatively, you can identify an individual sub-target by its index.

subtargets(model)
#> [1] "model_c56e5407" "model_706a1529" "model_da843806" "model_862f8003"
#> [5] "model_ebb41f51"

readd(model, subtargets = 2) # equivalent to readd() on a single model_* sub-target
#> # A tibble: 2 x 4
#>   continent term        statistic      p.value
#>   <fct>     <chr>           <dbl>        <dbl>
#> 1 Americas  (Intercept)     -5.56 0.0000000610
#> 2 Americas  year             5.55 0.0000000616

If you don’t know the index offhand, you can find out using the sub-target’s name.

print(subtarget)
#> [1] "model_706a1529"

which(subtarget == subtargets(model))
#> [1] 2

If the sub-target errored out and subtargets() fails, the individual sub-target metadata will have a subtarget_index field.

diagnose(subtarget, character_only = TRUE)$subtarget_index
#> [1] 2

Either way, once you have the sub-target’s index, you can retrieve the section of data that the sub-target took as input. Below, we load the part of contenents that the second sub-target of model used during make().

vctrs::vec_slice(readd(continents), 2)
#> $Americas
#> # A tibble: 300 x 6
#>    country   continent  year lifeExp      pop gdpPercap[,1]
#>    <fct>     <fct>     <int>   <dbl>    <int>         <dbl>
#>  1 Argentina Americas   1952    62.5 17876956      -0.132  
#>  2 Argentina Americas   1957    64.4 19610538      -0.0364 
#>  3 Argentina Americas   1962    65.1 21283783      -0.00833
#>  4 Argentina Americas   1967    65.6 22934225       0.0850 
#>  5 Argentina Americas   1972    67.1 24779799       0.226  
#>  6 Argentina Americas   1977    68.5 26983828       0.291  
#>  7 Argentina Americas   1982    69.9 29341374       0.181  
#>  8 Argentina Americas   1987    70.8 31620918       0.195  
#>  9 Argentina Americas   1992    71.9 33958947       0.212  
#> 10 Argentina Americas   1997    73.3 36203463       0.381  
#> # … with 290 more rows

If continents were dynamic, we could have just used readd(continents, subtargets = 2). But continents was a static target, so we needed to replicate drake’s dynamic branching behavior using vctrs.

6.5 Dynamic transformations

Dynamic branching supports transformations map(), cross(), and group(). These transformations tell drake how to create sub-targets.

6.5.1 `map()`

map() iterates over the vector slices of the targets you supply as arguments. We saw above how map() iterates over lists. If you give it a data frame, it will map over the rows.

plan <- drake_plan(
  subset = head(gapminder),
  row = target(subset, dynamic = map(subset))
)

make(plan)
#> ▶ target subset
#> ▶ dynamic row
#> ❯ subtarget row_9939cae3
#> ❯ subtarget row_e8047114
#> ❯ subtarget row_2ef3db10
#> ❯ subtarget row_f9171bbe
#> ❯ subtarget row_7d6002e9
#> ❯ subtarget row_509468b3
#> ■ finalize row

readd(row_9939cae3)
#> # A tibble: 1 x 6
#>   country     continent  year lifeExp     pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8 8425333      779.

If you supply multiple targets, map() iterates over the slices of each.

plan <- drake_plan(
  numbers = seq_len(2),
  letters = c("a", "b"),
  zipped = target(paste0(numbers, letters), dynamic = map(numbers, letters))
)

make(plan)
#> ▶ target numbers
#> ▶ target letters
#> ▶ dynamic zipped
#> ❯ subtarget zipped_8ac3968c
#> ❯ subtarget zipped_4a7a9b07
#> ■ finalize zipped

readd(zipped)
#> [1] "1a" "2b"

6.5.2 `cross()`

cross() creates a new sub-target for each combination of targets you supply as arguments.

plan <- drake_plan(
  numbers = seq_len(2),
  letters = c("a", "b"),
  combo = target(paste0(numbers, letters), dynamic = cross(numbers, letters))
)

make(plan)
#> ▶ dynamic combo
#> ❯ subtarget combo_8ac3968c
#> ❯ subtarget combo_ed1d2e7b
#> ❯ subtarget combo_ef37ab56
#> ❯ subtarget combo_4a7a9b07
#> ■ finalize combo

readd(combo)
#> [1] "1a" "1b" "2a" "2b"

6.5.3 `group()`

With group(), you can create multiple aggregates of a given target. Use the .by argument to set a grouping variable.

plan <- drake_plan(
  data = gapminder,
  by = data$continent,
  gdp = target(
    tibble(median = median(data$gdpPercap), continent = by[1]),
    dynamic = group(data, .by = by)
  )
)

make(plan)
#> ▶ target data
#> ▶ target by
#> ▶ dynamic gdp
#> ❯ subtarget gdp_9adfc39f
#> ❯ subtarget gdp_d9f30951
#> ❯ subtarget gdp_958a2f81
#> ❯ subtarget gdp_962b03c8
#> ❯ subtarget gdp_dc1cff81
#> ■ finalize gdp

readd(gdp)
#> # A tibble: 5 x 2
#>   median continent
#>    <dbl> <fct>    
#> 1  2647. Asia     
#> 2 12082. Europe   
#> 3  1192. Africa   
#> 4  5466. Americas 
#> 5 17983. Oceania

6.6 Trace

All dynamic transforms have a .trace argument to record optional metadata for each sub-target. In the example from group(), the trace is another way to keep track of the continent of each median GDP value.

plan <- drake_plan(
  data = gapminder,
  by = data$continent,
  gdp = target(
    median(data$gdpPercap),
    dynamic = group(data, .by = by, .trace = by)
  )
)

make(plan)
#> ▶ dynamic gdp
#> ❯ subtarget gdp_7e88fb1c
#> ❯ subtarget gdp_a61b8e1b
#> ❯ subtarget gdp_278ff532
#> ❯ subtarget gdp_6f3facea
#> ❯ subtarget gdp_73037e69
#> ■ finalize gdp

The gdp target no longer contains any explicit reference to continent.

readd(gdp)
#> [1]  2646.787 12081.749  1192.138  5465.510 17983.304

However, we can look up the continents in the trace.

read_trace("by", gdp)
#> [1] Asia     Europe   Africa   Americas Oceania 
#> Levels: Africa Americas Asia Europe Oceania

6.7 `max_expand`

Suppose we want a model for each country.

gapminder_countries <- function() {
  gapminder %>%
    mutate(gdpPercap = scale(gdpPercap)) %>%
    split(f = .$country)
}

plan <- drake_plan(
  countries = gapminder_countries(),
  model = target(fit_model(countries), dynamic = map(countries))
)

The Gapminder dataset has 142 countries, which can get overwhelming. In the early stages of the workflow when we are still debugging and testing, we can limit the number of sub-targets using the max_expand argument of make().

make(plan, max_expand = 2)
#> ▶ target countries
#> ▶ dynamic model
#> ❯ subtarget model_ab009698
#> ❯ subtarget model_cc031a6d
#> ■ finalize model

readd(model)
#> # A tibble: 4 x 4
#>   continent term        statistic  p.value
#>   <fct>     <chr>           <dbl>    <dbl>
#> 1 Asia      (Intercept)    -1.48  0.170   
#> 2 Asia      year           -0.233 0.821   
#> 3 Europe    (Intercept)    -4.76  0.000773
#> 4 Europe    year            4.59  0.000998

Then, when we are confident and ready, we can scale up to the full number of models.

make(plan)