Chapter 11 Dynamic branching

11.1 Branching

Sometimes, a pipeline contains more targets than a user can comfortably type by hand. For projects with hundreds of targets, branching can make the _targets.R file more concise and easier to read and maintain.

targets supports two types of branching: dynamic branching and static branching. Some projects are better suited to dynamic branching, while others benefit more from static branching or a combination of both. Some users understand dynamic branching more easily because it avoids metaprogramming, while others prefer static branching because tar_manifest() and tar_visnetwork() provide immediate feedback. Except for the section on dynamic-within-static branching, you can read the two chapters on branching in any order (or skip them) depending on your needs.

11.2 About dynamic branching

Dynamic branching is the act of defining new targets (i.e. branches) while the pipeline is running. Prior to launching the pipeline, the user does not necessarily know which branches will spawn or how many branches there will be, and each branch’s inputs are determined at the last minute. Relative to static branching, dynamic branching is better suited to iterating over a larger number of very similar tasks (but can act as an inner layer inside static branching, as the next chapter demonstrates).

11.3 Patterns

To use dynamic branching, set the pattern argument of tar_target(). A pattern is a dynamic branching specification expressed in terms of functional programming. The following minimal example explores the mechanics of patterns (and examples of branching in real-world projects are linked from here).

# _targets.R
library(targets)
list(
  tar_target(w, c(1, 2)),
  tar_target(x, c(10, 20)),
  tar_target(y, w + x, pattern = map(w, x)),
  tar_target(z, sum(y)),
  tar_target(z2, length(y), pattern = map(y))
)
tar_visnetwork()
tar_make()
#> • start target w
#> • built target w
#> • start target x
#> • built target x
#> • start branch y_61ced14d
#> • built branch y_61ced14d
#> • start branch y_4096b6a8
#> • built branch y_4096b6a8
#> • built pattern y
#> • start branch z2_275d234a
#> • built branch z2_275d234a
#> • start branch z2_8c730b35
#> • built branch z2_8c730b35
#> • built pattern z2
#> • start target z
#> • built target z
#> • end pipeline

Above, targets w, x, and z are called stems because they do not create branches themselves. Target y is a pattern because it creates its own branches like y_4096b6a8. A branch is a special kind of target whose dependencies are slices of the targets mentioned in pattern = map(...) etc. If we read target y into memory, all the branches will load and automatically aggregate as a vector.13

tar_read(y)
#> y_61ced14d y_4096b6a8 
#>         11         22

Target z accepts this entire aggregate of y and sums it.

tar_read(z)
#> [1] 33

Target z2 maps over y, so each each branch of z2 accepts a branch of y.

tar_read(z2)
#> z2_275d234a z2_8c730b35 
#>           1           1

11.4 Pattern construction

targets supports the following pattern types.

  • map(): iterate over one or more targets in sequence.
  • cross(): iterate over combinations of slices of targets.
  • slice(): select individual branches slices by numeric index. For example, pattern = slice(x, index = c(3, 4)) applies the target’s command to the third and fourth slices (or branches) of upstream target x.
  • head(): restrict branching to the first few elements.
  • tail(): restrict branching to the last few elements.
  • sample(): restrict branching to a random subset of elements.

These patterns are composable. Below, target z creates six branches, one for each combination of w and (x, y) pair. The pattern cross(w, map(x, y)) is equivalent to tidyr::crossing(w, tidyr::nesting(x, y)).

# _targets.R
library(targets)
list(
  tar_target(w_comp, seq_len(2)),
  tar_target(x_comp, head(letters, 3)),
  tar_target(y_comp, head(LETTERS, 3)),
  tar_target(
    z_comp,
    data.frame(w = w_comp, x = x_comp, y = y_comp),
    pattern = cross(w_comp, map(x_comp, y_comp))
  )
)
# R console
tar_make()
#> • start target w_comp
#> • built target w_comp
#> • start target x_comp
#> • built target x_comp
#> • start target y_comp
#> • built target y_comp
#> • start branch z_comp_a601a6f3
#> • built branch z_comp_a601a6f3
#> • start branch z_comp_38222c1e
#> • built branch z_comp_38222c1e
#> • start branch z_comp_b9220de4
#> • built branch z_comp_b9220de4
#> • start branch z_comp_c3fe35d0
#> • built branch z_comp_c3fe35d0
#> • start branch z_comp_1a64d49a
#> • built branch z_comp_1a64d49a
#> • start branch z_comp_7cbac5ae
#> • built branch z_comp_7cbac5ae
#> • built pattern z_comp
#> • end pipeline
# R console
tar_read(z_comp)
#>   w x y
#> 1 1 a A
#> 2 1 b B
#> 3 1 c C
#> 4 2 a A
#> 5 2 b B
#> 6 2 c C

With slice(), you can select pieces of a pattern or upstream target as follows.

# _targets.R
library(targets)
list(
  tar_target(a_data, letters),
  tar_target(b_data, LETTERS),
  tar_target(
    c_data,
    paste(a_data, b_data),
    pattern = slice(cross(a_data, b_data), index = seq(2, 3))
  )
)
# R console
tar_make()
#> • start target a_data
#> • built target a_data
#> • start target b_data
#> • built target b_data
#> • start branch c_data_5a10da89
#> • built branch c_data_5a10da89
#> • start branch c_data_3a92ccbc
#> • built branch c_data_3a92ccbc
#> • built pattern c_data
#> • end pipeline
# R console
tar_read(c_data)
#> c_data_5a10da89 c_data_3a92ccbc 
#>           "a B"           "a C"

11.5 Branch provenance

The tar_branches() function identifies dependency relationships among individual branches. In the example pipeline below, we can find out the branch of y that each branch of z depends on.

# _targets.R
library(targets)
list(
  tar_target(x, seq_len(3)),
  tar_target(y, x + 1, pattern = map(x)),
  tar_target(z, y + 1, pattern = map(y))
)
tar_make()
#> • start target x
#> • built target x
#> • start branch y_29239c8a
#> • built branch y_29239c8a
#> • start branch y_7cc32924
#> • built branch y_7cc32924
#> • start branch y_bd602d50
#> • built branch y_bd602d50
#> • built pattern y
#> • start branch z_d0697b30
#> • built branch z_d0697b30
#> • start branch z_959da4d3
#> • built branch z_959da4d3
#> • start branch z_92be5b5f
#> • built branch z_92be5b5f
#> • built pattern z
#> • end pipeline
branches <- tar_branches(z, map(y))
branches
#> # A tibble: 3 × 2
#>   z          y         
#>   <chr>      <chr>     
#> 1 z_d0697b30 y_29239c8a
#> 2 z_959da4d3 y_7cc32924
#> 3 z_92be5b5f y_bd602d50
tar_read_raw(branches$y[2])
#> [1] 3

However, tar_branches() is not always helpful: for example, if we look at how y branches over x. x does not use dynamic branching, so tar_branches() does not return meaningful branch names.

branches <- tar_branches(y, map(x))
branches
#> # A tibble: 3 × 2
#>   y          x         
#>   <chr>      <chr>     
#> 1 y_29239c8a x_ccba877f
#> 2 y_7cc32924 x_084b9b29
#> 3 y_bd602d50 x_91e9ed7e
tar_read_raw(branches$x[2])
#> Error: target x_084b9b29 not found

In situations like this, it is best to proactively write targets that keep track of information about their upstream branches. Data frames and tibbles are useful for this.

# _targets.R
library(targets)
list(
  tar_target(x, seq_len(3)),
  tar_target(y, data.frame(x = x, y = x + 1), pattern = map(x))
)
tar_make()
#> ✔ skip target x
#> • start branch y_29239c8a
#> • built branch y_29239c8a
#> • start branch y_7cc32924
#> • built branch y_7cc32924
#> • start branch y_bd602d50
#> • built branch y_bd602d50
#> • built pattern y
#> • end pipeline
tar_read(y)
#>   x y
#> 1 1 2
#> 2 2 3
#> 3 3 4

11.6 Testing patterns

To check the correctness of a pattern without running the pipeline, use tar_pattern(). Simply supply the pattern itself and the length of each dependency target. The branch names in the data frames below are made up, but they convey a high-level picture of the branching structure.

tar_pattern(
  cross(w_comp, map(x_comp, y_comp)),
  w_comp = 2,
  x_comp = 3,
  y_comp = 3
)
#> # A tibble: 6 × 3
#>   w_comp   x_comp   y_comp  
#>   <chr>    <chr>    <chr>   
#> 1 w_comp_1 x_comp_1 y_comp_1
#> 2 w_comp_1 x_comp_2 y_comp_2
#> 3 w_comp_1 x_comp_3 y_comp_3
#> 4 w_comp_2 x_comp_1 y_comp_1
#> 5 w_comp_2 x_comp_2 y_comp_2
#> 6 w_comp_2 x_comp_3 y_comp_3
tar_pattern(
  head(cross(w_comp, map(x_comp, y_comp)), n = 2),
  w_comp = 2,
  x_comp = 3,
  y_comp = 3
)
#> # A tibble: 2 × 3
#>   w_comp   x_comp   y_comp  
#>   <chr>    <chr>    <chr>   
#> 1 w_comp_1 x_comp_1 y_comp_1
#> 2 w_comp_1 x_comp_2 y_comp_2
tar_pattern(
  cross(w_comp, sample(map(x_comp, y_comp), n = 2)),
  w_comp = 2,
  x_comp = 3,
  y_comp = 3
)
#> # A tibble: 4 × 3
#>   w_comp   x_comp   y_comp  
#>   <chr>    <chr>    <chr>   
#> 1 w_comp_1 x_comp_2 y_comp_2
#> 2 w_comp_1 x_comp_1 y_comp_1
#> 3 w_comp_2 x_comp_2 y_comp_2
#> 4 w_comp_2 x_comp_1 y_comp_1

11.7 Branching over files

Dynamic branching over files is tricky. A target with format = "file" treats the entire set of files as an irreducible bundle. That means in order to branch over files downstream, each file must already have its own branch.

# _targets.R
library(targets)
list(
  tar_target(paths, c("a.csv", "b.csv")),
  tar_target(files, paths, format = "file", pattern = map(paths)),
  tar_target(data, read_csv(files), pattern = map(files))
)

The tar_files() function from the tarchetypes package is shorthand for the first two targets above.

# _targets.R
library(targets)
library(tarchetypes)
list(
  tar_files(files, c("a.csv", "b.csv")),
  tar_target(data, read_csv(files), pattern = map(files))
)

11.8 Branching over rows

tarchetypes >= 0.1.0 has helpers for easy branching over subsets of data frames:

  • tar_group_by(): define row groups using dplyr::group_by() semantics.
  • tar_group_select(): define row groups using tidyselect semantics.
  • tar_group_count(): define a given number row groups.
  • tar_group_size(): define row groups of a given size.

If you define a target with one of these functions, all downstream dynamic targets will automatically branch over the row groups.

# _targets.R file:
library(targets)
library(tarchetypes)
produce_data <- function() {
  expand.grid(var1 = c("a", "b"), var2 = c("c", "d"), rep = c(1, 2, 3))
}
list(
  tar_group_by(data, produce_data(), var1, var2),
  tar_target(group, data, pattern = map(data))
)
# R console:
library(targets)
tar_make()
#> • start target data
#> • built target data
#> • start branch group_b3d7d010
#> • built branch group_b3d7d010
#> • start branch group_6a76c5c0
#> • built branch group_6a76c5c0
#> • start branch group_164b16bf
#> • built branch group_164b16bf
#> • start branch group_f5aae602
#> • built branch group_f5aae602
#> • built pattern group
#> • end pipeline

# First row group:
tar_read(group, branches = 1)
#> # A tibble: 3 × 4
#>   var1  var2    rep tar_group
#>   <fct> <fct> <dbl>     <int>
#> 1 a     c         1         1
#> 2 a     c         2         1
#> 3 a     c         3         1

# Second row group:
tar_read(group, branches = 2)
#> # A tibble: 3 × 4
#>   var1  var2    rep tar_group
#>   <fct> <fct> <dbl>     <int>
#> 1 a     d         1         2
#> 2 a     d         2         2
#> 3 a     d         3         2

For more information on how the row grouping mechanism works, see the section on “group” iteration below.

11.9 Iteration

The iteration arguments of tar_target() and tar_option_set() control the slicing of stems and the aggregation of patterns. For example, if you write a stem target with tar_target(x, fun(), iteration = "list"), then all downstream targets that branch over x will give list-like slices x[[1]], x[[2]], etc. to the individual branches instead of the default vector-like slices x[1], x[2], etc. Likewise, for a pattern defined with tar_target(y, fun(x), pattern = map(x), iteration = "list"), tar_read(y) will return a list instead of a vector, and a downstream target like tar_target(y, fun(z)) will treat z as a list instead of a vector.14

11.9.1 Vector iteration

targets uses vector iteration by default, and you can opt into this behavior by setting iteration = "vector" in tar_target(). In vector iteration, targets uses the vctrs package to split stems and aggregate branches. That means vctrs::vec_slice() slices up stems like x for mapping, and vctrs::vec_c() aggregates patterns like y for operations like tar_read().

For atomic vectors like in the example above, this behavior is already intuitive. But if we map over a data frame, each branch will get a row of the data frame due to vector iteration.

library(targets)
print_and_return <- function(x) {
  print(x)
  x
}
list(
  tar_target(x, data.frame(a = c(1, 2), b = c("a", "b"))),
  tar_target(y, print_and_return(x), pattern = map(x))
)
tar_make()
#> • start target x
#> • built target x
#> • start branch y_2e131731
#>   a b
#> 1 1 a
#> • built branch y_2e131731
#> • start branch y_6c3b7977
#>   a b
#> 1 2 b
#> • built branch y_6c3b7977
#> • built pattern y
#> • end pipeline

And since y also has iteration = "vector", the aggregate of y is a single data frame of all the rows.

tar_read(y)
#>   a b
#> 1 1 a
#> 2 2 b

11.9.2 List iteration

List iteration splits and aggregates targets as simple lists. If target x has "list" iteration, all branches of downstream patterns will get x[[1]], x[[2]], and so on. (vctrs::vec_slice() behaves more like [] than [[]].)

# _targets.R
library(targets)
print_and_return <- function(x) {
  print(x)
  x
}
list(
  tar_target(
    x,
    data.frame(a = c(1, 2), b = c("a", "b")),
    iteration = "list"
  ),
  tar_target(y, print_and_return(x), pattern = map(x)),
  tar_target(z, x, pattern = map(x), iteration = "list")
)
tar_make()
#> • start target x
#> • built target x
#> • start branch y_651b13c7
#> [1] 1 2
#> • built branch y_651b13c7
#> • start branch y_1c8c7726
#> [1] "a" "b"
#> • built branch y_1c8c7726
#> • built pattern y
#> • start branch z_651b13c7
#> • built branch z_651b13c7
#> • start branch z_1c8c7726
#> • built branch z_1c8c7726
#> • built pattern z
#> • end pipeline

Aggregation also happens differently. In this case, the vector iteration in y is not ideal, and the list iteration in z gives us more sensible output.

tar_read(y)
#> Error: Can't combine `y_651b13c7` <double> and `y_1c8c7726` <character>.
tar_read(z)
#> $z_651b13c7
#> [1] 1 2
#> 
#> $z_1c8c7726
#> [1] "a" "b"

11.9.3 Group iteration

Group iteration brings dplyr::group_by() functionality to patterns. This way, we can map or cross over custom subsets of rows. Consider the following data frame.

object <- data.frame(
  x = seq_len(6),
  id = rep(letters[seq_len(3)], each = 2)
)

object
#>   x id
#> 1 1  a
#> 2 2  a
#> 3 3  b
#> 4 4  b
#> 5 5  c
#> 6 6  c

To map over the groups of rows defined by the id column, we

  1. Use group_by() and tar_group() to define the groups of rows, and
  2. Use iteration = "group" in tar_target() to tell downstream patterns to use the row groups.

Put together, the pipeline looks like this.

# _targets.R
library(targets)
tar_option_set(packages = "dplyr")
list(
  tar_target(
    data,
    data.frame(
      x = seq_len(6),
      id = rep(letters[seq_len(3)], each = 2)
    ) %>%
      group_by(id) %>%
      tar_group(),
    iteration = "group"
  ),
  tar_target(
    subsets,
    data,
    pattern = map(data),
    iteration = "list"
  )
)
tar_make()
#> • start target data
#> • built target data
#> • start branch subsets_f6af59da
#> • built branch subsets_f6af59da
#> • start branch subsets_f91945cc
#> • built branch subsets_f91945cc
#> • start branch subsets_cae649ac
#> • built branch subsets_cae649ac
#> • built pattern subsets
#> • end pipeline
lapply(tar_read(subsets), as.data.frame)
#> $subsets_f6af59da
#>   x id tar_group
#> 1 1  a         1
#> 2 2  a         1
#> 
#> $subsets_f91945cc
#>   x id tar_group
#> 1 3  b         2
#> 2 4  b         2
#> 
#> $subsets_cae649ac
#>   x id tar_group
#> 1 5  c         3
#> 2 6  c         3

Row groups are defined in the special tar_group column created by tar_group().

data.frame(
  x = seq_len(6),
  id = rep(letters[seq_len(3)], each = 2)
) %>%
  dplyr::group_by(id) %>%
  tar_group()
#> # A tibble: 6 × 3
#>       x id    tar_group
#>   <int> <chr>     <int>
#> 1     1 a             1
#> 2     2 a             1
#> 3     3 b             2
#> 4     4 b             2
#> 5     5 c             3
#> 6     6 c             3

tar_group() creates this column based on the orderings of the grouping variables supplied to dplyr::group_by(), not the order of the rows in the data.

flip_order <- function(x) {
  ordered(x, levels = sort(unique(x), decreasing = TRUE))
}

data.frame(
  x = seq_len(6),
  id = flip_order(rep(letters[seq_len(3)], each = 2))
) %>%
  dplyr::group_by(id) %>%
  tar_group()
#> # A tibble: 6 × 3
#>       x id    tar_group
#>   <int> <ord>     <int>
#> 1     1 a             3
#> 2     2 a             3
#> 3     3 b             2
#> 4     4 b             2
#> 5     5 c             1
#> 6     6 c             1

The ordering in tar_group agrees with the ordering shown by dplyr::group_keys().

data.frame(
  x = seq_len(6),
  id = flip_order(rep(letters[seq_len(3)], each = 2))
) %>%
  dplyr::group_by(id) %>%
  dplyr::group_keys()
#> # A tibble: 3 × 1
#>   id   
#>   <ord>
#> 1 c    
#> 2 b    
#> 3 a

Branches are arranged in increasing order with respect to the integers in tar_group.

# _targets.R
library(targets)
tar_option_set(packages = "dplyr")

flip_order <- function(x) {
  ordered(x, levels = sort(unique(x), decreasing = TRUE))
}

list(
  tar_target(
    data,
    data.frame(
      x = seq_len(6),
      id = flip_order(rep(letters[seq_len(3)], each = 2))
    ) %>%
      group_by(id) %>%
      tar_group(),
    iteration = "group"
  ),
  tar_target(
    subsets,
    data,
    pattern = map(data),
    iteration = "list"
  )
)
tar_make()
#> • start target data
#> • built target data
#> • start branch subsets_13c06a5c
#> • built branch subsets_13c06a5c
#> • start branch subsets_f38b90a2
#> • built branch subsets_f38b90a2
#> • start branch subsets_d2717707
#> • built branch subsets_d2717707
#> • built pattern subsets
#> • end pipeline
lapply(tar_read(subsets), as.data.frame)
#> $subsets_13c06a5c
#>   x id tar_group
#> 1 5  c         1
#> 2 6  c         1
#> 
#> $subsets_f38b90a2
#>   x id tar_group
#> 1 3  b         2
#> 2 4  b         2
#> 
#> $subsets_d2717707
#>   x id tar_group
#> 1 1  a         3
#> 2 2  a         3

11.10 Dynamic branching performance

With dynamic branching, it is super easy to create an enormous number of targets. But when the number of targets starts to exceed a couple hundred, tar_make() slows down, and graphs from tar_visnetwork() start to become unmanageable.

If that happens to you, you might be tempted to use the names and shortcut arguments to tar_make(), e.g. tar_make(names = all_of("only", "these", "targets"), shortcut = TRUE), to completely temporarily omit thousands of upstream targets for the sake of concentrating on one section at a time. However, this technique is only a temporary measure, and it is best to eventually revert back to the default names = NULL and shortcut = FALSE to ensure reproducibility.

Another temporary workaround is to temporarily select subsets of branches. For example, instead of pattern = map(large_target) in tar_target(), you could prototype on a target that uses pattern = head(map(large_target), n = 1) or pattern = slice(map(large_target), c(4, 5, 6)). In the case of slice(), the tar_branch_index() function (only in targets version 0.5.0.9000 and above) can help you find the required integer indexes corresponding to individual branch names you may want.

However, both the above solutions are temporary workarounds for the sake of prototyping a pipeline and chipping away at a large computation. To help the entire pipeline scale more efficiently, consider batching your work into a smaller number of targets.

11.11 Batching

Targetopia packages usually have functions that support batching for various use cases. In stantargets, tar_stan_mcmc_rep_summary() and friends automatically use batching behind the scenes. The user simply needs to select the number of batches and number of reps per batch. Each batch is a dynamic branch with multiple reps, and each rep fits the user’s model once and computes summary statistics.

In tarchetypes, tar_rep() is a general-use target factory for dynamic branching. It allows you to repeat arbitrary code over multiple reps split into multiple batches. Each batch gets its own reproducible random number seed generated from the target name (as do all targets) and reps run sequentially within each batch, so the results are reproducible. The targets-stan repository has an example of batching implemented from scratch. The goal of the pipeline is to validate a Bayesian model by simulating thousands of dataset, analyzing each with a Bayesian model, and assessing the overall accuracy of the inference. Rather than define a target for each dataset in model, the pipeline breaks up the work into batches, where each batch has multiple datasets or multiple analyses. Here is a version of the pipeline with 40 batches and 25 simulation reps per batch (1000 reps total in a pipeline of 82 targets).

# _targets.R
library(targets)
list(
  tar_target(model_file, compile_model("stan/model.stan"), format = "file"),
  tar_target(index_batch, seq_len(40)),
  tar_target(index_sim, seq_len(25)),
  tar_target(
    data_continuous,
    purrr::map_dfr(index_sim, ~simulate_data_continuous()),
    pattern = map(index_batch)
  ),
  tar_target(
    fit_continuous,
    map_sims(data_continuous, model_file = model_file),
    pattern = map(data_continuous)
  )
)

  1. To use list aggregation, write tar_target(y, w + x, pattern = map(w, x), iteration = "list") in the pipeline.↩︎

  2. iteration does not control the aggregation of stems because stems are already aggregated, so there is nothing to be done. Likewise, it does not control the splitting of patterns because the pattern argument of tar_target() already does that.↩︎

Copyright Eli Lilly and Company