Chapter 7 Dynamic branching

7.1 Branching

Sometimes, a pipeline contains more targets than a user can comfortably type by hand. For projects with hundreds of targets, branching can make the _targets.R file more concise and easier to read and maintain.

targets supports two types of branching: dynamic branching and static branching. Some projects are better suited to dynamic branching, while others benefit more from static branching or a combination of both. Some users understand dynamic branching more easily because it avoids metaprogramming, while others prefer static branching because tar_manifest() and tar_visnetwork() provide immediate feedback. Except for the section on dynamic-within-static branching, you can read the two chapters on branching in any order (or skip them) depending on your needs.

7.2 About dynamic branching

Dynamic branching is the act of defining new targets (i.e. branches) while the pipeline is running. Prior to launching the pipeline, the user does not necessarily know which branches will spawn or how many branches there will be, and each branch’s inputs are determined at the last minute. Relative to static branching, dynamic branching is better suited to iterating over a larger number of very similar tasks (but can act as an inner layer inside static branching, as the next chapter demonstrates).

7.3 Patterns

To use dynamic branching, set the pattern argument of tar_target(). A pattern is a dynamic branching specification expressed in terms of functional programming. The following minimal example explores the mechanics of patterns (and examples of branching in real-world projects are linked from here).

# _targets.R
library(targets)
library(tidyverse)
list(
  tar_target(w, c(1, 2)),
  tar_target(x, c(10, 20)),
  tar_target(y, w + x, pattern = map(w, x)),
  tar_target(z, sum(y)),
  tar_target(z2, length(y), pattern = map(y))
)
tar_visnetwork()
tar_make()

Above, targets w, x, and z are called stems because they provide values for other targets to branch over. Target y is a pattern because it defines multiple sub-targets, or branches, based on the return values of the targets named inside map() or cross(). If we read target y into memory, all the branches will load and get aggregated according to the iteration argument of tar_target().

tar_read(y)

Target z accepts this entire aggregate of y and sums it.

tar_read(z)

Target z2 maps over y, so each each branch of z2 accepts a branch of y.

tar_read(z2)

7.4 Pattern construction

targets supports the following pattern types.

  • map(): iterate over one or more targets in sequence.
  • cross(): iterate over combinations of slices of targets.
  • head(): restrict branching to the first few elements.
  • tail(): restrict branching to the last few elements.
  • sample(): restrict branching to a random subset of elements.

These patterns are composable. Below, target z creates six branches, one for each combination of w and (x, y) pair. The pattern cross(w, map(x, y)) is equivalent to tidyr::crossing(w, tidyr::nesting(x, y)).

# _targets.R
library(targets)
list(
  tar_target(w_comp, seq_len(2)),
  tar_target(x_comp, head(letters, 3)),
  tar_target(y_comp, head(LETTERS, 3)),
  tar_target(
    z_comp,
    data.frame(w = w_comp, x = x_comp, y = y_comp),
    pattern = cross(w_comp, map(x_comp, y_comp))
  )
)
tar_make()
tar_read(z_comp)

7.5 Branch provenance

The tar_branches() function identifies dependency relationships among individual branches. In the example pipeline below, we can find out the branch of y that each branch of z depends on.

# _targets.R
library(targets)
list(
  tar_target(x, seq_len(3)),
  tar_target(y, x + 1, pattern = map(x)),
  tar_target(z, y + 1, pattern = map(y))
)
tar_make()
branches <- tar_branches(z, map(y))
branches
tar_read_raw(branches$y[2])

However, tar_branches() is not always helpful: for example, if we look at how y branches over x. x does not use dynamic branching, so tar_branches() does not return meaningful branch names.

branches <- tar_branches(y, map(x))
branches
tar_read_raw(branches$x[2])

In situations like this, it is best to proactively write targets that keep track of information about their upstream branches. Data frames and tibbles are useful for this.

# _targets.R
library(targets)
library(tibble)
list(
  tar_target(x, seq_len(3)),
  tar_target(y, tibble(x = x, y = x + 1), pattern = map(x))
)
tar_make()
tar_read(y)

7.6 Testing patterns

To check the correctness of a pattern without running the pipeline, use tar_pattern(). Simply supply the pattern itself and the length of each dependency target. The branch names in the data frames below are made up, but they convey a high-level picture of the branching structure.

tar_pattern(
  cross(w_comp, map(x_comp, y_comp)),
  w_comp = 2,
  x_comp = 3,
  y_comp = 3
)
tar_pattern(
  head(cross(w_comp, map(x_comp, y_comp)), n = 2),
  w_comp = 2,
  x_comp = 3,
  y_comp = 3
)
tar_pattern(
  cross(w_comp, sample(map(x_comp, y_comp), n = 2)),
  w_comp = 2,
  x_comp = 3,
  y_comp = 3
)

7.7 Dynamic branching over files

Dynamic branching over files is tricky. A target with format = "file" treats the entire set of files as an irreducible bundle. That means in order to branch over files downstream, each file must already have its own branch.

# _targets.R
library(targets)
list(
  tar_target(paths, c("a.csv", "b.csv")),
  tar_target(files, paths, format = "file", pattern = map(paths)),
  tar_target(data, read_csv(files), pattern = map(files))
)

The tar_files() function from the tarchetypes package is shorthand for the first two targets above.

# _targets.R
library(targets)
library(tarchetypes)
list(
  tar_files(files, c("a.csv", "b.csv")),
  tar_target(data, read_csv(files), pattern = map(files))
)

7.8 Iteration

There are many ways to slice up a stem for branching, and there are many ways to aggregate the branches of a pattern.5 The iteration argument of tar_target() controls the splitting and aggregation protocol on a target-by-target basis, and you can set the default for all targets with the analogous argument of tar_option_set().

7.8.1 Vector iteration

targets uses vector iteration by default, and you can opt into this behavior by setting iteration = "vector" in tar_target(). In vector iteration, targets uses the vctrs package to split stems and aggregate branches. That means vctrs::vec_slice() slices up stems like x for mapping, and vctrs::vec_c() aggregates patterns like y for operations like tar_read().

For atomic vectors like in the example above, this behavior is already intuitive. But if we map over a data frame, each branch will get a row of the data frame due to vector iteration.

library(targets)
print_and_return <- function(x) {
  print(x)
  x
}
list(
  tar_target(x, data.frame(a = c(1, 2), b = c("a", "b"))),
  tar_target(y, print_and_return(x), pattern = map(x))
)
tar_make()

And since y also has iteration = "vector", the aggregate of y is a single data frame of all the rows.

tar_read(y)

7.8.2 List iteration

List iteration splits and aggregates targets as simple lists. If target x has "list" iteration, all branches of downstream patterns will get x[[1]], x[[2]], and so on. (vctrs::vec_slice() behaves more like [] than [[]].)

# _targets.R
library(targets)
print_and_return <- function(x) {
  print(x)
  x
}
list(
  tar_target(
    x,
    data.frame(a = c(1, 2), b = c("a", "b")),
    iteration = "list"
  ),
  tar_target(y, print_and_return(x), pattern = map(x)),
  tar_target(z, x, pattern = map(x), iteration = "list")
)
tar_make()

Aggregation also happens differently. In this case, the vector iteration in y is not ideal, and the list iteration in z gives us more sensible output.

tar_read(y)
tar_read(z)

7.8.3 Group iteration

Group iteration brings dplyr::group_by() functionality to patterns. This way, we can map or cross over custom subsets of rows. Consider the following data frame.

object <- data.frame(
  x = seq_len(6),
  id = rep(letters[seq_len(3)], each = 2)
)

object

To map over the groups of rows defined by the id column, we

  1. Use group_by() and tar_group() to define the groups of rows, and
  2. Use iteration = "group" in tar_target() to tell downstream patterns to use the row groups.

Put together, the pipeline looks like this.

# _targets.R
library(targets)
tar_option_set(packages = "tidyverse")
list(
  tar_target(
    data,
    data.frame(
      x = seq_len(6),
      id = rep(letters[seq_len(3)], each = 2)
    ) %>%
      group_by(id) %>%
      tar_group(),
    iteration = "group"
  ),
  tar_target(
    subsets,
    data,
    pattern = map(data),
    iteration = "list"
  )
)
tar_make()
lapply(tar_read(subsets), as.data.frame)

Row groups are defined in the special tar_group column created by tar_group().

data.frame(
  x = seq_len(6),
  id = rep(letters[seq_len(3)], each = 2)
) %>%
  dplyr::group_by(id) %>%
  tar_group()

tar_group() creates this column based on the orderings of the grouping variables supplied to dplyr::group_by(), not the order of the rows in the data.

flip_order <- function(x) {
  ordered(x, levels = sort(unique(x), decreasing = TRUE))
}

data.frame(
  x = seq_len(6),
  id = flip_order(rep(letters[seq_len(3)], each = 2))
) %>%
  dplyr::group_by(id) %>%
  tar_group()

The ordering in tar_group agrees with the ordering shown by dplyr::group_keys().

data.frame(
  x = seq_len(6),
  id = flip_order(rep(letters[seq_len(3)], each = 2))
) %>%
  dplyr::group_by(id) %>%
  dplyr::group_keys()

Branches are arranged in increasing order with respect to the integers in tar_group.

# _targets.R
library(targets)
tar_option_set(packages = "tidyverse")

flip_order <- function(x) {
  ordered(x, levels = sort(unique(x), decreasing = TRUE))
}

list(
  tar_target(
    data,
    data.frame(
      x = seq_len(6),
      id = flip_order(rep(letters[seq_len(3)], each = 2))
    ) %>%
      group_by(id) %>%
      tar_group(),
    iteration = "group"
  ),
  tar_target(
    subsets,
    data,
    pattern = map(data),
    iteration = "list"
  )
)
tar_make()
lapply(tar_read(subsets), as.data.frame)

7.9 Batching

With dynamic branching, it is super easy to create an enormous number of targets. But when the number of targets starts to exceed a couple hundred, tar_make() slows down, and graphs from tar_visnetwork() start to become unmanageable. If that happens to you, consider batching your work into a smaller number of targets.

Targetopia packages usually have functions that support batching for various use cases. In stantargets, tar_stan_mcmc_rep_summary() and friends automatically use batching behind the scenes. The user simply needs to select the number of batches and number of reps per batch. Each batch is a dynamic branch with multiple reps, and each rep fits the user’s model once and computes summary statistics.

In tarchetypes, tar_rep()` is a general-use target factory for dynamic branching. It allows you to repeat arbitrary code over multiple reps split into multiple batches. Each batch gets its own reproducible random number seed generated from the target name (as do all targets) and reps run sequentially within each batch, so the results are reproducible.

The targets-stan repository has an example of batching implemented from scratch. The goal of the pipeline is to validate a Bayesian model by simulating thousands of dataset, analyzing each with a Bayesian model, and assessing the overall accuracy of the inference. Rather than define a target for each dataset in model, the pipeline breaks up the work into batches, where each batch has multiple datasets or multiple analyses. Here is a version of the pipeline with 40 batches and 25 simulation reps per batch (1000 reps total in a pipeline of 82 targets).

list(
  tar_target(model_file, compile_model("stan/model.stan"), format = "file"),
  tar_target(index_batch, seq_len(40)),
  tar_target(index_sim, seq_len(25)),
  tar_target(
    data_continuous,
    purrr::map_dfr(index_sim, ~simulate_data_continuous()),
    pattern = map(index_batch)
  ),
  tar_target(
    fit_continuous,
    map_sims(data_continuous, model_file = model_file),
    pattern = map(data_continuous)
  )
)

  1. Slicing is always the same when we branch over an existing pattern. If we have tar_target(y, x, pattern = map(x)) and x is another pattern, then each branch of y always gets a branch of x regardless of the iteration method. Likewise, the aggregation of stems does not depend on the iteration method because every stem is already aggregated.↩︎

Copyright Eli Lilly and Company