16  Static branching

16.1 Branching

Performance

Branched pipelines can be computationally demanding. See the performance chapter for options, settings, and other choices to optimize and monitor large pipelines.

Sometimes, a pipeline contains more targets than a user can comfortably type by hand. For projects with hundreds of targets, branching can make the _targets.R file more concise and easier to read and maintain.

targets supports two types of branching: dynamic branching and static branching. Some projects are better suited to dynamic branching, while others benefit more from static branching or a combination of both. Here is a short list of tradeoffs.

Dynamic Static
Pipeline creates new targets at runtime. All targets defined in advance.
Cryptic target names. Friendly target names.
Scales to hundreds of branches. Does not scale as easily for tar_visnetwork() etc.
No metaprogramming required. Familiarity with metaprogramming is helpful.

16.2 When to use static branching

Static branching is the act of defining a group of targets in bulk before the pipeline starts. Whereas dynamic branching uses last-minute dependency data to define the branches, static branching uses metaprogramming to modify the code of the pipeline up front. Whereas dynamic branching excels at creating a large number of very similar targets, static branching is most useful for smaller number of heterogeneous targets. Some users find it more convenient because they can use tar_manifest() and tar_visnetwork() to check the correctness of static branching before launching the pipeline.

16.3 Map

tar_map() from the tarchetypes package creates copies of existing target objects, where each new command is a variation on the original. In the example below, we have a data analysis workflow that iterates over datasets and analysis methods. The values data frame has the operational parameters of each data analysis, and tar_map() creates one new target per row.

# _targets.R file:
library(targets)
library(tarchetypes)
library(tibble)
values <- tibble(
  method_function = rlang::syms(c("method1", "method2")),
  data_source = c("NIH", "NIAID")
)
targets <- tar_map(
  values = values,
  tar_target(analysis, method_function(data_source, reps = 10)),
  tar_target(summary, summarize_analysis(analysis, data_source))
)
list(targets)
tar_manifest()
#> # A tibble: 4 × 3
#>   name                   command                                     description
#>   <chr>                  <chr>                                       <chr>      
#> 1 analysis_method2_NIAID "method2(\"NIAID\", reps = 10)"             method2 NI…
#> 2 analysis_method1_NIH   "method1(\"NIH\", reps = 10)"               method1 NIH
#> 3 summary_method2_NIAID  "summarize_analysis(analysis_method2_NIAID… method2 NI…
#> 4 summary_method1_NIH    "summarize_analysis(analysis_method1_NIH, … method1 NIH
tar_visnetwork(targets_only = TRUE)

For shorter target names, use the names argument of tar_map(). And for more combinations of settings, use tidyr::expand_grid() on values.

# _targets.R file:
library(targets)
library(tarchetypes)
library(tidyr)
values <- expand_grid( # Use all possible combinations of input settings.
  method_function = rlang::syms(c("method1", "method2")),
  data_source = c("NIH", "NIAID")
)
targets <- tar_map(
  values = values,
  names = "data_source", # Select columns from `values` for target names.
  tar_target(analysis, method_function(data_source, reps = 10)),
  tar_target(summary, summarize_analysis(analysis, data_source))
)
list(targets)

It is extra important to run tar_manifest() to check that tar_map() generates the right R code for the targets. Sometimes, the metaprogramming may not produce the desired commands on your first try.

tar_manifest()
#> # A tibble: 8 × 3
#>   name             command                                           description
#>   <chr>            <chr>                                             <chr>      
#> 1 analysis_NIAID_1 "method2(\"NIAID\", reps = 10)"                   method2 NI…
#> 2 analysis_NIAID   "method1(\"NIAID\", reps = 10)"                   method1 NI…
#> 3 analysis_NIH_1   "method2(\"NIH\", reps = 10)"                     method2 NIH
#> 4 analysis_NIH     "method1(\"NIH\", reps = 10)"                     method1 NIH
#> 5 summary_NIAID_1  "summarize_analysis(analysis_NIAID_1, \"NIAID\")" method2 NI…
#> 6 summary_NIAID    "summarize_analysis(analysis_NIAID, \"NIAID\")"   method1 NI…
#> 7 summary_NIH_1    "summarize_analysis(analysis_NIH_1, \"NIH\")"     method2 NIH
#> 8 summary_NIH      "summarize_analysis(analysis_NIH, \"NIH\")"       method1 NIH

And of course, check the dependency graph to ensure the pipeline is properly connected. If tar_map() generates a lot of targets, the graph may render slowly or look too cumbersome. If that happens, choose a small subset of rows of values for tar_map() and then try again on the smaller pipeline.

# You may need to zoom out on this interactive graph to see all 8 targets.
tar_visnetwork(targets_only = TRUE)

16.3.1 Limitations

tar_map() generates R expressions to serve as commands in other targets. When it substitutes an element from values, it needs a way to transform the element into valid R code. For elements even a little bit complicated, especially nested data frames and objects with attributes, this is not always possible. For these complicated elements, it is best to use quote() to work with the underlying expressions instead of the objects themselves. See https://github.com/ropensci/tarchetypes/discussions/105 for an example.

16.4 Dynamic-within-static branching

You can even combine together static and dynamic branching. The static tar_map() is an excellent outer layer on top of targets with patterns. The following is a sketch of a pipeline that runs each of two data analysis methods 10 times, once per random seed. Static branching iterates over the method functions, while dynamic branching iterates over the seeds. tar_map() creates new patterns as well as new commands. So below, the summary methods map over the analysis methods both statically and dynamically.

# _targets.R file:
library(targets)
library(tarchetypes)
library(tibble)
random_seed_target <- tar_target(random_seed, seq_len(10))
targets <- tar_map(
  values = tibble(method_function = rlang::syms(c("method1", "method2"))),
  tar_target(
    analysis,
    method_function("NIH", seed = random_seed),
    pattern = map(random_seed)
  ),
  tar_target(
    summary,
    summarize_analysis(analysis),
    pattern = map(analysis)
  )
)
list(random_seed_target, targets)
tar_manifest()
#> # A tibble: 5 × 4
#>   name             command                                pattern    description
#>   <chr>            <chr>                                  <chr>      <chr>      
#> 1 random_seed      "seq_len(10)"                          <NA>       <NA>       
#> 2 analysis_method1 "method1(\"NIH\", seed = random_seed)" map(rando… method1    
#> 3 analysis_method2 "method2(\"NIH\", seed = random_seed)" map(rando… method2    
#> 4 summary_method1  "summarize_analysis(analysis_method1)" map(analy… method1    
#> 5 summary_method2  "summarize_analysis(analysis_method2)" map(analy… method2
tar_visnetwork(targets_only = TRUE)

16.5 Combine

tar_combine() from the tarchetypes package creates a new target to aggregate the results of upstream targets. In the simple example below, our combined target simply aggregates the rows returned from two other targets.

# _targets.R file:
library(targets)
library(tarchetypes)
library(tibble)
options(crayon.enabled = FALSE)
target1 <- tar_target(head, head(mtcars, 1))
target2 <- tar_target(tail, tail(mtcars, 1))
target3 <- tar_combine(combined_target, target1, target2)
list(target1, target2, target3)
tar_manifest()
#> # A tibble: 3 × 2
#>   name            command                                                       
#>   <chr>           <chr>                                                         
#> 1 head_mtcars     head(mtcars, 1)                                               
#> 2 tail_mtcars     tail(mtcars, 1)                                               
#> 3 combined_target vctrs::vec_c(head_mtcars = head_mtcars, tail_mtcars = tail_mt…
tar_visnetwork(targets_only = TRUE)
tar_make()
#> ▶ dispatched target head_mtcars
#> ● completed target head_mtcars [0.001 seconds, 215 bytes]
#> ▶ dispatched target tail_mtcars
#> ● completed target tail_mtcars [0 seconds, 221 bytes]
#> ▶ dispatched target combined_target
#> ● completed target combined_target [0 seconds, 276 bytes]
#> ▶ ended pipeline [0.063 seconds]
tar_read(combined_target)
#>             mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> Mazda RX4  21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
#> Volvo 142E 21.4   4  121 109 4.11 2.78 18.60  1  1    4    2

To use tar_combine() and tar_map() together in more complicated situations, you may need to supply unlist = FALSE to tar_map(). That way, tar_map() will return a nested list of target objects, and you can combine the ones you want. The pipeline extends our previous tar_map() example by combining just the summaries, omitting the analyses from tar_combine(). Also note the use of bind_rows(!!!.x) below. This is how you supply custom code to combine the return values of other targets. .x is a placeholder for the return values, and !!! is the “unquote-splice” operator from the rlang package.

# _targets.R file:
library(targets)
library(tarchetypes)
library(tibble)
random_seed <- tar_target(random_seed, seq_len(10))
mapped <- tar_map(
  unlist = FALSE, # Return a nested list from tar_map()
  values = tibble(method_function = rlang::syms(c("method1", "method2"))),
  tar_target(
    analysis,
    method_function("NIH", seed = random_seed),
    pattern = map(random_seed)
  ),
  tar_target(
    summary,
    summarize_analysis(analysis),
    pattern = map(analysis)
  )
)
combined <- tar_combine(
  combined_summaries,
  mapped[["summary"]],
  command = dplyr::bind_rows(!!!.x, .id = "method")
)
list(random_seed, mapped, combined)
tar_manifest()
#> Warning message:
#> Targets and globals must have unique names. Ignoring global objects that conflict with target names: random_seed. Warnings like this one are important, but if you must suppress them, you can do so with Sys.setenv(TAR_WARN = "false").
#> # A tibble: 6 × 4
#>   name               command                                 pattern description
#>   <chr>              <chr>                                   <chr>   <chr>      
#> 1 random_seed        "seq_len(10)"                           <NA>    <NA>       
#> 2 analysis_method1   "method1(\"NIH\", seed = random_seed)"  map(ra… method1    
#> 3 analysis_method2   "method2(\"NIH\", seed = random_seed)"  map(ra… method2    
#> 4 summary_method1    "summarize_analysis(analysis_method1)"  map(an… method1    
#> 5 summary_method2    "summarize_analysis(analysis_method2)"  map(an… method2    
#> 6 combined_summaries "dplyr::bind_rows(summary_method1 = su… <NA>    <NA>
tar_visnetwork(targets_only = TRUE)
#> Warning message:
#> Targets and globals must have unique names. Ignoring global objects that conflict with target names: random_seed. Warnings like this one are important, but if you must suppress them, you can do so with Sys.setenv(TAR_WARN = "false").

16.6 Metaprogramming

Custom metaprogramming is a more flexible alternative to tar_map() and tar_combine(). tar_eval() from tarchetypes accepts an arbitrary expression and iteratively plugs in symbols. Below, we use it to branch over datasets.

# _targets.R
library(rlang)
library(targets)
library(tarchetypes)
string <- c("gapminder", "who", "imf")
symbol <- syms(string)
tar_eval(
  tar_target(symbol, get_data(string)),
  values = list(string = string, symbol = symbol)
)

tar_eval() has fewer guardrails than tar_map() or tar_combine(), so tar_manifest() is especially important for checking the correctness of your metaprogramming.

tar_manifest(fields = command)
#> # A tibble: 3 × 2
#>   name      command                  
#>   <chr>     <chr>                    
#> 1 imf       "get_data(\"imf\")"      
#> 2 gapminder "get_data(\"gapminder\")"
#> 3 who       "get_data(\"who\")"

16.7 Hooks

Hooks are supported in tarchetypes version 0.2.0 and above, and they allow you to prepend or wrap code in multiple targets at a time. For example, tar_hook_before() is a robust way to invoke the conflicted package to resolve namespace conflicts that works with distributed computing and does not require a project-level .Rprofile file.

# _targets.R file
library(tarchetypes)
library(magrittr)
tar_option_set(packages = c("conflicted", "dplyr"))
source("R/functions.R")
list(
  tar_target(data, get_time_series_data()),
  tar_target(analysis1, analyze_months(data)),
  tar_target(analysis2, analyze_weeks(data))
) %>%
  tar_hook_before(
    hook = conflicted_prefer("filter", "dplyr"),
    names = starts_with("analysis")
  )
# R console
targets::tar_manifest(fields = command)
#> # A tibble: 3 × 2
#>   name      command                                                             
#>   <chr>     <chr>                                                               
#> 1 data      "get_time_series_data()"                                            
#> 2 analysis1 "{\n     conflicted_prefer(\"filter\", \"dplyr\")\n     analyze(dat…
#> 3 analysis2 "{\n     conflicted_prefer(\"filter\", \"dplyr\")\n     analyze(dat…

Similarly, tar_hook_outer() wraps expressions around target commands, and tar_hook_inner() wraps expressions around target dependencies. These hooks could potentially help encrypt targets before storage in _targets/ and decrypt targets before retrieval, as demonstrated in the sketch below.

Data security is the sole responsibility of the user and not the responsibility of targets, tarchetypes, or related pipeline packages. You as the user are responsible for validating your own target specifications and custom code and applying additional security precautions as appropriate for the situation.

# _targets.R file
library(tarchetypes)
library(magrittr)
list(
  tar_target(data1, get_data1()),
  tar_target(data2, get_data2()),
  tar_target(analysis, analyze(data1, data2))
) %>%
  tar_hook_outer(encrypt(.x, threads = 2)) %>%
  tar_hook_inner(decrypt(.x))
# R console
targets::tar_manifest(fields = command)
#> # A tibble: 3 × 2
#>   name     command                                                      
#>   <chr>    <chr>                                                        
#> 1 data1    encrypt(get_data1(), threads = 2)                            
#> 2 data2    encrypt(get_data2(), threads = 2)                            
#> 3 analysis encrypt(analyze(decrypt(data1), decrypt(data2)), threads = 2)