4  Debugging pipelines

This chapter offers advice on debugging targets pipelines. The repository at https://github.com/wlandau/targets-debug has example R code. To practice the debugging techniques explained below, download the code and step through the interactive R scripts demo_small.R, demo_browser.R, and demo_workspace.R. The help chapter has advice on asking effective questions if you need help from a human.

4.1 Debugging in targets is different

R code is easiest to debug when it is interactive. In the R console or RStudio IDE, you have full control over the code and the objects in the environment, and you are free to dissect, tinker, and test until you find and fix the issue. However, a pipeline is deliberately non-interactive because it tries to be automated and reproducible. In targets, several layers of encapsulation separate you from the code you want to debug:

  • The pipeline runs in an external non-interactive callr::r() process where you cannot use the R console.
  • The targets in the pipeline may run on parallel workers.
  • tar_make() automatically saves output data to disk.
  • tar_make() has its own error-catching system.

Although these layers are essential for reproducibility and scale, you will need to peel them back to diagnose and solve problems. This chapter explains how.

4.2 Debugging example

The following pipeline simulates a repeated measures dataset and analyzes it with generalized least squares.

# _targets.R
library(targets)
library(tarchetypes)
tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr")
)

simulate_data <- function(units) {
  options(warn = -1)
  tibble(unit = seq_len(units), factor = rnorm(n = units, mean = 2)) %>%
    expand_grid(measurement = seq_len(4)) %>%
    mutate(outcome = sqrt(factor) + rnorm(n()))
}

analyze_data <- function(data) {
  gls(
    model = outcome ~ factor,
    data = data,
    correlation = corSymm(form = ~ measurement | unit),
    weights = varIdent(form = ~ 1 | measurement)
  ) %>%
    tidy(conf.int = TRUE, conf.level = 0.95)
}

list(
  tar_target(name = data, command = simulate_data(100)),
  tar_target(name = model, command = analyze_data(data))
)

This pipeline has an error.

# R console
tar_make()
#> + data dispatched
#> ✔ data completed [173ms, 4.37 kB]
#> + model dispatched
#> ✖ model errored
#> ✖ errored pipeline [281ms, 1 completed, 0 skipped]
#> Error:
#> ! Error in tar_make():
#>   missing values in object
#>   See https://books.ropensci.org/targets/debugging.html

4.3 Finish the pipeline anyway

Even if you hit an error, you can still finish the successful parts of the pipeline. The error argument of tar_option_set() and tar_target() tells each target what to do if it hits an error. For example, tar_option_set(error = "null") tells errored targets to return NULL. The output as a whole will not be correct or up to date, but the pipeline will finish so you can look at preliminary results. This is especially helpful with dynamic branching.

# _targets.R
library(targets)
library(tarchetypes)
tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr"),
  error = "null" # produce a result even if the target errored out.
)

# Functions etc...
# R console
tar_make()
#> + data dispatched
#> ✔ data completed [632ms, 4.37 kB]
#> + model dispatched
#> ✖ model errored
#> ✖ errored pipeline [804ms, 1 completed, 0 skipped]

# We have a NULL placeholder for the model target.
tar_read(model)
#> NULL

# But it is not up to date.
tar_outdated()
#> [1] "model"

4.4 Error messages

Still, it is important to fix known errors. The metadata in _targets/meta/meta is a good place to start. It stores the most recent error and warning messages for each target. tar_meta() can retrieve these messages.

# R console
tar_meta(fields = error, complete_only = TRUE)
#> # A tibble: 1 × 2
#>   name  error                   
#>   <chr> <chr>                   
#> 1 model missing values in object

It looks like missing values in the data are responsible for the error in the model target. Maybe this clue alone is enough to repair the code. If not, read on.

4.5 Debugging in functions

Most errors are come from custom user-defined functions like simulate_data() and analyze_data(). See if you can reproduce the error in the R console.

# R console

# Restart your R session.
rstudioapi::restartSession()

library(targets)
library(tarchetypes)

# Loads globals like tar_option_set() packages, simulate_data(), and analyze_data():
tar_load_globals()

# Load the data that the target depends on.
tar_load(data)

# Run the command of the errored target.
analyze_data(data)
#> Error in `na.fail.default()`:
#>   missing values in object

If you see the same error here that you saw in the pipeline, then good! Now that you are in an interactive R session, all the usual debugging techniques and tools such as debug() and browser() can help you figure out how to fix your code, and you can exclude targets from the rest of the debugging process.

# R console
debug(analyze_data)
analyze_data(data)
#> debugging in: analyze_data(data)
#> ...
Browse[2]> anyNA(data$outcome) # Do I need to handle missing values?
#> [1] TRUE

In some cases, however, you may not see the original error:

# R console
new_data <- simulate_data(100)
analyze_data(new_data)
#> # A tibble: 2 × 7
#>   term        estimate std.error statistic  p.value conf.low conf.high
#> * <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)    0.595    0.152       3.92 1.05e- 4    0.297     0.892
#> 2 factor         0.362    0.0476      7.61 2.04e-13    0.269     0.455

Above, the random number generator seed in your local session is different from the seed assigned to the target in the pipeline. The dataset from the pipeline has missing values, whereas the one in the local session does not.

If you cannot reproduce the error in an interactive R session, read on.

4.6 Pause the pipeline with browser()

Sometimes, you may still need to run the pipeline to find the problem. The following trick lets you pause the pipeline and tinker with a running target interactively:

  1. Insert browser() into the function that produces the error.
  2. Restart your R session to remove detritus from memory.
  3. Call tar_make() with callr_function = NULL, use_crew = FALSE, and as_job = FALSE to run the whole pipeline in your interactive session without launching a new callr::r() process, parallel crew workers, or an RStudio job.
  4. Poke around until you find the bug.
# _targets.R
# ...
analyze_data <- function(data) {
  browser() # Pause the pipeline here.
  gls(
    model = outcome ~ factor,
    data = data,
    correlation = corSymm(form = ~ measurement | unit),
    weights = varIdent(form = ~ 1 | measurement)
  ) %>%
    tidy(conf.int = TRUE, conf.level = 0.95)
}
# ...
# R console

# Restart your R session.
rstudioapi::restartSession()

library(targets)
library(tarchetypes)

# Run the whole pipeline in your interactive R session
# (no callr process, parallel workers, or RStudio job).
tar_make(callr_function = NULL, use_crew = FALSE, as_job = FALSE)
#> + model dispatched
#> Called from: analyze_data(dataset1)

# Tinker with the R session to see if you can reproduce the error.
Browse[1]> model <- gls(
+   model = outcome ~ factor,
+   data = data,
+   correlation = corSymm(form = ~ measurement | unit),
+   weights = varIdent(form = ~ 1 | measurement)
+ )
#> Error in `na.fail.default()`: 
#>   missing values in object

# Figure out what it would take to fix the error.
Browse[1]> model <- gls(
+   model = outcome ~ factor,
+   data = data,
+   correlation = corSymm(form = ~ measurement | unit),
+   weights = varIdent(form = ~ 1 | measurement),
+   na.action = na.omit
+ )

# Confirm that the bug is fixed.
Browse[1]> tidy(model, conf.int = TRUE, conf.level = 0.95)
#> # A tibble: 2 × 7
#>   term        estimate std.error statistic       p.value conf.low conf.high
#> * <chr>          <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
#> 1 (Intercept)    0.795    0.148       5.36 0.000000145      0.504     1.09 
#> 2 factor         0.275    0.0466      5.92 0.00000000717    0.184     0.367

4.7 Pause the pipeline with the debug option

It may be too tedious to comb through all targets with browser(). For example, what if the pipeline has hundreds of simulated datasets? The following pipeline simulates 100 datasets with 58 experimental units each and 100 datasets with 70 experimental units each. Each dataset is analyzed with gls(). tar_map_rep() from the tarchetypes package organizes this simulation structure and batches the replications for computational efficiency.

# _targets.R
library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr")
)

simulate_data <- function(units) {
  options(warn = -1)
  tibble(unit = seq_len(units), factor = rnorm(units, mean = 3)) %>%
    expand_grid(measurement = seq_len(4)) %>%
    mutate(outcome = sqrt(factor) + rnorm(n()))
}

analyze_data <- function(data) {
  gls(
    model = outcome ~ factor,
    data = data,
    correlation = corSymm(form = ~ measurement | unit),
    weights = varIdent(form = ~ 1 | measurement)
  ) %>%
    tidy(conf.int = TRUE, conf.level = 0.95)
}

simulate_and_analyze_one_dataset <- function(units) {
  data <- simulate_data(units)
  analyze_data(data)
}

list(
  tar_map_rep( # from the {tarchetypes} package
    name = analysis,
    command = simulate_and_analyze_one_dataset(units),
    values = data.frame(units = c(58, 70)), # 2 data size scenarios.
    names = all_of("units"), # The columns of values to use to name the targets.
    batches = 20, # For each scenario, divide the 100 simulations into 20 dynamic branch targets.
    reps = 5 # Each branch target (batch) runs simulate_and_analyze_one_dataset(n = 100) 5 times.
  )
)
# R console
tar_make()
#> + analysis_batch dispatched
#> ✔ analysis_batch completed [0ms, 97 B]
#> + analysis_58 declared [20 branches]
#> ✖ analysis_58_0ea05e90e3c60147 errored                      
#> ✖ errored pipeline [810ms, 2 completed, 0 skipped]
#> Error:
#> ! Error in tar_make():
#>   missing values in object
#>   See https://books.ropensci.org/targets/debugging.html

Remember, if you just want to see the results that succeeded, run the pipeline with error = "null" in tar_option_set(). This temporary workaround is especially helpful with so many simulations.

# _targets.R
library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr"),
  error = "null"
)

# Functions etc...
# R console
tar_make()
#> + analysis_batch dispatched
#> ✔ analysis_batch completed [0ms, 97 B]
#> + analysis_58 declared [20 branches]
#> ✖ analysis_58_0ea05e90e3c60147 errored 
#> ✖ analysis_58_6b77f315fc663bee errored 
#> ✖ analysis_58_cef9803445fd2742 errored 
#> ✖ analysis_58_66e82c295e9c8fb7 errored 
#> ✖ analysis_58_62ba11e34d87b6d0 errored 
#> ✖ analysis_58_0aa7ec9d3a482b98 errored 
#> + analysis_70 declared [20 branches]  
#> ✖ analysis_70_0ea05e90e3c60147 errored 
#> ✖ analysis_70_e693a7e1ba8177f8 errored  
#> ✖ analysis_70_cef9803445fd2742 errored  
#> ✖ analysis_70_66e82c295e9c8fb7 errored
#> ✖ analysis_70_6319f7f8f2676866 errored 
#> ✖ analysis_70_ca6bd7e8ae4fc65f errored
#> ✖ analysis_70_62ba11e34d87b6d0 errored 
#> ✖ analysis_70_1ae85d59f0c9cedf errored 
#> ✖ analysis_70_d1f66671ad88fe8e errored 
#> + analysis_58_combine dispatched   
#> ✔ analysis_58_combine completed [0ms, 7.26 kB] 
#> + analysis_70_combine dispatched            
#> ✔ analysis_70_combine completed [0ms, 5.79 kB]  
#> + analysis dispatched                     
#> ✔ analysis completed [1ms, 12.69 kB]  
#> ✖ errored pipeline [11.8s, 29 completed, 0 skipped]

# Read the simulations that succeeded.
tar_read(analysis)
#> # A tibble: 250 × 12
#>    term        estimate std.error statistic  p.value conf.low conf.high units
#>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <dbl>
#>  1 (Intercept)    0.775    0.167       4.64 5.78e- 6   0.448      1.10     58
#>  2 factor         0.285    0.0525      5.43 1.42e- 7   0.182      0.388    58
#>  3 (Intercept)    0.892    0.161       5.53 8.75e- 8   0.576      1.21     58
#>  4 factor         0.262    0.0477      5.48 1.09e- 7   0.168      0.355    58
#>  5 (Intercept)    1.04     0.186       5.61 5.93e- 8   0.676      1.40     58
#>  6 factor         0.203    0.0607      3.34 9.72e- 4   0.0839     0.322    58
#>  7 (Intercept)    0.665    0.163       4.08 6.23e- 5   0.346      0.985    58
#>  8 factor         0.350    0.0508      6.88 5.75e-11   0.250      0.449    58
#>  9 (Intercept)    1.03     0.162       6.36 1.07e- 9   0.713      1.35     58
#> 10 factor         0.244    0.0514      4.74 3.77e- 6   0.143      0.345    58
#> # ℹ 240 more rows
#> # ℹ 4 more variables: tar_batch <int>, tar_rep <int>, tar_seed <int>,
#> #   tar_group <int>
#> # ℹ Use `print(n = ...)` to see more rows

Now let’s seriously debug this pipeline. If each call to simulate_and_analyze_one_dataset() takes a long time to run, then the first step is to set one rep per batch in tar_map_rep() while keeping the total number of reps the same. In other words, increase batches from 20 to 100 and decrease reps from 5 to 1.. Also remove the units = 70 scenario because we can reproduce the error without it.

# _targets.R
library(targets)
library(tarchetypes)
tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr"),
)

# Functions...

list(
  tar_map_rep(
    name = analysis,
    command = simulate_and_analyze_one_dataset(units),
    values = data.frame(units = 58), # Remove the units = 70 scenario.
    names = all_of("units"),
    batches = 100, # 100 batches now
    reps = 1 # 1 rep per batch now
  )
)
# R console
tar_make()
#> + analysis_batch dispatched
#> ✔ analysis_batch completed [0ms, 97 B]
#> + analysis_58 declared [100 branches]
#> ✖ analysis_58_8edfc70f9a7feaf4 errored                       
#> ✖ errored pipeline [1.1s, 8 completed, 0 skipped]          
#> Error:
#> ! Error in tar_make():
#>   missing values in object
#>   See https://books.ropensci.org/targets/debugging.html

8 targets ran successfully, and target analysis_58_8edfc70f9a7feaf4 hit an error. Let’s interactively debug analysis_58_8edfc70f9a7feaf4 without interfering with any other targets:

  1. Set the debug option to "analysis_58_8edfc70f9a7feaf4" in tar_option_set().
  2. Optional: set cue = tar_cue(mode = "never") in tar_option_set() to force skip all targets except:
    • analysis_58_8edfc70f9a7feaf4 and other targets in the debug option.
    • targets that do not already exist in the metadata.
    • targets that set their own cues.
  3. Restart your R session to remove detritus from memory.
  4. Run tar_make(callr_function = NULL, use_crew = FALSE, as_job = FALSE).
# _targets.R
library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr"),
  debug = "analysis_58_8edfc70f9a7feaf4", # Set the target you want to debug.
  cue = tar_cue(mode = "never") # Force skip non-debugging outdated targets.
)

# Functions etc...
# R console

# Restart your R session.
rstudioapi::restartSession()

library(targets)

# Run the pipeline in your interactive R session
# (no callr process, no parallel crew workers, no RStudio job)
tar_make(callr_function = NULL, use_crew = FALSE, as_job = FALSE)
#> + analysis_58 declared [100 branches]
#> → You are now running an interactive debugger in target analysis_58_8edfc70f9a7feaf4.
#> You can enter code and print objects as with the normal R console.
#> How to use: https://adv-r.hadley.nz/debugging.html#browser
#> 
#> → The debugger is poised to run the command of target analysis_58_8edfc70f9a7feaf4:
#> 
#>     tarchetypes::tar_rep_run(command = tarchetypes::tar_append_static_values(object = simulate_and_analyze_one_dataset(58), 
#>     values = list(units = 58)), batch = analysis_batch, reps = 1, 
#>     rep_workers = 1L, iteration = "vector")
#>
#> → Tip: run debug(your_function) and then enter "c"          
#> to move the debugger inside your_function(),   [7ms, 0+, 1-]
#> where your_function() is called from the command of target
#> analysis_58_8edfc70f9a7feaf4.
#> Then debug the function as you would normally (without `targets`).
#> Called from: eval(expr = expr, envir = envir)
Browse[1]>

At this point, we are in an interactive debugger again. Only this time, we quickly skipped straight to the target we want to debug. We can follow the advice in the prompt above, or we can tinker in other ways.

# R console
# Jump to the function we want to debug.
Browse[1]> debug(analyze_data)
Browse[1]> c # Continue to the next breakpoint.
#> debugging in: analyze_data(data)

# Tinker with the R session to see if you can reproduce the error.
Browse[2]> model <- gls(
+   model = outcome ~ factor,
+   data = data,
+   correlation = corSymm(form = ~ measurement | unit),
+   weights = varIdent(form = ~ 1 | measurement)
+ )
#> Error in na.fail.default(): 
#>   missing values in object

# Figure out what it would take to fix the error.
Browse[1]>  model <- gls(
+   model = outcome ~ factor,
+   data = data,
+   correlation = corSymm(form = ~ measurement | unit),
+   weights = varIdent(form = ~ 1 | measurement),
+   na.action = na.omit
+ )

# Confirm that the bug is fixed.
Browse[1]> tidy(model, conf.int = TRUE, conf.level = 0.95)
#> # A tibble: 2 × 7
#>   term        estimate std.error statistic   p.value conf.low conf.high
#> * <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 (Intercept)    0.925    0.205       4.51 0.0000103    0.523     1.33 
#> 2 factor         0.279    0.0646      4.32 0.0000230    0.153     0.406

4.8 Workspaces

A workspace is a special file that helps locally reconstruct the environment of a target outside the pipeline. By default, every target that throws an error generates a workspace.

# _targets.R
library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("broom", "broom.mixed", "dplyr", "nlme", "tibble", "tidyr")
)

simulate_data <- function(units) {
  options(warn = -1)
  tibble(unit = seq_len(units), factor = rnorm(units, mean = 3)) %>%
    expand_grid(measurement = seq_len(4)) %>%
    mutate(outcome = sqrt(factor) + rnorm(n()))
}

analyze_data <- function(data) {
  gls(
    model = outcome ~ factor,
    data = data,
    correlation = corSymm(form = ~ measurement | unit),
    weights = varIdent(form = ~ 1 | measurement)
  ) %>%
    tidy(conf.int = TRUE, conf.level = 0.95)
}

list(
  tar_target(rep, seq_len(100)),
  tar_target(data, simulate_data(100), pattern = map(rep)),
  tar_target(analysis, analyze_data(data), pattern = map(data))
)
# R console
tar_make()
#> + rep dispatched
#> ✔ rep completed [145ms, 97 B]
#> + data declared [100 branches]
#> ✔ data completed [410ms, 442.17 kB]                           
#> + analysis declared [100 branches]                            
#> ✖ analysis_9f60c6e05a6c5414 errored                            
#> ✖ errored pipeline [1.2s, 101 completed, 0 skipped]            
#> Error:
#> ! Error in tar_make():
#>   missing values in object
#>   See https://books.ropensci.org/targets/debugging.html

What went wrong with target analysis_9f60c6e05a6c5414? To find out, we load the workspace in an interactive session.

# R console
# List the available workspaces.
tar_workspaces()
#> [1] "analysis_9f60c6e05a6c5414"

# Load the workspace.
tar_workspace(analysis_9f60c6e05a6c5414)

At this point, the global objects, functions, and upstream dependencies of target analysis_9f60c6e05a6c5414 are in memory. In addition, the target’s original random number generator seed is set.

# R console
ls()
#> [1] "analyze_data"  "data"          "simulate_data"

With the data and functions in hand, you can reproduce the error locally.

# R console
analyze_data(data)
#> Error in na.fail.default():
#>   missing values in object

For more assistance, you can load the [traceback] from the workspace file.

tar_traceback(analysis_9f60c6e05a6c5414)
#>  [1] "analyze_data(data)"                                                           
#>  [2] "gls(model = outcome ~ factor, data = data, correlation = corSymm(form = ..."  
#>  [3] "tidy(., conf.int = TRUE, conf.level = 0.95)"                                  
#>  [4] "gls(model = outcome ~ factor, data = data, correlation = corSymm(form = ..."  
#>  [5] "do.call(model.frame, mfArgs)"                                                 
#>  [6] "(function (formula, ...)  UseMethod(\"model.frame\"))(formula = ~measureme..."
#>  [7] "model.frame.default(formula = ~measurement + unit + outcome +      facto..."  
#>  [8] "(function (object, ...)  UseMethod(\"na.fail\"))(structure(list(measuremen..."
#>  [9] "na.fail.default(structure(list(measurement = c(1L, 2L, 3L, 4L,  1L, 2L, ..."  
#> [10] "stop(\"missing values in object\")"                                           
#> [11] ".handleSimpleError(function (condition)  {     state$error <- build_mess..."  
#> [12] "h(simpleError(msg, call))"    

  1. tar_meta(fields = warnings, complete_only = TRUE) retrieves warnings.↩︎

  2. You can fix the bug by either removing the missing values from the dataset or by setting na.action = na.omit in gls().↩︎

  3. With callr_function = NULL, a messy local R environment can accidentally change the functions and objects that a target depends on, which can invalidate those targets and erase hard-earned results that were previously correct. This is why targets uses callr in the first place, and it is why callr_function = NULL is for debugging only. If you do need callr_function = NULL, please restart your R session first.↩︎

  4. In tarchetypes version 0.7.1.9000 and above, this re-batching will not change the random number generator seed assigned to each call to simulate_and_analyze_one_dataset().↩︎

  5. See the workspace_on_error argument of tar_option_set().↩︎

  6. tar_option_set() also has a workspaces argument to let you choose which targets save workspace files, regardless of whether they hit errors.↩︎

  7. You can retrieve this seed with tar_meta(names = analysis_9f60c6e05a6c5414, fields = seed). In the pipeline, targets sets this seed with withr::with_seed() just before running the target. However, other functions or target factories may set their own seeds. For example, tarchetypes::tar_map_rep() sets its own target seeds so they are resilient to re-batching. For more details on seeds, see the documentation of the seed argument of tar_option_set().↩︎