9  Pseudo-random numbers

Package versions

This chapter was written for targets >= 1.3.2.9001 and tarchetypes >= 0.7.8.9001 which use targets::tar_seed_create() and additional safeguards for statistical independence of pseudo-random numbers.

This chapter explains the behaviors, limitations, and trade-offs of targets with respect to pseudo-random number generation. It assumes basic familiarity with pseudo-random numbers, especially base R functions like sample() and set.seed().

9.1 Overview

A targets pipeline may run stochastic methods, and targets tries to ensure that the results are repeatable and correct. There are two major statistical challenges:

  1. Reproducibility: different runs of the same pipeline should produce the same results, even if a target runs a stochastic function like rnorm().
  2. Independence: pseudo-random numbers should behave like independent random samples. Pseudo-random number sequences in different targets should overlap as little as possible.

9.2 Reproducibility

Each target runs with its own deterministic seed. The target seed is a function of:

  1. Its name, and
  2. The pipeline-level seed from tar_option_get("seed").

Consider the following simple pipeline.

# _targets.R file:
library(targets)
tar_option_set(seed = 3)
list(
  tar_target(name = target_1, command = runif(n = 1)),
  tar_target(name = target_2, command = runif(n = 1)),
  tar_target(name = target_3, command = runif(n = 1))
)

The seed of the target_1 target is:

expected_seed <- tar_seed_create("target_1", global_seed = 3)

expected_seed
#> [1] 1432713270

And the target runs the equivalent of:

withr::with_seed(seed = expected_seed, code = runif(n = 1))
#> [1] 0.1995242

We can run the pipeline with tar_make(), view the seed with tar_meta(), and view the result with tar_read().1

tar_make()
#> ▶ dispatched target target_1
#> ● completed target target_1 [0.001 seconds]
#> ▶ dispatched target target_2
#> ● completed target target_2 [0 seconds]
#> ▶ dispatched target target_3
#> ● completed target target_3 [0 seconds]
#> ▶ ended pipeline [0.076 seconds]

tar_meta(names = any_of("target_1"), fields = any_of("seed"))
#> # A tibble: 1 × 2
#>   name           seed
#>   <chr>         <int>
#> 1 target_1 1432713270

tar_read(target_1)
#> [1] 0.1995242

The seed argument of tar_option_set() offers flexibility:

  1. If you set seed to a different integer, you will get a different (but still reproducible) set of stochastic results.
  2. If you set seed to NA, then targets will not set a seed at all. Different runs of the pipeline will produce results, and those results will not be reproducible.

For (2), each target will always appear outdated in tar_make() and tar_outdated(). To force a target to be up to date, set cue = tar_cue(seed = FALSE) in tar_target() or tar_option_set().

9.3 Independence

Within a pipeline, different targets are guaranteed to have different names. Barring the vanishingly small chance of hash collisions in tar_seed_create(), that means they should also have different seeds.

tar_meta(targets_only = TRUE, fields = any_of("seed"))
#> # A tibble: 3 × 2
#>   name           seed
#>   <chr>         <int>
#> 1 target_1 1432713270
#> 2 target_2 -740972651
#> 3 target_3  227569066

Thus, different targets should have non-identical sequences of pseudo-random numbers.

tar_read(target_1)
#> [1] 0.1995242
tar_read(target_2)
#> [1] 0.2481659
tar_read(target_3)
#> [1] 0.5339015

In theory, these parallel random number generator streams could overlap and produce statistically correlated results. However, the risk is extremely small in practice. See https://docs.ropensci.org/targets/reference/tar_seed_create.html#rng-overlap for details, references, and justification.

9.4 tarchetypes

Many target factories in the tarchetypes package support batched replication:

  • tar_rep()
  • tar_map_rep()
  • tar_map2_count()
  • tar_map2_size()
  • tar_quarto_rep()
  • tar_render_rep()

In batched replication, each target is a batch that runs multiple replications of a stochastic task. If you change the number of batches or number of replications per batch, the target name changes, which changes the seed of each target. To make pipelines more resilient, tar_rep() and friends set their own unique deterministic seeds from tar_seed_create() based on:

  1. tar_option_get("seed").
  2. The parent name of the dynamnic target
  3. The index of each replicate in the sequence.

If you return data frames or lists, those seeds are available in the tar_seed element of the output. Each replicate gets its own seed, and the default seeds from tar_meta() no longer apply.

# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
  tar_rep(
    name = tasks,
    command = runif(n = 1),
    batches = 2,
    reps = 3
  )
)
tar_make()
#> ▶ dispatched target tasks_batch
#> ● completed target tasks_batch [0 seconds]
#> ▶ dispatched branch tasks_1b5e876cb04170df
#> ● completed branch tasks_1b5e876cb04170df [0.014 seconds]
#> ▶ dispatched branch tasks_e9280f6d9ede67e3
#> ● completed branch tasks_e9280f6d9ede67e3 [0.003 seconds]
#> ● completed pattern tasks
#> ▶ ended pipeline [0.101 seconds]

tar_read(tasks)
#> # A tibble: 6 × 4
#>   result tar_batch tar_rep   tar_seed
#>    <dbl>     <int>   <int>      <int>
#> 1  0.882         1       1 1161495390
#> 2  0.781         1       2 1040766653
#> 3  0.213         1       3  942098819
#> 4  0.913         2       1 -720434756
#> 5  0.545         2       2 1717229114
#> 6  0.298         2       3 -115675171

If you change the batching structure, the tar_rep and tar_batch columns will change, but the results and the seeds will stay the same.

# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
  tar_rep(
    name = tasks,
    command = runif(n = 1),
    batches = 3, # previously 2
    reps = 2     # previously 3
  )
)
tar_make()
#> ▶ dispatched target tasks_batch
#> ● completed target tasks_batch [0 seconds]
#> ▶ dispatched branch tasks_1b5e876cb04170df
#> ● completed branch tasks_1b5e876cb04170df [0.014 seconds]
#> ▶ dispatched branch tasks_e9280f6d9ede67e3
#> ● completed branch tasks_e9280f6d9ede67e3 [0.003 seconds]
#> ▶ dispatched branch tasks_9e424147a252bf65
#> ● completed branch tasks_9e424147a252bf65 [0.003 seconds]
#> ● completed pattern tasks
#> ▶ ended pipeline [0.108 seconds]

tar_read(tasks)
#> # A tibble: 6 × 4
#>   result tar_batch tar_rep   tar_seed
#>    <dbl>     <int>   <int>      <int>
#> 1  0.882         1       1 1161495390
#> 2  0.781         1       2 1040766653
#> 3  0.213         2       1  942098819
#> 4  0.913         2       2 -720434756
#> 5  0.545         3       1 1717229114
#> 6  0.298         3       2 -115675171

  1. tar_make() does not interfere with the pseudo-random number generator state of the calling R process.↩︎