# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
tar_target(name = target_1, command = runif(n = 1)),
tar_target(name = target_2, command = runif(n = 1)),
tar_target(name = target_3, command = runif(n = 1))
)
9 Pseudo-random numbers
This chapter was written for targets >= 1.3.2.9001
and tarchetypes >= 0.7.8.9001
which use targets::tar_seed_create()
and additional safeguards for statistical independence of pseudo-random numbers.
This chapter explains the behaviors, limitations, and trade-offs of targets
with respect to pseudo-random number generation. It assumes basic familiarity with pseudo-random numbers, especially base R functions like sample()
and set.seed()
.
9.1 Overview
A targets
pipeline may run stochastic methods, and targets
tries to ensure that the results are repeatable and correct. There are two major statistical challenges:
- Reproducibility: different runs of the same pipeline should produce the same results, even if a target runs a stochastic function like
rnorm()
. - Independence: pseudo-random numbers should behave like independent random samples. Pseudo-random number sequences in different targets should overlap as little as possible.
9.2 Reproducibility
Each target runs with its own deterministic seed. The target seed is a function of:
- Its name, and
- The pipeline-level seed from
tar_option_get("seed")
.
Consider the following simple pipeline.
The seed of the target_1
target is:
<- tar_seed_create("target_1", global_seed = 3)
expected_seed
expected_seed#> [1] 1432713270
And the target runs the equivalent of:
::with_seed(seed = expected_seed, code = runif(n = 1))
withr#> [1] 0.1995242
We can run the pipeline with tar_make()
, view the seed with tar_meta()
, and view the result with tar_read()
.1
tar_make()
#> ▶ dispatched target target_1
#> ● completed target target_1 [0.001 seconds, 55 bytes]
#> ▶ dispatched target target_2
#> ● completed target target_2 [0 seconds, 55 bytes]
#> ▶ dispatched target target_3
#> ● completed target target_3 [0 seconds, 55 bytes]
#> ▶ ended pipeline [0.057 seconds]
tar_meta(names = any_of("target_1"), fields = any_of("seed"))
#> # A tibble: 1 × 2
#> name seed
#> <chr> <int>
#> 1 target_1 1432713270
tar_read(target_1)
#> [1] 0.1995242
The seed
argument of tar_option_set()
offers flexibility:
- If you set
seed
to a different integer, you will get a different (but still reproducible) set of stochastic results. - If you set
seed
toNA
, thentargets
will not set a seed at all. Different runs of the pipeline will produce results, and those results will not be reproducible.
For (2), each target will always appear outdated in tar_make()
and tar_outdated()
. To force a target to be up to date, set cue = tar_cue(seed = FALSE)
in tar_target()
or tar_option_set()
.
9.3 Independence
Within a pipeline, different targets are guaranteed to have different names. Barring the vanishingly small chance of hash collisions in tar_seed_create()
, that means they should also have different seeds.
tar_meta(targets_only = TRUE, fields = any_of("seed"))
#> # A tibble: 3 × 2
#> name seed
#> <chr> <int>
#> 1 target_1 1432713270
#> 2 target_2 -740972651
#> 3 target_3 227569066
Thus, different targets should have non-identical sequences of pseudo-random numbers.
tar_read(target_1)
#> [1] 0.1995242
tar_read(target_2)
#> [1] 0.2481659
tar_read(target_3)
#> [1] 0.5339015
In theory, these parallel random number generator streams could overlap and produce statistically correlated results. However, the risk is extremely small in practice. See https://docs.ropensci.org/targets/reference/tar_seed_create.html#rng-overlap for details, references, and justification.
9.4 tarchetypes
Many target factories in the tarchetypes
package support batched replication:
tar_rep()
tar_map_rep()
tar_map2_count()
tar_map2_size()
tar_quarto_rep()
tar_render_rep()
In batched replication, each target is a batch that runs multiple replications of a stochastic task. If you change the number of batches or number of replications per batch, the target name changes, which changes the seed of each target. To make pipelines more resilient, tar_rep()
and friends set their own unique deterministic seeds from tar_seed_create()
based on:
tar_option_get("seed")
.- The parent name of the dynamnic target
- The index of each replicate in the sequence.
If you return data frames or lists, those seeds are available in the tar_seed
element of the output. Each replicate gets its own seed, and the default seeds from tar_meta()
no longer apply.
# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
tar_rep(
name = tasks,
command = runif(n = 1),
batches = 2,
reps = 3
) )
tar_make()
#> ▶ dispatched target tasks_batch
#> ● completed target tasks_batch [0 seconds, 98 bytes]
#> ▶ dispatched branch tasks_1b5e876cb04170df
#> ● completed branch tasks_1b5e876cb04170df [0.013 seconds, 216 bytes]
#> ▶ dispatched branch tasks_e9280f6d9ede67e3
#> ● completed branch tasks_e9280f6d9ede67e3 [0.003 seconds, 217 bytes]
#> ● completed pattern tasks
#> ▶ ended pipeline [0.098 seconds]
tar_read(tasks)
#> # A tibble: 6 × 4
#> result tar_batch tar_rep tar_seed
#> <dbl> <int> <int> <int>
#> 1 0.882 1 1 1161495390
#> 2 0.781 1 2 1040766653
#> 3 0.213 1 3 942098819
#> 4 0.913 2 1 -720434756
#> 5 0.545 2 2 1717229114
#> 6 0.298 2 3 -115675171
If you change the batching structure, the tar_rep
and tar_batch
columns will change, but the results and the seeds will stay the same.
# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
tar_rep(
name = tasks,
command = runif(n = 1),
batches = 3, # previously 2
reps = 2 # previously 3
) )
tar_make()
#> ▶ dispatched target tasks_batch
#> ● completed target tasks_batch [0.001 seconds, 99 bytes]
#> ▶ dispatched branch tasks_1b5e876cb04170df
#> ● completed branch tasks_1b5e876cb04170df [0.011 seconds, 201 bytes]
#> ▶ dispatched branch tasks_e9280f6d9ede67e3
#> ● completed branch tasks_e9280f6d9ede67e3 [0.002 seconds, 198 bytes]
#> ▶ dispatched branch tasks_9e424147a252bf65
#> ● completed branch tasks_9e424147a252bf65 [0.002 seconds, 202 bytes]
#> ● completed pattern tasks
#> ▶ ended pipeline [0.098 seconds]
tar_read(tasks)
#> # A tibble: 6 × 4
#> result tar_batch tar_rep tar_seed
#> <dbl> <int> <int> <int>
#> 1 0.882 1 1 1161495390
#> 2 0.781 1 2 1040766653
#> 3 0.213 2 1 942098819
#> 4 0.913 2 2 -720434756
#> 5 0.545 3 1 1717229114
#> 6 0.298 3 2 -115675171
tar_make()
does not interfere with the pseudo-random number generator state of the calling R process.↩︎