# 9  Pseudo-random numbers

Package versions

This chapter was written for `targets >= 1.3.2.9001` and `tarchetypes >= 0.7.8.9001` which use `targets::tar_seed_create()` and additional safeguards for statistical independence of pseudo-random numbers.

This chapter explains the behaviors, limitations, and trade-offs of `targets` with respect to pseudo-random number generation. It assumes basic familiarity with pseudo-random numbers, especially base R functions like `sample()` and `set.seed()`.

## 9.1 Overview

A `targets` pipeline may run stochastic methods, and `targets` tries to ensure that the results are repeatable and correct. There are two major statistical challenges:

1. Reproducibility: different runs of the same pipeline should produce the same results, even if a target runs a stochastic function like `rnorm()`.
2. Independence: pseudo-random numbers should behave like independent random samples. Pseudo-random number sequences in different targets should overlap as little as possible.

## 9.2 Reproducibility

Each target runs with its own deterministic seed. The target seed is a function of:

1. Its name, and
2. The pipeline-level seed from `tar_option_get("seed")`.

Consider the following simple pipeline.

``````# _targets.R file:
library(targets)
tar_option_set(seed = 3)
list(
tar_target(name = target_1, command = runif(n = 1)),
tar_target(name = target_2, command = runif(n = 1)),
tar_target(name = target_3, command = runif(n = 1))
)``````

The seed of the `target_1` target is:

``````expected_seed <- tar_seed_create("target_1", global_seed = 3)

expected_seed
#> [1] 1432713270``````

And the target runs the equivalent of:

``````withr::with_seed(seed = expected_seed, code = runif(n = 1))
#> [1] 0.1995242``````

We can run the pipeline with `tar_make()`, view the seed with `tar_meta()`, and view the result with `tar_read()`.1

``````tar_make()
#> ▶ dispatched target target_1
#> ● completed target target_1 [0 seconds]
#> ▶ dispatched target target_2
#> ● completed target target_2 [0.001 seconds]
#> ▶ dispatched target target_3
#> ● completed target target_3 [0 seconds]
#> ▶ ended pipeline [0.062 seconds]

tar_meta(names = any_of("target_1"), fields = any_of("seed"))
#> # A tibble: 1 × 2
#>   name           seed
#>   <chr>         <int>
#> 1 target_1 1432713270

#> [1] 0.1995242``````

The `seed` argument of `tar_option_set()` offers flexibility:

1. If you set `seed` to a different integer, you will get a different (but still reproducible) set of stochastic results.
2. If you set `seed` to `NA`, then `targets` will not set a seed at all. Different runs of the pipeline will produce results, and those results will not be reproducible.

For (2), each target will always appear outdated in `tar_make()` and `tar_outdated()`. To force a target to be up to date, set `cue = tar_cue(seed = FALSE)` in `tar_target()` or `tar_option_set()`.

## 9.3 Independence

Within a pipeline, different targets are guaranteed to have different names. Barring the vanishingly small chance of hash collisions in `tar_seed_create()`, that means they should also have different seeds.

``````tar_meta(targets_only = TRUE, fields = any_of("seed"))
#> # A tibble: 3 × 2
#>   name           seed
#>   <chr>         <int>
#> 1 target_1 1432713270
#> 2 target_2 -740972651
#> 3 target_3  227569066``````

Thus, different targets should have non-identical sequences of pseudo-random numbers.

``````tar_read(target_1)
#> [1] 0.1995242
#> [1] 0.2481659
#> [1] 0.5339015``````

In theory, these parallel random number generator streams could overlap and produce statistically correlated results. However, the risk is extremely small in practice. See https://docs.ropensci.org/targets/reference/tar_seed_create.html#rng-overlap for details, references, and justification.

## 9.4`tarchetypes`

Many target factories in the `tarchetypes` package support batched replication:

• `tar_rep()`
• `tar_map_rep()`
• `tar_map2_count()`
• `tar_map2_size()`
• `tar_quarto_rep()`
• `tar_render_rep()`

In batched replication, each target is a batch that runs multiple replications of a stochastic task. If you change the number of batches or number of replications per batch, the target name changes, which changes the seed of each target. To make pipelines more resilient, `tar_rep()` and friends set their own unique deterministic seeds from `tar_seed_create()` based on:

1. `tar_option_get("seed")`.
2. The parent name of the dynamnic target
3. The index of each replicate in the sequence.

If you return data frames or lists, those seeds are available in the `tar_seed` element of the output. Each replicate gets its own seed, and the default seeds from `tar_meta()` no longer apply.

``````# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
tar_rep(
command = runif(n = 1),
batches = 2,
reps = 3
)
)``````
``````tar_make()
#> ● completed target tasks_batch [0.001 seconds]
#> ● completed branch tasks_1b5e876cb04170df [0.011 seconds]
#> ● completed branch tasks_e9280f6d9ede67e3 [0.003 seconds]
#> ▶ ended pipeline [0.087 seconds]

#> # A tibble: 6 × 4
#>   result tar_batch tar_rep   tar_seed
#>    <dbl>     <int>   <int>      <int>
#> 1  0.882         1       1 1161495390
#> 2  0.781         1       2 1040766653
#> 3  0.213         1       3  942098819
#> 4  0.913         2       1 -720434756
#> 5  0.545         2       2 1717229114
#> 6  0.298         2       3 -115675171``````

If you change the batching structure, the `tar_rep` and `tar_batch` columns will change, but the results and the seeds will stay the same.

``````# _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(seed = 3)
list(
tar_rep(
command = runif(n = 1),
batches = 3, # previously 2
reps = 2     # previously 3
)
)``````
``````tar_make()
#> ● completed target tasks_batch [0 seconds]
#> ● completed branch tasks_1b5e876cb04170df [0.011 seconds]
#> ● completed branch tasks_e9280f6d9ede67e3 [0.002 seconds]
#> ● completed branch tasks_9e424147a252bf65 [0.003 seconds]
#> ▶ ended pipeline [0.093 seconds]

1. `tar_make()` does not interfere with the pseudo-random number generator state of the calling R process.↩︎