Chapter 10 Literate programming

Literate programming is the practice of mixing code and descriptive writing in order to execute and explain a data analysis simultaneously in the same document. The targets package supports literate programming through tight integration with R Markdown and knitr.

There are two kinds of literate programming in targets:

  1. Rendering within a target. Here, you define a special kind of target that runs a lightweight R Markdown report which depends on upstream targets.
  2. Target Markdown, an overarching system in which one or more R Markdown files write the _targets.R file and encapsulate the pipeline.

10.1 Rendering within a target

If you render an R Markdown report as part of a target, the report should be lightweight: mostly prose, minimal code, fast execution, and no output other than the rendered HTML/PDF document. In other words, R Markdown reports are just targets that document prior results. The bulk of the computation should have already happened upstream, and the most of the code chunks in the report itself should be terse calls to tar_read() and tar_load().

For this chapter, we will use the so-called “minimal” example, which is similar to the walkthrough example but with an R Markdown report and a slightly different pipeline. The report from that example looks like this:

Above, the report depends on targets fit and hist. The use of tar_read() and tar_load() allows us to run the report outside the pipeline. As long as _targets/ folder has data on the required targets from a previous tar_make(), you can open the RStudio IDE, edit the report, and click the Knit button like you would for any other R Markdown report.

To connect the target with the pipeline, we define a special kind of target using tar_render() from the tarchetypes package instead of the usual tar_target(), which

  1. Finds all the tar_load()/tar_read() dependencies in the report and inserts them into the target’s command. This enforces the proper dependency relationships. (tar_load_raw() and tar_read_raw() are ignored because those dependencies cannot be resolved with static code analysis.)
  2. Sets format = "file" (see tar_target()) so targets watches the files at the returned paths.
  3. Configures the target’s command to return both the output report files and the input source file. All these file paths are relative paths so the project stays portable.
  4. Forces the report to run in the user’s current working directory instead of the working directory of the report.
  5. Sets convenient default options such as deployment = "main" in tar_target() and quiet = TRUE in rmarkdown::render().

The target definition looks like this.

library(tarchetypes)
target <- tar_render(report, "report.Rmd") # Just defines a target object.
target$command$expr[[1]]
#> tarchetypes::tar_render_run(path = "report.Rmd", args = list(input = "report.Rmd", 
#>     knit_root_dir = getwd(), quiet = TRUE), deps = list(fit, 
#>     hist))

Because symbols fit and hist appear in the command, targets knows that report depends on the fit target and the hist target. When we put the report target in the pipeline, these dependency relationships show up in the graph.

# _targets.R
library(targets)
library(tarchetypes)
source("R/functions.R")
list(
  tar_target(
    raw_data_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_file, col_types = cols())
  ),
  tar_target(
    data,
    raw_data %>%
      mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
  ),
  tar_target(hist, create_plot(data)),
  tar_target(fit, biglm(Ozone ~ Wind + Temp, data)),
  tar_render(report, "report.Rmd") # Here is our call to tar_render().
)
# R console
tar_visnetwork()

10.1.1 Parameterized R Markdown

Functions in tarchetypes make it straightforward to use parameterized R Markdown in a targets pipeline. The next two subsections walk through the major use cases.

10.1.2 Single parameter set

In this scenario, the pipeline renders your parameterized R Markdown report one time using a single set of parameters. These parameters can be upstream targets, global objects, or fixed values. Simply pass a params argument to tarchetypes::tar_render():

# _targets.R
library(targets)
library(tarchetypes)
list(
  tar_target(data, data.frame(x = seq_len(26), y = letters))
  tar_render(report, "report.Rmd", params = list(your_param = data))
)

the report target will run:

# R console
rmarkdown::render("report.Rmd", params = list(your_param = your_target))

where report.Rmd has the following YAML front matter:

---
title: report
output_format: html_document
params:
  your_param: "default value"
---

and the following code chunk:

print(params$your_param)

See these examples for a demonstration.

10.1.3 Multiple parameter sets

In this scenario, you still have a single report, but you render it multiple times over multiple sets of R Markdown parameters. This time, use tarchetypes::tar_render_rep() and write code to reference or generate a grid of parameters with one row per rendered report and one column per parameter. Optionally, you can also include an output_file column to control the file paths of the generated reports, and you can set the number of batches to reduce the overhead that would otherwise ensue from creating a large number of targets.

# _targets.R
library(targets)
library(tarchetypes)
tar_option_set(packages = "tibble")
list(
  tar_target(x, "value_of_x"),
  tar_render_rep(
    report,
    "report.Rmd",
    params = tibble(
      par = c("par_val_1", "par_val_2", "par_val_3", "par_val_4"),
      output_file = c("f1.html", "f2.html", "f3.html", "f4.html")
    ),
    batches = 2
  )
)

where report.Rmd has the following YAML front matter:

title: report
output_format: html_document
params:
  par: "default value"

and the following R code chunk:

print(params$par)
print(tar_read(x))

tar_render_rep() creates multiple targets to set up the R Markdown part of the workflow, including a target for the grid of parameters and a dynamic branching target to iterate over the parameters in batches. In this case, we have two batches (dynamic branches) and each one renders the report twice.

# R console
tar_make()
#> ● run target x
#> ● run target report_params
#> ● run branch report_9e7470a1
#> ● run branch report_457829de
#> ● end pipeline

The third output file f3.html is below, and the rest look similar.

For more information, see these examples.

10.2 Target Markdown

Target Markdown, available in targets > 0.6.0, is a powerful R Markdown interface for reproducible analysis pipelines. With Target Markdown, you can define a fully scalable pipeline from within one or more R Markdown reports, anything from a single report to a whole bookdown or workflowr project. You get the best of both worlds: the human readable narrative of literate programming, and the sophisticated caching and dependency management systems of targets.

10.2.1 Access

This chapter’s example Target Markdown document is itself a tutorial and a simplified version of the chapter. There are two convenient ways to access the file:

  1. The use_targets() function.
  2. The RStudio R Markdown template system.

For (2), in the RStudio IDE, select a new R Markdown document in the New File dropdown menu in the upper left-hand corner of the window.

Then, select the Target Markdown template and click OK to open a copy of the report for editing.

10.2.2 Purpose

Target Markdown has two primary objectives:

  1. Interactively explore, prototype, and test the components of a targets pipeline using the R Markdown notebook interface.
  2. Set up a targets pipeline using convenient R Markdown code chunks.

Target Markdown supports a special {targets} language engine with an interactive mode for (1) and a non-interactive mode for (2). By default, the mode is interactive in the notebook interface and non-interactive when you knit/render the whole document.16. You can set the mode using the tar_interactive chunk option.

10.2.3 Example

The following example is based on the minimal targets project at https://github.com/wlandau/targets-minimal/. We process the base airquality dataset, fit a model, and display a histogram of ozone concentration.

10.2.4 Required packages

This example requires several R packages, and targets must be version 0.6.0 or above.

# R console
install.packages(c("biglm", "dplyr", "ggplot2", "readr", "targets", "tidyr"))

10.2.5 Setup

First, load targets to activate the specialized knitr engine for Target Markdown.

```{r}
library(targets)
```

Non-interactive Target Markdown writes scripts to a special _targets_r/ directory to define individual targets and global objects. In order to keep your target definitions up to date, it is recommended to remove _targets_r/ at the beginning of the R Markdown document(s) in order to clear out superfluous targets and globals from a previous version. tar_unscript() is a convenient way to do this.

```{r}
tar_unscript()
```

10.2.6 Globals

As usual, your targets depend on custom functions, global objects, and tar_option_set() options you define before the pipeline begins. Define these globals using the {targets} engine with tar_globals = TRUE chunk option.

```{targets some-globals, tar_globals = TRUE, tar_interactive = TRUE}
options(tidyverse.quiet = TRUE)
tar_option_set(packages = c("biglm", "dplyr", "ggplot2", "readr", "tidyr"))
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), bins = 12) +
    theme_gray(24)
}
```

In interactive mode, the chunk simply runs the R code in the tar_option_get("envir") environment (usually the global environment) and displays a message:

#> Run code and assign objects to the environment.

Here is the same chunk in non-interactive mode. Normally, there is no need to duplicate chunks like this, but we do so here in order to demonstrate both modes.

```{targets chunk-name, tar_globals = TRUE, tar_interactive = FALSE}
options(tidyverse.quiet = TRUE)
tar_option_set(packages = c("biglm", "dplyr", "ggplot2", "readr", "tidyr"))
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), bins = 12) +
    theme_gray(24)
}
```

In non-interactive mode, the chunk establishes a common _targets.R file and writes the R code to a script in _targets_r/globals/, and displays an informative message:17

#> Establish _targets.R and _targets_r/globals/chunk-name.R.

It is good practice to assign explicit chunk labels or set the tar_name chunk option on a chunk-by-chunk basis. Each chunk writes code to a script path that depends on the name, and all script paths need to be unique.18

10.2.7 Target definitions

To define targets of the pipeline, use the {targets} language engine with the tar_globals chunk option equal FALSE or NULL (default). The return value of the chunk must be a target object or a list of target objects, created by tar_target() or a similar function.

Below, we define a target to establish the air quality dataset in the pipeline.

```{targets raw-data, tar_interactive = TRUE}
tar_target(raw_data, airquality)
```

If you run this chunk in interactive mode, the target’s R command runs, the engine tests if the output can be saved and loaded from disk correctly, and then the return value gets assigned to the tar_option_get("envir") environment (usually the global environment).

#> Run targets and assign them to the environment.

In the process, some temporary files are created and destroyed, but your local file space will remain untouched (barring any custom side effects in your custom code).

After you run a target in interactive mode, the return value is available in memory, and you can write an ordinary R code chunk to read it.

```{r}
head(raw_data)
```

The output is the same as what tar_read(raw_data) would show after a serious pipeline run.

head(raw_data)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

For demonstration purposes, here is the raw_data target code chunk in non-interactive mode.

```{targets chunk-name-with-target, tar_interactive = FALSE}
tar_target(raw_data, airquality)
```

In non-interactive mode, the {targets} engine does not actually run any targets. Instead, it establishes a common _targets.R and writes the code to a script in _targets_r/targets/.

#> Establish _targets.R and _targets_r/targets/chunk-name-with-target.R.

Next, we define more targets to process the raw data and plot a histogram. Only the returned value of the chunk code actually becomes part of the pipeline, so if you define multiple targets in a single chunk, be sure to wrap them all in a list.

```{targets downstream-targets}
list(
  tar_target(data, raw_data %>% filter(!is.na(Ozone))),
  tar_target(hist, create_plot(data))
)
```

In non-interactive mode, the whole target list gets written to a single script.

#> Establish _targets.R and _targets_r/targets/downstream-targets.R.

Lastly, we define a target to fit a model to the data. For simple targets like this one, we can use convenient shorthand to convert the code in a chunk into a valid target. Simply set the tar_simple chunk option to TRUE.

```{targets fit, tar_simple = TRUE}
analysis_data <- data
biglm(Ozone ~ Wind + Temp, analysis_data)
```

When the chunk is preprocessed, chunk label (or the tar_name chunk option if you set it) becomes the target name, and the chunk code becomes the target command. All other arguments of tar_target() remain at their default values (configurable with tar_option_set() in a tar_globals = TRUE chunk). The output in the rendered R Markdown document reflects this preprocessing.

tar_target(fit, {
  biglm(Ozone ~ Wind + Temp, data)
})
#> Define target fit from chunk code.
#> Establish _targets.R and _targets_r/targets/fit.R.

10.2.8 Pipeline

If you ran all the {targets} chunks in non-interactive mode (i.e. pipeline construction mode), then the target script file and helper scripts should all be established, and you are ready to run the pipeline in with tar_make() in an ordinary {r} code chunk. This time, the output is written to persistent storage at the project root.

```{r}
tar_make()
```
#> • start target raw_data
#> • built target raw_data
#> • start target data
#> • built target data
#> • start target fit
#> • built target fit
#> • start target hist
#> • built target hist
#> • end pipeline: 0.608 seconds

10.2.9 Output

You can retrieve results from the _targets/ data store using tar_read() or tar_load().

```{r}
library(biglm)
tar_read(fit)
```
#> Large data regression model: biglm(Ozone ~ Wind + Temp, data)
#> Sample size =  116
```{r}
tar_read(hist)
```

The targets dependency graph helps your readers understand the steps of your pipeline at a high level.

```{r}
tar_visnetwork()
```

At this point, you can go back and run {targets} chunks in interactive mode without interfering with the code or data of the non-interactive pipeline.

10.2.10 Conditioning on interactive mode

targets version 0.6.0.9001 and above supports the tar_interactive() function, which suppresses code unless Target Markdown interactive mode is turned on. Similarly, tar_noninteractive() suppresses code in interactive mode, and tar_toggle() selects alternative pieces of code based on the current mode.

10.2.11 tar_interactive()

tar_interactive() is useful for dynamic branching. If a dynamic target branches over a target from a different chunk, this ordinarily breaks interactive mode.

```{targets condition, tar_interactive = TRUE}
tar_target(y, x ^ 2, pattern = map(x))
```
#> Run targets and assign them to the environment.
#> Error:
#> ! Target y tried to branch over x, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.

However, with tar_interactive(), you can define a version of x just for testing and prototyping in interactive mode. The chunk below fixes interactive mode without changing the pipeline in non-interactive mode.

```{targets condition-fixed, tar_interactive = TRUE}
list(
  tar_interactive(tar_target(x, seq_len(2))),
  tar_target(y, x ^ 2, pattern = map(x))
)
```
#> Run targets and assign them to the environment.

10.2.12 tar_toggle()

tar_toggle() is useful for scaling up and down the amount of work based on the current mode. Interactive mode should finish quickly for prototyping and testing, and non-interactive mode should take on the full level work required for a serious pipeline. Below, tar_toggle() seamlessly scales up and down the number of simulations repetitions in the example target from https://wlandau.github.io/rmedicine2021-pipeline/#target-definitions. To learn more about stantargets, visit https://docs.ropensci.org/stantargets/.

```{targets bayesian-model-validation, tar_interactive = TRUE}
tar_stan_mcmc_rep_summary(
  name = mcmc,
  stan_files = "model.stan",
  data = simulate_data(), # Defined in another code chunk.
  batches = tar_toggle(1, 100),
  reps = tar_toggle(1, 10),
  chains = tar_toggle(1, 4),
  parallel_chains = tar_toggle(1, 4),
  iter_warmup = tar_toggle(100, 4e4),
  iter_sampling = tar_toggle(100, 4e4),
  summaries = list(
    ~posterior::quantile2(.x, probs = c(0.025, 0.25, 0.5, 0.75, 0.975)),
    rhat = ~posterior::rhat(.x)
  ),
  deployment = "worker"
)
```

10.2.13 Chunk options

  • tar_globals: Logical of length 1, whether to define globals or targets. If TRUE, the chunk code defines functions, objects, and options common to all the targets. If FALSE or NULL (default), then the chunk returns formal targets for the pipeline.
  • tar_interactive: Logical of length 1 to choose whether to run the chunk in interactive mode or non-interactive mode.
  • tar_name: name to use for writing helper script files (e.g. _targets_r/targets/target_script.R) and specifying target names if the tar_simple chunk option is TRUE. All helper scripts and target names must have unique names, so please do not set this option globally with knitr::opts_chunk$set().
  • tar_script: Character of length 1, where to write the target script file in non-interactive mode. Most users can skip this option and stick with the default _targets.R script path. Helper script files are always written next to the target script in a folder with an "_r" suffix. The tar_script path must either be absolute or be relative to the project root (where you call tar_make() or similar). If not specified, the target script path defaults to tar_config_get("script") (default: _targets.R; helpers default: _targets_r/). When you run tar_make() etc. with a non-default target script, you must select the correct target script file either with the script argument or with tar_config_set(script = ...). The function will source() the script file from the current working directory (i.e. with chdir = FALSE in source()).
  • tar_simple: Logical of length 1. Set to TRUE to define a single target with a simplified interface. In code chunks with tar_simple equal to TRUE, the chunk label (or the tar_name chunk option if you set it) becomes the name, and the chunk code becomes the command. In other words, a code chunk with label targetname and command mycommand() automatically gets converted to tar_target(name = targetname, command = mycommand()). All other arguments of tar_target() remain at their default values (configurable with tar_option_set() in a tar_globals = TRUE chunk).

  1. In targets version 0.6.0, the mode is interactive if interactive() is TRUE. In subsequent versions, the mode is interactive if !isTRUE(getOption("knitr.in.progress")) is TRUE.↩︎

  2. The _targets.R file from Target Markdown never changes from chunk to chunk or report to report, so you can spread your work over multiple reports without worrying about aligning _targets.R scripts. Just be sure all your chunk names are unique across all the reports of a project, or you set the tar_name chunk option to specify base names of script file paths.↩︎

  3. In addition, for bookdown projects, chunk labels should only use alphanumeric characters and dashes.↩︎

Copyright Eli Lilly and Company