Chapter 6 External files and literate programming

The targets package automatically stores data and automatically responds to changed files to keep your targets up to date. The chapter below explains how to leverage this reproducibility for external datasets, external output files, and literate programming artifacts such as R Markdown reports. Real-world applications of these techniques are linked from here.

6.1 Internal files

Each project’s data lives in the _targets/ folder in the root directory (where you call tar_make()). The files in the _targets/ look like this:

_targets/
├── meta/
├────── meta
├────── process
├────── progress
├── objects/
├────── target1 
├────── target2
├────── branching_target_c7bcb4bd
├────── branching_target_285fb6a9
├────── branching_target_874ca381
└── scratch/ # tar_make() deletes this folder after it finishes.

Spreadsheets _targets/meta/meta keeps track of target metadata, _targets/meta/progress records runtime progress, and _targets/meta/process has high-level information (such as process ID) about the external R session orchestrating the targets. The scratch/ directory contains temporary files which can be safely deleted after tar_make() finishes. The _targets/objects/ folder contains the return values of the targets themselves.

A typical target returns an R object: for example, a dataset with tar_target(dataset, data.frame(x = rnorm(1000)), format = "fst") or a fitted model tar_target(model, biglm(ozone ~ temp + wind), format = "qs"). When you run the pipeline, targets computes this object and saves it as a file in _targets/objects/. The file name in _targets/objects/ is always the target name, and type of the file is determined by the format argument of tar_target(), and formats "fst" and "qs" are two of many choices explained in the help file of tar_target(). No matter what format you pick, targets watches the file for changes and recomputes the target in tar_make() if the the file gets corrupted (unless you suppress the file cue with tar_target(cue = tar_cue(file = FALSE))).

6.2 External input files

To reproducibly track an external input file, you need to define a new target that has

  1. A command that returns the file path as a character vector, and
  2. format = "file" in tar_target().

When the target runs in the pipeline, the returned character vector gets recorded in _targets/meta, and targets watches the data file and invalidates the target when that file changes. To track multiple files or directories this way, simply define a multi-element character vector where each element is a path.

The first two targets of the minimal example demonstrate how to track an input file.

# _targets.R
library(targets)
path_to_data <- function() {
  "data/raw_data.csv"
}
list(
  tar_target(
    raw_data_file,
    path_to_data(),
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_file, col_types = cols())
  )
)

Above, raw_data_file is the dynamic file target. The file data/raw_data.csv exists before we ever run the pipeline, and the R expression for the target returns the character vector "data/raw_data.csv". (We use the path_to_data() function to demonstrate that you need not literally write "data/raw_data.csv" as long as the path is returned somehow.)

All subsequent targets that depend on the file must reference the file using the symbol raw_data_file. This allows targets’ automatic static code analysis routines to detect which targets depend on the file. Because the raw_data target literally mentions the symbol raw_data_file, targets knows raw_data depends on raw_data_file. This ensures that

  1. raw_data_file gets processed before raw_data, and
  2. tar_make() automatically reruns raw_data if raw_data_file or "data/raw_data.csv" change.
tar_visnetwork()

If we were to omit the symbol raw_data_file from the R expression of raw_data, those targets would be disconnected in the graph and tar_make() would make incorrect decisions.

# _targets.R
library(targets)
path_to_data <- function() {
  "data/raw_data.csv"
}
list(
  tar_target(
    raw_data_file,
    path_to_data(),
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv("data/raw_data.csv", col_types = cols()) # incorrect
  )
)
tar_visnetwork()

6.3 External output files

We can generate and track custom external files too, and the mechanics are similar. We still return a file path and use format = "file", but this time, our R command writes a file before it returns a path. For an external plot file, our target might look like this.

tar_target(
  plot_file,
  save_plot_and_return_path(),
  format = "file"
)

where our custom save_plot_and_return_path() function does exactly what the name describes.

save_plot_and_return_path <- function() {
  plot <- ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg))
  ggsave("plot_file.png", plot, width = 7, height = 7)
  return("plot_file.png")
}

6.4 Literate programming

An R Markdown report should be lightweight: mostly prose, minimal code, fast execution, and no output other than the rendered HTML/PDF document. In other words, R Markdown reports are just targets that document prior results. The bulk of the computation should have already happened upstream, and the most of the code chunks in the report itself should be terse calls to tar_read() and tar_load().

The report from the minimal example looks like this:

Above, the report depends on targets fit and hist. The use of tar_read() and tar_load() allows us to run the report outside the pipeline. As long as _targets/ folder has data on the required targets from a previous tar_make(), you can open the RStudio IDE, edit the report, and click the Knit button like you would for any other R Markdown report.

To connect the target with the pipeline, we define a special kind of target using tar_render() from the tarchetypes package instead of the usual tar_target(), which

  1. Finds all the tar_load()/tar_read() dependencies in the report and inserts them into the target’s command. This enforces the proper dependency relationships. (tar_load_raw() and tar_read_raw() are ignored because those dependencies cannot be resolved with static code analysis.)
  2. Sets format = "file" (see tar_target()) so targets watches the files at the returned paths.
  3. Configures the target’s command to return both the output report files and the input source file. All these file paths are relative paths so the project stays portable.
  4. Forces the report to run in the user’s current working directory instead of the working directory of the report.
  5. Sets convenient default options such as deployment = "main" in tar_target() and quiet = TRUE in rmarkdown::render().

The target definition looks like this.

library(tarchetypes)
tar_render(report, "report.Rmd")

Because symbols fit and hist appear in the command, targets knows that report depends on fit and hist. When we put the report target in the pipeline, these dependency relationships show up in the graph.

# _targets.R
library(targets)
library(tarchetypes)
source("R/functions.R")
list(
  tar_target(
    raw_data_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_file, col_types = cols())
  ),
  tar_target(
    data,
    raw_data %>%
      mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
  ),
  tar_target(hist, create_plot(data)),
  tar_target(fit, biglm(Ozone ~ Wind + Temp, data)),
  tar_render(report, "report.Rmd") # Here is our call to tar_render().
)
tar_visnetwork()

6.5 Parameterized R Markdown

Functions in tarchetypes make it straightforward to use parameterized R Markdown in a targets pipeline. The next two subsections walk through the major use cases.

6.5.1 Single parameter set

In this scenario, the pipeline renders your parameterized R Markdown report one time using a single set of parameters. These parameters can be upstream targets, global objects, or fixed values. Simply pass a params argument to tarchetypes::tar_render():

# _targets.R
library(targets)
library(tarchetypes)
list(
  tar_target(data, data.frame(x = seq_len(26), y = letters))
  tar_render(report, "report.Rmd", params = list(your_param = data))
)

the report target will run:

rmarkdown::render("report.Rmd", params = list(your_param = your_target))

where report.Rmd has the following YAML front matter:

---
title: report
output_format: html_document
params:
  your_param: "default value"
---

and the following code chunk:

print(params$your_param)

See these examples for a demonstration.

6.5.2 Multiple parameter sets

In this scenario, you still have a single report, but you render it multiple times over multiple sets of R Markdown parameters. This time, use tarchetypes::tar_render_rep() and write code to reference or generate a grid of parameters with one row per rendered report and one column per parameter. Optionally, you can also include an output_file column to control the file paths of the generated reports, and you can set the number of batches to reduce the overhead that would otherwise ensue from creating a large number of targets.

# _targets.R
library(targets)
library(tarchetypes)
library(tibble)
list(
  tar_target(x, "value_of_x"),
  tar_render_rep(
    report,
    "report.Rmd",
    params = tibble(
      par = c("par_val_1", "par_val_2", "par_val_3", "par_val_4"),
      output_file = c("f1.html", "f2.html", "f3.html", "f4.html")
    ),
    batches = 2
  )
)

where report.Rmd has the following YAML front matter:

title: report
output_format: html_document
params:
  par: "default value"

and the following R code chunk:

print(params$par)
print(tar_read(x))

tar_render_rep() creates multiple targets to set up the R Markdown part of the workflow, including a target for the grid of parameters and a dynamic branching target to iterate over the parameters in batches. In this case, we have two batches (dynamic branches) and each one renders the report twice.

tar_make()
#> ● run target x
#> ● run target report_params
#> ● run branch report_9e7470a1
#> ● run branch report_457829de

The third output file f3.html is below, and the rest look similar.

For more information, see these examples.

Copyright Eli Lilly and Company