Chapter 2 Walkthrough

This chapter walks through a minimal example of a targets-powered data analysis project. The source code is available here, and it has a free RStudio Cloud workspace where you can try the code in your web browser. The documentation website links to other examples.

2.1 About this minimal example

The goal of this minimal workflow is to assess the relationship among ozone, wind, and temperature in base R’s airquality dataset. We read the data from a file, preprocess it, visualize some of the variables, fit a regression model, and generate an R Markdown report to communicate the results.

2.2 File structure

The file structure of the project looks like this.

├── _targets.R
├── R/
├──── functions.R
├── data/
└──── raw_data.csv

raw_data.csv contains the data we want to analyze.

Ozone,Solar.R,Wind,Temp,Month,Day
36,118,8.0,72,5,2
12,149,12.6,74,5,3
...

functions.R contains our custom user-defined functions. (See the best practices chapter for a discussion of function-oriented workflows.)

# functions.R
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone)) +
    theme_gray(24)
}

2.3 Target script file

Whereas files raw_data.csv and functions.R are typical user-defined components of a project-oriented workflow, the target script file _targets.R file is special. Every targets workflow needs a target script file to formally define the targets in the pipeline. By default, the target script is a file called _targets.R in the project’s root directory.1 Functions tar_script() and tar_edit() can help you create a target script file. Ours looks like this:

# _targets.R file
library(targets)
source("R/functions.R")
options(tidyverse.quiet = TRUE)
tar_option_set(packages = c("biglm", "tidyverse"))
list(
  tar_target(
    raw_data_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_file, col_types = cols())
  ),
  tar_target(
    data,
    raw_data %>%
      filter(!is.na(Ozone))
  ),
  tar_target(hist, create_plot(data)),
  tar_target(fit, biglm(Ozone ~ Wind + Temp, data))
)

All target script files have these requirements.

  1. Load the targets package itself. (target scripts created with tar_script() automatically insert a library(targets) line at the top by default.)
  2. Load your custom functions and global objects into the R session. In our case, our only such object is the create_plot() function, and we load it into the session by calling source("R/functions.R").
  3. Call tar_option_set() to set the default settings for all you targets, such as the names of required packages and the data storage format. Individual targets can override these settings.
  4. Define individual targets with the tar_target() function. Each target is an intermediate step of the workflow. At minimum, a target must have a name and an R expression. This expression runs when the pipeline builds the target, and the return value is saved as a file in the _targets/objects/ folder. The only targets not stored in _/targets/objects/ are dynamic files such as raw_data_file. Here, format = "file" makes raw_data_file a dynamic file. That means targets watches the data at the file paths returned from the expression (in this case, "data/raw_data.csv").2
  5. Every target script must end with a list of your tar_target() objects. Those objects can be nested, i.e. lists within lists.

2.4 Inspect the pipeline

Before you run the pipeline for real, you should always inspect the manifest and the graph for errors. tar_manifest() shows you a data frame information about the targets, and it has functionality to specify the targets and columns returned.

tar_manifest(fields = "command")
#> # A tibble: 5 × 2
#>   name          command                                      
#>   <chr>         <chr>                                        
#> 1 raw_data_file "\"data/raw_data.csv\""                      
#> 2 raw_data      "read_csv(raw_data_file, col_types = cols())"
#> 3 data          "raw_data %>% filter(!is.na(Ozone))"         
#> 4 fit           "biglm(Ozone ~ Wind + Temp, data)"           
#> 5 hist          "create_plot(data)"

There are also graphical displays with tar_glimpse()

tar_glimpse()

and tar_visnetwork().

tar_visnetwork()

Both graphing functions above visualize the underlying directed acyclic graph (DAG) and tell you how targets are connected. This DAG is indifferent to the order of targets in your pipeline. You will still get the same graph even if you rearrange them. This is because targets uses static code analysis to detect the dependencies of each target, and this process does not depend on target order. For details, visit the dependency detection section of the best practices guide.

2.5 Run the pipeline

tar_make() runs the workflow. It creates a fresh clean external R process, reads the target script to learn about the pipeline, runs the correct targets in the correct order given by the graph, and saves the necessary data to the _targets/ data store. 3

tar_make()
#> • start target raw_data_file
#> • built target raw_data_file
#> • start target raw_data
#> • built target raw_data
#> • start target data
#> • built target data
#> • start target fit
#> • built target fit
#> • start target hist
#> • built target hist
#> • end pipeline

The next time you run tar_make(), targets skips everything that is already up to date, which saves a lot of time in large projects with long runtimes.

tar_make()
#> ✔ skip target raw_data_file
#> ✔ skip target raw_data
#> ✔ skip target data
#> ✔ skip target fit
#> ✔ skip target hist
#> ✔ skip pipeline

You can use tar_visnetwork() and tar_outdated() to check ahead of time which targets are up to date.

tar_visnetwork()
tar_outdated()
#> character(0)

2.6 Changes

The targets package notices when you make changes to code and data, and those changes affect which targets rerun and which targets are skipped. Internally, special rules called “cues” decide whether a target reruns. The tar_cue() function lets you suppress some of these cues, and the tarchetypes package supports nuanced cue factories and target factories to further customize target invalidation behavior. The tar_cue() function documentation explains cues in detail, as well as specifics on how targets detects changes to upstream dependencies.

2.6.1 Change code

If you change one of your functions, the targets that depend on it will no longer be up to date, and tar_make() will rebuild them. For example, let’s set the number of bins in our histogram.

# Edit functions.R.
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), bins = 10) + # Set number of bins.
    theme_gray(24)
}

targets detects the change. hist is outdated (as would be any targets downstream of hist) and the others are still up to date.

tar_visnetwork()
tar_outdated()
#> [1] "hist"

That means tar_make() reruns hist and nothing else.

tar_make()
#> ✔ skip target raw_data_file
#> ✔ skip target raw_data
#> ✔ skip target data
#> ✔ skip target fit
#> • start target hist
#> • built target hist
#> • end pipeline

We would see similar behavior if we changed the R expressions in any tar_target() calls in the target script file.

2.6.2 Change data

If we change the data file raw_data.csv, targets notices the change. This is because raw_data_file is a dynamic file (i.e. tar_target(format = "file")) that returned "raw_data.csv". Let’s try it out. Below, let’s use only the first 100 rows of the airquality dataset.

write_csv(head(airquality, n = 100), "data/raw_data.csv")

Sure enough, raw_data_file and everything downstream is out of date, so all our targets are outdated.

tar_visnetwork()
tar_outdated()
#> [1] "raw_data"      "fit"           "hist"          "raw_data_file"
#> [5] "data"
tar_make()
#> • start target raw_data_file
#> • built target raw_data_file
#> • start target raw_data
#> • built target raw_data
#> • start target data
#> • built target data
#> • start target fit
#> • built target fit
#> • start target hist
#> • built target hist
#> • end pipeline

2.7 Read your data

targets has a convenient functions tar_read() to read your data from the _targets/ data store.

tar_read(hist)

There is also a tar_load() function, which supports tidyselect verbs like starts_with()

tar_load(starts_with("fit"))
library(biglm)
#> Loading required package: DBI
fit
#> Large data regression model: biglm(Ozone ~ Wind + Temp, data)
#> Sample size =  69

The purpose of tar_read() and tar_load() is to make exploratory data analysis easy and convenient. Use these functions to verify the correctness of the output from the pipeline and come up with ideas for new targets if needed.

2.8 Read metadata

To read the build progress of your targets while tar_make() is running, you can open a new R session and run tar_progress(). It reads the spreadsheet in _targets/meta/progress and tells you which targets are running, built, errored, or cancelled.

tar_progress()
#> # A tibble: 5 × 2
#>   name          progress
#>   <chr>         <chr>   
#> 1 raw_data_file built   
#> 2 raw_data      built   
#> 3 data          built   
#> 4 fit           built   
#> 5 hist          built

Likewise, the tar_meta() function reads _targets/meta/meta and tells you high-level information about the target’s settings, data, and results. The warnings, error, and traceback columns give you diagnostic information about targets with problems.

tar_meta()
#> # A tibble: 6 × 17
#>   name  type  data  command depend    seed path  time                size  bytes
#>   <chr> <chr> <chr> <chr>   <chr>    <int> <lis> <dttm>              <chr> <int>
#> 1 crea… func… 658f… <NA>    <NA>   NA      <chr… NA                  <NA>     NA
#> 2 raw_… stem  2b16… b6df0c… ef46d…  2.11e9 <chr… 2021-07-23 12:50:31 9fb6…  1884
#> 3 raw_… stem  2f00… 000ed0… 6ef08… -9.80e8 <chr… 2021-07-23 12:50:36 acca…  1169
#> 4 data  stem  0e4c… 173ca5… eaa52…  1.59e9 <chr… 2021-07-23 12:50:36 4ccc…  1001
#> 5 fit   stem  66d1… aa0df6… f64d8…  1.78e9 <chr… 2021-07-23 12:50:36 d047…  1486
#> 6 hist  stem  2b6c… 688771… dbc9e… -1.03e9 <chr… 2021-07-23 12:50:36 fa9d… 44445
#> # … with 7 more variables: format <chr>, iteration <chr>, parent <lgl>,
#> #   children <list>, seconds <dbl>, warnings <lgl>, error <lgl>

The _targets/meta/meta spreadsheet file is critically important. Although targets can still work properly if files are missing from _targets/objects, the pipeline will error out if _targets/meta/meta is corrupted. If tar_meta() works, the project should be fine.


  1. However, in targets version 0.5.0.9000 and above, you can set the target script file path to something other than _targets.R. You can either set the path persistently for your project using tar_config_set(), or you can set it temporarily for an individual function call using the script argument of tar_make() and related functions.↩︎

  2. You can also set the path of the data store to something other than _targets/. tar_config_set() sets it persistently, and the store argument of various functions like tar_make() let you choose the data store path temporarily for a single function call.↩︎

  3. In targets version 0.3.1.9000 and above, you can set the path of the local data store to something other than _targets/. A project-level _targets.yaml file keeps track of the path. Functions tar_config_set() and tar_config_get() can help.↩︎

Copyright Eli Lilly and Company