Chapter 2 Walkthrough

This chapter walks through a minimal example of a targets-powered data analysis project. The source code is available here, and it has a free RStudio Cloud workspace where you can try the code in your web browser. The documentation website links to other examples.

2.1 About this minimal example

The goal of this minimal workflow is to assess the relationship among ozone, wind, and temperature in base R’s airquality dataset. We read the data from a file, preprocess it, visualize some of the variables, fit a regression model, and generate an R Markdown report to communicate the results.

2.2 File structure

The file structure of the project looks like this.

├── _targets.R
├── R/
├──── functions.R
├── data/
└──── raw_data.csv

raw_data.csv contains the data we want to analyze.

Ozone,Solar.R,Wind,Temp,Month,Day
36,118,8.0,72,5,2
12,149,12.6,74,5,3
...

functions.R contains our custom user-defined functions. (See the best practices chapter for a discussion of function-oriented workflows.)

# functions.R
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone)) +
    theme_gray(24)
}

Whereas files raw_data.csv and functions.R are typical user-defined components of a project-oriented workflow, _targets.R file is special. Every targets workflow needs a file called _targets.R in the project’s root directory. Functions tar_script() and tar_edit() can help you create one. Ours looks looks like this:

# _targets.R
library(targets)
source("R/functions.R")
options(tidyverse.quiet = TRUE)
tar_option_set(packages = c("biglm", "tidyverse"))
list(
  tar_target(
    raw_data_file,
    "data/raw_data.csv",
    format = "file"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_file, col_types = cols())
  ),
  tar_target(
    data,
    raw_data %>%
      mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
  ),
  tar_target(hist, create_plot(data)),
  tar_target(fit, biglm(Ozone ~ Wind + Temp, data))
)

All _targets.R scripts have these requirements.

  1. Load the targets package itself. (_targets.R scripts created with tar_script() automatically insert a library(targets) line at the top by default.)
  2. Load your custom functions and global objects into the R session. In our case, our only such object is the create_plot() function, and we load it into the session by calling source("R/functions.R").
  3. Call tar_option_set() to set the default settings for all you targets, such as the names of required packages and the data storage format. Individual targets can override these settings.
  4. Define individual targets with the tar_target() function. Each target is an intermediate step of the workflow. At minimum, a target must have a name and an R expression. This expression runs when the pipeline builds the target, and the return value is saved as a file in the _targets/objects/ folder. The only targets not stored in _/targets/objects/ are dynamic files such as raw_data_file. Here, format = "file" makes raw_data_file a dynamic file. That means targets watches the data at the file paths returned from the expression (in this case, "data/raw_data.csv").
  5. Every _targets.R script must end with a list of your tar_target() objects. Those objects can be nested, i.e. lists within lists.

2.3 Inspect the pipeline

Before you run the pipeline for real, you should always inspect the manifest and the graph for errors. tar_manifest() shows you a data frame information about the targets, and it has functionality to specify the targets and columns returned.

tar_manifest(fields = "command")
#> # A tibble: 5 x 2
#>   name         command                                                          
#>   <chr>        <chr>                                                            
#> 1 raw_data_fi… "\"data/raw_data.csv\""                                          
#> 2 raw_data     "read_csv(raw_data_file, col_types = cols())"                    
#> 3 data         "raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone,  \\n …
#> 4 fit          "biglm(Ozone ~ Wind + Temp, data)"                               
#> 5 hist         "create_plot(data)"

There are also graphical displays with tar_glimpse()

tar_glimpse()

and tar_visnetwork().

tar_visnetwork()

Both graphing functions above visualize the underlying directed acyclic graph (DAG) and tell you how targets are connected. This DAG is indifferent to the order of targets in your pipeline. You will still get the same graph even if you rearrange them. This is because targets uses static code analysis to detect the dependencies of each target, and this process does not depend on target order. For details, visit the dependency detection section of the best practices guide.

2.4 Run the pipeline

tar_make() runs the workflow. It creates a fresh clean external R process, reads _targets.R to learn about the pipeline, runs the correct targets in the correct order given by the graph, and saves the necessary data to the _targets/ data store.

tar_make()
#> ● run target raw_data_file
#> ● run target raw_data
#> ● run target data
#> ● run target fit
#> ● run target hist
#> ● end pipeline

The next time you run tar_make(), targets skips everything that is already up to date, which saves a lot of time in large projects with long runtimes.

tar_make()
#> ✔ skip target raw_data_file
#> ✔ skip target raw_data
#> ✔ skip target data
#> ✔ skip target fit
#> ✔ skip target hist
#> ✔ skip pipeline

You can use tar_visnetwork() and tar_outdated() to check ahead of time which targets are up to date.

tar_visnetwork()
tar_outdated()
#> character(0)

2.5 Changes

The targets package notices when you make changes to your workflow, and tar_make() only runs the targets that need to build. There are custom rules called “cues” that targets uses to decide whether a target needs to rerun. 1

2.5.1 Change code

If you change one of your functions, the targets that depend on it will no longer be up to date, and tar_make() will rebuild them. For example, let’s set the number of bins in our histogram.

# Edit functions.R.
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), bins = 10) + # Set number of bins.
    theme_gray(24)
}

targets detects the change. hist is outdated (as would be any targets downstream of hist) and the others are still up to date.

tar_visnetwork()
tar_outdated()
#> [1] "hist"

That means tar_make() reruns hist and nothing else.

tar_make()
#> ✔ skip target raw_data_file
#> ✔ skip target raw_data
#> ✔ skip target data
#> ✔ skip target fit
#> ● run target hist
#> ● end pipeline

We would see similar behavior if we changed the R expressions in any tar_target() calls in _targets.R.

2.5.2 Change data

If we change the data file raw_data.csv, targets notices the change. This is because raw_data_file is a dynamic file (i.e. tar_target(format = "file")) that returned "raw_data.csv". Let’s try it out. Below, let’s use only the first 100 rows of the airquality dataset.

write_csv(head(airquality, n = 100), "data/raw_data.csv")

Sure enough, raw_data_file and everything downstream is out of date, so all our targets are outdated.

tar_visnetwork()
tar_outdated()
#> [1] "raw_data"      "fit"           "hist"          "raw_data_file"
#> [5] "data"
tar_make()
#> ● run target raw_data_file
#> ● run target raw_data
#> ● run target data
#> ● run target fit
#> ● run target hist
#> ● end pipeline

2.6 Read your data

targets has a convenient functions tar_read() to read your data from the _targets/ data store.

tar_read(hist)

There is also a tar_load() function, which supports tidyselect verbs like starts_with()

tar_load(starts_with("fit"))
library(biglm)
#> Loading required package: DBI
fit
#> Large data regression model: biglm(Ozone ~ Wind + Temp, data)
#> Sample size =  100

The purpose of tar_read() and tar_load() is to make exploratory data analysis easy and convenient. Use these functions to verify the correctness of the output from the pipeline and come up with ideas for new targets if needed.

2.7 Read metadata

To read the build progress of your targets while tar_make() is running, you can open a new R session and run tar_progress(). It reads the spreadsheet in _targets/meta/progress and tells you which targets are running, built, errored, or cancelled.

tar_progress()
#> # A tibble: 5 x 2
#>   name          progress
#>   <chr>         <chr>   
#> 1 raw_data_file built   
#> 2 raw_data      built   
#> 3 data          built   
#> 4 fit           built   
#> 5 hist          built

Likewise, the tar_meta() function reads _targets/meta/meta and tells you high-level information about the target’s settings, data, and results. The warnings, error, and traceback columns give you diagnostic information about targets with problems.

as.data.frame(tar_meta())
#>            name     type             data          command           depend
#> 1   create_plot function 658f2a44b31f63bb             <NA>             <NA>
#> 2 raw_data_file     stem 2b16f490030787ce b6df0c34fc22d1b9 ef46db3751d8e999
#> 3      raw_data     stem 1cad7cea1af1d9b9 000ed0cc054f0d35 6ef08fe7f06c82ac
#> 4          data     stem 31b0aae9cd840b97 df3101ce91c63e21 f64bb47ba166802c
#> 5           fit     stem d1ca2c6e2379fc3e aa0df6c5dbd10537 971da964f77940d5
#> 6          hist     stem a0a066b54682fa83 68877181ab74e51e 56365937b52e5161
#>          seed                      path             time             size bytes
#> 1          NA                        NA             <NA>             <NA>    NA
#> 2  2110307107         data/raw_data.csv 14b88a92911ded1d 9fb65f8dffe0153b  1884
#> 3  -979620141 _targets/objects/raw_data c0cc40cfeb0f9907 1f1c3f46554d3c28  1153
#> 4  1588979285     _targets/objects/data d1d129ff3461c81b 519c4c3e5705e4ce  1157
#> 5  1780184594      _targets/objects/fit 4f4bd89f515c8409 49c3093a2e8614d1  1609
#> 6 -1026346201     _targets/objects/hist 7e8871451d892923 6c682578505acaba 44326
#>   format iteration parent children seconds warnings error
#> 1   <NA>      <NA>     NA       NA      NA       NA    NA
#> 2   file    vector     NA       NA   1.457       NA    NA
#> 3    rds    vector     NA       NA   0.031       NA    NA
#> 4    rds    vector     NA       NA   0.012       NA    NA
#> 5    rds    vector     NA       NA   0.004       NA    NA
#> 6    rds    vector     NA       NA   0.012       NA    NA

The _targets/meta/meta spreadsheet file is critically important. Although targets can still work properly if files are missing from _targets/objects, the pipeline will error out if _targets/meta/meta is corrupted. If tar_meta() works, the project should be fine.


  1. For the full details on cues, read the “Details” section of the tar_cue() help file. (Enter ?targets::tar_cue into your R console.)↩︎

Copyright Eli Lilly and Company