Chapter 17 Visualization with drake

Data analysis projects have complicated networks of dependencies, and drake can help you visualize them with vis_drake_graph(), sankey_drake_graph(), and drake_ggraph() (note the two g’s).

17.1 Plotting plans

Except for drake 7.7.0 and below, you can simply plot() the plan to show the targets and their dependency relationships.

library(drake)
# from https://github.com/wlandau/drake-examples/tree/main/mtcars
load_mtcars_example()

my_plan
#> # A tibble: 15 x 2
#>    target             command                                                   
#>    <chr>              <expr_lst>                                                
#>  1 report             knitr::knit(drake::knitr_in("report.Rmd"), drake::file_ou…
#>  2 small              simulate(48)                                             …
#>  3 large              simulate(64)                                             …
#>  4 regression1_small  reg1(small)                                              …
#>  5 regression1_large  reg1(large)                                              …
#>  6 regression2_small  reg2(small)                                              …
#>  7 regression2_large  reg2(large)                                              …
#>  8 summ_regression1_… suppressWarnings(summary(regression1_small$residuals))   …
#>  9 summ_regression1_… suppressWarnings(summary(regression1_large$residuals))   …
#> 10 summ_regression2_… suppressWarnings(summary(regression2_small$residuals))   …
#> 11 summ_regression2_… suppressWarnings(summary(regression2_large$residuals))   …
#> 12 coef_regression1_… suppressWarnings(summary(regression1_small))$coefficients…
#> 13 coef_regression1_… suppressWarnings(summary(regression1_large))$coefficients…
#> 14 coef_regression2_… suppressWarnings(summary(regression2_small))$coefficients…
#> 15 coef_regression2_… suppressWarnings(summary(regression2_large))$coefficients…

plot(my_plan)

17.1.1 vis_drake_graph()

Powered by visNetwork. Colors represent target status, and shapes represent data type. These graphs are interactive, so you can click, drag, zoom, and and pan to adjust the size and position. Double-click on nodes to contract neighborhoods into clusters or expand them back out again. If you hover over a node, you will see text in a tooltip showing the first few lines of

  • The command of a target, or
  • The body of an imported function, or
  • The content of an imported text file.
vis_drake_graph(my_plan)

To save this interactive widget for later, just supply the name of an HTML file.

vis_drake_graph(my_plan, file = "graph.html")

To save a static image file, supply a file name that ends in ".png", ".pdf", ".jpeg", or ".jpg".

vis_drake_graph(my_plan, file = "graph.png")

17.1.2 sankey_drake_graph()

These interactive networkD3 Sankey diagrams have more nuance: the height of each node is proportional to its number of connections. Nodes with many incoming connnections tend to fall out of date more often, and nodes with many outgoing connections can invalidate bigger chunks of the downstream pipeline.

sankey_drake_graph(my_plan)

Saving the graphs is the same as before.

sankey_drake_graph(my_plan, file = "graph.html") # Interactive HTML widget
sankey_drake_graph(my_plan, file = "graph.png")  # Static image file

Unfortunately, a legend is not yet available for Sankey diagrams, but drake exposes a separate legend for the colors and shapes.

library(visNetwork)
legend_nodes()
#> # A tibble: 12 x 6
#>    label      color   shape    font.color font.size    id
#>    <chr>      <chr>   <chr>    <chr>          <dbl> <int>
#>  1 Up to date #228B22 dot      black             20     1
#>  2 Outdated   #000000 dot      black             20     2
#>  3 Running    #FF7221 dot      black             20     3
#>  4 Cancelled  #ECB753 dot      black             20     4
#>  5 Failed     #AA0000 dot      black             20     5
#>  6 Imported   #1874CD dot      black             20     6
#>  7 Missing    #9A32CD dot      black             20     7
#>  8 Object     #888888 dot      black             20     8
#>  9 Dynamic    #888888 star     black             20     9
#> 10 Function   #888888 triangle black             20    10
#> 11 File       #888888 square   black             20    11
#> 12 Cluster    #888888 diamond  black             20    12
visNetwork(nodes = legend_nodes())

17.1.3 drake_ggraph()

drake_ggraph() can handle larger workflows than the other graphing functions. If your project has thousands of targets and vis_drake_graph()/sankey_drake_graph() does not render properly, consider drake_ggraph(). Powered by ggraph, drake_ggraph()s are static ggplot2 objects, and you can save them with ggsave().

drake_ggraph(my_plan)

17.1.4 text_drake_graph()

If you are running R in a terminal without X Window support, the usual visualizations will show up interactively in your session. Here, you can use text_drake_graph() to see a text display in your terminal window. Terminal colors are deactivated in this manual, but you will see color in your console.

# Use nchar = 0 or nchar = 1 for better results.
# The color display is better in your own terminal.
text_drake_graph(my_plan, nchar = 3)
#>   reg                                                   sum  
#>                                          reg            coe  
#>                             lar                            
#>   dat            sim                                      sum  
#>                                          reg               
#>   ran                                                   coe  
#>                             sma                            
#>   fil                                                   sum  
#>                                          reg               
#>                rep                                      coe  
#>   kni                         fil                            
#>                                          reg            sum  
#>   reg                                                   coe

17.2 Underlying graph data: node and edge data frames

drake_graph_info() is used behind the scenes in vis_drake_graph(), sankey_drake_graph(), and drake_ggraph() to get the graph information ready for rendering. To save time, you can call drake_graph_info() to get these internals and then call render_drake_graph(), render_sankey_drake_graph(), or render_drake_ggraph().

str(drake_graph_info(my_plan))
#> List of 4
#>  $ nodes        : tibble [23 × 12] (S3: tbl_df/tbl/data.frame)
#>   ..$ id       : chr [1:23] "reg2" "n-NNXGS5DSHI5GW3TJOQ" "p-OJSXA33SOQXFE3LE" "random_rows" ...
#>   ..$ imported : logi [1:23] TRUE TRUE TRUE TRUE TRUE TRUE ...
#>   ..$ label    : chr [1:23] "reg2" "knitr::knit" "file report.Rmd" "random_rows" ...
#>   ..$ status   : chr [1:23] "imported" "imported" "imported" "imported" ...
#>   ..$ type     : chr [1:23] "function" "function" "file" "function" ...
#>   ..$ font.size: num [1:23] 20 20 20 20 20 20 20 20 20 20 ...
#>   ..$ color    : chr [1:23] "#1874CD" "#1874CD" "#1874CD" "#1874CD" ...
#>   ..$ shape    : chr [1:23] "triangle" "triangle" "square" "triangle" ...
#>   ..$ level    : num [1:23] 1 1 1 1 1 1 2 2 3 3 ...
#>   ..$ title    : chr [1:23] "Call drake_graph_info(hover = TRUE) for informative text." "Call drake_graph_info(hover = TRUE) for informative text." "Call drake_graph_info(hover = TRUE) for informative text." "Call drake_graph_info(hover = TRUE) for informative text." ...
#>   ..$ x        : num [1:23] -1 -1 -1 -1 -1 -1 -0.5 -0.5 0 0 ...
#>   ..$ y        : num [1:23] -0.918 -0.551 -0.184 0.184 0.551 ...
#>  $ edges        : tibble [23 × 3] (S3: tbl_df/tbl/data.frame)
#>   ..$ from  : chr [1:23] "small" "small" "reg2" "reg2" ...
#>   ..$ to    : chr [1:23] "regression1_small" "regression2_small" "regression2_large" "regression2_small" ...
#>   ..$ arrows: chr [1:23] "to" "to" "to" "to" ...
#>  $ legend_nodes : tibble [5 × 6] (S3: tbl_df/tbl/data.frame)
#>   ..$ label     : chr [1:5] "Outdated" "Imported" "Object" "Function" ...
#>   ..$ color     : chr [1:5] "#000000" "#1874CD" "#888888" "#888888" ...
#>   ..$ shape     : chr [1:5] "dot" "dot" "dot" "triangle" ...
#>   ..$ font.color: chr [1:5] "black" "black" "black" "black" ...
#>   ..$ font.size : num [1:5] 20 20 20 20 20
#>   ..$ id        : int [1:5] 2 6 8 10 11
#>  $ default_title: chr "Dependency graph"
#>  - attr(*, "class")= chr "drake_graph_info"

17.3 Visualizing target status

drake’s visuals tell you which targets are up to date and which are outdated.

make(my_plan, verbose = 0L)
outdated(my_plan)
#> character(0)

sankey_drake_graph(my_plan)

When you change a dependency, some targets fall out of date (black nodes).

reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
sankey_drake_graph(my_plan)

17.4 Subgraphs

Graphs can grow enormous for serious projects, so there are multiple ways to focus on a manageable subgraph. The most brute-force way is to just pick a manual subset of nodes. However, with the subset argument, the graphing functions can drop intermediate nodes and edges.

vis_drake_graph(
  my_plan,
  subset = c("regression2_small", "large")
)

The rest of the subgraph functionality preserves connectedness. Use targets_only to ignore the imports.

vis_drake_graph(my_plan, targets_only = TRUE)

Similarly, you can just show downstream nodes.

vis_drake_graph(my_plan, from = c("regression2_small", "regression2_large"))

Or upstream ones.

vis_drake_graph(my_plan, from = "small", mode = "in")

In fact, let us just take a small neighborhood around a target in both directions. For the graph below, given order is 1, but all the custom file_out() output files of the neighborhood’s targets appear as well. This ensures consistent behavior between show_output_files = TRUE and show_output_files = FALSE (more on that later).

vis_drake_graph(my_plan, from = "small", mode = "all", order = 1)

17.5 Control the vis_drake_graph() legend.

Some arguments to vis_drake_graph() control the legend.

vis_drake_graph(my_plan, full_legend = TRUE, ncol_legend = 2)

To remove the legend altogether, set the ncol_legend argument to 0.

vis_drake_graph(my_plan, ncol_legend = 0)

17.6 Clusters

With the group and clusters arguments to the graphing functions, you can condense nodes into clusters. This is handy for workflows with lots of targets. Take the schools scenario from the drake plan guide. Our plan was generated with drake_plan(trace = TRUE), so it has wildcard columns that group nodes into natural clusters already. You can manually add such columns if you wish.

# Visit https://books.ropensci.org/drake/static.html
# to learn about the syntax with target(transform = ...).
plan <- drake_plan(
  school = target(
    get_school_data(id),
    transform = map(id = c(1, 2, 3))
  ),
  credits = target(
    fun(school),
    transform = cross(
      school,
      fun = c(check_credit_hours, check_students, check_graduations)
    )
  ),
  public_funds_school = target(
    command = check_public_funding(school),
    transform = map(school = c(school_1, school_2))
  ),
  trace = TRUE
)
plan
#> # A tibble: 14 x 7
#>    target       command      fun     school credits     public_funds_scho… id   
#>    <chr>        <expr_lst>   <chr>   <chr>  <chr>       <chr>              <chr>
#>  1 credits_che… check_credi… check_… schoo… credits_ch… <NA>               <NA> 
#>  2 credits_che… check_stude… check_… schoo… credits_ch… <NA>               <NA> 
#>  3 credits_che… check_gradu… check_… schoo… credits_ch… <NA>               <NA> 
#>  4 credits_che… check_credi… check_… schoo… credits_ch… <NA>               <NA> 
#>  5 credits_che… check_stude… check_… schoo… credits_ch… <NA>               <NA> 
#>  6 credits_che… check_gradu… check_… schoo… credits_ch… <NA>               <NA> 
#>  7 credits_che… check_credi… check_… schoo… credits_ch… <NA>               <NA> 
#>  8 credits_che… check_stude… check_… schoo… credits_ch… <NA>               <NA> 
#>  9 credits_che… check_gradu… check_… schoo… credits_ch… <NA>               <NA> 
#> 10 public_fund… check_publi… <NA>    schoo… <NA>        public_funds_scho… <NA> 
#> 11 public_fund… check_publi… <NA>    schoo… <NA>        public_funds_scho… <NA> 
#> 12 school_1     get_school_… <NA>    schoo… <NA>        <NA>               1    
#> 13 school_2     get_school_… <NA>    schoo… <NA>        <NA>               2    
#> 14 school_3     get_school_… <NA>    schoo… <NA>        <NA>               3

Ordinarily, the workflow graph gives a separate node to each individual import object or target.

vis_drake_graph(plan)

For large projects with hundreds of nodes, this can get quite cumbersome. But here, we can choose a wildcard column (or any other column in the plan, even custom columns) to condense nodes into natural clusters. For the group argument to the graphing functions, choose the name of a column in plan or a column you know will be in drake_graph_info(my_plan)$nodes. Then for clusters, choose the values in your group column that correspond to nodes you want to bunch together. The new graph is not as cumbersome.

vis_drake_graph(plan,
  group = "school",
  clusters = c("school_1", "school_2", "school_3")
)

As previously mentioned, you can group on any column in drake_graph_info(my_plan)$nodes. Let’s return to the mtcars project for demonstration.

vis_drake_graph(my_plan)

Let’s condense all the imports into one node and all the up-to-date targets into another. That way, the outdated targets stand out.

vis_drake_graph(
  my_plan,
  group = "status",
  clusters = c("imported", "up to date")
)

17.7 Output files

drake can reproducibly track multiple output files per target and show them in the graph.

plan <- drake_plan(
  target1 = {
    file.copy(file_in("in1.txt"), file_out("out1.txt"))
    file.copy(file_in("in2.txt"), file_out("out2.txt"))
  },
  target2 = {
    file.copy(file_in("out1.txt"), file_out("out3.txt"))
    file.copy(file_in("out2.txt"), file_out("out4.txt"))
  }
)
writeLines("in1", "in1.txt")
writeLines("in2", "in2.txt")
make(plan)
#> ▶ target target1
#> ▶ target target2

writeLines("abcdefg", "out3.txt")
vis_drake_graph(plan, targets_only = TRUE)

If your graph is too busy, you can hide the output files with show_output_files = FALSE.

vis_drake_graph(plan, show_output_files = FALSE, targets_only = TRUE)

17.8 Node Selection

(Supported in drake > 7.7.0 only)

First, we define our plan, adding a custom column named “link”.

mtcars_link <-
  "https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html"

plan <- drake_plan(
  mtc = target(
    mtcars,
    link = !!mtcars_link
  ),
  mtc2 = target(
    mtc,
    link = !!mtcars_link
  ),
  mtc3 = target(
    modify_mtc2(mtc2, number),
    transform = map(number = !!c(1:3), .tag_in = cluster_id),
    link = !!mtcars_link
  ),
  trace = TRUE
)
unique_stems <- unique(plan$cluster_id)

17.8.1 Perform the default action on select

By supplying vis_drake_graph(on_select = TRUE, on_select_col = "my_column"), treats the values in the column named "my_column" as hyperlinks. Click on a node in the graph to navigate to the corresponding link in your browser.

vis_drake_graph(
  plan,
  clusters = unique_stems,
  group = "cluster_id",
  on_select_col = "link",
  on_select = TRUE
)

17.8.2 Perform no action on select

No action will be taken if any of the following are given to vis_drake_graph():

  • on_select = NULL,
  • on_select = FALSE,
  • on_select_col = NULL

This is the default behaviour.

vis_drake_graph(
  my_plan,
  clusters = unique_stems,
  group = "cluster_id",
  on_select_col = "link",
  on_select = NULL
)

17.8.3 Customize the onSelect event behaviour

What if we instead wanted the browser to display an alert when a node is clicked?

alert_behaviour <- function(){
  js <- "
  function(props) {
    alert('selected node with on_select_col: \\r\\n' +
            this.body.data.nodes.get(props.nodes[0]).on_select_col);
  }"
}

vis_drake_graph(
  my_plan,
  on_select_col = "link",
  on_select = alert_behaviour()
)

17.9 Enhanced interactivity

For enhanced interactivity, including custom interactive target documentation, see the mandrake R package. For a taste of the functionality, visit this vignette page and click the mtcars node in the graph.

Copyright Eli Lilly and Company