Chapter 11 An analysis of R package download trends
This chapter explores R package download trends using the cranlogs
package, and it shows how drake
’s custom triggers can help with workflows with remote data sources.
11.1 Get the code.
Write the code files to your workspace.
drake_example("packages")
The new packages
folder now includes a file structure of a serious drake
project, plus an interactive-tutorial.R
to narrate the example. The code is also online here.
11.2 Overview
This small data analysis project explores some trends in R package downloads over time. The datasets are downloaded using the cranlogs package.
library(cranlogs)
cran_downloads(packages = "dplyr", when = "last-week")
Above, each count is the number of times dplyr
was downloaded from the RStudio CRAN mirror on the given day. To stay up to date with the latest download statistics, we need to refresh the data frequently. With drake
, we can bring all our work up to date without restarting everything from scratch.
11.3 Analysis
First, we load the required packages. drake
detects the packages you install and load.
library(cranlogs)
library(drake)
library(dplyr)
library(ggplot2)
library(knitr)
library(rvest)
We will want custom functions to summarize the CRAN logs we download.
<- function(downloads){
make_my_table group_by(downloads, package) %>%
summarize(mean_downloads = mean(count))
}
<- function(downloads){
make_my_plot ggplot(downloads) +
geom_line(aes(x = date, y = count, group = package, color = package))
}
Next, we generate the plan.
We want to explore the daily downloads from the knitr
, Rcpp
, and ggplot2
packages. We will use the cranlogs
package to get daily logs of package downloads from RStudio’s CRAN mirror. In our drake_plan()
, we declare targets older
and recent
to kcontain snapshots of the logs.
The following drake_plan()
syntax is described here, which is supported in drake
7.0.0 and above.
<- drake_plan(
plan older = cran_downloads(
packages = c("knitr", "Rcpp", "ggplot2"),
from = "2016-11-01",
to = "2016-12-01"
),recent = target(
command = cran_downloads(
packages = c("knitr", "Rcpp", "ggplot2"),
when = "last-month"
),trigger = trigger(change = latest_log_date())
),averages = target(
make_my_table(data),
transform = map(data = c(older, recent))
),plot = target(
make_my_plot(data),
transform = map(data)
),report = knit(
knitr_in("report.Rmd"),
file_out("report.md"),
quiet = TRUE
) )
Notice the custom trigger for the target recent
. Here, we are telling drake
to rebuild recent
whenever a new day’s log is uploaded to http://cran-logs.rstudio.com. In other words, drake
keeps track of the return value of latest_log_date()
and recomputes recent
(during make()
) if that value changed since the last make()
. Here, latest_log_date()
is one of our custom imported functions. We use it to scrape http://cran-logs.rstudio.com using the rvest
package.
<- function(){
latest_log_date read_html("http://cran-logs.rstudio.com/") %>%
html_nodes("li:last-of-type") %>%
html_nodes("a:last-of-type") %>%
html_text() %>%
max }
Now, we run the project to download the data and analyze it.
The results will be summarized in the knitted report, report.md
,
but you can also read the results directly from the cache.
make(plan)
readd(averages_recent)
readd(averages_older)
readd(plot_recent)
readd(plot_older)
If we run make()
again right away, we see that everything is up to date. But if we wait until a new day’s log is uploaded, make()
will update recent
and everything that depends on it.
make(plan)
To visualize the build behavior, you can plot the dependency network.
vis_drake_graph(plan)
11.4 Other ways to trigger downloads
Sometimes, our remote data sources get revised, and web scraping may not be the best way to detect changes. We may want to look at our remote dataset’s modification time or HTTP ETag. To see how this works, consider the CRAN log file from February 9, 2018.
<- "http://cran-logs.rstudio.com/2018/2018-02-09-r.csv.gz" url
We can track the modification date using the httr
package.
library(httr) # For querying websites.
HEAD(url)$headers[["last-modified"]]
In our drake
plan, we can track this timestamp and trigger a download whenever it changes.
<- drake_plan(
plan logs = target(
get_logs(url),
trigger = trigger(change = HEAD(url)$headers[["last-modified"]])
)
) plan
where
library(R.utils) # For unzipping the files we download.
library(curl) # For downloading data.
<- function(url){
get_logs curl_download(url, "logs.csv.gz") # Get a big file.
gunzip("logs.csv.gz", overwrite = TRUE) # Unzip it.
<- read.csv("logs.csv", nrows = 4) # Extract the data you need.
out unlink(c("logs.csv.gz", "logs.csv")) # Remove the big files
# Value of the target.
out }
When we are ready, we run the workflow.
make(plan)
readd(logs)
If the log file at the url
ever changes, the timestamp will update remotely, and make()
will download the file again.