Chapter 10 Validating a small hierarchical model with Stan
The goal of this example workflow is to validate a small Bayesian hierarchical model.
~ iid Normal(alpha + x_i * beta, sigma^2)
y_i ~ Normal(0, 1)
alpha ~ Normal(0, 1)
beta ~ Uniform(0, 1) sigma
We simulate multiple datasets from the model and fit the model on each dataset. For each model fit, we determine if the 50% credible interval of the regression coefficient beta
contains the true value of beta
used to generate the data. If we implemented the model correctly, roughly 50% of the models should recapture the true beta
in 50% credible intervals.
10.1 The drake
project
Because of the long computation time involved, this chapter of the manual does not actually run the analysis code. The complete code can be found at https://github.com/wlandau/drake-examples/tree/main/stan and downloaded with drake::drake_example("stan")
, and we encourage you to try out the code yourself. This chapter serves to walk through the functions and plan and explain the overall thought process.
The file structure is that of a typical drake
project with some additions to allow optional high-performance computing on a cluster.
├── run.sh
├── run.R
├── _drake.R
├── sge.tmpl/
├── R
├──── packages.R
├──── functions.R
├──── plan.R/
├── stan
├──── model.stan └── report.Rmd
File | Purpose |
---|---|
run.sh |
Shell script to run run.R in a persistent background process. Works on Unix-like systems. Helpful for long computations on servers. |
run.R |
R script to run r_make() . |
_drake.R |
The special R script that powers functions r_make() and friends (details here). |
sge.tmpl |
A clustermq template file to deploy targets in parallel to a Sun Grid Engine cluster. |
R/packages.R |
A custom R script loading the packages we need. |
R/functions.R |
A custom R script with user-defined functions. |
R/plan.R |
A custom R script that defines the drake plan. |
stan/model.stan |
The specification of our Stan model. |
report.Rmd |
An R Markdown report summarizing the results of the analysis. |
The following sections walk through the functions and plan.
10.2 Functions
Good functions have meaningful inputs and outputs that are easy to generate. For data anlaysis, good inputs and outputs are typically datasets, models, and summaries of fitted models. The functions below for our Stan workflow follow this pattern.
First, we need a function to compile the model. It accepts a Stan model specification file (a *.stan
text file) and returns a paths to both the model file and the compiled RDS file. (We need to set rstan_options(auto_write = TRUE)
to make sure stan_model()
generates the RDS file.) We return the file paths because the target that uses this function will be a dynamic file target. drake
will reproducibly watch these files for changes and automatically recompile and run the model if changes are detected.
<- function(model_file) {
compile_model stan_model(model_file, auto_write = TRUE, save_dso = TRUE)
c(model_file, path_ext_set(model_file, "rds"))
}
Next, we need a function to simulate a dataset from the hierarchical model.
<- function() {
simulate_data <- rnorm(1, 0, 1)
alpha <- rnorm(1, 0, 1)
beta <- runif(1, 0, 1)
sigma <- rbinom(100, 1, 0.5)
x <- rnorm(100, alpha + x * beta, sigma)
y tibble(x = x, y = y, beta_true = beta)
}
Lastly, we write a function to fit the compiled model to a simulated dataset. We pass in the *.stan
model specification file, but rstan
is smart enough to use the compiled RDS model file instead if available.
In Bayesian data analysis workflows with many runs of the same model, we need to make a conscious effort to conserve computing resources. That means we should not save all the posterior samples from every single model fit. Instead, we compute summary statistics on the chains such as posterior quantiles, coverage in credible intervals, and convergence diagnostics.
<- function(model_file, data) {
fit_model # From https://github.com/stan-dev/rstan/issues/444#issuecomment-445108185,
# we need each stanfit object to have its own unique name, so we create
# a special new environment for it.
<- new.env(parent = baseenv())
tmp_envir $model <- stan_model(model_file, auto_write = TRUE, save_dso = TRUE)
tmp_envir$fit <- sampling(
tmp_envirobject = tmp_envir$model,
data = list(x = data$x, y = data$y, n = nrow(data)),
refresh = 0
)<- As.mcmc.list(tmp_envir$fit)
mcmc_list <- as.data.frame(as.matrix(mcmc_list))
samples <- quantile(samples$beta, 0.25)
beta_25 <- quantile(samples$beta, 0.5)
beta_median <- quantile(samples$beta, 0.75)
beta_75 <- data$beta_true[1]
beta_true <- beta_25 < beta_true && beta_true < beta_75
beta_cover <- max(gelman.diag(mcmc_list, multivariate = FALSE)$psrf[, 1])
psrf <- min(effectiveSize(mcmc_list))
ess tibble(
beta_cover = beta_cover,
beta_true = beta_true,
beta_25 = beta_25,
beta_median = beta_median,
beta_75 = beta_75,
psrf = psrf,
ess = ess
) }
10.3 Plan
Our drake
plan is defined in the R/plan.R
script.
<- drake_plan(
plan model_file = target(
compile_model("stan/model.stan"),
format = "file",
hpc = FALSE
),index = target(
seq_len(10), # Change the number of simulations here.
hpc = FALSE
),data = target(
simulate_data(),
dynamic = map(index),
format = "fst_tbl"
),fit = target(
fit_model(model_file[1], data),
dynamic = map(data),
format = "fst_tbl"
),report = target(
render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
),hpc = FALSE
) )
The following subsections describe the strategy and practical adjustments behind each target.
10.3.1 Model
The model_files
target is a dynamic file target to reproducibly track our Stan model specification file (stan/model.stan
) and compiled model file (stan/model.rds
). Below, format = "file"
indicates that the target is a dynamic file target, and hpc = FALSE
tells drake
not to run the target on a parallel worker in high-performance computing scenarios.
= target(
model_files compile_model("stan/model.stan"),
format = "file",
hpc = FALSE
),
10.3.2 Index
The index
target is simply a numeric vector from 1 to the number of simulations. To fit our model multiple times, we are going to dynamically map over index
. This is a small target and we do not want to waste expensive computing resources on it, so we set hpc = FALSE
.
= target(
index seq_len(1000), # Change the number of simulations here.
hpc = FALSE
)
10.3.3 Data
data
is a dynamic target with one sub-target per simulated dataset, so we write dynamic = map(index)
below. In addition, these datasets are data frames, so we choose format = "fst_tbl"
below to increase read/write speeds and conserve storage space. Read here for more on specialized storage formats.
= target(
data simulate_data(),
dynamic = map(index),
format = "fst_tbl"
)
10.3.4 Fit
We want to fit our model once for each simulated dataset, so our fit
target dynamically maps over the datasets with dynamic = map(data)
. We also pass in the path to the stan/model.stan
specification file, but rstan
is smart enough to use the compiled model stan/model.rds
instead if available. And since fit_model()
returns a data frame, we also choose format = "fst_tbl"
here.
= target(
fit fit_model(model_files[1], data),
dynamic = map(data),
format = "fst_tbl"
)
10.3.5 Report
R Markdown reports should never do any heavy lifting in drake
pipelines. They should simply leverage the computationally expensive work done in the previous targets. If we follow this good practice and our report renders quickly, we should not need heavy computing resources to process it, and we can set hpc = FALSE
below.
The report.Rmd
file itself has loadd()
and readd()
statements to refer to these targets, and with the knitr_in()
keyword below, drake
knows that it needs to update the report when the models or datasets change. Similarly, file_out("report.html")
tells drake
to rerun the report if the output file gets corrupted.
= target(
report render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
),hpc = FALSE
)
10.4 Try it out!
The complete code can be found at https://github.com/wlandau/drake-examples/tree/main/stan and downloaded with drake::drake_example("stan")
, and we encourage you to try out the code yourself.