Chapter 15 Memory management
The default settings of drake
prioritize speed over memory efficiency. For projects with large data, this default behavior can cause problems. Consider the following hypothetical workflow, where we simulate several large datasets and summarize them.
<- 10 # Serious workflows may have several times more.
reps
# Reduce `n` to lighten the load if you want to try this workflow yourself.
# It is super high in this chapter to motivate the memory issues.
<- function(rep, n = 1e8) {
generate_large_data tibble(x = rnorm(n), y = rnorm(n), rep = rep)
}
<- function(...) {
get_means <- NULL
out for (dataset in list(...)) {
<- bind_rows(out, colMeans(dataset))
out
}
out
}
<- drake_plan(
plan large_data = target(
generate_large_data(rep),
transform = map(rep = !!seq_len(reps), .id = FALSE)
),means = target(
get_means(large_data),
transform = combine(large_data)
),summ = summary(means)
)
print(plan)
vis_drake_graph(plan)
If you call make(plan)
with no additional arguments, drake
will try to load all the datasets into the same R session. Each dataset from generate_large_data(n = 1e8)
occupies about 2.4 GB of memory, and most machines cannot handle all the data at once. We should use memory more wisely.
15.1 Garbage collection and custom files
make()
has a garbage_collection
argument, which tells drake
to periodically unload data objects that no longer belong to variables. You can also run garbage collection manually with the gc()
function. For more on garbage collection, please refer to the memory usage chapter of Advanced R.
Let’s reduce the memory consumption of our example workflow:
- Call
gc()
after every loop iteration ofget_means()
. - Avoid
drake
’s caching system with customfile_out()
files in the plan. - Call
make(plan, garbage_collection = TRUE)
.
<- 10 # Serious workflows may have several times more.
reps <- paste0(seq_len(reps), ".rds")
files
<- function(file, n = 1e8) {
generate_large_data <- tibble(x = rnorm(n), y = rnorm(n)) # a billion rows
out saveRDS(out, file)
}
<- function(files) {
get_means <- NULL
out for (file in files) {
<- colMeans(readRDS(file))
x <- bind_rows(out, x)
out gc() # Use the gc() function here to make sure each x gets unloaded.
}
out
}
<- drake_plan(
plan large_data = target(
generate_large_data(file = file_out(file)),
transform = map(file = !!files, .id = FALSE)
),means = get_means(file_in(!!files)),
summ = summary(means)
)
print(plan)
vis_drake_graph(plan)
make(plan, garbage_collection = TRUE)
15.2 Memory strategies
make()
has a memory_strategy
argument to customize how drake
loads and unloads targets. With the right memory strategy, you can rely on drake
’s built-in caching system without having to bother with messy file_out()
files.
Each memory strategy follows three stages for each target:
- Initial discard: before building the target, optionally discard some other targets from the R session. The choice of discards depends on the memory strategy. (Note: we do not actually get the memory back until we call
gc()
.) - Initial load: before building the target, optionally load any dependencies that are not already in memory.
- Final discard: optionally discard or keep the return value after the target finishes building. Either way, the return value is still stored in the cache, so you can load it with
loadd()
andreadd()
.
The implementation of these steps varies from strategy to strategy.
Memory strategy | Initial discard | Initial load | Final discard |
---|---|---|---|
“speed” | Discard nothing | Load any missing dependencies. | Keep the return value loaded. |
“autoclean”1 | Discard all targets which are not dependencies of the current target. | Load any missing dependencies. | Discard the return value. |
“preclean” | Discard all targets which are not dependencies of the current target. | Load any missing dependencies. | Keep the return value loaded. |
“lookahead” | Discard all targets which are not dependencies of either (1) the current target or (2) other targets waiting to be checked or built. | Load any missing dependencies. | Keep the return value loaded. |
“unload”2 | Unload all targets. | Load nothing. | Discard the return value. |
“none”3 | Unload nothing. | Load nothing. | Discard the return value. |
With the "speed"
, "autoclean"
, "preclean"
, and "lookahead"
strategies, you can simply call make(plan, memory_strategy = YOUR_CHOICE, garbage_collection = TRUE)
and trust that your targets will build normally. For the "unload"
and "none"
strategies, there is extra work to do: you will need to manually load each target’s dependencies with loadd()
or readd()
. This manual bookkeeping lets you aggressively optimize your workflow, and it is less cumbersome than swarms of file_out()
files. It is particularly useful when you have a large combine()
step.
Let’s redesign the workflow to reap the benefits of make(plan, memory_strategy = "none", garbage_collection = TRUE)
. The trick is to use match.call()
inside get_means()
so we can load and unload dependencies one at a time instead of all at once.
<- 10 # Serious workflows may have several times more.
reps
<- function(rep, n = 1e8) {
generate_large_data tibble(x = rnorm(n), y = rnorm(n), rep = rep)
}
# Load targets one at a time
<- function(...) {
get_means <- match.call(expand.dots = FALSE)$...
arg_symbols <- as.character(arg_symbols)
arg_names <- NULL
out for (arg_name in arg_names) {
<- readd(arg_name, character_only = TRUE)
dataset <- bind_rows(out, colMeans(dataset))
out gc() # Run garbage collection.
}
out
}
<- drake_plan(
plan large_data = target(
generate_large_data(rep),
transform = map(rep = !!seq_len(reps), .id = FALSE)
),means = target(
get_means(large_data),
transform = combine(large_data)
),summ = {
loadd(means) # Annoying, but necessary with the "none" strategy.
summary(means)
} )
Now, we can build our targets.
make(plan, memory_strategy = "none", garbage_collection = TRUE)
But there is a snag: we needed to manually load means
in the command for summ
(notice the call to loadd()
). This is annoying, especially because means
is quite small. Fortunately, drake
lets you define different memory strategies for different targets in the plan. The target-specific memory strategies override the global one (i.e. the memory_strategy
argument of make()
).
<- drake_plan(
plan large_data = target(
generate_large_data(rep),
transform = map(rep = !!seq_len(reps), .id = FALSE),
memory_strategy = "none"
),means = target(
get_means(large_data),
transform = combine(large_data),
memory_strategy = "unload" # Be careful with this one.
),summ = summary(means)
)
print(plan)
In fact, now you can run make()
without setting a global memory strategy at all.
make(plan, garbage_collection = TRUE)
15.3 Data splitting
The split()
transformation breaks up a dataset into smaller targets. The ordinary use of split()
is to partition an in-memory dataset into slices.
drake_plan(
data = get_large_data(),
x = target(
%>%
data analyze_data(),
transform = split(data, slices = 4)
) )
However, you can also use it to load individual pieces of a large file, thus conserving memory. The trick is to break up an index set instead of the data itself. In the following sketch, get_number_of_rows()
and read_selected_rows()
are user-defined functions, and %>%
is the magrittr
pipe.
<- function(file) {
get_number_of_rows # ...
}
<- function(which_rows, file) {
read_selected_rows # ...
}
<- drake_plan(
plan row_indices = file_in("large_file.csv") %>%
get_number_of_rows() %>%
seq_len(),
subset = target(
%>%
row_indices read_selected_rows(file = file_in("large_file.csv")),
transform = split(row_indices, slices = 4)
)
)
plan
drake_plan_source(plan)