The following is the actual code for the stash()
function, the main function of the ‘mustashe’ package. I have only added
a few more comments for clarification.
stash <- function(var,
code,
depends_on = NULL,
functional = NULL,
verbose = NULL) {
if (is.null(functional)) functional <- mustashe_functional()
if (is.null(verbose)) verbose <- mustashe_verbose()
# Make sure the stashing directory ".mustashe" is available.
check_stash_dir()
# Deparse and format the code.
deparsed_code <- deparse(substitute(code))
formatted_code <- format_code(deparsed_code)
# Make sure the `var` and `code` are valid.
var <- validate_var(var, functional)
if (formatted_code == "NULL") stop("`code` cannot be NULL")
# The environment where all code is evaluated and variables assigned.
target_env <- parent.frame()
# Make a new hash table.
new_hash_tbl <- make_hash_table(formatted_code, depends_on, target_env)
# if the variable has been stashed:
# if the hash tables are equivalent:
# load the stored variable
# else:
# make a new stash
# else:
# make a new stash
if (has_been_stashed(var)) {
old_hash_tbl <- get_hash_table(var)
if (hash_tables_are_equivalent(old_hash_tbl, new_hash_tbl)) {
if (verbose) {
message("Loading stashed object.")
}
res <- load_variable(var, functional, target_env)
} else {
if (verbose) {
message("Updating stash.")
}
res <- new_stash(
var, formatted_code, new_hash_tbl, functional, target_env
)
}
} else {
if (verbose) {
message("Stashing object.")
}
res <- new_stash(var, formatted_code, new_hash_tbl, functional, target_env)
}
invisible(res)
}
Overall, I believe the logic is quite simple. The steps that the
stash()
function follows, further explained in the
following sections, are:
The first step taken by the stash()
function is to
deparse and format the code.
Deparsing the code means to turn the unevaluated expression into a
string. The deparsing is done by passing the code immediately
to substitute()
and deparse()
. This must be
done immediately, else the code will be evaluated. The
substitute()
function “returns the parse tree for the
(unevaluated) expression expr
, substituting any variables
bound in env.”
substitute(x <- 1)
#> x <- 1
The deparse()
function “Turn[s] unevaluated expressions
into character strings.” Paired with substitute()
, it
returns a string of the unevaluated code.
deparse(substitute(x <- 1))
#> [1] "x <- 1"
With the code now as a string, it is formatted using the
tidy_source()
function from ‘formatR’. An
internal function in ‘mustashe’, format_code()
handles this
process:
format_code <- function(code) {
fmt_code <- formatR::tidy_source(
text = code,
comment = FALSE,
blank = FALSE,
arrow = TRUE,
brace.newline = FALSE,
indent = 4,
wrap = TRUE,
output = FALSE,
width.cutoff = 80
)$text.tidy
paste(fmt_code, sep = "", collapse = "\n")
}
format_code("x <- 2")
#> [1] "x <- 2"
The purpose of formatting the code is so any stylistic changes to the
code
input do not affect the hash table. To demonstrate
this, notice how the output from format_code()
is the same
between the two different code examples.
format_code("x=2")
#> [1] "x <- 2"
format_code(("x <- 2 # a comment"))
#> [1] "x <- 2"
The hash table is a two-column table with the name and hash value of the code and any (optional) dependencies.
The hashing is handled by the ‘digest’ package. It takes a value and reproducibly produces a unique hash value.
digest::digest("mustashe")
#> [1] "ac2aad9fdb730500c56009bff6154a7e"
A hash value is made for the code and for any of the dependencies
linked to the object. This process is handled by the
make_hash_table(code, depends_on)
internal function.
To tell if the code or dependencies have changed, the new hash table
and stashed hashed table are compared. The function underlying this
process is all.equal()
from base R. This function compares
two objects and “If they are different, [a] comparison is still made to
some extent, and a report of the differences is returned.”
Here is an example of using all.equal()
to compare two
data frames.
# Two data frames with a small difference *
df1 <- data.frame(a = c(1, 2, 3), b = c(5, 6, 7))
df2 <- data.frame(a = c(1, 2, 3), b = c(5, 6, 8))
# When the two data frames are equivalent.
all.equal(df1, df1)
#> [1] TRUE
# When the two data frames are not equivalent.
all.equal(df1, df2)
#> [1] "Component \"b\": Mean relative difference: 0.1428571"
A word of caution, if using all.equal()
for a boolean
comparison (like in an if-statement), make sure to wrap it with
isTRUE
, otherwise it will return TRUE
or
comments on the differences, but not FALSE
. Below is the
actual function stash()
uses to compare hash tables.
If the hash tables are different, that means the code must be
evaluated, the new object be assigned to the desired name
(var
), and the new hash table and value stashed. This is
handled by the internal function new_stash()
.
# Make a new stash from a variable, code, and hash table.
new_stash <- function(var, code, hash_tbl, functional, target_env) {
val <- evaluate_code(code, target_env)
write_hash_table(var, hash_tbl)
write_val(var, val)
if (functional) {
return(val)
} else {
assign_value(var, val, target_env = target_env)
return(NULL)
}
}
The first step is to evaluate the code with the
evaluate_code(code)
function. It uses the
parse()
and eval()
functions and returns the
resulting value.
evaluate_code <- function(code, target_env) {
eval(parse(text = code), envir = new.env(parent = target_env))
}
This value is then assigned the desired name in the global
environment using the internal assign_value(var, val)
function, where .TargetEnv
is a variable in the package
pointing to .GlobalEnv
.
assign_value <- function(var, val, target_env) {
assign(var, val, envir = target_env)
}
Lastly, the hash table and value are written to file using wrapper
functions around qs::qsave()
.
Below are some implementation details that are useful to know about.
When I first created this library, I just made all assignments to and
searched for dependencies in the global environment
(.GlobalEnv
). This generally worked for standard use of the
package, but could easily result in strange behavior if
stash()
was used within functions. In early 2022, Magnus
Torfason (GitHub @torfason) addressed this shortcoming by
capturing the parent environment in stash()
and conducting
code evaluation and dependency search within that R environment.
As described in Issue #8, using functions as dependencies can be tricky because their hashes change after evaluation.
fxn <- function(x) {
x**2
}
print(digest(fxn))
#> [1] "a8f7c79007722107fca655394dbc84b4"
print(digest(fxn))
#> [1] "a8f7c79007722107fca655394dbc84b4"
a <- fxn(2)
print(digest(fxn))
#> [1] "8e9b4716b7dedc4b192dad64b8e783b5"
a <- fxn(2)
print(digest(fxn))
#> [1] "ebc68a800ff72515e3a424afd28e61a6"
a <- fxn(4)
print(digest(fxn))
#> [1] "ebc68a800ff72515e3a424afd28e61a6"
You can actually see a difference in the print-out of a function when in an interactive R session.
fxn <- function(x) {
x**2
}
fxn
#> function(x) {
#> x**2
#> }
a <- fxn(2)
a <- fxn(2)
fxn
#> function(x) {
#> x**2
#> }
#> <bytecode: 0x7f8135013388>
Therefore, stash()
uses a specific process for hashing
functions. As described in the section on “Functions” by Hadley
Wickham in Advanced R, “R
functions have three parts:
body()
, the code inside the function.formals()
, the list of arguments which controls how
you can call the function.environment()
, the “map” of the location of the
function’s variables.”Therefore, stash()
specifically deparses the code of a
function passed as dependency and uses the hash of the resulting string
(of the code of the function) as the hash.
fxn_str <- deparse(fxn)
fxn_str
#> [1] "function (x) " "{" " x^2" "}"
digest(fxn_str)
#> [1] "f04c64f61c3b9165d1869c4266f9ac01"
Any issues and feedback on ‘mustashe’ can be submitted here. Alternatively, I can be reached through the contact form on my website or on Twitter @JoshDoesa