Reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with “computational reproducibility,” and the terms are used interchangeably in this report.
Replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.
Science is evidence-based by definition, and findings build on one another. Being able to reproduce the findings of other researchers is therefore nearly as important as contributing ‘new’ evidence. Researchers are fallable in their ability to write bug-free code, and by releasing code and data, these errors can be detected and corrected. Apart from this, reproducibiltiy and open-science sort of go hand-in-hand, in that code that is reproducible is also ideally open access, meaning that any researcher has the ability to take the code and data and reproduce the analyses.
Changes to libraries that are breaking changes. This occurs when the people maintaining the code make changes to add a new feature etc., but do so in a way that breaks the functionality of old code based on their codebase.
Different operating systems. The code and data could rely on few external libraries and still fail to run due to differences in installed backend libraries across different operating systems. For instance, to do spatial data analyses, many libraries outside of R need to be installed in order for the R packages to do spatial data analysis to be installed.
Link rot. Even if the data and code can run, it does not mean that they will be accessible to everyone, as platforms for hosting code may change over time, resulting in some hyperlinks not properly redirecting to the necessary files.
Resistance to data and code sharing. Data ownership and provenance are often listed as concerns among researchers who do not necessarily want to share their data and code. Recent NSF and NIH mandates start to force some researchers hands in terms of data and code sharing, but there will likely always be this feeling of ownership among some fraction of researchers.
The code itself is not reproducible. Last one and perhaps the most obvious. Sharing code doesn’t necessarily mean sharing good code, such that the actual code could not reproduce the analyses that the authors claim it does in their manuscript. This is actually not all that uncommon. Some authors code and data files that they share will only have summary data that we must trust is correct, and the code they supply is largely just for visualization.
There are a number of ways we can improve reproducibility in the code we write.
rbinom(10, 1, 0.25)
, and
vanishingly unlikely for something like runif(10,0,1)
.
‘Setting the seed’ refers to setting the parameter which controls the
resulting probabilistic output. So now, if users entered in
set.seed(1234)
before running the code above, everyone
should have the same string of 10 numbers.## [1] 0 0 0 0 1 0 0 0 0 0
## [1] 0 0 0 1 0 1 0 0 0 0
## [1] 0 0 0 0 0 1 0 1 1 0
## [1] 0 1 0 1 1 0 0 1 0 0
## [1] 1 0 0 0 0 1 0 0 0 1
Let’s explore an example of this. I wrote a function that performs a
“min-max standardization” on a given vector of data. This takes every
entry of a vector and subtracts the minimum value, then divides by the
maximum minus the minimum. The effect of this is that the data are now
scaled between 0 and 1. Proper documentation following the format
designed by the roxygen2
package developers and
incorporated into the workhorse library devtools
, both of
which are maintained by Wickham and folks at RStudio/posit. The nice
part of this form of documentation is that the documentation and the
function are in the same file. Long ago, the documentation of functions
in for distribution as an R package would require the developer to edit
a separate document with all the details of each function.
#' Min-max standardization
#'
#' @param x a vector of numeric data
#'
#' @return the same vector x, but standardized between 0 and 1
#'
#' @examples
#' getMinMax(x=1:10)
getMinMax <- function(x){
(x-min(x)) / (max(x)-min(x))
}
Here we include information on what the function does, what arguments it takes, what the output will be, and provide a use case. This is most important for package developers, but it is good practice to provide some documentation of your code.
warnings
and
errors
.Let’s take the example of the function we wrote above. It’s
well-documented, but let’s say I want to give it a vector that contains
an NA
value? What is going to happen?
## [1] NA NA NA NA NA NA NA NA NA NA NA
That’s not great. Ideally, it would keep the NA
values
where they are, but still standardize the vector. Maybe even provide the
user a warning
about NA
values being present?
Let’s implement that now.
#' Min-max standardization
#'
#' @param x a vector of numeric data
#'
#' @return the same vector x, but standardized between 0 and 1
#'
#' @examples
#' getMinMax(x=1:10)
getMinMax <- function(x){
if(any(is.na(x))){
warning('The vector x contains NA values. These will be ignored.')
}
(x-min(x, na.rm=TRUE)) / (max(x, na.rm=TRUE)-min(x, na.rm=TRUE))
}
So now, we warn the user about the input data having the
NA
values and have programmed our function in a way to
account for those NA
values. What other ways should we
think about modifying this function for clarity and usability?
sessionInfo
at end of document.
Putting sessionInfo
at the end the document will clearly
list the R version, OS, and package versions that you are using. It’s a
couple steps short of full reproducibility, but it at least shows the
user (if you compile the code to html or pdf) exactly what set of
conditions allowed the code to run all the way through.## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] stringr_1.5.0 rmarkdown_2.22 sf_1.0-13 raster_3.6-20 sp_2.0-0
## [6] rgbif_3.7.7
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.3 bslib_0.5.0 xfun_0.39 ggplot2_3.4.2
## [5] lattice_0.21-8 tzdb_0.4.0 vctrs_0.6.3 tools_4.3.0
## [9] generics_0.1.3 curl_5.0.1 parallel_4.3.0 tibble_3.2.1
## [13] proxy_0.4-27 fansi_1.0.4 highr_0.10 pkgconfig_2.0.3
## [17] KernSmooth_2.23-20 data.table_1.14.8 lifecycle_1.0.3 compiler_4.3.0
## [21] tinytex_0.45 munsell_0.5.0 terra_1.7-37 codetools_0.2-19
## [25] sass_0.4.6 htmltools_0.5.5 class_7.3-21 yaml_2.3.7
## [29] lazyeval_0.2.2 jquerylib_0.1.4 pillar_1.9.0 crayon_1.5.2
## [33] whisker_0.4.1 classInt_0.4-9 cachem_1.0.8 tidyselect_1.2.0
## [37] digest_0.6.32 stringi_1.7.12 dplyr_1.1.2 fastmap_1.1.1
## [41] grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
## [45] triebeard_0.4.1 crul_1.4.0 utf8_1.2.3 e1071_1.7-13
## [49] withr_2.5.0 scales_1.2.1 bit64_4.0.5 oai_0.4.0
## [53] httr_1.4.6 bit_4.0.5 evaluate_0.21 knitr_1.43
## [57] rlang_1.1.1 urltools_1.7.3 Rcpp_1.0.10 glue_1.6.2
## [61] DBI_1.1.3 httpcode_0.3.0 xml2_1.3.4 vroom_1.6.3
## [65] jsonlite_1.8.5 R6_2.5.1 plyr_1.8.8 units_0.8-2