What do we mean by ‘reproducible’?

Reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with “computational reproducibility,” and the terms are used interchangeably in this report.

Replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

What are some bad things that can happen when we fail to focus on reproducibility?

Science is evidence-based by definition, and findings build on one another. Being able to reproduce the findings of other researchers is therefore nearly as important as contributing ‘new’ evidence. Researchers are fallable in their ability to write bug-free code, and by releasing code and data, these errors can be detected and corrected. Apart from this, reproducibiltiy and open-science sort of go hand-in-hand, in that code that is reproducible is also ideally open access, meaning that any researcher has the ability to take the code and data and reproduce the analyses.

What are the major barriers to reproducibility?

How can we improve reproducibility in our code?

There are a number of ways we can improve reproducibility in the code we write.

rbinom(10, 1, 0.25)
##  [1] 0 0 0 0 1 0 0 0 0 0
rbinom(10, 1, 0.25)
##  [1] 0 0 0 1 0 1 0 0 0 0
rbinom(10, 1, 0.25)
##  [1] 0 0 0 0 0 1 0 1 1 0
rbinom(10, 1, 0.25)
##  [1] 0 1 0 1 1 0 0 1 0 0
rbinom(10, 1, 0.25)
##  [1] 1 0 0 0 0 1 0 0 0 1

Let’s explore an example of this. I wrote a function that performs a “min-max standardization” on a given vector of data. This takes every entry of a vector and subtracts the minimum value, then divides by the maximum minus the minimum. The effect of this is that the data are now scaled between 0 and 1. Proper documentation following the format designed by the roxygen2 package developers and incorporated into the workhorse library devtools, both of which are maintained by Wickham and folks at RStudio/posit. The nice part of this form of documentation is that the documentation and the function are in the same file. Long ago, the documentation of functions in for distribution as an R package would require the developer to edit a separate document with all the details of each function.

#' Min-max standardization
#' @param x a vector of numeric data
#' @return the same vector x, but standardized between 0 and 1
#' @examples 
#' getMinMax(x=1:10)

getMinMax <- function(x){
  (x-min(x)) / (max(x)-min(x))

Here we include information on what the function does, what arguments it takes, what the output will be, and provide a use case. This is most important for package developers, but it is good practice to provide some documentation of your code.

Let’s take the example of the function we wrote above. It’s well-documented, but let’s say I want to give it a vector that contains an NA value? What is going to happen?

test <- c(1:10, NA)

That’s not great. Ideally, it would keep the NA values where they are, but still standardize the vector. Maybe even provide the user a warning about NA values being present? Let’s implement that now.

#' Min-max standardization
#' @param x a vector of numeric data
#' @return the same vector x, but standardized between 0 and 1
#' @examples 
#' getMinMax(x=1:10)

getMinMax <- function(x){
    warning('The vector x contains NA values. These will be ignored.')
  (x-min(x, na.rm=TRUE)) / (max(x, na.rm=TRUE)-min(x, na.rm=TRUE))

So now, we warn the user about the input data having the NA values and have programmed our function in a way to account for those NA values. What other ways should we think about modifying this function for clarity and usability?


## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## time zone: America/New_York
## tzcode source: system (glibc)
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] stringr_1.5.0  rmarkdown_2.22 sf_1.0-13      raster_3.6-20  sp_2.0-0      
## [6] rgbif_3.7.7   
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.3       bslib_0.5.0        xfun_0.39          ggplot2_3.4.2     
##  [5] lattice_0.21-8     tzdb_0.4.0         vctrs_0.6.3        tools_4.3.0       
##  [9] generics_0.1.3     curl_5.0.1         parallel_4.3.0     tibble_3.2.1      
## [13] proxy_0.4-27       fansi_1.0.4        highr_0.10         pkgconfig_2.0.3   
## [17] KernSmooth_2.23-20 data.table_1.14.8  lifecycle_1.0.3    compiler_4.3.0    
## [21] tinytex_0.45       munsell_0.5.0      terra_1.7-37       codetools_0.2-19  
## [25] sass_0.4.6         htmltools_0.5.5    class_7.3-21       yaml_2.3.7        
## [29] lazyeval_0.2.2     jquerylib_0.1.4    pillar_1.9.0       crayon_1.5.2      
## [33] whisker_0.4.1      classInt_0.4-9     cachem_1.0.8       tidyselect_1.2.0  
## [37] digest_0.6.32      stringi_1.7.12     dplyr_1.1.2        fastmap_1.1.1     
## [41] grid_4.3.0         colorspace_2.1-0   cli_3.6.1          magrittr_2.0.3    
## [45] triebeard_0.4.1    crul_1.4.0         utf8_1.2.3         e1071_1.7-13      
## [49] withr_2.5.0        scales_1.2.1       bit64_4.0.5        oai_0.4.0         
## [53] httr_1.4.6         bit_4.0.5          evaluate_0.21      knitr_1.43        
## [57] rlang_1.1.1        urltools_1.7.3     Rcpp_1.0.10        glue_1.6.2        
## [61] DBI_1.1.3          httpcode_0.3.0     xml2_1.3.4         vroom_1.6.3       
## [65] jsonlite_1.8.5     R6_2.5.1           plyr_1.8.8         units_0.8-2