Application programming interfaces

Reading for this week:

Two this week

https://medium.com/@traffordDataLab/querying-apis-in-r-39029b73d5f1

https://www.interaction-design.org/literature/article/guidelines-for-good-visual-information-representations

What’s the point of APIs?

Application programming interfaces (or APIs) are used to describe the communication of one computer to a web-based source of data in a programmatic manner. The computer (client) makes a request for data in a well-documented and consistent way, and receives data from the server. This is incredibly useful for acquiring data in a programmatic way, allowing analyses to be re-run on dynamic data, creating a living analysis. This could be critical in situations such as an ongoing pandemic, where daily case count updates might change model parameters or predictions.

Many websites have APIs in order to allow users to make calls and develop apps on top of existing sites e.g., Facebook, Spotify, etc. all have APIs.

Why is this valuable? Contrast the API approach to pure web scraping. Web scraping refers to extracting data from html pages. This is different from querying an API, as APIs will have documentation about how the data are represented. For instance, what happens when a website decides to change a bit of its formatting? This would potentially break any web scraping code, but the API backend would be maintained.

In the context of biological data, many data repositories (figshare, data dryad, dataONE) have clear APIs. Further, it is becoming more common for large museum and government datasets to have developed APIs. I am a huge fan of this, as it is a departure from the longstanding “my data” mentality that has lead to a clear power dynamic (e.g., being associated with labs with lots of data gives them a clear advantage…but those data were likely gathered with a combination of taxpayer dollars and researcher sweat. The dollars trump the sweat, in my opinion.).

How do I query an API?

APIs can be queried programmtically using a number of R packages. Some APIs will promote their own R packages that have functions specific to their data structures or API. However, at the core of these packages are a small number of low-level packages for querying APIs.

The general structure of an API call is the same relatively regardless of R package used. The parameters of an HTTP request are typically contained in the URL. For example, to return a map of Manchester using the Google Maps Static API we would submit the following request:

https://maps.googleapis.com/maps/api/staticmap?center=Manchester,England&zoom=13&size=600x300&maptype=roadmap

The request contains: + a URL to the API endpoint (https://maps.googleapis.com/maps/api/staticmap?) and; + a query containing the parameters of the request (center=Manchester,England&zoom=13&size=600x300&maptype=roadmap). In this case, we have specified the location, zoom level, size and type of map.

Web service APIs use two key HTTP verbs to enable data requests: GET and POST. A GET request is submitted within the URL with each parameter separated by an ampersand (&). A POST request is submitted in the message body which is separate from the URL. The advantage of using POST over GET requests is that there are no character limits and the request is more secure because it is not stored in the browser’s cache.

There are several types of Web service APIs (e.g. XML-RPC, JSON-RPC and SOAP) but the most popular is Representational State Transfer or REST. RESTful APIs can return output as XML, JSON, CSV and several other data formats. Each API has documentation and specifications which determine how data can be transferred.

install.packages(c("httr", "jsonlite"))

After downloading the libraries, we will be able to use them in our R scripts or RMarkdown files.

library(httr)
library(jsonlite)

A simple example of an API call

Some APIs require a token, in order to identify who is accessing the data (e.g., Spotify, Facebook, etc.). Developer access is pretty easy to get when you have an account, but it is a bit too much of a hassle for demonstration purposes. So we will use an open API from the Internet Archive’s Open Library.

We can send a request for data by using the documented API put out by the data provider, which consists of a base url (http://openlibrary.org/search.json?) with an amended query for searching the database. We use the GET verb, which is in R but stems from HTTP (hypertext transfer protocol). It stores all the information in an object that is structured in JSON. Below, we search the Open Library for “Vonnegut”, and receive a nested list which includes information on status codes, among other relevant information.

von <- httr::GET('http://openlibrary.org/search.json?q=vonnegut')
str(von)

## List of 10
##  $ url        : chr "http://openlibrary.org/search.json?q=vonnegut"
##  $ status_code: int 200
##  $ headers    :List of 10
##   ..$ server                     : chr "nginx/1.18.0 (Ubuntu)"
##   ..$ date                       : chr "Mon, 12 Jun 2023 21:04:49 GMT"
##   ..$ content-type               : chr "application/json"
##   ..$ transfer-encoding          : chr "chunked"
##   ..$ connection                 : chr "keep-alive"
##   ..$ access-control-allow-origin: chr "*"
##   ..$ access-control-allow-method: chr "GET, OPTIONS"
##   ..$ access-control-max-age     : chr "86400"
##   ..$ x-ol-stats                 : chr "\"SR 1 0.520 TT 0 0.542\""
##   ..$ referrer-policy            : chr "no-referrer-when-downgrade"
##   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ all_headers:List of 1
##   ..$ :List of 3
##   .. ..$ status : int 200
##   .. ..$ version: chr "HTTP/1.1"
##   .. ..$ headers:List of 10
##   .. .. ..$ server                     : chr "nginx/1.18.0 (Ubuntu)"
##   .. .. ..$ date                       : chr "Mon, 12 Jun 2023 21:04:49 GMT"
##   .. .. ..$ content-type               : chr "application/json"
##   .. .. ..$ transfer-encoding          : chr "chunked"
##   .. .. ..$ connection                 : chr "keep-alive"
##   .. .. ..$ access-control-allow-origin: chr "*"
##   .. .. ..$ access-control-allow-method: chr "GET, OPTIONS"
##   .. .. ..$ access-control-max-age     : chr "86400"
##   .. .. ..$ x-ol-stats                 : chr "\"SR 1 0.520 TT 0 0.542\""
##   .. .. ..$ referrer-policy            : chr "no-referrer-when-downgrade"
##   .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ cookies    :'data.frame': 0 obs. of  7 variables:
##   ..$ domain    : logi(0) 
##   ..$ flag      : logi(0) 
##   ..$ path      : logi(0) 
##   ..$ secure    : logi(0) 
##   ..$ expiration: 'POSIXct' num(0) 
##   ..$ name      : logi(0) 
##   ..$ value     : logi(0) 
##  $ content    : raw [1:399885] 7b 0a 20 20 ...
##  $ date       : POSIXct[1:1], format: "2023-06-12 21:04:49"
##  $ times      : Named num [1:6] 0 0.162 0.233 0.234 0.854 ...
##   ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##  $ request    :List of 7
##   ..$ method    : chr "GET"
##   ..$ url       : chr "http://openlibrary.org/search.json?q=vonnegut"
##   ..$ headers   : Named chr "application/json, text/xml, application/xml, */*"
##   .. ..- attr(*, "names")= chr "Accept"
##   ..$ fields    : NULL
##   ..$ options   :List of 2
##   .. ..$ useragent: chr "libcurl/7.81.0 r-curl/4.3.3 httr/1.4.6"
##   .. ..$ httpget  : logi TRUE
##   ..$ auth_token: NULL
##   ..$ output    : list()
##   .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
##   ..- attr(*, "class")= chr "request"
##  $ handle     :Class 'curl_handle' <externalptr> 
##  - attr(*, "class")= chr "response"

This data structure is basically in JSON format, which is a different file format (e.g., xml, html, etc.). To make it something that R can work with, we use the jsonlite package, and function fromJSON.

vonInfo <- jsonlite::fromJSON(content(von, "text"), simplifyVector = FALSE)

## No encoding supplied: defaulting to UTF-8.

This output is a list of length 4, each containing a nested list. It is not the cleanest output, but it rarely is. But all the information you could want is buried somewhere here.

This apply statement gets information on the title of each of the works returned from our search for “Vonnegut”.

sapply(vonInfo[[4]], function(x){
  x$title
})

##   [1] "Slaughterhouse-Five"                                               
##   [2] "Kurt Vonnegut, Jr"                                                 
##   [3] "The Vonnegut statement"                                            
##   [4] "The Vonnegut effect"                                               
##   [5] "Kurt Vonnegut"                                                     
##   [6] "Kurt Vonnegut"                                                     
##   [7] "THE VONNEGUT STATEMENT"                                            
##   [8] "Kurt Vonnegut"                                                     
##   [9] "Kurt Vonnegut"                                                     
##  [10] "The Vonnegut Statement"                                            
##  [11] "Kurt Vonnegut, Jr."                                                
##  [12] "Kurt Vonnegut, Jr"                                                 
##  [13] "Kurt Vonnegut, Jr"                                                 
##  [14] "Cat's Cradle"                                                      
##  [15] "Kurt Vonnegut"                                                     
##  [16] "Kurt Vonnegut"                                                     
##  [17] "Kurt Vonnegut"                                                     
##  [18] "Kurt Vonnegut"                                                     
##  [19] "Kurt Vonnegut"                                                     
##  [20] "The Vonnegut Encyclopedia"                                         
##  [21] "Vonnegut Omnibus"                                                  
##  [22] "Kurt Vonnegut"                                                     
##  [23] "Vonnegut statement"                                                
##  [24] "Kurt Vonnegut."                                                    
##  [25] "Vonnegut statement."                                               
##  [26] "Kurt Vonnegut"                                                     
##  [27] "Kurt Vonnegut"                                                     
##  [28] "Kurt Vonnegut"                                                     
##  [29] "Kurt Vonnegut"                                                     
##  [30] "Kurt Vonnegut"                                                     
##  [31] "Kurt Vonnegut"                                                     
##  [32] "Kurt Vonnegut"                                                     
##  [33] "Kurt Vonnegut"                                                     
##  [34] "Kurt Vonnegut"                                                     
##  [35] "Kurt Vonnegut"                                                     
##  [36] "Talking Vonnegut"                                                  
##  [37] "Ast-klassika.vonnegut Jailbird / AST-klassika.Vonnegut Retsidivist"
##  [38] "Kurt Vonnegut, Jr. ..., a checklist"                               
##  [39] "Vonnegut and Hemingway"                                            
##  [40] "Kurt Vonnegut Jr"                                                  
##  [41] "Kurt Vonnegut"                                                     
##  [42] "Vonnegut"                                                          
##  [43] "Kurt Vonnegut's Slaugherhouse-Five"                                
##  [44] "Kurt Vonnegut"                                                     
##  [45] "Player Piano"                                                      
##  [46] "Vonnegut in America"                                               
##  [47] "Vonnegut in America"                                               
##  [48] "Breaking down Vonnegut"                                            
##  [49] "Kurt Vonnegut's America"                                           
##  [50] "Kurt Vonnegut Remembered"                                          
##  [51] "Essential Vonnegut Interviews CD"                                  
##  [52] "The Vonnegut Quartet"                                              
##  [53] "Vonnegut in America"                                               
##  [54] "KURT VONNEGUT, JR"                                                 
##  [55] "The Vonnegut Statement"                                            
##  [56] "Three By Vonnegut"                                                 
##  [57] "Understanding Kurt Vonnegut"                                       
##  [58] "THE VONNEGUT STATEMENT"                                            
##  [59] "THE VONNEGUT STATEMENT"                                            
##  [60] "The Vonnegut statement"                                            
##  [61] "Kurt Vonnegut Drawings"                                            
##  [62] "The Vonnegut effect"                                               
##  [63] "Kurt Vonnegut's America"                                           
##  [64] "The Vonnegut Statement"                                            
##  [65] "Kurt Vonnegut"                                                     
##  [66] "The Vonnegut Statement"                                            
##  [67] "Happy birthday, Kurt Vonnegut"                                     
##  [68] "Kurt Vonnegut"                                                     
##  [69] "Trout Fishing with Vonnegut Again"                                 
##  [70] "Vonnegut in Fact"                                                  
##  [71] "Conversations with Kurt Vonnegut"                                  
##  [72] "Kurt Vonnegut's Slaughterhouse-five"                               
##  [73] "Kurt Vonnegut's Cat's cradle"                                      
##  [74] "Kurt Vonnegut's Slaughterhouse-Five"                               
##  [75] "Kurt Vonnegut (Contemporary Writers)"                              
##  [76] "Kurt Vonnegut's Slaughterhouse-five"                               
##  [77] "Mother Night"                                                      
##  [78] "Critical companion to Kurt Vonnegut"                               
##  [79] "The Sirens of Titan"                                               
##  [80] "Kurt Vonnegut"                                                     
##  [81] "Vonnegut's major works"                                            
##  [82] "Kurt Vonnegut, Jr"                                                 
##  [83] "Bloom's how to write about Kurt Vonnegut"                          
##  [84] "The Vonnegut chronicles"                                           
##  [85] "Jailbird"                                                          
##  [86] "Breakfast of Champions"                                            
##  [87] "CliffsNotes on Vonnegut's Slaughterhouse-Five"                     
##  [88] "CliffsNotes on Vonnegut's Major Works"                             
##  [89] "Matka noc - Kurt Vonnegut [KSIÄĹťKA]"                              
##  [90] "God Bless You, Mr. Vonnegut"                                       
##  [91] "Kurt Vonnegut, a comprehensive bibliography"                       
##  [92] "CliffsNotes on Vonnegut's Slaughterhouse-Five"                     
##  [93] "Kurt Vonnegut's America (Non Series)"                              
##  [94] "(Mis)reading of Kurt Vonnegut"                                     
##  [95] "Critical essays on Kurt Vonnegut"                                  
##  [96] "Slaughterhouse-five [by] Kurt Vonnegut"                            
##  [97] "Slaughterhouse-five"                                               
##  [98] "The novels of Kurt Vonnegut"                                       
##  [99] "Kurt Vonnegut and the American Novel"                              
## [100] "The Brothers Vonnegut"

But Vonnegut as a search term is misleading, because many people have written books on Vonnegut, while we may want books by Vonnegut.

von2 <- httr::GET('http://openlibrary.org/search.json?author=vonnegut')
str(von2)

## List of 10
##  $ url        : chr "http://openlibrary.org/search.json?author=vonnegut"
##  $ status_code: int 200
##  $ headers    :List of 10
##   ..$ server                     : chr "nginx/1.18.0 (Ubuntu)"
##   ..$ date                       : chr "Mon, 12 Jun 2023 21:04:50 GMT"
##   ..$ content-type               : chr "application/json"
##   ..$ transfer-encoding          : chr "chunked"
##   ..$ connection                 : chr "keep-alive"
##   ..$ access-control-allow-origin: chr "*"
##   ..$ access-control-allow-method: chr "GET, OPTIONS"
##   ..$ access-control-max-age     : chr "86400"
##   ..$ x-ol-stats                 : chr "\"SR 1 0.305 TT 0 0.336\""
##   ..$ referrer-policy            : chr "no-referrer-when-downgrade"
##   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ all_headers:List of 1
##   ..$ :List of 3
##   .. ..$ status : int 200
##   .. ..$ version: chr "HTTP/1.1"
##   .. ..$ headers:List of 10
##   .. .. ..$ server                     : chr "nginx/1.18.0 (Ubuntu)"
##   .. .. ..$ date                       : chr "Mon, 12 Jun 2023 21:04:50 GMT"
##   .. .. ..$ content-type               : chr "application/json"
##   .. .. ..$ transfer-encoding          : chr "chunked"
##   .. .. ..$ connection                 : chr "keep-alive"
##   .. .. ..$ access-control-allow-origin: chr "*"
##   .. .. ..$ access-control-allow-method: chr "GET, OPTIONS"
##   .. .. ..$ access-control-max-age     : chr "86400"
##   .. .. ..$ x-ol-stats                 : chr "\"SR 1 0.305 TT 0 0.336\""
##   .. .. ..$ referrer-policy            : chr "no-referrer-when-downgrade"
##   .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ cookies    :'data.frame': 0 obs. of  7 variables:
##   ..$ domain    : logi(0) 
##   ..$ flag      : logi(0) 
##   ..$ path      : logi(0) 
##   ..$ secure    : logi(0) 
##   ..$ expiration: 'POSIXct' num(0) 
##   ..$ name      : logi(0) 
##   ..$ value     : logi(0) 
##  $ content    : raw [1:558025] 7b 0a 20 20 ...
##  $ date       : POSIXct[1:1], format: "2023-06-12 21:04:50"
##  $ times      : Named num [1:6] 0 0.000022 0.000022 0.000045 0.41462 ...
##   ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##  $ request    :List of 7
##   ..$ method    : chr "GET"
##   ..$ url       : chr "http://openlibrary.org/search.json?author=vonnegut"
##   ..$ headers   : Named chr "application/json, text/xml, application/xml, */*"
##   .. ..- attr(*, "names")= chr "Accept"
##   ..$ fields    : NULL
##   ..$ options   :List of 2
##   .. ..$ useragent: chr "libcurl/7.81.0 r-curl/4.3.3 httr/1.4.6"
##   .. ..$ httpget  : logi TRUE
##   ..$ auth_token: NULL
##   ..$ output    : list()
##   .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
##   ..- attr(*, "class")= chr "request"
##  $ handle     :Class 'curl_handle' <externalptr> 
##  - attr(*, "class")= chr "response"

This data structure is basically in JSON format, which is a different file format (e.g., xml, htm, etc.). To make it something that R can work with, we use the jsonlite package, and function fromJSON.

vonInfo2 <- jsonlite::fromJSON(content(von2, "text"), simplifyVector = FALSE)

## No encoding supplied: defaulting to UTF-8.

This is better, as it now provides information on books where Vonnegut was listed as an author.

sapply(vonInfo2[[4]], function(x){
  x$title
})

##   [1] "Slaughterhouse-Five"                                                                                            
##   [2] "Cat's Cradle"                                                                                                   
##   [3] "Player Piano"                                                                                                   
##   [4] "Mother Night"                                                                                                   
##   [5] "The Sirens of Titan"                                                                                            
##   [6] "Jailbird"                                                                                                       
##   [7] "Breakfast of Champions"                                                                                         
##   [8] "God bless you, Mr. Rosewater"                                                                                   
##   [9] "Slapstick"                                                                                                      
##  [10] "Welcome to the Monkey House"                                                                                    
##  [11] "Hocus Pocus"                                                                                                    
##  [12] "Palm Sunday"                                                                                                    
##  [13] "Bluebeard"                                                                                                      
##  [14] "Galapagos"                                                                                                      
##  [15] "Deadeye Dick"                                                                                                   
##  [16] "Timequake"                                                                                                      
##  [17] "Fates Worse than Death"                                                                                         
##  [18] "Bagombo Snuff Box"                                                                                              
##  [19] "Between time and Timbuktu"                                                                                      
##  [20] "Wampeters, Foma & Granfalloons (opinions)"                                                                      
##  [21] "A man without a country"                                                                                        
##  [22] "Galápagos"                                                                                                      
##  [23] "Happy birthday, Wanda June"                                                                                     
##  [24] "Armageddon in Retrospect"                                                                                       
##  [25] "Look at the birdie"                                                                                             
##  [26] "Canary in a cat house"                                                                                          
##  [27] "God bless you, Dr. Kevorkian"                                                                                   
##  [28] "Wampeters, foma & granfalloons"                                                                                 
##  [29] "Breakfast of champions, or, Goodbye blue Monday!"                                                               
##  [30] "Again, Dangerous Visions"                                                                                       
##  [31] "Gibier de potence"                                                                                              
##  [32] "God bless you, Mr Rosewater; or, Pearls before swine"                                                           
##  [33] "Sun, moon, star"                                                                                                
##  [34] "Happy Birthday, Wanda June"                                                                                     
##  [35] "Deadeye Dick"                                                                                                   
##  [36] "Like shaking hands with God"                                                                                    
##  [37] "Galapagos"                                                                                                      
##  [38] "Pity the Reader"                                                                                                
##  [39] "Welcome to the Monkey House CD"                                                                                 
##  [40] "Hocus Pocus Or, What's the Hurry, Son?"                                                                         
##  [41] "Vonnegut Omnibus"                                                                                               
##  [42] "Breakfast of Champions CD"                                                                                      
##  [43] "Essential Vonnegut Interviews CD"                                                                               
##  [44] "Bagombo Snuff Box a Fmt (Export)"                                                                               
##  [45] "Zeitbeben"                                                                                                      
##  [46] "Slapstick/Mother Night"                                                                                         
##  [47] "Galapagos"                                                                                                      
##  [48] "Slaughterhouse-five ; The sirens of Titan ; Player-piano ; cat's cradle ; Breakfast of champions ; Mother night"
##  [49] "Le Berceau du chat"                                                                                             
##  [50] "We Are What We Pretend to Be"                                                                                   
##  [51] "Welcome to the Monkey House"                                                                                    
##  [52] "Jailbird"                                                                                                       
##  [53] "Kurt Vonnegut"                                                                                                  
##  [54] "Armageddon in Retrospect"                                                                                       
##  [55] "Timequake"                                                                                                      
##  [56] "Bluebeard"                                                                                                      
##  [57] "God Bless You, Dr.Kevorkian"                                                                                    
##  [58] "While Mortals Sleep"                                                                                            
##  [59] "Between Time and Timbuktu Or Prometheus 5"                                                                      
##  [60] "If This Isn't Nice, What Is?"                                                                                   
##  [61] "Love, Kurt"                                                                                                     
##  [62] "Domestic goddesses"                                                                                             
##  [63] "Galapagos CST"                                                                                                  
##  [64] "Retsidivist. Galapagosy"                                                                                        
##  [65] "The Vonnegut Quartet"                                                                                           
##  [66] "Schlachthof 5 oder der Kinderkreuzzug"                                                                          
##  [67] "Sirenene på Titan"                                                                                              
##  [68] "Hokus Pokus oder wohin so eilig?"                                                                               
##  [69] "Slapstick, or, Lonesome no more!"                                                                               
##  [70] "Ghiaccio-nove"                                                                                                  
##  [71] "AITI YO"                                                                                                        
##  [72] "Confido"                                                                                                        
##  [73] "Hello, Red"                                                                                                     
##  [74] "Kurt Vonnegut Jr Conversations With Writers (L077 Cassette)"                                                    
##  [75] "Next Door"                                                                                                      
##  [76] "Easy Readers - English - Level 4"                                                                               
##  [77] "Welcome to the Monkey House"                                                                                    
##  [78] "Armageddon in retrospect, and other new and unpublished writings on war and peace"                              
##  [79] "Three complete novels"                                                                                          
##  [80] "Kurt Vonnegut's Slaughterhouse Five (Monarch Notes) A Critical Commentary"                                      
##  [81] "Conversations with Kurt Vonnegut"                                                                               
##  [82] "Op Timequake"                                                                                                   
##  [83] "Who Am I This Time"                                                                                             
##  [84] "Cuna de Gato"                                                                                                   
##  [85] "Long Walk to Forever"                                                                                           
##  [86] "Suche Traum, biete mich. Verstreute Kurzgeschichten"                                                            
##  [87] "The Sirens of Titan"                                                                                            
##  [88] "Abattoir 5 ou la croisade des enfants"                                                                          
##  [89] "Di wu hao tu zai chang"                                                                                         
##  [90] "The Barnhouse Effect"                                                                                           
##  [91] "Wampeters, Fomas & Granfalloons"                                                                                
##  [92] "Barbazul"                                                                                                       
##  [93] "The Kurt Vonnegut, Jr Soundbook/4 Cassettes and Program Booklet"                                                
##  [94] "Fates Worse/death Ltd"                                                                                          
##  [95] "Barbe-Bleue, ou, La vie et les oeuvres de Rabo Karabekian, 1916-1988"                                           
##  [96] "On Mark Twain, Lincoln, imperialist wars and the weather"                                                       
##  [97] "God Bless You, Mr. Rosewater or Pearls Before Swine"                                                            
##  [98] "Man Without a Country, A"                                                                                       
##  [99] "Slaughterhouse- Five. Arbeitsbuch für Englischkurse in der Sekundarstufe 2"                                     
## [100] "Abracadabra"

Side fun note: You can access Github info through API calls as well

tad <- httr::GET('https://api.github.com/users/taddallas')
tadInfo <- jsonlite::fromJSON(content(tad, "text"), simplifyVector = FALSE)

A more biological example

So this gives a good background about APIs, but it does not give a good biological example of when they are useful. The books that Vonnegut wrote are unlikely to change, so the search queries to this API are fairly static, and a bit removed from biology.

One good biology API that is open is the Global Biodiversity Information Facility (GBIF). We can use a similar structure to our Open Library query to obtain information on species names and occurrences.

giraffe <- httr::GET('https://www.gbif.org/developer/species/Giraffa+camelopardalis')
str(giraffe)

## List of 10
##  $ url        : chr "https://www.gbif.org/developer/species/Giraffa+camelopardalis"
##  $ status_code: int 404
##  $ headers    :List of 19
##   ..$ content-security-policy          : chr "default-src maxcdn.bootstrapcdn.com cdn.jsdelivr.net/codemirror.spell-checker/ 'self' *.gbif.org *.gbif-uat.org"| __truncated__
##   ..$ x-dns-prefetch-control           : chr "off"
##   ..$ expect-ct                        : chr "max-age=0"
##   ..$ x-frame-options                  : chr "SAMEORIGIN"
##   ..$ strict-transport-security        : chr "max-age=600; includeSubDomains"
##   ..$ x-download-options               : chr "noopen"
##   ..$ x-content-type-options           : chr "nosniff"
##   ..$ x-permitted-cross-domain-policies: chr "none"
##   ..$ x-xss-protection                 : chr "0"
##   ..$ x-request-id                     : chr "773b95b0-0963-11ee-8690-f9e6b9db5d9e"
##   ..$ content-type                     : chr "text/html; charset=utf-8"
##   ..$ etag                             : chr "W/\"e600-f7J9KD5JDE0A2PlHIs9BtR8CvCc\""
##   ..$ vary                             : chr "Accept-Encoding"
##   ..$ content-encoding                 : chr "gzip"
##   ..$ date                             : chr "Mon, 12 Jun 2023 20:55:30 GMT"
##   ..$ x-varnish                        : chr "931137505 941424977"
##   ..$ age                              : chr "560"
##   ..$ via                              : chr "1.1 varnish (Varnish/6.0)"
##   ..$ content-length                   : chr "8023"
##   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ all_headers:List of 1
##   ..$ :List of 3
##   .. ..$ status : int 404
##   .. ..$ version: chr "HTTP/2"
##   .. ..$ headers:List of 19
##   .. .. ..$ content-security-policy          : chr "default-src maxcdn.bootstrapcdn.com cdn.jsdelivr.net/codemirror.spell-checker/ 'self' *.gbif.org *.gbif-uat.org"| __truncated__
##   .. .. ..$ x-dns-prefetch-control           : chr "off"
##   .. .. ..$ expect-ct                        : chr "max-age=0"
##   .. .. ..$ x-frame-options                  : chr "SAMEORIGIN"
##   .. .. ..$ strict-transport-security        : chr "max-age=600; includeSubDomains"
##   .. .. ..$ x-download-options               : chr "noopen"
##   .. .. ..$ x-content-type-options           : chr "nosniff"
##   .. .. ..$ x-permitted-cross-domain-policies: chr "none"
##   .. .. ..$ x-xss-protection                 : chr "0"
##   .. .. ..$ x-request-id                     : chr "773b95b0-0963-11ee-8690-f9e6b9db5d9e"
##   .. .. ..$ content-type                     : chr "text/html; charset=utf-8"
##   .. .. ..$ etag                             : chr "W/\"e600-f7J9KD5JDE0A2PlHIs9BtR8CvCc\""
##   .. .. ..$ vary                             : chr "Accept-Encoding"
##   .. .. ..$ content-encoding                 : chr "gzip"
##   .. .. ..$ date                             : chr "Mon, 12 Jun 2023 20:55:30 GMT"
##   .. .. ..$ x-varnish                        : chr "931137505 941424977"
##   .. .. ..$ age                              : chr "560"
##   .. .. ..$ via                              : chr "1.1 varnish (Varnish/6.0)"
##   .. .. ..$ content-length                   : chr "8023"
##   .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ cookies    :'data.frame': 0 obs. of  7 variables:
##   ..$ domain    : logi(0) 
##   ..$ flag      : logi(0) 
##   ..$ path      : logi(0) 
##   ..$ secure    : logi(0) 
##   ..$ expiration: 'POSIXct' num(0) 
##   ..$ name      : logi(0) 
##   ..$ value     : logi(0) 
##  $ content    : raw [1:58880] 0a 3c 21 44 ...
##  $ date       : POSIXct[1:1], format: "2023-06-12 20:55:30"
##  $ times      : Named num [1:6] 0 0.0389 0.1484 0.409 0.5186 ...
##   ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##  $ request    :List of 7
##   ..$ method    : chr "GET"
##   ..$ url       : chr "https://www.gbif.org/developer/species/Giraffa+camelopardalis"
##   ..$ headers   : Named chr "application/json, text/xml, application/xml, */*"
##   .. ..- attr(*, "names")= chr "Accept"
##   ..$ fields    : NULL
##   ..$ options   :List of 2
##   .. ..$ useragent: chr "libcurl/7.81.0 r-curl/4.3.3 httr/1.4.6"
##   .. ..$ httpget  : logi TRUE
##   ..$ auth_token: NULL
##   ..$ output    : list()
##   .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
##   ..- attr(*, "class")= chr "request"
##  $ handle     :Class 'curl_handle' <externalptr> 
##  - attr(*, "class")= chr "response"

Before we dive too far down this hole, it is important to note that this wheel has already been invented and is being maintained by the ROpenSci collective, a group of R programmers who develop tools for the access and analysis of data resources. We will use this package, not because it enhances reproducibility (as it is yet another dependency), but because it is well-written, well-documented, and from a group of scientists committed to helping make open data available and analyses reproducible.

# install.packages('rgbif')
library(rgbif)

giraffe <- rgbif::occ_search(scientificName = "Giraffa camelopardalis", 
    limit = 5000)

Depending on your operating system (OS) and previously installed packages, this is where we start to get into issues of dependencies. The rgbif package requires the installation of rgeos, which requires the geos library to be installed outside of R. Hopefully you will be able to successfully install the package, but it is an important note that handling these dependency and OS-specific issues is pretty central to making sure an analytical pipeline is reproducible.

So presumably we have successfully queried the GBIF API using the rgbif package, and now have a data.frame of giraffe occurrences (limited to 5000 occurrence points). The output data are structured as a list of lists, which is a bit confusing if we look at the str() of giraffe, as the authors of the package have created a class called gbif to hold all the data. This is common in R packages, and the authors have written in some great functionality in printing and working with these data, formatting the data as a tibble object. We will not go too much into tibbles, but they are pretty useful for keeping data in a “tidy” format, as columns in a tibble can be lists (e.g., what if we wanted to store spatial polygon information, vectors, etc. alongside each row element? tibble handles this need effectively).

typeof(giraffe)

## [1] "list"

class(giraffe)

## [1] "gbif"

For simplicity sake, we can use some of the functionality within rgbif to just return the data we are interested in, specifying that we only want the raw data, not all the potential data (which includes images, etc.)

giraffe2 <- rgbif::occ_search(scientificName = "Giraffa camelopardalis", 
    limit = 5000, return='data')[[3]]

## Warning in pchk(return, "occ_search"): `return` param in `occ_search` function is defunct as of rgbif v3.0.0, and is ignored
## See `?rgbif` for more information.

We will quickly plot these data out just to show them in geographic space. We first create a base map using the maps package, which is a really easy way to plot geopolitical boundaries. Then we layer on the occurrence data using the points function.

# install.packages('maps')

maps::map()
points(giraffe2$decimalLongitude, giraffe2$decimalLatitude, pch=16, 
    col=grey(0.1,0.5))

Giraffes are in the US!!! These are almost certainly zoo records.

Interfacing with local databases

Accessing data through APIs is fantastic, but there are plenty of times where you may need to access large data files locally. Note that this only is necessary if the data you are working with cannot fit into local memory. Otherwise, there is no real advantage to using a database framework, as it will likely be more expensive in terms of time and potential frustration.

For this, we will have to include yet another dependency (5+ in this lecture alone), which of course has a bunch of other dependencies. This is not really ideal given the focus on reproducible research in the course, but it is sadly pretty much necessary.

install.packages('dbplyr')

When referring to “databases”, we may think that there is a single type or schema, but this is definitely not the case. R has a way to accommodate for different database types, and that is to include a dependency for each database type (maybe not ideal). Below is a list describing the different ways R interfaces with different database types (taken from https://www.kaggle.com/anello/working-with-databases-in-r).

RMySQL connects to MySQL and MariaDB
RPostgreSQL connects to Postgres and Redshift.
RSQLite embeds a SQLite database.
odbc connects to many commercial databases via the open database connectivity protocol.
bigrquery connects to Google’s BigQuery.

We will focus on the use of RSQLite as a backend to connect to SQLite databases. This will sadly require another dependency.

install.packages('RSQLite')

dir.create("data_raw", showWarnings = FALSE)
download.file(url = "https://ndownloader.figshare.com/files/2292171",
              destfile = "data_raw/portal_mammals.sqlite", mode = "wb")

library(DBI)

mammals <- DBI::dbConnect(RSQLite::SQLite(), "data_raw/portal_mammals.sqlite")
dbplyr::src_dbi(mammals)

## src:  sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## tbls: plots, species, surveys

Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the tbls row in the output above:

plots
species
surveys

Now that we know we can connect to the database, let’s explore how to get the data from its tables into R.

Querying the database with the SQL syntax

To connect to tables within a database, you can use the tbl() function from dplyr. This function can be used to send SQL queries to the database. To demonstrate this functionality, let’s select the columns “year”, “species_id”, and “plot_id” from the surveys table:

dplyr::tbl(mammals, dplyr::sql("SELECT year, species_id, plot_id FROM surveys"))

## # Source:   SQL [?? x 3]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
##     year species_id plot_id
##    <int> <chr>        <int>
##  1  1977 NL               2
##  2  1977 NL               3
##  3  1977 DM               2
##  4  1977 DM               7
##  5  1977 DM               3
##  6  1977 PF               1
##  7  1977 PE               2
##  8  1977 DM               1
##  9  1977 DM               1
## 10  1977 PF               6
## # ℹ more rows

With this approach you can use any of the SQL queries we have seen in the database lesson.

Querying the database with the dplyr syntax

One of the strengths of dplyr is that the same operation can be done using dplyr verbs instead of writing SQL. First, we select the table on which to do the operations by creating the surveys object, and then we use the standard dplyr syntax as if it were a data frame:

surveys <- dplyr::tbl(mammals, "surveys")
surveys %>%
    dplyr::select(year, species_id, plot_id)

## # Source:   SQL [?? x 3]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
##     year species_id plot_id
##    <int> <chr>        <int>
##  1  1977 NL               2
##  2  1977 NL               3
##  3  1977 DM               2
##  4  1977 DM               7
##  5  1977 DM               3
##  6  1977 PF               1
##  7  1977 PE               2
##  8  1977 DM               1
##  9  1977 DM               1
## 10  1977 PF               6
## # ℹ more rows

In this case, the surveys object behaves like a data frame. Several functions that can be used with data frames can also be used on tables from a database. For instance, the head() function can be used to check the first 10 rows of the table:

head(surveys, n = 10)

## # Source:   SQL [10 x 9]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
##    record_id month   day  year plot_id species_id sex   hindfoot_length weight
##        <int> <int> <int> <int>   <int> <chr>      <chr>           <int>  <int>
##  1         1     7    16  1977       2 NL         M                  32     NA
##  2         2     7    16  1977       3 NL         M                  33     NA
##  3         3     7    16  1977       2 DM         F                  37     NA
##  4         4     7    16  1977       7 DM         M                  36     NA
##  5         5     7    16  1977       3 DM         M                  35     NA
##  6         6     7    16  1977       1 PF         M                  14     NA
##  7         7     7    16  1977       2 PE         F                  NA     NA
##  8         8     7    16  1977       1 DM         M                  37     NA
##  9         9     7    16  1977       1 DM         F                  34     NA
## 10        10     7    16  1977       6 PF         F                  20     NA

surveys2 <- surveys %>%
  dplyr::filter(weight < 5) %>%
  dplyr::select(species_id, sex, weight)

R makes lazy calls to databases, until you make it not be lazy. That is, everything is held outside of memory until you tell R that some portion of the data should not be.

head(surveys2)

## # Source:   SQL [6 x 3]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
##   species_id sex   weight
##   <chr>      <chr>  <int>
## 1 PF         M          4
## 2 PF         F          4
## 3 PF         <NA>       4
## 4 PF         F          4
## 5 PF         F          4
## 6 RM         M          4

To pull the data into memory and allow us to use the data, we must use the collect function.

surveys3 <- dplyr::collect(surveys2)
head(surveys3)

## # A tibble: 6 × 3
##   species_id sex   weight
##   <chr>      <chr>  <int>
## 1 PF         M          4
## 2 PF         F          4
## 3 PF         <NA>       4
## 4 PF         F          4
## 5 PF         F          4
## 6 RM         M          4

This difference becomes clear when we try to index a specific column of data and work with it.

surveys2$sex

## NULL

surveys3$sex

##  [1] "M" "F" NA  "F" "F" "M" "F" "M" "M" "M" "M" "F" "M" "M" "M" "M" "M"

Relating this back to the `dplyr` syntax of joins

We discussed joins in the data manipulation section. Joins are pretty key to working with databases, as much of the benefit of database structure is from nested data and the utility of key values. For instance, in our mammal sampling data, we have plot level data…

plots <- dplyr::tbl(mammals, "plots")
plots

## # Source:   table<plots> [?? x 2]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
##    plot_id plot_type                
##      <int> <chr>                    
##  1       1 Spectab exclosure        
##  2       2 Control                  
##  3       3 Long-term Krat Exclosure 
##  4       4 Control                  
##  5       5 Rodent Exclosure         
##  6       6 Short-term Krat Exclosure
##  7       7 Rodent Exclosure         
##  8       8 Control                  
##  9       9 Spectab exclosure        
## 10      10 Rodent Exclosure         
## # ℹ more rows

and survey-level data, where plot-level data provides what is essentially metadata for each survey.

We can use joins to combine and manipulate these data, as we did before, but now in the context of the database structure.

survPlot <- dplyr::left_join(surveys, plots, by='plot_id')

sessionInfo

sessionInfo()

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] geodata_0.5-8  terra_1.7-29   maps_3.4.1     gbm_2.1.8.1    igraph_1.4.3  
##  [6] dplyr_1.1.2    plyr_1.8.8     DBI_1.1.3      rgbif_3.7.7    jsonlite_1.8.5
## [11] httr_1.4.6     rmarkdown_2.11 fastmap_1.1.1 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.3      xfun_0.29         ggplot2_3.4.2     lattice_0.21-8   
##  [5] vctrs_0.6.2       tools_4.3.0       generics_0.1.2    parallel_4.3.0   
##  [9] curl_4.3.3        tibble_3.2.1      fansi_1.0.2       RSQLite_2.3.1    
## [13] highr_0.9         blob_1.2.4        pkgconfig_2.0.3   Matrix_1.5-1     
## [17] data.table_1.14.6 dbplyr_2.3.2      lifecycle_1.0.3   compiler_4.3.0   
## [21] stringr_1.5.0     munsell_0.5.0     codetools_0.2-19  htmltools_0.5.2  
## [25] yaml_2.3.6        lazyeval_0.2.2    pillar_1.9.0      jquerylib_0.1.4  
## [29] whisker_0.4.1     cachem_1.0.8      viridis_0.6.3     tidyselect_1.2.0 
## [33] digest_0.6.31     stringi_1.7.12    purrr_1.0.1       splines_4.3.0    
## [37] grid_4.3.0        colorspace_2.1-0  cli_3.6.1         magrittr_2.0.2   
## [41] triebeard_0.4.1   survival_3.5-3    crul_1.4.0        utf8_1.2.2       
## [45] withr_2.5.0       scales_1.2.1      bit64_4.0.5       oai_0.4.0        
## [49] bit_4.0.5         gridExtra_2.3     memoise_2.0.1     evaluate_0.15    
## [53] knitr_1.37        viridisLite_0.4.2 rlang_1.1.1       urltools_1.7.3   
## [57] Rcpp_1.0.10       glue_1.6.2        httpcode_0.3.0    xml2_1.3.4       
## [61] R6_2.5.1