Two this week
https://medium.com/@traffordDataLab/querying-apis-in-r-39029b73d5f1
Application programming interfaces (or API
s) are used to
describe the communication of one computer to a web-based source of data
in a programmatic manner. The computer (client) makes a request for data
in a well-documented and consistent way, and receives data from the
server. This is incredibly useful for acquiring data in a programmatic
way, allowing analyses to be re-run on dynamic data, creating a living
analysis. This could be critical in situations such as an ongoing
pandemic, where daily case count updates might change model parameters
or predictions.
Many websites have APIs in order to allow users to make calls and develop apps on top of existing sites e.g., Facebook, Spotify, etc. all have APIs.
Why is this valuable? Contrast the API approach to pure web scraping. Web scraping refers to extracting data from html pages. This is different from querying an API, as APIs will have documentation about how the data are represented. For instance, what happens when a website decides to change a bit of its formatting? This would potentially break any web scraping code, but the API backend would be maintained.
In the context of biological data, many data repositories (figshare, data dryad, dataONE) have clear APIs. Further, it is becoming more common for large museum and government datasets to have developed APIs. I am a huge fan of this, as it is a departure from the longstanding “my data” mentality that has lead to a clear power dynamic (e.g., being associated with labs with lots of data gives them a clear advantage…but those data were likely gathered with a combination of taxpayer dollars and researcher sweat. The dollars trump the sweat, in my opinion.).
APIs can be queried programmtically using a number of R
packages. Some APIs will promote their own R
packages that
have functions specific to their data structures or API. However, at the
core of these packages are a small number of low-level packages for
querying APIs.
The general structure of an API call is the same relatively
regardless of R
package used. The parameters of an HTTP
request are typically contained in the URL. For example, to return a map
of Manchester using the Google Maps Static API we would submit the
following request:
https://maps.googleapis.com/maps/api/staticmap?center=Manchester,England&zoom=13&size=600x300&maptype=roadmap
The request contains: + a URL to the API endpoint (https://maps.googleapis.com/maps/api/staticmap?) and; + a query containing the parameters of the request (center=Manchester,England&zoom=13&size=600x300&maptype=roadmap). In this case, we have specified the location, zoom level, size and type of map.
Web service APIs use two key HTTP verbs to enable data requests: GET and POST. A GET request is submitted within the URL with each parameter separated by an ampersand (&). A POST request is submitted in the message body which is separate from the URL. The advantage of using POST over GET requests is that there are no character limits and the request is more secure because it is not stored in the browser’s cache.
There are several types of Web service APIs (e.g. XML-RPC, JSON-RPC and SOAP) but the most popular is Representational State Transfer or REST. RESTful APIs can return output as XML, JSON, CSV and several other data formats. Each API has documentation and specifications which determine how data can be transferred.
install.packages(c("httr", "jsonlite"))
After downloading the libraries, we will be able to use them in our
R
scripts or RMarkdown
files.
library(httr)
library(jsonlite)
Some APIs require a token, in order to identify who is accessing the data (e.g., Spotify, Facebook, etc.). Developer access is pretty easy to get when you have an account, but it is a bit too much of a hassle for demonstration purposes. So we will use an open API from the Internet Archive’s Open Library.
We can send a request for data by using the documented API put out by
the data provider, which consists of a base url (http://openlibrary.org/search.json?) with an amended
query for searching the database. We use the GET verb, which is in
R
but stems from HTTP (hypertext transfer protocol). It
stores all the information in an object that is structured in JSON.
Below, we search the Open Library for “Vonnegut”, and receive a nested
list which includes information on status codes, among other relevant
information.
von <- httr::GET('http://openlibrary.org/search.json?q=vonnegut')
str(von)
## List of 10
## $ url : chr "http://openlibrary.org/search.json?q=vonnegut"
## $ status_code: int 200
## $ headers :List of 10
## ..$ server : chr "nginx/1.18.0 (Ubuntu)"
## ..$ date : chr "Mon, 12 Jun 2023 21:04:49 GMT"
## ..$ content-type : chr "application/json"
## ..$ transfer-encoding : chr "chunked"
## ..$ connection : chr "keep-alive"
## ..$ access-control-allow-origin: chr "*"
## ..$ access-control-allow-method: chr "GET, OPTIONS"
## ..$ access-control-max-age : chr "86400"
## ..$ x-ol-stats : chr "\"SR 1 0.520 TT 0 0.542\""
## ..$ referrer-policy : chr "no-referrer-when-downgrade"
## ..- attr(*, "class")= chr [1:2] "insensitive" "list"
## $ all_headers:List of 1
## ..$ :List of 3
## .. ..$ status : int 200
## .. ..$ version: chr "HTTP/1.1"
## .. ..$ headers:List of 10
## .. .. ..$ server : chr "nginx/1.18.0 (Ubuntu)"
## .. .. ..$ date : chr "Mon, 12 Jun 2023 21:04:49 GMT"
## .. .. ..$ content-type : chr "application/json"
## .. .. ..$ transfer-encoding : chr "chunked"
## .. .. ..$ connection : chr "keep-alive"
## .. .. ..$ access-control-allow-origin: chr "*"
## .. .. ..$ access-control-allow-method: chr "GET, OPTIONS"
## .. .. ..$ access-control-max-age : chr "86400"
## .. .. ..$ x-ol-stats : chr "\"SR 1 0.520 TT 0 0.542\""
## .. .. ..$ referrer-policy : chr "no-referrer-when-downgrade"
## .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
## $ cookies :'data.frame': 0 obs. of 7 variables:
## ..$ domain : logi(0)
## ..$ flag : logi(0)
## ..$ path : logi(0)
## ..$ secure : logi(0)
## ..$ expiration: 'POSIXct' num(0)
## ..$ name : logi(0)
## ..$ value : logi(0)
## $ content : raw [1:399885] 7b 0a 20 20 ...
## $ date : POSIXct[1:1], format: "2023-06-12 21:04:49"
## $ times : Named num [1:6] 0 0.162 0.233 0.234 0.854 ...
## ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
## $ request :List of 7
## ..$ method : chr "GET"
## ..$ url : chr "http://openlibrary.org/search.json?q=vonnegut"
## ..$ headers : Named chr "application/json, text/xml, application/xml, */*"
## .. ..- attr(*, "names")= chr "Accept"
## ..$ fields : NULL
## ..$ options :List of 2
## .. ..$ useragent: chr "libcurl/7.81.0 r-curl/4.3.3 httr/1.4.6"
## .. ..$ httpget : logi TRUE
## ..$ auth_token: NULL
## ..$ output : list()
## .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
## ..- attr(*, "class")= chr "request"
## $ handle :Class 'curl_handle' <externalptr>
## - attr(*, "class")= chr "response"
This data structure is basically in JSON format, which is a different
file format (e.g., xml, html, etc.). To make it something that
R
can work with, we use the jsonlite
package,
and function fromJSON
.
vonInfo <- jsonlite::fromJSON(content(von, "text"), simplifyVector = FALSE)
## No encoding supplied: defaulting to UTF-8.
This output is a list of length 4, each containing a nested list. It is not the cleanest output, but it rarely is. But all the information you could want is buried somewhere here.
This apply statement gets information on the title of each of the works returned from our search for “Vonnegut”.
sapply(vonInfo[[4]], function(x){
x$title
})
## [1] "Slaughterhouse-Five"
## [2] "Kurt Vonnegut, Jr"
## [3] "The Vonnegut statement"
## [4] "The Vonnegut effect"
## [5] "Kurt Vonnegut"
## [6] "Kurt Vonnegut"
## [7] "THE VONNEGUT STATEMENT"
## [8] "Kurt Vonnegut"
## [9] "Kurt Vonnegut"
## [10] "The Vonnegut Statement"
## [11] "Kurt Vonnegut, Jr."
## [12] "Kurt Vonnegut, Jr"
## [13] "Kurt Vonnegut, Jr"
## [14] "Cat's Cradle"
## [15] "Kurt Vonnegut"
## [16] "Kurt Vonnegut"
## [17] "Kurt Vonnegut"
## [18] "Kurt Vonnegut"
## [19] "Kurt Vonnegut"
## [20] "The Vonnegut Encyclopedia"
## [21] "Vonnegut Omnibus"
## [22] "Kurt Vonnegut"
## [23] "Vonnegut statement"
## [24] "Kurt Vonnegut."
## [25] "Vonnegut statement."
## [26] "Kurt Vonnegut"
## [27] "Kurt Vonnegut"
## [28] "Kurt Vonnegut"
## [29] "Kurt Vonnegut"
## [30] "Kurt Vonnegut"
## [31] "Kurt Vonnegut"
## [32] "Kurt Vonnegut"
## [33] "Kurt Vonnegut"
## [34] "Kurt Vonnegut"
## [35] "Kurt Vonnegut"
## [36] "Talking Vonnegut"
## [37] "Ast-klassika.vonnegut Jailbird / AST-klassika.Vonnegut Retsidivist"
## [38] "Kurt Vonnegut, Jr. ..., a checklist"
## [39] "Vonnegut and Hemingway"
## [40] "Kurt Vonnegut Jr"
## [41] "Kurt Vonnegut"
## [42] "Vonnegut"
## [43] "Kurt Vonnegut's Slaugherhouse-Five"
## [44] "Kurt Vonnegut"
## [45] "Player Piano"
## [46] "Vonnegut in America"
## [47] "Vonnegut in America"
## [48] "Breaking down Vonnegut"
## [49] "Kurt Vonnegut's America"
## [50] "Kurt Vonnegut Remembered"
## [51] "Essential Vonnegut Interviews CD"
## [52] "The Vonnegut Quartet"
## [53] "Vonnegut in America"
## [54] "KURT VONNEGUT, JR"
## [55] "The Vonnegut Statement"
## [56] "Three By Vonnegut"
## [57] "Understanding Kurt Vonnegut"
## [58] "THE VONNEGUT STATEMENT"
## [59] "THE VONNEGUT STATEMENT"
## [60] "The Vonnegut statement"
## [61] "Kurt Vonnegut Drawings"
## [62] "The Vonnegut effect"
## [63] "Kurt Vonnegut's America"
## [64] "The Vonnegut Statement"
## [65] "Kurt Vonnegut"
## [66] "The Vonnegut Statement"
## [67] "Happy birthday, Kurt Vonnegut"
## [68] "Kurt Vonnegut"
## [69] "Trout Fishing with Vonnegut Again"
## [70] "Vonnegut in Fact"
## [71] "Conversations with Kurt Vonnegut"
## [72] "Kurt Vonnegut's Slaughterhouse-five"
## [73] "Kurt Vonnegut's Cat's cradle"
## [74] "Kurt Vonnegut's Slaughterhouse-Five"
## [75] "Kurt Vonnegut (Contemporary Writers)"
## [76] "Kurt Vonnegut's Slaughterhouse-five"
## [77] "Mother Night"
## [78] "Critical companion to Kurt Vonnegut"
## [79] "The Sirens of Titan"
## [80] "Kurt Vonnegut"
## [81] "Vonnegut's major works"
## [82] "Kurt Vonnegut, Jr"
## [83] "Bloom's how to write about Kurt Vonnegut"
## [84] "The Vonnegut chronicles"
## [85] "Jailbird"
## [86] "Breakfast of Champions"
## [87] "CliffsNotes on Vonnegut's Slaughterhouse-Five"
## [88] "CliffsNotes on Vonnegut's Major Works"
## [89] "Matka noc - Kurt Vonnegut [KSIÄĹťKA]"
## [90] "God Bless You, Mr. Vonnegut"
## [91] "Kurt Vonnegut, a comprehensive bibliography"
## [92] "CliffsNotes on Vonnegut's Slaughterhouse-Five"
## [93] "Kurt Vonnegut's America (Non Series)"
## [94] "(Mis)reading of Kurt Vonnegut"
## [95] "Critical essays on Kurt Vonnegut"
## [96] "Slaughterhouse-five [by] Kurt Vonnegut"
## [97] "Slaughterhouse-five"
## [98] "The novels of Kurt Vonnegut"
## [99] "Kurt Vonnegut and the American Novel"
## [100] "The Brothers Vonnegut"
But Vonnegut as a search term is misleading, because many people have written books on Vonnegut, while we may want books by Vonnegut.
von2 <- httr::GET('http://openlibrary.org/search.json?author=vonnegut')
str(von2)
## List of 10
## $ url : chr "http://openlibrary.org/search.json?author=vonnegut"
## $ status_code: int 200
## $ headers :List of 10
## ..$ server : chr "nginx/1.18.0 (Ubuntu)"
## ..$ date : chr "Mon, 12 Jun 2023 21:04:50 GMT"
## ..$ content-type : chr "application/json"
## ..$ transfer-encoding : chr "chunked"
## ..$ connection : chr "keep-alive"
## ..$ access-control-allow-origin: chr "*"
## ..$ access-control-allow-method: chr "GET, OPTIONS"
## ..$ access-control-max-age : chr "86400"
## ..$ x-ol-stats : chr "\"SR 1 0.305 TT 0 0.336\""
## ..$ referrer-policy : chr "no-referrer-when-downgrade"
## ..- attr(*, "class")= chr [1:2] "insensitive" "list"
## $ all_headers:List of 1
## ..$ :List of 3
## .. ..$ status : int 200
## .. ..$ version: chr "HTTP/1.1"
## .. ..$ headers:List of 10
## .. .. ..$ server : chr "nginx/1.18.0 (Ubuntu)"
## .. .. ..$ date : chr "Mon, 12 Jun 2023 21:04:50 GMT"
## .. .. ..$ content-type : chr "application/json"
## .. .. ..$ transfer-encoding : chr "chunked"
## .. .. ..$ connection : chr "keep-alive"
## .. .. ..$ access-control-allow-origin: chr "*"
## .. .. ..$ access-control-allow-method: chr "GET, OPTIONS"
## .. .. ..$ access-control-max-age : chr "86400"
## .. .. ..$ x-ol-stats : chr "\"SR 1 0.305 TT 0 0.336\""
## .. .. ..$ referrer-policy : chr "no-referrer-when-downgrade"
## .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
## $ cookies :'data.frame': 0 obs. of 7 variables:
## ..$ domain : logi(0)
## ..$ flag : logi(0)
## ..$ path : logi(0)
## ..$ secure : logi(0)
## ..$ expiration: 'POSIXct' num(0)
## ..$ name : logi(0)
## ..$ value : logi(0)
## $ content : raw [1:558025] 7b 0a 20 20 ...
## $ date : POSIXct[1:1], format: "2023-06-12 21:04:50"
## $ times : Named num [1:6] 0 0.000022 0.000022 0.000045 0.41462 ...
## ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
## $ request :List of 7
## ..$ method : chr "GET"
## ..$ url : chr "http://openlibrary.org/search.json?author=vonnegut"
## ..$ headers : Named chr "application/json, text/xml, application/xml, */*"
## .. ..- attr(*, "names")= chr "Accept"
## ..$ fields : NULL
## ..$ options :List of 2
## .. ..$ useragent: chr "libcurl/7.81.0 r-curl/4.3.3 httr/1.4.6"
## .. ..$ httpget : logi TRUE
## ..$ auth_token: NULL
## ..$ output : list()
## .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
## ..- attr(*, "class")= chr "request"
## $ handle :Class 'curl_handle' <externalptr>
## - attr(*, "class")= chr "response"
This data structure is basically in JSON format, which is a different
file format (e.g., xml, htm, etc.). To make it something that
R
can work with, we use the jsonlite
package,
and function fromJSON
.
vonInfo2 <- jsonlite::fromJSON(content(von2, "text"), simplifyVector = FALSE)
## No encoding supplied: defaulting to UTF-8.
This is better, as it now provides information on books where Vonnegut was listed as an author.
sapply(vonInfo2[[4]], function(x){
x$title
})
## [1] "Slaughterhouse-Five"
## [2] "Cat's Cradle"
## [3] "Player Piano"
## [4] "Mother Night"
## [5] "The Sirens of Titan"
## [6] "Jailbird"
## [7] "Breakfast of Champions"
## [8] "God bless you, Mr. Rosewater"
## [9] "Slapstick"
## [10] "Welcome to the Monkey House"
## [11] "Hocus Pocus"
## [12] "Palm Sunday"
## [13] "Bluebeard"
## [14] "Galapagos"
## [15] "Deadeye Dick"
## [16] "Timequake"
## [17] "Fates Worse than Death"
## [18] "Bagombo Snuff Box"
## [19] "Between time and Timbuktu"
## [20] "Wampeters, Foma & Granfalloons (opinions)"
## [21] "A man without a country"
## [22] "Galápagos"
## [23] "Happy birthday, Wanda June"
## [24] "Armageddon in Retrospect"
## [25] "Look at the birdie"
## [26] "Canary in a cat house"
## [27] "God bless you, Dr. Kevorkian"
## [28] "Wampeters, foma & granfalloons"
## [29] "Breakfast of champions, or, Goodbye blue Monday!"
## [30] "Again, Dangerous Visions"
## [31] "Gibier de potence"
## [32] "God bless you, Mr Rosewater; or, Pearls before swine"
## [33] "Sun, moon, star"
## [34] "Happy Birthday, Wanda June"
## [35] "Deadeye Dick"
## [36] "Like shaking hands with God"
## [37] "Galapagos"
## [38] "Pity the Reader"
## [39] "Welcome to the Monkey House CD"
## [40] "Hocus Pocus Or, What's the Hurry, Son?"
## [41] "Vonnegut Omnibus"
## [42] "Breakfast of Champions CD"
## [43] "Essential Vonnegut Interviews CD"
## [44] "Bagombo Snuff Box a Fmt (Export)"
## [45] "Zeitbeben"
## [46] "Slapstick/Mother Night"
## [47] "Galapagos"
## [48] "Slaughterhouse-five ; The sirens of Titan ; Player-piano ; cat's cradle ; Breakfast of champions ; Mother night"
## [49] "Le Berceau du chat"
## [50] "We Are What We Pretend to Be"
## [51] "Welcome to the Monkey House"
## [52] "Jailbird"
## [53] "Kurt Vonnegut"
## [54] "Armageddon in Retrospect"
## [55] "Timequake"
## [56] "Bluebeard"
## [57] "God Bless You, Dr.Kevorkian"
## [58] "While Mortals Sleep"
## [59] "Between Time and Timbuktu Or Prometheus 5"
## [60] "If This Isn't Nice, What Is?"
## [61] "Love, Kurt"
## [62] "Domestic goddesses"
## [63] "Galapagos CST"
## [64] "Retsidivist. Galapagosy"
## [65] "The Vonnegut Quartet"
## [66] "Schlachthof 5 oder der Kinderkreuzzug"
## [67] "Sirenene på Titan"
## [68] "Hokus Pokus oder wohin so eilig?"
## [69] "Slapstick, or, Lonesome no more!"
## [70] "Ghiaccio-nove"
## [71] "AITI YO"
## [72] "Confido"
## [73] "Hello, Red"
## [74] "Kurt Vonnegut Jr Conversations With Writers (L077 Cassette)"
## [75] "Next Door"
## [76] "Easy Readers - English - Level 4"
## [77] "Welcome to the Monkey House"
## [78] "Armageddon in retrospect, and other new and unpublished writings on war and peace"
## [79] "Three complete novels"
## [80] "Kurt Vonnegut's Slaughterhouse Five (Monarch Notes) A Critical Commentary"
## [81] "Conversations with Kurt Vonnegut"
## [82] "Op Timequake"
## [83] "Who Am I This Time"
## [84] "Cuna de Gato"
## [85] "Long Walk to Forever"
## [86] "Suche Traum, biete mich. Verstreute Kurzgeschichten"
## [87] "The Sirens of Titan"
## [88] "Abattoir 5 ou la croisade des enfants"
## [89] "Di wu hao tu zai chang"
## [90] "The Barnhouse Effect"
## [91] "Wampeters, Fomas & Granfalloons"
## [92] "Barbazul"
## [93] "The Kurt Vonnegut, Jr Soundbook/4 Cassettes and Program Booklet"
## [94] "Fates Worse/death Ltd"
## [95] "Barbe-Bleue, ou, La vie et les oeuvres de Rabo Karabekian, 1916-1988"
## [96] "On Mark Twain, Lincoln, imperialist wars and the weather"
## [97] "God Bless You, Mr. Rosewater or Pearls Before Swine"
## [98] "Man Without a Country, A"
## [99] "Slaughterhouse- Five. Arbeitsbuch für Englischkurse in der Sekundarstufe 2"
## [100] "Abracadabra"
Side fun note: You can access Github info through API calls as well
tad <- httr::GET('https://api.github.com/users/taddallas')
tadInfo <- jsonlite::fromJSON(content(tad, "text"), simplifyVector = FALSE)
So this gives a good background about APIs, but it does not give a good biological example of when they are useful. The books that Vonnegut wrote are unlikely to change, so the search queries to this API are fairly static, and a bit removed from biology.
One good biology API that is open is the Global Biodiversity Information Facility (GBIF). We can use a similar structure to our Open Library query to obtain information on species names and occurrences.
giraffe <- httr::GET('https://www.gbif.org/developer/species/Giraffa+camelopardalis')
str(giraffe)
## List of 10
## $ url : chr "https://www.gbif.org/developer/species/Giraffa+camelopardalis"
## $ status_code: int 404
## $ headers :List of 19
## ..$ content-security-policy : chr "default-src maxcdn.bootstrapcdn.com cdn.jsdelivr.net/codemirror.spell-checker/ 'self' *.gbif.org *.gbif-uat.org"| __truncated__
## ..$ x-dns-prefetch-control : chr "off"
## ..$ expect-ct : chr "max-age=0"
## ..$ x-frame-options : chr "SAMEORIGIN"
## ..$ strict-transport-security : chr "max-age=600; includeSubDomains"
## ..$ x-download-options : chr "noopen"
## ..$ x-content-type-options : chr "nosniff"
## ..$ x-permitted-cross-domain-policies: chr "none"
## ..$ x-xss-protection : chr "0"
## ..$ x-request-id : chr "773b95b0-0963-11ee-8690-f9e6b9db5d9e"
## ..$ content-type : chr "text/html; charset=utf-8"
## ..$ etag : chr "W/\"e600-f7J9KD5JDE0A2PlHIs9BtR8CvCc\""
## ..$ vary : chr "Accept-Encoding"
## ..$ content-encoding : chr "gzip"
## ..$ date : chr "Mon, 12 Jun 2023 20:55:30 GMT"
## ..$ x-varnish : chr "931137505 941424977"
## ..$ age : chr "560"
## ..$ via : chr "1.1 varnish (Varnish/6.0)"
## ..$ content-length : chr "8023"
## ..- attr(*, "class")= chr [1:2] "insensitive" "list"
## $ all_headers:List of 1
## ..$ :List of 3
## .. ..$ status : int 404
## .. ..$ version: chr "HTTP/2"
## .. ..$ headers:List of 19
## .. .. ..$ content-security-policy : chr "default-src maxcdn.bootstrapcdn.com cdn.jsdelivr.net/codemirror.spell-checker/ 'self' *.gbif.org *.gbif-uat.org"| __truncated__
## .. .. ..$ x-dns-prefetch-control : chr "off"
## .. .. ..$ expect-ct : chr "max-age=0"
## .. .. ..$ x-frame-options : chr "SAMEORIGIN"
## .. .. ..$ strict-transport-security : chr "max-age=600; includeSubDomains"
## .. .. ..$ x-download-options : chr "noopen"
## .. .. ..$ x-content-type-options : chr "nosniff"
## .. .. ..$ x-permitted-cross-domain-policies: chr "none"
## .. .. ..$ x-xss-protection : chr "0"
## .. .. ..$ x-request-id : chr "773b95b0-0963-11ee-8690-f9e6b9db5d9e"
## .. .. ..$ content-type : chr "text/html; charset=utf-8"
## .. .. ..$ etag : chr "W/\"e600-f7J9KD5JDE0A2PlHIs9BtR8CvCc\""
## .. .. ..$ vary : chr "Accept-Encoding"
## .. .. ..$ content-encoding : chr "gzip"
## .. .. ..$ date : chr "Mon, 12 Jun 2023 20:55:30 GMT"
## .. .. ..$ x-varnish : chr "931137505 941424977"
## .. .. ..$ age : chr "560"
## .. .. ..$ via : chr "1.1 varnish (Varnish/6.0)"
## .. .. ..$ content-length : chr "8023"
## .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
## $ cookies :'data.frame': 0 obs. of 7 variables:
## ..$ domain : logi(0)
## ..$ flag : logi(0)
## ..$ path : logi(0)
## ..$ secure : logi(0)
## ..$ expiration: 'POSIXct' num(0)
## ..$ name : logi(0)
## ..$ value : logi(0)
## $ content : raw [1:58880] 0a 3c 21 44 ...
## $ date : POSIXct[1:1], format: "2023-06-12 20:55:30"
## $ times : Named num [1:6] 0 0.0389 0.1484 0.409 0.5186 ...
## ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
## $ request :List of 7
## ..$ method : chr "GET"
## ..$ url : chr "https://www.gbif.org/developer/species/Giraffa+camelopardalis"
## ..$ headers : Named chr "application/json, text/xml, application/xml, */*"
## .. ..- attr(*, "names")= chr "Accept"
## ..$ fields : NULL
## ..$ options :List of 2
## .. ..$ useragent: chr "libcurl/7.81.0 r-curl/4.3.3 httr/1.4.6"
## .. ..$ httpget : logi TRUE
## ..$ auth_token: NULL
## ..$ output : list()
## .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
## ..- attr(*, "class")= chr "request"
## $ handle :Class 'curl_handle' <externalptr>
## - attr(*, "class")= chr "response"
Before we dive too far down this hole, it is important to note that
this wheel has already been invented and is being maintained by the
ROpenSci collective, a group of R
programmers who develop
tools for the access and analysis of data resources. We will use this
package, not because it enhances reproducibility (as it is yet another
dependency), but because it is well-written, well-documented, and from a
group of scientists committed to helping make open data available and
analyses reproducible.
# install.packages('rgbif')
library(rgbif)
giraffe <- rgbif::occ_search(scientificName = "Giraffa camelopardalis",
limit = 5000)
Depending on your operating system (OS) and previously installed packages, this is where we start to get into issues of dependencies. The
rgbif
package requires the installation ofrgeos
, which requires thegeos
library to be installed outside ofR
. Hopefully you will be able to successfully install the package, but it is an important note that handling these dependency and OS-specific issues is pretty central to making sure an analytical pipeline is reproducible.
So presumably we have successfully queried the GBIF API using the
rgbif
package, and now have a data.frame of giraffe
occurrences (limited to 5000 occurrence points). The output data are
structured as a list of lists, which is a bit confusing if we look at
the str()
of giraffe
, as the authors of the
package have created a class
called gbif
to
hold all the data. This is common in R
packages, and the
authors have written in some great functionality in printing and working
with these data, formatting the data as a tibble
object. We
will not go too much into tibble
s, but they are pretty
useful for keeping data in a “tidy” format, as columns in a tibble can
be lists (e.g., what if we wanted to store spatial polygon information,
vectors, etc. alongside each row element? tibble
handles
this need effectively).
typeof(giraffe)
## [1] "list"
class(giraffe)
## [1] "gbif"
For simplicity sake, we can use some of the functionality within
rgbif
to just return the data we are interested in,
specifying that we only want the raw data, not all the potential data
(which includes images, etc.)
giraffe2 <- rgbif::occ_search(scientificName = "Giraffa camelopardalis",
limit = 5000, return='data')[[3]]
## Warning in pchk(return, "occ_search"): `return` param in `occ_search` function is defunct as of rgbif v3.0.0, and is ignored
## See `?rgbif` for more information.
We will quickly plot these data out just to show them in geographic
space. We first create a base map using the maps
package,
which is a really easy way to plot geopolitical boundaries. Then we
layer on the occurrence data using the points
function.
# install.packages('maps')
maps::map()
points(giraffe2$decimalLongitude, giraffe2$decimalLatitude, pch=16,
col=grey(0.1,0.5))
Giraffes are in the US!!! These are almost certainly zoo records.
Accessing data through APIs is fantastic, but there are plenty of times where you may need to access large data files locally. Note that this only is necessary if the data you are working with cannot fit into local memory. Otherwise, there is no real advantage to using a database framework, as it will likely be more expensive in terms of time and potential frustration.
For this, we will have to include yet another dependency (5+ in this lecture alone), which of course has a bunch of other dependencies. This is not really ideal given the focus on reproducible research in the course, but it is sadly pretty much necessary.
install.packages('dbplyr')
When referring to “databases”, we may think that there is a single
type or schema, but this is definitely not the case. R
has
a way to accommodate for different database types, and that is to
include a dependency for each database type (maybe not ideal). Below is
a list describing the different ways R
interfaces with
different database types (taken from https://www.kaggle.com/anello/working-with-databases-in-r).
We will focus on the use of RSQLite as a backend to connect to SQLite databases. This will sadly require another dependency.
install.packages('RSQLite')
dir.create("data_raw", showWarnings = FALSE)
download.file(url = "https://ndownloader.figshare.com/files/2292171",
destfile = "data_raw/portal_mammals.sqlite", mode = "wb")
library(DBI)
mammals <- DBI::dbConnect(RSQLite::SQLite(), "data_raw/portal_mammals.sqlite")
dbplyr::src_dbi(mammals)
## src: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## tbls: plots, species, surveys
Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the tbls row in the output above:
Now that we know we can connect to the database, let’s explore how to get the data from its tables into R.
To connect to tables within a database, you can use the
tbl()
function from dplyr
. This function can
be used to send SQL queries to the database. To demonstrate this
functionality, let’s select the columns “year”, “species_id”, and
“plot_id” from the surveys table:
dplyr::tbl(mammals, dplyr::sql("SELECT year, species_id, plot_id FROM surveys"))
## # Source: SQL [?? x 3]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## year species_id plot_id
## <int> <chr> <int>
## 1 1977 NL 2
## 2 1977 NL 3
## 3 1977 DM 2
## 4 1977 DM 7
## 5 1977 DM 3
## 6 1977 PF 1
## 7 1977 PE 2
## 8 1977 DM 1
## 9 1977 DM 1
## 10 1977 PF 6
## # ℹ more rows
With this approach you can use any of the SQL queries we have seen in the database lesson.
One of the strengths of dplyr is that the same operation can be done using dplyr verbs instead of writing SQL. First, we select the table on which to do the operations by creating the surveys object, and then we use the standard dplyr syntax as if it were a data frame:
surveys <- dplyr::tbl(mammals, "surveys")
surveys %>%
dplyr::select(year, species_id, plot_id)
## # Source: SQL [?? x 3]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## year species_id plot_id
## <int> <chr> <int>
## 1 1977 NL 2
## 2 1977 NL 3
## 3 1977 DM 2
## 4 1977 DM 7
## 5 1977 DM 3
## 6 1977 PF 1
## 7 1977 PE 2
## 8 1977 DM 1
## 9 1977 DM 1
## 10 1977 PF 6
## # ℹ more rows
In this case, the surveys object behaves like a data frame. Several
functions that can be used with data frames can also be used on tables
from a database. For instance, the head()
function can be
used to check the first 10 rows of the table:
head(surveys, n = 10)
## # Source: SQL [10 x 9]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## record_id month day year plot_id species_id sex hindfoot_length weight
## <int> <int> <int> <int> <int> <chr> <chr> <int> <int>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## 7 7 7 16 1977 2 PE F NA NA
## 8 8 7 16 1977 1 DM M 37 NA
## 9 9 7 16 1977 1 DM F 34 NA
## 10 10 7 16 1977 6 PF F 20 NA
surveys2 <- surveys %>%
dplyr::filter(weight < 5) %>%
dplyr::select(species_id, sex, weight)
R
makes lazy calls to databases, until you make it not
be lazy. That is, everything is held outside of memory until you tell
R
that some portion of the data should not be.
head(surveys2)
## # Source: SQL [6 x 3]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## species_id sex weight
## <chr> <chr> <int>
## 1 PF M 4
## 2 PF F 4
## 3 PF <NA> 4
## 4 PF F 4
## 5 PF F 4
## 6 RM M 4
To pull the data into memory and allow us to use the data, we must
use the collect
function.
surveys3 <- dplyr::collect(surveys2)
head(surveys3)
## # A tibble: 6 × 3
## species_id sex weight
## <chr> <chr> <int>
## 1 PF M 4
## 2 PF F 4
## 3 PF <NA> 4
## 4 PF F 4
## 5 PF F 4
## 6 RM M 4
This difference becomes clear when we try to index a specific column of data and work with it.
surveys2$sex
## NULL
surveys3$sex
## [1] "M" "F" NA "F" "F" "M" "F" "M" "M" "M" "M" "F" "M" "M" "M" "M" "M"
dplyr
syntax of joinsWe discussed joins in the data manipulation section. Joins are pretty key to working with databases, as much of the benefit of database structure is from nested data and the utility of key values. For instance, in our mammal sampling data, we have plot level data…
plots <- dplyr::tbl(mammals, "plots")
plots
## # Source: table<plots> [?? x 2]
## # Database: sqlite 3.41.2 [/media/tad/sanDisk1TB/Teaching/ecoInformatics/website/content/code/data_raw/portal_mammals.sqlite]
## plot_id plot_type
## <int> <chr>
## 1 1 Spectab exclosure
## 2 2 Control
## 3 3 Long-term Krat Exclosure
## 4 4 Control
## 5 5 Rodent Exclosure
## 6 6 Short-term Krat Exclosure
## 7 7 Rodent Exclosure
## 8 8 Control
## 9 9 Spectab exclosure
## 10 10 Rodent Exclosure
## # ℹ more rows
and survey-level data, where plot-level data provides what is essentially metadata for each survey.
We can use joins to combine and manipulate these data, as we did before, but now in the context of the database structure.
survPlot <- dplyr::left_join(surveys, plots, by='plot_id')
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] geodata_0.5-8 terra_1.7-29 maps_3.4.1 gbm_2.1.8.1 igraph_1.4.3
## [6] dplyr_1.1.2 plyr_1.8.8 DBI_1.1.3 rgbif_3.7.7 jsonlite_1.8.5
## [11] httr_1.4.6 rmarkdown_2.11 fastmap_1.1.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.3 xfun_0.29 ggplot2_3.4.2 lattice_0.21-8
## [5] vctrs_0.6.2 tools_4.3.0 generics_0.1.2 parallel_4.3.0
## [9] curl_4.3.3 tibble_3.2.1 fansi_1.0.2 RSQLite_2.3.1
## [13] highr_0.9 blob_1.2.4 pkgconfig_2.0.3 Matrix_1.5-1
## [17] data.table_1.14.6 dbplyr_2.3.2 lifecycle_1.0.3 compiler_4.3.0
## [21] stringr_1.5.0 munsell_0.5.0 codetools_0.2-19 htmltools_0.5.2
## [25] yaml_2.3.6 lazyeval_0.2.2 pillar_1.9.0 jquerylib_0.1.4
## [29] whisker_0.4.1 cachem_1.0.8 viridis_0.6.3 tidyselect_1.2.0
## [33] digest_0.6.31 stringi_1.7.12 purrr_1.0.1 splines_4.3.0
## [37] grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.2
## [41] triebeard_0.4.1 survival_3.5-3 crul_1.4.0 utf8_1.2.2
## [45] withr_2.5.0 scales_1.2.1 bit64_4.0.5 oai_0.4.0
## [49] bit_4.0.5 gridExtra_2.3 memoise_2.0.1 evaluate_0.15
## [53] knitr_1.37 viridisLite_0.4.2 rlang_1.1.1 urltools_1.7.3
## [57] Rcpp_1.0.10 glue_1.6.2 httpcode_0.3.0 xml2_1.3.4
## [61] R6_2.5.1