Chapter 9 Fetch
The ft_get
function makes it easy to fetch full text articles. There are a few different ways to use ft_get
:
- Pass in only DOIs - leave
from
parameterNULL
. This route will first query Crossref API for the publisher of the DOI, then we’ll use the appropriate method to fetch full text from the publisher. If a publisher is not found for the DOI, then we’ll throw back a message telling you a publisher was not found. - Pass in DOIs (or other pub IDs) and use the
from
parameter. This route means we don’t have to make an extra API call to Crossref (thus, this route is faster) to determine the publisher for each DOI. We go straight to getting full text based on the publisher. - Use
ft_search()
to search for articles. Then pass that output to this function, which will use info in that object. This behaves the same as the previous option in that each DOI has publisher info so we know how to get full text for each DOI.
Note that some publishers are available through other data sources, e.g., through Entrez’s Pubmed.
ft_get
is a bit complicated. These are just some of the hurdles we’re jumping over:
- Negotiating various user inputs, likely seeing new publishers we’ve not dealt with
- Dealing with authentication and trying to make it easier for users
- Users sometimes being at an IP address that has access to a publisher and sometimes not
- Caching results to avoid unnecessary downloads if the content has already been acquired
Thus, expect some hiccups here, and please do report problems, and if a certain publisher is not supported yet.
9.1 Data formats
You can specify whether you want PDF, XML or plaint text with the type
parameter. It is sometimes ignored, sometimes used, depending on the data source. For certain data sources, they only accept one type. Details by data source/publisher:
- PLOS: pdf and xml
- Entrez: only xml
- eLife: pdf and xml
- Pensoft: pdf and xml
- arXiv: only pdf
- BiorXiv: only pdf
- Elsevier: pdf and plain
- Wiley: only pdf
- Peerj: pdf and xml
- Informa: only pdf
- FrontiersIn: pdf and xml
- Copernicus: pdf and xml
- Scientific Societies: only pdf
- Crossref: depends on the publisher
- other data sources/publishers: there are too many to cover here - will try to make a helper in the future for what is covered by different publishers
9.2 How data is stored
This depends on what backend
value you use. If you use the default (rds) we store all data in .rds
files. These are binary compressed files that are specific to R. Because they are specific to R, you don’t want to use this option if part of your downstream workflow is using another tool/programming language.
The three types are stored in differnt ways. xml and plain text are parsed to plain text then stored with whatever backend you choose. However, pdf is retrived as raw bytes and stored as such. Thus, we no longer write pdf files to disk. However, you can easily do that yourself with ft_extract()
or yourself by using pdftools::pdf_text
which accepts a file path to a pdf or raw bytes.
9.3 Usage
library(fulltext)
List backends available
ft_get_ls()
#> [1] "aaas" "aip" "amersocclinoncol"
#> [4] "amersocmicrobiol" "arxiv" "biorxiv"
#> [7] "bmc" "cambridge" "cob"
#> [10] "copernicus" "crossref" "elife"
#> [13] "elsevier" "entrez" "frontiersin"
#> [16] "ieee" "informa" "instinvestfil"
#> [19] "jama" "microbiology" "peerj"
#> [22] "pensoft" "plos" "pnas"
#> [25] "royalsocchem" "roysoc" "sciencedirect"
#> [28] "scientificsocieties" "transtech" "wiley"
The simplest approach is passing a DOI directly to ft_get
<- ft_get('10.1371/journal.pone.0086169')) (res
#> <fulltext text>
#> [Docs] 1
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext
#> [IDs] 10.1371/journal.pone.0086169 ...
$plos res
#> $found
#> [1] 1
#>
#> $dois
#> [1] "10.1371/journal.pone.0086169"
#>
#> $data
#> $data$backend
#> [1] "ext"
#>
#> $data$cache_path
#> [1] "~/Library/Caches/R/fulltext"
#>
#> $data$path
#> $data$path$`10.1371/journal.pone.0086169`
#> $data$path$`10.1371/journal.pone.0086169`$path
#> [1] "/Users/sckott/Library/Caches/R/fulltext/10_1371_journal_pone_0086169.xml"
#>
#> $data$path$`10.1371/journal.pone.0086169`$id
#> [1] "10.1371/journal.pone.0086169"
#>
#> $data$path$`10.1371/journal.pone.0086169`$type
#> [1] "xml"
#>
#> $data$path$`10.1371/journal.pone.0086169`$error
#> NULL
#>
#>
#>
#> $data$data
#> NULL
#>
#>
#> $opts
#> $opts$doi
#> [1] "10.1371/journal.pone.0086169"
#>
#> $opts$type
#> [1] "xml"
#>
#> $opts$progress
#> [1] FALSE
#>
#>
#> $errors
#> id error
#> 1 10.1371/journal.pone.0086169 <NA>
You can pass many DOIs in at once
<- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009'))) (res
#> <fulltext text>
#> [Docs] 2
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext
#> [IDs] 10.3389/fphar.2014.00109 10.3389/feart.2015.00009 ...
$frontiersin res
#> $found
#> [1] 2
#>
#> $dois
#> [1] "10.3389/fphar.2014.00109" "10.3389/feart.2015.00009"
#>
#> $data
#> $data$backend
#> [1] "ext"
#>
#> $data$cache_path
#> [1] "~/Library/Caches/R/fulltext"
#>
#> $data$path
#> $data$path$`10.3389/fphar.2014.00109`
#> $data$path$`10.3389/fphar.2014.00109`$path
#> [1] "/Users/sckott/Library/Caches/R/fulltext/10_3389_fphar_2014_00109.xml"
#>
#> $data$path$`10.3389/fphar.2014.00109`$id
#> [1] "10.3389/fphar.2014.00109"
#>
#> $data$path$`10.3389/fphar.2014.00109`$type
#> [1] "xml"
#>
#> $data$path$`10.3389/fphar.2014.00109`$error
#> NULL
#>
#>
#> $data$path$`10.3389/feart.2015.00009`
#> $data$path$`10.3389/feart.2015.00009`$path
#> [1] "/Users/sckott/Library/Caches/R/fulltext/10_3389_feart_2015_00009.xml"
#>
#> $data$path$`10.3389/feart.2015.00009`$id
#> [1] "10.3389/feart.2015.00009"
#>
#> $data$path$`10.3389/feart.2015.00009`$type
#> [1] "xml"
#>
#> $data$path$`10.3389/feart.2015.00009`$error
#> NULL
#>
#>
#>
#> $data$data
#> NULL
#>
#>
#> $opts
#> $opts$dois
#> [1] "10.3389/fphar.2014.00109" "10.3389/feart.2015.00009"
#>
#> $opts$type
#> [1] "xml"
#>
#> $opts$progress
#> [1] FALSE
#>
#>
#> $errors
#> id error
#> 1 10.3389/fphar.2014.00109 <NA>
#> 2 10.3389/feart.2015.00009 <NA>
9.4 Errors
ft_get()
for each article has an error
slot.
If the error
slot is NULL
then there is no error.
If the error
slot is not NULL
there was an error, and the error message
will be a character string in that slot.
Possible errors include:
- An error reported by the web service
- e.g. “Timeout was reached: Connection timed out after 10003 milliseconds”
- No link found
- e.g. “no link found from Crossref” OR “has no link available”
- We attempted to fetch the article but the content type wasn’t what was expected. In this case we skip to the next article.
- e.g. “type was supposed to be
pdf
, but wastext/html; charset=UTF-8
”
- e.g. “type was supposed to be
- Weird uninformative errors
- e.g. “Recv failure: Operation timed out” OR “Operation was aborted by an application callback”
- An error associated mostly with PLOS. PLOS gives DOIs for parts of articles, like figures, so it doesn’t make sense to get full text of a figure.
- e.g., “was not found or may be a DOI for a part of an article”
An additional top level slot called errors
has a data.frame of all errors
from each article, like:
<- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.aaaa'), from = "elife")
res $elife$errors
res
id error1 10.7554/eLife.03032 <NA>
2 10.7554/eLife.aaaa subscript out of bounds
Where the second DOI was invalid. Granted the error in the data.frame “subscript out of bounds” isn’t very informative, but we can work on that.
9.5 Cleanup
The above section about errors suggests that we often run into errors. When we run into errors downloading full text we capture the error message, if there is one, and delete the file we were trying to create. That is, we cleanup upon hitting an error such that you shouldn’t end up with blank files on your machine. Let us know if this isn’t true in your case and we’ll get it fixed.
Note that even if you exit out of the all to ft_get()
it should clean up a file
if it is not completey done creating it, so you shouldn’t end up with bad files
if you exit out of the function when it’s running.
9.6 Internals
What’s going on under the hood in ft_get()
? It goes like this:
- If you request data from a specific data source (use of
from
, only allowed for PLOS, Entrez, eLife, Pensoft, arXiv, BiorXiv, Elsevier and Wiley):- Grab publisher specific collector function
- Each of these functions has specific code for that publisher to pull full text given an article identifier
- If you don’t request a specific data source:
- Guess which publisher the DOI comes from
- If publisher discovered that we have plugins for
- Get publisher specific collector function
- If publisher not discovered
- Ping the https://ftdoi.org API
- If publisher found with ftdoi API:
- Link for full text used from the ftdoi API response
- If publisher not found with ftdoi API:
- Attempt fetch via Crossref API - if links found we try those, some work and some don’t
9.7 Notes about specific data sources
9.7.1 Elsevier
When you don’t have access to the full text of Elsevier articles they will often still give you something, but it will sometimes be just metadata of the paper or sometimes an abstract if you’re lucky. When you go to extract the text this will be rather obvious.