Chapter 9 Fetch

The ft_get function makes it easy to fetch full text articles. There are a few different ways to use ft_get:

  • Pass in only DOIs - leave from parameter NULL. This route will first query Crossref API for the publisher of the DOI, then we’ll use the appropriate method to fetch full text from the publisher. If a publisher is not found for the DOI, then we’ll throw back a message telling you a publisher was not found.
  • Pass in DOIs (or other pub IDs) and use the from parameter. This route means we don’t have to make an extra API call to Crossref (thus, this route is faster) to determine the publisher for each DOI. We go straight to getting full text based on the publisher.
  • Use ft_search() to search for articles. Then pass that output to this function, which will use info in that object. This behaves the same as the previous option in that each DOI has publisher info so we know how to get full text for each DOI.

Note that some publishers are available through other data sources, e.g., through Entrez’s Pubmed.

ft_get is a bit complicated. These are just some of the hurdles we’re jumping over:

  • Negotiating various user inputs, likely seeing new publishers we’ve not dealt with
  • Dealing with authentication and trying to make it easier for users
  • Users sometimes being at an IP address that has access to a publisher and sometimes not
  • Caching results to avoid unnecessary downloads if the content has already been acquired

Thus, expect some hiccups here, and please do report problems, and if a certain publisher is not supported yet.

9.1 Data formats

You can specify whether you want PDF, XML or plaint text with the type parameter. It is sometimes ignored, sometimes used, depending on the data source. For certain data sources, they only accept one type. Details by data source/publisher:

  • PLOS: pdf and xml
  • Entrez: only xml
  • eLife: pdf and xml
  • Pensoft: pdf and xml
  • arXiv: only pdf
  • BiorXiv: only pdf
  • Elsevier: pdf and plain
  • Wiley: only pdf
  • Peerj: pdf and xml
  • Informa: only pdf
  • FrontiersIn: pdf and xml
  • Copernicus: pdf and xml
  • Scientific Societies: only pdf
  • Crossref: depends on the publisher
  • other data sources/publishers: there are too many to cover here - will try to make a helper in the future for what is covered by different publishers

9.2 How data is stored

This depends on what backend value you use. If you use the default (rds) we store all data in .rds files. These are binary compressed files that are specific to R. Because they are specific to R, you don’t want to use this option if part of your downstream workflow is using another tool/programming language.

The three types are stored in differnt ways. xml and plain text are parsed to plain text then stored with whatever backend you choose. However, pdf is retrived as raw bytes and stored as such. Thus, we no longer write pdf files to disk. However, you can easily do that yourself with ft_extract() or yourself by using pdftools::pdf_text which accepts a file path to a pdf or raw bytes.

9.3 Usage

library(fulltext)

List backends available

ft_get_ls()
#>  [1] "aaas"                "aip"                 "amersocclinoncol"   
#>  [4] "amersocmicrobiol"    "arxiv"               "biorxiv"            
#>  [7] "bmc"                 "cambridge"           "cob"                
#> [10] "copernicus"          "crossref"            "elife"              
#> [13] "elsevier"            "entrez"              "frontiersin"        
#> [16] "ieee"                "informa"             "instinvestfil"      
#> [19] "jama"                "microbiology"        "peerj"              
#> [22] "pensoft"             "plos"                "pnas"               
#> [25] "royalsocchem"        "roysoc"              "sciencedirect"      
#> [28] "scientificsocieties" "transtech"           "wiley"

The simplest approach is passing a DOI directly to ft_get

(res <- ft_get('10.1371/journal.pone.0086169'))
#> <fulltext text>
#> [Docs] 1 
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext 
#> [IDs] 10.1371/journal.pone.0086169 ...
res$plos
#> $found
#> [1] 1
#> 
#> $dois
#> [1] "10.1371/journal.pone.0086169"
#> 
#> $data
#> $data$backend
#> [1] "ext"
#> 
#> $data$cache_path
#> [1] "~/Library/Caches/R/fulltext"
#> 
#> $data$path
#> $data$path$`10.1371/journal.pone.0086169`
#> $data$path$`10.1371/journal.pone.0086169`$path
#> [1] "/Users/sckott/Library/Caches/R/fulltext/10_1371_journal_pone_0086169.xml"
#> 
#> $data$path$`10.1371/journal.pone.0086169`$id
#> [1] "10.1371/journal.pone.0086169"
#> 
#> $data$path$`10.1371/journal.pone.0086169`$type
#> [1] "xml"
#> 
#> $data$path$`10.1371/journal.pone.0086169`$error
#> NULL
#> 
#> 
#> 
#> $data$data
#> NULL
#> 
#> 
#> $opts
#> $opts$doi
#> [1] "10.1371/journal.pone.0086169"
#> 
#> $opts$type
#> [1] "xml"
#> 
#> $opts$progress
#> [1] FALSE
#> 
#> 
#> $errors
#>                             id error
#> 1 10.1371/journal.pone.0086169  <NA>

You can pass many DOIs in at once

(res <- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009')))
#> <fulltext text>
#> [Docs] 2 
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext 
#> [IDs] 10.3389/fphar.2014.00109 10.3389/feart.2015.00009 ...
res$frontiersin
#> $found
#> [1] 2
#> 
#> $dois
#> [1] "10.3389/fphar.2014.00109" "10.3389/feart.2015.00009"
#> 
#> $data
#> $data$backend
#> [1] "ext"
#> 
#> $data$cache_path
#> [1] "~/Library/Caches/R/fulltext"
#> 
#> $data$path
#> $data$path$`10.3389/fphar.2014.00109`
#> $data$path$`10.3389/fphar.2014.00109`$path
#> [1] "/Users/sckott/Library/Caches/R/fulltext/10_3389_fphar_2014_00109.xml"
#> 
#> $data$path$`10.3389/fphar.2014.00109`$id
#> [1] "10.3389/fphar.2014.00109"
#> 
#> $data$path$`10.3389/fphar.2014.00109`$type
#> [1] "xml"
#> 
#> $data$path$`10.3389/fphar.2014.00109`$error
#> NULL
#> 
#> 
#> $data$path$`10.3389/feart.2015.00009`
#> $data$path$`10.3389/feart.2015.00009`$path
#> [1] "/Users/sckott/Library/Caches/R/fulltext/10_3389_feart_2015_00009.xml"
#> 
#> $data$path$`10.3389/feart.2015.00009`$id
#> [1] "10.3389/feart.2015.00009"
#> 
#> $data$path$`10.3389/feart.2015.00009`$type
#> [1] "xml"
#> 
#> $data$path$`10.3389/feart.2015.00009`$error
#> NULL
#> 
#> 
#> 
#> $data$data
#> NULL
#> 
#> 
#> $opts
#> $opts$dois
#> [1] "10.3389/fphar.2014.00109" "10.3389/feart.2015.00009"
#> 
#> $opts$type
#> [1] "xml"
#> 
#> $opts$progress
#> [1] FALSE
#> 
#> 
#> $errors
#>                         id error
#> 1 10.3389/fphar.2014.00109  <NA>
#> 2 10.3389/feart.2015.00009  <NA>

9.4 Errors

ft_get() for each article has an error slot.

If the error slot is NULL then there is no error.

If the error slot is not NULL there was an error, and the error message will be a character string in that slot.

Possible errors include:

  • An error reported by the web service
    • e.g. “Timeout was reached: Connection timed out after 10003 milliseconds”
  • No link found
    • e.g. “no link found from Crossref” OR “has no link available”
  • We attempted to fetch the article but the content type wasn’t what was expected. In this case we skip to the next article.
    • e.g. “type was supposed to be pdf, but was text/html; charset=UTF-8
  • Weird uninformative errors
    • e.g. “Recv failure: Operation timed out” OR “Operation was aborted by an application callback”
  • An error associated mostly with PLOS. PLOS gives DOIs for parts of articles, like figures, so it doesn’t make sense to get full text of a figure.
    • e.g., “was not found or may be a DOI for a part of an article”

An additional top level slot called errors has a data.frame of all errors from each article, like:

res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.aaaa'), from = "elife")
res$elife$errors
                   id                   error
1 10.7554/eLife.03032                    <NA>
2  10.7554/eLife.aaaa subscript out of bounds

Where the second DOI was invalid. Granted the error in the data.frame “subscript out of bounds” isn’t very informative, but we can work on that.

9.5 Cleanup

The above section about errors suggests that we often run into errors. When we run into errors downloading full text we capture the error message, if there is one, and delete the file we were trying to create. That is, we cleanup upon hitting an error such that you shouldn’t end up with blank files on your machine. Let us know if this isn’t true in your case and we’ll get it fixed.

Note that even if you exit out of the all to ft_get() it should clean up a file if it is not completey done creating it, so you shouldn’t end up with bad files if you exit out of the function when it’s running.

9.6 Internals

What’s going on under the hood in ft_get()? It goes like this:

  • If you request data from a specific data source (use of from, only allowed for PLOS, Entrez, eLife, Pensoft, arXiv, BiorXiv, Elsevier and Wiley):
    • Grab publisher specific collector function
    • Each of these functions has specific code for that publisher to pull full text given an article identifier
  • If you don’t request a specific data source:
    • Guess which publisher the DOI comes from
    • If publisher discovered that we have plugins for
      • Get publisher specific collector function
    • If publisher not discovered
      • Ping the https://ftdoi.org API
      • If publisher found with ftdoi API:
        • Link for full text used from the ftdoi API response
      • If publisher not found with ftdoi API:
        • Attempt fetch via Crossref API - if links found we try those, some work and some don’t

9.7 Notes about specific data sources

9.7.1 Elsevier

When you don’t have access to the full text of Elsevier articles they will often still give you something, but it will sometimes be just metadata of the paper or sometimes an abstract if you’re lucky. When you go to extract the text this will be rather obvious.