• fulltext manual
  • 1 fulltext manual
    • 1.1 Info
    • 1.2 Citing fulltext
    • 1.3 Installation
  • 2 Introduction
    • 2.1 User interface
  • 3 Data sources
    • 3.1 Search
    • 3.2 Abstracts
    • 3.3 Links
    • 3.4 Getting full text
  • 4 Authentication
  • 5 Rate limits
  • 6 Search
    • 6.1 Usage
    • 6.2 Search many sources
    • 6.3 Search options
  • 7 Abstracts
    • 7.1 Usage
    • 7.2 By Ids
    • 7.3 Abstracts options
  • 8 Links
    • 8.1 Usage
    • 8.2 Links options
  • 9 Fetch
    • 9.1 Data formats
    • 9.2 How data is stored
    • 9.3 Usage
    • 9.4 Errors
    • 9.5 Cleanup
    • 9.6 Internals
    • 9.7 Notes about specific data sources
      • 9.7.1 Elsevier
  • 10 Extracting text
    • 10.1 Usage
    • 10.2 Tabularize
    • 10.3 Other inputs
      • 10.3.1 Files
      • 10.3.2 xml in a string
      • 10.3.3 xml2 objects
  • 11 Summarize articles on disk
    • 11.1 Usage
  • 12 Request debugging/inspection
  • 13 Use cases
  • 14 session info
  • (c) Scott Chamberlain, 2020

fulltext manual

Chapter 6 Search

Search is what you’ll likely start with for a number of reasons. First, search functionality in fulltext means that you can start from searching on words like ‘ecology’ or ‘cellular’ - and the output of that search can be fed downstream to the next major task: fetching articles.

6.1 Usage

library(fulltext)

List backends available

ft_search_ls()
#> [1] "arxiv"      "biorxivr"   "bmc"        "crossref"   "entrez"    
#> [6] "europe_pmc" "ma"         "plos"       "scopus"

Search - by default searches against PLOS (Public Library of Science)

res <- ft_search(query = "ecology")

The output of ft_search is a ft S3 object, with a summary of the results:

res
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 58840; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

and has slots for each data source:

names(res)
#> [1] "plos"     "bmc"      "crossref" "entrez"   "arxiv"    "biorxiv"  "europmc" 
#> [8] "scopus"   "ma"

Get data for a single source

res$plos
#> Query: [ecology] 
#> Records found, returned: [58840, 10] 
#> License: [CC-BY] 
#> # A tibble: 10 × 1
#>    id                          
#>    <chr>                       
#>  1 10.1371/journal.pone.0001248
#>  2 10.1371/journal.pone.0248090
#>  3 10.1371/journal.pone.0059813
#>  4 10.1371/journal.pone.0080763
#>  5 10.1371/journal.pone.0246749
#>  6 10.1371/journal.pone.0254411
#>  7 10.1371/journal.pone.0220747
#>  8 10.1371/journal.pone.0155019
#>  9 10.1371/journal.pone.0175014
#> 10 10.1371/journal.pone.0241618

Note how in the metadata section above the data.frame of results we clearly state the license for articles for the given data source. For some data sources, licenses are the same for each paper; sometimes they vary among papers.

6.2 Search many sources

Here, search for the term “ecology” across PLOS, Crossref, and arXiv preprint server.

res <- ft_search(query='ecology', from=c('plos','crossref','arxiv'))
res
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 58840; BMC: 0; Crossref: 229003; Entrez: 0; arxiv: 2629; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 10; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

Each source may have different results AND with different columns in each data.frame

res$plos
#> Query: [ecology] 
#> Records found, returned: [58840, 10] 
#> License: [CC-BY] 
#> # A tibble: 10 × 1
#>    id                          
#>    <chr>                       
#>  1 10.1371/journal.pone.0001248
#>  2 10.1371/journal.pone.0248090
#>  3 10.1371/journal.pone.0059813
#>  4 10.1371/journal.pone.0080763
#>  5 10.1371/journal.pone.0246749
#>  6 10.1371/journal.pone.0254411
#>  7 10.1371/journal.pone.0220747
#>  8 10.1371/journal.pone.0155019
#>  9 10.1371/journal.pone.0175014
#> 10 10.1371/journal.pone.0241618
res$arxiv
#> Query: [ecology] 
#> Records found, returned: [2629, 10] 
#> License: [variable, but should be free to text-mine, see http://arxiv.org/help/license and http://arxiv.org/help/bulk_data] 
#> # A tibble: 10 × 15
#>    id     submitted  updated  title  abstract authors affiliations link_abstract
#>    <chr>  <chr>      <chr>    <chr>  <chr>    <chr>   <chr>        <chr>        
#>  1 hep-p… 1993-03-0… 1993-03… "Coul… "  We e… John E… ""           http://arxiv…
#>  2 cond-… 1993-09-2… 1994-01… "The … "  Natu… Dallas… ""           http://arxiv…
#>  3 chao-… 1993-11-2… 1993-11… "Rele… "  Netw… Kunihi… "University… http://arxiv…
#>  4 adap-… 1993-11-2… 1993-11… "Chao… "  The … Kunihi… "University… http://arxiv…
#>  5 chao-… 1994-08-1… 1994-08… "Weak… "  Popu… Shin-i… "Department… http://arxiv…
#>  6 adap-… 1994-12-2… 1994-12… "Pref… "  Part… Dan As… "Dept. of M… http://arxiv…
#>  7 comp-… 1995-06-0… 1995-06… "Pred… "  In p… H. P. … "Institute … http://arxiv…
#>  8 math/… 1995-07-0… 1995-07… "Nece… "  This… Robert… ""           http://arxiv…
#>  9 cond-… 1996-02-0… 1996-02… "Mass… "  It i… Per Ba… ""           http://arxiv…
#> 10 cond-… 1996-07-0… 1998-09… "Simp… "  A ma… Susann… ""           http://arxiv…
#> # … with 7 more variables: link_pdf <chr>, link_doi <chr>, comment <chr>,
#> #   journal_ref <chr>, doi <chr>, primary_category <chr>, categories <chr>
res$crossref
#> Query: [ecology] 
#> Records found, returned: [229003, 10] 
#> License: [variable, see individual records] 
#> # A tibble: 10 × 32
#>    container.title  created  deposited published.print doi   indexed issn  issue
#>    <chr>            <chr>    <chr>     <chr>           <chr> <chr>   <chr> <chr>
#>  1 Functional Ecol… 2009-01… 2021-07-… 2009-02         10.1… 2021-0… 0269… 1    
#>  2 Ecology          2006-05… 2018-08-… 1957-10         10.2… 2021-0… 0012… 4    
#>  3 Ecology          2006-05… 2018-08-… 1957-07         10.2… 2021-0… 0012… 3    
#>  4 Journal of Indu… 2014-11… 2021-07-… 2014-12         10.1… 2021-0… 1088… 6    
#>  5 Ecology          2006-05… 2018-08-… 1942-04         10.2… 2021-0… 0012… 2    
#>  6 Ecology          2017-04… 2021-07-… 2017-06         10.1… 2021-0… 0012… 6    
#>  7 Ecology          2006-05… 2018-08-… 1966-09         10.2… 2021-0… 0012… 5    
#>  8 Ecology          2006-05… 2018-08-… 1927-10         10.2… 2021-0… 0012… 4    
#>  9 Ecology          2006-05… 2018-08-… 1976-11         10.2… 2021-0… 0012… 6    
#> 10 Ecology          2006-05… 2018-08-… 1927-04         10.2… 2021-0… 0012… 2    
#> # … with 24 more variables: issued <chr>, member <chr>, page <chr>,
#> #   prefix <chr>, publisher <chr>, score <chr>, source <chr>,
#> #   reference.count <chr>, references.count <chr>,
#> #   is.referenced.by.count <chr>, subject <chr>, title <chr>, type <chr>,
#> #   url <chr>, volume <chr>, language <chr>, author <list>, link <list>,
#> #   license <list>, reference <list>, short.container.title <chr>,
#> #   archive <chr>, published.online <chr>, subtitle <chr>

Note above how licenses for PLOS are all CC-BY, whereas licenses in arXiv and Crossref are variable. For arXiv we don’t get any license information in the results. But for Crossref we do get license information. Let’s get license information for the first article:

res$crossref$data$license[[1]]
#> # A tibble: 1 × 4
#>   date       content.version delay.in.days URL                                  
#>   <chr>      <chr>                   <int> <chr>                                
#> 1 2015-09-01 tdm                      2403 http://doi.wiley.com/10.1002/tdm_lic…

It shows a license specific to Wiley, and gives the URL so you can look it up.

6.3 Search options

Each of the data sources in ft_search can accept additional configuration. See the ?ft_search docs for details. Each data source has a parameter in ft_search, e.g, europmc data source can be configured with the euroopts parameter. Each of the *opts parameters expects a named list.

Here, we search the phrase “ecology” at Europe PMC.

res <- ft_search(query='ecology', from='europmc')
res$europmc
#> Query: [ecology] 
#> Records found, returned: [425357, 10] 
#>  
#> # A tibble: 10 × 27
#>    id           source pmid     doi   title authorString journalTitle journalVolume
#>    <chr>        <chr>  <chr>    <chr> <chr> <chr>        <chr>        <chr>        
#>  1 34563792     MED    34563792 10.1… "Fre… Yin B, Li J… J Plant Phy… 266          
#>  2 34593996     MED    34593996 10.1… "Mic… Mason-Jones… ISME J       <NA>         
#>  3 34579579     MED    34579579 10.1… "Mic… Frederickso… mBio         <NA>         
#>  4 34583083     MED    34583083 10.1… "Cor… Yin X, Zhan… Sci Total E… 806          
#>  5 34597569     MED    34597569 10.1… "New… Stockmann M… Sci Total E… <NA>         
#>  6 34129698     MED    34129698 10.1… "Low… Kreider MR,… Ecology      102          
#>  7 34498255     MED    34498255 10.1… "Err… <NA>         Ecology      102          
#>  8 34600006     MED    34600006 10.1… "Eff… Deng J, Zho… Chemosphere  <NA>         
#>  9 34523534     MED    34523534 10.1… "Bis… Zhou Y, Guo… Environ Pol… 290          
#> 10 IND607357494 AGR    <NA>     10.1… "Soi… Vieira AF, … Appl Soil E… 167          
#> # … with 19 more variables: pubYear <chr>, journalIssn <chr>, pageInfo <chr>,
#> #   pubType <chr>, isOpenAccess <chr>, inEPMC <chr>, inPMC <chr>, hasPDF <chr>,
#> #   hasBook <chr>, hasSuppl <chr>, citedByCount <int>, hasReferences <chr>,
#> #   hasTextMinedTerms <chr>, hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> #   hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> #   firstPublicationDate <chr>, issue <chr>

Then get the next batch of results, using the cursorMark result

ft_search(query='ecology', from='europmc', 
  euroopts = list(cursorMark = res$europmc$cursorMark))
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 425357; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 10; Scopus: 0; Microsoft: 0]