Chapter 6 Search
Search is what you’ll likely start with for a number of reasons. First, search functionality in fulltext
means that you can start from searching on words like ‘ecology’ or ‘cellular’ - and the output of that search can be fed downstream to the next major task: fetching articles.
6.1 Usage
library(fulltext)
List backends available
ft_search_ls()
#> [1] "arxiv" "biorxivr" "bmc" "crossref" "entrez"
#> [6] "europe_pmc" "ma" "plos" "scopus"
Search - by default searches against PLOS (Public Library of Science)
<- ft_search(query = "ecology") res
The output of ft_search
is a ft
S3 object, with a summary of the results:
res
#> Query:
#> [ecology]
#> Found:
#> [PLoS: 58840; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
#> Returned:
#> [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
and has slots for each data source:
names(res)
#> [1] "plos" "bmc" "crossref" "entrez" "arxiv" "biorxiv" "europmc"
#> [8] "scopus" "ma"
Get data for a single source
$plos res
#> Query: [ecology]
#> Records found, returned: [58840, 10]
#> License: [CC-BY]
#> # A tibble: 10 × 1
#> id
#> <chr>
#> 1 10.1371/journal.pone.0001248
#> 2 10.1371/journal.pone.0248090
#> 3 10.1371/journal.pone.0059813
#> 4 10.1371/journal.pone.0080763
#> 5 10.1371/journal.pone.0246749
#> 6 10.1371/journal.pone.0254411
#> 7 10.1371/journal.pone.0220747
#> 8 10.1371/journal.pone.0155019
#> 9 10.1371/journal.pone.0175014
#> 10 10.1371/journal.pone.0241618
Note how in the metadata section above the data.frame of results we clearly state the license for articles for the given data source. For some data sources, licenses are the same for each paper; sometimes they vary among papers.
6.2 Search many sources
Here, search for the term “ecology” across PLOS, Crossref, and arXiv preprint server.
<- ft_search(query='ecology', from=c('plos','crossref','arxiv'))
res res
#> Query:
#> [ecology]
#> Found:
#> [PLoS: 58840; BMC: 0; Crossref: 229003; Entrez: 0; arxiv: 2629; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
#> Returned:
#> [PLoS: 10; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 10; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
Each source may have different results AND with different columns in each data.frame
$plos res
#> Query: [ecology]
#> Records found, returned: [58840, 10]
#> License: [CC-BY]
#> # A tibble: 10 × 1
#> id
#> <chr>
#> 1 10.1371/journal.pone.0001248
#> 2 10.1371/journal.pone.0248090
#> 3 10.1371/journal.pone.0059813
#> 4 10.1371/journal.pone.0080763
#> 5 10.1371/journal.pone.0246749
#> 6 10.1371/journal.pone.0254411
#> 7 10.1371/journal.pone.0220747
#> 8 10.1371/journal.pone.0155019
#> 9 10.1371/journal.pone.0175014
#> 10 10.1371/journal.pone.0241618
$arxiv res
#> Query: [ecology]
#> Records found, returned: [2629, 10]
#> License: [variable, but should be free to text-mine, see http://arxiv.org/help/license and http://arxiv.org/help/bulk_data]
#> # A tibble: 10 × 15
#> id submitted updated title abstract authors affiliations link_abstract
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 hep-p… 1993-03-0… 1993-03… "Coul… " We e… John E… "" http://arxiv…
#> 2 cond-… 1993-09-2… 1994-01… "The … " Natu… Dallas… "" http://arxiv…
#> 3 chao-… 1993-11-2… 1993-11… "Rele… " Netw… Kunihi… "University… http://arxiv…
#> 4 adap-… 1993-11-2… 1993-11… "Chao… " The … Kunihi… "University… http://arxiv…
#> 5 chao-… 1994-08-1… 1994-08… "Weak… " Popu… Shin-i… "Department… http://arxiv…
#> 6 adap-… 1994-12-2… 1994-12… "Pref… " Part… Dan As… "Dept. of M… http://arxiv…
#> 7 comp-… 1995-06-0… 1995-06… "Pred… " In p… H. P. … "Institute … http://arxiv…
#> 8 math/… 1995-07-0… 1995-07… "Nece… " This… Robert… "" http://arxiv…
#> 9 cond-… 1996-02-0… 1996-02… "Mass… " It i… Per Ba… "" http://arxiv…
#> 10 cond-… 1996-07-0… 1998-09… "Simp… " A ma… Susann… "" http://arxiv…
#> # … with 7 more variables: link_pdf <chr>, link_doi <chr>, comment <chr>,
#> # journal_ref <chr>, doi <chr>, primary_category <chr>, categories <chr>
$crossref res
#> Query: [ecology]
#> Records found, returned: [229003, 10]
#> License: [variable, see individual records]
#> # A tibble: 10 × 32
#> container.title created deposited published.print doi indexed issn issue
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Functional Ecol… 2009-01… 2021-07-… 2009-02 10.1… 2021-0… 0269… 1
#> 2 Ecology 2006-05… 2018-08-… 1957-10 10.2… 2021-0… 0012… 4
#> 3 Ecology 2006-05… 2018-08-… 1957-07 10.2… 2021-0… 0012… 3
#> 4 Journal of Indu… 2014-11… 2021-07-… 2014-12 10.1… 2021-0… 1088… 6
#> 5 Ecology 2006-05… 2018-08-… 1942-04 10.2… 2021-0… 0012… 2
#> 6 Ecology 2017-04… 2021-07-… 2017-06 10.1… 2021-0… 0012… 6
#> 7 Ecology 2006-05… 2018-08-… 1966-09 10.2… 2021-0… 0012… 5
#> 8 Ecology 2006-05… 2018-08-… 1927-10 10.2… 2021-0… 0012… 4
#> 9 Ecology 2006-05… 2018-08-… 1976-11 10.2… 2021-0… 0012… 6
#> 10 Ecology 2006-05… 2018-08-… 1927-04 10.2… 2021-0… 0012… 2
#> # … with 24 more variables: issued <chr>, member <chr>, page <chr>,
#> # prefix <chr>, publisher <chr>, score <chr>, source <chr>,
#> # reference.count <chr>, references.count <chr>,
#> # is.referenced.by.count <chr>, subject <chr>, title <chr>, type <chr>,
#> # url <chr>, volume <chr>, language <chr>, author <list>, link <list>,
#> # license <list>, reference <list>, short.container.title <chr>,
#> # archive <chr>, published.online <chr>, subtitle <chr>
Note above how licenses for PLOS are all CC-BY, whereas licenses in arXiv and Crossref are variable. For arXiv we don’t get any license information in the results. But for Crossref we do get license information. Let’s get license information for the first article:
$crossref$data$license[[1]] res
#> # A tibble: 1 × 4
#> date content.version delay.in.days URL
#> <chr> <chr> <int> <chr>
#> 1 2015-09-01 tdm 2403 http://doi.wiley.com/10.1002/tdm_lic…
It shows a license specific to Wiley, and gives the URL so you can look it up.
6.3 Search options
Each of the data sources in ft_search
can accept additional configuration. See the ?ft_search
docs for details. Each data source has a parameter in ft_search
, e.g, europmc
data source can be configured with the euroopts
parameter. Each of the *opts
parameters expects a named list.
Here, we search the phrase “ecology” at Europe PMC.
<- ft_search(query='ecology', from='europmc')
res $europmc res
#> Query: [ecology]
#> Records found, returned: [425357, 10]
#>
#> # A tibble: 10 × 27
#> id source pmid doi title authorString journalTitle journalVolume
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 34563792 MED 34563792 10.1… "Fre… Yin B, Li J… J Plant Phy… 266
#> 2 34593996 MED 34593996 10.1… "Mic… Mason-Jones… ISME J <NA>
#> 3 34579579 MED 34579579 10.1… "Mic… Frederickso… mBio <NA>
#> 4 34583083 MED 34583083 10.1… "Cor… Yin X, Zhan… Sci Total E… 806
#> 5 34597569 MED 34597569 10.1… "New… Stockmann M… Sci Total E… <NA>
#> 6 34129698 MED 34129698 10.1… "Low… Kreider MR,… Ecology 102
#> 7 34498255 MED 34498255 10.1… "Err… <NA> Ecology 102
#> 8 34600006 MED 34600006 10.1… "Eff… Deng J, Zho… Chemosphere <NA>
#> 9 34523534 MED 34523534 10.1… "Bis… Zhou Y, Guo… Environ Pol… 290
#> 10 IND607357494 AGR <NA> 10.1… "Soi… Vieira AF, … Appl Soil E… 167
#> # … with 19 more variables: pubYear <chr>, journalIssn <chr>, pageInfo <chr>,
#> # pubType <chr>, isOpenAccess <chr>, inEPMC <chr>, inPMC <chr>, hasPDF <chr>,
#> # hasBook <chr>, hasSuppl <chr>, citedByCount <int>, hasReferences <chr>,
#> # hasTextMinedTerms <chr>, hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, issue <chr>
Then get the next batch of results, using the cursorMark
result
ft_search(query='ecology', from='europmc',
euroopts = list(cursorMark = res$europmc$cursorMark))
#> Query:
#> [ecology]
#> Found:
#> [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 425357; Scopus: 0; Microsoft: 0]
#> Returned:
#> [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 10; Scopus: 0; Microsoft: 0]