Chapter 10 Extracting text

Functions for extracting parts of texts used to live inside of fulltext, but have now moved to the package pubchunks.

The pubchunks::pub_chunks function tries to make it easy to extract the parts of articles you want. This only works with XML format articles though since although we can get text out of PDFs, there is no machine readable way to say “I want the abstract”.

In addition to only working with XML, this function only has knowledge about a select set of publishers for which we’ve encoded knowledge about how to get different sections of the article. Not all publishers use the same format XML - so each publisher is slightly different for how to get to each section. That is, to get to the abstract requires slightly different xpath for publisher A vs. publisher B vs. publisher C.

An alternative to pubchunks is to use xpath or css selectors yourself to slice and dice XML.

10.1 Usage

library(fulltext)
library(pubchunks)

Get a full text article

x <- ft_get('10.1371/journal.pone.0086169')

Note that unlike previous versions of fulltext you now have to collect (ft_collect()) the text from the XML file on disk. Then you can pass to pub_chunks(), here to get authors.

x %>% ft_collect %>% pub_chunks("authors")

#> $plos
#> $plos$`10.1371/journal.pone.0086169`
#> <pub chunks>
#>   from: xml_document
#>   publisher/journal: plos/PLoS ONE
#>   sections: authors
#>   showing up to first 5: 
#>    authors (n=4): nested list
#> 
#> 
#> attr(,"ft_data")
#> [1] TRUE

In another example, let’s search for PLOS articles.

library("rplos")
(dois <- searchplos(q="*:*", fl='id',
   fq=list('doc_type:full',"article_type:\"research article\""),
     limit=5)$data$id)

#> [1] "10.1371/journal.pbio.1000153" "10.1371/journal.pbio.1000159"
#> [3] "10.1371/journal.pbio.1000167" "10.1371/journal.pbio.1000173"
#> [5] "10.1371/journal.pbio.1000176"

Then get the full text

x <- ft_get(dois)

Then pull out various sections of each article.

remember to pull out the full text first

x <- ft_collect(x)

x %>% pub_chunks("front")
x %>% pub_chunks("body")
x %>% pub_chunks("back")
x %>% pub_chunks("history")
x %>% pub_chunks("authors")
x %>% pub_chunks(c("doi","categories"))
x %>% pub_chunks("all")
x %>% pub_chunks("publisher")
x %>% pub_chunks("acknowledgments")
x %>% pub_chunks("permissions")
x %>% pub_chunks("journal_meta")
x %>% pub_chunks("article_meta")

10.2 Tabularize

The function pub_tabularize() is useful for coercing the output of pub_chunks() into a data.frame, the lingua franca of data work in R.

library(data.table)
x <- pub_chunks(x, c("doi", "title"))
x <- pub_tabularize(x)
rbindlist(x$plos, fill = TRUE)

#>                             doi
#> 1: 10.1371/journal.pbio.1000153
#> 2: 10.1371/journal.pbio.1000159
#> 3: 10.1371/journal.pbio.1000167
#> 4: 10.1371/journal.pbio.1000173
#> 5: 10.1371/journal.pbio.1000176
#>                                                                             title
#> 1:                 Emergence of a Stable Cortical Map for Neuroprosthetic Control
#> 2: Natural Killer Cell Signal Integration Balances Synapse Symmetry and Migration
#> 3: Ready…Go: Amplitude of the fMRI Signal Encodes Expectation of Cue Arrival Time
#> 4:       Hippocampus Leads Ventral Striatum in Replay of Place-Reward Information
#> 5:   β1 Integrin Maintains Integrity of the Embryonic Neocortical Stem Cell Niche
#>    .publisher
#> 1:       plos
#> 2:       plos
#> 3:       plos
#> 4:       plos
#> 5:       plos

10.3 Other inputs

pub_chunks() works with other inputs besides the output of fulltext::ft_get().

10.3.1 Files

x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml", 
  package = "pubchunks")

pub_chunks(x, "abstract")

#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: abstract
#>   showing up to first 5: 
#>    abstract (n=1): Abstract
#>                
#>                   This pa ...

pub_chunks(x, "title")

#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...

pub_chunks(x, "authors")

#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: authors
#>   showing up to first 5: 
#>    authors (n=1): Chetaev, D.N

pub_chunks(x, c("title", "refs"))

#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title, refs
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...
#>    refs (n=6): Watson G.N.. 1949. Teoriia besselevykh funktsii. N

The output of pub_chunks() is a list with an S3 class pub_chunks to make internal work in the package easier. You can easily see the list structure by using unclass().

10.3.2 xml in a string

xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")

#> <pub chunks>
#>   from: character
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...

10.3.3 xml2 objects

xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")

#> <pub chunks>
#>   from: xml_document
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...