Chapter 10 Extracting text
Functions for extracting parts of texts used to live inside of fulltext
, but have now
moved to the package pubchunks.
The pubchunks::pub_chunks
function tries to make it easy to extract the parts of
articles you want. This only works with XML format articles though since although
we can get text out of PDFs, there is no machine readable way to say
“I want the abstract”.
In addition to only working with XML, this function only has knowledge about a select set of publishers for which we’ve encoded knowledge about how to get different sections of the article. Not all publishers use the same format XML - so each publisher is slightly different for how to get to each section. That is, to get to the abstract requires slightly different xpath for publisher A vs. publisher B vs. publisher C.
An alternative to pubchunks
is to use xpath or css selectors yourself to slice and
dice XML.
10.1 Usage
library(fulltext)
library(pubchunks)
Get a full text article
<- ft_get('10.1371/journal.pone.0086169') x
Note that unlike previous versions of fulltext
you now have to collect (ft_collect()
)
the text from the XML file on disk. Then you can pass to pub_chunks()
, here to get
authors.
%>% ft_collect %>% pub_chunks("authors") x
#> $plos
#> $plos$`10.1371/journal.pone.0086169`
#> <pub chunks>
#> from: xml_document
#> publisher/journal: plos/PLoS ONE
#> sections: authors
#> showing up to first 5:
#> authors (n=4): nested list
#>
#>
#> attr(,"ft_data")
#> [1] TRUE
In another example, let’s search for PLOS articles.
library("rplos")
<- searchplos(q="*:*", fl='id',
(dois fq=list('doc_type:full',"article_type:\"research article\""),
limit=5)$data$id)
#> [1] "10.1371/journal.pbio.1000153" "10.1371/journal.pbio.1000159"
#> [3] "10.1371/journal.pbio.1000167" "10.1371/journal.pbio.1000173"
#> [5] "10.1371/journal.pbio.1000176"
Then get the full text
<- ft_get(dois) x
Then pull out various sections of each article.
remember to pull out the full text first
<- ft_collect(x) x
%>% pub_chunks("front")
x %>% pub_chunks("body")
x %>% pub_chunks("back")
x %>% pub_chunks("history")
x %>% pub_chunks("authors")
x %>% pub_chunks(c("doi","categories"))
x %>% pub_chunks("all")
x %>% pub_chunks("publisher")
x %>% pub_chunks("acknowledgments")
x %>% pub_chunks("permissions")
x %>% pub_chunks("journal_meta")
x %>% pub_chunks("article_meta") x
10.2 Tabularize
The function pub_tabularize()
is useful for coercing the output of pub_chunks()
into a data.frame,
the lingua franca of data work in R.
library(data.table)
<- pub_chunks(x, c("doi", "title"))
x <- pub_tabularize(x)
x rbindlist(x$plos, fill = TRUE)
#> doi
#> 1: 10.1371/journal.pbio.1000153
#> 2: 10.1371/journal.pbio.1000159
#> 3: 10.1371/journal.pbio.1000167
#> 4: 10.1371/journal.pbio.1000173
#> 5: 10.1371/journal.pbio.1000176
#> title
#> 1: Emergence of a Stable Cortical Map for Neuroprosthetic Control
#> 2: Natural Killer Cell Signal Integration Balances Synapse Symmetry and Migration
#> 3: Ready…Go: Amplitude of the fMRI Signal Encodes Expectation of Cue Arrival Time
#> 4: Hippocampus Leads Ventral Striatum in Replay of Place-Reward Information
#> 5: β1 Integrin Maintains Integrity of the Embryonic Neocortical Stem Cell Niche
#> .publisher
#> 1: plos
#> 2: plos
#> 3: plos
#> 4: plos
#> 5: plos
10.3 Other inputs
pub_chunks()
works with other inputs besides the output of fulltext::ft_get()
.
10.3.1 Files
<- system.file("examples/10_1016_0021_8928_59_90156_x.xml",
x package = "pubchunks")
pub_chunks(x, "abstract")
#> <pub chunks>
#> from: file
#> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#> sections: abstract
#> showing up to first 5:
#> abstract (n=1): Abstract
#>
#> This pa ...
pub_chunks(x, "title")
#> <pub chunks>
#> from: file
#> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#> sections: title
#> showing up to first 5:
#> title (n=1): On the driving of a piston with a rigid collar int ...
pub_chunks(x, "authors")
#> <pub chunks>
#> from: file
#> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#> sections: authors
#> showing up to first 5:
#> authors (n=1): Chetaev, D.N
pub_chunks(x, c("title", "refs"))
#> <pub chunks>
#> from: file
#> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#> sections: title, refs
#> showing up to first 5:
#> title (n=1): On the driving of a piston with a rigid collar int ...
#> refs (n=6): Watson G.N.. 1949. Teoriia besselevykh funktsii. N
The output of pub_chunks()
is a list with an S3 class pub_chunks
to make
internal work in the package easier. You can easily see the list structure
by using unclass()
.
10.3.2 xml in a string
<- paste0(readLines(x), collapse = "")
xml pub_chunks(xml, "title")
#> <pub chunks>
#> from: character
#> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#> sections: title
#> showing up to first 5:
#> title (n=1): On the driving of a piston with a rigid collar int ...
10.3.3 xml2 objects
<- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
xml pub_chunks(xml, "title")
#> <pub chunks>
#> from: xml_document
#> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#> sections: title
#> showing up to first 5:
#> title (n=1): On the driving of a piston with a rigid collar int ...