Chapter 11 Summarize articles on disk
The ft_table()
function makes it easy to create a data.frame of the text of PDF, plain text, and XML files, together with DOIs/IDs for each article. It’s similar to the readtext::readtext()
function, but is much more specific to just this package.
With the output of ft_table()
you can go directly into a text-mining package like quanteda
.
11.1 Usage
library(fulltext)
Use ft_table()
to pull out text from all articles.
ft_table()
#> # A tibble: 9 × 4
#> dois ids_norm text paths
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/journal.pbio.1000153 10_1371_journal_pbio_1000153 "plosPLoS… /Users/s…
#> 2 10.1371/journal.pbio.1000159 10_1371_journal_pbio_1000159 "plosPLoS… /Users/s…
#> 3 10.1371/journal.pbio.1000167 10_1371_journal_pbio_1000167 "plosPLoS… /Users/s…
#> 4 10.1371/journal.pbio.1000173 10_1371_journal_pbio_1000173 "plosPLoS… /Users/s…
#> 5 10.1371/journal.pbio.1000176 10_1371_journal_pbio_1000176 "plosPLoS… /Users/s…
#> 6 10.1371/journal.pone.0086169 10_1371_journal_pone_0086169 "PLoS ONE… /Users/s…
#> 7 10.3389/feart.2015.00009 10_3389_feart_2015_00009 "Front. E… /Users/s…
#> 8 10.3389/fphar.2014.00109 10_3389_fphar_2014_00109 "Front. P… /Users/s…
#> 9 3399982 3399982 "03727252… /Users/s…
You can pull out just text from XML files
ft_table(type = "xml")
#> # A tibble: 9 × 4
#> dois ids_norm text paths
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/journal.pbio.1000153 10_1371_journal_pbio_1000153 "plosPLoS… /Users/s…
#> 2 10.1371/journal.pbio.1000159 10_1371_journal_pbio_1000159 "plosPLoS… /Users/s…
#> 3 10.1371/journal.pbio.1000167 10_1371_journal_pbio_1000167 "plosPLoS… /Users/s…
#> 4 10.1371/journal.pbio.1000173 10_1371_journal_pbio_1000173 "plosPLoS… /Users/s…
#> 5 10.1371/journal.pbio.1000176 10_1371_journal_pbio_1000176 "plosPLoS… /Users/s…
#> 6 10.1371/journal.pone.0086169 10_1371_journal_pone_0086169 "PLoS ONE… /Users/s…
#> 7 10.3389/feart.2015.00009 10_3389_feart_2015_00009 "Front. E… /Users/s…
#> 8 10.3389/fphar.2014.00109 10_3389_fphar_2014_00109 "Front. P… /Users/s…
#> 9 3399982 3399982 "03727252… /Users/s…
You can pull out just text from PDF files
ft_table(type = "pdf")
#> # A tibble: 0 × 3
#> # … with 3 variables: ids_norm <chr>, text <chr>, paths <chr>
You can pull out XML but not extract the text. So you’ll get XML strings that you can parse yourself with xpath/css selectors/etc.
ft_table(xml_extract_text = FALSE)
#> # A tibble: 9 × 4
#> dois ids_norm text paths
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/journal.pbio.1000153 10_1371_journal_pbio_1000153 "<?xml ve… /Users/s…
#> 2 10.1371/journal.pbio.1000159 10_1371_journal_pbio_1000159 "<?xml ve… /Users/s…
#> 3 10.1371/journal.pbio.1000167 10_1371_journal_pbio_1000167 "<?xml ve… /Users/s…
#> 4 10.1371/journal.pbio.1000173 10_1371_journal_pbio_1000173 "<?xml ve… /Users/s…
#> 5 10.1371/journal.pbio.1000176 10_1371_journal_pbio_1000176 "<?xml ve… /Users/s…
#> 6 10.1371/journal.pone.0086169 10_1371_journal_pone_0086169 "<?xml ve… /Users/s…
#> 7 10.3389/feart.2015.00009 10_3389_feart_2015_00009 "<?xml ve… /Users/s…
#> 8 10.3389/fphar.2014.00109 10_3389_fphar_2014_00109 "<?xml ve… /Users/s…
#> 9 3399982 3399982 "<?xml ve… /Users/s…