Chapter 11 Summarize articles on disk

The ft_table() function makes it easy to create a data.frame of the text of PDF, plain text, and XML files, together with DOIs/IDs for each article. It’s similar to the readtext::readtext() function, but is much more specific to just this package.

With the output of ft_table() you can go directly into a text-mining package like quanteda.

11.1 Usage

library(fulltext)

Use ft_table() to pull out text from all articles.

ft_table()

#> # A tibble: 9 × 4
#>   dois                         ids_norm                     text       paths    
#>   <chr>                        <chr>                        <chr>      <chr>    
#> 1 10.1371/journal.pbio.1000153 10_1371_journal_pbio_1000153 "plosPLoS… /Users/s…
#> 2 10.1371/journal.pbio.1000159 10_1371_journal_pbio_1000159 "plosPLoS… /Users/s…
#> 3 10.1371/journal.pbio.1000167 10_1371_journal_pbio_1000167 "plosPLoS… /Users/s…
#> 4 10.1371/journal.pbio.1000173 10_1371_journal_pbio_1000173 "plosPLoS… /Users/s…
#> 5 10.1371/journal.pbio.1000176 10_1371_journal_pbio_1000176 "plosPLoS… /Users/s…
#> 6 10.1371/journal.pone.0086169 10_1371_journal_pone_0086169 "PLoS ONE… /Users/s…
#> 7 10.3389/feart.2015.00009     10_3389_feart_2015_00009     "Front. E… /Users/s…
#> 8 10.3389/fphar.2014.00109     10_3389_fphar_2014_00109     "Front. P… /Users/s…
#> 9 3399982                      3399982                      "03727252… /Users/s…

You can pull out just text from XML files

ft_table(type = "xml")

#> # A tibble: 9 × 4
#>   dois                         ids_norm                     text       paths    
#>   <chr>                        <chr>                        <chr>      <chr>    
#> 1 10.1371/journal.pbio.1000153 10_1371_journal_pbio_1000153 "plosPLoS… /Users/s…
#> 2 10.1371/journal.pbio.1000159 10_1371_journal_pbio_1000159 "plosPLoS… /Users/s…
#> 3 10.1371/journal.pbio.1000167 10_1371_journal_pbio_1000167 "plosPLoS… /Users/s…
#> 4 10.1371/journal.pbio.1000173 10_1371_journal_pbio_1000173 "plosPLoS… /Users/s…
#> 5 10.1371/journal.pbio.1000176 10_1371_journal_pbio_1000176 "plosPLoS… /Users/s…
#> 6 10.1371/journal.pone.0086169 10_1371_journal_pone_0086169 "PLoS ONE… /Users/s…
#> 7 10.3389/feart.2015.00009     10_3389_feart_2015_00009     "Front. E… /Users/s…
#> 8 10.3389/fphar.2014.00109     10_3389_fphar_2014_00109     "Front. P… /Users/s…
#> 9 3399982                      3399982                      "03727252… /Users/s…

You can pull out just text from PDF files

ft_table(type = "pdf")

#> # A tibble: 0 × 3
#> # … with 3 variables: ids_norm <chr>, text <chr>, paths <chr>

You can pull out XML but not extract the text. So you’ll get XML strings that you can parse yourself with xpath/css selectors/etc.

ft_table(xml_extract_text = FALSE)

#> # A tibble: 9 × 4
#>   dois                         ids_norm                     text       paths    
#>   <chr>                        <chr>                        <chr>      <chr>    
#> 1 10.1371/journal.pbio.1000153 10_1371_journal_pbio_1000153 "<?xml ve… /Users/s…
#> 2 10.1371/journal.pbio.1000159 10_1371_journal_pbio_1000159 "<?xml ve… /Users/s…
#> 3 10.1371/journal.pbio.1000167 10_1371_journal_pbio_1000167 "<?xml ve… /Users/s…
#> 4 10.1371/journal.pbio.1000173 10_1371_journal_pbio_1000173 "<?xml ve… /Users/s…
#> 5 10.1371/journal.pbio.1000176 10_1371_journal_pbio_1000176 "<?xml ve… /Users/s…
#> 6 10.1371/journal.pone.0086169 10_1371_journal_pone_0086169 "<?xml ve… /Users/s…
#> 7 10.3389/feart.2015.00009     10_3389_feart_2015_00009     "<?xml ve… /Users/s…
#> 8 10.3389/fphar.2014.00109     10_3389_fphar_2014_00109     "<?xml ve… /Users/s…
#> 9 3399982                      3399982                      "<?xml ve… /Users/s…