Chapter 10 Table

The ft_table() function makes it easy to create a data.frame of the text of PDF, plain text, and XML files, together with DOIs/IDs for each article. It’s similar to the readtext::readtext() function, but is much more specific to just this package.

With the output of ft_table() you can go directly into a text-mining package like quanteda.

10.1 Usage

library(fulltext)

Use ft_table() to pull out text from all articles.

ft_table()

#> # A tibble: 9 x 4
#>   dois         ids_norm       text                      paths                   
#>   <chr>        <chr>          <chr>                     <chr>                   
#> 1 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 2 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 3 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 4 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 5 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 6 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 7 10.3389/fea… 10_3389_feart… "Front. Earth Sci.Fronti… /Users/sckott/Library/C…
#> 8 10.3389/fph… 10_3389_fphar… "Front. Pharmacol.Fronti… /Users/sckott/Library/C…
#> 9 10.7717/pee… 10_7717_peerj… "PeerJPeerJPeerJPeerJ216… /Users/sckott/Library/C…

You can pull out just text from XML files

ft_table(type = "xml")

#> # A tibble: 9 x 4
#>   dois         ids_norm       text                      paths                   
#>   <chr>        <chr>          <chr>                     <chr>                   
#> 1 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 2 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 3 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 4 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 5 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 6 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 7 10.3389/fea… 10_3389_feart… "Front. Earth Sci.Fronti… /Users/sckott/Library/C…
#> 8 10.3389/fph… 10_3389_fphar… "Front. Pharmacol.Fronti… /Users/sckott/Library/C…
#> 9 10.7717/pee… 10_7717_peerj… "PeerJPeerJPeerJPeerJ216… /Users/sckott/Library/C…

You can pull out just text from PDF files

ft_table(type = "pdf")

#> # A tibble: 0 x 3
#> # … with 3 variables: ids_norm <chr>, text <chr>, paths <chr>

You can pull out XML but not extract the text. So you’ll get XML strings that you can parse yourself with xpath/css selectors/etc.

ft_table(xml_extract_text = FALSE)

#> # A tibble: 9 x 4
#>   dois         ids_norm       text                      paths                   
#>   <chr>        <chr>          <chr>                     <chr>                   
#> 1 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 2 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 3 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 4 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 5 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 6 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 7 10.3389/fea… 10_3389_feart… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 8 10.3389/fph… 10_3389_fphar… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 9 10.7717/pee… 10_7717_peerj… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…