Chapter 10 Table
The ft_table()
function makes it easy to create a data.frame of the text of PDF, plain text, and XML files, together with DOIs/IDs for each article. It’s similar to the readtext::readtext()
function, but is much more specific to just this package.
With the output of ft_table()
you can go directly into a text-mining package like quanteda
.
10.1 Usage
Use ft_table()
to pull out text from all articles.
#> # A tibble: 9 x 4
#> dois ids_norm text paths
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 2 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 3 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 4 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 5 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 6 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 7 10.3389/fea… 10_3389_feart… "Front. Earth Sci.Fronti… /Users/sckott/Library/C…
#> 8 10.3389/fph… 10_3389_fphar… "Front. Pharmacol.Fronti… /Users/sckott/Library/C…
#> 9 10.7717/pee… 10_7717_peerj… "PeerJPeerJPeerJPeerJ216… /Users/sckott/Library/C…
You can pull out just text from XML files
#> # A tibble: 9 x 4
#> dois ids_norm text paths
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 2 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 3 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 4 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 5 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 6 10.1371/jou… 10_1371_journ… "PLoS ONEplosplosonePLoS… /Users/sckott/Library/C…
#> 7 10.3389/fea… 10_3389_feart… "Front. Earth Sci.Fronti… /Users/sckott/Library/C…
#> 8 10.3389/fph… 10_3389_fphar… "Front. Pharmacol.Fronti… /Users/sckott/Library/C…
#> 9 10.7717/pee… 10_7717_peerj… "PeerJPeerJPeerJPeerJ216… /Users/sckott/Library/C…
You can pull out just text from PDF files
#> # A tibble: 0 x 3
#> # … with 3 variables: ids_norm <chr>, text <chr>, paths <chr>
You can pull out XML but not extract the text. So you’ll get XML strings that you can parse yourself with xpath/css selectors/etc.
#> # A tibble: 9 x 4
#> dois ids_norm text paths
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 2 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 3 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 4 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 5 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 6 10.1371/jou… 10_1371_journ… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 7 10.3389/fea… 10_3389_feart… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 8 10.3389/fph… 10_3389_fphar… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…
#> 9 10.7717/pee… 10_7717_peerj… "<?xml version=\"1.0\" e… /Users/sckott/Library/C…