function to get a total word-token count of a given leipzig corpus file.
It is built on top of str_count
.
corpus_size_leipzig( leipzig_path = "(full) filepath to Leipzig corpus files", word_regex = "\\b(?i)([-a-zA-Z0-9]+)\\b" )
leipzig_path | file path to the directory folder in which the Leipzig corpus files are stored |
---|---|
word_regex | regular expressions defining what "a word" is |
tibble containing corpus_id
, size
, and size_print
(for text-printing)