function to get a total word-token count of a given leipzig corpus file.
It is built on top of str_count.
corpus_size_leipzig( leipzig_path = "(full) filepath to Leipzig corpus files", word_regex = "\\b(?i)([-a-zA-Z0-9]+)\\b" )
| leipzig_path | file path to the directory folder in which the Leipzig corpus files are stored |
|---|---|
| word_regex | regular expressions defining what "a word" is |
tibble containing corpus_id, size, and size_print (for text-printing)