function to get a total word-token count of a given leipzig corpus file. It is built on top of str_count.

corpus_size_leipzig(
  leipzig_path = "(full) filepath to Leipzig corpus files",
  word_regex = "\\b(?i)([-a-zA-Z0-9]+)\\b"
)

Arguments

leipzig_path

file path to the directory folder in which the Leipzig corpus files are stored

word_regex

regular expressions defining what "a word" is

Value

tibble containing corpus_id, size, and size_print (for text-printing)