Generate tidyverse-style window-span collocates for the Leipzig Corpora

The function produces tibble-output collocates for Leipzig Corpora files.

colloc_leipzig(
  leipzig_path = NULL,
  leipzig_corpus_list = NULL,
  pattern = NULL,
  window = "b",
  span = 2,
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  save_results = FALSE,
  coll_output_name = "colloc_tidy_colloc_out.txt",
  sent_output_name = "colloc_tidy_sent_out.txt"
)

Arguments

leipzig_path	character strings of (i) file names of the Leipzig corpus if they are in the working directory, or (ii) the complete file path to each of the Leipzig corpus files.
leipzig_corpus_list	specify this argument if each Leipzig corpus file has been loaded as R object and acts as an element of a list. Example of this type of data-input can be seen in `data("demo_corpus_leipzig")`. So specify either `leipzig_path` OR `leipzig_corpus_list` and set one of them to `NULL`.
pattern	regular expressions/exact patterns for the target pattern.
window	window-span direction of the collocates: `"r"` ('right of the node'), `"l"` ('left of the node'), or the DEFAULT is `"b"` ('both left and right context-window').
span	integer vector indicating the span of the collocate scope.
case_insensitive	whether the search pattern ignores case (TRUE -- the default) or not (FALSE).
to_lower_colloc	whether to lowercase the retrieved collocates and the nodes (TRUE -- default) or not (FALSE).
save_results	whether to output the collocates into a tab-separated plain text (TRUE) or not (FALSE -- default).
coll_output_name	name of the file for the collocate tables.
sent_output_name	name of the file for the full sentence match containing the collocates.

Value

a list of two tibbles: (i) for collocates with sentence number of the match, window span information, and the corpus files, and (ii) full-sentences per match with sentence number and corpus file

Examples

if (FALSE) {
# get the corpus filepaths
# so this example use the filepath input rather than list of corpus
leipzig_corpus_path <- c("my/path/to/leipzig_corpus_file_1M-sent_1.txt",
                       "my/path/to/leipzig_corpus_file_300K-sent_2.txt",
                       "my/path/to/leipzig_corpus_file_300K-sent_3.txt")

# run the function
colloc <- colloc_leipzig(leipzig_path = leipzig_corpus_path[2:3],
                              pattern = "\\bterelakkan\\b",
                              window = "b",
                              span = 3,
                              save_results = FALSE,
                              to_lower_colloc = TRUE)
# Inspect outputs
## This one outputs the collocates tibble
colloc$collocates

## This one outputs the sentence matches tibble
colloc$sentence_matches
}