Collocates retrieval on sentence-based corpus — colloc

Perform collocate search for a given word/pattern based on corpus text files with one sentence per line (e.g., the Leipzig Corpora) (cf. Details below). If the input is otherwise, such that each line of the corpus does not correspond to a sentence, use colloc_default.

colloc_sentence(
  corpus_path = "(full) filepath to sentence-based corpus",
  leipzig_input = TRUE,
  pattern = "regular expressions",
  window = c("r", "l", "b"),
  span = 3,
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  save_interim_results = FALSE,
  coll_output_name = "colloc_tibble_out.txt"
)

Arguments

corpus_path	character strings of (full) filepath for the corpus text files in `.txt` plain-text format. The corpus file IS a sentence-based corpus, meaning that each line of the file corresponds to one sentence. Each sentence can be in successive, cohesive sequence (e.g. based on a Novel) or randomised (as in the Leipzig Corpora).
leipzig_input	logical; to check if the input corpus is specifically the Leipzig corpus files (`TRUE`) so that the function will remove the sentence number in the beginning of the line.
pattern	regular expressions/exact patterns for the target pattern.
window	window-span direction of the collocates: `"r"` ('right of the node'), `"l"` ('left of the node'), or the DEFAULT is `"b"` ('both left and right context-window').
span	integer vector indicating the span of the collocate scope.
case_insensitive	whether the search ignores case (TRUE -- the default) or not (FALSE).
to_lower_colloc	whether to lowercase the retrieved collocates and the nodes (TRUE -- default) or not (FALSE).
save_interim_results	whether to output the interim results (per corpus file) into a tab-separated plain text (TRUE) or not (FALSE -- default).
coll_output_name	name of the file for the collocate tables.

Value

A tbl_df of raw collocates.

Details

This function, which is largely built on top of the tidyverse, is specifically designed to handle collocates search that is not crossing boundary of the sentence in which the search word/pattern occurs. The reason is that the sentence can be randomised and totaly unrelated (as in the Leipzig Corpora). Thus, it is important to keep the collocates of the search word/pattern falling within the sentence boundary in which the word/pattern occurs. That way, it aims maintain cohesivness of meaning of the word.

Moreover, the function only outputs the raw collocates data without tabulating the frequency of the collocates and performing association measure of the collocates to the search word/pattern. Future iteration of this package aims to accommodate this feature.

Examples

if (FALSE) {
# get the path of the Leipzig corpora
leipzig_corpus_path <- c("/my/path/to/leipzig_corpus_1M_1.txt",
                         "/my/path/to/leipzig_corpus_300K_2.txt",
                         "/my/path/to/leipzig_corpus_300K_3.txt")
# retrieve collocate list
df <- colloc_sentence(corpus_path = leipzig_corpus_path[2:3],
                      leipzig_input = TRUE,
                      pattern = "\\bterkalahkan\\b",
                      window = "l",
                      span = 1,
                      case_insensitive = TRUE,
                      to_lower_colloc = TRUE,
                      save_interim_results = FALSE)

# see the output
df

# count the frequency of the collocates
df %>% dplyr::count(w, sort = TRUE)
}