Perform collocate search for a given word/pattern based on corpus text files with one sentence per line (e.g., the Leipzig Corpora) (cf. Details below). If the input is otherwise, such that each line of the corpus does not correspond to a sentence, use colloc_default.

colloc_sentence(
  corpus_path = "(full) filepath to sentence-based corpus",
  leipzig_input = TRUE,
  pattern = "regular expressions",
  window = c("r", "l", "b"),
  span = 3,
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  save_interim_results = FALSE,
  coll_output_name = "colloc_tibble_out.txt"
)

Arguments

corpus_path

character strings of (full) filepath for the corpus text files in .txt plain-text format. The corpus file IS a sentence-based corpus, meaning that each line of the file corresponds to one sentence. Each sentence can be in successive, cohesive sequence (e.g. based on a Novel) or randomised (as in the Leipzig Corpora).

leipzig_input

logical; to check if the input corpus is specifically the Leipzig corpus files (TRUE) so that the function will remove the sentence number in the beginning of the line.

pattern

regular expressions/exact patterns for the target pattern.

window

window-span direction of the collocates: "r" ('right of the node'), "l" ('left of the node'), or the DEFAULT is "b" ('both left and right context-window').

span

integer vector indicating the span of the collocate scope.

case_insensitive

whether the search ignores case (TRUE -- the default) or not (FALSE).

to_lower_colloc

whether to lowercase the retrieved collocates and the nodes (TRUE -- default) or not (FALSE).

save_interim_results

whether to output the interim results (per corpus file) into a tab-separated plain text (TRUE) or not (FALSE -- default).

coll_output_name

name of the file for the collocate tables.

Value

A tbl_df of raw collocates.

Details

This function, which is largely built on top of the tidyverse, is specifically designed to handle collocates search that is not crossing boundary of the sentence in which the search word/pattern occurs. The reason is that the sentence can be randomised and totaly unrelated (as in the Leipzig Corpora). Thus, it is important to keep the collocates of the search word/pattern falling within the sentence boundary in which the word/pattern occurs. That way, it aims maintain cohesivness of meaning of the word.

Moreover, the function only outputs the raw collocates data without tabulating the frequency of the collocates and performing association measure of the collocates to the search word/pattern. Future iteration of this package aims to accommodate this feature.

Examples

if (FALSE) { # get the path of the Leipzig corpora leipzig_corpus_path <- c("/my/path/to/leipzig_corpus_1M_1.txt", "/my/path/to/leipzig_corpus_300K_2.txt", "/my/path/to/leipzig_corpus_300K_3.txt") # retrieve collocate list df <- colloc_sentence(corpus_path = leipzig_corpus_path[2:3], leipzig_input = TRUE, pattern = "\\bterkalahkan\\b", window = "l", span = 1, case_insensitive = TRUE, to_lower_colloc = TRUE, save_interim_results = FALSE) # see the output df # count the frequency of the collocates df %>% dplyr::count(w, sort = TRUE) }