Collocates retrieval for raw corpus text — colloc

This function retrieve collocates for a word within the user-defined context window based on raw/unannotated corpus texts. The function use vectorisation approach to determine the vector-position of the collocates in relation to the vector-position of the node-word in the corpus word-vector. There is the argument of tokenise_corpus_to_sentence (cf. below) that allows user to first split the input, raw corpus into character vector whose elements correspond to a sentence line.

colloc_default(
  corpus_path = NULL,
  corpus_list = NULL,
  pattern = NULL,
  window = "b",
  span = 3,
  word_split_regex = "([^a-zA-Z-]+|--)",
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  tokenise_corpus_to_sentence = TRUE
)

Arguments

corpus_path	character strings of (full) filepath for the corpus text files in `.txt` plain-text format.
corpus_list	a named list object containing elements constituting a corpus text. The name of each element should correspond to the corpus file. There can be more than one element (hence more than one corpus text) within this list object.
pattern	regular expressions/exact patterns for the target pattern.
window	window-span direction of the collocates: `"r"` ('right of the node'), `"l"` ('left of the node'), or the DEFAULT is `"b"` ('both left and right context-window').
span	integer vector indicating the span of the collocate scope.
word_split_regex	user-defined regular expressions to tokenise the corpus. The default is to split at non alphabetic characters but retain hypen "-" as to maintain reduplication, for instance. The regex for this default setting is `""([^a-zA-Z-]+\|--)""`. Another possible splitting regex may include various characters with diacritics (e.g., `'([^a-zA-Z\u00c0-\u00d6\u00d9-\u00f6\u00f9-\u00ff\u0100-\u017e\u1e00-\u1eff]+\|--)'`)
case_insensitive	whether the search pattern ignores case (TRUE -- the default) or not (FALSE).
to_lower_colloc	whether to lowercase the retrieved collocates (TRUE -- default) or not (FALSE).
tokenise_corpus_to_sentence	whether to tokenise the input corpus by sentence so that the script can handle the collocates for not crossing sentence boundary. The default is `TRUE` and it uses `stri_split_boundaries` to tokenise into sentence before further tokenising into word-tokens with `str_split`.

Value

A list of three elements:

A tibble of all words in the corpus including the sentence number;
A tibble of all retrieved collocates, including their span position and sentence number;
Regular expression object of the search pattern.

Examples

if (FALSE) {
# do the collocate search using "corpus_path" input-option
df <- colloc_default(corpus_path = orti_bali_path,
                     pattern = "^nuju$",
                     window = "b", # focusing on both left and right context window
                     span = 3) # retrieve 3 collocates to the left and right of the node
}