This function retrieve collocates for a word within the user-defined context window based on raw/unannotated corpus texts. The function use vectorisation approach to determine the vector-position of the collocates in relation to the vector-position of the node-word in the corpus word-vector. There is the argument of tokenise_corpus_to_sentence (cf. below) that allows user to first split the input, raw corpus into character vector whose elements correspond to a sentence line.

colloc_default(
  corpus_path = NULL,
  corpus_list = NULL,
  pattern = NULL,
  window = "b",
  span = 3,
  word_split_regex = "([^a-zA-Z-]+|--)",
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  tokenise_corpus_to_sentence = TRUE
)

Arguments

corpus_path

character strings of (full) filepath for the corpus text files in .txt plain-text format.

corpus_list

a named list object containing elements constituting a corpus text. The name of each element should correspond to the corpus file. There can be more than one element (hence more than one corpus text) within this list object.

pattern

regular expressions/exact patterns for the target pattern.

window

window-span direction of the collocates: "r" ('right of the node'), "l" ('left of the node'), or the DEFAULT is "b" ('both left and right context-window').

span

integer vector indicating the span of the collocate scope.

word_split_regex

user-defined regular expressions to tokenise the corpus. The default is to split at non alphabetic characters but retain hypen "-" as to maintain reduplication, for instance. The regex for this default setting is ""([^a-zA-Z-]+|--)"". Another possible splitting regex may include various characters with diacritics (e.g., '([^a-zA-Z\u00c0-\u00d6\u00d9-\u00f6\u00f9-\u00ff\u0100-\u017e\u1e00-\u1eff]+|--)')

case_insensitive

whether the search pattern ignores case (TRUE -- the default) or not (FALSE).

to_lower_colloc

whether to lowercase the retrieved collocates (TRUE -- default) or not (FALSE).

tokenise_corpus_to_sentence

whether to tokenise the input corpus by sentence so that the script can handle the collocates for not crossing sentence boundary. The default is TRUE and it uses stri_split_boundaries to tokenise into sentence before further tokenising into word-tokens with str_split.

Value

A list of three elements:

  • A tibble of all words in the corpus including the sentence number;

  • A tibble of all retrieved collocates, including their span position and sentence number;

  • Regular expression object of the search pattern.

Examples

if (FALSE) { # do the collocate search using "corpus_path" input-option df <- colloc_default(corpus_path = orti_bali_path, pattern = "^nuju$", window = "b", # focusing on both left and right context window span = 3) # retrieve 3 collocates to the left and right of the node }