This function retrieve collocates for a word within the user-defined context window based on raw/unannotated corpus texts.
The function use vectorisation approach to determine the vector-position of the collocates in relation to the vector-position of the node-word in the corpus word-vector.
There is the argument of tokenise_corpus_to_sentence
(cf. below) that allows user to first split the input, raw corpus into character vector whose elements correspond to a sentence line.
colloc_default( corpus_path = NULL, corpus_list = NULL, pattern = NULL, window = "b", span = 3, word_split_regex = "([^a-zA-Z-]+|--)", case_insensitive = TRUE, to_lower_colloc = TRUE, tokenise_corpus_to_sentence = TRUE )
corpus_path | character strings of (full) filepath for the corpus text files in |
---|---|
corpus_list | a named list object containing elements constituting a corpus text. The name of each element should correspond to the corpus file. There can be more than one element (hence more than one corpus text) within this list object. |
pattern | regular expressions/exact patterns for the target pattern. |
window | window-span direction of the collocates: |
span | integer vector indicating the span of the collocate scope. |
word_split_regex | user-defined regular expressions to tokenise the corpus.
The default is to split at non alphabetic characters but retain hypen "-" as to maintain reduplication, for instance.
The regex for this default setting is |
case_insensitive | whether the search pattern ignores case (TRUE -- the default) or not (FALSE). |
to_lower_colloc | whether to lowercase the retrieved collocates (TRUE -- default) or not (FALSE). |
tokenise_corpus_to_sentence | whether to tokenise the input corpus by sentence so that the script can handle the collocates for not crossing sentence boundary.
The default is |
A list of three elements:
A tibble of all words in the corpus including the sentence number;
A tibble of all retrieved collocates, including their span position and sentence number;
Regular expression object of the search pattern.
if (FALSE) { # do the collocate search using "corpus_path" input-option df <- colloc_default(corpus_path = orti_bali_path, pattern = "^nuju$", window = "b", # focusing on both left and right context window span = 3) # retrieve 3 collocates to the left and right of the node }