R/corplingr_colloc_sentence.R
colloc_sentence.Rd
Perform collocate search for a given word/pattern based on corpus text files with one sentence per line (e.g., the Leipzig Corpora) (cf. Details below).
If the input is otherwise, such that each line of the corpus does not correspond to a sentence, use colloc_default
.
colloc_sentence( corpus_path = "(full) filepath to sentence-based corpus", leipzig_input = TRUE, pattern = "regular expressions", window = c("r", "l", "b"), span = 3, case_insensitive = TRUE, to_lower_colloc = TRUE, save_interim_results = FALSE, coll_output_name = "colloc_tibble_out.txt" )
corpus_path | character strings of (full) filepath for the corpus text files in |
---|---|
leipzig_input | logical; to check if the input corpus is specifically the Leipzig corpus files ( |
pattern | regular expressions/exact patterns for the target pattern. |
window | window-span direction of the collocates: |
span | integer vector indicating the span of the collocate scope. |
case_insensitive | whether the search ignores case (TRUE -- the default) or not (FALSE). |
to_lower_colloc | whether to lowercase the retrieved collocates and the nodes (TRUE -- default) or not (FALSE). |
save_interim_results | whether to output the interim results (per corpus file) into a tab-separated plain text (TRUE) or not (FALSE -- default). |
coll_output_name | name of the file for the collocate tables. |
A tbl_df of raw collocates.
This function, which is largely built on top of the tidyverse
, is specifically designed to handle collocates search that is not crossing boundary of the sentence in which the search word/pattern occurs.
The reason is that the sentence can be randomised and totaly unrelated (as in the Leipzig Corpora).
Thus, it is important to keep the collocates of the search word/pattern falling within the sentence boundary in which the word/pattern occurs. That way, it aims maintain cohesivness of meaning of the word.
Moreover, the function only outputs the raw collocates data without tabulating the frequency of the collocates and performing association measure of the collocates to the search word/pattern. Future iteration of this package aims to accommodate this feature.
if (FALSE) { # get the path of the Leipzig corpora leipzig_corpus_path <- c("/my/path/to/leipzig_corpus_1M_1.txt", "/my/path/to/leipzig_corpus_300K_2.txt", "/my/path/to/leipzig_corpus_300K_3.txt") # retrieve collocate list df <- colloc_sentence(corpus_path = leipzig_corpus_path[2:3], leipzig_input = TRUE, pattern = "\\bterkalahkan\\b", window = "l", span = 1, case_insensitive = TRUE, to_lower_colloc = TRUE, save_interim_results = FALSE) # see the output df # count the frequency of the collocates df %>% dplyr::count(w, sort = TRUE) }