Generate window-span collocates for the Leipzig Corpora

The function produces tibble-output collocates for the Leipzig corpus files.

colloc_leipzig(
  leipzig_path = NULL,
  leipzig_corpus_list = NULL,
  pattern = NULL,
  case_insensitive = TRUE,
  window = "b",
  span = 2,
  split_corpus_pattern = "([^a-zA-Z-¬]+|--)",
  to_lower_colloc = TRUE,
  save_interim = FALSE,
  freqlist_output_file = "collogetr_out_1_freqlist.txt",
  colloc_output_file = "collogetr_out_2_collocates.txt",
  corpussize_output_file = "collogetr_out_3_corpus_size.txt",
  search_pattern_output_file = "collogetr_out_4_search_pattern.txt"
)

Arguments

leipzig_path	Character strings of (i) file names of the Leipzig corpus if they are in the working directory, or (ii) the complete file path to each of the Leipzig corpus files.
leipzig_corpus_list	Specify this argument if each Leipzig corpus file has been loaded as R object and acts as an element of a named list. Example of this type of data-input can be seen in `data("demo_corpus_leipzig")`. So specify either `leipzig_path` OR `leipzig_corpus_list` and set one of them to `NULL`.
pattern	Character vector input containing a set of exact word forms.
case_insensitive	Logical; whether the search for the `pattern` ignores case (`TRUE` -- default) or not (`FALSE`).
window	Character; window-span direction of the collocates: `"r"` ('right of the node'), `"l"` ('left of the node'), or the default is `"b"` ('both left and right context-window').
span	A numeric vector indicating the span of the collocate scope. The default is `2` words around the node word.
split_corpus_pattern	Regular expressions used to tokenise the corpus into word-vector. The default regex is `"([^a-zA-Z-\u00AC]+\|--)"`. The character `"\u00AC"` is a hexademical version of `"¬"`, which may occur in the Leipzig Corpora as separator between root and suffixes of a word, in addition to hypen. This procedure supports the vectorised method of the function to generate the collocate of the search pattern.
to_lower_colloc	Logical; whether to lowercase the retrieved collocates and the nodes (`TRUE` -- default) or not (`FALSE`).
save_interim	Logical; whether to save interim results into plain text files or not (`FALSE` -- default).
freqlist_output_file	Character strings for the name of the file for the word frequency in a corpus.
colloc_output_file	Character strings for the name of the file for the raw collocate table.
corpussize_output_file	Character strings for the name of the file for the total word-size of a corpus.
search_pattern_output_file	Character strings for the name of the file for the search_pattern.

Value

List of raw collocate items, frequency list of all words in the loaded corpus files, the total word tokens in each loaded corpus, and the search pattern.

Examples

collout <- colloc_leipzig(leipzig_corpus_list = demo_corpus_leipzig,
                     pattern = "mengatakan",
                     window = "r",
                     span = 3,
                     save_interim = FALSE)
#> Detecting a 'named list' input!
#> You chose NOT to SAVE INTERIM RESULTS, which will be stored as a list in console!
#> 1. Tokenising the "ind_mixed_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_mixed_2012_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2008_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_news_2008_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2009_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_news_2009_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2010_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_news_2010_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2011_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_news_2011_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2012_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_news_2012_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2011_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_newscrawl_2011_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_newscrawl_2012_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2015_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_newscrawl_2015_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2016_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_newscrawl_2016_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_web_2011_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_web_2011_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_web_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_web_2012_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_wikipedia_2016_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind_wikipedia_2016_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind-id_web_2013_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind-id_web_2013_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind-id_web_2015_3M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'mengatakan' in ind-id_web_2015_3M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 3. Storing all of the outputs...
#> 
#> DONE!
# collout <- colloc_leipzig(leipzig_corpus_path = c('path_to_corpus1.txt',
#                                                     'path_to_corpus2.txt'),
#                             pattern = "mengatakan",
#                             window = "r",
#                             span = 3,
#                             save_interim = TRUE # save interim output file
#                             # you need to specify path in the argument
#                             # with \code{...output_file}
#                             )