The function produces tibble-output collocates for the Leipzig corpus files.

colloc_leipzig(
  leipzig_path = NULL,
  leipzig_corpus_list = NULL,
  pattern = NULL,
  case_insensitive = TRUE,
  window = "b",
  span = 2,
  split_corpus_pattern = "([^a-zA-Z-¬]+|--)",
  to_lower_colloc = TRUE,
  save_interim = FALSE,
  freqlist_output_file = "collogetr_out_1_freqlist.txt",
  colloc_output_file = "collogetr_out_2_collocates.txt",
  corpussize_output_file = "collogetr_out_3_corpus_size.txt",
  search_pattern_output_file = "collogetr_out_4_search_pattern.txt"
)

Arguments

leipzig_path

Character strings of (i) file names of the Leipzig corpus if they are in the working directory, or (ii) the complete file path to each of the Leipzig corpus files.

leipzig_corpus_list

Specify this argument if each Leipzig corpus file has been loaded as R object and acts as an element of a named list. Example of this type of data-input can be seen in data("demo_corpus_leipzig"). So specify either leipzig_path OR leipzig_corpus_list and set one of them to NULL.

pattern

Character vector input containing a set of exact word forms.

case_insensitive

Logical; whether the search for the pattern ignores case (TRUE -- default) or not (FALSE).

window

Character; window-span direction of the collocates: "r" ('right of the node'), "l" ('left of the node'), or the default is "b" ('both left and right context-window').

span

A numeric vector indicating the span of the collocate scope. The default is 2 words around the node word.

split_corpus_pattern

Regular expressions used to tokenise the corpus into word-vector. The default regex is "([^a-zA-Z-\u00AC]+|--)". The character "\u00AC" is a hexademical version of "¬", which may occur in the Leipzig Corpora as separator between root and suffixes of a word, in addition to hypen. This procedure supports the vectorised method of the function to generate the collocate of the search pattern.

to_lower_colloc

Logical; whether to lowercase the retrieved collocates and the nodes (TRUE -- default) or not (FALSE).

save_interim

Logical; whether to save interim results into plain text files or not (FALSE -- default).

freqlist_output_file

Character strings for the name of the file for the word frequency in a corpus.

colloc_output_file

Character strings for the name of the file for the raw collocate table.

corpussize_output_file

Character strings for the name of the file for the total word-size of a corpus.

search_pattern_output_file

Character strings for the name of the file for the search_pattern.

Value

List of raw collocate items, frequency list of all words in the loaded corpus files, the total word tokens in each loaded corpus, and the search pattern.

Examples

collout <- colloc_leipzig(leipzig_corpus_list = demo_corpus_leipzig, pattern = "mengatakan", window = "r", span = 3, save_interim = FALSE)
#> Detecting a 'named list' input!
#> You chose NOT to SAVE INTERIM RESULTS, which will be stored as a list in console!
#> 1. Tokenising the "ind_mixed_2012_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_mixed_2012_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2008_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_news_2008_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2009_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_news_2009_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2010_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_news_2010_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2011_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_news_2011_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_news_2012_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_news_2012_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2011_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_newscrawl_2011_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2012_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_newscrawl_2012_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2015_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_newscrawl_2015_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_newscrawl_2016_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_newscrawl_2016_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_web_2011_300K" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_web_2011_300K.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_web_2012_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_web_2012_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind_wikipedia_2016_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind_wikipedia_2016_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind-id_web_2013_1M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind-id_web_2013_1M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 1. Tokenising the "ind-id_web_2015_3M" corpus. This process may take a while!
#> 1.1 Removing one-character tokens...
#> 1.2 Lowercasing the tokenised corpus...
#> At least a match is detected for 'mengatakan' in ind-id_web_2015_3M.
#> 2.1 Gathering the collocates for 'mengatakan' ...
#> 3. Storing all of the outputs...
#> #> DONE!
# collout <- colloc_leipzig(leipzig_corpus_path = c('path_to_corpus1.txt', # 'path_to_corpus2.txt'), # pattern = "mengatakan", # window = "r", # span = 3, # save_interim = TRUE # save interim output file # # you need to specify path in the argument # # with \code{...output_file} # )