Split a corpus by sentence-boundary — tokenise

The embedded function in the collocational framework to split input corpus into vector of sentences using stri_split_boundaries from stringi package. Each sentence line will be appended, at the beginning and at the end, with "ZSENTENCEZ" marker as many as the number of collocational window-span is required. This marker will help identify if collocates of a word cross the boundary of the sentence in which the word occurs. The function automatically detects and removes if "ZSENTENCEZ" is part of the identified collocate.

tokenise_sentence(strings = NULL, to_lower = TRUE, window_span = NULL)

Arguments

strings	character vector of a corpus text.
to_lower	logical; turn the corpus into lowercase when `TRUE` (the default).
window_span	integer; it is supplied from the value of the `span` argument in the higher-level collocational function call. It will determine the number of times the `"ZSENTENCEZ"` marker will be appended at the beginning and at the end of each sentence.

Value

A character vector with as many sentences as there are in the input corpus as identified by stri_split_boundaries.

Examples

txt <- c("It is one sentence. It is another sentence! There are TWO sentences.",
         "The second sentence and the second element of 'txt' corpus.",
         "Can we add another one (sentence) here as the third element?",
         "Of course!")
sent <- tokenise_sentence(strings = txt,
                          to_lower = TRUE,
                          window_span = 3)
sent
#> [1] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ it is one sentence. ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ"                                         
#> [2] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ it is another sentence! ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ"                                     
#> [3] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ there are two sentences. ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ"                                    
#> [4] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ the second sentence and the second element of 'txt' corpus. ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ" 
#> [5] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ can we add another one (sentence) here as the third element? ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ"
#> [6] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ of course! ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ"