The embedded function in the collocational framework to split input corpus into vector of sentences using stri_split_boundaries
from stringi
package.
Each sentence line will be appended, at the beginning and at the end, with "ZSENTENCEZ"
marker as many as the number of collocational window-span is required.
This marker will help identify if collocates of a word cross the boundary of the sentence in which the word occurs.
The function automatically detects and removes if "ZSENTENCEZ"
is part of the identified collocate.
tokenise_sentence(strings = NULL, to_lower = TRUE, window_span = NULL)
strings | character vector of a corpus text. |
---|---|
to_lower | logical; turn the corpus into lowercase when |
window_span | integer; it is supplied from the value of the |
A character vector with as many sentences as there are in the input corpus as identified by stri_split_boundaries
.
txt <- c("It is one sentence. It is another sentence! There are TWO sentences.", "The second sentence and the second element of 'txt' corpus.", "Can we add another one (sentence) here as the third element?", "Of course!") sent <- tokenise_sentence(strings = txt, to_lower = TRUE, window_span = 3) sent#> [1] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ it is one sentence. ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ" #> [2] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ it is another sentence! ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ" #> [3] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ there are two sentences. ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ" #> [4] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ the second sentence and the second element of 'txt' corpus. ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ" #> [5] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ can we add another one (sentence) here as the third element? ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ" #> [6] "ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ of course! ZSENTENCEZ ZSENTENCEZ ZSENTENCEZ"