Detect word pairs in text

word_pairs searches for the occurrences of a pair of words in sentences. These words can be separated by intervening strings (viz. other in-between words).

word_pairs(corpus, word_1 = NULL, word_2 = NULL,
  min_intervening = 0L, max_intervening = 3L)

Arguments

corpus	A character vector of sentences.
word_1	A regular expressions for the first word. The regex must enclose the word with word boundary character (i.e. `"\\b"`).
word_2	A regular expressions for the second word. The regex must enclose the word with word boundary character (i.e. `"\\b"`).
min_intervening	Number of minimum occurrence of the intervening word. The default is `0L`.
max_intervening	Number of minimum occurrence of the intervening word. The default is `3L`. Use `Inf` to get infinite intervening words after `word_1` and before the occurrence of `word_2`.

Value

A list object with the following elements:

pattern: the extracted pattern spanning from the first word to the second word.
pattern_tagged: the version of pattern containing tags for the first and the second word.
matches: the sentence matches containing the word pairs that are tagged for the first and the second word.

References

Rajeg, Gede Primahadi Wijaya. (2018). wordpairs: An R package to retrieve word pair in sentences of the (Indonesian) Leipzig Corpora.

Examples

# co-occurrence of *me-X-kan* transitive verbs with *kepada*
word_1 <- "\\bmen[a-z]{3,}kan\\b"
word_2 <- "\\bkepada\\b"
corpus <- my_leipzig_sample
m <- word_pairs(corpus,
                word_1 = word_1,
                word_2 = word_2,
                min_intervening = 0L,
                max_intervening = 3L)

# inspect the snippet of the results
head(m$pattern)
#> [1] "menyampaikan secara langsung kepada"            
#> [2] "mengutamakan pemberian kredit kepada"           
#> [3] "menuangkan pikiran dan menyampaikan kepada"     
#> [4] "mengembalikan ID card kepada"                   
#> [5] "mengeluhkan diskriminasi pemberian nilai kepada"
#> [6] "menawarkan jasa kepada"                         
head(m$pattern_tagged)
#> [1] "<w id='1'>menyampaikan</w> secara langsung <w id='2'>kepada</w>"            
#> [2] "<w id='1'>mengutamakan</w> pemberian kredit <w id='2'>kepada</w>"           
#> [3] "<w id='1'>menuangkan</w> pikiran dan menyampaikan <w id='2'>kepada</w>"     
#> [4] "<w id='1'>mengembalikan</w> ID card <w id='2'>kepada</w>"                   
#> [5] "<w id='1'>mengeluhkan</w> diskriminasi pemberian nilai <w id='2'>kepada</w>"
#> [6] "<w id='1'>menawarkan</w> jasa <w id='2'>kepada</w>"                         

# generate frequency table for the patterns
freq_tb <- table(m$pattern_tagged)

# sort in decreasing order of frequency
head(sort(freq_tb, decreasing = TRUE))
#> 
#>               <w id='1'>mengatakan</w> <w id='2'>kepada</w> 
#>                                                          56 
#>             <w id='1'>mengingatkan</w> <w id='2'>kepada</w> 
#>                                                           6 
#> <w id='1'>mengucapkan</w> terima kasih <w id='2'>kepada</w> 
#>                                                           5 
#>             <w id='1'>menyampaikan</w> <w id='2'>kepada</w> 
#>                                                           5 
#>             <w id='1'>mengemukakan</w> <w id='2'>kepada</w> 
#>                                                           4 
#>              <w id='1'>mengusulkan</w> <w id='2'>kepada</w> 
#>                                                           4

Arguments

Value

References

Examples

Contents