Sentence match retriever with tagged collocations

This is an expanded function for colloc_sentmatch to retrieve sentence matches in which a given collocate is found. The extended features in colloc_sentmatch_tagged (but not available in colloc_sentmatch) are:

a tibble/data frame output format (similar to the colloc_df, which is one of the outputs of colloc_leipzig)
tagging for the collocates (with "<c>...</c>" tag) and the nodes ("<n>...</n>") in the sentence matches
a column called "coll_pattern" containing the collocate-node pattern extracted via regular expression on the basis of the tagging.

colloc_sentmatch_tagged(collout, colloc = NULL)

Arguments

collout	List output of `colloc_leipzig`.
colloc	Character vector of the collocate(s) whose sentence match(es) to be retrieved.

Value

A data frame with the following variables/columns:

corpus_names - corpus file name
sent_id - sentence number of the collocate matches
w - the collocate whose full sentence match with the node is retrieved via the function
span - the window-span of the collocate in relation to the node word
node - the node word
sent_match_tagged - tagged sentence matches for the collocate of interest with the node
coll_pattern - extracted collocate-node pattern from the sentence matches

Details

Examples

# retrieve the collocate of "sudah" 'already'
collout <- colloc_leipzig(leipzig_corpus_list = demo_corpus_leipzig,
                          pattern = "sudah",
                          window = "b",
                          span = 2,
                          save_interim = FALSE)
#> Detecting a 'named list' input!
#> You chose NOT to SAVE INTERIM RESULTS, which will be stored as a list in console!
#> 1. Tokenising the "ind_mixed_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_mixed_2012_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_news_2008_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_news_2008_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_news_2009_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_news_2009_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_news_2010_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_news_2010_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_news_2011_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_news_2011_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_news_2012_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_news_2012_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_newscrawl_2011_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_newscrawl_2011_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_newscrawl_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_newscrawl_2012_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_newscrawl_2015_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_newscrawl_2015_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_newscrawl_2016_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_newscrawl_2016_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_web_2011_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_web_2011_300K.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_web_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_web_2012_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind_wikipedia_2016_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind_wikipedia_2016_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind-id_web_2013_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind-id_web_2013_1M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 1. Tokenising the "ind-id_web_2015_3M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'sudah' in ind-id_web_2015_3M.
#> 2.1 Gathering the collocates for 'sudah' ...
#> 3. Storing all of the outputs...
#> 
#> DONE!

# retrieve and tag the sentence match for "sudah" with its collocate "ada" 'exist'
df_sentmatch <- colloc_sentmatch_tagged(collout, colloc = "ada")

# check the tagging in the sentence match
df_sentmatch$sent_match_tagged
#> [1] "128411 Sejak kecil aku hidup sendiri, bersama subo yang kini <n>sudah</n> tidak <c>ada</c>."                                                                                                                                                                               
#> [2] "182610 \"Kita <n>sudah</n> <c>ada</c> 62 negara tujuan ekspor, rasanya itu yang akan kita `maintance`."                                                                                                                                                                    
#> [3] "61890 Jumlah \"swing voter\" akan tinggi karena adanya krisis tokoh, simbol perubahan yang <n>sudah</n> <c>ada</c> tidak melekat dan isu yang disampaikan tidak mengalami kemajuan."                                                                                       
#> [4] "207401 \"Kami bersyukur, semangat menyatukan antara pusat dan daerah <n>sudah</n> <c>ada</c>."                                                                                                                                                                             
#> [5] "17527 Di sisi lain, kata dia, Pemkab Magelang sedang memikirkan pemberian bantuan bagi mereka yang tidak mengungsi karena dana yang tersedia hanya diperuntukkan bagi mereka yang mengungsi di tempat-tempat yang <n>sudah</n> <c>ada</c>."                                
#> [6] "74163 Dengan demikian pihaknya mengaku <n>sudah</n> tidak <c>ada</c> masalah lagi."                                                                                                                                                                                        
#> [7] "571094 \"Sudah,hasil kelengkapan berkas yang kami minta sesuai kepengurusan partai sampai di tingkat kecamatan dan kelurahan juga <n>sudah</n> tak <c>ada</c> masalah."                                                                                                    
#> [8] "408270\tKatanya, instansi vertikal itu <n>sudah</n> <c>ada</c> dana dari APBN,” tutur Manan, di hadapan para pejabat teras Pemkab Bolmong yang dipimpin Wakil Bupati Drs Hi Sehan Mokoapa Mokoagow dan Sekda Ir Hi Siswa Rachmat Mokodongan, serta segenap unsur Muspida." 
#> [9] "463754\tConnection: close Vary: Accept-Encoding Expires: Fri, 07 Aug 2015 03:14:15 GMT Vary: Accept-Encoding,User-Agent X-Upstream-Cache-Status: MISS Kasus Dwelling Time, Pejabat Kemendag Resmi Ditahan <n>Sudah</n> <c>ada</c> tiga orang yang ditahan dalam kasus ini."

# check the extracted collocation pattern and the window span
df_sentmatch[, c("span", "coll_pattern")]
#> # A tibble: 9 x 2
#>   span  coll_pattern                 
#>   <chr> <chr>                        
#> 1 r2    <n>sudah</n> tidak <c>ada</c>
#> 2 r1    <n>sudah</n> <c>ada</c>      
#> 3 r1    <n>sudah</n> <c>ada</c>      
#> 4 r1    <n>sudah</n> <c>ada</c>      
#> 5 r1    <n>sudah</n> <c>ada</c>      
#> 6 r2    <n>sudah</n> tidak <c>ada</c>
#> 7 r2    <n>sudah</n> tak <c>ada</c>  
#> 8 r1    <n>sudah</n> <c>ada</c>      
#> 9 r1    <n>Sudah</n> <c>ada</c>

Arguments

Value

Details

See also

Examples