R/coll_sentmatch_tagged.R
colloc_sentmatch_tagged.Rd
This is an expanded function for colloc_sentmatch
to retrieve sentence matches in which a given collocate is found.
The extended features in colloc_sentmatch_tagged
(but not available in colloc_sentmatch
) are:
a tibble/data frame output format (similar to the colloc_df
, which is one of the outputs of colloc_leipzig
)
tagging for the collocates (with "<c>...</c>" tag
) and the nodes ("<n>...</n>"
) in the sentence matches
a column called "coll_pattern"
containing the collocate-node pattern extracted via regular expression on the basis of the tagging.
colloc_sentmatch_tagged(collout, colloc = NULL)
collout | List output of |
---|---|
colloc | Character vector of the collocate(s) whose sentence match(es) to be retrieved. |
A data frame with the following variables/columns:
corpus_names
- corpus file name
sent_id
- sentence number of the collocate matches
w
- the collocate whose full sentence match with the node is retrieved via the function
span
- the window-span of the collocate in relation to the node word
node
- the node word
sent_match_tagged
- tagged sentence matches for the collocate of interest with the node
coll_pattern
- extracted collocate-node pattern from the sentence matches
colloc_sentmatch
for untagged and character-vector version of the output, colloc_leipzig
for collocate retrieval.
# retrieve the collocate of "sudah" 'already' collout <- colloc_leipzig(leipzig_corpus_list = demo_corpus_leipzig, pattern = "sudah", window = "b", span = 2, save_interim = FALSE)#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#> #># retrieve and tag the sentence match for "sudah" with its collocate "ada" 'exist' df_sentmatch <- colloc_sentmatch_tagged(collout, colloc = "ada") # check the tagging in the sentence match df_sentmatch$sent_match_tagged#> [1] "128411 Sejak kecil aku hidup sendiri, bersama subo yang kini <n>sudah</n> tidak <c>ada</c>." #> [2] "182610 \"Kita <n>sudah</n> <c>ada</c> 62 negara tujuan ekspor, rasanya itu yang akan kita `maintance`." #> [3] "61890 Jumlah \"swing voter\" akan tinggi karena adanya krisis tokoh, simbol perubahan yang <n>sudah</n> <c>ada</c> tidak melekat dan isu yang disampaikan tidak mengalami kemajuan." #> [4] "207401 \"Kami bersyukur, semangat menyatukan antara pusat dan daerah <n>sudah</n> <c>ada</c>." #> [5] "17527 Di sisi lain, kata dia, Pemkab Magelang sedang memikirkan pemberian bantuan bagi mereka yang tidak mengungsi karena dana yang tersedia hanya diperuntukkan bagi mereka yang mengungsi di tempat-tempat yang <n>sudah</n> <c>ada</c>." #> [6] "74163 Dengan demikian pihaknya mengaku <n>sudah</n> tidak <c>ada</c> masalah lagi." #> [7] "571094 \"Sudah,hasil kelengkapan berkas yang kami minta sesuai kepengurusan partai sampai di tingkat kecamatan dan kelurahan juga <n>sudah</n> tak <c>ada</c> masalah." #> [8] "408270\tKatanya, instansi vertikal itu <n>sudah</n> <c>ada</c> dana dari APBN,” tutur Manan, di hadapan para pejabat teras Pemkab Bolmong yang dipimpin Wakil Bupati Drs Hi Sehan Mokoapa Mokoagow dan Sekda Ir Hi Siswa Rachmat Mokodongan, serta segenap unsur Muspida." #> [9] "463754\tConnection: close Vary: Accept-Encoding Expires: Fri, 07 Aug 2015 03:14:15 GMT Vary: Accept-Encoding,User-Agent X-Upstream-Cache-Status: MISS Kasus Dwelling Time, Pejabat Kemendag Resmi Ditahan <n>Sudah</n> <c>ada</c> tiga orang yang ditahan dalam kasus ini."# check the extracted collocation pattern and the window span df_sentmatch[, c("span", "coll_pattern")]#> # A tibble: 9 x 2 #> span coll_pattern #> <chr> <chr> #> 1 r2 <n>sudah</n> tidak <c>ada</c> #> 2 r1 <n>sudah</n> <c>ada</c> #> 3 r1 <n>sudah</n> <c>ada</c> #> 4 r1 <n>sudah</n> <c>ada</c> #> 5 r1 <n>sudah</n> <c>ada</c> #> 6 r2 <n>sudah</n> tidak <c>ada</c> #> 7 r2 <n>sudah</n> tak <c>ada</c> #> 8 r1 <n>sudah</n> <c>ada</c> #> 9 r1 <n>Sudah</n> <c>ada</c>