This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please cite this R Notebook as follows (in Unified Style Sheet for Linguistics):
Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019. R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. https://doi.org10.6084/m9.figshare.9970205. https://figshare.com/articles/R_Markdown_Notebook_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/9970205.
# global option chunck
knitr::opts_chunk$set(fig.width = 7,
fig.asp = 0.618,
dpi = 300,
dev = "pdf",
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.path = "figures/")
This is an R Markdown Notebook (Rajeg, Denistia & Musgrave 2019a) for the statistical analyses accompanying our paper (Rajeg, Denistia & Musgrave 2019b) on vector space models and Indonesian denominal verbs (published in NUSA’s special issue titled Linguistic studies using large annotated corpora, edited by Hiroki Nomoto and David Moeljadi) (Nomoto & Moeljadi 2019). The Notebook, however, does not provide detailed exposition and discussion for each points, including English glossing of the Indonesian words in the tables. Readers are referred to our paper for details. Check the README page here for the tutorial to download this Notebook and the data (Rajeg, Denistia & Musgrave 2019c) for the paper.
The following R packages have to be installed and loaded to run all codes in this Notebook:
Click the Code
button to hide/reveal the R codes.
# load the required packages
## make sure all packages are installed!
library(cluster)
library(tidyverse)
library(dendextend)
library(wordVectors) # should be installed from Ben Schmidt's GitHub
library(Rling) # <- `Rling` package from Natalia Levshina's (2015) book 'How to do linguistics with R' published by John Benjamins.
For illustration, we used six deverbal nouns with the suffix -an. They are bacaan ‘reading’ (from the root baca ‘to read’), tulisan ‘writing’ (from tulis ‘to write’), lukisan ‘painting’ (from lukis ‘to paint/draw’), masakan ‘cooking’ (from masak ‘to cook’), makanan ‘food’ (from makan ‘to eat’), and minuman ‘beverages’ (from minum ‘to drink’). We retrieved collocates within three-word window to the either sides (left and right) of each noun. We used one of the Indonesian Leipzig Corpora files, namely ind_mixed_2012_1M-sentences.txt (see Biemann et al. 2007; Quasthoff & Goldhahn 2013).
The following codes process the retrieved collocates into co-occurrence frequency table and print a subset of this table. The collocates data is available in the "vsm_creation_data.rds"
file. This .rds
file contains a List
data-structure; the table version is available as tab-delimited .csv
and .txt
files (vsm_creation_data_tb.txt
& vsm_creation_data_tb.csv
).
df_list <- readRDS(file = "data/vsm_creation_data.rds")
df_ex <- purrr::map_df(df_list, dplyr::bind_rows); rm(df_list)
df_ex1 <- dplyr::filter(df_ex,
!is.na(w), # remove NAs
nchar(w) > 1, # remove one-character tokens
stringr::str_detect(w, "^[a-z-]+$"),
# remove incomplete words
!stringr::str_detect(w, "^[a-z]+-$"),
# remove incomplete words
!stringr::str_detect(w, "^-[a-z]+$"))
df_ex1_count <- dplyr::count(df_ex1, node, w, sort = TRUE)
df_ex1_count_spread <- tidyr::spread(df_ex1_count, w, n, fill = 0)
df_ex1_count_spread <- as.data.frame(df_ex1_count_spread)
df_ex1_count_spread <- df_ex1_count_spread[, colSums(df_ex1_count_spread[, -1]) > 0]
rownames(df_ex1_count_spread) <- df_ex1_count_spread$node
df_ex1_count_spread <- df_ex1_count_spread[, -1]
selected_column <- c("alkitab",
"dituangkan", "halal", "keras",
"mengandung", "penutup")
df_ex1_print <- df_ex1_count_spread[, selected_column]
rownames(df_ex1_print) <- paste("*", rownames(df_ex1_print), "*", sep = "")
colnames(df_ex1_print) <- paste("*", colnames(df_ex1_print), "*", sep = "")
df_ex1_print$`...` <- rep("...", nrow(df_ex1_print))
knitr::kable(df_ex1_print, caption = "Raw co-occurrence frequency for the target words with the context words")
alkitab | dituangkan | halal | keras | mengandung | penutup | … | |
---|---|---|---|---|---|---|---|
bacaan | 31 | 0 | 0 | 0 | 0 | 0 | … |
lukisan | 0 | 0 | 0 | 1 | 0 | 0 | … |
makanan | 0 | 0 | 19 | 12 | 55 | 7 | … |
masakan | 0 | 0 | 0 | 0 | 1 | 0 | … |
minuman | 0 | 0 | 0 | 150 | 14 | 0 | … |
tulisan | 16 | 6 | 0 | 2 | 2 | 0 | … |
Current approach in VSM adopts a more principled method of weighting the initial raw-frequency vectors into statistical measures of collocation strength before computing (dis)similarity measure between the target words. The goal of the weighting is “to give a higher weight to context words that co-occur significantly more often than expected by chance” (Heylen et al. 2015:156; cf. Clark 2015:503–504; Perek 2016:12). These significantly associatied context words are assumed to be more informative for the semantics of the target words (Heylen et al. 2015:156). The popular weighting measure used in VSM is the Pointwise Mutual Information (PMI) (see Levshina 2015:327–328 for computing PMI in R):
\[\text{PMI } (x, y) = log_{2} \frac {O_{xy}} {E_{xy}}\]
where \(O_{xy}\) represents the observed co-occurrence frequency between x and y, while \(E_{xy}\) is their expected co-occurrence frequency, which is the frequency expected under the chance distribution between x and y given the overall distribution of x and y in the corpus. Negative PMI values is normally replaced with zero, resulting in the Positive PMI (PPMI) (Levshina 2015; Hilpert & Perek 2015).
mtx <- as.matrix(df_ex1_count_spread)
mtx_exp <- chisq.test(mtx)$expected
mtx_pmi <- log2(mtx/mtx_exp); rm(mtx)
mtx_ppmi <- ifelse(mtx_pmi < 0, 0, mtx_pmi); rm(mtx_exp)
ppmiprint <- as.data.frame(round(mtx_ppmi, digits = 2)); rm(mtx_pmi)
ppmiprint <- ppmiprint[, selected_column]
rownames(ppmiprint) <- paste("*", rownames(ppmiprint), "*", sep = "")
colnames(ppmiprint) <- paste("*", colnames(ppmiprint), "*", sep = "")
ppmiprint$`...` <- rep("...", nrow(df_ex1_count_spread))
knitr::kable(ppmiprint, caption = "Weighted co-occurrence frequency with *Positive Pointwise Mutual Information*")
alkitab | dituangkan | halal | keras | mengandung | penutup | … | |
---|---|---|---|---|---|---|---|
bacaan | 3.62 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | … |
lukisan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | … |
makanan | 0.00 | 0.00 | 1.05 | 0.00 | 0.67 | 1.05 | … |
masakan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | … |
minuman | 0.00 | 0.00 | 0.00 | 3.07 | 0.85 | 0.00 | … |
tulisan | 0.31 | 1.87 | 0.00 | 0.00 | 0.00 | 0.00 | … |
Further analysis can be performed. The most common one is determining the semantic (dis)similarity between the target words (i.e., which word is more similar and different among each other). The following section briefly discusses the Cosine Similarity and Hierarchical Agglomerative Cluster (HAC) analyses as the exploratory tools (Levshina 2014; Levshina 2015).
In VSM, cosine similarity is the popular measure for computing pairwise (dis)similarity between the target words. Cosine similarity computes the cosine of angles between the words’ vectors to capture their (dis)similarity. The cosine value between a pair of word is close to 1 when they are semantically more similar, and close to 0 when otherwise (see @ref(tab:vsm-xmpl-cossim)) (Levshina 2015, Ch. 16).
The following codes perform cosine similarity using the cossim()
function from the Rling package (Levshina 2015:329).
# Cosine Similarity
mtx_cossim <- Rling::cossim(mtx_ppmi); rm(mtx_ppmi)
# Generate CosSim table output
attr(mtx_cossim, "dimnames")[[1]] <- paste("*",
attr(mtx_cossim, "dimnames")[[1]],
"*",
sep = "")
attr(mtx_cossim, "dimnames")[[2]] <- paste("*",
attr(mtx_cossim, "dimnames")[[2]],
"*",
sep = "")
knitr::kable(round(mtx_cossim, digits = 2),
caption = "Cosine similarity matrix between the deverbal nouns")
bacaan | lukisan | makanan | masakan | minuman | tulisan | |
---|---|---|---|---|---|---|
bacaan | 1.00 | 0.04 | 0.02 | 0.03 | 0.02 | 0.06 |
lukisan | 0.04 | 1.00 | 0.01 | 0.03 | 0.04 | 0.04 |
makanan | 0.02 | 0.01 | 1.00 | 0.03 | 0.03 | 0.00 |
masakan | 0.03 | 0.03 | 0.03 | 1.00 | 0.04 | 0.02 |
minuman | 0.02 | 0.04 | 0.03 | 0.04 | 1.00 | 0.02 |
tulisan | 0.06 | 0.04 | 0.00 | 0.02 | 0.02 | 1.00 |
For the cluster analysis, such as HAC (Levshina 2014; see also Gries 2013:336; Levshina 2015, Ch. 15; and Desagulier 2017:276 for R implementations on HAC), the similarity matrix/table as above needs to be converted into distance matrix as input for the cluster analysis (see the codes below adapted from Levshina 2015:330).
# Distance matrix computation
mtx_dist <- 1 - (mtx_cossim/max(mtx_cossim[mtx_cossim != 1]))
mtx_dist <- as.dist(mtx_dist)
# Cluster analysis
mtx_hcl <- hclust(mtx_dist, method = "ward.D2")
The output of HAC can be visualised into a dendrogram tree (see Figure @ref(fig:vsm-xmpl-dendrogram)). To determine the optimal number of cluster solution for grouping the nouns, we used the Average Silhouette Width (ASW) scores and tested two up to five-cluster solutions. The two-cluster solution produces the highest ASW score. See the codes below.
# ASW calculation
mtx_asw <- sapply(2:(nrow(mtx_cossim) - 1), function(x) summary(cluster::silhouette(stats::cutree(mtx_hcl, k = x), mtx_dist))$avg.width)
# names the ASW scores representing the tested cluster-solutions
names(mtx_asw) <- 2:(nrow(mtx_cossim) - 1)
# identify the maximum ASW score
max_asw <- as.numeric(names(mtx_asw[mtx_asw == max(mtx_asw)]))
# plotting dendrogram
mtx_hcl$labels <- gsub('\\*', '', mtx_hcl$labels)
plot(mtx_hcl, hang = -1)
rect.hclust(mtx_hcl, k = max_asw)
# generate texts annotation within the plot
added_info <- paste("The ", max_asw, "-cluster solution is based on the highest Average Silhouette Width (ASW) score of ", round(max(mtx_asw), 3), ".\n(ASW ranges from 0 to 1; min. ASW score for assuming substantial cluster is 0.2)", sep = "")
mtext(added_info, side = 1, line = 1.75, cex = .8)
The following codes produce the plot visualising the number of cluster-solutions tested and their corresponding ASW scores.
The paper uses thirteen corpus files from the Indonesian Leipzig Corpora to be trained into a vector space models. The following codes show how to load the data for the information of corpus size (i.e. "wordcount_leipzig_allcorpus.RData"
) and to print them.
# load the corpus size data
load("data/wordcount_leipzig_allcorpus.RData")
# print the table
leipzig.word.count %>%
select(-Size) %>%
rename(`Size (in word-tokens)`= Size_print,
`Corpus files` = Corpus) %>%
as.data.frame() %>%
knitr::kable(row.names = TRUE, caption = "Indonesian Leipzig Corpora and their sizes")
Corpus files | Size (in word-tokens) | |
---|---|---|
1 | ind_mixed_2012_1M-sentences.txt | 15,052,159 |
2 | ind_news_2008_300K-sentences.txt | 5,875,376 |
3 | ind_news_2009_300K-sentences.txt | 5,868,276 |
4 | ind_news_2010_300K-sentences.txt | 5,874,158 |
5 | ind_news_2011_300K-sentences.txt | 5,852,211 |
6 | ind_news_2012_300K-sentences.txt | 5,873,523 |
7 | ind_newscrawl_2011_1M-sentences.txt | 16,376,426 |
8 | ind_newscrawl_2012_1M-sentences.txt | 16,916,778 |
9 | ind_web_2011_300K-sentences.txt | 4,472,885 |
10 | ind_web_2012_1M-sentences.txt | 15,844,629 |
11 | ind_wikipedia_2016_1M-sentences.txt | 16,506,714 |
12 | ind-id_web_2013_1M-sentences.txt | 16,406,671 |
13 | ind-id_web_2015_3M-sentences.txt | 49,849,398 |
In total, the thirteen corpus files amount to 180,769,204 million word-tokens.
The database for the studied verbs is available in the .rds
, .csv
, and .txt
tab-delimited files named "me_parsed_nountaggedbase"
. The verbs have been parsed and tagged using MorphInd (Larasati, Kuboň & Zeman 2011). We limited the study on the verbs with noun-tagged roots that occur over 20 tokens. For each root, all the verb forms must occur in the three morphological schemas, namely meN-, meN-/-kan, and meN-/-i. For instance, we take mendasar ‘to be basic’, mendasari ‘to underlie sth.’, and mendasarkan ‘to base sth. on’ that are all derived from the nominal root dasar ‘base; foundation’.
The following codes show the filtering processes to retrieve the relevant verbs and Table @ref(tab:studied-verbs-retrieval) shows the snippet of the database.
# load the database
parsed_me_noun <- readRDS("data/me_parsed_nountaggedbase.rds")
# filtering parameters
min_freq <- 20 # freq. threshold
base_type <- "noun" # root category
# filtering process
df_noun <- parsed_me_noun %>%
filter(n > min_freq) %>% # retrieve only verbs occurring over 20 tokens
group_by(base) %>%
mutate(n_affix_new = n_distinct(affix)) %>%
filter(n_affix_new == 3) %>% # make sure each root occurs in the 3 schemas
ungroup()
# retrieve the target-word character vectors
me_words <- df_noun$word
# print the database snippet
df_noun %>%
select(-n_affix_new) %>%
filter(base %in% c("dasar")) %>%
rename(base_pos = base.pos,
token_freq = n) %>%
mutate(word = paste("*", word, "*", sep = ""),
base = paste("*", base, "*", sep = ""),
affix = paste("*", affix, "*", sep = ""),
morphind = gsub("(<|>)", "\\\\\\1", morphind, perl = TRUE)) %>%
knitr::kable(caption = "Snippet of the analysed denominal verbs")
word | token_freq | base | base_pos | morphind | affix |
---|---|---|---|---|---|
mendasar | 4571 | dasar | n | meN+dasar<n>_VSA | me |
mendasari | 1365 | dasar | n | meN+dasar<n>+i_VSA | me.i |
mendasarkan | 781 | dasar | n | meN+dasar<n>+kan_VSA | me.kan |
In total, we analysed 51 denominal verbs based on 17 root types occurring in three morphological schemas.
We mention in the paper about checking the existence of the studied verbs with MALINDO Morph, a morphological dictionary for Indonesian and Malay (Nomoto et al. 2018). We checked it with the latest version of MALINDO Morph from the file named "malindo_dic_20181125.tsv"
that has been saved as "malindo_dbase.rds"
(an R type of data) and tab-delimited .csv
file ("malindo_dbase.csv"
). The following codes document the checking processes. The printed verbs are those absent from MALINDO Morph but available in our corpus occurring over 20 tokens overall.
# load MALINDO Dictionary data
malindo <- readRDS("data/malindo_dbase.rds")
# print the verbs that are available in the dataset but absent in MALINDO
(absent_in_malindo <- setdiff(me_words, malindo$word_form))
[1] "mengakhir" "membuah" "mengantung" "mewakil"
[5] "mewaris"
As MALINDO Morph is also based on the Leipzig Corpora, it only takes into account words occurring over ten tokens in all the 300K-sentences version of the corpus (Nomoto et al. 2018). Our frequency check of these absent verbs confirms that they all occur less than ten tokens in the 300K-sentence files that we use. The following codes show these information.
# load the freqlist per corpus and
# check freq. of words absent in MALINDO 300K corpus
df_all_pref <- read_tsv("data/wordlist_leipzig_ME_DI_TER_percorpus.tsv", progress = FALSE)
df_all_pref %>%
filter(word %in% absent_in_malindo,
grepl("300K", corpus, perl = TRUE)) %>%
mutate(word = paste("*", word, "*", sep = "")) %>%
arrange(word, corpus) %>%
knitr::kable(row.names = TRUE, caption = 'Token frequency of the absent verbs in the 300K-sentence files')
corpus | word | n | |
---|---|---|---|
1 | ind_news_2008_300K | membuah | 5 |
2 | ind_news_2009_300K | membuah | 3 |
3 | ind_news_2010_300K | membuah | 1 |
4 | ind_news_2011_300K | membuah | 2 |
5 | ind_news_2012_300K | membuah | 5 |
6 | ind_web_2011_300K | membuah | 1 |
7 | ind_news_2008_300K | mengakhir | 2 |
8 | ind_news_2009_300K | mengakhir | 5 |
9 | ind_news_2010_300K | mengakhir | 3 |
10 | ind_news_2011_300K | mengakhir | 6 |
11 | ind_news_2011_300K | mengantung | 1 |
12 | ind_news_2012_300K | mengantung | 1 |
13 | ind_news_2009_300K | mewakil | 2 |
14 | ind_news_2010_300K | mewakil | 2 |
15 | ind_news_2011_300K | mewakil | 4 |
16 | ind_web_2011_300K | mewaris | 1 |
Detailed information concerning the training parameters is available in our paper. In short, we trained the Leipzig Corpora on the MonARCH HPC using the skip-gram learning algorithm from the word2vec model (Mikolov, Chen, et al. 2013; Mikolov, Sutskever, et al. 2013; Mikolov, Yih & Zweig 2013) via the wordVectors R package (Schmidt & Li 2017). The output model is available as .bin file named "leipzig_w2v_vector_full.bin"
. The following codes show how the model is loaded into R using read.binary.vectors()
from wordVectors package.
To retrieve the vector space model (VSM) of the target denominal verbs, use the following codes.
# get the VSM for the studied denominal verbs
vsm_tgt <- vsm[rownames(vsm) %in% me_words, ]
# print the subset of the model
vsm_tgt
A VectorSpaceModel object of 51 words and 100 vectors
[,1] [,2] [,3] [,4]
mengatakan -0.15838896 0.05143370 0.02442524 0.24287735
menggunakan -0.10439277 0.07838520 0.18227956 0.38019755
mewakili -0.25855595 -0.09868021 0.05797758 0.19394790
menempatkan 0.01809703 -0.01772323 0.24649994 0.12339904
menempati -0.17115426 -0.10566988 0.16301627 0.17471313
mengakhiri 0.24585645 0.12317820 -0.12496730 0.18348111
melangkah 0.14984114 0.02910020 0.04853236 -0.34504056
mencontohkan -0.24159168 0.01203183 0.07734830 0.13243185
mendasar -0.45044890 -0.07471947 0.37076223 -0.02369694
menandai 0.01873370 0.19389461 0.17055722 0.15514162
[,5] [,6]
mengatakan -0.037270460 -0.128047243
menggunakan -0.012107351 -0.019670846
mewakili -0.219191238 -0.049940147
menempatkan 0.090888657 -0.063201532
menempati -0.105571389 -0.090712421
mengakhiri -0.032905310 -0.200849608
melangkah -0.339647442 -0.008438474
mencontohkan 0.002526026 0.068748802
mendasar 0.357669443 0.234728336
menandai 0.026616529 -0.258213133
attr(,".cache")
<environment: 0x7fbe56512620>
The following codes run the Hierarchical Agglomerative Analysis (HAC) on the target verbs vector space model as well as the Average Silhouette Width (ASW) statistics.
# A wrapper function for HCA and ASW computations and gathering the relevant results
svs_hca <- function(vect = NULL,
clust_method = c("complete", "ward.D", "ward.D2",
"single", "average", "mcquitty",
"median", "centroid")) {
# 1 cosine distance using function from `wordVectors`
cosdist <- wordVectors::cosineDist(vect, vect)
cosdist <- as.dist(cosdist)
# 2 Hierarchical Cluster Analysis
hca <- hclust(cosdist, method = clust_method)
# 3 Compute the 'average silhouette width' (ASW)
# using the 'cutree' function for HCA partitioning
asw_f <- function(x) {
summary(cluster::silhouette(stats::cutree(hca, k = x), cosdist))$avg.width
}
asw_cutree <- sapply(2:(dim(vect)[1] - 1), asw_f)
names(asw_cutree) <- 2:(dim(vect)[1] - 1)
# 3.2 get the cluster number with highest ASW score
n_cluster <- as.numeric(names(asw_cutree[asw_cutree == max(asw_cutree)]))
max_asw <- asw_cutree[asw_cutree == max(asw_cutree)]
# 4 Put the results into a list
res <- list(cosdist, hca, n_cluster, max_asw, asw_cutree)
names(res) <- c("cosine_dist", "hcluster", "n_cluster", "asw", "asw_all")
return(res)
}
# Hierarchical Agglomerative Cluster (HAC) analysis for the denominal verbs
clust_method <- "ward.D2"
hca_res <- svs_hca(vect = vsm_tgt,
clust_method = clust_method)
The following codes generate the plot for the ASW scores in the paper.
# preparing data frame for the ASW and tested clusters
asw_df <- hca_res$asw_all
asw_df <- tibble::tibble(tested_cluster = names(asw_df),
asw = unname(asw_df))
asw_df_sort <- dplyr::arrange(asw_df, dplyr::desc(asw))
# plot with ggplot2
asw_df_sort %>%
ggplot(aes(x = reorder(tested_cluster, asw), y = asw, group = 1)) +
geom_step() +
ylim(c(0, 1)) +
coord_flip() +
theme_light() +
geom_text(aes(label = round(asw, 3)), hjust = -0.2, size = 2.5) +
labs(x = "Number of tested cluster solutions",
y = "Average Silhouette Width (ASW)",
caption = paste("The tested cluster solution ranges from 2 to ",
length(me_words) - 1,
" (i.e., the length of the analysed words (",
length(me_words), ") - 1).\nThe highest ASW of ",
round(hca_res$asw, 3),
" is for ", hca_res$n_cluster,
"-cluster solution",
sep = ""))
The dendrogram tree is generated using the following codes:
opar <- par(no.readonly = TRUE)
par(mar = c(2, 2, 2, 10))
dend <- as.dendrogram(hca_res$hcluster)
dend <- dendextend::set(dend = dend, "labels_cex", .9)
plot(dend, horiz = TRUE)
dendextend::rect.dendrogram(dend,
k = hca_res$n_cluster,
horiz = TRUE,
border = 2,
lwd = 0.35)
# A function to retrieve the token frequency of the studied verbs
wfreq <- function(df = df_noun, w, formatted = TRUE) {
freq <- dplyr::pull(dplyr::filter(df, word == w), n)
if (formatted) {
freq <- format(freq, big.mark = ",")
}
return(freq)
}
The subcluster of the motion verbs is extracted using the following codes:
clusters <- dendextend::cutree(hca_res$hcluster, k = hca_res$n_cluster)
opar <- par(no.readonly = TRUE)
par(mar = c(2, 2, 2, 7))
dend <- as.dendrogram(hca_res$hcluster)
to_prune <- names(clusters[!names(clusters) %in% c("melangkahkan", "melangkah", "menapak", "menapaki", "menapakkan", "menjejakkan", "menjejak", "menjejaki")])
plot(dendextend::prune(dend, to_prune), horiz = TRUE)
par(opar)
In this section, we present three tables for n-grams data of verbs with the root tapak ‘sole of the foot’. The data can be loaded as follows:
tapak <- readr::read_tsv("data/ngramexampl_3gr_menapak.txt")
tapaki <- readr::read_tsv("data/ngramexampl_3gr_menapaki.txt")
tapakkan <- readr::read_tsv("data/ngramexampl_3gr_menapakkan.txt")
The following codes generate the ten most frequent right-side 3-gram for menapak.
tapakgr <- tapak %>%
filter(w1 == "menapak") %>%
count(ngrams, sort = TRUE) %>%
.[1:10, ] %>%
mutate(ngrams = paste("*", ngrams, "*", sep = "")) # make ngrams italics
knitr::kable(tapakgr, caption = "The ten most frequent 3-gram for *menapak*", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | menapak_masa_depan | 10 |
2 | menapak_ke_babak | 6 |
3 | menapak_di_jalan | 4 |
4 | menapak_di_lantai | 4 |
5 | menapak_karir_di | 4 |
6 | menapak_di_atas | 3 |
7 | menapak_tilas_jejak | 3 |
8 | menapak_di_bumi | 2 |
9 | menapak_di_jalanan | 2 |
10 | menapak_di_permukaan | 2 |
Menapak can be used as transitive (item 1, 5, and 7) and intransitive verbs (the remaining items in Table @ref(tab:tapak-gram-table)). Its transitive usage shares similar right-side collocation patterns with the MeN-/-i form menapaki, especially their direct object with masa depan ‘future’ and karir ‘career’ (Table@ref(tab:tapaki-gram-table) below).
tapakigr <- tapaki %>%
filter(w1 == "menapaki") %>%
count(ngrams, sort = TRUE) %>%
.[1:10, ] %>%
mutate(ngrams = paste("*", ngrams, "*", sep = ""))
knitr::kable(tapakigr, caption = "The ten most frequent 3-gram for *menapaki*", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | menapaki_anak_tangga | 11 |
2 | menapaki_jalan_menuju | 11 |
3 | menapaki_masa_depan | 11 |
4 | menapaki_jalan_yang | 9 |
5 | menapaki_karir_di | 9 |
6 | menapaki_karier_sebagai | 7 |
7 | menapaki_karier_di | 6 |
8 | menapaki_babak_baru | 5 |
9 | menapaki_dunia_kerja | 5 |
10 | menapaki_karir_sebagai | 5 |
This is different from the transitive usage with MeN-/-kan schema (Table @ref(tab:tapakkan-gram-table) below), which predominantly has kaki ‘foot’ as its direct object, followed by either locational/directional prepositional phrses or motion verb complements (e.g., memasuki ‘to enter’ [item 5] and maju ‘to move forward’ [item 8]).
tapakkangr <- tapakkan %>%
filter(w1 == "menapakkan") %>%
count(ngrams, sort = TRUE) %>%
.[1:10, ] %>%
mutate(ngrams = paste("*", ngrams, "*", sep = ""))
knitr::kable(tapakkangr, caption = "The ten most frequent 3-gram for *menapakkan*", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | menapakkan_kakinya_di | 24 |
2 | menapakkan_kaki_di | 14 |
3 | menapakkan_kaki_ke | 3 |
4 | menapakkan_dirinya_di | 2 |
5 | menapakkan_kaki_memasuki | 2 |
6 | menapakkan_kaki_saat | 2 |
7 | menapakkan_kakinya_ke | 2 |
8 | menapakkan_kakinya_maju | 2 |
9 | menapakkan_bisnis_toko | 1 |
10 | menapakkan_citra_donnie | 1 |
Codes for extracting the subset of the mixture of communication and psych verbs are as follows.
clusters <- dendextend::cutree(hca_res$hcluster, k = hca_res$n_cluster)
opar <- par(no.readonly = TRUE)
par(mar = c(2, 2, 2, 7))
dend <- as.dendrogram(hca_res$hcluster)
to_prune <- names(clusters[!names(clusters) %in% c("membayangkan", "menyesal", "menyesali", "mengatai")])
plot(dendextend::prune(dend, to_prune), horiz = TRUE)
par(opar)
The above dendrogram is from the top cluster in Figure @ref(fig:hca-plot), consisting of (i) menyesal ‘to be regretful’ (N = 2,556), (ii) menyesali ‘regret sth.’ (N = 956), and (iii) membayangkan ‘imagine; visualise’ (N = 2,719). The other subset of communication and psych verbs (i.e., mengatakan ‘to say sth.’ (N = 265,381), mencontohkan ‘to exemplify’ (N = 4,799), and menyesalkan ‘regret sth.’ (N = 1,976)) is extracted as follows.
clusters <- dendextend::cutree(hca_res$hcluster, k = hca_res$n_cluster)
opar <- par(no.readonly = TRUE)
par(mar = c(2, 2, 2, 7))
dend <- as.dendrogram(hca_res$hcluster)
to_prune <- names(clusters[!names(clusters) %in% c("mencontohkan", "mengatakan", "menyesalkan")])
plot(dendextend::prune(dend, to_prune), horiz = TRUE)
par(opar)
What interesting between verbs in these last two clusters is that the MeN-/-i (i.e. mengatai) and MeN-/-kan verbs (i.e. mengatakan) with the root kata ‘word’ are way apart in the dendrogram. Similar case is apparent between menyesalkan separated with menyesal and menyesali, where the latter two verbs cluster together and are merged first in Figure @ref(fig:subcluster-psychs-1) (see §@ref(cluster-split) for further discussion on this split).
This cluster type captures denominal verbs of a given root with the three different morphological schemas that cluster together (Figure @ref(fig:subcluster-root-based)). We have seen few examples of these in the motion cluster with the root tapak ‘sole of the foot’ and jejak ‘footprint’, the derived forms of which fall into one cluster but differ in terms of their within-cluster branching (Figure @ref(fig:subclust-motion)). The other examples are based on the following roots:
The cluster subsets of these verbs are extracted from Figure @ref(fig:hca-plot) into Figure @ref(fig:subcluster-root-based) with the codes below.
clusters <- dendextend::cutree(hca_res$hcluster, k = hca_res$n_cluster)
opar <- par(no.readonly = TRUE)
par(mar = c(2, 2, 2, 7))
dend <- as.dendrogram(hca_res$hcluster)
to_prune <- names(clusters[!names(clusters) %in% c("mewakili", "mewakilkan", "mewakil", "menyusu", "menyusui", "menyusukan", "menempati", "menempatkan", "menempat", "mendasari", "mendasarkan", "mendasar")])
plot(dendextend::prune(dend, to_prune), horiz = TRUE)
dendextend::rect.dendrogram(dendextend::prune(dend, to_prune),
k = 4,
horiz = TRUE,
border = 2,
lwd = 0.35)
In the paper, we present n-grams data contrasting mewakili and mewakilkan in terms of their ride-side collocates in their 3-grams data. First, the codes for generating n-grams for mewakili ‘to (be a) represent(ative of) X’ are shown below.
mewakili <- readr::read_tsv("data/ngramexampl_5gr_mewakili.txt", progress = FALSE)
mewakili_mini <- dplyr::filter(mewakili, w1 == "mewakili")
mewakili_count <- dplyr::count(mewakili_mini, w1, w2, w3, sort = TRUE)
mewakili_print <- mewakili_count[1:10,]
mewakili_print <- dplyr::mutate(mewakili_print,
ngrams = paste(w1, "_", w2, "_", w3, sep = ""),
ngrams = paste("*", ngrams, "*", sep = ""))
mewakili_print <- dplyr::select(mewakili_print, ngrams, n)
knitr::kable(mewakili_print, caption = "10 most frequent 3-gram for *mewakili* 'to (be a) represent(ative of) X'", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | mewakili_kebijakan_editorial | 174 |
2 | mewakili_indonesia_di | 142 |
3 | mewakili_indonesia_dalam | 129 |
4 | mewakili_indonesia_pada | 69 |
5 | mewakili_lebih_dari | 38 |
6 | mewakili_iklan_anda | 36 |
7 | mewakili_kepala_dinas | 35 |
8 | mewakili_indonesia_untuk | 32 |
9 | mewakili_indonesia_ke | 26 |
10 | mewakili_kepala_badan | 23 |
Then, the codes for 3-grams of mewakilkan ‘to make X as the representative (of Y)’.
mewakilkan <- readr::read_tsv("data/ngramexampl_5gr_mewakilkan.txt", progress = FALSE)
mewakilkan_mini <- dplyr::filter(mewakilkan, w1 == "mewakilkan")
mewakilkan_count <- dplyr::count(mewakilkan_mini, w1, w2, w3, sort = TRUE)
mewakilkan_print <- mewakilkan_count[1:10,]
mewakilkan_print <- dplyr::mutate(mewakilkan_print,
ngrams = paste(w1, "_", w2, "_", w3, sep = ""),
ngrams = paste("*", ngrams, "*", sep = ""))
mewakilkan_print <- dplyr::select(mewakilkan_print, ngrams, n)
knitr::kable(mewakilkan_print, caption = "10 most frequent 3-gram for *mewakilkan* 'to appoint/select X as the representative of Y'", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | mewakilkan_sebuah_film | 6 |
2 | mewakilkan_kepada_orang | 5 |
3 | mewakilkan_orang_lain | 4 |
4 | mewakilkan_benua_asia | 3 |
5 | mewakilkan_kehadirannya_kepada | 3 |
6 | mewakilkan_kepada_unais | 3 |
7 | mewakilkan_kepada_wakil | 3 |
8 | mewakilkan_6_perwakilan | 2 |
9 | mewakilkan_bisa_dengan | 2 |
10 | mewakilkan_dirinya_lewat | 2 |
This sub-section addresses in more details split cases between morphological schemas for a given root. The split, especially between MeN-/-kan and MeN-/-i verbs, reflects Sneddon et al’s (2010:100–101) hypothesis concerning clear semantic difference between some of a pair of MeN-/-kan and MeN-/-i verbs with the same root. Our VSM-based approach allows us to visualise such split through the dendrogram based on large-scale usage data. The sub-section also demonstrates further enrichment in charactersing the difference between morphologically related verb pairs using the technique of nearest neighbours based on the VSM data.
In §@ref(semclust), we have mentioned the clear split between the transitive melangkahkan ‘to move the foot forward’ and melangkahi ‘to step over’. Looking at the 2-gram data for each verb shows that they have different semantic orientation. The following codes generate the 2-gram data for melangkahi.
langkahi <- readr::read_tsv("data/ngramexampl_3gr_melangkahi.txt")
langkahi_print <- langkahi %>%
filter(w1 == "melangkahi") %>%
mutate(ngrams = paste(w1, "_", w2, sep = ""),
ngrams = paste("*", ngrams, "*", sep = "")) %>%
count(ngrams, sort = TRUE) %>%
.[1:10, ]
knitr::kable(langkahi_print, caption = "Ten most frequent 2-gram for right-side collocates of *melangkahi*", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | melangkahi_kewenangan | 7 |
2 | melangkahi_aturan | 6 |
3 | melangkahi_batas-batas | 4 |
4 | melangkahi_apa | 3 |
5 | melangkahi_beberapa | 3 |
6 | melangkahi_mekanisme | 3 |
7 | melangkahi_pundak | 3 |
8 | melangkahi_tlundak | 3 |
9 | melangkahi_batasan | 2 |
10 | melangkahi_dasar-dasar | 2 |
Melangkahi predominantly conveys metaphorical sense related to disobeying/disregarding certain (i) rules/protocols (i.e., aturan, batas-batas/batasan ‘limits; restriction’, mekanisme ‘mechanism’), (ii) foundation (dasar-dasar), or (iii) authority (kewenangan).
In contrast, melangkahkan predominantly collocates with kaki ‘foot’ as its direct object collocates, which can be used in the literal, translational motion and metaphorical motion (see our paper for the example sentences).
langkahkan <- readr::read_tsv("data/ngramexampl_3gr_melangkahkan.txt")
langkahkan_print <- langkahkan %>%
filter(w1 == "melangkahkan") %>%
mutate(ngrams = paste(w1, "_", w2, sep = ""),
ngrams = paste("*", ngrams, "*", sep = "")) %>%
count(ngrams, sort = TRUE) %>%
.[1:10, ]
knitr::kable(langkahkan_print, caption = "Ten most frequent 2-gram for right-side collocates of *melangkahkan*", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | melangkahkan_kaki | 162 |
2 | melangkahkan_kakinya | 123 |
3 | melangkahkan_kakiku | 13 |
4 | melangkahkan_kedua | 3 |
5 | melangkahkan_satu | 3 |
6 | melangkahkan_kakimu | 2 |
7 | melangkahkan_bakti | 1 |
8 | melangkahkan_bidak | 1 |
9 | melangkahkan_gambaran | 1 |
10 | melangkahkan_jalan | 1 |
Observation on the n-gram can be enriched using information from the VSM of words. Given that the skip-gram algorithm of word2vec learns to predict the contextual environments given a target word (cf. Mikolov, Chen, et al. 2013), one can retrieve from the VSM a set of words that have similar contextual-vector distribution to a given target verb on the basis their cosine similarities; these words can be metaphorically referred to as the verb’s nearest neighbours. Table @ref(tab:nearest-to-melangkahi) illustrates the idea for melangkahi ‘to step over’.
# function to retrieve and print top-10 nearest neighbours
neighbours <- function(model, seed_word, n = 10) {
topn <- n + 1
df <- wordVectors::closest_to(model, seed_word, topn)
# remove the seed words because it is similar to itself and become the first nearest word. We thus count top-10 nearest neighbours from the rank 2 to 11
df <- dplyr::filter(df, word != seed_word)
df <- dplyr::mutate(df, word = paste("*", word, "*", sep = ""))
df <- as.data.frame(df)
colnames(df)[2] <- paste('similarity to "*', seed_word, '*"', sep = "")
return(df)
}
# get the nearest neighbours for *melangkahi*
near_langkahi <- neighbours(vsm, seed_word = "melangkahi")
knitr::kable(near_langkahi,
caption = "10 closest words to *melangkahi* 'to step over'",
row.names = TRUE)
word | similarity to “melangkahi” | |
---|---|---|
1 | mengangkangi | 0.5508479 |
2 | berkeras | 0.5435145 |
3 | memperhitungkannya | 0.5337096 |
4 | mengacuhkan | 0.5174865 |
5 | memagari | 0.5163426 |
6 | memegang | 0.5040103 |
7 | membelakangi | 0.5036757 |
8 | mematuhi | 0.4959915 |
9 | bersikeras | 0.4949589 |
10 | berbenturan | 0.4929108 |
The closest words may not necessarily similar in meaning (e.g., near-synonyms), but may exhibit different kind of relationships, such as antonyms. Words in Table @ref(tab:nearest-to-melangkahi) conveying more or less antonymous sense to melangkahi ‘to step over; to disregard’ include mengacuhkan ‘to care about/heed sth.’, mematuhi ‘to obey’, and (to a degree) memperhitungkannya ‘to take sth. into account’. Mengangkangi ‘to straddle sth.’ is the closest in meaning with melangkahi as it can be extended into ‘disregarding’ sense from its physical, posture sense: the 2-gram data for mengangkangi across the whole corpus (Table @ref(tab:kangkangi-gr)) reveals that it does co-occur with rules-related direct objects, such as hukum ‘law’ (3 tokens), peraturan ‘regulation’ (3), kebenaran ‘the truth’ (2), prinsip ‘principles’ (2), undang-undang ‘constitution’ (2), aturan ‘rules’ (1), inter alia.
ngangkangi <- readr::read_tsv("data/ngramexampl_3gr_mengangkangi.txt")
ngangkangi_print <- ngangkangi %>%
filter(w1 == "mengangkangi") %>%
mutate(ngrams = paste(w1, "_", w2, sep = ""),
ngrams = paste("*", ngrams, "*", sep = "")) %>%
count(ngrams, sort = TRUE)
knitr::kable(ngangkangi_print[ngangkangi_print$n > 1, ], caption = "The 2-gram data for right-side collocates of *mengangkangi* (n >= 2)", row.names = TRUE)
ngrams | n | |
---|---|---|
1 | mengangkangi_tubuh | 4 |
2 | mengangkangi_dunia | 3 |
3 | mengangkangi_hukum | 3 |
4 | mengangkangi_peraturan | 3 |
5 | mengangkangi_seluruh | 3 |
6 | mengangkangi_bagian | 2 |
7 | mengangkangi_hak | 2 |
8 | mengangkangi_jembatan | 2 |
9 | mengangkangi_kebenaran | 2 |
10 | mengangkangi_mu | 2 |
11 | mengangkangi_prinsip | 2 |
12 | mengangkangi_sebuah | 2 |
13 | mengangkangi_tanah | 2 |
14 | mengangkangi_undang-undang | 2 |
The following codes retrieve words nearest to melangkahkan.
word | similarity to “melangkahkan” | |
---|---|---|
1 | menjejakkan | 0.7487976 |
2 | melangkah | 0.7351338 |
3 | dilangkahkan | 0.7294263 |
4 | berlari | 0.7258124 |
5 | kakiku | 0.7195594 |
6 | menghunjamkan | 0.7150664 |
7 | menapakkan | 0.7126413 |
8 | berjingkat | 0.7079603 |
9 | kakinya | 0.7068803 |
10 | langkahkan | 0.7066612 |
The codes to retrieve the nearest words of mengatai ‘to rebuke; speake of one’s badness’.
word | similarity to “mengatai” | |
---|---|---|
1 | memaki | 0.7476359 |
2 | marah-marah | 0.6818458 |
3 | cerewet | 0.6806308 |
4 | mengejek | 0.6795740 |
5 | memaki-maki | 0.6713904 |
6 | jengkel | 0.6675079 |
7 | diejek | 0.6645705 |
8 | diolok-olok | 0.6641099 |
9 | meledek | 0.6628723 |
10 | berbohong | 0.6597248 |
In contrast, mengatakan mostly appears as communication verb with similar distribution with other reported speech verbs.
word | similarity to “mengatakan” | |
---|---|---|
1 | menegaskan | 0.8376837 |
2 | menyatakan | 0.8318030 |
3 | mengungkapkan | 0.8079668 |
4 | mengemukakan | 0.7967164 |
5 | menuturkan | 0.7925858 |
6 | menjelaskan | 0.7808896 |
7 | menyebutkan | 0.7640667 |
8 | menerangkan | 0.7568715 |
9 | mengakui | 0.7495804 |
10 | mengatkan | 0.7467988 |
Nearest words to membuahi ‘to breed sth.’ are retrieved as follows.
word | similarity to “membuahi” | |
---|---|---|
1 | dibuahi | 0.8372309 |
2 | ovum | 0.8330801 |
3 | sperma | 0.7965994 |
4 | gamet | 0.7326975 |
5 | pembuahan | 0.7300101 |
6 | terbuahi | 0.7016097 |
7 | spermatozoid | 0.6901719 |
8 | spermatozoa | 0.6794663 |
9 | parthenogenesis | 0.6775866 |
10 | zigot | 0.6712285 |
Codes for extracting closest words to membuahkan ‘to bear a fruit; to result in sth.’.
word | similarity to “membuahkan” | |
---|---|---|
1 | berbuah | 0.6716316 |
2 | mem-buahkan | 0.6296921 |
3 | tercipta | 0.6214626 |
4 | membuah | 0.5991543 |
5 | menuai | 0.5729809 |
6 | tendangannya | 0.5533927 |
7 | ditepis | 0.5530693 |
8 | kerasnya | 0.5528045 |
9 | pinalti | 0.5482891 |
10 | dimentahkan | 0.5478375 |
Notes on the usage sentences for membuah are below. But first load the sentence citations for membuah into R and print them into the console. After that the notes are manually created by manually inspecting all the usage sentences. The sentence data is available as .rds
file of list
(sentence_membuah.rds
) and .txt
file as plain text of sentences (sentence_membuah.txt
).
membuah <- readRDS("data/sentence_membuah.rds")
membuah$ind_mixed_2012_1M # for instance, retrieve sentences found in `ind_mixed_2012_1M-sentences.txt`
[1] "ind_mixed_2012_1M__314113__Dan , entah sudah berapa kali pejabat di Dinas Tata Kota dan Permukiman ( DTKP ) Pemkot telah berganti , tetapi persoalan bangunan mangkrak itu tak juga <m>membuah</m> solusi cepat ."
[2] "ind_mixed_2012_1M__635047__Pertemuan dengan ibu saudaranya <m>membuah</m> seribu kegembiraan ."
[3] "ind_mixed_2012_1M__912835__Peluang Irak lewat tendangan Younis Khalef di dalam kotak penalti juga masih belum <m>membuah</m> hasil ."
[4] "ind_mixed_2012_1M__912841__Namun penyerbuan ke al-Karak , 1183-84 tidak <m>membuah</m> hasil yang memuaskan ."
The sentence format is corpus-file-name_sentence-id-number_sentence-citation
. The <m>...</m>
tag indicates the corresponding match/word/verb.
membuah (56 citations; Summary: analogy to membuahkan = 24 citations; split with -kan = 1; mispell for membuat = 25; mispell for membuang = 5; unclear = 1)
The following codes retrieve the nearest words to mengakhiri (N = 8,512).
word | similarity to “mengakhiri” | |
---|---|---|
1 | mengakhir | 0.7149269 |
2 | menyudahi | 0.6486199 |
3 | mengahiri | 0.6346930 |
4 | akhiri | 0.6329971 |
5 | berakhir | 0.6263411 |
6 | mengakhirinya | 0.6060989 |
7 | mengakiri | 0.5811454 |
8 | memimpin | 0.5763634 |
9 | memperpanjang | 0.5681369 |
10 | memupus | 0.5669336 |
If reader wishes to check all usage sentences for mengakhir to confirm that it occurs as a full word-form and has similar usage patterns with meN-/-i form mengakhiri (i.e. in transitive constructions), use the following code. It will print all sentences for mengakhir into R console.
mengakhir <- readRDS("data/sentence_mengakhir.rds")
unlist(unname(mengakhir)) # print all 57 sentences in the console
Next, the nearest words to mengakhirkan (N = 116) are shown below.
word | similarity to “mengakhirkan” | |
---|---|---|
1 | shalat | 0.8595279 |
2 | menjamak | 0.8582150 |
3 | shubuh | 0.8510190 |
4 | dijamak | 0.8504550 |
5 | zhuhur | 0.8434480 |
6 | qashar | 0.8383760 |
7 | isya | 0.8300283 |
8 | disunnahkan | 0.8250431 |
9 | menjama | 0.8227094 |
10 | qabliyah | 0.8215906 |
Codes to extract the nearest words to mengantung ‘to hang’ (N = 28), which is a misspelling of menggantung ‘to hang’ (with double g) (N = 1,100). Nearest words output shows that mengantung is indeed the closest word to menggantung in their usage co-occurrence.
word | similarity to “mengantung” | |
---|---|---|
1 | menggantung | 0.6580294 |
2 | blandar | 0.6541215 |
3 | memaku | 0.6125143 |
4 | ditelungkupkan | 0.5888941 |
5 | dibaringkan | 0.5863362 |
6 | loso | 0.5808274 |
7 | digergaji | 0.5766285 |
8 | tertelungkup | 0.5721740 |
9 | tersekap | 0.5694923 |
10 | ditindih | 0.5677198 |
The codes below extract the nearest words for mengantungkan (N = 44), a misspelling for menggantungkan ‘to hang sth. (onto sth.)’ (N = 1,264).
word | similarity to “mengantungkan” | |
---|---|---|
1 | menggantungkan | 0.7764547 |
2 | penghidupannya | 0.6517582 |
3 | peladang | 0.5890552 |
4 | matapencaharian | 0.5649209 |
5 | mengentaskannya | 0.5516661 |
6 | bersawah | 0.5515007 |
7 | petani-petani | 0.5469401 |
8 | pengais | 0.5458413 |
9 | bertani | 0.5404030 |
10 | upahan | 0.5370267 |
Then, the nearest words for mengantungi (N = 327) which is closest to the common spelling with mengantongi (N = 3,574) based on the root kantong ‘pocket’.
word | similarity to “mengantungi” | |
---|---|---|
1 | mengantongi | 0.7021752 |
2 | mengoleksi | 0.6954547 |
3 | memuncaki | 0.6904767 |
4 | raihan | 0.6513341 |
5 | terpaut | 0.6435489 |
6 | mengungguli | 0.6411234 |
7 | torehan | 0.6217388 |
8 | pemuncak | 0.6164198 |
9 | mengemas | 0.6040049 |
10 | diposisi | 0.6004999 |
Finally, the case where part of the complex word is split. The paper illustrates this with menanda which is part of menandatangani ‘to give signature; to sign’ but is written separately with whitespace, thus, menanda tangani, where the whitespace got tokenised.
word | similarity to “menanda” | |
---|---|---|
1 | ditanda | 0.7982222 |
2 | menandatangani | 0.6721747 |
3 | menanda-tangani | 0.6597414 |
4 | menandatangai | 0.6295324 |
5 | menandatangi | 0.6148956 |
6 | tanganinya | 0.6112830 |
7 | ditandatangani | 0.6063122 |
8 | ditanda-tangani | 0.6057313 |
9 | ditandatanganinya | 0.5950392 |
10 | meneken | 0.5891363 |
menanda (N = 121; Summary: split = 113; intransitive usage = 5; transitive usage analogy = 2; ambiguous = 1)
─ Session info ─────────────────────────────────────────────────
─ Packages ─────────────────────────────────────────────────────
package * version date lib
assertthat 0.2.1 2019-03-21 [1]
backports 1.1.5 2019-10-02 [1]
broom 0.5.5 2020-02-29 [1]
callr 3.2.0 2019-03-15 [1]
cellranger 1.1.0 2016-07-27 [1]
cli 2.0.2 2020-02-28 [1]
cluster * 2.1.0 2019-06-19 [2]
colorspace 1.4-1 2019-03-18 [1]
crayon 1.3.4 2017-09-16 [1]
DBI 1.0.0 2018-05-02 [1]
dbplyr 1.4.2 2019-06-17 [1]
dendextend * 1.13.2 2019-12-02 [1]
desc 1.2.0 2018-05-01 [1]
devtools 2.2.1 2019-09-24 [1]
digest 0.6.25 2020-02-23 [1]
dplyr * 0.8.5 2020-03-07 [1]
ellipsis 0.3.0 2019-09-20 [1]
evaluate 0.14 2019-05-28 [1]
fansi 0.4.1 2020-01-08 [1]
farver 2.0.1 2019-11-13 [1]
forcats * 0.5.0 2020-03-01 [1]
fs 1.3.1 2019-05-06 [1]
generics 0.0.2 2018-11-29 [1]
ggplot2 * 3.3.0 2020-03-05 [1]
glue 1.3.2 2020-03-12 [1]
gridExtra 2.3 2017-09-09 [1]
gtable 0.3.0 2019-03-25 [1]
haven 2.2.0 2019-11-08 [1]
highr 0.8 2019-03-20 [1]
hms 0.5.3 2020-01-08 [1]
htmltools 0.3.6 2017-04-28 [1]
httr 1.4.1 2019-08-05 [1]
jsonlite 1.6 2018-12-07 [1]
knitr 1.28 2020-02-06 [1]
labeling 0.3 2014-08-23 [1]
lattice 0.20-38 2018-11-04 [2]
lifecycle 0.2.0 2020-03-06 [1]
lubridate 1.7.4 2018-04-11 [1]
magrittr 1.5 2014-11-22 [1]
memoise 1.1.0 2017-04-21 [1]
modelr 0.1.5 2019-08-08 [1]
munsell 0.5.0 2018-06-12 [1]
nlme 3.1-144 2020-02-06 [2]
pillar 1.4.3 2019-12-20 [1]
pkgbuild 1.0.3 2019-03-20 [1]
pkgconfig 2.0.3 2019-09-22 [1]
pkgload 1.0.2 2018-10-29 [1]
prettyunits 1.0.2 2015-07-13 [1]
processx 3.3.1 2019-05-08 [1]
ps 1.3.0 2018-12-21 [1]
purrr * 0.3.3 2019-10-18 [1]
R6 2.4.1 2019-11-12 [1]
Rcpp 1.0.4 2020-03-17 [1]
readr * 1.3.1 2018-12-21 [1]
readxl 1.3.1 2019-03-13 [1]
remotes 2.1.0 2019-06-24 [1]
reprex 0.3.0 2019-05-16 [1]
rlang 0.4.5 2020-03-01 [1]
Rling * 1.0 2019-09-12 [1]
rmarkdown 2.1 2020-01-20 [1]
rprojroot 1.3-2 2018-01-03 [1]
rstudioapi 0.10 2019-03-19 [1]
rvest 0.3.5 2019-11-08 [1]
scales 1.1.0 2019-11-18 [1]
sessioninfo 1.1.1 2018-11-05 [1]
stringi 1.4.6 2020-02-17 [1]
stringr * 1.4.0 2019-02-10 [1]
testthat 2.3.1 2019-12-01 [1]
tibble * 2.1.3 2019-06-06 [1]
tidyr * 1.0.2 2020-01-24 [1]
tidyselect 1.0.0 2020-01-27 [1]
tidyverse * 1.3.0 2019-11-21 [1]
usethis 1.5.1 2019-07-04 [1]
vctrs 0.2.4 2020-03-10 [1]
viridis 0.5.1 2018-03-29 [1]
viridisLite 0.3.0 2018-02-01 [1]
withr 2.1.2 2018-03-15 [1]
wordVectors * 2.0 2019-06-01 [1]
xfun 0.12 2020-01-13 [1]
xml2 1.2.2 2019-08-09 [1]
source
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.3)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.3)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.3)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
local
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
Github (bmschmidt/wordVectors@7f1914c)
CRAN (R 3.6.0)
CRAN (R 3.6.0)
[1] /Users/Primahadi/Rlibs
[2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Biemann, Chris, Gerhard Heyer, Uwe Quasthoff & Matthias Richter. 2007. The Leipzig Corpora Collection: Monolingual corpora of standard size. In Matthew Davies, Paul Rayson, Susan Hunston & Pernilla Danielsson (eds.), Proceedings of the Corpus Linguistics Conference. University of Birmingham, UK. http://ucrel.lancs.ac.uk/publications/CL2007/paper/190_Paper.pdf (5 March, 2014).
Clark, Stephen. 2015. Vector space models of lexical meaning. In Shalom Lappin & Chris Fox (eds.), The Handbook of Contemporary semantic theory, 493–522. Second Edition. Hoboken: John Wiley & Sons.
Desagulier, Guillaume. 2017. Corpus Linguistics and Statistics with R. Cham: Springer International Publishing. doi:10.1007/978-3-319-64572-8.
Galili, Tal. 2015. Dendextend: An R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. doi:10.1093/bioinformatics/btv428.
Gries, Stefan Th. 2013. Statistics for linguistics with R: A practical introduction. 2nd ed. Berlin: Mouton de Gruyter.
Heylen, Kris, Thomas Wielfaert, Dirk Speelman & Dirk Geeraerts. 2015. Monitoring polysemy: Word space models as a tool for large-scale lexical semantic analysis. Lingua 157. (Polysemy: Current Perspectives and Approaches). 153–172. doi:10.1016/j.lingua.2014.12.001.
Hilpert, Martin & Florent Perek. 2015. Meaning change in a petri dish: Constructions, semantic vector spaces, and motion charts. Linguistics Vanguard 1(1). doi:10.1515/lingvan-2015-0013.
Larasati, Septina Dian, Vladislav Kuboň & Daniel Zeman. 2011. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus. In, Systems and Frameworks for Computational Morphology, 119–129. Springer, Berlin, Heidelberg. doi:10.1007/978-3-642-23138-4_8.
Levshina, Natalia. 2014. Geographic variation of Quite + ADJ in twenty national varieties of English: A pilot study. Yearbook of the German Cognitive Linguistics Association 2(1). 109–126. doi:10.1515/gcla-2014-0008.
Levshina, Natalia. 2015. How to do Linguistics with R: Data exploration and statistical analysis. John Benjamins Publishing Company.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert & Kurt Hornik. 2018. Cluster: Cluster Analysis Basics and Extensions.
Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301.3781 (14 December, 2018).
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. http://arxiv.org/abs/1310.4546 (14 December, 2018).
Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–751. Atlanta, Georgia: Association for Computational Linguistics. http://www.aclweb.org/anthology/N13-1090 (14 December, 2018).
Nomoto, Hiroki, Hannah Choi, David Moeljadi & Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. In, Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36–43. http://lrec-conf.org/workshops/lrec2018/W29/pdf/8_W29.pdf.
Nomoto, Hiroki & David Moeljadi. 2019. Linguistic studies using large annotated corpora: Introduction. (Ed.) Hiroki Nomoto & David Moeljadi. NUSA 67. (Linguistic Studies Using Large Annotated Corpora). 1–6. http://repository.tufs.ac.jp/handle/10108/94450 (1 April, 2020).
Perek, Florent. 2016. Recent change in the productivity and schematicity of the Way-construction: A distributional semantic analysis. Corpus Linguistics and Linguistic Theory. doi:10.1515/cllt-2016-0014.
Quasthoff, Uwe & Dirk Goldhahn. 2013. Indonesian corpora. (Technical Report Series on Corpus Building). Leipzig, Germany: Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig. http://asvdoku.informatik.uni-leipzig.de/corpora/data/uploads/corpus-building-vol7-ind.pdf (26 July, 2015).
Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019a. R markdown notebook for vector space model and the usage patterns of indonesian denominal verbs. figshare. doi:10.6084/m9.figshare.9970205. https://figshare.com/articles/R_Markdown_Notebook_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/9970205.
Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019b. Vector space models and the usage patterns of indonesian denominal verbs: A case study of verbs with meN-, meN-/-kan, and meN-/-i affixes. (Ed.) Hiroki Nomoto & David Moeljadi. NUSA 67. (Linguistic Studies Using Large Annotated Corpora). 35–76. http://repository.tufs.ac.jp/handle/10108/94452 (1 April, 2020).
Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019c. Dataset for vector space model and the usage patterns of indonesian denominal verbs. figshare. doi:10.6084/m9.figshare.8187155. https://figshare.com/articles/Dataset_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/8187155.
Schmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. http://github.com/bmschmidt/wordVectors.
Sneddon, James Neil, Alexander Adelaar, Dwi Noverini Djenar & Michael C. Ewing. 2010. Indonesian reference grammar. 2nd ed. Crows Nest, New South Wales, Australia: Allen & Unwin.
Wickham, Hadley & Garrett Grolemund. 2017. R for Data Science. Canada: O’Reilly. http://r4ds.had.co.nz/ (7 March, 2017).