This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please cite this R Notebook as follows (in Unified Style Sheet for Linguistics):
Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019. R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. https://doi.org10.6084/m9.figshare.9970205. https://figshare.com/articles/R_Markdown_Notebook_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/9970205.
# global option chunck
knitr::opts_chunk$set(fig.width = 7,
fig.asp = 0.618,
dpi = 300,
dev = "pdf",
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.path = "figures/")
This is an R Markdown Notebook (Rajeg, Denistia & Musgrave 2019a) for the statistical analyses accompanying our paper (Rajeg, Denistia & Musgrave 2019b) on vector space models and Indonesian denominal verbs (published in NUSA’s special issue titled Linguistic studies using large annotated corpora, edited by Hiroki Nomoto and David Moeljadi) (Nomoto & Moeljadi 2019). The Notebook, however, does not provide detailed exposition and discussion for each points, including English glossing of the Indonesian words in the tables. Readers are referred to our paper for details. Check the README page here for the tutorial to download this Notebook and the data (Rajeg, Denistia & Musgrave 2019c) for the paper.
The following R packages have to be installed and loaded to run all codes in this Notebook:
Click the Code
button to hide/reveal the R codes.
# load the required packages
## make sure all packages are installed!
library(cluster)
library(tidyverse)
library(dendextend)
library(wordVectors) # should be installed from Ben Schmidt's GitHub
library(Rling) # <- `Rling` package from Natalia Levshina's (2015) book 'How to do linguistics with R' published by John Benjamins.
For illustration, we used six deverbal nouns with the suffix -an. They are bacaan ‘reading’ (from the root baca ‘to read’), tulisan ‘writing’ (from tulis ‘to write’), lukisan ‘painting’ (from lukis ‘to paint/draw’), masakan ‘cooking’ (from masak ‘to cook’), makanan ‘food’ (from makan ‘to eat’), and minuman ‘beverages’ (from minum ‘to drink’). We retrieved collocates within three-word window to the either sides (left and right) of each noun. We used one of the Indonesian Leipzig Corpora files, namely ind_mixed_2012_1M-sentences.txt (see Biemann et al. 2007; Quasthoff & Goldhahn 2013).
The following codes process the retrieved collocates into co-occurrence frequency table and print a subset of this table. The collocates data is available in the "vsm_creation_data.rds"
file. This .rds
file contains a List
data-structure; the table version is available as tab-delimited .csv
and .txt
files (vsm_creation_data_tb.txt
& vsm_creation_data_tb.csv
).
df_list <- readRDS(file = "data/vsm_creation_data.rds")
df_ex <- purrr::map_df(df_list, dplyr::bind_rows); rm(df_list)
df_ex1 <- dplyr::filter(df_ex,
!is.na(w), # remove NAs
nchar(w) > 1, # remove one-character tokens
stringr::str_detect(w, "^[a-z-]+$"),
# remove incomplete words
!stringr::str_detect(w, "^[a-z]+-$"),
# remove incomplete words
!stringr::str_detect(w, "^-[a-z]+$"))
df_ex1_count <- dplyr::count(df_ex1, node, w, sort = TRUE)
df_ex1_count_spread <- tidyr::spread(df_ex1_count, w, n, fill = 0)
df_ex1_count_spread <- as.data.frame(df_ex1_count_spread)
df_ex1_count_spread <- df_ex1_count_spread[, colSums(df_ex1_count_spread[, -1]) > 0]
rownames(df_ex1_count_spread) <- df_ex1_count_spread$node
df_ex1_count_spread <- df_ex1_count_spread[, -1]
selected_column <- c("alkitab",
"dituangkan", "halal", "keras",
"mengandung", "penutup")
df_ex1_print <- df_ex1_count_spread[, selected_column]
rownames(df_ex1_print) <- paste("*", rownames(df_ex1_print), "*", sep = "")
colnames(df_ex1_print) <- paste("*", colnames(df_ex1_print), "*", sep = "")
df_ex1_print$`...` <- rep("...", nrow(df_ex1_print))
knitr::kable(df_ex1_print, caption = "Raw co-occurrence frequency for the target words with the context words")
alkitab | dituangkan | halal | keras | mengandung | penutup | … | |
---|---|---|---|---|---|---|---|
bacaan | 31 | 0 | 0 | 0 | 0 | 0 | … |
lukisan | 0 | 0 | 0 | 1 | 0 | 0 | … |
makanan | 0 | 0 | 19 | 12 | 55 | 7 | … |
masakan | 0 | 0 | 0 | 0 | 1 | 0 | … |
minuman | 0 | 0 | 0 | 150 | 14 | 0 | … |
tulisan | 16 | 6 | 0 | 2 | 2 | 0 | … |
Current approach in VSM adopts a more principled method of weighting the initial raw-frequency vectors into statistical measures of collocation strength before computing (dis)similarity measure between the target words. The goal of the weighting is “to give a higher weight to context words that co-occur significantly more often than expected by chance” (Heylen et al. 2015:156; cf. Clark 2015:503–504; Perek 2016:12). These significantly associatied context words are assumed to be more informative for the semantics of the target words (Heylen et al. 2015:156). The popular weighting measure used in VSM is the Pointwise Mutual Information (PMI) (see Levshina 2015:327–328 for computing PMI in R):
\[\text{PMI } (x, y) = log_{2} \frac {O_{xy}} {E_{xy}}\]
where \(O_{xy}\) represents the observed co-occurrence frequency between x and y, while \(E_{xy}\) is their expected co-occurrence frequency, which is the frequency expected under the chance distribution between x and y given the overall distribution of x and y in the corpus. Negative PMI values is normally replaced with zero, resulting in the Positive PMI (PPMI) (Levshina 2015; Hilpert & Perek 2015).
mtx <- as.matrix(df_ex1_count_spread)
mtx_exp <- chisq.test(mtx)$expected
mtx_pmi <- log2(mtx/mtx_exp); rm(mtx)
mtx_ppmi <- ifelse(mtx_pmi < 0, 0, mtx_pmi); rm(mtx_exp)
ppmiprint <- as.data.frame(round(mtx_ppmi, digits = 2)); rm(mtx_pmi)
ppmiprint <- ppmiprint[, selected_column]
rownames(ppmiprint) <- paste("*", rownames(ppmiprint), "*", sep = "")
colnames(ppmiprint) <- paste("*", colnames(ppmiprint), "*", sep = "")
ppmiprint$`...` <- rep("...", nrow(df_ex1_count_spread))
knitr::kable(ppmiprint, caption = "Weighted co-occurrence frequency with *Positive Pointwise Mutual Information*")
alkitab | dituangkan | halal | keras | mengandung | penutup | … | |
---|---|---|---|---|---|---|---|
bacaan | 3.62 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | … |
lukisan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | … |
makanan | 0.00 | 0.00 | 1.05 | 0.00 | 0.67 | 1.05 | … |
masakan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | … |
minuman | 0.00 | 0.00 | 0.00 | 3.07 | 0.85 | 0.00 | … |
tulisan | 0.31 | 1.87 | 0.00 | 0.00 | 0.00 | 0.00 | … |
Further analysis can be performed. The most common one is determining the semantic (dis)similarity between the target words (i.e., which word is more similar and different among each other). The following section briefly discusses the Cosine Similarity and Hierarchical Agglomerative Cluster (HAC) analyses as the exploratory tools (Levshina 2014; Levshina 2015).
In VSM, cosine similarity is the popular measure for computing pairwise (dis)similarity between the target words. Cosine similarity computes the cosine of angles between the words’ vectors to capture their (dis)similarity. The cosine value between a pair of word is close to 1 when they are semantically more similar, and close to 0 when otherwise (see @ref(tab:vsm-xmpl-cossim)) (Levshina 2015, Ch. 16).
The following codes perform cosine similarity using the cossim()
function from the Rling package (Levshina 2015:329).
# Cosine Similarity
mtx_cossim <- Rling::cossim(mtx_ppmi); rm(mtx_ppmi)
# Generate CosSim table output
attr(mtx_cossim, "dimnames")[[1]] <- paste("*",
attr(mtx_cossim, "dimnames")[[1]],
"*",
sep = "")
attr(mtx_cossim, "dimnames")[[2]] <- paste("*",
attr(mtx_cossim, "dimnames")[[2]],
"*",
sep = "")
knitr::kable(round(mtx_cossim, digits = 2),
caption = "Cosine similarity matrix between the deverbal nouns")
bacaan | lukisan | makanan | masakan | minuman | tulisan | |
---|---|---|---|---|---|---|
bacaan | 1.00 | 0.04 | 0.02 | 0.03 | 0.02 | 0.06 |
lukisan | 0.04 | 1.00 | 0.01 | 0.03 | 0.04 | 0.04 |
makanan | 0.02 | 0.01 | 1.00 | 0.03 | 0.03 | 0.00 |
masakan | 0.03 | 0.03 | 0.03 | 1.00 | 0.04 | 0.02 |
minuman | 0.02 | 0.04 | 0.03 | 0.04 | 1.00 | 0.02 |
tulisan | 0.06 | 0.04 | 0.00 | 0.02 | 0.02 | 1.00 |
For the cluster analysis, such as HAC (Levshina 2014; see also Gries 2013:336; Levshina 2015, Ch. 15; and Desagulier 2017:276 for R implementations on HAC), the similarity matrix/table as above needs to be converted into distance matrix as input for the cluster analysis (see the codes below adapted from Levshina 2015:330).
# Distance matrix computation
mtx_dist <- 1 - (mtx_cossim/max(mtx_cossim[mtx_cossim != 1]))
mtx_dist <- as.dist(mtx_dist)
# Cluster analysis
mtx_hcl <- hclust(mtx_dist, method = "ward.D2")
The output of HAC can be visualised into a dendrogram tree (see Figure @ref(fig:vsm-xmpl-dendrogram)). To determine the optimal number of cluster solution for grouping the nouns, we used the Average Silhouette Width (ASW) scores and tested two up to five-cluster solutions. The two-cluster solution produces the highest ASW score. See the codes below.
# ASW calculation
mtx_asw <- sapply(2:(nrow(mtx_cossim) - 1), function(x) summary(cluster::silhouette(stats::cutree(mtx_hcl, k = x), mtx_dist))$avg.width)
# names the ASW scores representing the tested cluster-solutions
names(mtx_asw) <- 2:(nrow(mtx_cossim) - 1)
# identify the maximum ASW score
max_asw <- as.numeric(names(mtx_asw[mtx_asw == max(mtx_asw)]))
# plotting dendrogram
mtx_hcl$labels <- gsub('\\*', '', mtx_hcl$labels)
plot(mtx_hcl, hang = -1)
rect.hclust(mtx_hcl, k = max_asw)
# generate texts annotation within the plot
added_info <- paste("The ", max_asw, "-cluster solution is based on the highest Average Silhouette Width (ASW) score of ", round(max(mtx_asw), 3), ".\n(ASW ranges from 0 to 1; min. ASW score for assuming substantial cluster is 0.2)", sep = "")
mtext(added_info, side = 1, line = 1.75, cex = .8)