Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

DOI

1 How to cite this R Notebook

Please cite this R Notebook as follows (in Unified Style Sheet for Linguistics):

Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019. R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. https://doi.org10.6084/m9.figshare.9970205. https://figshare.com/articles/R_Markdown_Notebook_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/9970205.

2 Preface

This is an R Markdown Notebook (Rajeg, Denistia & Musgrave 2019a) for the statistical analyses accompanying our paper (Rajeg, Denistia & Musgrave 2019b) on vector space models and Indonesian denominal verbs (published in NUSA’s special issue titled Linguistic studies using large annotated corpora, edited by Hiroki Nomoto and David Moeljadi) (Nomoto & Moeljadi 2019). The Notebook, however, does not provide detailed exposition and discussion for each points, including English glossing of the Indonesian words in the tables. Readers are referred to our paper for details. Check the README page here for the tutorial to download this Notebook and the data (Rajeg, Denistia & Musgrave 2019c) for the paper.

The following R packages have to be installed and loaded to run all codes in this Notebook:

  • cluster (Maechler et al. 2018)
  • tidyverse (Wickham & Grolemund 2017)
  • dendextend (Galili 2015)
  • wordVectors (Schmidt & Li 2017)
  • Rling (Levshina 2015)

Click the Code button to hide/reveal the R codes.

3 Illustration for generating Vector Space Models (VSM)

For illustration, we used six deverbal nouns with the suffix -an. They are bacaan ‘reading’ (from the root baca ‘to read’), tulisan ‘writing’ (from tulis ‘to write’), lukisan ‘painting’ (from lukis ‘to paint/draw’), masakan ‘cooking’ (from masak ‘to cook’), makanan ‘food’ (from makan ‘to eat’), and minuman ‘beverages’ (from minum ‘to drink’). We retrieved collocates within three-word window to the either sides (left and right) of each noun. We used one of the Indonesian Leipzig Corpora files, namely ind_mixed_2012_1M-sentences.txt (see Biemann et al. 2007; Quasthoff & Goldhahn 2013).

3.1 Creating the co-occurrence frequency table

The following codes process the retrieved collocates into co-occurrence frequency table and print a subset of this table. The collocates data is available in the "vsm_creation_data.rds" file. This .rds file contains a List data-structure; the table version is available as tab-delimited .csv and .txt files (vsm_creation_data_tb.txt & vsm_creation_data_tb.csv).

alkitab dituangkan halal keras mengandung penutup
bacaan 31 0 0 0 0 0
lukisan 0 0 0 1 0 0
makanan 0 0 19 12 55 7
masakan 0 0 0 0 1 0
minuman 0 0 0 150 14 0
tulisan 16 6 0 2 2 0

3.2 Performing the Positive Pointwise Mutual Information

Current approach in VSM adopts a more principled method of weighting the initial raw-frequency vectors into statistical measures of collocation strength before computing (dis)similarity measure between the target words. The goal of the weighting is “to give a higher weight to context words that co-occur significantly more often than expected by chance” (Heylen et al. 2015:156; cf. Clark 2015:503–504; Perek 2016:12). These significantly associatied context words are assumed to be more informative for the semantics of the target words (Heylen et al. 2015:156). The popular weighting measure used in VSM is the Pointwise Mutual Information (PMI) (see Levshina 2015:327–328 for computing PMI in R):

\[\text{PMI } (x, y) = log_{2} \frac {O_{xy}} {E_{xy}}\]

where \(O_{xy}\) represents the observed co-occurrence frequency between x and y, while \(E_{xy}\) is their expected co-occurrence frequency, which is the frequency expected under the chance distribution between x and y given the overall distribution of x and y in the corpus. Negative PMI values is normally replaced with zero, resulting in the Positive PMI (PPMI) (Levshina 2015; Hilpert & Perek 2015).

alkitab dituangkan halal keras mengandung penutup
bacaan 3.62 0.00 0.00 0.00 0.00 0.00
lukisan 0.00 0.00 0.00 0.00 0.00 0.00
makanan 0.00 0.00 1.05 0.00 0.67 1.05
masakan 0.00 0.00 0.00 0.00 0.00 0.00
minuman 0.00 0.00 0.00 3.07 0.85 0.00
tulisan 0.31 1.87 0.00 0.00 0.00 0.00

Further analysis can be performed. The most common one is determining the semantic (dis)similarity between the target words (i.e., which word is more similar and different among each other). The following section briefly discusses the Cosine Similarity and Hierarchical Agglomerative Cluster (HAC) analyses as the exploratory tools (Levshina 2014; Levshina 2015).

3.3 Exploring VSM with Cosine Similarity and Hierarchical Agglomerative Cluster (HAC) analysis

In VSM, cosine similarity is the popular measure for computing pairwise (dis)similarity between the target words. Cosine similarity computes the cosine of angles between the words’ vectors to capture their (dis)similarity. The cosine value between a pair of word is close to 1 when they are semantically more similar, and close to 0 when otherwise (see @ref(tab:vsm-xmpl-cossim)) (Levshina 2015, Ch. 16).

The following codes perform cosine similarity using the cossim() function from the Rling package (Levshina 2015:329).

bacaan lukisan makanan masakan minuman tulisan
bacaan 1.00 0.04 0.02 0.03 0.02 0.06
lukisan 0.04 1.00 0.01 0.03 0.04 0.04
makanan 0.02 0.01 1.00 0.03 0.03 0.00
masakan 0.03 0.03 0.03 1.00 0.04 0.02
minuman 0.02 0.04 0.03 0.04 1.00 0.02
tulisan 0.06 0.04 0.00 0.02 0.02 1.00

For the cluster analysis, such as HAC (Levshina 2014; see also Gries 2013:336; Levshina 2015, Ch. 15; and Desagulier 2017:276 for R implementations on HAC), the similarity matrix/table as above needs to be converted into distance matrix as input for the cluster analysis (see the codes below adapted from Levshina 2015:330).

The output of HAC can be visualised into a dendrogram tree (see Figure @ref(fig:vsm-xmpl-dendrogram)). To determine the optimal number of cluster solution for grouping the nouns, we used the Average Silhouette Width (ASW) scores and tested two up to five-cluster solutions. The two-cluster solution produces the highest ASW score. See the codes below.