Mutual Information score — collex

The function to compute collocation association measure with Mutual Information.

collex_MI(df, collstr_digit = 3)

Arguments

df	The output of `assoc_prepare`.
collstr_digit	The numeric vector for floating digits of the collostruction strength. The default is `3`.

Value

A tibble consisting of the collocates (column w), co-occurrence frequencies with the node (column a), the expected co-occurrence frequencies with the node (column a_exp), the direction of the association (e.g., attraction or repulsion) (column assoc), the Mutual Information score (column MI), and two uni-directional association measures of Delta P.

Examples

out <- colloc_leipzig(leipzig_corpus_list = demo_corpus_leipzig,
    pattern = "ke", # it is a preposition meaning 'to(wards)'
    window = "r",
    span = 2L,
    save_interim = FALSE)
#> Detecting a 'named list' input!
#> You chose NOT to SAVE INTERIM RESULTS, which will be stored as a list in console!
#> 1. Tokenising the "ind_mixed_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_mixed_2012_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_news_2008_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_news_2008_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_news_2009_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_news_2009_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_news_2010_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_news_2010_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_news_2011_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_news_2011_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_news_2012_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_news_2012_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_newscrawl_2011_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_newscrawl_2011_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_newscrawl_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_newscrawl_2012_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_newscrawl_2015_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_newscrawl_2015_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_newscrawl_2016_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_newscrawl_2016_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_web_2011_300K" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_web_2011_300K.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_web_2012_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_web_2012_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind_wikipedia_2016_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind_wikipedia_2016_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind-id_web_2013_1M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind-id_web_2013_1M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 1. Tokenising the "ind-id_web_2015_3M" corpus. This process may take a while!
#>     1.1 Removing one-character tokens...
#>     1.2 Lowercasing the tokenised corpus...
#>     At least a match is detected for 'ke' in ind-id_web_2015_3M.
#> 2.1 Gathering the collocates for 'ke' ...
#> 3. Storing all of the outputs...
#> 
#> DONE!
assoc_tb <- assoc_prepare(colloc_out = out,
     stopword_list = collogetr::stopwords[collogetr::stopwords != "ke"])
#> Your colloc_leipzig output is stored as list!
#> You chose to combine the collocational and frequency list data from ALL CORPORA!
#> Tallying frequency list of all words in ALL CORPORA!
#> You chose to remove stopwords!

collex_MI(assoc_tb)
#> # A tibble: 301 x 8
#> # Groups:   node, w [301]
#>    node  w            a a_exp assoc        MI dP_collex_cue_cxn dP_cxn_cue_coll…
#>    <chr> <chr>    <int> <dbl> <chr>     <dbl>             <dbl>            <dbl>
#>  1 ke    rumah       10 0.578 attracti…  4.11             0.036            0.104
#>  2 ke    luar         6 0.273 attracti…  4.46             0.022            0.133
#>  3 ke    arah         5 0.127 attracti…  5.30             0.019            0.244
#>  4 ke    berbagai     5 0.33  attracti…  3.92             0.018            0.09 
#>  5 ke    kata         5 1.33  attracti…  1.92             0.014            0.018
#>  6 ke    negara       5 0.647 attracti…  2.95             0.017            0.043
#>  7 ke    daerah       4 0.489 attracti…  3.03             0.013            0.046
#>  8 ke    tempat       4 0.317 attracti…  3.66             0.014            0.074
#>  9 ke    belakang     3 0.076 attracti…  5.30             0.011            0.244
#> 10 ke    dua          3 0.609 attracti…  2.3              0.009            0.025
#> # … with 291 more rows