Franz-Josef Och. 1995. Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der kombinatorischen Optimierung [Combinatorial Optimization Methods for Maximum Likelihood Estimation of Word Classes]. Studienarbeit [~Bachelor's Thesis!], Friedrich-Alexander-Universität Erlangen-Nürnburg. [BIBTEX]For those who don't want to slog through 100 pages of German, this post briefly describes how mkcls clusters words.
The clustering problem is formulated in terms of the following combinatorial optimization problem:
$$c^* = \arg \max_{c \in C} \prod_i p_{\mathrm{MLE}}(c(x_i) \mid c(x_{i-1})) \times p_{\mathrm{MLE}}(x_i \mid c(x_{i}))$$
where C is the set of clusterings of the vocabulary and x is a training corpus. Although this is an NP-hard problem, it was proposed (independently?) several times, and several specialized algorithms have been developed to solve it, including the "exchange algorithm" by some mixture of Kneser, Martin, and Ney (1993, 1998), an agglomerative algorithm by Brown et al. (1992), and Percy Liang's improvement of the Brown et al. algorithm (2005). In contrast to the more specialized approaches taken in other work, mkcls uses generic combinatorial optimization algorithms to solve the above objective. It includes implementations of the following algorithms:
- Stochastic Hill Climbing (HC)
- Simulated Annealing (SA)
- The Great Deluge algorithm (GDA)
- Threshold Accepting (TA)
- Record-to-Record Travel (RRT)
In addition to the five core optimization algorithms, mkcls includes several meta-optimization algorithms that use the core algorithms as subroutines:
- Multi-directional search with clipping (MSB). The intuition here is to run a bunch of independent optimizations in parallel and kill them if they are getting bad results early on. Empirically Franz found that good early performance is indicative of good final values. [Although the code for this appears to be included, this option is not enabled in the current version of mkcls.]
- Random restarts (OPT). The classic strategy for getting better solutions out of random algorithms. This option is enabled by adding opt to the end of the command line and setting the parameter n > 1.
- Iterated state reduction (ISR). This a solution tailored to the particular problem of finding a clustering of words. By comparing the results of different optimization runs (generated by different algorithms, different starting points, or different random seeds)—i.e., the clusterings found in different local optima—subsets of the vocabulary are identified that are common across the multiple runs and fused to elements. ISR then reruns the standard optimizer (TA by default), using these multi-word elements to take bigger steps in the search space. That is, rather than moving a single word, a block of words is moved between clusters all at once (this strategy is roughly analogous to how "blocked Gibbs sampling" compares to standard, single-variable Gibbs). ISR is enabled by adding izr-opt to the end of the command line and setting the parameter n > 1.
Since mkcls optimizes the same objective as other algorithms (e.g., Brown et al.), we can compare them in terms of the value of the objective at convergence. Using a corpus of 10M lowercased, tokenized English words (the English side of the FBIS Chinese–English parallel corpus), I clustered the 50k word types into k=80 clusters with the agglomerative algorithm of Liang (2005) and also with several different configurations of mkcls. I report perplexities on the training data (i.e., the objective) at convergence and give a link to a page that shows the learned clusterings.
- Agglomerative algorithm (Liang implementation), perplexity=249
- mkcls (TA), perplexity=236
- mkcls (n=2, OPT, TA), perplexity=236
- mkcls (n=10, OPT, TA), perplexity=234
- mkcls (n=100, OPT, TA), perplexity=233
- mkcls (n=2, ISR, TA), perplexity=231
- mkcls (n=10, ISR, TA), perplexity=230