## Monday, July 14, 2014

### Understanding mkcls

Clustering words based on contextual similarity is an effective way to learn lexical representations that improve performance—relative to using basic lexical identity features—in many supervised and unsupervised learning problems in NLP. One of the standard tools (e.g., used in the Moses training pipeline) that does this is mkcls, written by Franz Och while he was doing his first university degree at the University of Erlangen in the mid-1990's. Despite its many years of use, there is some confusion about what mkcls actually does. In part, this confusion persists because the paper describing its behavior is written auf deutsch. Further complicating matters, a few years later when he was doing his PhD, Franz extended mkcls to do bilingual clustering and published a paper in 1999 at EACL on that, and it is the EACL paper that is often read/cited instead (for example). Unfortunately, this later paper says next to nothing about what the commonly used monolingual mkcls does. To find that out, you have to go to the original:
Franz-Josef Och. 1995. Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der kombinatorischen Optimierung [Combinatorial Optimization Methods for Maximum Likelihood Estimation of Word Classes]. Studienarbeit [~Bachelor's Thesis!], Friedrich-Alexander-Universität Erlangen-Nürnburg. [BIBTEX]
For those who don't want to slog through 100 pages of German, this post briefly describes how mkcls clusters words.

The clustering problem is formulated in terms of the following combinatorial optimization problem:
$$c^* = \arg \max_{c \in C} \prod_i p_{\mathrm{MLE}}(c(x_i) \mid c(x_{i-1})) \times p_{\mathrm{MLE}}(x_i \mid c(x_{i}))$$
where C is the set of clusterings of the vocabulary and x is a training corpus. Although this is an NP-hard problem, it was proposed (independently?) several times, and several specialized algorithms have been developed to solve it, including the "exchange algorithm" by some mixture of Kneser, Martin, and Ney (1993, 1998), an agglomerative algorithm by Brown et al. (1992), and Percy Liang's improvement of the Brown et al. algorithm (2005). In contrast to the more specialized approaches taken in other work, mkcls uses generic combinatorial optimization algorithms to solve the above objective. It includes implementations of the following algorithms:
Each algorithm starts with a heuristic partitioning of the vocabulary (the paper goes into detail on different initialization strategies, but there's nothing particularly interesting that hasn't since been proposed elsewhere) and then considers updates to the current value of c by moving single words between its clusters. (This is in contrast to the Brown et al. agglomerative algorithm which starts with |V| clusters and greedily merges them, terminating when k clusters remain after |V|-k iterations.) Proposed moves are accepted probabilistically, and all but HC implement some kind of annealing during search. Empirically, Franz found that TA reliably outperformed the others, and this is the default algorithm, but alternatives can be selected with the -a option. Incidentally, mkcls does not implement the Brown et al. (1992) algorithm, and, based on the citations list in his paper, Franz does not appear to have been aware of this work.

In addition to the five core optimization algorithms, mkcls includes several meta-optimization algorithms that use the core algorithms as subroutines:
• Multi-directional search with clipping (MSB). The intuition here is to run a bunch of independent optimizations in parallel and kill them if they are getting bad results early on. Empirically Franz found that good early performance is indicative of good final values. [Although the code for this appears to be included, this option is not enabled in the current version of mkcls.]
• Random restarts (OPT). The classic strategy for getting better solutions out of random algorithms. This option is enabled by adding opt to the end of the command line and setting the parameter n > 1.
• Iterated state reduction (ISR). This a solution tailored to the particular problem of finding a clustering of words. By comparing the results of different optimization runs (generated by different algorithms, different starting points, or different random seeds)—i.e., the clusterings found in different local optima—subsets of the vocabulary are identified that are common across the multiple runs and fused to elements. ISR then reruns the standard optimizer (TA by default), using these multi-word elements to take bigger steps in the search space. That is, rather than moving a single word, a block of words is moved between clusters all at once (this strategy is roughly analogous to how "blocked Gibbs sampling" compares to standard, single-variable Gibbs). ISR is enabled by adding izr-opt to the end of the command line and setting the parameter n > 1.
Running the meta-optimization algorithms is computationally rather expensive since it involves multiple independent runs of the core algorithms (the ISR implementation in particular is slow), although parallelization would be trivial. In any case, these options can add marginal improvements over the core algorithms for performance-critical cases.
How well does mkcls work?

Since mkcls optimizes the same objective as other algorithms (e.g., Brown et al.), we can compare them in terms of the value of the objective at convergence. Using a corpus of 10M lowercased, tokenized English words (the English side of the FBIS ChineseEnglish parallel corpus), I clustered the 50k word types into k=80 clusters with the agglomerative algorithm of Liang (2005) and also with several different configurations of mkcls. I report perplexities on the training data (i.e., the objective) at convergence and give a link to a page that shows the learned clusterings.
What clustering algorithm should you use?

Although the Brown et al. agglomerative clustering algorithm tends to have somewhat worse perplexities thank mkcls-TA (I've seen the same pattern of results every time I've run these comparisons), it is relatively fast—especially as k gets larger, and this is important since values of k in the 500–1000 range appear to be optimal for many tasks. Moreover, the hierarchical clustering it produces is convenient for obtaining different cluster granularities from a single run, so it may be a good option for some applications (see here and here for examples). But, if you don't need hierarchical clustering and have a fast computer, mkcls is hard to beat.

Unknown said...

this is great, thanks! By any chance did you try any of the other algs in mkcls (other than TA) to confirm Franz' findings?

Chris said...

I didn't try any of the other configurations, but if you want to try them, here's the source data

kokomo40 said...

"Each algorithm starts with a heuristic partitioning of the vocabulary ..., and then considers updates to the current value of c by moving single words between its clusters." - isn't it an exchange algorithm used in (Martin et al, 1998) and (Och, 1999)? According to your explanation, mkcls seems to do what those papers (which most people cites) are doing, and adds combinatorial optimization strategies (e.g. threshold accepting) to improve the local optimum, as (Och, 1999) says.

Anonymous said...

Thanks for the article. FYI, the source data is compressed twice.