*auf deutsch*. Further complicating matters, a few years later when he was doing his PhD, Franz extended mkcls to do bilingual clustering and published a paper in 1999 at EACL on that, and it is the EACL paper that is often read/cited instead (for example). Unfortunately, this later paper says next to nothing about what the commonly used monolingual mkcls does. To find that out, you have to go to the original:

Franz-Josef Och. 1995.For those who don't want to slog through 100 pages of German, this post briefly describes how mkcls clusters words.Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der kombinatorischen Optimierung [Combinatorial Optimization Methods forMaximum Likelihood Estimation of Word Classes]. Studienarbeit [~Bachelor's Thesis!], Friedrich-Alexander-Universität Erlangen-Nürnburg. [BIBTEX]

The clustering problem is formulated in terms of the following combinatorial optimization problem:

$$c^* = \arg \max_{c \in C} \prod_i p_{\mathrm{MLE}}(c(x_i) \mid c(x_{i-1})) \times p_{\mathrm{MLE}}(x_i \mid c(x_{i}))$$

where

*C*is the set of clusterings of the vocabulary and

**x**is a training corpus. Although this is an NP-hard problem, it was proposed (independently?) several times, and several specialized algorithms have been developed to solve it, including the "exchange algorithm" by some mixture of Kneser, Martin, and Ney (1993, 1998), an agglomerative algorithm by Brown et al. (1992), and Percy Liang's improvement of the Brown et al. algorithm (2005). In contrast to the more specialized approaches taken in other work, mkcls uses generic combinatorial optimization algorithms to solve the above objective. It includes implementations of the following algorithms:

- Stochastic Hill Climbing (HC)
- Simulated Annealing (SA)
- The Great Deluge algorithm (GDA)
- Threshold Accepting (TA)
- Record-to-Record Travel (RRT)

*c*by moving single words between its clusters. (This is in contrast to the Brown et al. agglomerative algorithm which starts with |

*V*| clusters and greedily merges them, terminating when

*k*clusters remain after |

*V*|-

*k*iterations.) Proposed moves are accepted probabilistically, and all but HC implement some kind of annealing during search. Empirically, Franz found that TA reliably outperformed the others, and this is the default algorithm, but alternatives can be selected with the -a option. Incidentally, mkcls does not implement the Brown et al. (1992) algorithm, and, based on the citations list in his paper, Franz does not appear to have been aware of this work.

In addition to the five core optimization algorithms, mkcls includes several meta-optimization algorithms that use the core algorithms as subroutines:

**Multi-directional search with clipping**(MSB). The intuition here is to run a bunch of independent optimizations in parallel and kill them if they are getting bad results early on. Empirically Franz found that good early performance is indicative of good final values.*[Although the code for this appears to be included, this option is not enabled in the current version of mkcls.]***Random restarts**(OPT). The classic strategy for getting better solutions out of random algorithms. This option is enabled by adding opt to the end of the command line and setting the parameter n > 1.**Iterated state reduction**(ISR). This a solution tailored to the particular problem of finding a clustering of words. By comparing the results of different optimization runs (generated by different algorithms, different starting points, or different random seeds)—i.e., the clusterings found in different**local optima**—subsets of the vocabulary are identified that are common across the multiple runs and fused to elements. ISR then reruns the standard optimizer (TA by default), using these multi-word elements to take bigger steps in the search space. That is, rather than moving a single word, a block of words is moved between clusters all at once (this strategy is roughly analogous to how "blocked Gibbs sampling" compares to standard, single-variable Gibbs). ISR is enabled by adding izr-opt to the end of the command line and setting the parameter n > 1.

Since mkcls optimizes the same objective as other algorithms (e.g., Brown et al.), we can compare them in terms of the value of the objective at convergence. Using a corpus of 10M lowercased, tokenized English words (the English side of the FBIS Chinese–English parallel corpus), I clustered the 50k word types into

*k*=80 clusters with the agglomerative algorithm of Liang (2005) and also with several different configurations of mkcls. I report perplexities on the training data (i.e., the objective) at convergence and give a link to a page that shows the learned clusterings.

- Agglomerative algorithm (Liang implementation), perplexity=249
- mkcls (TA), perplexity=236
- mkcls (n=2, OPT, TA), perplexity=236
- mkcls (n=10, OPT, TA), perplexity=234
- mkcls (n=100, OPT, TA), perplexity=233
- mkcls (n=2, ISR, TA), perplexity=231
- mkcls (n=10, ISR, TA), perplexity=230

*k*gets larger, and this is important since values of

*k*in the 500–1000 range appear to be optimal for many tasks. Moreover, the hierarchical clustering it produces is convenient for obtaining different cluster granularities from a single run, so it may be a good option for some applications (see here and here for examples). But, if you don't need hierarchical clustering and have a fast computer, mkcls is hard to beat.