Franz-Josef Och. 1995. Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der kombinatorischen Optimierung [Combinatorial Optimization Methods for Maximum Likelihood Estimation of Word Classes]. Studienarbeit [~Bachelor's Thesis!], Friedrich-Alexander-Universität Erlangen-Nürnburg. [BIBTEX]For those who don't want to slog through 100 pages of German, this post briefly describes how mkcls clusters words.
The clustering problem is formulated in terms of the following combinatorial optimization problem:
$$c^* = \arg \max_{c \in C} \prod_i p_{\mathrm{MLE}}(c(x_i) \mid c(x_{i-1})) \times p_{\mathrm{MLE}}(x_i \mid c(x_{i}))$$
where C is the set of clusterings of the vocabulary and x is a training corpus. Although this is an NP-hard problem, it was proposed (independently?) several times, and several specialized algorithms have been developed to solve it, including the "exchange algorithm" by some mixture of Kneser, Martin, and Ney (1993, 1998), an agglomerative algorithm by Brown et al. (1992), and Percy Liang's improvement of the Brown et al. algorithm (2005). In contrast to the more specialized approaches taken in other work, mkcls uses generic combinatorial optimization algorithms to solve the above objective. It includes implementations of the following algorithms:
- Stochastic Hill Climbing (HC)
- Simulated Annealing (SA)
- The Great Deluge algorithm (GDA)
- Threshold Accepting (TA)
- Record-to-Record Travel (RRT)
In addition to the five core optimization algorithms, mkcls includes several meta-optimization algorithms that use the core algorithms as subroutines:
- Multi-directional search with clipping (MSB). The intuition here is to run a bunch of independent optimizations in parallel and kill them if they are getting bad results early on. Empirically Franz found that good early performance is indicative of good final values. [Although the code for this appears to be included, this option is not enabled in the current version of mkcls.]
- Random restarts (OPT). The classic strategy for getting better solutions out of random algorithms. This option is enabled by adding opt to the end of the command line and setting the parameter n > 1.
- Iterated state reduction (ISR). This a solution tailored to the particular problem of finding a clustering of words. By comparing the results of different optimization runs (generated by different algorithms, different starting points, or different random seeds)—i.e., the clusterings found in different local optima—subsets of the vocabulary are identified that are common across the multiple runs and fused to elements. ISR then reruns the standard optimizer (TA by default), using these multi-word elements to take bigger steps in the search space. That is, rather than moving a single word, a block of words is moved between clusters all at once (this strategy is roughly analogous to how "blocked Gibbs sampling" compares to standard, single-variable Gibbs). ISR is enabled by adding izr-opt to the end of the command line and setting the parameter n > 1.
Since mkcls optimizes the same objective as other algorithms (e.g., Brown et al.), we can compare them in terms of the value of the objective at convergence. Using a corpus of 10M lowercased, tokenized English words (the English side of the FBIS Chinese–English parallel corpus), I clustered the 50k word types into k=80 clusters with the agglomerative algorithm of Liang (2005) and also with several different configurations of mkcls. I report perplexities on the training data (i.e., the objective) at convergence and give a link to a page that shows the learned clusterings.
- Agglomerative algorithm (Liang implementation), perplexity=249
- mkcls (TA), perplexity=236
- mkcls (n=2, OPT, TA), perplexity=236
- mkcls (n=10, OPT, TA), perplexity=234
- mkcls (n=100, OPT, TA), perplexity=233
- mkcls (n=2, ISR, TA), perplexity=231
- mkcls (n=10, ISR, TA), perplexity=230
5 comments:
this is great, thanks! By any chance did you try any of the other algs in mkcls (other than TA) to confirm Franz' findings?
I didn't try any of the other configurations, but if you want to try them, here's the source data
"Each algorithm starts with a heuristic partitioning of the vocabulary ..., and then considers updates to the current value of c by moving single words between its clusters." - isn't it an exchange algorithm used in (Martin et al, 1998) and (Och, 1999)? According to your explanation, mkcls seems to do what those papers (which most people cites) are doing, and adds combinatorial optimization strategies (e.g. threshold accepting) to improve the local optimum, as (Och, 1999) says.
Thanks for the article. FYI, the source data is compressed twice.
Here it says "Franz extended mkcls to do bilingual clustering", but I couldn't find how to pass a bilingual corpus to mkcls to do bilingual clustering. All the examples I found are using it for monolingual word clustering. Do you know how to run it for bilingual clustering?
Post a Comment