Monday, July 14, 2014

Understanding mkcls

Clustering words based on contextual similarity is an effective way to learn lexical representations that improve performance—relative to using basic lexical identity features—in many supervised and unsupervised learning problems in NLP. One of the standard tools (e.g., used in the Moses training pipeline) that does this is mkcls, written by Franz Och while he was doing his first university degree at the University of Erlangen in the mid-1990's. Despite its many years of use, there is some confusion about what mkcls actually does. In part, this confusion persists because the paper describing its behavior is written auf deutsch. Further complicating matters, a few years later when he was doing his PhD, Franz extended mkcls to do bilingual clustering and published a paper in 1999 at EACL on that, and it is the EACL paper that is often read/cited instead (for example). Unfortunately, this later paper says next to nothing about what the commonly used monolingual mkcls does. To find that out, you have to go to the original:
Franz-Josef Och. 1995. Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der kombinatorischen Optimierung [Combinatorial Optimization Methods for Maximum Likelihood Estimation of Word Classes]. Studienarbeit [~Bachelor's Thesis!], Friedrich-Alexander-Universität Erlangen-Nürnburg. [BIBTEX]
For those who don't want to slog through 100 pages of German, this post briefly describes how mkcls clusters words.

The clustering problem is formulated in terms of the following combinatorial optimization problem:
$$c^* = \arg \max_{c \in C} \prod_i p_{\mathrm{MLE}}(c(x_i) \mid c(x_{i-1})) \times p_{\mathrm{MLE}}(x_i \mid c(x_{i}))$$
where C is the set of clusterings of the vocabulary and x is a training corpus. Although this is an NP-hard problem, it was proposed (independently?) several times, and several specialized algorithms have been developed to solve it, including the "exchange algorithm" by some mixture of Kneser, Martin, and Ney (1993, 1998), an agglomerative algorithm by Brown et al. (1992), and Percy Liang's improvement of the Brown et al. algorithm (2005). In contrast to the more specialized approaches taken in other work, mkcls uses generic combinatorial optimization algorithms to solve the above objective. It includes implementations of the following algorithms:
Each algorithm starts with a heuristic partitioning of the vocabulary (the paper goes into detail on different initialization strategies, but there's nothing particularly interesting that hasn't since been proposed elsewhere) and then considers updates to the current value of c by moving single words between its clusters. (This is in contrast to the Brown et al. agglomerative algorithm which starts with |V| clusters and greedily merges them, terminating when k clusters remain after |V|-k iterations.) Proposed moves are accepted probabilistically, and all but HC implement some kind of annealing during search. Empirically, Franz found that TA reliably outperformed the others, and this is the default algorithm, but alternatives can be selected with the -a option. Incidentally, mkcls does not implement the Brown et al. (1992) algorithm, and, based on the citations list in his paper, Franz does not appear to have been aware of this work.

In addition to the five core optimization algorithms, mkcls includes several meta-optimization algorithms that use the core algorithms as subroutines:
  • Multi-directional search with clipping (MSB). The intuition here is to run a bunch of independent optimizations in parallel and kill them if they are getting bad results early on. Empirically Franz found that good early performance is indicative of good final values. [Although the code for this appears to be included, this option is not enabled in the current version of mkcls.]
  • Random restarts (OPT). The classic strategy for getting better solutions out of random algorithms. This option is enabled by adding opt to the end of the command line and setting the parameter n > 1.
  • Iterated state reduction (ISR). This a solution tailored to the particular problem of finding a clustering of words. By comparing the results of different optimization runs (generated by different algorithms, different starting points, or different random seeds)—i.e., the clusterings found in different local optima—subsets of the vocabulary are identified that are common across the multiple runs and fused to elements. ISR then reruns the standard optimizer (TA by default), using these multi-word elements to take bigger steps in the search space. That is, rather than moving a single word, a block of words is moved between clusters all at once (this strategy is roughly analogous to how "blocked Gibbs sampling" compares to standard, single-variable Gibbs). ISR is enabled by adding izr-opt to the end of the command line and setting the parameter n > 1.
Running the meta-optimization algorithms is computationally rather expensive since it involves multiple independent runs of the core algorithms (the ISR implementation in particular is slow), although parallelization would be trivial. In any case, these options can add marginal improvements over the core algorithms for performance-critical cases.
How well does mkcls work?

Since mkcls optimizes the same objective as other algorithms (e.g., Brown et al.), we can compare them in terms of the value of the objective at convergence. Using a corpus of 10M lowercased, tokenized English words (the English side of the FBIS ChineseEnglish parallel corpus), I clustered the 50k word types into k=80 clusters with the agglomerative algorithm of Liang (2005) and also with several different configurations of mkcls. I report perplexities on the training data (i.e., the objective) at convergence and give a link to a page that shows the learned clusterings.
What clustering algorithm should you use?

Although the Brown et al. agglomerative clustering algorithm tends to have somewhat worse perplexities thank mkcls-TA (I've seen the same pattern of results every time I've run these comparisons), it is relatively fast—especially as k gets larger, and this is important since values of k in the 500–1000 range appear to be optimal for many tasks. Moreover, the hierarchical clustering it produces is convenient for obtaining different cluster granularities from a single run, so it may be a good option for some applications (see here and here for examples). But, if you don't need hierarchical clustering and have a fast computer, mkcls is hard to beat.

Monday, July 7, 2014

Devlin et al. 2014 - ACL 2014's best paper

Jacob Devlin and colleagues at BBN received ACL's 2014 best paper award for their work on neural translation models. This post gives my perspective on why this paper received the award (nb. I was not involved in the award decision). Brendan O'Connor suggested that I write this post for people outside MT who might not have fully understood the significance of the paper.

So, what's so great about this paper?

Neural networks are popular in MT in these days. In part, their popularity is due to them being this year's must-have accessory. More substantively, there is optimism that NNs provide a reasonable way to relax the naive independence assumptions made in current translation models without sacrificing statistical efficiency. However, doing inference in neural models of language is much less obviously computationally efficient, and (as recent acquisitions might suggest) getting NNs to perform at scale requires specialized expertise. MT is likewise a problem requiring a fair amount of specialized expertise, so getting the two to work together is quite an accomplishment.

Devlin et al. manage to get major improvements in translation quality over state-of-the-art baselines by using NNs to condition on the sort of nonlocal context that intuition tells us should be useful when making translation decisions—and they did so without giving up computational efficiency. Their proposed model is remarkably simple, based largely on Bengio et al.'s 2003 neural n-gram LM which predicts a sequence of words conditioning on the n-1 previous words, i.e.,
$$p(\textbf{e}) = \prod_{i=1}^{|\textbf{e}|} p(e_i \mid e_{i-n+1}, \ldots, e_{i-1})$$
but extended to condition on an aligned source word and its m-word surrounding context when generating each target word, i.e.,
$$p(\textbf{e} \mid \textbf{f}, \textbf{a}) = \prod_{i=1}^{|\textbf{e}|} p(e_i \mid e_{i-n+1}, \ldots, e_{i-1}, \underbrace{f_{a_i - m/2}, \ldots , f_{a_i}, \ldots, f_{a_i+m/2}}_{\textrm{new context}})$$
This formulation is reminiscent of "Markov" or "operation sequence" approaches to translation that have been around for the last 10 years.  Devlin et al.'s formulation is noteably close to the conditional variants of these models, such as one proposed in recent work of Feng and Cohn (alas, this was missed in the citations).

Including probabilities from this model as a feature in a conventional (e.g., phrase based, hierarchical phrase based, syntactic, etc.) translation model is, at least in principle, no different than including probabilities from an n-gram language model—all of the additional conditioning context comes from the fully-observed source side. The practical challenge is that working out the probability of the next word requires computing a normalization factor by summing over the entire output vocabulary for each conditioning context (although alternatives exist). Devlin et al. deal with the summation problem by altering the training so as to encourage the model to be implicitly normalized, that is, by penalizing deviations from log Z(f,a)=0 during training. (To further speed up inference, they show that the activation of the hidden layer can be precomputed; although one wonders if a caching approach would have worked as well and made the source conditioning possibilities richer.)

Evaluated on Arabic–English and Chinese–English translation, the proposed method works astoundingly well, obtaining improvements on the order of those seen when phrase-based models replaced word based models. This is a remarkable result, and the award is well deserved for these results alone.

I will conclude by saying that although this work is indisputably a major achievement, the paper does have a number of flaws. Calling an archetypically conditional model a "joint model" (in the title no less!) is confusing, in particular for those who might have wanted to use a best paper as a way to "read in" to a new area. Also, for those wishing to reimplement this (I'm aware of at least 3 efforts already underway), details such as the base of the logarithm in Table 1 would have been helpful as well as information about training performance on smaller datasets. Finally, there is also the matter of not modeling the alignment distribution p(a | f). This certainly does not need to be part of the NN model (as results clearly show!), but given the problems of reordering in MT, some sort of explicit model would seem to be an obvious thing to have considered.

Nevertheless, whatever these objections may be, this is great work that is much deserving of its award.

After 7 years...

This blog has been parked on the web for a really long time, so I'm going to start posting some of the ideas that I've had.