The StatMT Blog: Devlin et al. 2014 - ACL 2014's best paper

Jacob Devlin and colleagues at BBN received ACL's 2014 best paper award for their work on neural translation models. This post gives my perspective on why this paper received the award (nb. I was not involved in the award decision). Brendan O'Connor suggested that I write this post for people outside MT who might not have fully understood the significance of the paper.

So, what's so great about this paper?

Neural networks are popular in MT in these days. In part, their popularity is due to them being this year's must-have accessory. More substantively, there is optimism that NNs provide a reasonable way to relax the naive independence assumptions made in current translation models without sacrificing statistical efficiency. However, doing inference in neural models of language is much less obviously computationally efficient, and (as recent acquisitions might suggest) getting NNs to perform at scale requires specialized expertise. MT is likewise a problem requiring a fair amount of specialized expertise, so getting the two to work together is quite an accomplishment.

Devlin et al. manage to get major improvements in translation quality over state-of-the-art baselines by using NNs to condition on the sort of nonlocal context that intuition tells us should be useful when making translation decisions—and they did so without giving up computational efficiency. Their proposed model is remarkably simple, based largely on Bengio et al.'s 2003 neural n-gram LM which predicts a sequence of words conditioning on the n-1 previous words, i.e.,
$$p(\textbf{e}) = \prod_{i=1}^{|\textbf{e}|} p(e_i \mid e_{i-n+1}, \ldots, e_{i-1})$$
but extended to condition on an aligned source word and its m-word surrounding context when generating each target word, i.e.,
$$p(\textbf{e} \mid \textbf{f}, \textbf{a}) = \prod_{i=1}^{|\textbf{e}|} p(e_i \mid e_{i-n+1}, \ldots, e_{i-1}, \underbrace{f_{a_i - m/2}, \ldots , f_{a_i}, \ldots, f_{a_i+m/2}}_{\textrm{new context}})$$

This formulation is reminiscent of "Markov" or "operation sequence" approaches to translation that have been around for the last 10 years. Devlin et al.'s formulation is noteably close to the conditional variants of these models, such as one proposed in recent work of Feng and Cohn (alas, this was missed in the citations).

Including probabilities from this model as a feature in a conventional (e.g., phrase based, hierarchical phrase based, syntactic, etc.) translation model is, at least in principle, no different than including probabilities from an n-gram language model—all of the additional conditioning context comes from the fully-observed source side. The practical challenge is that working out the probability of the next word requires computing a normalization factor by summing over the entire output vocabulary for each conditioning context (although alternatives exist). Devlin et al. deal with the summation problem by altering the training so as to encourage the model to be implicitly normalized, that is, by penalizing deviations from log Z(f,a)=0 during training. (To further speed up inference, they show that the activation of the hidden layer can be precomputed; although one wonders if a caching approach would have worked as well and made the source conditioning possibilities richer.)

Evaluated on Arabic–English and Chinese–English translation, the proposed method works astoundingly well, obtaining improvements on the order of those seen when phrase-based models replaced word based models. This is a remarkable result, and the award is well deserved for these results alone.

I will conclude by saying that although this work is indisputably a major achievement, the paper does have a number of flaws. Calling an archetypically conditional model a "joint model" (in the title no less!) is confusing, in particular for those who might have wanted to use a best paper as a way to "read in" to a new area. Also, for those wishing to reimplement this (I'm aware of at least 3 efforts already underway), details such as the base of the logarithm in Table 1 would have been helpful as well as information about training performance on smaller datasets. Finally, there is also the matter of not modeling the alignment distribution p(a | f). This certainly does not need to be part of the NN model (as results clearly show!), but given the problems of reordering in MT, some sort of explicit model would seem to be an obvious thing to have considered.

Nevertheless, whatever these objections may be, this is great work that is much deserving of its award.

6 comments:

hal said...: This comment has been removed by the author.; July 8, 2014 at 9:48 AM
hal said...: hi Chris --

great to see you in the blogging world, and thanks for the summary.

i think you're missing one sociocultural aspect of what's so great about this paper: that others have tried and (more or less) failed.

since naacl 2013 there have been probably dozens of papers that do some form of deep learning for MT, whether it's in phrase table features (like the Devlin paper -- go Terps!) or nbest reranking or completely scrapping the underlying model (yay -- that's the best approach!) or something else. of the ones i've seen (not all), the improvements were at best small.

and then the devlin paper comes around and gets huge gains using rather simple techniques (or, at least simple on the surface -- from chatting with some of the people trying to reproduce these results it's kind of hard).

so i think a large part of the "this is great" factor comes from the fact that people had been trying this without much luck, and then devlin made it work. (actually a similar phenomenon seems to hold for many past best paper awards too.) i'm not saying that to say in any way that it's not a great contribution -- it is -- but i think the culture in which it came about is a large part of why people are excited. we _wanted_ it to work and devlin (ugh it's hard to type that without typing "devil"!) made it happen.

that's also the _one_ nit pick i have with the devlin paper: _given_ that no one else had made this work particularly well before, the falsifiable hypothesis might well have been that deep learning isn't that helpful for MT. devlin successfully falsified this, but i'm not sure i'm left with any particular sense of what he did right that others did wrong. perhaps we'll figure it out soon :); July 8, 2014 at 11:30 AM
Chris said...: @Hal yeah, I agree completely that we give best paper awards to things we want to work (and if they only work after a long and arduous struggle, all the better). Maybe we need a new award for "most surprising paper"?; July 8, 2014 at 2:13 PM
Reza said...: This comment has been removed by the author.; July 9, 2014 at 11:27 PM
Reza said...: Thanks Chris for the great post.; July 9, 2014 at 11:27 PM
cyrine said...: Thank you for this share! so interesting paper. I woud like to know if the source code of this program is available?

thanks; November 5, 2014 at 6:01 PM

The StatMT Blog

Monday, July 7, 2014

Devlin et al. 2014 - ACL 2014's best paper

6 comments:

Blog Archive

Other Blogs

Contributors