Fast and Robust Neural Network Joint Models for Statistical Machine Translation

Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.

In recent years, neural network models have become increasingly popular in NLP.

Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.

Because our NNJM is fundamentally an n-gram NNLM with additional source context, it can easily be integrated into any SMT decoder.

Recall that our NNJM feature can be described with the following probability:

Appears in 36 sentences as: BLEU (46)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.Page 1, “Abstract”
- The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation.Page 1, “Abstract”
- Additionally, we present several variations of this model which provide significant additive BLEU gains.Page 1, “Introduction”
- The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includesPage 1, “Introduction”
- Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined.Page 2, “Introduction”
- We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU .Page 4, “Neural Network Joint Model (NNJ M)”
- We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M.Page 4, “Neural Network Joint Model (NNJ M)”
- Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6Page 6, “Model Variations”
- BLEU scores are mixed-case.Page 6, “Model Variations”
- On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more.Page 6, “Model Variations”
- This leads to a total improvement of +3.0 BLEU from the NNJM and its variations.Page 6, “Model Variations”

See all papers in *Proc. ACL 2014* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 33 sentences as: Neural Network (6) neural network (25) neural networks (1) neural network’s (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.Page 1, “Abstract”
- Here, we present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window.Page 1, “Abstract”
- In recent years, neural network models have become increasingly popular in NLP.Page 1, “Introduction”
- Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).Page 1, “Introduction”
- In this paper we use a basic neural network architecture and a lexicalized probability model to create a powerful MT decoding feature.Page 1, “Introduction”
- Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.Page 1, “Introduction”
- We also present a novel technique for training the neural network to be self-normalized, which avoids the costly step of posteriorizing over the entire vocabulary in decoding.Page 1, “Introduction”
- Fortunately, neural network language models are able to elegantly scale up and take advantage of arbitrarily large context sizes.Page 2, “Neural Network Joint Model (NNJ M)”
- 2.1 Neural Network ArchitecturePage 2, “Neural Network Joint Model (NNJ M)”
- Our neural network architecture is almost identical to the original feed-forward NNLM architecture described in Bengio et al.Page 2, “Neural Network Joint Model (NNJ M)”
- 2.2 Neural Network TrainingPage 2, “Neural Network Joint Model (NNJ M)”

See all papers in *Proc. ACL 2014* that mention neural network.

See all papers in *Proc. ACL* that mention neural network.

Back to top.

Appears in 17 sentences as: Hidden Layer (1) hidden layer (12) hidden layers (5)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- When used in conjunction with a precomputed hidden layer , these techniques speed up NNJ M computation by a factor of 10,000X, with only a small reduction on MT accuracy.Page 1, “Introduction”
- We use two 512-dimensional hidden layers with tanh activation functions.Page 2, “Neural Network Joint Model (NNJ M)”
- We chose these values for the hidden layer size, vocabulary size, and source window size because they seemed to work best on our data sets — larger sizes did not improve results, while smaller sizes degraded results.Page 2, “Neural Network Joint Model (NNJ M)”
- 2.4 Pre-Computing the Hidden LayerPage 4, “Neural Network Joint Model (NNJ M)”
- Here, we present a “trick” for pre-computing the first hidden layer , which further increases the speed of NNJM lookups by a factor of 1,000X.Page 4, “Neural Network Joint Model (NNJ M)”
- Note that this technique only results in a significant speedup for self-normalized, feed-forward, NNLM-style networks with one hidden layer .Page 4, “Neural Network Joint Model (NNJ M)”
- We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU.Page 4, “Neural Network Joint Model (NNJ M)”
- For the neural network described in Section 2.1, computing the first hidden layer requires multiplying a 2689-dimensional input vector5 with a 2689 x 512 dimensional hidden layer matrix.Page 4, “Neural Network Joint Model (NNJ M)”
- Therefore, for every word in the vocabulary, and for each position, we can pre-compute the dot product between the word embedding and the first hidden layer .Page 4, “Neural Network Joint Model (NNJ M)”
- Computing the first hidden layer now only requires 15 scalar additions for each of the 512 hidden rows — one for each word in the inputPage 4, “Neural Network Joint Model (NNJ M)”
- If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7Page 4, “Neural Network Joint Model (NNJ M)”

See all papers in *Proc. ACL 2014* that mention hidden layer.

See all papers in *Proc. ACL* that mention hidden layer.

Back to top.

Appears in 14 sentences as: LM (15)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- These techniques speed up NNJM computation by a factor of l(),()()()X, making the model as fast as a standard back-off LM .Page 1, “Abstract”
- Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.Page 2, “Neural Network Joint Model (NNJ M)”
- It is clear that this model is effectively an (n+m)-gram LM, and a lS-gram LM would bePage 2, “Neural Network Joint Model (NNJ M)”
- Although self-normalization significantly improves the speed of NNJM lookups, the model is still several orders of magnitude slower than a back-off LM .Page 4, “Neural Network Joint Model (NNJ M)”
- By combining self-normalization and pre-computation, we can achieve a speed of 1.4M lookups/second, which is on par with fast back-off LM implementations (Tanaka et al., 2013).Page 4, “Neural Network Joint Model (NNJ M)”
- When performing hierarchical decoding with an n-gram LM , the leftmost and rightmost n — 1 words from each constituent must be stored in the state space.Page 5, “Decoding with the NNJ M”
- 4- gram Kneser—Ney LMPage 5, “Model Variations”
- Dependency LM (Shen et al., 2010) Contextual lexical smoothing (Devlin, 2009) Length distribution (Shen et al., 2010)Page 5, “Model Variations”
- 0 LM adaptation (Snover et al., 2008)Page 5, “Model Variations”
- 0 5-gram Kneser-Ney LM 0 Recurrent neural network language model (RNNLM) (Mikolov et al., 2010)Page 6, “Model Variations”
- Our NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese.10 The Kneser-Ney LM is trained on SE words of data from English GigaWord.Page 6, “Model Variations”

See all papers in *Proc. ACL 2014* that mention LM.

See all papers in *Proc. ACL* that mention LM.

Back to top.

Appears in 13 sentences as: NIST (13)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.Page 1, “Abstract”
- We show primary results on the NIST OpenMT12 Arabic-English condition.Page 1, “Introduction”
- We also show strong improvements on the NIST OpenMT12 Chinese-English task, as well as the DARPA BOLT (Broad Operational Language Translation) Arabic-English and Chinese-English conditions.Page 2, “Introduction”
- For Arabic word tokenization, we use the MADA-ARZ tokenizer (Habash et al., 2013) for the BOLT condition, and the Sakhr9 tokenizer for the NIST condition.Page 6, “Model Variations”
- We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions.Page 6, “Model Variations”
- 6.1 NIST OpenMT12 ResultsPage 6, “Model Variations”
- Our NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese.10 The Kneser-Ney LM is trained on SE words of data from English GigaWord.Page 6, “Model Variations”
- NIST MT12 TestPage 6, “Model Variations”
- Table 3: Primary results on Arabic-English and Chinese-English NIST MT12 Test Set.Page 6, “Model Variations”
- The first section corresponds to the top and bottom ranked systems from the evaluation, and are taken from the NIST website.Page 6, “Model Variations”
- 6.2 “Simple Hierarchical” NIST ResultsPage 7, “Model Variations”

See all papers in *Proc. ACL 2014* that mention NIST.

See all papers in *Proc. ACL* that mention NIST.

Back to top.

Appears in 11 sentences as: n-gram (11)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).Page 1, “Introduction”
- Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.Page 1, “Introduction”
- Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.Page 2, “Neural Network Joint Model (NNJ M)”
- If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7Page 4, “Neural Network Joint Model (NNJ M)”
- “lookups/sec” is the number of unique n-gram probabilities that can be computed per second.Page 4, “Neural Network Joint Model (NNJ M)”
- Because our NNJM is fundamentally an n-gram NNLM with additional source context, it can easily be integrated into any SMT decoder.Page 4, “Decoding with the NNJ M”
- When performing hierarchical decoding with an n-gram LM, the leftmost and rightmost n — 1 words from each constituent must be stored in the state space.Page 5, “Decoding with the NNJ M”
- We also train a separate lower-order n-gram model, which is necessary to compute estimate scores during hierarchical decoding.Page 5, “Decoding with the NNJ M”
- Specifically, this means that we don’t use dependency-based rule extraction, and our decoder only contains the following MT features: (1) rule probabilities, (2) n-gram Kneser-Ney LM, (3) lexical smoothing, (4) target word count, (5) con-cat rule penalty.Page 7, “Model Variations”
- This does not include the cost of n-gram creation or cached lookups, which amount to ~0.03 seconds per source word in our current implementation.14 However, the n-grams created for the NNJ M can be shared with the Kneser-Ney LM, which reduces the cost of that feature.Page 8, “Model Variations”
- 14In our decoder, roughly 95% of NNJM n-gram lookups within the same sentence are duplicates.Page 8, “Model Variations”

See all papers in *Proc. ACL 2014* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 9 sentences as: Chinese-English (10)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- We also show strong improvements on the NIST OpenMT12 Chinese-English task, as well as the DARPA BOLT (Broad Operational Language Translation) Arabic-English and Chinese-English conditions.Page 2, “Introduction”
- An example of the NNJ M context model for a Chinese-English parallel sentence is given in Figure 1.Page 2, “Neural Network Joint Model (NNJ M)”
- We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions.Page 6, “Model Variations”
- Table 3: Primary results on Arabic-English and Chinese-English NIST MT12 Test Set.Page 6, “Model Variations”
- For the Chinese-English condition, there is an improvement of +0.8 BLEU from the primary NNJM and +1.3 BLEU overall.Page 6, “Model Variations”
- The smaller improvement on Chinese-English compared to Arabic-English is consistent with the behavior of our baseline features, as we show in the next section.Page 7, “Model Variations”
- For Chinese-English , the “Simple Hierarchical” system only degrades by -3.2 BLEU compared to our strongest baseline, and the NNJ M features produce a gain of +2.1 BLEU on top of that.Page 7, “Model Variations”
- Table 4: Primary results on Arabic-English and Chinese-English BOLT Web Forum.Page 7, “Model Variations”
- They obtain an +0.5 BLEU improvement on Chinese-English (30.0 vs. 30.5).Page 9, “Model Variations”

See all papers in *Proc. ACL 2014* that mention Chinese-English.

See all papers in *Proc. ACL* that mention Chinese-English.

Back to top.

Appears in 7 sentences as: joint model (5) joint modeling (1) joint models (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Here, we present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window.Page 1, “Abstract”
- Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.Page 1, “Introduction”
- Unlike previous approaches to joint modeling (Le et al., 2012), our feature can be easily integrated into any statistical machine translation (SMT) decoder, which leads to substantially larger improvements than k-best rescoring only.Page 1, “Introduction”
- To make this a joint model , we also condition on source context vector 81-:Page 2, “Neural Network Joint Model (NNJ M)”
- Although there has been a substantial amount of past work in lexicalized joint models (Marino et al., 2006; Crego and Yvon, 2010), nearly all of these papers have used older statistical techniques such as Kneser-Ney or Maximum Entropy.Page 8, “Model Variations”
- This is consistent with our rescoring-only result, which indicates that k-best rescoring is too shallow to take advantage of the power of a joint model .Page 9, “Model Variations”
- We have described a novel formulation for a neural network-based machine translation joint model , along with several simple variations of this model.Page 9, “Model Variations”

See all papers in *Proc. ACL 2014* that mention joint model.

See all papers in *Proc. ACL* that mention joint model.

Back to top.

Appears in 6 sentences as: language model (3) language models (3)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.Page 1, “Abstract”
- Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).Page 1, “Introduction”
- Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.Page 1, “Introduction”
- Fortunately, neural network language models are able to elegantly scale up and take advantage of arbitrarily large context sizes.Page 2, “Neural Network Joint Model (NNJ M)”
- In particular, we can reverse the translation direction of the languages, as well as the direction of the language model .Page 5, “Model Variations”
- 0 5-gram Kneser-Ney LM 0 Recurrent neural network language model (RNNLM) (Mikolov et al., 2010)Page 6, “Model Variations”

See all papers in *Proc. ACL 2014* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 6 sentences as: lexicalized (6)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Our model is purely lexicalized and can be integrated into any MT decoder.Page 1, “Abstract”
- In this paper we use a basic neural network architecture and a lexicalized probability model to create a powerful MT decoding feature.Page 1, “Introduction”
- Although there has been a substantial amount of past work in lexicalized joint models (Marino et al., 2006; Crego and Yvon, 2010), nearly all of these papers have used older statistical techniques such as Kneser-Ney or Maximum Entropy.Page 8, “Model Variations”
- Le’s model also uses minimal phrases rather than being purely lexicalized , which has two main downsides: (a) a number of complex, handcrafted heuristics are required to define phrase boundaries, which may not transfer well to new languages, (b) the effective vocabulary size is much larger, which substantially increases data sparsity issues.Page 9, “Model Variations”
- The fact that the model is purely lexicalized , which avoids both data sparsity and implementation complexity.Page 9, “Model Variations”
- For example, creating a new type of decoder centered around a purely lexicalized neural network model.Page 9, “Model Variations”

See all papers in *Proc. ACL 2014* that mention lexicalized.

See all papers in *Proc. ACL* that mention lexicalized.

Back to top.

Appears in 4 sentences as: machine translation (4)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).Page 1, “Introduction”
- Unlike previous approaches to joint modeling (Le et al., 2012), our feature can be easily integrated into any statistical machine translation (SMT) decoder, which leads to substantially larger improvements than k-best rescoring only.Page 1, “Introduction”
- We have described a novel formulation for a neural network-based machine translation joint model, along with several simple variations of this model.Page 9, “Model Variations”
- One of the biggest goals of this work is to quell any remaining doubts about the utility of neural networks in machine translation .Page 9, “Model Variations”

See all papers in *Proc. ACL 2014* that mention machine translation.

See all papers in *Proc. ACL* that mention machine translation.

Back to top.

Appears in 4 sentences as: MT System (1) MT system (2) MT systems (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- Recent work has shown success in using neural network language models (NNLMs) as features in MT systems .Page 1, “Abstract”
- > 5 MT SystemPage 5, “Model Variations”
- In this section, we describe the MT system used in our experiments.Page 5, “Model Variations”
- Because of this, the baseline BLEU scores are much higher than a typical MT system — especially a real-time, production engine which must support many language pairs.Page 7, “Model Variations”

See all papers in *Proc. ACL 2014* that mention MT system.

See all papers in *Proc. ACL* that mention MT system.

Back to top.

Appears in 4 sentences as: word alignment (4) word aligns (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- This notion of afi‘iliation is derived from the word alignment, but unlike word alignment , each target word must be affiliated with exactly one non-NULL source word.Page 2, “Neural Network Joint Model (NNJ M)”
- For aligned target words, the normal affiliation heuristic can be used, since the word alignment is available within the rule.Page 5, “Decoding with the NNJ M”
- We treat NULL as a normal target word, and if a source word aligns to multiple target words, it is treated as a single concatenated token.Page 5, “Model Variations”
- For word alignment , we align all of the training data with both GIZA++ (Och and Ney, 2003) and NILE (Riesa et al., 2011), and concatenate the corpora together for rule extraction.Page 6, “Model Variations”

See all papers in *Proc. ACL 2014* that mention word alignment.

See all papers in *Proc. ACL* that mention word alignment.

Back to top.

Appears in 3 sentences as: objective function (2) objective function: (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- While we cannot train a neural network with this guarantee, we can explicitly encourage the log-softmaX normalizer to be as close to 0 as possible by augmenting our training objective function:Page 3, “Neural Network Joint Model (NNJ M)”
- Note that 04 = 0 is equivalent to the standard neural network objective function .Page 4, “Neural Network Joint Model (NNJ M)”
- For MT feature weight optimization, we use iterative k-best optimization with an Expected-BLEU objective function (Rosti et al., 2010).Page 6, “Model Variations”

See all papers in *Proc. ACL 2014* that mention objective function.

See all papers in *Proc. ACL* that mention objective function.

Back to top.

Appears in 3 sentences as: probability model (2) probability models (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- In this paper we use a basic neural network architecture and a lexicalized probability model to create a powerful MT decoding feature.Page 1, “Introduction”
- far too sparse for standard probability models such as Kneser-Ney back-off (Kneser and Ney, 1995) or Maximum Entropy (Rosenfeld, 1996).Page 2, “Neural Network Joint Model (NNJ M)”
- Formally, the probability model is:Page 5, “Model Variations”

See all papers in *Proc. ACL 2014* that mention probability model.

See all papers in *Proc. ACL* that mention probability model.

Back to top.

Appears in 3 sentences as: Translation Model (1) translation model (1) translation modeling (1)

In *Fast and Robust Neural Network Joint Models for Statistical Machine Translation*

- They have since been extended to translation modeling , parsing, and many other NLP tasks.Page 1, “Introduction”
- 4.1 Neural Network Lexical Translation Model (NNLTM)Page 5, “Model Variations”
- In order to assign a probability to every source word during decoding, we also train a neural network lexical translation model (NNLMT).Page 5, “Model Variations”

See all papers in *Proc. ACL 2014* that mention Translation Model.

See all papers in *Proc. ACL* that mention Translation Model.

Back to top.