Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models

Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.

Neural network-based language and translation models have achieved impressive accuracy improvements on statistical machine translation tasks (Allauzen et al., 2011; Le et al., 2012b; Schwenk et al., 2012; Vaswani et al., 2013; Gao et al., 2014).

Our model has a similar structure to the recurrent neural network language model of Mikolov et al.

We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003).

Directly integrating our recurrent neural network language model into first-pass decoding enables us to search a much larger space than would be possible in rescoring.

Baseline.

We introduce an empirically effective approximation to integrate a recurrent neural network model into first pass decoding, thereby extending previous work on decoding with feed-forward neu-

We thank Michel Galley, Arul Menezes, Chris Quirk and Geoffrey Zweig for helpful discussions related to this work as well as the four anonymous reviewers for their comments.

Appears in 25 sentences as: BLEU (37)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.Page 1, “Abstract”
- We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.Page 1, “Abstract”
- Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.Page 1, “Abstract”
- The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).Page 1, “Introduction”
- We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.Page 1, “Introduction”
- time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations.Page 2, “Recurrent Neural Network LMs”
- The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3.Page 2, “Expected BLEU Training”
- 3.1 Expected BLEU ObjectivePage 3, “Expected BLEU Training”
- Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:Page 3, “Expected BLEU Training”
- where sBLEU(e,e(i)) is a smoothed sentence-level BLEU score with respect to the reference translation ea), and 5 (f) is the generation set given by an n-best list.2 We use a sentence-level BLEU approximation similar to He and Deng (2012).3 The normalized probability pA,9(€| f) of a particular translation 6 given f is defined as:Page 3, “Expected BLEU Training”
- Next, we define the gradient of the expected BLEU loss function [(6) using the observation that the loss does not explicitly depend on 6:Page 3, “Expected BLEU Training”

See all papers in *Proc. ACL 2014* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 23 sentences as: Neural Network (1) Neural network (1) neural network (20) neural networks (1)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.Page 1, “Abstract”
- We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.Page 1, “Abstract”
- In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014).Page 1, “Introduction”
- In practice, neural network models for machine translation are usually trained by maximizing the likelihood of the training data, either via a cross-entropy objective (Mikolov et al., 2010; SchwenkPage 1, “Introduction”
- Most previous work on neural networks for machine translation is based on a rescoring setup (Arisoy et al., 2012; Mikolov, 2012; Le et al., 2012a; Auli et al., 2013), thereby side stepping the algorithmic and engineering challenges of direct decoder-integration.Page 1, “Introduction”
- Decoder integration has the advantage for the neural network to directly influence search, unlike rescoring which is restricted to an n-best list or lattice.Page 1, “Introduction”
- We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.Page 1, “Introduction”
- Figure 1: Structure of the recurrent neural network language model.Page 2, “Introduction”
- Our model has a similar structure to the recurrent neural network language model of Mikolov et al.Page 2, “Recurrent Neural Network LMs”
- We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003).Page 2, “Expected BLEU Training”
- We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1):Page 2, “Expected BLEU Training”

See all papers in *Proc. ACL 2014* that mention neural network.

See all papers in *Proc. ACL* that mention neural network.

Back to top.

Appears in 16 sentences as: language model (12) language modeling (1) language models (3)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.Page 1, “Abstract”
- We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.Page 1, “Abstract”
- In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014).Page 1, “Introduction”
- (2013) who demonstrated that feed-forward network-based language models are more accurate in first-pass decoding than in rescoring.Page 1, “Introduction”
- Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models .Page 1, “Introduction”
- We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.Page 1, “Introduction”
- Figure 1: Structure of the recurrent neural network language model .Page 2, “Introduction”
- Our model has a similar structure to the recurrent neural network language model of Mikolov et al.Page 2, “Recurrent Neural Network LMs”
- We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003).Page 2, “Expected BLEU Training”
- We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1):Page 2, “Expected BLEU Training”
- which computes a sentence-level language model score as the sum of individual word scores.Page 2, “Expected BLEU Training”

See all papers in *Proc. ACL 2014* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 10 sentences as: hidden layer (14)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- (2010) which is factored into an input layer, a hidden layer with recurrent connections, and an output layer (Figure 1).Page 2, “Recurrent Neural Network LMs”
- The hidden layer state ht encodes the history of all words observed in the sequence up to time step t. The state of the hidden layer is determined by the input layer and the hidden layer configuration of the previous time step ht_1.Page 2, “Recurrent Neural Network LMs”
- The weights of the connections between the layers are summarized in a number of matrices: U represents weights from the input layer to the hidden layer, and W represents connections from the previous hidden layer to the current hidden layer .Page 2, “Recurrent Neural Network LMs”
- Matrix V contains weights between the current hidden layer and the output layer.Page 2, “Recurrent Neural Network LMs”
- .wt, ht) for the next word given the previous 75 input words and the current hidden layer configuration ht.Page 2, “Recurrent Neural Network LMs”
- To solve this problem, we follow previous work on lattice rescoring with recurrent networks that maintained the usual n-gram context but kept a beam of hidden layer configurations at each state (Auli et al., 2013).Page 4, “Decoder Integration”
- In fact, to make decoding as efficient as possible, we only keep the single best scoring hidden layer configuration.Page 4, “Decoder Integration”
- As future cost estimate we score each phrase in isolation, resetting the hidden layer at the beginning of a phrase.Page 4, “Decoder Integration”
- The hidden layer uses 100 neurons unless otherwise stated.Page 4, “Experiments”
- To keep training times manageable, we reduce the hidden layer size to 30 neurons, thereby greatly increasing speed.Page 5, “Experiments”

See all papers in *Proc. ACL 2014* that mention hidden layer.

See all papers in *Proc. ACL* that mention hidden layer.

Back to top.

Appears in 7 sentences as: machine translation (8)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation .Page 1, “Abstract”
- Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.Page 1, “Abstract”
- Neural network-based language and translation models have achieved impressive accuracy improvements on statistical machine translation tasks (Allauzen et al., 2011; Le et al., 2012b; Schwenk et al., 2012; Vaswani et al., 2013; Gao et al., 2014).Page 1, “Introduction”
- In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014).Page 1, “Introduction”
- In practice, neural network models for machine translation are usually trained by maximizing the likelihood of the training data, either via a cross-entropy objective (Mikolov et al., 2010; SchwenkPage 1, “Introduction”
- The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).Page 1, “Introduction”
- Most previous work on neural networks for machine translation is based on a rescoring setup (Arisoy et al., 2012; Mikolov, 2012; Le et al., 2012a; Auli et al., 2013), thereby side stepping the algorithmic and engineering challenges of direct decoder-integration.Page 1, “Introduction”

See all papers in *Proc. ACL 2014* that mention machine translation.

See all papers in *Proc. ACL* that mention machine translation.

Back to top.

Appears in 7 sentences as: phrase-based (7)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.Page 1, “Abstract”
- Formally, our phrase-based model is parameterized by M parameters A where each Am 6 A, m = l .Page 2, “Expected BLEU Training”
- Typically, phrase-based decoders maintain a set of states representing partial and complete translation hypothesis that are scored by a set of features.Page 3, “Decoder Integration”
- We use a phrase-based system similar to Moses (Koehn et al., 2007) based on a set of common features including maximum likelihood estimates pML (elf) and pML (f |e), lexically weighted estimates pLW(e| f) and p LW( f |e), word and phrase-penalties, a hierarchical reordering model (Galley and Manning, 2008), a linear distortion feature, and a modified Kneser—Ney language model trained on the target-side of the parallel data.Page 4, “Experiments”
- ther lattices or the unique 100-best output of the phrase-based decoder and reestimate the log-linear weights by running a further iteration of MERT on the n-best list of the development set, augmented by scores corresponding to the neural network models.Page 4, “Experiments”
- We use the same data both for training the phrase-based system as well as the language model but find that the resulting bias did not hurt end-to-end accuracy (Yu et al., 2013).Page 4, “Experiments”
- Our best result improves the output of a phrase-based decoder by up to 2.0 BLEU on French-English translation, outperforming n-best rescoring by up to 1.1 BLEU and lattice rescoring by up to 0.4 BLEU.Page 5, “Conclusion and Future Work”

See all papers in *Proc. ACL 2014* that mention phrase-based.

See all papers in *Proc. ACL* that mention phrase-based.

Back to top.

Appears in 6 sentences as: Log-linear (2) log-linear (4)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).Page 1, “Introduction”
- We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003).Page 2, “Expected BLEU Training”
- We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1):Page 2, “Expected BLEU Training”
- Log-linear weights are tuned with MERT.Page 4, “Experiments”
- Log-linear weights are estimated on the 2009 data set comprising 2525 sentences.Page 4, “Experiments”
- ther lattices or the unique 100-best output of the phrase-based decoder and reestimate the log-linear weights by running a further iteration of MERT on the n-best list of the development set, augmented by scores corresponding to the neural network models.Page 4, “Experiments”

See all papers in *Proc. ACL 2014* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 4 sentences as: loss function (4)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Next, we fix A, set AMH = l and optimize 6 with respect to the loss function on the training data using stochastic gradient descent (SGD).1Page 2, “Expected BLEU Training”
- Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:Page 3, “Expected BLEU Training”
- Next, we define the gradient of the expected BLEU loss function [(6) using the observation that the loss does not explicitly depend on 6:Page 3, “Expected BLEU Training”
- We rewrite the loss function (2) using (3) and separate it into two terms G (6) and Z (6) as follows:Page 3, “Expected BLEU Training”

See all papers in *Proc. ACL 2014* that mention loss function.

See all papers in *Proc. ACL* that mention loss function.

Back to top.

Appears in 4 sentences as: n-gram (4)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models.Page 1, “Introduction”
- One exception is the n-gram language model which requires the preceding n — 1 words as well.Page 3, “Decoder Integration”
- To solve this problem, we follow previous work on lattice rescoring with recurrent networks that maintained the usual n-gram context but kept a beam of hidden layer configurations at each state (Auli et al., 2013).Page 4, “Decoder Integration”
- This approximation has been effective for lattice rescoring, since the translations represented by each state are in fact very similar: They share both the same source words as well as the same n-gram context which is likely to result in similar recurrent histories that can be safely pruned.Page 4, “Decoder Integration”

See all papers in *Proc. ACL 2014* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 4 sentences as: translation model (2) Translation models (1) translation models (1)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Neural network-based language and translation models have achieved impressive accuracy improvements on statistical machine translation tasks (Allauzen et al., 2011; Le et al., 2012b; Schwenk et al., 2012; Vaswani et al., 2013; Gao et al., 2014).Page 1, “Introduction”
- We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1):Page 2, “Expected BLEU Training”
- The translation model is parameterized by A and 6 which are learned as follows (Gao et al., 2014):Page 2, “Expected BLEU Training”
- Translation models are estimated on 102M words of parallel data for French-English, and 99M words for German-English; about 6.5M words for each language pair are newswire, the remainder are parliamentary proceedings.Page 4, “Experiments”

See all papers in *Proc. ACL 2014* that mention translation model.

See all papers in *Proc. ACL* that mention translation model.

Back to top.

Appears in 3 sentences as: development set (3)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- 1We tuned AM+1 on the development set but found that AM+1 = 1 resulted in faster training and equal accuracy.Page 2, “Expected BLEU Training”
- We fix 6 and re-optimize A in the presence of the recurrent neural network model using Minimum Error Rate Training (Och, 2003) on the development set (§5).Page 3, “Expected BLEU Training”
- ther lattices or the unique 100-best output of the phrase-based decoder and reestimate the log-linear weights by running a further iteration of MERT on the n-best list of the development set , augmented by scores corresponding to the neural network models.Page 4, “Experiments”

See all papers in *Proc. ACL 2014* that mention development set.

See all papers in *Proc. ACL* that mention development set.

Back to top.

Appears in 3 sentences as: language pair (1) language pairs (2)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- Translation models are estimated on 102M words of parallel data for French-English, and 99M words for German-English; about 6.5M words for each language pair are newswire, the remainder are parliamentary proceedings.Page 4, “Experiments”
- The vocabulary consists of words that occur in at least two different sentences, which is 31K words for both language pairs .Page 4, “Experiments”
- The results (Table 1 and Table 2) show that direct integration improves accuracy across all six test sets on both language pairs .Page 4, “Experiments”

See all papers in *Proc. ACL 2014* that mention language pairs.

See all papers in *Proc. ACL* that mention language pairs.

Back to top.

Appears in 3 sentences as: parallel data (3)

In *Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models*

- We use a phrase-based system similar to Moses (Koehn et al., 2007) based on a set of common features including maximum likelihood estimates pML (elf) and pML (f |e), lexically weighted estimates pLW(e| f) and p LW( f |e), word and phrase-penalties, a hierarchical reordering model (Galley and Manning, 2008), a linear distortion feature, and a modified Kneser—Ney language model trained on the target-side of the parallel data .Page 4, “Experiments”
- Translation models are estimated on 102M words of parallel data for French-English, and 99M words for German-English; about 6.5M words for each language pair are newswire, the remainder are parliamentary proceedings.Page 4, “Experiments”
- All neural network models are trained on the news portion of the parallel data , corresponding to 136K sentences, which we found to be most useful in initial experiments.Page 4, “Experiments”

See all papers in *Proc. ACL 2014* that mention parallel data.

See all papers in *Proc. ACL* that mention parallel data.

Back to top.