Fast Consensus Decoding over Translation Forests

The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.

In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002).

Let 6 be a candidate translation for a sentence f, where 6 may stand for a sentence or its derivation as appropriate.

We now turn our focus to efficiently computing feature expectations, in service of our fast consensus decoding procedure.

We evaluate these consensus decoding techniques on two different full-scale state-of-the-art hierarchical machine translation systems.

We have demonstrated substantial speed increases in k-best consensus decoding through a new procedure inspired by MBR under linear similarity measures.

Appears in 37 sentences as: BLEU (41)

In *Fast Consensus Decoding over Translation Forests*

- The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.Page 1, “Abstract”
- However, MBR targeting BLEU is prohibitively slow to optimize over kr-best lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000-best lists.Page 1, “Abstract”
- Our forest-based decoding objective consistently outperforms kr-best list MBR, giving improvements of up to 1.0 BLEU .Page 1, “Abstract”
- In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002).Page 1, “Introduction”
- Unfortunately, with a nonlinear similarity measure like BLEU , we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution.Page 1, “Introduction”
- In experiments using BLEU over 1000-best lists, we found that our objective provided benefits very similar to MBR, only much faster.Page 1, “Introduction”
- In all, we show improvements of up to 1.0 BLEU from consensus approaches for state-of-the-art large-scale hierarchical translation systems.Page 2, “Introduction”
- 1Typically, MBR is defined as arg mineeElE [L(e; e’ for some loss function L, for example 1 — BLEU (e; 6’ These definitions are equivalent.Page 2, “Consensus Decoding Algorithms”
- Figure 1 compares Algorithms 1 and 2 using U(e; e’ Other linear functions have been explored for MBR, including Taylor approximations to the logarithm of BLEU (Tromble et al., 2008) and counts of matching constituents (Zhang and Gildea, 2008), which are discussed further in Section 3.3.Page 3, “Consensus Decoding Algorithms”
- Computing MBR even with simple nonlinear measures such as BLEU , NIST or bag-of-words Fl seems to require 0(k2) computation time.Page 3, “Consensus Decoding Algorithms”
- For example, we can express BLEU (e; e’) =Page 3, “Consensus Decoding Algorithms”

See all papers in *Proc. ACL 2009* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 33 sentences as: similar measure (1) Similarity measure (1) similarity measure (21) Similarity Measures (1) similarity measures (9)

In *Fast Consensus Decoding over Translation Forests*

- The Bayes optimal decoding objective is to minimize risk based on the similarity measure used for evaluation.Page 1, “Introduction”
- Unfortunately, with a nonlinear similarity measure like BLEU, we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution.Page 1, “Introduction”
- We show that if the similarity measure is linear in features of a sentence, then computing expected similarity for all k sentences requires only k similarity evaluations.Page 1, “Introduction”
- Specific instances of this general algorithm have recently been proposed for two linear similarity measures (Tromble et al., 2008; Zhang and Gildea, 2008).Page 1, “Introduction”
- However, the sentence similarity measures we want to optimize in MT are not linear functions, and so this fast algorithm for MBR does not apply.Page 1, “Introduction”
- For this reason, we propose a new objective that retains the benefits of MBR, but can be 0p-timized efficiently, even for nonlinear similarity measures .Page 1, “Introduction”
- The contributions of this paper include a linear-time algorithm for MBR using linear similarities, a linear-time alternative to MBR using nonlinear similarity measures , and a forest-based extension to this procedure for similarities based on n-gram counts.Page 1, “Introduction”
- For MBR decoding, we instead leverage a similarity measure 8(6; 6’) to choose a translation using the model’s probability distribution P(e| f), which has support over a set of possible translations E. The Viterbi derivation 6* is the mode of this distribution.Page 2, “Consensus Decoding Algorithms”
- Given any similarity measure 8 and a k-best list E, the minimum Bayes risk translation can be found by computing the similarity between all pairs of sentences in E, as in Algorithm 1.Page 2, “Consensus Decoding Algorithms”
- An example of a linear similarity measure is bag-of-words precision, which can be written as:Page 3, “Consensus Decoding Algorithms”
- 2.3 Fast Consensus Decoding using NonLinear Similarity MeasuresPage 3, “Consensus Decoding Algorithms”

See all papers in *Proc. ACL 2009* that mention similarity measure.

See all papers in *Proc. ACL* that mention similarity measure.

Back to top.

Appears in 26 sentences as: N-Gram (1) n-gram (27)

In *Fast Consensus Decoding over Translation Forests*

- The contributions of this paper include a linear-time algorithm for MBR using linear similarities, a linear-time alternative to MBR using nonlinear similarity measures, and a forest-based extension to this procedure for similarities based on n-gram counts.Page 1, “Introduction”
- In this expression, BLEU(e; 6’) references 6’ only via its n-gram count features C(e’ , t).2 2The length penalty (1 — is also a function of n-Page 3, “Consensus Decoding Algorithms”
- In this section, we consider BLEU in particular, for which the relevant features gb(e) are n-gram counts up to length n = 4.Page 4, “Computing Feature Expectations”
- The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals inPage 4, “Computing Feature Expectations”
- Each 71- gram that appears in a translation 6 is associated with some h in its derivation: the h corresponding to the rule that produces the n-gram .Page 5, “Computing Feature Expectations”
- The n-gram language model score of e similarly decomposes over the h in e that produce n-grams.Page 5, “Computing Feature Expectations”
- 3.2 Computing Expected N-Gram CountsPage 5, “Computing Feature Expectations”
- We can compute expected n-gram counts efficiently from a translation forest by appealing to the linearity of expectations.Page 5, “Computing Feature Expectations”
- Let gb(e) be a vector of n-gram counts for a sentence 6.Page 5, “Computing Feature Expectations”
- Then, gb(e) is the sum of hyper-edge-specific n-gram count vectors gb(h) for all h in 6.Page 5, “Computing Feature Expectations”
- To compute n-gram expectations for a hyper-edge, we first compute the posterior probability of each h, conditioned on the input sentence f:Page 5, “Computing Feature Expectations”

See all papers in *Proc. ACL 2009* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 12 sentences as: n-grams (13) n-grams: (1)

In *Fast Consensus Decoding over Translation Forests*

- Un-igrams are produced by lexical rules, while higher-order n-grams can be produced either directly by lexical rules, or by combining constituents.Page 5, “Computing Feature Expectations”
- The n-gram language model score of e similarly decomposes over the h in e that produce n-grams .Page 5, “Computing Feature Expectations”
- The linear similarity measure takes the following form, where Tn is the set of n-grams:Page 5, “Computing Feature Expectations”
- Let H be the number of hyper-edges in the forest or lattice, and T the number of n-grams that can potentially appear in a translation.Page 6, “Computing Feature Expectations”
- Computing count expectations requires 0(H) time, because only a constant number of n-grams can be produced by each hyper-edge.Page 6, “Computing Feature Expectations”
- Above, we compare the precision, relative to reference translations, of sets of n-grams chosen in two ways.Page 8, “Experimental Results”
- The left bar is the precision of the n-grams in 6*.Page 8, “Experimental Results”
- The right bar is the precision of n-grams with E[c(6, 75)] > p. To justify this comparison, we chose p so that both methods of choosing 71- grams gave the same n-gram recall: the fraction of n-grams in reference translations that also appeared in 6* or had E[6(6, 75)] > p.Page 8, “Experimental Results”
- To this end, we first tested the predictive accuracy of the n-grams proposed by 6*: the fraction of the n-grams in 6* that appear in a reference translation.Page 8, “Experimental Results”
- We compared this n-gram precision to a similar measure of predictive accuracy for expected n-gram counts: the fraction of the n-grams t with lE[c(6, 25)] 2 p that appear in a reference.Page 8, “Experimental Results”
- To make these two precisions comparable, we chose p such that the recall of reference n-grams was equal.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention n-grams.

See all papers in *Proc. ACL* that mention n-grams.

Back to top.

Appears in 9 sentences as: Viterbi (9)

In *Fast Consensus Decoding over Translation Forests*

- The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.Page 1, “Abstract”
- The standard Viterbi decoding objective is to find 6* = arg maxe A - 6( f, e).Page 2, “Consensus Decoding Algorithms”
- For MBR decoding, we instead leverage a similarity measure 8(6; 6’) to choose a translation using the model’s probability distribution P(e| f), which has support over a set of possible translations E. The Viterbi derivation 6* is the mode of this distribution.Page 2, “Consensus Decoding Algorithms”
- This forest-based MBR approach improved translation output relative to Viterbi translations.Page 5, “Computing Feature Expectations”
- Table 2: Translation performance improves when computing expected sentences from translation forests rather than 104-best lists, which in turn improve over Viterbi translations.Page 7, “Experimental Results”
- Figure 3: N - grams with high expected count are more likely to appear in the reference translation that 71- grams in the translation model’s Viterbi translation, 6*.Page 8, “Experimental Results”
- We endeavored to test the hypothesis that expected n-gram counts under the forest distribution carry more predictive information than the baseline Viterbi derivation 6* , which is the mode of the distribution.Page 8, “Experimental Results”
- Figure 4: Three translations of an example Arabic sentence: its human- generated reference, the translation with the highest model score under Hiero ( Viterbi ), and the translation chosen by forest-based consensus decoding.Page 8, “Experimental Results”
- The consensus translation reconstructs content lost in the Viterbi translation.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention Viterbi.

See all papers in *Proc. ACL* that mention Viterbi.

Back to top.

Appears in 9 sentences as: translation system (2) translation systems (6) translation system’s (1)

In *Fast Consensus Decoding over Translation Forests*

- We evaluate our procedure on translation forests from two large-scale, state-of-the-art hierarchical machine translation systems .Page 1, “Abstract”
- Translation forests compactly encode distributions over much larger sets of derivations and arise naturally in chart-based decoding for a wide variety of hierarchical translation systems (Chiang, 2007; Galley et al., 2006; Mi et al., 2008; Venugopal et al., 2007).Page 1, “Introduction”
- In all, we show improvements of up to 1.0 BLEU from consensus approaches for state-of-the-art large-scale hierarchical translation systems .Page 2, “Introduction”
- Modern statistical machine translation systems take as input some f and score each derivation 6 according to a linear model of features: A, -6i(f, e).Page 2, “Consensus Decoding Algorithms”
- The distribution P(e| f) can be induced from a translation system’s features and weights by expo-nentiating with base I) to form a log-linear model:Page 2, “Consensus Decoding Algorithms”
- Forests arise naturally in chart-based decoding procedures for many hierarchical translation systems (Chiang, 2007).Page 4, “Computing Feature Expectations”
- generated already by the decoder of a syntactic translation system .Page 6, “Computing Feature Expectations”
- We evaluate these consensus decoding techniques on two different full-scale state-of-the-art hierarchical machine translation systems .Page 6, “Experimental Results”
- SBMT is a string-to-tree translation system with rich target-side syntactic information encoded in the translation model.Page 6, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention translation systems.

See all papers in *Proc. ACL* that mention translation systems.

Back to top.

Appears in 7 sentences as: machine translation (7)

In *Fast Consensus Decoding over Translation Forests*

- The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.Page 1, “Abstract”
- We evaluate our procedure on translation forests from two large-scale, state-of-the-art hierarchical machine translation systems.Page 1, “Abstract”
- In statistical machine translation , output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002).Page 1, “Introduction”
- Modern statistical machine translation systems take as input some f and score each derivation 6 according to a linear model of features: A, -6i(f, e).Page 2, “Consensus Decoding Algorithms”
- Most similarity measures of interest for machine translation are not linear, and so Algorithm 2 does not apply.Page 3, “Consensus Decoding Algorithms”
- Exploiting forests has proven a fruitful avenue of research in both parsing (Huang, 2008) and machine translation (Mi et al., 2008).Page 4, “Computing Feature Expectations”
- We evaluate these consensus decoding techniques on two different full-scale state-of-the-art hierarchical machine translation systems.Page 6, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention machine translation.

See all papers in *Proc. ACL* that mention machine translation.

Back to top.

Appears in 6 sentences as: maxe (6) maxeeES (1)

In *Fast Consensus Decoding over Translation Forests*

- The standard Viterbi decoding objective is to find 6* = arg maxe A - 6( f, e).Page 2, “Consensus Decoding Algorithms”
- 6 = arg maxe EP(e/|f) [8(6; (3’)] arg maxe Z P(e’|f) - 8(6; 6’)Page 2, “Consensus Decoding Algorithms”
- arg maerE EP(e’|f) [3(6; 6/” = arg maxe Z P(e/|f) - ij'(€) ° ¢j(el) e’EE jPage 2, “Consensus Decoding Algorithms”
- = arg maxe Zeufle) Z FOB/If) ° ¢j(el)]Page 2, “Consensus Decoding Algorithms”
- : arg maxe Z wj(e) ° EP(e’|f) [¢j(el)] ' jPage 2, “Consensus Decoding Algorithms”
- 6 = arg maxeeES (e;]Ep(e/|f) [gb(e’)]).Page 3, “Consensus Decoding Algorithms”

See all papers in *Proc. ACL 2009* that mention maxe.

See all papers in *Proc. ACL* that mention maxe.

Back to top.

Appears in 6 sentences as: Sentence Pairs (3) sentence pairs (3)

In *Fast Consensus Decoding over Translation Forests*

- 2.1 Minimum Bayes Risk over Sentence PairsPage 2, “Consensus Decoding Algorithms”
- Algorithm 1 MBR over Sentence Pairs 12 A <— —00 2: for e E E do 3: A6 <— 0 4 for e’ E E do 5: Ae<—A6+P(e’|f)-S(e;e’) 6 7Page 2, “Consensus Decoding Algorithms”
- MBR over Sentence Pairs MBR over FeaturesPage 3, “Consensus Decoding Algorithms”
- Figure 1: For the linear similarity measure U (c; e’ ), which computes unigram precision, the MBR translation can be found by iterating either over sentence pairs (Algorithm 1) or over features (Algorithm 2).Page 3, “Consensus Decoding Algorithms”
- A phrase discovery procedure over word-aligned sentence pairs provides rule frequency counts, which are normalized to estimate features on rules.Page 6, “Experimental Results”
- The synchronous grammar rules are extracted from word aligned sentence pairs where the target sentence is annotated with a syntactic parse (Galley et al., 2004).Page 6, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention Sentence Pairs.

See all papers in *Proc. ACL* that mention Sentence Pairs.

Back to top.

Appears in 6 sentences as: Chinese-English (7)

In *Fast Consensus Decoding over Translation Forests*

- Speed Ratio 29 142 Chinese-English Objective Hiero SBMTPage 7, “Experimental Results”
- We evaluated on both Chinese-English and Arabic-English translation tasks.Page 7, “Experimental Results”
- For the Chinese-English experiments, we used 260 million words of word-aligned parallel text; the hierarchical system used all of this data, and the syntax-based system used a 65-million word subset.Page 7, “Experimental Results”
- All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT).Page 7, “Experimental Results”
- All systems were tuned on held-out data (1994 sentences for Arabic-English, 2010 sentences for Chinese-English) and tested on another dataset (2118 sentences for Arabic-English, 1994 sentences for Chinese-English ).Page 7, “Experimental Results”
- Chinese-English Expectations Similarity Hiero SBMTPage 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention Chinese-English.

See all papers in *Proc. ACL* that mention Chinese-English.

Back to top.

Appears in 5 sentences as: model score (3) model scores (1) model scoring (1)

In *Fast Consensus Decoding over Translation Forests*

- The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score .Page 1, “Abstract”
- Translation forests compactly encode an exponential number of output translations for an input sentence, along with their model scores .Page 4, “Computing Feature Expectations”
- 3Decoder states can include additional information as well, such as local configurations for dependency language model scoring .Page 4, “Computing Feature Expectations”
- The n-gram language model score of e similarly decomposes over the h in e that produce n-grams.Page 5, “Computing Feature Expectations”
- Figure 4: Three translations of an example Arabic sentence: its human- generated reference, the translation with the highest model score under Hiero (Viterbi), and the translation chosen by forest-based consensus decoding.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention model score.

See all papers in *Proc. ACL* that mention model score.

Back to top.

Appears in 5 sentences as: language model (5) language models (1)

In *Fast Consensus Decoding over Translation Forests*

- The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals inPage 4, “Computing Feature Expectations”
- 3Decoder states can include additional information as well, such as local configurations for dependency language model scoring.Page 4, “Computing Feature Expectations”
- The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes.Page 5, “Computing Feature Expectations”
- The n-gram language model score of e similarly decomposes over the h in e that produce n-grams.Page 5, “Computing Feature Expectations”
- All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT).Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 4 sentences as: translation model (3) translation model’s (1)

In *Fast Consensus Decoding over Translation Forests*

- The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes.Page 5, “Computing Feature Expectations”
- Hiero is a hierarchical system that expresses its translation model as a synchronous context-free grammar (Chiang, 2007).Page 6, “Experimental Results”
- SBMT is a string-to-tree translation system with rich target-side syntactic information encoded in the translation model .Page 6, “Experimental Results”
- Figure 3: N - grams with high expected count are more likely to appear in the reference translation that 71- grams in the translation model’s Viterbi translation, 6*.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention translation model.

See all papers in *Proc. ACL* that mention translation model.

Back to top.

Appears in 4 sentences as: bigram (3) bigrams (1)

In *Fast Consensus Decoding over Translation Forests*

- Hyper-edges (boxes) are annotated with normalized transition probabilities, as well as the bigrams produced by each rule application.Page 4, “Computing Feature Expectations”
- The expected count of the bigram “man with” is the sum of posterior probabilities of the two hyper-edges that produce it.Page 4, “Computing Feature Expectations”
- 5For example, decoding under a variational approximation to the model’s posterior that decomposes over bigram probabilities is equivalent to fast consensus decoding withPage 6, “Computing Feature Expectations”
- where h(t) is the unigram prefix of bigram t.Page 6, “Computing Feature Expectations”

See all papers in *Proc. ACL 2009* that mention bigram.

See all papers in *Proc. ACL* that mention bigram.

Back to top.

Appears in 3 sentences as: log-linear model (2) log-linear model: (1)

In *Fast Consensus Decoding over Translation Forests*

- The distribution P(e| f) can be induced from a translation system’s features and weights by expo-nentiating with base I) to form a log-linear model:Page 2, “Consensus Decoding Algorithms”
- The log-linear model weights were trained using MIRA, a margin-based optimization procedure that accommodates many features (Crammer and Singer, 2003; Chiang et al., 2008).Page 6, “Experimental Results”
- We tuned b, the base of the log-linear model , to optimize consensus decoding performance.Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention log-linear model.

See all papers in *Proc. ACL* that mention log-linear model.

Back to top.

Appears in 3 sentences as: log-linear (3)

In *Fast Consensus Decoding over Translation Forests*

- The distribution P(e| f) can be induced from a translation system’s features and weights by expo-nentiating with base I) to form a log-linear model:Page 2, “Consensus Decoding Algorithms”
- The log-linear model weights were trained using MIRA, a margin-based optimization procedure that accommodates many features (Crammer and Singer, 2003; Chiang et al., 2008).Page 6, “Experimental Results”
- We tuned b, the base of the log-linear model, to optimize consensus decoding performance.Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 3 sentences as: language pair (1) language pairs (3)

In *Fast Consensus Decoding over Translation Forests*

- We also show that using forests outperforms using k-best lists consistently across language pairs .Page 1, “Introduction”
- (2008) showed that most of the improvement from lattice-based consensus decoding comes from lattice-based expectations, not search: searching over lattices instead of k-best lists did not change results for two language pairs, and improved a third language pair by 0.3 BLEU.Page 6, “Computing Feature Expectations”
- Despite this optimization, our new Algorithm 3 was an average of 80 times faster across systems and language pairs .Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention language pairs.

See all papers in *Proc. ACL* that mention language pairs.

Back to top.

Appears in 3 sentences as: unigram (2) unigrams (1)

In *Fast Consensus Decoding over Translation Forests*

- where T1 is the set of unigrams in the language, and 6(6, 25) is an indicator function that equals 1 if 75 appears in e and 0 otherwise.Page 3, “Consensus Decoding Algorithms”
- Figure 1: For the linear similarity measure U (c; e’ ), which computes unigram precision, the MBR translation can be found by iterating either over sentence pairs (Algorithm 1) or over features (Algorithm 2).Page 3, “Consensus Decoding Algorithms”
- where h(t) is the unigram prefix of bigram t.Page 6, “Computing Feature Expectations”

See all papers in *Proc. ACL 2009* that mention unigram.

See all papers in *Proc. ACL* that mention unigram.

Back to top.