Variational Decoding for Statistical Machine Translation

Statistical models in machine translation exhibit spurious ambiguity.

Ambiguity is a central issue in natural language processing.

In MT, spurious ambiguity occurs both in regular phrase-based systems (e.g., Koehn et al.

The Viterbi and crunching methods above approximate the intractable decoding of (2) by ignoring most of the derivations.

In place of the MAP decoding, another commonly used decision rule is minimum Bayes risk (MBR):

We report results using an open source MT toolkit, called Joshua (Li et al., 2009), which implements Hiero (Chiang, 2007).

We have successfully applied the general variational inference framework to a large-scale MT task, to approximate the intractable problem of MAP decoding in the presence of spurious ambiguity.

Appears in 32 sentences as: n-gram (35)

In *Variational Decoding for Statistical Machine Translation*

- Our particular variational distributions are parameterized as n-gram models.Page 1, “Abstract”
- We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008).Page 1, “Abstract”
- In practice, we approximate with several different variational families Q, corresponding to n-gram (Markov) models of different orders.Page 2, “Introduction”
- Since each q(y) is a distribution over output strings, a natural choice for Q is the family of n-gram models.Page 4, “Variational Approximate Decoding”
- where W is a set of n-gram types.Page 4, “Variational Approximate Decoding”
- Each 212 E W is an n-gram , which occurs cw(y) times in the string 3/, and 212 may be divided into an (n — l)-gram prefix (the history) and a l-gram suffix r(w) (the rightmost or current word).Page 4, “Variational Approximate Decoding”
- It is also the case when Q is a pure n-gram model and L is constructed not to back off beyond 71- grams; or when the variational family Q is defined by deliberately taking the FSA Q to have the same topology as L.Page 4, “Variational Approximate Decoding”
- In fact, if p were the empirical distribution over strings in a training corpus, then q* of (10) is just the maximum-likelihood n-gram model—whose parameters, trivially, are just unsmoothed ratios of the 71- gram and (n — 1)- gram counts in the training corpus.Page 4, “Variational Approximate Decoding”
- Our actual job is exactly the same, except that p is specified not by a corpus but by the hypergraph The only change is that the n-gram counts 6(w) are no longer integers from a corpus, but are expected counts under p: 10Page 4, “Variational Approximate Decoding”
- The algorithm then enumerates each n-gram and (n — 1)-gram in y and accumulates its soft count into the expected count, and finally obtains the parameters of q* by taking count ratios via (12).Page 4, “Variational Approximate Decoding”
- 2 for win y D each n-gram typePage 5, “Variational Approximate Decoding”

See all papers in *Proc. ACL 2009* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 23 sentences as: Viterbi (24)

In *Variational Decoding for Statistical Machine Translation*

- Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation.Page 1, “Abstract”
- This corresponds to a Viterbi approximation that measures the goodness of an output string using only its most probable derivation, ignoring all the others.Page 1, “Introduction”
- 2.3 Viterbi ApproximationPage 3, “Background 2.1 Terminology”
- To approximate the intractable decoding problem of (2), most MT systems (Koehn et al., 2003; Chiang, 2007) use a simple Viterbi approximation,Page 3, “Background 2.1 Terminology”
- The Viterbi approximation is simple and tractable, but it ignores most derivations.Page 3, “Background 2.1 Terminology”
- The Viterbi and crunching methods above approximate the intractable decoding of (2) by ignoring most of the derivations.Page 3, “Variational Approximate Decoding”
- Lastly, note that Viterbi and variational approximation are different ways to approximate the exact probability p(y | c), and each of them has pros and cons.Page 5, “Variational Approximate Decoding”
- Specifically, Viterbi approximation uses the correct probability of one completePage 5, “Variational Approximate Decoding”
- Therefore, it is desirable to interpolate further with the Viterbi approximation when choosing the final translation output:11Page 6, “Variational Approximate Decoding”
- where the first term corresponds to the interpolated variational decoding of (15) and the second term corresponds to the Viterbi decoding of (4).12 Assuming 6,, > 0, the second term penalizes translations with no good derivation in the hypergraph.13 For n g m, any of these decoders (14)—(16) may be implemented efficiently by using the n-gram variational approximations q* to rescore HG(x)—preserving its hypergraph topology, but modifying the hyperedge weights.14 While the original weights gave derivation d a score of log p(d | cc), the weights as modified for (16) will give d a score of Zn 6,, - log q;‘.Page 6, “Variational Approximate Decoding”
- Viterbi 35.4 32.6 MBR (K=1000) 35.8 32.7 Crunching (N =10000) 35.7 32.8Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention Viterbi.

See all papers in *Proc. ACL* that mention Viterbi.

Back to top.

Appears in 15 sentences as: BLEU (15)

In *Variational Decoding for Statistical Machine Translation*

- We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008).Page 1, “Abstract”
- We geometrically interpolate the resulting approximations q with one another (and with the original distribution p), justifying this interpolation as similar to the minimum-risk decoding for BLEU proposed by Tromble et al.Page 2, “Introduction”
- However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams.Page 5, “Variational Approximate Decoding”
- They use the following loss function, of which a linear approximation to BLEU (Papineni et al., 2001) is a special case,Page 6, “Variational vs. Min-Risk Decoding”
- Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.Page 7, “Experimental Results”
- Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.Page 7, “Experimental Results”
- Table 2 presents the BLEU results under different ways in using the variational models, as discussed in Section 3.2.3.Page 7, “Experimental Results”
- Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.Page 7, “Experimental Results”
- 18We found the BLEU scores are not very sensitive to *y, contrasting to the observations by Tromble et a1.Page 7, “Experimental Results”
- Table 2: BLEU scores under different variational decoders discussed in Section 3.2.3.Page 8, “Experimental Results”
- The brevity penalty BP in BLEU is always 1, meaning that on average y* is no shorter than the reference translation, except for the “1 gram” systems in (a), which suffer from brevity penalties of 0.826 and 0.831.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 8 sentences as: BLEU score (2) BLEU scores (6)

In *Variational Decoding for Statistical Machine Translation*

- Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.Page 7, “Experimental Results”
- Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.Page 7, “Experimental Results”
- Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.Page 7, “Experimental Results”
- 18We found the BLEU scores are not very sensitive to *y, contrasting to the observations by Tromble et a1.Page 7, “Experimental Results”
- Table 2: BLEU scores under different variational decoders discussed in Section 3.2.3.Page 8, “Experimental Results”
- While the BLEU scores reported show the prac-Page 8, “Experimental Results”
- Indeed, although Table 3 shows that better approximations can be obtained by using higher-order models, the best BLEU score in Tables 2a and 2c was obtained by the bigram model.Page 8, “Experimental Results”
- After all, p cannot perfectly predict the reference translation anyway, hence may not be worth approximating closely; but p may do a good job of predicting bigrams of the reference translation, and the BLEU score rewards us for those.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention BLEU scores.

See all papers in *Proc. ACL* that mention BLEU scores.

Back to top.

Appears in 6 sentences as: language model (3) language models (3)

In *Variational Decoding for Statistical Machine Translation*

- Of course, this last point also means that our computation becomes intractable as n —> 00.8 However, if p(y | at) is defined by a hypergraph HG(:c) whose structure explicitly incorporates an m-gram language model , both training and decoding will be efficient when m 2 n. We will give algorithms for this case that are linear in the size of HG(:c).9Page 4, “Variational Approximate Decoding”
- 9A reviewer asks about the interaction with backed-off language models .Page 4, “Variational Approximate Decoding”
- We sketch a method that works for any language model given by a weighted FSA, L. The variational family Q can be specified by any deterministic weighted FSA, Q, with weights parameterized by ((5.Page 4, “Variational Approximate Decoding”
- (It is assumed that p is already smoothed appropriately, having been constructed from channel and language models that were estimated with smoothing from finite training data.)Page 4, “Variational Approximate Decoding”
- We also used a 5-gram language model with modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Giga-word (LDC2007T07) and the English side of the parallel corpora.Page 7, “Experimental Results”
- We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models, language models , and the optimal weights for combining these models, respectively.Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 5 sentences as: bigram (4) bigrams (1)

In *Variational Decoding for Statistical Machine Translation*

- However, because q* only approximates p, y* of (13) may be locally appropriate but globally inadequate as a translation of c. Observe, e. g., that an n-gram model q* will tend to favor short strings y, regardless of the length of c. Suppose cc 2 le chat chasse la souris (“the cat chases the mouse”) and q* is a bigram approximation to p(y | Presumably q* (the | START), q* (mouse | the), and q*(END | mouse) are all large in So the most probable string y* under q* may be simply “the mouse,” which is short and has a high probability but fails to cover cc.Page 5, “Variational Approximate Decoding”
- Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.Page 7, “Experimental Results”
- This is necessarily true, but it is interesting to see that most of the improvement is obtained just by moving from a unigram to a bigram model.Page 8, “Experimental Results”
- Indeed, although Table 3 shows that better approximations can be obtained by using higher-order models, the best BLEU score in Tables 2a and 2c was obtained by the bigram model.Page 8, “Experimental Results”
- After all, p cannot perfectly predict the reference translation anyway, hence may not be worth approximating closely; but p may do a good job of predicting bigrams of the reference translation, and the BLEU score rewards us for those.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention bigram.

See all papers in *Proc. ACL* that mention bigram.

Back to top.

Appears in 5 sentences as: MT system (3) MT systems (2)

In *Variational Decoding for Statistical Machine Translation*

- They recover additional latent variables—so-called nuisance variables—that are not of interest to the user.1 For example, though machine translation (MT) seeks to output a string, typical MT systems (Koehn et al., 2003; Chiang, 2007)Page 1, “Introduction”
- It can be used to encode exponentially many hypotheses generated by a phrase-based MT system (e.g., Koehn et al.Page 2, “Background 2.1 Terminology”
- (2003)) or a syntax-based MT system (e.g., Chiang (2007)).Page 2, “Background 2.1 Terminology”
- To approximate the intractable decoding problem of (2), most MT systems (Koehn et al., 2003; Chiang, 2007) use a simple Viterbi approximation,Page 3, “Background 2.1 Terminology”
- For each input sentence c, we assume that a baseline MT system generates a hypergraph HG(cc) that compactly encodes the derivation set D(cc) along with a score for each d E D(9c),5 which we interpret as p(y, d | c) (or proportional to it).Page 3, “Variational Approximate Decoding”

See all papers in *Proc. ACL 2009* that mention MT system.

See all papers in *Proc. ACL* that mention MT system.

Back to top.

Appears in 4 sentences as: n-grams (5)

In *Variational Decoding for Statistical Machine Translation*

- whose edges correspond to n-grams (weighted with negative log-probabilities) and whose vertices correspond to (n — l)-grams.Page 5, “Variational Approximate Decoding”
- This may be regarded as favoring n-grams that are likely to appear in the reference translation (because they are likely in the derivation forest).Page 5, “Variational Approximate Decoding”
- However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams .Page 5, “Variational Approximate Decoding”
- Now, let us divide N, which contains n-gram types of different n, into several subsets Wn, each of which contains only the n-grams with a given length n. We can now rewrite (19) as follows,Page 6, “Variational vs. Min-Risk Decoding”

See all papers in *Proc. ACL 2009* that mention n-grams.

See all papers in *Proc. ACL* that mention n-grams.

Back to top.

Appears in 3 sentences as: Dynamic programming (1) dynamic programming (1) dynamic programs (1)

In *Variational Decoding for Statistical Machine Translation*

- The trick is to parameterize q as a factorized distribution such that the estimation of q* and decoding using q* are both tractable through efficient dynamic programs .Page 3, “Variational Approximate Decoding”
- (2008) effectively do take n = 00, by maintaining the whole translation string in the dynamic programming state.Page 4, “Variational Approximate Decoding”
- Figure 4: Dynamic programming estimation of q*.Page 5, “Variational Approximate Decoding”

See all papers in *Proc. ACL 2009* that mention Dynamic programming.

See all papers in *Proc. ACL* that mention Dynamic programming.

Back to top.

Appears in 3 sentences as: loss function (3)

In *Variational Decoding for Statistical Machine Translation*

- They use the following loss function , of which a linear approximation to BLEU (Papineni et al., 2001) is a special case,Page 6, “Variational vs. Min-Risk Decoding”
- With the above loss function , Tromble et al.Page 6, “Variational vs. Min-Risk Decoding”
- 15 The MBR becomes the MAP decision rule of (1) if a so-called zero-one loss function is used: l(y, y’) = 0 if y = y’ ; otherwise l(y,y’) = 1.Page 6, “Variational vs. Min-Risk Decoding”

See all papers in *Proc. ACL 2009* that mention loss function.

See all papers in *Proc. ACL* that mention loss function.

Back to top.

Appears in 3 sentences as: phrase-based (3)

In *Variational Decoding for Statistical Machine Translation*

- In MT, spurious ambiguity occurs both in regular phrase-based systems (e.g., Koehn et al.Page 2, “Background 2.1 Terminology”
- Figure l: Segmentation ambiguity in phrase-based MT: two different segmentations lead to the same translation string.Page 2, “Background 2.1 Terminology”
- It can be used to encode exponentially many hypotheses generated by a phrase-based MT system (e.g., Koehn et al.Page 2, “Background 2.1 Terminology”

See all papers in *Proc. ACL 2009* that mention phrase-based.

See all papers in *Proc. ACL* that mention phrase-based.

Back to top.

Appears in 3 sentences as: segmentations (3)

In *Variational Decoding for Statistical Machine Translation*

- That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations ).Page 1, “Abstract”
- (2003)), where different segmentations lead to the same translation string (Figure l), and in syntax-based systems (e.g., Chiang (2007)), where different derivation trees yield the same string (Figure 2).Page 2, “Background 2.1 Terminology”
- Figure l: Segmentation ambiguity in phrase-based MT: two different segmentations lead to the same translation string.Page 2, “Background 2.1 Terminology”

See all papers in *Proc. ACL 2009* that mention segmentations.

See all papers in *Proc. ACL* that mention segmentations.

Back to top.

Appears in 3 sentences as: state of the art (3)

In *Variational Decoding for Statistical Machine Translation*

- Experiments show that our approach improves the state of the art .Page 1, “Abstract”
- Experiments show that our approach improves the state of the art .Page 2, “Introduction”
- Our empirical results improve the state of the art .Page 8, “Conclusions and Future Work”

See all papers in *Proc. ACL 2009* that mention state of the art.

See all papers in *Proc. ACL* that mention state of the art.

Back to top.

Appears in 3 sentences as: translation model (2) translation models (1)

In *Variational Decoding for Statistical Machine Translation*

- However, suppose the hypergraph were very large (thanks to a large or smoothed translation model and weak pruning).Page 6, “Variational vs. Min-Risk Decoding”
- Our translation model was trained on about 1M parallel sentence pairs (about 28M words in each language), which are sub-sampled from corpora distributed by LDC for the NIST MT evaluation using a sampling method based on the n-gram matches between training and test sets in the foreign side.Page 7, “Experimental Results”
- We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models , language models, and the optimal weights for combining these models, respectively.Page 7, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention translation model.

See all papers in *Proc. ACL* that mention translation model.

Back to top.

Appears in 3 sentences as: unigram (3)

In *Variational Decoding for Statistical Machine Translation*

- As shown in Table 2a, decoding with a single variational n-gram model (VM) as per (14) improves the Viterbi baseline (except the case with a unigram VM), though often not statistically significant.Page 7, “Experimental Results”
- The interpolation between a VM and a word penalty feature (“wp”) improves over the unigramPage 7, “Experimental Results”
- This is necessarily true, but it is interesting to see that most of the improvement is obtained just by moving from a unigram to a bigram model.Page 8, “Experimental Results”

See all papers in *Proc. ACL 2009* that mention unigram.

See all papers in *Proc. ACL* that mention unigram.

Back to top.