Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz

Article Structure

Abstract

Minimum Error Rate Training (MERT) and Minimum Bayes-Risk (MBR) decoding are used in most current state-of—the-art Statistical Machine Translation (SMT) systems.

Introduction

Statistical Machine Translation (SMT) systems have improved considerably by directly using the error criterion in both training and decoding.

Translation Hypergraphs

A translation lattice compactly encodes a large number of hypotheses produced by a phrase-based SMT system.

Minimum Error Rate Training

Given a set of source sentences F15 with corresponding reference translations R5, the objective of MERT is to find a parameter set M” which minimizes an automated evaluation criterion under a linear model:

Minimum Bayes-Risk Decoding

We first review Minimum Bayes-Risk (MBR) decoding for statistical MT.

MERT for MBR Parameter Optimization

Lattice MBR Decoding (Equation 3) assumes a linear form for the gain function (Equation 2).

Experiments

We now describe our experiments to evaluate MERT and MBR on lattices and hypergraphs, and show how MERT can be used to tune MBR parameters.

Discussion

We have presented efficient algorithms which extend previous work on lattice-based MERT (Macherey et al., 2008) and MBR decoding (Tromble et al., 2008) to work with hypergraphs.

Topics

n-gram

Appears in 21 sentences as: n-gram (24)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
    Page 1, “Introduction”
  2. They approximated log(BLEU) score by a linear function of n-gram matches and candidate length.
    Page 4, “Minimum Bayes-Risk Decoding”
  3. where w is an n-gram present in either E or E’, and 60, 61, ..., 6 N are weights which are determined empirically, where N is the maximum n-gram order.
    Page 4, “Minimum Bayes-Risk Decoding”
  4. Next, the posterior probability of each n-gram is computed.
    Page 4, “Minimum Bayes-Risk Decoding”
  5. A new automaton is then created by intersecting each n-gram with weight (from Equation 2) to an unweighted lattice.
    Page 4, “Minimum Bayes-Risk Decoding”
  6. The above steps are carried out one n-gram at a time.
    Page 4, “Minimum Bayes-Risk Decoding”
  7. The key idea behind this new algorithm is to rewrite the n-gram posterior probability (Equation 4) as follows:
    Page 5, “Minimum Bayes-Risk Decoding”
  8. p(wlg> = Z Z f(e, w, E>P(EIF> (5) E69 66E where f (e, w, E) is a score assigned to edge e on path E containing n-gram w:
    Page 5, “Minimum Bayes-Risk Decoding”
  9. We therefore approximate the quantity f(e, w, E) with f*(e, 212, Q) that counts the edge 6 with n-gram 212 that has the highest arc posterior probability relative to predecessors in the entire lattice Q. f *(e, 212, Q) can be computed locally, and the n-gram posterior probability based on f * can be determined as follows:
    Page 5, “Minimum Bayes-Risk Decoding”
  10. For each node 75 in the lattice, we maintain a quantity Score(w, t) for each n-gram 212 that lies on a path from the source node to t. Score(w, t) is the highest posterior probability among all edges on the paths that terminate on t and contain n-gram w. The forward pass requires computing the n-grams introduced by each edge; to do this, we propagate n-grams (up to maximum order — l) terminating on each node.
    Page 5, “Minimum Bayes-Risk Decoding”
  11. of each n-gram : for each edge e do Compute edge posterior probability P(e\g Compute 71- gram posterior probs.
    Page 5, “Minimum Bayes-Risk Decoding”

See all papers in Proc. ACL 2009 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

BLEU

Appears in 20 sentences as: BLEU (20)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
    Page 1, “Introduction”
  2. We employ MERT to select these weights by optimizing BLEU score on a development set.
    Page 1, “Introduction”
  3. In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
    Page 1, “Introduction”
  4. This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate.
    Page 4, “Minimum Bayes-Risk Decoding”
  5. (2008) extended MBR decoding to translation lattices under an approximate BLEU score.
    Page 4, “Minimum Bayes-Risk Decoding”
  6. They approximated log( BLEU ) score by a linear function of n-gram matches and candidate length.
    Page 4, “Minimum Bayes-Risk Decoding”
  7. However, this does not guarantee that the resulting linear score (Equation 2) is close to the corpus BLEU .
    Page 6, “MERT for MBR Parameter Optimization”
  8. We now describe how MERT can be used to estimate these factors to achieve a better approximation to the corpus BLEU .
    Page 6, “MERT for MBR Parameter Optimization”
  9. We recall that MERT selects weights in a linear model to optimize an error criterion (e. g. corpus BLEU ) on a training set.
    Page 6, “MERT for MBR Parameter Optimization”
  10. The linear approximation to BLEU may not hold in practice for unseen test sets or language-pairs.
    Page 6, “MERT for MBR Parameter Optimization”
  11. We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set.
    Page 6, “MERT for MBR Parameter Optimization”

See all papers in Proc. ACL 2009 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

phrase-based

Appears in 11 sentences as: phrase-based (12)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. These two techniques were originally developed for N -best lists of translation hypotheses and recently extended to translation lattices (Macherey et al., 2008; Tromble et al., 2008) generated by a phrase-based SMT system (Och and Ney, 2004).
    Page 1, “Introduction”
  2. SMT systems based on synchronous context free grammars (SCFG) (Chiang, 2007; Zollmann and Venugopal, 2006; Galley et al., 2006) have recently been shown to give competitive performance relative to phrase-based SMT.
    Page 1, “Introduction”
  3. A translation lattice compactly encodes a large number of hypotheses produced by a phrase-based SMT system.
    Page 2, “Translation Hypergraphs”
  4. Our phrase-based statistical MT system is similar to the alignment template system described in (Och and Ney, 2004; Tromble et al., 2008).
    Page 6, “Experiments”
  5. We also train two SCFG—based MT systems: a hierarchical phrase-based SMT (Chiang, 2007) system and a syntax augmented machine translation (SAMT) system using the approach described in Zollmann and Venugopal (2006).
    Page 6, “Experiments”
  6. Both systems are built on top of our phrase-based systems.
    Page 6, “Experiments”
  7. We first compare the new lattice MBR (Algorithm 3) with MBR decoding on 1000-best lists and FSAMBR (Tromble et al., 2008) on lattices generated by the phrase-based systems; evaluation is done using both BLEU and average runtime per sentence (Table 3).
    Page 7, “Experiments”
  8. Table 3: Lattice MBR for a phrase-based system.
    Page 7, “Experiments”
  9. We report results on nist03 set and present three systems for each language pair: phrase-based (pb), hierarchical (hier), and SAMT; Lattice MBR is done for the phrase-based system while HGMBR is used for the other two.
    Page 7, “Experiments”
  10. For the multi-language case, we train phrase-based systems and perform lattice MBR for all language pairs.
    Page 7, “Experiments”
  11. We believe that our efficient algorithms will make them more widely applicable in both SCFG—based and phrase-based MT systems.
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

language pairs

Appears in 10 sentences as: language pair (3) language pairs (7)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Our experiments show speedups from MERT and MBR as well as performance improvements from MBR decoding on several language pairs .
    Page 1, “Abstract”
  2. We report results on nist03 set and present three systems for each language pair : phrase-based (pb), hierarchical (hier), and SAMT; Lattice MBR is done for the phrase-based system while HGMBR is used for the other two.
    Page 7, “Experiments”
  3. For the multi-language case, we train phrase-based systems and perform lattice MBR for all language pairs .
    Page 7, “Experiments”
  4. When we optimize MBR features with MERT, the number of language pairs with gains/no changes/-drops is 22/5/12.
    Page 8, “Experiments”
  5. We hypothesize that the default MBR parameters are suboptimal for some language pairs and that MERT helps to find better parameter settings.
    Page 8, “Experiments”
  6. In particular, MERT avoids the need for manually tuning these parameters by language pair .
    Page 8, “Experiments”
  7. This may not be optimal in practice for unseen test sets and language pairs , and the resulting linear loss may be quite different from the corpus level BLEU.
    Page 8, “Discussion”
  8. On an experiment with 40 language pairs , we obtain improvements on 26 pairs, no difference on 8 pairs and drops on 5 pairs.
    Page 8, “Discussion”
  9. This was achieved without any need for manual tuning for each language pair .
    Page 8, “Discussion”
  10. The baseline model cost feature helps the algorithm effectively back off to the MAP translation in language pairs where MBR features alone would not have helped.
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

MT systems

Appears in 9 sentences as: MT System (1) MT system (3) MT systems (5)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. We here extend lattice-based MERT and MBR algorithms to work with hypergraphs that encode a vast number of translations produced by MT systems based on Synchronous Context Free Grammars.
    Page 1, “Abstract”
  2. In this paper, we extend MERT and MBR decoding to work on hypergraphs produced by SCFG—based MT systems .
    Page 1, “Introduction”
  3. MBR decoding for translation can be performed by reranking an N -best list of hypotheses generated by an MT system (Kumar and Byme, 2004).
    Page 4, “Minimum Bayes-Risk Decoding”
  4. We next extend the Lattice MBR decoding algorithm (Algorithm 3) to rescore hypergraphs produced by a SCFG based MT system .
    Page 5, “Minimum Bayes-Risk Decoding”
  5. 6.2 MT System Description
    Page 6, “Experiments”
  6. Our phrase-based statistical MT system is similar to the alignment template system described in (Och and Ney, 2004; Tromble et al., 2008).
    Page 6, “Experiments”
  7. We also train two SCFG—based MT systems : a hierarchical phrase-based SMT (Chiang, 2007) system and a syntax augmented machine translation (SAMT) system using the approach described in Zollmann and Venugopal (2006).
    Page 6, “Experiments”
  8. On hypergraphs produced by Hierarchical and Syntax Augmented MT systems , our MBR algorithm gives a 7X speedup relative to 1000-best MBR while giving comparable or even better performance.
    Page 8, “Discussion”
  9. We believe that our efficient algorithms will make them more widely applicable in both SCFG—based and phrase-based MT systems .
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

NIST

Appears in 9 sentences as: NIST (10)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. The first one is the constrained data track of the NIST Arabic-to-English (aren) and Chinese-to-English (zhen) translation taskl.
    Page 6, “Experiments”
  2. Table 1: Statistics over the NIST dev/test sets.
    Page 6, “Experiments”
  3. Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters.
    Page 6, “Experiments”
  4. We report results on NIST 2002 and NIST 2003 evaluation sets.
    Page 6, “Experiments”
  5. Table 5 shows results for NIST systems.
    Page 7, “Experiments”
  6. We observed in the NIST systems that MERT resulted in short translations relative to MAP on the unseen test set.
    Page 8, “Experiments”
  7. In the NIST systems, MERT yields small improvements on top of MBR with default parameters.
    Page 8, “Experiments”
  8. Thus, MERT has a bigger impact here than in the NIST systems.
    Page 8, “Experiments”
  9. Table 5: MBR Parameter Tuning on NIST systems
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

BLEU score

Appears in 8 sentences as: BLEU score (8)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
    Page 1, “Introduction”
  2. We employ MERT to select these weights by optimizing BLEU score on a development set.
    Page 1, “Introduction”
  3. In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
    Page 1, “Introduction”
  4. (2008) extended MBR decoding to translation lattices under an approximate BLEU score .
    Page 4, “Minimum Bayes-Risk Decoding”
  5. We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set.
    Page 6, “MERT for MBR Parameter Optimization”
  6. MERT is then performed to optimize the BLEU score on a development set; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008).
    Page 7, “Experiments”
  7. We consider a BLEU score difference to be a) gain if is at least 0.2 points, b) drop if it is at most -0.2 points, and c) no change otherwise.
    Page 7, “Experiments”
  8. When MBR does not produce a higher BLEU score relative to MAP on the development set, MERT assigns a higher weight to this feature function.
    Page 8, “Experiments”

See all papers in Proc. ACL 2009 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

development set

Appears in 7 sentences as: development set (7)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. We employ MERT to select these weights by optimizing BLEU score on a development set .
    Page 1, “Introduction”
  2. Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters.
    Page 6, “Experiments”
  3. MERT is then performed to optimize the BLEU score on a development set ; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008).
    Page 7, “Experiments”
  4. We select the MBR scaling factor (Tromble et al., 2008) based on the development set ; it is set to 0.1, 0.01, 0.5, 0.2, 0.5 and 1.0 for the aren-phrase, aren-hier, aren-samt, zhen-phrase zhen-hier and zhen-samt systems respectively.
    Page 7, “Experiments”
  5. We tune 04 on the development set so that the brevity score of MBR translation is close to that of the MAP translation.
    Page 8, “Experiments”
  6. When MBR does not produce a higher BLEU score relative to MAP on the development set , MERT assigns a higher weight to this feature function.
    Page 8, “Experiments”
  7. In this paper, we have described how MERT can be employed to estimate the weights for the linear loss function to maximize BLEU on a development set .
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

n-grams

Appears in 6 sentences as: n-grams (7)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. First, the set of n-grams is extracted from the lattice.
    Page 4, “Minimum Bayes-Risk Decoding”
  2. For a moderately large lattice, there can be several thousands of n-grams and the procedure becomes expensive.
    Page 4, “Minimum Bayes-Risk Decoding”
  3. For each node 75 in the lattice, we maintain a quantity Score(w, t) for each n-gram 212 that lies on a path from the source node to t. Score(w, t) is the highest posterior probability among all edges on the paths that terminate on t and contain n-gram w. The forward pass requires computing the n-grams introduced by each edge; to do this, we propagate n-grams (up to maximum order — l) terminating on each node.
    Page 5, “Minimum Bayes-Risk Decoding”
  4. This is necessary because unlike a lattice, new n-grams may be created at subsequent nodes by concatenating words both to the left and the right side of the n-gram.
    Page 5, “Minimum Bayes-Risk Decoding”
  5. As a result, we need to consider all new sequences which can be created by the crossproduct of the n-grams on the two tail nodes.
    Page 5, “Minimum Bayes-Risk Decoding”
  6. This linear function contains n + 1 parameters 60, 61, ..., 6 N, where N is the maximum order of the n-grams involved.
    Page 6, “MERT for MBR Parameter Optimization”

See all papers in Proc. ACL 2009 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

SMT system

Appears in 5 sentences as: SMT system (3) SMT systems (2)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. These two techniques were originally developed for N -best lists of translation hypotheses and recently extended to translation lattices (Macherey et al., 2008; Tromble et al., 2008) generated by a phrase-based SMT system (Och and Ney, 2004).
    Page 1, “Introduction”
  2. SMT systems based on synchronous context free grammars (SCFG) (Chiang, 2007; Zollmann and Venugopal, 2006; Galley et al., 2006) have recently been shown to give competitive performance relative to phrase-based SMT.
    Page 1, “Introduction”
  3. A translation lattice compactly encodes a large number of hypotheses produced by a phrase-based SMT system .
    Page 2, “Translation Hypergraphs”
  4. The corresponding representation for an SMT system based on SCFGs (e.g.
    Page 2, “Translation Hypergraphs”
  5. MERT and MBR decoding are popular techniques for incorporating the final evaluation metric into the development of SMT systems .
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention SMT system.

See all papers in Proc. ACL that mention SMT system.

Back to top.

Machine Translation

Appears in 4 sentences as: Machine Translation (2) machine translation (2)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Minimum Error Rate Training (MERT) and Minimum Bayes-Risk (MBR) decoding are used in most current state-of—the-art Statistical Machine Translation (SMT) systems.
    Page 1, “Abstract”
  2. Statistical Machine Translation (SMT) systems have improved considerably by directly using the error criterion in both training and decoding.
    Page 1, “Introduction”
  3. In the context of statistical machine translation , the optimization procedure was first described in Och (2003) for N -best lists and later extended to phrase-lattices in Macherey et al.
    Page 2, “Minimum Error Rate Training”
  4. We also train two SCFG—based MT systems: a hierarchical phrase-based SMT (Chiang, 2007) system and a syntax augmented machine translation (SAMT) system using the approach described in Zollmann and Venugopal (2006).
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention Machine Translation.

See all papers in Proc. ACL that mention Machine Translation.

Back to top.

Error Rate

Appears in 3 sentences as: Error Rate (4)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Minimum Error Rate Training (MERT) and Minimum Bayes-Risk (MBR) decoding are used in most current state-of—the-art Statistical Machine Translation (SMT) systems.
    Page 1, “Abstract”
  2. Two popular techniques that incorporate the error criterion are Minimum Error Rate Training (MERT) (Och, 2003) and Minimum Bayes-Risk (MBR) decoding (Kumar and Byrne, 2004).
    Page 1, “Introduction”
  3. This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate .
    Page 4, “Minimum Bayes-Risk Decoding”

See all papers in Proc. ACL 2009 that mention Error Rate.

See all papers in Proc. ACL that mention Error Rate.

Back to top.

loss function

Appears in 3 sentences as: loss function (3)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate.
    Page 4, “Minimum Bayes-Risk Decoding”
  2. Note that N -best MBR uses a sentence BLEU loss function .
    Page 7, “Experiments”
  3. In this paper, we have described how MERT can be employed to estimate the weights for the linear loss function to maximize BLEU on a development set.
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

Statistical Machine Translation

Appears in 3 sentences as: Statistical Machine Translation (2) statistical machine translation (1)
In Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
  1. Minimum Error Rate Training (MERT) and Minimum Bayes-Risk (MBR) decoding are used in most current state-of—the-art Statistical Machine Translation (SMT) systems.
    Page 1, “Abstract”
  2. Statistical Machine Translation (SMT) systems have improved considerably by directly using the error criterion in both training and decoding.
    Page 1, “Introduction”
  3. In the context of statistical machine translation , the optimization procedure was first described in Och (2003) for N -best lists and later extended to phrase-lattices in Macherey et al.
    Page 2, “Minimum Error Rate Training”

See all papers in Proc. ACL 2009 that mention Statistical Machine Translation.

See all papers in Proc. ACL that mention Statistical Machine Translation.

Back to top.