Joint Decoding with Multiple Translation Models
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun

Article Structure

Abstract

Current SMT systems usually decode with single translation models and cannot benefit from the strengths of other models in decoding phase.

Introduction

System combination aims to find consensus translations among different machine translation systems.

Background

Statistical machine translation is a decision problem where we need decide on the best of target sentence matching a source sentence.

Joint Decoding

There are two major challenges for combining multiple models directly in decoding phase.

Extended Minimum Error Rate Training

Minimum error rate training (Och, 2003) is widely used to optimize feature weights for a linear model (Och and Ney, 2002).

Experiments

5.1 Data Preparation

Related Work

System combination has benefited various NLP tasks in recent years, such as products-of-eXperts (e.g., (Smith and Eisner, 2005)) and ensemble-based parsing (e.g., (Henderson and Brill, 1999)).

Conclusion

We have presented a framework for including multiple translation models in one decoder.

Topics

phrase-based

Appears in 14 sentences as: phrase-based (13) “phrase-based (3)
In Joint Decoding with Multiple Translation Models
  1. We evaluated our joint decoder that integrated a hierarchical phrase-based model (Chiang, 2005; Chiang, 2007) and a tree-to-string model (Liu et al., 2006) on the NIST 2005 Chinese-English test-set.
    Page 1, “Introduction”
  2. Some researchers prefer to saying “phrase-based approaches” or “phrase-based systems”.
    Page 1, “Introduction”
  3. On the other hand, other authors (e. g., (Och and Ney, 2004; Koehn et al., 2003; Chiang, 2007)) do use the expression “phrase-based models”.
    Page 1, “Introduction”
  4. In phrase-based models, a decision can be translating a source phrase into a target phrase or reordering the target phrases.
    Page 2, “Background”
  5. Figure 2(a) demonstrates a translation hypergraph for one model, for example, a hierarchical phrase-based model.
    Page 3, “Joint Decoding”
  6. Although phrase-based decoders usually produce translations from left to right, they can adopt bottom-up decoding in principle.
    Page 4, “Joint Decoding”
  7. (2006) propose left-to-right target generation for hierarchical phrase-based translation.
    Page 4, “Joint Decoding”
  8. For example, although different on the source side, both hierarchical phrase-based and tree-to-string models produce strings of terminals and nonterminals on the target side.
    Page 5, “Joint Decoding”
  9. It is appealing to combine them in such a way because the hierarchical phrase-based model provides excellent rule coverage while the tree-to-string model offers linguistically motivated nonlocal reordering.
    Page 5, “Joint Decoding”
  10. String-targeted models include phrase-based, hierarchical phrase-based , and tree-to-string models.
    Page 5, “Joint Decoding”
  11. first model was the hierarchical phrase-based model (Chiang, 2005; Chiang, 2007).
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

BLEU

Appears in 13 sentences as: BLEU (13) |BLEU| (1)
In Joint Decoding with Multiple Translation Models
  1. Comparable to the state-of-the-art system combination technique, joint decoding achieves an absolute improvement of 1.5 BLEU points over individual decoding.
    Page 1, “Abstract”
  2. 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
    Page 1, “Introduction”
  3. ing with multiple models achieves an absolute improvement of 1.5 BLEU points over individual decoding with single models (Section 5).
    Page 2, “Introduction”
  4. We evaluated the translation quality using case-insensitive BLEU metric (Papineni et al., 2002).
    Page 6, “Experiments”
  5. Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
    Page 7, “Experiments”
  6. With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
    Page 7, “Experiments”
  7. We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score.
    Page 7, “Experiments”
  8. Max-derivation decoding with the tree-to-string model yielded much lower BLEU score (i.e., 27.23) than the hierarchical phrase-based model.
    Page 7, “Experiments”
  9. When combining the two models at the translation level, the joint decoder achieved a BLEU score of 30.79 that outperformed the best result (i.e., 30.11) of individual decoding significantly (p < 0.05).
    Page 7, “Experiments”
  10. When combining the two models at the derivation level using max-derivation decoding, the joint decoder achieved a BLEU score of 31.63 that outperformed the best result (i.e., 30.11) of individ-
    Page 7, “Experiments”
  11. | Method | Model |BLEU|
    Page 8, “Experiments”

See all papers in Proc. ACL 2009 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

feature weights

Appears in 12 sentences as: feature weight (3) feature weights (9)
In Joint Decoding with Multiple Translation Models
  1. 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
    Page 1, “Introduction”
  2. where hm is a feature function, Am is the associated feature weight , and Z (f) is a constant for normalization:
    Page 2, “Background”
  3. Minimum error rate training (Och, 2003) is widely used to optimize feature weights for a linear model (Och and Ney, 2002).
    Page 5, “Extended Minimum Error Rate Training”
  4. The key idea of MERT is to tune one feature weight to minimize error rate each time while keep others fixed.
    Page 5, “Extended Minimum Error Rate Training”
  5. where a is the feature value of current dimension, cc is the feature weight being tuned, and b is the dotproduct of other dimensions.
    Page 6, “Extended Minimum Error Rate Training”
  6. Unfortunately, minimum error rate training cannot be directly used to optimize feature weights of max-translation decoding because Eq.
    Page 6, “Extended Minimum Error Rate Training”
  7. Extended MERT runs on n-best translations plus 71’-best derivations to optimize the feature weights .
    Page 6, “Extended Minimum Error Rate Training”
  8. Note that feature weights of various models are tuned jointly in extended MERT.
    Page 6, “Extended Minimum Error Rate Training”
  9. We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score.
    Page 7, “Experiments”
  10. We concatenate and normalize their feature weights for the joint decoder.
    Page 8, “Experiments”
  11. As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

translation models

Appears in 10 sentences as: translation model (2) translation models (8)
In Joint Decoding with Multiple Translation Models
  1. Current SMT systems usually decode with single translation models and cannot benefit from the strengths of other models in decoding phase.
    Page 1, “Abstract”
  2. We instead propose joint decoding, a method that combines multiple translation models in one decoder.
    Page 1, “Abstract”
  3. In this paper, we propose a framework for combining multiple translation models directly in de-
    Page 1, “Introduction”
  4. Second, translation models differ in decoding algorithms.
    Page 3, “Joint Decoding”
  5. Despite the diversity of translation models , they all have to produce partial translations for substrings of input sentences.
    Page 3, “Joint Decoding”
  6. Therefore, we represent the search space of a translation model as a structure called translation hypergraph.
    Page 3, “Joint Decoding”
  7. As a general representation, a translation hypergraph is capable of characterizing the search space of an arbitrary translation model .
    Page 4, “Joint Decoding”
  8. Although the information inside a derivation differs widely among translation models , the beginning and end points (i.e., f and e, respectively) must be identical.
    Page 4, “Joint Decoding”
  9. The input is a source language sentence ff, and a set of translation models M
    Page 4, “Joint Decoding”
  10. We have presented a framework for including multiple translation models in one decoder.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention translation models.

See all papers in Proc. ACL that mention translation models.

Back to top.

BLEU score

Appears in 9 sentences as: BLEU score (10)
In Joint Decoding with Multiple Translation Models
  1. 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
    Page 1, “Introduction”
  2. Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
    Page 7, “Experiments”
  3. With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
    Page 7, “Experiments”
  4. We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score .
    Page 7, “Experiments”
  5. Max-derivation decoding with the tree-to-string model yielded much lower BLEU score (i.e., 27.23) than the hierarchical phrase-based model.
    Page 7, “Experiments”
  6. When combining the two models at the translation level, the joint decoder achieved a BLEU score of 30.79 that outperformed the best result (i.e., 30.11) of individual decoding significantly (p < 0.05).
    Page 7, “Experiments”
  7. When combining the two models at the derivation level using max-derivation decoding, the joint decoder achieved a BLEU score of 31.63 that outperformed the best result (i.e., 30.11) of individ-
    Page 7, “Experiments”
  8. As shown in Table 3, taking the translations of the two individual decoders as input, the system combination method achieved a BLEU score of 31.50, slightly lower than that of joint decoding.
    Page 8, “Experiments”
  9. As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

phrase pairs

Appears in 8 sentences as: phrase pairs (8)
In Joint Decoding with Multiple Translation Models
  1. Decoders that use rules with flat structures (e.g., phrase pairs ) usually generate target sentences from left to right while those using rules with hierarchical structures (e.g., SCFG rules) often run in a bottom-up style.
    Page 3, “Joint Decoding”
  2. (2006) develop a bottom-up decoder for BTG (Wu, 1997) that uses only phrase pairs .
    Page 4, “Joint Decoding”
  3. Hierarchical phrase pairs are used for translating smaller units and tree-to-string rules for bigger ones.
    Page 5, “Joint Decoding”
  4. Similarly, Blunsom and Osborne (2008) use both hierarchical phrase pairs and tree-to-string rules in decoding, where source parse trees serve as conditioning context rather than hard constraints.
    Page 5, “Joint Decoding”
  5. About 2.6M hierarchical phrase pairs extracted from the training corpus were used on the test set.
    Page 7, “Experiments”
  6. This improvement resulted from the mixture of hierarchical phrase pairs and tree-to-string rules.
    Page 8, “Experiments”
  7. To produce the result, the joint decoder made use of 8,114 hierarchical phrase pairs learned from training data, 6,800 glue rules connecting partial translations monotonically, and 16,554 tree-to-string rules.
    Page 8, “Experiments”
  8. While tree-to-string rules offer linguistically motivated nonlocal reordering during decoding, hierarchical phrase pairs ensure good rule coverage.
    Page 8, “Experiments”

See all papers in Proc. ACL 2009 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

error rate

Appears in 5 sentences as: error rate (5)
In Joint Decoding with Multiple Translation Models
  1. 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
    Page 1, “Introduction”
  2. Minimum error rate training (Och, 2003) is widely used to optimize feature weights for a linear model (Och and Ney, 2002).
    Page 5, “Extended Minimum Error Rate Training”
  3. The key idea of MERT is to tune one feature weight to minimize error rate each time while keep others fixed.
    Page 5, “Extended Minimum Error Rate Training”
  4. Unfortunately, minimum error rate training cannot be directly used to optimize feature weights of max-translation decoding because Eq.
    Page 6, “Extended Minimum Error Rate Training”
  5. One possible reason is that we only used n-best derivations instead of all possible derivations for minimum error rate training.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

latent variable

Appears in 4 sentences as: latent variable (4)
In Joint Decoding with Multiple Translation Models
  1. (2008) present a latent variable model that describes the relationship between translation and derivation clearly.
    Page 2, “Background”
  2. Although originally proposed for supporting large sets of nonindependent and overlapping features, the latent variable model is actually a more general form of conventional linear model (Och and Ney, 2002).
    Page 2, “Background”
  3. Accordingly, decoding for the latent variable model can be formalized as
    Page 2, “Background”
  4. They show that max-translation decoding outperforms max-derivation decoding for the latent variable model.
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Joint Decoding with Multiple Translation Models
  1. System combination aims to find consensus translations among different machine translation systems.
    Page 1, “Introduction”
  2. Statistical machine translation is a decision problem where we need decide on the best of target sentence matching a source sentence.
    Page 2, “Background”
  3. In machine translation , confusion-network based combination techniques (e.g., (Rosti et al., 2007; He et al., 2008)) have achieved the state-of-the-art performance in MT evaluations.
    Page 8, “Related Work”
  4. Hypergraphs have been successfully used in parsing (Klein and Manning., 2001; Huang and Chiang, 2005; Huang, 2008) and machine translation (Huang and Chiang, 2007; Mi et al., 2008; Mi and Huang, 2008).
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

language model

Appears in 3 sentences as: language model (2) Language Modeling (1) language models (1)
In Joint Decoding with Multiple Translation Models
  1. 2There are also features independent of derivations, such as language model and word penalty.
    Page 2, “Joint Decoding”
  2. Although left-to-right decoding might enable a more efficient use of language models and hopefully produce better translations, we adopt bottom-up decoding in this paper just for convenience.
    Page 4, “Joint Decoding”
  3. For language model, we used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram model on the Xinhua portion of GIGAWORD corpus.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

SMT systems

Appears in 3 sentences as: SMT systems (3)
In Joint Decoding with Multiple Translation Models
  1. Current SMT systems usually decode with single translation models and cannot benefit from the strengths of other models in decoding phase.
    Page 1, “Abstract”
  2. Most SMT systems approximate the summation over all possible derivations by using l-best derivation for efficiency.
    Page 2, “Background”
  3. By now, most current SMT systems , adopting either max-derivation decoding or max-translation decoding, have only used single models in decoding phase.
    Page 2, “Background”

See all papers in Proc. ACL 2009 that mention SMT systems.

See all papers in Proc. ACL that mention SMT systems.

Back to top.