Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming

Article Structure

Abstract

This paper presents collaborative decoding (co-decoding), a new method to improve machine translation accuracy by leveraging translation consensus between multiple machine translation decoders.

Introduction

Recent research has shown substantial improvements can be achieved by utilizing consensus statistics obtained from outputs of multiple machine translation systems.

Collaborative Decoding

2.1 Overview

Experiments

In this section we present experiments to evaluate the co-decoding method.

Discussion

Word-level system combination (system combination hereafter) (Rosti et al., 2007; He et al., 2008) has been proven to be an effective way to improve machine translation quality by using outputs from multiple systems.

Conclusion

Improving machine translation with multiple systems has been a focus in recent SMT research.

Topics

NIST

Appears in 16 sentences as: NIST (21)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination.
    Page 1, “Abstract”
  2. We will present experimental results on the data sets of NIST Chinese-to-English machine translation task, and demonstrate that co-decoding can bring significant improvements to baseline systems.
    Page 2, “Introduction”
  3. We conduct our experiments on the test data from the NIST 2005 and NIST 2008 Chinese-to-English machine translation tasks.
    Page 5, “Experiments”
  4. The NIST 2003 test data is used for development data to estimate model parameters.
    Page 5, “Experiments”
  5. In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
    Page 5, “Experiments”
  6. Data set # Sentences # Words NIST 2003 (dev) 919 23,782 NIST 2005 (test) 1,082 29,258 NIST 2008 (test) 1,357 31,592
    Page 5, “Experiments”
  7. We use the parallel data available for the NIST 2008 constrained track of Chinese-to-English machine translation task as bilingual training data, which contains 5.1M sentence pairs, 128M Chinese words and 147M English words after preprocessing.
    Page 5, “Experiments”
  8. We run two iterations of decoding for each member decoder, and hold the value of a in Equation 5 as a constant 0.05, which is tuned on the test data of NIST 2004 Chinese-to-English machine translation task.
    Page 5, “Experiments”
  9. Parameters for both system combination and hypothesis selection are also tuned on NIST 2003 test data.
    Page 5, “Experiments”
  10. NIST 2005 NIST 2008
    Page 5, “Experiments”
  11. We also evaluate the performance of system combination using different n-best sizes, and the results on NIST 2005 data set are shown in Figure 2, where [91- and co- legends denote combination results of baseline decoding and co-decoding respectively.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

machine translation

Appears in 16 sentences as: Machine Translation (1) machine translation (17)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. This paper presents collaborative decoding (co-decoding), a new method to improve machine translation accuracy by leveraging translation consensus between multiple machine translation decoders.
    Page 1, “Abstract”
  2. Different from system combination and MBR decoding, which post-process the n-best lists or word lattice of machine translation decoders, in our method multiple machine translation decoders collaborate by exchanging partial translation results.
    Page 1, “Abstract”
  3. Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination.
    Page 1, “Abstract”
  4. Recent research has shown substantial improvements can be achieved by utilizing consensus statistics obtained from outputs of multiple machine translation systems.
    Page 1, “Introduction”
  5. Typically, the resulting systems take outputs of individual machine translation systems as
    Page 1, “Introduction”
  6. A common property of all the work mentioned above is that the combination models work on the basis of n-best translation lists (full hypotheses) of existing machine translation systems.
    Page 1, “Introduction”
  7. However, the n-best list only presents a very small portion of the entire search space of a Statistical Machine Translation (SMT) model while a majority of the space, within which there are many potentially good translations, is pruned away in decoding.
    Page 1, “Introduction”
  8. In this paper, we present collaborative decoding (or co-decoding), a new SMT decoding scheme to leverage consensus information between multiple machine translation systems.
    Page 1, “Introduction”
  9. In this scheme, instead of using a postprocessing step, multiple machine translation decoders collaborate during the decoding process, and translation consensus statistics are taken into account to improve ranking not only for full translations, but also for partial hypotheses.
    Page 1, “Introduction”
  10. We will present experimental results on the data sets of NIST Chinese-to-English machine translation task, and demonstrate that co-decoding can bring significant improvements to baseline systems.
    Page 2, “Introduction”
  11. Because usually it is not feasible to enumerate the entire hypothesis space for machine translation , we approximate 17-[k (f) with n-best hypotheses by convention.
    Page 3, “Collaborative Decoding”

See all papers in Proc. ACL 2009 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

n-gram

Appears in 15 sentences as: (1) n-gram (14)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Using an iterative decoding approach, n-gram agreement statistics between translations of multiple decoders are employed to re-rank both full and partial hypothesis explored in decoding.
    Page 1, “Abstract”
  2. To compute the consensus measures, we further decompose each 61 (e, 6') into n-gram matching statistics between 6 and 6'.
    Page 4, “Collaborative Decoding”
  3. For each n-gram of order n, we introduce a pair of complementary consensus measure functions Gn+(e, e') and 671— (e, 6') described as follows:
    Page 4, “Collaborative Decoding”
  4. Gn+(e,e') is the n-gram agreement measure function which counts the number of occurrences in e'of n-grams in 6.
    Page 4, “Collaborative Decoding”
  5. |e|—n+1 Gn+(e,e’) = Z: I(eii+"_1,e’) i=1 where T(',') is a binary indicator function —r(e§+"—1,e') is 1 if the n-gram e;+"—1 e, and 0 otherwise.
    Page 4, “Collaborative Decoding”
  6. Gn—(e, e') is the n-gram disagreement measure function which is complementary to Gn+(e, e ):
    Page 4, “Collaborative Decoding”
  7. Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses.
    Page 4, “Collaborative Decoding”
  8. All baseline decoders are extended with n-gram consensus —based co-decoding features to construct member decoders.
    Page 5, “Experiments”
  9. Table 4 shows the comparison results of a two-system co-decoding using different settings of n-gram agreement and disagreement features.
    Page 7, “Experiments”
  10. It is clearly shown that both n-gram agreement and disagreement types of features are helpful, and using them together is the best choice.
    Page 7, “Experiments”
  11. Table 4: Co-decoding with/without n-gram agreement and disagreement features
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

n-grams

Appears in 6 sentences as: n-grams (7)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Here we do not discriminate among different lexical n-grams and are only concerned with statistics aggregation of all n-grams of the same order.
    Page 4, “Collaborative Decoding”
  2. Gn+(e,e') is the n-gram agreement measure function which counts the number of occurrences in e'of n-grams in 6.
    Page 4, “Collaborative Decoding”
  3. So the corresponding feature value will be the expected number of occurrences in 17-[k (f) of all n-grams in e:
    Page 4, “Collaborative Decoding”
  4. In Table 5 we show in another dimension the impact of consensus-based features by restricting the maximum order of n-grams used to compute agreement statistics.
    Page 7, “Experiments”
  5. One reason could be that the data sparsity for high-order n-grams leads to over fitting on development data.
    Page 7, “Experiments”
  6. Our method uses agreement information of n-grams , and consensus features are integrated into decoding models.
    Page 7, “Discussion”

See all papers in Proc. ACL 2009 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

Word-level

Appears in 6 sentences as: Word-level (3) word-level (3)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Most of the work focused on seeking better word alignment for consensus-based confusion network decoding (Matusov et al., 2006) or word-level system combination (He et al., 2008; Ayan et al., 2008).
    Page 1, “Introduction”
  2. We also conduct extensive investigations when different settings of co-decoding are applied, and make comparisons with related methods such as word-level system combination of hypothesis selection from multiple n-best lists.
    Page 2, “Introduction”
  3. 0 Word-level system combination (Rosti et al., 2007) of member decoders’ n-best outputs
    Page 5, “Collaborative Decoding”
  4. We also implemented the word-level system combination (Rosti et al., 2007) and the hypothesis selection method (Hildebrand and Vogel, 2008).
    Page 5, “Experiments”
  5. Word-level Comb 4045/4085 2952/3035 Hypo Selection 40.09/40.50 29.02/29.71
    Page 5, “Experiments”
  6. Word-level system combination (system combination hereafter) (Rosti et al., 2007; He et al., 2008) has been proven to be an effective way to improve machine translation quality by using outputs from multiple systems.
    Page 7, “Discussion”

See all papers in Proc. ACL 2009 that mention Word-level.

See all papers in Proc. ACL that mention Word-level.

Back to top.

score function

Appears in 6 sentences as: score function (6) scoring function (1)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. 2.2 Generic Collaborative Decoding Model For a given source sentence f, a member model in co-decoding finds the best translation 6* among the set of possible candidate translations if (f) based on a scoring function F:
    Page 2, “Collaborative Decoding”
  2. where CIDm (f, e) is the score function of the mth baseline model, and each Wk(e,17-[k (f)) is a partial consensus score function with respect to dk and is defined over e and 17-[k (f):
    Page 2, “Collaborative Decoding”
  3. Note that in Equation 2, though the baseline score function CIDm (f, 6) can be computed inside each decoder, the case of Wk (ail-[k (f)) is more complicated.
    Page 3, “Collaborative Decoding”
  4. One is the support for co-decoding features, including computation of feature values and the use of augmented co-decoding score function (Equation 2) for hypothesis ranking and pruning.
    Page 3, “Collaborative Decoding”
  5. where Fk (-) is the score function given in Equation 2, and a is a scaling factor following the work of Tromble et al.
    Page 4, “Collaborative Decoding”
  6. Since there is more than one model in co-decoding, we cannot rely on member model’s score function to choose one best translation from multiple decoders’ outputs because the model scores are not directly comparable.
    Page 5, “Collaborative Decoding”

See all papers in Proc. ACL 2009 that mention score function.

See all papers in Proc. ACL that mention score function.

Back to top.

Model training

Appears in 5 sentences as: model trained (1) Model Training (1) Model training (2) model training (1)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Model training .
    Page 2, “Collaborative Decoding”
  2. 2.5 Model Training
    Page 4, “Collaborative Decoding”
  3. Model training for co-decoding
    Page 4, “Collaborative Decoding”
  4. The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3.
    Page 5, “Experiments”
  5. We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output.
    Page 5, “Experiments”

See all papers in Proc. ACL 2009 that mention Model training.

See all papers in Proc. ACL that mention Model training.

Back to top.

translation task

Appears in 5 sentences as: translation task (4) translation tasks (1)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination.
    Page 1, “Abstract”
  2. We will present experimental results on the data sets of NIST Chinese-to-English machine translation task , and demonstrate that co-decoding can bring significant improvements to baseline systems.
    Page 2, “Introduction”
  3. We conduct our experiments on the test data from the NIST 2005 and NIST 2008 Chinese-to-English machine translation tasks .
    Page 5, “Experiments”
  4. We use the parallel data available for the NIST 2008 constrained track of Chinese-to-English machine translation task as bilingual training data, which contains 5.1M sentence pairs, 128M Chinese words and 147M English words after preprocessing.
    Page 5, “Experiments”
  5. We run two iterations of decoding for each member decoder, and hold the value of a in Equation 5 as a constant 0.05, which is tuned on the test data of NIST 2004 Chinese-to-English machine translation task .
    Page 5, “Experiments”

See all papers in Proc. ACL 2009 that mention translation task.

See all papers in Proc. ACL that mention translation task.

Back to top.

BLEU

Appears in 5 sentences as: BLEU (5)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
    Page 5, “Experiments”
  2. Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
    Page 6, “Experiments”
  3. Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.
    Page 6, “Experiments”
  4. The results show that member models help each other: although improvements can be made using a single member model, best BLEU scores can only be achieved when both member models are used as shown by the results of iteration 2.
    Page 7, “Experiments”
  5. From the results we do not observe BLEU improvement for n > 4.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

log-linear

Appears in 4 sentences as: log-linear (4)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. In our work, any Maximum A Posteriori (MAP) SMT model with log-linear formulation (Och, 2002) can be a qualified candidate for a baseline model.
    Page 2, “Collaborative Decoding”
  2. The requirement for a log-linear model aims to provide a natural way to integrate the new co-decoding features.
    Page 2, “Collaborative Decoding”
  3. Referring to the log-linear model formulation, the translation posterior P(e'|dk) can be computed as:
    Page 4, “Collaborative Decoding”
  4. In this paper, we present a framework of collaborative decoding, in which multiple MT decoders are coordinated to search for better translations by re-ranking partial hypotheses using augmented log-linear models with translation consensus -based features.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

language model

Appears in 4 sentences as: language model (4) language models (1)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses.
    Page 4, “Collaborative Decoding”
  2. The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3.
    Page 5, “Experiments”
  3. We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output.
    Page 5, “Experiments”
  4. They also empirically show that n-gram agreement is the most important factor for improvement apart from language models .
    Page 7, “Discussion”

See all papers in Proc. ACL 2009 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

BLEU score

Appears in 4 sentences as: BLEU score (2) BLEU scores (2)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
    Page 5, “Experiments”
  2. Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
    Page 6, “Experiments”
  3. Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.
    Page 6, “Experiments”
  4. The results show that member models help each other: although improvements can be made using a single member model, best BLEU scores can only be achieved when both member models are used as shown by the results of iteration 2.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

translation systems

Appears in 4 sentences as: translation systems (4)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Recent research has shown substantial improvements can be achieved by utilizing consensus statistics obtained from outputs of multiple machine translation systems .
    Page 1, “Introduction”
  2. Typically, the resulting systems take outputs of individual machine translation systems as
    Page 1, “Introduction”
  3. A common property of all the work mentioned above is that the combination models work on the basis of n-best translation lists (full hypotheses) of existing machine translation systems .
    Page 1, “Introduction”
  4. In this paper, we present collaborative decoding (or co-decoding), a new SMT decoding scheme to leverage consensus information between multiple machine translation systems .
    Page 1, “Introduction”

See all papers in Proc. ACL 2009 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

log-linear model

Appears in 3 sentences as: log-linear model (2) log-linear models (1)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. The requirement for a log-linear model aims to provide a natural way to integrate the new co-decoding features.
    Page 2, “Collaborative Decoding”
  2. Referring to the log-linear model formulation, the translation posterior P(e'|dk) can be computed as:
    Page 4, “Collaborative Decoding”
  3. In this paper, we present a framework of collaborative decoding, in which multiple MT decoders are coordinated to search for better translations by re-ranking partial hypotheses using augmented log-linear models with translation consensus -based features.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention log-linear model.

See all papers in Proc. ACL that mention log-linear model.

Back to top.

beam size

Appears in 3 sentences as: beam size (3) beam sizes (1)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. By default, the beam size of 20 is used for all decoders in the experiments.
    Page 5, “Experiments”
  2. For partial hypothesis re-ranking, obtaining more top-ranked results requires increasing the beam size , which is not affordable for large numbers in experiments.
    Page 6, “Experiments”
  3. We work around this issue by approximating beam sizes larger than 20 by only enlarging the beam size for the span covering the entire source sentence.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention beam size.

See all papers in Proc. ACL that mention beam size.

Back to top.

feature weight

Appears in 3 sentences as: feature weight (3)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Let 2m be the feature weight vector for member decoder dm, the training procedure proceeds as follows:
    Page 4, “Collaborative Decoding”
  2. For each decoder dm, find a new feature weight vector 2;,1 which optimizes the specified evaluation criterion L on D using the MERT algorithm based on the n-best list Jim generated by dm:
    Page 4, “Collaborative Decoding”
  3. where T denotes the translations selected by re-ranking the translations in Jim using a new feature weight vector A
    Page 4, “Collaborative Decoding”

See all papers in Proc. ACL 2009 that mention feature weight.

See all papers in Proc. ACL that mention feature weight.

Back to top.

phrase-based

Appears in 3 sentences as: phrase-based (3)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Similar to a typical phrase-based decoder (Koehn, 2004), we associate each hypothesis with a coverage vector C to track translated source words in it.
    Page 3, “Collaborative Decoding”
  2. But to be a general framework, this step is necessary for some state-of-the-art phrase-based decoders (Koehn, 2007; Och and Ney, 2004) because in these decoders, hypotheses with different coverage vectors can coeXist in the same bin, or hypotheses associated with the same coverage vector might appear in different bins.
    Page 3, “Collaborative Decoding”
  3. The first one (SYS 1) is re-implementation of Hiero, a hierarchical phrase-based decoder.
    Page 5, “Experiments”

See all papers in Proc. ACL 2009 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

significant improvements

Appears in 3 sentences as: significant improvements (3)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination.
    Page 1, “Abstract”
  2. We will present experimental results on the data sets of NIST Chinese-to-English machine translation task, and demonstrate that co-decoding can bring significant improvements to baseline systems.
    Page 2, “Introduction”
  3. However, we did not observe any significant improvements for both combination schemes when n-best size is larger than 20.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.

TER

Appears in 3 sentences as: TER (3)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Actually we find that the TER score between two member decoders’ outputs are significantly reduced (as shown in Table 3), which indicates that the outputs become more similar due to the use of consensus information.
    Page 6, “Experiments”
  2. For example, the TER score between SYS2 and SYS3 of the NIST 2008 outputs are reduced from 0.4238 to 0.2665.
    Page 6, “Experiments”
  3. Table 3: TER scores between co-decoding translation outputs
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

weight vector

Appears in 3 sentences as: weight vector (3)
In Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders
  1. Let 2m be the feature weight vector for member decoder dm, the training procedure proceeds as follows:
    Page 4, “Collaborative Decoding”
  2. For each decoder dm, find a new feature weight vector 2;,1 which optimizes the specified evaluation criterion L on D using the MERT algorithm based on the n-best list Jim generated by dm:
    Page 4, “Collaborative Decoding”
  3. where T denotes the translations selected by re-ranking the translations in Jim using a new feature weight vector A
    Page 4, “Collaborative Decoding”

See all papers in Proc. ACL 2009 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.