Boosting-Based System Combination for Machine Translation
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen

Article Structure

Abstract

In this paper, we present a simple and effective method to address the issue of how to generate diversified translation systems from a single Statistical Machine Translation (SMT) engine for system combination.

Introduction

Recent research on Statistical Machine Translation (SMT) has achieved substantial progress.

Background

Given a source string f, the goal of SMT is to find a target string e* by the following equation.

Topics

BLEU

Appears in 25 sentences as: BLEU (30)
In Boosting-Based System Combination for Machine Translation
  1. As in other state-of-the-art SMT systems, BLEU is selected as the accuracy measure to define the error function used in MERT.
    Page 3, “Background”
  2. Since the weights of training samples are not taken into account in BLEUZ, we modify the original definition of BLEU to make it sensitive to the distribution Dt(i) over the training samples.
    Page 3, “Background”
  3. The modified version of BLEU is called weighted BLE U (WBLEU) in this paper.
    Page 3, “Background”
  4. The weighted BLEU metric has the following form:
    Page 3, “Background”
  5. 2 In this paper, we use the NIST definition of BLEU where the effective reference length is the length of the shortest reference translation.
    Page 3, “Background”
  6. ously the original BLEU is just a special case of WBLEU when all the training samples are equally weighted.
    Page 3, “Background”
  7. As the weighted BLEU is used to measure the translation accuracy on the training set, the error rate is defined to be:
    Page 3, “Background”
  8. l k 1.- : BLEU (ez- ,n) — E2],:1BL13U(ez-j, n) (7)
    Page 3, “Background”
  9. where BLEU(e,-j, r,-) is the smoothed sentence-level BLEU score (Liang et al., 2006) of the translation e with respect to the reference translations r,, and e: is the oracle translation which is selected from {em ..., em} in terms of BLEU (e,-j, r,-).
    Page 3, “Background”
  10. The translation quality is evaluated in terms of case-insensitive NIST version BLEU metric.
    Page 5, “Background”
  11. Figures 2-5 show the BLEU curves on the development and test sets, where the X-aXis is the iteration number, and the Y-aXis is the BLEU score of the system generated by the boosting-based system combination.
    Page 5, “Background”

See all papers in Proc. ACL 2010 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

SMT systems

Appears in 24 sentences as: SMT system (12) SMT systems (14)
In Boosting-Based System Combination for Machine Translation
  1. To adapt boosting to SMT system combination, several key components of the original boosting algorithms are redesigned in this work.
    Page 1, “Abstract”
  2. With the emergence of various structurally different SMT systems, more and more studies are focused on combining multiple SMT systems for achieving higher translation accuracy rather than using a single translation system.
    Page 1, “Introduction”
  3. One of the key factors in SMT system combination is the diversity in the ensemble of translation outputs (Macherey and Och, 2007).
    Page 1, “Introduction”
  4. However, this requirement cannot be met in many cases, since we do not always have the access to multiple SMT engines due to the high cost of developing and tuning SMT systems .
    Page 1, “Introduction”
  5. Our experiments are conducted on Chinese-to-English translation in three state-of-the-art SMT systems , including a phrase-based system, a hierarchical phrase-based system and a syntax-based
    Page 1, “Introduction”
  6. where Pr(e| f) is the probability that e is the translation of the given source string f. To model the posterior probability Pr(e| f) , most of the state-of-the-art SMT systems utilize the log-linear model proposed by Och and Ney (2002), as follows,
    Page 2, “Background”
  7. In this paper, u denotes a log-linear model that has Mfixed features {h1(f,e), ..., hM(f,e)}, ,1 = {3.1, ..., AM} denotes the M parameters of u, and u(/1) denotes a SMT system based on u with parameters ,1.
    Page 2, “Background”
  8. Suppose that there are T available SMT systems {u1(/1*1), ..., uT(/1*T)}, the task of system combination is to build a new translation system v(u1(/l*1), mm?» from mm), mfg}.
    Page 2, “Background”
  9. However, since most of the boosting algorithms are designed for the classification problem that is very different from the translation problem in natural language processing, several key components have to be redesigned when boosting is adapted to SMT system combination.
    Page 3, “Background”
  10. As in other state-of-the-art SMT systems , BLEU is selected as the accuracy measure to define the error function used in MERT.
    Page 3, “Background”
  11. t/I(e,H(v)) is a consensus-based scoring function which has been successfully adopted in SMT system combination (Duan et al., 2009; Hildebrand and Vogel, 2008; Li et al., 2009).
    Page 4, “Background”

See all papers in Proc. ACL 2010 that mention SMT systems.

See all papers in Proc. ACL that mention SMT systems.

Back to top.

baseline systems

Appears in 20 sentences as: baseline system (4) Baseline Systems (1) baseline systems (15)
In Boosting-Based System Combination for Machine Translation
  1. First, a sequence of weak translation systems is generated from a baseline system in an iterative manner.
    Page 1, “Abstract”
  2. We evaluate our method on Chinese-to-English Machine Translation (MT) tasks in three baseline systems , including a phrase-based system, a hierarchical phrase-based system and a syntax-based system.
    Page 1, “Abstract”
  3. The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems .
    Page 1, “Abstract”
  4. In this method, a sequence of weak translation systems is generated from a baseline system in an iterative manner.
    Page 1, “Introduction”
  5. Experimental results show that our method leads to significant improvements in translation accuracy over the baseline systems .
    Page 2, “Introduction”
  6. 5.1 Baseline Systems
    Page 5, “Background”
  7. In this work, baseline system refers to the system produced by the boosting-based system combination when the number of iterations (i.e.
    Page 5, “Background”
  8. To obtain satisfactory baseline performance, we train each SMT system for 5 times using MERT with different initial values of feature weights to generate a group of baseline candidates, and then select the best-performing one from this group as the final baseline system (i.e.
    Page 5, “Background”
  9. All the word-aligned bilingual sentence pairs are used to extract phrases and rules for the baseline systems .
    Page 5, “Background”
  10. Beam search and cube pruning (Huang and Chiang, 2007) are used to prune the search space in all the three baseline systems .
    Page 5, “Background”
  11. The points at iteration 1 stand for the performance of the baseline systems .
    Page 5, “Background”

See all papers in Proc. ACL 2010 that mention baseline systems.

See all papers in Proc. ACL that mention baseline systems.

Back to top.

phrase-based

Appears in 19 sentences as: phrase-based (23)
In Boosting-Based System Combination for Machine Translation
  1. We evaluate our method on Chinese-to-English Machine Translation (MT) tasks in three baseline systems, including a phrase-based system, a hierarchical phrase-based system and a syntax-based system.
    Page 1, “Abstract”
  2. Many SMT frameworks have been developed, including phrase-based SMT (Koehn et al., 2003), hierarchical phrase-based SMT (Chiang, 2005), syntax-based SMT (Eisner, 2003; Ding and Palmer, 2005; Liu et al., 2006; Galley et al., 2006; Cowan et al., 2006), etc.
    Page 1, “Introduction”
  3. Our experiments are conducted on Chinese-to-English translation in three state-of-the-art SMT systems, including a phrase-based system, a hierarchical phrase-based system and a syntax-based
    Page 1, “Introduction”
  4. The first SMT system is a phrase-based system with two reordering models including the maximum entropy-based lexicalized reordering model proposed by Xiong et al.
    Page 5, “Background”
  5. The second SMT system is an in-house reim-plementation of the Hiero system which is based on the hierarchical phrase-based model proposed by Chiang (2005).
    Page 5, “Background”
  6. After 5, 7 and 8 iterations, relatively stable improvements are achieved by the phrase-based system, the Hiero system and the syntaX-based system, respectively.
    Page 5, “Background”
  7. Figures 2-5 also show that the boosting-based system combination seems to be more helpful to the phrase-based system than to the Hiero system and the syntaX-based system.
    Page 5, “Background”
  8. For the phrase-based system, it yields over 0.6 BLEU point gains just after the 3rd iteration on all the data sets.
    Page 5, “Background”
  9. Also as shown in Table 1, over 0.7 BLEU point gains are obtained on the phrase-based system after 10 iterations.
    Page 6, “Background”
  10. The largest BLEU improvement on the phrase-based system is over 1 BLEU point in most cases.
    Page 6, “Background”
  11. These results reflect that our method is relatively more effective for the phrase-based system than for the other two
    Page 6, “Background”

See all papers in Proc. ACL 2010 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

translation systems

Appears in 17 sentences as: translation system (10) translation systems (11)
In Boosting-Based System Combination for Machine Translation
  1. In this paper, we present a simple and effective method to address the issue of how to generate diversified translation systems from a single Statistical Machine Translation (SMT) engine for system combination.
    Page 1, “Abstract”
  2. First, a sequence of weak translation systems is generated from a baseline system in an iterative manner.
    Page 1, “Abstract”
  3. Then, a strong translation system is built from the ensemble of these weak translation systems .
    Page 1, “Abstract”
  4. With the emergence of various structurally different SMT systems, more and more studies are focused on combining multiple SMT systems for achieving higher translation accuracy rather than using a single translation system .
    Page 1, “Introduction”
  5. To reduce the burden of system development, it might be a nice way to combine a set of translation systems built from a single translation engine.
    Page 1, “Introduction”
  6. A key issue here is how to generate an ensemble of diversified translation systems from a single translation engine in a principled way.
    Page 1, “Introduction”
  7. Addressing this issue, we propose a boosting-based system combination method to learn a combined translation system from a single SMT engine.
    Page 1, “Introduction”
  8. In this method, a sequence of weak translation systems is generated from a baseline system in an iterative manner.
    Page 1, “Introduction”
  9. In each iteration, a new weak translation system is learned, focusing more on the sentences that are relatively poorly translated by the previous weak translation system .
    Page 1, “Introduction”
  10. Finally, a strong translation system is built from the ensemble of the weak translation systems .
    Page 1, “Introduction”
  11. Suppose that there are T available SMT systems {u1(/1*1), ..., uT(/1*T)}, the task of system combination is to build a new translation system v(u1(/l*1), mm?» from mm), mfg}.
    Page 2, “Background”

See all papers in Proc. ACL 2010 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

n-gram

Appears in 11 sentences as: n-gram (15)
In Boosting-Based System Combination for Machine Translation
  1. The computation of t/I(e,H(v)) is based on a linear combination of a set of n-gram consensuses-based features.
    Page 4, “Background”
  2. For each order of n-gram, h; (e,H(v)) and h”— (e,H(v)) are defined to measure the n-gram agreement and disagreement between e and other translation candidates in H(v), respectively.
    Page 4, “Background”
  3. If p orders of n-gram are used in computing t//(e,H(v)) , the total number of features in the system combination will be T +2>< p (T model-score-based features defined in Equation 8 and 2x p consensus-based features defined in Equation 9).
    Page 4, “Background”
  4. Another method to speed up the system is to accelerate n-gram language model with n-gram caching techniques.
    Page 4, “Background”
  5. In this method, a n-gram cache is used to store the most frequently and recently accessed n-grams.
    Page 4, “Background”
  6. When a new n-gram is accessed during decoding, the cache is checked first.
    Page 4, “Background”
  7. If the required n-gram hits the cache, the corresponding n-gram probability is returned by the cached copy rather than re-fetching the original data in language model.
    Page 4, “Background”
  8. As the translation speed of SMT system depends heavily on the computation of n-gram language model, the acceleration of n-gram language model generally leads to substantial speedup of SMT system.
    Page 4, “Background”
  9. In our implementation, the n-gram caching in general brings us over 30% speed improvement of the system.
    Page 4, “Background”
  10. The n-gram consensuses-based features (in Equation 9) used in system combination ranges from unigram to 4-gram.
    Page 5, “Background”
  11. For example, they used the remove-one-feature strategy and varied the order of n-gram language model to obtain a satisfactory group of diverse systems.
    Page 8, “Background”

See all papers in Proc. ACL 2010 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

BLEU scores

Appears in 9 sentences as: BLEU score (2) BLEU scores (7)
In Boosting-Based System Combination for Machine Translation
  1. where BLEU(e,-j, r,-) is the smoothed sentence-level BLEU score (Liang et al., 2006) of the translation e with respect to the reference translations r,, and e: is the oracle translation which is selected from {em ..., em} in terms of BLEU(e,-j, r,-).
    Page 3, “Background”
  2. Figures 2-5 show the BLEU curves on the development and test sets, where the X-aXis is the iteration number, and the Y-aXis is the BLEU score of the system generated by the boosting-based system combination.
    Page 5, “Background”
  3. The BLEU scores tend to converge to the stable values after 20 iterations for all the systems.
    Page 5, “Background”
  4. Table 1 summarizes the evaluation results, where the BLEU scores at iteration 5, 10, 15, 20 and 30 are reported for the comparison.
    Page 5, “Background”
  5. iteration number Figure 2: BLEU scores on the development set
    Page 6, “Background”
  6. Figure 4: BLEU scores on the test set of MT05
    Page 6, “Background”
  7. hieves significant BLEU improvements after 15 iterations, and the highest BLEU scores are generally yielded after 20 iterations.
    Page 6, “Background”
  8. Figure 3: BLEU scores on the test set of MTO4
    Page 6, “Background”
  9. Figure 5: BLEU scores on the test set of MTO6
    Page 6, “Background”

See all papers in Proc. ACL 2010 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

NIST

Appears in 7 sentences as: NIST (7)
In Boosting-Based System Combination for Machine Translation
  1. The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems.
    Page 1, “Abstract”
  2. All the systems are evaluated on three NIST MT evaluation test sets.
    Page 2, “Introduction”
  3. 2 In this paper, we use the NIST definition of BLEU where the effective reference length is the length of the shortest reference translation.
    Page 3, “Background”
  4. The data set used for weight training in boosting-based system combination comes from NIST MTO3 evaluation set.
    Page 5, “Background”
  5. The test sets are the NIST evaluation sets of MTO4, MTOS and MTO6.
    Page 5, “Background”
  6. The translation quality is evaluated in terms of case-insensitive NIST version BLEU metric.
    Page 5, “Background”
  7. We apply our method to three state-of-the-art SMT systems, and conduct experiments on three NIST Chinese-to-English MT evaluations test sets.
    Page 9, “Background”

See all papers in Proc. ACL 2010 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

TER

Appears in 7 sentences as: TER (8)
In Boosting-Based System Combination for Machine Translation
  1. Diversity ( TER [%])
    Page 7, “Background”
  2. Diversity ( TER [%])
    Page 7, “Background”
  3. The diversity is measured in terms of the Translation Error Rate ( TER ) metric proposed in (Snover et al., 2006).
    Page 7, “Background”
  4. A higher TER score means that more edit operations are performed if we transform one translation output into another
    Page 7, “Background”
  5. Diversity ( TER [%])
    Page 7, “Background”
  6. In this work, the TER score for a given group of member systems is calculated by averaging the TER scores between the outputs of each pair of member systems in this group.
    Page 7, “Background”
  7. In this work, the baseline’s diversity is the TER score of the group of baseline candidates that are generated in advance (Section 5.1).
    Page 7, “Background”

See all papers in Proc. ACL 2010 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

language model

Appears in 6 sentences as: language model (7)
In Boosting-Based System Combination for Machine Translation
  1. Since all the member systems share the same data resources, such as language model and translation table, we only need to keep one copy of the required resources in memory.
    Page 4, “Background”
  2. Another method to speed up the system is to accelerate n-gram language model with n-gram caching techniques.
    Page 4, “Background”
  3. If the required n-gram hits the cache, the corresponding n-gram probability is returned by the cached copy rather than re-fetching the original data in language model .
    Page 4, “Background”
  4. As the translation speed of SMT system depends heavily on the computation of n-gram language model, the acceleration of n-gram language model generally leads to substantial speedup of SMT system.
    Page 4, “Background”
  5. A 5-gram language model is trained on the target-side
    Page 5, “Background”
  6. For example, they used the remove-one-feature strategy and varied the order of n-gram language model to obtain a satisfactory group of diverse systems.
    Page 8, “Background”

See all papers in Proc. ACL 2010 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

feature weights

Appears in 5 sentences as: feature weight (2) feature weights (3)
In Boosting-Based System Combination for Machine Translation
  1. where {hm(f, e) | m = l, ..., M} is a set offea-tures, and Am is the feature weight corresponding to the m-th feature.
    Page 2, “Background”
  2. In this work, Minimum Error Rate Training (MERT) proposed by Och (2003) is used to estimate feature weights ,1 over a series of training samples.
    Page 3, “Background”
  3. where 44(e) is the log-scaled model score of e in the t-th member system, and ,8: is the corresponding feature weight .
    Page 4, “Background”
  4. 6’; and 6’”— are the feature weights corresponding to h; (e,H(v)) and h"— (e,H(v)) .
    Page 4, “Background”
  5. To obtain satisfactory baseline performance, we train each SMT system for 5 times using MERT with different initial values of feature weights to generate a group of baseline candidates, and then select the best-performing one from this group as the final baseline system (i.e.
    Page 5, “Background”

See all papers in Proc. ACL 2010 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

development set

Appears in 4 sentences as: development set (4)
In Boosting-Based System Combination for Machine Translation
  1. 1 The data set used for weight training is generally called development set or tuning set in the SMT field.
    Page 2, “Background”
  2. We see, first of all, that all the three systems are improved during iterations on the development set .
    Page 5, “Background”
  3. iteration number Figure 2: BLEU scores on the development set
    Page 6, “Background”
  4. 0 5 1 0 1 5 20 25 3O iteration number Figure 6: Diversity on the development set
    Page 7, “Background”

See all papers in Proc. ACL 2010 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

sentence-level

Appears in 4 sentences as: sentence-level (4)
In Boosting-Based System Combination for Machine Translation
  1. sentence-level combination (Hildebrand and Vogel, 2008) simply selects one from original translations, while some more sophisticated methods, such as word-level and phrase-level combination (Matusov et al., 2006; Rosti et al., 2007), can generate new translations differing from any of the original translations.
    Page 1, “Introduction”
  2. where BLEU(e,-j, r,-) is the smoothed sentence-level BLEU score (Liang et al., 2006) of the translation e with respect to the reference translations r,, and e: is the oracle translation which is selected from {em ..., em} in terms of BLEU(e,-j, r,-).
    Page 3, “Background”
  3. In this work, a sentence-level combination method is used to select the best translation from the pool of the n-best outputs of all the member systems.
    Page 4, “Background”
  4. In this work, we use a sentence-level system combination method to generate final translations.
    Page 9, “Background”

See all papers in Proc. ACL 2010 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

Error Rate

Appears in 3 sentences as: Error Rate (2) error rate (1)
In Boosting-Based System Combination for Machine Translation
  1. In this work, Minimum Error Rate Training (MERT) proposed by Och (2003) is used to estimate feature weights ,1 over a series of training samples.
    Page 3, “Background”
  2. As the weighted BLEU is used to measure the translation accuracy on the training set, the error rate is defined to be:
    Page 3, “Background”
  3. The diversity is measured in terms of the Translation Error Rate (TER) metric proposed in (Snover et al., 2006).
    Page 7, “Background”

See all papers in Proc. ACL 2010 that mention Error Rate.

See all papers in Proc. ACL that mention Error Rate.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Boosting-Based System Combination for Machine Translation
  1. where Pr(e| f) is the probability that e is the translation of the given source string f. To model the posterior probability Pr(e| f) , most of the state-of-the-art SMT systems utilize the log-linear model proposed by Och and Ney (2002), as follows,
    Page 2, “Background”
  2. In this paper, u denotes a log-linear model that has Mfixed features {h1(f,e), ..., hM(f,e)}, ,1 = {3.1, ..., AM} denotes the M parameters of u, and u(/1) denotes a SMT system based on u with parameters ,1.
    Page 2, “Background”
  3. In this paper, we use the term training set to emphasize the training of log-linear model.
    Page 2, “Background”

See all papers in Proc. ACL 2010 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

log-linear model

Appears in 3 sentences as: log-linear model (3)
In Boosting-Based System Combination for Machine Translation
  1. where Pr(e| f) is the probability that e is the translation of the given source string f. To model the posterior probability Pr(e| f) , most of the state-of-the-art SMT systems utilize the log-linear model proposed by Och and Ney (2002), as follows,
    Page 2, “Background”
  2. In this paper, u denotes a log-linear model that has Mfixed features {h1(f,e), ..., hM(f,e)}, ,1 = {3.1, ..., AM} denotes the M parameters of u, and u(/1) denotes a SMT system based on u with parameters ,1.
    Page 2, “Background”
  3. In this paper, we use the term training set to emphasize the training of log-linear model .
    Page 2, “Background”

See all papers in Proc. ACL 2010 that mention log-linear model.

See all papers in Proc. ACL that mention log-linear model.

Back to top.

Machine Translation

Appears in 3 sentences as: Machine Translation (3)
In Boosting-Based System Combination for Machine Translation
  1. In this paper, we present a simple and effective method to address the issue of how to generate diversified translation systems from a single Statistical Machine Translation (SMT) engine for system combination.
    Page 1, “Abstract”
  2. We evaluate our method on Chinese-to-English Machine Translation (MT) tasks in three baseline systems, including a phrase-based system, a hierarchical phrase-based system and a syntax-based system.
    Page 1, “Abstract”
  3. Recent research on Statistical Machine Translation (SMT) has achieved substantial progress.
    Page 1, “Introduction”

See all papers in Proc. ACL 2010 that mention Machine Translation.

See all papers in Proc. ACL that mention Machine Translation.

Back to top.

n-grams

Appears in 3 sentences as: n-grams (4)
In Boosting-Based System Combination for Machine Translation
  1. where gn(s) is the multi-set of all n-grams in a string s. In this definition, n-grams in e,~ and {rij} are weighted by Dt(i).
    Page 3, “Background”
  2. If the i-th training sample has a larger weight, the corresponding n-grams will have more contributions to the overall score WBLEU(E,R) .
    Page 3, “Background”
  3. In this method, a n-gram cache is used to store the most frequently and recently accessed n-grams .
    Page 4, “Background”

See all papers in Proc. ACL 2010 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

significant improvements

Appears in 3 sentences as: significant improvements (3)
In Boosting-Based System Combination for Machine Translation
  1. The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems.
    Page 1, “Abstract”
  2. Experimental results show that our method leads to significant improvements in translation accuracy over the baseline systems.
    Page 2, “Introduction”
  3. It also gives us a rational eXplanation for the significant improvements achieved by our method as shown in Section 5.3.
    Page 8, “Background”

See all papers in Proc. ACL 2010 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.

BLEU point

Appears in 3 sentences as: BLEU point (3)
In Boosting-Based System Combination for Machine Translation
  1. For the phrase-based system, it yields over 0.6 BLEU point gains just after the 3rd iteration on all the data sets.
    Page 5, “Background”
  2. Also as shown in Table 1, over 0.7 BLEU point gains are obtained on the phrase-based system after 10 iterations.
    Page 6, “Background”
  3. The largest BLEU improvement on the phrase-based system is over 1 BLEU point in most cases.
    Page 6, “Background”

See all papers in Proc. ACL 2010 that mention BLEU point.

See all papers in Proc. ACL that mention BLEU point.

Back to top.