Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
He, Xiaodong and Deng, Li

Article Structure

Abstract

This paper proposes a new discriminative training method in constructing phrase and lexicon translation models.

Topics

BLEU

Appears in 44 sentences as: BLEU (52)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset.
    Page 1, “Abstract”
  2. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
    Page 1, “Abstract”
  3. parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002).
    Page 1, “Abstract”
  4. The training objective is an expected BLEU score, which is closely linked to translation quality.
    Page 1, “Abstract”
  5. Experiments on the Europarl German-to-English dataset show that the proposed method leads to a 1.1 BLEU point improvement over a strong baseline.
    Page 2, “Abstract”
  6. bold updating), the author proposed a local updating strategy where the model parameters are updated towards a pseudo-reference (i.e., the hypothesis in the n-best list that gives the best BLEU score).
    Page 2, “Abstract”
  7. Experimental results showed that their approach outperformed a baseline by 0.8 BLEU point when using monotonic decoding, but there was no
    Page 2, “Abstract”
  8. In our work, we use the expectation of BLEU scores as the objective.
    Page 2, “Abstract”
  9. (2011) have proposed using differentiable expected BLEU score as the objective to train system combination parameters.
    Page 2, “Abstract”
  10. Other work related to the computation of expected BLEU in common with ours includes minimum Bayes risk approaches (Smith and Eisner 2006, Tromble et al., 2008) and lattice-based MERT (Macherey et al., 2008).
    Page 2, “Abstract”
  11. U (9) is proportional (with a factor of N) to the expected sentence BLEU score over the entire training set, i.e., after some algebra,
    Page 4, “Abstract”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

translation models

Appears in 25 sentences as: Translation model (1) translation model (8) translation model: (1) translation models (17)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. This paper proposes a new discriminative training method in constructing phrase and lexicon translation models .
    Page 1, “Abstract”
  2. parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy (BLEU) (Papineni et al., 2002).
    Page 1, “Abstract”
  3. However, the number of parameters in common phrase and lexicon translation models is much larger.
    Page 1, “Abstract”
  4. In this work, we present a new, highly effective discriminative learning method for phrase and lexicon translation models .
    Page 1, “Abstract”
  5. (2006) proposed a large set of lexical and Part-of-Speech features in addition to the phrase translation model .
    Page 2, “Abstract”
  6. In these earlier work, however, the phrase and lexicon translation models used remained unchanged.
    Page 2, “Abstract”
  7. (2010) proposed a method to train the phrase translation model using Expectation-Maximization algorithm with a leave-one-out strategy.
    Page 2, “Abstract”
  8. Features used in a phrase-based system usually include LM, reordering model, word and phrase counts, and phrase and lexicon translation models .
    Page 3, “Abstract”
  9. Given the focus of this paper, we review only the phrase and lexicon translation models below.
    Page 3, “Abstract”
  10. Phrase translation model
    Page 3, “Abstract”
  11. The target-to-source (backward) phrase translation model is defined similarly.
    Page 3, “Abstract”

See all papers in Proc. ACL 2012 that mention translation models.

See all papers in Proc. ACL that mention translation models.

Back to top.

BLEU score

Appears in 19 sentences as: BLEU score (14) BLEU scores (7)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. The training objective is an expected BLEU score , which is closely linked to translation quality.
    Page 1, “Abstract”
  2. bold updating), the author proposed a local updating strategy where the model parameters are updated towards a pseudo-reference (i.e., the hypothesis in the n-best list that gives the best BLEU score ).
    Page 2, “Abstract”
  3. In our work, we use the expectation of BLEU scores as the objective.
    Page 2, “Abstract”
  4. (2011) have proposed using differentiable expected BLEU score as the objective to train system combination parameters.
    Page 2, “Abstract”
  5. U (9) is proportional (with a factor of N) to the expected sentence BLEU score over the entire training set, i.e., after some algebra,
    Page 4, “Abstract”
  6. In Un (9') takes a form similar to (6), but is the expected BLEU score for sentence n using models from the previous iteration.
    Page 5, “Abstract”
  7. This baseline achieves a BLEU score of 26.22% on the test set.
    Page 6, “Abstract”
  8. Table 2 reports the BLEU scores and gains over the baseline given different values of I.
    Page 6, “Abstract”
  9. BLEU scores are reported on the validation set.
    Page 6, “Abstract”
  10. BLEU) score of N-best lists and the corpus-level BLEU score of 1-best translations.
    Page 7, “Abstract”
  11. 2 it is clear that the expected BLEU score correlates strongly with the real BLEU score , justifying its use as our training objective.
    Page 7, “Abstract”

See all papers in Proc. ACL 2012 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

phrase table

Appears in 11 sentences as: phrase table (12)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Another line of research that is closely related to our work is phrase table refinement and pruning.
    Page 2, “Abstract”
  2. The parallel sentences were forced to be aligned at the phrase level using the phrase table and other features as in a decoding process.
    Page 2, “Abstract”
  3. To prevent overf1tting, the statistics of phrase pairs from a particular sentence was excluded from the phrase table when aligning that sentence.
    Page 2, “Abstract”
  4. To build the baseline phrase-based SMT system, we first perform word alignment on the training set using a hidden Markov model with lexicalized distortion (He 2007), then extract the phrase table from the word aligned bilingual texts (Koehn et al., 2003).
    Page 6, “Abstract”
  5. In our system, a primary phrase table is trained from the 110K TED parallel training data, and a 3-gram LM is trained on the English side of the parallel data.
    Page 8, “Abstract”
  6. From them, we train a secondary 5-gram LM on 115M sentences of supplementary English data, and a secondary phrase table from 500K sentences selected from the supplementary UN corpus by the method proposed by Axelrod et al.
    Page 8, “Abstract”
  7. We only train the parameters of the primary phrase table .
    Page 8, “Abstract”
  8. The secondary phrase table and LM are excluded from the training process since the out-of-domain phrase table is less relevant to the TED translation task, and the large LM slows down the N-best generation process significantly.
    Page 8, “Abstract”
  9. At the end, we perform one final MERT to tune the relative weights with all features including the secondary phrase table and LM.
    Page 8, “Abstract”
  10. The baseline is a phrase-based system with all features including the secondary phrase table and LM.
    Page 8, “Abstract”
  11. The new system uses the same features except that the primary phrase table is discriminatively
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

phrase-based

Appears in 8 sentences as: Phrase-based (1) phrase-based (7)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Our work is based on a phrase-based SMT system.
    Page 2, “Abstract”
  2. Phrase-based Translation System
    Page 2, “Abstract”
  3. The translation process of phrase-based SMT can be briefly described in three steps: segment source sentence into a sequence of phrases, translate each
    Page 2, “Abstract”
  4. Features used in a phrase-based system usually include LM, reordering model, word and phrase counts, and phrase and lexicon translation models.
    Page 3, “Abstract”
  5. In a phrase-based SMT system, the total number of parameters of phrase and lexicon translation models, which we aim to learn discriminatively, is very large (see Table 1).
    Page 4, “Abstract”
  6. To build the baseline phrase-based SMT system, we first perform word alignment on the training set using a hidden Markov model with lexicalized distortion (He 2007), then extract the phrase table from the word aligned bilingual texts (Koehn et al., 2003).
    Page 6, “Abstract”
  7. A fast beam-search phrase-based decoder (Moore and Quirk 2007) is used and the distortion limit is set to four.
    Page 6, “Abstract”
  8. The baseline is a phrase-based system with all features including the secondary phrase table and LM.
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

BLEU point

Appears in 7 sentences as: BLEU point (5) BLEU points (2)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
    Page 1, “Abstract”
  2. Experiments on the Europarl German-to-English dataset show that the proposed method leads to a 1.1 BLEU point improvement over a strong baseline.
    Page 2, “Abstract”
  3. Experimental results showed that their approach outperformed a baseline by 0.8 BLEU point when using monotonic decoding, but there was no
    Page 2, “Abstract”
  4. While I = 5X10‘5 gives the best score on the validation set, the gain is shown to be substantially reduced to merely 0.2 BLEU point when I = 0, i.e., no regularization.
    Page 6, “Abstract”
  5. Compared with the baseline, training phrase or lexicon models alone gives a gain of 0.7 and 0.5 BLEU points , respectively, on the test set.
    Page 7, “Abstract”
  6. The two-stage training of both models gives the best result of 27.33%, outperforming the baseline by 1.1 BLEU points .
    Page 7, “Abstract”
  7. The results in Table 4 show that the proposed method leads to an improvement of 1.2 BLEU point over the baseline.
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention BLEU point.

See all papers in Proc. ACL that mention BLEU point.

Back to top.

translation probabilities

Appears in 7 sentences as: translation probabilities (8)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective.
    Page 1, “Abstract”
  2. For effective optimization, we derive updating formulas of growth transformation (GT) for phrase and lexicon translation probabilities .
    Page 1, “Abstract”
  3. Then the phrase translation probabilities were estimated based on the phrase alignments.
    Page 2, “Abstract”
  4. Phrase translation probabilities are then computed as relative frequencies of phrases over the training dataset.
    Page 3, “Abstract”
  5. Objective function We denote by 0 the set of all the parameters to be optimized, including forward phrase and lexicon translation probabilities and their backward counterparts.
    Page 3, “Abstract”
  6. Next, we study the effects of training the phrase translation probabilities and the lexicon translation probabilities according to the GT formulas presented in the preceding section.
    Page 7, “Abstract”
  7. The phrase translation probabilities (PT) are trained alone in the first stage, shown in blue color.
    Page 7, “Abstract”

See all papers in Proc. ACL 2012 that mention translation probabilities.

See all papers in Proc. ACL that mention translation probabilities.

Back to top.

LM

Appears in 7 sentences as: LM (8)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Features used in a phrase-based system usually include LM , reordering model, word and phrase counts, and phrase and lexicon translation models.
    Page 3, “Abstract”
  2. Other models used in the baseline system include lexicalized ordering model, word count and phrase count, and a 3-gram LM trained on the English side of the parallel training corpus.
    Page 6, “Abstract”
  3. In our system, a primary phrase table is trained from the 110K TED parallel training data, and a 3-gram LM is trained on the English side of the parallel data.
    Page 8, “Abstract”
  4. From them, we train a secondary 5-gram LM on 115M sentences of supplementary English data, and a secondary phrase table from 500K sentences selected from the supplementary UN corpus by the method proposed by Axelrod et al.
    Page 8, “Abstract”
  5. The secondary phrase table and LM are excluded from the training process since the out-of-domain phrase table is less relevant to the TED translation task, and the large LM slows down the N-best generation process significantly.
    Page 8, “Abstract”
  6. At the end, we perform one final MERT to tune the relative weights with all features including the secondary phrase table and LM .
    Page 8, “Abstract”
  7. The baseline is a phrase-based system with all features including the secondary phrase table and LM .
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

objective function

Appears in 6 sentences as: Objective function (1) objective function (5)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Objective function We denote by 0 the set of all the parameters to be optimized, including forward phrase and lexicon translation probabilities and their backward counterparts.
    Page 3, “Abstract”
  2. Therefore, we design the objective function to be maximized as:
    Page 4, “Abstract”
  3. First, we propose a new objective function (Eq.
    Page 8, “Abstract”
  4. The objective function consists of 1) the utility function of expected BLEU score, and 2) the regularization term taking the form of KL divergence in the parameter space.
    Page 8, “Abstract”
  5. Second, through nontrivial derivation, we show that the novel objective function of Eq.
    Page 8, “Abstract”
  6. Third, the new objective function and new optimization technique are successfully applied to two important machine translation tasks, with implementation issues resolved (e.g., training schedule and hyper-parameter tuning, etc.).
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

feature weights

Appears in 6 sentences as: Feature weights (2) feature weights (4)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Och (2003) proposed using a log-linear model to incorporate multiple features for translation, and proposed a minimum error rate training (MERT) method to train the feature weights to optimize a desirable translation metric.
    Page 1, “Abstract”
  2. (2009) improved a syntactic SMT system by adding as many as ten thousand syntactic features, and used Margin Infused Relaxed Algorithm (MIRA) to train the feature weights .
    Page 1, “Abstract”
  3. The feature weights are trained on a tuning set with 2010 sentences using MIRA.
    Page 2, “Abstract”
  4. Feature weights A = {Am} are usually tuned by MERT.
    Page 3, “Abstract”
  5. The parameter set 0 is optimized on the training set while the feature weights 1. are tuned on a small tuning set3.
    Page 6, “Abstract”
  6. Feature weights are tuned by MERT.
    Page 6, “Abstract”

See all papers in Proc. ACL 2012 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

translation task

Appears in 5 sentences as: translation task (4) translation tasks (1)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Our experimental results on this open-domain spoken language translation task show that the proposed method leads to significant translation performance improvement over a state-of-the-art baseline, and the system using the proposed method achieved the best single system translation result in the Chinese-to-English MT track.
    Page 2, “Abstract”
  2. In the Chinese-to-English translation task , we are provided with human translated Chinese text with punctuations inserted.
    Page 8, “Abstract”
  3. This is an open-domain spoken language translation task .
    Page 8, “Abstract”
  4. The secondary phrase table and LM are excluded from the training process since the out-of-domain phrase table is less relevant to the TED translation task , and the large LM slows down the N-best generation process significantly.
    Page 8, “Abstract”
  5. Third, the new objective function and new optimization technique are successfully applied to two important machine translation tasks , with implementation issues resolved (e.g., training schedule and hyper-parameter tuning, etc.).
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention translation task.

See all papers in Proc. ACL that mention translation task.

Back to top.

baseline system

Appears in 4 sentences as: baseline system (4)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Build the baseline system , estimate { 0, k }.
    Page 6, “Abstract”
  2. the baseline system , compute BLE U (En, El).
    Page 6, “Abstract”
  3. Other models used in the baseline system include lexicalized ordering model, word count and phrase count, and a 3-gram LM trained on the English side of the parallel training corpus.
    Page 6, “Abstract”
  4. This baseline system is also used to generate a 100-best list of the training corpus during maximum expected BLEU training.
    Page 6, “Abstract”

See all papers in Proc. ACL 2012 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

SMT system

Appears in 4 sentences as: SMT system (4)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. (2009) improved a syntactic SMT system by adding as many as ten thousand syntactic features, and used Margin Infused Relaxed Algorithm (MIRA) to train the feature weights.
    Page 1, “Abstract”
  2. Our work is based on a phrase-based SMT system .
    Page 2, “Abstract”
  3. In a phrase-based SMT system , the total number of parameters of phrase and lexicon translation models, which we aim to learn discriminatively, is very large (see Table 1).
    Page 4, “Abstract”
  4. To build the baseline phrase-based SMT system , we first perform word alignment on the training set using a hidden Markov model with lexicalized distortion (He 2007), then extract the phrase table from the word aligned bilingual texts (Koehn et al., 2003).
    Page 6, “Abstract”

See all papers in Proc. ACL 2012 that mention SMT system.

See all papers in Proc. ACL that mention SMT system.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Discriminative training is an active area in statistical machine translation (SMT) (e.g., Och et al., 2002, 2003, Liang et al., 2006, Blunsom et al., 2008, Chiang et al., 2009, Foster et al, 2010, Xiao et al.
    Page 1, “Abstract”
  2. 5.3 Experiments on the IWSLT2011 benchmark As the second evaluation task, we apply our new method described in this paper to the 2011 IWSLT Chinese-to-English machine translation benchmark (Federico et al., 2011).
    Page 8, “Abstract”
  3. Third, the new objective function and new optimization technique are successfully applied to two important machine translation tasks, with implementation issues resolved (e.g., training schedule and hyper-parameter tuning, etc.).
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

phrase pairs

Appears in 3 sentences as: phrase pair (1) phrase pairs (2)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. To prevent overf1tting, the statistics of phrase pairs from a particular sentence was excluded from the phrase table when aligning that sentence.
    Page 2, “Abstract”
  2. A set of phrase pairs are extracted from word-aligned parallel corpus according to phrase extraction rules (Koehn et al., 2003).
    Page 3, “Abstract”
  3. We use the word translation table from IBM Model 1 (Brown et al., 1993) and compute the sum over all possible word alignments within a phrase pair without normalizing for length (Quirk et al., 2005).
    Page 3, “Abstract”

See all papers in Proc. ACL 2012 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

log-linear model

Appears in 3 sentences as: log-linear model (3)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Och (2003) proposed using a log-linear model to incorporate multiple features for translation, and proposed a minimum error rate training (MERT) method to train the feature weights to optimize a desirable translation metric.
    Page 1, “Abstract”
  2. While the log-linear model itself is discriminative, the phrase and lexicon translation features, which are among the most important components of SMT, are derived from either generative models or heuristics (Koehn et al., 2003, Brown et al., 1993).
    Page 1, “Abstract”
  3. In that work, multiple features, most of them are derived from generative models, are incorporated into a log-linear model , and the relative weights of them are tuned discriminatively on a small tuning set.
    Page 2, “Abstract”

See all papers in Proc. ACL 2012 that mention log-linear model.

See all papers in Proc. ACL that mention log-linear model.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Och (2003) proposed using a log-linear model to incorporate multiple features for translation, and proposed a minimum error rate training (MERT) method to train the feature weights to optimize a desirable translation metric.
    Page 1, “Abstract”
  2. While the log-linear model itself is discriminative, the phrase and lexicon translation features, which are among the most important components of SMT, are derived from either generative models or heuristics (Koehn et al., 2003, Brown et al., 1993).
    Page 1, “Abstract”
  3. In that work, multiple features, most of them are derived from generative models, are incorporated into a log-linear model, and the relative weights of them are tuned discriminatively on a small tuning set.
    Page 2, “Abstract”

See all papers in Proc. ACL 2012 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

sentence-level

Appears in 3 sentences as: sentence-level (3)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. In the updating formula, we need to compute the sentence-level BLE U (En, E5).
    Page 5, “Abstract”
  2. a non-clipped BP, BP = 60—9, for sentence-level BLEU].
    Page 5, “Abstract”
  3. sentence-level BLEU (Exp.
    Page 7, “Abstract”

See all papers in Proc. ACL 2012 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

iteratively

Appears in 3 sentences as: iteratively (3)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective.
    Page 1, “Abstract”
  2. In this section, we derived GT formulas for iteratively updating the parameters so as to optimize objective (9).
    Page 4, “Abstract”
  3. Baum-Eagon inequality (Baum and Eagon, 1967) gives the GT formula to iteratively maximize positive-coefficient polynomials of random
    Page 4, “Abstract”

See all papers in Proc. ACL 2012 that mention iteratively.

See all papers in Proc. ACL that mention iteratively.

Back to top.

translation quality

Appears in 3 sentences as: translation quality (3)
In Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
  1. Therefore, it is desirable to train all these parameters to directly maximize an objective that directly links to translation quality .
    Page 1, “Abstract”
  2. The training objective is an expected BLEU score, which is closely linked to translation quality .
    Page 1, “Abstract”
  3. The expected BLEU score is closely linked to translation quality and the regularization is essential when many parameters are trained at scale.
    Page 8, “Abstract”

See all papers in Proc. ACL 2012 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.