A Discriminative Latent Variable Model for Statistical Machine Translation

Large-scale discriminative machine translation promises to further the state-of-the-art, but has failed to deliver convincing gains over current heuristic frequency count systems.

Statistical machine translation (SMT) has seen a resurgence in popularity in recent years, with progress being driven by a move to phrase-based and syntax-inspired approaches.

Discriminative models allow for the use of expressive features, in the order of thousands or millions, which can reference arbitrary aspects of the source sentence.

A synchronous context free grammar (SCFG) consists of paired CFG rules with co-indexed non-terminals (Lewis 11 and Stearns, 1968).

Our model evaluation was motivated by the following questions: (1) the effect of maximising translations rather than derivations in training and decoding; (2) whether a regularised model performs better than a maximum likelihood model; (3) how the performance of our model compares with a frequency count based hierarchical system; and (4) how translation performance scales with the number of training examples.

We have shown that explicitly accounting for competing derivations yields translation improvements.

Appears in 10 sentences as: BLEU (11)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- Although there is no direct relationship between BLEU and likelihood, it provides a rough measure for comparing performance.Page 6, “Evaluation”
- 6We also experimented with using max-translation decoding for standard MER trained translation models, finding that it had a small negative impact on BLEU score.Page 6, “Evaluation”
- Figure 5 shows the relationship between beam width and development BLEU .Page 6, “Evaluation”
- To do this we use our own implementation of Hiero (Chiang, 2007), with the same grammar but with the traditional generative feature set trained in a linear model with minimum BLEU training.Page 7, “Evaluation”
- | System | Test ( BLEU ) |Page 7, “Evaluation”
- This is encouraging as our model was trained to optimise likelihood rather than BLEU , yet it is still competitive on that metric.Page 7, “Evaluation”
- As expected, the language model makes a significant difference to BLEU , however we believe that this effect is orthogonal to the choice of base translation model, thus we would expect a similar gain when integrating a language model into the discriminative system.Page 7, “Evaluation”
- translation optimising discriminative model more often produces quite fluent translations, yet not in ways that would lead to an increase in BLEU score.9 This could be considered a side-effect of optimising likelihood rather than BLEU .Page 8, “Evaluation”
- 9Hiero was MERT trained on this set and has a 2% higher BLEU score compared to the discriminative model.Page 8, “Discussion and Further Work”
- development BLEU (%) 28Page 8, “Discussion and Further Work”

See all papers in *Proc. ACL 2008* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 9 sentences as: latent variable (7) latent variables (2)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- We present a translation model which models derivations as a latent variable , in both training and decoding, and is fully discriminative and globally optimised.Page 1, “Abstract”
- Second, within this framework, we model the derivation, d, as a latent variable , p(e, d|f), which is marginalised out in training and decoding.Page 2, “Introduction”
- Instead we model the translation distribution with a latent variable for the derivation, which we marginalise out in training and decoding.Page 2, “Challenges for Discriminative SMT”
- As the training data only provides source and target sentences, the derivations are modelled as a latent variable .Page 3, “Discriminative Synchronous Transduction”
- Our findings echo those observed for latent variable log-linear models successfully used in monolingual parsing (Clark and Curran, 2007; Petrov et al., 2007).Page 4, “Discriminative Synchronous Transduction”
- This method has been demonstrated to be effective for (non-convex) log-linear models with latent variables (Clark and Curran, 2004; Petrov et al., 2007).Page 4, “Discriminative Synchronous Transduction”
- Clark and Curran (2004) provides a more complete discussion of parsing with a log-linear model and latent variables .Page 4, “Discriminative Synchronous Transduction”
- Derivational ambiguity Table 1 shows the impact of accounting for derivational ambiguity in training and decoding.5 There are two options for training, we could use our latent variable model and optimise the probability of all derivations of the reference translation, or choose a single derivation that yields the reference and optimise its probability alone.Page 6, “Evaluation”
- Max-translation decoding for the model trained on single derivations has only a small positive effect, while for the latent variable model the impact is much larger.6Page 6, “Evaluation”

See all papers in *Proc. ACL 2008* that mention latent variable.

See all papers in *Proc. ACL* that mention latent variable.

Back to top.

Appears in 7 sentences as: log-linear (7)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- First, we develop a log-linear model of translation which is globally trained on a significant number of parallel sentences.Page 1, “Introduction”
- 3.1 A global log-linear modelPage 3, “Discriminative Synchronous Transduction”
- Our log-linear translation model defines a conditional probability distribution over the target translations of a given source sentence.Page 3, “Discriminative Synchronous Transduction”
- Our findings echo those observed for latent variable log-linear models successfully used in monolingual parsing (Clark and Curran, 2007; Petrov et al., 2007).Page 4, “Discriminative Synchronous Transduction”
- This method has been demonstrated to be effective for (non-convex) log-linear models with latent variables (Clark and Curran, 2004; Petrov et al., 2007).Page 4, “Discriminative Synchronous Transduction”
- Clark and Curran (2004) provides a more complete discussion of parsing with a log-linear model and latent variables.Page 4, “Discriminative Synchronous Transduction”
- Such approaches have been shown to be effective in log-linear word-alignment models Where only a small supervised corpus is available (Blunsom and Cohn, 2006).Page 8, “Discussion and Further Work”

See all papers in *Proc. ACL 2008* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 6 sentences as: language model (8)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- ilar to the methods for decoding with a SCFG intersected with an n-gram language model, which require language model contexts to be stored in each chart cell.Page 5, “Discriminative Synchronous Transduction”
- The feature set includes: a trigram language model (lm) trainedPage 7, “Evaluation”
- To compare our model directly with these systems we would need to incorporate additional features and a language model , work which we have left for a later date.Page 7, “Evaluation”
- The relative scores confirm that our model, with its minimalist feature set, achieves comparable performance to the standard feature set without the language model .Page 7, “Evaluation”
- As expected, the language model makes a significant difference to BLEU, however we believe that this effect is orthogonal to the choice of base translation model, thus we would expect a similar gain when integrating a language model into the discriminative system.Page 7, “Evaluation”
- To do so would require integrating a language model feature into the max-translation decoding algorithm.Page 8, “Discussion and Further Work”

See all papers in *Proc. ACL 2008* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 5 sentences as: conditional probabilities (2) conditional probability (3)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- Our log-linear translation model defines a conditional probability distribution over the target translations of a given source sentence.Page 3, “Discriminative Synchronous Transduction”
- The conditional probability of a derivation, d, for a target translation, e, conditioned on the source, f, is given by:Page 3, “Discriminative Synchronous Transduction”
- Given (1), the conditional probability of a target translation given the source is the sum over all of its derivations:Page 3, “Discriminative Synchronous Transduction”
- This is illustrated in Table 2, which shows the conditional probabilities for rules, obtained by 10-cally normalising the rule feature weights for a simple grammar extracted from the ambiguous pair of sentences presented in DeNero et al.Page 7, “Evaluation”
- The first column of conditional probabilities corresponds to a maximum likelihood estimate, i.e., without regularisation.Page 7, “Evaluation”

See all papers in *Proc. ACL 2008* that mention conditional probability.

See all papers in *Proc. ACL* that mention conditional probability.

Back to top.

Appears in 5 sentences as: feature set (5) feature sets (1)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- This problem of over-fitting is exacerbated in discriminative models with large, expressive, feature sets .Page 2, “Challenges for Discriminative SMT”
- Learning with a large feature set requires many training examples and typically many iterations of a solver during training.Page 2, “Challenges for Discriminative SMT”
- To do this we use our own implementation of Hiero (Chiang, 2007), with the same grammar but with the traditional generative feature set trained in a linear model with minimum BLEU training.Page 7, “Evaluation”
- The feature set includes: a trigram language model (lm) trainedPage 7, “Evaluation”
- The relative scores confirm that our model, with its minimalist feature set, achieves comparable performance to the standard feature set without the language model.Page 7, “Evaluation”

See all papers in *Proc. ACL 2008* that mention feature set.

See all papers in *Proc. ACL* that mention feature set.

Back to top.

Appears in 5 sentences as: log-linear model (3) log-linear models (2)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- First, we develop a log-linear model of translation which is globally trained on a significant number of parallel sentences.Page 1, “Introduction”
- 3.1 A global log-linear modelPage 3, “Discriminative Synchronous Transduction”
- Our findings echo those observed for latent variable log-linear models successfully used in monolingual parsing (Clark and Curran, 2007; Petrov et al., 2007).Page 4, “Discriminative Synchronous Transduction”
- This method has been demonstrated to be effective for (non-convex) log-linear models with latent variables (Clark and Curran, 2004; Petrov et al., 2007).Page 4, “Discriminative Synchronous Transduction”
- Clark and Curran (2004) provides a more complete discussion of parsing with a log-linear model and latent variables.Page 4, “Discriminative Synchronous Transduction”

See all papers in *Proc. ACL 2008* that mention log-linear model.

See all papers in *Proc. ACL* that mention log-linear model.

Back to top.

Appears in 5 sentences as: machine translation (5)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- Large-scale discriminative machine translation promises to further the state-of-the-art, but has failed to deliver convincing gains over current heuristic frequency count systems.Page 1, “Abstract”
- Statistical machine translation (SMT) has seen a resurgence in popularity in recent years, with progress being driven by a move to phrase-based and syntax-inspired approaches.Page 1, “Introduction”
- These results could — and should — be applied to other models, discriminative and generative, phrase- and syntax-based, to further progress the state-of-the-art in machine translation .Page 3, “Challenges for Discriminative SMT”
- The development and test data was taken from the 2006 NAACL and 2007 ACL workshops on machine translation , also filtered for sentence length.4 Tuning of the regularisation parameter and MERT training of the benchmark models was performed on dev2006, while the test set was the concatenation of devtest2006, test2006 and test2007, amounting to 315 development and 1164 test sentences.Page 5, “Evaluation”
- Finally, while in this paper we have focussed on the science of discriminative machine translation , we believe that with suitable engineering this model will advance the state-of-the-art.Page 8, “Discussion and Further Work”

See all papers in *Proc. ACL 2008* that mention machine translation.

See all papers in *Proc. ACL* that mention machine translation.

Back to top.

Appears in 5 sentences as: translation model (3) translation models (2)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- We present a translation model which models derivations as a latent variable, in both training and decoding, and is fully discriminative and globally optimised.Page 1, “Abstract”
- Our log-linear translation model defines a conditional probability distribution over the target translations of a given source sentence.Page 3, “Discriminative Synchronous Transduction”
- 6We also experimented with using max-translation decoding for standard MER trained translation models , finding that it had a small negative impact on BLEU score.Page 6, “Evaluation”
- Firstly we show the relative scores of our model against Hiero without using reverse translation or lexical features.7 This allows us to directly study the differences between the two translation models without the added complication of the other features.Page 7, “Evaluation”
- As expected, the language model makes a significant difference to BLEU, however we believe that this effect is orthogonal to the choice of base translation model , thus we would expect a similar gain when integrating a language model into the discriminative system.Page 7, “Evaluation”

See all papers in *Proc. ACL 2008* that mention translation model.

See all papers in *Proc. ACL* that mention translation model.

Back to top.

Appears in 4 sentences as: development set (4)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- A comparison on the impact of accounting for all derivations in training and decoding ( development set ).Page 6, “Evaluation”
- The effect of the beam width (log-scale) on max-translation decoding ( development set ).Page 6, “Evaluation”
- An informal comparison of the outputs on the development set , presented in Table 4, suggests that thePage 7, “Evaluation”
- _ tences ( development set )Page 8, “Discussion and Further Work”

See all papers in *Proc. ACL 2008* that mention development set.

See all papers in *Proc. ACL* that mention development set.

Back to top.

Appears in 3 sentences as: beam search (3)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- Most prior work in SMT, both generative and discriminative, has approximated the sum over derivations by choosing a single ‘best’ derivation using a Viterbi or beam search algorithm.Page 4, “Discriminative Synchronous Transduction”
- Here we approximate the sum over derivations directly using a beam search in which we produce a beam of high probability translation sub-strings for each cell in the parse chart.Page 4, “Discriminative Synchronous Transduction”
- When the beam search is complete we have a list of translations in the top beam cell spanning the entire source sentence along with their approximated inside derivation scores.Page 5, “Discriminative Synchronous Transduction”

See all papers in *Proc. ACL 2008* that mention beam search.

See all papers in *Proc. ACL* that mention beam search.

Back to top.

Appears in 3 sentences as: (1) lm (2)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- The feature set includes: a trigram language model ( lm ) trainedPage 7, “Evaluation”
- Discriminative max-derivation 25.78 Hiero (pd, gr, re, we) 26.48 Discriminative max—translation 27.72 Hiero (pd, 19,, p2“, pi“, 97", re, we) 28.14 Hiero (pd, 19,, p2“, pi“, 97", re, we, lm) 32.00Page 7, “Evaluation”
- 8Hiero (pd, Pr, P262195”, 97“, re, we, lm ) represents state-Page 7, “Evaluation”

See all papers in *Proc. ACL 2008* that mention lm.

See all papers in *Proc. ACL* that mention lm.

Back to top.

Appears in 3 sentences as: model parameters (3)

In *A Discriminative Latent Variable Model for Statistical Machine Translation*

- This itself provides robustness to noisy data, in addition to the explicit regularisation from a prior over the model parameters .Page 3, “Challenges for Discriminative SMT”
- Here k ranges over the model’s features, and A = {M} are the model parameters (weights for their corresponding features).Page 3, “Discriminative Synchronous Transduction”
- Each L-BFGS iteration requires the objective value and its gradient with respect to the model parameters .Page 4, “Discriminative Synchronous Transduction”

See all papers in *Proc. ACL 2008* that mention model parameters.

See all papers in *Proc. ACL* that mention model parameters.

Back to top.