Learning Hierarchical Translation Structure with Linguistic Annotations
Mylonakis, Markos and Sima'an, Khalil

Article Structure

Abstract

While it is generally accepted that many translation phenomena are correlated with linguistic structures, employing linguistic syntax for translation has proven a highly nontrivial task.

Introduction

Recent advances in Statistical Machine Translation (SMT) are widely centred around two concepts: (a) hierarchical translation processes, frequently employing Synchronous Context Free Grammars (SCFGS) and (b) transduction or synchronous rewrite processes over a linguistic syntactic tree.

Joint Translation Model

Our model is based on a probabilistic Synchronous CFG (Wu, 1997; Chiang, 2005).

Learning Translation Structure

3.1 Phrase-Pair Label Chart

Experiments

4.1 Decoding Model

Related Work

In this work, we focus on the combination of learning latent structure with syntax and linguistic annotations, exploring the crossroads of machine

Conclusions

In this work we contribute a method to learn and apply latent hierarchical translation structure.

Topics

language pairs

Appears in 12 sentences as: language pair (2) language pairs (10)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
    Page 1, “Abstract”
  2. utilised an ITG-flavour which focused on hierarchical phrase-pairs to capture context-driven translation and reordering patterns with ‘gaps’, offering competitive performance particularly for language pairs with extensive reordering.
    Page 1, “Introduction”
  3. By advancing from structures which mimic linguistic syntax, to learning linguistically aware latent recursive structures targeting translation, we achieve significant improvements in translation quality for 4 different language pairs in comparison with a strong hierarchical translation baseline.
    Page 2, “Introduction”
  4. These extra features assess translation quality past the synchronous grammar derivation and learning general reordering or word emission preferences for the language pair .
    Page 6, “Experiments”
  5. We evaluate our method on four different language pairs with English as the source language and French, German, Dutch and Chinese as target.
    Page 6, “Experiments”
  6. The data for the first three language pairs are derived from parliament proceedings sourced from the Europarl corpus (Koehn, 2005), with WMT—07 development and test data for French and German.
    Page 6, “Experiments”
  7. For all language pairs we employ 200K and 400K sentence pairs for training, 2K for development and 2K for testing (single reference per source sentence).
    Page 6, “Experiments”
  8. Apart of evaluating against a state-of-the-art system, especially on the English-Chinese language pair , the comparison has an added interesting aspect.
    Page 7, “Experiments”
  9. Table 1 presents the results for the baseline and our method for the 4 language pairs , for training sets of both 200K and 400K sentence pairs.
    Page 7, “Experiments”
  10. Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set.
    Page 7, “Experiments”
  11. the robustness of our system is exemplified by delivering significant performance increases for all language pairs .
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

overfitting

Appears in 8 sentences as: overfit (2) overfitting (7)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. Estimating such grammars under a Maximum Likelihood criterion is known to be plagued by strong overfitting leading to degenerate estimates (DeNero et al., 2006).
    Page 2, “Introduction”
  2. In contrast, our learning objective not only avoids overfitting the training data but, most importantly, learns joint stochastic synchronous grammars which directly aim at generalisation towards yet unseen instances.
    Page 2, “Introduction”
  3. On the other hand, estimating the parameters under Maximum-Likelihood Estimation (MLE) for the latent translation structure model 19(0) is bound to overfit towards memorising whole sentence-pairs as discussed in (Mylonakis and Sima’an, 2010), with the resulting grammar estimate not being able to
    Page 5, “Learning Translation Structure”
  4. However, apart from overfitting towards long phrase-pairs, a grammar with millions of structural rules is also liable to overfit towards degenerate latent structures which, while fitting the training data well, have limited applicability to unseen sentences.
    Page 5, “Learning Translation Structure”
  5. The CV—criterion, apart from avoiding overfitting , results in discarding the structural rules which are only found in a single part of the training corpus, leading to a more compact grammar while still retaining millions of structural rules that are more hopeful to generalise.
    Page 5, “Learning Translation Structure”
  6. We show that a translation system based on such a joint model can perform competitively in comparison with conditional probability models, when it is augmented with a rich latent hierarchical structure trained adequately to avoid overfitting .
    Page 8, “Related Work”
  7. Cohn and Blunsom (2009) sample rules of the form proposed in (Galley et al., 2004) from a Bayesian model, employing Dirichlet Process priors favouring smaller rules to avoid overfitting .
    Page 9, “Related Work”
  8. We address overfitting issues by cross-validating climbing the likelihood of the training data and propose solutions to increase the efficiency and accuracy of decoding.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2011 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

language model

Appears in 6 sentences as: language model (6)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. While in a decoder this is somehow mitigated by the use of a language model , we believe that the weakness of straightforward applications of SCFGs to model reordering structure at the sentence level misses a chance to learn this crucial part of the translation process during grammar induction.
    Page 3, “Joint Translation Model”
  2. As (Mylonakis and Sima’an, 2010) note, ‘plain’ SCFGs seem to perform worse than the grammars described next, mainly due to wrong long-range reordering decisions for which the language model can hardly help.
    Page 3, “Joint Translation Model”
  3. The final feature is the language model score for the target sentence, mounting up to the following model used at decoding time, with the feature weights A trained by Minimum Error Rate Training (MERT) (Och, 2003) on a development corpus.
    Page 6, “Experiments”
  4. with a 3-gram language model smoothed with modified Knesser-Ney discounting (Chen and Goodman, 1998), trained on around 1M sentences per target language.
    Page 7, “Experiments”
  5. Table 2: Additional experiments for English to Chinese translation examining (a) the impact of the linguistic annotations in the LTS system (lts), when compared with an instance not employing such annotations (lts—nolabels) and (b) decoding with a 4th-order language model (—lm4).
    Page 8, “Experiments”
  6. The second additional experiment relates to the impact of employing a stronger language model during decoding, which may increase performance but slows down decoding speed.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

recursive

Appears in 6 sentences as: Recursive (1) recursive (5)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. As Hiero uses a single nonterminal and concentrates on overcoming translation lexicon sparsity, it barely explores the recursive nature of translation past the lexical level.
    Page 1, “Introduction”
  2. By advancing from structures which mimic linguistic syntax, to learning linguistically aware latent recursive structures targeting translation, we achieve significant improvements in translation quality for 4 different language pairs in comparison with a strong hierarchical translation baseline.
    Page 2, “Introduction”
  3. Figure 2: Recursive Reordering Grammar rule categories; A, B, C non—terminals; oz, fl source and target strings respectively.
    Page 3, “Joint Translation Model”
  4. structural part and their associated probabilities define a model 19(0) over the latent variable 0 determining the recursive , reordering and phrase-pair segmenting structure of translation, as in Figure 4.
    Page 4, “Joint Translation Model”
  5. We aim to induce a recursive translation structure explaining the joint generation of the source and target
    Page 4, “Learning Translation Structure”
  6. As an example, while our probabilistic HR-SCFG maintains a separate joint phrase-pair emission distribution per nonterminal, the smoothing features (a) above assess the conditional translation of surface phrases irrespective of any notion of recursive translation structure.
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention recursive.

See all papers in Proc. ACL that mention recursive.

Back to top.

parse trees

Appears in 5 sentences as: parse tree (2) parse trees (4)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. The key assumption behind many approaches is that translation is guided by the source and/or target language parse, employing rules extracted from the parse tree or performing tree transformations.
    Page 1, “Abstract”
  2. Recent research tries to address these issues, by restructuring training data parse trees to better suit syntax-based SMT training (Wang et al., 2010), or by moving from linguistically motivated synchronous grammars to systems where linguistic plausibility of the translation is assessed through additional features in a phrase-based system (Venugopal et al., 2009; Chiang et al., 2009), obscuring the impact of higher level syntactic processes.
    Page 1, “Introduction”
  3. The results in Table 2(a) indicate that a large part of the performance improvement can be attributed to the use of the linguistic annotations extracted from the source parse trees , indicating the potential of the LTS system to take advantage of such additional annotations to deliver better translations.
    Page 8, “Experiments”
  4. Earlier approaches for linguistic syntax-based translation such as (Yamada and Knight, 2001; Galley et al., 2006; Huang et al., 2006; Liu et al., 2006) focus on memorising and reusing parts of the structure of the source and/or target parse trees and constraining decoding by the input parse tree .
    Page 8, “Related Work”
  5. A further promising direction is broadening this set with labels taking advantage of both source and target-language linguistic annotation or categories exploring additional phrase-pair properties past the parse trees such as semantic annotations.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2011 that mention parse trees.

See all papers in Proc. ACL that mention parse trees.

Back to top.

translation model

Appears in 5 sentences as: translation model (3) translation models (3)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. Nevertheless, the successful employment of SCFGs for phrase-based SMT brought translation models assuming latent syntactic structure to the spotlight.
    Page 1, “Introduction”
  2. Section 2 discusses the weak independence assumptions of SCFGs and introduces a joint translation model which addresses these issues and separates hierarchical translation structure from phrase-pair emission.
    Page 2, “Introduction”
  3. The rest of the (sometimes thousands of) rule-specific features usually added to SCFG translation models do not directly help either, leaving reordering decisions disconnected from the rest of the derivation.
    Page 3, “Joint Translation Model”
  4. The induced joint translation model can be used to recover arg maxe p(e|f), as it is equal to arg maxe p(e, f We employ the induced probabilistic HR-SCFG G as the backbone of a log-linear, feature based translation model , with the derivation probability p(D) under the grammar estimate being
    Page 5, “Experiments”
  5. Most of the aforementioned work does concentrate on learning hierarchical, linguistically motivated translation models .
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (5)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
    Page 1, “Abstract”
  2. Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set.
    Page 7, “Experiments”
  3. BLEU scores for 200K and 400K training sentence pairs.
    Page 8, “Experiments”
  4. Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

sentence pairs

Appears in 4 sentences as: sentence pairs (4)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. For all language pairs we employ 200K and 400K sentence pairs for training, 2K for development and 2K for testing (single reference per source sentence).
    Page 6, “Experiments”
  2. Table 1 presents the results for the baseline and our method for the 4 language pairs, for training sets of both 200K and 400K sentence pairs .
    Page 7, “Experiments”
  3. In addition, increasing the size of the training data from 200K to 400K sentence pairs widens the performance margin between the baseline and our system, in some cases considerably.
    Page 7, “Experiments”
  4. BLEU scores for 200K and 400K training sentence pairs .
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

significant improvements

Appears in 4 sentences as: significant improvement (1) significant improvements (3)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
    Page 1, “Abstract”
  2. By advancing from structures which mimic linguistic syntax, to learning linguistically aware latent recursive structures targeting translation, we achieve significant improvements in translation quality for 4 different language pairs in comparison with a strong hierarchical translation baseline.
    Page 2, “Introduction”
  3. Even for Dutch and German, which pose additional challenges such as compound words and morphology which we do not explicitly treat in the current system, LTS still delivers significant improvements in performance.
    Page 7, “Experiments”
  4. Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.

translation system

Appears in 4 sentences as: translation system (2) translation systems (2)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. Interestingly, early on (Koehn et al., 2003) exemplified the difficulties of integrating linguistic information in translation systems .
    Page 1, “Introduction”
  2. We compare against a state-of-the-art hierarchical translation (Chiang, 2005) baseline, based on the Joshua translation system under the default training and decoding settings (j o sh—ba se).
    Page 7, “Experiments”
  3. The decoder does not employ any ‘glue grammar’ as is usual with hierarchical translation systems to limit reordering up to a certain cutoff length.
    Page 7, “Experiments”
  4. We show that a translation system based on such a joint model can perform competitively in comparison with conditional probability models, when it is augmented with a rich latent hierarchical structure trained adequately to avoid overfitting.
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

joint model

Appears in 3 sentences as: joint model (3)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. Phrase-pairs are emitted jointly and the over-11 probabilistic SCFG is a joint model over parallel trings.
    Page 2, “Joint Translation Model”
  2. By splitting the joint model in a hierarchical structure model and a lexical emission one we facilitate estimating the two models separately.
    Page 4, “Joint Translation Model”
  3. We show that a translation system based on such a joint model can perform competitively in comparison with conditional probability models, when it is augmented with a rich latent hierarchical structure trained adequately to avoid overfitting.
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention joint model.

See all papers in Proc. ACL that mention joint model.

Back to top.

latent variable

Appears in 3 sentences as: latent variable (3)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. structural part and their associated probabilities define a model 19(0) over the latent variable 0 determining the recursive, reordering and phrase-pair segmenting structure of translation, as in Figure 4.
    Page 4, “Joint Translation Model”
  2. It works it-eratively on a partition of the training data, climbing the likelihood of the training data while cross-validating the latent variable values, considering for every training data point only those which can be produced by models built from the rest of the data excluding the current part.
    Page 5, “Learning Translation Structure”
  3. The rich linguistically motivated latent variable learnt by our method delivers translation performance that compares favourably to a state-of-the-art system.
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. The induced joint translation model can be used to recover arg maxe p(e|f), as it is equal to arg maxe p(e, f We employ the induced probabilistic HR-SCFG G as the backbone of a log-linear , feature based translation model, with the derivation probability p(D) under the grammar estimate being
    Page 5, “Experiments”
  2. We train the feature weights under MERT and decode with the resulting log-linear model.
    Page 7, “Experiments”
  3. Future work directions include investigating the impact of hierarchical phrases for our models as well as any gains from additional features in the log-linear decoding model.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2011 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

translation quality

Appears in 3 sentences as: translation quality (3)
In Learning Hierarchical Translation Structure with Linguistic Annotations
  1. By advancing from structures which mimic linguistic syntax, to learning linguistically aware latent recursive structures targeting translation, we achieve significant improvements in translation quality for 4 different language pairs in comparison with a strong hierarchical translation baseline.
    Page 2, “Introduction”
  2. These extra features assess translation quality past the synchronous grammar derivation and learning general reordering or word emission preferences for the language pair.
    Page 6, “Experiments”
  3. by interpolating them with less sparse ones, could in the future lead to an additional increase in translation quality .
    Page 9, “Conclusions”

See all papers in Proc. ACL 2011 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.