A Class-Based Agreement Model for Generating Accurately Inflected Translations
Green, Spence and DeNero, John

Article Structure

Abstract

When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors.

Introduction

Languages vary in the degree to which surface forms reflect grammatical relations.

A Class-based Model of Agreement

2.1 Morpho-syntactic Agreement

Inference during Translation Decoding

Scoring the agreement model as part of translation decoding requires a novel inference procedure.

Related Work

We compare our class-based model to previous approaches to scoring syntactic relations in MT.

Experiments

We first evaluate the Arabic segmenter and tagger components independently, then provide English-Arabic translation quality results.

Discussion of Translation Results

Tbl.

Conclusion and Outlook

Our class-based agreement model improves translation quality by promoting local agreement, but with a minimal increase in decoding time and no additional storage requirements for the phrase table.

Topics

phrase-based

Appears in 10 sentences as: Phrase-based (1) phrase-based (9)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders.
    Page 1, “Abstract”
  2. Agreement relations that cross statistical phrase boundaries are not explicitly modeled in most phrase-based MT systems (Avramidis and Koehn, 2008).
    Page 1, “Introduction”
  3. The model can be implemented using the feature APIs of popular phrase-based decoders such as Moses (Koehn et al., 2007) and Phrasal (Cer et al., 2010).
    Page 1, “Introduction”
  4. We chose a bigram model due to the aggressive recombination strategy in our phrase-based decoder.
    Page 4, “A Class-based Model of Agreement”
  5. 3.1 Phrase-based Translation Decoding
    Page 4, “Inference during Translation Decoding”
  6. We consider the standard phrase-based approach to MT (Och and Ney, 2004).
    Page 4, “Inference during Translation Decoding”
  7. Subotin (2011) recently extended factored translation models to hierarchical phrase-based translation and developed a discriminative model for predicting target-side morphology in English-Czech.
    Page 5, “Related Work”
  8. Experimental Setup Our decoder is based on the phrase-based approach to translation (Och and Ney, 2004) and contains various feature functions including phrase relative frequency, word-level alignment statistics, and lexicalized reordering models (Tillmann, 2004; Och et al., 2004).
    Page 6, “Experiments”
  9. Phrase Table Coverage In a standard phrase-based system, effective translation into a highly inflected target language requires that the phrase table contain the inflected word forms necessary to construct an output with correct agreement.
    Page 8, “Discussion of Translation Results”
  10. This large gap between the unigram recall of the actual translation output (top) and the lexical coverage of the phrase-based model (bottom) indicates that translation performance can be improved dramatically by altering the translation model through features such as ours, without expanding the search space of the decoder.
    Page 8, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

CRF

Appears in 9 sentences as: CRF (9)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. We treat segmentation as a character-level sequence modeling problem and train a linear-chain conditional random field ( CRF ) model (Lafferty et al., 2001).
    Page 2, “A Class-based Model of Agreement”
  2. Class-based Agreement Model t E T Set of morpho-syntactic classes 3 E S Set of all word segments 6569 Learned weights for the CRF-based segmenter 0mg Learned weights for the CRF-based tagger gbo, gbt CRF potential functions (emission and transition)
    Page 3, “A Class-based Model of Agreement”
  3. For this task we also train a standard CRF model on full sentences with gold classes and segmentation.
    Page 3, “A Class-based Model of Agreement”
  4. The CRF tagging model predicts a target-side class sequence 7*
    Page 3, “A Class-based Model of Agreement”
  5. Features The tagging CRF includes emission features gbo that indicate a class 7-,; appearing with various orthographic characteristics of the word sequence being tagged.
    Page 3, “A Class-based Model of Agreement”
  6. In typical CRF inference, the entire observation sequence is available throughout inference, so these features can be scored on observed words in an arbitrary neighborhood around the current position i.
    Page 3, “A Class-based Model of Agreement”
  7. However, we conduct CRF inference in tandem with the translation decoding procedure (§3), creating an environment in which subsequent words of the observation are not available; the MT system has yet to generate the rest of the translation when the tagging features for a position are scored.
    Page 3, “A Class-based Model of Agreement”
  8. The CRF tagger model defines a conditional distribution p(7'|e; 6mg) for a class sequence 7' given a sentence e and model parameters 6mg.
    Page 4, “A Class-based Model of Agreement”
  9. The model can be implemented with a standard CRF package, trained on existing treebanks for many languages, and integrated easily with many MT feature APIs.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2012 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

LM

Appears in 9 sentences as: LM (9)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. Intuition might suggest that the standard 71- gram language model ( LM ) is suflicient to handle agreement phenomena.
    Page 1, “Introduction”
  2. However, LM statistics are sparse, and they are made sparser by morphological variation.
    Page 1, “Introduction”
  3. For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM .
    Page 1, “Introduction”
  4. For contexts in which the LM is guaranteed to back off (for instance, after an unseen bigram), our decoder maintains only the minimal state needed (perhaps only a single word).
    Page 4, “A Class-based Model of Agreement”
  5. initialize 77 to —oo set 77(t) = 0 compute 7* from parameters <8, 5%, 7r, is_goal> compute q(e{,+1) 2 p(7'*) under the generative LM set model state anew = <§L, 7-2) for prefix ef Output: q(e£+1)
    Page 4, “Inference during Translation Decoding”
  6. They used a target-side LM over Combinatorial Categorial Grammar (CCG) supertags, along with a penalty for the number of operator violations, and also modified the phrase probabilities based on the tags.
    Page 5, “Related Work”
  7. Then they mixed the classes into a word-based LM .
    Page 6, “Related Work”
  8. Target-Side Syntactic LMs Our agreement model is a form of syntactic LM , of which there is a long history of research, especially in speech processing.5 Syntactic LMs have traditionally been too slow for scoring during MT decoding.
    Page 6, “Related Work”
  9. The target-side structure enables scoring hypotheses with a trigram dependency LM .
    Page 6, “Related Work”

See all papers in Proc. ACL 2012 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

translation quality

Appears in 7 sentences as: Translation Quality (1) translation quality (6)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. However, using lexical coverage experiments, we show that there is ample room for translation quality improvements through better selection of forms that already exist in the translation model.
    Page 2, “Introduction”
  2. Segmentation is typically applied as a bitext preprocessing step, and there is a rich literature on the effect of different segmentation schemata on translation quality (Koehn and Knight, 2003; Habash and Sadat, 2006; El Kholy and Habash, 2012).
    Page 2, “A Class-based Model of Agreement”
  3. We first evaluate the Arabic segmenter and tagger components independently, then provide English-Arabic translation quality results.
    Page 6, “Experiments”
  4. 5.2 Translation Quality
    Page 6, “Experiments”
  5. We evaluated translation quality with BLEU-4 (Pa-pineni et al., 2002) and computed statistical significance with the approximate randomization method of Riezler and Maxwell (2005).9
    Page 7, “Experiments”
  6. 2 shows translation quality results on newswire, while Tbl.
    Page 7, “Discussion of Translation Results”
  7. Our class-based agreement model improves translation quality by promoting local agreement, but with a minimal increase in decoding time and no additional storage requirements for the phrase table.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2012 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

translation model

Appears in 7 sentences as: Translation Model (2) translation model (3) Translation Models (1) translation models (2)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. However, using lexical coverage experiments, we show that there is ample room for translation quality improvements through better selection of forms that already exist in the translation model .
    Page 2, “Introduction”
  2. Translation Model 6 Target sequence of I words f Source sequence of J words a Sequence of K phrase alignments for (e, f) H Permutation of the alignments for target word order 6 h Sequence of M feature functions A Sequence of learned weights for the M features H A priority queue of hypotheses
    Page 3, “A Class-based Model of Agreement”
  3. 3.3 Translation Model Features
    Page 5, “Inference during Translation Decoding”
  4. Factored Translation Models Factored translation models (Koehn and Hoang, 2007) facilitate a more data-oriented approach to agreement modeling.
    Page 5, “Related Work”
  5. Subotin (2011) recently extended factored translation models to hierarchical phrase-based translation and developed a discriminative model for predicting target-side morphology in English-Czech.
    Page 5, “Related Work”
  6. We trained the translation model on 502 million words of parallel text collected from a variety of sources, including the Web.
    Page 7, “Experiments”
  7. This large gap between the unigram recall of the actual translation output (top) and the lexical coverage of the phrase-based model (bottom) indicates that translation performance can be improved dramatically by altering the translation model through features such as ours, without expanding the search space of the decoder.
    Page 8, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

language model

Appears in 7 sentences as: language model (6) language models (1)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. Intuition might suggest that the standard 71- gram language model (LM) is suflicient to handle agreement phenomena.
    Page 1, “Introduction”
  2. However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score).
    Page 4, “A Class-based Model of Agreement”
  3. We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
    Page 4, “A Class-based Model of Agreement”
  4. With a trigram language model , the state might be the last two words of the translation prefix.
    Page 4, “Inference during Translation Decoding”
  5. Monz (2011) recently investigated parameter estimation for POS-based language models , but his classes did not include inflectional features.
    Page 6, “Related Work”
  6. One exception was the quadratic-time dependency language model presented by Galley and Manning (2009).
    Page 6, “Related Work”
  7. Our distributed 4—gram language model was trained on 600 million words of Arabic text, also collected from many sources including the Web (Brants et al., 2007).
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

bigram

Appears in 5 sentences as: Bigram (1) bigram (4)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. The features are indicators for (character, position, label) triples for a five Character window and bigram label transition indicators.
    Page 3, “A Class-based Model of Agreement”
  2. Bigram transition features gbt encode local agreement relations.
    Page 3, “A Class-based Model of Agreement”
  3. We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
    Page 4, “A Class-based Model of Agreement”
  4. We chose a bigram model due to the aggressive recombination strategy in our phrase-based decoder.
    Page 4, “A Class-based Model of Agreement”
  5. For contexts in which the LM is guaranteed to back off (for instance, after an unseen bigram ), our decoder maintains only the minimal state needed (perhaps only a single word).
    Page 4, “A Class-based Model of Agreement”

See all papers in Proc. ACL 2012 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

treebanks

Appears in 5 sentences as: Treebank (1) treebank (1) treebanks (3)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. More than 25 treebanks (in 22 languages) can be automatically mapped to this tag set, which includes “Noun” (nominals), “Verb” (verbs), “Adj” (adjectives), and “ADP” (pre-and postpositions).
    Page 3, “A Class-based Model of Agreement”
  2. Many of these treebanks also contain per-token morphological annotations.
    Page 3, “A Class-based Model of Agreement”
  3. We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
    Page 4, “A Class-based Model of Agreement”
  4. Experimental Setup All experiments use the Penn Arabic Treebank (ATB) (Maamouri et al., 2004) parts 1—3 divided into training/dev/test sections according to the canonical split (Rambow et al., 2005).7
    Page 6, “Experiments”
  5. The model can be implemented with a standard CRF package, trained on existing treebanks for many languages, and integrated easily with many MT feature APIs.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2012 that mention treebanks.

See all papers in Proc. ACL that mention treebanks.

Back to top.

model scores

Appears in 5 sentences as: model score (2) model scores (3)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. Our model scores hypotheses during decoding.
    Page 1, “Introduction”
  2. The agreement model scores sequences of morpho-syntactic word classes, which express grammatical features relevant to agreement.
    Page 2, “A Class-based Model of Agreement”
  3. However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score ).
    Page 4, “A Class-based Model of Agreement”
  4. Discriminative model scores have been used as MT features (Galley and Manning, 2009), but we obtained better results by scoring the l-best class sequences with a generative model.
    Page 4, “A Class-based Model of Agreement”
  5. The agreement model score is one decoder feature function.
    Page 5, “Inference during Translation Decoding”

See all papers in Proc. ACL 2012 that mention model scores.

See all papers in Proc. ACL that mention model scores.

Back to top.

phrase table

Appears in 5 sentences as: Phrase Table (1) phrase table (5)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders.
    Page 1, “Abstract”
  2. Unlike previous models for scoring syntactic relations, our model does not require bitext annotations, phrase table features, or decoder modifications.
    Page 1, “Introduction”
  3. Phrase Table Coverage In a standard phrase-based system, effective translation into a highly inflected target language requires that the phrase table contain the inflected word forms necessary to construct an output with correct agreement.
    Page 8, “Discussion of Translation Results”
  4. During development, we observed that the phrase table of our large-scale English-Arabic system did often contain the inflected forms that we desired the system to select.
    Page 8, “Discussion of Translation Results”
  5. Our class-based agreement model improves translation quality by promoting local agreement, but with a minimal increase in decoding time and no additional storage requirements for the phrase table .
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2012 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

MT system

Appears in 4 sentences as: MT system (3) MT systems (1)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. The MT system selects the correct verb stem, but with masculine inflection.
    Page 1, “Introduction”
  2. Agreement relations that cross statistical phrase boundaries are not explicitly modeled in most phrase-based MT systems (Avramidis and Koehn, 2008).
    Page 1, “Introduction”
  3. However, we conduct CRF inference in tandem with the translation decoding procedure (§3), creating an environment in which subsequent words of the observation are not available; the MT system has yet to generate the rest of the translation when the tagging features for a position are scored.
    Page 3, “A Class-based Model of Agreement”
  4. To our knowledge, Uszkoreit and Brants (2008) are the only recent authors to show an improvement in a state-of-the-art MT system using class-based LMs.
    Page 5, “Related Work”

See all papers in Proc. ACL 2012 that mention MT system.

See all papers in Proc. ACL that mention MT system.

Back to top.

statistically significant

Appears in 4 sentences as: statistical significance (1) statistically significant (3)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. The MT06 result is statistically significant at p g 0.01; MT08 is significant at p g 0.02.
    Page 7, “Experiments”
  2. We evaluated translation quality with BLEU-4 (Pa-pineni et al., 2002) and computed statistical significance with the approximate randomization method of Riezler and Maxwell (2005).9
    Page 7, “Experiments”
  3. We realized smaller, yet statistically significant , gains on the mixed genre data sets.
    Page 7, “Discussion of Translation Results”
  4. The baseline contained 78 errors, while our system produced 66 errors, a statistically significant 15.4% error reduction at p S 0.01 according to a paired t-test.
    Page 8, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

beam search

Appears in 4 sentences as: beam search (4)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. This decoding problem is NP—hard, thus a beam search is often used (Fig.
    Page 4, “Inference during Translation Decoding”
  2. The beam search relies on three operations, two of which affect the agreement model:
    Page 4, “Inference during Translation Decoding”
  3. Figure 4: Breadth-first beam search algorithm of Och and Ney (2004).
    Page 4, “Inference during Translation Decoding”
  4. The beam search maintains state for each derivation, the score of which is a linear combination of the feature values.
    Page 4, “Inference during Translation Decoding”

See all papers in Proc. ACL 2012 that mention beam search.

See all papers in Proc. ACL that mention beam search.

Back to top.

fine-grained

Appears in 4 sentences as: fine-grained (4)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. We address this shortcoming with an agreement model that scores sequences of fine-grained morpho-syntactic classes.
    Page 1, “Introduction”
  2. After segmentation, we tag each segment with a fine-grained morpho-syntactic class.
    Page 3, “A Class-based Model of Agreement”
  3. For training the tagger, we automatically converted the ATE morphological analyses to the fine-grained class set.
    Page 6, “Experiments”
  4. Finally, +POS+Agr shows the class-based model with the fine-grained classes (e. g., “Noun+Fem+S g”).
    Page 7, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention fine-grained.

See all papers in Proc. ACL that mention fine-grained.

Back to top.

phrase pair

Appears in 3 sentences as: phrase pair (2) Phrase pairs (1)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. 0 Extend a hypothesis with a new phrase pair 0 Recombine hypotheses with identical states
    Page 4, “Inference during Translation Decoding”
  2. Och (1999) showed a method for inducing bilingual word classes that placed each phrase pair into a two-dimensional equivalence class.
    Page 5, “Related Work”
  3. 0 Baseline system translation output: 44.6% o Phrase pairs matching source n-grams: 67.8%
    Page 8, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention phrase pair.

See all papers in Proc. ACL that mention phrase pair.

Back to top.

POS tags

Appears in 3 sentences as: POS tag (1) POS tags (2)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. The coarse categories are the universal POS tag set described by Petrov et al.
    Page 3, “A Class-based Model of Agreement”
  2. For Arabic, we used the coarse POS tags plus definiteness and the so-called phi features (gender, number, and person).4 For example, SJWl ‘the car’ would be tagged “Noun+Def+Sg+Fem”.
    Page 3, “A Class-based Model of Agreement”
  3. For comparison, +POS indicates our class-based model trained on the 11 coarse POS tags only (e.g., “Noun”).
    Page 7, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

model training

Appears in 3 sentences as: model trained (1) model training (2)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. For comparison, +POS indicates our class-based model trained on the 11 coarse POS tags only (e.g., “Noun”).
    Page 7, “Discussion of Translation Results”
  2. The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre.
    Page 7, “Discussion of Translation Results”
  3. We achieved best results when the model training data, MT tuning set, and MT evaluation set contained roughly the same genre.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2012 that mention model training.

See all papers in Proc. ACL that mention model training.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. Table 1: Intrinsic evaluation accuracy [‘70] ( development set ) for Arabic segmentation and tagging.
    Page 6, “Experiments”
  2. 1 shows development set accuracy for two settings.
    Page 6, “Experiments”
  3. We tuned the feature weights on a development set using lattice-based minimum error rate training (MERT) (Macherey et al.,
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

BLEU

Appears in 3 sentences as: BLEU (3)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline.
    Page 1, “Abstract”
  2. For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM.
    Page 1, “Introduction”
  3. The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre.
    Page 7, “Discussion of Translation Results”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

Viterbi

Appears in 3 sentences as: Viterbi (3)
In A Class-Based Agreement Model for Generating Accurately Inflected Translations
  1. It can be learned from gold-segmented data, generally applies to languages with bound morphemes, and does not require a hand-compiled lexicon.3 Moreover, it has only four labels, so Viterbi decoding is very fast.
    Page 3, “A Class-based Model of Agreement”
  2. Incremental Greedy Decoding Decoding with the CRF—based tagger model in this setting requires some slight modifications to the Viterbi algorithm.
    Page 5, “Inference during Translation Decoding”
  3. This forces the Viterbi path to go through If.
    Page 5, “Inference during Translation Decoding”

See all papers in Proc. ACL 2012 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.