Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
Vaswani, Ashish and Huang, Liang and Chiang, David

Article Structure

Abstract

Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems.

Introduction

Automatic word alignment is a Vital component of nearly all current statistical translation pipelines.

Method

We start with a brief review of the IBM and HMM word alignment models, then describe how to extend them with a smoothed (0 prior and how to efficiently train them.

Experiments

To demonstrate the effect of the {O-norm on the IBM models, we performed experiments on four translation tasks: Arabic-English, Chinese-English, and Urdu-English from the NIST Open MT Evaluation, and the Czech-English translation from the Workshop on Machine Translation (WMT) shared task.

Related Work

Schoenemann (2011a), taking inspiration from Bo-drumlu et al.

Conclusion

We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model, which compacts the word-to-word translation table, reducing overfitting, and, in particular, the “garbage collection” effect.

Topics

word alignment

Appears in 17 sentences as: word alignment (11) word alignments (6)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems.
    Page 1, “Abstract”
  2. We explain how to implement this extension efliciently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
    Page 1, “Abstract”
  3. Automatic word alignment is a Vital component of nearly all current statistical translation pipelines.
    Page 1, “Introduction”
  4. Although state-of—the-art translation models use rules that operate on units bigger than words (like phrases or tree fragments), they nearly always use word alignments to drive extraction of those translation rules.
    Page 1, “Introduction”
  5. The dominant approach to word alignment has been the IBM models (Brown et al., 1993) together with the HMM model (Vogel et al., 1996).
    Page 1, “Introduction”
  6. This extension follows our previous work on unsupervised part-of-speech tagging (Vaswani et al., 2010), but enables it to scale to the large datasets typical in word alignment , using an efficient training method based on projected gradient descent (Section 2.3).
    Page 1, “Introduction”
  7. Experiments on Czech-, Arabic-, Chinese- and Urdu-English translation (Section 3) demonstrate consistent significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
    Page 1, “Introduction”
  8. We start with a brief review of the IBM and HMM word alignment models, then describe how to extend them with a smoothed (0 prior and how to efficiently train them.
    Page 2, “Method”
  9. In word alignment , one well-known manifestation of overfitting is that rare words can act as “garbage collectors”
    Page 2, “Method”
  10. Previously (Vaswani et al., 2010), we used ALGENCAN, a nonlinear optimization toolkit, but this solution does not scale well to the number of parameters involved in word alignment models.
    Page 3, “Method”
  11. We measured the accuracy of word alignments generated by GIZA++ with and without the {O-norm,
    Page 4, “Experiments”

See all papers in Proc. ACL 2012 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

hyperparameters

Appears in 9 sentences as: hyperparameter (2) hyperparameters (8)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. The hyperparameter ,6 controls the tightness of the approximation, as illustrated in Figure 1.
    Page 2, “Method”
  2. We have implemented our algorithm as an open-source extension to GIZA++.1 Usage of the extension is identical to standard GIZA++, except that the user can switch the (0 prior on or off, and adjust the hyperparameters a and ,6.
    Page 4, “Experiments”
  3. We set the hyperparameters a and ,6 by tuning on gold-standard word alignments (to maximize F1) when possible.
    Page 6, “Experiments”
  4. The fact that we had to use hand-aligned data to tune the hyperparameters a and ,6 means that our method is no longer completely unsupervised.
    Page 6, “Experiments”
  5. However, our observation is that alignment accuracy is actually fairly robust to the choice of these hyperparameters , as shown in Table 2.
    Page 6, “Experiments”
  6. Table 4: Optimizing hyperparameters on alignment F 1 score does not necessarily lead to optimal B .
    Page 7, “Experiments”
  7. All of the tests showed significant improvements (1) < 0.01), ranging from +0.4 B to +1.4 B For Urdu, even though we didn’t have manual alignments to tune hyperparameters , we got significant gains over a good baseline.
    Page 8, “Experiments”
  8. We ran some contrastive experiments to investigate the impact of hyperparameter tuning on translation quality.
    Page 8, “Experiments”
  9. Even though we have used a small set of gold-standard alignments to tune our hyperparameters, we found that performance was fairly robust to variation in the hyperparameters , and translation performance was good even when gold-standard alignments were unavailable.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention hyperparameters.

See all papers in Proc. ACL that mention hyperparameters.

Back to top.

NIST

Appears in 8 sentences as: NIST (9)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. To demonstrate the effect of the {O-norm on the IBM models, we performed experiments on four translation tasks: Arabic-English, Chinese-English, and Urdu-English from the NIST Open MT Evaluation, and the Czech-English translation from the Workshop on Machine Translation (WMT) shared task.
    Page 4, “Experiments”
  2. 0 Chinese-English: selected data from the constrained task of the NIST 2009 Open MT Evaluation.3
    Page 4, “Experiments”
  3. o Arabic-English: all available data for the constrained track of NIST 2009, excluding United Nations proceedings (LDC2004E13), ISI Automatically Extracted Parallel Text (LDC2007E08), and Ummah newswire text (LDC2004T18), for a total of 5.4+4.3 million words.
    Page 4, “Experiments”
  4. 0 Urdu-English: all available data for the constrained track of NIST 2009.
    Page 4, “Experiments”
  5. We used two S-gram language models, one on the combined English sides of the NIST 2009 Arabic-English and Chinese-English constrained tracks (385M words), and another on 2 billion words of English.
    Page 7, “Experiments”
  6. The development data that were used for discriminative training were: for Chinese-English and Arabic-English, data from the NIST 2004 and NIST 2006 test sets, plus newsgroup data from the
    Page 7, “Experiments”
  7. GALE program (LDC2006E92); for Urdu-English, half of the NIST 2008 test set; for Czech-English, a training set of 2051 sentences provided by the WMT10 translation workshop.
    Page 8, “Experiments”
  8. Unfortunately, we find that optimizing F1 is not optimal for B —using the second-best alignments yields a further improvement of 0.5 B on the NIST 2009 data, which is statistically significant (p < 0.05).
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

translation quality

Appears in 7 sentences as: translation quality (7)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. We explain how to implement this extension efliciently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
    Page 1, “Abstract”
  2. In this paper, we propose a simple extension to the IBM/HMM models that is unsupervised like the IBM models, is as scalable as GIZA++ because it is implemented on top of GIZA++, and provides significant improvements in both alignment and translation quality .
    Page 1, “Introduction”
  3. Experiments on Czech-, Arabic-, Chinese- and Urdu-English translation (Section 3) demonstrate consistent significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
    Page 1, “Introduction”
  4. As we will see below, we still obtained strong improvements in translation quality when hand-aligned data was unavailable.
    Page 6, “Experiments”
  5. We then tested the effect of word alignments on translation quality using the hierarchical phrase-based translation system Hiero (Chiang, 2007).
    Page 7, “Experiments”
  6. We ran some contrastive experiments to investigate the impact of hyperparameter tuning on translation quality .
    Page 8, “Experiments”
  7. The method is implemented as a modification to the open-source toolkit GIZA++, and we have shown that it significantly improves translation quality across four different language pairs.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

Chinese-English

Appears in 6 sentences as: Chinese-English (6)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. To demonstrate the effect of the {O-norm on the IBM models, we performed experiments on four translation tasks: Arabic-English, Chinese-English , and Urdu-English from the NIST Open MT Evaluation, and the Czech-English translation from the Workshop on Machine Translation (WMT) shared task.
    Page 4, “Experiments”
  2. 0 Chinese-English : selected data from the constrained task of the NIST 2009 Open MT Evaluation.3
    Page 4, “Experiments”
  3. For Arabic-English and Chinese-English , we used 346 and 184 hand-aligned sentences from LDC2006E86 and LDC2006E93.
    Page 6, “Experiments”
  4. Figure 2 shows four examples of Chinese-English alignment, comparing the baseline with our smoothed-£0 method.
    Page 7, “Experiments”
  5. We used two S-gram language models, one on the combined English sides of the NIST 2009 Arabic-English and Chinese-English constrained tracks (385M words), and another on 2 billion words of English.
    Page 7, “Experiments”
  6. The development data that were used for discriminative training were: for Chinese-English and Arabic-English, data from the NIST 2004 and NIST 2006 test sets, plus newsgroup data from the
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention Chinese-English.

See all papers in Proc. ACL that mention Chinese-English.

Back to top.

translation model

Appears in 6 sentences as: translation model (3) translation models (3)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. Two decades after their invention, the IBM word-based translation models , widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems.
    Page 1, “Abstract”
  2. In this paper, we propose a simple extension to the IBM models: an 60 prior to encourage sparsity in the word-to-word translation model .
    Page 1, “Abstract”
  3. Although state-of—the-art translation models use rules that operate on units bigger than words (like phrases or tree fragments), they nearly always use word alignments to drive extraction of those translation rules.
    Page 1, “Introduction”
  4. It extends the IBM/HMM models by incorporating an (0 prior, inspired by the principle of minimum description length (Barron et al., 1998), to encourage sparsity in the word-to-word translation model (Section 2.2).
    Page 1, “Introduction”
  5. Table 4 shows B scores for translation models learned from these alignments.
    Page 8, “Experiments”
  6. We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model , which compacts the word-to-word translation table, reducing overfitting, and, in particular, the “garbage collection” effect.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

significant improvements

Appears in 5 sentences as: significant improvements (4) significantly improves (1)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. We explain how to implement this extension efliciently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
    Page 1, “Abstract”
  2. In this paper, we propose a simple extension to the IBM/HMM models that is unsupervised like the IBM models, is as scalable as GIZA++ because it is implemented on top of GIZA++, and provides significant improvements in both alignment and translation quality.
    Page 1, “Introduction”
  3. Experiments on Czech-, Arabic-, Chinese- and Urdu-English translation (Section 3) demonstrate consistent significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
    Page 1, “Introduction”
  4. All of the tests showed significant improvements (1) < 0.01), ranging from +0.4 B to +1.4 B For Urdu, even though we didn’t have manual alignments to tune hyperparameters, we got significant gains over a good baseline.
    Page 8, “Experiments”
  5. The method is implemented as a modification to the open-source toolkit GIZA++, and we have shown that it significantly improves translation quality across four different language pairs.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.

language pair

Appears in 4 sentences as: language pair (2) language pairs (2)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. These models are unsupervised, making them applicable to any language pair for which parallel text is available.
    Page 1, “Introduction”
  2. Although manually-aligned data is very valuable, it is only available for a small number of language pairs .
    Page 1, “Introduction”
  3. For each language pair , we extracted grammar rules from the same data that were used for word alignment.
    Page 7, “Experiments”
  4. The method is implemented as a modification to the open-source toolkit GIZA++, and we have shown that it significantly improves translation quality across four different language pairs .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention language pair.

See all papers in Proc. ACL that mention language pair.

Back to top.

overfitting

Appears in 4 sentences as: overfitting (4)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. Maximum likelihood training is prone to overfitting , especially in models with many parameters.
    Page 2, “Method”
  2. In word alignment, one well-known manifestation of overfitting is that rare words can act as “garbage collectors”
    Page 2, “Method”
  3. We have previously proposed another simple remedy to overfitting in the context of unsupervised part-of-speech tagging (Vaswani et al., 2010), which is to minimize the size of the model using a smoothed (0 prior.
    Page 2, “Method”
  4. We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model, which compacts the word-to-word translation table, reducing overfitting , and, in particular, the “garbage collection” effect.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

baseline system

Appears in 3 sentences as: baseline system (3)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. We found that adding word classes improved alignment quality a little, but more so for the baseline system (see Table 3).
    Page 6, “Experiments”
  2. Table 3: Adding word classes improves the F-score in both directions for Arabic-English alignment by a little, for the baseline system more so than ours.
    Page 7, “Experiments”
  3. In particular, the baseline system demonstrates typical “garbage collection” behavior (Moore, 2004) in all four examples.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

gold-standard

Appears in 3 sentences as: gold-standard (4)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. We set the hyperparameters a and ,6 by tuning on gold-standard word alignments (to maximize F1) when possible.
    Page 6, “Experiments”
  2. First, we evaluated alignment accuracy directly by comparing against gold-standard word alignments.
    Page 6, “Experiments”
  3. Even though we have used a small set of gold-standard alignments to tune our hyperparameters, we found that performance was fairly robust to variation in the hyperparameters, and translation performance was good even when gold-standard alignments were unavailable.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

objective function

Appears in 3 sentences as: (1) objective function (2)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. With the addition of the (0 prior, the MAP (maximum a posteriori) objective function is
    Page 2, “Method”
  2. Let F (6) be the objective function in
    Page 3, “Method”
  3. (Note that we don’t allow m = 0 because this can cause 6" + 6m to land on the boundary of the probability simplex, where the objective function is undefined.)
    Page 4, “Method”

See all papers in Proc. ACL 2012 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

optimization problem

Appears in 3 sentences as: optimization problem (2) optimization problems (1)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. Substituting back into (4) and dropping constant terms, we get the following optimization problem : minimize
    Page 2, “Method”
  2. This optimization problem is non-convex, and we do not know of a closed-form solution.
    Page 3, “Method”
  3. Gradient projection methods are attractive solutions to constrained optimization problems , particularly when the constraints on the parameters are simple (Bert-sekas, 1999).
    Page 3, “Method”

See all papers in Proc. ACL 2012 that mention optimization problem.

See all papers in Proc. ACL that mention optimization problem.

Back to top.

translation systems

Appears in 3 sentences as: translation system (1) translation systems (2)
In Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
  1. Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems .
    Page 1, “Abstract”
  2. We then tested the effect of word alignments on translation quality using the hierarchical phrase-based translation system Hiero (Chiang, 2007).
    Page 7, “Experiments”
  3. We hope that our method, due to its simplicity, generality, and effectiveness, will find wide application for training better statistical translation systems .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.