Fast and Adaptive Online Training of Feature-Rich Translation Models
Green, Spence and Wang, Sida and Cer, Daniel and Manning, Christopher D.

Article Structure

Abstract

We present a fast and scalable online method for tuning statistical machine translation models with large feature sets.

Introduction

Sparse, overlapping features such as words and n-gram contexts improve many NLP systems such as parsers and taggers.

Adaptive Online Algorithms

Machine translation is an unusual machine learning setting because multiple correct translations exist and decoding is comparatively expensive.

Adaptive Online MT

Algorithm 1 shows the full algorithm introduced in this paper.

Experiments

We built Arabic-English and Chinese-English MT systems with Phrasal (Cer et al., 2010), a phrase-based system based on alignment templates (Och and Ney, 2004).

Analysis

5.1 Feature Overlap Analysis

Related Work

Our work relates most closely to that of Hasler et al.

Conclusion and Outlook

We introduced a new online method for tuning feature-rich translation models.

Topics

feature set

Appears in 14 sentences as: feature set (11) feature sets (3)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. We present a fast and scalable online method for tuning statistical machine translation models with large feature sets .
    Page 1, “Abstract”
  2. When we have a large feature set and therefore want to tune on a large data set, batch methods are infeasible.
    Page 2, “Adaptive Online Algorithms”
  3. For example, simple indicator features like lexicalized reordering classes are potentially useful yet bloat the the feature set and, in the worst case, can negatively impact
    Page 4, “Adaptive Online MT”
  4. To the dense features we add three high dimensional “sparse” feature sets .
    Page 5, “Experiments”
  5. The primary baseline is the dense feature set tuned with MERT (Och, 2003).
    Page 5, “Experiments”
  6. with the PT feature set .
    Page 6, “Experiments”
  7. Moses PRO with the PT feature set is slightly worse, e.g., 44.52 on MTO9.
    Page 6, “Experiments”
  8. The full feature set PT+AL+LO does help.
    Page 6, “Experiments”
  9. the PT feature set alone, our algorithm tuned on MT05/6/8 scores well below the best model, e.g.
    Page 6, “Experiments”
  10. PRO learns a smaller model with the PT+AL+LO feature set which is surprising given that it applies L2 regularization (AdaGrad uses L1).
    Page 7, “Experiments”
  11. Domain mismatch matters for the dense feature set (Haddow and Koehn, 2012).
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

phrase table

Appears in 11 sentences as: Phrase table (1) phrase table (12)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. Finally, large data structures such as the language model (LM) and phrase table exist in shared memory, obviating the need for remote queries.
    Page 5, “Adaptive Online MT”
  2. tive phrase table (PT): indicators for each rule in the phrase table .
    Page 5, “Experiments”
  3. Moses5 also contains the discriminative phrase table implementation of (Hasler et al., 2012b), which is identical to our implementation using Phrasal.
    Page 5, “Experiments”
  4. Moses and Phrasal accept the same phrase table and LM formats, so we kept those data structures in common.
    Page 5, “Experiments”
  5. When we add the discriminative phrase table , our algorithm improves over kb-MIRA, and over batch PRO on two evaluation sets.
    Page 7, “Experiments”
  6. Table 6: Number of overlapping phrase table (+PT) features on various Zh-En dataset pairs.
    Page 7, “Experiments”
  7. In Table 6, A is the set of phrase table features that received a nonzero weight when tuned on dataset DA (same for B).
    Page 7, “Analysis”
  8. Phrase table features in A m B are overwhelmingly short, simple, and correct phrases, suggesting L1 regularization is effective for feature selection.
    Page 8, “Analysis”
  9. To understand the domain adaptation issue we compared the nonzero weights in the discriminative phrase table (PT) for Ar—En models tuned on bitext5k and MT05/6/ 8.
    Page 8, “Analysis”
  10. Bottom: rule counts in the discriminative phrase table (PT) for models tuned on the two tuning sets.
    Page 8, “Analysis”
  11. A discriminative phrase table helped them improve slightly over a dense, online MIRA baseline, but their best results required initialization with MERT-tuned weights and retuning a single, shared weight for the discriminative phrase table with MERT.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

loss function

Appears in 7 sentences as: Loss Function (1) loss function (5) loss functions (1)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. 1We specify the loss function for MT in section 3.1.
    Page 2, “Adaptive Online Algorithms”
  2. The relationship to SGD can be seen by lineariz-ing the loss function 6421)) m €t(wt_1) + (w —wt_1)TV€t(wt_1) and taking the derivative of (6).
    Page 2, “Adaptive Online Algorithms”
  3. MIRA/AROW requires selecting the loss function 6 so that wt can be solved in closed-form, by a quadratic program (QP), or in some other way that is better than linearizing.
    Page 3, “Adaptive Online Algorithms”
  4. On the other hand, Ada-Grad/linearized AROW only requires that the gradient of the loss function can be computed efiiciently.
    Page 3, “Adaptive Online Algorithms”
  5. AdaGrad (lines 9—10) is a crucial piece, but the loss function , regularization technique, and parallelization strategy described in this section are equally important in the MT setting.
    Page 3, “Adaptive Online MT”
  6. 3.1 Pairwise Logistic Loss Function
    Page 3, “Adaptive Online MT”
  7. The pairwise approach results in simple, convex loss functions suitable for online learning.
    Page 3, “Adaptive Online MT”

See all papers in Proc. ACL 2013 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

translation quality

Appears in 6 sentences as: translation quality (6)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems.
    Page 1, “Abstract”
  2. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features.
    Page 1, “Abstract”
  3. We conduct large-scale translation quality experiments on Arabic-English and Chinese-English.
    Page 1, “Introduction”
  4. Tables 2 and 3 show that adding tuning examples improves translation quality .
    Page 7, “Experiments”
  5. (2012) showed significant translation quality gains by tuning on the bitext.
    Page 7, “Experiments”
  6. When tuned on bitext5k the translation quality gains are significant for bitextSk-test relative to tuning on MT05/ 6/ 8, which has multiple references.
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

machine translation

Appears in 4 sentences as: Machine translation (1) machine translation (3)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. We present a fast and scalable online method for tuning statistical machine translation models with large feature sets.
    Page 1, “Abstract”
  2. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation .
    Page 1, “Abstract”
  3. Adaptation of discriminative learning methods for these types of features to statistical machine translation (MT) systems, which have historically used idiosyncratic learning techniques for a few dense features, has been an active research area for the past half-decade.
    Page 1, “Introduction”
  4. Machine translation is an unusual machine learning setting because multiple correct translations exist and decoding is comparatively expensive.
    Page 2, “Adaptive Online Algorithms”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

NIST

Appears in 4 sentences as: NIST (4)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. The first experiment uses standard tuning and test sets from the NIST OpenMT competitions.
    Page 1, “Introduction”
  2. 4.3 NIST OpenMT Experiment
    Page 6, “Experiments”
  3. However, the bitext5k models do not generalize as well to the NIST evaluation sets as represented by the MT04 result.
    Page 7, “Experiments”
  4. The mass is concentrated along the diagonal, probably because MT05/6/8 was prepared by NIST , an American agency, while the bitext was collected from many sources including Agence France Presse.
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

Chinese-English

Appears in 3 sentences as: Chinese-English (3)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features.
    Page 1, “Abstract”
  2. We conduct large-scale translation quality experiments on Arabic-English and Chinese-English .
    Page 1, “Introduction”
  3. We built Arabic-English and Chinese-English MT systems with Phrasal (Cer et al., 2010), a phrase-based system based on alignment templates (Och and Ney, 2004).
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention Chinese-English.

See all papers in Proc. ACL that mention Chinese-English.

Back to top.

domain adaptation

Appears in 3 sentences as: Domain Adaptation (1) domain adaptation (2)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. Second, large bitexts often comprise many text genres (Haddow and Koehn, 2012), a virtue for classical dense MT models but a curse for high dimensional models: bitext tuning can lead to a significant domain adaptation problem when evaluating on standard test sets.
    Page 1, “Introduction”
  2. 5.2 Domain Adaptation Analysis
    Page 8, “Analysis”
  3. To understand the domain adaptation issue we compared the nonzero weights in the discriminative phrase table (PT) for Ar—En models tuned on bitext5k and MT05/6/ 8.
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

lexicalized

Appears in 3 sentences as: lexicalized (3)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. For example, simple indicator features like lexicalized reordering classes are potentially useful yet bloat the the feature set and, in the worst case, can negatively impact
    Page 4, “Adaptive Online MT”
  2. The baseline “dense” model contains 19 features: the nine Moses baseline features, the hierarchical lexicalized reordering model of Galley and Manning (2008), the (log) count of each rule, and an indicator for unique rules.
    Page 5, “Experiments”
  3. Discriminative reordering (LO): indicators for eight lexicalized reordering classes, including the six standard mono-tone/swap/discontinuous classes plus the two simpler Moses monotone/non-monotone classes.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention lexicalized.

See all papers in Proc. ACL that mention lexicalized.

Back to top.

MT systems

Appears in 3 sentences as: MT system (1) MT systems (2)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. We introduce a new method for training feature-rich MT systems that is effective yet comparatively easy to implement.
    Page 1, “Introduction”
  2. SGD is sensitive to the learning rate 77, which is difiicult to set in an MT system that mixes frequent “dense” features (like the language model) with sparse features (e. g., for translation rules).
    Page 2, “Adaptive Online Algorithms”
  3. We built Arabic-English and Chinese-English MT systems with Phrasal (Cer et al., 2010), a phrase-based system based on alignment templates (Och and Ney, 2004).
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

weight vector

Appears in 3 sentences as: weight vector (2) weight vectors (1)
In Fast and Adaptive Online Training of Feature-Rich Translation Models
  1. A fixed threadpool of workers computes gradients in parallel and sends them to a master thread, which updates a central weight vector .
    Page 4, “Adaptive Online MT”
  2. During a tuning run, the online method decodes the tuning set under many more weight vectors than a MERT—style batch method.
    Page 5, “Adaptive Online MT”
  3. Our algorithm decodes each example with a new weight vector , thus exploring more of the search space for the same tuning set.
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.