Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
Simianer, Patrick and Riezler, Stefan and Dyer, Chris

Article Structure

Abstract

With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data.

Introduction

The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003).

Related Work

The great promise of discriminative training for SMT is the possibility to design arbitrarily expressive, complex, or overlapping features in great numbers.

Local Features for Synchronous CFGs

The work described in this paper is based on the SMT framework of hierarchical phrase-based translation (Chiang, 2005; Chiang, 2007).

Joint Feature Selection in Distributed Stochastic Learning

The following discussion of learning methods is based on pairwise ranking in a Stochastic Gradient Descent (SGD) framework.

Experiments

5.1 Data, Systems, Experiment Settings

Discussion

We presented an approach to scaling discriminative learning for SMT not only to large feature

Topics

BLEU

Appears in 8 sentences as: BLEU (10)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference.
    Page 3, “Joint Feature Selection in Distributed Stochastic Learning”
  2. Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a).
    Page 5, “Experiments”
  3. Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU .
    Page 6, “Experiments”
  4. In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set.
    Page 7, “Experiments”
  5. The results on the news-commentary (nc) data show that training on the development set does not benefit from adding large feature sets — BLEU result differences between tuning 12 default features
    Page 7, “Experiments”
  6. However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set.
    Page 8, “Experiments”
  7. Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets.
    Page 8, “Experiments”
  8. Another 0.5 BLEU points (test-crawll 1) or even 1.3 BLEU points (test-crawl 10) are gained when scaling to the full training set using iterative features selection.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

perceptron

Appears in 8 sentences as: perceptron (7) Perceptrons (2)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. Examples for adapted algorithms include Maximum-Entropy Models (Och and Ney, 2002; Blunsom et al., 2008), Pairwise Ranking Perceptrons (Shen et al., 2004; Watanabe et al., 2006; Hopkins and May, 2011), Structured Perceptrons (Liang et al., 2006a), Boosting (Duh and Kirchhoff, 2008; Wellington et al., 2009), Structured SVMs (Tillmann and Zhang, 2006; Hayashi et al., 2009), MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang et al., 2009), and others.
    Page 2, “Related Work”
  2. The resulting algorithms can be seen as variants of the perceptron algorithm.
    Page 3, “Joint Feature Selection in Distributed Stochastic Learning”
  3. subgradient leads to the perceptron algorithm for pairwise ranking3 (Shen and Joshi, 2005):
    Page 3, “Joint Feature Selection in Distributed Stochastic Learning”
  4. (2010) also present an iterative mixing algorithm where weights are mixed from each shard after training a single epoch of the perceptron in parallel on each shard.
    Page 4, “Joint Feature Selection in Distributed Stochastic Learning”
  5. The baseline learner in our experiments is a pairwise ranking perceptron that is used on various features and training data and plugged into various meta-
    Page 6, “Experiments”
  6. The perceptron algorithm itself compares favorably to related learning techniques such as the MIRA adaptation of Chiang et a1.
    Page 6, “Experiments”
  7. In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set.
    Page 7, “Experiments”
  8. If not indicated otherwise, the perceptron was run for 10 epochs with learning rate 77 = 0.0001, started at zero weight vector, using deduplicated 100-best lists.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

feature sets

Appears in 6 sentences as: feature sets (6)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data.
    Page 1, “Abstract”
  2. Our resulting models are learned on large data sets, but they are small and outperform models that tune feature sets of various sizes on small development sets.
    Page 2, “Introduction”
  3. All approaches have been shown to scale to large feature sets and all include some kind of regularization method.
    Page 2, “Related Work”
  4. The results on the news-commentary (nc) data show that training on the development set does not benefit from adding large feature sets — BLEU result differences between tuning 12 default features
    Page 7, “Experiments”
  5. Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets.
    Page 8, “Experiments”
  6. In future work, we would like to investigate more sophisticated features, better learners, and in general improve the components of our system that have been neglected in the current investigation of relative improvements by scaling the size of data and feature sets .
    Page 8, “Discussion”

See all papers in Proc. ACL 2012 that mention feature sets.

See all papers in Proc. ACL that mention feature sets.

Back to top.

language models

Appears in 6 sentences as: language model (3) language models (5)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003).
    Page 1, “Introduction”
  2. The modeler’s goals might be to identify complex properties of translations, or to counter errors of pre-trained translation models and language models by explicitly down-weighting translations that exhibit certain undesired properties.
    Page 1, “Introduction”
  3. 3-gram (news-commentary) and 5-gram (Europarl) language models are trained on the data described in Table 1, using the SRILM toolkit (Stol-cke, 2002) and binarized for efficient querying using kenlm (Heafield, 2011).
    Page 5, “Experiments”
  4. For the 5-gram language models, we replaced every word in the lm training data with <unk> that did not appear in the English part of the parallel training data to build an open vocabulary language model .
    Page 5, “Experiments”
  5. 7Absolute improvements would be possible, e. g., by using larger language models or by adding news data to the ep training set when evaluating on crawl test sets (see, e. g., Dyer et al.
    Page 7, “Experiments”
  6. 8negative log relative frequency p(e| f); log count( f); log count(e, f); lexical translation probability p( f le) and p(e| f) (Koehn et al., 2003); indicator variable on singleton phrase e; indicator variable on singleton phrase pair f, 6; word penalty; language model weight; OOV count of language model ; number of untranslated words; Hiero glue rules (Chiang, 2007).
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

development set

Appears in 4 sentences as: development set (2) development sets (2)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. We present eXperiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets .
    Page 1, “Abstract”
  2. Our resulting models are learned on large data sets, but they are small and outperform models that tune feature sets of various sizes on small development sets .
    Page 2, “Introduction”
  3. The results on the news-commentary (nc) data show that training on the development set does not benefit from adding large feature sets — BLEU result differences between tuning 12 default features
    Page 7, “Experiments”
  4. However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set .
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

feature vector

Appears in 4 sentences as: feature vector (2) feature vectors (1) features vectors (1)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. The simple but effective idea is to randomly divide training data into evenly sized shards, use stochastic learning on each shard in parallel, while performing 61/62 regularization for joint feature selection on the shards after each epoch, before starting a new epoch with a reduced feature vector averaged across shards.
    Page 2, “Introduction”
  2. Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference.
    Page 3, “Joint Feature Selection in Distributed Stochastic Learning”
  3. Parameter mixing by averaging will help to ease the feature sparsity problem, however, keeping feature vectors on the scale of several million features in memory can be prohibitive.
    Page 4, “Joint Feature Selection in Distributed Stochastic Learning”
  4. Algorithms 2 and 3 were infeasible to run on Europarl data beyond one epoch because features vectors grew too large to be kept in memory.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

weight vector

Appears in 4 sentences as: weight vector (3) weight vectors (1)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. The mixed weight vector is resent to each shard to start another epoch of training in parallel on each shard.
    Page 4, “Joint Feature Selection in Distributed Stochastic Learning”
  2. Reduced weight vectors are mixed and the result is resent to each shard to start another epoch of parallel training on each shard.
    Page 4, “Joint Feature Selection in Distributed Stochastic Learning”
  3. The initial weight vector was 0.
    Page 6, “Experiments”
  4. If not indicated otherwise, the perceptron was run for 10 epochs with learning rate 77 = 0.0001, started at zero weight vector , using deduplicated 100-best lists.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

BLEU points

Appears in 3 sentences as: BLEU points (4)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set.
    Page 8, “Experiments”
  2. Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets.
    Page 8, “Experiments”
  3. Another 0.5 BLEU points (test-crawll 1) or even 1.3 BLEU points (test-crawl 10) are gained when scaling to the full training set using iterative features selection.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention BLEU points.

See all papers in Proc. ACL that mention BLEU points.

Back to top.

machine learning

Appears in 3 sentences as: machine learning (3)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. Evidence from machine learning indicates that increasing the training sample size results in better prediction.
    Page 1, “Abstract”
  2. This contradicts theoretical and practical evidence from machine learning that suggests that larger training samples should be beneficial to improve prediction also in SMT.
    Page 1, “Introduction”
  3. The focus of many approaches thus has been on feature engineering and on adaptations of machine learning algorithms to the special case of SMT (where gold standard rankings have to be created automatically).
    Page 2, “Related Work”

See all papers in Proc. ACL 2012 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.

n-grams

Appears in 3 sentences as: n-grams (4)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. Such features include rule ids, rule-local n-grams , or types of rule shapes.
    Page 2, “Introduction”
  2. Rule n-grams: These features identify n-grams of consecutive items in a rule.
    Page 3, “Local Features for Synchronous CFGs”
  3. Feature templates such as rule n-grams and rule shapes only work if iterative mixing (algorithm 3) or feature selection (algorithm 4) are used.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

overfitting

Appears in 3 sentences as: overfitting (3)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. Another possible reason why large training data did not yet show the expected improvements in discriminative SMT is a special overfitting problem of current popular online learning techniques.
    Page 2, “Introduction”
  2. Selecting features jointly across shards and averaging does counter the overfitting effect that is inherent to stochastic updating.
    Page 2, “Introduction”
  3. Our algorithm 4 (IterSelSGD) introduces feature selection into distributed learning for increased efficiency and as a more radical measure against overfitting .
    Page 4, “Joint Feature Selection in Distributed Stochastic Learning”

See all papers in Proc. ACL 2012 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

significant improvements

Appears in 3 sentences as: significant improvements (3)
In Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
  1. We present eXperiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.
    Page 1, “Abstract”
  2. However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set.
    Page 8, “Experiments”
  3. Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.