Prediction of Learning Curves in Machine Translation
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram

Article Structure

Abstract

Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose.

Introduction

Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose.

Related Work

Learning curves are routinely used to illustrate how the performance of experimental methods depend on the amount of training data used.

Selecting a parametric family of curves

The first step in our approach consists in selecting a suitable family of shapes for the learning curves that we want to produce in the two scenarios being considered.

Inferring a learning curve from mostly monolingual data

In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model.

Topics

BLEU

Appears in 21 sentences as: BLEU (24)
In Prediction of Learning Curves in Machine Translation
  1. In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training.
    Page 1, “Introduction”
  2. An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU .
    Page 1, “Introduction”
  3. They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
    Page 2, “Introduction”
  4. For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
    Page 2, “Selecting a parametric family of curves”
  5. The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not
    Page 2, “Selecting a parametric family of curves”
  6. The values are on the same scale as the BLEU scores.
    Page 3, “Selecting a parametric family of curves”
  7. BLEU scores .0 pl on
    Page 4, “Selecting a parametric family of curves”
  8. Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
    Page 4, “Inferring a learning curve from mostly monolingual data”
  9. We first train models to predict the BLEU score at m anchor sizes 81, .
    Page 4, “Inferring a learning curve from mostly monolingual data”
  10. We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
    Page 4, “Inferring a learning curve from mostly monolingual data”
  11. , pm with accurate estimates of BLEU at the anchor sizes5.
    Page 5, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

parallel corpus

Appears in 11 sentences as: parallel corpus (11)
In Prediction of Learning Curves in Machine Translation
  1. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available.
    Page 1, “Abstract”
  2. In the first scenario ($1), the SMT developer is given only monolingual source and target samples from the relevant domain, and a small test parallel corpus .
    Page 1, “Introduction”
  3. In the second scenario (S2), an additional small seed parallel corpus is given that can be used to train small in-domain models and measure (with some variance) the evaluation score at a few points on the initial portion of the learning curve.
    Page 1, “Introduction”
  4. For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
    Page 2, “Selecting a parametric family of curves”
  5. The corpus size 307; is measured in terms of the number of segments (sentences) present in the parallel corpus .
    Page 2, “Selecting a parametric family of curves”
  6. In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model.
    Page 4, “Inferring a learning curve from mostly monolingual data”
  7. 5 Extrapolating a learning curve fitted on a small parallel corpus
    Page 5, “Inferring a learning curve from mostly monolingual data”
  8. Given a small “seed” parallel corpus , the translation system can be used to train small in-domain models and the evaluation score can be measured at a few initial sample sizes {($1,y1), ($2, yg)...(acp, yp)}.
    Page 5, “Inferring a learning curve from mostly monolingual data”
  9. In scenario $2, the models trained from the seed parallel corpus and the features used for inference (Section 4) provide complementary information.
    Page 6, “Inferring a learning curve from mostly monolingual data”
  10. Let u be a new configuration with seed parallel corpus of size mu, and let :10; be the largest point in our grid for which :10; S mu.
    Page 6, “Inferring a learning curve from mostly monolingual data”
  11. For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

BLEU score

Appears in 7 sentences as: BLEU score (4) BLEU scores (3)
In Prediction of Learning Curves in Machine Translation
  1. The values are on the same scale as the BLEU scores .
    Page 3, “Selecting a parametric family of curves”
  2. BLEU scores .0 pl on
    Page 4, “Selecting a parametric family of curves”
  3. Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
    Page 4, “Inferring a learning curve from mostly monolingual data”
  4. We first train models to predict the BLEU score at m anchor sizes 81, .
    Page 4, “Inferring a learning curve from mostly monolingual data”
  5. We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
    Page 4, “Inferring a learning curve from mostly monolingual data”
  6. Feature correlation measures such as Pearsons R showed that the features corresponding to type-token ratios of both source and target languages and size of test set have a high correlation with the BLEU scores at the three anchor sizes.
    Page 6, “Inferring a learning curve from mostly monolingual data”
  7. The average distance is on the same scale as the BLEU score , which suggests that our best curves can predict the gold curve within 1.5 BLEU points on average (the best result being 0.7 BLEU points when the initial points are lK-SK-lOK-ZOK) which is a telling result.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

parallel data

Appears in 6 sentences as: Parallel data (2) parallel data (4)
In Prediction of Learning Curves in Machine Translation
  1. Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose.
    Page 1, “Abstract”
  2. Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose.
    Page 1, “Introduction”
  3. This prediction, or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data , is the objective of this paper.
    Page 1, “Introduction”
  4. They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
    Page 2, “Introduction”
  5. However, when a configuration )f four initial points is used for the same amount of ‘seed” parallel data , it outperforms both the config-Jrations with three initial points.
    Page 7, “Inferring a learning curve from mostly monolingual data”
  6. The ability to predict the amount of parallel data required to achieve a given level of quality is very valuable in planning business deployments of statistical machine translation; yet, we are not aware of any rigorous proposal for addressing this need.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

BLEU points

Appears in 5 sentences as: BLEU point (2) BLEU points (5)
In Prediction of Learning Curves in Machine Translation
  1. They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
    Page 2, “Introduction”
  2. As an example, the model estimated using Lasso for the 75K anchor size eXhibits a root mean squared error of 6 BLEU points .
    Page 6, “Inferring a learning curve from mostly monolingual data”
  3. The average distance is on the same scale as the BLEU score, which suggests that our best curves can predict the gold curve within 1.5 BLEU points on average (the best result being 0.7 BLEU points when the initial points are lK-SK-lOK-ZOK) which is a telling result.
    Page 8, “Inferring a learning curve from mostly monolingual data”
  4. For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K.
    Page 8, “Inferring a learning curve from mostly monolingual data”
  5. Considering that variations in the order of l BLEU point on a same test dataset can be observed simply due to the instability of the standard MERT parameter tuning algorithm (Foster and Kuhn, 2009; Clark et al., 2011), we believe our results to be close to what can be achieved in principle.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention BLEU points.

See all papers in Proc. ACL that mention BLEU points.

Back to top.

regression model

Appears in 5 sentences as: Regression model (1) regression model (2) regression models (2)
In Prediction of Learning Curves in Machine Translation
  1. We consider such observations to be generated by a regression model of the form:
    Page 2, “Selecting a parametric family of curves”
  2. Regression model 10K 75K 500K Ridge 0.063 0.060 0.053
    Page 6, “Inferring a learning curve from mostly monolingual data”
  3. Table 4: Root mean squared error of the linear regression models for each anchor size
    Page 6, “Inferring a learning curve from mostly monolingual data”
  4. Table 4 shows these results for Ridge and Lasso regression models at the three anchor sizes.
    Page 6, “Inferring a learning curve from mostly monolingual data”
  5. The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus .
    Page 6, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention regression model.

See all papers in Proc. ACL that mention regression model.

Back to top.

feature vector

Appears in 4 sentences as: feature vector (4)
In Prediction of Learning Curves in Machine Translation
  1. The feature vector qb consists of the following features:
    Page 4, “Inferring a learning curve from mostly monolingual data”
  2. We construct the design matrix (I) with one column for each feature vector qbct corresponding to each combination of training configuration 0 and test set If.
    Page 5, “Inferring a learning curve from mostly monolingual data”
  3. For a new unseen configuration with feature vector gbu, we determine the parameters 6., of the corresponding learning curve as:
    Page 5, “Inferring a learning curve from mostly monolingual data”
  4. where gbu is the feature vector for u, and wj are the weights we obtained from the regression in Eq.
    Page 6, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

in-domain

Appears in 4 sentences as: in-domain (4)
In Prediction of Learning Curves in Machine Translation
  1. This prediction, or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data, is the objective of this paper.
    Page 1, “Introduction”
  2. In the second scenario (S2), an additional small seed parallel corpus is given that can be used to train small in-domain models and measure (with some variance) the evaluation score at a few points on the initial portion of the learning curve.
    Page 1, “Introduction”
  3. Given a small “seed” parallel corpus, the translation system can be used to train small in-domain models and the evaluation score can be measured at a few initial sample sizes {($1,y1), ($2, yg)...(acp, yp)}.
    Page 5, “Inferring a learning curve from mostly monolingual data”
  4. For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Prediction of Learning Curves in Machine Translation
  1. Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose.
    Page 1, “Abstract”
  2. Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose.
    Page 1, “Introduction”
  3. (2008), the authors examined corpus features that contribute most to the machine translation performance.
    Page 2, “Related Work”
  4. The ability to predict the amount of parallel data required to achieve a given level of quality is very valuable in planning business deployments of statistical machine translation ; yet, we are not aware of any rigorous proposal for addressing this need.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

models trained

Appears in 4 sentences as: model trained (1) models trained (4)
In Prediction of Learning Curves in Machine Translation
  1. For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
    Page 2, “Selecting a parametric family of curves”
  2. In scenario $2, the models trained from the seed parallel corpus and the features used for inference (Section 4) provide complementary information.
    Page 6, “Inferring a learning curve from mostly monolingual data”
  3. Using the models trained for the experiments in Section 3, we estimate the squared extrapolation error at the anchors 83- when using models trained on size up to 30;, and set the confidence in the extrapolations8 for u to its inverse:
    Page 6, “Inferring a learning curve from mostly monolingual data”
  4. For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention models trained.

See all papers in Proc. ACL that mention models trained.

Back to top.

language model

Appears in 3 sentences as: language model (2) language models (1)
In Prediction of Learning Curves in Machine Translation
  1. In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model .
    Page 4, “Inferring a learning curve from mostly monolingual data”
  2. (b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus.
    Page 4, “Inferring a learning curve from mostly monolingual data”
  3. The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus .
    Page 6, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

language pair

Appears in 3 sentences as: language pair (3)
In Prediction of Learning Curves in Machine Translation
  1. Our experiments involve 30 distinct language pair and domain combinations and 96 different learning curves.
    Page 2, “Introduction”
  2. for all the six families on a test dataset for English-German language pair .
    Page 4, “Selecting a parametric family of curves”
  3. For each configuration (combination of language pair and domain) 0 and test set If in Table 2, a gold curve is fitted using the selected tri-parameter power-law family using a fine grid of corpus sizes.
    Page 5, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention language pair.

See all papers in Proc. ACL that mention language pair.

Back to top.

SMT system

Appears in 3 sentences as: SMT system (3)
In Prediction of Learning Curves in Machine Translation
  1. This prediction, or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data, is the objective of this paper.
    Page 1, “Introduction”
  2. An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU.
    Page 1, “Introduction”
  3. For enabling this work we trained a multitude of instances of the same phrase-based SMT system on 30 distinct combinations of language-pair and domain, each with fourteen distinct training sets of increasing size and tested these instances on multiple in—domain datasets, generating 96 learning curves.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention SMT system.

See all papers in Proc. ACL that mention SMT system.

Back to top.

statistical machine translation

Appears in 3 sentences as: statistical machine translation (3)
In Prediction of Learning Curves in Machine Translation
  1. Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose.
    Page 1, “Abstract”
  2. Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose.
    Page 1, “Introduction”
  3. The ability to predict the amount of parallel data required to achieve a given level of quality is very valuable in planning business deployments of statistical machine translation ; yet, we are not aware of any rigorous proposal for addressing this need.
    Page 8, “Inferring a learning curve from mostly monolingual data”

See all papers in Proc. ACL 2012 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.