Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
Cohn, Trevor and Specia, Lucia

Article Structure

Abstract

Annotating linguistic data is often a complex, time consuming and expensive endeavour.

Introduction

Most empirical work in Natural Language Processing (NLP) is based on supervised machine learning techniques which rely on human annotated data of some form or another.

Quality Estimation

Quality estimation (QE) for MT aims at providing an estimate on the quality of each translated segment — typically a sentence — without access to reference translations.

Gaussian Process Regression

Machine learning models for quality estimation typically treat the problem as regression, seeking to model the relationship between features of the text input and the human quality judgement as a continuous response variable.

Multitask Quality Estimation 4.1 Experimental Setup

Feature sets: In all experiments we use 17 shallow QE features that have been shown to perform well in previous work.

Conclusion

This paper presented a novel approach for learning from human linguistic annotations by explicitly training models of individual annotators (and possibly additional metadata) using multitask leam-ing.

Topics

hyperparameters

Appears in 14 sentences as: Hyperparameter (1) hyperparameter (4) hyperparameters (9)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. Specifically, we can derive the gradient of the (log) marginal likelihood with respect to the model hyperparameters (i.e., a, an, 08 etc.)
    Page 4, “Gaussian Process Regression”
  2. Note that in general the marginal likelihood is non-convex in the hyperparameter values, and consequently the solutions may only be locally optimal.
    Page 4, “Gaussian Process Regression”
  3. Here we bootstrap the learning of complex models with many hyperparameters by initialising
    Page 4, “Gaussian Process Regression”
  4. Moreover GPs provide greater flexibility in fitting the kernel hyperparameters even for complex composite kernels.
    Page 5, “Gaussian Process Regression”
  5. In typical usage, the kernel hyperparameters for an SVM are fit using held-out estimation, which is inefficient and often involves tying together parameters to limit the search complexity (e.g., using a single scale parameter in the squared exponential).
    Page 5, “Gaussian Process Regression”
  6. This corresponds to independent modelling of each task, although all models share the same data kernel, so this setting is not strictly equivalent to independent training with independent per-task data kernels (with different hyperparameters ).
    Page 5, “Gaussian Process Regression”
  7. Combined A simple approach for B is a weighted combination of Independent and Pool, i.e., B = 1 + a], where the hyperparameter a 2 0 controls the amount of intertask transfer between each task and the global ‘pooled’ task.5 For dissimilar tasks, a high value of a allows each task to be modelled independently, while for more similar tasks low a allows the use of a large pool of
    Page 5, “Gaussian Process Regression”
  8. In contrast to these earlier approaches, we learn the hyperparameter a directly, fitting the relative amounts of inter- versus intra-task transfer to the dataset.
    Page 6, “Gaussian Process Regression”
  9. Combined+ We consider an extension to the Combined kernel, B = 1 + diag(a), ad 2 0 in which each task has a different hyperparameter modulating its independence from the global pool.
    Page 6, “Gaussian Process Regression”
  10. GP: All GP models were implemented using the GPML Matlab toolbox.7 Hyperparameter optimi-sation was performed using conjugate gradient ascent of the log marginal likelihood function, with up to 100 iterations.
    Page 6, “Multitask Quality Estimation 4.1 Experimental Setup”
  11. The simpler models were initialised with all hyperparameters set to one, while more complex models were initialised using the
    Page 6, “Multitask Quality Estimation 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention hyperparameters.

See all papers in Proc. ACL that mention hyperparameters.

Back to top.

SVM

Appears in 10 sentences as: SVM (12)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. In typical usage, the kernel hyperparameters for an SVM are fit using held-out estimation, which is inefficient and often involves tying together parameters to limit the search complexity (e.g., using a single scale parameter in the squared exponential).
    Page 5, “Gaussian Process Regression”
  2. Multiple-kemel learning (Go'nen and Alpaydin, 2011) goes some way to addressing this problem within the SVM framework, however this technique is limited to reweighting linear combinations of kernels and has high computational complexity.
    Page 5, “Gaussian Process Regression”
  3. Baselines: The baselines use the SVM regression algorithm with radial basis function kernel and parameters 7, e and C optimised through grid-search and 5-fold cross validation on the training set.
    Page 6, “Multitask Quality Estimation 4.1 Experimental Setup”
  4. a 0.8279 0.9899 SVM 0.6889 0.8201
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  5. ,a is a baseline which predicts the training mean, SVM uses the same system as the WMT12 QE task, and the remainder are GP regression models with different kernels (all include additive noise).
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  6. From this we can see that all models do much better than the mean baseline and that most of the GP models have lower error than the state-of-the-art SVM .
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  7. a 0.8541 1.0119 Independent SVMs 0.7967 0.9673 EasyAdapt SVM 0.7655 0.9105
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  8. The GP models significantly improve over the baselines, including an SVM trained independently and using the EasyAdapt method for multitask learning (Daume III, 2007).
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”
  9. While EasyAdapt showed an improvement over the independent SVM , it was a long way short of the GP models.
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”
  10. Model MAE RMSE p 0.5596 0.7053 MA 0.5184 0.6367 us 0.5888 0.7588 MT 0.6300 0.8270 Pooled SVM 0.5823 0.7472 Independent A SVM 0.5058 0.6351 EasyAdapt SVM 0.7027 0.8816 SINGLE-TASK LEARNING Independent A 0.5091 0.6362 Independents 0.5980 0.7729 Pooled 0.5834 0.7494 Pooled & {N} 0.4932 0.6275 MULTITASK LEARNING: Annotator Combined A 0.4815 0.6174 CombinedA & {N} 0.4909 0.6268 Combined+A 0.4855 0.6203 Combined+A & {N} 0.4833 0.6102 MULTITASK LEARNING: Translation system Combineds 0.5825 0.7482 MULTITASK LEARNING: Sentence pair CombinedT 0.5813 0.7410 MULTITASK LEARNING: Combinations Combined A, 5 0.4988 0.6490 Combined A, s & {N A, 5} 0.4707 0.6003 Combined+A, 5 0.4772 0.6094 Combined 14,51 0.4588 0.5852 Combined A, s,T & {N A, 5} 0.4723 0.6023
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

MT systems

Appears in 8 sentences as: MT system (4) MT systems (5)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. In this paper we model the task of predicting the quality of sentence translations using datasets that have been annotated by several judges with different levels of expertise and reliability, containing translations from a variety of MT systems and on a range of different types of sentences.
    Page 2, “Introduction”
  2. Examples of applications of QE include improving post-editing efficiency by filtering out low quality segments which would require more effort and time to correct than translating from scratch (Specia et al., 2009), selecting high quality segments to be published as they are, without post-editing (Soricut and Echihabi, 2010), selecting a translation from either an MT system or a translation memory for post-editing (He et al., 2010), selecting the best translation from multiple MT systems (Specia et al., 2010), and highlighting subsegments that need revision (Bach et al., 2011).
    Page 2, “Quality Estimation”
  3. o It is often desirable to include alternative translations of source sentences produced by multiple MT systems , which requires multiple annotators for unbiased judgements, particularly for labels such as post-editing time (a translation seen a second time will require less editing effort).
    Page 3, “Quality Estimation”
  4. It contains 299 English sentences translated into Spanish using two or more of eight MT systems randomly selected from all system submissions for WMT11 (Callison-Burch et al., 2011).
    Page 3, “Quality Estimation”
  5. These MT systems range from online and customised SMT systems to commercial rule-based systems.
    Page 3, “Quality Estimation”
  6. In our quality estimation experiments we consider as metadata the MT system which produced the translation, and the identity of the source sentence being translated.
    Page 6, “Gaussian Process Regression”
  7. Partitioning the data by annotator (,uA) gives the best baseline result, while there is less information from the MT system or sentence identity.
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”
  8. The multitask learning methods performed best when using the annotator identity as the task descriptor, and less well for the MT system and sentence pair, where they only slightly improved over the baseline.
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

shared task

Appears in 5 sentences as: shared task (5)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. For an overview of various algorithms and features we refer the reader to the WMT12 shared task on QE (Callison-Burch et al., 2012).
    Page 2, “Quality Estimation”
  2. WMT12: This dataset was distributed as part of the WMT12 shared task on QE (Callison-Burch et al., 2012).
    Page 3, “Quality Estimation”
  3. These were used by a highly competitive baseline entry in the WMT12 shared task , and were extracted here using the system provided by that shared task.6 They include simple counts, e.g., the tokens in sentences, as well as source and target language model probabilities.
    Page 6, “Multitask Quality Estimation 4.1 Experimental Setup”
  4. This is generally a very strong baseline: in the WMT12 QE shared task , only five out of 19 submissions were able to significantly outperform it, and only by including many complex additional features, tree kernels, etc.
    Page 6, “Multitask Quality Estimation 4.1 Experimental Setup”
  5. WMT12: Single task We start by comparing GP regression with alternative approaches using the WMT12 dataset on the standard task of predicting a weighted mean quality rating (as it was done in the WMT12 QE shared task ).
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention shared task.

See all papers in Proc. ACL that mention shared task.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multitask learning, and consistently outperform strong baselines.
    Page 1, “Abstract”
  2. This is the case, for example, of annotations on the quality of sentences generated using machine translation (MT) systems, which are often used to build quality estimation models (Blatz et al., 2004; Specia et al., 2009) — our application of interest.
    Page 1, “Introduction”
  3. Our experiments showed how our approach outperformed competitive baselines on two machine translation quality regression problems, including the highly challenging problem of predicting post-editing time.
    Page 9, “Conclusion”
  4. Models of individual annotators could be used to train machine translation systems to optimise an annotator-specific quality measure, or in active learning for corpus annotation, where the model can suggest the most appropriate instances for each annotator or the best annotator for a given instance.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

translation quality

Appears in 4 sentences as: translation quality (4)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multitask learning, and consistently outperform strong baselines.
    Page 1, “Abstract”
  2. In addition to annotators’ own perceptions and expectations with respect to translation quality , a number of factors can affect their judgements on specific sentences.
    Page 1, “Introduction”
  3. We show in our experiments on two translation quality datasets that these multitask learning strategies are far superior to training individual per-task models or a single pooled model, and moreover that our multitask learning approach can achieve similar performance to these baselines using only a fraction of the training data.
    Page 2, “Introduction”
  4. Our experiments showed how our approach outperformed competitive baselines on two machine translation quality regression problems, including the highly challenging problem of predicting post-editing time.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

translation system

Appears in 4 sentences as: Translation system (1) translation system (2) translation systems (1)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. We address this problem using multitask learning in which we learn individual models for each context (the task, incorporating the annotator and other metadata: translation system and the source sentence) while also modelling correlations between tasks such that related tasks can mutually inform one another.
    Page 2, “Introduction”
  2. Let B (i) be a square covariance matrix for the ith task descriptor of M, with a column and row for each value (e. g., annotator identity, translation system , etc.).
    Page 6, “Gaussian Process Regression”
  3. Model MAE RMSE p 0.5596 0.7053 MA 0.5184 0.6367 us 0.5888 0.7588 MT 0.6300 0.8270 Pooled SVM 0.5823 0.7472 Independent A SVM 0.5058 0.6351 EasyAdapt SVM 0.7027 0.8816 SINGLE-TASK LEARNING Independent A 0.5091 0.6362 Independents 0.5980 0.7729 Pooled 0.5834 0.7494 Pooled & {N} 0.4932 0.6275 MULTITASK LEARNING: Annotator Combined A 0.4815 0.6174 CombinedA & {N} 0.4909 0.6268 Combined+A 0.4855 0.6203 Combined+A & {N} 0.4833 0.6102 MULTITASK LEARNING: Translation system Combineds 0.5825 0.7482 MULTITASK LEARNING: Sentence pair CombinedT 0.5813 0.7410 MULTITASK LEARNING: Combinations Combined A, 5 0.4988 0.6490 Combined A, s & {N A, 5} 0.4707 0.6003 Combined+A, 5 0.4772 0.6094 Combined 14,51 0.4588 0.5852 Combined A, s,T & {N A, 5} 0.4723 0.6023
    Page 9, “Conclusion”
  4. Models of individual annotators could be used to train machine translation systems to optimise an annotator-specific quality measure, or in active learning for corpus annotation, where the model can suggest the most appropriate instances for each annotator or the best annotator for a given instance.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

best results

Appears in 3 sentences as: best result (1) best results (2)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. Including per-annotator noise to the pooled model provides a boost in performance, however the best results are obtained using the Combined kernel which brings the strengths of both the independent and pooled settings.
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  2. The MTL model trained on 500 samples had an MAE of 0.7082 j: 0.0042, close to the best results from the full dataset in Table 2, despite using % as much data: here we use % as many training instances where each is singly (cf.
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”
  3. However, making use of all these layers of metadata together gives substantial further improvements, reaching the best result with Com-binedA,3,T.
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention best results.

See all papers in Proc. ACL that mention best results.

Back to top.

error rates

Appears in 3 sentences as: error rates (3)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. Shown above are the training mean baseline ,a, single-task learning approaches, and multitask learning models, with the columns showing macro average error rates over all three response values.
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  2. Note that here error rates are measured over all of the three annotators’ judgements, and consequently are higher than those measured against their average response in Table 1.
    Page 7, “Multitask Quality Estimation 4.1 Experimental Setup”
  3. To test this, we trained single-task, pooled and multitask models on randomly sub-sampled training sets of different sizes, and plot their error rates in Figure 1.
    Page 8, “Multitask Quality Estimation 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention error rates.

See all papers in Proc. ACL that mention error rates.

Back to top.

feature vector

Appears in 3 sentences as: feature vector (2) feature vectors (1)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. In our regression task3 the data consists of n pairs D = {(xi,yi)}, where x,- 6 RF is a F-dimensional feature vector and y,- E R is the response variable.
    Page 4, “Gaussian Process Regression”
  2. Each instance is a translation and the feature vector encodes its linguistic features; the response variable is a numerical quality judgement: post editing time or likert score.
    Page 4, “Gaussian Process Regression”
  3. GP regression assumes the presence of a latent function, f : RF —> R, which maps from the input space of feature vectors x to a scalar.
    Page 4, “Gaussian Process Regression”

See all papers in Proc. ACL 2013 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

machine learning

Appears in 3 sentences as: Machine learning (1) machine learning (2)
In Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation
  1. Most empirical work in Natural Language Processing (NLP) is based on supervised machine learning techniques which rely on human annotated data of some form or another.
    Page 1, “Introduction”
  2. Machine learning models for quality estimation typically treat the problem as regression, seeking to model the relationship between features of the text input and the human quality judgement as a continuous response variable.
    Page 3, “Gaussian Process Regression”
  3. In this paper we consider Gaussian Processes (GP) (Rasmussen and Williams, 2006), a probabilistic machine learning framework incorporating kernels and Bayesian non-parametrics, widely considered state-of-the-art for regression.
    Page 4, “Gaussian Process Regression”

See all papers in Proc. ACL 2013 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.