Robust Machine Translation Evaluation with Entailment Features
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.

Article Structure

Abstract

Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres.

Introduction

Constant evaluation is vital to the progress of machine translation (MT).

Regression-based MT Quality Prediction

Current MT metrics tend to focus on a single dimension of linguistic information.

Textual Entailment vs. MT Evaluation

Our novel approach to MT evaluation exploits the similarity between MT evaluation and textual entailment (TE).

Experimental Evaluation

4.1 Experiments

EXpt. 1: Predicting Absolute Scores

Data.

Expt. 2: Predicting Pairwise Preferences

In this experiment, we predict human pairwise preference judgments (cf.

Related Work

Researchers have exploited various resources to enable the matching between words or n- grams that are semantically close but not identical.

Conclusion and Outlook

In this paper, we have explored a strategy for the evaluation of MT output that aims at comprehensively assessing the meaning equivalence between reference and hypothesis.

Topics

NIST

Appears in 10 sentences as: NIST (6) NISTR (5)
In Robust Machine Translation Evaluation with Entailment Features
  1. We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST , TER, and METEOR) in two different settings.
    Page 1, “Abstract”
  2. Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
    Page 1, “Introduction”
  3. BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
    Page 1, “Introduction”
  4. NISTR consists of 16 features.
    Page 4, “Experimental Evaluation”
  5. NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP.
    Page 4, “Experimental Evaluation”
  6. Our first experiment evaluates the models we have proposed on a corpus with traditional annotation on a seven-point scale, namely the NIST OpenMT 2008 corpus.4 The corpus contains translations of newswire teXt into English from three source languages (Arabic (Ar), Chinese (Ch), Urdu (Ur)).
    Page 4, “EXpt. 1: Predicting Absolute Scores”
  7. BLEUR, METEORR, and NISTR significantly predict one language each (all Arabic); TERR, MTR, and RTER predict two languages.
    Page 5, “EXpt. 1: Predicting Absolute Scores”
  8. 1: Among individual metrics, METEORR and TERR do better than BLEUR and NISTR .
    Page 6, “Expt. 2: Predicting Pairwise Preferences”
  9. NISTR 50.2 70.4
    Page 7, “Expt. 2: Predicting Pairwise Preferences”
  10. Again, we see better results for METEORR and TERR than for BLEUR and NISTR , and the individual metrics do worse than the combination models.
    Page 7, “Expt. 2: Predicting Pairwise Preferences”

See all papers in Proc. ACL 2009 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

regression models

Appears in 9 sentences as: regression model (3) regression models (6)
In Robust Machine Translation Evaluation with Entailment Features
  1. This allows us to use an off-the-shelf RTE system to obtain features, and to combine them using a regression model as described in Section 2.
    Page 2, “Textual Entailment vs. MT Evaluation”
  2. They are small regression models as described in Section 2 over component scores of four widely used MT metrics.
    Page 4, “Experimental Evaluation”
  3. 2The regression models can simulate the behaviour of each component by setting the weights appropriately, but are strictly more powerful.
    Page 4, “Experimental Evaluation”
  4. We therefore verified that the three nontrivial “baseline” regression models indeed confer a benefit over the default component combination scores: BLEU—1 (which outperformed BLEU-4 in the MetricsMATR 2008 evaluation), NIST-4, and TER (with all costs set to 1).
    Page 4, “Experimental Evaluation”
  5. We found higher robustness and improved correlations for the regression models .
    Page 4, “Experimental Evaluation”
  6. The following three regression models implement the methods discussed in Sections 2 and 3.
    Page 4, “Experimental Evaluation”
  7. We optimize the weights of our regression models on two languages and then predict the human scores on the third language.
    Page 4, “EXpt. 1: Predicting Absolute Scores”
  8. 6We also experimented with a logistic regression model that predicts binary preferences directly.
    Page 6, “Expt. 2: Predicting Pairwise Preferences”
  9. We have used an off-the-shelf RTE system to compute these features, and demonstrated that a regression model over these features can outperform an ensemble of traditional MT metrics in two experiments on different datasets.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2009 that mention regression models.

See all papers in Proc. ACL that mention regression models.

Back to top.

human judgments

Appears in 8 sentences as: human judgment (1) human judgments (7)
In Robust Machine Translation Evaluation with Entailment Features
  1. BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
    Page 1, “Introduction”
  2. Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments .
    Page 1, “Introduction”
  3. At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p,
    Page 3, “Experimental Evaluation”
  4. Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1
    Page 4, “Experimental Evaluation”
  5. Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs.
    Page 4, “Experimental Evaluation”
  6. The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics.
    Page 4, “EXpt. 1: Predicting Absolute Scores”
  7. On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments .
    Page 5, “EXpt. 1: Predicting Absolute Scores”
  8. The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions.
    Page 7, “Expt. 2: Predicting Pairwise Preferences”

See all papers in Proc. ACL 2009 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

BLEU

Appears in 5 sentences as: BLEU (8)
In Robust Machine Translation Evaluation with Entailment Features
  1. We compare this metric against a combination metric of four state—of—the—art scores ( BLEU , NIST, TER, and METEOR) in two different settings.
    Page 1, “Abstract”
  2. Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
    Page 1, “Introduction”
  3. BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
    Page 1, “Introduction”
  4. (2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences.
    Page 1, “Introduction”
  5. BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2009 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

feature set

Appears in 4 sentences as: Feature set (1) feature set (3)
In Robust Machine Translation Evaluation with Entailment Features
  1. (2005)), and thus predict the quality of MT hypotheses with a rich RTE feature set .
    Page 1, “Introduction”
  2. (2007) train binary classifiers on a feature set formed by a number of MT metrics.
    Page 2, “Regression-based MT Quality Prediction”
  3. Feature set Consis- System-level tency (%) correlation (p)
    Page 7, “Expt. 2: Predicting Pairwise Preferences”
  4. Conceputalizing MT evaluation as an entailment problem motivates the use of a rich feature set that covers, unlike almost all earlier metrics, a wide range of linguistic levels, including lexical, syntactic, and compositional phenomena.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2009 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

MT systems

Appears in 4 sentences as: MT system (1) MT systems (3)
In Robust Machine Translation Evaluation with Entailment Features
  1. Figure l: Entailment status between an MT system hypothesis and a reference translation for equivalent (top) and nonequivalent (bottom) translations.
    Page 2, “Introduction”
  2. Each language consists of 1500—2800 sentence pairs produced by 7—15 MT systems .
    Page 4, “EXpt. 1: Predicting Absolute Scores”
  3. 2) and may find use in uncovering systematic shortcomings of MT systems .
    Page 8, “Conclusion and Outlook”
  4. To some extent, of course, this problem holds as well for state-of—the-art MT systems .
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2009 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In Robust Machine Translation Evaluation with Entailment Features
  1. BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
    Page 1, “Introduction”
  2. BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
    Page 4, “Experimental Evaluation”
  3. To counteract BLEU’s brittleness at the sentence level, we also smooth BLEU-n and n-gram precision as in Lin and Och (2004).
    Page 4, “Experimental Evaluation”
  4. NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP.
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2009 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

sentence-level

Appears in 4 sentences as: sentence-level (4)
In Robust Machine Translation Evaluation with Entailment Features
  1. System-level predictions are computed in both experiments from sentence-level predictions, as the ratio of sentences for which each system provided the best translation (Callison-Burch et al., 2008).
    Page 4, “Experimental Evaluation”
  2. BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
    Page 4, “Experimental Evaluation”
  3. We first concentrate on the upper half ( sentence-level results).
    Page 4, “EXpt. 1: Predicting Absolute Scores”
  4. This result supports the conclusions we have drawn from the sentence-level analysis.
    Page 5, “EXpt. 1: Predicting Absolute Scores”

See all papers in Proc. ACL 2009 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

TER

Appears in 4 sentences as: TER (4)
In Robust Machine Translation Evaluation with Entailment Features
  1. We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST, TER , and METEOR) in two different settings.
    Page 1, “Abstract”
  2. A number of metrics have been designed to account for paraphrase, either by making the matching more intelligent ( TER , Snover et al.
    Page 1, “Introduction”
  3. We therefore verified that the three nontrivial “baseline” regression models indeed confer a benefit over the default component combination scores: BLEU—1 (which outperformed BLEU-4 in the MetricsMATR 2008 evaluation), NIST-4, and TER (with all costs set to 1).
    Page 4, “Experimental Evaluation”
  4. We start with the standard TER score and the number of each of the four edit operations.
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2009 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

word order

Appears in 4 sentences as: word order (4)
In Robust Machine Translation Evaluation with Entailment Features
  1. (2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order ; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences.
    Page 1, “Introduction”
  2. The first example (top) shows a good translation that is erroneously assigned a low score by METEORR because (a) it cannot align fact and reality (METEORR aligns only synonyms) and (b) it punishes the change of word order through its “penalty” term.
    Page 6, “EXpt. 1: Predicting Absolute Scores”
  3. The human rater’s favorite translation deviates considerably from the reference in lexical choice, syntactic structure, and word order , for which it is punished by MTR (rank 3/5).
    Page 7, “Expt. 2: Predicting Pairwise Preferences”
  4. Our data analysis has confirmed that each of the feature groups contributes to the overall success of the RTE metric, and that its gains come from its better success at abstracting away from valid variation (such as word order or lexical substitution), while still detecting major semantic divergences.
    Page 8, “Conclusion and Outlook”

See all papers in Proc. ACL 2009 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.

feature weights

Appears in 3 sentences as: Feature Weights (1) feature weights (2)
In Robust Machine Translation Evaluation with Entailment Features
  1. Feature Weights .
    Page 7, “Expt. 2: Predicting Pairwise Preferences”
  2. Finally, we make two observations about feature weights in the RTER model.
    Page 7, “Expt. 2: Predicting Pairwise Preferences”
  3. Second, good MT evaluation feature weights are not good weights for RTE.
    Page 8, “Expt. 2: Predicting Pairwise Preferences”

See all papers in Proc. ACL 2009 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

linear regression

Appears in 3 sentences as: linear regression (3)
In Robust Machine Translation Evaluation with Entailment Features
  1. We first explore the combination of traditional scores into a more robust ensemble metric with linear regression .
    Page 1, “Introduction”
  2. We follow a similar idea, but use a regularized linear regression to directly predict human ratings.
    Page 2, “Regression-based MT Quality Prediction”
  3. We reuse the linear regression framework from Section 2 and predict pairwise preferences by predicting two absolute scores (as before) and comparing them.6
    Page 6, “Expt. 2: Predicting Pairwise Preferences”

See all papers in Proc. ACL 2009 that mention linear regression.

See all papers in Proc. ACL that mention linear regression.

Back to top.

machine translation

Appears in 3 sentences as: Machine Translation (1) machine translation (2)
In Robust Machine Translation Evaluation with Entailment Features
  1. Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres.
    Page 1, “Abstract”
  2. Constant evaluation is vital to the progress of machine translation (MT).
    Page 1, “Introduction”
  3. This experiment uses the 2006—2008 corpora of the Workshop on Statistical Machine Translation (WMT).7 It consists of data from EU-ROPARL (Koehn, 2005) and various news commentaries, with five source languages (French, German, Spanish, Czech, and Hungarian).
    Page 6, “Expt. 2: Predicting Pairwise Preferences”

See all papers in Proc. ACL 2009 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

sentence pairs

Appears in 3 sentences as: sentence pair (1) sentence pairs (2)
In Robust Machine Translation Evaluation with Entailment Features
  1. The average total runtime per sentence pair is 5 seconds on an AMD 2.6GHz Opteron core — efficient enough to perform regular evaluations on development and test sets.
    Page 3, “Textual Entailment vs. MT Evaluation”
  2. Each language consists of 1500—2800 sentence pairs produced by 7—15 MT systems.
    Page 4, “EXpt. 1: Predicting Absolute Scores”
  3. RTER has a rather flat learning curve that climbs to within 2 points of the final correlation value for 20% of the training set (about 400 sentence pairs ).
    Page 5, “EXpt. 1: Predicting Absolute Scores”

See all papers in Proc. ACL 2009 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

translation quality

Appears in 3 sentences as: translation quality (3)
In Robust Machine Translation Evaluation with Entailment Features
  1. Our second, more fundamental, strategy replaces the use of loose surrogates of translation quality with a model that attempts to comprehensively assess meaning equivalence between references and MT hypotheses.
    Page 1, “Introduction”
  2. We thus expect even noisy RTE features to be predictive for translation quality .
    Page 2, “Textual Entailment vs. MT Evaluation”
  3. (2006) use the degree of overlap between the dependency trees of reference and hypothesis as a predictor of translation quality .
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

WordNet

Appears in 3 sentences as: WordNet (3)
In Robust Machine Translation Evaluation with Entailment Features
  1. The computation of these scores make extensive use of about ten lexical similarity resources, including WordNet , InfoMap, and Dekang Lin’s thesaurus.
    Page 3, “Textual Entailment vs. MT Evaluation”
  2. The first is that the lack of alignments for two function words is unproblematic; the second is that the alignment between fact and reality, which is established on the basis of WordNet similarity, is indeed licensed in the current context.
    Page 6, “EXpt. 1: Predicting Absolute Scores”
  3. Banerjee and Lavie (2005) and Chan and Ng (2008) use WordNet , and Zhou et a1.
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.