Abstract | In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics. |
Correlation with Human Judgements | Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics. |
Correlation with Human Judgements | Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks. |
Correlation with Human Judgements | For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed. |
Introduction | In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements . |
Metrics and Test Beds | Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges . |
Previous Work on Machine Translation Meta-Evaluation | In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level). |
Previous Work on Machine Translation Meta-Evaluation | In all these cases, metrics were also evaluated by means of correlation with human judgements . |
Previous Work on Machine Translation Meta-Evaluation | Most approaches again rely on correlation with human judgements . |
Abstract | In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE). |
Current practice in summary evaluation | Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form. |
Current practice in summary evaluation | In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant. |
Dependency-based evaluation | In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics. |
Dependency-based evaluation | In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e. |
Dependency-based evaluation | For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant. |
Experimental results | Of course, the ideal evaluation metric would show high correlations with human judgment on both levels. |
Experimental results | The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B). |
Introduction | Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004). |
EXpt. 1: Predicting Absolute Scores | The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics. |
EXpt. 1: Predicting Absolute Scores | On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments . |
Experimental Evaluation | At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p, |
Experimental Evaluation | Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1 |
Experimental Evaluation | Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs. |
Expt. 2: Predicting Pairwise Preferences | The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions. |
Introduction | BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. |
Introduction | Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments . |
Introduction | When employing any such metric, it is crucial to verify that the predictions of the automated evaluation process agree with human judgements of the important aspects of the system output. |
Introduction | counter-examples to the claim that BLEU agrees with human judgements . |
Introduction | Also, Foster (2008) examined a range of automated metrics for evaluation generated multimodal output and found that few agreed with the preferences expressed by human judges . |