Index of papers in Proc. ACL 2009 that mention
  • human judgements
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Abstract
In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Correlation with Human Judgements
Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
Correlation with Human Judgements
Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks.
Correlation with Human Judgements
For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed.
Introduction
In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements .
Metrics and Test Beds
Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges .
Previous Work on Machine Translation Meta-Evaluation
In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level).
Previous Work on Machine Translation Meta-Evaluation
In all these cases, metrics were also evaluated by means of correlation with human judgements .
Previous Work on Machine Translation Meta-Evaluation
Most approaches again rely on correlation with human judgements .
human judgements is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Owczarzak, Karolina
Abstract
In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation
Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form.
Current practice in summary evaluation
In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant.
Dependency-based evaluation
In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics.
Dependency-based evaluation
In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e.
Dependency-based evaluation
For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant.
Experimental results
Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results
The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B).
Introduction
Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004).
human judgements is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
EXpt. 1: Predicting Absolute Scores
The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics.
EXpt. 1: Predicting Absolute Scores
On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments .
Experimental Evaluation
At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p,
Experimental Evaluation
Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1
Experimental Evaluation
Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs.
Expt. 2: Predicting Pairwise Preferences
The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions.
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction
Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments .
human judgements is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Foster, Mary Ellen and Giuliani, Manuel and Knoll, Alois
Introduction
When employing any such metric, it is crucial to verify that the predictions of the automated evaluation process agree with human judgements of the important aspects of the system output.
Introduction
counter-examples to the claim that BLEU agrees with human judgements .
Introduction
Also, Foster (2008) examined a range of automated metrics for evaluation generated multimodal output and found that few agreed with the preferences expressed by human judges .
human judgements is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: