Index of papers in Proc. ACL 2009 that mention

human judgements

Seen in text as:

human judgements (15)
human judgments (12)
human judges (7)
human judgment (4)

Seen in 39 sentences in 4 papers.

1. The Contribution of Linguistic Features to Automatic Machine Translation Evaluation

Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Correlation with Human Judgements	Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
Correlation with Human Judgements	Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks.
Correlation with Human Judgements	For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed.
Introduction	In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements .
Metrics and Test Beds	Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges .
Previous Work on Machine Translation Meta-Evaluation	In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level).
Previous Work on Machine Translation Meta-Evaluation	In all these cases, metrics were also evaluated by means of correlation with human judgements .
Previous Work on Machine Translation Meta-Evaluation	Most approaches again rely on correlation with human judgements .

human judgements is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

2. DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries

Owczarzak, Karolina

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation	Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form.
Current practice in summary evaluation	In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant.
Dependency-based evaluation	In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics.
Dependency-based evaluation	In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e.
Dependency-based evaluation	For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant.
Experimental results	Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results	The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B).
Introduction	Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004).

human judgements is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

3. Robust Machine Translation Evaluation with Entailment Features

Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

EXpt. 1: Predicting Absolute Scores	The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics.
EXpt. 1: Predicting Absolute Scores	On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments .
Experimental Evaluation	At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p,
Experimental Evaluation	Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1
Experimental Evaluation	Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs.
Expt. 2: Predicting Pairwise Preferences	The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions.
Introduction	BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction	Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments .

human judgements is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

4. Comparing Objective and Subjective Measures of Usability in a Human-Robot Dialogue System

Foster, Mary Ellen and Giuliani, Manuel and Knoll, Alois

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	When employing any such metric, it is crucial to verify that the predictions of the automated evaluation process agree with human judgements of the important aspects of the system output.
Introduction	counter-examples to the claim that BLEU agrees with human judgements .
Introduction	Also, Foster (2008) examined a range of automated metrics for evaluation generated multimodal output and found that few agreed with the preferences expressed by human judges .

human judgements is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

linear regression (5)
human judgements (3)