Index of papers in Proc. ACL 2010 that mention
  • human judgments
Bojar, Ondřej and Kos, Kamil and Mareċek, David
Conclusion
This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments .
Extensions of SemPOS
For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks.
Extensions of SemPOS
The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence.
Extensions of SemPOS
Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets):
Introduction
Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g.
Problems of BLEU
Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp.
Problems of BLEU
Figure 1 illustrates a very low correlation to human judgments when translating to Czech.
Problems of BLEU
This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored.
human judgments is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Echizen-ya, Hiroshi and Araki, Kenji
Abstract
Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
Experiments
We calculated the correlation between the scores obtained using our method and scores produced by human judgment .
Experiments
Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5.
Experiments
We used the median value in the evaluation results of three human judges as the final scores of 1—5.
Introduction
The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007).
Introduction
Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.
human judgments is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Scheible, Christian
Experiments
To determine the sentiment of these adjectives, we asked 9 human judges , all native German speakers, to annotate them given the classes neutral, slightly negative, very negative, slightly positive, and very positive, reflecting the categories from the training data.
Experiments
Since human judges tend to interpret scales differently, we examine their agreement using Kendall’s coefficient of concordance (W) including correction for ties (Legendre, 2005) which takes ranks into account.
Experiments
Due to disagreements between the human judges there exists no clear threshold between these categories.
human judgments is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Thater, Stefan and Fürstenau, Hagen and Pinkal, Manfred
Experiment: Ranking Word Senses
Based on agreement between human judges , Erk and McCarthy (2009) estimate an upper bound p of 0.544 for the dataset.
Experiment: Ranking Word Senses
The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard, averaged over all instances.
Experiment: Ranking Word Senses
Table 4: Correlation of model predictions and human judgments
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: