Index of papers in Proc. ACL 2010 that mention

Seen in text as:

Seen in 26 sentences in 4 papers.

Bojar, Ondřej and Kos, Kamil and Mareċek, David

Conclusion	This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments .
Extensions of SemPOS	For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks.
Extensions of SemPOS	The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence.
Extensions of SemPOS	Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets):
Introduction	Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g.
Problems of BLEU	Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp.
Problems of BLEU	Figure 1 illustrates a very low correlation to human judgments when translating to Czech.
Problems of BLEU	This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored.

human judgments is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

Echizen-ya, Hiroshi and Araki, Kenji

Abstract	Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
Experiments	We calculated the correlation between the scores obtained using our method and scores produced by human judgment .
Experiments	Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5.
Experiments	We used the median value in the evaluation results of three human judges as the final scores of 1—5.
Introduction	The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007).
Introduction	Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.

human judgments is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

Scheible, Christian

Experiments	To determine the sentiment of these adjectives, we asked 9 human judges , all native German speakers, to annotate them given the classes neutral, slightly negative, very negative, slightly positive, and very positive, reflecting the categories from the training data.
Experiments	Since human judges tend to interpret scales differently, we examine their agreement using Kendall’s coefficient of concordance (W) including correction for ties (Legendre, 2005) which takes ranks into account.
Experiments	Due to disagreements between the human judges there exists no clear threshold between these categories.

human judgments is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

Thater, Stefan and Fürstenau, Hagen and Pinkal, Manfred

Experiment: Ranking Word Senses	Based on agreement between human judges , Erk and McCarthy (2009) estimate an upper bound p of 0.544 for the dataset.
Experiment: Ranking Word Senses	The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard, averaged over all instances.
Experiment: Ranking Word Senses	Table 4: Correlation of model predictions and human judgments

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: