Conclusion | This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments . |
Extensions of SemPOS | For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks. |
Extensions of SemPOS | The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence. |
Extensions of SemPOS | Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets): |
Introduction | Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g. |
Problems of BLEU | Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp. |
Problems of BLEU | Figure 1 illustrates a very low correlation to human judgments when translating to Czech. |
Problems of BLEU | This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored. |
Abstract | Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7. |
Experiments | We calculated the correlation between the scores obtained using our method and scores produced by human judgment . |
Experiments | Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5. |
Experiments | We used the median value in the evaluation results of three human judges as the final scores of 1—5. |
Introduction | The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007). |
Introduction | Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency. |
Experiments | To determine the sentiment of these adjectives, we asked 9 human judges , all native German speakers, to annotate them given the classes neutral, slightly negative, very negative, slightly positive, and very positive, reflecting the categories from the training data. |
Experiments | Since human judges tend to interpret scales differently, we examine their agreement using Kendall’s coefficient of concordance (W) including correction for ties (Legendre, 2005) which takes ranks into account. |
Experiments | Due to disagreements between the human judges there exists no clear threshold between these categories. |
Experiment: Ranking Word Senses | Based on agreement between human judges , Erk and McCarthy (2009) estimate an upper bound p of 0.544 for the dataset. |
Experiment: Ranking Word Senses | The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard, averaged over all instances. |
Experiment: Ranking Word Senses | Table 4: Correlation of model predictions and human judgments |