Abstract | In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments . |
Experiments | The average scores of the two human judges are shown in Table 3. |
Experiments | 5.2 Correlation with human judgments |
Experiments | Having established rough correspondences between BLEU/PINC scores and human judgments of se- |
Introduction | Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005). |
Introduction | Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments . |
Paraphrase Evaluation Metrics | While PEM was shown to correlate well with human judgments , it has some limitations. |
Related Work | While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems. |
Related Work | In addition, the metric was shown to correlate well with human judgments . |
Related Work | However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric. |
Dataset Construction and Human Performance | In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions. |
Dataset Construction and Human Performance | Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges. |
Dataset Construction and Human Performance | Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive. |
Introduction | In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006). |
Related Work | However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3). |
Related Work | Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam. |
Results and Discussion | We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008). |
Results and Discussion | mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold). |
Abstract | We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. |
Abstract | The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. |
Abstract | (2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality. |
Abstract | We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947). |
Background | In general, however, we now have an automated model that is reliable in reproducing human judgments of authoritativeness. |
Introduction | In section 5, this model is evaluated on a subset of the MapTask corpus (Anderson et al., 1991) and shows a high correlation with human judgements of authoritativeness (r2 = 0.947). |