Index of papers in Proc. ACL 2011 that mention
  • human judgments
Chen, David and Dolan, William
Abstract
In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments .
Experiments
The average scores of the two human judges are shown in Table 3.
Experiments
5.2 Correlation with human judgments
Experiments
Having established rough correspondences between BLEU/PINC scores and human judgments of se-
Introduction
Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005).
Introduction
Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments .
Paraphrase Evaluation Metrics
While PEM was shown to correlate well with human judgments , it has some limitations.
Related Work
While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems.
Related Work
In addition, the metric was shown to correlate well with human judgments .
Related Work
However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric.
human judgments is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Dataset Construction and Human Performance
In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
Dataset Construction and Human Performance
Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges.
Dataset Construction and Human Performance
Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive.
Introduction
In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).
Related Work
However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3).
Related Work
Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam.
Results and Discussion
We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008).
Results and Discussion
mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold).
human judgments is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Wu, Dekai
Abstract
We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.
Abstract
The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.
Abstract
(2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality.
human judgments is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Mayfield, Elijah and Penstein Rosé, Carolyn
Abstract
We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).
Background
In general, however, we now have an automated model that is reliable in reproducing human judgments of authoritativeness.
Introduction
In section 5, this model is evaluated on a subset of the MapTask corpus (Anderson et al., 1991) and shows a high correlation with human judgements of authoritativeness (r2 = 0.947).
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: