Index of papers in Proc. ACL 2011 that mention

human judgments

Seen in text as:

human judgments (16)
human judges (15)
human judgment (9)

Seen in 39 sentences in 4 papers.

1. Collecting Highly Parallel Data for Paraphrase Evaluation

Chen, David and Dolan, William

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments .
Experiments	The average scores of the two human judges are shown in Table 3.
Experiments	5.2 Correlation with human judgments
Experiments	Having established rough correspondences between BLEU/PINC scores and human judgments of se-
Introduction	Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005).
Introduction	Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments .
Paraphrase Evaluation Metrics	While PEM was shown to correlate well with human judgments , it has some limitations.
Related Work	While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems.
Related Work	In addition, the metric was shown to correlate well with human judgments .
Related Work	However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric.

human judgments is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

2. Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Dataset Construction and Human Performance	In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
Dataset Construction and Human Performance	Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges.
Dataset Construction and Human Performance	Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive.
Introduction	In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).
Related Work	However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3).
Related Work	Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam.
Results and Discussion	We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008).
Results and Discussion	mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold).

human judgments is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

3. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Lo, Chi-kiu and Wu, Dekai

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.
Abstract	The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.
Abstract	(2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality.

human judgments is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

4. Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

Mayfield, Elijah and Penstein Rosé, Carolyn

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).
Background	In general, however, we now have an automated model that is reliable in reproducing human judgments of authoritativeness.
Introduction	In section 5, this model is evaluated on a subset of the MapTask corpus (Anderson et al., 1991) and shows a high correlation with human judgements of authoritativeness (r2 = 0.947).

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: