Index of papers in Proc. ACL that mention

human judgments

Seen in text as:

human judgments (117)
human judges (68)
human judgements (54)
human judgment (32)
human judgement (11)
human judges’ (8)
human judge (3)

Seen in 299 sentences in 35 papers.

1. Improving Word Representations via Global Context and Multiple Word Prototypes

Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
Conclusion	We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context.
Experiments	Our model also improves the correlation with human judgments on a word similarity task.
Experiments	important, we introduce a new dataset with human judgments on similarity of pairs of words in sentential context.
Experiments	Each pair is presented without context and associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10.
Introduction	However, one limitation of this evaluation is that the human judgments are on pairs
Introduction	Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context.

human judgments is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

2. Comparing Automatic Evaluation Measures for Image Description

Elliott, Desmond and Keller, Frank

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements .
Abstract	The focus of this paper is to determine the correlation of automatic measures with human judgements for this task.
Abstract	We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Introduction	In this paper we estimate the correlation of human judgements with five automatic evaluation measures on two image description data sets.
Introduction	lated against human judgements , ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology	We estimate Spearman’s p for five different automatic evaluation measures against human judgements for the automatic image description task.
Methodology	The automatic measures are calculated on the sentence level and correlated against human judgements of semantic correctness.
Methodology	The images were retrieved from Flickr, the reference descriptions were collected from Mechanical Turk, and the human judgements were collected from expert annotators as follows: each image in the test data was paired with the highest scoring sentence(s) retrieved from all possible test sentences by the TRIS SEM model in Hodosh et al.

human judgments is mentioned in 30 sentences in this paper.

Topics mentioned in this paper:

3. The Contribution of Linguistic Features to Automatic Machine Translation Evaluation

Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Correlation with Human Judgements	Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
Correlation with Human Judgements	Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks.
Correlation with Human Judgements	For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed.
Introduction	In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements .
Metrics and Test Beds	Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges .
Previous Work on Machine Translation Meta-Evaluation	In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level).
Previous Work on Machine Translation Meta-Evaluation	In all these cases, metrics were also evaluated by means of correlation with human judgements .
Previous Work on Machine Translation Meta-Evaluation	Most approaches again rely on correlation with human judgements .

human judgments is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

4. Assessing Dialog System User Simulation Evaluation Measures Using Human Judges

Ai, Hua and Litman, Diane J.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures.
Abstract	We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives.
Abstract	When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
Introduction	However, our approach use human judgments as the gold standard.
Introduction	Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures.
Introduction	First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement.

human judgments is mentioned in 42 sentences in this paper.

Topics mentioned in this paper:

5. Using Discourse Structure Improves Machine Translation Evaluation

Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	Spearman’s correlation with human judgments .
Experimental Results	Overall, we observe an average improvement of +.024, in the correlation with the human judgments .
Experimental Results	Kendall’s Tau with human judgments .
Experimental Setup	We measured the correlation of the metrics with the human judgments provided by the organizers.
Experimental Setup	4.2 Human Judgements and Learning
Experimental Setup	As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level, i.e.
Related Work	Here we suggest some simple ways to create such metrics, and we also show that they yield better correlation with human judgments .
Related Work	However, they could not improve correlation with human judgments , as evaluated on the MetricsMATR dataset.
Related Work	Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments .

human judgments is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

6. PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

Chen, Boxing and Kuhn, Roland and Larkin, Samuel

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract	It has a better correlation with human judgment than BLEU.
Abstract	PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
Experiments	We used Spearman’s rank correlation coefficient p to measure correlation of the metric with system-level human judgments of translation.
Experiments	The human judgment score is based on the “Rank” only, i.e., how often the translations of the system were rated as better than those from other systems (Callison-Burch et al., 2008).
Experiments	BLEU 0.792 0.215 0.777 0.240 METEOR 0.834 0.231 0.835 0.225 PORT 0.801 0.236 0.804 0.242 Table 2: Correlations with human judgment on WMT
Introduction	Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et
Introduction	Second, though a tuning metric should correlate strongly with human judgment , MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly.
Introduction	(2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment .

human judgments is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

7. Tackling Sparse Data Issue in Machine Translation Evaluation

Bojar, Ondřej and Kos, Kamil and Mareċek, David

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments .
Extensions of SemPOS	For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks.
Extensions of SemPOS	The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence.
Extensions of SemPOS	Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets):
Introduction	Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g.
Problems of BLEU	Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp.
Problems of BLEU	Figure 1 illustrates a very low correlation to human judgments when translating to Czech.
Problems of BLEU	This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored.

human judgments is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

8. Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Dataset Construction and Human Performance	In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
Dataset Construction and Human Performance	Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges.
Dataset Construction and Human Performance	Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive.
Introduction	In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).
Related Work	However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3).
Related Work	Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam.
Results and Discussion	We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008).
Results and Discussion	mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold).

human judgments is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

9. Collecting Highly Parallel Data for Paraphrase Evaluation

Chen, David and Dolan, William

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments .
Experiments	The average scores of the two human judges are shown in Table 3.
Experiments	5.2 Correlation with human judgments
Experiments	Having established rough correspondences between BLEU/PINC scores and human judgments of se-
Introduction	Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005).
Introduction	Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments .
Paraphrase Evaluation Metrics	While PEM was shown to correlate well with human judgments , it has some limitations.
Related Work	While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems.
Related Work	In addition, the metric was shown to correlate well with human judgments .
Related Work	However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric.

human judgments is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

10. MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation

Chan, Yee Seng and Ng, Hwee Tou

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.
Automatic Evaluation Metrics	In the ACL-07 MT workshop, ParaEval based on recall (ParaEval-recall) achieves good correlation with human judgements .
Introduction	Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable.
Introduction	Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction	During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement .
Metric Design Considerations	The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement ) of 11 automatic MT evaluation metrics.
Metric Design Considerations	For human evaluation of the MT submissions, four different criteria were used in the workshop: Adequacy (how much of the original meaning is expressed in a system translation), Fluency (the translation’s fluency), Rank (different translations of a single source sentence are compared and ranked from best to worst), and Constituent (some constituents from the parse tree of the source sentence are translated, and human judges have to rank these translations).
Metric Design Considerations	For this dataset, human judgements are available on adequacy and fluency for six system submissions, and there are four English reference translation texts.

human judgments is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

unigram (24)
BLEU (20)
bigrams (16)

11. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Lo, Chi-kiu and Wu, Dekai

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.
Abstract	The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.
Abstract	(2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality.

human judgments is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

12. DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries

Owczarzak, Karolina

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation	Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form.
Current practice in summary evaluation	In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant.
Dependency-based evaluation	In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics.
Dependency-based evaluation	In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e.
Dependency-based evaluation	For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant.
Experimental results	Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results	The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B).
Introduction	Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004).

human judgments is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

13. Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking

Echizen-ya, Hiroshi and Araki, Kenji

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
Experiments	We calculated the correlation between the scores obtained using our method and scores produced by human judgment .
Experiments	Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5.
Experiments	We used the median value in the evaluation results of three human judges as the final scores of 1—5.
Introduction	The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007).
Introduction	Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.

human judgments is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

14. Robust Machine Translation Evaluation with Entailment Features

Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

EXpt. 1: Predicting Absolute Scores	The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics.
EXpt. 1: Predicting Absolute Scores	On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments .
Experimental Evaluation	At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p,
Experimental Evaluation	Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1
Experimental Evaluation	Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs.
Expt. 2: Predicting Pairwise Preferences	The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions.
Introduction	BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction	Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments .

human judgments is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

15. English-to-Russian MT evaluation campaign

Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation methodology	The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
Evaluation methodology	This task is also much simpler for human judges to complete.
Evaluation methodology	The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required.
Results	METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements .
Results	Table 3: Correlation to human judgements

human judgments is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

16. Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

Liu, Chang and Ng, Hwee Tou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Future Work	This is probably due to the linguistic characteristics of Chinese, where human judges apparently give equal importance to function words and content words.
Experiments	The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
Experiments	The correlations between the TESLA-CELAB scores and human judgments are shown in the last row of Tables 2 and 3.
Introduction	In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments .
Introduction	Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
Introduction	The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments .

human judgments is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word-level (15)
n-grams (12)

17. Citation Resolution: A method for evaluating context-based citation recommendation systems

Duma, Daniel and Klein, Ewan

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Exploiting the human judgements that are already implicit in available resources, we avoid purpose-specific annotation.
Introduction	A main problem we face is that evaluating the performance of these systems ultimately requires human judgement .
Introduction	Fortunately there is already an abundance of data that meets our requirements: every scientific paper contains human “judgements” in the form of citations to other papers which are contextually appropriate: that is, relevant to specific passages of the document and aligned with its argumentative structure.
Introduction	Citation Resolution is a method for evaluating CBCR systems that is exclusively based on this source of human judgements .
Related work	Third, as we outlined above, existing citations between papers can be exploited as a source of human judgements .
The task: Citation Resolution	The core criterion of this task is to use only the human judgements that we have clearest evidence for.

human judgments is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

human judgements (6)

18. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment 1: Textual Similarity	Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges , with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient.
Experiment 1: Textual Similarity	Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean).
Experiment 1: Textual Similarity	MSRpar (MPar) is the only dataset in which TLsim (éarié et al., 2012) achieves a higher correlation with human judgments .
Experiment 2: Word Similarity	Table 6 shows the Spearman’s p rank correlation coefficients with human judgments on the RG—65 dataset.
Experiment 3: Sense Similarity	Table 6: Spearman’s p correlation coefficients with human judgments on the RG—65 dataset.
Introduction	Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al.

human judgments is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

19. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.
Related Work	In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work	ULC (Gimenez and Marquez, 2007, 2008) incorporates several semantic features and shows improved correlation with human judgement on translation quality (Callison-Burch et al., 2007, 2008) but no work has been done towards tuning an SMT system using a pure form of ULC perhaps due to its expensive run time.
Related Work	For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

20. Trainable Generation of Big-Five Personality Styles through Data-Driven Parameter Estimation

Mairesse, François and Walker, Marilyn

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges , and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data.
Evaluation Experiment	We then evaluate the output utterances using naive human judges to rate their perceived personality and naturalness.
Evaluation Experiment	Table 5 shows several sample outputs and the mean personality ratings from the human judges .
Introduction	Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005).
Parameter Estimation Models	Collects human judgments rating the personality of each utterance;

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

21. Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

Feng, Song and Kang, Jun Seok and Kuznetsova, Polina and Choi, Yejin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results 11	5.1 Intrinsic Evaluation: Human Judgements
Experimental Results 11	Therefore, we also report the degree of agreement among human judges in Table 7, where we compute the agreement of one Turker with respect to the gold standard drawn from the rest of the Turkers, and take the average across over all five Turkerslg.
Experimental Results 11	C-LP SENTIWN HUMAN JUDGES 9V0“ 77.0 71.5 66.0 95”“ 73.0 69.0 69.0
Introduction	We provide comparative empirical results over several variants of these approaches with comprehensive evaluations including lexicon-based, human judgments , and extrinsic evaluations.
Introduction	§5 presents comprehensive evaluation with human judges and extrinsic evaluations.

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

22. Name-aware Machine Translation

Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments	We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments .
Experiments	Figure 2 shows that NAMT consistently achieved higher scores with both name-aware BLEU metric and human judgement .

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word alignment (17)
LM (12)

23. Sentiment Translation through Lexicon Induction

Scheible, Christian

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	To determine the sentiment of these adjectives, we asked 9 human judges , all native German speakers, to annotate them given the classes neutral, slightly negative, very negative, slightly positive, and very positive, reflecting the categories from the training data.
Experiments	Since human judges tend to interpret scales differently, we examine their agreement using Kendall’s coefficient of concordance (W) including correction for ties (Legendre, 2005) which takes ranks into account.
Experiments	Due to disagreements between the human judges there exists no clear threshold between these categories.

human judgments is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

24. ConnotationWordNet: Learning Connotation over the Word+Sense Network

Kang, Jun Seok and Feng, Song and Akoglu, Leman and Choi, Yejin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation 111: Sentiment Analysis using ConnotationWordNet	Note that there is a difference in how humans judge the orientation and the degree of connotation for a given word out of context, and how the use of such words in context can be perceived as good/bad news.
Evaluation 11: Human Evaluation on ConnotationWordNet	The agreement between the new lexicon and human judges varies between 84% and 86.98%.
Evaluation 11: Human Evaluation on ConnotationWordNet	(2005a)) show low agreement rate with human, which is somewhat as expected: human judges in this study are labeling for subtle connotation, not for more explicit sentiment.
Evaluation 11: Human Evaluation on ConnotationWordNet	Because different human judges have different notion of scales however, subtle differences are more likely to be noisy.

human judgments is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

synsets (24)
word-level (15)
WordNet (11)

25. Vector-based Models of Semantic Composition

Mitchell, Jeff and Lapata, Mirella

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments .
Evaluation Setup	The task involves examining the degree of linear relationship between the human judgments for two individual words and vector-based similarity values.
Evaluation Setup	We assume that the inter-subject agreement can serve as an upper bound for comparing the fit of our models against the human judgments .
Results	Table 2: Model means for High and Low similarity items and correlation coefficients with human judgments (z p < 0.05, *: p < 0.01)

human judgments is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

26. Metaphor Detection with Cross-Lingual Model Transfer

Tsvetkov, Yulia and Boytsov, Leonid and Gershman, Anatole and Nyberg, Eric and Dyer, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	They are collected by the same human judges and belong to the same domain.
Experiments	The pairs were presented to five human judges who rated each pair on a scale from 1 (very literal/denotative) to 4 (very non-literal/connotative).
Experiments	Table 4: Comparing AN metaphor detection method to the baselines: accuracy of the 10-fold cross validation on annotations of five human judges .

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

27. Aspect Extraction with Automated Prior Knowledge Learning

Chen, Zhiyuan and Mukherjee, Arjun and Liu, Bing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	However, perpleXity on the held-out test set does not reflect the semantic coherence of topics and may be contrary to human judgments (Chang et al., 2009).
Experiments	As our objective is to discover more coherent aspects, we recruited two human judges .
Introduction	However, researchers have shown that fully unsupervised models often produce incoherent topics because the objective functions of topic models do not always correlate well with human judgments (Chang et al., 2009).

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

Veale, Tony and Li, Guofu

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Empirical Evaluation	We evaluate Rex by estimating how closely its judgments correlate with those of human judges on the 30-pair word set of Miller & Charles (M&C), who aggregated the judgments of multiple human raters into mean ratings for these pairs.
Related Work and Ideas	Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures.
Related Work and Ideas	Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

29. Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

Narisawa, Katsuma and Watanabe, Yotaro and Mizuno, Junta and Okazaki, Naoaki and Inui, Kentaro

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related work	We utilize large and small modifiers (described in Section 4.1), which correspond to textual clues m0 (as many as, as large as) and shika (only, as few as), respectively, for detecting humans’ judgments .
Related work	We asked three human judges to annotate every numerical expression with one of six labels, small, relatively small, normal, relatively large, large, and unsure.
Related work	The cause of this error is exemplified by the sentence, “there are two reasons.” Human judges label normal to the numerical expression two reasons, but the method predicts small.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

30. Discovering User Interactions in Ideological Discussions

Mukherjee, Arjun and Liu, Bing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Empirical Evaluation	The evaluation of this task requires human judges to read all the posts where the two users forming the pair have interacted.
Empirical Evaluation	Two human judges were asked to independently read all the post interactions of 500 pairs and label each pair as overall “disagreeing” or overall “agreeing” or “none”.
Phrase Ranking based on Relevance	For this and subsequent human judgment tasks, we use two judges (graduate students well versed in English).

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

31. Finding Bursty Topics from Microblogs

Diao, Qiming and Jiang, Jing and Zhu, Feida and Lim, Ee-Peng

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We merged hese topics and asked two human judges to judge heir quality by assigning a score of either 0 or l. The judges are graduate students living in Singapore 1nd not involved in this project.
Experiments	based on the human judge’s understanding.
Experiments	For ground truth, we consider a bursty topic to be cor-'ect if both human judges have scored it 1.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

32. You Had Me at Hello: How Phrasing Affects Memorability

Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Hello. My name is Inigo Montoya.	None of these observations, however, serve as definitions, and indeed, we believe it desirable to not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments .
Hello. My name is Inigo Montoya.	In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments — perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used.
Never send a human to do a machine’s job.	Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who — very much unlike our system — have the full cultural context in which movies occur at their disposal.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

33. Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

Mayfield, Elijah and Penstein Rosé, Carolyn

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).
Background	In general, however, we now have an automated model that is reliable in reproducing human judgments of authoritativeness.
Introduction	In section 5, this model is evaluated on a subset of the MapTask corpus (Anderson et al., 1991) and shows a high correlation with human judgements of authoritativeness (r2 = 0.947).

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

34. Contextualizing Semantic Representations Using Syntactically Enriched Vector Models

Thater, Stefan and Fürstenau, Hagen and Pinkal, Manfred

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment: Ranking Word Senses	Based on agreement between human judges , Erk and McCarthy (2009) estimate an upper bound p of 0.544 for the dataset.
Experiment: Ranking Word Senses	The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard, averaged over all instances.
Experiment: Ranking Word Senses	Table 4: Correlation of model predictions and human judgments

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

35. Comparing Objective and Subjective Measures of Usability in a Human-Robot Dialogue System

Foster, Mary Ellen and Giuliani, Manuel and Knoll, Alois

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	When employing any such metric, it is crucial to verify that the predictions of the automated evaluation process agree with human judgements of the important aspects of the system output.
Introduction	counter-examples to the claim that BLEU agrees with human judgements .
Introduction	Also, Foster (2008) examined a range of automated metrics for evaluation generated multimodal output and found that few agreed with the preferences expressed by human judges .

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

linear regression (5)
human judgements (3)