Index of papers in Proc. ACL 2013 that mention

human judgments

Seen in text as:

human judgments (9)
human judges (8)
human judgment (4)
human judgements (3)

Seen in 32 sentences in 7 papers.

1. English-to-Russian MT evaluation campaign

Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation methodology	The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
Evaluation methodology	This task is also much simpler for human judges to complete.
Evaluation methodology	The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required.
Results	METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements .
Results	Table 3: Correlation to human judgements

human judgments is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

2. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment 1: Textual Similarity	Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges , with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient.
Experiment 1: Textual Similarity	Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean).
Experiment 1: Textual Similarity	MSRpar (MPar) is the only dataset in which TLsim (éarié et al., 2012) achieves a higher correlation with human judgments .
Experiment 2: Word Similarity	Table 6 shows the Spearman’s p rank correlation coefficients with human judgments on the RG—65 dataset.
Experiment 3: Sense Similarity	Table 6: Spearman’s p correlation coefficients with human judgments on the RG—65 dataset.
Introduction	Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al.

human judgments is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

3. Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

Feng, Song and Kang, Jun Seok and Kuznetsova, Polina and Choi, Yejin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results 11	5.1 Intrinsic Evaluation: Human Judgements
Experimental Results 11	Therefore, we also report the degree of agreement among human judges in Table 7, where we compute the agreement of one Turker with respect to the gold standard drawn from the rest of the Turkers, and take the average across over all five Turkerslg.
Experimental Results 11	C-LP SENTIWN HUMAN JUDGES 9V0“ 77.0 71.5 66.0 95”“ 73.0 69.0 69.0
Introduction	We provide comparative empirical results over several variants of these approaches with comprehensive evaluations including lexicon-based, human judgments , and extrinsic evaluations.
Introduction	§5 presents comprehensive evaluation with human judges and extrinsic evaluations.

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

4. Name-aware Machine Translation

Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments	We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments .
Experiments	Figure 2 shows that NAMT consistently achieved higher scores with both name-aware BLEU metric and human judgement .

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word alignment (17)
LM (12)

5. Discovering User Interactions in Ideological Discussions

Mukherjee, Arjun and Liu, Bing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Empirical Evaluation	The evaluation of this task requires human judges to read all the posts where the two users forming the pair have interacted.
Empirical Evaluation	Two human judges were asked to independently read all the post interactions of 500 pairs and label each pair as overall “disagreeing” or overall “agreeing” or “none”.
Phrase Ranking based on Relevance	For this and subsequent human judgment tasks, we use two judges (graduate students well versed in English).

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

6. Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

Narisawa, Katsuma and Watanabe, Yotaro and Mizuno, Junta and Okazaki, Naoaki and Inui, Kentaro

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related work	We utilize large and small modifiers (described in Section 4.1), which correspond to textual clues m0 (as many as, as large as) and shika (only, as few as), respectively, for detecting humans’ judgments .
Related work	We asked three human judges to annotate every numerical expression with one of six labels, small, relatively small, normal, relatively large, large, and unsure.
Related work	The cause of this error is exemplified by the sentence, “there are two reasons.” Human judges label normal to the numerical expression two reasons, but the method predicts small.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

Veale, Tony and Li, Guofu

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Empirical Evaluation	We evaluate Rex by estimating how closely its judgments correlate with those of human judges on the 30-pair word set of Miller & Charles (M&C), who aggregated the judgments of multiple human raters into mean ratings for these pairs.
Related Work and Ideas	Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures.
Related Work and Ideas	Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: