Index of papers in Proc. ACL 2013 that mention
  • human judgments
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
Evaluation methodology
This task is also much simpler for human judges to complete.
Evaluation methodology
The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required.
Results
METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements .
Results
Table 3: Correlation to human judgements
human judgments is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto
Experiment 1: Textual Similarity
Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges , with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient.
Experiment 1: Textual Similarity
Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean).
Experiment 1: Textual Similarity
MSRpar (MPar) is the only dataset in which TLsim (éarié et al., 2012) achieves a higher correlation with human judgments .
Experiment 2: Word Similarity
Table 6 shows the Spearman’s p rank correlation coefficients with human judgments on the RG—65 dataset.
Experiment 3: Sense Similarity
Table 6: Spearman’s p correlation coefficients with human judgments on the RG—65 dataset.
Introduction
Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al.
human judgments is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Feng, Song and Kang, Jun Seok and Kuznetsova, Polina and Choi, Yejin
Experimental Results 11
5.1 Intrinsic Evaluation: Human Judgements
Experimental Results 11
Therefore, we also report the degree of agreement among human judges in Table 7, where we compute the agreement of one Turker with respect to the gold standard drawn from the rest of the Turkers, and take the average across over all five Turkerslg.
Experimental Results 11
C-LP SENTIWN HUMAN JUDGES 9V0“ 77.0 71.5 66.0 95”“ 73.0 69.0 69.0
Introduction
We provide comparative empirical results over several variants of these approaches with comprehensive evaluations including lexicon-based, human judgments , and extrinsic evaluations.
Introduction
§5 presents comprehensive evaluation with human judges and extrinsic evaluations.
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Experiments
In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments
We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments .
Experiments
Figure 2 shows that NAMT consistently achieved higher scores with both name-aware BLEU metric and human judgement .
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Mukherjee, Arjun and Liu, Bing
Empirical Evaluation
The evaluation of this task requires human judges to read all the posts where the two users forming the pair have interacted.
Empirical Evaluation
Two human judges were asked to independently read all the post interactions of 500 pairs and label each pair as overall “disagreeing” or overall “agreeing” or “none”.
Phrase Ranking based on Relevance
For this and subsequent human judgment tasks, we use two judges (graduate students well versed in English).
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Narisawa, Katsuma and Watanabe, Yotaro and Mizuno, Junta and Okazaki, Naoaki and Inui, Kentaro
Related work
We utilize large and small modifiers (described in Section 4.1), which correspond to textual clues m0 (as many as, as large as) and shika (only, as few as), respectively, for detecting humans’ judgments .
Related work
We asked three human judges to annotate every numerical expression with one of six labels, small, relatively small, normal, relatively large, large, and unsure.
Related work
The cause of this error is exemplified by the sentence, “there are two reasons.” Human judges label normal to the numerical expression two reasons, but the method predicts small.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Veale, Tony and Li, Guofu
Empirical Evaluation
We evaluate Rex by estimating how closely its judgments correlate with those of human judges on the 30-pair word set of Miller & Charles (M&C), who aggregated the judgments of multiple human raters into mean ratings for these pairs.
Related Work and Ideas
Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures.
Related Work and Ideas
Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: