Evaluation methodology | The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous. |
Evaluation methodology | This task is also much simpler for human judges to complete. |
Evaluation methodology | The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required. |
Results | METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements . |
Results | Table 3: Correlation to human judgements |
Experiment 1: Textual Similarity | Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges , with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient. |
Experiment 1: Textual Similarity | Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean). |
Experiment 1: Textual Similarity | MSRpar (MPar) is the only dataset in which TLsim (éarié et al., 2012) achieves a higher correlation with human judgments . |
Experiment 2: Word Similarity | Table 6 shows the Spearman’s p rank correlation coefficients with human judgments on the RG—65 dataset. |
Experiment 3: Sense Similarity | Table 6: Spearman’s p correlation coefficients with human judgments on the RG—65 dataset. |
Introduction | Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al. |
Experimental Results 11 | 5.1 Intrinsic Evaluation: Human Judgements |
Experimental Results 11 | Therefore, we also report the degree of agreement among human judges in Table 7, where we compute the agreement of one Turker with respect to the gold standard drawn from the rest of the Turkers, and take the average across over all five Turkerslg. |
Experimental Results 11 | C-LP SENTIWN HUMAN JUDGES 9V0“ 77.0 71.5 66.0 95”“ 73.0 69.0 69.0 |
Introduction | We provide comparative empirical results over several variants of these approaches with comprehensive evaluations including lexicon-based, human judgments , and extrinsic evaluations. |
Introduction | §5 presents comprehensive evaluation with human judges and extrinsic evaluations. |
Experiments | In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora. |
Experiments | We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments . |
Experiments | Figure 2 shows that NAMT consistently achieved higher scores with both name-aware BLEU metric and human judgement . |
Empirical Evaluation | The evaluation of this task requires human judges to read all the posts where the two users forming the pair have interacted. |
Empirical Evaluation | Two human judges were asked to independently read all the post interactions of 500 pairs and label each pair as overall “disagreeing” or overall “agreeing” or “none”. |
Phrase Ranking based on Relevance | For this and subsequent human judgment tasks, we use two judges (graduate students well versed in English). |
Related work | We utilize large and small modifiers (described in Section 4.1), which correspond to textual clues m0 (as many as, as large as) and shika (only, as few as), respectively, for detecting humans’ judgments . |
Related work | We asked three human judges to annotate every numerical expression with one of six labels, small, relatively small, normal, relatively large, large, and unsure. |
Related work | The cause of this error is exemplified by the sentence, “there are two reasons.” Human judges label normal to the numerical expression two reasons, but the method predicts small. |
Empirical Evaluation | We evaluate Rex by estimating how closely its judgments correlate with those of human judges on the 30-pair word set of Miller & Charles (M&C), who aggregated the judgments of multiple human raters into mean ratings for these pairs. |
Related Work and Ideas | Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. |
Related Work and Ideas | Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. |