Abstract | We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models. |
Conclusion | We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context. |
Experiments | Our model also improves the correlation with human judgments on a word similarity task. |
Experiments | important, we introduce a new dataset with human judgments on similarity of pairs of words in sentential context. |
Experiments | Each pair is presented without context and associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10. |
Introduction | However, one limitation of this evaluation is that the human judgments are on pairs |
Introduction | Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context. |
Abstract | The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements . |
Abstract | The focus of this paper is to determine the correlation of automatic measures with human judgements for this task. |
Abstract | We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets. |
Introduction | In this paper we estimate the correlation of human judgements with five automatic evaluation measures on two image description data sets. |
Introduction | lated against human judgements , ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor. |
Methodology | We estimate Spearman’s p for five different automatic evaluation measures against human judgements for the automatic image description task. |
Methodology | The automatic measures are calculated on the sentence level and correlated against human judgements of semantic correctness. |
Methodology | The images were retrieved from Flickr, the reference descriptions were collected from Mechanical Turk, and the human judgements were collected from expert annotators as follows: each image in the test data was paired with the highest scoring sentence(s) retrieved from all possible test sentences by the TRIS SEM model in Hodosh et al. |
Abstract | In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics. |
Correlation with Human Judgements | Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics. |
Correlation with Human Judgements | Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks. |
Correlation with Human Judgements | For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed. |
Introduction | In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements . |
Metrics and Test Beds | Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges . |
Previous Work on Machine Translation Meta-Evaluation | In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level). |
Previous Work on Machine Translation Meta-Evaluation | In all these cases, metrics were also evaluated by means of correlation with human judgements . |
Previous Work on Machine Translation Meta-Evaluation | Most approaches again rely on correlation with human judgements . |
Abstract | In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures. |
Abstract | We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives. |
Abstract | When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model. |
Introduction | However, our approach use human judgments as the gold standard. |
Introduction | Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures. |
Introduction | First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement. |
Experimental Results | Spearman’s correlation with human judgments . |
Experimental Results | Overall, we observe an average improvement of +.024, in the correlation with the human judgments . |
Experimental Results | Kendall’s Tau with human judgments . |
Experimental Setup | We measured the correlation of the metrics with the human judgments provided by the organizers. |
Experimental Setup | 4.2 Human Judgements and Learning |
Experimental Setup | As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level, i.e. |
Related Work | Here we suggest some simple ways to create such metrics, and we also show that they yield better correlation with human judgments . |
Related Work | However, they could not improve correlation with human judgments , as evaluated on the MetricsMATR dataset. |
Related Work | Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments . |
Abstract | Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. |
Abstract | It has a better correlation with human judgment than BLEU. |
Abstract | PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). |
Experiments | We used Spearman’s rank correlation coefficient p to measure correlation of the metric with system-level human judgments of translation. |
Experiments | The human judgment score is based on the “Rank” only, i.e., how often the translations of the system were rated as better than those from other systems (Callison-Burch et al., 2008). |
Experiments | BLEU 0.792 0.215 0.777 0.240 METEOR 0.834 0.231 0.835 0.225 PORT 0.801 0.236 0.804 0.242 Table 2: Correlations with human judgment on WMT |
Introduction | Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et |
Introduction | Second, though a tuning metric should correlate strongly with human judgment , MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly. |
Introduction | (2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment . |
Conclusion | This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments . |
Extensions of SemPOS | For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks. |
Extensions of SemPOS | The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence. |
Extensions of SemPOS | Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets): |
Introduction | Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g. |
Problems of BLEU | Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp. |
Problems of BLEU | Figure 1 illustrates a very low correlation to human judgments when translating to Czech. |
Problems of BLEU | This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored. |
Dataset Construction and Human Performance | In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions. |
Dataset Construction and Human Performance | Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges. |
Dataset Construction and Human Performance | Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive. |
Introduction | In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006). |
Related Work | However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3). |
Related Work | Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam. |
Results and Discussion | We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008). |
Results and Discussion | mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold). |
Abstract | In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments . |
Experiments | The average scores of the two human judges are shown in Table 3. |
Experiments | 5.2 Correlation with human judgments |
Experiments | Having established rough correspondences between BLEU/PINC scores and human judgments of se- |
Introduction | Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005). |
Introduction | Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments . |
Paraphrase Evaluation Metrics | While PEM was shown to correlate well with human judgments , it has some limitations. |
Related Work | While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems. |
Related Work | In addition, the metric was shown to correlate well with human judgments . |
Related Work | However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric. |
Abstract | When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop. |
Automatic Evaluation Metrics | In the ACL-07 MT workshop, ParaEval based on recall (ParaEval-recall) achieves good correlation with human judgements . |
Introduction | Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable. |
Introduction | Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement |
Introduction | During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement . |
Metric Design Considerations | The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement ) of 11 automatic MT evaluation metrics. |
Metric Design Considerations | For human evaluation of the MT submissions, four different criteria were used in the workshop: Adequacy (how much of the original meaning is expressed in a system translation), Fluency (the translation’s fluency), Rank (different translations of a single source sentence are compared and ranked from best to worst), and Constituent (some constituents from the parse tree of the source sentence are translated, and human judges have to rank these translations). |
Metric Design Considerations | For this dataset, human judgements are available on adequacy and fluency for six system submissions, and there are four English reference translation texts. |
Abstract | We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. |
Abstract | The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. |
Abstract | (2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality. |
Abstract | In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE). |
Current practice in summary evaluation | Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form. |
Current practice in summary evaluation | In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant. |
Dependency-based evaluation | In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics. |
Dependency-based evaluation | In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e. |
Dependency-based evaluation | For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant. |
Experimental results | Of course, the ideal evaluation metric would show high correlations with human judgment on both levels. |
Experimental results | The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B). |
Introduction | Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004). |
Abstract | Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7. |
Experiments | We calculated the correlation between the scores obtained using our method and scores produced by human judgment . |
Experiments | Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5. |
Experiments | We used the median value in the evaluation results of three human judges as the final scores of 1—5. |
Introduction | The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007). |
Introduction | Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency. |
EXpt. 1: Predicting Absolute Scores | The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics. |
EXpt. 1: Predicting Absolute Scores | On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments . |
Experimental Evaluation | At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p, |
Experimental Evaluation | Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1 |
Experimental Evaluation | Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs. |
Expt. 2: Predicting Pairwise Preferences | The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions. |
Introduction | BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. |
Introduction | Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments . |
Evaluation methodology | The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous. |
Evaluation methodology | This task is also much simpler for human judges to complete. |
Evaluation methodology | The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required. |
Results | METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements . |
Results | Table 3: Correlation to human judgements |
Discussion and Future Work | This is probably due to the linguistic characteristics of Chinese, where human judges apparently give equal importance to function words and content words. |
Experiments | The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively. |
Experiments | The correlations between the TESLA-CELAB scores and human judgments are shown in the last row of Tables 2 and 3. |
Introduction | In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments . |
Introduction | Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality. |
Introduction | The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments . |
Abstract | Exploiting the human judgements that are already implicit in available resources, we avoid purpose-specific annotation. |
Introduction | A main problem we face is that evaluating the performance of these systems ultimately requires human judgement . |
Introduction | Fortunately there is already an abundance of data that meets our requirements: every scientific paper contains human “judgements” in the form of citations to other papers which are contextually appropriate: that is, relevant to specific passages of the document and aligned with its argumentative structure. |
Introduction | Citation Resolution is a method for evaluating CBCR systems that is exclusively based on this source of human judgements . |
Related work | Third, as we outlined above, existing citations between papers can be exploited as a source of human judgements . |
The task: Citation Resolution | The core criterion of this task is to use only the human judgements that we have clearest evidence for. |
Experiment 1: Textual Similarity | Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges , with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient. |
Experiment 1: Textual Similarity | Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean). |
Experiment 1: Textual Similarity | MSRpar (MPar) is the only dataset in which TLsim (éarié et al., 2012) achieves a higher correlation with human judgments . |
Experiment 2: Word Similarity | Table 6 shows the Spearman’s p rank correlation coefficients with human judgments on the RG—65 dataset. |
Experiment 3: Sense Similarity | Table 6: Spearman’s p correlation coefficients with human judgments on the RG—65 dataset. |
Introduction | Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al. |
Introduction | We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations. |
Related Work | In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy. |
Related Work | ULC (Gimenez and Marquez, 2007, 2008) incorporates several semantic features and shows improved correlation with human judgement on translation quality (Callison-Burch et al., 2007, 2008) but no work has been done towards tuning an SMT system using a pure form of ULC perhaps due to its expensive run time. |
Related Work | For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable. |
Conclusion | Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges , and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data. |
Evaluation Experiment | We then evaluate the output utterances using naive human judges to rate their perceived personality and naturalness. |
Evaluation Experiment | Table 5 shows several sample outputs and the mean personality ratings from the human judges . |
Introduction | Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005). |
Parameter Estimation Models | Collects human judgments rating the personality of each utterance; |
Experimental Results 11 | 5.1 Intrinsic Evaluation: Human Judgements |
Experimental Results 11 | Therefore, we also report the degree of agreement among human judges in Table 7, where we compute the agreement of one Turker with respect to the gold standard drawn from the rest of the Turkers, and take the average across over all five Turkerslg. |
Experimental Results 11 | C-LP SENTIWN HUMAN JUDGES 9V0“ 77.0 71.5 66.0 95”“ 73.0 69.0 69.0 |
Introduction | We provide comparative empirical results over several variants of these approaches with comprehensive evaluations including lexicon-based, human judgments , and extrinsic evaluations. |
Introduction | §5 presents comprehensive evaluation with human judges and extrinsic evaluations. |
Experiments | In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora. |
Experiments | We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments . |
Experiments | Figure 2 shows that NAMT consistently achieved higher scores with both name-aware BLEU metric and human judgement . |
Experiments | To determine the sentiment of these adjectives, we asked 9 human judges , all native German speakers, to annotate them given the classes neutral, slightly negative, very negative, slightly positive, and very positive, reflecting the categories from the training data. |
Experiments | Since human judges tend to interpret scales differently, we examine their agreement using Kendall’s coefficient of concordance (W) including correction for ties (Legendre, 2005) which takes ranks into account. |
Experiments | Due to disagreements between the human judges there exists no clear threshold between these categories. |
Evaluation 111: Sentiment Analysis using ConnotationWordNet | Note that there is a difference in how humans judge the orientation and the degree of connotation for a given word out of context, and how the use of such words in context can be perceived as good/bad news. |
Evaluation 11: Human Evaluation on ConnotationWordNet | The agreement between the new lexicon and human judges varies between 84% and 86.98%. |
Evaluation 11: Human Evaluation on ConnotationWordNet | (2005a)) show low agreement rate with human, which is somewhat as expected: human judges in this study are labeling for subtle connotation, not for more explicit sentiment. |
Evaluation 11: Human Evaluation on ConnotationWordNet | Because different human judges have different notion of scales however, subtle differences are more likely to be noisy. |
Abstract | Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments . |
Evaluation Setup | The task involves examining the degree of linear relationship between the human judgments for two individual words and vector-based similarity values. |
Evaluation Setup | We assume that the inter-subject agreement can serve as an upper bound for comparing the fit of our models against the human judgments . |
Results | Table 2: Model means for High and Low similarity items and correlation coefficients with human judgments (*z p < 0.05, **: p < 0.01) |
Experiments | They are collected by the same human judges and belong to the same domain. |
Experiments | The pairs were presented to five human judges who rated each pair on a scale from 1 (very literal/denotative) to 4 (very non-literal/connotative). |
Experiments | Table 4: Comparing AN metaphor detection method to the baselines: accuracy of the 10-fold cross validation on annotations of five human judges . |
Experiments | However, perpleXity on the held-out test set does not reflect the semantic coherence of topics and may be contrary to human judgments (Chang et al., 2009). |
Experiments | As our objective is to discover more coherent aspects, we recruited two human judges . |
Introduction | However, researchers have shown that fully unsupervised models often produce incoherent topics because the objective functions of topic models do not always correlate well with human judgments (Chang et al., 2009). |
Empirical Evaluation | We evaluate Rex by estimating how closely its judgments correlate with those of human judges on the 30-pair word set of Miller & Charles (M&C), who aggregated the judgments of multiple human raters into mean ratings for these pairs. |
Related Work and Ideas | Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. |
Related Work and Ideas | Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. |
Related work | We utilize large and small modifiers (described in Section 4.1), which correspond to textual clues m0 (as many as, as large as) and shika (only, as few as), respectively, for detecting humans’ judgments . |
Related work | We asked three human judges to annotate every numerical expression with one of six labels, small, relatively small, normal, relatively large, large, and unsure. |
Related work | The cause of this error is exemplified by the sentence, “there are two reasons.” Human judges label normal to the numerical expression two reasons, but the method predicts small. |
Empirical Evaluation | The evaluation of this task requires human judges to read all the posts where the two users forming the pair have interacted. |
Empirical Evaluation | Two human judges were asked to independently read all the post interactions of 500 pairs and label each pair as overall “disagreeing” or overall “agreeing” or “none”. |
Phrase Ranking based on Relevance | For this and subsequent human judgment tasks, we use two judges (graduate students well versed in English). |
Experiments | We merged hese topics and asked two human judges to judge heir quality by assigning a score of either 0 or l. The judges are graduate students living in Singapore 1nd not involved in this project. |
Experiments | based on the human judge’s understanding. |
Experiments | For ground truth, we consider a bursty topic to be cor-'ect if both human judges have scored it 1. |
Hello. My name is Inigo Montoya. | None of these observations, however, serve as definitions, and indeed, we believe it desirable to not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments . |
Hello. My name is Inigo Montoya. | In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments — perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used. |
Never send a human to do a machine’s job. | Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who — very much unlike our system — have the full cultural context in which movies occur at their disposal. |
Abstract | We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947). |
Background | In general, however, we now have an automated model that is reliable in reproducing human judgments of authoritativeness. |
Introduction | In section 5, this model is evaluated on a subset of the MapTask corpus (Anderson et al., 1991) and shows a high correlation with human judgements of authoritativeness (r2 = 0.947). |
Experiment: Ranking Word Senses | Based on agreement between human judges , Erk and McCarthy (2009) estimate an upper bound p of 0.544 for the dataset. |
Experiment: Ranking Word Senses | The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard, averaged over all instances. |
Experiment: Ranking Word Senses | Table 4: Correlation of model predictions and human judgments |
Introduction | When employing any such metric, it is crucial to verify that the predictions of the automated evaluation process agree with human judgements of the important aspects of the system output. |
Introduction | counter-examples to the claim that BLEU agrees with human judgements . |
Introduction | Also, Foster (2008) examined a range of automated metrics for evaluation generated multimodal output and found that few agreed with the preferences expressed by human judges . |