Index of papers in Proc. ACL that mention
  • human judgments
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew
Abstract
We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
Conclusion
We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context.
Experiments
Our model also improves the correlation with human judgments on a word similarity task.
Experiments
important, we introduce a new dataset with human judgments on similarity of pairs of words in sentential context.
Experiments
Each pair is presented without context and associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10.
Introduction
However, one limitation of this evaluation is that the human judgments are on pairs
Introduction
Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context.
human judgments is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Elliott, Desmond and Keller, Frank
Abstract
The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements .
Abstract
The focus of this paper is to determine the correlation of automatic measures with human judgements for this task.
Abstract
We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Introduction
In this paper we estimate the correlation of human judgements with five automatic evaluation measures on two image description data sets.
Introduction
lated against human judgements , ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology
We estimate Spearman’s p for five different automatic evaluation measures against human judgements for the automatic image description task.
Methodology
The automatic measures are calculated on the sentence level and correlated against human judgements of semantic correctness.
Methodology
The images were retrieved from Flickr, the reference descriptions were collected from Mechanical Turk, and the human judgements were collected from expert annotators as follows: each image in the test data was paired with the highest scoring sentence(s) retrieved from all possible test sentences by the TRIS SEM model in Hodosh et al.
human judgments is mentioned in 30 sentences in this paper.
Topics mentioned in this paper:
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Abstract
In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Correlation with Human Judgements
Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
Correlation with Human Judgements
Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks.
Correlation with Human Judgements
For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed.
Introduction
In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements .
Metrics and Test Beds
Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges .
Previous Work on Machine Translation Meta-Evaluation
In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level).
Previous Work on Machine Translation Meta-Evaluation
In all these cases, metrics were also evaluated by means of correlation with human judgements .
Previous Work on Machine Translation Meta-Evaluation
Most approaches again rely on correlation with human judgements .
human judgments is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Ai, Hua and Litman, Diane J.
Abstract
In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures.
Abstract
We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives.
Abstract
When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
Introduction
However, our approach use human judgments as the gold standard.
Introduction
Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures.
Introduction
First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement.
human judgments is mentioned in 42 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Experimental Results
Spearman’s correlation with human judgments .
Experimental Results
Overall, we observe an average improvement of +.024, in the correlation with the human judgments .
Experimental Results
Kendall’s Tau with human judgments .
Experimental Setup
We measured the correlation of the metrics with the human judgments provided by the organizers.
Experimental Setup
4.2 Human Judgements and Learning
Experimental Setup
As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level, i.e.
Related Work
Here we suggest some simple ways to create such metrics, and we also show that they yield better correlation with human judgments .
Related Work
However, they could not improve correlation with human judgments , as evaluated on the MetricsMATR dataset.
Related Work
Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments .
human judgments is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Abstract
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract
It has a better correlation with human judgment than BLEU.
Abstract
PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
Experiments
We used Spearman’s rank correlation coefficient p to measure correlation of the metric with system-level human judgments of translation.
Experiments
The human judgment score is based on the “Rank” only, i.e., how often the translations of the system were rated as better than those from other systems (Callison-Burch et al., 2008).
Experiments
BLEU 0.792 0.215 0.777 0.240 METEOR 0.834 0.231 0.835 0.225 PORT 0.801 0.236 0.804 0.242 Table 2: Correlations with human judgment on WMT
Introduction
Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et
Introduction
Second, though a tuning metric should correlate strongly with human judgment , MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly.
Introduction
(2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment .
human judgments is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Bojar, Ondřej and Kos, Kamil and Mareċek, David
Conclusion
This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments .
Extensions of SemPOS
For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks.
Extensions of SemPOS
The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence.
Extensions of SemPOS
Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets):
Introduction
Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g.
Problems of BLEU
Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp.
Problems of BLEU
Figure 1 illustrates a very low correlation to human judgments when translating to Czech.
Problems of BLEU
This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored.
human judgments is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Dataset Construction and Human Performance
In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
Dataset Construction and Human Performance
Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges.
Dataset Construction and Human Performance
Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive.
Introduction
In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).
Related Work
However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3).
Related Work
Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam.
Results and Discussion
We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008).
Results and Discussion
mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold).
human judgments is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Chen, David and Dolan, William
Abstract
In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments .
Experiments
The average scores of the two human judges are shown in Table 3.
Experiments
5.2 Correlation with human judgments
Experiments
Having established rough correspondences between BLEU/PINC scores and human judgments of se-
Introduction
Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005).
Introduction
Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments .
Paraphrase Evaluation Metrics
While PEM was shown to correlate well with human judgments , it has some limitations.
Related Work
While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems.
Related Work
In addition, the metric was shown to correlate well with human judgments .
Related Work
However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric.
human judgments is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Abstract
When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.
Automatic Evaluation Metrics
In the ACL-07 MT workshop, ParaEval based on recall (ParaEval-recall) achieves good correlation with human judgements .
Introduction
Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable.
Introduction
Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction
During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement .
Metric Design Considerations
The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement ) of 11 automatic MT evaluation metrics.
Metric Design Considerations
For human evaluation of the MT submissions, four different criteria were used in the workshop: Adequacy (how much of the original meaning is expressed in a system translation), Fluency (the translation’s fluency), Rank (different translations of a single source sentence are compared and ranked from best to worst), and Constituent (some constituents from the parse tree of the source sentence are translated, and human judges have to rank these translations).
Metric Design Considerations
For this dataset, human judgements are available on adequacy and fluency for six system submissions, and there are four English reference translation texts.
human judgments is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Wu, Dekai
Abstract
We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.
Abstract
The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.
Abstract
(2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality.
human judgments is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Owczarzak, Karolina
Abstract
In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation
Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form.
Current practice in summary evaluation
In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant.
Dependency-based evaluation
In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics.
Dependency-based evaluation
In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e.
Dependency-based evaluation
For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant.
Experimental results
Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results
The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B).
Introduction
Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004).
human judgments is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Echizen-ya, Hiroshi and Araki, Kenji
Abstract
Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
Experiments
We calculated the correlation between the scores obtained using our method and scores produced by human judgment .
Experiments
Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5.
Experiments
We used the median value in the evaluation results of three human judges as the final scores of 1—5.
Introduction
The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007).
Introduction
Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.
human judgments is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
EXpt. 1: Predicting Absolute Scores
The predictions of all models correlate highly significantly with human judgments , but we still see robustness issues for the individual MT metrics.
EXpt. 1: Predicting Absolute Scores
On the system level (bottom half of Table 1), there is high variance due to the small number of predictions per language, and many predictions are not significantly correlated with human judgments .
Experimental Evaluation
At the sentence level, we can correlate predictions in Experiment 1 directly with human judgments with Spearman’s p,
Experimental Evaluation
Finally, the predictions are again correlated with human judgments using Spearman’s p. “Tie awareness” makes a considerable practical difference, improving correlation figures by 5—10 points.1
Experimental Evaluation
Since the default uniform cost does not always correlate well with human judgment , we duplicate these features for 9 nonuniform edit costs.
Expt. 2: Predicting Pairwise Preferences
The right column shows Spearman’s p for the correlation between human judgments and tie-aware system-level predictions.
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction
Unfortunately, each metrics tend to concentrate on one particular type of linguistic information, none of which always correlates well with human judgments .
human judgments is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
Evaluation methodology
This task is also much simpler for human judges to complete.
Evaluation methodology
The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required.
Results
METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements .
Results
Table 3: Correlation to human judgements
human judgments is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Discussion and Future Work
This is probably due to the linguistic characteristics of Chinese, where human judges apparently give equal importance to function words and content words.
Experiments
The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
Experiments
The correlations between the TESLA-CELAB scores and human judgments are shown in the last row of Tables 2 and 3.
Introduction
In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments .
Introduction
Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
Introduction
The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments .
human judgments is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Duma, Daniel and Klein, Ewan
Abstract
Exploiting the human judgements that are already implicit in available resources, we avoid purpose-specific annotation.
Introduction
A main problem we face is that evaluating the performance of these systems ultimately requires human judgement .
Introduction
Fortunately there is already an abundance of data that meets our requirements: every scientific paper contains human “judgements” in the form of citations to other papers which are contextually appropriate: that is, relevant to specific passages of the document and aligned with its argumentative structure.
Introduction
Citation Resolution is a method for evaluating CBCR systems that is exclusively based on this source of human judgements .
Related work
Third, as we outlined above, existing citations between papers can be exploited as a source of human judgements .
The task: Citation Resolution
The core criterion of this task is to use only the human judgements that we have clearest evidence for.
human judgments is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto
Experiment 1: Textual Similarity
Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges , with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient.
Experiment 1: Textual Similarity
Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean).
Experiment 1: Textual Similarity
MSRpar (MPar) is the only dataset in which TLsim (éarié et al., 2012) achieves a higher correlation with human judgments .
Experiment 2: Word Similarity
Table 6 shows the Spearman’s p rank correlation coefficients with human judgments on the RG—65 dataset.
Experiment 3: Sense Similarity
Table 6: Spearman’s p correlation coefficients with human judgments on the RG—65 dataset.
Introduction
Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al.
human judgments is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Introduction
We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.
Related Work
In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work
ULC (Gimenez and Marquez, 2007, 2008) incorporates several semantic features and shows improved correlation with human judgement on translation quality (Callison-Burch et al., 2007, 2008) but no work has been done towards tuning an SMT system using a pure form of ULC perhaps due to its expensive run time.
Related Work
For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Mairesse, François and Walker, Marilyn
Conclusion
Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges , and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data.
Evaluation Experiment
We then evaluate the output utterances using naive human judges to rate their perceived personality and naturalness.
Evaluation Experiment
Table 5 shows several sample outputs and the mean personality ratings from the human judges .
Introduction
Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005).
Parameter Estimation Models
Collects human judgments rating the personality of each utterance;
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Feng, Song and Kang, Jun Seok and Kuznetsova, Polina and Choi, Yejin
Experimental Results 11
5.1 Intrinsic Evaluation: Human Judgements
Experimental Results 11
Therefore, we also report the degree of agreement among human judges in Table 7, where we compute the agreement of one Turker with respect to the gold standard drawn from the rest of the Turkers, and take the average across over all five Turkerslg.
Experimental Results 11
C-LP SENTIWN HUMAN JUDGES 9V0“ 77.0 71.5 66.0 95”“ 73.0 69.0 69.0
Introduction
We provide comparative empirical results over several variants of these approaches with comprehensive evaluations including lexicon-based, human judgments , and extrinsic evaluations.
Introduction
§5 presents comprehensive evaluation with human judges and extrinsic evaluations.
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Experiments
In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments
We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments .
Experiments
Figure 2 shows that NAMT consistently achieved higher scores with both name-aware BLEU metric and human judgement .
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Scheible, Christian
Experiments
To determine the sentiment of these adjectives, we asked 9 human judges , all native German speakers, to annotate them given the classes neutral, slightly negative, very negative, slightly positive, and very positive, reflecting the categories from the training data.
Experiments
Since human judges tend to interpret scales differently, we examine their agreement using Kendall’s coefficient of concordance (W) including correction for ties (Legendre, 2005) which takes ranks into account.
Experiments
Due to disagreements between the human judges there exists no clear threshold between these categories.
human judgments is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kang, Jun Seok and Feng, Song and Akoglu, Leman and Choi, Yejin
Evaluation 111: Sentiment Analysis using ConnotationWordNet
Note that there is a difference in how humans judge the orientation and the degree of connotation for a given word out of context, and how the use of such words in context can be perceived as good/bad news.
Evaluation 11: Human Evaluation on ConnotationWordNet
The agreement between the new lexicon and human judges varies between 84% and 86.98%.
Evaluation 11: Human Evaluation on ConnotationWordNet
(2005a)) show low agreement rate with human, which is somewhat as expected: human judges in this study are labeling for subtle connotation, not for more explicit sentiment.
Evaluation 11: Human Evaluation on ConnotationWordNet
Because different human judges have different notion of scales however, subtle differences are more likely to be noisy.
human judgments is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Mitchell, Jeff and Lapata, Mirella
Abstract
Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments .
Evaluation Setup
The task involves examining the degree of linear relationship between the human judgments for two individual words and vector-based similarity values.
Evaluation Setup
We assume that the inter-subject agreement can serve as an upper bound for comparing the fit of our models against the human judgments .
Results
Table 2: Model means for High and Low similarity items and correlation coefficients with human judgments (*z p < 0.05, **: p < 0.01)
human judgments is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Tsvetkov, Yulia and Boytsov, Leonid and Gershman, Anatole and Nyberg, Eric and Dyer, Chris
Experiments
They are collected by the same human judges and belong to the same domain.
Experiments
The pairs were presented to five human judges who rated each pair on a scale from 1 (very literal/denotative) to 4 (very non-literal/connotative).
Experiments
Table 4: Comparing AN metaphor detection method to the baselines: accuracy of the 10-fold cross validation on annotations of five human judges .
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Zhiyuan and Mukherjee, Arjun and Liu, Bing
Experiments
However, perpleXity on the held-out test set does not reflect the semantic coherence of topics and may be contrary to human judgments (Chang et al., 2009).
Experiments
As our objective is to discover more coherent aspects, we recruited two human judges .
Introduction
However, researchers have shown that fully unsupervised models often produce incoherent topics because the objective functions of topic models do not always correlate well with human judgments (Chang et al., 2009).
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Veale, Tony and Li, Guofu
Empirical Evaluation
We evaluate Rex by estimating how closely its judgments correlate with those of human judges on the 30-pair word set of Miller & Charles (M&C), who aggregated the judgments of multiple human raters into mean ratings for these pairs.
Related Work and Ideas
Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures.
Related Work and Ideas
Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Narisawa, Katsuma and Watanabe, Yotaro and Mizuno, Junta and Okazaki, Naoaki and Inui, Kentaro
Related work
We utilize large and small modifiers (described in Section 4.1), which correspond to textual clues m0 (as many as, as large as) and shika (only, as few as), respectively, for detecting humans’ judgments .
Related work
We asked three human judges to annotate every numerical expression with one of six labels, small, relatively small, normal, relatively large, large, and unsure.
Related work
The cause of this error is exemplified by the sentence, “there are two reasons.” Human judges label normal to the numerical expression two reasons, but the method predicts small.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Mukherjee, Arjun and Liu, Bing
Empirical Evaluation
The evaluation of this task requires human judges to read all the posts where the two users forming the pair have interacted.
Empirical Evaluation
Two human judges were asked to independently read all the post interactions of 500 pairs and label each pair as overall “disagreeing” or overall “agreeing” or “none”.
Phrase Ranking based on Relevance
For this and subsequent human judgment tasks, we use two judges (graduate students well versed in English).
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Diao, Qiming and Jiang, Jing and Zhu, Feida and Lim, Ee-Peng
Experiments
We merged hese topics and asked two human judges to judge heir quality by assigning a score of either 0 or l. The judges are graduate students living in Singapore 1nd not involved in this project.
Experiments
based on the human judge’s understanding.
Experiments
For ground truth, we consider a bursty topic to be cor-'ect if both human judges have scored it 1.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian
Hello. My name is Inigo Montoya.
None of these observations, however, serve as definitions, and indeed, we believe it desirable to not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments .
Hello. My name is Inigo Montoya.
In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments — perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used.
Never send a human to do a machine’s job.
Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who — very much unlike our system — have the full cultural context in which movies occur at their disposal.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Mayfield, Elijah and Penstein Rosé, Carolyn
Abstract
We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).
Background
In general, however, we now have an automated model that is reliable in reproducing human judgments of authoritativeness.
Introduction
In section 5, this model is evaluated on a subset of the MapTask corpus (Anderson et al., 1991) and shows a high correlation with human judgements of authoritativeness (r2 = 0.947).
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Thater, Stefan and Fürstenau, Hagen and Pinkal, Manfred
Experiment: Ranking Word Senses
Based on agreement between human judges , Erk and McCarthy (2009) estimate an upper bound p of 0.544 for the dataset.
Experiment: Ranking Word Senses
The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard, averaged over all instances.
Experiment: Ranking Word Senses
Table 4: Correlation of model predictions and human judgments
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Foster, Mary Ellen and Giuliani, Manuel and Knoll, Alois
Introduction
When employing any such metric, it is crucial to verify that the predictions of the automated evaluation process agree with human judgements of the important aspects of the system output.
Introduction
counter-examples to the claim that BLEU agrees with human judgements .
Introduction
Also, Foster (2008) examined a range of automated metrics for evaluation generated multimodal output and found that few agreed with the preferences expressed by human judges .
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: