Index of papers in Proc. ACL 2014 that mention
  • human judgements
Elliott, Desmond and Keller, Frank
Abstract
The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements .
Abstract
The focus of this paper is to determine the correlation of automatic measures with human judgements for this task.
Abstract
We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Introduction
In this paper we estimate the correlation of human judgements with five automatic evaluation measures on two image description data sets.
Introduction
lated against human judgements , ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology
We estimate Spearman’s p for five different automatic evaluation measures against human judgements for the automatic image description task.
Methodology
The automatic measures are calculated on the sentence level and correlated against human judgements of semantic correctness.
Methodology
The images were retrieved from Flickr, the reference descriptions were collected from Mechanical Turk, and the human judgements were collected from expert annotators as follows: each image in the test data was paired with the highest scoring sentence(s) retrieved from all possible test sentences by the TRIS SEM model in Hodosh et al.
human judgements is mentioned in 30 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Experimental Results
Spearman’s correlation with human judgments .
Experimental Results
Overall, we observe an average improvement of +.024, in the correlation with the human judgments .
Experimental Results
Kendall’s Tau with human judgments .
Experimental Setup
We measured the correlation of the metrics with the human judgments provided by the organizers.
Experimental Setup
4.2 Human Judgements and Learning
Experimental Setup
As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level, i.e.
Related Work
Here we suggest some simple ways to create such metrics, and we also show that they yield better correlation with human judgments .
Related Work
However, they could not improve correlation with human judgments , as evaluated on the MetricsMATR dataset.
Related Work
Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments .
human judgements is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Duma, Daniel and Klein, Ewan
Abstract
Exploiting the human judgements that are already implicit in available resources, we avoid purpose-specific annotation.
Introduction
A main problem we face is that evaluating the performance of these systems ultimately requires human judgement .
Introduction
Fortunately there is already an abundance of data that meets our requirements: every scientific paper contains human “judgements” in the form of citations to other papers which are contextually appropriate: that is, relevant to specific passages of the document and aligned with its argumentative structure.
Introduction
Citation Resolution is a method for evaluating CBCR systems that is exclusively based on this source of human judgements .
Related work
Third, as we outlined above, existing citations between papers can be exploited as a source of human judgements .
The task: Citation Resolution
The core criterion of this task is to use only the human judgements that we have clearest evidence for.
human judgements is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Introduction
We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.
Related Work
In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work
ULC (Gimenez and Marquez, 2007, 2008) incorporates several semantic features and shows improved correlation with human judgement on translation quality (Callison-Burch et al., 2007, 2008) but no work has been done towards tuning an SMT system using a pure form of ULC perhaps due to its expensive run time.
Related Work
For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.
human judgements is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kang, Jun Seok and Feng, Song and Akoglu, Leman and Choi, Yejin
Evaluation 111: Sentiment Analysis using ConnotationWordNet
Note that there is a difference in how humans judge the orientation and the degree of connotation for a given word out of context, and how the use of such words in context can be perceived as good/bad news.
Evaluation 11: Human Evaluation on ConnotationWordNet
The agreement between the new lexicon and human judges varies between 84% and 86.98%.
Evaluation 11: Human Evaluation on ConnotationWordNet
(2005a)) show low agreement rate with human, which is somewhat as expected: human judges in this study are labeling for subtle connotation, not for more explicit sentiment.
Evaluation 11: Human Evaluation on ConnotationWordNet
Because different human judges have different notion of scales however, subtle differences are more likely to be noisy.
human judgements is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Zhiyuan and Mukherjee, Arjun and Liu, Bing
Experiments
However, perpleXity on the held-out test set does not reflect the semantic coherence of topics and may be contrary to human judgments (Chang et al., 2009).
Experiments
As our objective is to discover more coherent aspects, we recruited two human judges .
Introduction
However, researchers have shown that fully unsupervised models often produce incoherent topics because the objective functions of topic models do not always correlate well with human judgments (Chang et al., 2009).
human judgements is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Tsvetkov, Yulia and Boytsov, Leonid and Gershman, Anatole and Nyberg, Eric and Dyer, Chris
Experiments
They are collected by the same human judges and belong to the same domain.
Experiments
The pairs were presented to five human judges who rated each pair on a scale from 1 (very literal/denotative) to 4 (very non-literal/connotative).
Experiments
Table 4: Comparing AN metaphor detection method to the baselines: accuracy of the 10-fold cross validation on annotations of five human judges .
human judgements is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: