Index of papers in Proc. ACL 2014 that mention

human judgements

Seen in text as:

human judgements (30)
human judgments (19)
human judges (9)
human judgement (5)

Seen in 67 sentences in 7 papers.

1. Comparing Automatic Evaluation Measures for Image Description

Elliott, Desmond and Keller, Frank

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements .
Abstract	The focus of this paper is to determine the correlation of automatic measures with human judgements for this task.
Abstract	We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Introduction	In this paper we estimate the correlation of human judgements with five automatic evaluation measures on two image description data sets.
Introduction	lated against human judgements , ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology	We estimate Spearman’s p for five different automatic evaluation measures against human judgements for the automatic image description task.
Methodology	The automatic measures are calculated on the sentence level and correlated against human judgements of semantic correctness.
Methodology	The images were retrieved from Flickr, the reference descriptions were collected from Mechanical Turk, and the human judgements were collected from expert annotators as follows: each image in the test data was paired with the highest scoring sentence(s) retrieved from all possible test sentences by the TRIS SEM model in Hodosh et al.

human judgements is mentioned in 30 sentences in this paper.

Topics mentioned in this paper:

2. Using Discourse Structure Improves Machine Translation Evaluation

Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	Spearman’s correlation with human judgments .
Experimental Results	Overall, we observe an average improvement of +.024, in the correlation with the human judgments .
Experimental Results	Kendall’s Tau with human judgments .
Experimental Setup	We measured the correlation of the metrics with the human judgments provided by the organizers.
Experimental Setup	4.2 Human Judgements and Learning
Experimental Setup	As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level, i.e.
Related Work	Here we suggest some simple ways to create such metrics, and we also show that they yield better correlation with human judgments .
Related Work	However, they could not improve correlation with human judgments , as evaluated on the MetricsMATR dataset.
Related Work	Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments .

human judgements is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

3. Citation Resolution: A method for evaluating context-based citation recommendation systems

Duma, Daniel and Klein, Ewan

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Exploiting the human judgements that are already implicit in available resources, we avoid purpose-specific annotation.
Introduction	A main problem we face is that evaluating the performance of these systems ultimately requires human judgement .
Introduction	Fortunately there is already an abundance of data that meets our requirements: every scientific paper contains human “judgements” in the form of citations to other papers which are contextually appropriate: that is, relevant to specific passages of the document and aligned with its argumentative structure.
Introduction	Citation Resolution is a method for evaluating CBCR systems that is exclusively based on this source of human judgements .
Related work	Third, as we outlined above, existing citations between papers can be exploited as a source of human judgements .
The task: Citation Resolution	The core criterion of this task is to use only the human judgements that we have clearest evidence for.

human judgements is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

human judgements (6)

4. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.
Related Work	In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work	ULC (Gimenez and Marquez, 2007, 2008) incorporates several semantic features and shows improved correlation with human judgement on translation quality (Callison-Burch et al., 2007, 2008) but no work has been done towards tuning an SMT system using a pure form of ULC perhaps due to its expensive run time.
Related Work	For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.

human judgements is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

5. ConnotationWordNet: Learning Connotation over the Word+Sense Network

Kang, Jun Seok and Feng, Song and Akoglu, Leman and Choi, Yejin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation 111: Sentiment Analysis using ConnotationWordNet	Note that there is a difference in how humans judge the orientation and the degree of connotation for a given word out of context, and how the use of such words in context can be perceived as good/bad news.
Evaluation 11: Human Evaluation on ConnotationWordNet	The agreement between the new lexicon and human judges varies between 84% and 86.98%.
Evaluation 11: Human Evaluation on ConnotationWordNet	(2005a)) show low agreement rate with human, which is somewhat as expected: human judges in this study are labeling for subtle connotation, not for more explicit sentiment.
Evaluation 11: Human Evaluation on ConnotationWordNet	Because different human judges have different notion of scales however, subtle differences are more likely to be noisy.

human judgements is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

synsets (24)
word-level (15)
WordNet (11)

6. Aspect Extraction with Automated Prior Knowledge Learning

Chen, Zhiyuan and Mukherjee, Arjun and Liu, Bing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	However, perpleXity on the held-out test set does not reflect the semantic coherence of topics and may be contrary to human judgments (Chang et al., 2009).
Experiments	As our objective is to discover more coherent aspects, we recruited two human judges .
Introduction	However, researchers have shown that fully unsupervised models often produce incoherent topics because the objective functions of topic models do not always correlate well with human judgments (Chang et al., 2009).

human judgements is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

7. Metaphor Detection with Cross-Lingual Model Transfer

Tsvetkov, Yulia and Boytsov, Leonid and Gershman, Anatole and Nyberg, Eric and Dyer, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	They are collected by the same human judges and belong to the same domain.
Experiments	The pairs were presented to five human judges who rated each pair on a scale from 1 (very literal/denotative) to 4 (very non-literal/connotative).
Experiments	Table 4: Comparing AN metaphor detection method to the baselines: accuracy of the 10-fold cross validation on annotations of five human judges .

human judgements is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: