Index of papers in Proc. ACL 2012 that mention

Seen in text as:

Seen in 35 sentences in 5 papers.

Chen, Boxing and Kuhn, Roland and Larkin, Samuel

Abstract	Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract	It has a better correlation with human judgment than BLEU.
Abstract	PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
Experiments	We used Spearman’s rank correlation coefficient p to measure correlation of the metric with system-level human judgments of translation.
Experiments	The human judgment score is based on the “Rank” only, i.e., how often the translations of the system were rated as better than those from other systems (Callison-Burch et al., 2008).
Experiments	BLEU 0.792 0.215 0.777 0.240 METEOR 0.834 0.231 0.835 0.225 PORT 0.801 0.236 0.804 0.242 Table 2: Correlations with human judgment on WMT
Introduction	Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et
Introduction	Second, though a tuning metric should correlate strongly with human judgment , MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly.
Introduction	(2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment .

human judgments is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew

Abstract	We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
Conclusion	We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context.
Experiments	Our model also improves the correlation with human judgments on a word similarity task.
Experiments	important, we introduce a new dataset with human judgments on similarity of pairs of words in sentential context.
Experiments	Each pair is presented without context and associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10.
Introduction	However, one limitation of this evaluation is that the human judgments are on pairs
Introduction	Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context.

human judgments is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

Liu, Chang and Ng, Hwee Tou

Discussion and Future Work	This is probably due to the linguistic characteristics of Chinese, where human judges apparently give equal importance to function words and content words.
Experiments	The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
Experiments	The correlations between the TESLA-CELAB scores and human judgments are shown in the last row of Tables 2 and 3.
Introduction	In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments .
Introduction	Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
Introduction	The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments .

human judgments is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian

Hello. My name is Inigo Montoya.	None of these observations, however, serve as definitions, and indeed, we believe it desirable to not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments .
Hello. My name is Inigo Montoya.	In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments — perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used.
Never send a human to do a machine’s job.	Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who — very much unlike our system — have the full cultural context in which movies occur at their disposal.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

Diao, Qiming and Jiang, Jing and Zhu, Feida and Lim, Ee-Peng

Experiments	We merged hese topics and asked two human judges to judge heir quality by assigning a score of either 0 or l. The judges are graduate students living in Singapore 1nd not involved in this project.
Experiments	based on the human judge’s understanding.
Experiments	For ground truth, we consider a bursty topic to be cor-'ect if both human judges have scored it 1.

human judgments is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: