Index of papers in Proc. ACL 2012 that mention
  • human judgments
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Abstract
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract
It has a better correlation with human judgment than BLEU.
Abstract
PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
Experiments
We used Spearman’s rank correlation coefficient p to measure correlation of the metric with system-level human judgments of translation.
Experiments
The human judgment score is based on the “Rank” only, i.e., how often the translations of the system were rated as better than those from other systems (Callison-Burch et al., 2008).
Experiments
BLEU 0.792 0.215 0.777 0.240 METEOR 0.834 0.231 0.835 0.225 PORT 0.801 0.236 0.804 0.242 Table 2: Correlations with human judgment on WMT
Introduction
Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et
Introduction
Second, though a tuning metric should correlate strongly with human judgment , MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly.
Introduction
(2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment .
human judgments is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew
Abstract
We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
Conclusion
We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context.
Experiments
Our model also improves the correlation with human judgments on a word similarity task.
Experiments
important, we introduce a new dataset with human judgments on similarity of pairs of words in sentential context.
Experiments
Each pair is presented without context and associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10.
Introduction
However, one limitation of this evaluation is that the human judgments are on pairs
Introduction
Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context.
human judgments is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Discussion and Future Work
This is probably due to the linguistic characteristics of Chinese, where human judges apparently give equal importance to function words and content words.
Experiments
The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
Experiments
The correlations between the TESLA-CELAB scores and human judgments are shown in the last row of Tables 2 and 3.
Introduction
In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments .
Introduction
Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
Introduction
The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments .
human judgments is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian
Hello. My name is Inigo Montoya.
None of these observations, however, serve as definitions, and indeed, we believe it desirable to not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments .
Hello. My name is Inigo Montoya.
In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments — perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used.
Never send a human to do a machine’s job.
Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who — very much unlike our system — have the full cultural context in which movies occur at their disposal.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Diao, Qiming and Jiang, Jing and Zhu, Feida and Lim, Ee-Peng
Experiments
We merged hese topics and asked two human judges to judge heir quality by assigning a score of either 0 or l. The judges are graduate students living in Singapore 1nd not involved in this project.
Experiments
based on the human judge’s understanding.
Experiments
For ground truth, we consider a bursty topic to be cor-'ect if both human judges have scored it 1.
human judgments is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: