Index of papers in Proc. ACL 2008 that mention
  • human judgments
Ai, Hua and Litman, Diane J.
Abstract
In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures.
Abstract
We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives.
Abstract
When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
Introduction
However, our approach use human judgments as the gold standard.
Introduction
Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures.
Introduction
First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement.
human judgments is mentioned in 42 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Abstract
When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.
Automatic Evaluation Metrics
In the ACL-07 MT workshop, ParaEval based on recall (ParaEval-recall) achieves good correlation with human judgements .
Introduction
Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable.
Introduction
Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction
During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement .
Metric Design Considerations
The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement ) of 11 automatic MT evaluation metrics.
Metric Design Considerations
For human evaluation of the MT submissions, four different criteria were used in the workshop: Adequacy (how much of the original meaning is expressed in a system translation), Fluency (the translation’s fluency), Rank (different translations of a single source sentence are compared and ranked from best to worst), and Constituent (some constituents from the parse tree of the source sentence are translated, and human judges have to rank these translations).
Metric Design Considerations
For this dataset, human judgements are available on adequacy and fluency for six system submissions, and there are four English reference translation texts.
human judgments is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Mairesse, François and Walker, Marilyn
Conclusion
Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges , and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data.
Evaluation Experiment
We then evaluate the output utterances using naive human judges to rate their perceived personality and naturalness.
Evaluation Experiment
Table 5 shows several sample outputs and the mean personality ratings from the human judges .
Introduction
Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005).
Parameter Estimation Models
Collects human judgments rating the personality of each utterance;
human judgments is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Mitchell, Jeff and Lapata, Mirella
Abstract
Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments .
Evaluation Setup
The task involves examining the degree of linear relationship between the human judgments for two individual words and vector-based similarity values.
Evaluation Setup
We assume that the inter-subject agreement can serve as an upper bound for comparing the fit of our models against the human judgments .
Results
Table 2: Model means for High and Low similarity items and correlation coefficients with human judgments (*z p < 0.05, **: p < 0.01)
human judgments is mentioned in 4 sentences in this paper.
Topics mentioned in this paper: