Abstract | In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures. |
Abstract | We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives. |
Abstract | When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model. |
Introduction | However, our approach use human judgments as the gold standard. |
Introduction | Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures. |
Introduction | First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement. |
Abstract | When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop. |
Automatic Evaluation Metrics | In the ACL-07 MT workshop, ParaEval based on recall (ParaEval-recall) achieves good correlation with human judgements . |
Introduction | Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable. |
Introduction | Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement |
Introduction | During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement . |
Metric Design Considerations | The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement ) of 11 automatic MT evaluation metrics. |
Metric Design Considerations | For human evaluation of the MT submissions, four different criteria were used in the workshop: Adequacy (how much of the original meaning is expressed in a system translation), Fluency (the translation’s fluency), Rank (different translations of a single source sentence are compared and ranked from best to worst), and Constituent (some constituents from the parse tree of the source sentence are translated, and human judges have to rank these translations). |
Metric Design Considerations | For this dataset, human judgements are available on adequacy and fluency for six system submissions, and there are four English reference translation texts. |
Conclusion | Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges , and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data. |
Evaluation Experiment | We then evaluate the output utterances using naive human judges to rate their perceived personality and naturalness. |
Evaluation Experiment | Table 5 shows several sample outputs and the mean personality ratings from the human judges . |
Introduction | Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005). |
Parameter Estimation Models | Collects human judgments rating the personality of each utterance; |
Abstract | Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments . |
Evaluation Setup | The task involves examining the degree of linear relationship between the human judgments for two individual words and vector-based similarity values. |
Evaluation Setup | We assume that the inter-subject agreement can serve as an upper bound for comparing the fit of our models against the human judgments . |
Results | Table 2: Model means for High and Low similarity items and correlation coefficients with human judgments (*z p < 0.05, **: p < 0.01) |