Index of papers in Proc. ACL 2008 that mention

human judgments

Seen in text as:

human judgments (24)
human judges (17)
human judges’ (8)
human judgements (5)
human judgement (4)
human judgment (3)

Seen in 61 sentences in 4 papers.

1. Assessing Dialog System User Simulation Evaluation Measures Using Human Judges

Ai, Hua and Litman, Diane J.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures.
Abstract	We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives.
Abstract	When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
Introduction	However, our approach use human judgments as the gold standard.
Introduction	Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures.
Introduction	First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement.

human judgments is mentioned in 42 sentences in this paper.

Topics mentioned in this paper:

2. MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation

Chan, Yee Seng and Ng, Hwee Tou

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.
Automatic Evaluation Metrics	In the ACL-07 MT workshop, ParaEval based on recall (ParaEval-recall) achieves good correlation with human judgements .
Introduction	Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable.
Introduction	Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction	During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement .
Metric Design Considerations	The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement ) of 11 automatic MT evaluation metrics.
Metric Design Considerations	For human evaluation of the MT submissions, four different criteria were used in the workshop: Adequacy (how much of the original meaning is expressed in a system translation), Fluency (the translation’s fluency), Rank (different translations of a single source sentence are compared and ranked from best to worst), and Constituent (some constituents from the parse tree of the source sentence are translated, and human judges have to rank these translations).
Metric Design Considerations	For this dataset, human judgements are available on adequacy and fluency for six system submissions, and there are four English reference translation texts.

human judgments is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

unigram (24)
BLEU (20)
bigrams (16)

3. Trainable Generation of Big-Five Personality Styles through Data-Driven Parameter Estimation

Mairesse, François and Walker, Marilyn

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges , and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data.
Evaluation Experiment	We then evaluate the output utterances using naive human judges to rate their perceived personality and naturalness.
Evaluation Experiment	Table 5 shows several sample outputs and the mean personality ratings from the human judges .
Introduction	Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005).
Parameter Estimation Models	Collects human judgments rating the personality of each utterance;

human judgments is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

4. Vector-based Models of Semantic Composition

Mitchell, Jeff and Lapata, Mirella

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments .
Evaluation Setup	The task involves examining the degree of linear relationship between the human judgments for two individual words and vector-based similarity values.
Evaluation Setup	We assume that the inter-subject agreement can serve as an upper bound for comparing the fit of our models against the human judgments .
Results	Table 2: Model means for High and Low similarity items and correlation coefficients with human judgments (z p < 0.05, *: p < 0.01)

human judgments is mentioned in 4 sentences in this paper.

Topics mentioned in this paper: