Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
Ai, Hua and Litman, Diane J.

Article Structure

Abstract

Previous studies evaluate simulated dialog corpora using evaluation measures which can be automatically extracted from the dialog systems’ logs.

Introduction

User simulation has been widely used in different phases in spoken dialog system development.

Related Work

A lot of research has been done in evaluating different components of Spoken Dialog Systems as well as overall system performance.

System and User Simulation Models

In this section, we describe our dialog system (IT-SPOKE) and the user simulation models which we use in the assessment study.

Assessment Study Design 4.1 Data

We decided to conduct a middle-scale assessment study that involved 30 human judges.

Assessment Study Results

In the initial analysis, we observe that it is a difficult task for human judges to rate on the 5-point scale and the agreements among the judges are fairly low.

Validating Automatic Measures

Since it is expensive to use human judges to rate simulated dialogs, we are interested in building prediction models of human judgments using automatic measures.

Conclusion and Future Work

Automatic evaluation measures are used in evaluating simulated dialog corpora.

Topics

human judgments

Appears in 42 sentences as: (1) Human judges (1) human judges (13) human judges’ (8) Human judgment (1) human judgment (3) human judgments (18)
In Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
  1. In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures.
    Page 1, “Abstract”
  2. We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives.
    Page 1, “Abstract”
  3. When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
    Page 1, “Abstract”
  4. However, our approach use human judgments as the gold standard.
    Page 1, “Introduction”
  5. Although to date there are few studies that use human judges to directly assess the quality of user simulation, we believe that this is a reliable approach to assess the simulated corpora as well as an important step towards developing a comprehensive set of user simulation evaluation measures.
    Page 1, “Introduction”
  6. First, we can estimate the difficulty of the task of distinguishing real and simulated corpora by knowing how hard it is for human judges to reach an agreement.
    Page 1, “Introduction”
  7. Second, human judgments can be used as the gold standard of the automatic evaluation measures.
    Page 1, “Introduction”
  8. measures by correlating the conclusions drawn from the automatic measures with the human judgments .
    Page 2, “Introduction”
  9. In this study, we recruit human judges to assess the quality of three user simulation models.
    Page 2, “Introduction”
  10. We first assess human judges’ abilities in distinguishing real from simulated users.
    Page 2, “Introduction”
  11. We find that it is hard for human judges to reach good agreement on the ratings.
    Page 2, “Introduction”

See all papers in Proc. ACL 2008 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

cross validation

Appears in 9 sentences as: Cross Validation (1) cross validation (6) Cross Validations (1) cross validations (1)
In Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
  1. Cross Validation d_TUR d_QLT d_PAT Regular 0.176 0.155 0.151 Minus—one—model 0.224 0.180 0.178
    Page 7, “Validating Automatic Measures”
  2. Table 7: LOSS scores for Regular and Minus-one-model (during training) Cross Validations
    Page 7, “Validating Automatic Measures”
  3. First, we use regular 4-fold cross validation where we randomly hold out 25% of the data for testing and train on the remaining 75% of the data for 4 rounds.
    Page 7, “Validating Automatic Measures”
  4. We call this approach the minus-one-model cross validation .
    Page 7, “Validating Automatic Measures”
  5. Table 7 shows the LOSS scores for both cross validations .
    Page 7, “Validating Automatic Measures”
  6. When comparing the two cross validation results for the same question, we see more LOSS in the more difficult minus-one-model case.
    Page 7, “Validating Automatic Measures”
  7. To address this question, we use AMR scores to reevaluate all cross validation results.
    Page 7, “Validating Automatic Measures”
  8. Table 8 shows the human-rated and predicted AMR scores averaged over four rounds of testing on the regular cross validation results.
    Page 7, “Validating Automatic Measures”
  9. When applying AMR on the minus-one-model cross validation results, we see similar results that the ranking model reproduces hu-
    Page 7, “Validating Automatic Measures”

See all papers in Proc. ACL 2008 that mention cross validation.

See all papers in Proc. ACL that mention cross validation.

Back to top.

regression model

Appears in 6 sentences as: Regression Model (1) regression model (3) regression models (2)
In Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
  1. When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model , but we can predict human rankings by a ranking model.
    Page 1, “Abstract”
  2. Similarly, when we use previously proposed automatic measures to predict human judgments, we cannot reliably predict human ratings using a regression model , but we can consistently mimic human judges’ rankings using a ranking model.
    Page 2, “Introduction”
  3. Some studies (e.g., (Walker et al., 1997)) build regression models to predict user satisfaction scores from the system log as well as the user survey.
    Page 2, “Related Work”
  4. In this study, we build both a regression model and a ranking model to evaluate user simulation.
    Page 2, “Related Work”
  5. 6.1 The Regression Model
    Page 6, “Validating Automatic Measures”
  6. We would also want to include more automatic measures that may be available in the richer corpora to improve the ranking and the regression models .
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2008 that mention regression model.

See all papers in Proc. ACL that mention regression model.

Back to top.

gold standard

Appears in 4 sentences as: gold standard (4)
In Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
  1. In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures.
    Page 1, “Abstract”
  2. However, our approach use human judgments as the gold standard .
    Page 1, “Introduction”
  3. Second, human judgments can be used as the gold standard of the automatic evaluation measures.
    Page 1, “Introduction”
  4. The second and third column shows the human-rated score as the gold standard and the machine-predicted score in the testing phase respectively.
    Page 7, “Validating Automatic Measures”

See all papers in Proc. ACL 2008 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.