Index of papers in Proc. ACL 2009 that mention
  • evaluation metrics
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Abstract
In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics , since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Abstract
We then use this approach to investigate the benefits of introducing linguistic features into evaluation metrics .
Alternatives to Correlation-based Meta-evaluation
However, each automatic evaluation metric has its own scale properties.
Alternatives to Correlation-based Meta-evaluation
This conclusion motivates the incorporation of linguistic processing into automatic evaluation metrics .
Alternatives to Correlation-based Meta-evaluation
In order to obtain additional evidence about the usefulness of combining evaluation metrics at different processing levels, let us consider the following situation: given a set of reference translations we want to train a combined system that takes the most appropriate translation approach for each text segment.
Correlation with Human Judgements
Figure 1 shows the correlation obtained by each automatic evaluation metric at system level (horizontal axis) versus segment level (vertical axis) in our test beds.
Introduction
These automatic evaluation metrics allow developers to optimize their systems without the need for expensive human assessments for each of their possible system configurations.
Introduction
context of Machine Translation, a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics , both syntactic and semantic (see Section 2 for details).
Introduction
Analyzing the reliability of evaluation metrics requires meta-evaluation criteria.
Previous Work on Machine Translation Meta-Evaluation
Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced.
evaluation metrics is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Owczarzak, Karolina
Abstract
In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation
Since this type of evaluation processes information in stages (constituent parser, dependency extraction, and the method of dependency matching between a candidate and a reference), there is potential for variance in performance among dependency-based evaluation metrics that use different components.
Dependency-based evaluation
In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics .
Discussion and future work
Admittedly, we could just ignore this problem and focus on increasing correlations for automatic summaries only; after all, the whole point of creating evaluation metrics is to score and rank the output of systems.
Discussion and future work
Since there is no single winner among all 32 variants of DEPEVAL(summ) on TAC 2008 data, we must decide which of the categories is most important to a successful automatic evaluation metric .
Discussion and future work
This ties in with the purpose which the evaluation metric should serve.
Experimental results
Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results
Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data.
Introduction
In this paper, we explore one such evaluation metric , DEPEVAL(summ), based on the comparison of Lexical-Functional Grammar (LFG) dependencies between a candidate summary and
evaluation metrics is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Pervouchine, Vladimir and Li, Haizhou and Lin, Bo
Abstract
This paper studies transliteration alignment, its evaluation metrics and applications.
Abstract
We propose a new evaluation metric , alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score.
Experiments
From the figures, we can observe a clear correlation between the alignment entropy and F-score, that validates the effectiveness of alignment entropy as an evaluation metric .
Experiments
This once again demonstrates the desired property of alignment entropy as an evaluation metric of alignment.
Introduction
In Section 3, we introduce both statistically and phonologically motivated alignment techniques and in Section 4 we advocate an evaluation metric , alignment entropy that measures the alignment quality.
Related Work
Although there are many studies of evaluation metrics of word alignment for MT (Lambert, 2008), there has been much less reported work on evaluation metrics of transliteration alignment.
Related Work
Three evaluation metrics are used: precision, recall, and F -sc0re, the latter being a function of the former two.
Related Work
In this paper we propose a novel evaluation metric for transliteration alignment grounded on the information theory.
evaluation metrics is mentioned in 8 sentences in this paper.
Topics mentioned in this paper: