Abstract | In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics , since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics. |
Abstract | We then use this approach to investigate the benefits of introducing linguistic features into evaluation metrics . |
Alternatives to Correlation-based Meta-evaluation | However, each automatic evaluation metric has its own scale properties. |
Alternatives to Correlation-based Meta-evaluation | This conclusion motivates the incorporation of linguistic processing into automatic evaluation metrics . |
Alternatives to Correlation-based Meta-evaluation | In order to obtain additional evidence about the usefulness of combining evaluation metrics at different processing levels, let us consider the following situation: given a set of reference translations we want to train a combined system that takes the most appropriate translation approach for each text segment. |
Correlation with Human Judgements | Figure 1 shows the correlation obtained by each automatic evaluation metric at system level (horizontal axis) versus segment level (vertical axis) in our test beds. |
Introduction | These automatic evaluation metrics allow developers to optimize their systems without the need for expensive human assessments for each of their possible system configurations. |
Introduction | context of Machine Translation, a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics , both syntactic and semantic (see Section 2 for details). |
Introduction | Analyzing the reliability of evaluation metrics requires meta-evaluation criteria. |
Previous Work on Machine Translation Meta-Evaluation | Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced. |
Abstract | In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE). |
Current practice in summary evaluation | Since this type of evaluation processes information in stages (constituent parser, dependency extraction, and the method of dependency matching between a candidate and a reference), there is potential for variance in performance among dependency-based evaluation metrics that use different components. |
Dependency-based evaluation | In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics . |
Discussion and future work | Admittedly, we could just ignore this problem and focus on increasing correlations for automatic summaries only; after all, the whole point of creating evaluation metrics is to score and rank the output of systems. |
Discussion and future work | Since there is no single winner among all 32 variants of DEPEVAL(summ) on TAC 2008 data, we must decide which of the categories is most important to a successful automatic evaluation metric . |
Discussion and future work | This ties in with the purpose which the evaluation metric should serve. |
Experimental results | Of course, the ideal evaluation metric would show high correlations with human judgment on both levels. |
Experimental results | Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data. |
Introduction | In this paper, we explore one such evaluation metric , DEPEVAL(summ), based on the comparison of Lexical-Functional Grammar (LFG) dependencies between a candidate summary and |
Abstract | This paper studies transliteration alignment, its evaluation metrics and applications. |
Abstract | We propose a new evaluation metric , alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score. |
Experiments | From the figures, we can observe a clear correlation between the alignment entropy and F-score, that validates the effectiveness of alignment entropy as an evaluation metric . |
Experiments | This once again demonstrates the desired property of alignment entropy as an evaluation metric of alignment. |
Introduction | In Section 3, we introduce both statistically and phonologically motivated alignment techniques and in Section 4 we advocate an evaluation metric , alignment entropy that measures the alignment quality. |
Related Work | Although there are many studies of evaluation metrics of word alignment for MT (Lambert, 2008), there has been much less reported work on evaluation metrics of transliteration alignment. |
Related Work | Three evaluation metrics are used: precision, recall, and F -sc0re, the latter being a function of the former two. |
Related Work | In this paper we propose a novel evaluation metric for transliteration alignment grounded on the information theory. |