Abstract | In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics , since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics. |
Abstract | We then use this approach to investigate the benefits of introducing linguistic features into evaluation metrics . |
Alternatives to Correlation-based Meta-evaluation | However, each automatic evaluation metric has its own scale properties. |
Alternatives to Correlation-based Meta-evaluation | This conclusion motivates the incorporation of linguistic processing into automatic evaluation metrics . |
Alternatives to Correlation-based Meta-evaluation | In order to obtain additional evidence about the usefulness of combining evaluation metrics at different processing levels, let us consider the following situation: given a set of reference translations we want to train a combined system that takes the most appropriate translation approach for each text segment. |
Correlation with Human Judgements | Figure 1 shows the correlation obtained by each automatic evaluation metric at system level (horizontal axis) versus segment level (vertical axis) in our test beds. |
Introduction | These automatic evaluation metrics allow developers to optimize their systems without the need for expensive human assessments for each of their possible system configurations. |
Introduction | context of Machine Translation, a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics , both syntactic and semantic (see Section 2 for details). |
Introduction | Analyzing the reliability of evaluation metrics requires meta-evaluation criteria. |
Previous Work on Machine Translation Meta-Evaluation | Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced. |
Abstract | Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level. |
Abstract | Rather than proposing a single new metric, we show that discourse information is complementary to the state-of-the-art evaluation metrics, and thus should be taken into account in the development of future richer evaluation metrics . |
Experimental Setup | 4.1 MT Evaluation Metrics |
Introduction | We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics . |
Introduction | In this paper, rather than proposing yet another MT evaluation metric, we show that discourse information is complementary to many existing evaluation metrics , and thus should not be ignored. |
Introduction | We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012). |
Our Discourse-Based Measures | In order to develop a discourse-aware evaluation metric , we first generate discourse trees for the reference and the system-translated sentences using a discourse parser, and then we measure the similarity between the two discourse trees. |
Related Work | A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012). |
Related Work | Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance research in this direction. |
Related Work | The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others. |
Abstract | Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. |
Abstract | This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. |
BLEU and PORT | Several ordering measures have been integrated into MT evaluation metrics recently. |
Experiments | 3.1 PORT as an Evaluation Metric |
Experiments | We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions. |
Experiments | This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric , but rather, to optimize system tuning performance. |
Introduction | Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems. |
Introduction | VIT Evaluation Metric for Tuning |
Introduction | These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations. |
Abstract | As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. |
Abstract | But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. |
Abstract | We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure. |
Abstract | We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences. |
Abstract | When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop. |
Introduction | Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable. |
Introduction | Among all the automatic MT evaluation metrics , BLEU (Papineni et al., 2002) is the most widely used. |
Introduction | During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement. |
Metric Design Considerations | We first review some aspects of existing metrics and highlight issues that should be considered when designing an MT evaluation metric . |
Metric Design Considerations | The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement) of 11 automatic MT evaluation metrics . |
Metric Design Considerations | In this paper, we present MAXSIM, a new automatic MT evaluation metric that computes a similarity score between corresponding items across a sentence pair, and uses a bipartite graph to obtain an optimal matching between item pairs. |
Abstract | In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE). |
Current practice in summary evaluation | Since this type of evaluation processes information in stages (constituent parser, dependency extraction, and the method of dependency matching between a candidate and a reference), there is potential for variance in performance among dependency-based evaluation metrics that use different components. |
Dependency-based evaluation | In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics . |
Discussion and future work | Admittedly, we could just ignore this problem and focus on increasing correlations for automatic summaries only; after all, the whole point of creating evaluation metrics is to score and rank the output of systems. |
Discussion and future work | Since there is no single winner among all 32 variants of DEPEVAL(summ) on TAC 2008 data, we must decide which of the categories is most important to a successful automatic evaluation metric . |
Discussion and future work | This ties in with the purpose which the evaluation metric should serve. |
Experimental results | Of course, the ideal evaluation metric would show high correlations with human judgment on both levels. |
Experimental results | Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data. |
Introduction | In this paper, we explore one such evaluation metric , DEPEVAL(summ), based on the comparison of Lexical-Functional Grammar (LFG) dependencies between a candidate summary and |
Abstract | This paper studies transliteration alignment, its evaluation metrics and applications. |
Abstract | We propose a new evaluation metric , alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score. |
Experiments | From the figures, we can observe a clear correlation between the alignment entropy and F-score, that validates the effectiveness of alignment entropy as an evaluation metric . |
Experiments | This once again demonstrates the desired property of alignment entropy as an evaluation metric of alignment. |
Introduction | In Section 3, we introduce both statistically and phonologically motivated alignment techniques and in Section 4 we advocate an evaluation metric , alignment entropy that measures the alignment quality. |
Related Work | Although there are many studies of evaluation metrics of word alignment for MT (Lambert, 2008), there has been much less reported work on evaluation metrics of transliteration alignment. |
Related Work | Three evaluation metrics are used: precision, recall, and F -sc0re, the latter being a function of the former two. |
Related Work | In this paper we propose a novel evaluation metric for transliteration alignment grounded on the information theory. |
Conclusions | This evaluation metric allows for a deeper understanding of how certain normalization actions impact the output of the parser. |
Evaluation | 5.1 Evaluation Metrics |
Evaluation | Therefore, we propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application—dependency parsing. |
Introduction | Another potential problem with state-of-the-art normalization is the lack of appropriate evaluation metrics . |
Introduction | For instance, it is unclear how performance measured by the typical normalization evaluation metrics of word error rate and BLEU score (Pap-ineni et al., 2002) translates into performance on a parsing task, where a well placed punctuation mark may provide more substantial improvements than changing a nonstandard word form. |
Introduction | To address this problem, this work introduces an evaluation metric that ties normalization performance directly to the performance of a downstream dependency parser. |
Introduction | These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU, which serve as surrogate objective for translation quality. |
Introduction | While many alternatives have been proposed, such a perfect evaluation metric remains elusive. |
Introduction | As a result, many MT evaluation campaigns now report multiple evaluation metrics (Callison—Burch et al., 2011; Paul, 2010). |
Opportunities and Limitations | Leveraging the diverse perspectives of different evaluation metrics has the potential to improve overall quality. |
Related Work | If a good evaluation metric could not be used for tuning, it would be a pity. |
Conclusion | In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese. |
Introduction | The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011). |
Introduction | The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments. |
The Algorithm | Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages. |
The Algorithm | This relationship is implicit in the matching problem for English translation evaluation metrics where words are well delimited. |
The Algorithm | Many prior translation evaluation metrics such as MAXSIM (Chan and Ng, 2008) and TESLA (Liu et al., 2010; Dahlmeier et al., 2011) use the F-0.8 measure as the final score: |
Abstract | A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. |
Introduction | However, a lack of standard datasets and automatic evaluation metrics has impeded progress in the field. |
Introduction | Second, we define a new evaluation metric , PINC (Paraphrase In N- gram Changes), that relies on simple BLEU-like n—gram comparisons to measure the degree of novelty of automatically generated paraphrases. |
Paraphrase Evaluation Metrics | A good paraphrase, according to our evaluation metric , has few n-gram overlaps with the source sentence but many n- gram overlaps with the reference sentences. |
Related Work | The more recently proposed metric PEM (Paraphrase Evaluation Metric ) (Liu et al., 2010) produces a single score that captures the semantic adequacy, fluency, and lexical dissimilarity of candidate paraphrases, relying on bilingual data to learn semantic equivalences without using n- gram similarity between candidate and reference sentences. |
Abstract | This work proposes a new segmentation evaluation metric , named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012). |
Conclusions | In this work, a new segmentation evaluation metric , referred to as boundary similarity (B) is proposed as an unbiased metric, along with a boundary-edit-distance-based (BED-based) confusion matrix to compute predictably biased IR metrics such as precision and recall. |
Conclusions | B also allows for an intuitive comparison of boundary pairs between segmentations, as opposed to the window counts of WD or the simplistic edit count normalization of S. When an unbiased segmentation evaluation metric is desired, this work recommends the usage of B and the use of an upper and lower bound to provide context. |
Evaluation of Automatic Segmenters | An ideal segmentation evaluation metric should, in theory, place the three automatic segmenters between the upper and lower bounds in terms of performance if the metrics, and the segmenters, function properly. |
Introduction | To select an automatic segmenter for a particular task, a variety of segmentation evaluation metrics have been proposed, including Pk, (Beeferman and Berger, 1999, pp. |
Abstract | We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references. |
Introduction | It is well established that the MEANT family of metrics correlates better with human adequacy judgments than commonly used MT evaluation metrics (Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo and Wu, 2013b; Machacek and Bojar, 2013). |
Introduction | We therefore propose XMEANT, a cross-lingual MT evaluation metric , that modifies MEANT using (1) simple translation probabilities (in our experiments, |
Related Work | 2.1 MT evaluation metrics |
Results | Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics , and (2) even correlates nearly as well as monolingual MEANT. |
Abstract | We present a novel approach to the automatic acquisition of a Verbnet like classification of French verbs which involves the use (i) of a neural clustering method which associates clusters with features, (ii) of several supervised and unsupervised evaluation metrics and (iii) of various existing syntactic and semantic lexical resources. |
Clustering Methods, Evaluation Metrics and Experimental Setup | 3.2 Evaluation metrics |
Clustering Methods, Evaluation Metrics and Experimental Setup | We use several evaluation metrics which bear on different properties of the clustering. |
Clustering Methods, Evaluation Metrics and Experimental Setup | As pointed out in (Lamirel et al., 2008; Attik et al., 2006), unsupervised evaluation metrics based on cluster labelling and feature maximisation can prove very useful for identifying the best clustering strategy. |
Features and Data | Moreover, for this data set, the unsupervised evaluation metrics (cf. |
Abstract | With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies). |
Abstract | To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real |
Conclusion | In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting. |
Real-world corpora | In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric , and our three 04 metrics. |
Synthetic experiments | 6The de facto standard parser evaluation metric in depen- |
Experiments and Evaluations | We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes. |
Experiments and Evaluations | 4.2 Evaluation Metrics |
Experiments and Evaluations | This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al. |
Conclusions and Future Work | o the evaluation metrics employed are to be questioned (certainly), |
Evaluation | 5.1 Data and Evaluation Metrics |
Evaluation | We evaluate our system with the coreference resolution evaluation metrics that were used for the CoNLL shared tasks on coreference, which are MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAFe (Luo, 2005). |
Evaluation | We also report the unweighted average of the three scores, which was the official evaluation metric in the shared tasks. |
Evaluation | 4.1 Evaluation Metrics |
Evaluation | Designing evaluation metrics for keyphrase extraction is by no means an easy task. |
Evaluation | To score the output of a keyphrase extraction system, the typical approach, which is also adopted by the SemEval—2010 shared task on keyphrase extraction, is (1) to create a mapping between the keyphrases in the gold standard and those in the system output using exact match, and then (2) score the output using evaluation metrics such as precision (P), recall (R), and F-score (F). |
Abstract | Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics . |
Experiments & Results 4.1 Experimental Setup | Two intrinsic evaluation metrics that we use to evaluate the possible translations for oovs are Mean Reciprocal Rank (MRR) (Voorhees, 1999) and Recall. |
Experiments & Results 4.1 Experimental Setup | Intrinsic evaluation metrics are faster to apply and are used to optimize different hyper-parameters of the approach (e.g. |
Experiments & Results 4.1 Experimental Setup | BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT. |
Experiments | 3.1 Evaluation metrics |
Experiments | We use two evaluation metrics in our experiments. |
Experiments | Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics . |
Experiments | 4.2 Evaluation Metrics |
Experiments | We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy2. |
Experiments | 3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model |
Experiments | 4.1 Datasets and Evaluation Metrics |
Experiments | Evaluation Metrics : We evaluate the proposed method in terms of precision(P), recall(R) and F-measure(F). |
Experiments | To take into account the correctly expanded terms for both positive and negative seeds, we use Accuracy as the evaluation metric, |
Experiments | 3.1 Data Set and Evaluation Metrics |
Experiments | Evaluation Metrics : We evaluate the performance of question retrieval using the following metrics: Mean Average Precision (MAP) and Precision@N (P@N). |
Our Approach | where feature vector (I) (q, d) = (SVSM(Q7 d), 8((11, d1), 8(Q2, d2), - - - ,8(QP, 003)), and 6 is the corresponding weight vector, we optimize this parameter for our evaluation metrics directly using the Powell Search algorithm (Paul et al., 1992) via cross-validation. |
Evaluation Setup | Evaluation Metrics: We use two evaluation metrics . |
Experiment and Analysis | Moreover, increasing the number of coarse annotations used in training leads to further improvement on different evaluation metrics . |
Experiment and Analysis | Figure 5 also illustrates a slightly different characteristics of transfer performance between two evaluation metrics . |
Experiment 1: Textual Similarity | Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean). |
Experiment 1: Textual Similarity | Table 2 shows the scores obtained by ADW for the three evaluation metrics , as well as the Pearson correlation values obtained on each of the five test sets (rightmost columns). |
Experiment 1: Textual Similarity | As can be seen from Table 2, our system (ADW) outperforms all the 88 participating systems according to all the evaluation metrics . |
Conclusions and Future Work | We also proposed a new name-aware evaluation metric . |
Introduction | Propose a new MT evaluation metric which can discriminate names and noninformative words (Section 4). |
Name-aware MT Evaluation | Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally. |
Abstract | Our experiments on the C0NLL-2012 Shared Task English datasets (gold mentions) indicate that our method is robust relative to different clustering strategies and evaluation metrics , showing large and consistent improvements over a single pairwise model using the same base features. |
Experiments | 5.3 Evaluation metrics |
Introduction | As will be shown based on a variety of experiments on the CoNLL-2012 Shared Task English datasets, these improvements are consistent across different evaluation metrics and for the most part independent of the clustering decoder that was used. |
Conclusion and Future Work | We present an evaluation metric for whole-sentence semantic analysis, and show that it can be computed efficiently. |
Introduction | In this work, we provide an evaluation metric that uses the degree of overlap between two whole-sentence semantic structures as the partial credit. |
Semantic Overlap | Our evaluation metric measures precision, recall, and f-score of the triples in the second AMR against the triples in the first AMR, i.e., the amount of propositional overlap. |
Paraphrasing with a Dual SMT System | MERT integrates the automatic evaluation metrics into the training process to achieve optimal end-to-end performance. |
Paraphrasing with a Dual SMT System | (2) where G is the automatic evaluation metric for paraphrasing. |
Paraphrasing with a Dual SMT System | 2.2 Paraphrase Evaluation Metrics |
Evaluation | Before describing the experiments and presenting the results, we first describe the evaluation metrics we use. |
Evaluation | 4.0.1 Evaluation Metrics |
Evaluation | We use two evaluation metrics to evaluate subgroups detection accuracy: Purity and Entropy. |
Introduction | Good performance on this task is the most desired property of evaluation metrics during system development. |
Results and discussion | These input-level accuracies compare favorably with automatic evaluation metrics for other natural language processing tasks. |
Results and discussion | For example, at the 2008 ACL Workshop on Statistical Machine Translation, all fifteen automatic evaluation metrics , including variants of BLEU scores, achieved between 42% and 56% pairwise accuracy with human judgments at the sentence level (Callison-Burch et al., 2008). |