Abstract | Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. |
Abstract | This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. |
BLEU and PORT | Several ordering measures have been integrated into MT evaluation metrics recently. |
Experiments | 3.1 PORT as an Evaluation Metric |
Experiments | We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions. |
Experiments | This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric , but rather, to optimize system tuning performance. |
Introduction | Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems. |
Introduction | VIT Evaluation Metric for Tuning |
Introduction | These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations. |
Introduction | These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU, which serve as surrogate objective for translation quality. |
Introduction | While many alternatives have been proposed, such a perfect evaluation metric remains elusive. |
Introduction | As a result, many MT evaluation campaigns now report multiple evaluation metrics (Callison—Burch et al., 2011; Paul, 2010). |
Opportunities and Limitations | Leveraging the diverse perspectives of different evaluation metrics has the potential to improve overall quality. |
Related Work | If a good evaluation metric could not be used for tuning, it would be a pity. |
Conclusion | In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese. |
Introduction | The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011). |
Introduction | The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments. |
The Algorithm | Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages. |
The Algorithm | This relationship is implicit in the matching problem for English translation evaluation metrics where words are well delimited. |
The Algorithm | Many prior translation evaluation metrics such as MAXSIM (Chan and Ng, 2008) and TESLA (Liu et al., 2010; Dahlmeier et al., 2011) use the F-0.8 measure as the final score: |
Abstract | We present a novel approach to the automatic acquisition of a Verbnet like classification of French verbs which involves the use (i) of a neural clustering method which associates clusters with features, (ii) of several supervised and unsupervised evaluation metrics and (iii) of various existing syntactic and semantic lexical resources. |
Clustering Methods, Evaluation Metrics and Experimental Setup | 3.2 Evaluation metrics |
Clustering Methods, Evaluation Metrics and Experimental Setup | We use several evaluation metrics which bear on different properties of the clustering. |
Clustering Methods, Evaluation Metrics and Experimental Setup | As pointed out in (Lamirel et al., 2008; Attik et al., 2006), unsupervised evaluation metrics based on cluster labelling and feature maximisation can prove very useful for identifying the best clustering strategy. |
Features and Data | Moreover, for this data set, the unsupervised evaluation metrics (cf. |
Evaluation | Before describing the experiments and presenting the results, we first describe the evaluation metrics we use. |
Evaluation | 4.0.1 Evaluation Metrics |
Evaluation | We use two evaluation metrics to evaluate subgroups detection accuracy: Purity and Entropy. |
Paraphrasing with a Dual SMT System | MERT integrates the automatic evaluation metrics into the training process to achieve optimal end-to-end performance. |
Paraphrasing with a Dual SMT System | (2) where G is the automatic evaluation metric for paraphrasing. |
Paraphrasing with a Dual SMT System | 2.2 Paraphrase Evaluation Metrics |