Index of papers in Proc. ACL 2012 that mention
  • evaluation metrics
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Abstract
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract
This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems.
BLEU and PORT
Several ordering measures have been integrated into MT evaluation metrics recently.
Experiments
3.1 PORT as an Evaluation Metric
Experiments
We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions.
Experiments
This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric , but rather, to optimize system tuning performance.
Introduction
Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems.
Introduction
VIT Evaluation Metric for Tuning
Introduction
These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations.
evaluation metrics is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki
Introduction
These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU, which serve as surrogate objective for translation quality.
Introduction
While many alternatives have been proposed, such a perfect evaluation metric remains elusive.
Introduction
As a result, many MT evaluation campaigns now report multiple evaluation metrics (Callison—Burch et al., 2011; Paul, 2010).
Opportunities and Limitations
Leveraging the diverse perspectives of different evaluation metrics has the potential to improve overall quality.
Related Work
If a good evaluation metric could not be used for tuning, it would be a pity.
evaluation metrics is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Conclusion
In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese.
Introduction
The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011).
Introduction
The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
The Algorithm
Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages.
The Algorithm
This relationship is implicit in the matching problem for English translation evaluation metrics where words are well delimited.
The Algorithm
Many prior translation evaluation metrics such as MAXSIM (Chan and Ng, 2008) and TESLA (Liu et al., 2010; Dahlmeier et al., 2011) use the F-0.8 measure as the final score:
evaluation metrics is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Falk, Ingrid and Gardent, Claire and Lamirel, Jean-Charles
Abstract
We present a novel approach to the automatic acquisition of a Verbnet like classification of French verbs which involves the use (i) of a neural clustering method which associates clusters with features, (ii) of several supervised and unsupervised evaluation metrics and (iii) of various existing syntactic and semantic lexical resources.
Clustering Methods, Evaluation Metrics and Experimental Setup
3.2 Evaluation metrics
Clustering Methods, Evaluation Metrics and Experimental Setup
We use several evaluation metrics which bear on different properties of the clustering.
Clustering Methods, Evaluation Metrics and Experimental Setup
As pointed out in (Lamirel et al., 2008; Attik et al., 2006), unsupervised evaluation metrics based on cluster labelling and feature maximisation can prove very useful for identifying the best clustering strategy.
Features and Data
Moreover, for this data set, the unsupervised evaluation metrics (cf.
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Abu-Jbara, Amjad and Dasigi, Pradeep and Diab, Mona and Radev, Dragomir
Evaluation
Before describing the experiments and presenting the results, we first describe the evaluation metrics we use.
Evaluation
4.0.1 Evaluation Metrics
Evaluation
We use two evaluation metrics to evaluate subgroups detection accuracy: Purity and Entropy.
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Hong and Zhou, Ming
Paraphrasing with a Dual SMT System
MERT integrates the automatic evaluation metrics into the training process to achieve optimal end-to-end performance.
Paraphrasing with a Dual SMT System
(2) where G is the automatic evaluation metric for paraphrasing.
Paraphrasing with a Dual SMT System
2.2 Paraphrase Evaluation Metrics
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: