Index of papers in Proc. ACL 2014 that mention
  • evaluation metrics
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Abstract
Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level.
Abstract
Rather than proposing a single new metric, we show that discourse information is complementary to the state-of-the-art evaluation metrics, and thus should be taken into account in the development of future richer evaluation metrics .
Experimental Setup
4.1 MT Evaluation Metrics
Introduction
We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics .
Introduction
In this paper, rather than proposing yet another MT evaluation metric, we show that discourse information is complementary to many existing evaluation metrics , and thus should not be ignored.
Introduction
We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012).
Our Discourse-Based Measures
In order to develop a discourse-aware evaluation metric , we first generate discourse trees for the reference and the system-translated sentences using a discourse parser, and then we measure the similarity between the two discourse trees.
Related Work
A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work
Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance research in this direction.
Related Work
The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.
evaluation metrics is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Abstract
We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references.
Introduction
It is well established that the MEANT family of metrics correlates better with human adequacy judgments than commonly used MT evaluation metrics (Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo and Wu, 2013b; Machacek and Bojar, 2013).
Introduction
We therefore propose XMEANT, a cross-lingual MT evaluation metric , that modifies MEANT using (1) simple translation probabilities (in our experiments,
Related Work
2.1 MT evaluation metrics
Results
Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics , and (2) even correlates nearly as well as monolingual MEANT.
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Skjaerholt, Arne
Abstract
With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies).
Abstract
To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real
Conclusion
In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting.
Real-world corpora
In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric , and our three 04 metrics.
Synthetic experiments
6The de facto standard parser evaluation metric in depen-
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hasan, Kazi Saidul and Ng, Vincent
Evaluation
4.1 Evaluation Metrics
Evaluation
Designing evaluation metrics for keyphrase extraction is by no means an easy task.
Evaluation
To score the output of a keyphrase extraction system, the typical approach, which is also adopted by the SemEval—2010 shared task on keyphrase extraction, is (1) to create a mapping between the keyphrases in the gold standard and those in the system output using exact match, and then (2) score the output using evaluation metrics such as precision (P), recall (R), and F-score (F).
evaluation metrics is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kawahara, Daisuke and Peterson, Daniel W. and Palmer, Martha
Experiments and Evaluations
We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes.
Experiments and Evaluations
4.2 Evaluation Metrics
Experiments and Evaluations
This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al.
evaluation metrics is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Jia, Zhongye and Zhao, Hai
Experiments
4.2 Evaluation Metrics
Experiments
We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy2.
Experiments
3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Monroe, Will and Green, Spence and Manning, Christopher D.
Experiments
3.1 Evaluation metrics
Experiments
We use two evaluation metrics in our experiments.
Experiments
Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics .
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xu, Liheng and Liu, Kang and Lai, Siwei and Zhao, Jun
Experiments
4.1 Datasets and Evaluation Metrics
Experiments
Evaluation Metrics : We evaluate the proposed method in terms of precision(P), recall(R) and F-measure(F).
Experiments
To take into account the correctly expanded terms for both positive and negative seeds, we use Accuracy as the evaluation metric,
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: