Abstract | Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level. |
Abstract | Rather than proposing a single new metric, we show that discourse information is complementary to the state-of-the-art evaluation metrics, and thus should be taken into account in the development of future richer evaluation metrics . |
Experimental Setup | 4.1 MT Evaluation Metrics |
Introduction | We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics . |
Introduction | In this paper, rather than proposing yet another MT evaluation metric, we show that discourse information is complementary to many existing evaluation metrics , and thus should not be ignored. |
Introduction | We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012). |
Our Discourse-Based Measures | In order to develop a discourse-aware evaluation metric , we first generate discourse trees for the reference and the system-translated sentences using a discourse parser, and then we measure the similarity between the two discourse trees. |
Related Work | A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012). |
Related Work | Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance research in this direction. |
Related Work | The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others. |
Abstract | We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references. |
Introduction | It is well established that the MEANT family of metrics correlates better with human adequacy judgments than commonly used MT evaluation metrics (Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo and Wu, 2013b; Machacek and Bojar, 2013). |
Introduction | We therefore propose XMEANT, a cross-lingual MT evaluation metric , that modifies MEANT using (1) simple translation probabilities (in our experiments, |
Related Work | 2.1 MT evaluation metrics |
Results | Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics , and (2) even correlates nearly as well as monolingual MEANT. |
Abstract | With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies). |
Abstract | To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real |
Conclusion | In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting. |
Real-world corpora | In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric , and our three 04 metrics. |
Synthetic experiments | 6The de facto standard parser evaluation metric in depen- |
Evaluation | 4.1 Evaluation Metrics |
Evaluation | Designing evaluation metrics for keyphrase extraction is by no means an easy task. |
Evaluation | To score the output of a keyphrase extraction system, the typical approach, which is also adopted by the SemEval—2010 shared task on keyphrase extraction, is (1) to create a mapping between the keyphrases in the gold standard and those in the system output using exact match, and then (2) score the output using evaluation metrics such as precision (P), recall (R), and F-score (F). |
Experiments and Evaluations | We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes. |
Experiments and Evaluations | 4.2 Evaluation Metrics |
Experiments and Evaluations | This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al. |
Experiments | 4.2 Evaluation Metrics |
Experiments | We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy2. |
Experiments | 3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model |
Experiments | 3.1 Evaluation metrics |
Experiments | We use two evaluation metrics in our experiments. |
Experiments | Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics . |
Experiments | 4.1 Datasets and Evaluation Metrics |
Experiments | Evaluation Metrics : We evaluate the proposed method in terms of precision(P), recall(R) and F-measure(F). |
Experiments | To take into account the correctly expanded terms for both positive and negative seeds, we use Accuracy as the evaluation metric, |