Index of papers in Proc. ACL that mention
  • evaluation metrics
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Abstract
In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics , since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Abstract
We then use this approach to investigate the benefits of introducing linguistic features into evaluation metrics .
Alternatives to Correlation-based Meta-evaluation
However, each automatic evaluation metric has its own scale properties.
Alternatives to Correlation-based Meta-evaluation
This conclusion motivates the incorporation of linguistic processing into automatic evaluation metrics .
Alternatives to Correlation-based Meta-evaluation
In order to obtain additional evidence about the usefulness of combining evaluation metrics at different processing levels, let us consider the following situation: given a set of reference translations we want to train a combined system that takes the most appropriate translation approach for each text segment.
Correlation with Human Judgements
Figure 1 shows the correlation obtained by each automatic evaluation metric at system level (horizontal axis) versus segment level (vertical axis) in our test beds.
Introduction
These automatic evaluation metrics allow developers to optimize their systems without the need for expensive human assessments for each of their possible system configurations.
Introduction
context of Machine Translation, a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics , both syntactic and semantic (see Section 2 for details).
Introduction
Analyzing the reliability of evaluation metrics requires meta-evaluation criteria.
Previous Work on Machine Translation Meta-Evaluation
Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced.
evaluation metrics is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Abstract
Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level.
Abstract
Rather than proposing a single new metric, we show that discourse information is complementary to the state-of-the-art evaluation metrics, and thus should be taken into account in the development of future richer evaluation metrics .
Experimental Setup
4.1 MT Evaluation Metrics
Introduction
We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics .
Introduction
In this paper, rather than proposing yet another MT evaluation metric, we show that discourse information is complementary to many existing evaluation metrics , and thus should not be ignored.
Introduction
We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012).
Our Discourse-Based Measures
In order to develop a discourse-aware evaluation metric , we first generate discourse trees for the reference and the system-translated sentences using a discourse parser, and then we measure the similarity between the two discourse trees.
Related Work
A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work
Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance research in this direction.
Related Work
The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.
evaluation metrics is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Abstract
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract
This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems.
BLEU and PORT
Several ordering measures have been integrated into MT evaluation metrics recently.
Experiments
3.1 PORT as an Evaluation Metric
Experiments
We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions.
Experiments
This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric , but rather, to optimize system tuning performance.
Introduction
Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems.
Introduction
VIT Evaluation Metric for Tuning
Introduction
These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations.
evaluation metrics is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Wu, Dekai
Abstract
As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent.
Abstract
But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle.
Abstract
We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure.
evaluation metrics is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Abstract
We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences.
Abstract
When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.
Introduction
Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable.
Introduction
Among all the automatic MT evaluation metrics , BLEU (Papineni et al., 2002) is the most widely used.
Introduction
During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement.
Metric Design Considerations
We first review some aspects of existing metrics and highlight issues that should be considered when designing an MT evaluation metric .
Metric Design Considerations
The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement) of 11 automatic MT evaluation metrics .
Metric Design Considerations
In this paper, we present MAXSIM, a new automatic MT evaluation metric that computes a similarity score between corresponding items across a sentence pair, and uses a bipartite graph to obtain an optimal matching between item pairs.
evaluation metrics is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Owczarzak, Karolina
Abstract
In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation
Since this type of evaluation processes information in stages (constituent parser, dependency extraction, and the method of dependency matching between a candidate and a reference), there is potential for variance in performance among dependency-based evaluation metrics that use different components.
Dependency-based evaluation
In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics .
Discussion and future work
Admittedly, we could just ignore this problem and focus on increasing correlations for automatic summaries only; after all, the whole point of creating evaluation metrics is to score and rank the output of systems.
Discussion and future work
Since there is no single winner among all 32 variants of DEPEVAL(summ) on TAC 2008 data, we must decide which of the categories is most important to a successful automatic evaluation metric .
Discussion and future work
This ties in with the purpose which the evaluation metric should serve.
Experimental results
Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results
Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data.
Introduction
In this paper, we explore one such evaluation metric , DEPEVAL(summ), based on the comparison of Lexical-Functional Grammar (LFG) dependencies between a candidate summary and
evaluation metrics is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Pervouchine, Vladimir and Li, Haizhou and Lin, Bo
Abstract
This paper studies transliteration alignment, its evaluation metrics and applications.
Abstract
We propose a new evaluation metric , alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score.
Experiments
From the figures, we can observe a clear correlation between the alignment entropy and F-score, that validates the effectiveness of alignment entropy as an evaluation metric .
Experiments
This once again demonstrates the desired property of alignment entropy as an evaluation metric of alignment.
Introduction
In Section 3, we introduce both statistically and phonologically motivated alignment techniques and in Section 4 we advocate an evaluation metric , alignment entropy that measures the alignment quality.
Related Work
Although there are many studies of evaluation metrics of word alignment for MT (Lambert, 2008), there has been much less reported work on evaluation metrics of transliteration alignment.
Related Work
Three evaluation metrics are used: precision, recall, and F -sc0re, the latter being a function of the former two.
Related Work
In this paper we propose a novel evaluation metric for transliteration alignment grounded on the information theory.
evaluation metrics is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zhang, Congle and Baldwin, Tyler and Ho, Howard and Kimelfeld, Benny and Li, Yunyao
Conclusions
This evaluation metric allows for a deeper understanding of how certain normalization actions impact the output of the parser.
Evaluation
5.1 Evaluation Metrics
Evaluation
Therefore, we propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application—dependency parsing.
Introduction
Another potential problem with state-of-the-art normalization is the lack of appropriate evaluation metrics .
Introduction
For instance, it is unclear how performance measured by the typical normalization evaluation metrics of word error rate and BLEU score (Pap-ineni et al., 2002) translates into performance on a parsing task, where a well placed punctuation mark may provide more substantial improvements than changing a nonstandard word form.
Introduction
To address this problem, this work introduces an evaluation metric that ties normalization performance directly to the performance of a downstream dependency parser.
evaluation metrics is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki
Introduction
These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU, which serve as surrogate objective for translation quality.
Introduction
While many alternatives have been proposed, such a perfect evaluation metric remains elusive.
Introduction
As a result, many MT evaluation campaigns now report multiple evaluation metrics (Callison—Burch et al., 2011; Paul, 2010).
Opportunities and Limitations
Leveraging the diverse perspectives of different evaluation metrics has the potential to improve overall quality.
Related Work
If a good evaluation metric could not be used for tuning, it would be a pity.
evaluation metrics is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Conclusion
In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese.
Introduction
The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011).
Introduction
The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
The Algorithm
Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages.
The Algorithm
This relationship is implicit in the matching problem for English translation evaluation metrics where words are well delimited.
The Algorithm
Many prior translation evaluation metrics such as MAXSIM (Chan and Ng, 2008) and TESLA (Liu et al., 2010; Dahlmeier et al., 2011) use the F-0.8 measure as the final score:
evaluation metrics is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, David and Dolan, William
Abstract
A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years.
Introduction
However, a lack of standard datasets and automatic evaluation metrics has impeded progress in the field.
Introduction
Second, we define a new evaluation metric , PINC (Paraphrase In N- gram Changes), that relies on simple BLEU-like n—gram comparisons to measure the degree of novelty of automatically generated paraphrases.
Paraphrase Evaluation Metrics
A good paraphrase, according to our evaluation metric , has few n-gram overlaps with the source sentence but many n- gram overlaps with the reference sentences.
Related Work
The more recently proposed metric PEM (Paraphrase Evaluation Metric ) (Liu et al., 2010) produces a single score that captures the semantic adequacy, fluency, and lexical dissimilarity of candidate paraphrases, relying on bilingual data to learn semantic equivalences without using n- gram similarity between candidate and reference sentences.
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Fournier, Chris
Abstract
This work proposes a new segmentation evaluation metric , named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012).
Conclusions
In this work, a new segmentation evaluation metric , referred to as boundary similarity (B) is proposed as an unbiased metric, along with a boundary-edit-distance-based (BED-based) confusion matrix to compute predictably biased IR metrics such as precision and recall.
Conclusions
B also allows for an intuitive comparison of boundary pairs between segmentations, as opposed to the window counts of WD or the simplistic edit count normalization of S. When an unbiased segmentation evaluation metric is desired, this work recommends the usage of B and the use of an upper and lower bound to provide context.
Evaluation of Automatic Segmenters
An ideal segmentation evaluation metric should, in theory, place the three automatic segmenters between the upper and lower bounds in terms of performance if the metrics, and the segmenters, function properly.
Introduction
To select an automatic segmenter for a particular task, a variety of segmentation evaluation metrics have been proposed, including Pk, (Beeferman and Berger, 1999, pp.
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Abstract
We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references.
Introduction
It is well established that the MEANT family of metrics correlates better with human adequacy judgments than commonly used MT evaluation metrics (Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo and Wu, 2013b; Machacek and Bojar, 2013).
Introduction
We therefore propose XMEANT, a cross-lingual MT evaluation metric , that modifies MEANT using (1) simple translation probabilities (in our experiments,
Related Work
2.1 MT evaluation metrics
Results
Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics , and (2) even correlates nearly as well as monolingual MEANT.
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Falk, Ingrid and Gardent, Claire and Lamirel, Jean-Charles
Abstract
We present a novel approach to the automatic acquisition of a Verbnet like classification of French verbs which involves the use (i) of a neural clustering method which associates clusters with features, (ii) of several supervised and unsupervised evaluation metrics and (iii) of various existing syntactic and semantic lexical resources.
Clustering Methods, Evaluation Metrics and Experimental Setup
3.2 Evaluation metrics
Clustering Methods, Evaluation Metrics and Experimental Setup
We use several evaluation metrics which bear on different properties of the clustering.
Clustering Methods, Evaluation Metrics and Experimental Setup
As pointed out in (Lamirel et al., 2008; Attik et al., 2006), unsupervised evaluation metrics based on cluster labelling and feature maximisation can prove very useful for identifying the best clustering strategy.
Features and Data
Moreover, for this data set, the unsupervised evaluation metrics (cf.
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Skjaerholt, Arne
Abstract
With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies).
Abstract
To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real
Conclusion
In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting.
Real-world corpora
In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric , and our three 04 metrics.
Synthetic experiments
6The de facto standard parser evaluation metric in depen-
evaluation metrics is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kawahara, Daisuke and Peterson, Daniel W. and Palmer, Martha
Experiments and Evaluations
We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes.
Experiments and Evaluations
4.2 Evaluation Metrics
Experiments and Evaluations
This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al.
evaluation metrics is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Martschat, Sebastian
Conclusions and Future Work
o the evaluation metrics employed are to be questioned (certainly),
Evaluation
5.1 Data and Evaluation Metrics
Evaluation
We evaluate our system with the coreference resolution evaluation metrics that were used for the CoNLL shared tasks on coreference, which are MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAFe (Luo, 2005).
Evaluation
We also report the unweighted average of the three scores, which was the official evaluation metric in the shared tasks.
evaluation metrics is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hasan, Kazi Saidul and Ng, Vincent
Evaluation
4.1 Evaluation Metrics
Evaluation
Designing evaluation metrics for keyphrase extraction is by no means an easy task.
Evaluation
To score the output of a keyphrase extraction system, the typical approach, which is also adopted by the SemEval—2010 shared task on keyphrase extraction, is (1) to create a mapping between the keyphrases in the gold standard and those in the system output using exact match, and then (2) score the output using evaluation metrics such as precision (P), recall (R), and F-score (F).
evaluation metrics is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop
Abstract
Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics .
Experiments & Results 4.1 Experimental Setup
Two intrinsic evaluation metrics that we use to evaluate the possible translations for oovs are Mean Reciprocal Rank (MRR) (Voorhees, 1999) and Recall.
Experiments & Results 4.1 Experimental Setup
Intrinsic evaluation metrics are faster to apply and are used to optimize different hyper-parameters of the approach (e.g.
Experiments & Results 4.1 Experimental Setup
BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
evaluation metrics is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Monroe, Will and Green, Spence and Manning, Christopher D.
Experiments
3.1 Evaluation metrics
Experiments
We use two evaluation metrics in our experiments.
Experiments
Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics .
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jia, Zhongye and Zhao, Hai
Experiments
4.2 Evaluation Metrics
Experiments
We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy2.
Experiments
3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xu, Liheng and Liu, Kang and Lai, Siwei and Zhao, Jun
Experiments
4.1 Datasets and Evaluation Metrics
Experiments
Evaluation Metrics : We evaluate the proposed method in terms of precision(P), recall(R) and F-measure(F).
Experiments
To take into account the correctly expanded terms for both positive and negative seeds, we use Accuracy as the evaluation metric,
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhou, Guangyou and Liu, Fang and Liu, Yang and He, Shizhu and Zhao, Jun
Experiments
3.1 Data Set and Evaluation Metrics
Experiments
Evaluation Metrics : We evaluate the performance of question retrieval using the following metrics: Mean Average Precision (MAP) and Precision@N (P@N).
Our Approach
where feature vector (I) (q, d) = (SVSM(Q7 d), 8((11, d1), 8(Q2, d2), - - - ,8(QP, 003)), and 6 is the corresponding weight vector, we optimize this parameter for our evaluation metrics directly using the Powell Search algorithm (Paul et al., 1992) via cross-validation.
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Yuan and Barzilay, Regina and Globerson, Amir
Evaluation Setup
Evaluation Metrics: We use two evaluation metrics .
Experiment and Analysis
Moreover, increasing the number of coarse annotations used in training leads to further improvement on different evaluation metrics .
Experiment and Analysis
Figure 5 also illustrates a slightly different characteristics of transfer performance between two evaluation metrics .
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto
Experiment 1: Textual Similarity
Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean).
Experiment 1: Textual Similarity
Table 2 shows the scores obtained by ADW for the three evaluation metrics , as well as the Pearson correlation values obtained on each of the five test sets (rightmost columns).
Experiment 1: Textual Similarity
As can be seen from Table 2, our system (ADW) outperforms all the 88 participating systems according to all the evaluation metrics .
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Conclusions and Future Work
We also proposed a new name-aware evaluation metric .
Introduction
Propose a new MT evaluation metric which can discriminate names and noninformative words (Section 4).
Name-aware MT Evaluation
Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally.
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lassalle, Emmanuel and Denis, Pascal
Abstract
Our experiments on the C0NLL-2012 Shared Task English datasets (gold mentions) indicate that our method is robust relative to different clustering strategies and evaluation metrics , showing large and consistent improvements over a single pairwise model using the same base features.
Experiments
5.3 Evaluation metrics
Introduction
As will be shown based on a variety of experiments on the CoNLL-2012 Shared Task English datasets, these improvements are consistent across different evaluation metrics and for the most part independent of the clustering decoder that was used.
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Cai, Shu and Knight, Kevin
Conclusion and Future Work
We present an evaluation metric for whole-sentence semantic analysis, and show that it can be computed efficiently.
Introduction
In this work, we provide an evaluation metric that uses the degree of overlap between two whole-sentence semantic structures as the partial credit.
Semantic Overlap
Our evaluation metric measures precision, recall, and f-score of the triples in the second AMR against the triples in the first AMR, i.e., the amount of propositional overlap.
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Hong and Zhou, Ming
Paraphrasing with a Dual SMT System
MERT integrates the automatic evaluation metrics into the training process to achieve optimal end-to-end performance.
Paraphrasing with a Dual SMT System
(2) where G is the automatic evaluation metric for paraphrasing.
Paraphrasing with a Dual SMT System
2.2 Paraphrase Evaluation Metrics
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Abu-Jbara, Amjad and Dasigi, Pradeep and Diab, Mona and Radev, Dragomir
Evaluation
Before describing the experiments and presenting the results, we first describe the evaluation metrics we use.
Evaluation
4.0.1 Evaluation Metrics
Evaluation
We use two evaluation metrics to evaluate subgroups detection accuracy: Purity and Entropy.
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Introduction
Good performance on this task is the most desired property of evaluation metrics during system development.
Results and discussion
These input-level accuracies compare favorably with automatic evaluation metrics for other natural language processing tasks.
Results and discussion
For example, at the 2008 ACL Workshop on Statistical Machine Translation, all fifteen automatic evaluation metrics , including variants of BLEU scores, achieved between 42% and 56% pairwise accuracy with human judgments at the sentence level (Callison-Burch et al., 2008).
evaluation metrics is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: