Index of papers in Proc. ACL that mention

evaluation metrics

Seen in text as:

evaluation metrics (108)
evaluation metric (64)
Evaluation Metrics (13)
Evaluation metrics (4)
Evaluation Metric (3)

Seen in 188 sentences in 31 papers.

1. The Contribution of Linguistic Features to Automatic Machine Translation Evaluation

Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics , since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
Abstract	We then use this approach to investigate the benefits of introducing linguistic features into evaluation metrics .
Alternatives to Correlation-based Meta-evaluation	However, each automatic evaluation metric has its own scale properties.
Alternatives to Correlation-based Meta-evaluation	This conclusion motivates the incorporation of linguistic processing into automatic evaluation metrics .
Alternatives to Correlation-based Meta-evaluation	In order to obtain additional evidence about the usefulness of combining evaluation metrics at different processing levels, let us consider the following situation: given a set of reference translations we want to train a combined system that takes the most appropriate translation approach for each text segment.
Correlation with Human Judgements	Figure 1 shows the correlation obtained by each automatic evaluation metric at system level (horizontal axis) versus segment level (vertical axis) in our test beds.
Introduction	These automatic evaluation metrics allow developers to optimize their systems without the need for expensive human assessments for each of their possible system configurations.
Introduction	context of Machine Translation, a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics , both syntactic and semantic (see Section 2 for details).
Introduction	Analyzing the reliability of evaluation metrics requires meta-evaluation criteria.
Previous Work on Machine Translation Meta-Evaluation	Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced.

evaluation metrics is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

2. Using Discourse Structure Improves Machine Translation Evaluation

Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level.
Abstract	Rather than proposing a single new metric, we show that discourse information is complementary to the state-of-the-art evaluation metrics, and thus should be taken into account in the development of future richer evaluation metrics .
Experimental Setup	4.1 MT Evaluation Metrics
Introduction	We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics .
Introduction	In this paper, rather than proposing yet another MT evaluation metric, we show that discourse information is complementary to many existing evaluation metrics , and thus should not be ignored.
Introduction	We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012).
Our Discourse-Based Measures	In order to develop a discourse-aware evaluation metric , we first generate discourse trees for the reference and the system-translated sentences using a discourse parser, and then we measure the similarity between the two discourse trees.
Related Work	A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work	Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance research in this direction.
Related Work	The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.

evaluation metrics is mentioned in 21 sentences in this paper.

Topics mentioned in this paper:

3. PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

Chen, Boxing and Kuhn, Roland and Larkin, Samuel

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
Abstract	This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems.
BLEU and PORT	Several ordering measures have been integrated into MT evaluation metrics recently.
Experiments	3.1 PORT as an Evaluation Metric
Experiments	We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions.
Experiments	This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric , but rather, to optimize system tuning performance.
Introduction	Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems.
Introduction	VIT Evaluation Metric for Tuning
Introduction	These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations.

evaluation metrics is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

4. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Lo, Chi-kiu and Wu, Dekai

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent.
Abstract	But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle.
Abstract	We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure.

evaluation metrics is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

5. MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation

Chan, Yee Seng and Ng, Hwee Tou

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences.
Abstract	When evaluated on data from the ACL—07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.
Introduction	Since human evaluation of MT output is time consuming and expensive, having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable.
Introduction	Among all the automatic MT evaluation metrics , BLEU (Papineni et al., 2002) is the most widely used.
Introduction	During the recent ACL-07 workshop on statistical MT (Callison-Burch et al., 2007), a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement.
Metric Design Considerations	We first review some aspects of existing metrics and highlight issues that should be considered when designing an MT evaluation metric .
Metric Design Considerations	The ACL-07 MT workshop evaluated the translation quality of MT systems on various translation tasks, and also measured the correlation (with human judgement) of 11 automatic MT evaluation metrics .
Metric Design Considerations	In this paper, we present MAXSIM, a new automatic MT evaluation metric that computes a similarity score between corresponding items across a sentence pair, and uses a bipartite graph to obtain an optimal matching between item pairs.

evaluation metrics is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

unigram (24)
BLEU (20)
bigrams (16)

6. DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries

Owczarzak, Karolina

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
Current practice in summary evaluation	Since this type of evaluation processes information in stages (constituent parser, dependency extraction, and the method of dependency matching between a candidate and a reference), there is potential for variance in performance among dependency-based evaluation metrics that use different components.
Dependency-based evaluation	In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics .
Discussion and future work	Admittedly, we could just ignore this problem and focus on increasing correlations for automatic summaries only; after all, the whole point of creating evaluation metrics is to score and rank the output of systems.
Discussion and future work	Since there is no single winner among all 32 variants of DEPEVAL(summ) on TAC 2008 data, we must decide which of the categories is most important to a successful automatic evaluation metric .
Discussion and future work	This ties in with the purpose which the evaluation metric should serve.
Experimental results	Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
Experimental results	Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data.
Introduction	In this paper, we explore one such evaluation metric , DEPEVAL(summ), based on the comparison of Lexical-Functional Grammar (LFG) dependencies between a candidate summary and

evaluation metrics is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

7. Transliteration Alignment

Pervouchine, Vladimir and Li, Haizhou and Lin, Bo

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This paper studies transliteration alignment, its evaluation metrics and applications.
Abstract	We propose a new evaluation metric , alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score.
Experiments	From the figures, we can observe a clear correlation between the alignment entropy and F-score, that validates the effectiveness of alignment entropy as an evaluation metric .
Experiments	This once again demonstrates the desired property of alignment entropy as an evaluation metric of alignment.
Introduction	In Section 3, we introduce both statistically and phonologically motivated alignment techniques and in Section 4 we advocate an evaluation metric , alignment entropy that measures the alignment quality.
Related Work	Although there are many studies of evaluation metrics of word alignment for MT (Lambert, 2008), there has been much less reported work on evaluation metrics of transliteration alignment.
Related Work	Three evaluation metrics are used: precision, recall, and F -sc0re, the latter being a function of the former two.
Related Work	In this paper we propose a novel evaluation metric for transliteration alignment grounded on the information theory.

evaluation metrics is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

8. Adaptive Parser-Centric Text Normalization

Zhang, Congle and Baldwin, Tyler and Ho, Howard and Kimelfeld, Benny and Li, Yunyao

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions	This evaluation metric allows for a deeper understanding of how certain normalization actions impact the output of the parser.
Evaluation	5.1 Evaluation Metrics
Evaluation	Therefore, we propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application—dependency parsing.
Introduction	Another potential problem with state-of-the-art normalization is the lack of appropriate evaluation metrics .
Introduction	For instance, it is unclear how performance measured by the typical normalization evaluation metrics of word error rate and BLEU score (Pap-ineni et al., 2002) translates into performance on a parsing task, where a well placed punctuation mark may provide more substantial improvements than changing a nonstandard word form.
Introduction	To address this problem, this work introduces an evaluation metric that ties normalization performance directly to the performance of a downstream dependency parser.

evaluation metrics is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

9. Learning to Translate with Multiple Objectives

Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU, which serve as surrogate objective for translation quality.
Introduction	While many alternatives have been proposed, such a perfect evaluation metric remains elusive.
Introduction	As a result, many MT evaluation campaigns now report multiple evaluation metrics (Callison—Burch et al., 2011; Paul, 2010).
Opportunities and Limitations	Leveraging the diverse perspectives of different evaluation metrics has the potential to improve overall quality.
Related Work	If a good evaluation metric could not be used for tuning, it would be a pity.

evaluation metrics is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

10. Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

Liu, Chang and Ng, Hwee Tou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese.
Introduction	The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011).
Introduction	The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
The Algorithm	Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages.
The Algorithm	This relationship is implicit in the matching problem for English translation evaluation metrics where words are well delimited.
The Algorithm	Many prior translation evaluation metrics such as MAXSIM (Chan and Ng, 2008) and TESLA (Liu et al., 2010; Dahlmeier et al., 2011) use the F-0.8 measure as the final score:

evaluation metrics is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word-level (15)
n-grams (12)

11. Collecting Highly Parallel Data for Paraphrase Evaluation

Chen, David and Dolan, William

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years.
Introduction	However, a lack of standard datasets and automatic evaluation metrics has impeded progress in the field.
Introduction	Second, we define a new evaluation metric , PINC (Paraphrase In N- gram Changes), that relies on simple BLEU-like n—gram comparisons to measure the degree of novelty of automatically generated paraphrases.
Paraphrase Evaluation Metrics	A good paraphrase, according to our evaluation metric , has few n-gram overlaps with the source sentence but many n- gram overlaps with the reference sentences.
Related Work	The more recently proposed metric PEM (Paraphrase Evaluation Metric ) (Liu et al., 2010) produces a single score that captures the semantic adequacy, fluency, and lexical dissimilarity of candidate paraphrases, relying on bilingual data to learn semantic equivalences without using n- gram similarity between candidate and reference sentences.

evaluation metrics is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

12. Evaluating Text Segmentation using Boundary Edit Distance

Fournier, Chris

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This work proposes a new segmentation evaluation metric , named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012).
Conclusions	In this work, a new segmentation evaluation metric , referred to as boundary similarity (B) is proposed as an unbiased metric, along with a boundary-edit-distance-based (BED-based) confusion matrix to compute predictably biased IR metrics such as precision and recall.
Conclusions	B also allows for an intuitive comparison of boundary pairs between segmentations, as opposed to the window counts of WD or the simplistic edit count normalization of S. When an unbiased segmentation evaluation metric is desired, this work recommends the usage of B and the use of an upper and lower bound to provide context.
Evaluation of Automatic Segmenters	An ideal segmentation evaluation metric should, in theory, place the three automatic segmenters between the upper and lower bounds in terms of performance if the metrics, and the segmenters, function properly.
Introduction	To select an automatic segmenter for a particular task, a variety of segmentation evaluation metrics have been proposed, including Pk, (Beeferman and Berger, 1999, pp.

evaluation metrics is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

13. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references.
Introduction	It is well established that the MEANT family of metrics correlates better with human adequacy judgments than commonly used MT evaluation metrics (Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo and Wu, 2013b; Machacek and Bojar, 2013).
Introduction	We therefore propose XMEANT, a cross-lingual MT evaluation metric , that modifies MEANT using (1) simple translation probabilities (in our experiments,
Related Work	2.1 MT evaluation metrics
Results	Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics , and (2) even correlates nearly as well as monolingual MEANT.

evaluation metrics is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

14. Classifying French Verbs Using French and English Lexical Resources

Falk, Ingrid and Gardent, Claire and Lamirel, Jean-Charles

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present a novel approach to the automatic acquisition of a Verbnet like classification of French verbs which involves the use (i) of a neural clustering method which associates clusters with features, (ii) of several supervised and unsupervised evaluation metrics and (iii) of various existing syntactic and semantic lexical resources.
Clustering Methods, Evaluation Metrics and Experimental Setup	3.2 Evaluation metrics
Clustering Methods, Evaluation Metrics and Experimental Setup	We use several evaluation metrics which bear on different properties of the clustering.
Clustering Methods, Evaluation Metrics and Experimental Setup	As pointed out in (Lamirel et al., 2008; Attik et al., 2006), unsupervised evaluation metrics based on cluster labelling and feature maximisation can prove very useful for identifying the best clustering strategy.
Features and Data	Moreover, for this data set, the unsupervised evaluation metrics (cf.

evaluation metrics is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

15. A chance-corrected measure of inter-annotator agreement for syntax

Skjaerholt, Arne

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies).
Abstract	To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real
Conclusion	In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting.
Real-world corpora	In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric , and our three 04 metrics.
Synthetic experiments	6The de facto standard parser evaluation metric in depen-

evaluation metrics is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

16. A Step-wise Usage-based Method for Inducing Polysemy-aware Verb Classes

Kawahara, Daisuke and Peterson, Daniel W. and Palmer, Martha

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Evaluations	We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes.
Experiments and Evaluations	4.2 Evaluation Metrics
Experiments and Evaluations	This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al.

evaluation metrics is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

17. Multigraph Clustering for Unsupervised Coreference Resolution

Martschat, Sebastian

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions and Future Work	o the evaluation metrics employed are to be questioned (certainly),
Evaluation	5.1 Data and Evaluation Metrics
Evaluation	We evaluate our system with the coreference resolution evaluation metrics that were used for the CoNLL shared tasks on coreference, which are MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAFe (Luo, 2005).
Evaluation	We also report the unweighted average of the three scores, which was the official evaluation metric in the shared tasks.

evaluation metrics is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

18. Automatic Keyphrase Extraction: A Survey of the State of the Art

Hasan, Kazi Saidul and Ng, Vincent

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	4.1 Evaluation Metrics
Evaluation	Designing evaluation metrics for keyphrase extraction is by no means an easy task.
Evaluation	To score the output of a keyphrase extraction system, the typical approach, which is also adopted by the SemEval—2010 shared task on keyphrase extraction, is (1) to create a mapping between the keyphrases in the gold standard and those in the system output using exact match, and then (2) score the output using evaluation metrics such as precision (P), recall (R), and F-score (F).

evaluation metrics is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

19. Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics .
Experiments & Results 4.1 Experimental Setup	Two intrinsic evaluation metrics that we use to evaluate the possible translations for oovs are Mean Reciprocal Rank (MRR) (Voorhees, 1999) and Recall.
Experiments & Results 4.1 Experimental Setup	Intrinsic evaluation metrics are faster to apply and are used to optimize different hyper-parameters of the approach (e.g.
Experiments & Results 4.1 Experimental Setup	BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.

evaluation metrics is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

20. Word Segmentation of Informal Arabic with Domain Adaptation

Monroe, Will and Green, Spence and Manning, Christopher D.

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	3.1 Evaluation metrics
Experiments	We use two evaluation metrics in our experiments.
Experiments	Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics .

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

21. A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction

Jia, Zhongye and Zhao, Hai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	4.2 Evaluation Metrics
Experiments	We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy2.
Experiments	3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

22. Product Feature Mining: Semantic Clues versus Syntactic Constituents

Xu, Liheng and Liu, Kang and Lai, Siwei and Zhao, Jun

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	4.1 Datasets and Evaluation Metrics
Experiments	Evaluation Metrics : We evaluate the proposed method in terms of precision(P), recall(R) and F-measure(F).
Experiments	To take into account the correctly expanded terms for both positive and negative seeds, we use Accuracy as the evaluation metric,

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

23. Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

Zhou, Guangyou and Liu, Fang and Liu, Yang and He, Shizhu and Zhao, Jun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	3.1 Data Set and Evaluation Metrics
Experiments	Evaluation Metrics : We evaluate the performance of question retrieval using the following metrics: Mean Average Precision (MAP) and Precision@N (P@N).
Our Approach	where feature vector (I) (q, d) = (SVSM(Q7 d), 8((11, d1), 8(Q2, d2), - - - ,8(QP, 003)), and 6 is the corresponding weight vector, we optimize this parameter for our evaluation metrics directly using the Powell Search algorithm (Paul et al., 1992) via cross-validation.

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

24. Transfer Learning for Constituency-Based Grammars

Zhang, Yuan and Barzilay, Regina and Globerson, Amir

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation Setup	Evaluation Metrics: We use two evaluation metrics .
Experiment and Analysis	Moreover, increasing the number of coarse annotations used in training leads to further improvement on different evaluation metrics .
Experiment and Analysis	Figure 5 also illustrates a slightly different characteristics of transfer performance between two evaluation metrics .

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

CCG (45)
Treebank (27)
Penn Treebank (20)

25. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment 1: Textual Similarity	Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean).
Experiment 1: Textual Similarity	Table 2 shows the scores obtained by ADW for the three evaluation metrics , as well as the Pearson correlation values obtained on each of the five test sets (rightmost columns).
Experiment 1: Textual Similarity	As can be seen from Table 2, our system (ADW) outperforms all the 88 participating systems according to all the evaluation metrics .

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

26. Name-aware Machine Translation

Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions and Future Work	We also proposed a new name-aware evaluation metric .
Introduction	Propose a new MT evaluation metric which can discriminate names and noninformative words (Section 4).
Name-aware MT Evaluation	Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally.

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word alignment (17)
LM (12)

27. Improving pairwise coreference models through feature space hierarchy learning

Lassalle, Emmanuel and Denis, Pascal

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our experiments on the C0NLL-2012 Shared Task English datasets (gold mentions) indicate that our method is robust relative to different clustering strategies and evaluation metrics , showing large and consistent improvements over a single pairwise model using the same base features.
Experiments	5.3 Evaluation metrics
Introduction	As will be shown based on a variety of experiments on the CoNLL-2012 Shared Task English datasets, these improvements are consistent across different evaluation metrics and for the most part independent of the clustering decoder that was used.

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

28. Smatch: an Evaluation Metric for Semantic Feature Structures

Cai, Shu and Knight, Kevin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	We present an evaluation metric for whole-sentence semantic analysis, and show that it can be computed efficiently.
Introduction	In this work, we provide an evaluation metric that uses the degree of overlap between two whole-sentence semantic structures as the partial credit.
Semantic Overlap	Our evaluation metric measures precision, recall, and f-score of the triples in the second AMR against the triples in the first AMR, i.e., the amount of propositional overlap.

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

29. Joint Learning of a Dual SMT System for Paraphrase Generation

Sun, Hong and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Paraphrasing with a Dual SMT System	MERT integrates the automatic evaluation metrics into the training process to achieve optimal end-to-end performance.
Paraphrasing with a Dual SMT System	(2) where G is the automatic evaluation metric for paraphrasing.
Paraphrasing with a Dual SMT System	2.2 Paraphrase Evaluation Metrics

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

SMT systems (18)
BLEU (13)
BLEU score (10)

30. Subgroup Detection in Ideological Discussions

Abu-Jbara, Amjad and Dasigi, Pradeep and Diab, Mona and Radev, Dragomir

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	Before describing the experiments and presenting the results, we first describe the evaluation metrics we use.
Evaluation	4.0.1 Evaluation Metrics
Evaluation	We use two evaluation metrics to evaluate subgroups detection accuracy: Purity and Entropy.

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

31. Automatic Evaluation of Linguistic Quality in Multi-Document Summarization

Pitler, Emily and Louis, Annie and Nenkova, Ani

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Good performance on this task is the most desired property of evaluation metrics during system development.
Results and discussion	These input-level accuracies compare favorably with automatic evaluation metrics for other natural language processing tasks.
Results and discussion	For example, at the 2008 ACL Workshop on Statistical Machine Translation, all fifteen automatic evaluation metrics , including variants of BLEU scores, achieved between 42% and 56% pairwise accuracy with human judgments at the sentence level (Callison-Burch et al., 2008).

evaluation metrics is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: