Conclusions | This evaluation metric allows for a deeper understanding of how certain normalization actions impact the output of the parser. |
Evaluation | 5.1 Evaluation Metrics |
Evaluation | Therefore, we propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application—dependency parsing. |
Introduction | Another potential problem with state-of-the-art normalization is the lack of appropriate evaluation metrics . |
Introduction | For instance, it is unclear how performance measured by the typical normalization evaluation metrics of word error rate and BLEU score (Pap-ineni et al., 2002) translates into performance on a parsing task, where a well placed punctuation mark may provide more substantial improvements than changing a nonstandard word form. |
Introduction | To address this problem, this work introduces an evaluation metric that ties normalization performance directly to the performance of a downstream dependency parser. |
Abstract | This work proposes a new segmentation evaluation metric , named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012). |
Conclusions | In this work, a new segmentation evaluation metric , referred to as boundary similarity (B) is proposed as an unbiased metric, along with a boundary-edit-distance-based (BED-based) confusion matrix to compute predictably biased IR metrics such as precision and recall. |
Conclusions | B also allows for an intuitive comparison of boundary pairs between segmentations, as opposed to the window counts of WD or the simplistic edit count normalization of S. When an unbiased segmentation evaluation metric is desired, this work recommends the usage of B and the use of an upper and lower bound to provide context. |
Evaluation of Automatic Segmenters | An ideal segmentation evaluation metric should, in theory, place the three automatic segmenters between the upper and lower bounds in terms of performance if the metrics, and the segmenters, function properly. |
Introduction | To select an automatic segmenter for a particular task, a variety of segmentation evaluation metrics have been proposed, including Pk, (Beeferman and Berger, 1999, pp. |
Conclusions and Future Work | o the evaluation metrics employed are to be questioned (certainly), |
Evaluation | 5.1 Data and Evaluation Metrics |
Evaluation | We evaluate our system with the coreference resolution evaluation metrics that were used for the CoNLL shared tasks on coreference, which are MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAFe (Luo, 2005). |
Evaluation | We also report the unweighted average of the three scores, which was the official evaluation metric in the shared tasks. |
Abstract | Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics . |
Experiments & Results 4.1 Experimental Setup | Two intrinsic evaluation metrics that we use to evaluate the possible translations for oovs are Mean Reciprocal Rank (MRR) (Voorhees, 1999) and Recall. |
Experiments & Results 4.1 Experimental Setup | Intrinsic evaluation metrics are faster to apply and are used to optimize different hyper-parameters of the approach (e.g. |
Experiments & Results 4.1 Experimental Setup | BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT. |
Conclusion and Future Work | We present an evaluation metric for whole-sentence semantic analysis, and show that it can be computed efficiently. |
Introduction | In this work, we provide an evaluation metric that uses the degree of overlap between two whole-sentence semantic structures as the partial credit. |
Semantic Overlap | Our evaluation metric measures precision, recall, and f-score of the triples in the second AMR against the triples in the first AMR, i.e., the amount of propositional overlap. |
Abstract | Our experiments on the C0NLL-2012 Shared Task English datasets (gold mentions) indicate that our method is robust relative to different clustering strategies and evaluation metrics , showing large and consistent improvements over a single pairwise model using the same base features. |
Experiments | 5.3 Evaluation metrics |
Introduction | As will be shown based on a variety of experiments on the CoNLL-2012 Shared Task English datasets, these improvements are consistent across different evaluation metrics and for the most part independent of the clustering decoder that was used. |
Conclusions and Future Work | We also proposed a new name-aware evaluation metric . |
Introduction | Propose a new MT evaluation metric which can discriminate names and noninformative words (Section 4). |
Name-aware MT Evaluation | Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally. |
Experiment 1: Textual Similarity | Three evaluation metrics are provided by the organizers of the SemEval-2012 STS task, all of which are based on Pearson correlation 7“ of human judgments with system outputs: (1) the correlation value for the concatenation of all five datasets (ALL), (2) a correlation value obtained on a concatenation of the outputs, separately normalized by least square (ALLnrm), and (3) the weighted average of Pearson correlations across datasets (Mean). |
Experiment 1: Textual Similarity | Table 2 shows the scores obtained by ADW for the three evaluation metrics , as well as the Pearson correlation values obtained on each of the five test sets (rightmost columns). |
Experiment 1: Textual Similarity | As can be seen from Table 2, our system (ADW) outperforms all the 88 participating systems according to all the evaluation metrics . |
Evaluation Setup | Evaluation Metrics: We use two evaluation metrics . |
Experiment and Analysis | Moreover, increasing the number of coarse annotations used in training leads to further improvement on different evaluation metrics . |
Experiment and Analysis | Figure 5 also illustrates a slightly different characteristics of transfer performance between two evaluation metrics . |
Experiments | 3.1 Data Set and Evaluation Metrics |
Experiments | Evaluation Metrics : We evaluate the performance of question retrieval using the following metrics: Mean Average Precision (MAP) and Precision@N (P@N). |
Our Approach | where feature vector (I) (q, d) = (SVSM(Q7 d), 8((11, d1), 8(Q2, d2), - - - ,8(QP, 003)), and 6 is the corresponding weight vector, we optimize this parameter for our evaluation metrics directly using the Powell Search algorithm (Paul et al., 1992) via cross-validation. |