Experimental Setup | We report an F-score as well (the harmonic mean of precision and recall). |
Experimental Setup | We use the standard parsing F-score evaluation measure. |
Introduction | We use two measures to evaluate the performance of our algorithm, precision and F-score . |
Introduction | Precision reflects the algorithm’s applicability for creating training data to be used by supervised SRL models, while the standard SRL F-score measures the model’s performance when used by itself. |
Introduction | The first stage of our algorithm is shown to outperform a strong baseline both in terms of F-score and of precision. |
Related Work | Better performance is achieved on the classification, where state-of-the-art supervised approaches achieve about 81% F-score on the in-domain identification task, of which about 95% are later labeled correctly (Marquez et al., 2008). |
Results | In the “Collocation Maximum F-score” the collocation parameters were generally tuned such that the maximum possible F-score for the collocation algorithm is achieved. |
Results | The best or close to best F-score is achieved when using the clause detection algorithm alone (59.14% for English, 23.34% for Spanish). |
Results | Note that for both English and Spanish F-score improvements are achieved via a precision improvement that is more significant than the recall degradation. |
Abstract | We propose a new evaluation metric, alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score . |
Experiments | Next we conduct three experiments to study 1) alignment entropy vs. F-score , 2) the impact of alignment quality on transliteration accuracy, and 3) how to validate transliteration using alignment metrics. |
Experiments | 5.1 Alignment entropy vs. F-score |
Experiments | We have manually aligned a random set of 3,000 transliteration pairs from the Xinhua training set to serve as the gold standard, on which we calculate the precision, recall and F-score as well as alignment entropy for each alignment. |
Related Work | Denoting the number of cross-lingual mappings that are common in both A and Q as CA0, the number of cross-lingual mappings in A as CA and the number of cross-lingual mappings in Q as Cg, precision Pr is given as CAglCA, recall Be as GAO/CG and F-score as 2P7“ - Rc/(Pr + Re). |
Transliteration alignment entropy | We expect and will show that this estimate is a good indicator of the alignment quality, and is as effective as the F-score , but without the need for a gold standard reference. |
Classification Results | The table lists the f-score for each of the target relations, with overall accuracy shown in brackets. |
Classification Results | Given that the experiments are run on natural distribution of the data, which are skewed towards Expansion relations, the f-score is the more important measure to track. |
Classification Results | Our random baseline is the f-score one would achieve by randomly assigning classes in proportion to its true distribution in the test set. |
Abstract | Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result. |
Experiments of Grammar Formalism Conversion | Finally Q-10-method achieved an f-score of 93.8% on WSJ section 22, an absolute 4.4% improvement (42% error reduction) over the best result of Xia et al. |
Experiments of Grammar Formalism Conversion | Finally Q-10-method achieved an f-score of 93.6% on WSJ section 2~l8 and 20~22, better than that of Q-0-method and comparable with that of Q-10-method in Section 3.1. |
Experiments of Parsing | Finally we decided that the optimal value of A was 0.4 and the optimal weight of CTB was 1, which brought the best performance on the development set (an f-score of 86.1%). |
Experiments of Parsing | In comparison with the results in Section 4.1, the average index of converted trees in 200-best list increased to 2, and their average unlabeled dependency f-score dropped to 65.4%. |
Experiments of Parsing | 84.2% f-score , better than the result of the reranking parser with CTB and CDTPS as training data (shown in Table 5). |
Introduction | Our conversion method achieves 93.8% f-score on dependency trees produced from WSJ section 22, resulting in 42% error reduction over the previous best result for DS to PS conversion. |
Introduction | When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing. |
Our Two-Step Solution | Therefore we modified the selection metric in Section 2.1 by interpolating two scores, the probability of a conversion candidate from the parser and its unlabeled dependency f-score , shown as follows: |
Abstract | Using a rich set of shallow lexical, syntactic and structural features from the input text, our parser achieves, in linear time, 73.9% of professional annotators’ human agreement F-score . |
Building a Discourse Parser | Current state-of-the-art results in automatic segmenting are much closer to human levels than full structure labeling ( F-score ratios of automatic performance over gold standard reported in LeThanh et al. |
Evaluation | Standard performance indicators for such a task are precision, recall and F-score as measured by the PARSEVAL metrics (Black et al., 1991), with the specific adaptations to the case of RST trees made by Marcu (2000, page 143-144). |
Evaluation | S N R F S N R F Precision 83.0 68.4 55.3 54.8 69.5 56.1 44.9 44.4 Recall 83.0 68.4 55.3 54.8 69.2 55.8 44.7 44.2 F-Score 83.0 68.4 55.3 54.8 69.3 56.0 44.8 44.3 |
Evaluation | Manual SPADE -SNRFSNRFSNRF Precision 84.1 70.6 55.6 55.1 70.6 58.1 46.0 45.6 88.0 77.5 66.0 65.2 Recall 84.1 70.6 55.6 55.1 71.2 58.6 46.4 46.0 88.1 77.6 66.1 65.3 F-Score 84.1 70.6 55.6 55.1 70.9 58.3 46.2 45.8 88.1 77.5 66.0 65.3 |
Abstract | Additionally, we remove low confidence alignment links from the word alignment of a bilingual training corpus, which increases the alignment F-score , improves Chinese-English and Arabic-English translation quality and significantly reduces the phrase translation table size. |
Alignment Link Confidence Measure | Table 2 shows the precision, recall and F-score of individual alignments and the combined align- |
Alignment Link Confidence Measure | Overall it improves the F-score by 1.5 points (from 69.3 to 70.8), 1.8 point improvement for content words and 1.0 point for function words. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | Precision Recall F-score Baseline 72.66 66.17 69.26 +ALF 78.14 64.36 70.59 |
Improved MaXEnt Aligner with Confidence-based Link Filtering | Precision Recall F-score Baseline 84.43 83.64 84.04 +ALF 88.29 83.14 85.64 |
Sentence Alignment Confidence Measure | Aligner F-score Cor. |
Sentence Alignment Confidence Measure | The results in Figure 2 shows strong correlation between the confidence measure and the alignment F-score , with the correlation coefficients equals to -0.69. |
Discourse vs. non-discourse usage | Using the string of the connective as the only feature sets a reasonably high baseline, with an f-score of 75.33% and an accuracy of 85.86%. |
Discourse vs. non-discourse usage | Interestingly, using only the syntactic features, ignoring the identity of the connective, is even better, resulting in an f-score of 88.19% and accuracy of 92.25%. |
Discourse vs. non-discourse usage | Using both the connective and syntactic features is better than either individually, with an f-score of 92.28% and accuracy of 95.04%. |
Experiments and Results | Table 2 depicts the exact numbers of manually labeled tokens to reach the maximal (supervised) F-score on both corpora. |
Experiments and Results | On the MUC7 corpus, FuSAL requires 7,374 annotated NPs to yield an F-score of 87%, While SeSAL hit the same F-score with only 4,017 NPs. |
Experiments and Results | 5 On PENNBIOIE, SeSAL also saves about 45 % compared to FuSAL to achieve an F-score of 81 %. |
Evaluation | The third row is similar, but for sentences for which the oracle F-score is geater than 92%. |
The CCG to PTB Conversion | shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%. |
The CCG to PTB Conversion | The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script. |