Index of papers in Proc. ACL 2009 that mention
  • F-score
Abend, Omri and Reichart, Roi and Rappoport, Ari
Experimental Setup
We report an F-score as well (the harmonic mean of precision and recall).
Experimental Setup
We use the standard parsing F-score evaluation measure.
Introduction
We use two measures to evaluate the performance of our algorithm, precision and F-score .
Introduction
Precision reflects the algorithm’s applicability for creating training data to be used by supervised SRL models, while the standard SRL F-score measures the model’s performance when used by itself.
Introduction
The first stage of our algorithm is shown to outperform a strong baseline both in terms of F-score and of precision.
Related Work
Better performance is achieved on the classification, where state-of-the-art supervised approaches achieve about 81% F-score on the in-domain identification task, of which about 95% are later labeled correctly (Marquez et al., 2008).
Results
In the “Collocation Maximum F-score” the collocation parameters were generally tuned such that the maximum possible F-score for the collocation algorithm is achieved.
Results
The best or close to best F-score is achieved when using the clause detection algorithm alone (59.14% for English, 23.34% for Spanish).
Results
Note that for both English and Spanish F-score improvements are achieved via a precision improvement that is more significant than the recall degradation.
F-score is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Pervouchine, Vladimir and Li, Haizhou and Lin, Bo
Abstract
We propose a new evaluation metric, alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score .
Experiments
Next we conduct three experiments to study 1) alignment entropy vs. F-score , 2) the impact of alignment quality on transliteration accuracy, and 3) how to validate transliteration using alignment metrics.
Experiments
5.1 Alignment entropy vs. F-score
Experiments
We have manually aligned a random set of 3,000 transliteration pairs from the Xinhua training set to serve as the gold standard, on which we calculate the precision, recall and F-score as well as alignment entropy for each alignment.
Related Work
Denoting the number of cross-lingual mappings that are common in both A and Q as CA0, the number of cross-lingual mappings in A as CA and the number of cross-lingual mappings in Q as Cg, precision Pr is given as CAglCA, recall Be as GAO/CG and F-score as 2P7“ - Rc/(Pr + Re).
Transliteration alignment entropy
We expect and will show that this estimate is a good indicator of the alignment quality, and is as effective as the F-score , but without the need for a gold standard reference.
F-score is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Classification Results
The table lists the f-score for each of the target relations, with overall accuracy shown in brackets.
Classification Results
Given that the experiments are run on natural distribution of the data, which are skewed towards Expansion relations, the f-score is the more important measure to track.
Classification Results
Our random baseline is the f-score one would achieve by randomly assigning classes in proportion to its true distribution in the test set.
F-score is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Niu, Zheng-Yu and Wang, Haifeng and Wu, Hua
Abstract
Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result.
Experiments of Grammar Formalism Conversion
Finally Q-10-method achieved an f-score of 93.8% on WSJ section 22, an absolute 4.4% improvement (42% error reduction) over the best result of Xia et al.
Experiments of Grammar Formalism Conversion
Finally Q-10-method achieved an f-score of 93.6% on WSJ section 2~l8 and 20~22, better than that of Q-0-method and comparable with that of Q-10-method in Section 3.1.
Experiments of Parsing
Finally we decided that the optimal value of A was 0.4 and the optimal weight of CTB was 1, which brought the best performance on the development set (an f-score of 86.1%).
Experiments of Parsing
In comparison with the results in Section 4.1, the average index of converted trees in 200-best list increased to 2, and their average unlabeled dependency f-score dropped to 65.4%.
Experiments of Parsing
84.2% f-score , better than the result of the reranking parser with CTB and CDTPS as training data (shown in Table 5).
Introduction
Our conversion method achieves 93.8% f-score on dependency trees produced from WSJ section 22, resulting in 42% error reduction over the previous best result for DS to PS conversion.
Introduction
When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing.
Our Two-Step Solution
Therefore we modified the selection metric in Section 2.1 by interpolating two scores, the probability of a conversion candidate from the parser and its unlabeled dependency f-score , shown as follows:
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
duVerle, David and Prendinger, Helmut
Abstract
Using a rich set of shallow lexical, syntactic and structural features from the input text, our parser achieves, in linear time, 73.9% of professional annotators’ human agreement F-score .
Building a Discourse Parser
Current state-of-the-art results in automatic segmenting are much closer to human levels than full structure labeling ( F-score ratios of automatic performance over gold standard reported in LeThanh et al.
Evaluation
Standard performance indicators for such a task are precision, recall and F-score as measured by the PARSEVAL metrics (Black et al., 1991), with the specific adaptations to the case of RST trees made by Marcu (2000, page 143-144).
Evaluation
S N R F S N R F Precision 83.0 68.4 55.3 54.8 69.5 56.1 44.9 44.4 Recall 83.0 68.4 55.3 54.8 69.2 55.8 44.7 44.2 F-Score 83.0 68.4 55.3 54.8 69.3 56.0 44.8 44.3
Evaluation
Manual SPADE -SNRFSNRFSNRF Precision 84.1 70.6 55.6 55.1 70.6 58.1 46.0 45.6 88.0 77.5 66.0 65.2 Recall 84.1 70.6 55.6 55.1 71.2 58.6 46.4 46.0 88.1 77.6 66.1 65.3 F-Score 84.1 70.6 55.6 55.1 70.9 58.3 46.2 45.8 88.1 77.5 66.0 65.3
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei
Abstract
Additionally, we remove low confidence alignment links from the word alignment of a bilingual training corpus, which increases the alignment F-score , improves Chinese-English and Arabic-English translation quality and significantly reduces the phrase translation table size.
Alignment Link Confidence Measure
Table 2 shows the precision, recall and F-score of individual alignments and the combined align-
Alignment Link Confidence Measure
Overall it improves the F-score by 1.5 points (from 69.3 to 70.8), 1.8 point improvement for content words and 1.0 point for function words.
Improved MaXEnt Aligner with Confidence-based Link Filtering
Precision Recall F-score Baseline 72.66 66.17 69.26 +ALF 78.14 64.36 70.59
Improved MaXEnt Aligner with Confidence-based Link Filtering
Precision Recall F-score Baseline 84.43 83.64 84.04 +ALF 88.29 83.14 85.64
Sentence Alignment Confidence Measure
Aligner F-score Cor.
Sentence Alignment Confidence Measure
The results in Figure 2 shows strong correlation between the confidence measure and the alignment F-score , with the correlation coefficients equals to -0.69.
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Nenkova, Ani
Discourse vs. non-discourse usage
Using the string of the connective as the only feature sets a reasonably high baseline, with an f-score of 75.33% and an accuracy of 85.86%.
Discourse vs. non-discourse usage
Interestingly, using only the syntactic features, ignoring the identity of the connective, is even better, resulting in an f-score of 88.19% and accuracy of 92.25%.
Discourse vs. non-discourse usage
Using both the connective and syntactic features is better than either individually, with an f-score of 92.28% and accuracy of 95.04%.
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Tomanek, Katrin and Hahn, Udo
Experiments and Results
Table 2 depicts the exact numbers of manually labeled tokens to reach the maximal (supervised) F-score on both corpora.
Experiments and Results
On the MUC7 corpus, FuSAL requires 7,374 annotated NPs to yield an F-score of 87%, While SeSAL hit the same F-score with only 4,017 NPs.
Experiments and Results
5 On PENNBIOIE, SeSAL also saves about 45 % compared to FuSAL to achieve an F-score of 81 %.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Clark, Stephen and Curran, James R.
Evaluation
The third row is similar, but for sentences for which the oracle F-score is geater than 92%.
The CCG to PTB Conversion
shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%.
The CCG to PTB Conversion
The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: