Index of papers in Proc. ACL 2009 that mention

F-score

Seen in text as:

F-score (50)
f-score (34)

Seen in 83 sentences in 9 papers.

1. Unsupervised Argument Identification for Semantic Role Labeling

Abend, Omri and Reichart, Roi and Rappoport, Ari

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	We report an F-score as well (the harmonic mean of precision and recall).
Experimental Setup	We use the standard parsing F-score evaluation measure.
Introduction	We use two measures to evaluate the performance of our algorithm, precision and F-score .
Introduction	Precision reflects the algorithm’s applicability for creating training data to be used by supervised SRL models, while the standard SRL F-score measures the model’s performance when used by itself.
Introduction	The first stage of our algorithm is shown to outperform a strong baseline both in terms of F-score and of precision.
Related Work	Better performance is achieved on the classification, where state-of-the-art supervised approaches achieve about 81% F-score on the in-domain identification task, of which about 95% are later labeled correctly (Marquez et al., 2008).
Results	In the “Collocation Maximum F-score” the collocation parameters were generally tuned such that the maximum possible F-score for the collocation algorithm is achieved.
Results	The best or close to best F-score is achieved when using the clause detection algorithm alone (59.14% for English, 23.34% for Spanish).
Results	Note that for both English and Spanish F-score improvements are achieved via a precision improvement that is more significant than the recall degradation.

F-score is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

POS tags (14)
F-score (12)
parse tree (11)

2. Transliteration Alignment

Pervouchine, Vladimir and Li, Haizhou and Lin, Bo

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a new evaluation metric, alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score .
Experiments	Next we conduct three experiments to study 1) alignment entropy vs. F-score , 2) the impact of alignment quality on transliteration accuracy, and 3) how to validate transliteration using alignment metrics.
Experiments	5.1 Alignment entropy vs. F-score
Experiments	We have manually aligned a random set of 3,000 transliteration pairs from the Xinhua training set to serve as the gold standard, on which we calculate the precision, recall and F-score as well as alignment entropy for each alignment.
Related Work	Denoting the number of cross-lingual mappings that are common in both A and Q as CA0, the number of cross-lingual mappings in A as CA and the number of cross-lingual mappings in Q as Cg, precision Pr is given as CAglCA, recall Be as GAO/CG and F-score as 2P7“ - Rc/(Pr + Re).
Transliteration alignment entropy	We expect and will show that this estimate is a good indicator of the alignment quality, and is as effective as the F-score , but without the need for a gold standard reference.

F-score is mentioned in 18 sentences in this paper.

Topics mentioned in this paper:

3. Automatic sense prediction for implicit discourse relations in text

Pitler, Emily and Louis, Annie and Nenkova, Ani

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Classification Results	The table lists the f-score for each of the target relations, with overall accuracy shown in brackets.
Classification Results	Given that the experiments are run on natural distribution of the data, which are skewed towards Expansion relations, the f-score is the more important measure to track.
Classification Results	Our random baseline is the f-score one would achieve by randomly assigning classes in proportion to its true distribution in the test set.

F-score is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

4. Exploiting Heterogeneous Treebanks for Parsing

Niu, Zheng-Yu and Wang, Haifeng and Wu, Hua

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result.
Experiments of Grammar Formalism Conversion	Finally Q-10-method achieved an f-score of 93.8% on WSJ section 22, an absolute 4.4% improvement (42% error reduction) over the best result of Xia et al.
Experiments of Grammar Formalism Conversion	Finally Q-10-method achieved an f-score of 93.6% on WSJ section 2~l8 and 20~22, better than that of Q-0-method and comparable with that of Q-10-method in Section 3.1.
Experiments of Parsing	Finally we decided that the optimal value of A was 0.4 and the optimal weight of CTB was 1, which brought the best performance on the development set (an f-score of 86.1%).
Experiments of Parsing	In comparison with the results in Section 4.1, the average index of converted trees in 200-best list increased to 2, and their average unlabeled dependency f-score dropped to 65.4%.
Experiments of Parsing	84.2% f-score , better than the result of the reranking parser with CTB and CDTPS as training data (shown in Table 5).
Introduction	Our conversion method achieves 93.8% f-score on dependency trees produced from WSJ section 22, resulting in 42% error reduction over the previous best result for DS to PS conversion.
Introduction	When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing.
Our Two-Step Solution	Therefore we modified the selection metric in Section 2.1 by interpolating two scores, the probability of a conversion candidate from the parser and its unlabeled dependency f-score , shown as follows:

F-score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

5. A Novel Discourse Parser Based on Support Vector Machine Classification

duVerle, David and Prendinger, Helmut

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Using a rich set of shallow lexical, syntactic and structural features from the input text, our parser achieves, in linear time, 73.9% of professional annotators’ human agreement F-score .
Building a Discourse Parser	Current state-of-the-art results in automatic segmenting are much closer to human levels than full structure labeling ( F-score ratios of automatic performance over gold standard reported in LeThanh et al.
Evaluation	Standard performance indicators for such a task are precision, recall and F-score as measured by the PARSEVAL metrics (Black et al., 1991), with the specific adaptations to the case of RST trees made by Marcu (2000, page 143-144).
Evaluation	S N R F S N R F Precision 83.0 68.4 55.3 54.8 69.5 56.1 44.9 44.4 Recall 83.0 68.4 55.3 54.8 69.2 55.8 44.7 44.2 F-Score 83.0 68.4 55.3 54.8 69.3 56.0 44.8 44.3
Evaluation	Manual SPADE -SNRFSNRFSNRF Precision 84.1 70.6 55.6 55.1 70.6 58.1 46.0 45.6 88.0 77.5 66.0 65.2 Recall 84.1 70.6 55.6 55.1 71.2 58.6 46.4 46.0 88.1 77.6 66.1 65.3 F-Score 84.1 70.6 55.6 55.1 70.9 58.3 46.2 45.8 88.1 77.5 66.0 65.3

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

discourse parsing (10)
SVM (10)
edus (9)

6. Confidence Measure for Word Alignment

Huang, Fei

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Additionally, we remove low confidence alignment links from the word alignment of a bilingual training corpus, which increases the alignment F-score , improves Chinese-English and Arabic-English translation quality and significantly reduces the phrase translation table size.
Alignment Link Confidence Measure	Table 2 shows the precision, recall and F-score of individual alignments and the combined align-
Alignment Link Confidence Measure	Overall it improves the F-score by 1.5 points (from 69.3 to 70.8), 1.8 point improvement for content words and 1.0 point for function words.
Improved MaXEnt Aligner with Confidence-based Link Filtering	Precision Recall F-score Baseline 72.66 66.17 69.26 +ALF 78.14 64.36 70.59
Improved MaXEnt Aligner with Confidence-based Link Filtering	Precision Recall F-score Baseline 84.43 83.64 84.04 +ALF 88.29 83.14 85.64
Sentence Alignment Confidence Measure	Aligner F-score Cor.
Sentence Alignment Confidence Measure	The results in Figure 2 shows strong correlation between the confidence measure and the alignment F-score , with the correlation coefficients equals to -0.69.

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

7. Using Syntax to Disambiguate Explicit Discourse Connectives in Text

Pitler, Emily and Nenkova, Ani

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discourse vs. non-discourse usage	Using the string of the connective as the only feature sets a reasonably high baseline, with an f-score of 75.33% and an accuracy of 85.86%.
Discourse vs. non-discourse usage	Interestingly, using only the syntactic features, ignoring the identity of the connective, is even better, resulting in an f-score of 88.19% and accuracy of 92.25%.
Discourse vs. non-discourse usage	Using both the connective and syntactic features is better than either individually, with an f-score of 92.28% and accuracy of 95.04%.

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

f-score (7)
Treebank (3)

8. Semi-Supervised Active Learning for Sequence Labeling

Tomanek, Katrin and Hahn, Udo

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Results	Table 2 depicts the exact numbers of manually labeled tokens to reach the maximal (supervised) F-score on both corpora.
Experiments and Results	On the MUC7 corpus, FuSAL requires 7,374 annotated NPs to yield an F-score of 87%, While SeSAL hit the same F-score with only 4,017 NPs.
Experiments and Results	5 On PENNBIOIE, SeSAL also saves about 45 % compared to FuSAL to achieve an F-score of 81 %.

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

9. Comparing the Accuracy of CCG and Penn Treebank Parsers

Clark, Stephen and Curran, James R.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	The third row is similar, but for sentences for which the oracle F-score is geater than 92%.
The CCG to PTB Conversion	shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%.
The CCG to PTB Conversion	The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: