Index of papers in Proc. ACL 2011 that mention
  • F-score
Habash, Nizar and Roth, Ryan
Abstract
Our best approach achieves a roughly ~15% absolute increase in F-score over a simple but reasonable baseline.
Results
We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall.
Results
Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7
Results
Feature Set F-score %Imp word 43.85 —
F-score is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Khapra, Mitesh M. and Joshi, Salil and Chatterjee, Arindam and Bhattacharyya, Pushpak
Discussions
For small seed sizes, the F-score of bilingual bootstrapping is consistently better than the F-score obtained by training only on the seed data without using any bootstrapping.
Discussions
To further illustrate this, we take some sample points from the graph and compare the number of tagged words needed by BiBoot and OnlySeed to reach the same (or nearly the same) F-score .
Experimental Setup
Seed Size v/s F-score
Experimental Setup
80 70 60 go‘ 50 g 40 O (I) [L 30 20 OnlySeed éfi ' I. WFS 10 “ BiBoot ' 0 [I I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 Seed Size (words) Figure 1: Comparison of BiBoot, MonoBoot, OnlySeed and WF S on Hindi Health data Seed Size v/s F-score 80 $3 9 O O (I) [L OnlySeed éfi ' WF ..= BiBoot ' 0 , I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000
Experimental Setup
Seed Size v/s F-score
Results
a. BiBoot: This curve represents the F-score obtained after 10 iterations by using bilingual bootstrapping with different amounts of seed data.
Results
b. MonoBoot: This curve represents the F-score obtained after 10 iterations by using monolingual bootstrapping with different amounts of seed data.
Results
c. OnlySeed: This curve represents the F-score obtained by training on the seed data alone without using any bootstrapping.
F-score is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Conclusion and Future Work
In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments
We evaluate detection performance by token-level precision, recall and F-score (6 = 1).
Experiments
For candidate selection, we once again evaluate using token-level precision, recall and F-score .
Experiments
Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Lopez, Adam
Experiments
Labelled F-score 00 \l 00
Oracle Parsing
To answer this question we computed oracle best and worst values for labelled dependency F-score using the algorithm of Huang (2008) on the hybrid model of Clark and Curran (2007), the best model of their C&C parser.
Oracle Parsing
Labelleld F-score
Oracle Parsing
Digging deeper, we compared parser model score against Viterbi F—score and oracle F-score at a va-
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Harr and Benson, Edward and Naseem, Tahira and Barzilay, Regina
Experimental Setup
For these reasons, we evaluate on both sentence-level and token-level precision, recall, and F-score .
Results
However, the best F-Score corresponding to the optimal number of clusters is 42.2, still far below our model’s 66.0 F-score .
Results
Our results show a large gap in F-score between the sentence and token-level evaluations for both the USP baseline and our model.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: