Index of papers in Proc. ACL 2011 that mention
  • F1 score
Chambers, Nathanael and Jurafsky, Dan
Abstract
We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand—created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
Discussion
We achieved results with comparable precision, and an F1 score of .40 that approaches prior algorithms that rely on handcrafted knowledge.
Information Extraction: Slot Filling
The bombing template performs best with an F1 score of .72.
Specific Evaluation
Kidnap improves most significantly in F1 score (7 Fl points absolute), but the others only change slightly.
Standard Evaluation
The standard evaluation for this corpus is to report the F1 score for slot type accuracy, ignoring the template type.
Standard Evaluation
F1 Score Kidnap Bomb Arson Attack Results .53 .43 .42 .16 / .25
F1 score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Rüd, Stefan and Ciaramita, Massimiliano and Müller, Jens and Schütze, Hinrich
Results and discussion
Lin and Wu (2009) report an F1 score of 90.90 on the original split of the CoNLL data.
Results and discussion
Our F1 scores > 92% can be explained by a combination of randomly partitioning the data and the fact that the four-class problem is easier than the five-class problem LOC-ORG-PER-MISC-O.
Results and discussion
We use the t-test to compute significance on the two sets of five F1 scores from the two experiments that are being compared (two-tailed, p < .01 for t > 3.36).8 CoNLL scores that are significantly different from line c7 are marked with >|<.
F1 score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hoffmann, Raphael and Zhang, Congle and Ling, Xiao and Zettlemoyer, Luke and Weld, Daniel S.
Experiments
At the highest recall point, MULTIR reaches 72.4% precision and 51.9% recall, for an F1 score of 60.5%.
Experiments
On average across relations, precision increases 12 points but recall drops 26 points, for an overall reduction in F1 score from 60.5% to 40.3%.
Related Work
(2010) describe a system similar to KYLIN, but which dynamically generates lexicons in order to handle sparse data, learning over 5000 Infobox relations with an average F1 score of 61%.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Singh, Sameer and Subramanya, Amarnag and Pereira, Fernando and McCallum, Andrew
Experiments
Table 2: F1 Scores on the Wikipedia Link Data.
Experiments
We use N = 100, 500 and the B3 F1 score results obtained set for each case are shown in Figure 7.
Introduction
On this dataset, our proposed model yields a B3 (Bagga and Baldwin, 1998) F1 score of 73.7%, improving over the baseline by 16% absolute (corresponding to 38% error reduction).
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: