Abstract | We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand—created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates. |
Discussion | We achieved results with comparable precision, and an F1 score of .40 that approaches prior algorithms that rely on handcrafted knowledge. |
Information Extraction: Slot Filling | The bombing template performs best with an F1 score of .72. |
Specific Evaluation | Kidnap improves most significantly in F1 score (7 Fl points absolute), but the others only change slightly. |
Standard Evaluation | The standard evaluation for this corpus is to report the F1 score for slot type accuracy, ignoring the template type. |
Standard Evaluation | F1 Score Kidnap Bomb Arson Attack Results .53 .43 .42 .16 / .25 |
Results and discussion | Lin and Wu (2009) report an F1 score of 90.90 on the original split of the CoNLL data. |
Results and discussion | Our F1 scores > 92% can be explained by a combination of randomly partitioning the data and the fact that the four-class problem is easier than the five-class problem LOC-ORG-PER-MISC-O. |
Results and discussion | We use the t-test to compute significance on the two sets of five F1 scores from the two experiments that are being compared (two-tailed, p < .01 for t > 3.36).8 CoNLL scores that are significantly different from line c7 are marked with >|<. |
Experiments | At the highest recall point, MULTIR reaches 72.4% precision and 51.9% recall, for an F1 score of 60.5%. |
Experiments | On average across relations, precision increases 12 points but recall drops 26 points, for an overall reduction in F1 score from 60.5% to 40.3%. |
Related Work | (2010) describe a system similar to KYLIN, but which dynamically generates lexicons in order to handle sparse data, learning over 5000 Infobox relations with an average F1 score of 61%. |
Experiments | Table 2: F1 Scores on the Wikipedia Link Data. |
Experiments | We use N = 100, 500 and the B3 F1 score results obtained set for each case are shown in Figure 7. |
Introduction | On this dataset, our proposed model yields a B3 (Bagga and Baldwin, 1998) F1 score of 73.7%, improving over the baseline by 16% absolute (corresponding to 38% error reduction). |