Abstract | To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters. |
Experiments | We treat the F1 score |
Experiments | Table 1 shows the F1 score of NER6 when trained on these monolingual German word clusters. |
Experiments | For Turkish the F1 score improves by 1.0 point over when there are no distributional clusters which clearly shows that the word alignment information improves the clustering quality. |
Conclusion | We show an overall performance of 61% F1 score , and present experiments evaluating LUCHS’s individual components. |
Experiments | Table 2: Lexicon and Gaussian features greatly expand F1 score (Fl-LUCHS) over the baseline (F1-B), in particular for attributes with few training examples. |
Experiments | Figure 3 shows the distribution of obtained F1 scores . |
Experiments | Averaging across all attributes we obtain F1 scores of 0.56 and 0.60 for textual and numeric values respectively. |
Introduction | 0 We evaluate the overall end-to-end performance of LUCHS, showing an F1 score of 61% when extracting relations from randomly selected Wikipedia pages. |
Experiments | Evaluation Metric: We evaluated the results with both Mean Average Precision (MAP) and F1 Score . |
Experiments | Macro—averaged F1 Score .0 .o .o .o .o 00 00 A A 01 00 00 00 00 00 I I I I I |
Experiments | (b) Macro-Averaged F1 Score |
Experimental Results | We compute the precision, recall and F1 scores for each EC on the test set, and collect their counts in the reference and system output. |
Experimental Results | The F1 scores for majority of the ECs are above 70%, except for “*”, which is relatively rare in the data. |
Experimental Results | For the two categories that are interesting to MT, *pro* and *PRO*, the predictor achieves 74.3% and 81.5% in F1 scores , respectively. |
Abstract | We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand—created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates. |
Discussion | We achieved results with comparable precision, and an F1 score of .40 that approaches prior algorithms that rely on handcrafted knowledge. |
Information Extraction: Slot Filling | The bombing template performs best with an F1 score of .72. |
Specific Evaluation | Kidnap improves most significantly in F1 score (7 Fl points absolute), but the others only change slightly. |
Standard Evaluation | The standard evaluation for this corpus is to report the F1 score for slot type accuracy, ignoring the template type. |
Standard Evaluation | F1 Score Kidnap Bomb Arson Attack Results .53 .43 .42 .16 / .25 |
Experiment | We used EVALB1 to compute the F1 score . |
Experiment | Table 2 shows the F1 scores of the CFG, TSG and SR-TSG parsers for small and full training sets. |
Experiment | Table 3 shows the F1 scores of an SR-TSG and conventional parsers with the full training set. |
Introduction | Our SR-TSG parser achieves an F1 score of 92.4% in the WSJ English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and superior to state-of-the-art discriminative reranking parsers. |
Experiments | Performance is measured by four metrics: accuracy, precision, recall, and F1 score on the test set, shown in the first section in each subtable. |
Experiments | However, under this discourse condition, the distribution of positive and negative instances in both training and test sets is extremely skewed, which makes it more sensible to compare the recall and F1 scores for evaluation. |
Experiments | In fact, our features achieve much higher recall and F1 score despite a much lower precision and a slightly lower accuracy. |
Experiments | Table 4: F1 scores for each sentiment category (positive, negative and neutral) for semi-supervised sentiment classification on the MD dataset |
Experiments | Table 4 shows the results in terms of F1 scores for each sentiment category (positive, negative and neutral). |
Experiments | We observe that the DOCORACLE baseline provides very strong F1 scores on the positive and negative categories especially in the Books and Music domains, but very poor F1 on the neutral category. |
Experiment and Discussion | Table 1 shows the micro and macro averages of F1 scores . |
Experiment and Discussion | Moreover, the macro-averaged F1 scores clearly showed improvements resulting from using role groups. |
Experiment and Discussion | In Table 2, we show that the micro-averaged F1 score for roles having 10 instances or less was improved (by 15.46 points) when all role groups were used. |
Experiments | The F1 scores reported here are the average of all 5 rounds. |
Experiments | The tree kemel-based approach and linear regression achieved similar F1 scores , while linear SVM made a 5% improvement over them. |
Experiments | By integrating unlabeled data, the manifold model under setting (1) made a 15% improvement over linear regression model on F1 score , where the improvement was significant across all relations. |
Model | We use standard measures of word-level precision, recall, and F1 score , for evaluating each task. |
Model | Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-Sc-l with respect to the training epoch and different parsing feature weights, where “Seg”, “Tag”, and “Dep” respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing. |
Model | Table 3: F1 scores and speed (in sentences per sec.) |
Citation Extraction Data | Table 1: Set of constraints learned and F1 scores . |
Citation Extraction Data | This final feature improves the F1 score on the cleaned test set from 94.0 F1 to 94.44 F1, which we use as a baseline score. |
Citation Extraction Data | We asses performance in terms of field-level F1 score , which is the harmonic mean of precision and recall for predicted segments. |
Results and discussion | Lin and Wu (2009) report an F1 score of 90.90 on the original split of the CoNLL data. |
Results and discussion | Our F1 scores > 92% can be explained by a combination of randomly partitioning the data and the fact that the four-class problem is easier than the five-class problem LOC-ORG-PER-MISC-O. |
Results and discussion | We use the t-test to compute significance on the two sets of five F1 scores from the two experiments that are being compared (two-tailed, p < .01 for t > 3.36).8 CoNLL scores that are significantly different from line c7 are marked with >|<. |
Abstract | It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other high—performance parsers. |
Constituent Recombination | The parameters )V; and p are tuned by the Powell’s method (Powell, 1964) on a development set, using the F1 score of PARSEVAL (Black et al., 1991) as objective. |
Introduction | Combined with other high-performance parsers under the framework of constituent recombination (Sagae and Lavie, 2006; Fossum and Knight, 2009), this model further enhances the F1 scores to 92.80% and 85.60%, the highest ones achieved so far on these two data sets. |
Empirical Evaluation | We compute the aggregate PU, CO, and F1 scores over all predicates in the same way as (Lang and Lapata, 2011a) by weighting the scores of each predicate by the number of its argument occurrences. |
Empirical Evaluation | Our models are robust to parameter settings; the parameters were tuned (to an order of magnitude) to optimize the F1 score on the held-out development set and were as follows. |
Empirical Evaluation | Boldface is used to highlight the best F1 scores . |
Experimental Results | Table 2 shows that our general CRF model based on question segmentation with group L1 regularization outperforms the baselines significantly in all three measures (gCRF—QS-ll is 13.99% better than SVM in precision, 9.77% better in recall and 11.72% better in F1 score ). |
Experimental Results | It is observed that our gCRF-QS-ll model improves the performance in terms of precision, recall and F1 score on all three measurements of ROUGE-l, ROUGE-2 and ROUGE-L by a significant margin compared to other baselines due to the use of local and nonlocal contextual factors and factors based on Q8 with group L1 regularization. |
Experimental Setting | In our experiments, we also compare the precision, recall and F1 score in the ROUGE-1, ROUGE-2 and ROUGE-L measures (Lin , 2004) for answer summarization performance. |
Experiments | Table 2: F1 Scores on the Wikipedia Link Data. |
Experiments | We use N = 100, 500 and the B3 F1 score results obtained set for each case are shown in Figure 7. |
Introduction | On this dataset, our proposed model yields a B3 (Bagga and Baldwin, 1998) F1 score of 73.7%, improving over the baseline by 16% absolute (corresponding to 38% error reduction). |
Evaluation | For the final evaluation, we optimized the number of clusters based on F1 score on calibration and validation sets (cf. |
Results | We omit the F1 score because its use for precision and recall estimates from different samples is unclear. |
Results | Note that for these methods, precision and recall can be traded off against each other by varying the number of clusters; we chose the number of clusters by optimizing the F1 score on the calibration and validaton sets. |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | We adapted them to this dataset by weighing each answer by its overlap with gold answers, where overlap is measured as the highest F1 score between the candidate and a gold answer. |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | Thus, P@1 reduces to this F1 score for the top answer. |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | For example, if the best answer for a question appears at rank 2 with an F1 score of 0.3, the corresponding MRR score is 0.3 / 2. |
Experiments | The human F1 score on end-to-end relation extraction is only about 70%, which indicates it is a very challenging task. |
Experiments | Furthermore, the F1 score of the inter-annotator agreement is 51.9%, which is only 2.4% above that of our proposed method. |
Experiments | For entity mention extraction, our joint model achieved 79.7% on 5-fold cross-validation, which is comparable with the best F1 score 79.2% reported by (Florian et al., 2006) on single-fold. |
Experiments | At the highest recall point, MULTIR reaches 72.4% precision and 51.9% recall, for an F1 score of 60.5%. |
Experiments | On average across relations, precision increases 12 points but recall drops 26 points, for an overall reduction in F1 score from 60.5% to 40.3%. |
Related Work | (2010) describe a system similar to KYLIN, but which dynamically generates lexicons in order to handle sparse data, learning over 5000 Infobox relations with an average F1 score of 61%. |
Experiment | We can see that the parsing F1 decreased by about 8.5 percentage points in F1 score when using automatically assigned POS tags instead of gold-standard ones, and this shows that the pipeline approach is greatly affected by the quality of its preliminary POS tagging step. |
Experiment | Compared with the JointParsing system which does not employ any alignment strategy, the Padding system only achieved a slight improvement on parsing F1 score , but no improvement on POS tagging accuracy. |
Experiment | In contrast, our StateAlign system achieved an improvement of 0.6% on parsing F1 score and 0.4% on POS tagging accuracy. |
Coreference Subtask Analysis | The MUC scoring algorithm (Vilain et a1., 1995) computes the F1 score (harmonic mean) of precision and recall based on the identifcation of unique coreference links. |
Coreference Subtask Analysis | Precision and recall for a set of documents are computed as the mean over all CEs in the documents and the F1 score of precision and recall is reported. |
Resolution Complexity | We then count the number of unique correct/incorrect links that the system introduced on top of the correct partial clustering and compute precision, recall, and F1 score . |