Index of papers in Proc. ACL that mention
  • F1 score
Faruqui, Manaal and Dyer, Chris
Abstract
To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
Experiments
We treat the F1 score
Experiments
Table 1 shows the F1 score of NER6 when trained on these monolingual German word clusters.
Experiments
For Turkish the F1 score improves by 1.0 point over when there are no distributional clusters which clearly shows that the word alignment information improves the clustering quality.
F1 score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Hoffmann, Raphael and Zhang, Congle and Weld, Daniel S.
Conclusion
We show an overall performance of 61% F1 score , and present experiments evaluating LUCHS’s individual components.
Experiments
Table 2: Lexicon and Gaussian features greatly expand F1 score (Fl-LUCHS) over the baseline (F1-B), in particular for attributes with few training examples.
Experiments
Figure 3 shows the distribution of obtained F1 scores .
Experiments
Averaging across all attributes we obtain F1 scores of 0.56 and 0.60 for textual and numeric values respectively.
Introduction
0 We evaluate the overall end-to-end performance of LUCHS, showing an F1 score of 61% when extracting relations from randomly selected Wikipedia pages.
F1 score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Martineau, Justin and Chen, Lu and Cheng, Doreen and Sheth, Amit
Experiments
Evaluation Metric: We evaluated the results with both Mean Average Precision (MAP) and F1 Score .
Experiments
Macro—averaged F1 Score .0 .o .o .o .o 00 00 A A 01 00 00 00 00 00 I I I I I
Experiments
(b) Macro-Averaged F1 Score
F1 score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen
Experimental Results
We compute the precision, recall and F1 scores for each EC on the test set, and collect their counts in the reference and system output.
Experimental Results
The F1 scores for majority of the ECs are above 70%, except for “*”, which is relatively rare in the data.
Experimental Results
For the two categories that are interesting to MT, *pro* and *PRO*, the predictor achieves 74.3% and 81.5% in F1 scores , respectively.
F1 score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Chambers, Nathanael and Jurafsky, Dan
Abstract
We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand—created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
Discussion
We achieved results with comparable precision, and an F1 score of .40 that approaches prior algorithms that rely on handcrafted knowledge.
Information Extraction: Slot Filling
The bombing template performs best with an F1 score of .72.
Specific Evaluation
Kidnap improves most significantly in F1 score (7 Fl points absolute), but the others only change slightly.
Standard Evaluation
The standard evaluation for this corpus is to report the F1 score for slot type accuracy, ignoring the template type.
Standard Evaluation
F1 Score Kidnap Bomb Arson Attack Results .53 .43 .42 .16 / .25
F1 score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Shindo, Hiroyuki and Miyao, Yusuke and Fujino, Akinori and Nagata, Masaaki
Experiment
We used EVALB1 to compute the F1 score .
Experiment
Table 2 shows the F1 scores of the CFG, TSG and SR-TSG parsers for small and full training sets.
Experiment
Table 3 shows the F1 scores of an SR-TSG and conventional parsers with the full training set.
Introduction
Our SR-TSG parser achieves an F1 score of 92.4% in the WSJ English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and superior to state-of-the-art discriminative reranking parsers.
F1 score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Feng, Vanessa Wei and Hirst, Graeme
Experiments
Performance is measured by four metrics: accuracy, precision, recall, and F1 score on the test set, shown in the first section in each subtable.
Experiments
However, under this discourse condition, the distribution of positive and negative instances in both training and test sets is extremely skewed, which makes it more sensible to compare the recall and F1 scores for evaluation.
Experiments
In fact, our features achieve much higher recall and F1 score despite a much lower precision and a slightly lower accuracy.
F1 score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Yang, Bishan and Cardie, Claire
Experiments
Table 4: F1 scores for each sentiment category (positive, negative and neutral) for semi-supervised sentiment classification on the MD dataset
Experiments
Table 4 shows the results in terms of F1 scores for each sentiment category (positive, negative and neutral).
Experiments
We observe that the DOCORACLE baseline provides very strong F1 scores on the positive and negative categories especially in the Books and Music domains, but very poor F1 on the neutral category.
F1 score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Matsubayashi, Yuichiroh and Okazaki, Naoaki and Tsujii, Jun'ichi
Experiment and Discussion
Table 1 shows the micro and macro averages of F1 scores .
Experiment and Discussion
Moreover, the macro-averaged F1 scores clearly showed improvements resulting from using role groups.
Experiment and Discussion
In Table 2, we show that the micro-averaged F1 score for roles having 10 instances or less was improved (by 15.46 points) when all role groups were used.
F1 score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Wang, Chang and Fan, James
Experiments
The F1 scores reported here are the average of all 5 rounds.
Experiments
The tree kemel-based approach and linear regression achieved similar F1 scores , while linear SVM made a 5% improvement over them.
Experiments
By integrating unlabeled data, the manifold model under setting (1) made a 15% improvement over linear regression model on F1 score , where the improvement was significant across all relations.
F1 score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hatori, Jun and Matsuzaki, Takuya and Miyao, Yusuke and Tsujii, Jun'ichi
Model
We use standard measures of word-level precision, recall, and F1 score , for evaluating each task.
Model
Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-Sc-l with respect to the training epoch and different parsing feature weights, where “Seg”, “Tag”, and “Dep” respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing.
Model
Table 3: F1 scores and speed (in sentences per sec.)
F1 score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Anzaroot, Sam and Passos, Alexandre and Belanger, David and McCallum, Andrew
Citation Extraction Data
Table 1: Set of constraints learned and F1 scores .
Citation Extraction Data
This final feature improves the F1 score on the cleaned test set from 94.0 F1 to 94.44 F1, which we use as a baseline score.
Citation Extraction Data
We asses performance in terms of field-level F1 score , which is the harmonic mean of precision and recall for predicted segments.
F1 score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Rüd, Stefan and Ciaramita, Massimiliano and Müller, Jens and Schütze, Hinrich
Results and discussion
Lin and Wu (2009) report an F1 score of 90.90 on the original split of the CoNLL data.
Results and discussion
Our F1 scores > 92% can be explained by a combination of randomly partitioning the data and the fact that the four-class problem is easier than the five-class problem LOC-ORG-PER-MISC-O.
Results and discussion
We use the t-test to compute significance on the two sets of five F1 scores from the two experiments that are being compared (two-tailed, p < .01 for t > 3.36).8 CoNLL scores that are significantly different from line c7 are marked with >|<.
F1 score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Xiao and Kit, Chunyu
Abstract
It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other high—performance parsers.
Constituent Recombination
The parameters )V; and p are tuned by the Powell’s method (Powell, 1964) on a development set, using the F1 score of PARSEVAL (Black et al., 1991) as objective.
Introduction
Combined with other high-performance parsers under the framework of constituent recombination (Sagae and Lavie, 2006; Fossum and Knight, 2009), this model further enhances the F1 scores to 92.80% and 85.60%, the highest ones achieved so far on these two data sets.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Titov, Ivan and Klementiev, Alexandre
Empirical Evaluation
We compute the aggregate PU, CO, and F1 scores over all predicates in the same way as (Lang and Lapata, 2011a) by weighting the scores of each predicate by the number of its argument occurrences.
Empirical Evaluation
Our models are robust to parameter settings; the parameters were tuned (to an order of magnitude) to optimize the F1 score on the held-out development set and were as follows.
Empirical Evaluation
Boldface is used to highlight the best F1 scores .
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chan, Wen and Zhou, Xiangdong and Wang, Wei and Chua, Tat-Seng
Experimental Results
Table 2 shows that our general CRF model based on question segmentation with group L1 regularization outperforms the baselines significantly in all three measures (gCRF—QS-ll is 13.99% better than SVM in precision, 9.77% better in recall and 11.72% better in F1 score ).
Experimental Results
It is observed that our gCRF-QS-ll model improves the performance in terms of precision, recall and F1 score on all three measurements of ROUGE-l, ROUGE-2 and ROUGE-L by a significant margin compared to other baselines due to the use of local and nonlocal contextual factors and factors based on Q8 with group L1 regularization.
Experimental Setting
In our experiments, we also compare the precision, recall and F1 score in the ROUGE-1, ROUGE-2 and ROUGE-L measures (Lin , 2004) for answer summarization performance.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Singh, Sameer and Subramanya, Amarnag and Pereira, Fernando and McCallum, Andrew
Experiments
Table 2: F1 Scores on the Wikipedia Link Data.
Experiments
We use N = 100, 500 and the B3 F1 score results obtained set for each case are shown in Figure 7.
Introduction
On this dataset, our proposed model yields a B3 (Bagga and Baldwin, 1998) F1 score of 73.7%, improving over the baseline by 16% absolute (corresponding to 38% error reduction).
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zeller, Britta and Šnajder, Jan and Padó, Sebastian
Evaluation
For the final evaluation, we optimized the number of clusters based on F1 score on calibration and validation sets (cf.
Results
We omit the F1 score because its use for precision and recall estimates from different samples is unclear.
Results
Note that for these methods, precision and recall can be traded off against each other by varying the number of clusters; we chose the number of clusters by optimizing the F1 score on the calibration and validaton sets.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jansen, Peter and Surdeanu, Mihai and Clark, Peter
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
We adapted them to this dataset by weighing each answer by its overlap with gold answers, where overlap is measured as the highest F1 score between the candidate and a gold answer.
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
Thus, P@1 reduces to this F1 score for the top answer.
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
For example, if the best answer for a question appears at rank 2 with an F1 score of 0.3, the corresponding MRR score is 0.3 / 2.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Qi and Ji, Heng
Experiments
The human F1 score on end-to-end relation extraction is only about 70%, which indicates it is a very challenging task.
Experiments
Furthermore, the F1 score of the inter-annotator agreement is 51.9%, which is only 2.4% above that of our proposed method.
Experiments
For entity mention extraction, our joint model achieved 79.7% on 5-fold cross-validation, which is comparable with the best F1 score 79.2% reported by (Florian et al., 2006) on single-fold.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hoffmann, Raphael and Zhang, Congle and Ling, Xiao and Zettlemoyer, Luke and Weld, Daniel S.
Experiments
At the highest recall point, MULTIR reaches 72.4% precision and 51.9% recall, for an F1 score of 60.5%.
Experiments
On average across relations, precision increases 12 points but recall drops 26 points, for an overall reduction in F1 score from 60.5% to 40.3%.
Related Work
(2010) describe a system similar to KYLIN, but which dynamically generates lexicons in order to handle sparse data, learning over 5000 Infobox relations with an average F1 score of 61%.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Zhiguo and Xue, Nianwen
Experiment
We can see that the parsing F1 decreased by about 8.5 percentage points in F1 score when using automatically assigned POS tags instead of gold-standard ones, and this shows that the pipeline approach is greatly affected by the quality of its preliminary POS tagging step.
Experiment
Compared with the JointParsing system which does not employ any alignment strategy, the Padding system only achieved a slight improvement on parsing F1 score , but no improvement on POS tagging accuracy.
Experiment
In contrast, our StateAlign system achieved an improvement of 0.6% on parsing F1 score and 0.4% on POS tagging accuracy.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Stoyanov, Veselin and Gilbert, Nathan and Cardie, Claire and Riloff, Ellen
Coreference Subtask Analysis
The MUC scoring algorithm (Vilain et a1., 1995) computes the F1 score (harmonic mean) of precision and recall based on the identifcation of unique coreference links.
Coreference Subtask Analysis
Precision and recall for a set of documents are computed as the mean over all CEs in the documents and the F1 score of precision and recall is reported.
Resolution Complexity
We then count the number of unique correct/incorrect links that the system introduced on top of the correct partial clustering and compute precision, recall, and F1 score .
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: