Index of papers in Proc. ACL 2012 that mention
  • F1 score
Feng, Vanessa Wei and Hirst, Graeme
Experiments
Performance is measured by four metrics: accuracy, precision, recall, and F1 score on the test set, shown in the first section in each subtable.
Experiments
However, under this discourse condition, the distribution of positive and negative instances in both training and test sets is extremely skewed, which makes it more sensible to compare the recall and F1 scores for evaluation.
Experiments
In fact, our features achieve much higher recall and F1 score despite a much lower precision and a slightly lower accuracy.
F1 score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Shindo, Hiroyuki and Miyao, Yusuke and Fujino, Akinori and Nagata, Masaaki
Experiment
We used EVALB1 to compute the F1 score .
Experiment
Table 2 shows the F1 scores of the CFG, TSG and SR-TSG parsers for small and full training sets.
Experiment
Table 3 shows the F1 scores of an SR-TSG and conventional parsers with the full training set.
Introduction
Our SR-TSG parser achieves an F1 score of 92.4% in the WSJ English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and superior to state-of-the-art discriminative reranking parsers.
F1 score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Hatori, Jun and Matsuzaki, Takuya and Miyao, Yusuke and Tsujii, Jun'ichi
Model
We use standard measures of word-level precision, recall, and F1 score , for evaluating each task.
Model
Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-Sc-l with respect to the training epoch and different parsing feature weights, where “Seg”, “Tag”, and “Dep” respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing.
Model
Table 3: F1 scores and speed (in sentences per sec.)
F1 score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chan, Wen and Zhou, Xiangdong and Wang, Wei and Chua, Tat-Seng
Experimental Results
Table 2 shows that our general CRF model based on question segmentation with group L1 regularization outperforms the baselines significantly in all three measures (gCRF—QS-ll is 13.99% better than SVM in precision, 9.77% better in recall and 11.72% better in F1 score ).
Experimental Results
It is observed that our gCRF-QS-ll model improves the performance in terms of precision, recall and F1 score on all three measurements of ROUGE-l, ROUGE-2 and ROUGE-L by a significant margin compared to other baselines due to the use of local and nonlocal contextual factors and factors based on Q8 with group L1 regularization.
Experimental Setting
In our experiments, we also compare the precision, recall and F1 score in the ROUGE-1, ROUGE-2 and ROUGE-L measures (Lin , 2004) for answer summarization performance.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Xiao and Kit, Chunyu
Abstract
It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other high—performance parsers.
Constituent Recombination
The parameters )V; and p are tuned by the Powell’s method (Powell, 1964) on a development set, using the F1 score of PARSEVAL (Black et al., 1991) as objective.
Introduction
Combined with other high-performance parsers under the framework of constituent recombination (Sagae and Lavie, 2006; Fossum and Knight, 2009), this model further enhances the F1 scores to 92.80% and 85.60%, the highest ones achieved so far on these two data sets.
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Titov, Ivan and Klementiev, Alexandre
Empirical Evaluation
We compute the aggregate PU, CO, and F1 scores over all predicates in the same way as (Lang and Lapata, 2011a) by weighting the scores of each predicate by the number of its argument occurrences.
Empirical Evaluation
Our models are robust to parameter settings; the parameters were tuned (to an order of magnitude) to optimize the F1 score on the held-out development set and were as follows.
Empirical Evaluation
Boldface is used to highlight the best F1 scores .
F1 score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: