Experiments | Performance is measured by four metrics: accuracy, precision, recall, and F1 score on the test set, shown in the first section in each subtable. |
Experiments | However, under this discourse condition, the distribution of positive and negative instances in both training and test sets is extremely skewed, which makes it more sensible to compare the recall and F1 scores for evaluation. |
Experiments | In fact, our features achieve much higher recall and F1 score despite a much lower precision and a slightly lower accuracy. |
Experiment | We used EVALB1 to compute the F1 score . |
Experiment | Table 2 shows the F1 scores of the CFG, TSG and SR-TSG parsers for small and full training sets. |
Experiment | Table 3 shows the F1 scores of an SR-TSG and conventional parsers with the full training set. |
Introduction | Our SR-TSG parser achieves an F1 score of 92.4% in the WSJ English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and superior to state-of-the-art discriminative reranking parsers. |
Model | We use standard measures of word-level precision, recall, and F1 score , for evaluating each task. |
Model | Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-Sc-l with respect to the training epoch and different parsing feature weights, where “Seg”, “Tag”, and “Dep” respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing. |
Model | Table 3: F1 scores and speed (in sentences per sec.) |
Experimental Results | Table 2 shows that our general CRF model based on question segmentation with group L1 regularization outperforms the baselines significantly in all three measures (gCRF—QS-ll is 13.99% better than SVM in precision, 9.77% better in recall and 11.72% better in F1 score ). |
Experimental Results | It is observed that our gCRF-QS-ll model improves the performance in terms of precision, recall and F1 score on all three measurements of ROUGE-l, ROUGE-2 and ROUGE-L by a significant margin compared to other baselines due to the use of local and nonlocal contextual factors and factors based on Q8 with group L1 regularization. |
Experimental Setting | In our experiments, we also compare the precision, recall and F1 score in the ROUGE-1, ROUGE-2 and ROUGE-L measures (Lin , 2004) for answer summarization performance. |
Abstract | It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other high—performance parsers. |
Constituent Recombination | The parameters )V; and p are tuned by the Powell’s method (Powell, 1964) on a development set, using the F1 score of PARSEVAL (Black et al., 1991) as objective. |
Introduction | Combined with other high-performance parsers under the framework of constituent recombination (Sagae and Lavie, 2006; Fossum and Knight, 2009), this model further enhances the F1 scores to 92.80% and 85.60%, the highest ones achieved so far on these two data sets. |
Empirical Evaluation | We compute the aggregate PU, CO, and F1 scores over all predicates in the same way as (Lang and Lapata, 2011a) by weighting the scores of each predicate by the number of its argument occurrences. |
Empirical Evaluation | Our models are robust to parameter settings; the parameters were tuned (to an order of magnitude) to optimize the F1 score on the held-out development set and were as follows. |
Empirical Evaluation | Boldface is used to highlight the best F1 scores . |