Index of papers in Proc. ACL 2014 that mention
  • F-score
Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting
Abstract
Our result, an F-score of 73.74%, outperforms the state-of-the-art methods on a manually labeled test dataset.
Abstract
Moreover, combining our method with a previous manually-built hierarchy extension method can further improve F-score to 80.29%.
Experimental Setup
We use precision, recall, and F-score as our metrics to evaluate the performances of the methods.
Introduction
The experimental results show that our method achieves an F-score of 73.74% which significantly outperforms the previous state-of-the-art methods.
Introduction
(2008) can further improve F-score to 80.29%.
Results and Analysis 5.1 Varying the Amount of Clusters
Table 3 shows that the proposed method achieves a better recall and F-score than all of the previous methods do.
Results and Analysis 5.1 Varying the Amount of Clusters
It can significantly (p < 0.01) improve the F-score over the state-of-the-art method MWikHCilmE.
Results and Analysis 5.1 Varying the Amount of Clusters
The F-score is further improved from 73.74% to 76.29%.
F-score is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim
Discussion and Conclusion
With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations.
Experiments
Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3.
Experiments
Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score .
Experiments
The overall F-score of the adaptive QE model is 0.282.
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Muller, Philippe and Fabre, Cécile and Adam, Clémentine
Evaluation of lexical similarity in context
In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score.
Experiments: predicting relevance in context
Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score , even though precision and recall might yield more important variations.
Experiments: predicting relevance in context
As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22.
Experiments: predicting relevance in context
If we take the best simple classifier (random forests), the precision and recall are 68.1% and 24.2% for an F-score of 35.7%, and this is significantly beaten by the Naive Bayes method as precision and recall are more even ( F-score of 41.5%).
Related work
Recall F-score 40.4 54.3 46.3 37.4 52.8 43.8 36.1 49.5 41.8 36.5 54.8 43.8
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zhang, Zhe and Singh, Munindar P.
Experiments
Accuracy Macro F-score Micro F-score
Experiments
Figure 8 reports the accuracy, macro F-score, and micro F-score .
Experiments
It shows that the BR learner produces better accuracy and a micro F-score than the FR learner but a slightly worse macro F-score .
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Johnson, Mark and Christophe, Anne and Dupoux, Emmanuel and Demuth, Katherine
Abstract
This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score .
Introduction
While absolute accuracy is not directly relevant to the main point of the paper, we note that the models that learn generalisations about function words perform unsupervised word segmentation at 92.5% token f-score on the standard Bernstein-Ratner (1987) corpus, which improves the previous state-of-the-art by more than 4%.
Introduction
that achieves the best token f-score expects function words to appear at the left edge of phrases.
Word segmentation results
f-score prec1s10n recall Baseline 0.872 0.918 0.956 + left FWs 0.924 0.935 0.990 + left + right FWs 0.912 0.957 0.953
Word segmentation results
Figure 2 presents the standard token and lexicon (i.e., type) f-score evaluations for word segmentations proposed by these models (Brent, 1999), and Table 1 summarises the token and lexicon f-scores for the major models discussed in this paper.
Word segmentation results
It is interesting to note that adding “function words” improves token f-score by more than 4%, corresponding to a 40% reduction in overall error rate.
Word segmentation with Adaptor Grammars
The starting point and baseline for our extension is the adaptor grammar with syllable structure phonotactic constraints and three levels of collo-cational structure (5-21), as prior work has found that this yields the highest word segmentation token f-score (Johnson and Goldwater, 2009).
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Chen, Yanping and Zheng, Qinghua and Zhang, Wei
Abstract
The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes.
Feature Construction
F-score is computed by
Feature Construction
In Row 2, with only the .7-"0w feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes.
Feature Construction
In Table 3, it is shown that our system outperforms other systems, in F-score , by 10% on 6 relation types and by 15% on 18 subtypes.
Introduction
The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) (Zhou et al., 2010).
Introduction
Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes (Dandan et al., 2012).
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Related Work
MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
Related Work
In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions.
Related Work
Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil .
Results
Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT.
XMEANT: a cross-lingual MEANT
3.1 Applying MEANT’s f-score within semantic role fillers
XMEANT: a cross-lingual MEANT
The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat-
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pershina, Maria and Min, Bonan and Xu, Wei and Grishman, Ralph
Abstract
Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve.
Available at http://nlp. stanford.edu/software/mimlre. shtml.
Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to M | M L (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1).
Available at http://nlp. stanford.edu/software/mimlre. shtml.
Performance of Guided DS also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., (2011) system for KBP 2011, which reports an F-score of 25.7%.
Introduction
posed approach, we extend MIML (Surdeanu et al., 2012), a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP (Ji and Grishman, 2011) dataset.
Introduction
While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines.
Training
Training MIML on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality.
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xu, Wenduan and Clark, Stephen and Zhang, Yue
Abstract
Standard CCGBank tests show the model achieves up to 1.05 labeled F-score improvements over three existing, competitive CCG parsing models.
Experiments
On both the full and reduced sets, our parser achieves the highest F-score .
Experiments
In comparison with C&C, our parser shows significant increases across all metrics, with 0.57% and 1.06% absolute F-score improvements over the hybrid and normal-form models, respectively.
Experiments
While our parser achieved lower precision than Z&C, it is more balanced and gives higher recall for all of the dependency relations except the last one, and higher F-score for over half of them.
Introduction
Results on the standard CCGBank tests show that our parser achieves absolute labeled F-score gains of up to 0.5 over the shift-reduce parser of Zhang and Clark (2011); and up to 1.05 and 0.64 over the normal-form and hybrid models of Clark and Curran (2007), respectively.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Gkatzia, Dimitra and Hastie, Helen and Lemon, Oliver
Abstract
We show that this method generates output closer to the feedback that lecturers actually generated, achieving 3.5% higher accuracy and 15% higher F-score than multiple simple classifiers that keep a history of selected templates.
Evaluation
The accuracy, the weighted precision, the weighted recall, and the weighted F-score of the classifiers are shown in Table 3.
Evaluation
It was found that in 10-fold cross validation RAkEL performs significantly better in all these automatic measures (accuracy = 76.95%, F-score = 85.50%).
Evaluation
Remarkably, ML achieves more than 10% higher F-score than the other methods (Table 3).
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Pei, Wenzhe and Ge, Tao and Chang, Baobao
Experiment
As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction.
Experiment
The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation.
Experiment
As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper: