SciSurf: Index of "F-score" in Proc. ACL 2014

Index of papers in Proc. ACL 2014 that mention

F-score

Seen in text as:

F-score (64)
f-score (14)

Seen in 73 sentences in 11 papers.

1. Learning Semantic Hierarchies via Word Embeddings

Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our result, an F-score of 73.74%, outperforms the state-of-the-art methods on a manually labeled test dataset.
Abstract	Moreover, combining our method with a previous manually-built hierarchy extension method can further improve F-score to 80.29%.
Experimental Setup	We use precision, recall, and F-score as our metrics to evaluate the performances of the methods.
Introduction	The experimental results show that our method achieves an F-score of 73.74% which significantly outperforms the previous state-of-the-art methods.
Introduction	(2008) can further improve F-score to 80.29%.
Results and Analysis 5.1 Varying the Amount of Clusters	Table 3 shows that the proposed method achieves a better recall and F-score than all of the previous methods do.
Results and Analysis 5.1 Varying the Amount of Clusters	It can significantly (p < 0.01) improve the F-score over the state-of-the-art method MWikHCilmE.
Results and Analysis 5.1 Varying the Amount of Clusters	The F-score is further improved from 73.74% to 76.29%.

F-score is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

2. Adaptive HTER Estimation for Document-Specific MT Post-Editing

Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Conclusion	With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations.
Experiments	Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3.
Experiments	Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score .
Experiments	The overall F-score of the adaptive QE model is 0.282.

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

3. Predicting the relevance of distributional semantic similarity with contextual information

Muller, Philippe and Fabre, Cécile and Adam, Clémentine

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation of lexical similarity in context	In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score.
Experiments: predicting relevance in context	Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score , even though precision and recall might yield more important variations.
Experiments: predicting relevance in context	As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22.
Experiments: predicting relevance in context	If we take the best simple classifier (random forests), the precision and recall are 68.1% and 24.2% for an F-score of 35.7%, and this is significantly beaten by the Naive Bayes method as precision and recall are more even ( F-score of 41.5%).
Related work	Recall F-score 40.4 54.3 46.3 37.4 52.8 43.8 36.1 49.5 41.8 36.5 54.8 43.8

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

4. ReNew: A Semi-Supervised Framework for Generating Domain-Specific Lexicons and Sentiment Analysis

Zhang, Zhe and Singh, Munindar P.

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Accuracy Macro F-score Micro F-score
Experiments	Figure 8 reports the accuracy, macro F-score, and micro F-score .
Experiments	It shows that the BR learner produces better accuracy and a micro F-score than the FR learner but a slightly worse macro F-score .

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

5. Modelling function words improves unsupervised word segmentation

Johnson, Mark and Christophe, Anne and Dupoux, Emmanuel and Demuth, Katherine

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score .
Introduction	While absolute accuracy is not directly relevant to the main point of the paper, we note that the models that learn generalisations about function words perform unsupervised word segmentation at 92.5% token f-score on the standard Bernstein-Ratner (1987) corpus, which improves the previous state-of-the-art by more than 4%.
Introduction	that achieves the best token f-score expects function words to appear at the left edge of phrases.
Word segmentation results	f-score prec1s10n recall Baseline 0.872 0.918 0.956 + left FWs 0.924 0.935 0.990 + left + right FWs 0.912 0.957 0.953
Word segmentation results	Figure 2 presents the standard token and lexicon (i.e., type) f-score evaluations for word segmentations proposed by these models (Brent, 1999), and Table 1 summarises the token and lexicon f-scores for the major models discussed in this paper.
Word segmentation results	It is interesting to note that adding “function words” improves token f-score by more than 4%, corresponding to a 40% reduction in overall error rate.
Word segmentation with Adaptor Grammars	The starting point and baseline for our extension is the adaptor grammar with syllable structure phonotactic constraints and three levels of collo-cational structure (5-21), as prior work has found that this yields the highest word segmentation token f-score (Johnson and Goldwater, 2009).

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

6. Omni-word Feature and Soft Constraint for Chinese Relation Extraction

Chen, Yanping and Zheng, Qinghua and Zhang, Wei

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes.
Feature Construction	F-score is computed by
Feature Construction	In Row 2, with only the .7-"0w feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes.
Feature Construction	In Table 3, it is shown that our system outperforms other systems, in F-score , by 10% on 6 relation types and by 15% on 18 subtypes.
Introduction	The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) (Zhou et al., 2010).
Introduction	Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes (Dandan et al., 2012).

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

7. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
Related Work	In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions.
Related Work	Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil .
Results	Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT.
XMEANT: a cross-lingual MEANT	3.1 Applying MEANT’s f-score within semantic role fillers
XMEANT: a cross-lingual MEANT	The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat-

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

8. Infusion of Labeled Data into Distant Supervision for Relation Extraction

Pershina, Maria and Min, Bonan and Xu, Wei and Grishman, Ralph

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve.
Available at http://nlp. stanford.edu/software/mimlre. shtml.	Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to M \| M L (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1).
Available at http://nlp. stanford.edu/software/mimlre. shtml.	Performance of Guided DS also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., (2011) system for KBP 2011, which reports an F-score of 25.7%.
Introduction	posed approach, we extend MIML (Surdeanu et al., 2012), a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP (Ji and Grishman, 2011) dataset.
Introduction	While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines.
Training	Training MIML on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality.

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

9. Shift-Reduce CCG Parsing with a Dependency Model

Xu, Wenduan and Clark, Stephen and Zhang, Yue

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Standard CCGBank tests show the model achieves up to 1.05 labeled F-score improvements over three existing, competitive CCG parsing models.
Experiments	On both the full and reduced sets, our parser achieves the highest F-score .
Experiments	In comparison with C&C, our parser shows significant increases across all metrics, with 0.57% and 1.06% absolute F-score improvements over the hybrid and normal-form models, respectively.
Experiments	While our parser achieved lower precision than Z&C, it is more balanced and gives higher recall for all of the dependency relations except the last one, and higher F-score for over half of them.
Introduction	Results on the standard CCGBank tests show that our parser achieves absolute labeled F-score gains of up to 0.5 over the shift-reduce parser of Zhang and Clark (2011); and up to 1.05 and 0.64 over the normal-form and hybrid models of Clark and Curran (2007), respectively.

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

10. Comparing Multi-label Classification with Reinforcement Learning for Summarisation of Time-series Data

Gkatzia, Dimitra and Hastie, Helen and Lemon, Oliver

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that this method generates output closer to the feedback that lecturers actually generated, achieving 3.5% higher accuracy and 15% higher F-score than multiple simple classifiers that keep a history of selected templates.
Evaluation	The accuracy, the weighted precision, the weighted recall, and the weighted F-score of the classifiers are shown in Table 3.
Evaluation	It was found that in 10-fold cross validation RAkEL performs significantly better in all these automatic measures (accuracy = 76.95%, F-score = 85.50%).
Evaluation	Remarkably, ML achieves more than 10% higher F-score than the other methods (Table 3).

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

11. Max-Margin Tensor Neural Network for Chinese Word Segmentation

Pei, Wenzhe and Ge, Tao and Chang, Baobao

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction.
Experiment	The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation.
Experiment	As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training.

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper: