Index of papers in Proc. ACL that mention

F-score

Seen in text as:

F-score (339)
f-score (80)
F-Score (20)
F-SCORE (7)
(3)

Seen in 411 sentences in 55 papers.

1. Modeling Thesis Clarity in Student Essays

Persing, Isaac and Ng, Vincent

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Error Classification	Using held-out validation data, we jointly tune the three parameters in the previous paragraph, Ci, 7%, and 25,, to optimize the F-score achieved by bi for error 61x3 However, an exact solution to this optimization problem is computationally expensive.
Error Classification	Consequently, we find a local maximum by employing the simulated annealing algorithm (Kirkpatrick et al., 1983), altering one parameter at a time to optimize F-score by holding the remaining parameters fixed.
Error Classification	Other ways we could measure our system’s performance (such as macro F-score ) would consider our system’s performance on the less frequent errors no less important than its performance on the
Evaluation	To evaluate our thesis clarity error type identification system, we compute precision, recall, micro F-score, and macro F-score , which are calculated as follows.
Evaluation	Then, the precision (Pi), recall (R1), and F-score (F1) for bi and the macro F-score (F) of the combined system for one test fold are calculated by 7510i 7510i 2PiRi A Z,- Fi
Evaluation	However, the macro F-score calculation can be seen as giving too much weight to the less frequent errors.

F-score is mentioned in 27 sentences in this paper.

Topics mentioned in this paper:

F-score (27)
n-gram (14)
n-grams (11)

2. Discriminative Pruning for Discriminative ITG Alignment

Liu, Shujie and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++.
Evaluation	An alternative criterion is the upper bound on alignment F-score , which essentially measures how many links in annotated alignment can be kept in ITG parse.
Evaluation	The calculation of F-score upper bound is done in a bottom-up way like ITG parsing.
Evaluation	The upper bound of alignment F-score can thus be calculated as well.
The DITG Models	The MERT module for DITG takes alignment F-score of a sentence pair as the performance measure.
The DITG Models	Given an input sentence pair and the reference annotated alignment, MERT aims to maximize the F-score of DITG-produced alignment.

F-score is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

3. Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition

Habash, Nizar and Roth, Ryan

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our best approach achieves a roughly ~15% absolute increase in F-score over a simple but reasonable baseline.
Results	We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall.
Results	Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7
Results	Feature Set F-score %Imp word 43.85 —

F-score is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

4. Together We Can: Bilingual Bootstrapping for WSD

Khapra, Mitesh M. and Joshi, Salil and Chatterjee, Arindam and Bhattacharyya, Pushpak

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussions	For small seed sizes, the F-score of bilingual bootstrapping is consistently better than the F-score obtained by training only on the seed data without using any bootstrapping.
Discussions	To further illustrate this, we take some sample points from the graph and compare the number of tagged words needed by BiBoot and OnlySeed to reach the same (or nearly the same) F-score .
Experimental Setup	Seed Size v/s F-score
Experimental Setup	80 70 60 go‘ 50 g 40 O (I) [L 30 20 OnlySeed éfi ' I. WFS 10 “ BiBoot ' 0 [I I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 Seed Size (words) Figure 1: Comparison of BiBoot, MonoBoot, OnlySeed and WF S on Hindi Health data Seed Size v/s F-score 80 $3 9 O O (I) [L OnlySeed éfi ' WF ..= BiBoot ' 0 , I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000
Experimental Setup	Seed Size v/s F-score
Results	a. BiBoot: This curve represents the F-score obtained after 10 iterations by using bilingual bootstrapping with different amounts of seed data.
Results	b. MonoBoot: This curve represents the F-score obtained after 10 iterations by using monolingual bootstrapping with different amounts of seed data.
Results	c. OnlySeed: This curve represents the F-score obtained by training on the seed data alone without using any bootstrapping.

F-score is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

Wordnet (14)
synset (13)
F-score (12)

5. Automatic sense prediction for implicit discourse relations in text

Pitler, Emily and Louis, Annie and Nenkova, Ani

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Classification Results	The table lists the f-score for each of the target relations, with overall accuracy shown in brackets.
Classification Results	Given that the experiments are run on natural distribution of the data, which are skewed towards Expansion relations, the f-score is the more important measure to track.
Classification Results	Our random baseline is the f-score one would achieve by randomly assigning classes in proportion to its true distribution in the test set.

F-score is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

6. Transliteration Alignment

Pervouchine, Vladimir and Li, Haizhou and Lin, Bo

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a new evaluation metric, alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score .
Experiments	Next we conduct three experiments to study 1) alignment entropy vs. F-score , 2) the impact of alignment quality on transliteration accuracy, and 3) how to validate transliteration using alignment metrics.
Experiments	5.1 Alignment entropy vs. F-score
Experiments	We have manually aligned a random set of 3,000 transliteration pairs from the Xinhua training set to serve as the gold standard, on which we calculate the precision, recall and F-score as well as alignment entropy for each alignment.
Related Work	Denoting the number of cross-lingual mappings that are common in both A and Q as CA0, the number of cross-lingual mappings in A as CA and the number of cross-lingual mappings in Q as Cg, precision Pr is given as CAglCA, recall Be as GAO/CG and F-score as 2P7“ - Rc/(Pr + Re).
Transliteration alignment entropy	We expect and will show that this estimate is a good indicator of the alignment quality, and is as effective as the F-score , but without the need for a gold standard reference.

F-score is mentioned in 18 sentences in this paper.

Topics mentioned in this paper:

7. Learning Semantic Hierarchies via Word Embeddings

Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our result, an F-score of 73.74%, outperforms the state-of-the-art methods on a manually labeled test dataset.
Abstract	Moreover, combining our method with a previous manually-built hierarchy extension method can further improve F-score to 80.29%.
Experimental Setup	We use precision, recall, and F-score as our metrics to evaluate the performances of the methods.
Introduction	The experimental results show that our method achieves an F-score of 73.74% which significantly outperforms the previous state-of-the-art methods.
Introduction	(2008) can further improve F-score to 80.29%.
Results and Analysis 5.1 Varying the Amount of Clusters	Table 3 shows that the proposed method achieves a better recall and F-score than all of the previous methods do.
Results and Analysis 5.1 Varying the Amount of Clusters	It can significantly (p < 0.01) improve the F-score over the state-of-the-art method MWikHCilmE.
Results and Analysis 5.1 Varying the Amount of Clusters	The F-score is further improved from 73.74% to 76.29%.

F-score is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

8. Unsupervised Argument Identification for Semantic Role Labeling

Abend, Omri and Reichart, Roi and Rappoport, Ari

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	We report an F-score as well (the harmonic mean of precision and recall).
Experimental Setup	We use the standard parsing F-score evaluation measure.
Introduction	We use two measures to evaluate the performance of our algorithm, precision and F-score .
Introduction	Precision reflects the algorithm’s applicability for creating training data to be used by supervised SRL models, while the standard SRL F-score measures the model’s performance when used by itself.
Introduction	The first stage of our algorithm is shown to outperform a strong baseline both in terms of F-score and of precision.
Related Work	Better performance is achieved on the classification, where state-of-the-art supervised approaches achieve about 81% F-score on the in-domain identification task, of which about 95% are later labeled correctly (Marquez et al., 2008).
Results	In the “Collocation Maximum F-score” the collocation parameters were generally tuned such that the maximum possible F-score for the collocation algorithm is achieved.
Results	The best or close to best F-score is achieved when using the clause detection algorithm alone (59.14% for English, 23.34% for Spanish).
Results	Note that for both English and Spanish F-score improvements are achieved via a precision improvement that is more significant than the recall degradation.

F-score is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

POS tags (14)
F-score (12)
parse tree (11)

9. Parsing Noun Phrase Structure with CCG

Vadas, David and Curran, James R.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	\| PREC \| RECALL \| F-SCORE
Experiments	\| PREC \| RECALL \| F-SCORE
Experiments	Table 2 shows that F-score has dropped by 0.61%.
Introduction	These features are targeted at improving the recovery of NP structure, increasing parser performance by 0.64% F-score .

F-score is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

F-score (20)
NER (19)
CCG (18)

10. Distributional Identification of Non-Referential Pronouns

Bergsma, Shane and Lin, Dekang and Goebel, Randy

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	F-Score (F) is the geometric average of precision and recall; it is the most common non-referential detection metric.
Results	Table 4 gives precision, recall, F-score , and accuracy on the Train/ Test split.
Results	Note that while the LL system has high detection precision, it has very low recall, sharply reducing F-score .
Results	The MINIPL approach sacrifices some precision for much higher recall, but again has fairly low F-score .

F-score is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

F-Score (13)
coreference (11)
n-gram (9)

11. Transfer Learning for Constituency-Based Grammars

Zhang, Yuan and Barzilay, Regina and Globerson, Amir

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation Setup	First, following previous work, we evaluate our method using the labeled and unlabeled predicate-argument dependency F-score .
Evaluation Setup	The dependency F-score captures both the target-
Experiment and Analysis	For instance, there is a gain of 6.2% in labeled dependency F-score for HPSG formalism when 15,000 CFG trees are used.
Experiment and Analysis	Across all three grammars, we can observe that adding CFG data has a more pronounced effect on the PARSEVAL measure than the dependency F-score .
Experiment and Analysis	On the other hand, predicate-argument dependency F-score (Figure 5ac) also relies on the target grammar information.
Implementation	This results in a drop on the dependency F-score by about 5%.
Introduction	For instance, the model trained on 500 HPSG sentences achieves labeled dependency F-score of 72.3%.
Introduction	Adding 15,000 Penn Treebank sentences during training leads to 78.5% labeled dependency F-score , an absolute improvement of 6.2%.

F-score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

CCG (45)
Treebank (27)
Penn Treebank (20)

12. Exploiting Heterogeneous Treebanks for Parsing

Niu, Zheng-Yu and Wang, Haifeng and Wu, Hua

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result.
Experiments of Grammar Formalism Conversion	Finally Q-10-method achieved an f-score of 93.8% on WSJ section 22, an absolute 4.4% improvement (42% error reduction) over the best result of Xia et al.
Experiments of Grammar Formalism Conversion	Finally Q-10-method achieved an f-score of 93.6% on WSJ section 2~l8 and 20~22, better than that of Q-0-method and comparable with that of Q-10-method in Section 3.1.
Experiments of Parsing	Finally we decided that the optimal value of A was 0.4 and the optimal weight of CTB was 1, which brought the best performance on the development set (an f-score of 86.1%).
Experiments of Parsing	In comparison with the results in Section 4.1, the average index of converted trees in 200-best list increased to 2, and their average unlabeled dependency f-score dropped to 65.4%.
Experiments of Parsing	84.2% f-score , better than the result of the reranking parser with CTB and CDTPS as training data (shown in Table 5).
Introduction	Our conversion method achieves 93.8% f-score on dependency trees produced from WSJ section 22, resulting in 42% error reduction over the previous best result for DS to PS conversion.
Introduction	When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing.
Our Two-Step Solution	Therefore we modified the selection metric in Section 2.1 by interpolating two scores, the probability of a conversion candidate from the parser and its unlabeled dependency f-score , shown as follows:

F-score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

13. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Han, Bo and Baldwin, Timothy

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments	We evaluate detection performance by token-level precision, recall and F-score (6 = 1).
Experiments	For candidate selection, we once again evaluate using token-level precision, recall and F-score .
Experiments	Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.

F-score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

14. Faster Parsing by Supertagger Adaptation

Kummerfeld, Jonathan K. and Roesner, Jessika and Dawborn, Tim and Haggerty, James and Curran, James R. and Clark, Stephen

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Using an adapted supertagger with ambiguity levels tuned to match the baseline system, we were also able to increase F-score on labelled grammatical relations by 0.75%.
Results	Interestingly, while the decrease in supertag accuracy in the previous experiment did not translate into a decrease in F-score, the increase in tag accuracy here does translate into an increase in F-score .
Results	The increase in F-score has two sources.
Results	As Table 6 shows, this change translates into an improvement of up to 0.75% in F-score on Section

F-score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

15. Using Adaptor Grammars to Identify Synergies in the Unsupervised Acquisition of Linguistic Structure

Johnson, Mark

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Word segmentation with adaptor grammars	We evaluated the f-score of the recovered word constituents (Goldwater et al., 2006b).
Word segmentation with adaptor grammars	Table 1: Word segmentation f-score results for all models, as a function of DP concentration parameter oz.
Word segmentation with adaptor grammars	With a = 1 and 04 = 10 we obtained a word segmentation f-score of 0.55.

F-score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

16. Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

Sun, Weiwei and Wan, Xiaojun

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Three metrics are used for evaluation: precision (P), recall (R) and balanced f-score (F) defined by 2PR/(P+R).
Experiments	The baseline of the character-based joint solver (CTagctb) is competitive, and achieves an f-score of 92.93.
Experiments	ging model achieves an f-score of 94.03 ([31th and
Introduction	Our structure-based stacking model achieves an f-score of 94.36, which is superior to a feature-based stacking model introduced in (Jiang et al., 2009).
Introduction	Our final system achieves an f-score of 94.68, which yields a relative error reduction of 11% over the best published result (94.02).

F-score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

17. ReNew: A Semi-Supervised Framework for Generating Domain-Specific Lexicons and Sentiment Analysis

Zhang, Zhe and Singh, Munindar P.

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Accuracy Macro F-score Micro F-score
Experiments	Figure 8 reports the accuracy, macro F-score, and micro F-score .
Experiments	It shows that the BR learner produces better accuracy and a micro F-score than the FR learner but a slightly worse macro F-score .

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

18. Adaptive HTER Estimation for Document-Specific MT Post-Editing

Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Conclusion	With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations.
Experiments	Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3.
Experiments	Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score .
Experiments	The overall F-score of the adaptive QE model is 0.282.

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

19. Predicting the relevance of distributional semantic similarity with contextual information

Muller, Philippe and Fabre, Cécile and Adam, Clémentine

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation of lexical similarity in context	In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score.
Experiments: predicting relevance in context	Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score , even though precision and recall might yield more important variations.
Experiments: predicting relevance in context	As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22.
Experiments: predicting relevance in context	If we take the best simple classifier (random forests), the precision and recall are 68.1% and 24.2% for an F-score of 35.7%, and this is significantly beaten by the Naive Bayes method as precision and recall are more even ( F-score of 41.5%).
Related work	Recall F-score 40.4 54.3 46.3 37.4 52.8 43.8 36.1 49.5 41.8 36.5 54.8 43.8

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

20. Forest Reranking: Discriminative Parsing with Non-Local Features

Huang, Liang

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	tures in the updated version.5 However, our initial experiments show that, even with this much simpler feature set, our 50-best reranker performed equally well as theirs (both with an F-score of 91.4, see Tables 3 and 4).
Experiments	With only local features, our forest reranker achieves an F-score of 91.25, and with the addition of non-
Forest Reranking	yEeand where function F returns the F-score .
Forest Reranking	2In case multiple candidates get the same highest F-score , we choose the parse with the highest log probability from the baseline parser to be the oracle parse (Collins, 2000).
Introduction	we achieved an F-score of 91.7, which is a 19% error reduction from the l-best baseline, and outperforms both 50-best and 100-best reranking.
Supporting Forest Algorithms	4.1 Forest Oracle Recall that the Parseval F-score is the harmonic mean of labelled precision P and labelled recall R: 2PR _ 2\|y fl y\| P + R lyl + \|y\|
Supporting Forest Algorithms	In other words, the optimal F-score tree in a forest is not guaranteed to be composed of two optimal F-score subtrees.
Supporting Forest Algorithms	Shown in Pseudocode 4, we perform these computations in a bottom-up topological order, and finally at the root node TOP, we can compute the best global F-score by maximizing over different numbers of test brackets (line 7).

F-score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

21. Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Experiments on the data from the Chinese tree bank (CTB-7) and Microsoft Research (MSR) show that the proposed model results in significant improvement over other comparative candidates in terms of F-score and out-of-vocabulary (OOV) recall.
Method	The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score , F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R).
Method	It obtains 0.92% and 2.32% increase in terms of F-score and OOV—R respectively.
Method	On the whole, for segmentation, they achieve average improvements of 1.02% and 6.8% in F-score and OOV—R; whereas for POS tagging, the average increments of F-sore and OOV—R are 0.87% and 6.45%.
Related Work	Prior supervised joint S&T models present approximate 0.2% - 1.3% improvement in F-score over supervised pipeline ones.

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

22. A Novel Discourse Parser Based on Support Vector Machine Classification

duVerle, David and Prendinger, Helmut

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Using a rich set of shallow lexical, syntactic and structural features from the input text, our parser achieves, in linear time, 73.9% of professional annotators’ human agreement F-score .
Building a Discourse Parser	Current state-of-the-art results in automatic segmenting are much closer to human levels than full structure labeling ( F-score ratios of automatic performance over gold standard reported in LeThanh et al.
Evaluation	Standard performance indicators for such a task are precision, recall and F-score as measured by the PARSEVAL metrics (Black et al., 1991), with the specific adaptations to the case of RST trees made by Marcu (2000, page 143-144).
Evaluation	S N R F S N R F Precision 83.0 68.4 55.3 54.8 69.5 56.1 44.9 44.4 Recall 83.0 68.4 55.3 54.8 69.2 55.8 44.7 44.2 F-Score 83.0 68.4 55.3 54.8 69.3 56.0 44.8 44.3
Evaluation	Manual SPADE -SNRFSNRFSNRF Precision 84.1 70.6 55.6 55.1 70.6 58.1 46.0 45.6 88.0 77.5 66.0 65.2 Recall 84.1 70.6 55.6 55.1 71.2 58.6 46.4 46.0 88.1 77.6 66.1 65.3 F-Score 84.1 70.6 55.6 55.1 70.9 58.3 46.2 45.8 88.1 77.5 66.0 65.3

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

discourse parsing (10)
SVM (10)
edus (9)

23. Confidence Measure for Word Alignment

Huang, Fei

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Additionally, we remove low confidence alignment links from the word alignment of a bilingual training corpus, which increases the alignment F-score , improves Chinese-English and Arabic-English translation quality and significantly reduces the phrase translation table size.
Alignment Link Confidence Measure	Table 2 shows the precision, recall and F-score of individual alignments and the combined align-
Alignment Link Confidence Measure	Overall it improves the F-score by 1.5 points (from 69.3 to 70.8), 1.8 point improvement for content words and 1.0 point for function words.
Improved MaXEnt Aligner with Confidence-based Link Filtering	Precision Recall F-score Baseline 72.66 66.17 69.26 +ALF 78.14 64.36 70.59
Improved MaXEnt Aligner with Confidence-based Link Filtering	Precision Recall F-score Baseline 84.43 83.64 84.04 +ALF 88.29 83.14 85.64
Sentence Alignment Confidence Measure	Aligner F-score Cor.
Sentence Alignment Confidence Measure	The results in Figure 2 shows strong correlation between the confidence measure and the alignment F-score , with the correlation coefficients equals to -0.69.

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

24. Modelling function words improves unsupervised word segmentation

Johnson, Mark and Christophe, Anne and Dupoux, Emmanuel and Demuth, Katherine

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score .
Introduction	While absolute accuracy is not directly relevant to the main point of the paper, we note that the models that learn generalisations about function words perform unsupervised word segmentation at 92.5% token f-score on the standard Bernstein-Ratner (1987) corpus, which improves the previous state-of-the-art by more than 4%.
Introduction	that achieves the best token f-score expects function words to appear at the left edge of phrases.
Word segmentation results	f-score prec1s10n recall Baseline 0.872 0.918 0.956 + left FWs 0.924 0.935 0.990 + left + right FWs 0.912 0.957 0.953
Word segmentation results	Figure 2 presents the standard token and lexicon (i.e., type) f-score evaluations for word segmentations proposed by these models (Brent, 1999), and Table 1 summarises the token and lexicon f-scores for the major models discussed in this paper.
Word segmentation results	It is interesting to note that adding “function words” improves token f-score by more than 4%, corresponding to a 40% reduction in overall error rate.
Word segmentation with Adaptor Grammars	The starting point and baseline for our extension is the adaptor grammar with syllable structure phonotactic constraints and three levels of collo-cational structure (5-21), as prior work has found that this yields the highest word segmentation token f-score (Johnson and Goldwater, 2009).

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

25. Using Syntax to Disambiguate Explicit Discourse Connectives in Text

Pitler, Emily and Nenkova, Ani

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discourse vs. non-discourse usage	Using the string of the connective as the only feature sets a reasonably high baseline, with an f-score of 75.33% and an accuracy of 85.86%.
Discourse vs. non-discourse usage	Interestingly, using only the syntactic features, ignoring the identity of the connective, is even better, resulting in an f-score of 88.19% and accuracy of 92.25%.
Discourse vs. non-discourse usage	Using both the connective and syntactic features is better than either individually, with an f-score of 92.28% and accuracy of 95.04%.

F-score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

f-score (7)
Treebank (3)

26. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
Related Work	In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions.
Related Work	Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil .
Results	Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT.
XMEANT: a cross-lingual MEANT	3.1 Applying MEANT’s f-score within semantic role fillers
XMEANT: a cross-lingual MEANT	The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat-

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

27. Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	F-score
Experiment	Both the f-score and OOV—recall increase.
Experiment	By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV—Recall.
INTRODUCTION	For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data.

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

28. Omni-word Feature and Soft Constraint for Chinese Relation Extraction

Chen, Yanping and Zheng, Qinghua and Zhang, Wei

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes.
Feature Construction	F-score is computed by
Feature Construction	In Row 2, with only the .7-"0w feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes.
Feature Construction	In Table 3, it is shown that our system outperforms other systems, in F-score , by 10% on 6 relation types and by 15% on 18 subtypes.
Introduction	The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) (Zhou et al., 2010).
Introduction	Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes (Dandan et al., 2012).

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

29. Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

Fukumoto, Fumiyo and Suzuki, Yoshimi and Matsuyoshi, Suguru

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The results using Reuters documents showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614.
Conclusion	The results using the 1996 Reuters corpora showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614.
Experiments	We empirically selected values of two parameters, “c” (tradeoff between training error and margin) and “j”, i.e., cost (cost-factor, by which training errors on positive examples) that optimized the F-score obtained by classification of test documents.
Experiments	Figure 3 shows micro-averaged F-score against the 6 value.
Experiments	F-score

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

SVM (23)
unlabeled data (10)
F-score (6)

30. Infusion of Labeled Data into Distant Supervision for Relation Extraction

Pershina, Maria and Min, Bonan and Xu, Wei and Grishman, Ralph

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve.
Available at http://nlp. stanford.edu/software/mimlre. shtml.	Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to M \| M L (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1).
Available at http://nlp. stanford.edu/software/mimlre. shtml.	Performance of Guided DS also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., (2011) system for KBP 2011, which reports an F-score of 25.7%.
Introduction	posed approach, we extend MIML (Surdeanu et al., 2012), a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP (Ji and Grishman, 2011) dataset.
Introduction	While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines.
Training	Training MIML on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality.

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

31. Improving Parsing and PP Attachment Performance with Sense Information

Agirre, Eneko and Baldwin, Timothy and Martinez, David

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	Table 8 sum-marises the results, showing that the error reduction rate (ERR) over the parsing F-score is up to 6.9%, which is remarkable given the relatively superficial strategy for incorporating sense information into the parser.
Experimental setting	We evaluate the parsers via labelled bracketing recall (R), precision (’P) and F-score (.731).
Results	The SFU representation produces the best results for Bikel (F-score 0.010 above baseline), while for Charniak the best performance is obtained with word+SF ( F-score 0.007 above baseline).
Results	Overall, Bikel obtains a superior F-score in all configurations.
Results	Again, the F-score for the semantic representations is better than the baseline in all cases.

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

32. A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

Auli, Michael and Lopez, Adam

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Labelled F-score 00 \l 00
Oracle Parsing	To answer this question we computed oracle best and worst values for labelled dependency F-score using the algorithm of Huang (2008) on the hybrid model of Clark and Curran (2007), the best model of their C&C parser.
Oracle Parsing	Labelleld F-score
Oracle Parsing	Digging deeper, we compared parser model score against Viterbi F—score and oracle F-score at a va-

F-score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

CCG (13)
F-score (6)
Model score (6)

33. A joint model of word segmentation and phonological variation for English word-final /t/-deletion

Börschinger, Benjamin and Johnson, Mark and Demuth, Katherine

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and outlook	We find that our Bigram model reaches 77% /t/-recovery F-score when run with knowledge of true word-boundaries and when it can make use of both the preceeding and the following phonological context, and that unlike the Unigram model it is able to learn the probability of /t/-deletion in different contexts.
Conclusion and outlook	When performing joint word segmentation on the Buckeye corpus, our Bigram model reaches around above 55% F-score for recovering deleted /t/s with a word segmentation F-score of around 72% which is 2% better than running a Bigram model that does not model /t/-deletion.
Experiments 4.1 The data	We evaluate the model in terms of F-score , the harmonic mean of recall (the fraction of underlying /t/s the model correctly recovered) and precision (the fraction of underlying /t/s the model predicted that were correct).
Experiments 4.1 The data	Looking at the segmentation performance this isn’t too surprising: the Unigram model’s poorer token F-score , the standard measure of segmentation performance on a word token level, suggests that it misses many more boundaries than the Bigram model to begin with and, consequently, can’t recover any potential underlying /t/s at these boundaries.
Experiments 4.1 The data	The generally worse performance of handling variation as measured by /t/-recovery F-score when performing joint segmentation is consistent with the finding of Elsner et al.

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

34. Semi-Supervised Active Learning for Sequence Labeling

Tomanek, Katrin and Hahn, Udo

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Results	Table 2 depicts the exact numbers of manually labeled tokens to reach the maximal (supervised) F-score on both corpora.
Experiments and Results	On the MUC7 corpus, FuSAL requires 7,374 annotated NPs to yield an F-score of 87%, While SeSAL hit the same F-score with only 4,017 NPs.
Experiments and Results	5 On PENNBIOIE, SeSAL also saves about 45 % compared to FuSAL to achieve an F-score of 81 %.

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

35. From Natural Language Specifications to Program Input Parsers

Lei, Tao and Long, Fan and Barzilay, Regina and Rinard, Martin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our results show that our approach achieves 80.0% F-Score accuracy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
Experimental Results	The two versions achieve very close performance (80% vs 84% in F-Score ), even though Full Model is trained with noisy feedback.
Experimental Setup	Model \| Recall ‘ Precision \| F-Score ‘
Introduction	However, when trained using the noisy supervision, our method achieves substantially more accurate translations than a state-of-the-art semantic parser (Clarke et al., 2010) (specifically, 80.0% in F—Score compared to an F-Score of 66.7%).
Introduction	The strength of our model in the face of such weak supervision is also highlighted by the fact that it retains an F-Score of 77% even when only one input example is provided for each input

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

36. Shift-Reduce CCG Parsing with a Dependency Model

Xu, Wenduan and Clark, Stephen and Zhang, Yue

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Standard CCGBank tests show the model achieves up to 1.05 labeled F-score improvements over three existing, competitive CCG parsing models.
Experiments	On both the full and reduced sets, our parser achieves the highest F-score .
Experiments	In comparison with C&C, our parser shows significant increases across all metrics, with 0.57% and 1.06% absolute F-score improvements over the hybrid and normal-form models, respectively.
Experiments	While our parser achieved lower precision than Z&C, it is more balanced and gives higher recall for all of the dependency relations except the last one, and higher F-score for over half of them.
Introduction	Results on the standard CCGBank tests show that our parser achieves absolute labeled F-score gains of up to 0.5 over the shift-reduce parser of Zhang and Clark (2011); and up to 1.05 and 0.64 over the normal-form and hybrid models of Clark and Curran (2007), respectively.

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

37. Automatic Generation of Story Highlights

Woodsend, Kristian and Lapata, Mirella

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	g 03 — Gj _ 0.25 — _ EX" _Q_ Q. G {x {T __ 0.2 - X _ EX 0.15 - _ Recall Precision F-score Recall Precision F-score Rouge-1 Rouge-L
Results	F-score is higher for the phrase-based system but not significantly.
Results	The sentence ILP model outperforms the lead baseline with respect to recall but not precision or F-score .
Results	The phrase ILP achieves a significantly better F-score over the lead baseline with both ROUGE-l and ROUGE-L.

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

38. Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection

Li, Linlin and Roth, Benjamin and Sporleder, Caroline

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Ve only compare the F-score , since all the com-tared systems have an attempted rate7 of 1.0,
Experiments	), F-score (Fl).
Experiments	System F-score

F-score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

39. Classification of Semantic Relationships between Nominals Using Pattern Clusters

Davidov, Dmitry and Rappoport, Ari

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our NR classification evaluation strictly follows the ACL SemEval-07 Task 4 datasets and protocol, obtaining an f-score of 70.6, as opposed to 64.8 of the best previous work that did not use the manually provided WordNet sense disambiguation tags.
Results	In fact, our results ( f-score 62.0, accuracy 64.5) are better than the averaged results (58.0, 61.1) of the group that did not utilize WN tags.
Results	Table 2 shows the HITS-based classification results ( F-score and Accuracy) and the number of positively labeled clusters (C) for each relation.
Results	We have used the exact evaluation procedure described in (Turney, 2006), achieving a class f-score average of 60.1, as opposed to 54.6 in (Turney, 2005) and 51.2 in (Nastase et al., 2006).

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

40. Learning Script Knowledge with Web Experiments

Regneri, Michaela and Koller, Alexander and Pinkal, Manfred

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	We calculated precision, recall, and f-score for our system, the baselines, and the upper bound as follows, with allsystem being the number of pairs labelled as paraphrase or happens-before, allgold as the respective number of pairs in the gold standard and correct as the number of pairs labeled correctly by the system.
Evaluation	The f-score for the upper bound is in the column upper.
Evaluation	For the f-score values, we calculated the significance for the difference between our system and the baselines as well as the upper bound, using a resampling test (Edgington, 1986).

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

41. Max-Margin Tensor Neural Network for Chinese Word Segmentation

Pei, Wenzhe and Ge, Tao and Chang, Baobao

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction.
Experiment	The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation.
Experiment	As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training.

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

42. Comparing Multi-label Classification with Reinforcement Learning for Summarisation of Time-series Data

Gkatzia, Dimitra and Hastie, Helen and Lemon, Oliver

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that this method generates output closer to the feedback that lecturers actually generated, achieving 3.5% higher accuracy and 15% higher F-score than multiple simple classifiers that keep a history of selected templates.
Evaluation	The accuracy, the weighted precision, the weighted recall, and the weighted F-score of the classifiers are shown in Table 3.
Evaluation	It was found that in 10-fold cross validation RAkEL performs significantly better in all these automatic measures (accuracy = 76.95%, F-score = 85.50%).
Evaluation	Remarkably, ML achieves more than 10% higher F-score than the other methods (Table 3).

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

43. Summarization Through Submodularity and Dispersion

Dasgupta, Anirban and Kumar, Ravi and Ravi, Sujith

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	of our system that approximates the submodular objective function proposed by (Lin and Bilmes, 2011).7 As shown in the results, our best system8 which uses the hs dispersion function achieves a better ROUGE-1 F-score than all other systems.
Experiments	(4) To understand the effect of utilizing syntactic structure and semantic similarity for constructing the summarization graph, we ran the experiments using just the unigrams and bigrams; we obtained a ROUGE-1 F-score of 37.1.
Experiments	7Note that Lin & Bilmes (2011) report a slightly higher ROUGE-1 score ( F-score 38.90) on DUC 2004.

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

44. Universal Conceptual Cognitive Annotation (UCCA)

Abend, Omri and Rappoport, Ari

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A UCCA-Annotated Corpus	We derive an F-score from these counts.
A UCCA-Annotated Corpus	The table presents the average F-score between the annotators, as well as the average F-score when comparing to the gold standard.
A UCCA-Annotated Corpus	An average taken over a sample of passages annotated by all four annotators yielded an F-score of 93.7%.

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

45. A Nonparametric Bayesian Approach to Acoustic Model Discovery

Lee, Chia-ying and Glass, James

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	We follow the suggestion of (Scharenborg et al., 2010) and use a 20-ms tolerance window to compute recall, precision rates and F-score of the segmentation our model proposed for TIMIT’s training set.
Introduction	the-art unsupervised method and improves the relative F-score by 18.8 points (Dusan and Rabiner, 2006).
Results	unit(%) Recall Precision F-score Dusan (2006) 75.2 66.8 70.8 Qiao et al.
Results	When compared to the baseline in which the number of phone boundaries in each utterance was also unknown (Dusan and Rabiner, 2006), our model outperforms in both recall and precision, improving the relative F-score by 18.8%.

F-score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

46. Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	We evaluated the performance ( F-score ) of our model on the three development sets by using different 04 values, where 04 is progressively increased in steps of 0.1 (0 < 04 < 1.0).
Experiment	Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets.
Experiment	Table 2: F-score (%) results of five CWS models on CTB-5, CTB-6 and CTB-7.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

47. Aid is Out There: Looking for Help from Tweets during a Large Scale Disaster

Varga, István and Sano, Motoki and Torisawa, Kentaro and Hashimoto, Chikara and Ohtake, Kiyonori and Kawai, Takao and Oh, Jong-Hoon and De Saeger, Stijn

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The proposed method achieved about 44% recall and nearly 80% precision, outperforming all other systems in terms of precision, F-score and average precision8.
Experiments	Table 4: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem report recognizers.
Experiments	Table 6: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem-aid match recognizers.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

48. Improved Lexical Acquisition through DPP-based Verb Clustering

Reichart, Roi and Korhonen, Anna

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	For four out of five conditions its F-score performance outperforms the baselines by 42-83%.
Evaluation	These are the Most Frequent SCF (O’Donovan et al., 2005) which uniformly assigns to all verbs the two most frequent SCFs in general language, transitive (SUBJ-DOBJ) and intransitive (SUBJ) (and results in poor F-score ), and a filtering that removes frames with low corpus frequencies (which results in low recall even when trying to provide the maximum recall for a given precision level).
Evaluation	The task we address is therefore to improve the precision of the corpus statistics baseline in a way that does not substantially harm the F-score .

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

49. Comparing the Accuracy of CCG and Penn Treebank Parsers

Clark, Stephen and Curran, James R.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	The third row is similar, but for sentences for which the oracle F-score is geater than 92%.
The CCG to PTB Conversion	shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%.
The CCG to PTB Conversion	The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

50. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment 3: Sense Similarity	Table 7: F-score sense merging evaluation on three hand-labeled datasets: OntoNotes (Onto), Senseval-2 (SE-2), and combined (Onto+SE-2).
Experiment 3: Sense Similarity	For a binary classification task, we can directly calculate precision, recall and F-score by constructing a contingency table.
Experiment 3: Sense Similarity	In addition, we show in Table 7 the F-score results provided by Snow et al.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

51. Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

Joty, Shafiq and Carenini, Giuseppe and Ng, Raymond and Mehdad, Yashar

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	To evaluate the parsing performance, we use the standard unlabeled (i.e., hierarchical spans) and labeled (i.e., nuclearity and relation) precision, recall and F-score as described in (Marcu, 2000b).
Experiments	Table 2 presents F-score parsing results for our parsers and the existing systems on the two corpora.2 On both corpora, our parser, namely, lS-lS (TSP 1-1) and sliding window (TSP SW), outperform existing systems by a wide margin (p<7.le-05).3 On RST—DT, our parsers achieve absolute F-score improvements of 8%, 9.4% and 11.4% in span, nuclearity and relation, respectively, over HILDA.
Experiments	On the Instructional genre, our parsers deliver absolute F-score improvements of 10.5%, 13.6% and 8.14% in span, nuclearity and relations, respectively, over the ILP-based approach.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

52. Learning High-Level Planning from Text

Branavan, S.R.K. and Kushman, Nate and Lei, Tao and Barzilay, Regina

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	Model F-score 0.4 _ ---- -- SVM F-score ---------- -- All-text F-score
Experimental Setup	Precondition prediction F-score
Introduction	Specifically, it yields an F-score of 66% compared to the 65% of the baseline.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

53. Accurate Context-Free Parsing with Combinatory Categorial Grammar

Fowler, Timothy A. D. and Penn, Gerald

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Latent Variable CCG Parser	To determine statistical significance, we obtain p-values from Bikel’s randomized parsing evaluation comparator6, modified for use with tagging accuracy, F-score and dependency accuracy.
A Latent Variable CCG Parser	In this section we evaluate the parsers using the traditional PARSEVAL measures which measure recall, precision and F-score on constituents in
A Latent Variable CCG Parser	The Petrov parser has better results by a statistically significant margin for both labeled and unlabeled recall and unlabeled F-score .

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

CCG (46)
Penn treebank (11)
treebank (11)

54. Letter-Phoneme Alignment: An Exploration

Jiampojamarn, Sittichai and Kondrak, Grzegorz

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Intrinsic evaluation	We report the alignment quality in terms of precision, recall and F-score .
Intrinsic evaluation	The F-score corresponding to perfect precision and the upper-bound recall is 94.75%.
Intrinsic evaluation	Overall, the MM models obtain lower precision but higher recall and F-score than 1-1 models, which is to be expected as the gold standard is defined in terms of MM links.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

55. In-domain Relation Discovery with Meta-constraints via Posterior Regularization

Chen, Harr and Benson, Edward and Naseem, Tahira and Barzilay, Regina

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	For these reasons, we evaluate on both sentence-level and token-level precision, recall, and F-score .
Results	However, the best F-Score corresponding to the optimal number of clusters is 42.2, still far below our model’s 66.0 F-score .
Results	Our results show a large gap in F-score between the sentence and token-level evaluations for both the USP baseline and our model.

F-score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: