Index of papers in Proc. ACL 2013 that mention
  • F-score
Persing, Isaac and Ng, Vincent
Error Classification
Using held-out validation data, we jointly tune the three parameters in the previous paragraph, Ci, 7%, and 25,, to optimize the F-score achieved by bi for error 61x3 However, an exact solution to this optimization problem is computationally expensive.
Error Classification
Consequently, we find a local maximum by employing the simulated annealing algorithm (Kirkpatrick et al., 1983), altering one parameter at a time to optimize F-score by holding the remaining parameters fixed.
Error Classification
Other ways we could measure our system’s performance (such as macro F-score ) would consider our system’s performance on the less frequent errors no less important than its performance on the
Evaluation
To evaluate our thesis clarity error type identification system, we compute precision, recall, micro F-score, and macro F-score , which are calculated as follows.
Evaluation
Then, the precision (Pi), recall (R1), and F-score (F1) for bi and the macro F-score (F) of the combined system for one test fold are calculated by 7510i 7510i 2PiRi A Z,- Fi
Evaluation
However, the macro F-score calculation can be seen as giving too much weight to the less frequent errors.
F-score is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Zhang, Yuan and Barzilay, Regina and Globerson, Amir
Evaluation Setup
First, following previous work, we evaluate our method using the labeled and unlabeled predicate-argument dependency F-score .
Evaluation Setup
The dependency F-score captures both the target-
Experiment and Analysis
For instance, there is a gain of 6.2% in labeled dependency F-score for HPSG formalism when 15,000 CFG trees are used.
Experiment and Analysis
Across all three grammars, we can observe that adding CFG data has a more pronounced effect on the PARSEVAL measure than the dependency F-score .
Experiment and Analysis
On the other hand, predicate-argument dependency F-score (Figure 5ac) also relies on the target grammar information.
Implementation
This results in a drop on the dependency F-score by about 5%.
Introduction
For instance, the model trained on 500 HPSG sentences achieves labeled dependency F-score of 72.3%.
Introduction
Adding 15,000 Penn Treebank sentences during training leads to 78.5% labeled dependency F-score , an absolute improvement of 6.2%.
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Introduction
Experiments on the data from the Chinese tree bank (CTB-7) and Microsoft Research (MSR) show that the proposed model results in significant improvement over other comparative candidates in terms of F-score and out-of-vocabulary (OOV) recall.
Method
The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score , F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R).
Method
It obtains 0.92% and 2.32% increase in terms of F-score and OOV—R respectively.
Method
On the whole, for segmentation, they achieve average improvements of 1.02% and 6.8% in F-score and OOV—R; whereas for POS tagging, the average increments of F-sore and OOV—R are 0.87% and 6.45%.
Related Work
Prior supervised joint S&T models present approximate 0.2% - 1.3% improvement in F-score over supervised pipeline ones.
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Fukumoto, Fumiyo and Suzuki, Yoshimi and Matsuyoshi, Suguru
Abstract
The results using Reuters documents showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614.
Conclusion
The results using the 1996 Reuters corpora showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614.
Experiments
We empirically selected values of two parameters, “c” (tradeoff between training error and margin) and “j”, i.e., cost (cost-factor, by which training errors on positive examples) that optimized the F-score obtained by classification of test documents.
Experiments
Figure 3 shows micro-averaged F-score against the 6 value.
Experiments
F-score
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni
Experiment
F-score
Experiment
Both the f-score and OOV—recall increase.
Experiment
By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV—Recall.
INTRODUCTION
For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data.
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Börschinger, Benjamin and Johnson, Mark and Demuth, Katherine
Conclusion and outlook
We find that our Bigram model reaches 77% /t/-recovery F-score when run with knowledge of true word-boundaries and when it can make use of both the preceeding and the following phonological context, and that unlike the Unigram model it is able to learn the probability of /t/-deletion in different contexts.
Conclusion and outlook
When performing joint word segmentation on the Buckeye corpus, our Bigram model reaches around above 55% F-score for recovering deleted /t/s with a word segmentation F-score of around 72% which is 2% better than running a Bigram model that does not model /t/-deletion.
Experiments 4.1 The data
We evaluate the model in terms of F-score , the harmonic mean of recall (the fraction of underlying /t/s the model correctly recovered) and precision (the fraction of underlying /t/s the model predicted that were correct).
Experiments 4.1 The data
Looking at the segmentation performance this isn’t too surprising: the Unigram model’s poorer token F-score , the standard measure of segmentation performance on a word token level, suggests that it misses many more boundaries than the Bigram model to begin with and, consequently, can’t recover any potential underlying /t/s at these boundaries.
Experiments 4.1 The data
The generally worse performance of handling variation as measured by /t/-recovery F-score when performing joint segmentation is consistent with the finding of Elsner et al.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Lei, Tao and Long, Fan and Barzilay, Regina and Rinard, Martin
Abstract
Our results show that our approach achieves 80.0% F-Score accuracy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
Experimental Results
The two versions achieve very close performance (80% vs 84% in F-Score ), even though Full Model is trained with noisy feedback.
Experimental Setup
Model | Recall ‘ Precision | F-Score
Introduction
However, when trained using the noisy supervision, our method achieves substantially more accurate translations than a state-of-the-art semantic parser (Clarke et al., 2010) (specifically, 80.0% in F—Score compared to an F-Score of 66.7%).
Introduction
The strength of our model in the face of such weak supervision is also highlighted by the fact that it retains an F-Score of 77% even when only one input example is provided for each input
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Abend, Omri and Rappoport, Ari
A UCCA-Annotated Corpus
We derive an F-score from these counts.
A UCCA-Annotated Corpus
The table presents the average F-score between the annotators, as well as the average F-score when comparing to the gold standard.
A UCCA-Annotated Corpus
An average taken over a sample of passages annotated by all four annotators yielded an F-score of 93.7%.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Dasgupta, Anirban and Kumar, Ravi and Ravi, Sujith
Experiments
of our system that approximates the submodular objective function proposed by (Lin and Bilmes, 2011).7 As shown in the results, our best system8 which uses the hs dispersion function achieves a better ROUGE-1 F-score than all other systems.
Experiments
(4) To understand the effect of utilizing syntactic structure and semantic similarity for constructing the summarization graph, we ran the experiments using just the unigrams and bigrams; we obtained a ROUGE-1 F-score of 37.1.
Experiments
7Note that Lin & Bilmes (2011) report a slightly higher ROUGE-1 score ( F-score 38.90) on DUC 2004.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Joty, Shafiq and Carenini, Giuseppe and Ng, Raymond and Mehdad, Yashar
Experiments
To evaluate the parsing performance, we use the standard unlabeled (i.e., hierarchical spans) and labeled (i.e., nuclearity and relation) precision, recall and F-score as described in (Marcu, 2000b).
Experiments
Table 2 presents F-score parsing results for our parsers and the existing systems on the two corpora.2 On both corpora, our parser, namely, lS-lS (TSP 1-1) and sliding window (TSP SW), outperform existing systems by a wide margin (p<7.le-05).3 On RST—DT, our parsers achieve absolute F-score improvements of 8%, 9.4% and 11.4% in span, nuclearity and relation, respectively, over HILDA.
Experiments
On the Instructional genre, our parsers deliver absolute F-score improvements of 10.5%, 13.6% and 8.14% in span, nuclearity and relations, respectively, over the ILP-based approach.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto
Experiment 3: Sense Similarity
Table 7: F-score sense merging evaluation on three hand-labeled datasets: OntoNotes (Onto), Senseval-2 (SE-2), and combined (Onto+SE-2).
Experiment 3: Sense Similarity
For a binary classification task, we can directly calculate precision, recall and F-score by constructing a contingency table.
Experiment 3: Sense Similarity
In addition, we show in Table 7 the F-score results provided by Snow et al.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Reichart, Roi and Korhonen, Anna
Evaluation
For four out of five conditions its F-score performance outperforms the baselines by 42-83%.
Evaluation
These are the Most Frequent SCF (O’Donovan et al., 2005) which uniformly assigns to all verbs the two most frequent SCFs in general language, transitive (SUBJ-DOBJ) and intransitive (SUBJ) (and results in poor F-score ), and a filtering that removes frames with low corpus frequencies (which results in low recall even when trying to provide the maximum recall for a given precision level).
Evaluation
The task we address is therefore to improve the precision of the corpus statistics baseline in a way that does not substantially harm the F-score .
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Varga, István and Sano, Motoki and Torisawa, Kentaro and Hashimoto, Chikara and Ohtake, Kiyonori and Kawai, Takao and Oh, Jong-Hoon and De Saeger, Stijn
Experiments
The proposed method achieved about 44% recall and nearly 80% precision, outperforming all other systems in terms of precision, F-score and average precision8.
Experiments
Table 4: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem report recognizers.
Experiments
Table 6: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem-aid match recognizers.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Experiment
We evaluated the performance ( F-score ) of our model on the three development sets by using different 04 values, where 04 is progressively increased in steps of 0.1 (0 < 04 < 1.0).
Experiment
Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets.
Experiment
Table 2: F-score (%) results of five CWS models on CTB-5, CTB-6 and CTB-7.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: