Index of papers in Proc. ACL that mention
  • F-score
Persing, Isaac and Ng, Vincent
Error Classification
Using held-out validation data, we jointly tune the three parameters in the previous paragraph, Ci, 7%, and 25,, to optimize the F-score achieved by bi for error 61x3 However, an exact solution to this optimization problem is computationally expensive.
Error Classification
Consequently, we find a local maximum by employing the simulated annealing algorithm (Kirkpatrick et al., 1983), altering one parameter at a time to optimize F-score by holding the remaining parameters fixed.
Error Classification
Other ways we could measure our system’s performance (such as macro F-score ) would consider our system’s performance on the less frequent errors no less important than its performance on the
Evaluation
To evaluate our thesis clarity error type identification system, we compute precision, recall, micro F-score, and macro F-score , which are calculated as follows.
Evaluation
Then, the precision (Pi), recall (R1), and F-score (F1) for bi and the macro F-score (F) of the combined system for one test fold are calculated by 7510i 7510i 2PiRi A Z,- Fi
Evaluation
However, the macro F-score calculation can be seen as giving too much weight to the less frequent errors.
F-score is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Zhou, Ming
Abstract
On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++.
Evaluation
An alternative criterion is the upper bound on alignment F-score , which essentially measures how many links in annotated alignment can be kept in ITG parse.
Evaluation
The calculation of F-score upper bound is done in a bottom-up way like ITG parsing.
Evaluation
The upper bound of alignment F-score can thus be calculated as well.
The DITG Models
The MERT module for DITG takes alignment F-score of a sentence pair as the performance measure.
The DITG Models
Given an input sentence pair and the reference annotated alignment, MERT aims to maximize the F-score of DITG-produced alignment.
F-score is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Habash, Nizar and Roth, Ryan
Abstract
Our best approach achieves a roughly ~15% absolute increase in F-score over a simple but reasonable baseline.
Results
We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall.
Results
Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7
Results
Feature Set F-score %Imp word 43.85 —
F-score is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Khapra, Mitesh M. and Joshi, Salil and Chatterjee, Arindam and Bhattacharyya, Pushpak
Discussions
For small seed sizes, the F-score of bilingual bootstrapping is consistently better than the F-score obtained by training only on the seed data without using any bootstrapping.
Discussions
To further illustrate this, we take some sample points from the graph and compare the number of tagged words needed by BiBoot and OnlySeed to reach the same (or nearly the same) F-score .
Experimental Setup
Seed Size v/s F-score
Experimental Setup
80 70 60 go‘ 50 g 40 O (I) [L 30 20 OnlySeed éfi ' I. WFS 10 “ BiBoot ' 0 [I I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 Seed Size (words) Figure 1: Comparison of BiBoot, MonoBoot, OnlySeed and WF S on Hindi Health data Seed Size v/s F-score 80 $3 9 O O (I) [L OnlySeed éfi ' WF ..= BiBoot ' 0 , I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000
Experimental Setup
Seed Size v/s F-score
Results
a. BiBoot: This curve represents the F-score obtained after 10 iterations by using bilingual bootstrapping with different amounts of seed data.
Results
b. MonoBoot: This curve represents the F-score obtained after 10 iterations by using monolingual bootstrapping with different amounts of seed data.
Results
c. OnlySeed: This curve represents the F-score obtained by training on the seed data alone without using any bootstrapping.
F-score is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Classification Results
The table lists the f-score for each of the target relations, with overall accuracy shown in brackets.
Classification Results
Given that the experiments are run on natural distribution of the data, which are skewed towards Expansion relations, the f-score is the more important measure to track.
Classification Results
Our random baseline is the f-score one would achieve by randomly assigning classes in proportion to its true distribution in the test set.
F-score is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Pervouchine, Vladimir and Li, Haizhou and Lin, Bo
Abstract
We propose a new evaluation metric, alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score .
Experiments
Next we conduct three experiments to study 1) alignment entropy vs. F-score , 2) the impact of alignment quality on transliteration accuracy, and 3) how to validate transliteration using alignment metrics.
Experiments
5.1 Alignment entropy vs. F-score
Experiments
We have manually aligned a random set of 3,000 transliteration pairs from the Xinhua training set to serve as the gold standard, on which we calculate the precision, recall and F-score as well as alignment entropy for each alignment.
Related Work
Denoting the number of cross-lingual mappings that are common in both A and Q as CA0, the number of cross-lingual mappings in A as CA and the number of cross-lingual mappings in Q as Cg, precision Pr is given as CAglCA, recall Be as GAO/CG and F-score as 2P7“ - Rc/(Pr + Re).
Transliteration alignment entropy
We expect and will show that this estimate is a good indicator of the alignment quality, and is as effective as the F-score , but without the need for a gold standard reference.
F-score is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting
Abstract
Our result, an F-score of 73.74%, outperforms the state-of-the-art methods on a manually labeled test dataset.
Abstract
Moreover, combining our method with a previous manually-built hierarchy extension method can further improve F-score to 80.29%.
Experimental Setup
We use precision, recall, and F-score as our metrics to evaluate the performances of the methods.
Introduction
The experimental results show that our method achieves an F-score of 73.74% which significantly outperforms the previous state-of-the-art methods.
Introduction
(2008) can further improve F-score to 80.29%.
Results and Analysis 5.1 Varying the Amount of Clusters
Table 3 shows that the proposed method achieves a better recall and F-score than all of the previous methods do.
Results and Analysis 5.1 Varying the Amount of Clusters
It can significantly (p < 0.01) improve the F-score over the state-of-the-art method MWikHCilmE.
Results and Analysis 5.1 Varying the Amount of Clusters
The F-score is further improved from 73.74% to 76.29%.
F-score is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Abend, Omri and Reichart, Roi and Rappoport, Ari
Experimental Setup
We report an F-score as well (the harmonic mean of precision and recall).
Experimental Setup
We use the standard parsing F-score evaluation measure.
Introduction
We use two measures to evaluate the performance of our algorithm, precision and F-score .
Introduction
Precision reflects the algorithm’s applicability for creating training data to be used by supervised SRL models, while the standard SRL F-score measures the model’s performance when used by itself.
Introduction
The first stage of our algorithm is shown to outperform a strong baseline both in terms of F-score and of precision.
Related Work
Better performance is achieved on the classification, where state-of-the-art supervised approaches achieve about 81% F-score on the in-domain identification task, of which about 95% are later labeled correctly (Marquez et al., 2008).
Results
In the “Collocation Maximum F-score” the collocation parameters were generally tuned such that the maximum possible F-score for the collocation algorithm is achieved.
Results
The best or close to best F-score is achieved when using the clause detection algorithm alone (59.14% for English, 23.34% for Spanish).
Results
Note that for both English and Spanish F-score improvements are achieved via a precision improvement that is more significant than the recall degradation.
F-score is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Vadas, David and Curran, James R.
Experiments
| PREC | RECALL | F-SCORE
Experiments
| PREC | RECALL | F-SCORE
Experiments
Table 2 shows that F-score has dropped by 0.61%.
Introduction
These features are targeted at improving the recovery of NP structure, increasing parser performance by 0.64% F-score .
F-score is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Bergsma, Shane and Lin, Dekang and Goebel, Randy
Evaluation
F-Score (F) is the geometric average of precision and recall; it is the most common non-referential detection metric.
Results
Table 4 gives precision, recall, F-score , and accuracy on the Train/ Test split.
Results
Note that while the LL system has high detection precision, it has very low recall, sharply reducing F-score .
Results
The MINIPL approach sacrifices some precision for much higher recall, but again has fairly low F-score .
F-score is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Zhang, Yuan and Barzilay, Regina and Globerson, Amir
Evaluation Setup
First, following previous work, we evaluate our method using the labeled and unlabeled predicate-argument dependency F-score .
Evaluation Setup
The dependency F-score captures both the target-
Experiment and Analysis
For instance, there is a gain of 6.2% in labeled dependency F-score for HPSG formalism when 15,000 CFG trees are used.
Experiment and Analysis
Across all three grammars, we can observe that adding CFG data has a more pronounced effect on the PARSEVAL measure than the dependency F-score .
Experiment and Analysis
On the other hand, predicate-argument dependency F-score (Figure 5ac) also relies on the target grammar information.
Implementation
This results in a drop on the dependency F-score by about 5%.
Introduction
For instance, the model trained on 500 HPSG sentences achieves labeled dependency F-score of 72.3%.
Introduction
Adding 15,000 Penn Treebank sentences during training leads to 78.5% labeled dependency F-score , an absolute improvement of 6.2%.
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Niu, Zheng-Yu and Wang, Haifeng and Wu, Hua
Abstract
Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result.
Experiments of Grammar Formalism Conversion
Finally Q-10-method achieved an f-score of 93.8% on WSJ section 22, an absolute 4.4% improvement (42% error reduction) over the best result of Xia et al.
Experiments of Grammar Formalism Conversion
Finally Q-10-method achieved an f-score of 93.6% on WSJ section 2~l8 and 20~22, better than that of Q-0-method and comparable with that of Q-10-method in Section 3.1.
Experiments of Parsing
Finally we decided that the optimal value of A was 0.4 and the optimal weight of CTB was 1, which brought the best performance on the development set (an f-score of 86.1%).
Experiments of Parsing
In comparison with the results in Section 4.1, the average index of converted trees in 200-best list increased to 2, and their average unlabeled dependency f-score dropped to 65.4%.
Experiments of Parsing
84.2% f-score , better than the result of the reranking parser with CTB and CDTPS as training data (shown in Table 5).
Introduction
Our conversion method achieves 93.8% f-score on dependency trees produced from WSJ section 22, resulting in 42% error reduction over the previous best result for DS to PS conversion.
Introduction
When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing.
Our Two-Step Solution
Therefore we modified the selection metric in Section 2.1 by interpolating two scores, the probability of a conversion candidate from the parser and its unlabeled dependency f-score , shown as follows:
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Conclusion and Future Work
In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments
We evaluate detection performance by token-level precision, recall and F-score (6 = 1).
Experiments
For candidate selection, we once again evaluate using token-level precision, recall and F-score .
Experiments
Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Kummerfeld, Jonathan K. and Roesner, Jessika and Dawborn, Tim and Haggerty, James and Curran, James R. and Clark, Stephen
Introduction
Using an adapted supertagger with ambiguity levels tuned to match the baseline system, we were also able to increase F-score on labelled grammatical relations by 0.75%.
Results
Interestingly, while the decrease in supertag accuracy in the previous experiment did not translate into a decrease in F-score, the increase in tag accuracy here does translate into an increase in F-score .
Results
The increase in F-score has two sources.
Results
As Table 6 shows, this change translates into an improvement of up to 0.75% in F-score on Section
F-score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Johnson, Mark
Word segmentation with adaptor grammars
We evaluated the f-score of the recovered word constituents (Goldwater et al., 2006b).
Word segmentation with adaptor grammars
Table 1: Word segmentation f-score results for all models, as a function of DP concentration parameter oz.
Word segmentation with adaptor grammars
With a = 1 and 04 = 10 we obtained a word segmentation f-score of 0.55.
F-score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Wan, Xiaojun
Experiments
Three metrics are used for evaluation: precision (P), recall (R) and balanced f-score (F) defined by 2PR/(P+R).
Experiments
The baseline of the character-based joint solver (CTagctb) is competitive, and achieves an f-score of 92.93.
Experiments
ging model achieves an f-score of 94.03 ([31th and
Introduction
Our structure-based stacking model achieves an f-score of 94.36, which is superior to a feature-based stacking model introduced in (Jiang et al., 2009).
Introduction
Our final system achieves an f-score of 94.68, which yields a relative error reduction of 11% over the best published result (94.02).
F-score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zhang, Zhe and Singh, Munindar P.
Experiments
Accuracy Macro F-score Micro F-score
Experiments
Figure 8 reports the accuracy, macro F-score, and micro F-score .
Experiments
It shows that the BR learner produces better accuracy and a micro F-score than the FR learner but a slightly worse macro F-score .
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim
Discussion and Conclusion
With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations.
Experiments
Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3.
Experiments
Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score .
Experiments
The overall F-score of the adaptive QE model is 0.282.
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Muller, Philippe and Fabre, Cécile and Adam, Clémentine
Evaluation of lexical similarity in context
In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score.
Experiments: predicting relevance in context
Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score , even though precision and recall might yield more important variations.
Experiments: predicting relevance in context
As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22.
Experiments: predicting relevance in context
If we take the best simple classifier (random forests), the precision and recall are 68.1% and 24.2% for an F-score of 35.7%, and this is significantly beaten by the Naive Bayes method as precision and recall are more even ( F-score of 41.5%).
Related work
Recall F-score 40.4 54.3 46.3 37.4 52.8 43.8 36.1 49.5 41.8 36.5 54.8 43.8
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Huang, Liang
Experiments
tures in the updated version.5 However, our initial experiments show that, even with this much simpler feature set, our 50-best reranker performed equally well as theirs (both with an F-score of 91.4, see Tables 3 and 4).
Experiments
With only local features, our forest reranker achieves an F-score of 91.25, and with the addition of non-
Forest Reranking
yEeand where function F returns the F-score .
Forest Reranking
2In case multiple candidates get the same highest F-score , we choose the parse with the highest log probability from the baseline parser to be the oracle parse (Collins, 2000).
Introduction
we achieved an F-score of 91.7, which is a 19% error reduction from the l-best baseline, and outperforms both 50-best and 100-best reranking.
Supporting Forest Algorithms
4.1 Forest Oracle Recall that the Parseval F-score is the harmonic mean of labelled precision P and labelled recall R: 2PR _ 2|y fl y*| P + R lyl + |y*|
Supporting Forest Algorithms
In other words, the optimal F-score tree in a forest is not guaranteed to be composed of two optimal F-score subtrees.
Supporting Forest Algorithms
Shown in Pseudocode 4, we perform these computations in a bottom-up topological order, and finally at the root node TOP, we can compute the best global F-score by maximizing over different numbers of test brackets (line 7).
F-score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Introduction
Experiments on the data from the Chinese tree bank (CTB-7) and Microsoft Research (MSR) show that the proposed model results in significant improvement over other comparative candidates in terms of F-score and out-of-vocabulary (OOV) recall.
Method
The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score , F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R).
Method
It obtains 0.92% and 2.32% increase in terms of F-score and OOV—R respectively.
Method
On the whole, for segmentation, they achieve average improvements of 1.02% and 6.8% in F-score and OOV—R; whereas for POS tagging, the average increments of F-sore and OOV—R are 0.87% and 6.45%.
Related Work
Prior supervised joint S&T models present approximate 0.2% - 1.3% improvement in F-score over supervised pipeline ones.
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
duVerle, David and Prendinger, Helmut
Abstract
Using a rich set of shallow lexical, syntactic and structural features from the input text, our parser achieves, in linear time, 73.9% of professional annotators’ human agreement F-score .
Building a Discourse Parser
Current state-of-the-art results in automatic segmenting are much closer to human levels than full structure labeling ( F-score ratios of automatic performance over gold standard reported in LeThanh et al.
Evaluation
Standard performance indicators for such a task are precision, recall and F-score as measured by the PARSEVAL metrics (Black et al., 1991), with the specific adaptations to the case of RST trees made by Marcu (2000, page 143-144).
Evaluation
S N R F S N R F Precision 83.0 68.4 55.3 54.8 69.5 56.1 44.9 44.4 Recall 83.0 68.4 55.3 54.8 69.2 55.8 44.7 44.2 F-Score 83.0 68.4 55.3 54.8 69.3 56.0 44.8 44.3
Evaluation
Manual SPADE -SNRFSNRFSNRF Precision 84.1 70.6 55.6 55.1 70.6 58.1 46.0 45.6 88.0 77.5 66.0 65.2 Recall 84.1 70.6 55.6 55.1 71.2 58.6 46.4 46.0 88.1 77.6 66.1 65.3 F-Score 84.1 70.6 55.6 55.1 70.9 58.3 46.2 45.8 88.1 77.5 66.0 65.3
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei
Abstract
Additionally, we remove low confidence alignment links from the word alignment of a bilingual training corpus, which increases the alignment F-score , improves Chinese-English and Arabic-English translation quality and significantly reduces the phrase translation table size.
Alignment Link Confidence Measure
Table 2 shows the precision, recall and F-score of individual alignments and the combined align-
Alignment Link Confidence Measure
Overall it improves the F-score by 1.5 points (from 69.3 to 70.8), 1.8 point improvement for content words and 1.0 point for function words.
Improved MaXEnt Aligner with Confidence-based Link Filtering
Precision Recall F-score Baseline 72.66 66.17 69.26 +ALF 78.14 64.36 70.59
Improved MaXEnt Aligner with Confidence-based Link Filtering
Precision Recall F-score Baseline 84.43 83.64 84.04 +ALF 88.29 83.14 85.64
Sentence Alignment Confidence Measure
Aligner F-score Cor.
Sentence Alignment Confidence Measure
The results in Figure 2 shows strong correlation between the confidence measure and the alignment F-score , with the correlation coefficients equals to -0.69.
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Johnson, Mark and Christophe, Anne and Dupoux, Emmanuel and Demuth, Katherine
Abstract
This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score .
Introduction
While absolute accuracy is not directly relevant to the main point of the paper, we note that the models that learn generalisations about function words perform unsupervised word segmentation at 92.5% token f-score on the standard Bernstein-Ratner (1987) corpus, which improves the previous state-of-the-art by more than 4%.
Introduction
that achieves the best token f-score expects function words to appear at the left edge of phrases.
Word segmentation results
f-score prec1s10n recall Baseline 0.872 0.918 0.956 + left FWs 0.924 0.935 0.990 + left + right FWs 0.912 0.957 0.953
Word segmentation results
Figure 2 presents the standard token and lexicon (i.e., type) f-score evaluations for word segmentations proposed by these models (Brent, 1999), and Table 1 summarises the token and lexicon f-scores for the major models discussed in this paper.
Word segmentation results
It is interesting to note that adding “function words” improves token f-score by more than 4%, corresponding to a 40% reduction in overall error rate.
Word segmentation with Adaptor Grammars
The starting point and baseline for our extension is the adaptor grammar with syllable structure phonotactic constraints and three levels of collo-cational structure (5-21), as prior work has found that this yields the highest word segmentation token f-score (Johnson and Goldwater, 2009).
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Nenkova, Ani
Discourse vs. non-discourse usage
Using the string of the connective as the only feature sets a reasonably high baseline, with an f-score of 75.33% and an accuracy of 85.86%.
Discourse vs. non-discourse usage
Interestingly, using only the syntactic features, ignoring the identity of the connective, is even better, resulting in an f-score of 88.19% and accuracy of 92.25%.
Discourse vs. non-discourse usage
Using both the connective and syntactic features is better than either individually, with an f-score of 92.28% and accuracy of 95.04%.
F-score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Related Work
MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
Related Work
In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions.
Related Work
Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil .
Results
Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT.
XMEANT: a cross-lingual MEANT
3.1 Applying MEANT’s f-score within semantic role fillers
XMEANT: a cross-lingual MEANT
The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat-
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni
Experiment
F-score
Experiment
Both the f-score and OOV—recall increase.
Experiment
By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV—Recall.
INTRODUCTION
For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data.
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Yanping and Zheng, Qinghua and Zhang, Wei
Abstract
The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes.
Feature Construction
F-score is computed by
Feature Construction
In Row 2, with only the .7-"0w feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes.
Feature Construction
In Table 3, it is shown that our system outperforms other systems, in F-score , by 10% on 6 relation types and by 15% on 18 subtypes.
Introduction
The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) (Zhou et al., 2010).
Introduction
Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes (Dandan et al., 2012).
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Fukumoto, Fumiyo and Suzuki, Yoshimi and Matsuyoshi, Suguru
Abstract
The results using Reuters documents showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614.
Conclusion
The results using the 1996 Reuters corpora showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614.
Experiments
We empirically selected values of two parameters, “c” (tradeoff between training error and margin) and “j”, i.e., cost (cost-factor, by which training errors on positive examples) that optimized the F-score obtained by classification of test documents.
Experiments
Figure 3 shows micro-averaged F-score against the 6 value.
Experiments
F-score
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pershina, Maria and Min, Bonan and Xu, Wei and Grishman, Ralph
Abstract
Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve.
Available at http://nlp. stanford.edu/software/mimlre. shtml.
Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to M | M L (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1).
Available at http://nlp. stanford.edu/software/mimlre. shtml.
Performance of Guided DS also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., (2011) system for KBP 2011, which reports an F-score of 25.7%.
Introduction
posed approach, we extend MIML (Surdeanu et al., 2012), a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP (Ji and Grishman, 2011) dataset.
Introduction
While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines.
Training
Training MIML on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality.
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Agirre, Eneko and Baldwin, Timothy and Martinez, David
Discussion
Table 8 sum-marises the results, showing that the error reduction rate (ERR) over the parsing F-score is up to 6.9%, which is remarkable given the relatively superficial strategy for incorporating sense information into the parser.
Experimental setting
We evaluate the parsers via labelled bracketing recall (R), precision (’P) and F-score (.731).
Results
The SFU representation produces the best results for Bikel (F-score 0.010 above baseline), while for Charniak the best performance is obtained with word+SF ( F-score 0.007 above baseline).
Results
Overall, Bikel obtains a superior F-score in all configurations.
Results
Again, the F-score for the semantic representations is better than the baseline in all cases.
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Lopez, Adam
Experiments
Labelled F-score 00 \l 00
Oracle Parsing
To answer this question we computed oracle best and worst values for labelled dependency F-score using the algorithm of Huang (2008) on the hybrid model of Clark and Curran (2007), the best model of their C&C parser.
Oracle Parsing
Labelleld F-score
Oracle Parsing
Digging deeper, we compared parser model score against Viterbi F—score and oracle F-score at a va-
F-score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Börschinger, Benjamin and Johnson, Mark and Demuth, Katherine
Conclusion and outlook
We find that our Bigram model reaches 77% /t/-recovery F-score when run with knowledge of true word-boundaries and when it can make use of both the preceeding and the following phonological context, and that unlike the Unigram model it is able to learn the probability of /t/-deletion in different contexts.
Conclusion and outlook
When performing joint word segmentation on the Buckeye corpus, our Bigram model reaches around above 55% F-score for recovering deleted /t/s with a word segmentation F-score of around 72% which is 2% better than running a Bigram model that does not model /t/-deletion.
Experiments 4.1 The data
We evaluate the model in terms of F-score , the harmonic mean of recall (the fraction of underlying /t/s the model correctly recovered) and precision (the fraction of underlying /t/s the model predicted that were correct).
Experiments 4.1 The data
Looking at the segmentation performance this isn’t too surprising: the Unigram model’s poorer token F-score , the standard measure of segmentation performance on a word token level, suggests that it misses many more boundaries than the Bigram model to begin with and, consequently, can’t recover any potential underlying /t/s at these boundaries.
Experiments 4.1 The data
The generally worse performance of handling variation as measured by /t/-recovery F-score when performing joint segmentation is consistent with the finding of Elsner et al.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Tomanek, Katrin and Hahn, Udo
Experiments and Results
Table 2 depicts the exact numbers of manually labeled tokens to reach the maximal (supervised) F-score on both corpora.
Experiments and Results
On the MUC7 corpus, FuSAL requires 7,374 annotated NPs to yield an F-score of 87%, While SeSAL hit the same F-score with only 4,017 NPs.
Experiments and Results
5 On PENNBIOIE, SeSAL also saves about 45 % compared to FuSAL to achieve an F-score of 81 %.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Lei, Tao and Long, Fan and Barzilay, Regina and Rinard, Martin
Abstract
Our results show that our approach achieves 80.0% F-Score accuracy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
Experimental Results
The two versions achieve very close performance (80% vs 84% in F-Score ), even though Full Model is trained with noisy feedback.
Experimental Setup
Model | Recall ‘ Precision | F-Score
Introduction
However, when trained using the noisy supervision, our method achieves substantially more accurate translations than a state-of-the-art semantic parser (Clarke et al., 2010) (specifically, 80.0% in F—Score compared to an F-Score of 66.7%).
Introduction
The strength of our model in the face of such weak supervision is also highlighted by the fact that it retains an F-Score of 77% even when only one input example is provided for each input
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Xu, Wenduan and Clark, Stephen and Zhang, Yue
Abstract
Standard CCGBank tests show the model achieves up to 1.05 labeled F-score improvements over three existing, competitive CCG parsing models.
Experiments
On both the full and reduced sets, our parser achieves the highest F-score .
Experiments
In comparison with C&C, our parser shows significant increases across all metrics, with 0.57% and 1.06% absolute F-score improvements over the hybrid and normal-form models, respectively.
Experiments
While our parser achieved lower precision than Z&C, it is more balanced and gives higher recall for all of the dependency relations except the last one, and higher F-score for over half of them.
Introduction
Results on the standard CCGBank tests show that our parser achieves absolute labeled F-score gains of up to 0.5 over the shift-reduce parser of Zhang and Clark (2011); and up to 1.05 and 0.64 over the normal-form and hybrid models of Clark and Curran (2007), respectively.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Woodsend, Kristian and Lapata, Mirella
Experimental Setup
g 03 — Gj _ 0.25 — _ EX" _Q_ Q. G {x {T __ 0.2 - X _ EX 0.15 - _ Recall Precision F-score Recall Precision F-score Rouge-1 Rouge-L
Results
F-score is higher for the phrase-based system but not significantly.
Results
The sentence ILP model outperforms the lead baseline with respect to recall but not precision or F-score .
Results
The phrase ILP achieves a significantly better F-score over the lead baseline with both ROUGE-l and ROUGE-L.
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Linlin and Roth, Benjamin and Sporleder, Caroline
Experiments
Ve only compare the F-score , since all the com-tared systems have an attempted rate7 of 1.0,
Experiments
), F-score (Fl).
Experiments
System F-score
F-score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Davidov, Dmitry and Rappoport, Ari
Abstract
Our NR classification evaluation strictly follows the ACL SemEval-07 Task 4 datasets and protocol, obtaining an f-score of 70.6, as opposed to 64.8 of the best previous work that did not use the manually provided WordNet sense disambiguation tags.
Results
In fact, our results ( f-score 62.0, accuracy 64.5) are better than the averaged results (58.0, 61.1) of the group that did not utilize WN tags.
Results
Table 2 shows the HITS-based classification results ( F-score and Accuracy) and the number of positively labeled clusters (C) for each relation.
Results
We have used the exact evaluation procedure described in (Turney, 2006), achieving a class f-score average of 60.1, as opposed to 54.6 in (Turney, 2005) and 51.2 in (Nastase et al., 2006).
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Regneri, Michaela and Koller, Alexander and Pinkal, Manfred
Evaluation
We calculated precision, recall, and f-score for our system, the baselines, and the upper bound as follows, with allsystem being the number of pairs labelled as paraphrase or happens-before, allgold as the respective number of pairs in the gold standard and correct as the number of pairs labeled correctly by the system.
Evaluation
The f-score for the upper bound is in the column upper.
Evaluation
For the f-score values, we calculated the significance for the difference between our system and the baselines as well as the upper bound, using a resampling test (Edgington, 1986).
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Pei, Wenzhe and Ge, Tao and Chang, Baobao
Experiment
As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction.
Experiment
The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation.
Experiment
As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Gkatzia, Dimitra and Hastie, Helen and Lemon, Oliver
Abstract
We show that this method generates output closer to the feedback that lecturers actually generated, achieving 3.5% higher accuracy and 15% higher F-score than multiple simple classifiers that keep a history of selected templates.
Evaluation
The accuracy, the weighted precision, the weighted recall, and the weighted F-score of the classifiers are shown in Table 3.
Evaluation
It was found that in 10-fold cross validation RAkEL performs significantly better in all these automatic measures (accuracy = 76.95%, F-score = 85.50%).
Evaluation
Remarkably, ML achieves more than 10% higher F-score than the other methods (Table 3).
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Dasgupta, Anirban and Kumar, Ravi and Ravi, Sujith
Experiments
of our system that approximates the submodular objective function proposed by (Lin and Bilmes, 2011).7 As shown in the results, our best system8 which uses the hs dispersion function achieves a better ROUGE-1 F-score than all other systems.
Experiments
(4) To understand the effect of utilizing syntactic structure and semantic similarity for constructing the summarization graph, we ran the experiments using just the unigrams and bigrams; we obtained a ROUGE-1 F-score of 37.1.
Experiments
7Note that Lin & Bilmes (2011) report a slightly higher ROUGE-1 score ( F-score 38.90) on DUC 2004.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Abend, Omri and Rappoport, Ari
A UCCA-Annotated Corpus
We derive an F-score from these counts.
A UCCA-Annotated Corpus
The table presents the average F-score between the annotators, as well as the average F-score when comparing to the gold standard.
A UCCA-Annotated Corpus
An average taken over a sample of passages annotated by all four annotators yielded an F-score of 93.7%.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lee, Chia-ying and Glass, James
Experimental Setup
We follow the suggestion of (Scharenborg et al., 2010) and use a 20-ms tolerance window to compute recall, precision rates and F-score of the segmentation our model proposed for TIMIT’s training set.
Introduction
the-art unsupervised method and improves the relative F-score by 18.8 points (Dusan and Rabiner, 2006).
Results
unit(%) Recall Precision F-score Dusan (2006) 75.2 66.8 70.8 Qiao et al.
Results
When compared to the baseline in which the number of phone boundaries in each utterance was also unknown (Dusan and Rabiner, 2006), our model outperforms in both recall and precision, improving the relative F-score by 18.8%.
F-score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Experiment
We evaluated the performance ( F-score ) of our model on the three development sets by using different 04 values, where 04 is progressively increased in steps of 0.1 (0 < 04 < 1.0).
Experiment
Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets.
Experiment
Table 2: F-score (%) results of five CWS models on CTB-5, CTB-6 and CTB-7.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Varga, István and Sano, Motoki and Torisawa, Kentaro and Hashimoto, Chikara and Ohtake, Kiyonori and Kawai, Takao and Oh, Jong-Hoon and De Saeger, Stijn
Experiments
The proposed method achieved about 44% recall and nearly 80% precision, outperforming all other systems in terms of precision, F-score and average precision8.
Experiments
Table 4: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem report recognizers.
Experiments
Table 6: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem-aid match recognizers.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Reichart, Roi and Korhonen, Anna
Evaluation
For four out of five conditions its F-score performance outperforms the baselines by 42-83%.
Evaluation
These are the Most Frequent SCF (O’Donovan et al., 2005) which uniformly assigns to all verbs the two most frequent SCFs in general language, transitive (SUBJ-DOBJ) and intransitive (SUBJ) (and results in poor F-score ), and a filtering that removes frames with low corpus frequencies (which results in low recall even when trying to provide the maximum recall for a given precision level).
Evaluation
The task we address is therefore to improve the precision of the corpus statistics baseline in a way that does not substantially harm the F-score .
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Clark, Stephen and Curran, James R.
Evaluation
The third row is similar, but for sentences for which the oracle F-score is geater than 92%.
The CCG to PTB Conversion
shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%.
The CCG to PTB Conversion
The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto
Experiment 3: Sense Similarity
Table 7: F-score sense merging evaluation on three hand-labeled datasets: OntoNotes (Onto), Senseval-2 (SE-2), and combined (Onto+SE-2).
Experiment 3: Sense Similarity
For a binary classification task, we can directly calculate precision, recall and F-score by constructing a contingency table.
Experiment 3: Sense Similarity
In addition, we show in Table 7 the F-score results provided by Snow et al.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Joty, Shafiq and Carenini, Giuseppe and Ng, Raymond and Mehdad, Yashar
Experiments
To evaluate the parsing performance, we use the standard unlabeled (i.e., hierarchical spans) and labeled (i.e., nuclearity and relation) precision, recall and F-score as described in (Marcu, 2000b).
Experiments
Table 2 presents F-score parsing results for our parsers and the existing systems on the two corpora.2 On both corpora, our parser, namely, lS-lS (TSP 1-1) and sliding window (TSP SW), outperform existing systems by a wide margin (p<7.le-05).3 On RST—DT, our parsers achieve absolute F-score improvements of 8%, 9.4% and 11.4% in span, nuclearity and relation, respectively, over HILDA.
Experiments
On the Instructional genre, our parsers deliver absolute F-score improvements of 10.5%, 13.6% and 8.14% in span, nuclearity and relations, respectively, over the ILP-based approach.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Branavan, S.R.K. and Kushman, Nate and Lei, Tao and Barzilay, Regina
Experimental Setup
Model F-score 0.4 _ ---- -- SVM F-score ---------- -- All-text F-score
Experimental Setup
Precondition prediction F-score
Introduction
Specifically, it yields an F-score of 66% compared to the 65% of the baseline.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Fowler, Timothy A. D. and Penn, Gerald
A Latent Variable CCG Parser
To determine statistical significance, we obtain p-values from Bikel’s randomized parsing evaluation comparator6, modified for use with tagging accuracy, F-score and dependency accuracy.
A Latent Variable CCG Parser
In this section we evaluate the parsers using the traditional PARSEVAL measures which measure recall, precision and F-score on constituents in
A Latent Variable CCG Parser
The Petrov parser has better results by a statistically significant margin for both labeled and unlabeled recall and unlabeled F-score .
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jiampojamarn, Sittichai and Kondrak, Grzegorz
Intrinsic evaluation
We report the alignment quality in terms of precision, recall and F-score .
Intrinsic evaluation
The F-score corresponding to perfect precision and the upper-bound recall is 94.75%.
Intrinsic evaluation
Overall, the MM models obtain lower precision but higher recall and F-score than 1-1 models, which is to be expected as the gold standard is defined in terms of MM links.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Harr and Benson, Edward and Naseem, Tahira and Barzilay, Regina
Experimental Setup
For these reasons, we evaluate on both sentence-level and token-level precision, recall, and F-score .
Results
However, the best F-Score corresponding to the optimal number of clusters is 42.2, still far below our model’s 66.0 F-score .
Results
Our results show a large gap in F-score between the sentence and token-level evaluations for both the USP baseline and our model.
F-score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: