Error Classification | Using held-out validation data, we jointly tune the three parameters in the previous paragraph, Ci, 7%, and 25,, to optimize the F-score achieved by bi for error 61x3 However, an exact solution to this optimization problem is computationally expensive. |
Error Classification | Consequently, we find a local maximum by employing the simulated annealing algorithm (Kirkpatrick et al., 1983), altering one parameter at a time to optimize F-score by holding the remaining parameters fixed. |
Error Classification | Other ways we could measure our system’s performance (such as macro F-score ) would consider our system’s performance on the less frequent errors no less important than its performance on the |
Evaluation | To evaluate our thesis clarity error type identification system, we compute precision, recall, micro F-score, and macro F-score , which are calculated as follows. |
Evaluation | Then, the precision (Pi), recall (R1), and F-score (F1) for bi and the macro F-score (F) of the combined system for one test fold are calculated by 7510i 7510i 2PiRi A Z,- Fi |
Evaluation | However, the macro F-score calculation can be seen as giving too much weight to the less frequent errors. |
Abstract | On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++. |
Evaluation | An alternative criterion is the upper bound on alignment F-score , which essentially measures how many links in annotated alignment can be kept in ITG parse. |
Evaluation | The calculation of F-score upper bound is done in a bottom-up way like ITG parsing. |
Evaluation | The upper bound of alignment F-score can thus be calculated as well. |
The DITG Models | The MERT module for DITG takes alignment F-score of a sentence pair as the performance measure. |
The DITG Models | Given an input sentence pair and the reference annotated alignment, MERT aims to maximize the F-score of DITG-produced alignment. |
Abstract | Our best approach achieves a roughly ~15% absolute increase in F-score over a simple but reasonable baseline. |
Results | We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall. |
Results | Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7 |
Results | Feature Set F-score %Imp word 43.85 — |
Discussions | For small seed sizes, the F-score of bilingual bootstrapping is consistently better than the F-score obtained by training only on the seed data without using any bootstrapping. |
Discussions | To further illustrate this, we take some sample points from the graph and compare the number of tagged words needed by BiBoot and OnlySeed to reach the same (or nearly the same) F-score . |
Experimental Setup | Seed Size v/s F-score |
Experimental Setup | 80 70 60 go‘ 50 g 40 O (I) [L 30 20 OnlySeed éfi ' I. WFS 10 “ BiBoot ' 0 [I I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 Seed Size (words) Figure 1: Comparison of BiBoot, MonoBoot, OnlySeed and WF S on Hindi Health data Seed Size v/s F-score 80 $3 9 O O (I) [L OnlySeed éfi ' WF ..= BiBoot ' 0 , I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 |
Experimental Setup | Seed Size v/s F-score |
Results | a. BiBoot: This curve represents the F-score obtained after 10 iterations by using bilingual bootstrapping with different amounts of seed data. |
Results | b. MonoBoot: This curve represents the F-score obtained after 10 iterations by using monolingual bootstrapping with different amounts of seed data. |
Results | c. OnlySeed: This curve represents the F-score obtained by training on the seed data alone without using any bootstrapping. |
Classification Results | The table lists the f-score for each of the target relations, with overall accuracy shown in brackets. |
Classification Results | Given that the experiments are run on natural distribution of the data, which are skewed towards Expansion relations, the f-score is the more important measure to track. |
Classification Results | Our random baseline is the f-score one would achieve by randomly assigning classes in proportion to its true distribution in the test set. |
Abstract | We propose a new evaluation metric, alignment entropy, grounded on the information theory, to evaluate the alignment quality without the need for the gold standard reference and compare the metric with F-score . |
Experiments | Next we conduct three experiments to study 1) alignment entropy vs. F-score , 2) the impact of alignment quality on transliteration accuracy, and 3) how to validate transliteration using alignment metrics. |
Experiments | 5.1 Alignment entropy vs. F-score |
Experiments | We have manually aligned a random set of 3,000 transliteration pairs from the Xinhua training set to serve as the gold standard, on which we calculate the precision, recall and F-score as well as alignment entropy for each alignment. |
Related Work | Denoting the number of cross-lingual mappings that are common in both A and Q as CA0, the number of cross-lingual mappings in A as CA and the number of cross-lingual mappings in Q as Cg, precision Pr is given as CAglCA, recall Be as GAO/CG and F-score as 2P7“ - Rc/(Pr + Re). |
Transliteration alignment entropy | We expect and will show that this estimate is a good indicator of the alignment quality, and is as effective as the F-score , but without the need for a gold standard reference. |
Abstract | Our result, an F-score of 73.74%, outperforms the state-of-the-art methods on a manually labeled test dataset. |
Abstract | Moreover, combining our method with a previous manually-built hierarchy extension method can further improve F-score to 80.29%. |
Experimental Setup | We use precision, recall, and F-score as our metrics to evaluate the performances of the methods. |
Introduction | The experimental results show that our method achieves an F-score of 73.74% which significantly outperforms the previous state-of-the-art methods. |
Introduction | (2008) can further improve F-score to 80.29%. |
Results and Analysis 5.1 Varying the Amount of Clusters | Table 3 shows that the proposed method achieves a better recall and F-score than all of the previous methods do. |
Results and Analysis 5.1 Varying the Amount of Clusters | It can significantly (p < 0.01) improve the F-score over the state-of-the-art method MWikHCilmE. |
Results and Analysis 5.1 Varying the Amount of Clusters | The F-score is further improved from 73.74% to 76.29%. |
Experimental Setup | We report an F-score as well (the harmonic mean of precision and recall). |
Experimental Setup | We use the standard parsing F-score evaluation measure. |
Introduction | We use two measures to evaluate the performance of our algorithm, precision and F-score . |
Introduction | Precision reflects the algorithm’s applicability for creating training data to be used by supervised SRL models, while the standard SRL F-score measures the model’s performance when used by itself. |
Introduction | The first stage of our algorithm is shown to outperform a strong baseline both in terms of F-score and of precision. |
Related Work | Better performance is achieved on the classification, where state-of-the-art supervised approaches achieve about 81% F-score on the in-domain identification task, of which about 95% are later labeled correctly (Marquez et al., 2008). |
Results | In the “Collocation Maximum F-score” the collocation parameters were generally tuned such that the maximum possible F-score for the collocation algorithm is achieved. |
Results | The best or close to best F-score is achieved when using the clause detection algorithm alone (59.14% for English, 23.34% for Spanish). |
Results | Note that for both English and Spanish F-score improvements are achieved via a precision improvement that is more significant than the recall degradation. |
Experiments | | PREC | RECALL | F-SCORE |
Experiments | | PREC | RECALL | F-SCORE |
Experiments | Table 2 shows that F-score has dropped by 0.61%. |
Introduction | These features are targeted at improving the recovery of NP structure, increasing parser performance by 0.64% F-score . |
Evaluation | F-Score (F) is the geometric average of precision and recall; it is the most common non-referential detection metric. |
Results | Table 4 gives precision, recall, F-score , and accuracy on the Train/ Test split. |
Results | Note that while the LL system has high detection precision, it has very low recall, sharply reducing F-score . |
Results | The MINIPL approach sacrifices some precision for much higher recall, but again has fairly low F-score . |
Evaluation Setup | First, following previous work, we evaluate our method using the labeled and unlabeled predicate-argument dependency F-score . |
Evaluation Setup | The dependency F-score captures both the target- |
Experiment and Analysis | For instance, there is a gain of 6.2% in labeled dependency F-score for HPSG formalism when 15,000 CFG trees are used. |
Experiment and Analysis | Across all three grammars, we can observe that adding CFG data has a more pronounced effect on the PARSEVAL measure than the dependency F-score . |
Experiment and Analysis | On the other hand, predicate-argument dependency F-score (Figure 5ac) also relies on the target grammar information. |
Implementation | This results in a drop on the dependency F-score by about 5%. |
Introduction | For instance, the model trained on 500 HPSG sentences achieves labeled dependency F-score of 72.3%. |
Introduction | Adding 15,000 Penn Treebank sentences during training leads to 78.5% labeled dependency F-score , an absolute improvement of 6.2%. |
Abstract | Evaluation on the Penn Chinese Treebank indicates that a converted dependency treebank helps constituency parsing and the use of unlabeled data by self-training further increases parsing f-score to 85.2%, resulting in 6% error reduction over the previous best result. |
Experiments of Grammar Formalism Conversion | Finally Q-10-method achieved an f-score of 93.8% on WSJ section 22, an absolute 4.4% improvement (42% error reduction) over the best result of Xia et al. |
Experiments of Grammar Formalism Conversion | Finally Q-10-method achieved an f-score of 93.6% on WSJ section 2~l8 and 20~22, better than that of Q-0-method and comparable with that of Q-10-method in Section 3.1. |
Experiments of Parsing | Finally we decided that the optimal value of A was 0.4 and the optimal weight of CTB was 1, which brought the best performance on the development set (an f-score of 86.1%). |
Experiments of Parsing | In comparison with the results in Section 4.1, the average index of converted trees in 200-best list increased to 2, and their average unlabeled dependency f-score dropped to 65.4%. |
Experiments of Parsing | 84.2% f-score , better than the result of the reranking parser with CTB and CDTPS as training data (shown in Table 5). |
Introduction | Our conversion method achieves 93.8% f-score on dependency trees produced from WSJ section 22, resulting in 42% error reduction over the previous best result for DS to PS conversion. |
Introduction | When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing. |
Our Two-Step Solution | Therefore we modified the selection metric in Section 2.1 by interpolating two scores, the probability of a conversion candidate from the parser and its unlabeled dependency f-score , shown as follows: |
Conclusion and Future Work | In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling. |
Experiments | We evaluate detection performance by token-level precision, recall and F-score (6 = 1). |
Experiments | For candidate selection, we once again evaluate using token-level precision, recall and F-score . |
Experiments | Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation. |
Introduction | Using an adapted supertagger with ambiguity levels tuned to match the baseline system, we were also able to increase F-score on labelled grammatical relations by 0.75%. |
Results | Interestingly, while the decrease in supertag accuracy in the previous experiment did not translate into a decrease in F-score, the increase in tag accuracy here does translate into an increase in F-score . |
Results | The increase in F-score has two sources. |
Results | As Table 6 shows, this change translates into an improvement of up to 0.75% in F-score on Section |
Word segmentation with adaptor grammars | We evaluated the f-score of the recovered word constituents (Goldwater et al., 2006b). |
Word segmentation with adaptor grammars | Table 1: Word segmentation f-score results for all models, as a function of DP concentration parameter oz. |
Word segmentation with adaptor grammars | With a = 1 and 04 = 10 we obtained a word segmentation f-score of 0.55. |
Experiments | Three metrics are used for evaluation: precision (P), recall (R) and balanced f-score (F) defined by 2PR/(P+R). |
Experiments | The baseline of the character-based joint solver (CTagctb) is competitive, and achieves an f-score of 92.93. |
Experiments | ging model achieves an f-score of 94.03 ([31th and |
Introduction | Our structure-based stacking model achieves an f-score of 94.36, which is superior to a feature-based stacking model introduced in (Jiang et al., 2009). |
Introduction | Our final system achieves an f-score of 94.68, which yields a relative error reduction of 11% over the best published result (94.02). |
Experiments | Accuracy Macro F-score Micro F-score |
Experiments | Figure 8 reports the accuracy, macro F-score, and micro F-score . |
Experiments | It shows that the BR learner produces better accuracy and a micro F-score than the FR learner but a slightly worse macro F-score . |
Discussion and Conclusion | With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations. |
Experiments | Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3. |
Experiments | Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score . |
Experiments | The overall F-score of the adaptive QE model is 0.282. |
Evaluation of lexical similarity in context | In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score. |
Experiments: predicting relevance in context | Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score , even though precision and recall might yield more important variations. |
Experiments: predicting relevance in context | As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22. |
Experiments: predicting relevance in context | If we take the best simple classifier (random forests), the precision and recall are 68.1% and 24.2% for an F-score of 35.7%, and this is significantly beaten by the Naive Bayes method as precision and recall are more even ( F-score of 41.5%). |
Related work | Recall F-score 40.4 54.3 46.3 37.4 52.8 43.8 36.1 49.5 41.8 36.5 54.8 43.8 |
Experiments | tures in the updated version.5 However, our initial experiments show that, even with this much simpler feature set, our 50-best reranker performed equally well as theirs (both with an F-score of 91.4, see Tables 3 and 4). |
Experiments | With only local features, our forest reranker achieves an F-score of 91.25, and with the addition of non- |
Forest Reranking | yEeand where function F returns the F-score . |
Forest Reranking | 2In case multiple candidates get the same highest F-score , we choose the parse with the highest log probability from the baseline parser to be the oracle parse (Collins, 2000). |
Introduction | we achieved an F-score of 91.7, which is a 19% error reduction from the l-best baseline, and outperforms both 50-best and 100-best reranking. |
Supporting Forest Algorithms | 4.1 Forest Oracle Recall that the Parseval F-score is the harmonic mean of labelled precision P and labelled recall R: 2PR _ 2|y fl y*| P + R lyl + |y*| |
Supporting Forest Algorithms | In other words, the optimal F-score tree in a forest is not guaranteed to be composed of two optimal F-score subtrees. |
Supporting Forest Algorithms | Shown in Pseudocode 4, we perform these computations in a bottom-up topological order, and finally at the root node TOP, we can compute the best global F-score by maximizing over different numbers of test brackets (line 7). |
Introduction | Experiments on the data from the Chinese tree bank (CTB-7) and Microsoft Research (MSR) show that the proposed model results in significant improvement over other comparative candidates in terms of F-score and out-of-vocabulary (OOV) recall. |
Method | The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score , F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R). |
Method | It obtains 0.92% and 2.32% increase in terms of F-score and OOV—R respectively. |
Method | On the whole, for segmentation, they achieve average improvements of 1.02% and 6.8% in F-score and OOV—R; whereas for POS tagging, the average increments of F-sore and OOV—R are 0.87% and 6.45%. |
Related Work | Prior supervised joint S&T models present approximate 0.2% - 1.3% improvement in F-score over supervised pipeline ones. |
Abstract | Using a rich set of shallow lexical, syntactic and structural features from the input text, our parser achieves, in linear time, 73.9% of professional annotators’ human agreement F-score . |
Building a Discourse Parser | Current state-of-the-art results in automatic segmenting are much closer to human levels than full structure labeling ( F-score ratios of automatic performance over gold standard reported in LeThanh et al. |
Evaluation | Standard performance indicators for such a task are precision, recall and F-score as measured by the PARSEVAL metrics (Black et al., 1991), with the specific adaptations to the case of RST trees made by Marcu (2000, page 143-144). |
Evaluation | S N R F S N R F Precision 83.0 68.4 55.3 54.8 69.5 56.1 44.9 44.4 Recall 83.0 68.4 55.3 54.8 69.2 55.8 44.7 44.2 F-Score 83.0 68.4 55.3 54.8 69.3 56.0 44.8 44.3 |
Evaluation | Manual SPADE -SNRFSNRFSNRF Precision 84.1 70.6 55.6 55.1 70.6 58.1 46.0 45.6 88.0 77.5 66.0 65.2 Recall 84.1 70.6 55.6 55.1 71.2 58.6 46.4 46.0 88.1 77.6 66.1 65.3 F-Score 84.1 70.6 55.6 55.1 70.9 58.3 46.2 45.8 88.1 77.5 66.0 65.3 |
Abstract | Additionally, we remove low confidence alignment links from the word alignment of a bilingual training corpus, which increases the alignment F-score , improves Chinese-English and Arabic-English translation quality and significantly reduces the phrase translation table size. |
Alignment Link Confidence Measure | Table 2 shows the precision, recall and F-score of individual alignments and the combined align- |
Alignment Link Confidence Measure | Overall it improves the F-score by 1.5 points (from 69.3 to 70.8), 1.8 point improvement for content words and 1.0 point for function words. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | Precision Recall F-score Baseline 72.66 66.17 69.26 +ALF 78.14 64.36 70.59 |
Improved MaXEnt Aligner with Confidence-based Link Filtering | Precision Recall F-score Baseline 84.43 83.64 84.04 +ALF 88.29 83.14 85.64 |
Sentence Alignment Confidence Measure | Aligner F-score Cor. |
Sentence Alignment Confidence Measure | The results in Figure 2 shows strong correlation between the confidence measure and the alignment F-score , with the correlation coefficients equals to -0.69. |
Abstract | This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score . |
Introduction | While absolute accuracy is not directly relevant to the main point of the paper, we note that the models that learn generalisations about function words perform unsupervised word segmentation at 92.5% token f-score on the standard Bernstein-Ratner (1987) corpus, which improves the previous state-of-the-art by more than 4%. |
Introduction | that achieves the best token f-score expects function words to appear at the left edge of phrases. |
Word segmentation results | f-score prec1s10n recall Baseline 0.872 0.918 0.956 + left FWs 0.924 0.935 0.990 + left + right FWs 0.912 0.957 0.953 |
Word segmentation results | Figure 2 presents the standard token and lexicon (i.e., type) f-score evaluations for word segmentations proposed by these models (Brent, 1999), and Table 1 summarises the token and lexicon f-scores for the major models discussed in this paper. |
Word segmentation results | It is interesting to note that adding “function words” improves token f-score by more than 4%, corresponding to a 40% reduction in overall error rate. |
Word segmentation with Adaptor Grammars | The starting point and baseline for our extension is the adaptor grammar with syllable structure phonotactic constraints and three levels of collo-cational structure (5-21), as prior work has found that this yields the highest word segmentation token f-score (Johnson and Goldwater, 2009). |
Discourse vs. non-discourse usage | Using the string of the connective as the only feature sets a reasonably high baseline, with an f-score of 75.33% and an accuracy of 85.86%. |
Discourse vs. non-discourse usage | Interestingly, using only the syntactic features, ignoring the identity of the connective, is even better, resulting in an f-score of 88.19% and accuracy of 92.25%. |
Discourse vs. non-discourse usage | Using both the connective and syntactic features is better than either individually, with an f-score of 92.28% and accuracy of 95.04%. |
Related Work | MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments. |
Related Work | In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions. |
Related Work | Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil . |
Results | Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT. |
XMEANT: a cross-lingual MEANT | 3.1 Applying MEANT’s f-score within semantic role fillers |
XMEANT: a cross-lingual MEANT | The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat- |
Experiment | F-score |
Experiment | Both the f-score and OOV—recall increase. |
Experiment | By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV—Recall. |
INTRODUCTION | For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. |
Abstract | The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes. |
Feature Construction | F-score is computed by |
Feature Construction | In Row 2, with only the .7-"0w feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes. |
Feature Construction | In Table 3, it is shown that our system outperforms other systems, in F-score , by 10% on 6 relation types and by 15% on 18 subtypes. |
Introduction | The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) (Zhou et al., 2010). |
Introduction | Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes (Dandan et al., 2012). |
Abstract | The results using Reuters documents showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614. |
Conclusion | The results using the 1996 Reuters corpora showed that the method was comparable to the current state-of-the-art biased-SVM method as the F-score obtained by our method was 0.627 and biased-SVM was 0.614. |
Experiments | We empirically selected values of two parameters, “c” (tradeoff between training error and margin) and “j”, i.e., cost (cost-factor, by which training errors on positive examples) that optimized the F-score obtained by classification of test documents. |
Experiments | Figure 3 shows micro-averaged F-score against the 6 value. |
Experiments | F-score |
Abstract | Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve. |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to M | M L (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1). |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | Performance of Guided DS also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., (2011) system for KBP 2011, which reports an F-score of 25.7%. |
Introduction | posed approach, we extend MIML (Surdeanu et al., 2012), a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP (Ji and Grishman, 2011) dataset. |
Introduction | While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines. |
Training | Training MIML on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality. |
Discussion | Table 8 sum-marises the results, showing that the error reduction rate (ERR) over the parsing F-score is up to 6.9%, which is remarkable given the relatively superficial strategy for incorporating sense information into the parser. |
Experimental setting | We evaluate the parsers via labelled bracketing recall (R), precision (’P) and F-score (.731). |
Results | The SFU representation produces the best results for Bikel (F-score 0.010 above baseline), while for Charniak the best performance is obtained with word+SF ( F-score 0.007 above baseline). |
Results | Overall, Bikel obtains a superior F-score in all configurations. |
Results | Again, the F-score for the semantic representations is better than the baseline in all cases. |
Experiments | Labelled F-score 00 \l 00 |
Oracle Parsing | To answer this question we computed oracle best and worst values for labelled dependency F-score using the algorithm of Huang (2008) on the hybrid model of Clark and Curran (2007), the best model of their C&C parser. |
Oracle Parsing | Labelleld F-score |
Oracle Parsing | Digging deeper, we compared parser model score against Viterbi F—score and oracle F-score at a va- |
Conclusion and outlook | We find that our Bigram model reaches 77% /t/-recovery F-score when run with knowledge of true word-boundaries and when it can make use of both the preceeding and the following phonological context, and that unlike the Unigram model it is able to learn the probability of /t/-deletion in different contexts. |
Conclusion and outlook | When performing joint word segmentation on the Buckeye corpus, our Bigram model reaches around above 55% F-score for recovering deleted /t/s with a word segmentation F-score of around 72% which is 2% better than running a Bigram model that does not model /t/-deletion. |
Experiments 4.1 The data | We evaluate the model in terms of F-score , the harmonic mean of recall (the fraction of underlying /t/s the model correctly recovered) and precision (the fraction of underlying /t/s the model predicted that were correct). |
Experiments 4.1 The data | Looking at the segmentation performance this isn’t too surprising: the Unigram model’s poorer token F-score , the standard measure of segmentation performance on a word token level, suggests that it misses many more boundaries than the Bigram model to begin with and, consequently, can’t recover any potential underlying /t/s at these boundaries. |
Experiments 4.1 The data | The generally worse performance of handling variation as measured by /t/-recovery F-score when performing joint segmentation is consistent with the finding of Elsner et al. |
Experiments and Results | Table 2 depicts the exact numbers of manually labeled tokens to reach the maximal (supervised) F-score on both corpora. |
Experiments and Results | On the MUC7 corpus, FuSAL requires 7,374 annotated NPs to yield an F-score of 87%, While SeSAL hit the same F-score with only 4,017 NPs. |
Experiments and Results | 5 On PENNBIOIE, SeSAL also saves about 45 % compared to FuSAL to achieve an F-score of 81 %. |
Abstract | Our results show that our approach achieves 80.0% F-Score accuracy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1 |
Experimental Results | The two versions achieve very close performance (80% vs 84% in F-Score ), even though Full Model is trained with noisy feedback. |
Experimental Setup | Model | Recall ‘ Precision | F-Score ‘ |
Introduction | However, when trained using the noisy supervision, our method achieves substantially more accurate translations than a state-of-the-art semantic parser (Clarke et al., 2010) (specifically, 80.0% in F—Score compared to an F-Score of 66.7%). |
Introduction | The strength of our model in the face of such weak supervision is also highlighted by the fact that it retains an F-Score of 77% even when only one input example is provided for each input |
Abstract | Standard CCGBank tests show the model achieves up to 1.05 labeled F-score improvements over three existing, competitive CCG parsing models. |
Experiments | On both the full and reduced sets, our parser achieves the highest F-score . |
Experiments | In comparison with C&C, our parser shows significant increases across all metrics, with 0.57% and 1.06% absolute F-score improvements over the hybrid and normal-form models, respectively. |
Experiments | While our parser achieved lower precision than Z&C, it is more balanced and gives higher recall for all of the dependency relations except the last one, and higher F-score for over half of them. |
Introduction | Results on the standard CCGBank tests show that our parser achieves absolute labeled F-score gains of up to 0.5 over the shift-reduce parser of Zhang and Clark (2011); and up to 1.05 and 0.64 over the normal-form and hybrid models of Clark and Curran (2007), respectively. |
Experimental Setup | g 03 — Gj _ 0.25 — _ EX" _Q_ Q. G {x {T __ 0.2 - X _ EX 0.15 - _ Recall Precision F-score Recall Precision F-score Rouge-1 Rouge-L |
Results | F-score is higher for the phrase-based system but not significantly. |
Results | The sentence ILP model outperforms the lead baseline with respect to recall but not precision or F-score . |
Results | The phrase ILP achieves a significantly better F-score over the lead baseline with both ROUGE-l and ROUGE-L. |
Experiments | Ve only compare the F-score , since all the com-tared systems have an attempted rate7 of 1.0, |
Experiments | ), F-score (Fl). |
Experiments | System F-score |
Abstract | Our NR classification evaluation strictly follows the ACL SemEval-07 Task 4 datasets and protocol, obtaining an f-score of 70.6, as opposed to 64.8 of the best previous work that did not use the manually provided WordNet sense disambiguation tags. |
Results | In fact, our results ( f-score 62.0, accuracy 64.5) are better than the averaged results (58.0, 61.1) of the group that did not utilize WN tags. |
Results | Table 2 shows the HITS-based classification results ( F-score and Accuracy) and the number of positively labeled clusters (C) for each relation. |
Results | We have used the exact evaluation procedure described in (Turney, 2006), achieving a class f-score average of 60.1, as opposed to 54.6 in (Turney, 2005) and 51.2 in (Nastase et al., 2006). |
Evaluation | We calculated precision, recall, and f-score for our system, the baselines, and the upper bound as follows, with allsystem being the number of pairs labelled as paraphrase or happens-before, allgold as the respective number of pairs in the gold standard and correct as the number of pairs labeled correctly by the system. |
Evaluation | The f-score for the upper bound is in the column upper. |
Evaluation | For the f-score values, we calculated the significance for the difference between our system and the baselines as well as the upper bound, using a resampling test (Edgington, 1986). |
Experiment | As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction. |
Experiment | The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation. |
Experiment | As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training. |
Abstract | We show that this method generates output closer to the feedback that lecturers actually generated, achieving 3.5% higher accuracy and 15% higher F-score than multiple simple classifiers that keep a history of selected templates. |
Evaluation | The accuracy, the weighted precision, the weighted recall, and the weighted F-score of the classifiers are shown in Table 3. |
Evaluation | It was found that in 10-fold cross validation RAkEL performs significantly better in all these automatic measures (accuracy = 76.95%, F-score = 85.50%). |
Evaluation | Remarkably, ML achieves more than 10% higher F-score than the other methods (Table 3). |
Experiments | of our system that approximates the submodular objective function proposed by (Lin and Bilmes, 2011).7 As shown in the results, our best system8 which uses the hs dispersion function achieves a better ROUGE-1 F-score than all other systems. |
Experiments | (4) To understand the effect of utilizing syntactic structure and semantic similarity for constructing the summarization graph, we ran the experiments using just the unigrams and bigrams; we obtained a ROUGE-1 F-score of 37.1. |
Experiments | 7Note that Lin & Bilmes (2011) report a slightly higher ROUGE-1 score ( F-score 38.90) on DUC 2004. |
A UCCA-Annotated Corpus | We derive an F-score from these counts. |
A UCCA-Annotated Corpus | The table presents the average F-score between the annotators, as well as the average F-score when comparing to the gold standard. |
A UCCA-Annotated Corpus | An average taken over a sample of passages annotated by all four annotators yielded an F-score of 93.7%. |
Experimental Setup | We follow the suggestion of (Scharenborg et al., 2010) and use a 20-ms tolerance window to compute recall, precision rates and F-score of the segmentation our model proposed for TIMIT’s training set. |
Introduction | the-art unsupervised method and improves the relative F-score by 18.8 points (Dusan and Rabiner, 2006). |
Results | unit(%) Recall Precision F-score Dusan (2006) 75.2 66.8 70.8 Qiao et al. |
Results | When compared to the baseline in which the number of phone boundaries in each utterance was also unknown (Dusan and Rabiner, 2006), our model outperforms in both recall and precision, improving the relative F-score by 18.8%. |
Experiment | We evaluated the performance ( F-score ) of our model on the three development sets by using different 04 values, where 04 is progressively increased in steps of 0.1 (0 < 04 < 1.0). |
Experiment | Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets. |
Experiment | Table 2: F-score (%) results of five CWS models on CTB-5, CTB-6 and CTB-7. |
Experiments | The proposed method achieved about 44% recall and nearly 80% precision, outperforming all other systems in terms of precision, F-score and average precision8. |
Experiments | Table 4: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem report recognizers. |
Experiments | Table 6: Recall (R), precision (P), F-score (F) and average precision (aP) of the problem-aid match recognizers. |
Evaluation | For four out of five conditions its F-score performance outperforms the baselines by 42-83%. |
Evaluation | These are the Most Frequent SCF (O’Donovan et al., 2005) which uniformly assigns to all verbs the two most frequent SCFs in general language, transitive (SUBJ-DOBJ) and intransitive (SUBJ) (and results in poor F-score ), and a filtering that removes frames with low corpus frequencies (which results in low recall even when trying to provide the maximum recall for a given precision level). |
Evaluation | The task we address is therefore to improve the precision of the corpus statistics baseline in a way that does not substantially harm the F-score . |
Evaluation | The third row is similar, but for sentences for which the oracle F-score is geater than 92%. |
The CCG to PTB Conversion | shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%. |
The CCG to PTB Conversion | The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script. |
Experiment 3: Sense Similarity | Table 7: F-score sense merging evaluation on three hand-labeled datasets: OntoNotes (Onto), Senseval-2 (SE-2), and combined (Onto+SE-2). |
Experiment 3: Sense Similarity | For a binary classification task, we can directly calculate precision, recall and F-score by constructing a contingency table. |
Experiment 3: Sense Similarity | In addition, we show in Table 7 the F-score results provided by Snow et al. |
Experiments | To evaluate the parsing performance, we use the standard unlabeled (i.e., hierarchical spans) and labeled (i.e., nuclearity and relation) precision, recall and F-score as described in (Marcu, 2000b). |
Experiments | Table 2 presents F-score parsing results for our parsers and the existing systems on the two corpora.2 On both corpora, our parser, namely, lS-lS (TSP 1-1) and sliding window (TSP SW), outperform existing systems by a wide margin (p<7.le-05).3 On RST—DT, our parsers achieve absolute F-score improvements of 8%, 9.4% and 11.4% in span, nuclearity and relation, respectively, over HILDA. |
Experiments | On the Instructional genre, our parsers deliver absolute F-score improvements of 10.5%, 13.6% and 8.14% in span, nuclearity and relations, respectively, over the ILP-based approach. |
Experimental Setup | Model F-score 0.4 _ ---- -- SVM F-score ---------- -- All-text F-score |
Experimental Setup | Precondition prediction F-score |
Introduction | Specifically, it yields an F-score of 66% compared to the 65% of the baseline. |
A Latent Variable CCG Parser | To determine statistical significance, we obtain p-values from Bikel’s randomized parsing evaluation comparator6, modified for use with tagging accuracy, F-score and dependency accuracy. |
A Latent Variable CCG Parser | In this section we evaluate the parsers using the traditional PARSEVAL measures which measure recall, precision and F-score on constituents in |
A Latent Variable CCG Parser | The Petrov parser has better results by a statistically significant margin for both labeled and unlabeled recall and unlabeled F-score . |
Intrinsic evaluation | We report the alignment quality in terms of precision, recall and F-score . |
Intrinsic evaluation | The F-score corresponding to perfect precision and the upper-bound recall is 94.75%. |
Intrinsic evaluation | Overall, the MM models obtain lower precision but higher recall and F-score than 1-1 models, which is to be expected as the gold standard is defined in terms of MM links. |
Experimental Setup | For these reasons, we evaluate on both sentence-level and token-level precision, recall, and F-score . |
Results | However, the best F-Score corresponding to the optimal number of clusters is 42.2, still far below our model’s 66.0 F-score . |
Results | Our results show a large gap in F-score between the sentence and token-level evaluations for both the USP baseline and our model. |