Abstract | Our result, an F-score of 73.74%, outperforms the state-of-the-art methods on a manually labeled test dataset. |
Abstract | Moreover, combining our method with a previous manually-built hierarchy extension method can further improve F-score to 80.29%. |
Experimental Setup | We use precision, recall, and F-score as our metrics to evaluate the performances of the methods. |
Introduction | The experimental results show that our method achieves an F-score of 73.74% which significantly outperforms the previous state-of-the-art methods. |
Introduction | (2008) can further improve F-score to 80.29%. |
Results and Analysis 5.1 Varying the Amount of Clusters | Table 3 shows that the proposed method achieves a better recall and F-score than all of the previous methods do. |
Results and Analysis 5.1 Varying the Amount of Clusters | It can significantly (p < 0.01) improve the F-score over the state-of-the-art method MWikHCilmE. |
Results and Analysis 5.1 Varying the Amount of Clusters | The F-score is further improved from 73.74% to 76.29%. |
Discussion and Conclusion | With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations. |
Experiments | Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3. |
Experiments | Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score . |
Experiments | The overall F-score of the adaptive QE model is 0.282. |
Evaluation of lexical similarity in context | In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score. |
Experiments: predicting relevance in context | Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score , even though precision and recall might yield more important variations. |
Experiments: predicting relevance in context | As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22. |
Experiments: predicting relevance in context | If we take the best simple classifier (random forests), the precision and recall are 68.1% and 24.2% for an F-score of 35.7%, and this is significantly beaten by the Naive Bayes method as precision and recall are more even ( F-score of 41.5%). |
Related work | Recall F-score 40.4 54.3 46.3 37.4 52.8 43.8 36.1 49.5 41.8 36.5 54.8 43.8 |
Experiments | Accuracy Macro F-score Micro F-score |
Experiments | Figure 8 reports the accuracy, macro F-score, and micro F-score . |
Experiments | It shows that the BR learner produces better accuracy and a micro F-score than the FR learner but a slightly worse macro F-score . |
Abstract | This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score . |
Introduction | While absolute accuracy is not directly relevant to the main point of the paper, we note that the models that learn generalisations about function words perform unsupervised word segmentation at 92.5% token f-score on the standard Bernstein-Ratner (1987) corpus, which improves the previous state-of-the-art by more than 4%. |
Introduction | that achieves the best token f-score expects function words to appear at the left edge of phrases. |
Word segmentation results | f-score prec1s10n recall Baseline 0.872 0.918 0.956 + left FWs 0.924 0.935 0.990 + left + right FWs 0.912 0.957 0.953 |
Word segmentation results | Figure 2 presents the standard token and lexicon (i.e., type) f-score evaluations for word segmentations proposed by these models (Brent, 1999), and Table 1 summarises the token and lexicon f-scores for the major models discussed in this paper. |
Word segmentation results | It is interesting to note that adding “function words” improves token f-score by more than 4%, corresponding to a 40% reduction in overall error rate. |
Word segmentation with Adaptor Grammars | The starting point and baseline for our extension is the adaptor grammar with syllable structure phonotactic constraints and three levels of collo-cational structure (5-21), as prior work has found that this yields the highest word segmentation token f-score (Johnson and Goldwater, 2009). |
Abstract | The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes. |
Feature Construction | F-score is computed by |
Feature Construction | In Row 2, with only the .7-"0w feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes. |
Feature Construction | In Table 3, it is shown that our system outperforms other systems, in F-score , by 10% on 6 relation types and by 15% on 18 subtypes. |
Introduction | The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) (Zhou et al., 2010). |
Introduction | Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes (Dandan et al., 2012). |
Related Work | MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments. |
Related Work | In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions. |
Related Work | Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil . |
Results | Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT. |
XMEANT: a cross-lingual MEANT | 3.1 Applying MEANT’s f-score within semantic role fillers |
XMEANT: a cross-lingual MEANT | The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat- |
Abstract | Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve. |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to M | M L (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1). |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | Performance of Guided DS also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., (2011) system for KBP 2011, which reports an F-score of 25.7%. |
Introduction | posed approach, we extend MIML (Surdeanu et al., 2012), a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP (Ji and Grishman, 2011) dataset. |
Introduction | While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines. |
Training | Training MIML on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality. |
Abstract | Standard CCGBank tests show the model achieves up to 1.05 labeled F-score improvements over three existing, competitive CCG parsing models. |
Experiments | On both the full and reduced sets, our parser achieves the highest F-score . |
Experiments | In comparison with C&C, our parser shows significant increases across all metrics, with 0.57% and 1.06% absolute F-score improvements over the hybrid and normal-form models, respectively. |
Experiments | While our parser achieved lower precision than Z&C, it is more balanced and gives higher recall for all of the dependency relations except the last one, and higher F-score for over half of them. |
Introduction | Results on the standard CCGBank tests show that our parser achieves absolute labeled F-score gains of up to 0.5 over the shift-reduce parser of Zhang and Clark (2011); and up to 1.05 and 0.64 over the normal-form and hybrid models of Clark and Curran (2007), respectively. |
Abstract | We show that this method generates output closer to the feedback that lecturers actually generated, achieving 3.5% higher accuracy and 15% higher F-score than multiple simple classifiers that keep a history of selected templates. |
Evaluation | The accuracy, the weighted precision, the weighted recall, and the weighted F-score of the classifiers are shown in Table 3. |
Evaluation | It was found that in 10-fold cross validation RAkEL performs significantly better in all these automatic measures (accuracy = 76.95%, F-score = 85.50%). |
Evaluation | Remarkably, ML achieves more than 10% higher F-score than the other methods (Table 3). |
Experiment | As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction. |
Experiment | The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation. |
Experiment | As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training. |