Cross-lingual Features | This led to an overall improvement in F-measure of 1.8 to 3.4 points (absolute) or 4.2% to 5.7% (relative). |
Cross-lingual Features | ture, transliteration mining slightly lowered precision — except for the TWEETS test set where the drop in precision was significant — and positively increased recall, leading to an overall improvement in F-measure for all test sets. |
Related Work | They reported 80%, 37%, and 47% F-measure for locations, organizations, and persons respectively on the ANERCORP dataset that they created and publicly released. |
Related Work | They reported 87%, 46%, and 52% F-measure for locations, organizations, and persons respectively. |
Related Work | Using POS tagging generally improved recall at the expense of precision, leading to overall improvements in F-measure . |
Lexicon Bootstrapping | The set of parameters 5 is optimized using a grid search on the development data using F-measure for subjectivity classification. |
Lexicon Evaluations | and F-measure results. |
Lexicon Evaluations | For polarity classification we get comparable F-measure but much higher recall for Lg compared to SW N. |
Lexicon Evaluations | Figure 1: Precision (x-axis), recall (y-axis) and F-measure (in the table) for English: L? |
Experiments | Since the model can give the same score for a permutation and the original document, we also compute F-measure where recall is correct/total and precision equals correct/decisions. |
Experiments | For evaluation purposes, the accuracy still corresponds to the number of correct ratings divided by the number of comparisons, while the F-measure combines recall and precision measures. |
Experiments | Moreover, in contrast to the first experiment, when accounting for the number of entities “shared” by two sentences (PW), values of accuracy and F-measure are lower. |
Conclusions | Table 3: Classification performance in F-measure for semantically ambiguous words on the most frequently confused descriptive tags in the movie domain. |
Experiments | F-Measure |
Experiments | Figure 2: F-measure for semantic clustering performance. |
Experiments | As expected, we see a drop in F-measure on all models on descriptive tags. |
Concluding Remarks | Using MDL alone, one proposed method outperforms the original regularized compressor (Chen et al., 2012) in precision by 2 percentage points and in F-measure by 1. |
Evaluation | Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F). |
Evaluation | We found that, in all three settings, G2 outperforms the baseline by 1 to 2 percentage points in F-measure . |
Evaluation | The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure , although this was obtained from search setting (c), using a heuristic p value 0.37. |
Abstract | This approach generates alignments that are 2.6 f-Measure points better than a baseline supervised aligner. |
Conclusion | We also proposed a model that scores alignments given source and target sentence reorderings that improves a supervised alignment model by 2.6 points in f-Measure . |
Results and Discussions | Type f-Measure (words) |
Results and Discussions | The f-Measure of this aligner is 78.1% (see row 1, column 2). |
Results and Discussions | Method f-Measure mBLEU Base Correction model 78.1 55.1 Correction model, C(fl'la) 78.1 56.4 P(alfl'), C(fl'la) 80.7 57.6 |
Analysis and Discussions | F-Measure (SO Prediction) Ch \l |
Analysis and Discussions | F-Measure (IQAPs Inference) 3:. |
Analysis and Discussions | F-Measure Ch m 0'! |
Related Works | F-Measure H N U) .5 Ln 0 O O O O |
Abstract | We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. |
Experiments 5.1 Data Sources | To test the classifier’s performance we evaluated it against a list of positive and negative examples of bilingual term pairs using the measures of precision, recall and F-measure . |
Method | First, we evaluate the performance of the classifier on a held-out term-pair list from EUROVOC using the standard measures of recall, precision and F-measure . |
Abstract | Results over a dataset of entities from four product domains show that the proposed approach achieves significantly above baseline F-measure of 0.96. |
Experimental Evaluation | Of the three individual modules, the n-gram and clustering methods achieve F-measure of around 0.9, while the ontology-based module performs only modestly above baseline. |
Experimental Evaluation | The final system that employed all modules produced an F-measure of 0.960, a significant (p < 0.01) absolute increase of 15.4% over the baseline. |
Experiments | Each learner uses a small amount of development data to tune a threshold on scores for predicting new-sense or not-a-new-sense, using macro F-measure as an objective. |
Experiments | are relatively weak for predicting new senses on EMEA data but stronger on Subs (TYPEONLY AUC performance is higher than both baselines) and even stronger on Science data (TYPEONLY AUC and f-measure performance is higher than both baselines as well as the ALLFEA-TURESmodel). |
Experiments | Recall that the microlevel evaluation computes precision, recall, and f-measure for all word tokens of a given word type and then averages across word types. |
Experiments | The performance measurement for word segmentation is balanced F-measure , F = 2PR/ (P + R), a function of precision P and recall R, where P is the percentage of words in segmentation results that are segmented correctly, and R is the percentage of correctly segmented words in the gold standard words. |
Experiments | wikipedia brings an F-measure increment of 0.93 points. |
Introduction | Experimental results show that, the knowledge implied in the natural annotations can significantly improve the performance of a baseline segmenter trained on CTB 5.0, an F-measure increment of 0.93 points on CTB test set, and an average increment of 1.53 points on 7 other domains. |