Abstract | Our best approach achieves a roughly ~15% absolute increase in F-score over a simple but reasonable baseline. |
Results | We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall. |
Results | Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7 |
Results | Feature Set F-score %Imp word 43.85 — |
Discussions | For small seed sizes, the F-score of bilingual bootstrapping is consistently better than the F-score obtained by training only on the seed data without using any bootstrapping. |
Discussions | To further illustrate this, we take some sample points from the graph and compare the number of tagged words needed by BiBoot and OnlySeed to reach the same (or nearly the same) F-score . |
Experimental Setup | Seed Size v/s F-score |
Experimental Setup | 80 70 60 go‘ 50 g 40 O (I) [L 30 20 OnlySeed éfi ' I. WFS 10 “ BiBoot ' 0 [I I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 Seed Size (words) Figure 1: Comparison of BiBoot, MonoBoot, OnlySeed and WF S on Hindi Health data Seed Size v/s F-score 80 $3 9 O O (I) [L OnlySeed éfi ' WF ..= BiBoot ' 0 , I I I MonoBoot 7777 ~ 0 1000 2000 3000 4000 5000 |
Experimental Setup | Seed Size v/s F-score |
Results | a. BiBoot: This curve represents the F-score obtained after 10 iterations by using bilingual bootstrapping with different amounts of seed data. |
Results | b. MonoBoot: This curve represents the F-score obtained after 10 iterations by using monolingual bootstrapping with different amounts of seed data. |
Results | c. OnlySeed: This curve represents the F-score obtained by training on the seed data alone without using any bootstrapping. |
Conclusion and Future Work | In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling. |
Experiments | We evaluate detection performance by token-level precision, recall and F-score (6 = 1). |
Experiments | For candidate selection, we once again evaluate using token-level precision, recall and F-score . |
Experiments | Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation. |
Experiments | Labelled F-score 00 \l 00 |
Oracle Parsing | To answer this question we computed oracle best and worst values for labelled dependency F-score using the algorithm of Huang (2008) on the hybrid model of Clark and Curran (2007), the best model of their C&C parser. |
Oracle Parsing | Labelleld F-score |
Oracle Parsing | Digging deeper, we compared parser model score against Viterbi F—score and oracle F-score at a va- |
Experimental Setup | For these reasons, we evaluate on both sentence-level and token-level precision, recall, and F-score . |
Results | However, the best F-Score corresponding to the optimal number of clusters is 42.2, still far below our model’s 66.0 F-score . |
Results | Our results show a large gap in F-score between the sentence and token-level evaluations for both the USP baseline and our model. |