Abstract | Experimental results show that our algorithm yields a relative error reduction of 6.3% in F-measure for the minority classes in comparison to a baseline that learns solely from the labeled data. |
Baseline Approaches | Results are reported in terms of precision (P), recall (R), and F-measure (F), which are computed by aggregating over the 14 shapers as follows. |
Baseline Approaches | Our second baseline is similar to the first, except that we tune the classification threshold (CT) to optimize F-measure . |
Baseline Approaches | Using the development data, we tune the 14 CTs jointly to optimize overall F-measure . |
Evaluation | Micro-averaged 5-fold cross validation results of this baseline for all 14 shapers and for just 10 minority classes (due to our focus on improving minority class prediction) are expressed as percentages in terms of precision (P), recall (R), and F-measure (F) in the first row of Table 4. |
Evaluation | As we can see, the baseline achieves an F-measure of 45.4 (14 shapers) and 35.4 (10 shapers). |
Evaluation | Comparing these two results, the higher F-measure achieved using all 14 shapers can be attributed primarily to improvements in recall. |
Introduction | In comparison to a supervised baseline approach where a classifier is acquired solely based on the training set, our bootstrapping approach yields a relative error reduction of 6.3% in F-measure for the minority classes. |
Our Bootstrapping Algorithm | In particular, if the second baseline is used, we will tune CT and k jointly on the development data using the local search algorithm described previously, where we adjust the values of both CT and k for one of the 14 classifiers in each step of the search process to optimize the overall F-measure score. |
Experiment | For performance evaluations of opinion and polarity detection, we use precision, recall, and F-measure , the same measure used to report the official results at the NTCIR MOAT workshop. |
Experiment | System parameters are optimized for F-measure using NTCIR6 dataset with lenient evaluations. |
Experiment | Model Precision Recall F-Measure BASELINE 0.305 0.866 0.451 VS 0.331 0.807 0.470 BM25 0.327 0.795 0.464 LM 0.325 0.794 0.461 LSA 0.315 0.806 0.453 PMI 0.342 0.603 0.436 DTP 0.322 0.778 0.455 VS-LSA 0.335 0.769 0.466 VS-PMI 0.311 0.833 0.453 VS-DTP 0.342 0.745 0.469 |
Discussion | Here, we compared our method with the baseline 3, of which F-measure was highest among four baselines described in Section 5.1. |
Discussion | recall (%) precision (%) F-measure by human 89.82 (459/511) 89.82 (459/511) 89.82 our method 82.19 (420/511) 81.71 (420/514) 81.95 |
Discussion | In F-measure , our method achieved 91.24% (8 1 .95/ 89.82) of the result by the human annotator. |
Experiment | recall (%) precision (%) F-measure |
Experiment | On the other hand, the F-measure and the sentence accuracy of our method were 81.43 and 53.15%, respectively. |
Abstract | On average, across a variety of testing scenarios, our model achieves an 8.8 absolute gain in F-measure . |
Experimental setup | In both cases our implementation achieves F-measure in the range of 69-70% on W8] 10, broadly in line with the performance reported by Klein and Manning (2002). |
Experimental setup | To evaluate both our model as well as the baseline, we use (unlabeled) bracket precision, recall, and F-measure (Klein and Manning, 2002). |
Experimental setup | We also report the upper bound on F-measure for binary trees. |
Introduction | On average, over all the testing scenarios that we studied, our model achieves an absolute increase in F-measure of 8.8 points, and a 19% reduction in error relative to a theoretical upper bound. |
Results | On average, the bilingual model gains 10.2 percentage points in precision, 7.7 in recall, and 8.8 in F-measure . |
Results | The Korean-English pairing results in substantial improvements for Korean and quite large improvements for English, for which the absolute gain reaches 28 points in F-measure . |
Conclusion and Future Works | It obtains considerable F-measure increment, about 0.8 point for word segmentation and 1 point for Joint S&T, with corresponding error reductions of 30.2% and 14%. |
Conclusion and Future Works | Moreover, such improvement further brings striking F-measure increment for Chinese parsing, about 0.8 points, corresponding to an error propagation reduction of 38%. |
Experiments | For word segmentation, the model after annotation adaptation (row 4 in upper subtable) achieves an F-measure increment of 0.8 points over the baseline model, corresponding to an error reduction of 30.2%; while for Joint S&T, the F-measure increment of the adapted model (row 4 in subtable below) is 1 point, which corresponds to an error reduction of 14%. |
Experiments | Note that if we input the gold-standard segmented test set into the parser, the F-measure under the two definitions are the same. |
Experiments | The parsing F-measure corresponding to the gold-standard segmentation, 82.35, represents the “oracle” accuracy (i.e., upperbound) of parsing on top of automatic word segmention. |
Conclusion | lected among multiple alignments and it obtained 0.8 F-measure improvement over the single best Chinese-English aligner. |
Conclusion | The second is the alignment link confidence measure, which selects the most reliable links from multiple alignments and obtained 1.5 F-measure improvement. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | the highest F-measure among the three aligners, although the algorithm described below can be applied to any aligner. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | For CE alignment, removing low confidence alignment links increased alignment precision by 5.5 point, while decreased recall by 1.8 point, and the overall alignment F-measure is increased by 1.3 point. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | When looking into the alignment links which are removed during the alignment link filtering process, we found that 80% of the removed links (1320 out of 1661 links) are incorrect alignments, For A-E alignment, it increased the precision by 3 points while reducing recall by 0.5 points, and the alignment F-measure is increased by about 1.5 points absolute, a 10% relative alignment error rate reduction. |
Experiments | #Correct’ Recall m and F-measure #guessed #relevant |
Experiments | Finally, both of the OpPr systems are better than both baselines in Accuracy as well as F-measure for all four debates. |
Experiments | The F-measure improves, on average, by 25 percentage points over the OpTopic system, and by 17 percentage points over the OpPMI system. |
Experiments and results | The F-measure score using the initial training data is 0.69. |
Experiments and results | Most of the previous work for prosodic event detection reported their results using classification accuracy instead of F-measure . |
Experiments and results | Table 3: The results ( F-measure ) of prosodic event detection for supervised and co-training approaches. |
Abstract | Experimental results show that our approach improved the F-measure by 3.6—10.3%. |
Motivation | (2008), which was only applied for Japanese and achieved around 80% in F-measure . |
Motivation | Experimental results showed that our method based on bilingual co-training improved the performance of monolingual hyponymy-relation acquisition about 3.6—10.3% in the F-measure . |