Experiments and Results | Trained and tested on Penn WSJ TreeBank, the POS tagger could obtain an accuracy of 97% and the NP chunker could produce an F-measure above 94% (Zhou and Su, 2000). |
Experiments and Results | Evaluated for the MUC-6 and MUC-7 Named-Entity task, the NER module (Zhou and Su, 2002) could provide an F-measure of 96.6% (MUC-6) and 94.1%(MUC-7). |
Experiments and Results | The overall F-measure for NWire, NPaper and BNews is 60.4%, 57.9% and 62.9% respectively. |
Cross-lingual Features | This led to an overall improvement in F-measure of 1.8 to 3.4 points (absolute) or 4.2% to 5.7% (relative). |
Cross-lingual Features | ture, transliteration mining slightly lowered precision — except for the TWEETS test set where the drop in precision was significant — and positively increased recall, leading to an overall improvement in F-measure for all test sets. |
Related Work | They reported 80%, 37%, and 47% F-measure for locations, organizations, and persons respectively on the ANERCORP dataset that they created and publicly released. |
Related Work | They reported 87%, 46%, and 52% F-measure for locations, organizations, and persons respectively. |
Related Work | Using POS tagging generally improved recall at the expense of precision, leading to overall improvements in F-measure . |
Clustering Methods, Evaluation Metrics and Experimental Setup | Feature maximisation is a cluster quality metric which associates each cluster with maximal features i.e., features whose Feature F-measure is maximal. |
Clustering Methods, Evaluation Metrics and Experimental Setup | Feature F-measure is the harmonic mean of Feature Recall and Feature Precision which in turn are defined as: |
Clustering Methods, Evaluation Metrics and Experimental Setup | represents the weight of the feature f for element :10 and FC designates the set of features associated with the verbs occuring in the cluster c. A feature is then said to be maximal for a given cluster iff its Feature F-measure is higher for that cluster than for any other cluster. |
Introduction | Their approach achieves a F-measure of 55.1 on 116 verbs occurring at least 150 times in Lexschem. |
Introduction | The best performance is achieved when restricting the approach to verbs occurring at least 4000 times (43 verbs) with an F-measure of 65.4. |
Introduction | We show that the approach yields promising results ( F-measure of 70%) and that the clustering produced systematically associates verbs with syntactic frames and thematic grids thereby providing an interesting basis for the creation and evaluation of a Verbnet-like classification. |
Conclusions | At an optimal decision threshold, however, both yielded a similar f-measure result. |
Experiments and Results | F-measure |
Experiments and Results | F-measure |
Experiments and Results | Since both algorithms show different behaviour with increasing experience and PROMODES-H yields a higher f-measure across all datasets, we will investigate in the next experiments how these differences manifest themselves at the boundary level. |
Abstract | Experimental results show that our algorithm yields a relative error reduction of 6.3% in F-measure for the minority classes in comparison to a baseline that learns solely from the labeled data. |
Baseline Approaches | Results are reported in terms of precision (P), recall (R), and F-measure (F), which are computed by aggregating over the 14 shapers as follows. |
Baseline Approaches | Our second baseline is similar to the first, except that we tune the classification threshold (CT) to optimize F-measure . |
Baseline Approaches | Using the development data, we tune the 14 CTs jointly to optimize overall F-measure . |
Evaluation | Micro-averaged 5-fold cross validation results of this baseline for all 14 shapers and for just 10 minority classes (due to our focus on improving minority class prediction) are expressed as percentages in terms of precision (P), recall (R), and F-measure (F) in the first row of Table 4. |
Evaluation | As we can see, the baseline achieves an F-measure of 45.4 (14 shapers) and 35.4 (10 shapers). |
Evaluation | Comparing these two results, the higher F-measure achieved using all 14 shapers can be attributed primarily to improvements in recall. |
Introduction | In comparison to a supervised baseline approach where a classifier is acquired solely based on the training set, our bootstrapping approach yields a relative error reduction of 6.3% in F-measure for the minority classes. |
Our Bootstrapping Algorithm | In particular, if the second baseline is used, we will tune CT and k jointly on the development data using the local search algorithm described previously, where we adjust the values of both CT and k for one of the 14 classifiers in each step of the search process to optimize the overall F-measure score. |
BLANC for Imperfect Response Mentions | and we propose to extend the coreference F-measure and non-coreference F-measure as follows. |
BLANC for Imperfect Response Mentions | Coreference recall, precision and F-measure are changed to: |
BLANC for Imperfect Response Mentions | Non-coreference recall, precision and F-measure are changed to: |
Introduction | It calculates recall, precision and F-measure separately on coreference and non-coreference links in the usual way, and defines the overall recall, precision and F-measure as the mean of the respective measures for coreference and non-coreference links. |
Original BLANC | BLANC-gold solves this problem by averaging the F-measure computed over coreference links and the F-measure over non-coreference links. |
Original BLANC | Using the notations in Section 2, the recall, precision, and F-measure on coreference links are: |
Original BLANC | Similarly, the recall, precision, and F-measure on non-coreference links are computed as: |
Abstract | Experimental results show that the proposed model achieves 83% in F-measure , and outperforms the state-of-the-art baseline by over 7%. |
Conclusions and Future Work | Experimental results show our proposed framework outperforms the state-of-the-art baseline by over 7% in F-measure . |
Experiments | It is worth noting that the F-measure reported for the event phrase extraction is only 64% in the baseline approach (Ritter et al., 2012). |
Experiments | Method Tuple Evaluated Precision Recall F-measure |
Experiments | Method Tuple Evaluated Precision Recall F-measure |
Introduction | We have conducted experiments on a Twitter corpus and the results show that our proposed approach outperforms TwiCal, the state-of-the-art open event extraction system, by 7.7% in F-measure . |
Experiment | For performance evaluations of opinion and polarity detection, we use precision, recall, and F-measure , the same measure used to report the official results at the NTCIR MOAT workshop. |
Experiment | System parameters are optimized for F-measure using NTCIR6 dataset with lenient evaluations. |
Experiment | Model Precision Recall F-Measure BASELINE 0.305 0.866 0.451 VS 0.331 0.807 0.470 BM25 0.327 0.795 0.464 LM 0.325 0.794 0.461 LSA 0.315 0.806 0.453 PMI 0.342 0.603 0.436 DTP 0.322 0.778 0.455 VS-LSA 0.335 0.769 0.466 VS-PMI 0.311 0.833 0.453 VS-DTP 0.342 0.745 0.469 |
Abstract | We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. |
Conclusion | We evaluated it against the semi-supervised systems of NEWS10 and achieved high F-measure and performed better than most of the semi-supervised systems. |
Conclusion | We also evaluated our method on parallel corpora and achieved high F-measure . |
Experiments | “Our” shows the F-measure of our filtered data against the gold standard using the supplied evaluation tool, “Systems” is the total number of participants in the subtask, and “Rank” is the rank we would have obtained if our system had participated. |
Experiments | We calculate the F-measure of our filtered transliteration pairs against the supplied gold standard using the supplied evaluation tool. |
Experiments | On the English/Russian data set, our system achieves 76% F-measure which is not good compared with the systems that participated in the shared task. |
Introduction | We achieve an F-measure of up to 92% outperforming most of the semi-supervised systems. |
Abstract | Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure , yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system. |
Conclusion | We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today. |
Discriminative training | Note that Equation 2 is equivalent to maximizing the sum of the F-measure and model score of y: |
Experiments | These plots show the current F-measure on the training set as time passes. |
Experiments | F-measure |
Experiments | The first three columns of Table 2 show the balanced F-measure , Precision, and Recall of our alignments versus the two GIZA++ Model-4 baselines. |
Word Alignment as a Hypergraph | (1) that using the structure of l-best English syntactic parse trees is a reasonable way to frame and drive our search, and (2) that F-measure approximately decomposes over hyperedges. |
Experimental Evaluation | leading to an F-measure of 53% for our initialization heuristic. |
Experimental Evaluation | on various quality metrics, of which F-Measure is typically considered most important. |
Experimental Evaluation | Our pure-LM13 setting (i.e., A = l) was seen to perform up to 6 F-Measure points better than the pure-TM14 setting (i.e., A = 0), whereas the uniform mix is seen to be able to harness both to give a 1.4 point (i.e., 2.2%) improvement over the pure-LM case. |
Experiments | Table 1: ROUGE-1 recall (R) and F-measure (F) results (%) on DUC-04. |
Experiments | On the other hand, when using £1(S) with an 04 < 1 (the value of 04 was determined on DUC-03 using a grid search), a ROUGE-1 F-measure score 38.65% |
Experiments | ROUGE-1 F-measure (%) |
Discussion | Here, we compared our method with the baseline 3, of which F-measure was highest among four baselines described in Section 5.1. |
Discussion | recall (%) precision (%) F-measure by human 89.82 (459/511) 89.82 (459/511) 89.82 our method 82.19 (420/511) 81.71 (420/514) 81.95 |
Discussion | In F-measure , our method achieved 91.24% (8 1 .95/ 89.82) of the result by the human annotator. |
Experiment | recall (%) precision (%) F-measure |
Experiment | On the other hand, the F-measure and the sentence accuracy of our method were 81.43 and 53.15%, respectively. |
Abstract | On average, across a variety of testing scenarios, our model achieves an 8.8 absolute gain in F-measure . |
Experimental setup | In both cases our implementation achieves F-measure in the range of 69-70% on W8] 10, broadly in line with the performance reported by Klein and Manning (2002). |
Experimental setup | To evaluate both our model as well as the baseline, we use (unlabeled) bracket precision, recall, and F-measure (Klein and Manning, 2002). |
Experimental setup | We also report the upper bound on F-measure for binary trees. |
Introduction | On average, over all the testing scenarios that we studied, our model achieves an absolute increase in F-measure of 8.8 points, and a 19% reduction in error relative to a theoretical upper bound. |
Results | On average, the bilingual model gains 10.2 percentage points in precision, 7.7 in recall, and 8.8 in F-measure . |
Results | The Korean-English pairing results in substantial improvements for Korean and quite large improvements for English, for which the absolute gain reaches 28 points in F-measure . |
Introduction | Neither recall nor f-measure are reported. |
Introduction | The feature whose omission causes the biggest drop in f-measure is set aside as a strong feature. |
Introduction | From the ranked list of features f1 to fn we evaluate increasingly extended feature sets f1.. fi for i = 2..n. We select the feature set that yields the best balanced performance, at 45.7% precision and 53.6% f-measure . |
Lexicon Bootstrapping | The set of parameters 5 is optimized using a grid search on the development data using F-measure for subjectivity classification. |
Lexicon Evaluations | and F-measure results. |
Lexicon Evaluations | For polarity classification we get comparable F-measure but much higher recall for Lg compared to SW N. |
Lexicon Evaluations | Figure 1: Precision (x-axis), recall (y-axis) and F-measure (in the table) for English: L? |
Conclusion | Comparing with TextRunner, WOEPOS runs at the same speed, but achieves an F-measure which is between 18% and 34% greater on three corpora; WOEparse achieves an F-measure which is between 72% and 91% higher than that of TextRunner, but runs about 30X times slower due to the time required for parsing. |
Experiments | Figure 3: WOEPOS achieves an F-measure , which is between 18% and 34% better than TextRunner’s. |
Experiments | Figure 4: WOEparse’s F-measure decreases more slowly with sentence length than WOEPOS and TextRunner, due to its better handling of difficult sentences using parser features. |
Introduction | Compared with TextRunner (the state of the art) on three corpora, WOE yields between 72% and 91% improved F-measure — generalizing well beyond Wikipedia. |
Wikipedia-based Open IE | As shown in the experiments on three corpora, WOEparse achieves an F-measure which is between 72% to 91% greater than TextRunner’s. |
Wikipedia-based Open IE | As shown in the experiments, WOEPOS achieves an improved F-measure over TextRunner between 18% to 34% on three corpora, and this is mainly due to the increase on precision. |
Experiments | Evaluation Metrics: We select precision(P), recall(R) and f-measure (F) as metrics. |
Experiments | The experimental results are shown in Table 2, 3, 4 and 5, where the last column presents the average F-measure scores for multiple domains. |
Experiments | Due to space limitation, we only show the F-measure of CR_WP on four domains. |
Conclusion and Future Works | It obtains considerable F-measure increment, about 0.8 point for word segmentation and 1 point for Joint S&T, with corresponding error reductions of 30.2% and 14%. |
Conclusion and Future Works | Moreover, such improvement further brings striking F-measure increment for Chinese parsing, about 0.8 points, corresponding to an error propagation reduction of 38%. |
Experiments | For word segmentation, the model after annotation adaptation (row 4 in upper subtable) achieves an F-measure increment of 0.8 points over the baseline model, corresponding to an error reduction of 30.2%; while for Joint S&T, the F-measure increment of the adapted model (row 4 in subtable below) is 1 point, which corresponds to an error reduction of 14%. |
Experiments | Note that if we input the gold-standard segmented test set into the parser, the F-measure under the two definitions are the same. |
Experiments | The parsing F-measure corresponding to the gold-standard segmentation, 82.35, represents the “oracle” accuracy (i.e., upperbound) of parsing on top of automatic word segmention. |
Results | For the alignment detection, we report the precision, recall, and F-measure associated with correctly detecting matches. |
Results | The threshold weight learned from the bias feature strongly influences the point at which real scores change from non-matches to matches, and given the threshold weight learned by the algorithm, we find an F-measure of 0.72, with precision(P) = 0.85 and recall(R) = 0.62. |
Results | By manually varying the threshold, we find a maximum F-measure of 0.76, with P=0.79 and R=0.74. |
Introduction | Experimental results show that our method can achieve a 5.4% F-measure improvement over the traditional convolution tree kernel based method. |
Introduction | The overall performance of CTK and FTK is shown in Table 1, the F-measure improvements over CTK are also shown inside the parentheses. |
Introduction | FTK on the 7 major relation types and their F-measure improvement over CTK |
Experiments | Evaluation Metrics: We evaluate the proposed method in terms of precision(P), recall(R) and F-measure (F). |
Experiments | Figure 5 shows the performance under different N, where the F-Measure saturates when N equates to 40 and beyond. |
Experiments | F-Measure |
Experiments | Since the model can give the same score for a permutation and the original document, we also compute F-measure where recall is correct/total and precision equals correct/decisions. |
Experiments | For evaluation purposes, the accuracy still corresponds to the number of correct ratings divided by the number of comparisons, while the F-measure combines recall and precision measures. |
Experiments | Moreover, in contrast to the first experiment, when accounting for the number of entities “shared” by two sentences (PW), values of accuracy and F-measure are lower. |
Conclusions | Table 3: Classification performance in F-measure for semantically ambiguous words on the most frequently confused descriptive tags in the movie domain. |
Experiments | F-Measure |
Experiments | Figure 2: F-measure for semantic clustering performance. |
Experiments | As expected, we see a drop in F-measure on all models on descriptive tags. |
Concluding Remarks | Using MDL alone, one proposed method outperforms the original regularized compressor (Chen et al., 2012) in precision by 2 percentage points and in F-measure by 1. |
Evaluation | Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F). |
Evaluation | We found that, in all three settings, G2 outperforms the baseline by 1 to 2 percentage points in F-measure . |
Evaluation | The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure , although this was obtained from search setting (c), using a heuristic p value 0.37. |
Abstract | Without using any additional labeled data this new approach obtained 7.6% higher F-Measure in trigger labeling and 6% higher F-Measure in argument labeling over a state-of-the-art IE system which extracts events independently for each sentence. |
Experimental Results and Analysis | We select the thresholds (d with k=1~13) for various confidence metrics by optimizing the F-measure score of each rule on the development set, as shown in Figure 2 and 3 as follows. |
Experimental Results and Analysis | The labeled point on each curve shows the best F-measure that can be obtained on the development set by adjusting the threshold for that rule. |
Experimental Results and Analysis | Table 5 shows the overall Precision (P), Recall (R) and F-Measure (F) scores for the blind test set. |
Abstract | This approach generates alignments that are 2.6 f-Measure points better than a baseline supervised aligner. |
Conclusion | We also proposed a model that scores alignments given source and target sentence reorderings that improves a supervised alignment model by 2.6 points in f-Measure . |
Results and Discussions | Type f-Measure (words) |
Results and Discussions | The f-Measure of this aligner is 78.1% (see row 1, column 2). |
Results and Discussions | Method f-Measure mBLEU Base Correction model 78.1 55.1 Correction model, C(fl'la) 78.1 56.4 P(alfl'), C(fl'la) 80.7 57.6 |
Conclusion | lected among multiple alignments and it obtained 0.8 F-measure improvement over the single best Chinese-English aligner. |
Conclusion | The second is the alignment link confidence measure, which selects the most reliable links from multiple alignments and obtained 1.5 F-measure improvement. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | the highest F-measure among the three aligners, although the algorithm described below can be applied to any aligner. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | For CE alignment, removing low confidence alignment links increased alignment precision by 5.5 point, while decreased recall by 1.8 point, and the overall alignment F-measure is increased by 1.3 point. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | When looking into the alignment links which are removed during the alignment link filtering process, we found that 80% of the removed links (1320 out of 1661 links) are incorrect alignments, For A-E alignment, it increased the precision by 3 points while reducing recall by 0.5 points, and the alignment F-measure is increased by about 1.5 points absolute, a 10% relative alignment error rate reduction. |
Data and task | For Bulgarian, the F-measure of the full model is 92.8 compared to the best baseline result of 83.2. |
Data and task | Within the semi-CRF model, the contribution of English sentence context was substantial, leading to 2.5 point increase in F-measure for Bulgarian (92.8 versus 90.3 F—measure), and 4.0 point increase for Korean (91.2 versus 87.2). |
Data and task | Preliminary results show performance of over 80 F-measure for such monolingual models. |
Introduction | Our results show that the semi-CRF model improves on the performance of projection models by more than 10 points in F—measure, and that we can achieve tagging F-measure of over 91 using a very small number of annotated sentence pairs. |
Evaluation | We measured correction performance by recall, precision, and F-measure . |
Evaluation | The simple error case frame-based method achieves an F-measure of 0.189. |
Evaluation | The hybrid methods achieve the best performances in F-measure . |
Experiments | Overall, we improve the labelled F-measure by almost 1.1% and unlabelled F-measure by 0.6% over the baseline. |
Experiments | Table 6: Parsing time in seconds per sentence (vs. F-measure ) on section 00. |
Oracle Parsing | F-measure |
Oracle Parsing | The inverse relationship between model score and F-score shows that the supertagger restricts the parser to mostly good parses (under F-measure ) that the model would otherwise disprefer. |
Experiments | #Correct’ Recall m and F-measure #guessed #relevant |
Experiments | Finally, both of the OpPr systems are better than both baselines in Accuracy as well as F-measure for all four debates. |
Experiments | The F-measure improves, on average, by 25 percentage points over the OpTopic system, and by 17 percentage points over the OpPMI system. |
Analysis and Discussions | F-Measure (SO Prediction) Ch \l |
Analysis and Discussions | F-Measure (IQAPs Inference) 3:. |
Analysis and Discussions | F-Measure Ch m 0'! |
Related Works | F-Measure H N U) .5 Ln 0 O O O O |
Empirical Evaluation | Since the purpose of solving this problem is to identify the threads which should be brought to the notice of the instructors, we measure the performance of our models using F-measure of the positive class. |
Empirical Evaluation | F-measure |
Empirical Evaluation | 6 shows 10-fold cross validation F-measure of the positive class for LR when different types of features are excluded from the full set. |
Abstract | We transform tagged character sequences to word segmentations first, and then evaluate word segmenta-tions by F-measure , as defined in Section 5.2. |
Abstract | We focus on the task of word segmentation only in this work and show that a comparable F-measure is achievable in a much more efficient manner. |
Abstract | The addition of these features makes a moderate improvement on the F-measure , from 0.974 to 0.975. |
Experiments | The performance measurement for word segmentation is balanced F-measure , F = 2PR/ (P + R), a function of precision P and recall R, where P is the percentage of words in segmentation results that are segmented correctly, and R is the percentage of correctly segmented words in the gold standard words. |
Experiments | wikipedia brings an F-measure increment of 0.93 points. |
Introduction | Experimental results show that, the knowledge implied in the natural annotations can significantly improve the performance of a baseline segmenter trained on CTB 5.0, an F-measure increment of 0.93 points on CTB test set, and an average increment of 1.53 points on 7 other domains. |
Experiments and results | The F-measure score using the initial training data is 0.69. |
Experiments and results | Most of the previous work for prosodic event detection reported their results using classification accuracy instead of F-measure . |
Experiments and results | Table 3: The results ( F-measure ) of prosodic event detection for supervised and co-training approaches. |
Experiments | Each learner uses a small amount of development data to tune a threshold on scores for predicting new-sense or not-a-new-sense, using macro F-measure as an objective. |
Experiments | are relatively weak for predicting new senses on EMEA data but stronger on Subs (TYPEONLY AUC performance is higher than both baselines) and even stronger on Science data (TYPEONLY AUC and f-measure performance is higher than both baselines as well as the ALLFEA-TURESmodel). |
Experiments | Recall that the microlevel evaluation computes precision, recall, and f-measure for all word tokens of a given word type and then averages across word types. |
Abstract | Results over a dataset of entities from four product domains show that the proposed approach achieves significantly above baseline F-measure of 0.96. |
Experimental Evaluation | Of the three individual modules, the n-gram and clustering methods achieve F-measure of around 0.9, while the ontology-based module performs only modestly above baseline. |
Experimental Evaluation | The final system that employed all modules produced an F-measure of 0.960, a significant (p < 0.01) absolute increase of 15.4% over the baseline. |
Abstract | We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. |
Experiments 5.1 Data Sources | To test the classifier’s performance we evaluated it against a list of positive and negative examples of bilingual term pairs using the measures of precision, recall and F-measure . |
Method | First, we evaluate the performance of the classifier on a held-out term-pair list from EUROVOC using the standard measures of recall, precision and F-measure . |
Evaluation and Discussion | The performance is measured using precision, recall, and f-measure . |
Evaluation and Discussion | We additionally used the weighted f-measure (wF) to aggregate the f-measure of the three categories. |
Evaluation and Discussion | The overall average of the weighted f-measure among issues was 0.68, 0.59, and 0.48 for the DrC, QbC, and Sim. |
Experiments | Using the same evaluation setting, our baseline RE system achieves a competitive 71.4 F-measure . |
Experiments | The results show that by using syntactico-semantic structures, we obtain significant F-measure improvements of 8.3, 7.2, and 5.5 for binary, coarse-grained, and fine-grained relation predictions respectively. |
Experiments | The results show that by leveraging syntactico-semantic structures, we obtain significant F-measure improvements of 8.2, 4.6, and 3.6 for binary, coarse-grained, and fine-grained relation predictions respectively. |
Substructure Spaces for BTKs | The coefficient Oi for the composite kernel are tuned with respect to F-measure (F) on the development set of HIT corpus. |
Substructure Spaces for BTKs | Those thresholds are also tuned on the development set of HIT corpus with respect to F-measure . |
Substructure Spaces for BTKs | The evaluation is conducted by means of Precision (P), Recall (R) and F-measure (F). |
Experiments | F-Measure (F): the harmonic mean of purity and inverse purity. |
Experiments | We use F-measure as the primary measure just liking WePSl and WePS2. |
Experiments | The F-Measure vs. 1 on three data sets |
Extracting Conversational Networks from Literature | Table 2: Precision, recall, and F-measure of three methods for detecting bilateral conversations in literary texts. |
Extracting Conversational Networks from Literature | The precision and recall values shown for the baselines in Table 2 represent the highest performance we achieved by varying t between 0 and 1 (maximizing F-measure over 25). |
Extracting Conversational Networks from Literature | Both baselines performed significantly worse in precision and F-measure than our quoted speech adjacency method for detecting conversations. |
Abstract | Experimental results show that our approach improved the F-measure by 3.6—10.3%. |
Motivation | (2008), which was only applied for Japanese and achieved around 80% in F-measure . |
Motivation | Experimental results showed that our method based on bilingual co-training improved the performance of monolingual hyponymy-relation acquisition about 3.6—10.3% in the F-measure . |