Index of papers in Proc. ACL that mention
  • F-measure
Yang, Xiaofeng and Su, Jian and Lang, Jun and Tan, Chew Lim and Liu, Ting and Li, Sheng
Experiments and Results
Trained and tested on Penn WSJ TreeBank, the POS tagger could obtain an accuracy of 97% and the NP chunker could produce an F-measure above 94% (Zhou and Su, 2000).
Experiments and Results
Evaluated for the MUC-6 and MUC-7 Named-Entity task, the NER module (Zhou and Su, 2002) could provide an F-measure of 96.6% (MUC-6) and 94.1%(MUC-7).
Experiments and Results
The overall F-measure for NWire, NPaper and BNews is 60.4%, 57.9% and 62.9% respectively.
F-measure is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Darwish, Kareem
Cross-lingual Features
This led to an overall improvement in F-measure of 1.8 to 3.4 points (absolute) or 4.2% to 5.7% (relative).
Cross-lingual Features
ture, transliteration mining slightly lowered precision — except for the TWEETS test set where the drop in precision was significant — and positively increased recall, leading to an overall improvement in F-measure for all test sets.
Related Work
They reported 80%, 37%, and 47% F-measure for locations, organizations, and persons respectively on the ANERCORP dataset that they created and publicly released.
Related Work
They reported 87%, 46%, and 52% F-measure for locations, organizations, and persons respectively.
Related Work
Using POS tagging generally improved recall at the expense of precision, leading to overall improvements in F-measure .
F-measure is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Falk, Ingrid and Gardent, Claire and Lamirel, Jean-Charles
Clustering Methods, Evaluation Metrics and Experimental Setup
Feature maximisation is a cluster quality metric which associates each cluster with maximal features i.e., features whose Feature F-measure is maximal.
Clustering Methods, Evaluation Metrics and Experimental Setup
Feature F-measure is the harmonic mean of Feature Recall and Feature Precision which in turn are defined as:
Clustering Methods, Evaluation Metrics and Experimental Setup
represents the weight of the feature f for element :10 and FC designates the set of features associated with the verbs occuring in the cluster c. A feature is then said to be maximal for a given cluster iff its Feature F-measure is higher for that cluster than for any other cluster.
Introduction
Their approach achieves a F-measure of 55.1 on 116 verbs occurring at least 150 times in Lexschem.
Introduction
The best performance is achieved when restricting the approach to verbs occurring at least 4000 times (43 verbs) with an F-measure of 65.4.
Introduction
We show that the approach yields promising results ( F-measure of 70%) and that the clustering produced systematically associates verbs with syntactic frames and thematic grids thereby providing an interesting basis for the creation and evaluation of a Verbnet-like classification.
F-measure is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Spiegler, Sebastian and Flach, Peter A.
Conclusions
At an optimal decision threshold, however, both yielded a similar f-measure result.
Experiments and Results
F-measure
Experiments and Results
F-measure
Experiments and Results
Since both algorithms show different behaviour with increasing experience and PROMODES-H yields a higher f-measure across all datasets, we will investigate in the next experiments how these differences manifest themselves at the boundary level.
F-measure is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Persing, Isaac and Ng, Vincent
Abstract
Experimental results show that our algorithm yields a relative error reduction of 6.3% in F-measure for the minority classes in comparison to a baseline that learns solely from the labeled data.
Baseline Approaches
Results are reported in terms of precision (P), recall (R), and F-measure (F), which are computed by aggregating over the 14 shapers as follows.
Baseline Approaches
Our second baseline is similar to the first, except that we tune the classification threshold (CT) to optimize F-measure .
Baseline Approaches
Using the development data, we tune the 14 CTs jointly to optimize overall F-measure .
Evaluation
Micro-averaged 5-fold cross validation results of this baseline for all 14 shapers and for just 10 minority classes (due to our focus on improving minority class prediction) are expressed as percentages in terms of precision (P), recall (R), and F-measure (F) in the first row of Table 4.
Evaluation
As we can see, the baseline achieves an F-measure of 45.4 (14 shapers) and 35.4 (10 shapers).
Evaluation
Comparing these two results, the higher F-measure achieved using all 14 shapers can be attributed primarily to improvements in recall.
Introduction
In comparison to a supervised baseline approach where a classifier is acquired solely based on the training set, our bootstrapping approach yields a relative error reduction of 6.3% in F-measure for the minority classes.
Our Bootstrapping Algorithm
In particular, if the second baseline is used, we will tune CT and k jointly on the development data using the local search algorithm described previously, where we adjust the values of both CT and k for one of the 14 classifiers in each step of the search process to optimize the overall F-measure score.
F-measure is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Luo, Xiaoqiang and Pradhan, Sameer and Recasens, Marta and Hovy, Eduard
BLANC for Imperfect Response Mentions
and we propose to extend the coreference F-measure and non-coreference F-measure as follows.
BLANC for Imperfect Response Mentions
Coreference recall, precision and F-measure are changed to:
BLANC for Imperfect Response Mentions
Non-coreference recall, precision and F-measure are changed to:
Introduction
It calculates recall, precision and F-measure separately on coreference and non-coreference links in the usual way, and defines the overall recall, precision and F-measure as the mean of the respective measures for coreference and non-coreference links.
Original BLANC
BLANC-gold solves this problem by averaging the F-measure computed over coreference links and the F-measure over non-coreference links.
Original BLANC
Using the notations in Section 2, the recall, precision, and F-measure on coreference links are:
Original BLANC
Similarly, the recall, precision, and F-measure on non-coreference links are computed as:
F-measure is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zhou, Deyu and Chen, Liangyu and He, Yulan
Abstract
Experimental results show that the proposed model achieves 83% in F-measure , and outperforms the state-of-the-art baseline by over 7%.
Conclusions and Future Work
Experimental results show our proposed framework outperforms the state-of-the-art baseline by over 7% in F-measure .
Experiments
It is worth noting that the F-measure reported for the event phrase extraction is only 64% in the baseline approach (Ritter et al., 2012).
Experiments
Method Tuple Evaluated Precision Recall F-measure
Experiments
Method Tuple Evaluated Precision Recall F-measure
Introduction
We have conducted experiments on a Twitter corpus and the results show that our proposed approach outperforms TwiCal, the state-of-the-art open event extraction system, by 7.7% in F-measure .
F-measure is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Kim, Jungi and Li, Jin-Ji and Lee, Jong-Hyeok
Experiment
For performance evaluations of opinion and polarity detection, we use precision, recall, and F-measure , the same measure used to report the official results at the NTCIR MOAT workshop.
Experiment
System parameters are optimized for F-measure using NTCIR6 dataset with lenient evaluations.
Experiment
Model Precision Recall F-Measure BASELINE 0.305 0.866 0.451 VS 0.331 0.807 0.470 BM25 0.327 0.795 0.464 LM 0.325 0.794 0.461 LSA 0.315 0.806 0.453 PMI 0.342 0.603 0.436 DTP 0.322 0.778 0.455 VS-LSA 0.335 0.769 0.466 VS-PMI 0.311 0.833 0.453 VS-DTP 0.342 0.745 0.469
F-measure is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Abstract
We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted.
Conclusion
We evaluated it against the semi-supervised systems of NEWS10 and achieved high F-measure and performed better than most of the semi-supervised systems.
Conclusion
We also evaluated our method on parallel corpora and achieved high F-measure .
Experiments
“Our” shows the F-measure of our filtered data against the gold standard using the supplied evaluation tool, “Systems” is the total number of participants in the subtask, and “Rank” is the rank we would have obtained if our system had participated.
Experiments
We calculate the F-measure of our filtered transliteration pairs against the supplied gold standard using the supplied evaluation tool.
Experiments
On the English/Russian data set, our system achieves 76% F-measure which is not good compared with the systems that participated in the shared task.
Introduction
We achieve an F-measure of up to 92% outperforming most of the semi-supervised systems.
F-measure is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Riesa, Jason and Marcu, Daniel
Abstract
Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure , yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
Conclusion
We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today.
Discriminative training
Note that Equation 2 is equivalent to maximizing the sum of the F-measure and model score of y:
Experiments
These plots show the current F-measure on the training set as time passes.
Experiments
F-measure
Experiments
The first three columns of Table 2 show the balanced F-measure , Precision, and Recall of our alignments versus the two GIZA++ Model-4 baselines.
Word Alignment as a Hypergraph
(1) that using the structure of l-best English syntactic parse trees is a reasonable way to frame and drive our search, and (2) that F-measure approximately decomposes over hyperedges.
F-measure is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
P, Deepak and Visweswariah, Karthik
Experimental Evaluation
leading to an F-measure of 53% for our initialization heuristic.
Experimental Evaluation
on various quality metrics, of which F-Measure is typically considered most important.
Experimental Evaluation
Our pure-LM13 setting (i.e., A = l) was seen to perform up to 6 F-Measure points better than the pure-TM14 setting (i.e., A = 0), whereas the uniform mix is seen to be able to harness both to give a 1.4 point (i.e., 2.2%) improvement over the pure-LM case.
F-measure is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Lin, Hui and Bilmes, Jeff
Experiments
Table 1: ROUGE-1 recall (R) and F-measure (F) results (%) on DUC-04.
Experiments
On the other hand, when using £1(S) with an 04 < 1 (the value of 04 was determined on DUC-03 using a grid search), a ROUGE-1 F-measure score 38.65%
Experiments
ROUGE-1 F-measure (%)
F-measure is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Ohno, Tomohiro and Murata, Masaki and Matsubara, Shigeki
Discussion
Here, we compared our method with the baseline 3, of which F-measure was highest among four baselines described in Section 5.1.
Discussion
recall (%) precision (%) F-measure by human 89.82 (459/511) 89.82 (459/511) 89.82 our method 82.19 (420/511) 81.71 (420/514) 81.95
Discussion
In F-measure , our method achieved 91.24% (8 1 .95/ 89.82) of the result by the human annotator.
Experiment
recall (%) precision (%) F-measure
Experiment
On the other hand, the F-measure and the sentence accuracy of our method were 81.43 and 53.15%, respectively.
F-measure is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Snyder, Benjamin and Naseem, Tahira and Barzilay, Regina
Abstract
On average, across a variety of testing scenarios, our model achieves an 8.8 absolute gain in F-measure .
Experimental setup
In both cases our implementation achieves F-measure in the range of 69-70% on W8] 10, broadly in line with the performance reported by Klein and Manning (2002).
Experimental setup
To evaluate both our model as well as the baseline, we use (unlabeled) bracket precision, recall, and F-measure (Klein and Manning, 2002).
Experimental setup
We also report the upper bound on F-measure for binary trees.
Introduction
On average, over all the testing scenarios that we studied, our model achieves an absolute increase in F-measure of 8.8 points, and a 19% reduction in error relative to a theoretical upper bound.
Results
On average, the bilingual model gains 10.2 percentage points in precision, 7.7 in recall, and 8.8 in F-measure .
Results
The Korean-English pairing results in substantial improvements for Korean and quite large improvements for English, for which the absolute gain reaches 28 points in F-measure .
F-measure is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Reiter, Nils and Frank, Anette
Introduction
Neither recall nor f-measure are reported.
Introduction
The feature whose omission causes the biggest drop in f-measure is set aside as a strong feature.
Introduction
From the ranked list of features f1 to fn we evaluate increasingly extended feature sets f1.. fi for i = 2..n. We select the feature set that yields the best balanced performance, at 45.7% precision and 53.6% f-measure .
F-measure is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Volkova, Svitlana and Wilson, Theresa and Yarowsky, David
Lexicon Bootstrapping
The set of parameters 5 is optimized using a grid search on the development data using F-measure for subjectivity classification.
Lexicon Evaluations
and F-measure results.
Lexicon Evaluations
For polarity classification we get comparable F-measure but much higher recall for Lg compared to SW N.
Lexicon Evaluations
Figure 1: Precision (x-axis), recall (y-axis) and F-measure (in the table) for English: L?
F-measure is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Wu, Fei and Weld, Daniel S.
Conclusion
Comparing with TextRunner, WOEPOS runs at the same speed, but achieves an F-measure which is between 18% and 34% greater on three corpora; WOEparse achieves an F-measure which is between 72% and 91% higher than that of TextRunner, but runs about 30X times slower due to the time required for parsing.
Experiments
Figure 3: WOEPOS achieves an F-measure , which is between 18% and 34% better than TextRunner’s.
Experiments
Figure 4: WOEparse’s F-measure decreases more slowly with sentence length than WOEPOS and TextRunner, due to its better handling of difficult sentences using parser features.
Introduction
Compared with TextRunner (the state of the art) on three corpora, WOE yields between 72% and 91% improved F-measure — generalizing well beyond Wikipedia.
Wikipedia-based Open IE
As shown in the experiments on three corpora, WOEparse achieves an F-measure which is between 72% to 91% greater than TextRunner’s.
Wikipedia-based Open IE
As shown in the experiments, WOEPOS achieves an improved F-measure over TextRunner between 18% to 34% on three corpora, and this is mainly due to the increase on precision.
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Kang and Xu, Liheng and Zhao, Jun
Experiments
Evaluation Metrics: We select precision(P), recall(R) and f-measure (F) as metrics.
Experiments
The experimental results are shown in Table 2, 3, 4 and 5, where the last column presents the average F-measure scores for multiple domains.
Experiments
Due to space limitation, we only show the F-measure of CR_WP on four domains.
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Huang, Liang and Liu, Qun
Conclusion and Future Works
It obtains considerable F-measure increment, about 0.8 point for word segmentation and 1 point for Joint S&T, with corresponding error reductions of 30.2% and 14%.
Conclusion and Future Works
Moreover, such improvement further brings striking F-measure increment for Chinese parsing, about 0.8 points, corresponding to an error propagation reduction of 38%.
Experiments
For word segmentation, the model after annotation adaptation (row 4 in upper subtable) achieves an F-measure increment of 0.8 points over the baseline model, corresponding to an error reduction of 30.2%; while for Joint S&T, the F-measure increment of the adapted model (row 4 in subtable below) is 1 point, which corresponds to an error reduction of 14%.
Experiments
Note that if we input the gold-standard segmented test set into the parser, the F-measure under the two definitions are the same.
Experiments
The parsing F-measure corresponding to the gold-standard segmentation, 82.35, represents the “oracle” accuracy (i.e., upperbound) of parsing on top of automatic word segmention.
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mohler, Michael and Bunescu, Razvan and Mihalcea, Rada
Results
For the alignment detection, we report the precision, recall, and F-measure associated with correctly detecting matches.
Results
The threshold weight learned from the bias feature strongly influences the point at which real scores change from non-matches to matches, and given the threshold weight learned by the algorithm, we find an F-measure of 0.72, with precision(P) = 0.85 and recall(R) = 0.62.
Results
By manually varying the threshold, we find a maximum F-measure of 0.76, with P=0.79 and R=0.74.
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Sun, Le and Han, Xianpei
Introduction
Experimental results show that our method can achieve a 5.4% F-measure improvement over the traditional convolution tree kernel based method.
Introduction
The overall performance of CTK and FTK is shown in Table 1, the F-measure improvements over CTK are also shown inside the parentheses.
Introduction
FTK on the 7 major relation types and their F-measure improvement over CTK
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xu, Liheng and Liu, Kang and Lai, Siwei and Zhao, Jun
Experiments
Evaluation Metrics: We evaluate the proposed method in terms of precision(P), recall(R) and F-measure (F).
Experiments
Figure 5 shows the performance under different N, where the F-Measure saturates when N equates to 40 and beyond.
Experiments
F-Measure
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Guinaudeau, Camille and Strube, Michael
Experiments
Since the model can give the same score for a permutation and the original document, we also compute F-measure where recall is correct/total and precision equals correct/decisions.
Experiments
For evaluation purposes, the accuracy still corresponds to the number of correct ratings divided by the number of comparisons, while the F-measure combines recall and precision measures.
Experiments
Moreover, in contrast to the first experiment, when accounting for the number of entities “shared” by two sentences (PW), values of accuracy and F-measure are lower.
F-measure is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Celikyilmaz, Asli and Hakkani-Tur, Dilek and Tur, Gokhan and Sarikaya, Ruhi
Conclusions
Table 3: Classification performance in F-measure for semantically ambiguous words on the most frequently confused descriptive tags in the movie domain.
Experiments
F-Measure
Experiments
Figure 2: F-measure for semantic clustering performance.
Experiments
As expected, we see a drop in F-measure on all models on descriptive tags.
F-measure is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chen, Ruey-Cheng
Concluding Remarks
Using MDL alone, one proposed method outperforms the original regularized compressor (Chen et al., 2012) in precision by 2 percentage points and in F-measure by 1.
Evaluation
Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F).
Evaluation
We found that, in all three settings, G2 outperforms the baseline by 1 to 2 percentage points in F-measure .
Evaluation
The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure , although this was obtained from search setting (c), using a heuristic p value 0.37.
F-measure is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ji, Heng and Grishman, Ralph
Abstract
Without using any additional labeled data this new approach obtained 7.6% higher F-Measure in trigger labeling and 6% higher F-Measure in argument labeling over a state-of-the-art IE system which extracts events independently for each sentence.
Experimental Results and Analysis
We select the thresholds (d with k=1~13) for various confidence metrics by optimizing the F-measure score of each rule on the development set, as shown in Figure 2 and 3 as follows.
Experimental Results and Analysis
The labeled point on each curve shows the best F-measure that can be obtained on the development set by adjusting the threshold for that rule.
Experimental Results and Analysis
Table 5 shows the overall Precision (P), Recall (R) and F-Measure (F) scores for the blind test set.
F-measure is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan
Abstract
This approach generates alignments that are 2.6 f-Measure points better than a baseline supervised aligner.
Conclusion
We also proposed a model that scores alignments given source and target sentence reorderings that improves a supervised alignment model by 2.6 points in f-Measure .
Results and Discussions
Type f-Measure (words)
Results and Discussions
The f-Measure of this aligner is 78.1% (see row 1, column 2).
Results and Discussions
Method f-Measure mBLEU Base Correction model 78.1 55.1 Correction model, C(fl'la) 78.1 56.4 P(alfl'), C(fl'la) 80.7 57.6
F-measure is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei
Conclusion
lected among multiple alignments and it obtained 0.8 F-measure improvement over the single best Chinese-English aligner.
Conclusion
The second is the alignment link confidence measure, which selects the most reliable links from multiple alignments and obtained 1.5 F-measure improvement.
Improved MaXEnt Aligner with Confidence-based Link Filtering
the highest F-measure among the three aligners, although the algorithm described below can be applied to any aligner.
Improved MaXEnt Aligner with Confidence-based Link Filtering
For CE alignment, removing low confidence alignment links increased alignment precision by 5.5 point, while decreased recall by 1.8 point, and the overall alignment F-measure is increased by 1.3 point.
Improved MaXEnt Aligner with Confidence-based Link Filtering
When looking into the alignment links which are removed during the alignment link filtering process, we found that 80% of the removed links (1320 out of 1661 links) are incorrect alignments, For A-E alignment, it increased the precision by 3 points while reducing recall by 0.5 points, and the alignment F-measure is increased by about 1.5 points absolute, a 10% relative alignment error rate reduction.
F-measure is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kim, Sungchul and Toutanova, Kristina and Yu, Hwanjo
Data and task
For Bulgarian, the F-measure of the full model is 92.8 compared to the best baseline result of 83.2.
Data and task
Within the semi-CRF model, the contribution of English sentence context was substantial, leading to 2.5 point increase in F-measure for Bulgarian (92.8 versus 90.3 F—measure), and 4.0 point increase for Korean (91.2 versus 87.2).
Data and task
Preliminary results show performance of over 80 F-measure for such monolingual models.
Introduction
Our results show that the semi-CRF model improves on the performance of projection models by more than 10 points in F—measure, and that we can achieve tagging F-measure of over 91 using a very small number of annotated sentence pairs.
F-measure is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Nagata, Ryo and Vilenius, Mikko and Whittaker, Edward
Evaluation
We measured correction performance by recall, precision, and F-measure .
Evaluation
The simple error case frame-based method achieves an F-measure of 0.189.
Evaluation
The hybrid methods achieve the best performances in F-measure .
F-measure is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Lopez, Adam
Experiments
Overall, we improve the labelled F-measure by almost 1.1% and unlabelled F-measure by 0.6% over the baseline.
Experiments
Table 6: Parsing time in seconds per sentence (vs. F-measure ) on section 00.
Oracle Parsing
F-measure
Oracle Parsing
The inverse relationship between model score and F-score shows that the supertagger restricts the parser to mostly good parses (under F-measure ) that the model would otherwise disprefer.
F-measure is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Somasundaran, Swapna and Wiebe, Janyce
Experiments
#Correct’ Recall m and F-measure #guessed #relevant
Experiments
Finally, both of the OpPr systems are better than both baselines in Accuracy as well as F-measure for all four debates.
Experiments
The F-measure improves, on average, by 25 percentage points over the OpTopic system, and by 17 percentage points over the OpPMI system.
F-measure is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Mohtarami, Mitra and Lan, Man and Tan, Chew Lim
Analysis and Discussions
F-Measure (SO Prediction) Ch \l
Analysis and Discussions
F-Measure (IQAPs Inference) 3:.
Analysis and Discussions
F-Measure Ch m 0'!
Related Works
F-Measure H N U) .5 Ln 0 O O O O
F-measure is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chaturvedi, Snigdha and Goldwasser, Dan and Daumé III, Hal
Empirical Evaluation
Since the purpose of solving this problem is to identify the threads which should be brought to the notice of the instructors, we measure the performance of our models using F-measure of the positive class.
Empirical Evaluation
F-measure
Empirical Evaluation
6 shows 10-fold cross validation F-measure of the positive class for LR when different types of features are excluded from the full set.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhao, Qiuye and Marcus, Mitch
Abstract
We transform tagged character sequences to word segmentations first, and then evaluate word segmenta-tions by F-measure , as defined in Section 5.2.
Abstract
We focus on the task of word segmentation only in this work and show that a comparable F-measure is achievable in a much more efficient manner.
Abstract
The addition of these features makes a moderate improvement on the F-measure , from 0.974 to 0.975.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Sun, Meng and Lü, Yajuan and Yang, Yating and Liu, Qun
Experiments
The performance measurement for word segmentation is balanced F-measure , F = 2PR/ (P + R), a function of precision P and recall R, where P is the percentage of words in segmentation results that are segmented correctly, and R is the percentage of correctly segmented words in the gold standard words.
Experiments
wikipedia brings an F-measure increment of 0.93 points.
Introduction
Experimental results show that, the knowledge implied in the natural annotations can significantly improve the performance of a baseline segmenter trained on CTB 5.0, an F-measure increment of 0.93 points on CTB test set, and an average increment of 1.53 points on 7 other domains.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jeon, Je Hun and Liu, Yang
Experiments and results
The F-measure score using the initial training data is 0.69.
Experiments and results
Most of the previous work for prosodic event detection reported their results using classification accuracy instead of F-measure .
Experiments and results
Table 3: The results ( F-measure ) of prosodic event detection for supervised and co-training approaches.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Carpuat, Marine and Daume III, Hal and Henry, Katharine and Irvine, Ann and Jagarlamudi, Jagadeesh and Rudinger, Rachel
Experiments
Each learner uses a small amount of development data to tune a threshold on scores for predicting new-sense or not-a-new-sense, using macro F-measure as an objective.
Experiments
are relatively weak for predicting new senses on EMEA data but stronger on Subs (TYPEONLY AUC performance is higher than both baselines) and even stronger on Science data (TYPEONLY AUC and f-measure performance is higher than both baselines as well as the ALLFEA-TURESmodel).
Experiments
Recall that the microlevel evaluation computes precision, recall, and f-measure for all word tokens of a given word type and then averages across word types.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Baldwin, Tyler and Li, Yunyao and Alexe, Bogdan and Stanoi, Ioana R.
Abstract
Results over a dataset of entities from four product domains show that the proposed approach achieves significantly above baseline F-measure of 0.96.
Experimental Evaluation
Of the three individual modules, the n-gram and clustering methods achieve F-measure of around 0.9, while the ontology-based module performs only modestly above baseline.
Experimental Evaluation
The final system that employed all modules produced an F-measure of 0.960, a significant (p < 0.01) absolute increase of 15.4% over the baseline.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Aker, Ahmet and Paramita, Monica and Gaizauskas, Rob
Abstract
We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs.
Experiments 5.1 Data Sources
To test the classifier’s performance we evaluated it against a list of positive and negative examples of bilingual term pairs using the measures of precision, recall and F-measure .
Method
First, we evaluate the performance of the classifier on a held-out term-pair list from EUROVOC using the standard measures of recall, precision and F-measure .
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Park, Souneil and Lee, Kyung Soon and Song, Junehwa
Evaluation and Discussion
The performance is measured using precision, recall, and f-measure .
Evaluation and Discussion
We additionally used the weighted f-measure (wF) to aggregate the f-measure of the three categories.
Evaluation and Discussion
The overall average of the weighted f-measure among issues was 0.68, 0.59, and 0.48 for the DrC, QbC, and Sim.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Roth, Dan
Experiments
Using the same evaluation setting, our baseline RE system achieves a competitive 71.4 F-measure .
Experiments
The results show that by using syntactico-semantic structures, we obtain significant F-measure improvements of 8.3, 7.2, and 5.5 for binary, coarse-grained, and fine-grained relation predictions respectively.
Experiments
The results show that by leveraging syntactico-semantic structures, we obtain significant F-measure improvements of 8.2, 4.6, and 3.6 for binary, coarse-grained, and fine-grained relation predictions respectively.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Substructure Spaces for BTKs
The coefficient Oi for the composite kernel are tuned with respect to F-measure (F) on the development set of HIT corpus.
Substructure Spaces for BTKs
Those thresholds are also tuned on the development set of HIT corpus with respect to F-measure .
Substructure Spaces for BTKs
The evaluation is conducted by means of Precision (P), Recall (R) and F-measure (F).
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Han, Xianpei and Zhao, Jun
Experiments
F-Measure (F): the harmonic mean of purity and inverse purity.
Experiments
We use F-measure as the primary measure just liking WePSl and WePS2.
Experiments
The F-Measure vs. 1 on three data sets
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Elson, David and Dames, Nicholas and McKeown, Kathleen
Extracting Conversational Networks from Literature
Table 2: Precision, recall, and F-measure of three methods for detecting bilateral conversations in literary texts.
Extracting Conversational Networks from Literature
The precision and recall values shown for the baselines in Table 2 represent the highest performance we achieved by varying t between 0 and 1 (maximizing F-measure over 25).
Extracting Conversational Networks from Literature
Both baselines performed significantly worse in precision and F-measure than our quoted speech adjacency method for detecting conversations.
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Oh, Jong-Hoon and Uchimoto, Kiyotaka and Torisawa, Kentaro
Abstract
Experimental results show that our approach improved the F-measure by 3.6—10.3%.
Motivation
(2008), which was only applied for Japanese and achieved around 80% in F-measure .
Motivation
Experimental results showed that our method based on bilingual co-training improved the performance of monolingual hyponymy-relation acquisition about 3.6—10.3% in the F-measure .
F-measure is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: