Building the Resource | We then revised the rules with the aim of increasing both precision and recall . |
Evaluation | To obtain reliable estimates of both precision and recall , we decided to draw two different samples: (l) a sample of lemma pairs sampled from the induced derivational families, on which we estimate precision (P-sample) and (2) a sample of lemma pairs sampled from the set of possibly derivationally related lemma pairs, on which we estimate recall (R-sample). |
Evaluation | Table 5: Precision and recall on test samples |
Introduction | We conduct a thorough evaluation of the induced derivational families both regarding precision and recall . |
Results | We omit the F1 score because its use for precision and recall estimates from different samples is unclear. |
Results | The string distance-based approaches achieve more balanced precision and recall scores. |
Results | Note that for these methods, precision and recall can be traded off against each other by varying the number of clusters; we chose the number of clusters by optimizing the F1 score on the calibration and validaton sets. |
Abstract | We attempt to tease apart the effects that this simple but effective modification has on alignment precision and recall tradeoffs, and how rare and common words are affected across several language pairs. |
Word alignment results | Consequently, in addition to AER, we focus on precision and recall . |
Word alignment results | Figure 3 shows the change in precision and recall with the amount of provided training data for the Hansards corpus. |
Word alignment results | We see that agreement constraints improve both precision and recall when we |
Evaluation of lexical similarity in context | Figure 3 shows the influence of the threshold value to select relevant pairs, when considering precision and recall of the pairs that are kept when choosing the threshold, evaluated against the human annotation of relevance in context. |
Evaluation of lexical similarity in context | In case one wants to optimize the F-score (the harmonic mean of precision and recall ) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score. |
Experiments: predicting relevance in context | Figure 3: Precision and recall on relevant links with respect to a threshold on the similarity measure (Lin’s score) |
Experiments: predicting relevance in context | We have seen that the relevant/not relevant classification is very imbalanced, biased towards the “not relevant” category (about 11%/89%), so we applied methods dedicated to counterbalance this, and will focus on the precision and recall of the predicted relevant links. |
Experiments: predicting relevance in context | Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score, even though precision and recall might yield more important variations. |
Introduction | Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature. |
Reranking with SVMs 4.1 Methods | precision and recall labeled and unlabeled precision and recall for each parser’s best parse |
Reranking with SVMs 4.1 Methods | per-label precision and recall (dep) precision and recall for each type of dependency obtained from each parser’s best parse (using zero if not defined for lack of predicted or gold dependencies with a given label) |
Reranking with SVMs 4.1 Methods | n-best precision and recall (nbest) labeled and unlabeled precision and recall for each parser’s top five parses, along with the same features for the most accurate of these parses |
Conclusion | Table 6: Comparison of grammar/lexicon observed in the model tagging vs. gold tagging in terms of precision and recall measures for supertagging on CCG—TUT. |
Experiments | Precision and recall of grammar and lexicon. |
Experiments | Table 3: Comparison of grammar/lexicon observed in the model tagging vs. gold tagging in terms of precision and recall measures for supertagging on CCGbank data. |
Experiments | We can obtain a more-fine grained understanding of how the models differ by considering the precision and recall values for the grammars and lexicons of the different models, given in Table 3. |
Abstract | Experimental results show our method can achieve high precision and recall in measure word generation. |
Conclusion and Future Work | Experimental results show that our method not only achieves high precision and recall for generating measure words, but also improves the quality of English-to-Chinese SMT systems. |
Experiments | Table 3 and Table 4 show the precision and recall of our measure word generation method. |
Experiments | In addition to precision and recall , we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output. |
Experiments | Table 6 and Table 7 show the precision and recall when using different features. |
Conclusion and Future Work | For experience detection, the performance was very promising, closed to 92% in precision and recall when all the features were used. |
Experience Detection | We not only compared our results with the baseline in terms of precision and recall but also |
Experience Detection | The performance for the best case with all the features included is very promising, closed to 92% precision and recall . |
Experience Detection | In order to see the effect of including individual features in the feature set, precision and recall were measured after eliminating a particular feature from the full set. |
Lexicon Construction | Note that the precision and recall are macro-averaged values across the two classes, activity and state. |
Experimental Setup | Aggregate Extraction Let A6 be the set of extracted relations for any of the systems; we compute aggregate precision and recall by comparing A6 with A. |
Experimental Setup | We then report precision and recall for each system on this set of sampled sentences. |
Experiments | Since the data contains an unbalanced number of instances of each relation, we also report precision and recall for each of the ten most frequent relations. |
Experiments | Table 1 presents this approximate precision and recall for MULTIR on each of the relations, along with statistics we computed to measure the quality of the weak supervision. |
Introduction | use supervised learning of relation-specific examples, which can achieve high precision and recall . |
Related Work | While they offer high precision and recall , these methods are unlikely to scale to the thousands of relations found in text on the Web. |
Conclusion | We also present results on the precision and recall tradeoffs inherent in this task. |
Experiments | We note that previous work achieves higher ancestor precision, while our approach achieves a more even balance between precision and recall . |
Experiments | Of course, precision and recall should both ideally be high, even if some applications weigh one over the other. |
Introduction | We note that our approach falls at a different point in the space of performance tradeoffs from past work — by producing complete, highly articulated trees, we naturally see a more even balance between precision and recall , while past work generally focused on precision.1 To |
Introduction | 1While different applications will value precision and recall differently, and past work was often intentionally precision-focused, it is certainly the case that an ideal solution would maximize both. |
Conclusion | Many researchers are trying to use IE to create large-scale knowledge bases from natural language text on the Web, but existing relation-specific techniques do not scale to the thousands of relations encoded in Web text — while relation-independent techniques suffer from lower precision and recall , and do not canonicalize the relations. |
Extraction with Lexicons | We expect that lists with higher similarity are more likely to contain phrases which are related to our seeds; hence, by varying the similarity threshold one may produce lexicons representing different compromises between lexicon precision and recall . |
Introduction | Open extraction is more scalable, but has lower precision and recall . |
Related Work | Open IE, self-supervised learning of unlexicalized, relation-independent extractors (Banko et al., 2007), is a more scalable approach, but suffers from lower precision and recall , and doesn’t canonicalize the relations. |
Related Work | The goal of set expansion techniques is to generate high precision sets of related items; hence, these techniques are evaluated based on lexicon precision and recall . |
Abstract | In this work, the problem of extracting phrase translation is formulated as an information retrieval process implemented with a log—linear model aiming for a balanced precision and recall . |
Conclusions | In this paper, the problem of extracting phrase translation is formulated as an information retrieval process implemented with a log-linear model aiming for a balanced precision and recall . |
Discussions | The generic phrase training algorithm follows an information retrieval perspective as in (Venugopal et al., 2003) but aims to improve both precision and recall with the trainable log-linear model. |
Discussions | It implies a balancing process between precision and recall . |
Introduction | As in information retrieval, precision and recall issues need to be addressed with a right balance for building a phrase translation table. |
Abstract | This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall . |
Abstract | WOE can operate in two modes: when restricted to P08 tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher. |
Introduction | high precision and recall , they are limited by the availability of training data and are unlikely to scale to the thousands of relations found in text on the Web. |
Introduction | WOE can operate in two modes: when restricted to shallow features like part-of-speech (POS) tags, it runs as quickly as Textrunner, but when set to use dependency-parse features its precision and recall rise even higher. |
Results and Discussion | Both precision and recall are improved with two exceptions: recall of B3 decreases from line 2 to 3 and from 15 to 16. |
Results and Discussion | In contrast to F1, there is no consistent trend for precision and recall . |
Results and Discussion | But this higher variability for precision and recall is to be expected since every system trades the two measures off differently. |
Results | We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall . |
Results | We consider the performance in terms of precision and recall in addition to F-score — see Table 7 (a). |
Results | Overall, there is no major tradeoff between precision and recall across the different settings; although we can observe the following: (i) adding more training data helps precision more than recall (over three times more) — compare the last two columns in Table 7 (a); and (ii) the best setting has a slightly lower precision than all features, although a much better recall — compare columns 4 and 5 in Table 7 (a). |
Introduction | do not report (NR) separate values for precision and recall on this dataset. |
Introduction | Differences in both precision and recall between the baseline and the other systems are statistically significant at p < 0.01 using the two-tailed Fisher’s exact test. |
Introduction | Differences in both precision and recall between the baseline and the Span-HMM systems are statistically significant at p < 0.01 using the two-tailed Fisher’s exact test. |
Coreference Subtask Analysis | The MUC scoring algorithm (Vilain et a1., 1995) computes the F1 score (harmonic mean) of precision and recall based on the identifcation of unique coreference links. |
Coreference Subtask Analysis | The B3 algorithm (Bagga and Baldwin, 1998) computes a precision and recall score for each CE: |
Coreference Subtask Analysis | Precision and recall for a set of documents are computed as the mean over all CEs in the documents and the F1 score of precision and recall is reported. |
Conclusion | The resulting predictions improve the precision and recall of both alignment links and extraced phrase pairs in Chinese-English experiments. |
Experimental Results | The bidirectional model improves both precision and recall relative to all heuristic combination techniques, including grow-diag-final (Koehn et al., 2003). |
Experimental Results | As our model only provides small improvements in alignment precision and recall for the union combiner, the magnitude of the BLEU improvement is not surprising. |
Experiments and Results | It seems that the model gets quickly saturated in terms of incorporating new information and therefore precision and recall do not drastically change for increasing dataset sizes. |
Experiments and Results | For this reason we broke down the summary measures of precision and recall into their original components: true/false positive (TP/FF) and negative (TN/FN) counts presented in the 2 x 2 contingency table of Figure 1. |
Experiments and Results | The optimal solution applying [1* = 0.38 is more balanced between precision and recall and |
Experiments | In figure 5 we compare the precision and recall of LDA-SP against the top two performing systems described by Pantel et al. |
Experiments | We find that LDA-SP achieves both higher precision and recall than ISP.IIM-\/. |
Experiments | Figure 5: Precision and recall on the inference filtering task. |
CD | While the first level of constituent analysis has high precision and recall on NPs, the second level often does well finding prepositional phrases (PPS), especially in WSJ; see Table 7. |
Phrasal punctuation revisited | The table shows absolute improvement (+) or decline (—) in precision and recall when phrasal punctuation is removed from the data. |
Tasks and Benchmark | It measures precision and recall on constituents produced by a parser as compared to gold standard constituents. |
BLEU and PORT | 2.2.1 Precision and Recall |
BLEU and PORT | To combine precision and recall , we tried four averaging methods: arithmetic (A), geometric (G), harmonic (H), and quadratic (Q) mean. |
BLEU and PORT | We chose the quadratic mean to combine precision and recall , as follows: |
Extracting Conversational Networks from Literature | The precision and recall of our method for detecting conversations is shown in Table 2. |
Extracting Conversational Networks from Literature | To calculate precision and recall for the two baseline social networks, we set a threshold 75 to derive a binary prediction from the continuous edge weights. |
Extracting Conversational Networks from Literature | The precision and recall values shown for the baselines in Table 2 represent the highest performance we achieved by varying t between 0 and 1 (maximizing F-measure over 25). |
Fitting the Model | We also measure the quality of the two observed grammars/dictionaries by computing their precision and recall against the grammar/dictionary we observe in the gold tagging.4 We find that precision of the observed grammar increases from 0.73 (EM) to 0.94 (IP+EM). |
Fitting the Model | Figure 6: Comparison of observed grammars from the model tagging vs. gold tagging in terms of precision and recall measures. |
Restarts and More Data | Figure 7: Comparison of observed dictionaries from the model tagging vs. gold tagging in terms of precision and recall measures. |
Comparative Evaluation | MENTA is the closest system to ours, obtaining slightly higher precision and recall . |
Comparative Evaluation | Notably, however, MENTA outputs the first WordNet sense of entity for 13.17% of all the given answers, which, despite being correct and accounted in precision and recall , is uninformative. |
Phase 1: Inducing the Page Taxonomy | Not only does our taxonomy show high precision and recall in extracting ambiguous hypernyms, it also disambiguates more than 3/4 of the hypernyms with high precision. |
Dependency-based evaluation | Overlap between the candidate bag and the reference bag is calculated in the form of precision, recall, and the f-measure (with precision and recall equally weighted). |
Discussion and future work | Much like the inverse relation of precision and recall , changes and additions that improve a metric’s correlation with human scores for model summaries often weaken the correlation for system summaries, and vice versa. |
Lexical-Functional Grammar and the LFG parser | (2004) obtains high precision and recall rates. |
Experiments | Figure 4: Labeled precision and recall relative to normal-form model is used. |
Experiments | Table 1 shows the accuracies of all parsers on the development set, in terms of labeled precision and recall over the predicate-argument dependencies in CCGBank. |
Experiments | To probe this further we compare labeled precision and recall relative to dependency length, as measured by the distance between the two words in a dependency, grouped into bins of 5 values. |