Comparison to Human Judgments | As Table 6 shows, we achieved 78.4% accuracy using all verbs (and and 72.3% with the first verb from each worker), which is a statistically significant improve- |
Relational Similarity Experiments | Our best model 2) + p + c performs a bit better, 71.3% vs. 67.4%, but the difference is not statistically significant . |
Relational Similarity Experiments | Our best model achieves 40.5% accuracy, which is slightly better than LRA’s 39.8%, but the difference is not statistically significant . |
Relational Similarity Experiments | However, this time coordinating conjunctions (with prepositions) do help a bit (the difference is not statistically significant ) since SAT verbal analogy questions ask for a broader range of relations, e. g., antonymy, for which coordinating conjunctions like but are helpful. |
Discussion | The improvement in PP attachment was larger (20.5% ERR), and also statistically significant . |
Experimental setting | We use Bikel’s randomized parsing evaluation comparator3 (with p < 0.05 throughout) to test the statistical significance of the results using word sense information, relative to the respective baseline parser using only lexical features. |
Experimental setting | Statistical significance was calculated based on |
Results | These results are statistically significant in some cases (as indicated by *). |
Results | As in full-parsing, Bikel outperforms Charniak, but in this case the difference in the baselines is not statistically significant . |
Results | As was the case for parsing, the performance with IST reaches and in many instances surpasses gold-standard levels, achieving statistical significance over the baseline in places. |
Experiments | 2All results are statistically significant at oz 2 0.01 with two exceptions: the difference between tri grams and bi grams for the system trained and tested on texts is statistically significant at alpha=0.l and for the system trained on sentences and tested on texts is not statistically significant at oz 2 0.01. |
Experiments | The statistical significance of the |
Experiments | results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025). |
Integrating the Corpus-based and Dictionary-based Approaches | Using then an SVM meta-classifier trained on a small number of target domain examples to combine the nine base classifiers, they obtained a statistically significant improvement on out-of-domain texts from book reviews, knowledge-base feedback, and product support services survey data. |
Integrating the Corpus-based and Dictionary-based Approaches | The results reported in Table 6 are statistically significant at 04 = 0.01. |
Integrating the Corpus-based and Dictionary-based Approaches | are statistically significant at 04 = 0.01, except the runs on movie reviews where the difference between the LBS and Ensemble classifiers was significant at 04 = 0.05. |
Experiments | We apply the same reasoning to test for statistical significance in GMAP improvements. |
Results | combination is a statistically significant improvement (04 = 0.05) over our new transcript set (that is, over the best single transcript result). |
Results | Tests for statistically significant improvements in GMAP are computed using our paired log AP test, as discussed in Section 4.2.2. |
Results | Secondly, it is the only combination approach able to produce statistically significant relative improvements on both measures for both conditions. |
Conclusion and Future work | While results are similar and not statistically significant in the WSJ test set, when testing on the Brown out—of—domain test set the difference in favor of PropBank plus mapping step is statistically significant . |
Mapping into VerbNet Thematic Roles | If we compare these results to those obtained by VerbNet in the SemEval setting (second row of Table 5), they are 0.5 points better, but the difference is not statistically significant . |
Mapping into VerbNet Thematic Roles | The performance drop compared to the use of the hand-annotated VerbNet class is of 2 points and statistically significant , and 0.2 points above the results obtained using VerbNet directly on the same conditions (fourth row of the same Table). |
Mapping into VerbNet Thematic Roles | In this case, the difference is larger, 1.9 points, and statistically significant in favor of the mapping approach. |
On the Generalization of Role Sets | In the second setting (‘CoNLL setting’ row in the same table) the PropBank classifier degrades slightly, but the difference is not statistically significant . |
On the Generalization of Role Sets | The results in the ‘CoNLL setting (no 5th)’ rows of Table 1 show that the drop for PropBank is negligible and not significant, while the drop for VerbNet is more important, and statistically significant . |
Abstract | Although the 1987 version of the Thesaurus is better, we show that the 1911 version performs surprisingly well and that often the differences between the versions of R0-get’s and WordNet are not statistically significant . |
Comparison on applications | Even on the largest set (Finkelstein et al., 2001), however, the differences between Roget’s Thesaurus and the Vector method are not statistically significant at the p < 0.05 level for either thesaurus on a two-tailed test4. |
Comparison on applications | On the (Miller and Charles, 1991) and (Rubenstein and Goodenough, 1965) data sets the best system did not show a statistically significant improvement over the 1911 or 1987 Roget’s Thesauri, even at p < 0.1 for a two-tailed test. |
Comparison on applications | Much like (Miller and Charles, 1991), the data set used here is not large enough to determine if any system’s improvement is statistically significant . |
Conclusion and future work | The 1987 version of Roget’s Thesaurus performed better than the 1911 version on all our tests, but we did not find the differences to be statistically significant . |
Results and Analysis | Our main result is that the allCP and allCP+pr methods rank matches statistically significantly better than the baselines in all setups (according to the Wilcoxon double-sided signed-ranks test at the level of 0.01 (Wilcoxon, 1945)). |
Results and Analysis | Furthermore, relative to this cpv(7“, 25) model from (Pantel et al., 2007), our combined allCP model, with or without the prior (first row of Table 2), obtains statistically significantly better ranking (at the level of 0.01). |
Results and Analysis | Comparing between the algorithms for matching 0pm (Section 3.2.2) we found that while mnkedC B C is statistically significantly better than binaryCBC, mnkedCBC and LIN generally achieve the same results. |
Results | The difference between COMBO and DISTRIB is not statistically significant , while both are significantly better than the rule-based approaches.8 This provides strong motivation for a “lightweight” approach to non-referential it detection — one that does not require parsing or handcrafted rules and — is easily ported to new languages and text domains. |
Results | Using no truncation (Unaltered) drops the F-Score by 4.3%, while truncating the patterns to a length of four only drops the F-Score by 1.4%, a difference which is not statistically significant . |
Results | Neither are statistically significant ; however there seems to be diminishing returns from longer context patterns. |
Experiments 6.1 Experiment Settings | We also conducted t-tests to determine whether the improvement is statistically significant . |
Experiments 6.1 Experiment Settings | model is statistically significant with p-value<0.05, |
Experiments 6.1 Experiment Settings | Moreover, the improvements on five collections are statistically significant . |
Empirical Evaluation | As to the statistical significance , we use the 2-tail pairwise student t-test in all the experiments to compare two specific methods. |
Empirical Evaluation | Meanwhile, both semantic similarities have lower accuracy than CWS, and the differences are also statistically significant even with the conservative Bonferroni adjustment (i.e., the p-values in Table 1 are multiplied by three). |
Empirical Evaluation | The increase in precision is at least 0.04 with statistical significance . |
Evaluation Setup | A Wilcoxon rank sum test confirmed that the difference is statistically significant (p < 0.01). |
Results | The difference between High and Low similarity values estimated by these models are statistically significant (p < 0.01 using the Wilcoxon rank sum test). |
Results | However, the difference between the two models is not statistically significant . |
Future Work | It is entirely possible that, within this protocol, the baselines that have performed so well in our experiments, such as length or, in read news, position, will utterly fail, and that less traditional acoustic or spoken language features will genuinely, and with statistical significance , add value to a purely transcript-based text summarization system. |
Results and analysis | The best performance is achieved by using all of the features together, but the length baseline, which uses only those features in bold type from Figure 3, is very close (no statistically significant difference), as is MMR.6 |
Results and analysis | The difference with respect to either of these baselines is statistically significant within the popular 10—30% compression range, as is the classifier trained on all features but acoustic |
Conclusion | When using predictive class-based models in combination with a word-based language model trained on very large amounts of data, the improvements continue to be statistically significant on the test and nist06 sets. |
Experiments | Adding the class-based models leads to small improvements in BLEU score, with the highest improvements for both dev and nist06 being statistically significant 2. |
Experiments | 2Differences of more than 0.0051 are statistically significant at the 0.05 level using bootstrap resampling (Noreen, 1989; Koehn, 2004) |