Index of papers in Proc. ACL 2008 that mention

statistically significant

Seen in text as:

statistically significant (53)
statistical significance (7)
statistically significantly (3)

Seen in 59 sentences in 13 papers.

1. Solving Relational Similarity Problems Using the Web as a Corpus

Nakov, Preslav and Hearst, Marti A.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Comparison to Human Judgments	As Table 6 shows, we achieved 78.4% accuracy using all verbs (and and 72.3% with the first verb from each worker), which is a statistically significant improve-
Relational Similarity Experiments	Our best model 2) + p + c performs a bit better, 71.3% vs. 67.4%, but the difference is not statistically significant .
Relational Similarity Experiments	Our best model achieves 40.5% accuracy, which is slightly better than LRA’s 39.8%, but the difference is not statistically significant .
Relational Similarity Experiments	However, this time coordinating conjunctions (with prepositions) do help a bit (the difference is not statistically significant ) since SAT verbal analogy questions ask for a broader range of relations, e. g., antonymy, for which coordinating conjunctions like but are helpful.

statistically significant is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

2. Improving Parsing and PP Attachment Performance with Sense Information

Agirre, Eneko and Baldwin, Timothy and Martinez, David

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	The improvement in PP attachment was larger (20.5% ERR), and also statistically significant .
Experimental setting	We use Bikel’s randomized parsing evaluation comparator3 (with p < 0.05 throughout) to test the statistical significance of the results using word sense information, relative to the respective baseline parser using only lexical features.
Experimental setting	Statistical significance was calculated based on
Results	These results are statistically significant in some cases (as indicated by *).
Results	As in full-parsing, Bikel outperforms Charniak, but in this case the difference in the baselines is not statistically significant .
Results	As was the case for parsing, the performance with IST reaches and in many instances surpasses gold-standard levels, achieving statistical significance over the baseline in places.

statistically significant is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

3. When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging

Andreevskaia, Alina and Bergler, Sabine

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	2All results are statistically significant at oz 2 0.01 with two exceptions: the difference between tri grams and bi grams for the system trained and tested on texts is statistically significant at alpha=0.l and for the system trained on sentences and tested on texts is not statistically significant at oz 2 0.01.
Experiments	The statistical significance of the
Experiments	results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
Integrating the Corpus-based and Dictionary-based Approaches	Using then an SVM meta-classifier trained on a small number of target domain examples to combine the nine base classifiers, they obtained a statistically significant improvement on out-of-domain texts from book reviews, knowledge-base feedback, and product support services survey data.
Integrating the Corpus-based and Dictionary-based Approaches	The results reported in Table 6 are statistically significant at 04 = 0.01.
Integrating the Corpus-based and Dictionary-based Approaches	are statistically significant at 04 = 0.01, except the runs on movie reviews where the difference between the LBS and Ensemble classifiers was significant at 04 = 0.05.

statistically significant is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

in-domain (26)
unigrams (14)
SVM (10)

4. Combining Speech Retrieval Results with Generalized Additive Models

Olsson, J. Scott and Oard, Douglas W.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We apply the same reasoning to test for statistical significance in GMAP improvements.
Results	combination is a statistically significant improvement (04 = 0.05) over our new transcript set (that is, over the best single transcript result).
Results	Tests for statistically significant improvements in GMAP are computed using our paired log AP test, as discussed in Section 4.2.2.
Results	Secondly, it is the only combination approach able to produce statistically significant relative improvements on both measures for both conditions.

statistically significant is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

5. Robustness and Generalization of Role Sets: PropBank vs. VerbNet

Zapirain, Beñat and Agirre, Eneko and Màrquez, Llu'is

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future work	While results are similar and not statistically significant in the WSJ test set, when testing on the Brown out—of—domain test set the difference in favor of PropBank plus mapping step is statistically significant .
Mapping into VerbNet Thematic Roles	If we compare these results to those obtained by VerbNet in the SemEval setting (second row of Table 5), they are 0.5 points better, but the difference is not statistically significant .
Mapping into VerbNet Thematic Roles	The performance drop compared to the use of the hand-annotated VerbNet class is of 2 points and statistically significant , and 0.2 points above the results obtained using VerbNet directly on the same conditions (fourth row of the same Table).
Mapping into VerbNet Thematic Roles	In this case, the difference is larger, 1.9 points, and statistically significant in favor of the mapping approach.
On the Generalization of Role Sets	In the second setting (‘CoNLL setting’ row in the same table) the PropBank classifier degrades slightly, but the difference is not statistically significant .
On the Generalization of Role Sets	The results in the ‘CoNLL setting (no 5th)’ rows of Table 1 show that the drop for PropBank is negligible and not significant, while the drop for VerbNet is more important, and statistically significant .

statistically significant is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

6. Evaluating Roget's Thesauri

Kennedy, Alistair and Szpakowicz, Stan

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Although the 1987 version of the Thesaurus is better, we show that the 1911 version performs surprisingly well and that often the differences between the versions of R0-get’s and WordNet are not statistically significant .
Comparison on applications	Even on the largest set (Finkelstein et al., 2001), however, the differences between Roget’s Thesaurus and the Vector method are not statistically significant at the p < 0.05 level for either thesaurus on a two-tailed test4.
Comparison on applications	On the (Miller and Charles, 1991) and (Rubenstein and Goodenough, 1965) data sets the best system did not show a statistically significant improvement over the 1911 or 1987 Roget’s Thesauri, even at p < 0.1 for a two-tailed test.
Comparison on applications	Much like (Miller and Charles, 1991), the data set used here is not large enough to determine if any system’s improvement is statistically significant .
Conclusion and future work	The 1987 version of Roget’s Thesaurus performed better than the 1911 version on all our tests, but we did not find the differences to be statistically significant .

statistically significant is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

7. Contextual Preferences

Szpektor, Idan and Dagan, Ido and Bar-Haim, Roy and Goldberger, Jacob

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Results and Analysis	Our main result is that the allCP and allCP+pr methods rank matches statistically significantly better than the baselines in all setups (according to the Wilcoxon double-sided signed-ranks test at the level of 0.01 (Wilcoxon, 1945)).
Results and Analysis	Furthermore, relative to this cpv(7“, 25) model from (Pantel et al., 2007), our combined allCP model, with or without the prior (first row of Table 2), obtains statistically significantly better ranking (at the level of 0.01).
Results and Analysis	Comparing between the algorithms for matching 0pm (Section 3.2.2) we found that while mnkedC B C is statistically significantly better than binaryCBC, mnkedCBC and LIN generally achieve the same results.

statistically significant is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

8. Distributional Identification of Non-Referential Pronouns

Bergsma, Shane and Lin, Dekang and Goebel, Randy

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Results	The difference between COMBO and DISTRIB is not statistically significant , while both are significantly better than the rule-based approaches.8 This provides strong motivation for a “lightweight” approach to non-referential it detection — one that does not require parsing or handcrafted rules and — is easily ported to new languages and text domains.
Results	Using no truncation (Unaltered) drops the F-Score by 4.3%, while truncating the patterns to a length of four only drops the F-Score by 1.4%, a difference which is not statistically significant .
Results	Neither are statistically significant ; however there seems to be diminishing returns from longer context patterns.

statistically significant is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

F-Score (13)
coreference (11)
n-gram (9)

9. Selecting Query Term Alternations for Web Search by Exploiting Query Contexts

Cao, Guihong and Robertson, Stephen and Nie, Jian-Yun

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments 6.1 Experiment Settings	We also conducted t-tests to determine whether the improvement is statistically significant .
Experiments 6.1 Experiment Settings	model is statistically significant with p-value<0.05,
Experiments 6.1 Experiment Settings	Moreover, the improvements on five collections are statistically significant .

statistically significant is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

10. Summarizing Emails with Conversational Cohesion and Subjectivity

Carenini, Giuseppe and Ng, Raymond T. and Zhou, Xiaodong

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Empirical Evaluation	As to the statistical significance , we use the 2-tail pairwise student t-test in all the experiments to compare two specific methods.
Empirical Evaluation	Meanwhile, both semantic similarities have lower accuracy than CWS, and the differences are also statistically significant even with the conservative Bonferroni adjustment (i.e., the p-values in Table 1 are multiplied by three).
Empirical Evaluation	The increase in precision is at least 0.04 with statistical significance .

statistically significant is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

11. Vector-based Models of Semantic Composition

Mitchell, Jeff and Lapata, Mirella

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation Setup	A Wilcoxon rank sum test confirmed that the difference is statistically significant (p < 0.01).
Results	The difference between High and Low similarity values estimated by these models are statistically significant (p < 0.01 using the Wilcoxon rank sum test).
Results	However, the difference between the two models is not statistically significant .

statistically significant is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

12. A Critical Reassessment of Evaluation Baselines for Speech Summarization

Penn, Gerald and Zhu, Xiaodan

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Future Work	It is entirely possible that, within this protocol, the baselines that have performed so well in our experiments, such as length or, in read news, position, will utterly fail, and that less traditional acoustic or spoken language features will genuinely, and with statistical significance , add value to a purely transcript-based text summarization system.
Results and analysis	The best performance is achieved by using all of the features together, but the length baseline, which uses only those features in bold type from Figure 3, is very close (no statistically significant difference), as is MMR.6
Results and analysis	The difference with respect to either of these baselines is statistically significant within the popular 10—30% compression range, as is the classifier trained on all features but acoustic

statistically significant is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

13. Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

Uszkoreit, Jakob and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	When using predictive class-based models in combination with a word-based language model trained on very large amounts of data, the improvements continue to be statistically significant on the test and nist06 sets.
Experiments	Adding the class-based models leads to small improvements in BLEU score, with the highest improvements for both dev and nist06 being statistically significant 2.
Experiments	2Differences of more than 0.0051 are statistically significant at the 0.05 level using bootstrap resampling (Noreen, 1989; Koehn, 2004)

statistically significant is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: