Index of papers in Proc. ACL that mention
  • statistically significant
Yancheva, Maria and Rudzicz, Frank
Related Work
Despite statistically significant predictors of deception such as shorter talking time, fewer semantic details, and less coherent statements, DePaulo et al.
Related Work
and revealed that verbal cues based on lexical categories extracted using the LIWC tool show statistically significant , though small, differences between truth- and lie-tellers.
Results
Statistically significant results are in bold.
Results
performs best, with 59.5% cross-validation accuracy, which is a statistically significant improvement over the baselines of LR (75(4) 2 22.25, p < .0001), and NB (25(4) 2 16.19,]?
Results
In comparison with classification accuracy on pooled data, a paired t-test shows statistically significant improvement across all age groups using RF, 75(3) = 1037,]?
statistically significant is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Nakov, Preslav and Hearst, Marti A.
Comparison to Human Judgments
As Table 6 shows, we achieved 78.4% accuracy using all verbs (and and 72.3% with the first verb from each worker), which is a statistically significant improve-
Relational Similarity Experiments
Our best model 2) + p + c performs a bit better, 71.3% vs. 67.4%, but the difference is not statistically significant .
Relational Similarity Experiments
Our best model achieves 40.5% accuracy, which is slightly better than LRA’s 39.8%, but the difference is not statistically significant .
Relational Similarity Experiments
However, this time coordinating conjunctions (with prepositions) do help a bit (the difference is not statistically significant ) since SAT verbal analogy questions ask for a broader range of relations, e. g., antonymy, for which coordinating conjunctions like but are helpful.
statistically significant is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Wang, WenTing and Su, Jian and Tan, Chew Lim
Abstract
The experiment shows tree kernel approach is able to give statistical significant improvements over flat syntactic path feature.
Abstract
Besides, we further propose to leverage on temporal ordering information to constrain the interpretation of discourse relation, which also demonstrate statistical significant improvements for discourse relation recognition on PDTB 2.0 for both explicit and implicit as well.
Conclusions and Future Works
The experimental results on PDTB v2.0 show that our kernel-based approach is able to give statistical significant improvement over flat syntactic path method.
Conclusions and Future Works
In addition, we also propose to incorporate temporal ordering information to constrain the interpretation of discourse relations, which also demonstrate statistical significant improvements for discourse relation recognition, both explicit and implicit.
Experiments and Results
conduct chi square statistical significance test on All relations between flat path approach and Simple-Expansion approach, which shows the performance improvements are statistical significant (p < 0.05) through incorporating tree kernel.
Experiments and Results
We conduct chi square statistical significant test on All relations, which shows the performance improvement is statistical significant (p < 0.05).
Introduction
The experiment shows that tree kernel is able to effectively incorporate syntactic structural information and produce statistical significant improvements over flat syntactic path feature for the recognition of both explicit and implicit relation in Penn Discourse Treebank (PDTB; Prasad et al., 2008).
Introduction
Besides, inspired by the linguistic study on tense and discourse anaphor (Webber, 1988), we further propose to incorporate temporal ordering information to constrain the interpretation of discourse relation, which also demonstrates statistical significant improvements for discourse relation recognition on PDTB v2.0 for both explicit and implicit relations.
statistically significant is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Riezler, Stefan and Simianer, Patrick and Haas, Carolin
Experiments
Statistical significance is measured using Approximate Randomization (Noreen, 1989) where result differences with a p-value smaller than 0.05 are considered statistically significant .
Experiments
All result differences are statistically significant .
Experiments
ever, the result differences between these two systems do not score as statistically significant .
statistically significant is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Kondadadi, Ravi and Howald, Blake and Schilder, Frank
Evaluation and Discussion
There is no statistically significant difference between DocSys and DocBase generations for METEOR and BLEU—4.4 However, there is a statistically significant difference in the syntactic variability metric for both domains (weather - X2=l37.16, d.f.=1, p<.0001; biography - X2=96.641, d.f.=1, p<.
Evaluation and Discussion
In terms of significance, there are no statistically significant differences between the systems for weather (DocOrig vs. DocSyS - X2=.347, d.f.=l, p=.555; DocOrig vs. DocBase - X2=.090, d.f.=l, p=.764; DocSyS vs. DocBase - X2=.790, d.f.=l, p=.373).
Evaluation and Discussion
For biography, the trend fits nicely both numerically and in terms of statistical significance (DocOrig vs. DocSys -X2=5 .094, d.f.=l, p=.
statistically significant is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Constant, Matthieu and Sigogne, Anthony and Watrin, Patrick
Abstract
However, it has no statistically significant impact in terms of F-score as incorrect multiword expression recognition has important side effects on parsing.
Evaluation
In order to establish the statistical significance of results between two parsing experiments in terms of F1 and UAS, we used a unidirectional t-test for two independent samples”.
Evaluation
The statistical significance between two MWE identification experiments was established by using the McNemar—s test (Gillick and Cox, 1989).
Evaluation
The results of the two experiments are considered statistically significant with the computed value p < 0.01.
statistically significant is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Echizen-ya, Hiroshi and Araki, Kenji
Experiments
Underlining in our method signifies that the differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level.
Experiments
8 and “All” of Tables 2 and 4, the differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level.
Experiments
The differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level for adequacy of SMT.
Introduction
Moreover, the differences between correlation coefficients obtained using our method and other methods are statistically significant at the 5% or lower significance level for adequacy.
statistically significant is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Andreevskaia, Alina and Bergler, Sabine
Experiments
2All results are statistically significant at oz 2 0.01 with two exceptions: the difference between tri grams and bi grams for the system trained and tested on texts is statistically significant at alpha=0.l and for the system trained on sentences and tested on texts is not statistically significant at oz 2 0.01.
Experiments
The statistical significance of the
Experiments
results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
Integrating the Corpus-based and Dictionary-based Approaches
Using then an SVM meta-classifier trained on a small number of target domain examples to combine the nine base classifiers, they obtained a statistically significant improvement on out-of-domain texts from book reviews, knowledge-base feedback, and product support services survey data.
Integrating the Corpus-based and Dictionary-based Approaches
The results reported in Table 6 are statistically significant at 04 = 0.01.
Integrating the Corpus-based and Dictionary-based Approaches
are statistically significant at 04 = 0.01, except the runs on movie reviews where the difference between the LBS and Ensemble classifiers was significant at 04 = 0.05.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhu, Xiaodan and Guo, Hongyu and Mohammad, Saif and Kiritchenko, Svetlana
Conclusions
formance statistically significantly .
Experimental results
Models marked with an asterisk (*) are statistically significantly better than the random baseline.
Experimental results
Double asterisks ** indicates a statistically significantly different from model (6), and the model with the double dagger His significantly better than model (7).
Experimental results
We can observe statistically significant differences of shifting abilities between many negator pairs such as that between “is_never” and “do_not” as well as between “does_not” and “cangot”.
Negation models based on heuristics
We will show that this simple modification improves the fitting performance statistically significantly .
Negation models based on heuristics
We will show that this model also statistically significantly outperforms the basic shifting without overfitting, although the number of parameters have increased.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
Discussion and Error Analysis
However, our best system selection approach improves over MSA-Pivot by a small margin of 0.2% BLEU absolute only, albeit a statistically significant improvement.
MT System Selection
It improves over the best single system baseline (MSA-Pivot) by a statistically significant 0.5% BLEU.
MT System Selection
Improvements are statistically significant .
MT System Selection
The differences in BLEU are statistically significant .
Machine Translation Experiments
All differences in BLEU scores between the four systems are statistically significant above the 95% level.
Machine Translation Experiments
Statistical significance is computed using paired bootstrap re-sampling (Koehn, 2004).
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Aobo and Kan, Min-Yen
Experiment
To determine statistical significance of the improvements, we also compute paired, one-tailed t tests.
Experiment
‘i’(‘*’) in the top four lines indicates statistical significance at p < 0.001 (0.05) when compared with the previous row.
Experiment
‘i’ or ‘*’ in the top four rows indicates statistical significance at p < 0.001 or < 005 compared with the previous row.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Lu and Raghavan, Hema and Castelli, Vittorio and Florian, Radu and Cardie, Claire
Abstract
Our best model achieves statistically significant improvement over the state-of-the-art systems on several metrics (e. g. 8.0% and 5.4% improvements in ROUGE-2 respectively) for the DUC 2006 and 2007 summarization task.
Introduction
We evaluate the summarization models on the standard Document Understanding Conference (DUC) 2006 and 2007 corpora 2 for query-focused MDS and find that all of our compression-based summarization models achieve statistically significantly better performance than the best DUC 2006 systems.
Introduction
With these results we believe we are the first to successfully show that sentence compression can provide statistically significant improvements over pure extraction-based approaches for query-focused MDS.
Results
Our sentence-compression-based systems (marked with T) show statistically significant improvements over pure extractive summarization for both R-2 and R-SU4 (paired t-test, p < 0.01).
Results
In Table 7, our context-aware and head-driven tree-based compression systems show statistically significantly (p < 0.01) higher precisions (Uni-
Results
For grammatical relation evaluation, our head-driven tree-based system obtains statistically significantly (p < 0.01) better Fl score (Rel-F1 than all the other systems except the rule-based system).
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Habash, Nizar and Roth, Ryan
Results
However, the differences among this model and the other models using lem Table 4 are not statistically significant .
Results
The differences between this model and the other lower performing models are statistically significant (p<0.05).
Results
The differences among the last four models (all including lem) in Table 5 are not statistically significant .
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Cheung, Jackie Chi Kit and Penn, Gerald
Abstract
Then, we incorporate the model enhanced with topological fields into a natural language generation system that generates constituent orders for German text, and show that the added coherence component improves performance slightly, though not statistically significantly .
Introduction
We add contextual features using topological field transitions to the model of Filippova and Strube (2007b) and achieve a slight improvement over their model in a constituent ordering task, though not statistically significantly .
Introduction
Two-tailed sign tests were calculated for each result against the best performing model in each column (1: p = 0.101; 2: p = 0.053; +: statistically significant, p < 0.05; ++: very statistically significant , p < 0.01 ).
Introduction
We embed entity topological field transitions into their probabilistic model, and show that the added coherence component slightly improves the performance of the baseline NLG system in generating constituent orderings in a German corpus, though not to a statistically significant degree.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Agirre, Eneko and Baldwin, Timothy and Martinez, David
Discussion
The improvement in PP attachment was larger (20.5% ERR), and also statistically significant .
Experimental setting
We use Bikel’s randomized parsing evaluation comparator3 (with p < 0.05 throughout) to test the statistical significance of the results using word sense information, relative to the respective baseline parser using only lexical features.
Experimental setting
Statistical significance was calculated based on
Results
These results are statistically significant in some cases (as indicated by *).
Results
As in full-parsing, Bikel outperforms Charniak, but in this case the difference in the baselines is not statistically significant .
Results
As was the case for parsing, the performance with IST reaches and in many instances surpasses gold-standard levels, achieving statistical significance over the baseline in places.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Olsson, J. Scott and Oard, Douglas W.
Experiments
We apply the same reasoning to test for statistical significance in GMAP improvements.
Results
combination is a statistically significant improvement (04 = 0.05) over our new transcript set (that is, over the best single transcript result).
Results
Tests for statistically significant improvements in GMAP are computed using our paired log AP test, as discussed in Section 4.2.2.
Results
Secondly, it is the only combination approach able to produce statistically significant relative improvements on both measures for both conditions.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zapirain, Beñat and Agirre, Eneko and Màrquez, Llu'is
Conclusion and Future work
While results are similar and not statistically significant in the WSJ test set, when testing on the Brown out—of—domain test set the difference in favor of PropBank plus mapping step is statistically significant .
Mapping into VerbNet Thematic Roles
If we compare these results to those obtained by VerbNet in the SemEval setting (second row of Table 5), they are 0.5 points better, but the difference is not statistically significant .
Mapping into VerbNet Thematic Roles
The performance drop compared to the use of the hand-annotated VerbNet class is of 2 points and statistically significant , and 0.2 points above the results obtained using VerbNet directly on the same conditions (fourth row of the same Table).
Mapping into VerbNet Thematic Roles
In this case, the difference is larger, 1.9 points, and statistically significant in favor of the mapping approach.
On the Generalization of Role Sets
In the second setting (‘CoNLL setting’ row in the same table) the PropBank classifier degrades slightly, but the difference is not statistically significant .
On the Generalization of Role Sets
The results in the ‘CoNLL setting (no 5th)’ rows of Table 1 show that the drop for PropBank is negligible and not significant, while the drop for VerbNet is more important, and statistically significant .
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Silberer, Carina and Ferrari, Vittorio and Lapata, Mirella
Experimental Setup
All correlation coefficients are statistically significant (p < 0.01, N = 435).
Results
Differences in correlation coefficients between models with two versus one modality are all statistically significant (p < 0.01 using a t-test), with the exception of Concat when compared against VisAttr.
Results
All correlation coefficients are statistically significant (p < 0.01, N = 1,716).
Results
All correlation coefficients are statistically significant (p < 0.01, N = 435).
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Experiments
The differences between scores marked with l are not statistically significant .
Experiments
Our model outperforms all other generative models, though the improvement over the 71- gram model is not statistically significant .
Experiments
We did not find that the use of our syntactic language model made any statistically significant increases in BLEU score.
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Cheung, Jackie Chi Kit and Penn, Gerald
Experiments
The model average is statistically significantly different from all the other conditions p < 10—7 (Study 1).
Experiments
The averages of the tested conditions are shown in Table 2, and are statistically significant .
Experiments
The differences in density between the human average and the non-baseline conditions are highly statistically significant , according to paired two-tailed Wilcoxon signed-rank tests for the statistic calculated for each topic cluster.
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Guinaudeau, Camille and Strube, Michael
Experiments
Indeed, the difference between our best results and those of Elsner and Charniak are not statistically significant .
Experiments
However, this improvement is not statistically significant .
Experiments
The best results, that present a statistically significant improvement when compared to the random baseline, are obtained when distance information and the number of entities “shared” by two sentences are taken into account (PW).
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Linlin and Roth, Benjamin and Sporleder, Caroline
Abstract
In all three cases, we outperform state-of-the-art systems either quantitatively or statistically significantly .
Conclusion
We find that all models outperform comparable state-of-the-art systems either quantitatively or statistically significantly .
Experiments
We find that Model I performs better than both the best unsupervised system, RACAI (Ion and Tufis, 2007) and the most frequent sense baseline (BmeS), although these differences are not statistically significant due to the small size of the available test data (465).
Experiments
For both tasks, our models outperform the state-of-the-art systems of the same type either quantitatively or statistically significantly .
Experiments
system by Li and Sporleder (2009), although not statistically significantly .
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kennedy, Alistair and Szpakowicz, Stan
Abstract
Although the 1987 version of the Thesaurus is better, we show that the 1911 version performs surprisingly well and that often the differences between the versions of R0-get’s and WordNet are not statistically significant .
Comparison on applications
Even on the largest set (Finkelstein et al., 2001), however, the differences between Roget’s Thesaurus and the Vector method are not statistically significant at the p < 0.05 level for either thesaurus on a two-tailed test4.
Comparison on applications
On the (Miller and Charles, 1991) and (Rubenstein and Goodenough, 1965) data sets the best system did not show a statistically significant improvement over the 1911 or 1987 Roget’s Thesauri, even at p < 0.1 for a two-tailed test.
Comparison on applications
Much like (Miller and Charles, 1991), the data set used here is not large enough to determine if any system’s improvement is statistically significant .
Conclusion and future work
The 1987 version of Roget’s Thesaurus performed better than the 1911 version on all our tests, but we did not find the differences to be statistically significant .
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Bhat, Suma and Xue, Huichao and Yoon, Su-Youn
Conclusions
Including the measure of syntactic complexity in an automatic scoring model resulted in statistically significant performance gains over the state-of-the-art.
Experimental Results
The correlation was approximately 0.1 higher in absolute value than that of 0034, which was the best performing feature in the VSM-based model and the difference is statistically significant .
Experimental Results
We note that the performance gain of Base+mescore over Base as well as over Base + cos4 is statistically significant at level = 0.01.
Experimental Results
The performance gain of Base+cos4 over Base, however, is not statistically significant at level = 0.01.
Introduction
In addition, including our proposed measure of syntactic complexity in an automatic scoring model results in a statistically significant performance gain over the state-of-the-art.
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Candito, Marie and Constant, Matthieu
Experiments
To evaluate statistical significance of parsing performance differences, we use eva107.pl14 with -b 0p-tion, and then Dan Bikel’s comparator.15 For MWEs, we use the Fmeasure for recognition of untagged MWEs (hereafter FUM) and for recognition of tagged MWEs (hereafter FTM).
Experiments
For each architecture except the PIPELINE one, differences between the baseline and the best setting are statistically significant (p < 0.01).
Experiments
Best JOINT has statistically significant difference (p < 0.01) over both best JOINT-REG and best PIPELINE.
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Experimental Results
Doubling the number of words (N = 64) produces a small gain, and defining the pairwise dominance model using N = 128 most frequent words produces a statistically significant 1-point gain over the baseline (p < 0.01).
Experimental Results
Larger values of N yield statistically significant performance above the baseline, but without further improvements over N = 128.
Experimental Results
Statistically significant results (p < 0.01) over the baseline are in bold.
Experimental Setup
all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Dwyer, Kenneth and Kondrak, Grzegorz
Results
Statistically significant improvements were realized on Dutch, French, and German.
Results
The only case where it had no statistically significant effect was on English.
Results
From this perspective, to achieve statistically significant improvements on five of six L2P datasets (without ever being beaten by random) is an excellent result for QBB.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhou, Guangyou and Liu, Fang and Liu, Yang and He, Shizhu and Zhao, Jun
Experiments
(2) Taking advantage of potentially rich semantic information drawn from other languages via statistical machine translation, question retrieval performance can be significantly improved (row 3, row 4, row 5 and row 6 vs. row 7, row 8 and row 9, all these comparisons are statistically significant at p < 0.05).
Experiments
(2012) (row 7 vs. row 8, the comparison is statistically significant at p < 0.05).
Experiments
(1) Our proposed matrix factorization can significantly improve the performance of question retrieval (row 1 vs. row2; row3 vs. row4, the improvements are statistically significant at p < 0.05).
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Jansen, Peter and Surdeanu, Mihai and Clark, Peter
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
The transferred models always outperform the baselines, but only the ensemble model’s improvement is statistically significant .
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
The results of the transferred models that include LS features are slightly lower, but still approach statistical significance for P@1 and are significant for MRR.
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
T indicates approaching statistical significance with p = 0.07 or 0.06.
Introduction
Our results show statistically significant improvements of up to 24% on top of state-of-the-art LS models (Yih et al., 2013).
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Experiments
Statistical significance in BLEU differences
Experiments
Such an improvement is statistically significant (p < 0.01).
Experiments
Such a gain, which is statistically significant , confirms the effectiveness of semantic features.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
Discussion of Translation Results
We realized smaller, yet statistically significant , gains on the mixed genre data sets.
Discussion of Translation Results
The baseline contained 78 errors, while our system produced 66 errors, a statistically significant 15.4% error reduction at p S 0.01 according to a paired t-test.
Experiments
The MT06 result is statistically significant at p g 0.01; MT08 is significant at p g 0.02.
Experiments
We evaluated translation quality with BLEU-4 (Pa-pineni et al., 2002) and computed statistical significance with the approximate randomization method of Riezler and Maxwell (2005).9
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Druck, Gregory and Pang, Bo
Experiments
They provide statistically significant improvements in review and segment F1, as well as accuracy, over the baseline models.
Experiments
Augmenting RS-MiXMiX with sequential dependencies, yielding RS-MiXHMM, provides a moderate (though not statistically significant ) improvement in segment F1.
Experiments
As a result, in addition to yielding alignments, RSA-MiXHMM provides small improvements over RS-MiXHMM (though they are not statistically significant ).
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chang, Kai-min K. and Cherkassky, Vladimir L. and Mitchell, Tom M. and Just, Marcel Adam
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension
However, the difference between the multiplicative model and the noun model is not statistically significant in this case.
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension
The difference is statistically significant at p < 0.05.
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension
Although neither difference is statistically significant , this clearly shows a pattern different from the attribute-specifying adjectives.
Introduction
They compared the composition models to human similarity ratings and found that all models were statistically significantly correlated with human judgements.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Fowler, Timothy A. D. and Penn, Gerald
A Latent Variable CCG Parser
To determine statistical significance , we obtain p-values from Bikel’s randomized parsing evaluation comparator6, modified for use with tagging accuracy, F-score and dependency accuracy.
A Latent Variable CCG Parser
The difference in accuracy is only statistically significant between Clark and Curran’s Normal Form model ignoring features and the Petrov parser trained on CCGbank without features (p-value = 0.013).
A Latent Variable CCG Parser
These results show that the features in CCGbank actually inhibit accuracy (to a statistically significant degree in the case of unlabeled accuracy on section ()0) when used as training data for the Petrov parser.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cahill, Aoife and Riester, Arndt
Abstract
We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions
In comparison to a baseline model, we achieve statistically significant improvement in BLEU score.
Generation Ranking Experiments
The improvement in BLEU is statistically significant (p < 0.01) using the paired bootstrap resampling significance test (Koehn, 2004).
Generation Ranking Experiments
(2007) and the model that only takes syntactic-based asymmetries into account is not statistically significant, while the difference between Model 1 and this model is statistically significant (p < 0.05).
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kummerfeld, Jonathan K. and Roesner, Jessika and Dawborn, Tim and Haggerty, James and Curran, James R. and Clark, Stephen
Evaluation
To check whether changes were statistically significant we applied the test described by Chinchor (1995).
Results
The BFGS, GIS and MIRA models produced mixed results, but no statistically significant decrease in accuracy, and as the amount of parser-annotated data was increased, parsing speed increased by up to 85%.
Results
All changes in F-score are statistically significant .
Results
All of the new models in the table make a statistically significant improvement over the baseline.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Celikyilmaz, Asli and Hakkani-Tur, Dilek
Experiments and Discussions
Results in bold show statistical significance over baseline in corresponding metric.
Experiments and Discussions
When stop words are used the HybHSumg outperforms state-of-the-art by 2.5-7% except R-2 (with statistical significance ).
Experiments and Discussions
Results are statistically significant based on t—test.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Wu, Stephen and Bachrach, Asaf and Cardenas, Carlos and Schuler, William
Evaluation
We report factors as statistically significant contributors to reading time if the absolute value of the t-value is greater than 2.
Results
The first data column shows the regression on all data; the second and third columns divide the data into open and closed classes, because an evaluation (not reported in detail here) showed statistically significant interactions between word class and 3 of the predictors.
Results
Out of the non-parser-based metrics, word order and bigram probability are statistically significant regardless of the data subset; though reciprocal length and unigram frequency do not reach significance here, likelihood ratio tests (not shown) confirm that they contribute to the model as a whole.
Results
It can be seen that nearly all the slopes have been estimated with signs as expected, with the exception of reciprocal length (which is not statistically significant ).
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bendersky, Michael and Croft, W. Bruce and Smith, David A.
Experiments
* and Jr denote statistically significant differences with i-QRY and i-PRF, respectively.
Experiments
In order to test the statistical significance of improvements attained by the proposed methods we use a two-sided Fisher’s randomization test with 20,000 permutations.
Experiments
Results with p-value < 0.05 are considered statistically significant .
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Szpektor, Idan and Dagan, Ido and Bar-Haim, Roy and Goldberger, Jacob
Results and Analysis
Our main result is that the allCP and allCP+pr methods rank matches statistically significantly better than the baselines in all setups (according to the Wilcoxon double-sided signed-ranks test at the level of 0.01 (Wilcoxon, 1945)).
Results and Analysis
Furthermore, relative to this cpv(7“, 25) model from (Pantel et al., 2007), our combined allCP model, with or without the prior (first row of Table 2), obtains statistically significantly better ranking (at the level of 0.01).
Results and Analysis
Comparing between the algorithms for matching 0pm (Section 3.2.2) we found that while mnkedC B C is statistically significantly better than binaryCBC, mnkedCBC and LIN generally achieve the same results.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Rosenthal, Sara and McKeown, Kathleen
Experiments and Results
the averages of the accuracies from the 10 cross-validation runs and all results were compared for statistical significance using the t—test where applicable.
Experiments and Results
Unless otherwise marked, all accuracies are statistically significant at p<=.0005 for both baselines.
Experiments and Results
1 not statistically significant over Online-Behavior and Interests.
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Experiments
Pairwise comparisons between all models and their statistical significance were carried out using a one-way ANOVA with post-hoc Tukey HSD tests and are shown in Table 6.
Experiments
With regard to simplification, our system ranks first and is very close to the manually simplified input (the difference is not statistically significant ).
Related Work
No human evaluation is provided but the approach is shown to result in statistically significant improvements over a traditional phrase based approach.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Mehdad, Yashar and Carenini, Giuseppe and Ng, Raymond T.
Experimental Setup
Abstractive vs. Extractive: our full query-based abstractive summariztion system show statistically significant improvements over baselines
Experimental Setup
4The statistical significance tests was calculated by approximate randomization, as described in (Yeh, 2000).
Introduction
Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lau, Jey Han and Cook, Paul and McCarthy, Diana and Gella, Spandana and Baldwin, Timothy
WordNet Experiments
Based on the McNemar’s Test with Yates correction for continuity, MKWC is significantly better over BNC and HDP-WSI is significantly better over FINANCE (p < 0.0001 in both cases), but the difference over SPORTS is not statistically significance (p > 0.1).
WordNet Experiments
Testing for statistical significance over the paired J S divergence values for each lemma using the Wilcoxon signed-rank test, the result for F1-NANCE is significant (p < 0.05) but the results for the other two datasets are not (p > 0.1 in each case).
WordNet Experiments
To summarise, the results for MKWC and HDP-WSI are fairly even for predominant sense leam-ing (each outperforms the other at a level of statistical significance over one dataset), but HDP—WSI is better at inducing the overall sense distribution.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Cao, Guihong and Robertson, Stephen and Nie, Jian-Yun
Experiments 6.1 Experiment Settings
We also conducted t-tests to determine whether the improvement is statistically significant .
Experiments 6.1 Experiment Settings
model is statistically significant with p-value<0.05,
Experiments 6.1 Experiment Settings
Moreover, the improvements on five collections are statistically significant .
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Farra, Noura and Tomeh, Nadi and Rozovskaya, Alla and Habash, Nizar
Experiments
Rows marked with an asterisk (*) are statistically significant compared to CEC (for the first half of the table) or CEC+MLE (for the second half of the table), with p < 0.05.
Experiments
In fact, CEC+MLE and GSEC+MLE perform similarly (p = 0.36, not statistically significant ).
Experiments
The two results are statistically significant (p < 0.0001) with respect to CBC and CEC+MLE respectively.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bergsma, Shane and Lin, Dekang and Goebel, Randy
Results
The difference between COMBO and DISTRIB is not statistically significant , while both are significantly better than the rule-based approaches.8 This provides strong motivation for a “lightweight” approach to non-referential it detection — one that does not require parsing or handcrafted rules and — is easily ported to new languages and text domains.
Results
Using no truncation (Unaltered) drops the F-Score by 4.3%, while truncating the patterns to a length of four only drops the F-Score by 1.4%, a difference which is not statistically significant .
Results
Neither are statistically significant ; however there seems to be diminishing returns from longer context patterns.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chambers, Nathanael and Jurafsky, Daniel
Results
Statistical significance tests were calculated using the approximate randomization test (Yeh, 2000) with 1000 iterations.
Results
* indicates statistical significance with the column’s Baseline at the p < 0.01 level, T at p < 005.
Results
All numbers are statistically significant * with p-value < 0.01 from the number to their left.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Turchi, Marco and Anastasopoulos, Antonios and C. de Souza, José G. and Negri, Matteo
Experiments with CAT data
These results (MAE reductions are always statistically significant ) suggest that, when dealing with datasets with very different label distributions, the evident limitations of batch methods are more easily overcome by learning from scratch from the feedback of a new post-editor.
Experiments with WMT12 data
11Results marked with the “*” symbol are NOT statistically significant compared to the corresponding batch model.
Experiments with WMT12 data
The others are always statistically significant at p§0.005, calculated with approximate randomization (Yeh, 2000).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Varga, István and Sano, Motoki and Torisawa, Kentaro and Hashimoto, Chikara and Ohtake, Kiyonori and Kawai, Takao and Oh, Jong-Hoon and De Saeger, Stijn
Experiments
The improvement in precision when using TR&EX is statistically significant (p < 0.05).9 Note that F-measure dropped
Experiments
The improvement in precision when using TR&EX is statistically significant (p < 0.01).
Experiments
statistically significant in both settings (p < 0.01).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Carenini, Giuseppe and Ng, Raymond T. and Zhou, Xiaodong
Empirical Evaluation
As to the statistical significance , we use the 2-tail pairwise student t-test in all the experiments to compare two specific methods.
Empirical Evaluation
Meanwhile, both semantic similarities have lower accuracy than CWS, and the differences are also statistically significant even with the conservative Bonferroni adjustment (i.e., the p-values in Table 1 are multiplied by three).
Empirical Evaluation
The increase in precision is at least 0.04 with statistical significance .
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Yates, Alexander
Introduction
Differences in both precision and recall between the baseline and the other systems are statistically significant at p < 0.01 using the two-tailed Fisher’s exact test.
Introduction
Differences in both precision and recall between the baseline and the Span-HMM systems are statistically significant at p < 0.01 using the two-tailed Fisher’s exact test.
Introduction
were not statistically significant , except that the difference in precision between the Multi-Span-HMM and the Span-HMM-Base10 is significant at p < .1.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Mitchell, Jeff and Lapata, Mirella
Evaluation Setup
A Wilcoxon rank sum test confirmed that the difference is statistically significant (p < 0.01).
Results
The difference between High and Low similarity values estimated by these models are statistically significant (p < 0.01 using the Wilcoxon rank sum test).
Results
However, the difference between the two models is not statistically significant .
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Croce, Danilo and Moschitti, Alessandro and Basili, Roberto and Palmer, Martha
Experiments
We also derive statistical significance of the results by using the model described in (Yeh, 2000) and implemented in (Pado, 2006).
Experiments
Third, PTK, which produces more general structures, improves over BR by almost 1.5 ( statistically significant result) when using our dependency structures GRCT and LCT.
Experiments
Finally, the best model of SPTK (i.e, using LCT) improves over the best PTK (i.e., using LCT) by almost 1 point ( statistically significant result): this difference is only given by lexical similarity.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Valitutti, Alessandro and Toivonen, Hannu and Doucet, Antoine and Toivanen, Jukka M.
Background
The evaluations indicate statistical significance , but the test settings are relatively specific.
Conclusions
The statistical significance is particularly high, even though there were several limitations in the experimental setting.
Evaluation
According to the one-sided Wilcoxon rank-sum test, both Collective Funniness and all Upper Agreements increase from FORM to FORM+TABOO and from FORM+TABOO to FORM+TABOO+CONT statistically significantly (in all cases p < .002).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Melamud, Oren and Berant, Jonathan and Dagan, Ido and Goldberger, Jacob and Szpektor, Idan
Results
to compute MAP values and corresponding statistical significance , we randomly split each test set into 30 subsets.
Results
This improvement is statistically significant at p < 0.01 for BInc and Lin, and p < 0.015 for Cosine, using paired t-test.
Results
On test-setivc, where context mismatches are abundant, our model outperformed all other baselines ( statistically significant at p < 0.01).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Conclusion
When using predictive class-based models in combination with a word-based language model trained on very large amounts of data, the improvements continue to be statistically significant on the test and nist06 sets.
Experiments
Adding the class-based models leads to small improvements in BLEU score, with the highest improvements for both dev and nist06 being statistically significant 2.
Experiments
2Differences of more than 0.0051 are statistically significant at the 0.05 level using bootstrap resampling (Noreen, 1989; Koehn, 2004)
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Faruqui, Manaal and Dyer, Chris
Abstract
To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
Experiments
For English and Turkish we observe a statistically significant improvement over the monolingual model (cf.
Experiments
English again has a statistically significant improvement over the baseline.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas
Experiments
The starred results are statistically significant improvements over the Baseline (at confidence p < 0.05).
Experiments
This improvement is statistically significant at confidence p < 0.05, which we computed using the pairwise bootstrap resampling technique of Koehn (2004).
Experiments
However this improvement (0.34) is not statistically significant .
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Alfonseca, Enrique and Pighin, Daniele and Garrido, Guillermo
Introduction
When compared to a state-of-the-art open-domain headline abstraction system (Filippova, 2010), the new headlines are statistically significantly better both in terms of readability and informativeness.
Results
0 Amongst the automatic systems, HEADY performed better than MSC, with statistical significance at 95% for all the metrics.
Results
o The most frequent pattern baseline and HEADY have comparable performance across all the metrics (not statistically significantly different), although HEADY has slightly better scores for all metrics except for informativeness.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Zhenghua and Liu, Ting and Che, Wanxiang
Experiments and Analysis
The p—values in parentheses present the statistical significance of the improvements.
Experiments and Analysis
The improvements shown in parentheses are all statistically significant (p < 10—5).
Experiments and Analysis
The improvements shown in parentheses are all statistically significant (p < 10—5).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chan, Wen and Zhou, Xiangdong and Wang, Wei and Chua, Tat-Seng
Experimental Results
The uparrow denotes the performance improvement compared to the precious method (above) with statistical significance under p value of 0.05, the short line ’-’ denotes there is no difference in statistical significance .
Experimental Results
which just utilize the independent sentence-level features, behave not vary well here, and there is no statistically significant performance difference between them.
Experimental Results
We also find that LCRF which utilizes the local context information between sentences perform better than the LR method in precision and F1 with statistical significance .
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Penn, Gerald and Zhu, Xiaodan
Future Work
It is entirely possible that, within this protocol, the baselines that have performed so well in our experiments, such as length or, in read news, position, will utterly fail, and that less traditional acoustic or spoken language features will genuinely, and with statistical significance , add value to a purely transcript-based text summarization system.
Results and analysis
The best performance is achieved by using all of the features together, but the length baseline, which uses only those features in bold type from Figure 3, is very close (no statistically significant difference), as is MMR.6
Results and analysis
The difference with respect to either of these baselines is statistically significant within the popular 10—30% compression range, as is the classifier trained on all features but acoustic
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian
I’m ready for my closeup.
For the null hypothesis of random guessing, these results are statistically significant , p < 2‘6 m .016.
I’m ready for my closeup.
Table 2 shows that all the subjects performed (sometimes much) better than chance, and against the null hypothesis that all subjects are guessing randomly, the results are statistically significant , p < 2‘6 m .016.
Never send a human to do a machine’s job.
Accuracies statistically significantly greater than bag-of-words according to a two-tailed t-test are indicated with *(p<.05) and **(p<.01).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: