Related Work | Despite statistically significant predictors of deception such as shorter talking time, fewer semantic details, and less coherent statements, DePaulo et al. |
Related Work | and revealed that verbal cues based on lexical categories extracted using the LIWC tool show statistically significant , though small, differences between truth- and lie-tellers. |
Results | Statistically significant results are in bold. |
Results | performs best, with 59.5% cross-validation accuracy, which is a statistically significant improvement over the baselines of LR (75(4) 2 22.25, p < .0001), and NB (25(4) 2 16.19,]? |
Results | In comparison with classification accuracy on pooled data, a paired t-test shows statistically significant improvement across all age groups using RF, 75(3) = 1037,]? |
Comparison to Human Judgments | As Table 6 shows, we achieved 78.4% accuracy using all verbs (and and 72.3% with the first verb from each worker), which is a statistically significant improve- |
Relational Similarity Experiments | Our best model 2) + p + c performs a bit better, 71.3% vs. 67.4%, but the difference is not statistically significant . |
Relational Similarity Experiments | Our best model achieves 40.5% accuracy, which is slightly better than LRA’s 39.8%, but the difference is not statistically significant . |
Relational Similarity Experiments | However, this time coordinating conjunctions (with prepositions) do help a bit (the difference is not statistically significant ) since SAT verbal analogy questions ask for a broader range of relations, e. g., antonymy, for which coordinating conjunctions like but are helpful. |
Abstract | The experiment shows tree kernel approach is able to give statistical significant improvements over flat syntactic path feature. |
Abstract | Besides, we further propose to leverage on temporal ordering information to constrain the interpretation of discourse relation, which also demonstrate statistical significant improvements for discourse relation recognition on PDTB 2.0 for both explicit and implicit as well. |
Conclusions and Future Works | The experimental results on PDTB v2.0 show that our kernel-based approach is able to give statistical significant improvement over flat syntactic path method. |
Conclusions and Future Works | In addition, we also propose to incorporate temporal ordering information to constrain the interpretation of discourse relations, which also demonstrate statistical significant improvements for discourse relation recognition, both explicit and implicit. |
Experiments and Results | conduct chi square statistical significance test on All relations between flat path approach and Simple-Expansion approach, which shows the performance improvements are statistical significant (p < 0.05) through incorporating tree kernel. |
Experiments and Results | We conduct chi square statistical significant test on All relations, which shows the performance improvement is statistical significant (p < 0.05). |
Introduction | The experiment shows that tree kernel is able to effectively incorporate syntactic structural information and produce statistical significant improvements over flat syntactic path feature for the recognition of both explicit and implicit relation in Penn Discourse Treebank (PDTB; Prasad et al., 2008). |
Introduction | Besides, inspired by the linguistic study on tense and discourse anaphor (Webber, 1988), we further propose to incorporate temporal ordering information to constrain the interpretation of discourse relation, which also demonstrates statistical significant improvements for discourse relation recognition on PDTB v2.0 for both explicit and implicit relations. |
Experiments | Statistical significance is measured using Approximate Randomization (Noreen, 1989) where result differences with a p-value smaller than 0.05 are considered statistically significant . |
Experiments | All result differences are statistically significant . |
Experiments | ever, the result differences between these two systems do not score as statistically significant . |
Evaluation and Discussion | There is no statistically significant difference between DocSys and DocBase generations for METEOR and BLEU—4.4 However, there is a statistically significant difference in the syntactic variability metric for both domains (weather - X2=l37.16, d.f.=1, p<.0001; biography - X2=96.641, d.f.=1, p<. |
Evaluation and Discussion | In terms of significance, there are no statistically significant differences between the systems for weather (DocOrig vs. DocSyS - X2=.347, d.f.=l, p=.555; DocOrig vs. DocBase - X2=.090, d.f.=l, p=.764; DocSyS vs. DocBase - X2=.790, d.f.=l, p=.373). |
Evaluation and Discussion | For biography, the trend fits nicely both numerically and in terms of statistical significance (DocOrig vs. DocSys -X2=5 .094, d.f.=l, p=. |
Abstract | However, it has no statistically significant impact in terms of F-score as incorrect multiword expression recognition has important side effects on parsing. |
Evaluation | In order to establish the statistical significance of results between two parsing experiments in terms of F1 and UAS, we used a unidirectional t-test for two independent samples”. |
Evaluation | The statistical significance between two MWE identification experiments was established by using the McNemar—s test (Gillick and Cox, 1989). |
Evaluation | The results of the two experiments are considered statistically significant with the computed value p < 0.01. |
Experiments | Underlining in our method signifies that the differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level. |
Experiments | 8 and “All” of Tables 2 and 4, the differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level. |
Experiments | The differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level for adequacy of SMT. |
Introduction | Moreover, the differences between correlation coefficients obtained using our method and other methods are statistically significant at the 5% or lower significance level for adequacy. |
Experiments | 2All results are statistically significant at oz 2 0.01 with two exceptions: the difference between tri grams and bi grams for the system trained and tested on texts is statistically significant at alpha=0.l and for the system trained on sentences and tested on texts is not statistically significant at oz 2 0.01. |
Experiments | The statistical significance of the |
Experiments | results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025). |
Integrating the Corpus-based and Dictionary-based Approaches | Using then an SVM meta-classifier trained on a small number of target domain examples to combine the nine base classifiers, they obtained a statistically significant improvement on out-of-domain texts from book reviews, knowledge-base feedback, and product support services survey data. |
Integrating the Corpus-based and Dictionary-based Approaches | The results reported in Table 6 are statistically significant at 04 = 0.01. |
Integrating the Corpus-based and Dictionary-based Approaches | are statistically significant at 04 = 0.01, except the runs on movie reviews where the difference between the LBS and Ensemble classifiers was significant at 04 = 0.05. |
Conclusions | formance statistically significantly . |
Experimental results | Models marked with an asterisk (*) are statistically significantly better than the random baseline. |
Experimental results | Double asterisks ** indicates a statistically significantly different from model (6), and the model with the double dagger His significantly better than model (7). |
Experimental results | We can observe statistically significant differences of shifting abilities between many negator pairs such as that between “is_never” and “do_not” as well as between “does_not” and “cangot”. |
Negation models based on heuristics | We will show that this simple modification improves the fitting performance statistically significantly . |
Negation models based on heuristics | We will show that this model also statistically significantly outperforms the basic shifting without overfitting, although the number of parameters have increased. |
Discussion and Error Analysis | However, our best system selection approach improves over MSA-Pivot by a small margin of 0.2% BLEU absolute only, albeit a statistically significant improvement. |
MT System Selection | It improves over the best single system baseline (MSA-Pivot) by a statistically significant 0.5% BLEU. |
MT System Selection | Improvements are statistically significant . |
MT System Selection | The differences in BLEU are statistically significant . |
Machine Translation Experiments | All differences in BLEU scores between the four systems are statistically significant above the 95% level. |
Machine Translation Experiments | Statistical significance is computed using paired bootstrap re-sampling (Koehn, 2004). |
Experiment | To determine statistical significance of the improvements, we also compute paired, one-tailed t tests. |
Experiment | ‘i’(‘*’) in the top four lines indicates statistical significance at p < 0.001 (0.05) when compared with the previous row. |
Experiment | ‘i’ or ‘*’ in the top four rows indicates statistical significance at p < 0.001 or < 005 compared with the previous row. |
Abstract | Our best model achieves statistically significant improvement over the state-of-the-art systems on several metrics (e. g. 8.0% and 5.4% improvements in ROUGE-2 respectively) for the DUC 2006 and 2007 summarization task. |
Introduction | We evaluate the summarization models on the standard Document Understanding Conference (DUC) 2006 and 2007 corpora 2 for query-focused MDS and find that all of our compression-based summarization models achieve statistically significantly better performance than the best DUC 2006 systems. |
Introduction | With these results we believe we are the first to successfully show that sentence compression can provide statistically significant improvements over pure extraction-based approaches for query-focused MDS. |
Results | Our sentence-compression-based systems (marked with T) show statistically significant improvements over pure extractive summarization for both R-2 and R-SU4 (paired t-test, p < 0.01). |
Results | In Table 7, our context-aware and head-driven tree-based compression systems show statistically significantly (p < 0.01) higher precisions (Uni- |
Results | For grammatical relation evaluation, our head-driven tree-based system obtains statistically significantly (p < 0.01) better Fl score (Rel-F1 than all the other systems except the rule-based system). |
Results | However, the differences among this model and the other models using lem Table 4 are not statistically significant . |
Results | The differences between this model and the other lower performing models are statistically significant (p<0.05). |
Results | The differences among the last four models (all including lem) in Table 5 are not statistically significant . |
Abstract | Then, we incorporate the model enhanced with topological fields into a natural language generation system that generates constituent orders for German text, and show that the added coherence component improves performance slightly, though not statistically significantly . |
Introduction | We add contextual features using topological field transitions to the model of Filippova and Strube (2007b) and achieve a slight improvement over their model in a constituent ordering task, though not statistically significantly . |
Introduction | Two-tailed sign tests were calculated for each result against the best performing model in each column (1: p = 0.101; 2: p = 0.053; +: statistically significant, p < 0.05; ++: very statistically significant , p < 0.01 ). |
Introduction | We embed entity topological field transitions into their probabilistic model, and show that the added coherence component slightly improves the performance of the baseline NLG system in generating constituent orderings in a German corpus, though not to a statistically significant degree. |
Discussion | The improvement in PP attachment was larger (20.5% ERR), and also statistically significant . |
Experimental setting | We use Bikel’s randomized parsing evaluation comparator3 (with p < 0.05 throughout) to test the statistical significance of the results using word sense information, relative to the respective baseline parser using only lexical features. |
Experimental setting | Statistical significance was calculated based on |
Results | These results are statistically significant in some cases (as indicated by *). |
Results | As in full-parsing, Bikel outperforms Charniak, but in this case the difference in the baselines is not statistically significant . |
Results | As was the case for parsing, the performance with IST reaches and in many instances surpasses gold-standard levels, achieving statistical significance over the baseline in places. |
Experiments | We apply the same reasoning to test for statistical significance in GMAP improvements. |
Results | combination is a statistically significant improvement (04 = 0.05) over our new transcript set (that is, over the best single transcript result). |
Results | Tests for statistically significant improvements in GMAP are computed using our paired log AP test, as discussed in Section 4.2.2. |
Results | Secondly, it is the only combination approach able to produce statistically significant relative improvements on both measures for both conditions. |
Conclusion and Future work | While results are similar and not statistically significant in the WSJ test set, when testing on the Brown out—of—domain test set the difference in favor of PropBank plus mapping step is statistically significant . |
Mapping into VerbNet Thematic Roles | If we compare these results to those obtained by VerbNet in the SemEval setting (second row of Table 5), they are 0.5 points better, but the difference is not statistically significant . |
Mapping into VerbNet Thematic Roles | The performance drop compared to the use of the hand-annotated VerbNet class is of 2 points and statistically significant , and 0.2 points above the results obtained using VerbNet directly on the same conditions (fourth row of the same Table). |
Mapping into VerbNet Thematic Roles | In this case, the difference is larger, 1.9 points, and statistically significant in favor of the mapping approach. |
On the Generalization of Role Sets | In the second setting (‘CoNLL setting’ row in the same table) the PropBank classifier degrades slightly, but the difference is not statistically significant . |
On the Generalization of Role Sets | The results in the ‘CoNLL setting (no 5th)’ rows of Table 1 show that the drop for PropBank is negligible and not significant, while the drop for VerbNet is more important, and statistically significant . |
Experimental Setup | All correlation coefficients are statistically significant (p < 0.01, N = 435). |
Results | Differences in correlation coefficients between models with two versus one modality are all statistically significant (p < 0.01 using a t-test), with the exception of Concat when compared against VisAttr. |
Results | All correlation coefficients are statistically significant (p < 0.01, N = 1,716). |
Results | All correlation coefficients are statistically significant (p < 0.01, N = 435). |
Experiments | The differences between scores marked with l are not statistically significant . |
Experiments | Our model outperforms all other generative models, though the improvement over the 71- gram model is not statistically significant . |
Experiments | We did not find that the use of our syntactic language model made any statistically significant increases in BLEU score. |
Experiments | The model average is statistically significantly different from all the other conditions p < 10—7 (Study 1). |
Experiments | The averages of the tested conditions are shown in Table 2, and are statistically significant . |
Experiments | The differences in density between the human average and the non-baseline conditions are highly statistically significant , according to paired two-tailed Wilcoxon signed-rank tests for the statistic calculated for each topic cluster. |
Experiments | Indeed, the difference between our best results and those of Elsner and Charniak are not statistically significant . |
Experiments | However, this improvement is not statistically significant . |
Experiments | The best results, that present a statistically significant improvement when compared to the random baseline, are obtained when distance information and the number of entities “shared” by two sentences are taken into account (PW). |
Abstract | In all three cases, we outperform state-of-the-art systems either quantitatively or statistically significantly . |
Conclusion | We find that all models outperform comparable state-of-the-art systems either quantitatively or statistically significantly . |
Experiments | We find that Model I performs better than both the best unsupervised system, RACAI (Ion and Tufis, 2007) and the most frequent sense baseline (BmeS), although these differences are not statistically significant due to the small size of the available test data (465). |
Experiments | For both tasks, our models outperform the state-of-the-art systems of the same type either quantitatively or statistically significantly . |
Experiments | system by Li and Sporleder (2009), although not statistically significantly . |
Abstract | Although the 1987 version of the Thesaurus is better, we show that the 1911 version performs surprisingly well and that often the differences between the versions of R0-get’s and WordNet are not statistically significant . |
Comparison on applications | Even on the largest set (Finkelstein et al., 2001), however, the differences between Roget’s Thesaurus and the Vector method are not statistically significant at the p < 0.05 level for either thesaurus on a two-tailed test4. |
Comparison on applications | On the (Miller and Charles, 1991) and (Rubenstein and Goodenough, 1965) data sets the best system did not show a statistically significant improvement over the 1911 or 1987 Roget’s Thesauri, even at p < 0.1 for a two-tailed test. |
Comparison on applications | Much like (Miller and Charles, 1991), the data set used here is not large enough to determine if any system’s improvement is statistically significant . |
Conclusion and future work | The 1987 version of Roget’s Thesaurus performed better than the 1911 version on all our tests, but we did not find the differences to be statistically significant . |
Conclusions | Including the measure of syntactic complexity in an automatic scoring model resulted in statistically significant performance gains over the state-of-the-art. |
Experimental Results | The correlation was approximately 0.1 higher in absolute value than that of 0034, which was the best performing feature in the VSM-based model and the difference is statistically significant . |
Experimental Results | We note that the performance gain of Base+mescore over Base as well as over Base + cos4 is statistically significant at level = 0.01. |
Experimental Results | The performance gain of Base+cos4 over Base, however, is not statistically significant at level = 0.01. |
Introduction | In addition, including our proposed measure of syntactic complexity in an automatic scoring model results in a statistically significant performance gain over the state-of-the-art. |
Experiments | To evaluate statistical significance of parsing performance differences, we use eva107.pl14 with -b 0p-tion, and then Dan Bikel’s comparator.15 For MWEs, we use the Fmeasure for recognition of untagged MWEs (hereafter FUM) and for recognition of tagged MWEs (hereafter FTM). |
Experiments | For each architecture except the PIPELINE one, differences between the baseline and the best setting are statistically significant (p < 0.01). |
Experiments | Best JOINT has statistically significant difference (p < 0.01) over both best JOINT-REG and best PIPELINE. |
Experimental Results | Doubling the number of words (N = 64) produces a small gain, and defining the pairwise dominance model using N = 128 most frequent words produces a statistically significant 1-point gain over the baseline (p < 0.01). |
Experimental Results | Larger values of N yield statistically significant performance above the baseline, but without further improvements over N = 128. |
Experimental Results | Statistically significant results (p < 0.01) over the baseline are in bold. |
Experimental Setup | all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004). |
Results | Statistically significant improvements were realized on Dutch, French, and German. |
Results | The only case where it had no statistically significant effect was on English. |
Results | From this perspective, to achieve statistically significant improvements on five of six L2P datasets (without ever being beaten by random) is an excellent result for QBB. |
Experiments | (2) Taking advantage of potentially rich semantic information drawn from other languages via statistical machine translation, question retrieval performance can be significantly improved (row 3, row 4, row 5 and row 6 vs. row 7, row 8 and row 9, all these comparisons are statistically significant at p < 0.05). |
Experiments | (2012) (row 7 vs. row 8, the comparison is statistically significant at p < 0.05). |
Experiments | (1) Our proposed matrix factorization can significantly improve the performance of question retrieval (row 1 vs. row2; row3 vs. row4, the improvements are statistically significant at p < 0.05). |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | The transferred models always outperform the baselines, but only the ensemble model’s improvement is statistically significant . |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | The results of the transferred models that include LS features are slightly lower, but still approach statistical significance for P@1 and are significant for MRR. |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | T indicates approaching statistical significance with p = 0.07 or 0.06. |
Introduction | Our results show statistically significant improvements of up to 24% on top of state-of-the-art LS models (Yih et al., 2013). |
Experiments | Statistical significance in BLEU differences |
Experiments | Such an improvement is statistically significant (p < 0.01). |
Experiments | Such a gain, which is statistically significant , confirms the effectiveness of semantic features. |
Discussion of Translation Results | We realized smaller, yet statistically significant , gains on the mixed genre data sets. |
Discussion of Translation Results | The baseline contained 78 errors, while our system produced 66 errors, a statistically significant 15.4% error reduction at p S 0.01 according to a paired t-test. |
Experiments | The MT06 result is statistically significant at p g 0.01; MT08 is significant at p g 0.02. |
Experiments | We evaluated translation quality with BLEU-4 (Pa-pineni et al., 2002) and computed statistical significance with the approximate randomization method of Riezler and Maxwell (2005).9 |
Experiments | They provide statistically significant improvements in review and segment F1, as well as accuracy, over the baseline models. |
Experiments | Augmenting RS-MiXMiX with sequential dependencies, yielding RS-MiXHMM, provides a moderate (though not statistically significant ) improvement in segment F1. |
Experiments | As a result, in addition to yielding alignments, RSA-MiXHMM provides small improvements over RS-MiXHMM (though they are not statistically significant ). |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | However, the difference between the multiplicative model and the noun model is not statistically significant in this case. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | The difference is statistically significant at p < 0.05. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | Although neither difference is statistically significant , this clearly shows a pattern different from the attribute-specifying adjectives. |
Introduction | They compared the composition models to human similarity ratings and found that all models were statistically significantly correlated with human judgements. |
A Latent Variable CCG Parser | To determine statistical significance , we obtain p-values from Bikel’s randomized parsing evaluation comparator6, modified for use with tagging accuracy, F-score and dependency accuracy. |
A Latent Variable CCG Parser | The difference in accuracy is only statistically significant between Clark and Curran’s Normal Form model ignoring features and the Petrov parser trained on CCGbank without features (p-value = 0.013). |
A Latent Variable CCG Parser | These results show that the features in CCGbank actually inhibit accuracy (to a statistically significant degree in the case of unlabeled accuracy on section ()0) when used as training data for the Petrov parser. |
Abstract | We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features. |
Conclusions | In comparison to a baseline model, we achieve statistically significant improvement in BLEU score. |
Generation Ranking Experiments | The improvement in BLEU is statistically significant (p < 0.01) using the paired bootstrap resampling significance test (Koehn, 2004). |
Generation Ranking Experiments | (2007) and the model that only takes syntactic-based asymmetries into account is not statistically significant, while the difference between Model 1 and this model is statistically significant (p < 0.05). |
Evaluation | To check whether changes were statistically significant we applied the test described by Chinchor (1995). |
Results | The BFGS, GIS and MIRA models produced mixed results, but no statistically significant decrease in accuracy, and as the amount of parser-annotated data was increased, parsing speed increased by up to 85%. |
Results | All changes in F-score are statistically significant . |
Results | All of the new models in the table make a statistically significant improvement over the baseline. |
Experiments and Discussions | Results in bold show statistical significance over baseline in corresponding metric. |
Experiments and Discussions | When stop words are used the HybHSumg outperforms state-of-the-art by 2.5-7% except R-2 (with statistical significance ). |
Experiments and Discussions | Results are statistically significant based on t—test. |
Evaluation | We report factors as statistically significant contributors to reading time if the absolute value of the t-value is greater than 2. |
Results | The first data column shows the regression on all data; the second and third columns divide the data into open and closed classes, because an evaluation (not reported in detail here) showed statistically significant interactions between word class and 3 of the predictors. |
Results | Out of the non-parser-based metrics, word order and bigram probability are statistically significant regardless of the data subset; though reciprocal length and unigram frequency do not reach significance here, likelihood ratio tests (not shown) confirm that they contribute to the model as a whole. |
Results | It can be seen that nearly all the slopes have been estimated with signs as expected, with the exception of reciprocal length (which is not statistically significant ). |
Experiments | * and Jr denote statistically significant differences with i-QRY and i-PRF, respectively. |
Experiments | In order to test the statistical significance of improvements attained by the proposed methods we use a two-sided Fisher’s randomization test with 20,000 permutations. |
Experiments | Results with p-value < 0.05 are considered statistically significant . |
Results and Analysis | Our main result is that the allCP and allCP+pr methods rank matches statistically significantly better than the baselines in all setups (according to the Wilcoxon double-sided signed-ranks test at the level of 0.01 (Wilcoxon, 1945)). |
Results and Analysis | Furthermore, relative to this cpv(7“, 25) model from (Pantel et al., 2007), our combined allCP model, with or without the prior (first row of Table 2), obtains statistically significantly better ranking (at the level of 0.01). |
Results and Analysis | Comparing between the algorithms for matching 0pm (Section 3.2.2) we found that while mnkedC B C is statistically significantly better than binaryCBC, mnkedCBC and LIN generally achieve the same results. |
Experiments and Results | the averages of the accuracies from the 10 cross-validation runs and all results were compared for statistical significance using the t—test where applicable. |
Experiments and Results | Unless otherwise marked, all accuracies are statistically significant at p<=.0005 for both baselines. |
Experiments and Results | 1 not statistically significant over Online-Behavior and Interests. |
Experiments | Pairwise comparisons between all models and their statistical significance were carried out using a one-way ANOVA with post-hoc Tukey HSD tests and are shown in Table 6. |
Experiments | With regard to simplification, our system ranks first and is very close to the manually simplified input (the difference is not statistically significant ). |
Related Work | No human evaluation is provided but the approach is shown to result in statistically significant improvements over a traditional phrase based approach. |
Experimental Setup | Abstractive vs. Extractive: our full query-based abstractive summariztion system show statistically significant improvements over baselines |
Experimental Setup | 4The statistical significance tests was calculated by approximate randomization, as described in (Yeh, 2000). |
Introduction | Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system. |
WordNet Experiments | Based on the McNemar’s Test with Yates correction for continuity, MKWC is significantly better over BNC and HDP-WSI is significantly better over FINANCE (p < 0.0001 in both cases), but the difference over SPORTS is not statistically significance (p > 0.1). |
WordNet Experiments | Testing for statistical significance over the paired J S divergence values for each lemma using the Wilcoxon signed-rank test, the result for F1-NANCE is significant (p < 0.05) but the results for the other two datasets are not (p > 0.1 in each case). |
WordNet Experiments | To summarise, the results for MKWC and HDP-WSI are fairly even for predominant sense leam-ing (each outperforms the other at a level of statistical significance over one dataset), but HDP—WSI is better at inducing the overall sense distribution. |
Experiments 6.1 Experiment Settings | We also conducted t-tests to determine whether the improvement is statistically significant . |
Experiments 6.1 Experiment Settings | model is statistically significant with p-value<0.05, |
Experiments 6.1 Experiment Settings | Moreover, the improvements on five collections are statistically significant . |
Experiments | Rows marked with an asterisk (*) are statistically significant compared to CEC (for the first half of the table) or CEC+MLE (for the second half of the table), with p < 0.05. |
Experiments | In fact, CEC+MLE and GSEC+MLE perform similarly (p = 0.36, not statistically significant ). |
Experiments | The two results are statistically significant (p < 0.0001) with respect to CBC and CEC+MLE respectively. |
Results | The difference between COMBO and DISTRIB is not statistically significant , while both are significantly better than the rule-based approaches.8 This provides strong motivation for a “lightweight” approach to non-referential it detection — one that does not require parsing or handcrafted rules and — is easily ported to new languages and text domains. |
Results | Using no truncation (Unaltered) drops the F-Score by 4.3%, while truncating the patterns to a length of four only drops the F-Score by 1.4%, a difference which is not statistically significant . |
Results | Neither are statistically significant ; however there seems to be diminishing returns from longer context patterns. |
Results | Statistical significance tests were calculated using the approximate randomization test (Yeh, 2000) with 1000 iterations. |
Results | * indicates statistical significance with the column’s Baseline at the p < 0.01 level, T at p < 005. |
Results | All numbers are statistically significant * with p-value < 0.01 from the number to their left. |
Experiments with CAT data | These results (MAE reductions are always statistically significant ) suggest that, when dealing with datasets with very different label distributions, the evident limitations of batch methods are more easily overcome by learning from scratch from the feedback of a new post-editor. |
Experiments with WMT12 data | 11Results marked with the “*” symbol are NOT statistically significant compared to the corresponding batch model. |
Experiments with WMT12 data | The others are always statistically significant at p§0.005, calculated with approximate randomization (Yeh, 2000). |
Experiments | The improvement in precision when using TR&EX is statistically significant (p < 0.05).9 Note that F-measure dropped |
Experiments | The improvement in precision when using TR&EX is statistically significant (p < 0.01). |
Experiments | statistically significant in both settings (p < 0.01). |
Empirical Evaluation | As to the statistical significance , we use the 2-tail pairwise student t-test in all the experiments to compare two specific methods. |
Empirical Evaluation | Meanwhile, both semantic similarities have lower accuracy than CWS, and the differences are also statistically significant even with the conservative Bonferroni adjustment (i.e., the p-values in Table 1 are multiplied by three). |
Empirical Evaluation | The increase in precision is at least 0.04 with statistical significance . |
Introduction | Differences in both precision and recall between the baseline and the other systems are statistically significant at p < 0.01 using the two-tailed Fisher’s exact test. |
Introduction | Differences in both precision and recall between the baseline and the Span-HMM systems are statistically significant at p < 0.01 using the two-tailed Fisher’s exact test. |
Introduction | were not statistically significant , except that the difference in precision between the Multi-Span-HMM and the Span-HMM-Base10 is significant at p < .1. |
Evaluation Setup | A Wilcoxon rank sum test confirmed that the difference is statistically significant (p < 0.01). |
Results | The difference between High and Low similarity values estimated by these models are statistically significant (p < 0.01 using the Wilcoxon rank sum test). |
Results | However, the difference between the two models is not statistically significant . |
Experiments | We also derive statistical significance of the results by using the model described in (Yeh, 2000) and implemented in (Pado, 2006). |
Experiments | Third, PTK, which produces more general structures, improves over BR by almost 1.5 ( statistically significant result) when using our dependency structures GRCT and LCT. |
Experiments | Finally, the best model of SPTK (i.e, using LCT) improves over the best PTK (i.e., using LCT) by almost 1 point ( statistically significant result): this difference is only given by lexical similarity. |
Background | The evaluations indicate statistical significance , but the test settings are relatively specific. |
Conclusions | The statistical significance is particularly high, even though there were several limitations in the experimental setting. |
Evaluation | According to the one-sided Wilcoxon rank-sum test, both Collective Funniness and all Upper Agreements increase from FORM to FORM+TABOO and from FORM+TABOO to FORM+TABOO+CONT statistically significantly (in all cases p < .002). |
Results | to compute MAP values and corresponding statistical significance , we randomly split each test set into 30 subsets. |
Results | This improvement is statistically significant at p < 0.01 for BInc and Lin, and p < 0.015 for Cosine, using paired t-test. |
Results | On test-setivc, where context mismatches are abundant, our model outperformed all other baselines ( statistically significant at p < 0.01). |
Conclusion | When using predictive class-based models in combination with a word-based language model trained on very large amounts of data, the improvements continue to be statistically significant on the test and nist06 sets. |
Experiments | Adding the class-based models leads to small improvements in BLEU score, with the highest improvements for both dev and nist06 being statistically significant 2. |
Experiments | 2Differences of more than 0.0051 are statistically significant at the 0.05 level using bootstrap resampling (Noreen, 1989; Koehn, 2004) |
Abstract | To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters. |
Experiments | For English and Turkish we observe a statistically significant improvement over the monolingual model (cf. |
Experiments | English again has a statistically significant improvement over the baseline. |
Experiments | The starred results are statistically significant improvements over the Baseline (at confidence p < 0.05). |
Experiments | This improvement is statistically significant at confidence p < 0.05, which we computed using the pairwise bootstrap resampling technique of Koehn (2004). |
Experiments | However this improvement (0.34) is not statistically significant . |
Introduction | When compared to a state-of-the-art open-domain headline abstraction system (Filippova, 2010), the new headlines are statistically significantly better both in terms of readability and informativeness. |
Results | 0 Amongst the automatic systems, HEADY performed better than MSC, with statistical significance at 95% for all the metrics. |
Results | o The most frequent pattern baseline and HEADY have comparable performance across all the metrics (not statistically significantly different), although HEADY has slightly better scores for all metrics except for informativeness. |
Experiments and Analysis | The p—values in parentheses present the statistical significance of the improvements. |
Experiments and Analysis | The improvements shown in parentheses are all statistically significant (p < 10—5). |
Experiments and Analysis | The improvements shown in parentheses are all statistically significant (p < 10—5). |
Experimental Results | The uparrow denotes the performance improvement compared to the precious method (above) with statistical significance under p value of 0.05, the short line ’-’ denotes there is no difference in statistical significance . |
Experimental Results | which just utilize the independent sentence-level features, behave not vary well here, and there is no statistically significant performance difference between them. |
Experimental Results | We also find that LCRF which utilizes the local context information between sentences perform better than the LR method in precision and F1 with statistical significance . |
Future Work | It is entirely possible that, within this protocol, the baselines that have performed so well in our experiments, such as length or, in read news, position, will utterly fail, and that less traditional acoustic or spoken language features will genuinely, and with statistical significance , add value to a purely transcript-based text summarization system. |
Results and analysis | The best performance is achieved by using all of the features together, but the length baseline, which uses only those features in bold type from Figure 3, is very close (no statistically significant difference), as is MMR.6 |
Results and analysis | The difference with respect to either of these baselines is statistically significant within the popular 10—30% compression range, as is the classifier trained on all features but acoustic |
I’m ready for my closeup. | For the null hypothesis of random guessing, these results are statistically significant , p < 2‘6 m .016. |
I’m ready for my closeup. | Table 2 shows that all the subjects performed (sometimes much) better than chance, and against the null hypothesis that all subjects are guessing randomly, the results are statistically significant , p < 2‘6 m .016. |
Never send a human to do a machine’s job. | Accuracies statistically significantly greater than bag-of-words according to a two-tailed t-test are indicated with *(p<.05) and **(p<.01). |