Related Work | Despite statistically significant predictors of deception such as shorter talking time, fewer semantic details, and less coherent statements, DePaulo et al. |
Related Work | and revealed that verbal cues based on lexical categories extracted using the LIWC tool show statistically significant , though small, differences between truth- and lie-tellers. |
Results | Statistically significant results are in bold. |
Results | performs best, with 59.5% cross-validation accuracy, which is a statistically significant improvement over the baselines of LR (75(4) 2 22.25, p < .0001), and NB (25(4) 2 16.19,]? |
Results | In comparison with classification accuracy on pooled data, a paired t-test shows statistically significant improvement across all age groups using RF, 75(3) = 1037,]? |
Evaluation and Discussion | There is no statistically significant difference between DocSys and DocBase generations for METEOR and BLEU—4.4 However, there is a statistically significant difference in the syntactic variability metric for both domains (weather - X2=l37.16, d.f.=1, p<.0001; biography - X2=96.641, d.f.=1, p<. |
Evaluation and Discussion | In terms of significance, there are no statistically significant differences between the systems for weather (DocOrig vs. DocSyS - X2=.347, d.f.=l, p=.555; DocOrig vs. DocBase - X2=.090, d.f.=l, p=.764; DocSyS vs. DocBase - X2=.790, d.f.=l, p=.373). |
Evaluation and Discussion | For biography, the trend fits nicely both numerically and in terms of statistical significance (DocOrig vs. DocSys -X2=5 .094, d.f.=l, p=. |
Abstract | Our best model achieves statistically significant improvement over the state-of-the-art systems on several metrics (e. g. 8.0% and 5.4% improvements in ROUGE-2 respectively) for the DUC 2006 and 2007 summarization task. |
Introduction | We evaluate the summarization models on the standard Document Understanding Conference (DUC) 2006 and 2007 corpora 2 for query-focused MDS and find that all of our compression-based summarization models achieve statistically significantly better performance than the best DUC 2006 systems. |
Introduction | With these results we believe we are the first to successfully show that sentence compression can provide statistically significant improvements over pure extraction-based approaches for query-focused MDS. |
Results | Our sentence-compression-based systems (marked with T) show statistically significant improvements over pure extractive summarization for both R-2 and R-SU4 (paired t-test, p < 0.01). |
Results | In Table 7, our context-aware and head-driven tree-based compression systems show statistically significantly (p < 0.01) higher precisions (Uni- |
Results | For grammatical relation evaluation, our head-driven tree-based system obtains statistically significantly (p < 0.01) better Fl score (Rel-F1 than all the other systems except the rule-based system). |
Experiment | To determine statistical significance of the improvements, we also compute paired, one-tailed t tests. |
Experiment | ‘i’(‘*’) in the top four lines indicates statistical significance at p < 0.001 (0.05) when compared with the previous row. |
Experiment | ‘i’ or ‘*’ in the top four rows indicates statistical significance at p < 0.001 or < 005 compared with the previous row. |
Experiments | The model average is statistically significantly different from all the other conditions p < 10—7 (Study 1). |
Experiments | The averages of the tested conditions are shown in Table 2, and are statistically significant . |
Experiments | The differences in density between the human average and the non-baseline conditions are highly statistically significant , according to paired two-tailed Wilcoxon signed-rank tests for the statistic calculated for each topic cluster. |
Experiments | Indeed, the difference between our best results and those of Elsner and Charniak are not statistically significant . |
Experiments | However, this improvement is not statistically significant . |
Experiments | The best results, that present a statistically significant improvement when compared to the random baseline, are obtained when distance information and the number of entities “shared” by two sentences are taken into account (PW). |
Experimental Setup | All correlation coefficients are statistically significant (p < 0.01, N = 435). |
Results | Differences in correlation coefficients between models with two versus one modality are all statistically significant (p < 0.01 using a t-test), with the exception of Concat when compared against VisAttr. |
Results | All correlation coefficients are statistically significant (p < 0.01, N = 1,716). |
Results | All correlation coefficients are statistically significant (p < 0.01, N = 435). |
Experiments | (2) Taking advantage of potentially rich semantic information drawn from other languages via statistical machine translation, question retrieval performance can be significantly improved (row 3, row 4, row 5 and row 6 vs. row 7, row 8 and row 9, all these comparisons are statistically significant at p < 0.05). |
Experiments | (2012) (row 7 vs. row 8, the comparison is statistically significant at p < 0.05). |
Experiments | (1) Our proposed matrix factorization can significantly improve the performance of question retrieval (row 1 vs. row2; row3 vs. row4, the improvements are statistically significant at p < 0.05). |
Introduction | When compared to a state-of-the-art open-domain headline abstraction system (Filippova, 2010), the new headlines are statistically significantly better both in terms of readability and informativeness. |
Results | 0 Amongst the automatic systems, HEADY performed better than MSC, with statistical significance at 95% for all the metrics. |
Results | o The most frequent pattern baseline and HEADY have comparable performance across all the metrics (not statistically significantly different), although HEADY has slightly better scores for all metrics except for informativeness. |
Experiments | The starred results are statistically significant improvements over the Baseline (at confidence p < 0.05). |
Experiments | This improvement is statistically significant at confidence p < 0.05, which we computed using the pairwise bootstrap resampling technique of Koehn (2004). |
Experiments | However this improvement (0.34) is not statistically significant . |
Abstract | To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters. |
Experiments | For English and Turkish we observe a statistically significant improvement over the monolingual model (cf. |
Experiments | English again has a statistically significant improvement over the baseline. |
Results | to compute MAP values and corresponding statistical significance , we randomly split each test set into 30 subsets. |
Results | This improvement is statistically significant at p < 0.01 for BInc and Lin, and p < 0.015 for Cosine, using paired t-test. |
Results | On test-setivc, where context mismatches are abundant, our model outperformed all other baselines ( statistically significant at p < 0.01). |
Background | The evaluations indicate statistical significance , but the test settings are relatively specific. |
Conclusions | The statistical significance is particularly high, even though there were several limitations in the experimental setting. |
Evaluation | According to the one-sided Wilcoxon rank-sum test, both Collective Funniness and all Upper Agreements increase from FORM to FORM+TABOO and from FORM+TABOO to FORM+TABOO+CONT statistically significantly (in all cases p < .002). |
Experiments | The improvement in precision when using TR&EX is statistically significant (p < 0.05).9 Note that F-measure dropped |
Experiments | The improvement in precision when using TR&EX is statistically significant (p < 0.01). |
Experiments | statistically significant in both settings (p < 0.01). |