Experimental Results | Doubling the number of words (N = 64) produces a small gain, and defining the pairwise dominance model using N = 128 most frequent words produces a statistically significant 1-point gain over the baseline (p < 0.01). |
Experimental Results | Larger values of N yield statistically significant performance above the baseline, but without further improvements over N = 128. |
Experimental Results | Statistically significant results (p < 0.01) over the baseline are in bold. |
Experimental Setup | all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004). |
Abstract | We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features. |
Conclusions | In comparison to a baseline model, we achieve statistically significant improvement in BLEU score. |
Generation Ranking Experiments | The improvement in BLEU is statistically significant (p < 0.01) using the paired bootstrap resampling significance test (Koehn, 2004). |
Generation Ranking Experiments | (2007) and the model that only takes syntactic-based asymmetries into account is not statistically significant, while the difference between Model 1 and this model is statistically significant (p < 0.05). |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | However, the difference between the multiplicative model and the noun model is not statistically significant in this case. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | The difference is statistically significant at p < 0.05. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | Although neither difference is statistically significant , this clearly shows a pattern different from the attribute-specifying adjectives. |
Introduction | They compared the composition models to human similarity ratings and found that all models were statistically significantly correlated with human judgements. |
Results | Statistically significant improvements were realized on Dutch, French, and German. |
Results | The only case where it had no statistically significant effect was on English. |
Results | From this perspective, to achieve statistically significant improvements on five of six L2P datasets (without ever being beaten by random) is an excellent result for QBB. |