Experiments | Statistical significance is measured using Approximate Randomization (Noreen, 1989) where result differences with a p-value smaller than 0.05 are considered statistically significant . |
Experiments | All result differences are statistically significant . |
Experiments | ever, the result differences between these two systems do not score as statistically significant . |
Discussion and Error Analysis | However, our best system selection approach improves over MSA-Pivot by a small margin of 0.2% BLEU absolute only, albeit a statistically significant improvement. |
MT System Selection | It improves over the best single system baseline (MSA-Pivot) by a statistically significant 0.5% BLEU. |
MT System Selection | Improvements are statistically significant . |
MT System Selection | The differences in BLEU are statistically significant . |
Machine Translation Experiments | All differences in BLEU scores between the four systems are statistically significant above the 95% level. |
Machine Translation Experiments | Statistical significance is computed using paired bootstrap re-sampling (Koehn, 2004). |
Conclusions | formance statistically significantly . |
Experimental results | Models marked with an asterisk (*) are statistically significantly better than the random baseline. |
Experimental results | Double asterisks ** indicates a statistically significantly different from model (6), and the model with the double dagger His significantly better than model (7). |
Experimental results | We can observe statistically significant differences of shifting abilities between many negator pairs such as that between “is_never” and “do_not” as well as between “does_not” and “cangot”. |
Negation models based on heuristics | We will show that this simple modification improves the fitting performance statistically significantly . |
Negation models based on heuristics | We will show that this model also statistically significantly outperforms the basic shifting without overfitting, although the number of parameters have increased. |
Conclusions | Including the measure of syntactic complexity in an automatic scoring model resulted in statistically significant performance gains over the state-of-the-art. |
Experimental Results | The correlation was approximately 0.1 higher in absolute value than that of 0034, which was the best performing feature in the VSM-based model and the difference is statistically significant . |
Experimental Results | We note that the performance gain of Base+mescore over Base as well as over Base + cos4 is statistically significant at level = 0.01. |
Experimental Results | The performance gain of Base+cos4 over Base, however, is not statistically significant at level = 0.01. |
Introduction | In addition, including our proposed measure of syntactic complexity in an automatic scoring model results in a statistically significant performance gain over the state-of-the-art. |
Experiments | To evaluate statistical significance of parsing performance differences, we use eva107.pl14 with -b 0p-tion, and then Dan Bikel’s comparator.15 For MWEs, we use the Fmeasure for recognition of untagged MWEs (hereafter FUM) and for recognition of tagged MWEs (hereafter FTM). |
Experiments | For each architecture except the PIPELINE one, differences between the baseline and the best setting are statistically significant (p < 0.01). |
Experiments | Best JOINT has statistically significant difference (p < 0.01) over both best JOINT-REG and best PIPELINE. |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | The transferred models always outperform the baselines, but only the ensemble model’s improvement is statistically significant . |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | The results of the transferred models that include LS features are slightly lower, but still approach statistical significance for P@1 and are significant for MRR. |
CR + LS + DMM + DPM 39.32* +24% 47.86* +20% | T indicates approaching statistical significance with p = 0.07 or 0.06. |
Introduction | Our results show statistically significant improvements of up to 24% on top of state-of-the-art LS models (Yih et al., 2013). |
Experiments | Rows marked with an asterisk (*) are statistically significant compared to CEC (for the first half of the table) or CEC+MLE (for the second half of the table), with p < 0.05. |
Experiments | In fact, CEC+MLE and GSEC+MLE perform similarly (p = 0.36, not statistically significant ). |
Experiments | The two results are statistically significant (p < 0.0001) with respect to CBC and CEC+MLE respectively. |
WordNet Experiments | Based on the McNemar’s Test with Yates correction for continuity, MKWC is significantly better over BNC and HDP-WSI is significantly better over FINANCE (p < 0.0001 in both cases), but the difference over SPORTS is not statistically significance (p > 0.1). |
WordNet Experiments | Testing for statistical significance over the paired J S divergence values for each lemma using the Wilcoxon signed-rank test, the result for F1-NANCE is significant (p < 0.05) but the results for the other two datasets are not (p > 0.1 in each case). |
WordNet Experiments | To summarise, the results for MKWC and HDP-WSI are fairly even for predominant sense leam-ing (each outperforms the other at a level of statistical significance over one dataset), but HDP—WSI is better at inducing the overall sense distribution. |
Experimental Setup | Abstractive vs. Extractive: our full query-based abstractive summariztion system show statistically significant improvements over baselines |
Experimental Setup | 4The statistical significance tests was calculated by approximate randomization, as described in (Yeh, 2000). |
Introduction | Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system. |
Experiments | Pairwise comparisons between all models and their statistical significance were carried out using a one-way ANOVA with post-hoc Tukey HSD tests and are shown in Table 6. |
Experiments | With regard to simplification, our system ranks first and is very close to the manually simplified input (the difference is not statistically significant ). |
Related Work | No human evaluation is provided but the approach is shown to result in statistically significant improvements over a traditional phrase based approach. |
Experiments with CAT data | These results (MAE reductions are always statistically significant ) suggest that, when dealing with datasets with very different label distributions, the evident limitations of batch methods are more easily overcome by learning from scratch from the feedback of a new post-editor. |
Experiments with WMT12 data | 11Results marked with the “*” symbol are NOT statistically significant compared to the corresponding batch model. |
Experiments with WMT12 data | The others are always statistically significant at p§0.005, calculated with approximate randomization (Yeh, 2000). |