Index of papers in Proc. ACL 2014 that mention
  • statistically significant
Riezler, Stefan and Simianer, Patrick and Haas, Carolin
Experiments
Statistical significance is measured using Approximate Randomization (Noreen, 1989) where result differences with a p-value smaller than 0.05 are considered statistically significant .
Experiments
All result differences are statistically significant .
Experiments
ever, the result differences between these two systems do not score as statistically significant .
statistically significant is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
Discussion and Error Analysis
However, our best system selection approach improves over MSA-Pivot by a small margin of 0.2% BLEU absolute only, albeit a statistically significant improvement.
MT System Selection
It improves over the best single system baseline (MSA-Pivot) by a statistically significant 0.5% BLEU.
MT System Selection
Improvements are statistically significant .
MT System Selection
The differences in BLEU are statistically significant .
Machine Translation Experiments
All differences in BLEU scores between the four systems are statistically significant above the 95% level.
Machine Translation Experiments
Statistical significance is computed using paired bootstrap re-sampling (Koehn, 2004).
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhu, Xiaodan and Guo, Hongyu and Mohammad, Saif and Kiritchenko, Svetlana
Conclusions
formance statistically significantly .
Experimental results
Models marked with an asterisk (*) are statistically significantly better than the random baseline.
Experimental results
Double asterisks ** indicates a statistically significantly different from model (6), and the model with the double dagger His significantly better than model (7).
Experimental results
We can observe statistically significant differences of shifting abilities between many negator pairs such as that between “is_never” and “do_not” as well as between “does_not” and “cangot”.
Negation models based on heuristics
We will show that this simple modification improves the fitting performance statistically significantly .
Negation models based on heuristics
We will show that this model also statistically significantly outperforms the basic shifting without overfitting, although the number of parameters have increased.
statistically significant is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Bhat, Suma and Xue, Huichao and Yoon, Su-Youn
Conclusions
Including the measure of syntactic complexity in an automatic scoring model resulted in statistically significant performance gains over the state-of-the-art.
Experimental Results
The correlation was approximately 0.1 higher in absolute value than that of 0034, which was the best performing feature in the VSM-based model and the difference is statistically significant .
Experimental Results
We note that the performance gain of Base+mescore over Base as well as over Base + cos4 is statistically significant at level = 0.01.
Experimental Results
The performance gain of Base+cos4 over Base, however, is not statistically significant at level = 0.01.
Introduction
In addition, including our proposed measure of syntactic complexity in an automatic scoring model results in a statistically significant performance gain over the state-of-the-art.
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Candito, Marie and Constant, Matthieu
Experiments
To evaluate statistical significance of parsing performance differences, we use eva107.pl14 with -b 0p-tion, and then Dan Bikel’s comparator.15 For MWEs, we use the Fmeasure for recognition of untagged MWEs (hereafter FUM) and for recognition of tagged MWEs (hereafter FTM).
Experiments
For each architecture except the PIPELINE one, differences between the baseline and the best setting are statistically significant (p < 0.01).
Experiments
Best JOINT has statistically significant difference (p < 0.01) over both best JOINT-REG and best PIPELINE.
statistically significant is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Jansen, Peter and Surdeanu, Mihai and Clark, Peter
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
The transferred models always outperform the baselines, but only the ensemble model’s improvement is statistically significant .
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
The results of the transferred models that include LS features are slightly lower, but still approach statistical significance for P@1 and are significant for MRR.
CR + LS + DMM + DPM 39.32* +24% 47.86* +20%
T indicates approaching statistical significance with p = 0.07 or 0.06.
Introduction
Our results show statistically significant improvements of up to 24% on top of state-of-the-art LS models (Yih et al., 2013).
statistically significant is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Farra, Noura and Tomeh, Nadi and Rozovskaya, Alla and Habash, Nizar
Experiments
Rows marked with an asterisk (*) are statistically significant compared to CEC (for the first half of the table) or CEC+MLE (for the second half of the table), with p < 0.05.
Experiments
In fact, CEC+MLE and GSEC+MLE perform similarly (p = 0.36, not statistically significant ).
Experiments
The two results are statistically significant (p < 0.0001) with respect to CBC and CEC+MLE respectively.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lau, Jey Han and Cook, Paul and McCarthy, Diana and Gella, Spandana and Baldwin, Timothy
WordNet Experiments
Based on the McNemar’s Test with Yates correction for continuity, MKWC is significantly better over BNC and HDP-WSI is significantly better over FINANCE (p < 0.0001 in both cases), but the difference over SPORTS is not statistically significance (p > 0.1).
WordNet Experiments
Testing for statistical significance over the paired J S divergence values for each lemma using the Wilcoxon signed-rank test, the result for F1-NANCE is significant (p < 0.05) but the results for the other two datasets are not (p > 0.1 in each case).
WordNet Experiments
To summarise, the results for MKWC and HDP-WSI are fairly even for predominant sense leam-ing (each outperforms the other at a level of statistical significance over one dataset), but HDP—WSI is better at inducing the overall sense distribution.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Mehdad, Yashar and Carenini, Giuseppe and Ng, Raymond T.
Experimental Setup
Abstractive vs. Extractive: our full query-based abstractive summariztion system show statistically significant improvements over baselines
Experimental Setup
4The statistical significance tests was calculated by approximate randomization, as described in (Yeh, 2000).
Introduction
Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Experiments
Pairwise comparisons between all models and their statistical significance were carried out using a one-way ANOVA with post-hoc Tukey HSD tests and are shown in Table 6.
Experiments
With regard to simplification, our system ranks first and is very close to the manually simplified input (the difference is not statistically significant ).
Related Work
No human evaluation is provided but the approach is shown to result in statistically significant improvements over a traditional phrase based approach.
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Turchi, Marco and Anastasopoulos, Antonios and C. de Souza, José G. and Negri, Matteo
Experiments with CAT data
These results (MAE reductions are always statistically significant ) suggest that, when dealing with datasets with very different label distributions, the evident limitations of batch methods are more easily overcome by learning from scratch from the feedback of a new post-editor.
Experiments with WMT12 data
11Results marked with the “*” symbol are NOT statistically significant compared to the corresponding batch model.
Experiments with WMT12 data
The others are always statistically significant at p§0.005, calculated with approximate randomization (Yeh, 2000).
statistically significant is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: