Index of papers in Proc. ACL 2014 that mention
  • BLEU scores
Duan, Manjuan and White, Michael
Abstract
Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract
However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction
With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction
With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction
Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods
In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Reranking with SVMs 4.1 Methods
The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02).
Simple Reranking
Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking
Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
BLEU scores is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris
Evaluation
In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation
This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
Evaluation
As expected, random selection yields bad performance, with a BLEU score of 30.52.
BLEU scores is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie
Abstract
We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion
The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set
In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments
For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments
It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction
Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.
BLEU scores is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mason, Rebecca and Charniak, Eugene
Our Approach
BLEU Scores 13 N J:
Our Approach
Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach
As shown in Figure 1, our system’s BLEU scores increase rapidly until about k = 25.
BLEU scores is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
Conclusion and Future Work
We plan to give different weights to different training examples based on the drop in BLEU score the example can cause if classified incorrectly.
MT System Selection
We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection
We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection
When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments
All differences in BLEU scores between the four systems are statistically significant above the 95% level.
Machine Translation Experiments
We also report in Table 1 an oracle system selection where we pick, for each sentence, the English translation that yields the best BLEU score .
BLEU scores is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Gyawali, Bikash and Gardent, Claire
Conclusion
We observed that this often fails to return the best output in terms of BLEU score , fluency, grammaticality and/or meaning.
Results and Discussion
Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion
The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion
In terms of BLEU score , the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().
BLEU scores is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Experiments
BLEU score We used Moses support tools: multi-bleu10 to calculate BLEU scores .
Experiments
The BLEU scores shown in Table 4 show that our system produces simplifications that are closest to the reference.
Experiments
In sum, the automatic metrics indicate that our system produces simplification that are consistently closest to the reference in terms of edit distance, number of splits and BLEU score .
Related Work
(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score .
BLEU scores is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Tu, Mei and Zhou, Yu and Zong, Chengqing
Experiments
In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments
The final BLEU scores on NIST05 and NIST06 are given in Table 4.
Experiments
BLEU scores on the large-scale training data.
BLEU scores is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
van Gompel, Maarten and van den Bosch, Antal
Experiments & Results
The BLEU scores , not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results
Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results
Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.
BLEU scores is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Complexity Analysis
It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Complexity Analysis
Table 4 presents the BLEU scores for Moses using different segmentation methods.
Introduction
o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.
BLEU scores is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: