Index of papers in Proc. ACL 2011 that mention
  • BLEU
Chen, David and Dolan, William
Experiments
BLEU
Experiments
As more training pairs are used, the model produces more varied sentences (PIN C) but preserves the meaning less well ( BLEU )
Experiments
As a comparison, evaluating each human description as a paraphrase for the other descriptions in the same cluster resulted in a BLEU score of 52.9 and a PINC score of 77.2.
Introduction
In addition to the lack of standard datasets for training and testing, there are also no standard metrics like BLEU (Papineni et al., 2002) for evaluating paraphrase systems.
Paraphrase Evaluation Metrics
One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU , which has played a crucial role in driving progress in MT.
Paraphrase Evaluation Metrics
Thus, researchers have been unable to rely on BLEU or some derivative: the optimal paraphrasing engine under these terms would be one that simply returns the input.
Paraphrase Evaluation Metrics
To measure semantic equivalence, we simply use BLEU with multiple references.
BLEU is mentioned in 28 sentences in this paper.
Topics mentioned in this paper:
Clifton, Ann and Sarkar, Anoop
Experimental Results
All the BLEU scores reported are for lowercase evaluation.
Experimental Results
m-BLEU 1dicates that the segmented output was evaluated gainst a segmented version of the reference (this Leasure does not have the same correlation with hu-Lan judgement as BLEU ).
Experimental Results
No Uni indicates the seg-Lented BLEU score without unigrams.
Models 2.1 Baseline Models
performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline.
Translation and Morphology
Automatic evaluation measures for MT, BLEU (Papineni et al., 2002), WER (Word Error Rate) and PER (Position Independent Word Error Rate) use the word as the basic unit rather than morphemes.
Translation and Morphology
Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset.
BLEU is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun
Abstract
The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Experimental results
We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
Experimental results
We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
Experimental results
2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score.
Introduction
ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Zhao, Bing and Lee, Young-Suk and Luo, Xiaoqiang and Li, Liu
Abstract
The syntax-based translation system integrating the proposed techniques outperforms the best Arabic-English unconstrained system in NIST—08 evaluations by 1.3 absolute BLEU , which is statistically significant.
Experiments
We use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to evaluate translation qualities.
Experiments
and we achieved a BLEUr4n4 55.01 for MT08-NW, or a cased BLEU of 53.31, which is close to the best officially reported result 53.85 for unconstrained systems.2 We expose the statistical decisions in Eqn.
Experiments
3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s.
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi
Abstract
Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.
Conclusion
Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008).
Experiments
BLEU (%) 26.15 27.07 27.93 28.89
Experiments
Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset
Experiments
the final BLEU scores of C3—T with Min-F and C3-F.
Introduction
(2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system.
Introduction
Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Fang, Licheng and Xu, Peng and Wu, Xiaoyun
Abstract
Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system).
Experiments
To evaluate the translation results, we use BLEU (Papineni et al., 2002).
Experiments
On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points, and 0.8 over the hierarchical phrase-based system.
Experiments
In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points, and 0.3 to 0.5 over the hierarchical phrase-based system.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Wu, Dekai
Abstract
As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU , which fail to properly evaluate adequacy, become more apparent.
Abstract
We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER.
Abstract
We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Cahill, Aoife and Kuhn, Jonas
Conclusion
This strategy leads to a better balanced distribution of the alternations in the training data, such that our linguistically informed generation ranking model achieves high BLEU scores and accurately predicts active and passive.
Experimental Setup
Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65
Experimental Setup
Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01
Experimental Setup
use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams.
Experiments
The differences in BLEU between the candidate sets and models are
Experiments
Its BLEU score and match accuracy decrease only slightly (though statistically significantly).
Experiments
Features | Match BLEU | Voice Prec.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya
Experimental Evaluation
6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
Experimental Evaluation
It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
Experimental Evaluation
BLEU
Flat ITG Model
The average gain across all data sets was approximately 0.8 BLEU points.
Hierarchical ITG Model
(2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score, while using larger phrases results in diminishing returns.
Introduction
We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Machine Translation as a Decipherment Task
Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et
Machine Translation as a Decipherment Task
The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
Machine Translation as a Decipherment Task
Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mylonakis, Markos and Sima'an, Khalil
Abstract
We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
Experiments
Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set.
Experiments
BLEU scores for 200K and 400K training sentence pairs.
Experiments
Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Conclusion and Future Work
In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments
The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
Experiments
Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen
Abstract
We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perpleXity.
Related Work
Figure 9 shows a statistically significant improvement to the BLEU score when using the HHMM and the n-gram LMs together on this reduced test set.
Related Work
Moses LM(s) ‘ BLEU
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zollmann, Andreas and Vogel, Stephan
Experiments
Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out.
Experiments
We have noticed that using an L0-penalized BLEU score5 as MERT’s objective on the merged n-best lists over all iterations is more stable and will therefore use this score to determine N.
Experiments
5Given by: BLEU —5 X Hi 6 {1, .
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: