Index of papers in Proc. ACL that mention

BLEU

Seen in text as:

BLEU (1794)
Bleu (12)

Seen in 1610 sentences in 147 papers.

1. Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

Clifton, Ann and Sarkar, Anoop

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	All the BLEU scores reported are for lowercase evaluation.
Experimental Results	m-BLEU 1dicates that the segmented output was evaluated gainst a segmented version of the reference (this Leasure does not have the same correlation with hu-Lan judgement as BLEU ).
Experimental Results	No Uni indicates the seg-Lented BLEU score without unigrams.
Models 2.1 Baseline Models	performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline.
Translation and Morphology	Automatic evaluation measures for MT, BLEU (Papineni et al., 2002), WER (Word Error Rate) and PER (Position Independent Word Error Rate) use the word as the basic unit rather than morphemes.
Translation and Morphology	Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset.

BLEU is mentioned in 17 sentences in this paper.

Topics mentioned in this paper:

translation model (21)
BLEU (17)
CRF (12)

2. Learning to Translate with Multiple Objectives

Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	BLEU , TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Experiments	As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al., 2011)).
Experiments	As metrics we use BLEU and NTER.
Experiments	o BLEU = BP >< (Hprecn)1/4.
Introduction	These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU , which serve as surrogate objective for translation quality.
Introduction	However, we know that a single metric such as BLEU is not enough.
Introduction	For example, while BLEU (Papineni et al., 2002) focuses on word-based n-gram precision, METEOR (Lavie and Agarwal, 2007) allows for stem/synonym matching and incorporates recall.
Multi-objective Algorithms	If we had used BLEU scores rather than the {0,1} labels in line 8, the entire PMO-PRO algorithm would revert to single-objective PRO.
Theory of Pareto Optimality 2.1 Definitions and Concepts	For example, suppose K = 2, M1(h) computes the BLEU score, and M2(h) gives the METEOR score of h. Figure 1 illustrates the set of vectors {M in a lO-best list.

BLEU is mentioned in 22 sentences in this paper.

Topics mentioned in this paper:

3. PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

Chen, Boxing and Kuhn, Roland and Larkin, Samuel

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU .
Abstract	In principle, tuning on these metrics should yield better systems than tuning on BLEU .
Abstract	It has a better correlation with human judgment than BLEU .
Introduction	0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
Introduction	Among these metrics, BLEU is the most widely used for both evaluation and tuning.
Introduction	Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et

BLEU is mentioned in 66 sentences in this paper.

Topics mentioned in this paper:

4. Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

Zhao, Bing and Lee, Young-Suk and Luo, Xiaoqiang and Li, Liu

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The syntax-based translation system integrating the proposed techniques outperforms the best Arabic-English unconstrained system in NIST—08 evaluations by 1.3 absolute BLEU , which is statistically significant.
Experiments	We use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to evaluate translation qualities.
Experiments	and we achieved a BLEUr4n4 55.01 for MT08-NW, or a cased BLEU of 53.31, which is close to the best officially reported result 53.85 for unconstrained systems.2 We expose the statistical decisions in Eqn.
Experiments	3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s.

BLEU is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

5. A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Experimental results	We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
Experimental results	We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
Experimental results	2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score.
Introduction	ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.

BLEU is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

language model (36)
BLEU (15)
n-gram (14)

6. Fast and Robust Neural Network Joint Models for Statistical Machine Translation

Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.
Abstract	The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation.
Introduction	Additionally, we present several variations of this model which provide significant additive BLEU gains.
Introduction	The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes
Introduction	Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined.
Model Variations	Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6
Model Variations	BLEU scores are mixed-case.
Model Variations	On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more.
Neural Network Joint Model (NNJ M)	We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU .
Neural Network Joint Model (NNJ M)	We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M.

BLEU is mentioned in 36 sentences in this paper.

Topics mentioned in this paper:

7. That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text

Duan, Manjuan and White, Michael

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract	However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction	With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction	With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction	Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods	In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Simple Reranking	Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking	Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
Simple Reranking	In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations.

BLEU is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

perceptron (29)
SVM (23)
BLEU (20)

8. Comparing Automatic Evaluation Measures for Image Description

Elliott, Desmond and Keller, Frank

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements.
Abstract	We estimate the correlation of unigram and Smoothed BLEU , TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Abstract	The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements.
Introduction	The main finding of our analysis is that TER and unigram BLEU are weakly corre-
Introduction	lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology	BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y.
Methodology	N BLEU = BP-exp < wn logpn> n=1
Methodology	Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1.

BLEU is mentioned in 27 sentences in this paper.

Topics mentioned in this paper:

9. Using Discourse Structure Improves Machine Translation Evaluation

Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results	II TER .812 .836 .848 BLEU .810 .830 .846
Experimental Results	We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
Experimental Setup	To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Experimental Setup	Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A.
Related Work	A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work	For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.

BLEU is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

10. Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models

Auli, Michael and Gao, Jianfeng

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.
Abstract	We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.
Abstract	Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.
Expected BLEU Training	The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3.
Expected BLEU Training	3.1 Expected BLEU Objective
Expected BLEU Training	Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:
Introduction	The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).
Introduction	We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.
Recurrent Neural Network LMs	time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations.

BLEU is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

11. Collecting Highly Parallel Data for Paraphrase Evaluation

Chen, David and Dolan, William

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	BLEU
Experiments	As more training pairs are used, the model produces more varied sentences (PIN C) but preserves the meaning less well ( BLEU )
Experiments	As a comparison, evaluating each human description as a paraphrase for the other descriptions in the same cluster resulted in a BLEU score of 52.9 and a PINC score of 77.2.
Introduction	In addition to the lack of standard datasets for training and testing, there are also no standard metrics like BLEU (Papineni et al., 2002) for evaluating paraphrase systems.
Paraphrase Evaluation Metrics	One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU , which has played a crucial role in driving progress in MT.
Paraphrase Evaluation Metrics	Thus, researchers have been unable to rely on BLEU or some derivative: the optimal paraphrasing engine under these terms would be one that simply returns the input.
Paraphrase Evaluation Metrics	To measure semantic equivalence, we simply use BLEU with multiple references.

BLEU is mentioned in 28 sentences in this paper.

Topics mentioned in this paper:

12. Boosting-Based System Combination for Machine Translation

Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background	As in other state-of-the-art SMT systems, BLEU is selected as the accuracy measure to define the error function used in MERT.
Background	Since the weights of training samples are not taken into account in BLEUZ, we modify the original definition of BLEU to make it sensitive to the distribution Dt(i) over the training samples.
Background	The modified version of BLEU is called weighted BLE U (WBLEU) in this paper.

BLEU is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

13. Training Phrase Translation Models with Leaving-One-Out

Wuebker, Joern and Mauser, Arne and Ney, Hermann

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU .
Alignment	We perform minimum error rate training with the downhill simplex algorithm (Nelder and Mead, 1965) on the development data to obtain a set of scaling factors that achieve a good BLEU score.
Experimental Evaluation	The scaling factors of the translation models have been optimized for BLEU on the DEV data.
Experimental Evaluation	‘ BLEU ‘ TER ‘
Experimental Evaluation	The metrics used for evaluation are the case-sensitive BLEU (Papineni et al., 2002) score and the translation edit rate (TER) (Snover et al., 2006) with one reference translation.
Introduction	Our results show that the proposed phrase model training improves translation quality on the test set by 0.9 BLEU points over our baseline.
Introduction	We find that by interpolation with the heuristically extracted phrases translation performance can reach up to 1.4 BLEU improvement over the baseline on the test set.

BLEU is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

14. Constituency to Dependency Translation with Forests

Mi, Haitao and Liu, Qun

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules.
Experiments	We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on development set.
Experiments	The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3.
Experiments	As shown in the third line in the column of BLEU score, the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage.
Introduction	BLEU
Introduction	Medium data experiments (Section 5) show a statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with less translation rules, this is also the first time that a tree-to-tree model can surpass tree-to-string counterparts.
Model	(2009), their forest-based constituency-to-constituency system achieves a comparable performance against Moses (Koehn et al., 2007), but a significant improvement of +3.6 BLEU points over the 1-best tree-based constituency-to-constituency system.

BLEU is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

15. Improving Statistical Machine Translation with Monolingual Collocation

Liu, Zhanyi and Wang, Haifeng and Wu, Hua and Li, Sheng

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system.
Experiments on Parsing-Based SMT	Experiments BLEU (%) Joshua 30.05 + Improved word alignments 31.81
Experiments on Parsing-Based SMT	The system using the improved word alignments achieves an absolute improvement of 1.76 BLEU score, which indicates that the improvements of word alignments are also effective to improve the performance of the parsing-based SMT systems.
Experiments on Phrase-Based SMT	We use BLEU (Papineni et al., 2002) as evaluation metrics.
Experiments on Phrase-Based SMT	Experiments BLEU (%) Moses 29.62 + Phrase collocation probability 30.47
Experiments on Phrase-Based SMT	If the same alignment method is used, the systems using CM-3 got the highest BLEU scores.
Experiments on Word Alignment	Experiments BLEU (%) Baseline 29.62 CM-l 30.85 WA-l CM-2 31.28 CM-3 31.48 CM-l 3 l .00 Our methods WA-2 CM-2 3 l .33 CM-3 31.51 CM-l 3 l .43 WA-3 CM-2 31.62 CM-3 31.78
Introduction	The alignment improvement results in an improvement of 2.16 BLEU score on phrase-based SMT system and an improvement of 1.76 BLEU score on parsing-based SMT system.
Introduction	SMT performance is further improved by 0.24 BLEU score.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

16. Tackling Sparse Data Issue in Machine Translation Evaluation

Bojar, Ondřej and Kos, Kamil and Mareċek, David

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	BLEU ) when applied to morphologically rich languages such as Czech.
Introduction	Section 2 illustrates and explains severe problems of a widely used BLEU metric (Papineni et al., 2002) when applied to Czech as a representative of languages with rich morphology.
Introduction	cu-bOJar uedin 0.4 l l l l 0.06 0.08 0.10 0.12 0.14 BLEU
Introduction	Figure l: BLEU and human ranks of systems participating in the English-to-Czech WMT09 shared task.
Problems of BLEU	BLEU (Papineni et al., 2002) is an established language-independent MT metric.
Problems of BLEU	The unbeaten advantage of BLEU is its simplicity.
Problems of BLEU	We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009).

BLEU is mentioned in 22 sentences in this paper.

Topics mentioned in this paper:

17. Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	BLEU score 0 iv A (A)
Experiments	BLEU score
Experiments	ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score.
Experiments	In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.
Future Work	BLEU score

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

language model (14)
BLEU (11)
LM (11)

18. Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Wuebker, Joern and Ney, Hermann and Zens, Richard

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU , whereas our approach yields 20.0% with identical models.
Experimental Evaluation	system \| BLEU [%] \ #HYP \ #LM \ w/s N0 2 oo baseline 20.1 3.0K 322K 2.2 +presort 20.1 2.5K 183K 3.6 N0 = 100
Experimental Evaluation	We evaluate with BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Experimental Evaluation	BLEU [%]
Introduction	We also run comparisons with the Moses decoder (Koehn et al., 2007), which yields the same performance in BLEU , but is outperformed significantly in terms of scalability for faster translation.

BLEU is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

LM (42)
BLEU (15)
beam search (6)

19. A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

Liu, Yang

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER.
Introduction	\| features \| BLEU \| TER \|
Introduction	Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly.

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

20. Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation

Nguyen, ThuyLinh and Vogel, Stephan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment Results	We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results	On average the improvement is 1.07 BLEU score (45.66
Experiment Results	Table 4: Arabic-English true case translation scores in BLEU metric.
Phrasal-Hiero Model	Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row).

BLEU is mentioned in 24 sentences in this paper.

Topics mentioned in this paper:

21. Name-aware Machine Translation

Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Baseline MT	The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003).
Experiments	We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation.
Experiments	According to Wilcoxon Matched-Pairs Signed-Ranks Test, the improvement is not significant with BLEU metric, but is significant at 98% confidence level with all of the other metrics.
Introduction	0 The current dominant automatic MT scoring metrics (such as Bilingual Evaluation Understudy ( BLEU ) (Papineni et al., 2002)) treat all words equally, but names have relative low frequency in text (about 6% in newswire and only 3% in web documents) and thus are vastly outnumbered by function words and common nouns, etc..
Name-aware MT Evaluation	Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally.
Name-aware MT Evaluation	In order to properly evaluate the translation quality of NAMT methods, we propose to modify the BLEU metric so that they can dynamically assign more weights to names during evaluation.
Name-aware MT Evaluation	BLEU considers the correspondence between a system translation and a human translation:

BLEU is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word alignment (17)
LM (12)

22. Scalable Decipherment for Machine Translation via Hash Sampling

Ravi, Sujith

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
Experiments and Results	To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
Experiments and Results	We show that our method achieves the best performance ( BLEU scores) on this task while being significantly faster than both the previous approaches.
Experiments and Results	We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon).

BLEU is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

23. Translating Dialectal Arabic to English

Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points.
Abstract	Further, adapting large MSAflEnglish parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
Introduction	— We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points.
Previous Work	‘Train LM BLEU oov
Previous Work	The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference.
Proposed Methods 3.1 Egyptian to EG’ Conversion	S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4).
Proposed Methods 3.1 Egyptian to EG’ Conversion	Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96.
Proposed Methods 3.1 Egyptian to EG’ Conversion	tian sentence “wbyHtrmwA AlnAs AltAnyp” Until produced “lyfizfij (OOV) the second people” ( BLEU = 0.31).

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

BLEU (13)
parallel data (9)
LM (8)

24. A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Feng, Yang and Cohn, Trevor

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU .
Experiments	We compared the performance of Moses using the alignment produced by our model and the baseline alignment, evaluating translation quality using BLEU (Papineni et al., 2002) with case-insensitive n-gram matching with n = 4.
Experiments	We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set.
Experiments	5 The effect on translation scores is modest, roughly amounting to +0.2 BLEU versus using a single sample.
Introduction	The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

25. Online Relative Margin Maximization for Statistical Machine Translation

Eidelman, Vladimir and Marton, Yuval and Resnik, Philip

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Additional Experiments	As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Additional Experiments	On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Additional Experiments	Interestingly, RM achieved substantially higher BLEU precision scores in all tests for both language pairs.
Experiments	We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus.
Experiments	The bound constraint B was set to 1.4 The approximate sentence-level BLEU cost A, is computed in a manner similar to (Chiang et al., 2009), namely, in the context of previous 1-best translations of the tuning set.
Experiments	We explored alternative values for B, as well as scaling it by the current candidate’s cost, and found that the optimizer is fairly insensitive to these changes, resulting in only minor differences in BLEU .

BLEU is mentioned in 18 sentences in this paper.

Topics mentioned in this paper:

feature set (25)
BLEU (18)
TER (11)

26. Domain-Independent Abstract Generation for Focused Meeting Summarization

Wang, Lu and Cardie, Claire

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Automatic evaluation (using ROUGE (Lin and Hovy, 2003) and BLEU (Papineni et al., 2002)) against manually generated focused summaries shows that our sum-marizers uniformly and statistically significantly outperform two baseline systems as well as a state-of-the-art supervised extraction-based system.
Results	To evaluate the full abstract generation system, the BLEU score (Papineni et al., 2002) (the precision of uni-grams and bigrams with a breVity penalty) is computed with human abstracts as reference.
Results	BLEU has a fairly good agreement with human judgement and has been used to evaluate a variety of language generation systems (Angeli et al., 2010; Konstas and Lapata, 2012).
Results	BLEU

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

27. Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Yeniterzi, Reyyan and Oflazer, Kemal

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores.
Abstract	Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU , all averaged over 10 training and test sets.
Experimental Setup and Results	For evaluation, we used the BLEU metric (Pap-ineni et al., 2001).
Experimental Setup and Results	Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation.
Experimental Setup and Results	We can observe that the combined syntax-to-morphology transformations on the source side provide a substantial improvement by themselves and a simple target side transformation on top of those provides a further boost to 21.96 BLEU which represents a 28.57% relative improvement over the word-based baseline and a 18.00% relative improvement over the factored baseline.
Introduction	We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets.
Syntax-to-Morphology Mapping	We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores.

BLEU is mentioned in 35 sentences in this paper.

Topics mentioned in this paper:

28. Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

Wang, Kun and Zong, Chengqing and Su, Keh-Yih

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system.
Conclusion and Future Work	The experiments show that the proposed Model-III outperforms both the TM and the SMT systems significantly (p < 0.05) in either BLEU or TER when fuzzy match score is above 0.4.
Conclusion and Future Work	Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database.
Experiments	In the tables, the best translation results (either in BLEU or TER) at each interval have been marked in bold.
Experiments	Compared with TM and SMT, Model-I is significantly better than the SMT system in either BLEU or TER when the fuzzy match score is above 0.7; Model-II significantly outperforms both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.5; Model-III significantly exceeds both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.4.
Experiments	SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7).
Introduction	Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall.

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

TER (14)
BLEU (11)
SMT system (11)

29. Joint Learning of a Dual SMT System for Paraphrase Generation

Sun, Hong and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems.
Experiments and Results	Jomt learnlng BLEU BLEU zB LE U No Joint 27.16 35.42 / oz 2 1 30.75 53.51 30.75
Experiments and Results	We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better).
Experiments and Results	From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously.
Introduction	The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time.
Paraphrasing with a Dual SMT System	Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible.
Paraphrasing with a Dual SMT System	iBLEU(s,rS,c) = aBLEU(c,7“S) — (l—a) BLEU (c,s) (3)
Paraphrasing with a Dual SMT System	BLEU (C, r3) captures the semantic equivalency between the candidates and the references (Finch et al.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

SMT systems (18)
BLEU (13)
BLEU score (10)

30. Deciphering Foreign Language by Combining Language Models and Context Vectors

Nuhn, Malte and Mauser, Arne and Ney, Hermann

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Evaluation	We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while
Experimental Evaluation	In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
Experimental Evaluation	For BLEU higher values are better, for TER lower values are better.
Related Work	They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data.

BLEU is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

LM (27)
BLEU (16)
translation model (15)

31. Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures

Zarriess, Sina and Kuhn, Jonas

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	BLEU , sentence-level geometric mean of 1- to 4-gram precision, as in (Belz et al., 2011)
Experiments	BLEUT, sentence-level BLEU computed on post-processed output where predicted referring expressions for victim and perp are replaced in the sentences (both gold and predicted) by their original role label, this score doeS not penalize lexical mismatches between corpus and system RES
Experiments	When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9).

BLEU is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

32. Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

Liu, Chang and Ng, Hwee Tou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks.
Experiments	4.3.1 BLEU
Experiments	Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011).
Experiments	use character-level BLEU as our main baseline.
Introduction	Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.
Introduction	In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments.
Introduction	Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.

BLEU is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word-level (15)
n-grams (12)

33. Prediction of Learning Curves in Machine Translation

Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Inferring a learning curve from mostly monolingual data	Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
Inferring a learning curve from mostly monolingual data	We first train models to predict the BLEU score at m anchor sizes 81, .
Inferring a learning curve from mostly monolingual data	We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
Introduction	In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training.
Introduction	An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU .
Introduction	They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
Selecting a parametric family of curves	For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
Selecting a parametric family of curves	The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not
Selecting a parametric family of curves	The values are on the same scale as the BLEU scores.

BLEU is mentioned in 21 sentences in this paper.

Topics mentioned in this paper:

34. Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

He, Xiaodong and Deng, Li

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset.
Abstract	The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
Abstract	parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002).

BLEU is mentioned in 44 sentences in this paper.

Topics mentioned in this paper:

35. Revisiting Pivot Language Approach for Machine Translation

Wu, Hua and Wang, Haifeng

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Translation quality was evaluated using both the BLEU score proposed by Papineni et al.
Experiments	(2002) and also the modified BLEU (BLEU-Fix) score3 used in the IWSLT 2008 evaluation campaign, where the brevity calculation is modified to use closest reference length instead of shortest reference length.
Experiments	Method BLEU BLEU-Fix Triangulation 33 .70/27.46 3 l .5 9/25 .02 Transfer 3352/2834 3136/2620 Synthetic 34.35/27 .21 32.00/26.07 Combination 38.14/29.32 34.76/27.39
Translation Selection	In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002).
Translation Selection	We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection	In the context of translation selection, 3/ is assigned as the smoothed BLEU score.

BLEU is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

36. MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation

Chan, Yee Seng and Ng, Hwee Tou

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Automatic Evaluation Metrics	In this section, we describe BLEU, and the three metrics which achieved higher correlation results than BLEU in the recent ACL-07 MT workshop.
Automatic Evaluation Metrics	2.1 BLEU
Automatic Evaluation Metrics	BLEU (Papineni et al., 2002) is essentially a precision-based metric and is currently the standard metric for automatic evaluation of MT performance.
Introduction	Among all the automatic MT evaluation metrics, BLEU (Papineni et al., 2002) is the most widely used.
Introduction	Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction	The results show that, as compared to BLEU , several recently proposed metrics such as Semantic-role overlap (Gimenez and Marquez, 2007), ParaEval-recall (Zhou et al., 2006), and METEOR (Banerjee and Lavie, 2005) achieve higher correlation.

BLEU is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

unigram (24)
BLEU (20)
bigrams (16)

37. Phrase Table Training for Precision and Recall: What Makes a Good Phrase and a Good Phrase Pair?

Deng, Yonggang and Xu, Jia and Gao, Yuqing

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Generic Phrase Training Procedure	lation engine to minimize the final translation errors measured by automatic metrics such as BLEU (Papineni et al., 2002).
Discussions	- + - BLEU mo“ Phrasetable Size
Discussions	After reaching its peak, the BLEU score drops as the threshold 7' increases.
Discussions	Table 4: Translation Results ( BLEU ) of discriminative phrase training approach using different features
Experimental Results	We measure translation performance by the BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) scores with multiple translation references.
Experimental Results	BLEU Scores
Experimental Results	The translation results as measured by BLEU and METEOR scores are presented in Table 3.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

38. Better Alignments = Better Translations?

Ganchev, Kuzman and Graça, João V. and Taskar, Ben

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score) in end-to-end translation systems for six languages pairs used in recent MT competitions.
Conclusions	Table 3: BLEU scores for all language pairs using all available data.
Introduction	Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score) across many languages on training corpora with up to hundred thousand sentences.
Introduction	In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases.
Phrase-based machine translation	We report BLEU scores using a script available with the baseline system.
Phrase-based machine translation	Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model.
Phrase-based machine translation	In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages.
Word alignment results	Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system.

BLEU is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

39. Forest-Based Translation

Mi, Haitao and Huang, Liang and Liu, Qun

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Large-scale experiments show an absolute improvement of 1.7 BLEU points over the l-best baseline.
Experiments	BLEU score
Experiments	We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set.
Experiments	The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure.
Introduction	Large-scale experiments (Section 4) show an improvement of 1.7 BLEU points over the l-best baseline, which is also 0.8 points higher than decoding with 30-best trees, and takes even less time thanks to the sharing of common subtrees.

BLEU is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

BLEU (12)
parse tree (11)
BLEU score (10)

40. A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model

Shen, Libin and Xu, Jinxi and Weischedel, Ralph

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our eXperiments show that the string-to-dependency decoder achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER compared to a standard hierarchical string—to—string system on the N IST 04 Chinese—English evaluation set.
Conclusions and Future Work	Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set.
Experiments	All models are tuned on BLEU (Papineni et al., 2001), and evaluated on both BLEU and Translation Error Rate (TER) (Snover et al., 2006) so that we could detect over-tuning on one metric.
Experiments	BLEU % TER% lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96
Experiments	Table 2: BLEU and TER scores on the test set.
Introduction	For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system.
Introduction	Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set.

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

41. Applying Morphology Generation Models to Machine Translation

Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We applied our inflection generation models in translating English into two morphologically complex languages, Russian and Arabic, and show that our model improves the quality of SMT over both phrasal and syntax-based SMT systems according to BLEU and human judge-ments.
Integration of inflection models with MT systems	We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2).
MT performance results	For automatically measuring performance, we used 4-gram BLEU against a single reference translation.
MT performance results	We also report oracle BLEU scores which incorporate two kinds of oracle knowledge.
MT performance results	For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation.

BLEU is mentioned in 26 sentences in this paper.

Topics mentioned in this paper:

42. Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set.
Evaluation	In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation	This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

Turker (25)
TER (12)
BLEU (11)

43. Efficient Multi-Pass Decoding for Synchronous Context Free Grammars

Zhang, Hao and Gildea, Daniel

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
Decoding to Maximize BLEU	BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU .
Experiments	We evaluate the translation results by comparing them against the reference translations using the BLEU metric.
Experiments	Hyperedges BLEU Bigram Pass 167K 21.77 Trigram Pass UNI — —BO + 629.7K=796.7K 23.56 BO+BB +2.7K =169.
Experiments	Fable 1: Speed and BLEU scores for two-pass decoding.
Introduction	With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
Introduction	Maximizing the expected count of synchronous constituents approximately maximizes BLEU .
Introduction	We find a significant increase in BLEU in the experiments, with minimal additional time.

BLEU is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
bigram (15)
language model (12)

44. Fast Consensus Decoding over Translation Forests

DeNero, John and Chiang, David and Knight, Kevin

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.
Abstract	However, MBR targeting BLEU is prohibitively slow to optimize over kr-best lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000-best lists.
Abstract	Our forest-based decoding objective consistently outperforms kr-best list MBR, giving improvements of up to 1.0 BLEU .
Consensus Decoding Algorithms	1Typically, MBR is defined as arg mineeElE [L(e; e’ for some loss function L, for example 1 — BLEU (e; 6’ These definitions are equivalent.
Consensus Decoding Algorithms	Figure 1 compares Algorithms 1 and 2 using U(e; e’ Other linear functions have been explored for MBR, including Taylor approximations to the logarithm of BLEU (Tromble et al., 2008) and counts of matching constituents (Zhang and Gildea, 2008), which are discussed further in Section 3.3.
Consensus Decoding Algorithms	Computing MBR even with simple nonlinear measures such as BLEU , NIST or bag-of-words Fl seems to require 0(k2) computation time.
Introduction	In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002).
Introduction	Unfortunately, with a nonlinear similarity measure like BLEU , we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution.
Introduction	In experiments using BLEU over 1000-best lists, we found that our objective provided benefits very similar to MBR, only much faster.

BLEU is mentioned in 37 sentences in this paper.

Topics mentioned in this paper:

45. Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices

Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Introduction	We employ MERT to select these weights by optimizing BLEU score on a development set.
Introduction	In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
MERT for MBR Parameter Optimization	However, this does not guarantee that the resulting linear score (Equation 2) is close to the corpus BLEU .
MERT for MBR Parameter Optimization	We now describe how MERT can be used to estimate these factors to achieve a better approximation to the corpus BLEU .
MERT for MBR Parameter Optimization	We recall that MERT selects weights in a linear model to optimize an error criterion (e. g. corpus BLEU ) on a training set.
Minimum Bayes-Risk Decoding	This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate.
Minimum Bayes-Risk Decoding	(2008) extended MBR decoding to translation lattices under an approximate BLEU score.
Minimum Bayes-Risk Decoding	They approximated log( BLEU ) score by a linear function of n-gram matches and candidate length.

BLEU is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

n-gram (21)
BLEU (20)
phrase-based (11)

46. Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data

Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
Evaluation	We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality.
Evaluation	Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
Evaluation	HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further.
Introduction	This enhancement alone results in an improvement of almost 1.4 BLEU points.
Introduction	We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models.

BLEU is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

language model (18)
BLEU (15)
bigram (13)

47. Variational Decoding for Statistical Machine Translation

Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008).
Experimental Results	Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.
Experimental Results	Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.
Experimental Results	Table 2 presents the BLEU results under different ways in using the variational models, as discussed in Section 3.2.3.
Introduction	We geometrically interpolate the resulting approximations q with one another (and with the original distribution p), justifying this interpolation as similar to the minimum-risk decoding for BLEU proposed by Tromble et al.
Variational Approximate Decoding	However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams.
Variational vs. Min-Risk Decoding	They use the following loss function, of which a linear approximation to BLEU (Papineni et al., 2001) is a special case,

BLEU is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

n-gram (32)
Viterbi (23)
BLEU (15)

48. Joint Decoding with Multiple Translation Models

Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Comparable to the state-of-the-art system combination technique, joint decoding achieves an absolute improvement of 1.5 BLEU points over individual decoding.
Experiments	We evaluated the translation quality using case-insensitive BLEU metric (Papineni et al., 2002).
Experiments	Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
Experiments	With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
Introduction	0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
Introduction	ing with multiple models achieves an absolute improvement of 1.5 BLEU points over individual decoding with single models (Section 5).

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

49. Improving Tree-to-Tree Translation with Packed Forests

Liu, Yang and Lü, Yajuan and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees.
Experiments	We evaluated the translation quality using the BLEU metric, as calculated by mteval-vl lb.pl with its default setting except that we used case-insensitive matching of n-grams.
Experiments	avg trees # of rules BLEU
Experiments	Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models.
Introduction	Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model.

BLEU is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

50. Sentence Level Dialect Identification for Machine Translation System Selection

Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our best result improves over the best single MT system baseline by 1.0% BLEU and over a strong system selection baseline by 0.6% BLEU on a blind test set.
Introduction	Our best system selection approach improves over our best baseline single MT system by 1.0% absolute BLEU point on a blind test set.
MT System Selection	We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection	We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection	When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments	Feature weights are tuned to maximize BLEU on tuning sets using Minimum Error Rate Training (Och, 2003).
Machine Translation Experiments	Results are presented in terms of BLEU (Papineni et al., 2002).
Machine Translation Experiments	All differences in BLEU scores between the four systems are statistically significant above the 95% level.

BLEU is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

51. Lattice Desegmentation for Statistical Machine Translation

Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Methods	This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references).
Related Work	We use both segmented and unsegmented language models, and tune automatically to optimize BLEU .
Related Work	(2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU .
Results	For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.
Results	Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.
Results	1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

LM (16)
language model (13)
BLEU (13)

52. Response-based Learning for Grounded Machine Translation

Riezler, Stefan and Simianer, Patrick and Haas, Carolin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	call F1 BLEU 1.21 60.82 46.53 $.57 66.791 48.001 L64 72.4912 56.6412 3.36 78.15123 55.6612
Experiments	:call F1 BLEU 7.86 61.48 46.53 1.79 64.07 46.00 3.57 65.56 55.6712 7.14 68.8612 55.6712
Experiments	Method 4, named REBOL, implements REsponse-Based Online Learning by instantiating y+ and y‘ to the form described in Section 4: In addition to the model score 3, it uses a cost function 0 based on sentence-level BLEU (Nakov et al., 2012) and tests translation hypotheses for task-based feedback using a binary execution function 6.
Response-based Online Learning	Computation of distance to the reference translation usually involves cost functions based on sentence-level BLEU (Nakov et al.
Response-based Online Learning	In addition, we can use translation-specific cost functions based on sentence-level BLEU in order to boost similarity of translations to human reference translations.
Response-based Online Learning	Our cost function c(y(i), y) = (l — BLEU(y(i), is based on a version of sentence-level BLEU Nakov et al.

BLEU is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

53. Predicting and Eliciting Addressee's Emotion in Online Dialogue

Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3).
Experiments	We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011).
Experiments	In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

54. Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.
Experiments	MT08 nw MT08 wb BLEU \ TER BLEU \ TER
Experiments	The best TER and BLEU results on each genre are in bold.
Experiments	For BLEU , higher scores are better, while for TER, lower scores are better.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

55. A Discriminative Latent Variable Model for Statistical Machine Translation

Blunsom, Phil and Cohn, Trevor and Osborne, Miles

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Further Work	9Hiero was MERT trained on this set and has a 2% higher BLEU score compared to the discriminative model.
Discussion and Further Work	development BLEU (%) 28
Evaluation	Although there is no direct relationship between BLEU and likelihood, it provides a rough measure for comparing performance.
Evaluation	6We also experimented with using max-translation decoding for standard MER trained translation models, finding that it had a small negative impact on BLEU score.
Evaluation	Figure 5 shows the relationship between beam width and development BLEU .

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

56. Hindi-to-Urdu Machine Translation through Transliteration

Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system.
Evaluation	M Pbo Pbl Pb2 M1 M2 BLEU 14.3 16.25 16.13 18.6 17.05
Evaluation	Both our systems (Model-1 and Model-2) beat the baseline phrase-based system with a BLEU point difference of 4.30 and 2.75 respectively.
Evaluation	The difference of 2.35 BLEU points between M1 and Pbl indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Final Results	This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models.
Final Results	BLEU point improvement and combined with all the heuristics (M2H123) gives an overall gain of 1.95 BLEU points and is close to our best results (M1H12).
Final Results	One important issue that has not been investigated yet is that BLEU has not yet been shown to have good performance in morphologically rich target languages like Urdu, but there is no metric known to work better.
Introduction	Section 4 discusses the training data, parameter optimization and the initial set of experiments that compare our two models with a baseline Hindi-Urdu phrase-based system and with two transliteration-aided phrase-based systems in terms of BLEU scores

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

57. Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

Uszkoreit, Jakob and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
Conclusion	The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
Experiments	Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
Experiments	minimum error rate training (Och, 2003) with BLEU score as the objective function.
Experiments	Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

58. Hybrid Simplification using Deep Semantics and Machine Translation

Narayan, Shashi and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG).
Experiments	BLEU gives a measure of how close a system’s output is to the gold standard simple sentence.
Experiments	Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification.
Related Work	(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

59. Mixing Multiple Translation Models in Statistical Machine Translation

Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Baselines	where m ranges over IN and OUT, pm(é\| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003).
Conclusion & Future Work	We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model.
Ensemble Decoding	In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup.
Ensemble Decoding	However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically.
Ensemble Decoding	However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well.
Experiments & Results 4.1 Experimental Setup	Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2.
Experiments & Results 4.1 Experimental Setup	This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34.
Experiments & Results 4.1 Experimental Setup	We also reported the BLEU scores when we applied the span-wise normalization heuristic.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

60. Pseudo-Word for Phrase-Based Machine Translation

Duan, Xiangyu and Zhang, Min and Li, Haizhou

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Results	Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments and Results	BLEU 0.4029 0.3146 NIST 7.0419 8.8462 METEOR 0.5785 0.5335
Experiments and Results	Both SMP and ESSP outperform baseline consistently in BLEU , NIST and METEOR.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

61. Incorporating Information Status into Generation Ranking

Cahill, Aoife and Riester, Arndt

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions	In comparison to a baseline model, we achieve statistically significant improvement in BLEU score.
Discussion	Given that we only looked at IS factors within a sentence, we think that such a significant improvement in BLEU and exact match scores is very encouraging.
Generation Ranking Experiments	Model BLEU Match (%)
Generation Ranking Experiments	We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al.,
Generation Ranking Experiments	We achieve an improvement of 0.0168 BLEU points and 1.91 percentage points in exact match.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

62. Cohesive Phrase-Based Decoding for Statistical Machine Translation

Cherry, Colin

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Cohesive Decoding	Initially, we were not certain to what extent this feature would be used by the MERT module, as BLEU is not always sensitive to syntactic improvements.
Cohesive Phrasal Output	We tested this approach on our English-French development set, and saw no improvement in BLEU score.
Conclusion	Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores.
Conclusion	Our soft constraint produced improvements ranging between 0.5 and 1.1 BLEU points on sentences for which the baseline produces uncohesive translations.
Experiments	We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets.
Experiments	First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets.
Experiments	Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

subtrees (11)
phrase-based (10)
BLEU (9)

63. Hierarchical Phrase Table Combination for Machine Translation

Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The performance measured by BLEU is at least as comparable to the traditional batch training method.
Conclusion and Future Work	The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones.
Experiment	The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011).
Experiment	In the IWLST2012 data set, there is a huge difference gap between the HIT corpus and the BTEC corpus, and our method gains 0.814 BLEU improvement.
Experiment	While the FBIS data set is artificially divided and no clear human assigned differences among sub-domains, our method loses 0.09 BLEU .

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

64. Improve SMT Quality with Automatically Extracted Paraphrase Rules

He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.
Discussion	on BLEU score
Experiments	The metrics for automatic evaluation were BLEU 3 and TER 4 (Snover et al., 2005).
Experiments	(00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules.
Experiments	Our system gains significant improvements of 1.6~3.6 points of BLEU in the oral domain, and 0.5~1 points of BLEU in the news domain.
Extraction of Paraphrase Rules	As mentioned above, the detailed procedure is: T1 = S1 = T2 = Finally we compute BLEU (Papineni et al.
Extraction of Paraphrase Rules	If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions.
Introduction	The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

65. Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
Complexity Analysis	In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved.
Complexity Analysis	For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality.
Complexity Analysis	It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Introduction	o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

segmenters (11)
BLEU (9)
bigram (7)

66. Quadratic-Time Dependency Parsing for Machine Translation

Galley, Michel and Manning, Christopher D.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Conclusion and future work	We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements).
Dependency parsing for machine translation	We found that dependency scores with or without loop elimination are generally close and highly correlated, and that MT performance without final loop removal was about the same (generally less than 0.2% BLEU ).
Introduction	In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores.
Machine translation experiments	Parameter tuning was done with minimum error rate training (Och, 2003), which was used to maximize BLEU (Papineni et al., 2001).
Machine translation experiments	In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001).
Machine translation experiments	For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER, all differences are significant.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

67. Binarized Forest to String Translation

Zhang, Hao and Fang, Licheng and Xu, Peng and Wu, Xiaoyun

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system).
Experiments	To evaluate the translation results, we use BLEU (Papineni et al., 2002).
Experiments	On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points, and 0.8 over the hierarchical phrase-based system.
Experiments	In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points, and 0.3 to 0.5 over the hierarchical phrase-based system.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

68. Effective Use of Function Words for Rule Generalization in Forest-Based Translation

Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.
Conclusion	Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008).
Experiments	BLEU (%) 26.15 27.07 27.93 28.89
Experiments	Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset
Experiments	the final BLEU scores of C3—T with Min-F and C3-F.
Introduction	(2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system.
Introduction	Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

69. Active Learning for Multilingual Statistical Machine Translation

Haffari, Gholamreza and Sarkar, Anoop

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

AL-SMT: Multilingual Setting	The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively.
AL-SMT: Multilingual Setting	This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score, WER and PER) (Papineni et al., 2002).
Experiments	The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set.
Experiments	Avg BLEU Score
Experiments	Avg BLEU Score
Sentence Selection: Multiple Language Pairs	0 Let e0 be the consensus among all the candidate translations, then define the disagreement as Ed ad(1 — BLEU (eC, ed)).

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

70. Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Simianer, Patrick and Riezler, Stefan and Dyer, Chris

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a).
Experiments	Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU .
Experiments	In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set.
Joint Feature Selection in Distributed Stochastic Learning	Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

71. Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline.
Conclusion	Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based
Experimental setup	All experiments were done on Urdu-English and we evaluate reordering in two ways: Firstly, we evaluate reordering performance directly by comparing the reordered source sentence in Urdu with a reference reordering obtained from the manual word alignments using BLEU (Papineni et al., 2002) (we call this measure monolingual BLEU or mBLEU).
Experimental setup	Additionally, we evaluate the effect of reordering on our final systems for machine translation measured using BLEU .
Introduction	This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments.
Introduction	In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering.
Results and Discussions	We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3.
Results and Discussions	We also note a gain of 2.0 BLEU points over a hierarchical phrase based system.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

72. A Syntax-Driven Bracketing Model for Phrase-Based Translation

Xiong, Deyi and Zhang, Min and Aw, Aiti and Li, Haizhou

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Analysis	0 The constituent boundary matching feature (CBMF) is a very important feature, which by itself achieves significant improvement over the baseline (up to 1.13 BLEU ).
Analysis	5.2 Beyond BLEU
Analysis	Since BLEU is not sufficient
Experiments	Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments	Like (Marton and Resnik, 2008), we find that the XP+ feature obtains a significant improvement of 1.08 BLEU over the baseline.
Experiments	However, using all syntax-driven features described in section 3.2, our SDB models achieve larger improvements of up to 1.67 BLEU .
Introduction	Our experimental results display that our SDB model achieves a substantial improvement over the baseline and significantly outperforms XP+ according to the BLEU metric (Papineni et al., 2002).
Introduction	In addition, our analysis shows further evidences of the performance gain from a different perspective than that of BLEU .

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

73. Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU .
Abstract	On general domain and speech translation tasks where test conditions substantially differ from standard government and news training text, web-mined training data improves performance substantially, resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain.
Abstract	For all language pairs and both test sets (WMT 2011 and WMT 2012), we show an improvement of around 0.5 BLEU .

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

74. Effective Selection of Translation Model Training Data

Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points.
Conclusion	Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points.
Experiments	The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2.
Experiments	The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points.
Experiments	The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

75. Using subcategorization knowledge to improve case prediction for translation to German

Weller, Marion and Fraser, Alexander and Schulte im Walde, Sabine

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and evaluation	We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3.
Experiments and evaluation	Table 5 gives results in case-insensitive BLEU .
Experiments and evaluation	While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU ; however, our manual evaluation shows that the new features have a positive impact on translation quality.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

76. Name Translation in Statistical Machine Translation - Learning When to Transliterate

Hermjakob, Ulf and Knight, Kevin and Daumé III, Hal

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	At the same time, there has been no negative impact on overall quality as measured by BLEU .
End-to-End results	To make sure our name transliterator does not degrade the overall translation quality, we evaluated our base SMT system with BLEU , as well as our transliteration-augmented SMT system.
End-to-End results	The BLEU scores for the two systems were 50.70 and 50.96 respectively.
Evaluation	General MT metrics such as BLEU , TER, METEOR are not suitable for evaluating named entity translation and transliteration, because they are not focused on named entities (NEs).
Integration with SMT	In a tuning step, the Minimim Error Rate Training component of our SMT system iteratively adjusts the set of rule weights, including the weight associated with the transliteration feature, such that the English translations are optimized with respect to a set of known reference translations according to the BLEU translation metric.
Introduction	First, although names are important to human readers, automatic MT scoring metrics (such as BLEU ) do not encourage researchers to improve name translation in the context of MT.
Introduction	A secondary goal is to make sure that our overall translation quality (as measured by BLEU ) does not degrade as a result of the name-handling techniques we introduce.
Introduction	0 We evaluate both the base SMT system and the augmented system in terms of entity translation accuracy and BLEU (Sections 2 and 6).

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

77. Advancements in Reordering Models for Statistical Machine Translation

Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average.
Conclusion	Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN).
Experiments	0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way.
Experiments	An Index column is added for score reference convenience (B for BLEU ; T for TER).
Experiments	For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1).

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

CRFs (19)
LM (14)
BLEU (8)

78. A Multi-Domain Translation Model Framework for Statistical Machine Translation

Sennrich, Rico and Schwenk, Holger and Aransa, Walid

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation.
Translation Model Architecture	We found that this had no significant effects on BLEU .
Translation Model Architecture	We report translation quality using BLEU (Papineni et
Translation Model Architecture	For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2).

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

79. Dependency Based Chinese Sentence Realization

He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
Experiments	In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments	We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method.
Experiments	All of the four feature functions we have tested achieve considerable improvement in BLEU scores.
Log-linear Models	BLEU score, a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al.
Log-linear Models	The BLEU measure computes the geometric mean of the precision of n-grams of various lengths between a sentence realization and a (set of) reference(s).
Log-linear Models	3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

80. Enhancing Grammatical Cohesion: Generating Transitional Expressions for SMT

Tu, Mei and Zhou, Yu and Zong, Chengqing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments	In particular, the best performance marked in bold is as high as 1.24, 0.94, and 0.82 BLEU points, respectively, over the baseline system on NIST04, CWMT08 Development, and CWMT08 Evaluation data.
Experiments	BLEU 35
Related Work	They added the labels assigned to connectives as an additional input to an SMT system, but their experimental results show that the improvements under the evaluation metric of BLEU were not significant.
Related Work	To the best of our knowledge, our work is the first attempt to exploit the source functional relationship to generate the target transitional expressions for grammatical cohesion, and we have successfully incorporated the proposed models into an SMT system with significant improvement of BLEU metrics.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

81. A Unified Model for Soft Linguistic Reordering Constraints in Statistical Machine Translation

Li, Junhui and Marton, Yuval and Resnik, Philip and Daumé III, Hal

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	Table 6: Performance gain in BLEU over baseline and MR08 systems averaged over all test sets.
Discussion	Table 9: Performance ( BLEU score) comparison between non-oracle and oracle experiments.
Experiments	We use NIST MT 06 dataset (1664 sentence pairs) for tuning, and NIST MT 03, 05, and 08 datasets (919, 1082, and 1357 sentence pairs, respectively) for evaluation.1 We use BLEU (Pap-ineni et al., 2002) for both tuning and evaluation.
Experiments	Our first group of experiments investigates whether the syntactic reordering models are able to improve translation quality in terms of BLEU .
Experiments	Table 5: System performance in BLEU scores.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

82. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Lo, Chi-kiu and Wu, Dekai

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU , which fail to properly evaluate adequacy, become more apparent.
Abstract	We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER.
Abstract	We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

83. Nonparametric Method for Data-driven Image Captioning

Mason, Rebecca and Charniak, Eugene

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Our Approach	BLEU Scores 13 N J:
Our Approach	Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach	Although BLEU (Papineni et al., 2002) scores are widely used for image caption evaluation, we find them to be poor indicators of the quality of our model.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

BLEU (8)
BLEU scores (6)

84. Hypertagging: Supertagging for Surface Realization with CCG

Espinosa, Dominic and White, Michael and Mehay, Dennis

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	We have also shown that, by integrating this hypertagger with a broad-coverage CCG chart realizer, considerably faster realization times are possible (approximately twice as fast as compared with a realizer that performs simple lexical lookups) with higher BLEU , METEOR and exact string match scores.
Conclusion	Moreover, the hypertagger-augmented realizer finds more than twice the number of complete realizations, and further analysis revealed that the realization quality (as per modified BLEU and METEOR) is higher in the cases when the realizer finds a complete realization.
Introduction	Moreover, the overall BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) scores, as well as numbers of exact string matches (as measured against to the original sentences in the CCGbank) are higher for the hypertagger-seeded realizer than for the preexisting realizer.
Results and Discussion	Table 5 shows that increasing the number of complete realizations also yields improved BLEU and METEOR scores, as well as more exact matches.
Results and Discussion	In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations.
Results and Discussion	Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%.
The Approach	compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

logical form (13)
POS tags (13)
CCG (11)

85. Learning Translation Consensus with Structured Label Propagation

Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures.
Experiments and Results	Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
Features and Training	defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]:
Features and Training	BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference.
Features and Training	1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis.
Graph Construction	In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

86. A Ranking-based Approach to Word Reordering for Statistical Machine Translation

Yang, Nan and Li, Mu and Zhang, Dongdong and Yu, Nenghai

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	Large scale experiment shows improvement on both reordering metric and SMT performance, with up to 1.73 point BLEU gain in our evaluation test.
Experiments	Table 2: BLEU (%) score on dev and test data for both EJ and J-E experiment.
Experiments	We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score, and the model size.
Experiments	CLN BLEU Feat.# tag+label 88.6 16.4 22.24 26k +dst 91.5 13.5 22.66 55k E_J +pct 92.2 13.1 22.73 79k +lezv100 92.9 12.1 22.85 347k +l€$1000 94.0 11.5 22.79 2,410k +l€$2000 95.2 10.7 22.81 3,794k tag+fw 85.0 18.6 25.43 31k +dst 90.3 16.9 25.62 65k J_E +lezv100 91.6 15.7 25.87 293k +l€$1000 92.4 14.8 25.91 2,156k +le$2000 93.0 14.3 25.84 3,297k

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

87. Modeling the Translation of Predicate-Argument Structure for SMT

Xiong, Deyi and Zhang, Min and Li, Haizhou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions and Future Work	EXperimental results show that both models are able to significantly improve translation accuracy in terms of BLEU score.
Experiments	Statistical significance in BLEU differences
Experiments	Our first group of experiments is to investigate whether the predicate translation model is able to improve translation accuracy in terms of BLEU and whether semantic features are useful.
Experiments	0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

88. Randomized Language Models via Perfect Hash Functions

Talbot, David and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
Experiments	Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Experiments	Figure 3: BLEU scores on the MT05 data set.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

89. A Sense-Based Translation Model for Statistical Machine Translation

Xiong, Deyi and Zhang, Min

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST.
Experiments	System BLEU (%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - -
Experiments	System BLEU (%) NIST Base 33.53 9.0561 STM (sense) 34.15 9.2596 STM (sense+lexicon) 34.73 9.4184
Experiments	System BLEU (%) NIST Base 33.53 9.0561 Reformulated WSD 34.16 9.3820 STM 34.73 9.4184

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

90. Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

Zhang, Jiajun and Zong, Chengqing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We use BLEU (Papineni et al., 2002) score with shortest length penalty as the evaluation metric and apply the pairwise re-sampling approach (Koehn, 2004) to perform the significance test.
Experiments	We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points.
Experiments	When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

91. Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking

Echizen-ya, Hiroshi and Araki, Kenji

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	To confirm the effectiveness of noun-phrase chunking, we performed the experiment using a system combining BLEU with our method.
Experiments	In this case, BLEU scores were used as scorewd in Eq.
Experiments	This experimental result is shown as “BLEU with our method” in Tables 2—5.
Introduction	Methods based on word strings (6.9., BLEU (Papineni et al., 2002), NIST(NIST, 2002), METEOR(Banerjee and Lavie., 2005), ROUGE-L(Lin and Och, 2004),

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

92. Topic Models for Dynamic Translation Model Adaptation

Eidelman, Vladimir and Boyd-Graber, Jordan and Resnik, Philip

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline.
Experiments	2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012).
Experiments	On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER.
Experiments	The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER.
Introduction	Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

93. Dependency-based Pre-ordering for Chinese-English Machine Translation

Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion	The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set	In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments	Lng the performance ( BLEU ) on the test set, the total
Experiments	For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments	It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction	Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

94. Underspecifying and Predicting Voice for Surface Realisation Ranking

Zarriess, Sina and Cahill, Aoife and Kuhn, Jonas

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	This strategy leads to a better balanced distribution of the alternations in the training data, such that our linguistically informed generation ranking model achieves high BLEU scores and accurately predicts active and passive.
Experimental Setup	Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65
Experimental Setup	Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01
Experimental Setup	use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams.
Experiments	The differences in BLEU between the candidate sets and models are
Experiments	Its BLEU score and match accuracy decrease only slightly (though statistically significantly).
Experiments	Features \| Match BLEU \| Voice Prec.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

95. Hierarchical Search for Word Alignment

Riesa, Jason and Marcu, Daniel

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
Conclusion	We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today.
Experiments	BLEU Words .696 45.1 2,538 .674 46.4 2,262
Experiments	Our hypergraph alignment algorithm allows us a 1.1 BLEU increase over the best baseline system, Model-4 grow-diag-final.
Experiments	We also report a 2.4 BLEU increase over a system trained with alignments from Model-4 union.
Related Work	Very recent work in word alignment has also started to report downstream effects on BLEU score.
Related Work	(2009) confirm and extend these results, showing BLEU improvement for a hierarchical phrase-based MT system on a small Chinese corpus.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

96. Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro and Takamura, Hiroya and Okumura, Manabu

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
Experiment	Table 1: Performance on Japanese-to-English Translation Measured by BLEU (%)
Experiment	Table 1 shows the performance for the test data measured by case sensitive BLEU (Papineni et al., 2002).
Experiment	Under the Moses phrase-based SMT system (Koehn et al., 2007) with the default settings, we achieved a 26.80% BLEU score.
Introduction	Further, our independent model achieves a more than 1 point gain in BLEU , which resolves the sparseness problem introduced by the bi-word observations.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

97. A Hybrid Approach to Skeleton-based Translation

Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work	The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation	Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems.
Evaluation	Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance.
Evaluation	Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system.
Introduction	0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

98. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
Related Work	Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
Related Work	In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work	TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

99. Learning Topic Representation for SMT with Neural Networks

Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The reported BLEU scores are averaged over 5 times of running MERT (Och, 2003).
Experiments	We illustrate the relationship among translation accuracy ( BLEU ), the number of retrieved documents (N) and the length of hidden layers (L) on different testing datasets.
Experiments	Figure 3: End-to-end translation results ( BLEU %)

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

100. Bilingually-constrained Phrase Embeddings for Machine Translation

Zhang, Jiajun and Liu, Shujie and Li, Mu and Zhou, Ming and Zong, Chengqing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Case-insensitive BLEU is employed as the evaluation metric.
Experiments	Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test.
Experiments	In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation.
Introduction	The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved.
Related Work	(2013) also use bag-of-words but learn BLEU sensitive phrase embeddings.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

101. Surface Realisation from Knowledge-Bases

Gyawali, Bikash and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning.
Results and Discussion	Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion	The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion	In terms of BLEU score, the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

102. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Li, Zhifei and Yarowsky, David

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	The feature functions are combined under a log-linear framework, and the weights are tuned by the minimum-error-rate training (Och, 2003) using BLEU (Papineni et al., 2002) as the optimization metric.
Experimental Results	This precision is extremely high because the BLEU score (precision with brevity penalty) that one obtains for a Chinese sentence is normally between 30% to 50%.
Experimental Results	4.5.2 BLEU on NIST MT Test Sets
Introduction	We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

103. Discriminative Pruning for Discriminative ITG Alignment

Liu, Shujie and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++.
Evaluation	Finally, we also do end-to-end evaluation using both F-score in alignment and Bleu score in translation.
Evaluation	HP-DITG using DPDI achieves the best Bleu score with acceptable time cost.
Evaluation	It shows that HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

104. Bilingual Sense Similarity for Statistical Machine Translation

Chen, Boxing and Foster, George and Kuhn, Roland

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Analysis and Discussion	Table 4: Results ( BLEU %) of Chinese—to—English large data (CE_LD) and small data (CE_SD) NIST task by applying one feature.
Analysis and Discussion	Table 5: Results ( BLEU %) for combination of two similarity scores.
Analysis and Discussion	Table 6: Results ( BLEU %) of using simple features based on context on small data NIST task.
Experiments	Our evaluation metric is IBM BLEU (Papineni et al., 2002), which performs case-insensitive matching of n- grams up to n = 4.
Experiments	Table 2: Results ( BLEU %) of small data Chinese-to-English NIST task.
Experiments	Table 3: Results ( BLEU %) of large data Chinese-to-English NIST task and German-to—English WMT task.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

105. Additive Neural Networks for Statistical Machine Translation

liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	In the extreme, if the k-best list consists only of a pair of translations ((6, d), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6, d) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
Introduction	to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
Introduction	Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

106. An Infinite Hierarchical Bayesian Model of Phrasal Translation

Cohn, Trevor and Haffari, Gholamreza

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
Experiments	10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments.
Experiments	We use minimum error rate training (Och, 2003) with nbest list size 100 to optimize the feature weights for maximum development BLEU .

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

107. An Unsupervised Model for Joint Phrase Alignment and Extraction

Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Evaluation	6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
Experimental Evaluation	It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
Experimental Evaluation	BLEU
Flat ITG Model	The average gain across all data sets was approximately 0.8 BLEU points.
Hierarchical ITG Model	(2003) that using phrases where max(\|e\|, \|f g 3 cause significant improvements in BLEU score, while using larger phrases results in diminishing returns.
Introduction	We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

108. A Syntax-Free Approach to Japanese Sentence Compression

Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Evaluation	For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training.
Experimental Evaluation	For automatic evaluation, we employed BLEU (Papineni et al., 2002) by following (Unno et al., 2006).
Experimental Evaluation	Label BLEU Proposed .679 w/o PLM .617 w/o IPTW .635 Hori— .493
Results and Discussion	Our method achieved the highest BLEU score.
Results and Discussion	For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score.
Results and Discussion	Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

109. Deciphering Foreign Language

Ravi, Sujith and Knight, Kevin

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Machine Translation as a Decipherment Task	Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et
Machine Translation as a Decipherment Task	The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
Machine Translation as a Decipherment Task	Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Is our topic similarity model able to improve translation quality in terms of BLEU ?
Experiments	Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea-
Experiments	By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average.
Introduction	Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

111. Large-Scale Syntactic Language Modeling with Treelets

Pauls, Adam and Klein, Dan

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
Experiments	In particular, their translations had a lower BLEU score, making their task easier.
Experiments	We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score).

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

112. A Recursive Recurrent Neural Network for Statistical Machine Translation

Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU .
Conclusion and Future Work	We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU .
Experiments and Results	When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.
Experiments and Results	TCBPPE based method drops about 3 BLEU points on both development and test data sets.
Introduction	We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

113. Polylingual Tree-Based Topic Models for Translation Domain Adaptation

Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We evaluate our model on a Chinese to English translation task and obtain up to 1.2 BLEU improvement over strong baselines.
Experiments	We refer to the SMT model without domain adaptation as baseline.5 LDA marginally improves machine translation (less than half a BLEU point).
Experiments	These improvements are not redundant: our new ptLDA-dict model, which has aspects of both models yields the best performance among these approaches—up to a 1.2 BLEU point gain (higher is better), and -2.6 TER improvement (lower is better).
Experiments	The BLEU improvement is significant (Koehn, 2004) at p = 0.01,6 except on MT03 with variational and variational-hybrid inference.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

114. Forest-based Tree Sequence to String Translation Model

Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	Model BLEU (%) Moses 25.68 TT2S 26.08 TTS2S 26.95 FT2S 27.66 FTS2S 28.83
Experiment	The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S.
Experiment	BLEU (%) N-best \ model FT2S FTS2S 100 Best 27.40 28.61 500 Best 27.66 28.83 2500 Best 27.66 28.96 5000 Best 27.79 28.89

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

115. A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation

Sun, Jun and Zhang, Min and Tan, Chew Lim

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	System Model BLEU Moses cBP 23.86 STSSG 25.92 SncTSSG 26.53
Experiments	ID Rule Set BLEU 1 CR (STSSG) 25.92 2 CR w/o ncPR 25.87 3 CR w/o ncPR + tgtncR 26.14 4 CR w/o ncPR + srchR 26.50 5 CR w/o ncPR + src&tgtncR 26.51 6 CR + tgtnCR 26.11 7 CR + srcncR 26.56 8 cR+src&tgtncR(SncTSSG) 26.53
Experiments	2) Not only that, after comparing Exp 6,7,8 against Exp 3,4,5 respectively, we find that the ability of rules derived from noncontiguous tree sequence pairs generally covers that of the rules derived from the contiguous tree sequence pairs, due to the slight Change in BLEU score.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

116. Head-Driven Hierarchical Phrase-based Translation

Li, Junhui and Tu, Zhaopeng and Zhou, Guodong and van Genabith, Josef

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU .
Experiments	For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores.
Experiments	Table 3 lists the translation performance with BLEU scores.
Experiments	Table 3 shows that our HD-HPB model significantly outperforms Chiang’s HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB).

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

117. Robust Machine Translation Evaluation with Entailment Features

Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We compare this metric against a combination metric of four state—of—the—art scores ( BLEU , NIST, TER, and METEOR) in two different settings.
Experimental Evaluation	BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
Introduction	Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
Introduction	BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction	(2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

118. Machine Translation without Words through Substring Alignment

Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Directions	12Similar results were found for character and word—based BLEU , but are omitted for lack of space.
Experiments	Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney.
Experiments	We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level.
Experiments	When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU .

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

119. Enlisting the Ghost: Modeling Empty Categories for Machine Translation

Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	The MT systems are optimized with pairwise ranking optimization (Hopkins and May, 2011) to maximize BLEU (Papineni et al., 2002).
Experimental Results	The BLEU scores from different systems are shown in Table 10 and Table 11, respectively.
Experimental Results	Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

120. A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	training data and not necessarily exactly follow the tendency of the final BLEU scores.
Experiments	For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score.
Experiments	Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

dependency trees (17)
CCG (10)
BLEU (5)

121. Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders

Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Experiments	Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
Experiments	Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

122. Translation Assistance by Translation of L1 Fragments in an L2 Context

van Gompel, Maarten and van den Bosch, Antal

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	We report on BLEU , NIST, METEOR, and word error rate metrics WER and PER.
Experiments & Results	The BLEU scores, not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results	Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results	Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

123. The Contribution of Linguistic Features to Automatic Machine Translation Evaluation

Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Alternatives to Correlation-based Meta-evaluation	We have studied 100 sentence evaluation cases from representatives of each metric family including: 1-PER, BLEU , DP-Or-‘k, GTM (e = 2), METEOR and ROUGE L. The evaluation cases have been extracted from the four test beds.
Metrics and Test Beds	At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision ( BLEU and NIST), lexical recall (ROUGE), and F-measure (GTM and METEOR).
Previous Work on Machine Translation Meta-Evaluation	(2001) introduced the BLEU metric and evaluated its reliability in terms of Pearson correlation with human assessments for adequacy and fluency judgements.
Previous Work on Machine Translation Meta-Evaluation	With the aim of overcoming some of the deficiencies of BLEU , Doddington (2002) introduced the NIST metric.
Previous Work on Machine Translation Meta-Evaluation	Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST, WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

124. Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU , TER, and NIST.
Corpus Data and Baseline SMT	Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs, 45K words) using BLEU as the tuning metric.
Experimental Setup and Results	Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006).
Experimental Setup and Results	In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 ( BLEU ), -0.6 (TER) and 0.08 (NIST).
Introduction	With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU , TER and NIST scores on an English-to-Iraqi CSLT task.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

125. A Tree Sequence Alignment-based Tree-to-Tree Translation Model

Zhang, Min and Jiang, Hongfei and Aw, Aiti and Li, Haizhou and Tan, Chew Lim and Li, Sheng

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	BLEU (%)
Experiments	Rule TR TR TR+TSR_L TR Type (STSG) +TSR_L +TSR_P +TSR BLEU (%) 24.71 25.72 25.93 26.07
Experiments	Rule Type BLEU (%) TR+TSR 26.07 (TR+TSR) w/o SRR 24.62 (TR+TSR) w/o DPR 25.78

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

126. Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task

Parton, Kristen and McKeown, Kathleen R. and Coyne, Bob and Diab, Mona T. and Grishman, Ralph and Hakkani-Tür, Dilek and Harper, Mary and Ji, Heng and Ma, Wei Yun and Meyers, Adam and Stolbach, Sara and Sun, Ang and Tur, Gokhan and Xu, Wei and Yaman, Sibel

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Results	Since MT systems are tuned for word-based overlap measures (such as BLEU ), verb deletion is penalized equally as, for example, determiner deletion.
SW System	model score and word penalty for a combination of BLEU and TER (2*(1-BLEU) + TER).
SW System	Bleu scores on the government supplied test set in December 2008 were 35.2 for formal text, 29.2 for informal text, 33.2 for formal speech, and 27.6 for informal speech.
The Chinese-English 5W Task	Unlike word- or phrase-overlap measures such as BLEU , the SW evaluation takes into account “concept” or “nugget” translation.

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

127. Topological Ordering of Function Words in Hierarchical Phrase-based Translation

Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Future Work	When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
Discussion and Future Work	Perhaps surprisingly, translation performance, 30.90 BLEU , was around the level we obtained when using frequency to approximate function words at N = 64.
Experimental Results	These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score, with a consistent pattern of results across the MT06 and MT08 test sets.
Experimental Setup	all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

128. A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages

Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score.
Conclusion and perspectives	Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging.
Evaluation	The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
Evaluation	The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging.

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

129. Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints

Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
Experiments	Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU,
Experiments	Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

CRFs (30)
segmentations (18)
treebank (16)

130. Learning Hierarchical Translation Structure with Linguistic Annotations

Mylonakis, Markos and Sima'an, Khalil

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
Experiments	Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set.
Experiments	BLEU scores for 200K and 400K training sentence pairs.
Experiments	Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores.

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

131. Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Our results showed improvement over the baselines both in intrinsic evaluations and on BLEU .
Experiments & Results 4.1 Experimental Setup	BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
Experiments & Results 4.1 Experimental Setup	Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e.
Introduction	In general, copied-over oovs are a hindrance to fluent, high quality translation, and we can see evidence of this in automatic measures such as BLEU (Papineni et al., 2002) and also in human evaluation scores such as HTER.

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

132. Concept-to-text Generation via Discriminative Reranking

Konstas, Ioannis and Lapata, Mirella

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental evaluation on the ATIS domain shows that our model outperforms a competitive discriminative system both using BLEU and in a judgment elicitation study.
Results	As can be seen, inclusion of lexical features gives our decoder an absolute increase of 6.73% in BLEU over the l-BEST system.
Results	System BLEU METEOR l-BEST+BASE+ALIGN 21.93 34.01 k-BEST+BASE+ALIGN+LEX 28.66 45.18 k-BEST+BASE+ALIGN+LEX+STR 30.62 46.07 ANGELI 26.77 42.41
Results	over the l-BEST system and 3.85% over ANGELI in terms of BLEU .

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

133. Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points.
Experiment	Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points).
Experiment	We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points).

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

134. Error Detection for Statistical Machine Translation Using Linguistic Features

Xiong, Deyi and Zhang, Min and Li, Haizhou

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Corpus ‘ BLEU (%) RCW (%)
Experiments	Table 4: Case-insensitive BLEU score and ratio of correct words (RCW) on the training, development and test corpus.
Experiments	Table 4 shows the case-insensitive BLEU score and the percentage of words that are labeled as correct according to the method described above on the training, development and test corpus.
SMT System	The performance, in terms of BLEU (Papineni et al., 2002) score, is shown in Table 4.

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

135. Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	System BLEU Baseline 12.60 [MB OT * 13 .06
Experiments	We measured the overall translation quality with the help of 4-gram BLEU (Papineni et al., 2002), which was computed on tokenized and lower-cased data for both systems.
Experiments	We obtain a BLEU score of 13.06, which is a gain of 0.46 BLEU points over the baseline.
Introduction	The translation quality is automatically measured using BLEU scores, and we confirm the findings by providing linguistic evidence (see Section 5).

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

136. Vector Space Model for Adaptation in Statistical Machine Translation

Chen, Boxing and Kuhn, Roland and Foster, George

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation.
Experiments	The 3-feature version of VSM yields +1.8 BLEU over the baseline for Arabic to English, and +1.4 BLEU for Chinese to English.
Experiments	For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.)
Experiments	To get an intuition for how VSM adaptation improves BLEU scores, we compared outputs from the baseline and VSM-adapted system (“vsm, joint” in Table 5) on the Chinese test data.

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

137. English-to-Russian MT evaluation campaign

Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation methodology	In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
Results	081 usually has the highest overall score (except BLEU ), it also has the highest scores for ‘regulations’ (more formal texts), P1 scores are better for the news documents.
Results	Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

138. Microblogs as Parallel Corpora

Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Table 3: BLEU scores for different datasets in different translation directions (left to right), broken with different training corpora (top to bottom).
Experiments	The BLEU scores for the different parallel corpora are shown in Table 3 and the top 10 out-of-vocabulary (OOV) words for each dataset are shown in Table 4.
Experiments	However, by combining the Weibo parallel data with this standard data, improvements in BLEU are obtained.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

139. Generalizing Image Captions for Image-Text Parallel Corpus

Kuznetsova, Polina and Ordonez, Vicente and Berg, Alexander and Berg, Tamara and Choi, Yejin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Code was provided by Deng et a1. (2012).	To compute evaluation measures, we take the average scores of BLEU (1) and F-score (unigram-based with respect to content-words) over k = 5 candidate captions.
Code was provided by Deng et a1. (2012).	Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU , Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database.
Related Work	When computing BLEU with semantic matching, we look for the match with the highest similarity score among words that have not been matched before.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

140. Measure Word Generation for English-Chinese SMT Systems

Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output.
Experiments	For our test data, we only consider sentences containing measure words for Bleu score evaluation.
Experiments	Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

141. Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing

Zhang, Hao and Quirk, Chris and Moore, Robert C. and Gildea, Daniel

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU .
Experiments	We do compare VB against EM in terms of final BLEU scores in the translation experiments to ensure that this sparse prior has a sig-
Experiments	Minimum Error Rate training (Och, 2003) over BLEU was used to optimize the weights for each of these models over the development test data.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

142. Distortion Model Considering Rich Context for Statistical Machine Translation

Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
Experiment	To stabilize the MERT results, we tuned three times by MERT using the first half of the development data and we selected the SMT weighting parameter set that performed the best on the second half of the development data based on the BLEU scores from the three SMT weighting parameter sets.
Experiment	To investigate the tolerance for sparsity of the training data, we reduced the training data for the sequence model to 20,000 sentences for JE translation.14 SEQUENCE using this model with a distortion limit of 30 achieved a BLEU score of 32.22.15 Although the score is lower than the score of SEQUENCE with a distortion limit of 30 in Table 3, the score was still higher than those of LINEAR, LINEAR+LEX, and 9-CLASS for JE in Table 3.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

143. A Class-Based Agreement Model for Generating Accurately Inflected Translations

Green, Spence and DeNero, John

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline.
Discussion of Translation Results	The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre.
Introduction	For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

phrase-based (10)
CRF (9)
LM (9)

144. A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Zollmann, Andreas and Vogel, Stephan

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out.
Experiments	We have noticed that using an L0-penalized BLEU score5 as MERT’s objective on the merged n-best lists over all iterations is more stable and will therefore use this score to determine N.
Experiments	5Given by: BLEU —5 X Hi 6 {1, .

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

145. Incremental Syntactic Language Models for Phrase-based Translation

Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perpleXity.
Related Work	Figure 9 shows a statistically significant improvement to the BLEU score when using the HHMM and the n-gram LMs together on this reduced test set.
Related Work	Moses LM(s) ‘ BLEU ‘

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

146. Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation

Lu, Shixiang and Chen, Zhenbiao and Xu, Bo

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively.
Conclusions	The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features.
Experiments and Results	Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

147. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Han, Bo and Baldwin, Timothy

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments	The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
Experiments	Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: