Index of papers in Proc. ACL 2012 that mention

BLEU

Seen in text as:

BLEU (363)

Seen in 305 sentences in 22 papers.

1. Learning to Translate with Multiple Objectives

Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	BLEU , TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Experiments	As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al., 2011)).
Experiments	As metrics we use BLEU and NTER.
Experiments	o BLEU = BP >< (Hprecn)1/4.
Introduction	These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU , which serve as surrogate objective for translation quality.
Introduction	However, we know that a single metric such as BLEU is not enough.
Introduction	For example, while BLEU (Papineni et al., 2002) focuses on word-based n-gram precision, METEOR (Lavie and Agarwal, 2007) allows for stem/synonym matching and incorporates recall.
Multi-objective Algorithms	If we had used BLEU scores rather than the {0,1} labels in line 8, the entire PMO-PRO algorithm would revert to single-objective PRO.
Theory of Pareto Optimality 2.1 Definitions and Concepts	For example, suppose K = 2, M1(h) computes the BLEU score, and M2(h) gives the METEOR score of h. Figure 1 illustrates the set of vectors {M in a lO-best list.

BLEU is mentioned in 22 sentences in this paper.

Topics mentioned in this paper:

2. PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

Chen, Boxing and Kuhn, Roland and Larkin, Samuel

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU .
Abstract	In principle, tuning on these metrics should yield better systems than tuning on BLEU .
Abstract	It has a better correlation with human judgment than BLEU .
Introduction	0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
Introduction	Among these metrics, BLEU is the most widely used for both evaluation and tuning.
Introduction	Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et

BLEU is mentioned in 66 sentences in this paper.

Topics mentioned in this paper:

3. Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

He, Xiaodong and Deng, Li

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset.
Abstract	The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
Abstract	parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002).

BLEU is mentioned in 44 sentences in this paper.

Topics mentioned in this paper:

4. Prediction of Learning Curves in Machine Translation

Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Inferring a learning curve from mostly monolingual data	Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
Inferring a learning curve from mostly monolingual data	We first train models to predict the BLEU score at m anchor sizes 81, .
Inferring a learning curve from mostly monolingual data	We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
Introduction	In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training.
Introduction	An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU .
Introduction	They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
Selecting a parametric family of curves	For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
Selecting a parametric family of curves	The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not
Selecting a parametric family of curves	The values are on the same scale as the BLEU scores.

BLEU is mentioned in 21 sentences in this paper.

Topics mentioned in this paper:

5. Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Wuebker, Joern and Ney, Hermann and Zens, Richard

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU , whereas our approach yields 20.0% with identical models.
Experimental Evaluation	system \| BLEU [%] \ #HYP \ #LM \ w/s N0 2 oo baseline 20.1 3.0K 322K 2.2 +presort 20.1 2.5K 183K 3.6 N0 = 100
Experimental Evaluation	We evaluate with BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Experimental Evaluation	BLEU [%]
Introduction	We also run comparisons with the Moses decoder (Koehn et al., 2007), which yields the same performance in BLEU , but is outperformed significantly in terms of scalability for faster translation.

BLEU is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

LM (42)
BLEU (15)
beam search (6)

6. Joint Learning of a Dual SMT System for Paraphrase Generation

Sun, Hong and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems.
Experiments and Results	Jomt learnlng BLEU BLEU zB LE U No Joint 27.16 35.42 / oz 2 1 30.75 53.51 30.75
Experiments and Results	We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better).
Experiments and Results	From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously.
Introduction	The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time.
Paraphrasing with a Dual SMT System	Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible.
Paraphrasing with a Dual SMT System	iBLEU(s,rS,c) = aBLEU(c,7“S) — (l—a) BLEU (c,s) (3)
Paraphrasing with a Dual SMT System	BLEU (C, r3) captures the semantic equivalency between the candidates and the references (Finch et al.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

SMT systems (18)
BLEU (13)
BLEU score (10)

7. Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

Liu, Chang and Ng, Hwee Tou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks.
Experiments	4.3.1 BLEU
Experiments	Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011).
Experiments	use character-level BLEU as our main baseline.
Introduction	Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.
Introduction	In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments.
Introduction	Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.

BLEU is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word-level (15)
n-grams (12)

8. Deciphering Foreign Language by Combining Language Models and Context Vectors

Nuhn, Malte and Mauser, Arne and Ney, Hermann

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Evaluation	We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while
Experimental Evaluation	In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
Experimental Evaluation	For BLEU higher values are better, for TER lower values are better.
Related Work	They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data.

BLEU is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

LM (27)
BLEU (16)
translation model (15)

9. Mixing Multiple Translation Models in Statistical Machine Translation

Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Baselines	where m ranges over IN and OUT, pm(é\| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003).
Conclusion & Future Work	We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model.
Ensemble Decoding	In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup.
Ensemble Decoding	However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically.
Ensemble Decoding	However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well.
Experiments & Results 4.1 Experimental Setup	Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2.
Experiments & Results 4.1 Experimental Setup	This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34.
Experiments & Results 4.1 Experimental Setup	We also reported the BLEU scores when we applied the span-wise normalization heuristic.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

10. Improve SMT Quality with Automatically Extracted Paraphrase Rules

He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.
Discussion	on BLEU score
Experiments	The metrics for automatic evaluation were BLEU 3 and TER 4 (Snover et al., 2005).
Experiments	(00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules.
Experiments	Our system gains significant improvements of 1.6~3.6 points of BLEU in the oral domain, and 0.5~1 points of BLEU in the news domain.
Extraction of Paraphrase Rules	As mentioned above, the detailed procedure is: T1 = S1 = T2 = Finally we compute BLEU (Papineni et al.
Extraction of Paraphrase Rules	If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions.
Introduction	The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

11. Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Simianer, Patrick and Riezler, Stefan and Dyer, Chris

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a).
Experiments	Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU .
Experiments	In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set.
Joint Feature Selection in Distributed Stochastic Learning	Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

12. A Ranking-based Approach to Word Reordering for Statistical Machine Translation

Yang, Nan and Li, Mu and Zhang, Dongdong and Yu, Nenghai

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	Large scale experiment shows improvement on both reordering metric and SMT performance, with up to 1.73 point BLEU gain in our evaluation test.
Experiments	Table 2: BLEU (%) score on dev and test data for both EJ and J-E experiment.
Experiments	We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score, and the model size.
Experiments	CLN BLEU Feat.# tag+label 88.6 16.4 22.24 26k +dst 91.5 13.5 22.66 55k E_J +pct 92.2 13.1 22.73 79k +lezv100 92.9 12.1 22.85 347k +l€$1000 94.0 11.5 22.79 2,410k +l€$2000 95.2 10.7 22.81 3,794k tag+fw 85.0 18.6 25.43 31k +dst 90.3 16.9 25.62 65k J_E +lezv100 91.6 15.7 25.87 293k +l€$1000 92.4 14.8 25.91 2,156k +le$2000 93.0 14.3 25.84 3,297k

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

13. Modeling the Translation of Predicate-Argument Structure for SMT

Xiong, Deyi and Zhang, Min and Li, Haizhou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions and Future Work	EXperimental results show that both models are able to significantly improve translation accuracy in terms of BLEU score.
Experiments	Statistical significance in BLEU differences
Experiments	Our first group of experiments is to investigate whether the predicate translation model is able to improve translation accuracy in terms of BLEU and whether semantic features are useful.
Experiments	0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

14. Learning Translation Consensus with Structured Label Propagation

Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures.
Experiments and Results	Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
Features and Training	defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]:
Features and Training	BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference.
Features and Training	1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis.
Graph Construction	In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

15. Topic Models for Dynamic Translation Model Adaptation

Eidelman, Vladimir and Boyd-Graber, Jordan and Resnik, Philip

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline.
Experiments	2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012).
Experiments	On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER.
Experiments	The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER.
Introduction	Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

16. Large-Scale Syntactic Language Modeling with Treelets

Pauls, Adam and Klein, Dan

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
Experiments	In particular, their translations had a lower BLEU score, making their task easier.
Experiments	We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score).

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Is our topic similarity model able to improve translation quality in terms of BLEU ?
Experiments	Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea-
Experiments	By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average.
Introduction	Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

18. Machine Translation without Words through Substring Alignment

Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Directions	12Similar results were found for character and word—based BLEU , but are omitted for lack of space.
Experiments	Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney.
Experiments	We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level.
Experiments	When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU .

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

19. Head-Driven Hierarchical Phrase-based Translation

Li, Junhui and Tu, Zhaopeng and Zhou, Guodong and van Genabith, Josef

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU .
Experiments	For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores.
Experiments	Table 3 lists the translation performance with BLEU scores.
Experiments	Table 3 shows that our HD-HPB model significantly outperforms Chiang’s HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB).

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

20. A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	training data and not necessarily exactly follow the tendency of the final BLEU scores.
Experiments	For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score.
Experiments	Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

dependency trees (17)
CCG (10)
BLEU (5)

21. Concept-to-text Generation via Discriminative Reranking

Konstas, Ioannis and Lapata, Mirella

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental evaluation on the ATIS domain shows that our model outperforms a competitive discriminative system both using BLEU and in a judgment elicitation study.
Results	As can be seen, inclusion of lexical features gives our decoder an absolute increase of 6.73% in BLEU over the l-BEST system.
Results	System BLEU METEOR l-BEST+BASE+ALIGN 21.93 34.01 k-BEST+BASE+ALIGN+LEX 28.66 45.18 k-BEST+BASE+ALIGN+LEX+STR 30.62 46.07 ANGELI 26.77 42.41
Results	over the l-BEST system and 3.85% over ANGELI in terms of BLEU .

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

22. A Class-Based Agreement Model for Generating Accurately Inflected Translations

Green, Spence and DeNero, John

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline.
Discussion of Translation Results	The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre.
Introduction	For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

phrase-based (10)
CRF (9)
LM (9)