Index of papers in Proc. ACL 2014 that mention

BLEU

Seen in text as:

BLEU (362)

Seen in 315 sentences in 28 papers.

1. Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models

Auli, Michael and Gao, Jianfeng

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.
Abstract	We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.
Abstract	Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.
Expected BLEU Training	The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3.
Expected BLEU Training	3.1 Expected BLEU Objective
Expected BLEU Training	Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:
Introduction	The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).
Introduction	We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.
Recurrent Neural Network LMs	time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations.

BLEU is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

2. Lattice Desegmentation for Statistical Machine Translation

Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Methods	This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references).
Related Work	We use both segmented and unsegmented language models, and tune automatically to optimize BLEU .
Related Work	(2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU .
Results	For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.
Results	Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.
Results	1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.

BLEU is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

LM (16)
language model (13)
BLEU (13)

3. Response-based Learning for Grounded Machine Translation

Riezler, Stefan and Simianer, Patrick and Haas, Carolin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	call F1 BLEU 1.21 60.82 46.53 $.57 66.791 48.001 L64 72.4912 56.6412 3.36 78.15123 55.6612
Experiments	:call F1 BLEU 7.86 61.48 46.53 1.79 64.07 46.00 3.57 65.56 55.6712 7.14 68.8612 55.6712
Experiments	Method 4, named REBOL, implements REsponse-Based Online Learning by instantiating y+ and y‘ to the form described in Section 4: In addition to the model score 3, it uses a cost function 0 based on sentence-level BLEU (Nakov et al., 2012) and tests translation hypotheses for task-based feedback using a binary execution function 6.
Response-based Online Learning	Computation of distance to the reference translation usually involves cost functions based on sentence-level BLEU (Nakov et al.
Response-based Online Learning	In addition, we can use translation-specific cost functions based on sentence-level BLEU in order to boost similarity of translations to human reference translations.
Response-based Online Learning	Our cost function c(y(i), y) = (l — BLEU(y(i), is based on a version of sentence-level BLEU Nakov et al.

BLEU is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

4. Sentence Level Dialect Identification for Machine Translation System Selection

Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our best result improves over the best single MT system baseline by 1.0% BLEU and over a strong system selection baseline by 0.6% BLEU on a blind test set.
Introduction	Our best system selection approach improves over our best baseline single MT system by 1.0% absolute BLEU point on a blind test set.
MT System Selection	We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection	We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection	When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments	Feature weights are tuned to maximize BLEU on tuning sets using Minimum Error Rate Training (Och, 2003).
Machine Translation Experiments	Results are presented in terms of BLEU (Papineni et al., 2002).
Machine Translation Experiments	All differences in BLEU scores between the four systems are statistically significant above the 95% level.

BLEU is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

5. Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data

Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
Evaluation	We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality.
Evaluation	Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
Evaluation	HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further.
Introduction	This enhancement alone results in an improvement of almost 1.4 BLEU points.
Introduction	We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models.

BLEU is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

language model (18)
BLEU (15)
bigram (13)

6. Using Discourse Structure Improves Machine Translation Evaluation

Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results	II TER .812 .836 .848 BLEU .810 .830 .846
Experimental Results	We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
Experimental Setup	To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Experimental Setup	Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A.
Related Work	A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work	For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.

BLEU is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

7. Comparing Automatic Evaluation Measures for Image Description

Elliott, Desmond and Keller, Frank

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements.
Abstract	We estimate the correlation of unigram and Smoothed BLEU , TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Abstract	The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements.
Introduction	The main finding of our analysis is that TER and unigram BLEU are weakly corre-
Introduction	lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology	BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y.
Methodology	N BLEU = BP-exp < wn logpn> n=1
Methodology	Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1.

BLEU is mentioned in 27 sentences in this paper.

Topics mentioned in this paper:

8. That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text

Duan, Manjuan and White, Michael

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract	However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction	With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction	With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction	Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods	In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Simple Reranking	Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking	Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
Simple Reranking	In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations.

BLEU is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

perceptron (29)
SVM (23)
BLEU (20)

9. Fast and Robust Neural Network Joint Models for Statistical Machine Translation

Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.
Abstract	The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation.
Introduction	Additionally, we present several variations of this model which provide significant additive BLEU gains.
Introduction	The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes
Introduction	Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined.
Model Variations	Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6
Model Variations	BLEU scores are mixed-case.
Model Variations	On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more.
Neural Network Joint Model (NNJ M)	We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU .
Neural Network Joint Model (NNJ M)	We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M.

BLEU is mentioned in 36 sentences in this paper.

Topics mentioned in this paper:

10. Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set.
Evaluation	In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation	This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.

BLEU is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

Turker (25)
TER (12)
BLEU (11)

11. Hybrid Simplification using Deep Semantics and Machine Translation

Narayan, Shashi and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG).
Experiments	BLEU gives a measure of how close a system’s output is to the gold standard simple sentence.
Experiments	Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification.
Related Work	(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score.

BLEU is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

12. Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
Complexity Analysis	In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved.
Complexity Analysis	For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality.
Complexity Analysis	It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Introduction	o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.

BLEU is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

segmenters (11)
BLEU (9)
bigram (7)

13. A Unified Model for Soft Linguistic Reordering Constraints in Statistical Machine Translation

Li, Junhui and Marton, Yuval and Resnik, Philip and Daumé III, Hal

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	Table 6: Performance gain in BLEU over baseline and MR08 systems averaged over all test sets.
Discussion	Table 9: Performance ( BLEU score) comparison between non-oracle and oracle experiments.
Experiments	We use NIST MT 06 dataset (1664 sentence pairs) for tuning, and NIST MT 03, 05, and 08 datasets (919, 1082, and 1357 sentence pairs, respectively) for evaluation.1 We use BLEU (Pap-ineni et al., 2002) for both tuning and evaluation.
Experiments	Our first group of experiments investigates whether the syntactic reordering models are able to improve translation quality in terms of BLEU .
Experiments	Table 5: System performance in BLEU scores.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

14. Enhancing Grammatical Cohesion: Generating Transitional Expressions for SMT

Tu, Mei and Zhou, Yu and Zong, Chengqing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments	In particular, the best performance marked in bold is as high as 1.24, 0.94, and 0.82 BLEU points, respectively, over the baseline system on NIST04, CWMT08 Development, and CWMT08 Evaluation data.
Experiments	BLEU 35
Related Work	They added the labels assigned to connectives as an additional input to an SMT system, but their experimental results show that the improvements under the evaluation metric of BLEU were not significant.
Related Work	To the best of our knowledge, our work is the first attempt to exploit the source functional relationship to generate the target transitional expressions for grammatical cohesion, and we have successfully incorporated the proposed models into an SMT system with significant improvement of BLEU metrics.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

15. Effective Selection of Translation Model Training Data

Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points.
Conclusion	Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points.
Experiments	The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2.
Experiments	The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points.
Experiments	The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

16. Nonparametric Method for Data-driven Image Captioning

Mason, Rebecca and Charniak, Eugene

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Our Approach	BLEU Scores 13 N J:
Our Approach	Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach	Although BLEU (Papineni et al., 2002) scores are widely used for image caption evaluation, we find them to be poor indicators of the quality of our model.

BLEU is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

BLEU (8)
BLEU scores (6)

17. A Sense-Based Translation Model for Statistical Machine Translation

Xiong, Deyi and Zhang, Min

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST.
Experiments	System BLEU (%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - -
Experiments	System BLEU (%) NIST Base 33.53 9.0561 STM (sense) 34.15 9.2596 STM (sense+lexicon) 34.73 9.4184
Experiments	System BLEU (%) NIST Base 33.53 9.0561 Reformulated WSD 34.16 9.3820 STM 34.73 9.4184

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

18. Dependency-based Pre-ordering for Chinese-English Machine Translation

Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion	The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set	In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments	Lng the performance ( BLEU ) on the test set, the total
Experiments	For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments	It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction	Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.

BLEU is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

19. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
Related Work	Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
Related Work	In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work	TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

20. A Hybrid Approach to Skeleton-based Translation

Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work	The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation	Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems.
Evaluation	Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance.
Evaluation	Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system.
Introduction	0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

21. Surface Realisation from Knowledge-Bases

Gyawali, Bikash and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning.
Results and Discussion	Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion	The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion	In terms of BLEU score, the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

22. Learning Topic Representation for SMT with Neural Networks

Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The reported BLEU scores are averaged over 5 times of running MERT (Och, 2003).
Experiments	We illustrate the relationship among translation accuracy ( BLEU ), the number of retrieved documents (N) and the length of hidden layers (L) on different testing datasets.
Experiments	Figure 3: End-to-end translation results ( BLEU %)

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

23. Bilingually-constrained Phrase Embeddings for Machine Translation

Zhang, Jiajun and Liu, Shujie and Li, Mu and Zhou, Ming and Zong, Chengqing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Case-insensitive BLEU is employed as the evaluation metric.
Experiments	Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test.
Experiments	In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation.
Introduction	The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved.
Related Work	(2013) also use bag-of-words but learn BLEU sensitive phrase embeddings.

BLEU is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

24. A Recursive Recurrent Neural Network for Statistical Machine Translation

Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU .
Conclusion and Future Work	We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU .
Experiments and Results	When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.
Experiments and Results	TCBPPE based method drops about 3 BLEU points on both development and test data sets.
Introduction	We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

25. Translation Assistance by Translation of L1 Fragments in an L2 Context

van Gompel, Maarten and van den Bosch, Antal

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	We report on BLEU , NIST, METEOR, and word error rate metrics WER and PER.
Experiments & Results	The BLEU scores, not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results	Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results	Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

26. Polylingual Tree-Based Topic Models for Translation Domain Adaptation

Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We evaluate our model on a Chinese to English translation task and obtain up to 1.2 BLEU improvement over strong baselines.
Experiments	We refer to the SMT model without domain adaptation as baseline.5 LDA marginally improves machine translation (less than half a BLEU point).
Experiments	These improvements are not redundant: our new ptLDA-dict model, which has aspects of both models yields the best performance among these approaches—up to a 1.2 BLEU point gain (higher is better), and -2.6 TER improvement (lower is better).
Experiments	The BLEU improvement is significant (Koehn, 2004) at p = 0.01,6 except on MT03 with variational and variational-hybrid inference.

BLEU is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

27. Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints

Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
Experiments	Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU,
Experiments	Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01

BLEU is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

CRFs (30)
segmentations (18)
treebank (16)

28. Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation

Lu, Shixiang and Chen, Zhenbiao and Xu, Bo

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively.
Conclusions	The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features.
Experiments and Results	Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features.

BLEU is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: