Index of papers in Proc. ACL that mention

BLEU score

Seen in text as:

BLEU score (288)
BLEU scores (181)
Bleu score (8)
BLEU Scores (4)
BLEU Score (4)

Seen in 462 sentences in 71 papers.

1. Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

He, Xiaodong and Deng, Li

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The training objective is an expected BLEU score , which is closely linked to translation quality.
Abstract	bold updating), the author proposed a local updating strategy where the model parameters are updated towards a pseudo-reference (i.e., the hypothesis in the n-best list that gives the best BLEU score ).
Abstract	In our work, we use the expectation of BLEU scores as the objective.

BLEU score is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

2. A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Experimental results	We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
Experimental results	We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
Experimental results	2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score .
Introduction	ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.

BLEU score is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

language model (36)
BLEU (15)
n-gram (14)

3. Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Yeniterzi, Reyyan and Oflazer, Kemal

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores .
Experimental Setup and Results	Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation.
Experimental Setup and Results	Table 1: BLEU scores for a variety of transformation combinations
Experimental Setup and Results	15Note than in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score .
Introduction	We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets.
Syntax-to-Morphology Mapping	We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores .

BLEU score is mentioned in 22 sentences in this paper.

Topics mentioned in this paper:

4. Revisiting Pivot Language Approach for Machine Translation

Wu, Hua and Wang, Haifeng

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	Table 6: CRR translation results ( BLEU scores ) by using different RBMT systems
Discussion	The BLEU scores are 43.90 and 29.77 for System A and System B, respectively.
Discussion	If we compare the results with those only using SMT systems as described in Table 3, the translation quality was greatly improved by at least 3 BLEU scores , even if the translation ac-
Experiments	Translation quality was evaluated using both the BLEU score proposed by Papineni et al.
Experiments	The results also show that our translation selection method is very effective, which achieved absolute improvements of about 4 and l BLEU scores on CRR and ASR inputs, respectively.
Experiments	As compared with those in Table 3, the translation quality was greatly improved, with absolute improvements of at least 5.1 and 3.9 BLEU scores on CRR and ASR inputs for system combination results.
Translation Selection	In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002).
Translation Selection	We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection	In the context of translation selection, 3/ is assigned as the smoothed BLEU score .

BLEU score is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

5. That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text

Duan, Manjuan and White, Michael

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract	However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction	With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction	With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction	Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods	In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Reranking with SVMs 4.1 Methods	The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02).
Simple Reranking	Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking	Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.

BLEU score is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

perceptron (29)
SVM (23)
BLEU (20)

6. Better Alignments = Better Translations?

Ganchev, Kuzman and Graça, João V. and Taskar, Ben

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score ) in end-to-end translation systems for six languages pairs used in recent MT competitions.
Conclusions	Table 3: BLEU scores for all language pairs using all available data.
Introduction	Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score ) across many languages on training corpora with up to hundred thousand sentences.
Introduction	In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases.
Phrase-based machine translation	We report BLEU scores using a script available with the baseline system.
Phrase-based machine translation	Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model.
Phrase-based machine translation	In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages.
Word alignment results	Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system.

BLEU score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

7. Forest-Based Translation

Mi, Haitao and Huang, Liang and Liu, Qun

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	BLEU score
Experiments	We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set.
Experiments	The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure.

BLEU score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

BLEU (12)
parse tree (11)
BLEU score (10)

8. Joint Learning of a Dual SMT System for Paraphrase Generation

Sun, Hong and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems.
Conclusion	Furthermore, a revised BLEU score that balances between paraphrase adequacy and dissimilarity is proposed in our training process.
Discussion	The first part of iBLEU, which is the traditional BLEU score , helps to ensure the quality of the machine translation results.
Experiments and Results	We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better).
Experiments and Results	From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously.
Experiments and Results	It is not capable with no joint learning or with the traditional BLEU score does not take self-paraphrase into consideration.
Introduction	The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time.
Paraphrasing with a Dual SMT System	Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible.
Paraphrasing with a Dual SMT System	(2005) have shown the capability for measuring semantic equivalency using BLEU score); BLEU (c, s) is the BLEU score computed between the candidate and the source sentence to measure the dissimilarity.

BLEU score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

SMT systems (18)
BLEU (13)
BLEU score (10)

9. Applying Morphology Generation Models to Machine Translation

Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Integration of inflection models with MT systems	We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2).
MT performance results	We also report oracle BLEU scores which incorporate two kinds of oracle knowledge.
MT performance results	For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation.
MT performance results	This system achieves a substantially better BLEU score (by 6.76) than the treelet system.

BLEU score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

10. Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

Uszkoreit, Jakob and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score .
Conclusion	The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
Experiments	Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
Experiments	minimum error rate training (Och, 2003) with BLEU score as the objective function.
Experiments	Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.

BLEU score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

11. Scalable Decipherment for Machine Translation via Hash Sampling

Ravi, Sujith

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
Discussion and Future Work	These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora, have been shown to yield some BLEU score improvements.
Experiments and Results	To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
Experiments and Results	We show that our method achieves the best performance ( BLEU scores ) on this task while being significantly faster than both the previous approaches.
Experiments and Results	For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words.

BLEU score is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

12. Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	BLEU score 0 iv A (A)
Experiments	BLEU score
Experiments	In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.
Experiments	Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses.
Future Work	BLEU score

BLEU score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

language model (14)
BLEU (11)
LM (11)

13. Boosting-Based System Combination for Machine Translation

Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background	where BLEU(e,-j, r,-) is the smoothed sentence-level BLEU score (Liang et al., 2006) of the translation e with respect to the reference translations r,, and e: is the oracle translation which is selected from {em ..., em} in terms of BLEU(e,-j, r,-).
Background	Figures 2-5 show the BLEU curves on the development and test sets, where the X-aXis is the iteration number, and the Y-aXis is the BLEU score of the system generated by the boosting-based system combination.
Background	The BLEU scores tend to converge to the stable values after 20 iterations for all the systems.

BLEU score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

14. Constituency to Dependency Translation with Forests

Mi, Haitao and Liu, Qun

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	Using all constituency-to-dependency translation rules and bilingual phrases, our model achieves +0.7 points improvement in BLEU score significantly over a state-of-the-art forest-based tree-to-string system.
Experiments	We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on development set.
Experiments	The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3.
Experiments	As shown in the third line in the column of BLEU score , the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage.

BLEU score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

15. Improving Statistical Machine Translation with Monolingual Collocation

Liu, Zhanyi and Wang, Haifeng and Wu, Hua and Li, Sheng

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system.
Conclusion	The improved word alignment results in an improvement of 2.16 BLEU score on a phrase-based SMT system and an improvement of 1.76 BLEU score on a parsing-based SMT system.
Conclusion	When we also used phrase collocation probabilities as additional features, the phrase-based SMT performance is finally improved by 2.40 BLEU score as compared with the baseline system.
Experiments on Parsing-Based SMT	The system using the improved word alignments achieves an absolute improvement of 1.76 BLEU score , which indicates that the improvements of word alignments are also effective to improve the performance of the parsing-based SMT systems.
Experiments on Phrase-Based SMT	If the same alignment method is used, the systems using CM-3 got the highest BLEU scores .
Experiments on Phrase-Based SMT	When the phrase collocation probabilities are incorporated into the SMT system, the translation quality is improved, achieving an absolute improvement of 0.85 BLEU score .
Experiments on Phrase-Based SMT	As compared with the baseline system, an absolute improvement of 2.40 BLEU score is achieved.
Introduction	The alignment improvement results in an improvement of 2.16 BLEU score on phrase-based SMT system and an improvement of 1.76 BLEU score on parsing-based SMT system.
Introduction	SMT performance is further improved by 0.24 BLEU score .

BLEU score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

16. Joint Decoding with Multiple Translation Models

Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding.
Experiments	Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
Experiments	With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
Experiments	We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score .
Introduction	0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).

BLEU score is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

17. Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices

Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	MERT is then performed to optimize the BLEU score on a development set; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008).
Experiments	We consider a BLEU score difference to be a) gain if is at least 0.2 points, b) drop if it is at most -0.2 points, and c) no change otherwise.
Experiments	When MBR does not produce a higher BLEU score relative to MAP on the development set, MERT assigns a higher weight to this feature function.
Introduction	Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Introduction	We employ MERT to select these weights by optimizing BLEU score on a development set.
Introduction	In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
MERT for MBR Parameter Optimization	We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set.
Minimum Bayes-Risk Decoding	(2008) extended MBR decoding to translation lattices under an approximate BLEU score .

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

n-gram (21)
BLEU (20)
phrase-based (11)

18. Improving Tree-to-Tree Translation with Packed Forests

Liu, Yang and Lü, Yajuan and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models.
Experiments	Table 3 shows the BLEU scores of tree-based and forest-based tree-to-tree models achieved on the test set over different pruning thresholds.
Experiments	With the increase of the number of rules used, the BLEU score increased accordingly.

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

19. Variational Decoding for Statistical Machine Translation

Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.
Experimental Results	Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.
Experimental Results	Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

n-gram (32)
Viterbi (23)
BLEU (15)

20. Effective Use of Function Words for Rule Generalization in Forest-Based Translation

Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score , as compared with a strong forest-to-string baseline system.
Conclusion	Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008).
Experiments	Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset
Experiments	the final BLEU scores of C3—T with Min-F and C3-F.
Experiments	Using the composed rule set C3—F in our forest-based decoder, we achieved an optimal BLEU score of 28.89 (%).
Introduction	(2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system.
Introduction	Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation.

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

21. Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation

Nguyen, ThuyLinh and Vogel, Stephan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment Results	We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results	On average the improvement is 1.07 BLEU score (45.66
Experiment Results	without new phrase-based features and 1.14 BLEU score over the baseline Hiero system.
Phrasal-Hiero Model	Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row).

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

22. Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

Clifton, Ann and Sarkar, Anoop

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	We found that using a segmented translation model based on unsupervised morphology induction and a model that combined morpheme segments in the translation model with a postprocessing morphology prediction model gave us better BLEU scores than a word-based baseline.
Experimental Results	All the BLEU scores reported are for lowercase evaluation.
Experimental Results	No Uni indicates the seg-Lented BLEU score without unigrams.
Experimental Results	.on of m-BLEU score (Luong et al., 2010) where 1e BLEU score is computed by comparing the 3gmented output with a segmented reference ranslation.
Models 2.1 Baseline Models	performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline.
Translation and Morphology	Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset.

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

translation model (21)
BLEU (17)
CRF (12)

23. Efficient Multi-Pass Decoding for Synchronous Context Free Grammars

Zhang, Hao and Gildea, Daniel

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
Conclusion	This technique, together with the progressive search at previous stages, gives a decoder that produces the highest BLEU score we have obtained on the data in a very reasonable amount of time.
Experiments	Fable 1: Speed and BLEU scores for two-pass decoding.
Experiments	However, model scores do not directly translate into BLEU scores .
Experiments	In order to maximize BLEU score using the algorithm described in Section 4, we need a sizable trigram forest as a starting point.
Introduction	With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.

BLEU score is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
bigram (15)
language model (12)

24. Prediction of Learning Curves in Machine Translation

Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Inferring a learning curve from mostly monolingual data	Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
Inferring a learning curve from mostly monolingual data	We first train models to predict the BLEU score at m anchor sizes 81, .
Inferring a learning curve from mostly monolingual data	We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
Selecting a parametric family of curves	The values are on the same scale as the BLEU scores .
Selecting a parametric family of curves	BLEU scores .0 pl on

BLEU score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

25. Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation	This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
Evaluation	As expected, random selection yields bad performance, with a BLEU score of 30.52.

BLEU score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

Turker (25)
TER (12)
BLEU (11)

26. Dependency Based Chinese Sentence Realization

He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
Experiments	In addition to BLEU score , percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments	We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method.
Experiments	All of the four feature functions we have tested achieve considerable improvement in BLEU scores .
Log-linear Models	BLEU score , a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al.
Log-linear Models	3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl

BLEU score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

27. Active Learning for Multilingual Statistical Machine Translation

Haffari, Gholamreza and Sarkar, Anoop

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

AL-SMT: Multilingual Setting	The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively.
AL-SMT: Multilingual Setting	This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score , WER and PER) (Papineni et al., 2002).
Experiments	The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set.
Experiments	Avg BLEU Score
Experiments	Avg BLEU Score

BLEU score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

28. Predicting and Eliciting Addressee's Emotion in Online Dialogue

Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3).
Experiments	We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011).
Experiments	In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score .

BLEU score is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

29. Cohesive Phrase-Based Decoding for Statistical Machine Translation

Cherry, Colin

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Cohesive Phrasal Output	We tested this approach on our English-French development set, and saw no improvement in BLEU score .
Conclusion	Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores .
Experiments	We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets.
Experiments	First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets.
Experiments	Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations.

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

subtrees (11)
phrase-based (10)
BLEU (9)

30. Mixing Multiple Translation Models in Statistical Machine Translation

Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Ensemble Decoding	In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup.
Ensemble Decoding	However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically.
Ensemble Decoding	However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well.
Experiments & Results 4.1 Experimental Setup	Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2.
Experiments & Results 4.1 Experimental Setup	This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34.
Experiments & Results 4.1 Experimental Setup	We also reported the BLEU scores when we applied the span-wise normalization heuristic.

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

31. Randomized Language Models via Perfect Hash Functions

Talbot, David and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
Experiments	Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Experiments	Figure 3: BLEU scores on the MT05 data set.

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

32. Sentence Level Dialect Identification for Machine Translation System Selection

Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	We plan to give different weights to different training examples based on the drop in BLEU score the example can cause if classified incorrectly.
MT System Selection	We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection	We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection	When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments	All differences in BLEU scores between the four systems are statistically significant above the 95% level.
Machine Translation Experiments	We also report in Table 1 an oracle system selection where we pick, for each sentence, the English translation that yields the best BLEU score .

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

33. Nonparametric Method for Data-driven Image Captioning

Mason, Rebecca and Charniak, Eugene

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Our Approach	BLEU Scores 13 N J:
Our Approach	Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach	As shown in Figure 1, our system’s BLEU scores increase rapidly until about k = 25.

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

BLEU (8)
BLEU scores (6)

34. Name-aware Machine Translation

Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Baseline MT	The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003).
Experiments	In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments	We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments.
Experiments	Furthermore, we calculated three Pearson product-moment correlation coefficients between human judgment scores and name-aware BLEU scores of these two MT systems.
Name-aware MT Evaluation	Based on BLEU score , we design a name-aware BLEU metric as follows.
Name-aware MT Evaluation	Finally the name-aware BLEU score is defined as:

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word alignment (17)
LM (12)

35. Hierarchical Phrase Table Combination for Machine Translation

Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones.
Experiment	The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011).
Experiment	When comparing the hier-combin with the pialign-batch, the BLEU scores are a little higher while the time spent for training is much lower, almost one quarter of the pialign-batch.
Experiment	Table 4 shows the BLEU scores for the three data sets, in which the order of combining phrase tables from each domain is alternated in the ascending and descending of the similarity to the test data.

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

36. Dependency-based Pre-ordering for Chinese-English Machine Translation

Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion	The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set	In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments	For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments	It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction	Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.

BLEU score is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

37. A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	training data and not necessarily exactly follow the tendency of the final BLEU scores .
Experiments	For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score .
Experiments	Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different.

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

dependency trees (17)
CCG (10)
BLEU (5)

38. Additive Neural Networks for Statistical Machine Translation

liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	In the extreme, if the k-best list consists only of a pair of translations ((6, d), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6, d) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
Introduction	to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
Introduction	Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

39. Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures

Zarriess, Sina and Kuhn, Jonas

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9).
Experiments	The revision-based system with disjoint modelling of implicits shows a slight, nonsignificant increase in BLEU score .
Experiments	By contrast, the BLEU.. score is signficantly better for the joint approach.

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

40. Large-Scale Syntactic Language Modeling with Treelets

Pauls, Adam and Klein, Dan

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
Experiments	In particular, their translations had a lower BLEU score , making their task easier.
Experiments	We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score ).

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

41. Phrase Table Training for Precision and Recall: What Makes a Good Phrase and a Good Phrase Pair?

Deng, Yonggang and Xu, Jia and Gao, Yuqing

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussions	After reaching its peak, the BLEU score drops as the threshold 7' increases.
Discussions	On the other hand, adding phrase pairs extracted by the new method only (PP3) can lead to significant BLEU score increases (comparing row 1 vs. 3, and row 2 vs. 4).
Experimental Results	BLEU Scores
Experimental Results	Once we have computed all feature values for all phrase pairs in the training corpus, we discriminatively train feature weights Aks and the threshold 7' using the downhill simplex method to maximize the BLEU score on 06dev set.
Experimental Results	Roughly, it has 0.5% higher BLEU score on 2006 sets and 1.5% to 3% higher on other sets than Model-4 based ViterbiExtract method.

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

42. Surface Realisation from Knowledge-Bases

Gyawali, Bikash and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	We observed that this often fails to return the best output in terms of BLEU score , fluency, grammaticality and/or meaning.
Results and Discussion	Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion	The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion	In terms of BLEU score , the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

43. Deciphering Foreign Language

Ravi, Sujith and Knight, Kevin

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Machine Translation as a Decipherment Task	The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
Machine Translation as a Decipherment Task	Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
Machine Translation as a Decipherment Task	Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus.

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

44. Discriminative Pruning for Discriminative ITG Alignment

Liu, Shujie and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++.
Evaluation	Finally, we also do end-to-end evaluation using both F-score in alignment and Bleu score in translation.
Evaluation	HP-DITG using DPDI achieves the best Bleu score with acceptable time cost.
Evaluation	It shows that HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score .

BLEU score is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

45. Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In columns 2 and 4, we report the BLEU scores , while in columns 3 and 5, we report the TER scores.
Experiments	Model 2 which conditions POL on OR provides an additional +0.2 BLEU improvement on BLEU score consistently across the two genres.
Experiments	The inclusion of explicit MOS modeling in Model 4 gives a significant BLEU score improvement of +0.5 but no TER improvement in newswire.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

46. Hybrid Simplification using Deep Semantics and Machine Translation

Narayan, Shashi and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	BLEU score We used Moses support tools: multi-bleu10 to calculate BLEU scores .
Experiments	The BLEU scores shown in Table 4 show that our system produces simplifications that are closest to the reference.
Experiments	In sum, the automatic metrics indicate that our system produces simplification that are consistently closest to the reference in terms of edit distance, number of splits and BLEU score .
Related Work	(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score .

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

47. Incorporating Information Status into Generation Ranking

Cahill, Aoife and Riester, Arndt

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions	In comparison to a baseline model, we achieve statistically significant improvement in BLEU score .
Generation Ranking Experiments	We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al.,
Generation Ranking Experiments	The difference in BLEU score between the model of Cahill et al.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

48. A Syntax-Free Approach to Japanese Sentence Compression

Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Evaluation	For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training.
Results and Discussion	Our method achieved the highest BLEU score .
Results and Discussion	For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score .
Results and Discussion	Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score .

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

49. Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders

Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Experiments	Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
Experiments	Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

50. A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages

Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score .
Conclusion and perspectives	Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging.
Evaluation	The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
Evaluation	The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

51. Enlisting the Ghost: Modeling Empty Categories for Machine Translation

Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	The BLEU scores from different systems are shown in Table 10 and Table 11, respectively.
Experimental Results	Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately.
Experimental Results	Table 10: BLEU scores in the Hiero system.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

52. Tackling Sparse Data Issue in Machine Translation Evaluation

Bojar, Ondřej and Kos, Kamil and Mareċek, David

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments.
Problems of BLEU	We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009).
Problems of BLEU	Figure 3 documents the issue across languages: the lower the BLEU score itself (i.e.
Problems of BLEU	A phrase-based system like Moses (cu-bojar) can sometimes produce a long sequence of tokens exactly as required by the reference, leading to a high BLEU score .

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

53. Online Relative Margin Maximization for Statistical Machine Translation

Eidelman, Vladimir and Marton, Yuval and Resnik, Philip

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Additional Experiments	On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Discussion	This correlates with our observation that RM’s overall BLEU score is negatively impacted by the BP, as the BLEU precision scores are noticeably higher.
Discussion	We also notice that while PRO had the lowest BLEU scores in Chinese, it was competitive in Arabic with the highest number of features.
Experiments	5In the small feature set RAMPION yielded similar best BLEU scores , but worse TER.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

feature set (25)
BLEU (18)
TER (11)

54. An Infinite Hierarchical Bayesian Model of Phrasal Translation

Cohn, Trevor and Haffari, Gholamreza

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
Experiments	Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines.
Experiments	For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

55. An Unsupervised Model for Joint Phrase Alignment and Extraction

Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Evaluation	6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
Experimental Evaluation	It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
Hierarchical ITG Model	(2003) that using phrases where max(\|e\|, \|f g 3 cause significant improvements in BLEU score , while using larger phrases results in diminishing returns.
Introduction	We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

56. Hindi-to-Urdu Machine Translation through Transliteration

Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system.
Final Results	This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models.
Final Results	For other parts of the data where the translators have heavily used transliteration, the system may receive a higher BLEU score .
Introduction	Section 4 discusses the training data, parameter optimization and the initial set of experiments that compare our two models with a baseline Hindi-Urdu phrase-based system and with two transliteration-aided phrase-based systems in terms of BLEU scores

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

57. A Ranking-based Approach to Word Reordering for Statistical Machine Translation

Yang, Nan and Li, Mu and Zhang, Dongdong and Yu, Nenghai

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score , and the model size.
Experiments	is RankingSVM accuracy in percentage on the training data; CLN is the crossing-link number per sentence on parallel corpus with automatically generated word alignment; BLEU is the BLEU score in percentage on web test set on Rank-IT setting (system with integrated rank reordering model); leacn means 11 most frequent lexicons in the training corpus.
Experiments	These features also correspond to BLEU score improvement for End-to-End evaluations.

BLEU score is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

58. Topological Ordering of Function Words in Hierarchical Phrase-based Translation

Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Future Work	When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
Experimental Results	These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score , with a consistent pattern of results across the MT06 and MT08 test sets.
Experimental Setup	all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

59. Hypertagging: Supertagging for Surface Realization with CCG

Espinosa, Dominic and White, Michael and Mehay, Dennis

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Results and Discussion	In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations.
Results and Discussion	Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%.
The Approach	compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

logical form (13)
POS tags (13)
CCG (11)

60. Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Complexity Analysis	It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Complexity Analysis	Table 4 presents the BLEU scores for Moses using different segmentation methods.
Introduction	o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

segmenters (11)
BLEU (9)
bigram (7)

61. Translation Assistance by Translation of L1 Fragments in an L2 Context

van Gompel, Maarten and van den Bosch, Antal

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments & Results	The BLEU scores , not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results	Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results	Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

62. Enhancing Grammatical Cohesion: Generating Transitional Expressions for SMT

Tu, Mei and Zhou, Yu and Zong, Chengqing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments	The final BLEU scores on NIST05 and NIST06 are given in Table 4.
Experiments	BLEU scores on the large-scale training data.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

63. Measure Word Generation for English-Chinese SMT Systems

Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output.
Experiments	For our test data, we only consider sentences containing measure words for Bleu score evaluation.
Experiments	Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

64. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Han, Bo and Baldwin, Timothy

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments	The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
Experiments	Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

65. Training Phrase Translation Models with Leaving-One-Out

Wuebker, Joern and Mauser, Arne and Ney, Hermann

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Alignment	We perform minimum error rate training with the downhill simplex algorithm (Nelder and Mead, 1965) on the development data to obtain a set of scaling factors that achieve a good BLEU score .
Experimental Evaluation	A second iteration of the training algorithm shows nearly no changes in BLEU score , but a small improvement in TER.
Experimental Evaluation	yields a BLEU score slightly lower than with fixed interpolation on both DEV and TEST.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

66. Forest-based Tree Sequence to String Translation Model

Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S.
Experiment	Even if in the 5000 Best case, tree sequence is still able to contribute l.l BLEU score improvement (28.89-27.79).
Experiment	2) The BLEU scores are very similar to each other when we increase the forest pruning threshold.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

67. Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data.
Abstract	Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011).
Abstract	Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

68. Improve SMT Quality with Automatically Extracted Paraphrase Rules

He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	on BLEU score
Experiments	(00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score ; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules.
Extraction of Paraphrase Rules	If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions.

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

69. Translating Dialectal Arabic to English

Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Proposed Methods 3.1 Egyptian to EG’ Conversion	Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96.
Proposed Methods 3.1 Egyptian to EG’ Conversion	In further analysis, we examined 1% of the sentences with the largest difference in BLEU score .
Proposed Methods 3.1 Egyptian to EG’ Conversion	Out of these, more than 70% were cases where the EG’ model achieved a higher BLEU score .

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

BLEU (13)
parallel data (9)
LM (8)

70. Hierarchical Search for Word Alignment

Riesa, Jason and Marcu, Daniel

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
Conclusion	We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today.
Related Work	Very recent work in word alignment has also started to report downstream effects on BLEU score .

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

71. Pseudo-Word for Phrase-Based Machine Translation

Duan, Xiangyu and Zhang, Min and Li, Haizhou

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Results	Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments and Results	Best ESSP (Wchpwen) is significantly better than baseline (p<0.0l) in BLEU score, best SMP (wdpwen) is significantly better than baseline (p<0.05) in BLEU score .
Experiments and Results	wchpwen is significantly better than baseline (p<0.04) in BLEU score .

BLEU score is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: