Index of papers in Proc. ACL 2008 that mention
  • BLEU
Mi, Haitao and Huang, Liang and Liu, Qun
Abstract
Large-scale experiments show an absolute improvement of 1.7 BLEU points over the l-best baseline.
Experiments
BLEU score
Experiments
We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set.
Experiments
The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure.
Introduction
Large-scale experiments (Section 4) show an improvement of 1.7 BLEU points over the l-best baseline, which is also 0.8 points higher than decoding with 30-best trees, and takes even less time thanks to the sharing of common subtrees.
BLEU is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
Decoding to Maximize BLEU
BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU .
Experiments
We evaluate the translation results by comparing them against the reference translations using the BLEU metric.
Experiments
Hyperedges BLEU Bigram Pass 167K 21.77 Trigram Pass UNI — —BO + 629.7K=796.7K 23.56 BO+BB +2.7K =169.
Experiments
Fable 1: Speed and BLEU scores for two-pass decoding.
Introduction
With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
Introduction
Maximizing the expected count of synchronous constituents approximately maximizes BLEU .
Introduction
We find a significant increase in BLEU in the experiments, with minimal additional time.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
Abstract
We applied our inflection generation models in translating English into two morphologically complex languages, Russian and Arabic, and show that our model improves the quality of SMT over both phrasal and syntax-based SMT systems according to BLEU and human judge-ments.
Integration of inflection models with MT systems
We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2).
MT performance results
For automatically measuring performance, we used 4-gram BLEU against a single reference translation.
MT performance results
We also report oracle BLEU scores which incorporate two kinds of oracle knowledge.
MT performance results
For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation.
BLEU is mentioned in 26 sentences in this paper.
Topics mentioned in this paper:
Shen, Libin and Xu, Jinxi and Weischedel, Ralph
Abstract
Our eXperiments show that the string-to-dependency decoder achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER compared to a standard hierarchical string—to—string system on the N IST 04 Chinese—English evaluation set.
Conclusions and Future Work
Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set.
Experiments
All models are tuned on BLEU (Papineni et al., 2001), and evaluated on both BLEU and Translation Error Rate (TER) (Snover et al., 2006) so that we could detect over-tuning on one metric.
Experiments
BLEU % TER% lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96
Experiments
Table 2: BLEU and TER scores on the test set.
Introduction
For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system.
Introduction
Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Ganchev, Kuzman and Graça, João V. and Taskar, Ben
Abstract
We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score) in end-to-end translation systems for six languages pairs used in recent MT competitions.
Conclusions
Table 3: BLEU scores for all language pairs using all available data.
Introduction
Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score) across many languages on training corpora with up to hundred thousand sentences.
Introduction
In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases.
Phrase-based machine translation
We report BLEU scores using a script available with the baseline system.
Phrase-based machine translation
Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model.
Phrase-based machine translation
In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages.
Word alignment results
Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system.
BLEU is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Deng, Yonggang and Xu, Jia and Gao, Yuqing
A Generic Phrase Training Procedure
lation engine to minimize the final translation errors measured by automatic metrics such as BLEU (Papineni et al., 2002).
Discussions
- + - BLEU mo“ Phrasetable Size
Discussions
After reaching its peak, the BLEU score drops as the threshold 7' increases.
Discussions
Table 4: Translation Results ( BLEU ) of discriminative phrase training approach using different features
Experimental Results
We measure translation performance by the BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) scores with multiple translation references.
Experimental Results
BLEU Scores
Experimental Results
The translation results as measured by BLEU and METEOR scores are presented in Table 3.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Automatic Evaluation Metrics
In this section, we describe BLEU, and the three metrics which achieved higher correlation results than BLEU in the recent ACL-07 MT workshop.
Automatic Evaluation Metrics
2.1 BLEU
Automatic Evaluation Metrics
BLEU (Papineni et al., 2002) is essentially a precision-based metric and is currently the standard metric for automatic evaluation of MT performance.
Introduction
Among all the automatic MT evaluation metrics, BLEU (Papineni et al., 2002) is the most widely used.
Introduction
Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction
The results show that, as compared to BLEU , several recently proposed metrics such as Semantic-role overlap (Gimenez and Marquez, 2007), ParaEval-recall (Zhou et al., 2006), and METEOR (Banerjee and Lavie, 2005) achieve higher correlation.
BLEU is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Blunsom, Phil and Cohn, Trevor and Osborne, Miles
Discussion and Further Work
9Hiero was MERT trained on this set and has a 2% higher BLEU score compared to the discriminative model.
Discussion and Further Work
development BLEU (%) 28
Evaluation
Although there is no direct relationship between BLEU and likelihood, it provides a rough measure for comparing performance.
Evaluation
6We also experimented with using max-translation decoding for standard MER trained translation models, finding that it had a small negative impact on BLEU score.
Evaluation
Figure 5 shows the relationship between beam width and development BLEU .
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Abstract
We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
Conclusion
The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
Experiments
Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
Experiments
minimum error rate training (Och, 2003) with BLEU score as the objective function.
Experiments
Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Cherry, Colin
Cohesive Decoding
Initially, we were not certain to what extent this feature would be used by the MERT module, as BLEU is not always sensitive to syntactic improvements.
Cohesive Phrasal Output
We tested this approach on our English-French development set, and saw no improvement in BLEU score.
Conclusion
Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores.
Conclusion
Our soft constraint produced improvements ranging between 0.5 and 1.1 BLEU points on sentences for which the baseline produces uncohesive translations.
Experiments
We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets.
Experiments
First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets.
Experiments
Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Hermjakob, Ulf and Knight, Kevin and Daumé III, Hal
Discussion
At the same time, there has been no negative impact on overall quality as measured by BLEU .
End-to-End results
To make sure our name transliterator does not degrade the overall translation quality, we evaluated our base SMT system with BLEU , as well as our transliteration-augmented SMT system.
End-to-End results
The BLEU scores for the two systems were 50.70 and 50.96 respectively.
Evaluation
General MT metrics such as BLEU , TER, METEOR are not suitable for evaluating named entity translation and transliteration, because they are not focused on named entities (NEs).
Integration with SMT
In a tuning step, the Minimim Error Rate Training component of our SMT system iteratively adjusts the set of rule weights, including the weight associated with the transliteration feature, such that the English translations are optimized with respect to a set of known reference translations according to the BLEU translation metric.
Introduction
First, although names are important to human readers, automatic MT scoring metrics (such as BLEU ) do not encourage researchers to improve name translation in the context of MT.
Introduction
A secondary goal is to make sure that our overall translation quality (as measured by BLEU ) does not degrade as a result of the name-handling techniques we introduce.
Introduction
0 We evaluate both the base SMT system and the augmented system in terms of entity translation accuracy and BLEU (Sections 2 and 6).
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Espinosa, Dominic and White, Michael and Mehay, Dennis
Conclusion
We have also shown that, by integrating this hypertagger with a broad-coverage CCG chart realizer, considerably faster realization times are possible (approximately twice as fast as compared with a realizer that performs simple lexical lookups) with higher BLEU , METEOR and exact string match scores.
Conclusion
Moreover, the hypertagger-augmented realizer finds more than twice the number of complete realizations, and further analysis revealed that the realization quality (as per modified BLEU and METEOR) is higher in the cases when the realizer finds a complete realization.
Introduction
Moreover, the overall BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) scores, as well as numbers of exact string matches (as measured against to the original sentences in the CCGbank) are higher for the hypertagger-seeded realizer than for the preexisting realizer.
Results and Discussion
Table 5 shows that increasing the number of complete realizations also yields improved BLEU and METEOR scores, as well as more exact matches.
Results and Discussion
In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations.
Results and Discussion
Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%.
The Approach
compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Experiments
Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
Experiments
Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Experiments
Figure 3: BLEU scores on the MT05 data set.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Yarowsky, David
Experimental Results
The feature functions are combined under a log-linear framework, and the weights are tuned by the minimum-error-rate training (Och, 2003) using BLEU (Papineni et al., 2002) as the optimization metric.
Experimental Results
This precision is extremely high because the BLEU score (precision with brevity penalty) that one obtains for a Chinese sentence is normally between 30% to 50%.
Experimental Results
4.5.2 BLEU on NIST MT Test Sets
Introduction
We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhang, Min and Jiang, Hongfei and Aw, Aiti and Li, Haizhou and Tan, Chew Lim and Li, Sheng
Experiments
BLEU (%)
Experiments
Rule TR TR TR+TSR_L TR Type (STSG) +TSR_L +TSR_P +TSR BLEU (%) 24.71 25.72 25.93 26.07
Experiments
Rule Type BLEU (%) TR+TSR 26.07 (TR+TSR) w/o SRR 24.62 (TR+TSR) w/o DPR 25.78
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output.
Experiments
For our test data, we only consider sentences containing measure words for Bleu score evaluation.
Experiments
Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Quirk, Chris and Moore, Robert C. and Gildea, Daniel
Experiments
Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU .
Experiments
We do compare VB against EM in terms of final BLEU scores in the translation experiments to ensure that this sparse prior has a sig-
Experiments
Minimum Error Rate training (Och, 2003) over BLEU was used to optimize the weights for each of these models over the development test data.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: