Index of papers in Proc. ACL 2008 that mention
  • BLEU score
Ganchev, Kuzman and Graça, João V. and Taskar, Ben
Abstract
We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score ) in end-to-end translation systems for six languages pairs used in recent MT competitions.
Conclusions
Table 3: BLEU scores for all language pairs using all available data.
Introduction
Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score ) across many languages on training corpora with up to hundred thousand sentences.
Introduction
In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases.
Phrase-based machine translation
We report BLEU scores using a script available with the baseline system.
Phrase-based machine translation
Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model.
Phrase-based machine translation
In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages.
Word alignment results
Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Huang, Liang and Liu, Qun
Experiments
BLEU score
Experiments
We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set.
Experiments
The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
Integration of inflection models with MT systems
We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2).
MT performance results
We also report oracle BLEU scores which incorporate two kinds of oracle knowledge.
MT performance results
For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation.
MT performance results
This system achieves a substantially better BLEU score (by 6.76) than the treelet system.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Abstract
We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score .
Conclusion
The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
Experiments
Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
Experiments
minimum error rate training (Och, 2003) with BLEU score as the objective function.
Experiments
Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
Conclusion
This technique, together with the progressive search at previous stages, gives a decoder that produces the highest BLEU score we have obtained on the data in a very reasonable amount of time.
Experiments
Fable 1: Speed and BLEU scores for two-pass decoding.
Experiments
However, model scores do not directly translate into BLEU scores .
Experiments
In order to maximize BLEU score using the algorithm described in Section 4, we need a sizable trigram forest as a starting point.
Introduction
With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Cherry, Colin
Cohesive Phrasal Output
We tested this approach on our English-French development set, and saw no improvement in BLEU score .
Conclusion
Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores .
Experiments
We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets.
Experiments
First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets.
Experiments
Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Experiments
Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
Experiments
Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Experiments
Figure 3: BLEU scores on the MT05 data set.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Deng, Yonggang and Xu, Jia and Gao, Yuqing
Discussions
After reaching its peak, the BLEU score drops as the threshold 7' increases.
Discussions
On the other hand, adding phrase pairs extracted by the new method only (PP3) can lead to significant BLEU score increases (comparing row 1 vs. 3, and row 2 vs. 4).
Experimental Results
BLEU Scores
Experimental Results
Once we have computed all feature values for all phrase pairs in the training corpus, we discriminatively train feature weights Aks and the threshold 7' using the downhill simplex method to maximize the BLEU score on 06dev set.
Experimental Results
Roughly, it has 0.5% higher BLEU score on 2006 sets and 1.5% to 3% higher on other sets than Model-4 based ViterbiExtract method.
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Espinosa, Dominic and White, Michael and Mehay, Dennis
Results and Discussion
In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations.
Results and Discussion
Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%.
The Approach
compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output.
Experiments
For our test data, we only consider sentences containing measure words for Bleu score evaluation.
Experiments
Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: