Abstract | We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score ) in end-to-end translation systems for six languages pairs used in recent MT competitions. |
Conclusions | Table 3: BLEU scores for all language pairs using all available data. |
Introduction | Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score ) across many languages on training corpora with up to hundred thousand sentences. |
Introduction | In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases. |
Phrase-based machine translation | We report BLEU scores using a script available with the baseline system. |
Phrase-based machine translation | Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model. |
Phrase-based machine translation | In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages. |
Word alignment results | Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system. |
Experiments | BLEU score |
Experiments | We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set. |
Experiments | The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure. |
Integration of inflection models with MT systems | We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2). |
MT performance results | We also report oracle BLEU scores which incorporate two kinds of oracle knowledge. |
MT performance results | For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation. |
MT performance results | This system achieves a substantially better BLEU score (by 6.76) than the treelet system. |
Abstract | We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score . |
Conclusion | The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks. |
Experiments | Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English. |
Experiments | minimum error rate training (Och, 2003) with BLEU score as the objective function. |
Experiments | Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task. |
Abstract | An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly. |
Conclusion | This technique, together with the progressive search at previous stages, gives a decoder that produces the highest BLEU score we have obtained on the data in a very reasonable amount of time. |
Experiments | Fable 1: Speed and BLEU scores for two-pass decoding. |
Experiments | However, model scores do not directly translate into BLEU scores . |
Experiments | In order to maximize BLEU score using the algorithm described in Section 4, we need a sizable trigram forest as a starting point. |
Introduction | With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder. |
Cohesive Phrasal Output | We tested this approach on our English-French development set, and saw no improvement in BLEU score . |
Conclusion | Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores . |
Experiments | We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets. |
Experiments | First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets. |
Experiments | Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations. |
Experiments | Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits. |
Experiments | Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits). |
Experiments | Figure 3: BLEU scores on the MT05 data set. |
Discussions | After reaching its peak, the BLEU score drops as the threshold 7' increases. |
Discussions | On the other hand, adding phrase pairs extracted by the new method only (PP3) can lead to significant BLEU score increases (comparing row 1 vs. 3, and row 2 vs. 4). |
Experimental Results | BLEU Scores |
Experimental Results | Once we have computed all feature values for all phrase pairs in the training corpus, we discriminatively train feature weights Aks and the threshold 7' using the downhill simplex method to maximize the BLEU score on 06dev set. |
Experimental Results | Roughly, it has 0.5% higher BLEU score on 2006 sets and 1.5% to 3% higher on other sets than Model-4 based ViterbiExtract method. |
Results and Discussion | In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations. |
Results and Discussion | Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%. |
The Approach | compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence. |
Experiments | In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output. |
Experiments | For our test data, we only consider sentences containing measure words for Bleu score evaluation. |
Experiments | Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system. |