Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Experimental results | We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002). |
Experimental results | We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och, |
Experimental results | 2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score . |
Introduction | ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”. |
Conclusion and Future Work | We found that using a segmented translation model based on unsupervised morphology induction and a model that combined morpheme segments in the translation model with a postprocessing morphology prediction model gave us better BLEU scores than a word-based baseline. |
Experimental Results | All the BLEU scores reported are for lowercase evaluation. |
Experimental Results | No Uni indicates the seg-Lented BLEU score without unigrams. |
Experimental Results | .on of m-BLEU score (Luong et al., 2010) where 1e BLEU score is computed by comparing the 3gmented output with a segmented reference ranslation. |
Models 2.1 Baseline Models | performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline. |
Translation and Morphology | Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset. |
Abstract | Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score , as compared with a strong forest-to-string baseline system. |
Conclusion | Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008). |
Experiments | Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset |
Experiments | the final BLEU scores of C3—T with Min-F and C3-F. |
Experiments | Using the composed rule set C3—F in our forest-based decoder, we achieved an optimal BLEU score of 28.89 (%). |
Introduction | (2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system. |
Introduction | Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation. |
Machine Translation as a Decipherment Task | The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output). |
Machine Translation as a Decipherment Task | Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment. |
Machine Translation as a Decipherment Task | Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus. |
Experimental Evaluation | 6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information |
Experimental Evaluation | It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER. |
Hierarchical ITG Model | (2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score , while using larger phrases results in diminishing returns. |
Introduction | We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches. |
Conclusion and Future Work | In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling. |
Experiments | The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81. |
Experiments | Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation. |