Experiments | BLEU |
Experiments | As more training pairs are used, the model produces more varied sentences (PIN C) but preserves the meaning less well ( BLEU ) |
Experiments | As a comparison, evaluating each human description as a paraphrase for the other descriptions in the same cluster resulted in a BLEU score of 52.9 and a PINC score of 77.2. |
Introduction | In addition to the lack of standard datasets for training and testing, there are also no standard metrics like BLEU (Papineni et al., 2002) for evaluating paraphrase systems. |
Paraphrase Evaluation Metrics | One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU , which has played a crucial role in driving progress in MT. |
Paraphrase Evaluation Metrics | Thus, researchers have been unable to rely on BLEU or some derivative: the optimal paraphrasing engine under these terms would be one that simply returns the input. |
Paraphrase Evaluation Metrics | To measure semantic equivalence, we simply use BLEU with multiple references. |
Experimental Results | All the BLEU scores reported are for lowercase evaluation. |
Experimental Results | m-BLEU 1dicates that the segmented output was evaluated gainst a segmented version of the reference (this Leasure does not have the same correlation with hu-Lan judgement as BLEU ). |
Experimental Results | No Uni indicates the seg-Lented BLEU score without unigrams. |
Models 2.1 Baseline Models | performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline. |
Translation and Morphology | Automatic evaluation measures for MT, BLEU (Papineni et al., 2002), WER (Word Error Rate) and PER (Position Independent Word Error Rate) use the word as the basic unit rather than morphemes. |
Translation and Morphology | Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset. |
Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Experimental results | We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002). |
Experimental results | We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och, |
Experimental results | 2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score. |
Introduction | ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”. |
Abstract | The syntax-based translation system integrating the proposed techniques outperforms the best Arabic-English unconstrained system in NIST—08 evaluations by 1.3 absolute BLEU , which is statistically significant. |
Experiments | We use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to evaluate translation qualities. |
Experiments | and we achieved a BLEUr4n4 55.01 for MT08-NW, or a cased BLEU of 53.31, which is close to the best officially reported result 53.85 for unconstrained systems.2 We expose the statistical decisions in Eqn. |
Experiments | 3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s. |
Abstract | Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system. |
Conclusion | Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008). |
Experiments | BLEU (%) 26.15 27.07 27.93 28.89 |
Experiments | Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset |
Experiments | the final BLEU scores of C3—T with Min-F and C3-F. |
Introduction | (2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system. |
Introduction | Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation. |
Abstract | Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). |
Experiments | To evaluate the translation results, we use BLEU (Papineni et al., 2002). |
Experiments | On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points, and 0.8 over the hierarchical phrase-based system. |
Experiments | In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points, and 0.3 to 0.5 over the hierarchical phrase-based system. |
Abstract | As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU , which fail to properly evaluate adequacy, become more apparent. |
Abstract | We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. |
Abstract | We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful. |
Conclusion | This strategy leads to a better balanced distribution of the alternations in the training data, such that our linguistically informed generation ranking model achieves high BLEU scores and accurately predicts active and passive. |
Experimental Setup | Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65 |
Experimental Setup | Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01 |
Experimental Setup | use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams. |
Experiments | The differences in BLEU between the candidate sets and models are |
Experiments | Its BLEU score and match accuracy decrease only slightly (though statistically significantly). |
Experiments | Features | Match BLEU | Voice Prec. |
Experimental Evaluation | 6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information |
Experimental Evaluation | It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER. |
Experimental Evaluation | BLEU |
Flat ITG Model | The average gain across all data sets was approximately 0.8 BLEU points. |
Hierarchical ITG Model | (2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score, while using larger phrases results in diminishing returns. |
Introduction | We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches. |
Machine Translation as a Decipherment Task | Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et |
Machine Translation as a Decipherment Task | The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output). |
Machine Translation as a Decipherment Task | Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment. |
Abstract | We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target. |
Experiments | Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set. |
Experiments | BLEU scores for 200K and 400K training sentence pairs. |
Experiments | Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores. |
Conclusion and Future Work | In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling. |
Experiments | The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81. |
Experiments | Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation. |
Abstract | We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perpleXity. |
Related Work | Figure 9 shows a statistically significant improvement to the BLEU score when using the HHMM and the n-gram LMs together on this reduced test set. |
Related Work | Moses LM(s) ‘ BLEU ‘ |
Experiments | Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out. |
Experiments | We have noticed that using an L0-penalized BLEU score5 as MERT’s objective on the merged n-best lists over all iterations is more stable and will therefore use this score to determine N. |
Experiments | 5Given by: BLEU —5 X Hi 6 {1, . |