Conclusion | BLEU score 0 iv A (A) |
Experiments | BLEU score |
Experiments | ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score. |
Experiments | In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences. |
Future Work | BLEU score |
Abstract | The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score. |
Abstract | However, MBR targeting BLEU is prohibitively slow to optimize over kr-best lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000-best lists. |
Abstract | Our forest-based decoding objective consistently outperforms kr-best list MBR, giving improvements of up to 1.0 BLEU . |
Consensus Decoding Algorithms | 1Typically, MBR is defined as arg mineeElE [L(e; e’ for some loss function L, for example 1 — BLEU (e; 6’ These definitions are equivalent. |
Consensus Decoding Algorithms | Figure 1 compares Algorithms 1 and 2 using U(e; e’ Other linear functions have been explored for MBR, including Taylor approximations to the logarithm of BLEU (Tromble et al., 2008) and counts of matching constituents (Zhang and Gildea, 2008), which are discussed further in Section 3.3. |
Consensus Decoding Algorithms | Computing MBR even with simple nonlinear measures such as BLEU , NIST or bag-of-words Fl seems to require 0(k2) computation time. |
Introduction | In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002). |
Introduction | Unfortunately, with a nonlinear similarity measure like BLEU , we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution. |
Introduction | In experiments using BLEU over 1000-best lists, we found that our objective provided benefits very similar to MBR, only much faster. |
Experiments | Translation quality was evaluated using both the BLEU score proposed by Papineni et al. |
Experiments | (2002) and also the modified BLEU (BLEU-Fix) score3 used in the IWSLT 2008 evaluation campaign, where the brevity calculation is modified to use closest reference length instead of shortest reference length. |
Experiments | Method BLEU BLEU-Fix Triangulation 33 .70/27.46 3 l .5 9/25 .02 Transfer 3352/2834 3136/2620 Synthetic 34.35/27 .21 32.00/26.07 Combination 38.14/29.32 34.76/27.39 |
Translation Selection | In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002). |
Translation Selection | We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions. |
Translation Selection | In the context of translation selection, 3/ is assigned as the smoothed BLEU score. |
Abstract | Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees. |
Experiments | We evaluated the translation quality using the BLEU metric, as calculated by mteval-vl lb.pl with its default setting except that we used case-insensitive matching of n-grams. |
Experiments | avg trees # of rules BLEU |
Experiments | Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models. |
Introduction | Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model. |
Abstract | Comparable to the state-of-the-art system combination technique, joint decoding achieves an absolute improvement of 1.5 BLEU points over individual decoding. |
Experiments | We evaluated the translation quality using case-insensitive BLEU metric (Papineni et al., 2002). |
Experiments | Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive). |
Experiments | With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence. |
Introduction | 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4). |
Introduction | ing with multiple models achieves an absolute improvement of 1.5 BLEU points over individual decoding with single models (Section 5). |
Abstract | We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008). |
Experimental Results | Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding. |
Experimental Results | Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding. |
Experimental Results | Table 2 presents the BLEU results under different ways in using the variational models, as discussed in Section 3.2.3. |
Introduction | We geometrically interpolate the resulting approximations q with one another (and with the original distribution p), justifying this interpolation as similar to the minimum-risk decoding for BLEU proposed by Tromble et al. |
Variational Approximate Decoding | However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams. |
Variational vs. Min-Risk Decoding | They use the following loss function, of which a linear approximation to BLEU (Papineni et al., 2001) is a special case, |
Introduction | Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice. |
Introduction | We employ MERT to select these weights by optimizing BLEU score on a development set. |
Introduction | In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008). |
MERT for MBR Parameter Optimization | However, this does not guarantee that the resulting linear score (Equation 2) is close to the corpus BLEU . |
MERT for MBR Parameter Optimization | We now describe how MERT can be used to estimate these factors to achieve a better approximation to the corpus BLEU . |
MERT for MBR Parameter Optimization | We recall that MERT selects weights in a linear model to optimize an error criterion (e. g. corpus BLEU ) on a training set. |
Minimum Bayes-Risk Decoding | This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate. |
Minimum Bayes-Risk Decoding | (2008) extended MBR decoding to translation lattices under an approximate BLEU score. |
Minimum Bayes-Risk Decoding | They approximated log( BLEU ) score by a linear function of n-gram matches and candidate length. |
Abstract | We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features. |
Conclusions | In comparison to a baseline model, we achieve statistically significant improvement in BLEU score. |
Discussion | Given that we only looked at IS factors within a sentence, we think that such a significant improvement in BLEU and exact match scores is very encouraging. |
Generation Ranking Experiments | Model BLEU Match (%) |
Generation Ranking Experiments | We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al., |
Generation Ranking Experiments | We achieve an improvement of 0.0168 BLEU points and 1.91 percentage points in exact match. |
AL-SMT: Multilingual Setting | The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively. |
AL-SMT: Multilingual Setting | This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score, WER and PER) (Papineni et al., 2002). |
Experiments | The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set. |
Experiments | Avg BLEU Score |
Experiments | Avg BLEU Score |
Sentence Selection: Multiple Language Pairs | 0 Let e0 be the consensus among all the candidate translations, then define the disagreement as Ed ad(1 — BLEU (eC, ed)). |
Abstract | Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. |
Conclusion and future work | We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements). |
Dependency parsing for machine translation | We found that dependency scores with or without loop elimination are generally close and highly correlated, and that MT performance without final loop removal was about the same (generally less than 0.2% BLEU ). |
Introduction | In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores. |
Machine translation experiments | Parameter tuning was done with minimum error rate training (Och, 2003), which was used to maximize BLEU (Papineni et al., 2001). |
Machine translation experiments | In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001). |
Machine translation experiments | For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER, all differences are significant. |
Abstract | Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874. |
Experiments | In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics. |
Experiments | We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method. |
Experiments | All of the four feature functions we have tested achieve considerable improvement in BLEU scores. |
Log-linear Models | BLEU score, a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al. |
Log-linear Models | The BLEU measure computes the geometric mean of the precision of n-grams of various lengths between a sentence realization and a (set of) reference(s). |
Log-linear Models | 3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl |
Analysis | 0 The constituent boundary matching feature (CBMF) is a very important feature, which by itself achieves significant improvement over the baseline (up to 1.13 BLEU ). |
Analysis | 5.2 Beyond BLEU |
Analysis | Since BLEU is not sufficient |
Experiments | Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004). |
Experiments | Like (Marton and Resnik, 2008), we find that the XP+ feature obtains a significant improvement of 1.08 BLEU over the baseline. |
Experiments | However, using all syntax-driven features described in section 3.2, our SDB models achieve larger improvements of up to 1.67 BLEU . |
Introduction | Our experimental results display that our SDB model achieves a substantial improvement over the baseline and significantly outperforms XP+ according to the BLEU metric (Papineni et al., 2002). |
Introduction | In addition, our analysis shows further evidences of the performance gain from a different perspective than that of BLEU . |
Experimental Evaluation | For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training. |
Experimental Evaluation | For automatic evaluation, we employed BLEU (Papineni et al., 2002) by following (Unno et al., 2006). |
Experimental Evaluation | Label BLEU Proposed .679 w/o PLM .617 w/o IPTW .635 Hori— .493 |
Results and Discussion | Our method achieved the highest BLEU score. |
Results and Discussion | For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score. |
Results and Discussion | Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score. |
Experiments | In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers. |
Experiments | Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking. |
Experiments | Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations. |
Abstract | We compare this metric against a combination metric of four state—of—the—art scores ( BLEU , NIST, TER, and METEOR) in two different settings. |
Experimental Evaluation | BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP. |
Introduction | Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002). |
Introduction | BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. |
Introduction | (2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences. |
Experiments | System Model BLEU Moses cBP 23.86 STSSG 25.92 SncTSSG 26.53 |
Experiments | ID Rule Set BLEU 1 CR (STSSG) 25.92 2 CR w/o ncPR 25.87 3 CR w/o ncPR + tgtncR 26.14 4 CR w/o ncPR + srchR 26.50 5 CR w/o ncPR + src&tgtncR 26.51 6 CR + tgtnCR 26.11 7 CR + srcncR 26.56 8 cR+src&tgtncR(SncTSSG) 26.53 |
Experiments | 2) Not only that, after comparing Exp 6,7,8 against Exp 3,4,5 respectively, we find that the ability of rules derived from noncontiguous tree sequence pairs generally covers that of the rules derived from the contiguous tree sequence pairs, due to the slight Change in BLEU score. |
Alternatives to Correlation-based Meta-evaluation | We have studied 100 sentence evaluation cases from representatives of each metric family including: 1-PER, BLEU , DP-Or-‘k, GTM (e = 2), METEOR and ROUGE L. The evaluation cases have been extracted from the four test beds. |
Metrics and Test Beds | At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision ( BLEU and NIST), lexical recall (ROUGE), and F-measure (GTM and METEOR). |
Previous Work on Machine Translation Meta-Evaluation | (2001) introduced the BLEU metric and evaluated its reliability in terms of Pearson correlation with human assessments for adequacy and fluency judgements. |
Previous Work on Machine Translation Meta-Evaluation | With the aim of overcoming some of the deficiencies of BLEU , Doddington (2002) introduced the NIST metric. |
Previous Work on Machine Translation Meta-Evaluation | Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST, WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM. |
Experiment | Model BLEU (%) Moses 25.68 TT2S 26.08 TTS2S 26.95 FT2S 27.66 FTS2S 28.83 |
Experiment | The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S. |
Experiment | BLEU (%) N-best \ model FT2S FTS2S 100 Best 27.40 28.61 500 Best 27.66 28.83 2500 Best 27.66 28.96 5000 Best 27.79 28.89 |
Results | Since MT systems are tuned for word-based overlap measures (such as BLEU ), verb deletion is penalized equally as, for example, determiner deletion. |
SW System | model score and word penalty for a combination of BLEU and TER (2*(1-BLEU) + TER). |
SW System | Bleu scores on the government supplied test set in December 2008 were 35.2 for formal text, 29.2 for informal text, 33.2 for formal speech, and 27.6 for informal speech. |
The Chinese-English 5W Task | Unlike word- or phrase-overlap measures such as BLEU , the SW evaluation takes into account “concept” or “nugget” translation. |
Discussion and Future Work | When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality. |
Discussion and Future Work | Perhaps surprisingly, translation performance, 30.90 BLEU , was around the level we obtained when using frequency to approximate function words at N = 64. |
Experimental Results | These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score, with a consistent pattern of results across the MT06 and MT08 test sets. |
Experimental Setup | all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004). |