Index of papers in Proc. ACL 2009 that mention
  • BLEU
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Conclusion
BLEU score 0 iv A (A)
Experiments
BLEU score
Experiments
ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score.
Experiments
In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.
Future Work
BLEU score
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
DeNero, John and Chiang, David and Knight, Kevin
Abstract
The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.
Abstract
However, MBR targeting BLEU is prohibitively slow to optimize over kr-best lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000-best lists.
Abstract
Our forest-based decoding objective consistently outperforms kr-best list MBR, giving improvements of up to 1.0 BLEU .
Consensus Decoding Algorithms
1Typically, MBR is defined as arg mineeElE [L(e; e’ for some loss function L, for example 1 — BLEU (e; 6’ These definitions are equivalent.
Consensus Decoding Algorithms
Figure 1 compares Algorithms 1 and 2 using U(e; e’ Other linear functions have been explored for MBR, including Taylor approximations to the logarithm of BLEU (Tromble et al., 2008) and counts of matching constituents (Zhang and Gildea, 2008), which are discussed further in Section 3.3.
Consensus Decoding Algorithms
Computing MBR even with simple nonlinear measures such as BLEU , NIST or bag-of-words Fl seems to require 0(k2) computation time.
Introduction
In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002).
Introduction
Unfortunately, with a nonlinear similarity measure like BLEU , we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution.
Introduction
In experiments using BLEU over 1000-best lists, we found that our objective provided benefits very similar to MBR, only much faster.
BLEU is mentioned in 37 sentences in this paper.
Topics mentioned in this paper:
Wu, Hua and Wang, Haifeng
Experiments
Translation quality was evaluated using both the BLEU score proposed by Papineni et al.
Experiments
(2002) and also the modified BLEU (BLEU-Fix) score3 used in the IWSLT 2008 evaluation campaign, where the brevity calculation is modified to use closest reference length instead of shortest reference length.
Experiments
Method BLEU BLEU-Fix Triangulation 33 .70/27.46 3 l .5 9/25 .02 Transfer 3352/2834 3136/2620 Synthetic 34.35/27 .21 32.00/26.07 Combination 38.14/29.32 34.76/27.39
Translation Selection
In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002).
Translation Selection
We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection
In the context of translation selection, 3/ is assigned as the smoothed BLEU score.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Abstract
Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees.
Experiments
We evaluated the translation quality using the BLEU metric, as calculated by mteval-vl lb.pl with its default setting except that we used case-insensitive matching of n-grams.
Experiments
avg trees # of rules BLEU
Experiments
Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models.
Introduction
Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model.
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun
Abstract
Comparable to the state-of-the-art system combination technique, joint decoding achieves an absolute improvement of 1.5 BLEU points over individual decoding.
Experiments
We evaluated the translation quality using case-insensitive BLEU metric (Papineni et al., 2002).
Experiments
Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
Experiments
With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
Introduction
0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
Introduction
ing with multiple models achieves an absolute improvement of 1.5 BLEU points over individual decoding with single models (Section 5).
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Abstract
We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008).
Experimental Results
Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.
Experimental Results
Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.
Experimental Results
Table 2 presents the BLEU results under different ways in using the variational models, as discussed in Section 3.2.3.
Introduction
We geometrically interpolate the resulting approximations q with one another (and with the original distribution p), justifying this interpolation as similar to the minimum-risk decoding for BLEU proposed by Tromble et al.
Variational Approximate Decoding
However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams.
Variational vs. Min-Risk Decoding
They use the following loss function, of which a linear approximation to BLEU (Papineni et al., 2001) is a special case,
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Introduction
Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Introduction
We employ MERT to select these weights by optimizing BLEU score on a development set.
Introduction
In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
MERT for MBR Parameter Optimization
However, this does not guarantee that the resulting linear score (Equation 2) is close to the corpus BLEU .
MERT for MBR Parameter Optimization
We now describe how MERT can be used to estimate these factors to achieve a better approximation to the corpus BLEU .
MERT for MBR Parameter Optimization
We recall that MERT selects weights in a linear model to optimize an error criterion (e. g. corpus BLEU ) on a training set.
Minimum Bayes-Risk Decoding
This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate.
Minimum Bayes-Risk Decoding
(2008) extended MBR decoding to translation lattices under an approximate BLEU score.
Minimum Bayes-Risk Decoding
They approximated log( BLEU ) score by a linear function of n-gram matches and candidate length.
BLEU is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Cahill, Aoife and Riester, Arndt
Abstract
We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions
In comparison to a baseline model, we achieve statistically significant improvement in BLEU score.
Discussion
Given that we only looked at IS factors within a sentence, we think that such a significant improvement in BLEU and exact match scores is very encouraging.
Generation Ranking Experiments
Model BLEU Match (%)
Generation Ranking Experiments
We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al.,
Generation Ranking Experiments
We achieve an improvement of 0.0168 BLEU points and 1.91 percentage points in exact match.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Haffari, Gholamreza and Sarkar, Anoop
AL-SMT: Multilingual Setting
The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively.
AL-SMT: Multilingual Setting
This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score, WER and PER) (Papineni et al., 2002).
Experiments
The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set.
Experiments
Avg BLEU Score
Experiments
Avg BLEU Score
Sentence Selection: Multiple Language Pairs
0 Let e0 be the consensus among all the candidate translations, then define the disagreement as Ed ad(1 — BLEU (eC, ed)).
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Conclusion and future work
We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements).
Dependency parsing for machine translation
We found that dependency scores with or without loop elimination are generally close and highly correlated, and that MT performance without final loop removal was about the same (generally less than 0.2% BLEU ).
Introduction
In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores.
Machine translation experiments
Parameter tuning was done with minimum error rate training (Och, 2003), which was used to maximize BLEU (Papineni et al., 2001).
Machine translation experiments
In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001).
Machine translation experiments
For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER, all differences are significant.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Abstract
Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
Experiments
In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments
We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method.
Experiments
All of the four feature functions we have tested achieve considerable improvement in BLEU scores.
Log-linear Models
BLEU score, a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al.
Log-linear Models
The BLEU measure computes the geometric mean of the precision of n-grams of various lengths between a sentence realization and a (set of) reference(s).
Log-linear Models
3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Aw, Aiti and Li, Haizhou
Analysis
0 The constituent boundary matching feature (CBMF) is a very important feature, which by itself achieves significant improvement over the baseline (up to 1.13 BLEU ).
Analysis
5.2 Beyond BLEU
Analysis
Since BLEU is not sufficient
Experiments
Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments
Like (Marton and Resnik, 2008), we find that the XP+ feature obtains a significant improvement of 1.08 BLEU over the baseline.
Experiments
However, using all syntax-driven features described in section 3.2, our SDB models achieve larger improvements of up to 1.67 BLEU .
Introduction
Our experimental results display that our SDB model achieves a substantial improvement over the baseline and significantly outperforms XP+ according to the BLEU metric (Papineni et al., 2002).
Introduction
In addition, our analysis shows further evidences of the performance gain from a different perspective than that of BLEU .
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
Experimental Evaluation
For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training.
Experimental Evaluation
For automatic evaluation, we employed BLEU (Papineni et al., 2002) by following (Unno et al., 2006).
Experimental Evaluation
Label BLEU Proposed .679 w/o PLM .617 w/o IPTW .635 Hori— .493
Results and Discussion
Our method achieved the highest BLEU score.
Results and Discussion
For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score.
Results and Discussion
Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Experiments
In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Experiments
Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
Experiments
Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
Abstract
We compare this metric against a combination metric of four state—of—the—art scores ( BLEU , NIST, TER, and METEOR) in two different settings.
Experimental Evaluation
BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
Introduction
Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction
(2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Experiments
System Model BLEU Moses cBP 23.86 STSSG 25.92 SncTSSG 26.53
Experiments
ID Rule Set BLEU 1 CR (STSSG) 25.92 2 CR w/o ncPR 25.87 3 CR w/o ncPR + tgtncR 26.14 4 CR w/o ncPR + srchR 26.50 5 CR w/o ncPR + src&tgtncR 26.51 6 CR + tgtnCR 26.11 7 CR + srcncR 26.56 8 cR+src&tgtncR(SncTSSG) 26.53
Experiments
2) Not only that, after comparing Exp 6,7,8 against Exp 3,4,5 respectively, we find that the ability of rules derived from noncontiguous tree sequence pairs generally covers that of the rules derived from the contiguous tree sequence pairs, due to the slight Change in BLEU score.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Alternatives to Correlation-based Meta-evaluation
We have studied 100 sentence evaluation cases from representatives of each metric family including: 1-PER, BLEU , DP-Or-‘k, GTM (e = 2), METEOR and ROUGE L. The evaluation cases have been extracted from the four test beds.
Metrics and Test Beds
At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision ( BLEU and NIST), lexical recall (ROUGE), and F-measure (GTM and METEOR).
Previous Work on Machine Translation Meta-Evaluation
(2001) introduced the BLEU metric and evaluated its reliability in terms of Pearson correlation with human assessments for adequacy and fluency judgements.
Previous Work on Machine Translation Meta-Evaluation
With the aim of overcoming some of the deficiencies of BLEU , Doddington (2002) introduced the NIST metric.
Previous Work on Machine Translation Meta-Evaluation
Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST, WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim
Experiment
Model BLEU (%) Moses 25.68 TT2S 26.08 TTS2S 26.95 FT2S 27.66 FTS2S 28.83
Experiment
The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S.
Experiment
BLEU (%) N-best \ model FT2S FTS2S 100 Best 27.40 28.61 500 Best 27.66 28.83 2500 Best 27.66 28.96 5000 Best 27.79 28.89
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Parton, Kristen and McKeown, Kathleen R. and Coyne, Bob and Diab, Mona T. and Grishman, Ralph and Hakkani-Tür, Dilek and Harper, Mary and Ji, Heng and Ma, Wei Yun and Meyers, Adam and Stolbach, Sara and Sun, Ang and Tur, Gokhan and Xu, Wei and Yaman, Sibel
Results
Since MT systems are tuned for word-based overlap measures (such as BLEU ), verb deletion is penalized equally as, for example, determiner deletion.
SW System
model score and word penalty for a combination of BLEU and TER (2*(1-BLEU) + TER).
SW System
Bleu scores on the government supplied test set in December 2008 were 35.2 for formal text, 29.2 for informal text, 33.2 for formal speech, and 27.6 for informal speech.
The Chinese-English 5W Task
Unlike word- or phrase-overlap measures such as BLEU , the SW evaluation takes into account “concept” or “nugget” translation.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Discussion and Future Work
When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
Discussion and Future Work
Perhaps surprisingly, translation performance, 30.90 BLEU , was around the level we obtained when using frequency to approximate function words at N = 64.
Experimental Results
These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score, with a consistent pattern of results across the MT06 and MT08 test sets.
Experimental Setup
all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper: