Index of papers in Proc. ACL 2012 that mention
  • BLEU
Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki
Abstract
BLEU , TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Experiments
As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al., 2011)).
Experiments
As metrics we use BLEU and NTER.
Experiments
o BLEU = BP >< (Hprecn)1/4.
Introduction
These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU , which serve as surrogate objective for translation quality.
Introduction
However, we know that a single metric such as BLEU is not enough.
Introduction
For example, while BLEU (Papineni et al., 2002) focuses on word-based n-gram precision, METEOR (Lavie and Agarwal, 2007) allows for stem/synonym matching and incorporates recall.
Multi-objective Algorithms
If we had used BLEU scores rather than the {0,1} labels in line 8, the entire PMO-PRO algorithm would revert to single-objective PRO.
Theory of Pareto Optimality 2.1 Definitions and Concepts
For example, suppose K = 2, M1(h) computes the BLEU score, and M2(h) gives the METEOR score of h. Figure 1 illustrates the set of vectors {M in a lO-best list.
BLEU is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Abstract
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU .
Abstract
In principle, tuning on these metrics should yield better systems than tuning on BLEU .
Abstract
It has a better correlation with human judgment than BLEU .
Introduction
0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
Introduction
Among these metrics, BLEU is the most widely used for both evaluation and tuning.
Introduction
Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et
BLEU is mentioned in 66 sentences in this paper.
Topics mentioned in this paper:
He, Xiaodong and Deng, Li
Abstract
In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset.
Abstract
The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
Abstract
parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002).
BLEU is mentioned in 44 sentences in this paper.
Topics mentioned in this paper:
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram
Inferring a learning curve from mostly monolingual data
Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
Inferring a learning curve from mostly monolingual data
We first train models to predict the BLEU score at m anchor sizes 81, .
Inferring a learning curve from mostly monolingual data
We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
Introduction
In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training.
Introduction
An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU .
Introduction
They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
Selecting a parametric family of curves
For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
Selecting a parametric family of curves
The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not
Selecting a parametric family of curves
The values are on the same scale as the BLEU scores.
BLEU is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Ney, Hermann and Zens, Richard
Abstract
At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU , whereas our approach yields 20.0% with identical models.
Experimental Evaluation
system | BLEU [%] \ #HYP \ #LM \ w/s N0 2 oo baseline 20.1 3.0K 322K 2.2 +presort 20.1 2.5K 183K 3.6 N0 = 100
Experimental Evaluation
We evaluate with BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Experimental Evaluation
BLEU [%]
Introduction
We also run comparisons with the Moses decoder (Koehn et al., 2007), which yields the same performance in BLEU , but is outperformed significantly in terms of scalability for faster translation.
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Sun, Hong and Zhou, Ming
Abstract
In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems.
Experiments and Results
Jomt learnlng BLEU BLEU zB LE U No Joint 27.16 35.42 / oz 2 1 30.75 53.51 30.75
Experiments and Results
We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better).
Experiments and Results
From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously.
Introduction
The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time.
Paraphrasing with a Dual SMT System
Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible.
Paraphrasing with a Dual SMT System
iBLEU(s,rS,c) = aBLEU(c,7“S) — (l—a) BLEU (c,s) (3)
Paraphrasing with a Dual SMT System
BLEU (C, r3) captures the semantic equivalency between the candidates and the references (Finch et al.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Abstract
We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks.
Experiments
4.3.1 BLEU
Experiments
Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011).
Experiments
use character-level BLEU as our main baseline.
Introduction
Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.
Introduction
In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments.
Introduction
Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Experimental Evaluation
We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while
Experimental Evaluation
In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
Experimental Evaluation
For BLEU higher values are better, for TER lower values are better.
Related Work
They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data.
BLEU is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Baselines
where m ranges over IN and OUT, pm(é| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003).
Conclusion & Future Work
We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model.
Ensemble Decoding
In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup.
Ensemble Decoding
However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically.
Ensemble Decoding
However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well.
Experiments & Results 4.1 Experimental Setup
Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2.
Experiments & Results 4.1 Experimental Setup
This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34.
Experiments & Results 4.1 Experimental Setup
We also reported the BLEU scores when we applied the span-wise normalization heuristic.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting
Abstract
The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.
Discussion
on BLEU score
Experiments
The metrics for automatic evaluation were BLEU 3 and TER 4 (Snover et al., 2005).
Experiments
(00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules.
Experiments
Our system gains significant improvements of 1.6~3.6 points of BLEU in the oral domain, and 0.5~1 points of BLEU in the news domain.
Extraction of Paraphrase Rules
As mentioned above, the detailed procedure is: T1 = S1 = T2 = Finally we compute BLEU (Papineni et al.
Extraction of Paraphrase Rules
If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions.
Introduction
The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Simianer, Patrick and Riezler, Stefan and Dyer, Chris
Experiments
Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a).
Experiments
Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU .
Experiments
In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set.
Joint Feature Selection in Distributed Stochastic Learning
Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Li, Mu and Zhang, Dongdong and Yu, Nenghai
Conclusion and Future Work
Large scale experiment shows improvement on both reordering metric and SMT performance, with up to 1.73 point BLEU gain in our evaluation test.
Experiments
Table 2: BLEU (%) score on dev and test data for both EJ and J-E experiment.
Experiments
We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score, and the model size.
Experiments
CLN BLEU Feat.# tag+label 88.6 16.4 22.24 26k +dst 91.5 13.5 22.66 55k E_J +pct 92.2 13.1 22.73 79k +lezv100 92.9 12.1 22.85 347k +l€$1000 94.0 11.5 22.79 2,410k +l€$2000 95.2 10.7 22.81 3,794k tag+fw 85.0 18.6 25.43 31k +dst 90.3 16.9 25.62 65k J_E +lezv100 91.6 15.7 25.87 293k +l€$1000 92.4 14.8 25.91 2,156k +le$2000 93.0 14.3 25.84 3,297k
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Conclusions and Future Work
EXperimental results show that both models are able to significantly improve translation accuracy in terms of BLEU score.
Experiments
Statistical significance in BLEU differences
Experiments
Our first group of experiments is to investigate whether the predicate translation model is able to improve translation accuracy in terms of BLEU and whether semantic features are useful.
Experiments
0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming
Conclusion and Future Work
In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures.
Experiments and Results
Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
Features and Training
defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]:
Features and Training
BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference.
Features and Training
1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis.
Graph Construction
In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Boyd-Graber, Jordan and Resnik, Philip
Abstract
Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline.
Experiments
2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012).
Experiments
On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER.
Experiments
The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER.
Introduction
Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Experiments
The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
Experiments
In particular, their translations had a lower BLEU score, making their task easier.
Experiments
We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score).
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun
Experiments
Is our topic similarity model able to improve translation quality in terms of BLEU ?
Experiments
Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea-
Experiments
By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average.
Introduction
Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya
Conclusion and Future Directions
12Similar results were found for character and word—based BLEU , but are omitted for lack of space.
Experiments
Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney.
Experiments
We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level.
Experiments
When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU .
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Junhui and Tu, Zhaopeng and Zhou, Guodong and van Genabith, Josef
Abstract
Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU .
Experiments
For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores.
Experiments
Table 3 lists the translation performance with BLEU scores.
Experiments
Table 3 shows that our HD-HPB model significantly outperforms Chiang’s HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB).
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki
Experiments
training data and not necessarily exactly follow the tendency of the final BLEU scores.
Experiments
For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score.
Experiments
Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Konstas, Ioannis and Lapata, Mirella
Abstract
Experimental evaluation on the ATIS domain shows that our model outperforms a competitive discriminative system both using BLEU and in a judgment elicitation study.
Results
As can be seen, inclusion of lexical features gives our decoder an absolute increase of 6.73% in BLEU over the l-BEST system.
Results
System BLEU METEOR l-BEST+BASE+ALIGN 21.93 34.01 k-BEST+BASE+ALIGN+LEX 28.66 45.18 k-BEST+BASE+ALIGN+LEX+STR 30.62 46.07 ANGELI 26.77 42.41
Results
over the l-BEST system and 3.85% over ANGELI in terms of BLEU .
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
Abstract
For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline.
Discussion of Translation Results
The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre.
Introduction
For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: