Abstract | BLEU , TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. |
Experiments | As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al., 2011)). |
Experiments | As metrics we use BLEU and NTER. |
Experiments | o BLEU = BP >< (Hprecn)1/4. |
Introduction | These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU , which serve as surrogate objective for translation quality. |
Introduction | However, we know that a single metric such as BLEU is not enough. |
Introduction | For example, while BLEU (Papineni et al., 2002) focuses on word-based n-gram precision, METEOR (Lavie and Agarwal, 2007) allows for stem/synonym matching and incorporates recall. |
Multi-objective Algorithms | If we had used BLEU scores rather than the {0,1} labels in line 8, the entire PMO-PRO algorithm would revert to single-objective PRO. |
Theory of Pareto Optimality 2.1 Definitions and Concepts | For example, suppose K = 2, M1(h) computes the BLEU score, and M2(h) gives the METEOR score of h. Figure 1 illustrates the set of vectors {M in a lO-best list. |
Abstract | Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU . |
Abstract | In principle, tuning on these metrics should yield better systems than tuning on BLEU . |
Abstract | It has a better correlation with human judgment than BLEU . |
Introduction | 0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic |
Introduction | Among these metrics, BLEU is the most widely used for both evaluation and tuning. |
Introduction | Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et |
Abstract | In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. |
Abstract | The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. |
Abstract | parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002). |
Inferring a learning curve from mostly monolingual data | Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data |
Inferring a learning curve from mostly monolingual data | We first train models to predict the BLEU score at m anchor sizes 81, . |
Inferring a learning curve from mostly monolingual data | We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest. |
Introduction | In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training. |
Introduction | An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU . |
Introduction | They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6). |
Selecting a parametric family of curves | For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;. |
Selecting a parametric family of curves | The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not |
Selecting a parametric family of curves | The values are on the same scale as the BLEU scores. |
Abstract | At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU , whereas our approach yields 20.0% with identical models. |
Experimental Evaluation | system | BLEU [%] \ #HYP \ #LM \ w/s N0 2 oo baseline 20.1 3.0K 322K 2.2 +presort 20.1 2.5K 183K 3.6 N0 = 100 |
Experimental Evaluation | We evaluate with BLEU (Papineni et al., 2002) and TER (Snover et al., 2006). |
Experimental Evaluation | BLEU [%] |
Introduction | We also run comparisons with the Moses decoder (Koehn et al., 2007), which yields the same performance in BLEU , but is outperformed significantly in terms of scalability for faster translation. |
Abstract | In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems. |
Experiments and Results | Jomt learnlng BLEU BLEU zB LE U No Joint 27.16 35.42 / oz 2 1 30.75 53.51 30.75 |
Experiments and Results | We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better). |
Experiments and Results | From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously. |
Introduction | The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time. |
Paraphrasing with a Dual SMT System | Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible. |
Paraphrasing with a Dual SMT System | iBLEU(s,rS,c) = aBLEU(c,7“S) — (l—a) BLEU (c,s) (3) |
Paraphrasing with a Dual SMT System | BLEU (C, r3) captures the semantic equivalency between the candidates and the references (Finch et al. |
Abstract | We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks. |
Experiments | 4.3.1 BLEU |
Experiments | Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011). |
Experiments | use character-level BLEU as our main baseline. |
Introduction | Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest. |
Introduction | In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments. |
Introduction | Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality. |
Experimental Evaluation | We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while |
Experimental Evaluation | In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations. |
Experimental Evaluation | For BLEU higher values are better, for TER lower values are better. |
Related Work | They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data. |
Baselines | where m ranges over IN and OUT, pm(é| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003). |
Conclusion & Future Work | We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model. |
Ensemble Decoding | In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup. |
Ensemble Decoding | However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically. |
Ensemble Decoding | However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well. |
Experiments & Results 4.1 Experimental Setup | Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2. |
Experiments & Results 4.1 Experimental Setup | This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34. |
Experiments & Results 4.1 Experimental Setup | We also reported the BLEU scores when we applied the span-wise normalization heuristic. |
Abstract | The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain. |
Discussion | on BLEU score |
Experiments | The metrics for automatic evaluation were BLEU 3 and TER 4 (Snover et al., 2005). |
Experiments | (00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules. |
Experiments | Our system gains significant improvements of 1.6~3.6 points of BLEU in the oral domain, and 0.5~1 points of BLEU in the news domain. |
Extraction of Paraphrase Rules | As mentioned above, the detailed procedure is: T1 = S1 = T2 = Finally we compute BLEU (Papineni et al. |
Extraction of Paraphrase Rules | If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions. |
Introduction | The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain. |
Experiments | Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a). |
Experiments | Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU . |
Experiments | In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set. |
Joint Feature Selection in Distributed Stochastic Learning | Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference. |
Conclusion and Future Work | Large scale experiment shows improvement on both reordering metric and SMT performance, with up to 1.73 point BLEU gain in our evaluation test. |
Experiments | Table 2: BLEU (%) score on dev and test data for both EJ and J-E experiment. |
Experiments | We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score, and the model size. |
Experiments | CLN BLEU Feat.# tag+label 88.6 16.4 22.24 26k +dst 91.5 13.5 22.66 55k E_J +pct 92.2 13.1 22.73 79k +lezv100 92.9 12.1 22.85 347k +l€$1000 94.0 11.5 22.79 2,410k +l€$2000 95.2 10.7 22.81 3,794k tag+fw 85.0 18.6 25.43 31k +dst 90.3 16.9 25.62 65k J_E +lezv100 91.6 15.7 25.87 293k +l€$1000 92.4 14.8 25.91 2,156k +le$2000 93.0 14.3 25.84 3,297k |
Conclusions and Future Work | EXperimental results show that both models are able to significantly improve translation accuracy in terms of BLEU score. |
Experiments | Statistical significance in BLEU differences |
Experiments | Our first group of experiments is to investigate whether the predicate translation model is able to improve translation accuracy in terms of BLEU and whether semantic features are useful. |
Experiments | 0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used. |
Conclusion and Future Work | In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures. |
Experiments and Results | Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)). |
Features and Training | defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]: |
Features and Training | BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference. |
Features and Training | 1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis. |
Graph Construction | In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation. |
Abstract | Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline. |
Experiments | 2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012). |
Experiments | On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER. |
Experiments | The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER. |
Introduction | Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline. |
Experiments | The BLEU scores for these outputs are 32.7, 27.8, and 20.8. |
Experiments | In particular, their translations had a lower BLEU score, making their task easier. |
Experiments | We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score). |
Experiments | Is our topic similarity model able to improve translation quality in terms of BLEU ? |
Experiments | Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea- |
Experiments | By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average. |
Introduction | Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points. |
Conclusion and Future Directions | 12Similar results were found for character and word—based BLEU , but are omitted for lack of space. |
Experiments | Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney. |
Experiments | We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level. |
Experiments | When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU . |
Abstract | Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU . |
Experiments | For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores. |
Experiments | Table 3 lists the translation performance with BLEU scores. |
Experiments | Table 3 shows that our HD-HPB model significantly outperforms Chiang’s HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB). |
Experiments | training data and not necessarily exactly follow the tendency of the final BLEU scores. |
Experiments | For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score. |
Experiments | Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different. |
Abstract | Experimental evaluation on the ATIS domain shows that our model outperforms a competitive discriminative system both using BLEU and in a judgment elicitation study. |
Results | As can be seen, inclusion of lexical features gives our decoder an absolute increase of 6.73% in BLEU over the l-BEST system. |
Results | System BLEU METEOR l-BEST+BASE+ALIGN 21.93 34.01 k-BEST+BASE+ALIGN+LEX 28.66 45.18 k-BEST+BASE+ALIGN+LEX+STR 30.62 46.07 ANGELI 26.77 42.41 |
Results | over the l-BEST system and 3.85% over ANGELI in terms of BLEU . |
Abstract | For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. |
Discussion of Translation Results | The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre. |
Introduction | For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM. |