Abstract | Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation. |
Abstract | We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion. |
Abstract | Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup. |
Expected BLEU Training | The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3. |
Expected BLEU Training | 3.1 Expected BLEU Objective |
Expected BLEU Training | Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f: |
Introduction | The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3). |
Introduction | We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements. |
Recurrent Neural Network LMs | time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations. |
Experimental Setup | We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006). |
Methods | This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references). |
Related Work | We use both segmented and unsegmented language models, and tune automatically to optimize BLEU . |
Related Work | (2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU . |
Results | For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic. |
Results | Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation. |
Results | 1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best. |
Experiments | call F1 BLEU 1.21 60.82 46.53 $.57 66.791 48.001 L64 72.4912 56.6412 3.36 78.15123 55.6612 |
Experiments | :call F1 BLEU 7.86 61.48 46.53 1.79 64.07 46.00 3.57 65.56 55.6712 7.14 68.8612 55.6712 |
Experiments | Method 4, named REBOL, implements REsponse-Based Online Learning by instantiating y+ and y‘ to the form described in Section 4: In addition to the model score 3, it uses a cost function 0 based on sentence-level BLEU (Nakov et al., 2012) and tests translation hypotheses for task-based feedback using a binary execution function 6. |
Response-based Online Learning | Computation of distance to the reference translation usually involves cost functions based on sentence-level BLEU (Nakov et al. |
Response-based Online Learning | In addition, we can use translation-specific cost functions based on sentence-level BLEU in order to boost similarity of translations to human reference translations. |
Response-based Online Learning | Our cost function c(y(i), y) = (l — BLEU(y(i), is based on a version of sentence-level BLEU Nakov et al. |
Abstract | Our best result improves over the best single MT system baseline by 1.0% BLEU and over a strong system selection baseline by 0.6% BLEU on a blind test set. |
Introduction | Our best system selection approach improves over our best baseline single MT system by 1.0% absolute BLEU point on a blind test set. |
MT System Selection | We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty). |
MT System Selection | We pick the name of the MT system with the highest BLEU score as the class label for that sentence. |
MT System Selection | When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied. |
Machine Translation Experiments | Feature weights are tuned to maximize BLEU on tuning sets using Minimum Error Rate Training (Och, 2003). |
Machine Translation Experiments | Results are presented in terms of BLEU (Papineni et al., 2002). |
Machine Translation Experiments | All differences in BLEU scores between the four systems are statistically significant above the 95% level. |
Abstract | Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets. |
Evaluation | We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality. |
Evaluation | Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set. |
Evaluation | HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further. |
Introduction | This enhancement alone results in an improvement of almost 1.4 BLEU points. |
Introduction | We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models. |
Experimental Results | Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level. |
Experimental Results | II TER .812 .836 .848 BLEU .810 .830 .846 |
Experimental Results | We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively). |
Experimental Setup | To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms). |
Experimental Setup | Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A. |
Related Work | A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012). |
Related Work | For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score. |
Abstract | The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. |
Abstract | We estimate the correlation of unigram and Smoothed BLEU , TER, ROUGE-SU4, and Meteor against human judgements on two data sets. |
Abstract | The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements. |
Introduction | The main finding of our analysis is that TER and unigram BLEU are weakly corre- |
Introduction | lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor. |
Methodology | BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y. |
Methodology | N BLEU = BP-exp < wn logpn> n=1 |
Methodology | Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1. |
Abstract | Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make. |
Abstract | However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved. |
Introduction | With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model. |
Introduction | With the SVM reranker, we obtain a significant improvement in BLEU scores over |
Introduction | Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases). |
Reranking with SVMs 4.1 Methods | In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments. |
Simple Reranking | Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations |
Simple Reranking | Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93. |
Simple Reranking | In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations. |
Abstract | On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM. |
Abstract | The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation. |
Introduction | Additionally, we present several variations of this model which provide significant additive BLEU gains. |
Introduction | The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes |
Introduction | Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined. |
Model Variations | Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6 |
Model Variations | BLEU scores are mixed-case. |
Model Variations | On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more. |
Neural Network Joint Model (NNJ M) | We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU . |
Neural Network Joint Model (NNJ M) | We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M. |
Evaluation | Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set. |
Evaluation | In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations. |
Evaluation | This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators. |
Experiments | To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG). |
Experiments | BLEU gives a measure of how close a system’s output is to the gold standard simple sentence. |
Experiments | Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification. |
Related Work | (2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score. |
Abstract | Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain. |
Complexity Analysis | In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved. |
Complexity Analysis | For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality. |
Complexity Analysis | It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings. |
Introduction | o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter. |
Discussion | Table 6: Performance gain in BLEU over baseline and MR08 systems averaged over all test sets. |
Discussion | Table 9: Performance ( BLEU score) comparison between non-oracle and oracle experiments. |
Experiments | We use NIST MT 06 dataset (1664 sentence pairs) for tuning, and NIST MT 03, 05, and 08 datasets (919, 1082, and 1357 sentence pairs, respectively) for evaluation.1 We use BLEU (Pap-ineni et al., 2002) for both tuning and evaluation. |
Experiments | Our first group of experiments investigates whether the syntactic reordering models are able to improve translation quality in terms of BLEU . |
Experiments | Table 5: System performance in BLEU scores. |
Experiments | In Table 3, almost all BLEU scores are improved, no matter what strategy is used. |
Experiments | In particular, the best performance marked in bold is as high as 1.24, 0.94, and 0.82 BLEU points, respectively, over the baseline system on NIST04, CWMT08 Development, and CWMT08 Evaluation data. |
Experiments | BLEU 35 |
Related Work | They added the labels assigned to connectives as an additional input to an SMT system, but their experimental results show that the improvements under the evaluation metric of BLEU were not significant. |
Related Work | To the best of our knowledge, our work is the first attempt to exploit the source functional relationship to generate the target transitional expressions for grammatical cohesion, and we have successfully incorporated the proposed models into an SMT system with significant improvement of BLEU metrics. |
Abstract | When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points. |
Conclusion | Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points. |
Experiments | The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2. |
Experiments | The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points. |
Experiments | The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems. |
Our Approach | BLEU Scores 13 N J: |
Our Approach | Figure l: BLEU scores vs k for SumBasic extraction. |
Our Approach | Although BLEU (Papineni et al., 2002) scores are widely used for image caption evaluation, we find them to be poor indicators of the quality of our model. |
Conclusion | o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST. |
Experiments | System BLEU (%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - - |
Experiments | System BLEU (%) NIST Base 33.53 9.0561 STM (sense) 34.15 9.2596 STM (sense+lexicon) 34.73 9.4184 |
Experiments | System BLEU (%) NIST Base 33.53 9.0561 Reformulated WSD 34.16 9.3820 STM 34.73 9.4184 |
Abstract | We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data. |
Conclusion | The results showed that our approach achieved a BLEU score gain of 1.61. |
Dependency-based Pre-ordering Rule Set | In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set. |
Experiments | Lng the performance ( BLEU ) on the test set, the total |
Experiments | For evaluation, we used BLEU scores (Papineni et al., 2002). |
Experiments | It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap- |
Introduction | Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61. |
Introduction | In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b). |
Related Work | Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence. |
Related Work | In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy. |
Related Work | TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments. |
Abstract | We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data. |
Conclusion and Future Work | The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data. |
Evaluation | Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems. |
Evaluation | Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance. |
Evaluation | Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system. |
Introduction | 0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data. |
Conclusion | We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning. |
Results and Discussion | Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees |
Results and Discussion | The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered). |
Results and Discussion | In terms of BLEU score, the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-(). |
Experiments | The reported BLEU scores are averaged over 5 times of running MERT (Och, 2003). |
Experiments | We illustrate the relationship among translation accuracy ( BLEU ), the number of retrieved documents (N) and the length of hidden layers (L) on different testing datasets. |
Experiments | Figure 3: End-to-end translation results ( BLEU %) |
Experiments | Case-insensitive BLEU is employed as the evaluation metric. |
Experiments | Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test. |
Experiments | In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation. |
Introduction | The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved. |
Related Work | (2013) also use bag-of-words but learn BLEU sensitive phrase embeddings. |
Abstract | Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU . |
Conclusion and Future Work | We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU . |
Experiments and Results | When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data. |
Experiments and Results | TCBPPE based method drops about 3 BLEU points on both development and test data sets. |
Introduction | We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system. |
Evaluation | We report on BLEU , NIST, METEOR, and word error rate metrics WER and PER. |
Experiments & Results | The BLEU scores, not included in the figure but shown in Table 2, show a similar trend. |
Experiments & Results | Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004). |
Experiments & Results | Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline. |
Abstract | We evaluate our model on a Chinese to English translation task and obtain up to 1.2 BLEU improvement over strong baselines. |
Experiments | We refer to the SMT model without domain adaptation as baseline.5 LDA marginally improves machine translation (less than half a BLEU point). |
Experiments | These improvements are not redundant: our new ptLDA-dict model, which has aspects of both models yields the best performance among these approaches—up to a 1.2 BLEU point gain (higher is better), and -2.6 TER improvement (lower is better). |
Experiments | The BLEU improvement is significant (Koehn, 2004) at p = 0.01,6 except on MT03 with variational and variational-hybrid inference. |
Experiments | We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality. |
Experiments | Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU, |
Experiments | Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01 |
Abstract | On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively. |
Conclusions | The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features. |
Experiments and Results | Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features. |