Experimental Results | All the BLEU scores reported are for lowercase evaluation. |
Experimental Results | m-BLEU 1dicates that the segmented output was evaluated gainst a segmented version of the reference (this Leasure does not have the same correlation with hu-Lan judgement as BLEU ). |
Experimental Results | No Uni indicates the seg-Lented BLEU score without unigrams. |
Models 2.1 Baseline Models | performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline. |
Translation and Morphology | Automatic evaluation measures for MT, BLEU (Papineni et al., 2002), WER (Word Error Rate) and PER (Position Independent Word Error Rate) use the word as the basic unit rather than morphemes. |
Translation and Morphology | Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset. |
Abstract | BLEU , TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. |
Experiments | As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al., 2011)). |
Experiments | As metrics we use BLEU and NTER. |
Experiments | o BLEU = BP >< (Hprecn)1/4. |
Introduction | These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU , which serve as surrogate objective for translation quality. |
Introduction | However, we know that a single metric such as BLEU is not enough. |
Introduction | For example, while BLEU (Papineni et al., 2002) focuses on word-based n-gram precision, METEOR (Lavie and Agarwal, 2007) allows for stem/synonym matching and incorporates recall. |
Multi-objective Algorithms | If we had used BLEU scores rather than the {0,1} labels in line 8, the entire PMO-PRO algorithm would revert to single-objective PRO. |
Theory of Pareto Optimality 2.1 Definitions and Concepts | For example, suppose K = 2, M1(h) computes the BLEU score, and M2(h) gives the METEOR score of h. Figure 1 illustrates the set of vectors {M in a lO-best list. |
Abstract | Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU . |
Abstract | In principle, tuning on these metrics should yield better systems than tuning on BLEU . |
Abstract | It has a better correlation with human judgment than BLEU . |
Introduction | 0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic |
Introduction | Among these metrics, BLEU is the most widely used for both evaluation and tuning. |
Introduction | Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et |
Abstract | The syntax-based translation system integrating the proposed techniques outperforms the best Arabic-English unconstrained system in NIST—08 evaluations by 1.3 absolute BLEU , which is statistically significant. |
Experiments | We use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to evaluate translation qualities. |
Experiments | and we achieved a BLEUr4n4 55.01 for MT08-NW, or a cased BLEU of 53.31, which is close to the best officially reported result 53.85 for unconstrained systems.2 We expose the statistical decisions in Eqn. |
Experiments | 3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s. |
Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Experimental results | We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002). |
Experimental results | We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och, |
Experimental results | 2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score. |
Introduction | ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”. |
Abstract | On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM. |
Abstract | The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation. |
Introduction | Additionally, we present several variations of this model which provide significant additive BLEU gains. |
Introduction | The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes |
Introduction | Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined. |
Model Variations | Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6 |
Model Variations | BLEU scores are mixed-case. |
Model Variations | On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more. |
Neural Network Joint Model (NNJ M) | We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU . |
Neural Network Joint Model (NNJ M) | We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M. |
Abstract | Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make. |
Abstract | However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved. |
Introduction | With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model. |
Introduction | With the SVM reranker, we obtain a significant improvement in BLEU scores over |
Introduction | Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases). |
Reranking with SVMs 4.1 Methods | In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments. |
Simple Reranking | Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations |
Simple Reranking | Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93. |
Simple Reranking | In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations. |
Abstract | The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. |
Abstract | We estimate the correlation of unigram and Smoothed BLEU , TER, ROUGE-SU4, and Meteor against human judgements on two data sets. |
Abstract | The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements. |
Introduction | The main finding of our analysis is that TER and unigram BLEU are weakly corre- |
Introduction | lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor. |
Methodology | BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y. |
Methodology | N BLEU = BP-exp < wn logpn> n=1 |
Methodology | Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1. |
Experimental Results | Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level. |
Experimental Results | II TER .812 .836 .848 BLEU .810 .830 .846 |
Experimental Results | We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively). |
Experimental Setup | To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms). |
Experimental Setup | Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A. |
Related Work | A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012). |
Related Work | For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score. |
Abstract | Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation. |
Abstract | We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion. |
Abstract | Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup. |
Expected BLEU Training | The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3. |
Expected BLEU Training | 3.1 Expected BLEU Objective |
Expected BLEU Training | Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f: |
Introduction | The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3). |
Introduction | We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements. |
Recurrent Neural Network LMs | time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations. |
Experiments | BLEU |
Experiments | As more training pairs are used, the model produces more varied sentences (PIN C) but preserves the meaning less well ( BLEU ) |
Experiments | As a comparison, evaluating each human description as a paraphrase for the other descriptions in the same cluster resulted in a BLEU score of 52.9 and a PINC score of 77.2. |
Introduction | In addition to the lack of standard datasets for training and testing, there are also no standard metrics like BLEU (Papineni et al., 2002) for evaluating paraphrase systems. |
Paraphrase Evaluation Metrics | One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU , which has played a crucial role in driving progress in MT. |
Paraphrase Evaluation Metrics | Thus, researchers have been unable to rely on BLEU or some derivative: the optimal paraphrasing engine under these terms would be one that simply returns the input. |
Paraphrase Evaluation Metrics | To measure semantic equivalence, we simply use BLEU with multiple references. |
Background | As in other state-of-the-art SMT systems, BLEU is selected as the accuracy measure to define the error function used in MERT. |
Background | Since the weights of training samples are not taken into account in BLEUZ, we modify the original definition of BLEU to make it sensitive to the distribution Dt(i) over the training samples. |
Background | The modified version of BLEU is called weighted BLE U (WBLEU) in this paper. |
Abstract | Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU . |
Alignment | We perform minimum error rate training with the downhill simplex algorithm (Nelder and Mead, 1965) on the development data to obtain a set of scaling factors that achieve a good BLEU score. |
Experimental Evaluation | The scaling factors of the translation models have been optimized for BLEU on the DEV data. |
Experimental Evaluation | ‘ BLEU ‘ TER ‘ |
Experimental Evaluation | The metrics used for evaluation are the case-sensitive BLEU (Papineni et al., 2002) score and the translation edit rate (TER) (Snover et al., 2006) with one reference translation. |
Introduction | Our results show that the proposed phrase model training improves translation quality on the test set by 0.9 BLEU points over our baseline. |
Introduction | We find that by interpolation with the heuristically extracted phrases translation performance can reach up to 1.4 BLEU improvement over the baseline on the test set. |
Abstract | Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules. |
Experiments | We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on development set. |
Experiments | The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3. |
Experiments | As shown in the third line in the column of BLEU score, the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage. |
Introduction | BLEU |
Introduction | Medium data experiments (Section 5) show a statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with less translation rules, this is also the first time that a tree-to-tree model can surpass tree-to-string counterparts. |
Model | (2009), their forest-based constituency-to-constituency system achieves a comparable performance against Moses (Koehn et al., 2007), but a significant improvement of +3.6 BLEU points over the 1-best tree-based constituency-to-constituency system. |
Abstract | As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. |
Experiments on Parsing-Based SMT | Experiments BLEU (%) Joshua 30.05 + Improved word alignments 31.81 |
Experiments on Parsing-Based SMT | The system using the improved word alignments achieves an absolute improvement of 1.76 BLEU score, which indicates that the improvements of word alignments are also effective to improve the performance of the parsing-based SMT systems. |
Experiments on Phrase-Based SMT | We use BLEU (Papineni et al., 2002) as evaluation metrics. |
Experiments on Phrase-Based SMT | Experiments BLEU (%) Moses 29.62 + Phrase collocation probability 30.47 |
Experiments on Phrase-Based SMT | If the same alignment method is used, the systems using CM-3 got the highest BLEU scores. |
Experiments on Word Alignment | Experiments BLEU (%) Baseline 29.62 CM-l 30.85 WA-l CM-2 31.28 CM-3 31.48 CM-l 3 l .00 Our methods WA-2 CM-2 3 l .33 CM-3 31.51 CM-l 3 l .43 WA-3 CM-2 31.62 CM-3 31.78 |
Introduction | The alignment improvement results in an improvement of 2.16 BLEU score on phrase-based SMT system and an improvement of 1.76 BLEU score on parsing-based SMT system. |
Introduction | SMT performance is further improved by 0.24 BLEU score. |
Abstract | BLEU ) when applied to morphologically rich languages such as Czech. |
Introduction | Section 2 illustrates and explains severe problems of a widely used BLEU metric (Papineni et al., 2002) when applied to Czech as a representative of languages with rich morphology. |
Introduction | cu-bOJar uedin 0.4 l l l l 0.06 0.08 0.10 0.12 0.14 BLEU |
Introduction | Figure l: BLEU and human ranks of systems participating in the English-to-Czech WMT09 shared task. |
Problems of BLEU | BLEU (Papineni et al., 2002) is an established language-independent MT metric. |
Problems of BLEU | The unbeaten advantage of BLEU is its simplicity. |
Problems of BLEU | We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009). |
Conclusion | BLEU score 0 iv A (A) |
Experiments | BLEU score |
Experiments | ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score. |
Experiments | In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences. |
Future Work | BLEU score |
Abstract | At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU , whereas our approach yields 20.0% with identical models. |
Experimental Evaluation | system | BLEU [%] \ #HYP \ #LM \ w/s N0 2 oo baseline 20.1 3.0K 322K 2.2 +presort 20.1 2.5K 183K 3.6 N0 = 100 |
Experimental Evaluation | We evaluate with BLEU (Papineni et al., 2002) and TER (Snover et al., 2006). |
Experimental Evaluation | BLEU [%] |
Introduction | We also run comparisons with the Moses decoder (Koehn et al., 2007), which yields the same performance in BLEU , but is outperformed significantly in terms of scalability for faster translation. |
Introduction | Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER. |
Introduction | | features | BLEU | TER | |
Introduction | Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly. |
Experiment Results | We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences). |
Experiment Results | On average the improvement is 1.07 BLEU score (45.66 |
Experiment Results | Table 4: Arabic-English true case translation scores in BLEU metric. |
Phrasal-Hiero Model | Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row). |
Baseline MT | The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003). |
Experiments | We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation. |
Experiments | According to Wilcoxon Matched-Pairs Signed-Ranks Test, the improvement is not significant with BLEU metric, but is significant at 98% confidence level with all of the other metrics. |
Introduction | 0 The current dominant automatic MT scoring metrics (such as Bilingual Evaluation Understudy ( BLEU ) (Papineni et al., 2002)) treat all words equally, but names have relative low frequency in text (about 6% in newswire and only 3% in web documents) and thus are vastly outnumbered by function words and common nouns, etc.. |
Name-aware MT Evaluation | Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally. |
Name-aware MT Evaluation | In order to properly evaluate the translation quality of NAMT methods, we propose to modify the BLEU metric so that they can dynamically assign more weights to names during evaluation. |
Name-aware MT Evaluation | BLEU considers the correspondence between a system translation and a human translation: |
Abstract | We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). |
Experiments and Results | To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation. |
Experiments and Results | We show that our method achieves the best performance ( BLEU scores) on this task while being significantly faster than both the previous approaches. |
Experiments and Results | We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon). |
Abstract | 'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. |
Abstract | Further, adapting large MSAflEnglish parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points. |
Introduction | — We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points. |
Previous Work | ‘Train LM BLEU oov |
Previous Work | The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4). |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | tian sentence “wbyHtrmwA AlnAs AltAnyp” Until produced “lyfizfij (OOV) the second people” ( BLEU = 0.31). |
Abstract | Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU . |
Experiments | We compared the performance of Moses using the alignment produced by our model and the baseline alignment, evaluating translation quality using BLEU (Papineni et al., 2002) with case-insensitive n-gram matching with n = 4. |
Experiments | We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set. |
Experiments | 5 The effect on translation scores is modest, roughly amounting to +0.2 BLEU versus using a single sample. |
Introduction | The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute. |
Abstract | We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set. |
Additional Experiments | As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM. |
Additional Experiments | On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain. |
Additional Experiments | Interestingly, RM achieved substantially higher BLEU precision scores in all tests for both language pairs. |
Experiments | We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus. |
Experiments | The bound constraint B was set to 1.4 The approximate sentence-level BLEU cost A, is computed in a manner similar to (Chiang et al., 2009), namely, in the context of previous 1-best translations of the tuning set. |
Experiments | We explored alternative values for B, as well as scaling it by the current candidate’s cost, and found that the optimizer is fairly insensitive to these changes, resulting in only minor differences in BLEU . |
Introduction | Automatic evaluation (using ROUGE (Lin and Hovy, 2003) and BLEU (Papineni et al., 2002)) against manually generated focused summaries shows that our sum-marizers uniformly and statistically significantly outperform two baseline systems as well as a state-of-the-art supervised extraction-based system. |
Results | To evaluate the full abstract generation system, the BLEU score (Papineni et al., 2002) (the precision of uni-grams and bigrams with a breVity penalty) is computed with human abstracts as reference. |
Results | BLEU has a fairly good agreement with human judgement and has been used to evaluate a variety of language generation systems (Angeli et al., 2010; Konstas and Lapata, 2012). |
Results | BLEU |
Abstract | We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. |
Abstract | Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU , all averaged over 10 training and test sets. |
Experimental Setup and Results | For evaluation, we used the BLEU metric (Pap-ineni et al., 2001). |
Experimental Setup and Results | Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation. |
Experimental Setup and Results | We can observe that the combined syntax-to-morphology transformations on the source side provide a substantial improvement by themselves and a simple target side transformation on top of those provides a further boost to 21.96 BLEU which represents a 28.57% relative improvement over the word-based baseline and a 18.00% relative improvement over the factored baseline. |
Introduction | We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets. |
Syntax-to-Morphology Mapping | We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores. |
Abstract | Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. |
Conclusion and Future Work | The experiments show that the proposed Model-III outperforms both the TM and the SMT systems significantly (p < 0.05) in either BLEU or TER when fuzzy match score is above 0.4. |
Conclusion and Future Work | Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database. |
Experiments | In the tables, the best translation results (either in BLEU or TER) at each interval have been marked in bold. |
Experiments | Compared with TM and SMT, Model-I is significantly better than the SMT system in either BLEU or TER when the fuzzy match score is above 0.7; Model-II significantly outperforms both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.5; Model-III significantly exceeds both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.4. |
Experiments | SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7). |
Introduction | Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall. |
Abstract | In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems. |
Experiments and Results | Jomt learnlng BLEU BLEU zB LE U No Joint 27.16 35.42 / oz 2 1 30.75 53.51 30.75 |
Experiments and Results | We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better). |
Experiments and Results | From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously. |
Introduction | The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time. |
Paraphrasing with a Dual SMT System | Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible. |
Paraphrasing with a Dual SMT System | iBLEU(s,rS,c) = aBLEU(c,7“S) — (l—a) BLEU (c,s) (3) |
Paraphrasing with a Dual SMT System | BLEU (C, r3) captures the semantic equivalency between the candidates and the references (Finch et al. |
Experimental Evaluation | We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while |
Experimental Evaluation | In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations. |
Experimental Evaluation | For BLEU higher values are better, for TER lower values are better. |
Related Work | They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data. |
Experiments | BLEU , sentence-level geometric mean of 1- to 4-gram precision, as in (Belz et al., 2011) |
Experiments | BLEUT, sentence-level BLEU computed on post-processed output where predicted referring expressions for victim and perp are replaced in the sentences (both gold and predicted) by their original role label, this score doeS not penalize lexical mismatches between corpus and system RES |
Experiments | When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9). |
Abstract | We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks. |
Experiments | 4.3.1 BLEU |
Experiments | Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011). |
Experiments | use character-level BLEU as our main baseline. |
Introduction | Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest. |
Introduction | In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments. |
Introduction | Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality. |
Inferring a learning curve from mostly monolingual data | Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data |
Inferring a learning curve from mostly monolingual data | We first train models to predict the BLEU score at m anchor sizes 81, . |
Inferring a learning curve from mostly monolingual data | We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest. |
Introduction | In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training. |
Introduction | An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU . |
Introduction | They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6). |
Selecting a parametric family of curves | For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;. |
Selecting a parametric family of curves | The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not |
Selecting a parametric family of curves | The values are on the same scale as the BLEU scores. |
Abstract | In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. |
Abstract | The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. |
Abstract | parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002). |
Experiments | Translation quality was evaluated using both the BLEU score proposed by Papineni et al. |
Experiments | (2002) and also the modified BLEU (BLEU-Fix) score3 used in the IWSLT 2008 evaluation campaign, where the brevity calculation is modified to use closest reference length instead of shortest reference length. |
Experiments | Method BLEU BLEU-Fix Triangulation 33 .70/27.46 3 l .5 9/25 .02 Transfer 3352/2834 3136/2620 Synthetic 34.35/27 .21 32.00/26.07 Combination 38.14/29.32 34.76/27.39 |
Translation Selection | In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002). |
Translation Selection | We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions. |
Translation Selection | In the context of translation selection, 3/ is assigned as the smoothed BLEU score. |
Automatic Evaluation Metrics | In this section, we describe BLEU, and the three metrics which achieved higher correlation results than BLEU in the recent ACL-07 MT workshop. |
Automatic Evaluation Metrics | 2.1 BLEU |
Automatic Evaluation Metrics | BLEU (Papineni et al., 2002) is essentially a precision-based metric and is currently the standard metric for automatic evaluation of MT performance. |
Introduction | Among all the automatic MT evaluation metrics, BLEU (Papineni et al., 2002) is the most widely used. |
Introduction | Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement |
Introduction | The results show that, as compared to BLEU , several recently proposed metrics such as Semantic-role overlap (Gimenez and Marquez, 2007), ParaEval-recall (Zhou et al., 2006), and METEOR (Banerjee and Lavie, 2005) achieve higher correlation. |
A Generic Phrase Training Procedure | lation engine to minimize the final translation errors measured by automatic metrics such as BLEU (Papineni et al., 2002). |
Discussions | - + - BLEU mo“ Phrasetable Size |
Discussions | After reaching its peak, the BLEU score drops as the threshold 7' increases. |
Discussions | Table 4: Translation Results ( BLEU ) of discriminative phrase training approach using different features |
Experimental Results | We measure translation performance by the BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) scores with multiple translation references. |
Experimental Results | BLEU Scores |
Experimental Results | The translation results as measured by BLEU and METEOR scores are presented in Table 3. |
Abstract | We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score) in end-to-end translation systems for six languages pairs used in recent MT competitions. |
Conclusions | Table 3: BLEU scores for all language pairs using all available data. |
Introduction | Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score) across many languages on training corpora with up to hundred thousand sentences. |
Introduction | In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases. |
Phrase-based machine translation | We report BLEU scores using a script available with the baseline system. |
Phrase-based machine translation | Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model. |
Phrase-based machine translation | In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages. |
Word alignment results | Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system. |
Abstract | Large-scale experiments show an absolute improvement of 1.7 BLEU points over the l-best baseline. |
Experiments | BLEU score |
Experiments | We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set. |
Experiments | The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure. |
Introduction | Large-scale experiments (Section 4) show an improvement of 1.7 BLEU points over the l-best baseline, which is also 0.8 points higher than decoding with 30-best trees, and takes even less time thanks to the sharing of common subtrees. |
Abstract | Our eXperiments show that the string-to-dependency decoder achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER compared to a standard hierarchical string—to—string system on the N IST 04 Chinese—English evaluation set. |
Conclusions and Future Work | Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set. |
Experiments | All models are tuned on BLEU (Papineni et al., 2001), and evaluated on both BLEU and Translation Error Rate (TER) (Snover et al., 2006) so that we could detect over-tuning on one metric. |
Experiments | BLEU % TER% lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96 |
Experiments | Table 2: BLEU and TER scores on the test set. |
Introduction | For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system. |
Introduction | Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set. |
Abstract | We applied our inflection generation models in translating English into two morphologically complex languages, Russian and Arabic, and show that our model improves the quality of SMT over both phrasal and syntax-based SMT systems according to BLEU and human judge-ments. |
Integration of inflection models with MT systems | We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2). |
MT performance results | For automatically measuring performance, we used 4-gram BLEU against a single reference translation. |
MT performance results | We also report oracle BLEU scores which incorporate two kinds of oracle knowledge. |
MT performance results | For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation. |
Evaluation | Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set. |
Evaluation | In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations. |
Evaluation | This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators. |
Abstract | An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly. |
Decoding to Maximize BLEU | BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU . |
Experiments | We evaluate the translation results by comparing them against the reference translations using the BLEU metric. |
Experiments | Hyperedges BLEU Bigram Pass 167K 21.77 Trigram Pass UNI — —BO + 629.7K=796.7K 23.56 BO+BB +2.7K =169. |
Experiments | Fable 1: Speed and BLEU scores for two-pass decoding. |
Introduction | With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder. |
Introduction | Maximizing the expected count of synchronous constituents approximately maximizes BLEU . |
Introduction | We find a significant increase in BLEU in the experiments, with minimal additional time. |
Abstract | The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score. |
Abstract | However, MBR targeting BLEU is prohibitively slow to optimize over kr-best lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000-best lists. |
Abstract | Our forest-based decoding objective consistently outperforms kr-best list MBR, giving improvements of up to 1.0 BLEU . |
Consensus Decoding Algorithms | 1Typically, MBR is defined as arg mineeElE [L(e; e’ for some loss function L, for example 1 — BLEU (e; 6’ These definitions are equivalent. |
Consensus Decoding Algorithms | Figure 1 compares Algorithms 1 and 2 using U(e; e’ Other linear functions have been explored for MBR, including Taylor approximations to the logarithm of BLEU (Tromble et al., 2008) and counts of matching constituents (Zhang and Gildea, 2008), which are discussed further in Section 3.3. |
Consensus Decoding Algorithms | Computing MBR even with simple nonlinear measures such as BLEU , NIST or bag-of-words Fl seems to require 0(k2) computation time. |
Introduction | In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002). |
Introduction | Unfortunately, with a nonlinear similarity measure like BLEU , we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution. |
Introduction | In experiments using BLEU over 1000-best lists, we found that our objective provided benefits very similar to MBR, only much faster. |
Introduction | Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice. |
Introduction | We employ MERT to select these weights by optimizing BLEU score on a development set. |
Introduction | In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008). |
MERT for MBR Parameter Optimization | However, this does not guarantee that the resulting linear score (Equation 2) is close to the corpus BLEU . |
MERT for MBR Parameter Optimization | We now describe how MERT can be used to estimate these factors to achieve a better approximation to the corpus BLEU . |
MERT for MBR Parameter Optimization | We recall that MERT selects weights in a linear model to optimize an error criterion (e. g. corpus BLEU ) on a training set. |
Minimum Bayes-Risk Decoding | This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate. |
Minimum Bayes-Risk Decoding | (2008) extended MBR decoding to translation lattices under an approximate BLEU score. |
Minimum Bayes-Risk Decoding | They approximated log( BLEU ) score by a linear function of n-gram matches and candidate length. |
Abstract | Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets. |
Evaluation | We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality. |
Evaluation | Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set. |
Evaluation | HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further. |
Introduction | This enhancement alone results in an improvement of almost 1.4 BLEU points. |
Introduction | We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models. |
Abstract | We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008). |
Experimental Results | Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding. |
Experimental Results | Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding. |
Experimental Results | Table 2 presents the BLEU results under different ways in using the variational models, as discussed in Section 3.2.3. |
Introduction | We geometrically interpolate the resulting approximations q with one another (and with the original distribution p), justifying this interpolation as similar to the minimum-risk decoding for BLEU proposed by Tromble et al. |
Variational Approximate Decoding | However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams. |
Variational vs. Min-Risk Decoding | They use the following loss function, of which a linear approximation to BLEU (Papineni et al., 2001) is a special case, |
Abstract | Comparable to the state-of-the-art system combination technique, joint decoding achieves an absolute improvement of 1.5 BLEU points over individual decoding. |
Experiments | We evaluated the translation quality using case-insensitive BLEU metric (Papineni et al., 2002). |
Experiments | Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive). |
Experiments | With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence. |
Introduction | 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4). |
Introduction | ing with multiple models achieves an absolute improvement of 1.5 BLEU points over individual decoding with single models (Section 5). |
Abstract | Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees. |
Experiments | We evaluated the translation quality using the BLEU metric, as calculated by mteval-vl lb.pl with its default setting except that we used case-insensitive matching of n-grams. |
Experiments | avg trees # of rules BLEU |
Experiments | Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models. |
Introduction | Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model. |
Abstract | Our best result improves over the best single MT system baseline by 1.0% BLEU and over a strong system selection baseline by 0.6% BLEU on a blind test set. |
Introduction | Our best system selection approach improves over our best baseline single MT system by 1.0% absolute BLEU point on a blind test set. |
MT System Selection | We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty). |
MT System Selection | We pick the name of the MT system with the highest BLEU score as the class label for that sentence. |
MT System Selection | When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied. |
Machine Translation Experiments | Feature weights are tuned to maximize BLEU on tuning sets using Minimum Error Rate Training (Och, 2003). |
Machine Translation Experiments | Results are presented in terms of BLEU (Papineni et al., 2002). |
Machine Translation Experiments | All differences in BLEU scores between the four systems are statistically significant above the 95% level. |
Experimental Setup | We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006). |
Methods | This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references). |
Related Work | We use both segmented and unsegmented language models, and tune automatically to optimize BLEU . |
Related Work | (2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU . |
Results | For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic. |
Results | Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation. |
Results | 1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best. |
Experiments | call F1 BLEU 1.21 60.82 46.53 $.57 66.791 48.001 L64 72.4912 56.6412 3.36 78.15123 55.6612 |
Experiments | :call F1 BLEU 7.86 61.48 46.53 1.79 64.07 46.00 3.57 65.56 55.6712 7.14 68.8612 55.6712 |
Experiments | Method 4, named REBOL, implements REsponse-Based Online Learning by instantiating y+ and y‘ to the form described in Section 4: In addition to the model score 3, it uses a cost function 0 based on sentence-level BLEU (Nakov et al., 2012) and tests translation hypotheses for task-based feedback using a binary execution function 6. |
Response-based Online Learning | Computation of distance to the reference translation usually involves cost functions based on sentence-level BLEU (Nakov et al. |
Response-based Online Learning | In addition, we can use translation-specific cost functions based on sentence-level BLEU in order to boost similarity of translations to human reference translations. |
Response-based Online Learning | Our cost function c(y(i), y) = (l — BLEU(y(i), is based on a version of sentence-level BLEU Nakov et al. |
Experiments | Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3). |
Experiments | We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011). |
Experiments | In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score. |
Abstract | On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement. |
Experiments | MT08 nw MT08 wb BLEU \ TER BLEU \ TER |
Experiments | The best TER and BLEU results on each genre are in bold. |
Experiments | For BLEU , higher scores are better, while for TER, lower scores are better. |
Discussion and Further Work | 9Hiero was MERT trained on this set and has a 2% higher BLEU score compared to the discriminative model. |
Discussion and Further Work | development BLEU (%) 28 |
Evaluation | Although there is no direct relationship between BLEU and likelihood, it provides a rough measure for comparing performance. |
Evaluation | 6We also experimented with using max-translation decoding for standard MER trained translation models, finding that it had a small negative impact on BLEU score. |
Evaluation | Figure 5 shows the relationship between beam width and development BLEU . |
Abstract | We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. |
Evaluation | M Pbo Pbl Pb2 M1 M2 BLEU 14.3 16.25 16.13 18.6 17.05 |
Evaluation | Both our systems (Model-1 and Model-2) beat the baseline phrase-based system with a BLEU point difference of 4.30 and 2.75 respectively. |
Evaluation | The difference of 2.35 BLEU points between M1 and Pbl indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. |
Final Results | This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models. |
Final Results | BLEU point improvement and combined with all the heuristics (M2H123) gives an overall gain of 1.95 BLEU points and is close to our best results (M1H12). |
Final Results | One important issue that has not been investigated yet is that BLEU has not yet been shown to have good performance in morphologically rich target languages like Urdu, but there is no metric known to work better. |
Introduction | Section 4 discusses the training data, parameter optimization and the initial set of experiments that compare our two models with a baseline Hindi-Urdu phrase-based system and with two transliteration-aided phrase-based systems in terms of BLEU scores |
Abstract | We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score. |
Conclusion | The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks. |
Experiments | Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English. |
Experiments | minimum error rate training (Och, 2003) with BLEU score as the objective function. |
Experiments | Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task. |
Experiments | To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG). |
Experiments | BLEU gives a measure of how close a system’s output is to the gold standard simple sentence. |
Experiments | Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification. |
Related Work | (2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score. |
Baselines | where m ranges over IN and OUT, pm(é| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003). |
Conclusion & Future Work | We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model. |
Ensemble Decoding | In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup. |
Ensemble Decoding | However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically. |
Ensemble Decoding | However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well. |
Experiments & Results 4.1 Experimental Setup | Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2. |
Experiments & Results 4.1 Experimental Setup | This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34. |
Experiments & Results 4.1 Experimental Setup | We also reported the BLEU scores when we applied the span-wise normalization heuristic. |
Experiments and Results | Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004). |
Experiments and Results | BLEU 0.4029 0.3146 NIST 7.0419 8.8462 METEOR 0.5785 0.5335 |
Experiments and Results | Both SMP and ESSP outperform baseline consistently in BLEU , NIST and METEOR. |
Abstract | We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features. |
Conclusions | In comparison to a baseline model, we achieve statistically significant improvement in BLEU score. |
Discussion | Given that we only looked at IS factors within a sentence, we think that such a significant improvement in BLEU and exact match scores is very encouraging. |
Generation Ranking Experiments | Model BLEU Match (%) |
Generation Ranking Experiments | We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al., |
Generation Ranking Experiments | We achieve an improvement of 0.0168 BLEU points and 1.91 percentage points in exact match. |
Cohesive Decoding | Initially, we were not certain to what extent this feature would be used by the MERT module, as BLEU is not always sensitive to syntactic improvements. |
Cohesive Phrasal Output | We tested this approach on our English-French development set, and saw no improvement in BLEU score. |
Conclusion | Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores. |
Conclusion | Our soft constraint produced improvements ranging between 0.5 and 1.1 BLEU points on sentences for which the baseline produces uncohesive translations. |
Experiments | We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets. |
Experiments | First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets. |
Experiments | Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations. |
Abstract | The performance measured by BLEU is at least as comparable to the traditional batch training method. |
Conclusion and Future Work | The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones. |
Experiment | The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011). |
Experiment | In the IWLST2012 data set, there is a huge difference gap between the HIT corpus and the BTEC corpus, and our method gains 0.814 BLEU improvement. |
Experiment | While the FBIS data set is artificially divided and no clear human assigned differences among sub-domains, our method loses 0.09 BLEU . |
Abstract | The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain. |
Discussion | on BLEU score |
Experiments | The metrics for automatic evaluation were BLEU 3 and TER 4 (Snover et al., 2005). |
Experiments | (00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules. |
Experiments | Our system gains significant improvements of 1.6~3.6 points of BLEU in the oral domain, and 0.5~1 points of BLEU in the news domain. |
Extraction of Paraphrase Rules | As mentioned above, the detailed procedure is: T1 = S1 = T2 = Finally we compute BLEU (Papineni et al. |
Extraction of Paraphrase Rules | If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions. |
Introduction | The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain. |
Abstract | Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain. |
Complexity Analysis | In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved. |
Complexity Analysis | For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality. |
Complexity Analysis | It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings. |
Introduction | o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter. |
Abstract | Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. |
Conclusion and future work | We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements). |
Dependency parsing for machine translation | We found that dependency scores with or without loop elimination are generally close and highly correlated, and that MT performance without final loop removal was about the same (generally less than 0.2% BLEU ). |
Introduction | In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores. |
Machine translation experiments | Parameter tuning was done with minimum error rate training (Och, 2003), which was used to maximize BLEU (Papineni et al., 2001). |
Machine translation experiments | In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001). |
Machine translation experiments | For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER, all differences are significant. |
Abstract | Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). |
Experiments | To evaluate the translation results, we use BLEU (Papineni et al., 2002). |
Experiments | On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points, and 0.8 over the hierarchical phrase-based system. |
Experiments | In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points, and 0.3 to 0.5 over the hierarchical phrase-based system. |
Abstract | Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system. |
Conclusion | Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008). |
Experiments | BLEU (%) 26.15 27.07 27.93 28.89 |
Experiments | Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset |
Experiments | the final BLEU scores of C3—T with Min-F and C3-F. |
Introduction | (2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system. |
Introduction | Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation. |
AL-SMT: Multilingual Setting | The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively. |
AL-SMT: Multilingual Setting | This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score, WER and PER) (Papineni et al., 2002). |
Experiments | The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set. |
Experiments | Avg BLEU Score |
Experiments | Avg BLEU Score |
Sentence Selection: Multiple Language Pairs | 0 Let e0 be the consensus among all the candidate translations, then define the disagreement as Ed ad(1 — BLEU (eC, ed)). |
Experiments | Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a). |
Experiments | Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU . |
Experiments | In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set. |
Joint Feature Selection in Distributed Stochastic Learning | Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference. |
Abstract | The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline. |
Conclusion | Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based |
Experimental setup | All experiments were done on Urdu-English and we evaluate reordering in two ways: Firstly, we evaluate reordering performance directly by comparing the reordered source sentence in Urdu with a reference reordering obtained from the manual word alignments using BLEU (Papineni et al., 2002) (we call this measure monolingual BLEU or mBLEU). |
Experimental setup | Additionally, we evaluate the effect of reordering on our final systems for machine translation measured using BLEU . |
Introduction | This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments. |
Introduction | In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering. |
Results and Discussions | We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3. |
Results and Discussions | We also note a gain of 2.0 BLEU points over a hierarchical phrase based system. |
Analysis | 0 The constituent boundary matching feature (CBMF) is a very important feature, which by itself achieves significant improvement over the baseline (up to 1.13 BLEU ). |
Analysis | 5.2 Beyond BLEU |
Analysis | Since BLEU is not sufficient |
Experiments | Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004). |
Experiments | Like (Marton and Resnik, 2008), we find that the XP+ feature obtains a significant improvement of 1.08 BLEU over the baseline. |
Experiments | However, using all syntax-driven features described in section 3.2, our SDB models achieve larger improvements of up to 1.67 BLEU . |
Introduction | Our experimental results display that our SDB model achieves a substantial improvement over the baseline and significantly outperforms XP+ according to the BLEU metric (Papineni et al., 2002). |
Introduction | In addition, our analysis shows further evidences of the performance gain from a different perspective than that of BLEU . |
Abstract | Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU . |
Abstract | On general domain and speech translation tasks where test conditions substantially differ from standard government and news training text, web-mined training data improves performance substantially, resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain. |
Abstract | For all language pairs and both test sets (WMT 2011 and WMT 2012), we show an improvement of around 0.5 BLEU . |
Abstract | When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points. |
Conclusion | Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points. |
Experiments | The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2. |
Experiments | The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points. |
Experiments | The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems. |
Experiments and evaluation | We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3. |
Experiments and evaluation | Table 5 gives results in case-insensitive BLEU . |
Experiments and evaluation | While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU ; however, our manual evaluation shows that the new features have a positive impact on translation quality. |
Discussion | At the same time, there has been no negative impact on overall quality as measured by BLEU . |
End-to-End results | To make sure our name transliterator does not degrade the overall translation quality, we evaluated our base SMT system with BLEU , as well as our transliteration-augmented SMT system. |
End-to-End results | The BLEU scores for the two systems were 50.70 and 50.96 respectively. |
Evaluation | General MT metrics such as BLEU , TER, METEOR are not suitable for evaluating named entity translation and transliteration, because they are not focused on named entities (NEs). |
Integration with SMT | In a tuning step, the Minimim Error Rate Training component of our SMT system iteratively adjusts the set of rule weights, including the weight associated with the transliteration feature, such that the English translations are optimized with respect to a set of known reference translations according to the BLEU translation metric. |
Introduction | First, although names are important to human readers, automatic MT scoring metrics (such as BLEU ) do not encourage researchers to improve name translation in the context of MT. |
Introduction | A secondary goal is to make sure that our overall translation quality (as measured by BLEU ) does not degrade as a result of the name-handling techniques we introduce. |
Introduction | 0 We evaluate both the base SMT system and the augmented system in terms of entity translation accuracy and BLEU (Sections 2 and 6). |
Abstract | Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. |
Conclusion | Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN). |
Experiments | 0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way. |
Experiments | An Index column is added for score reference convenience (B for BLEU ; T for TER). |
Experiments | For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1). |
Abstract | Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation. |
Translation Model Architecture | We found that this had no significant effects on BLEU . |
Translation Model Architecture | We report translation quality using BLEU (Papineni et |
Translation Model Architecture | For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2). |
Abstract | Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874. |
Experiments | In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics. |
Experiments | We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method. |
Experiments | All of the four feature functions we have tested achieve considerable improvement in BLEU scores. |
Log-linear Models | BLEU score, a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al. |
Log-linear Models | The BLEU measure computes the geometric mean of the precision of n-grams of various lengths between a sentence realization and a (set of) reference(s). |
Log-linear Models | 3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl |
Experiments | In Table 3, almost all BLEU scores are improved, no matter what strategy is used. |
Experiments | In particular, the best performance marked in bold is as high as 1.24, 0.94, and 0.82 BLEU points, respectively, over the baseline system on NIST04, CWMT08 Development, and CWMT08 Evaluation data. |
Experiments | BLEU 35 |
Related Work | They added the labels assigned to connectives as an additional input to an SMT system, but their experimental results show that the improvements under the evaluation metric of BLEU were not significant. |
Related Work | To the best of our knowledge, our work is the first attempt to exploit the source functional relationship to generate the target transitional expressions for grammatical cohesion, and we have successfully incorporated the proposed models into an SMT system with significant improvement of BLEU metrics. |
Discussion | Table 6: Performance gain in BLEU over baseline and MR08 systems averaged over all test sets. |
Discussion | Table 9: Performance ( BLEU score) comparison between non-oracle and oracle experiments. |
Experiments | We use NIST MT 06 dataset (1664 sentence pairs) for tuning, and NIST MT 03, 05, and 08 datasets (919, 1082, and 1357 sentence pairs, respectively) for evaluation.1 We use BLEU (Pap-ineni et al., 2002) for both tuning and evaluation. |
Experiments | Our first group of experiments investigates whether the syntactic reordering models are able to improve translation quality in terms of BLEU . |
Experiments | Table 5: System performance in BLEU scores. |
Abstract | As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU , which fail to properly evaluate adequacy, become more apparent. |
Abstract | We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. |
Abstract | We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful. |
Our Approach | BLEU Scores 13 N J: |
Our Approach | Figure l: BLEU scores vs k for SumBasic extraction. |
Our Approach | Although BLEU (Papineni et al., 2002) scores are widely used for image caption evaluation, we find them to be poor indicators of the quality of our model. |
Conclusion | We have also shown that, by integrating this hypertagger with a broad-coverage CCG chart realizer, considerably faster realization times are possible (approximately twice as fast as compared with a realizer that performs simple lexical lookups) with higher BLEU , METEOR and exact string match scores. |
Conclusion | Moreover, the hypertagger-augmented realizer finds more than twice the number of complete realizations, and further analysis revealed that the realization quality (as per modified BLEU and METEOR) is higher in the cases when the realizer finds a complete realization. |
Introduction | Moreover, the overall BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) scores, as well as numbers of exact string matches (as measured against to the original sentences in the CCGbank) are higher for the hypertagger-seeded realizer than for the preexisting realizer. |
Results and Discussion | Table 5 shows that increasing the number of complete realizations also yields improved BLEU and METEOR scores, as well as more exact matches. |
Results and Discussion | In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations. |
Results and Discussion | Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%. |
The Approach | compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence. |
Conclusion and Future Work | In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures. |
Experiments and Results | Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)). |
Features and Training | defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]: |
Features and Training | BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference. |
Features and Training | 1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis. |
Graph Construction | In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation. |
Conclusion and Future Work | Large scale experiment shows improvement on both reordering metric and SMT performance, with up to 1.73 point BLEU gain in our evaluation test. |
Experiments | Table 2: BLEU (%) score on dev and test data for both EJ and J-E experiment. |
Experiments | We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score, and the model size. |
Experiments | CLN BLEU Feat.# tag+label 88.6 16.4 22.24 26k +dst 91.5 13.5 22.66 55k E_J +pct 92.2 13.1 22.73 79k +lezv100 92.9 12.1 22.85 347k +l€$1000 94.0 11.5 22.79 2,410k +l€$2000 95.2 10.7 22.81 3,794k tag+fw 85.0 18.6 25.43 31k +dst 90.3 16.9 25.62 65k J_E +lezv100 91.6 15.7 25.87 293k +l€$1000 92.4 14.8 25.91 2,156k +le$2000 93.0 14.3 25.84 3,297k |
Conclusions and Future Work | EXperimental results show that both models are able to significantly improve translation accuracy in terms of BLEU score. |
Experiments | Statistical significance in BLEU differences |
Experiments | Our first group of experiments is to investigate whether the predicate translation model is able to improve translation accuracy in terms of BLEU and whether semantic features are useful. |
Experiments | 0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used. |
Experiments | Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits. |
Experiments | Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits). |
Experiments | Figure 3: BLEU scores on the MT05 data set. |
Conclusion | o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST. |
Experiments | System BLEU (%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - - |
Experiments | System BLEU (%) NIST Base 33.53 9.0561 STM (sense) 34.15 9.2596 STM (sense+lexicon) 34.73 9.4184 |
Experiments | System BLEU (%) NIST Base 33.53 9.0561 Reformulated WSD 34.16 9.3820 STM 34.73 9.4184 |
Experiments | We use BLEU (Papineni et al., 2002) score with shortest length penalty as the evaluation metric and apply the pairwise re-sampling approach (Koehn, 2004) to perform the significance test. |
Experiments | We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points. |
Experiments | When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points. |
Experiments | To confirm the effectiveness of noun-phrase chunking, we performed the experiment using a system combining BLEU with our method. |
Experiments | In this case, BLEU scores were used as scorewd in Eq. |
Experiments | This experimental result is shown as “BLEU with our method” in Tables 2—5. |
Introduction | Methods based on word strings (6.9., BLEU (Papineni et al., 2002), NIST(NIST, 2002), METEOR(Banerjee and Lavie., 2005), ROUGE-L(Lin and Och, 2004), |
Abstract | Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline. |
Experiments | 2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012). |
Experiments | On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER. |
Experiments | The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER. |
Introduction | Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline. |
Abstract | We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data. |
Conclusion | The results showed that our approach achieved a BLEU score gain of 1.61. |
Dependency-based Pre-ordering Rule Set | In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set. |
Experiments | Lng the performance ( BLEU ) on the test set, the total |
Experiments | For evaluation, we used BLEU scores (Papineni et al., 2002). |
Experiments | It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap- |
Introduction | Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61. |
Conclusion | This strategy leads to a better balanced distribution of the alternations in the training data, such that our linguistically informed generation ranking model achieves high BLEU scores and accurately predicts active and passive. |
Experimental Setup | Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65 |
Experimental Setup | Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01 |
Experimental Setup | use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams. |
Experiments | The differences in BLEU between the candidate sets and models are |
Experiments | Its BLEU score and match accuracy decrease only slightly (though statistically significantly). |
Experiments | Features | Match BLEU | Voice Prec. |
Abstract | Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system. |
Conclusion | We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today. |
Experiments | BLEU Words .696 45.1 2,538 .674 46.4 2,262 |
Experiments | Our hypergraph alignment algorithm allows us a 1.1 BLEU increase over the best baseline system, Model-4 grow-diag-final. |
Experiments | We also report a 2.4 BLEU increase over a system trained with alignments from Model-4 union. |
Related Work | Very recent work in word alignment has also started to report downstream effects on BLEU score. |
Related Work | (2009) confirm and extend these results, showing BLEU improvement for a hierarchical phrase-based MT system on a small Chinese corpus. |
Abstract | Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model. |
Experiment | Table 1: Performance on Japanese-to-English Translation Measured by BLEU (%) |
Experiment | Table 1 shows the performance for the test data measured by case sensitive BLEU (Papineni et al., 2002). |
Experiment | Under the Moses phrase-based SMT system (Koehn et al., 2007) with the default settings, we achieved a 26.80% BLEU score. |
Introduction | Further, our independent model achieves a more than 1 point gain in BLEU , which resolves the sparseness problem introduced by the bi-word observations. |
Abstract | We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data. |
Conclusion and Future Work | The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data. |
Evaluation | Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems. |
Evaluation | Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance. |
Evaluation | Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system. |
Introduction | 0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data. |
Introduction | In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b). |
Related Work | Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence. |
Related Work | In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy. |
Related Work | TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments. |
Experiments | The reported BLEU scores are averaged over 5 times of running MERT (Och, 2003). |
Experiments | We illustrate the relationship among translation accuracy ( BLEU ), the number of retrieved documents (N) and the length of hidden layers (L) on different testing datasets. |
Experiments | Figure 3: End-to-end translation results ( BLEU %) |
Experiments | Case-insensitive BLEU is employed as the evaluation metric. |
Experiments | Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test. |
Experiments | In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation. |
Introduction | The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved. |
Related Work | (2013) also use bag-of-words but learn BLEU sensitive phrase embeddings. |
Conclusion | We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning. |
Results and Discussion | Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees |
Results and Discussion | The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered). |
Results and Discussion | In terms of BLEU score, the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-(). |
Experimental Results | The feature functions are combined under a log-linear framework, and the weights are tuned by the minimum-error-rate training (Och, 2003) using BLEU (Papineni et al., 2002) as the optimization metric. |
Experimental Results | This precision is extremely high because the BLEU score (precision with brevity penalty) that one obtains for a Chinese sentence is normally between 30% to 50%. |
Experimental Results | 4.5.2 BLEU on NIST MT Test Sets |
Introduction | We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets. |
Abstract | On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++. |
Evaluation | Finally, we also do end-to-end evaluation using both F-score in alignment and Bleu score in translation. |
Evaluation | HP-DITG using DPDI achieves the best Bleu score with acceptable time cost. |
Evaluation | It shows that HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score. |
Analysis and Discussion | Table 4: Results ( BLEU %) of Chinese—to—English large data (CE_LD) and small data (CE_SD) NIST task by applying one feature. |
Analysis and Discussion | Table 5: Results ( BLEU %) for combination of two similarity scores. |
Analysis and Discussion | Table 6: Results ( BLEU %) of using simple features based on context on small data NIST task. |
Experiments | Our evaluation metric is IBM BLEU (Papineni et al., 2002), which performs case-insensitive matching of n- grams up to n = 4. |
Experiments | Table 2: Results ( BLEU %) of small data Chinese-to-English NIST task. |
Experiments | Table 3: Results ( BLEU %) of large data Chinese-to-English NIST task and German-to—English WMT task. |
Introduction | In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model: |
Introduction | to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8. |
Introduction | Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al. |
Experiments | 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature. |
Experiments | 10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments. |
Experiments | We use minimum error rate training (Och, 2003) with nbest list size 100 to optimize the feature weights for maximum development BLEU . |
Experimental Evaluation | 6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information |
Experimental Evaluation | It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER. |
Experimental Evaluation | BLEU |
Flat ITG Model | The average gain across all data sets was approximately 0.8 BLEU points. |
Hierarchical ITG Model | (2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score, while using larger phrases results in diminishing returns. |
Introduction | We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches. |
Experimental Evaluation | For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training. |
Experimental Evaluation | For automatic evaluation, we employed BLEU (Papineni et al., 2002) by following (Unno et al., 2006). |
Experimental Evaluation | Label BLEU Proposed .679 w/o PLM .617 w/o IPTW .635 Hori— .493 |
Results and Discussion | Our method achieved the highest BLEU score. |
Results and Discussion | For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score. |
Results and Discussion | Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score. |
Machine Translation as a Decipherment Task | Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et |
Machine Translation as a Decipherment Task | The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output). |
Machine Translation as a Decipherment Task | Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment. |
Experiments | Is our topic similarity model able to improve translation quality in terms of BLEU ? |
Experiments | Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea- |
Experiments | By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average. |
Introduction | Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points. |
Experiments | The BLEU scores for these outputs are 32.7, 27.8, and 20.8. |
Experiments | In particular, their translations had a lower BLEU score, making their task easier. |
Experiments | We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score). |
Abstract | Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU . |
Conclusion and Future Work | We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU . |
Experiments and Results | When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data. |
Experiments and Results | TCBPPE based method drops about 3 BLEU points on both development and test data sets. |
Introduction | We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system. |
Abstract | We evaluate our model on a Chinese to English translation task and obtain up to 1.2 BLEU improvement over strong baselines. |
Experiments | We refer to the SMT model without domain adaptation as baseline.5 LDA marginally improves machine translation (less than half a BLEU point). |
Experiments | These improvements are not redundant: our new ptLDA-dict model, which has aspects of both models yields the best performance among these approaches—up to a 1.2 BLEU point gain (higher is better), and -2.6 TER improvement (lower is better). |
Experiments | The BLEU improvement is significant (Koehn, 2004) at p = 0.01,6 except on MT03 with variational and variational-hybrid inference. |
Experiment | Model BLEU (%) Moses 25.68 TT2S 26.08 TTS2S 26.95 FT2S 27.66 FTS2S 28.83 |
Experiment | The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S. |
Experiment | BLEU (%) N-best \ model FT2S FTS2S 100 Best 27.40 28.61 500 Best 27.66 28.83 2500 Best 27.66 28.96 5000 Best 27.79 28.89 |
Experiments | System Model BLEU Moses cBP 23.86 STSSG 25.92 SncTSSG 26.53 |
Experiments | ID Rule Set BLEU 1 CR (STSSG) 25.92 2 CR w/o ncPR 25.87 3 CR w/o ncPR + tgtncR 26.14 4 CR w/o ncPR + srchR 26.50 5 CR w/o ncPR + src&tgtncR 26.51 6 CR + tgtnCR 26.11 7 CR + srcncR 26.56 8 cR+src&tgtncR(SncTSSG) 26.53 |
Experiments | 2) Not only that, after comparing Exp 6,7,8 against Exp 3,4,5 respectively, we find that the ability of rules derived from noncontiguous tree sequence pairs generally covers that of the rules derived from the contiguous tree sequence pairs, due to the slight Change in BLEU score. |
Abstract | Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU . |
Experiments | For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores. |
Experiments | Table 3 lists the translation performance with BLEU scores. |
Experiments | Table 3 shows that our HD-HPB model significantly outperforms Chiang’s HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB). |
Abstract | We compare this metric against a combination metric of four state—of—the—art scores ( BLEU , NIST, TER, and METEOR) in two different settings. |
Experimental Evaluation | BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP. |
Introduction | Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002). |
Introduction | BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. |
Introduction | (2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences. |
Conclusion and Future Directions | 12Similar results were found for character and word—based BLEU , but are omitted for lack of space. |
Experiments | Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney. |
Experiments | We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level. |
Experiments | When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU . |
Experimental Results | The MT systems are optimized with pairwise ranking optimization (Hopkins and May, 2011) to maximize BLEU (Papineni et al., 2002). |
Experimental Results | The BLEU scores from different systems are shown in Table 10 and Table 11, respectively. |
Experimental Results | Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately. |
Experiments | training data and not necessarily exactly follow the tendency of the final BLEU scores. |
Experiments | For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score. |
Experiments | Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different. |
Experiments | In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers. |
Experiments | Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking. |
Experiments | Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations. |
Evaluation | We report on BLEU , NIST, METEOR, and word error rate metrics WER and PER. |
Experiments & Results | The BLEU scores, not included in the figure but shown in Table 2, show a similar trend. |
Experiments & Results | Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004). |
Experiments & Results | Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline. |
Alternatives to Correlation-based Meta-evaluation | We have studied 100 sentence evaluation cases from representatives of each metric family including: 1-PER, BLEU , DP-Or-‘k, GTM (e = 2), METEOR and ROUGE L. The evaluation cases have been extracted from the four test beds. |
Metrics and Test Beds | At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision ( BLEU and NIST), lexical recall (ROUGE), and F-measure (GTM and METEOR). |
Previous Work on Machine Translation Meta-Evaluation | (2001) introduced the BLEU metric and evaluated its reliability in terms of Pearson correlation with human assessments for adequacy and fluency judgements. |
Previous Work on Machine Translation Meta-Evaluation | With the aim of overcoming some of the deficiencies of BLEU , Doddington (2002) introduced the NIST metric. |
Previous Work on Machine Translation Meta-Evaluation | Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST, WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM. |
Abstract | On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU , TER, and NIST. |
Corpus Data and Baseline SMT | Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs, 45K words) using BLEU as the tuning metric. |
Experimental Setup and Results | Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006). |
Experimental Setup and Results | In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 ( BLEU ), -0.6 (TER) and 0.08 (NIST). |
Introduction | With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU , TER and NIST scores on an English-to-Iraqi CSLT task. |
Experiments | BLEU (%) |
Experiments | Rule TR TR TR+TSR_L TR Type (STSG) +TSR_L +TSR_P +TSR BLEU (%) 24.71 25.72 25.93 26.07 |
Experiments | Rule Type BLEU (%) TR+TSR 26.07 (TR+TSR) w/o SRR 24.62 (TR+TSR) w/o DPR 25.78 |
Results | Since MT systems are tuned for word-based overlap measures (such as BLEU ), verb deletion is penalized equally as, for example, determiner deletion. |
SW System | model score and word penalty for a combination of BLEU and TER (2*(1-BLEU) + TER). |
SW System | Bleu scores on the government supplied test set in December 2008 were 35.2 for formal text, 29.2 for informal text, 33.2 for formal speech, and 27.6 for informal speech. |
The Chinese-English 5W Task | Unlike word- or phrase-overlap measures such as BLEU , the SW evaluation takes into account “concept” or “nugget” translation. |
Discussion and Future Work | When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality. |
Discussion and Future Work | Perhaps surprisingly, translation performance, 30.90 BLEU , was around the level we obtained when using frequency to approximate function words at N = 64. |
Experimental Results | These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score, with a consistent pattern of results across the MT06 and MT08 test sets. |
Experimental Setup | all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004). |
Abstract | Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score. |
Conclusion and perspectives | Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging. |
Evaluation | The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER). |
Evaluation | The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging. |
Experiments | We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality. |
Experiments | Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU, |
Experiments | Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01 |
Abstract | We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target. |
Experiments | Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set. |
Experiments | BLEU scores for 200K and 400K training sentence pairs. |
Experiments | Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores. |
Conclusion | Our results showed improvement over the baselines both in intrinsic evaluations and on BLEU . |
Experiments & Results 4.1 Experimental Setup | BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT. |
Experiments & Results 4.1 Experimental Setup | Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e. |
Introduction | In general, copied-over oovs are a hindrance to fluent, high quality translation, and we can see evidence of this in automatic measures such as BLEU (Papineni et al., 2002) and also in human evaluation scores such as HTER. |
Abstract | Experimental evaluation on the ATIS domain shows that our model outperforms a competitive discriminative system both using BLEU and in a judgment elicitation study. |
Results | As can be seen, inclusion of lexical features gives our decoder an absolute increase of 6.73% in BLEU over the l-BEST system. |
Results | System BLEU METEOR l-BEST+BASE+ALIGN 21.93 34.01 k-BEST+BASE+ALIGN+LEX 28.66 45.18 k-BEST+BASE+ALIGN+LEX+STR 30.62 46.07 ANGELI 26.77 42.41 |
Results | over the l-BEST system and 3.85% over ANGELI in terms of BLEU . |
Experiment | Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points. |
Experiment | Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points). |
Experiment | We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points). |
Experiments | Corpus ‘ BLEU (%) RCW (%) |
Experiments | Table 4: Case-insensitive BLEU score and ratio of correct words (RCW) on the training, development and test corpus. |
Experiments | Table 4 shows the case-insensitive BLEU score and the percentage of words that are labeled as correct according to the method described above on the training, development and test corpus. |
SMT System | The performance, in terms of BLEU (Papineni et al., 2002) score, is shown in Table 4. |
Experiments | System BLEU Baseline 12.60 [MB OT * 13 .06 |
Experiments | We measured the overall translation quality with the help of 4-gram BLEU (Papineni et al., 2002), which was computed on tokenized and lower-cased data for both systems. |
Experiments | We obtain a BLEU score of 13.06, which is a gain of 0.46 BLEU points over the baseline. |
Introduction | The translation quality is automatically measured using BLEU scores, and we confirm the findings by providing linguistic evidence (see Section 5). |
Abstract | Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. |
Experiments | The 3-feature version of VSM yields +1.8 BLEU over the baseline for Arabic to English, and +1.4 BLEU for Chinese to English. |
Experiments | For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.) |
Experiments | To get an intuition for how VSM adaptation improves BLEU scores, we compared outputs from the baseline and VSM-adapted system (“vsm, joint” in Table 5) on the Chinese test data. |
Evaluation methodology | In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003). |
Results | 081 usually has the highest overall score (except BLEU ), it also has the highest scores for ‘regulations’ (more formal texts), P1 scores are better for the news documents. |
Results | Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714 |
Experiments | Table 3: BLEU scores for different datasets in different translation directions (left to right), broken with different training corpora (top to bottom). |
Experiments | The BLEU scores for the different parallel corpora are shown in Table 3 and the top 10 out-of-vocabulary (OOV) words for each dataset are shown in Table 4. |
Experiments | However, by combining the Weibo parallel data with this standard data, improvements in BLEU are obtained. |
Code was provided by Deng et a1. (2012). | To compute evaluation measures, we take the average scores of BLEU (1) and F-score (unigram-based with respect to content-words) over k = 5 candidate captions. |
Code was provided by Deng et a1. (2012). | Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU , Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database. |
Related Work | When computing BLEU with semantic matching, we look for the match with the highest similarity score among words that have not been matched before. |
Experiments | In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output. |
Experiments | For our test data, we only consider sentences containing measure words for Bleu score evaluation. |
Experiments | Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system. |
Experiments | Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU . |
Experiments | We do compare VB against EM in terms of final BLEU scores in the translation experiments to ensure that this sparse prior has a sig- |
Experiments | Minimum Error Rate training (Och, 2003) over BLEU was used to optimize the weights for each of these models over the development test data. |
Abstract | In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models. |
Experiment | To stabilize the MERT results, we tuned three times by MERT using the first half of the development data and we selected the SMT weighting parameter set that performed the best on the second half of the development data based on the BLEU scores from the three SMT weighting parameter sets. |
Experiment | To investigate the tolerance for sparsity of the training data, we reduced the training data for the sequence model to 20,000 sentences for JE translation.14 SEQUENCE using this model with a distortion limit of 30 achieved a BLEU score of 32.22.15 Although the score is lower than the score of SEQUENCE with a distortion limit of 30 in Table 3, the score was still higher than those of LINEAR, LINEAR+LEX, and 9-CLASS for JE in Table 3. |
Abstract | For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. |
Discussion of Translation Results | The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre. |
Introduction | For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM. |
Experiments | Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out. |
Experiments | We have noticed that using an L0-penalized BLEU score5 as MERT’s objective on the merged n-best lists over all iterations is more stable and will therefore use this score to determine N. |
Experiments | 5Given by: BLEU —5 X Hi 6 {1, . |
Abstract | We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perpleXity. |
Related Work | Figure 9 shows a statistically significant improvement to the BLEU score when using the HHMM and the n-gram LMs together on this reduced test set. |
Related Work | Moses LM(s) ‘ BLEU ‘ |
Abstract | On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively. |
Conclusions | The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features. |
Experiments and Results | Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features. |
Conclusion and Future Work | In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling. |
Experiments | The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81. |
Experiments | Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation. |