Experiments | BLEU , sentence-level geometric mean of 1- to 4-gram precision, as in (Belz et al., 2011) |
Experiments | BLEUT, sentence-level BLEU computed on post-processed output where predicted referring expressions for victim and perp are replaced in the sentences (both gold and predicted) by their original role label, this score doeS not penalize lexical mismatches between corpus and system RES |
Experiments | When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9). |
Abstract | Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. |
Conclusion and Future Work | The experiments show that the proposed Model-III outperforms both the TM and the SMT systems significantly (p < 0.05) in either BLEU or TER when fuzzy match score is above 0.4. |
Conclusion and Future Work | Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database. |
Experiments | In the tables, the best translation results (either in BLEU or TER) at each interval have been marked in bold. |
Experiments | Compared with TM and SMT, Model-I is significantly better than the SMT system in either BLEU or TER when the fuzzy match score is above 0.7; Model-II significantly outperforms both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.5; Model-III significantly exceeds both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.4. |
Experiments | SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7). |
Introduction | Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall. |
Introduction | Automatic evaluation (using ROUGE (Lin and Hovy, 2003) and BLEU (Papineni et al., 2002)) against manually generated focused summaries shows that our sum-marizers uniformly and statistically significantly outperform two baseline systems as well as a state-of-the-art supervised extraction-based system. |
Results | To evaluate the full abstract generation system, the BLEU score (Papineni et al., 2002) (the precision of uni-grams and bigrams with a breVity penalty) is computed with human abstracts as reference. |
Results | BLEU has a fairly good agreement with human judgement and has been used to evaluate a variety of language generation systems (Angeli et al., 2010; Konstas and Lapata, 2012). |
Results | BLEU |
Abstract | We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set. |
Additional Experiments | As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM. |
Additional Experiments | On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain. |
Additional Experiments | Interestingly, RM achieved substantially higher BLEU precision scores in all tests for both language pairs. |
Experiments | We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus. |
Experiments | The bound constraint B was set to 1.4 The approximate sentence-level BLEU cost A, is computed in a manner similar to (Chiang et al., 2009), namely, in the context of previous 1-best translations of the tuning set. |
Experiments | We explored alternative values for B, as well as scaling it by the current candidate’s cost, and found that the optimizer is fairly insensitive to these changes, resulting in only minor differences in BLEU . |
Abstract | Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU . |
Experiments | We compared the performance of Moses using the alignment produced by our model and the baseline alignment, evaluating translation quality using BLEU (Papineni et al., 2002) with case-insensitive n-gram matching with n = 4. |
Experiments | We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set. |
Experiments | 5 The effect on translation scores is modest, roughly amounting to +0.2 BLEU versus using a single sample. |
Introduction | The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute. |
Abstract | 'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. |
Abstract | Further, adapting large MSAflEnglish parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points. |
Introduction | — We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points. |
Previous Work | ‘Train LM BLEU oov |
Previous Work | The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4). |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | tian sentence “wbyHtrmwA AlnAs AltAnyp” Until produced “lyfizfij (OOV) the second people” ( BLEU = 0.31). |
Abstract | We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). |
Experiments and Results | To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation. |
Experiments and Results | We show that our method achieves the best performance ( BLEU scores) on this task while being significantly faster than both the previous approaches. |
Experiments and Results | We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon). |
Experiment Results | We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences). |
Experiment Results | On average the improvement is 1.07 BLEU score (45.66 |
Experiment Results | Table 4: Arabic-English true case translation scores in BLEU metric. |
Phrasal-Hiero Model | Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row). |
Baseline MT | The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003). |
Experiments | We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation. |
Experiments | According to Wilcoxon Matched-Pairs Signed-Ranks Test, the improvement is not significant with BLEU metric, but is significant at 98% confidence level with all of the other metrics. |
Introduction | 0 The current dominant automatic MT scoring metrics (such as Bilingual Evaluation Understudy ( BLEU ) (Papineni et al., 2002)) treat all words equally, but names have relative low frequency in text (about 6% in newswire and only 3% in web documents) and thus are vastly outnumbered by function words and common nouns, etc.. |
Name-aware MT Evaluation | Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally. |
Name-aware MT Evaluation | In order to properly evaluate the translation quality of NAMT methods, we propose to modify the BLEU metric so that they can dynamically assign more weights to names during evaluation. |
Name-aware MT Evaluation | BLEU considers the correspondence between a system translation and a human translation: |
Introduction | Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER. |
Introduction | | features | BLEU | TER | |
Introduction | Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly. |
Abstract | On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement. |
Experiments | MT08 nw MT08 wb BLEU \ TER BLEU \ TER |
Experiments | The best TER and BLEU results on each genre are in bold. |
Experiments | For BLEU , higher scores are better, while for TER, lower scores are better. |
Experiments | Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3). |
Experiments | We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011). |
Experiments | In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score. |
Abstract | The performance measured by BLEU is at least as comparable to the traditional batch training method. |
Conclusion and Future Work | The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones. |
Experiment | The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011). |
Experiment | In the IWLST2012 data set, there is a huge difference gap between the HIT corpus and the BTEC corpus, and our method gains 0.814 BLEU improvement. |
Experiment | While the FBIS data set is artificially divided and no clear human assigned differences among sub-domains, our method loses 0.09 BLEU . |
Abstract | Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation. |
Translation Model Architecture | We found that this had no significant effects on BLEU . |
Translation Model Architecture | We report translation quality using BLEU (Papineni et |
Translation Model Architecture | For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2). |
Abstract | Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. |
Conclusion | Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN). |
Experiments | 0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way. |
Experiments | An Index column is added for score reference convenience (B for BLEU ; T for TER). |
Experiments | For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1). |
Abstract | Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU . |
Abstract | On general domain and speech translation tasks where test conditions substantially differ from standard government and news training text, web-mined training data improves performance substantially, resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain. |
Abstract | For all language pairs and both test sets (WMT 2011 and WMT 2012), we show an improvement of around 0.5 BLEU . |
Abstract | The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline. |
Conclusion | Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based |
Experimental setup | All experiments were done on Urdu-English and we evaluate reordering in two ways: Firstly, we evaluate reordering performance directly by comparing the reordered source sentence in Urdu with a reference reordering obtained from the manual word alignments using BLEU (Papineni et al., 2002) (we call this measure monolingual BLEU or mBLEU). |
Experimental setup | Additionally, we evaluate the effect of reordering on our final systems for machine translation measured using BLEU . |
Introduction | This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments. |
Introduction | In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering. |
Results and Discussions | We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3. |
Results and Discussions | We also note a gain of 2.0 BLEU points over a hierarchical phrase based system. |
Experiments and evaluation | We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3. |
Experiments and evaluation | Table 5 gives results in case-insensitive BLEU . |
Experiments and evaluation | While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU ; however, our manual evaluation shows that the new features have a positive impact on translation quality. |
Experiments | We use BLEU (Papineni et al., 2002) score with shortest length penalty as the evaluation metric and apply the pairwise re-sampling approach (Koehn, 2004) to perform the significance test. |
Experiments | We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points. |
Experiments | When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points. |
Introduction | In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model: |
Introduction | to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8. |
Introduction | Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al. |
Abstract | Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model. |
Experiment | Table 1: Performance on Japanese-to-English Translation Measured by BLEU (%) |
Experiment | Table 1 shows the performance for the test data measured by case sensitive BLEU (Papineni et al., 2002). |
Experiment | Under the Moses phrase-based SMT system (Koehn et al., 2007) with the default settings, we achieved a 26.80% BLEU score. |
Introduction | Further, our independent model achieves a more than 1 point gain in BLEU , which resolves the sparseness problem introduced by the bi-word observations. |
Experiments | 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature. |
Experiments | 10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments. |
Experiments | We use minimum error rate training (Och, 2003) with nbest list size 100 to optimize the feature weights for maximum development BLEU . |
Abstract | On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU , TER, and NIST. |
Corpus Data and Baseline SMT | Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs, 45K words) using BLEU as the tuning metric. |
Experimental Setup and Results | Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006). |
Experimental Setup and Results | In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 ( BLEU ), -0.6 (TER) and 0.08 (NIST). |
Introduction | With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU , TER and NIST scores on an English-to-Iraqi CSLT task. |
Experimental Results | The MT systems are optimized with pairwise ranking optimization (Hopkins and May, 2011) to maximize BLEU (Papineni et al., 2002). |
Experimental Results | The BLEU scores from different systems are shown in Table 10 and Table 11, respectively. |
Experimental Results | Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately. |
Conclusion | Our results showed improvement over the baselines both in intrinsic evaluations and on BLEU . |
Experiments & Results 4.1 Experimental Setup | BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT. |
Experiments & Results 4.1 Experimental Setup | Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e. |
Introduction | In general, copied-over oovs are a hindrance to fluent, high quality translation, and we can see evidence of this in automatic measures such as BLEU (Papineni et al., 2002) and also in human evaluation scores such as HTER. |
Experiment | Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points. |
Experiment | Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points). |
Experiment | We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points). |
Abstract | Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. |
Experiments | The 3-feature version of VSM yields +1.8 BLEU over the baseline for Arabic to English, and +1.4 BLEU for Chinese to English. |
Experiments | For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.) |
Experiments | To get an intuition for how VSM adaptation improves BLEU scores, we compared outputs from the baseline and VSM-adapted system (“vsm, joint” in Table 5) on the Chinese test data. |
Experiments | System BLEU Baseline 12.60 [MB OT * 13 .06 |
Experiments | We measured the overall translation quality with the help of 4-gram BLEU (Papineni et al., 2002), which was computed on tokenized and lower-cased data for both systems. |
Experiments | We obtain a BLEU score of 13.06, which is a gain of 0.46 BLEU points over the baseline. |
Introduction | The translation quality is automatically measured using BLEU scores, and we confirm the findings by providing linguistic evidence (see Section 5). |
Evaluation methodology | In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003). |
Results | 081 usually has the highest overall score (except BLEU ), it also has the highest scores for ‘regulations’ (more formal texts), P1 scores are better for the news documents. |
Results | Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714 |
Abstract | In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models. |
Experiment | To stabilize the MERT results, we tuned three times by MERT using the first half of the development data and we selected the SMT weighting parameter set that performed the best on the second half of the development data based on the BLEU scores from the three SMT weighting parameter sets. |
Experiment | To investigate the tolerance for sparsity of the training data, we reduced the training data for the sequence model to 20,000 sentences for JE translation.14 SEQUENCE using this model with a distortion limit of 30 achieved a BLEU score of 32.22.15 Although the score is lower than the score of SEQUENCE with a distortion limit of 30 in Table 3, the score was still higher than those of LINEAR, LINEAR+LEX, and 9-CLASS for JE in Table 3. |
Code was provided by Deng et a1. (2012). | To compute evaluation measures, we take the average scores of BLEU (1) and F-score (unigram-based with respect to content-words) over k = 5 candidate captions. |
Code was provided by Deng et a1. (2012). | Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU , Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database. |
Related Work | When computing BLEU with semantic matching, we look for the match with the highest similarity score among words that have not been matched before. |
Experiments | Table 3: BLEU scores for different datasets in different translation directions (left to right), broken with different training corpora (top to bottom). |
Experiments | The BLEU scores for the different parallel corpora are shown in Table 3 and the top 10 out-of-vocabulary (OOV) words for each dataset are shown in Table 4. |
Experiments | However, by combining the Weibo parallel data with this standard data, improvements in BLEU are obtained. |