Index of papers in Proc. ACL that mention
  • BLEU
Clifton, Ann and Sarkar, Anoop
Experimental Results
All the BLEU scores reported are for lowercase evaluation.
Experimental Results
m-BLEU 1dicates that the segmented output was evaluated gainst a segmented version of the reference (this Leasure does not have the same correlation with hu-Lan judgement as BLEU ).
Experimental Results
No Uni indicates the seg-Lented BLEU score without unigrams.
Models 2.1 Baseline Models
performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline.
Translation and Morphology
Automatic evaluation measures for MT, BLEU (Papineni et al., 2002), WER (Word Error Rate) and PER (Position Independent Word Error Rate) use the word as the basic unit rather than morphemes.
Translation and Morphology
Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset.
BLEU is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki
Abstract
BLEU , TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Experiments
As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al., 2011)).
Experiments
As metrics we use BLEU and NTER.
Experiments
o BLEU = BP >< (Hprecn)1/4.
Introduction
These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU , which serve as surrogate objective for translation quality.
Introduction
However, we know that a single metric such as BLEU is not enough.
Introduction
For example, while BLEU (Papineni et al., 2002) focuses on word-based n-gram precision, METEOR (Lavie and Agarwal, 2007) allows for stem/synonym matching and incorporates recall.
Multi-objective Algorithms
If we had used BLEU scores rather than the {0,1} labels in line 8, the entire PMO-PRO algorithm would revert to single-objective PRO.
Theory of Pareto Optimality 2.1 Definitions and Concepts
For example, suppose K = 2, M1(h) computes the BLEU score, and M2(h) gives the METEOR score of h. Figure 1 illustrates the set of vectors {M in a lO-best list.
BLEU is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Abstract
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU .
Abstract
In principle, tuning on these metrics should yield better systems than tuning on BLEU .
Abstract
It has a better correlation with human judgment than BLEU .
Introduction
0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
Introduction
Among these metrics, BLEU is the most widely used for both evaluation and tuning.
Introduction
Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et
BLEU is mentioned in 66 sentences in this paper.
Topics mentioned in this paper:
Zhao, Bing and Lee, Young-Suk and Luo, Xiaoqiang and Li, Liu
Abstract
The syntax-based translation system integrating the proposed techniques outperforms the best Arabic-English unconstrained system in NIST—08 evaluations by 1.3 absolute BLEU , which is statistically significant.
Experiments
We use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to evaluate translation qualities.
Experiments
and we achieved a BLEUr4n4 55.01 for MT08-NW, or a cased BLEU of 53.31, which is close to the best officially reported result 53.85 for unconstrained systems.2 We expose the statistical decisions in Eqn.
Experiments
3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s.
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun
Abstract
The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Experimental results
We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
Experimental results
We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
Experimental results
2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score.
Introduction
ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Abstract
On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.
Abstract
The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation.
Introduction
Additionally, we present several variations of this model which provide significant additive BLEU gains.
Introduction
The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes
Introduction
Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined.
Model Variations
Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6
Model Variations
BLEU scores are mixed-case.
Model Variations
On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more.
Neural Network Joint Model (NNJ M)
We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU .
Neural Network Joint Model (NNJ M)
We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M.
BLEU is mentioned in 36 sentences in this paper.
Topics mentioned in this paper:
Duan, Manjuan and White, Michael
Abstract
Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract
However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction
With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction
With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction
Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods
In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Simple Reranking
Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking
Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
Simple Reranking
In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations.
BLEU is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Elliott, Desmond and Keller, Frank
Abstract
The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements.
Abstract
We estimate the correlation of unigram and Smoothed BLEU , TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Abstract
The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements.
Introduction
The main finding of our analysis is that TER and unigram BLEU are weakly corre-
Introduction
lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology
BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y.
Methodology
N BLEU = BP-exp < wn logpn> n=1
Methodology
Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1.
BLEU is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Experimental Results
Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results
II TER .812 .836 .848 BLEU .810 .830 .846
Experimental Results
We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
Experimental Setup
To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Experimental Setup
Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A.
Related Work
A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work
For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.
BLEU is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Gao, Jianfeng
Abstract
Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.
Abstract
We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.
Abstract
Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.
Expected BLEU Training
The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3.
Expected BLEU Training
3.1 Expected BLEU Objective
Expected BLEU Training
Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:
Introduction
The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).
Introduction
We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.
Recurrent Neural Network LMs
time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations.
BLEU is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Chen, David and Dolan, William
Experiments
BLEU
Experiments
As more training pairs are used, the model produces more varied sentences (PIN C) but preserves the meaning less well ( BLEU )
Experiments
As a comparison, evaluating each human description as a paraphrase for the other descriptions in the same cluster resulted in a BLEU score of 52.9 and a PINC score of 77.2.
Introduction
In addition to the lack of standard datasets for training and testing, there are also no standard metrics like BLEU (Papineni et al., 2002) for evaluating paraphrase systems.
Paraphrase Evaluation Metrics
One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU , which has played a crucial role in driving progress in MT.
Paraphrase Evaluation Metrics
Thus, researchers have been unable to rely on BLEU or some derivative: the optimal paraphrasing engine under these terms would be one that simply returns the input.
Paraphrase Evaluation Metrics
To measure semantic equivalence, we simply use BLEU with multiple references.
BLEU is mentioned in 28 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Background
As in other state-of-the-art SMT systems, BLEU is selected as the accuracy measure to define the error function used in MERT.
Background
Since the weights of training samples are not taken into account in BLEUZ, we modify the original definition of BLEU to make it sensitive to the distribution Dt(i) over the training samples.
Background
The modified version of BLEU is called weighted BLE U (WBLEU) in this paper.
BLEU is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Mauser, Arne and Ney, Hermann
Abstract
Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU .
Alignment
We perform minimum error rate training with the downhill simplex algorithm (Nelder and Mead, 1965) on the development data to obtain a set of scaling factors that achieve a good BLEU score.
Experimental Evaluation
The scaling factors of the translation models have been optimized for BLEU on the DEV data.
Experimental Evaluation
BLEU ‘ TER ‘
Experimental Evaluation
The metrics used for evaluation are the case-sensitive BLEU (Papineni et al., 2002) score and the translation edit rate (TER) (Snover et al., 2006) with one reference translation.
Introduction
Our results show that the proposed phrase model training improves translation quality on the test set by 0.9 BLEU points over our baseline.
Introduction
We find that by interpolation with the heuristically extracted phrases translation performance can reach up to 1.4 BLEU improvement over the baseline on the test set.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Liu, Qun
Abstract
Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules.
Experiments
We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on development set.
Experiments
The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3.
Experiments
As shown in the third line in the column of BLEU score, the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage.
Introduction
BLEU
Introduction
Medium data experiments (Section 5) show a statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with less translation rules, this is also the first time that a tree-to-tree model can surpass tree-to-string counterparts.
Model
(2009), their forest-based constituency-to-constituency system achieves a comparable performance against Moses (Koehn et al., 2007), but a significant improvement of +3.6 BLEU points over the 1-best tree-based constituency-to-constituency system.
BLEU is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Liu, Zhanyi and Wang, Haifeng and Wu, Hua and Li, Sheng
Abstract
As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system.
Experiments on Parsing-Based SMT
Experiments BLEU (%) Joshua 30.05 + Improved word alignments 31.81
Experiments on Parsing-Based SMT
The system using the improved word alignments achieves an absolute improvement of 1.76 BLEU score, which indicates that the improvements of word alignments are also effective to improve the performance of the parsing-based SMT systems.
Experiments on Phrase-Based SMT
We use BLEU (Papineni et al., 2002) as evaluation metrics.
Experiments on Phrase-Based SMT
Experiments BLEU (%) Moses 29.62 + Phrase collocation probability 30.47
Experiments on Phrase-Based SMT
If the same alignment method is used, the systems using CM-3 got the highest BLEU scores.
Experiments on Word Alignment
Experiments BLEU (%) Baseline 29.62 CM-l 30.85 WA-l CM-2 31.28 CM-3 31.48 CM-l 3 l .00 Our methods WA-2 CM-2 3 l .33 CM-3 31.51 CM-l 3 l .43 WA-3 CM-2 31.62 CM-3 31.78
Introduction
The alignment improvement results in an improvement of 2.16 BLEU score on phrase-based SMT system and an improvement of 1.76 BLEU score on parsing-based SMT system.
Introduction
SMT performance is further improved by 0.24 BLEU score.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Bojar, Ondřej and Kos, Kamil and Mareċek, David
Abstract
BLEU ) when applied to morphologically rich languages such as Czech.
Introduction
Section 2 illustrates and explains severe problems of a widely used BLEU metric (Papineni et al., 2002) when applied to Czech as a representative of languages with rich morphology.
Introduction
cu-bOJar uedin 0.4 l l l l 0.06 0.08 0.10 0.12 0.14 BLEU
Introduction
Figure l: BLEU and human ranks of systems participating in the English-to-Czech WMT09 shared task.
Problems of BLEU
BLEU (Papineni et al., 2002) is an established language-independent MT metric.
Problems of BLEU
The unbeaten advantage of BLEU is its simplicity.
Problems of BLEU
We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009).
BLEU is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Conclusion
BLEU score 0 iv A (A)
Experiments
BLEU score
Experiments
ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score.
Experiments
In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.
Future Work
BLEU score
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Ney, Hermann and Zens, Richard
Abstract
At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU , whereas our approach yields 20.0% with identical models.
Experimental Evaluation
system | BLEU [%] \ #HYP \ #LM \ w/s N0 2 oo baseline 20.1 3.0K 322K 2.2 +presort 20.1 2.5K 183K 3.6 N0 = 100
Experimental Evaluation
We evaluate with BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Experimental Evaluation
BLEU [%]
Introduction
We also run comparisons with the Moses decoder (Koehn et al., 2007), which yields the same performance in BLEU , but is outperformed significantly in terms of scalability for faster translation.
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Introduction
Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER.
Introduction
| features | BLEU | TER |
Introduction
Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results
On average the improvement is 1.07 BLEU score (45.66
Experiment Results
Table 4: Arabic-English true case translation scores in BLEU metric.
Phrasal-Hiero Model
Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row).
BLEU is mentioned in 24 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Baseline MT
The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003).
Experiments
We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation.
Experiments
According to Wilcoxon Matched-Pairs Signed-Ranks Test, the improvement is not significant with BLEU metric, but is significant at 98% confidence level with all of the other metrics.
Introduction
0 The current dominant automatic MT scoring metrics (such as Bilingual Evaluation Understudy ( BLEU ) (Papineni et al., 2002)) treat all words equally, but names have relative low frequency in text (about 6% in newswire and only 3% in web documents) and thus are vastly outnumbered by function words and common nouns, etc..
Name-aware MT Evaluation
Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally.
Name-aware MT Evaluation
In order to properly evaluate the translation quality of NAMT methods, we propose to modify the BLEU metric so that they can dynamically assign more weights to names during evaluation.
Name-aware MT Evaluation
BLEU considers the correspondence between a system translation and a human translation:
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Abstract
We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
Experiments and Results
To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
Experiments and Results
We show that our method achieves the best performance ( BLEU scores) on this task while being significantly faster than both the previous approaches.
Experiments and Results
We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon).
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Abstract
'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points.
Abstract
Further, adapting large MSAflEnglish parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
Introduction
— We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points.
Previous Work
‘Train LM BLEU oov
Previous Work
The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference.
Proposed Methods 3.1 Egyptian to EG’ Conversion
S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4).
Proposed Methods 3.1 Egyptian to EG’ Conversion
Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96.
Proposed Methods 3.1 Egyptian to EG’ Conversion
tian sentence “wbyHtrmwA AlnAs AltAnyp” Until produced “lyfizfij (OOV) the second people” ( BLEU = 0.31).
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Feng, Yang and Cohn, Trevor
Abstract
Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU .
Experiments
We compared the performance of Moses using the alignment produced by our model and the baseline alignment, evaluating translation quality using BLEU (Papineni et al., 2002) with case-insensitive n-gram matching with n = 4.
Experiments
We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set.
Experiments
5 The effect on translation scores is modest, roughly amounting to +0.2 BLEU versus using a single sample.
Introduction
The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Abstract
We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Additional Experiments
As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Additional Experiments
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Additional Experiments
Interestingly, RM achieved substantially higher BLEU precision scores in all tests for both language pairs.
Experiments
We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus.
Experiments
The bound constraint B was set to 1.4 The approximate sentence-level BLEU cost A, is computed in a manner similar to (Chiang et al., 2009), namely, in the context of previous 1-best translations of the tuning set.
Experiments
We explored alternative values for B, as well as scaling it by the current candidate’s cost, and found that the optimizer is fairly insensitive to these changes, resulting in only minor differences in BLEU .
BLEU is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Wang, Lu and Cardie, Claire
Introduction
Automatic evaluation (using ROUGE (Lin and Hovy, 2003) and BLEU (Papineni et al., 2002)) against manually generated focused summaries shows that our sum-marizers uniformly and statistically significantly outperform two baseline systems as well as a state-of-the-art supervised extraction-based system.
Results
To evaluate the full abstract generation system, the BLEU score (Papineni et al., 2002) (the precision of uni-grams and bigrams with a breVity penalty) is computed with human abstracts as reference.
Results
BLEU has a fairly good agreement with human judgement and has been used to evaluate a variety of language generation systems (Angeli et al., 2010; Konstas and Lapata, 2012).
Results
BLEU
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Yeniterzi, Reyyan and Oflazer, Kemal
Abstract
We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores.
Abstract
Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU , all averaged over 10 training and test sets.
Experimental Setup and Results
For evaluation, we used the BLEU metric (Pap-ineni et al., 2001).
Experimental Setup and Results
Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation.
Experimental Setup and Results
We can observe that the combined syntax-to-morphology transformations on the source side provide a substantial improvement by themselves and a simple target side transformation on top of those provides a further boost to 21.96 BLEU which represents a 28.57% relative improvement over the word-based baseline and a 18.00% relative improvement over the factored baseline.
Introduction
We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets.
Syntax-to-Morphology Mapping
We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores.
BLEU is mentioned in 35 sentences in this paper.
Topics mentioned in this paper:
Wang, Kun and Zong, Chengqing and Su, Keh-Yih
Abstract
Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system.
Conclusion and Future Work
The experiments show that the proposed Model-III outperforms both the TM and the SMT systems significantly (p < 0.05) in either BLEU or TER when fuzzy match score is above 0.4.
Conclusion and Future Work
Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database.
Experiments
In the tables, the best translation results (either in BLEU or TER) at each interval have been marked in bold.
Experiments
Compared with TM and SMT, Model-I is significantly better than the SMT system in either BLEU or TER when the fuzzy match score is above 0.7; Model-II significantly outperforms both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.5; Model-III significantly exceeds both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.4.
Experiments
SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7).
Introduction
Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Sun, Hong and Zhou, Ming
Abstract
In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems.
Experiments and Results
Jomt learnlng BLEU BLEU zB LE U No Joint 27.16 35.42 / oz 2 1 30.75 53.51 30.75
Experiments and Results
We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better).
Experiments and Results
From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously.
Introduction
The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time.
Paraphrasing with a Dual SMT System
Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible.
Paraphrasing with a Dual SMT System
iBLEU(s,rS,c) = aBLEU(c,7“S) — (l—a) BLEU (c,s) (3)
Paraphrasing with a Dual SMT System
BLEU (C, r3) captures the semantic equivalency between the candidates and the references (Finch et al.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Experimental Evaluation
We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while
Experimental Evaluation
In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
Experimental Evaluation
For BLEU higher values are better, for TER lower values are better.
Related Work
They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data.
BLEU is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Kuhn, Jonas
Experiments
BLEU , sentence-level geometric mean of 1- to 4-gram precision, as in (Belz et al., 2011)
Experiments
BLEUT, sentence-level BLEU computed on post-processed output where predicted referring expressions for victim and perp are replaced in the sentences (both gold and predicted) by their original role label, this score doeS not penalize lexical mismatches between corpus and system RES
Experiments
When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9).
BLEU is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Abstract
We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks.
Experiments
4.3.1 BLEU
Experiments
Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011).
Experiments
use character-level BLEU as our main baseline.
Introduction
Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.
Introduction
In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments.
Introduction
Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram
Inferring a learning curve from mostly monolingual data
Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
Inferring a learning curve from mostly monolingual data
We first train models to predict the BLEU score at m anchor sizes 81, .
Inferring a learning curve from mostly monolingual data
We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
Introduction
In both cases, the task consists in predicting an evaluation score ( BLEU , throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training.
Introduction
An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU .
Introduction
They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
Selecting a parametric family of curves
For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;.
Selecting a parametric family of curves
The last condition is related to our use of BLEU —which is bounded by l — as a performance measure; It should be noted that some growth patterns which are sometimes proposed, such as a logarithmic regime of the form y 2 a + blog :10, are not
Selecting a parametric family of curves
The values are on the same scale as the BLEU scores.
BLEU is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
He, Xiaodong and Deng, Li
Abstract
In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset.
Abstract
The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
Abstract
parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy ( BLEU ) (Papineni et al., 2002).
BLEU is mentioned in 44 sentences in this paper.
Topics mentioned in this paper:
Wu, Hua and Wang, Haifeng
Experiments
Translation quality was evaluated using both the BLEU score proposed by Papineni et al.
Experiments
(2002) and also the modified BLEU (BLEU-Fix) score3 used in the IWSLT 2008 evaluation campaign, where the brevity calculation is modified to use closest reference length instead of shortest reference length.
Experiments
Method BLEU BLEU-Fix Triangulation 33 .70/27.46 3 l .5 9/25 .02 Transfer 3352/2834 3136/2620 Synthetic 34.35/27 .21 32.00/26.07 Combination 38.14/29.32 34.76/27.39
Translation Selection
In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002).
Translation Selection
We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection
In the context of translation selection, 3/ is assigned as the smoothed BLEU score.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Automatic Evaluation Metrics
In this section, we describe BLEU, and the three metrics which achieved higher correlation results than BLEU in the recent ACL-07 MT workshop.
Automatic Evaluation Metrics
2.1 BLEU
Automatic Evaluation Metrics
BLEU (Papineni et al., 2002) is essentially a precision-based metric and is currently the standard metric for automatic evaluation of MT performance.
Introduction
Among all the automatic MT evaluation metrics, BLEU (Papineni et al., 2002) is the most widely used.
Introduction
Although BLEU has played a crucial role in the progress of MT research, it is becoming evident that BLEU does not correlate with human judgement
Introduction
The results show that, as compared to BLEU , several recently proposed metrics such as Semantic-role overlap (Gimenez and Marquez, 2007), ParaEval-recall (Zhou et al., 2006), and METEOR (Banerjee and Lavie, 2005) achieve higher correlation.
BLEU is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Deng, Yonggang and Xu, Jia and Gao, Yuqing
A Generic Phrase Training Procedure
lation engine to minimize the final translation errors measured by automatic metrics such as BLEU (Papineni et al., 2002).
Discussions
- + - BLEU mo“ Phrasetable Size
Discussions
After reaching its peak, the BLEU score drops as the threshold 7' increases.
Discussions
Table 4: Translation Results ( BLEU ) of discriminative phrase training approach using different features
Experimental Results
We measure translation performance by the BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) scores with multiple translation references.
Experimental Results
BLEU Scores
Experimental Results
The translation results as measured by BLEU and METEOR scores are presented in Table 3.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Ganchev, Kuzman and Graça, João V. and Taskar, Ben
Abstract
We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score) in end-to-end translation systems for six languages pairs used in recent MT competitions.
Conclusions
Table 3: BLEU scores for all language pairs using all available data.
Introduction
Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score) across many languages on training corpora with up to hundred thousand sentences.
Introduction
In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases.
Phrase-based machine translation
We report BLEU scores using a script available with the baseline system.
Phrase-based machine translation
Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model.
Phrase-based machine translation
In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages.
Word alignment results
Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system.
BLEU is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Huang, Liang and Liu, Qun
Abstract
Large-scale experiments show an absolute improvement of 1.7 BLEU points over the l-best baseline.
Experiments
BLEU score
Experiments
We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set.
Experiments
The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure.
Introduction
Large-scale experiments (Section 4) show an improvement of 1.7 BLEU points over the l-best baseline, which is also 0.8 points higher than decoding with 30-best trees, and takes even less time thanks to the sharing of common subtrees.
BLEU is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Shen, Libin and Xu, Jinxi and Weischedel, Ralph
Abstract
Our eXperiments show that the string-to-dependency decoder achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER compared to a standard hierarchical string—to—string system on the N IST 04 Chinese—English evaluation set.
Conclusions and Future Work
Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set.
Experiments
All models are tuned on BLEU (Papineni et al., 2001), and evaluated on both BLEU and Translation Error Rate (TER) (Snover et al., 2006) so that we could detect over-tuning on one metric.
Experiments
BLEU % TER% lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96
Experiments
Table 2: BLEU and TER scores on the test set.
Introduction
For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system.
Introduction
Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
Abstract
We applied our inflection generation models in translating English into two morphologically complex languages, Russian and Arabic, and show that our model improves the quality of SMT over both phrasal and syntax-based SMT systems according to BLEU and human judge-ments.
Integration of inflection models with MT systems
We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2).
MT performance results
For automatically measuring performance, we used 4-gram BLEU against a single reference translation.
MT performance results
We also report oracle BLEU scores which incorporate two kinds of oracle knowledge.
MT performance results
For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation.
BLEU is mentioned in 26 sentences in this paper.
Topics mentioned in this paper:
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris
Evaluation
Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set.
Evaluation
In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation
This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
Decoding to Maximize BLEU
BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU .
Experiments
We evaluate the translation results by comparing them against the reference translations using the BLEU metric.
Experiments
Hyperedges BLEU Bigram Pass 167K 21.77 Trigram Pass UNI — —BO + 629.7K=796.7K 23.56 BO+BB +2.7K =169.
Experiments
Fable 1: Speed and BLEU scores for two-pass decoding.
Introduction
With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
Introduction
Maximizing the expected count of synchronous constituents approximately maximizes BLEU .
Introduction
We find a significant increase in BLEU in the experiments, with minimal additional time.
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
DeNero, John and Chiang, David and Knight, Kevin
Abstract
The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score.
Abstract
However, MBR targeting BLEU is prohibitively slow to optimize over kr-best lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000-best lists.
Abstract
Our forest-based decoding objective consistently outperforms kr-best list MBR, giving improvements of up to 1.0 BLEU .
Consensus Decoding Algorithms
1Typically, MBR is defined as arg mineeElE [L(e; e’ for some loss function L, for example 1 — BLEU (e; 6’ These definitions are equivalent.
Consensus Decoding Algorithms
Figure 1 compares Algorithms 1 and 2 using U(e; e’ Other linear functions have been explored for MBR, including Taylor approximations to the logarithm of BLEU (Tromble et al., 2008) and counts of matching constituents (Zhang and Gildea, 2008), which are discussed further in Section 3.3.
Consensus Decoding Algorithms
Computing MBR even with simple nonlinear measures such as BLEU , NIST or bag-of-words Fl seems to require 0(k2) computation time.
Introduction
In statistical machine translation, output translations are evaluated by their similarity to human reference translations, where similarity is most often measured by BLEU (Papineni et al., 2002).
Introduction
Unfortunately, with a nonlinear similarity measure like BLEU , we must resort to approximating the expected loss using a k-best list, which accounts for only a tiny fraction of a model’s full posterior distribution.
Introduction
In experiments using BLEU over 1000-best lists, we found that our objective provided benefits very similar to MBR, only much faster.
BLEU is mentioned in 37 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Introduction
Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Introduction
We employ MERT to select these weights by optimizing BLEU score on a development set.
Introduction
In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
MERT for MBR Parameter Optimization
However, this does not guarantee that the resulting linear score (Equation 2) is close to the corpus BLEU .
MERT for MBR Parameter Optimization
We now describe how MERT can be used to estimate these factors to achieve a better approximation to the corpus BLEU .
MERT for MBR Parameter Optimization
We recall that MERT selects weights in a linear model to optimize an error criterion (e. g. corpus BLEU ) on a training set.
Minimum Bayes-Risk Decoding
This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate.
Minimum Bayes-Risk Decoding
(2008) extended MBR decoding to translation lattices under an approximate BLEU score.
Minimum Bayes-Risk Decoding
They approximated log( BLEU ) score by a linear function of n-gram matches and candidate length.
BLEU is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Abstract
Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
Evaluation
We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality.
Evaluation
Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
Evaluation
HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further.
Introduction
This enhancement alone results in an improvement of almost 1.4 BLEU points.
Introduction
We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models.
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Abstract
We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008).
Experimental Results
Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.
Experimental Results
Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.
Experimental Results
Table 2 presents the BLEU results under different ways in using the variational models, as discussed in Section 3.2.3.
Introduction
We geometrically interpolate the resulting approximations q with one another (and with the original distribution p), justifying this interpolation as similar to the minimum-risk decoding for BLEU proposed by Tromble et al.
Variational Approximate Decoding
However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams.
Variational vs. Min-Risk Decoding
They use the following loss function, of which a linear approximation to BLEU (Papineni et al., 2001) is a special case,
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun
Abstract
Comparable to the state-of-the-art system combination technique, joint decoding achieves an absolute improvement of 1.5 BLEU points over individual decoding.
Experiments
We evaluated the translation quality using case-insensitive BLEU metric (Papineni et al., 2002).
Experiments
Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
Experiments
With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
Introduction
0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
Introduction
ing with multiple models achieves an absolute improvement of 1.5 BLEU points over individual decoding with single models (Section 5).
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Abstract
Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees.
Experiments
We evaluated the translation quality using the BLEU metric, as calculated by mteval-vl lb.pl with its default setting except that we used case-insensitive matching of n-grams.
Experiments
avg trees # of rules BLEU
Experiments
Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models.
Introduction
Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model.
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
Abstract
Our best result improves over the best single MT system baseline by 1.0% BLEU and over a strong system selection baseline by 0.6% BLEU on a blind test set.
Introduction
Our best system selection approach improves over our best baseline single MT system by 1.0% absolute BLEU point on a blind test set.
MT System Selection
We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection
We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection
When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments
Feature weights are tuned to maximize BLEU on tuning sets using Minimum Error Rate Training (Och, 2003).
Machine Translation Experiments
Results are presented in terms of BLEU (Papineni et al., 2002).
Machine Translation Experiments
All differences in BLEU scores between the four systems are statistically significant above the 95% level.
BLEU is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Experimental Setup
We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Methods
This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references).
Related Work
We use both segmented and unsegmented language models, and tune automatically to optimize BLEU .
Related Work
(2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU .
Results
For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.
Results
Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.
Results
1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Riezler, Stefan and Simianer, Patrick and Haas, Carolin
Experiments
call F1 BLEU 1.21 60.82 46.53 $.57 66.791 48.001 L64 72.4912 56.6412 3.36 78.15123 55.6612
Experiments
:call F1 BLEU 7.86 61.48 46.53 1.79 64.07 46.00 3.57 65.56 55.6712 7.14 68.8612 55.6712
Experiments
Method 4, named REBOL, implements REsponse-Based Online Learning by instantiating y+ and y‘ to the form described in Section 4: In addition to the model score 3, it uses a cost function 0 based on sentence-level BLEU (Nakov et al., 2012) and tests translation hypotheses for task-based feedback using a binary execution function 6.
Response-based Online Learning
Computation of distance to the reference translation usually involves cost functions based on sentence-level BLEU (Nakov et al.
Response-based Online Learning
In addition, we can use translation-specific cost functions based on sentence-level BLEU in order to boost similarity of translations to human reference translations.
Response-based Online Learning
Our cost function c(y(i), y) = (l — BLEU(y(i), is based on a version of sentence-level BLEU Nakov et al.
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi
Experiments
Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3).
Experiments
We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011).
Experiments
In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Abstract
On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.
Experiments
MT08 nw MT08 wb BLEU \ TER BLEU \ TER
Experiments
The best TER and BLEU results on each genre are in bold.
Experiments
For BLEU , higher scores are better, while for TER, lower scores are better.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Blunsom, Phil and Cohn, Trevor and Osborne, Miles
Discussion and Further Work
9Hiero was MERT trained on this set and has a 2% higher BLEU score compared to the discriminative model.
Discussion and Further Work
development BLEU (%) 28
Evaluation
Although there is no direct relationship between BLEU and likelihood, it provides a rough measure for comparing performance.
Evaluation
6We also experimented with using max-translation decoding for standard MER trained translation models, finding that it had a small negative impact on BLEU score.
Evaluation
Figure 5 shows the relationship between beam width and development BLEU .
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Abstract
We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system.
Evaluation
M Pbo Pbl Pb2 M1 M2 BLEU 14.3 16.25 16.13 18.6 17.05
Evaluation
Both our systems (Model-1 and Model-2) beat the baseline phrase-based system with a BLEU point difference of 4.30 and 2.75 respectively.
Evaluation
The difference of 2.35 BLEU points between M1 and Pbl indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Final Results
This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models.
Final Results
BLEU point improvement and combined with all the heuristics (M2H123) gives an overall gain of 1.95 BLEU points and is close to our best results (M1H12).
Final Results
One important issue that has not been investigated yet is that BLEU has not yet been shown to have good performance in morphologically rich target languages like Urdu, but there is no metric known to work better.
Introduction
Section 4 discusses the training data, parameter optimization and the initial set of experiments that compare our two models with a baseline Hindi-Urdu phrase-based system and with two transliteration-aided phrase-based systems in terms of BLEU scores
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Abstract
We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
Conclusion
The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
Experiments
Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
Experiments
minimum error rate training (Och, 2003) with BLEU score as the objective function.
Experiments
Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Experiments
To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG).
Experiments
BLEU gives a measure of how close a system’s output is to the gold standard simple sentence.
Experiments
Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification.
Related Work
(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Baselines
where m ranges over IN and OUT, pm(é| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003).
Conclusion & Future Work
We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model.
Ensemble Decoding
In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup.
Ensemble Decoding
However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically.
Ensemble Decoding
However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well.
Experiments & Results 4.1 Experimental Setup
Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2.
Experiments & Results 4.1 Experimental Setup
This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34.
Experiments & Results 4.1 Experimental Setup
We also reported the BLEU scores when we applied the span-wise normalization heuristic.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Duan, Xiangyu and Zhang, Min and Li, Haizhou
Experiments and Results
Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments and Results
BLEU 0.4029 0.3146 NIST 7.0419 8.8462 METEOR 0.5785 0.5335
Experiments and Results
Both SMP and ESSP outperform baseline consistently in BLEU , NIST and METEOR.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Cahill, Aoife and Riester, Arndt
Abstract
We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions
In comparison to a baseline model, we achieve statistically significant improvement in BLEU score.
Discussion
Given that we only looked at IS factors within a sentence, we think that such a significant improvement in BLEU and exact match scores is very encouraging.
Generation Ranking Experiments
Model BLEU Match (%)
Generation Ranking Experiments
We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al.,
Generation Ranking Experiments
We achieve an improvement of 0.0168 BLEU points and 1.91 percentage points in exact match.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Cherry, Colin
Cohesive Decoding
Initially, we were not certain to what extent this feature would be used by the MERT module, as BLEU is not always sensitive to syntactic improvements.
Cohesive Phrasal Output
We tested this approach on our English-French development set, and saw no improvement in BLEU score.
Conclusion
Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores.
Conclusion
Our soft constraint produced improvements ranging between 0.5 and 1.1 BLEU points on sentences for which the baseline produces uncohesive translations.
Experiments
We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets.
Experiments
First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets.
Experiments
Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Abstract
The performance measured by BLEU is at least as comparable to the traditional batch training method.
Conclusion and Future Work
The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones.
Experiment
The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011).
Experiment
In the IWLST2012 data set, there is a huge difference gap between the HIT corpus and the BTEC corpus, and our method gains 0.814 BLEU improvement.
Experiment
While the FBIS data set is artificially divided and no clear human assigned differences among sub-domains, our method loses 0.09 BLEU .
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting
Abstract
The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.
Discussion
on BLEU score
Experiments
The metrics for automatic evaluation were BLEU 3 and TER 4 (Snover et al., 2005).
Experiments
(00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules.
Experiments
Our system gains significant improvements of 1.6~3.6 points of BLEU in the oral domain, and 0.5~1 points of BLEU in the news domain.
Extraction of Paraphrase Rules
As mentioned above, the detailed procedure is: T1 = S1 = T2 = Finally we compute BLEU (Papineni et al.
Extraction of Paraphrase Rules
If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions.
Introduction
The experimental results show that our proposed approach achieves significant improvements of l.6~3.6 points of BLEU in the oral domain and 0.5~l points in the news domain.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Abstract
Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
Complexity Analysis
In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved.
Complexity Analysis
For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality.
Complexity Analysis
It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Introduction
o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Conclusion and future work
We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements).
Dependency parsing for machine translation
We found that dependency scores with or without loop elimination are generally close and highly correlated, and that MT performance without final loop removal was about the same (generally less than 0.2% BLEU ).
Introduction
In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores.
Machine translation experiments
Parameter tuning was done with minimum error rate training (Och, 2003), which was used to maximize BLEU (Papineni et al., 2001).
Machine translation experiments
In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001).
Machine translation experiments
For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER, all differences are significant.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Fang, Licheng and Xu, Peng and Wu, Xiaoyun
Abstract
Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system).
Experiments
To evaluate the translation results, we use BLEU (Papineni et al., 2002).
Experiments
On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points, and 0.8 over the hierarchical phrase-based system.
Experiments
In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points, and 0.3 to 0.5 over the hierarchical phrase-based system.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi
Abstract
Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.
Conclusion
Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008).
Experiments
BLEU (%) 26.15 27.07 27.93 28.89
Experiments
Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset
Experiments
the final BLEU scores of C3—T with Min-F and C3-F.
Introduction
(2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system.
Introduction
Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Haffari, Gholamreza and Sarkar, Anoop
AL-SMT: Multilingual Setting
The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively.
AL-SMT: Multilingual Setting
This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score, WER and PER) (Papineni et al., 2002).
Experiments
The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set.
Experiments
Avg BLEU Score
Experiments
Avg BLEU Score
Sentence Selection: Multiple Language Pairs
0 Let e0 be the consensus among all the candidate translations, then define the disagreement as Ed ad(1 — BLEU (eC, ed)).
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Simianer, Patrick and Riezler, Stefan and Dyer, Chris
Experiments
Training data for discriminative learning are prepared by comparing a 100-best list of translations against a single reference using smoothed per-sentence BLEU (Liang et al., 2006a).
Experiments
Figure 4 gives a boxplot depicting BLEU-4 results for 100 runs of the MIRA implementation of the cdec package, tuned on deV-nc, and evaluated on the respective test set test-11c.6 We see a high variance (whiskers denote standard deviations) around a median of 27.2 BLEU and a mean of 27.1 BLEU .
Experiments
In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set.
Joint Feature Selection in Distributed Stochastic Learning
Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan
Abstract
The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline.
Conclusion
Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based
Experimental setup
All experiments were done on Urdu-English and we evaluate reordering in two ways: Firstly, we evaluate reordering performance directly by comparing the reordered source sentence in Urdu with a reference reordering obtained from the manual word alignments using BLEU (Papineni et al., 2002) (we call this measure monolingual BLEU or mBLEU).
Experimental setup
Additionally, we evaluate the effect of reordering on our final systems for machine translation measured using BLEU .
Introduction
This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments.
Introduction
In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering.
Results and Discussions
We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3.
Results and Discussions
We also note a gain of 2.0 BLEU points over a hierarchical phrase based system.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Aw, Aiti and Li, Haizhou
Analysis
0 The constituent boundary matching feature (CBMF) is a very important feature, which by itself achieves significant improvement over the baseline (up to 1.13 BLEU ).
Analysis
5.2 Beyond BLEU
Analysis
Since BLEU is not sufficient
Experiments
Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments
Like (Marton and Resnik, 2008), we find that the XP+ feature obtains a significant improvement of 1.08 BLEU over the baseline.
Experiments
However, using all syntax-driven features described in section 3.2, our SDB models achieve larger improvements of up to 1.67 BLEU .
Introduction
Our experimental results display that our SDB model achieves a substantial improvement over the baseline and significantly outperforms XP+ according to the BLEU metric (Papineni et al., 2002).
Introduction
In addition, our analysis shows further evidences of the performance gain from a different perspective than that of BLEU .
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU .
Abstract
On general domain and speech translation tasks where test conditions substantially differ from standard government and news training text, web-mined training data improves performance substantially, resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain.
Abstract
For all language pairs and both test sets (WMT 2011 and WMT 2012), we show an improvement of around 0.5 BLEU .
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin
Abstract
When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points.
Conclusion
Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points.
Experiments
The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2.
Experiments
The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points.
Experiments
The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Weller, Marion and Fraser, Alexander and Schulte im Walde, Sabine
Experiments and evaluation
We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3.
Experiments and evaluation
Table 5 gives results in case-insensitive BLEU .
Experiments and evaluation
While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU ; however, our manual evaluation shows that the new features have a positive impact on translation quality.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Hermjakob, Ulf and Knight, Kevin and Daumé III, Hal
Discussion
At the same time, there has been no negative impact on overall quality as measured by BLEU .
End-to-End results
To make sure our name transliterator does not degrade the overall translation quality, we evaluated our base SMT system with BLEU , as well as our transliteration-augmented SMT system.
End-to-End results
The BLEU scores for the two systems were 50.70 and 50.96 respectively.
Evaluation
General MT metrics such as BLEU , TER, METEOR are not suitable for evaluating named entity translation and transliteration, because they are not focused on named entities (NEs).
Integration with SMT
In a tuning step, the Minimim Error Rate Training component of our SMT system iteratively adjusts the set of rule weights, including the weight associated with the transliteration feature, such that the English translations are optimized with respect to a set of known reference translations according to the BLEU translation metric.
Introduction
First, although names are important to human readers, automatic MT scoring metrics (such as BLEU ) do not encourage researchers to improve name translation in the context of MT.
Introduction
A secondary goal is to make sure that our overall translation quality (as measured by BLEU ) does not degrade as a result of the name-handling techniques we introduce.
Introduction
0 We evaluate both the base SMT system and the augmented system in terms of entity translation accuracy and BLEU (Sections 2 and 6).
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann
Abstract
Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average.
Conclusion
Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN).
Experiments
0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way.
Experiments
An Index column is added for score reference convenience (B for BLEU ; T for TER).
Experiments
For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1).
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Sennrich, Rico and Schwenk, Holger and Aransa, Walid
Abstract
Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation.
Translation Model Architecture
We found that this had no significant effects on BLEU .
Translation Model Architecture
We report translation quality using BLEU (Papineni et
Translation Model Architecture
For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2).
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Abstract
Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
Experiments
In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments
We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method.
Experiments
All of the four feature functions we have tested achieve considerable improvement in BLEU scores.
Log-linear Models
BLEU score, a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al.
Log-linear Models
The BLEU measure computes the geometric mean of the precision of n-grams of various lengths between a sentence realization and a (set of) reference(s).
Log-linear Models
3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Tu, Mei and Zhou, Yu and Zong, Chengqing
Experiments
In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments
In particular, the best performance marked in bold is as high as 1.24, 0.94, and 0.82 BLEU points, respectively, over the baseline system on NIST04, CWMT08 Development, and CWMT08 Evaluation data.
Experiments
BLEU 35
Related Work
They added the labels assigned to connectives as an additional input to an SMT system, but their experimental results show that the improvements under the evaluation metric of BLEU were not significant.
Related Work
To the best of our knowledge, our work is the first attempt to exploit the source functional relationship to generate the target transitional expressions for grammatical cohesion, and we have successfully incorporated the proposed models into an SMT system with significant improvement of BLEU metrics.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Li, Junhui and Marton, Yuval and Resnik, Philip and Daumé III, Hal
Discussion
Table 6: Performance gain in BLEU over baseline and MR08 systems averaged over all test sets.
Discussion
Table 9: Performance ( BLEU score) comparison between non-oracle and oracle experiments.
Experiments
We use NIST MT 06 dataset (1664 sentence pairs) for tuning, and NIST MT 03, 05, and 08 datasets (919, 1082, and 1357 sentence pairs, respectively) for evaluation.1 We use BLEU (Pap-ineni et al., 2002) for both tuning and evaluation.
Experiments
Our first group of experiments investigates whether the syntactic reordering models are able to improve translation quality in terms of BLEU .
Experiments
Table 5: System performance in BLEU scores.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Wu, Dekai
Abstract
As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU , which fail to properly evaluate adequacy, become more apparent.
Abstract
We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER.
Abstract
We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Mason, Rebecca and Charniak, Eugene
Our Approach
BLEU Scores 13 N J:
Our Approach
Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach
Although BLEU (Papineni et al., 2002) scores are widely used for image caption evaluation, we find them to be poor indicators of the quality of our model.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Espinosa, Dominic and White, Michael and Mehay, Dennis
Conclusion
We have also shown that, by integrating this hypertagger with a broad-coverage CCG chart realizer, considerably faster realization times are possible (approximately twice as fast as compared with a realizer that performs simple lexical lookups) with higher BLEU , METEOR and exact string match scores.
Conclusion
Moreover, the hypertagger-augmented realizer finds more than twice the number of complete realizations, and further analysis revealed that the realization quality (as per modified BLEU and METEOR) is higher in the cases when the realizer finds a complete realization.
Introduction
Moreover, the overall BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) scores, as well as numbers of exact string matches (as measured against to the original sentences in the CCGbank) are higher for the hypertagger-seeded realizer than for the preexisting realizer.
Results and Discussion
Table 5 shows that increasing the number of complete realizations also yields improved BLEU and METEOR scores, as well as more exact matches.
Results and Discussion
In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations.
Results and Discussion
Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%.
The Approach
compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming
Conclusion and Future Work
In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures.
Experiments and Results
Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
Features and Training
defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]:
Features and Training
BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference.
Features and Training
1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis.
Graph Construction
In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Li, Mu and Zhang, Dongdong and Yu, Nenghai
Conclusion and Future Work
Large scale experiment shows improvement on both reordering metric and SMT performance, with up to 1.73 point BLEU gain in our evaluation test.
Experiments
Table 2: BLEU (%) score on dev and test data for both EJ and J-E experiment.
Experiments
We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score, and the model size.
Experiments
CLN BLEU Feat.# tag+label 88.6 16.4 22.24 26k +dst 91.5 13.5 22.66 55k E_J +pct 92.2 13.1 22.73 79k +lezv100 92.9 12.1 22.85 347k +l€$1000 94.0 11.5 22.79 2,410k +l€$2000 95.2 10.7 22.81 3,794k tag+fw 85.0 18.6 25.43 31k +dst 90.3 16.9 25.62 65k J_E +lezv100 91.6 15.7 25.87 293k +l€$1000 92.4 14.8 25.91 2,156k +le$2000 93.0 14.3 25.84 3,297k
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Conclusions and Future Work
EXperimental results show that both models are able to significantly improve translation accuracy in terms of BLEU score.
Experiments
Statistical significance in BLEU differences
Experiments
Our first group of experiments is to investigate whether the predicate translation model is able to improve translation accuracy in terms of BLEU and whether semantic features are useful.
Experiments
0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Experiments
Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
Experiments
Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Experiments
Figure 3: BLEU scores on the MT05 data set.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min
Conclusion
o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST.
Experiments
System BLEU (%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - -
Experiments
System BLEU (%) NIST Base 33.53 9.0561 STM (sense) 34.15 9.2596 STM (sense+lexicon) 34.73 9.4184
Experiments
System BLEU (%) NIST Base 33.53 9.0561 Reformulated WSD 34.16 9.3820 STM 34.73 9.4184
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Zong, Chengqing
Experiments
We use BLEU (Papineni et al., 2002) score with shortest length penalty as the evaluation metric and apply the pairwise re-sampling approach (Koehn, 2004) to perform the significance test.
Experiments
We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points.
Experiments
When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Echizen-ya, Hiroshi and Araki, Kenji
Experiments
To confirm the effectiveness of noun-phrase chunking, we performed the experiment using a system combining BLEU with our method.
Experiments
In this case, BLEU scores were used as scorewd in Eq.
Experiments
This experimental result is shown as “BLEU with our method” in Tables 2—5.
Introduction
Methods based on word strings (6.9., BLEU (Papineni et al., 2002), NIST(NIST, 2002), METEOR(Banerjee and Lavie., 2005), ROUGE-L(Lin and Och, 2004),
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Boyd-Graber, Jordan and Resnik, Philip
Abstract
Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline.
Experiments
2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012).
Experiments
On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER.
Experiments
The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER.
Introduction
Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie
Abstract
We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion
The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set
In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments
Lng the performance ( BLEU ) on the test set, the total
Experiments
For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments
It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction
Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Cahill, Aoife and Kuhn, Jonas
Conclusion
This strategy leads to a better balanced distribution of the alternations in the training data, such that our linguistically informed generation ranking model achieves high BLEU scores and accurately predicts active and passive.
Experimental Setup
Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65
Experimental Setup
Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01
Experimental Setup
use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams.
Experiments
The differences in BLEU between the candidate sets and models are
Experiments
Its BLEU score and match accuracy decrease only slightly (though statistically significantly).
Experiments
Features | Match BLEU | Voice Prec.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Riesa, Jason and Marcu, Daniel
Abstract
Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
Conclusion
We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today.
Experiments
BLEU Words .696 45.1 2,538 .674 46.4 2,262
Experiments
Our hypergraph alignment algorithm allows us a 1.1 BLEU increase over the best baseline system, Model-4 grow-diag-final.
Experiments
We also report a 2.4 BLEU increase over a system trained with alignments from Model-4 union.
Related Work
Very recent work in word alignment has also started to report downstream effects on BLEU score.
Related Work
(2009) confirm and extend these results, showing BLEU improvement for a hierarchical phrase-based MT system on a small Chinese corpus.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro and Takamura, Hiroya and Okumura, Manabu
Abstract
Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
Experiment
Table 1: Performance on Japanese-to-English Translation Measured by BLEU (%)
Experiment
Table 1 shows the performance for the test data measured by case sensitive BLEU (Papineni et al., 2002).
Experiment
Under the Moses phrase-based SMT system (Koehn et al., 2007) with the default settings, we achieved a 26.80% BLEU score.
Introduction
Further, our independent model achieves a more than 1 point gain in BLEU , which resolves the sparseness problem introduced by the bi-word observations.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
Abstract
We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work
The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation
Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems.
Evaluation
Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance.
Evaluation
Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system.
Introduction
0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Introduction
In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
Related Work
Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
Related Work
In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work
TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun
Experiments
The reported BLEU scores are averaged over 5 times of running MERT (Och, 2003).
Experiments
We illustrate the relationship among translation accuracy ( BLEU ), the number of retrieved documents (N) and the length of hidden layers (L) on different testing datasets.
Experiments
Figure 3: End-to-end translation results ( BLEU %)
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Liu, Shujie and Li, Mu and Zhou, Ming and Zong, Chengqing
Experiments
Case-insensitive BLEU is employed as the evaluation metric.
Experiments
Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test.
Experiments
In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation.
Introduction
The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved.
Related Work
(2013) also use bag-of-words but learn BLEU sensitive phrase embeddings.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Gyawali, Bikash and Gardent, Claire
Conclusion
We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning.
Results and Discussion
Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion
The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion
In terms of BLEU score, the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Yarowsky, David
Experimental Results
The feature functions are combined under a log-linear framework, and the weights are tuned by the minimum-error-rate training (Och, 2003) using BLEU (Papineni et al., 2002) as the optimization metric.
Experimental Results
This precision is extremely high because the BLEU score (precision with brevity penalty) that one obtains for a Chinese sentence is normally between 30% to 50%.
Experimental Results
4.5.2 BLEU on NIST MT Test Sets
Introduction
We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Zhou, Ming
Abstract
On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++.
Evaluation
Finally, we also do end-to-end evaluation using both F-score in alignment and Bleu score in translation.
Evaluation
HP-DITG using DPDI achieves the best Bleu score with acceptable time cost.
Evaluation
It shows that HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Foster, George and Kuhn, Roland
Analysis and Discussion
Table 4: Results ( BLEU %) of Chinese—to—English large data (CE_LD) and small data (CE_SD) NIST task by applying one feature.
Analysis and Discussion
Table 5: Results ( BLEU %) for combination of two similarity scores.
Analysis and Discussion
Table 6: Results ( BLEU %) of using simple features based on context on small data NIST task.
Experiments
Our evaluation metric is IBM BLEU (Papineni et al., 2002), which performs case-insensitive matching of n- grams up to n = 4.
Experiments
Table 2: Results ( BLEU %) of small data Chinese-to-English NIST task.
Experiments
Table 3: Results ( BLEU %) of large data Chinese-to-English NIST task and German-to—English WMT task.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Introduction
In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
Introduction
to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
Introduction
Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Cohn, Trevor and Haffari, Gholamreza
Experiments
9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
Experiments
10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments.
Experiments
We use minimum error rate training (Och, 2003) with nbest list size 100 to optimize the feature weights for maximum development BLEU .
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya
Experimental Evaluation
6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
Experimental Evaluation
It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
Experimental Evaluation
BLEU
Flat ITG Model
The average gain across all data sets was approximately 0.8 BLEU points.
Hierarchical ITG Model
(2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score, while using larger phrases results in diminishing returns.
Introduction
We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
Experimental Evaluation
For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training.
Experimental Evaluation
For automatic evaluation, we employed BLEU (Papineni et al., 2002) by following (Unno et al., 2006).
Experimental Evaluation
Label BLEU Proposed .679 w/o PLM .617 w/o IPTW .635 Hori— .493
Results and Discussion
Our method achieved the highest BLEU score.
Results and Discussion
For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score.
Results and Discussion
Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Machine Translation as a Decipherment Task
Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et
Machine Translation as a Decipherment Task
The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
Machine Translation as a Decipherment Task
Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun
Experiments
Is our topic similarity model able to improve translation quality in terms of BLEU ?
Experiments
Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea-
Experiments
By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average.
Introduction
Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Experiments
The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
Experiments
In particular, their translations had a lower BLEU score, making their task easier.
Experiments
We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score).
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Abstract
Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU .
Conclusion and Future Work
We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU .
Experiments and Results
When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.
Experiments and Results
TCBPPE based method drops about 3 BLEU points on both development and test data sets.
Introduction
We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan
Abstract
We evaluate our model on a Chinese to English translation task and obtain up to 1.2 BLEU improvement over strong baselines.
Experiments
We refer to the SMT model without domain adaptation as baseline.5 LDA marginally improves machine translation (less than half a BLEU point).
Experiments
These improvements are not redundant: our new ptLDA-dict model, which has aspects of both models yields the best performance among these approaches—up to a 1.2 BLEU point gain (higher is better), and -2.6 TER improvement (lower is better).
Experiments
The BLEU improvement is significant (Koehn, 2004) at p = 0.01,6 except on MT03 with variational and variational-hybrid inference.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim
Experiment
Model BLEU (%) Moses 25.68 TT2S 26.08 TTS2S 26.95 FT2S 27.66 FTS2S 28.83
Experiment
The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S.
Experiment
BLEU (%) N-best \ model FT2S FTS2S 100 Best 27.40 28.61 500 Best 27.66 28.83 2500 Best 27.66 28.96 5000 Best 27.79 28.89
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Experiments
System Model BLEU Moses cBP 23.86 STSSG 25.92 SncTSSG 26.53
Experiments
ID Rule Set BLEU 1 CR (STSSG) 25.92 2 CR w/o ncPR 25.87 3 CR w/o ncPR + tgtncR 26.14 4 CR w/o ncPR + srchR 26.50 5 CR w/o ncPR + src&tgtncR 26.51 6 CR + tgtnCR 26.11 7 CR + srcncR 26.56 8 cR+src&tgtncR(SncTSSG) 26.53
Experiments
2) Not only that, after comparing Exp 6,7,8 against Exp 3,4,5 respectively, we find that the ability of rules derived from noncontiguous tree sequence pairs generally covers that of the rules derived from the contiguous tree sequence pairs, due to the slight Change in BLEU score.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Junhui and Tu, Zhaopeng and Zhou, Guodong and van Genabith, Josef
Abstract
Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU .
Experiments
For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores.
Experiments
Table 3 lists the translation performance with BLEU scores.
Experiments
Table 3 shows that our HD-HPB model significantly outperforms Chiang’s HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB).
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
Abstract
We compare this metric against a combination metric of four state—of—the—art scores ( BLEU , NIST, TER, and METEOR) in two different settings.
Experimental Evaluation
BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
Introduction
Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
Introduction
(2006) have identified a number of problems with BLEU and related n-gram-based scores: (1) BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity; (2) BLEU metrics can be “gamed” by permuting word order; (3) for some corpora and languages, the correlation to human ratings is very low even at the system level; (4) scores are biased towards statistical MT; (5) the quality gap between MT and human translations is not reflected in equally large BLEU differences.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya
Conclusion and Future Directions
12Similar results were found for character and word—based BLEU , but are omitted for lack of space.
Experiments
Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney.
Experiments
We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level.
Experiments
When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU .
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen
Experimental Results
The MT systems are optimized with pairwise ranking optimization (Hopkins and May, 2011) to maximize BLEU (Papineni et al., 2002).
Experimental Results
The BLEU scores from different systems are shown in Table 10 and Table 11, respectively.
Experimental Results
Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki
Experiments
training data and not necessarily exactly follow the tendency of the final BLEU scores.
Experiments
For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score.
Experiments
Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Experiments
In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Experiments
Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
Experiments
Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
van Gompel, Maarten and van den Bosch, Antal
Evaluation
We report on BLEU , NIST, METEOR, and word error rate metrics WER and PER.
Experiments & Results
The BLEU scores, not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results
Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results
Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Alternatives to Correlation-based Meta-evaluation
We have studied 100 sentence evaluation cases from representatives of each metric family including: 1-PER, BLEU , DP-Or-‘k, GTM (e = 2), METEOR and ROUGE L. The evaluation cases have been extracted from the four test beds.
Metrics and Test Beds
At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision ( BLEU and NIST), lexical recall (ROUGE), and F-measure (GTM and METEOR).
Previous Work on Machine Translation Meta-Evaluation
(2001) introduced the BLEU metric and evaluated its reliability in terms of Pearson correlation with human assessments for adequacy and fluency judgements.
Previous Work on Machine Translation Meta-Evaluation
With the aim of overcoming some of the deficiencies of BLEU , Doddington (2002) introduced the NIST metric.
Previous Work on Machine Translation Meta-Evaluation
Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST, WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem
Abstract
On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU , TER, and NIST.
Corpus Data and Baseline SMT
Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs, 45K words) using BLEU as the tuning metric.
Experimental Setup and Results
Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006).
Experimental Setup and Results
In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 ( BLEU ), -0.6 (TER) and 0.08 (NIST).
Introduction
With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU , TER and NIST scores on an English-to-Iraqi CSLT task.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Min and Jiang, Hongfei and Aw, Aiti and Li, Haizhou and Tan, Chew Lim and Li, Sheng
Experiments
BLEU (%)
Experiments
Rule TR TR TR+TSR_L TR Type (STSG) +TSR_L +TSR_P +TSR BLEU (%) 24.71 25.72 25.93 26.07
Experiments
Rule Type BLEU (%) TR+TSR 26.07 (TR+TSR) w/o SRR 24.62 (TR+TSR) w/o DPR 25.78
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Parton, Kristen and McKeown, Kathleen R. and Coyne, Bob and Diab, Mona T. and Grishman, Ralph and Hakkani-Tür, Dilek and Harper, Mary and Ji, Heng and Ma, Wei Yun and Meyers, Adam and Stolbach, Sara and Sun, Ang and Tur, Gokhan and Xu, Wei and Yaman, Sibel
Results
Since MT systems are tuned for word-based overlap measures (such as BLEU ), verb deletion is penalized equally as, for example, determiner deletion.
SW System
model score and word penalty for a combination of BLEU and TER (2*(1-BLEU) + TER).
SW System
Bleu scores on the government supplied test set in December 2008 were 35.2 for formal text, 29.2 for informal text, 33.2 for formal speech, and 27.6 for informal speech.
The Chinese-English 5W Task
Unlike word- or phrase-overlap measures such as BLEU , the SW evaluation takes into account “concept” or “nugget” translation.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Discussion and Future Work
When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
Discussion and Future Work
Perhaps surprisingly, translation performance, 30.90 BLEU , was around the level we obtained when using frequency to approximate function words at N = 64.
Experimental Results
These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score, with a consistent pattern of results across the MT06 and MT08 test sets.
Experimental Setup
all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick
Abstract
Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score.
Conclusion and perspectives
Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging.
Evaluation
The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
Evaluation
The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang
Experiments
We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
Experiments
Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU,
Experiments
Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Mylonakis, Markos and Sima'an, Khalil
Abstract
We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
Experiments
Our system (its) outperforms the baseline for all 4 language pairs for both BLEU and NIST scores, by a margin which scales up to +1.92 BLEU points for English to Chinese translation when training on the 400K set.
Experiments
BLEU scores for 200K and 400K training sentence pairs.
Experiments
Notably, as can be seen in Table 2(b), switching to a 4-gram LM results in performance gains for both the baseline and our system and while the margin between the two systems decreases, our system continues to deliver a considerable and significant improvement in translation BLEU scores.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop
Conclusion
Our results showed improvement over the baselines both in intrinsic evaluations and on BLEU .
Experiments & Results 4.1 Experimental Setup
BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
Experiments & Results 4.1 Experimental Setup
Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e.
Introduction
In general, copied-over oovs are a hindrance to fluent, high quality translation, and we can see evidence of this in automatic measures such as BLEU (Papineni et al., 2002) and also in human evaluation scores such as HTER.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Konstas, Ioannis and Lapata, Mirella
Abstract
Experimental evaluation on the ATIS domain shows that our model outperforms a competitive discriminative system both using BLEU and in a judgment elicitation study.
Results
As can be seen, inclusion of lexical features gives our decoder an absolute increase of 6.73% in BLEU over the l-BEST system.
Results
System BLEU METEOR l-BEST+BASE+ALIGN 21.93 34.01 k-BEST+BASE+ALIGN+LEX 28.66 45.18 k-BEST+BASE+ALIGN+LEX+STR 30.62 46.07 ANGELI 26.77 42.41
Results
over the l-BEST system and 3.85% over ANGELI in terms of BLEU .
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing
Experiment
Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points.
Experiment
Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points).
Experiment
We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points).
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Experiments
Corpus ‘ BLEU (%) RCW (%)
Experiments
Table 4: Case-insensitive BLEU score and ratio of correct words (RCW) on the training, development and test corpus.
Experiments
Table 4 shows the case-insensitive BLEU score and the percentage of words that are labeled as correct according to the method described above on the training, development and test corpus.
SMT System
The performance, in terms of BLEU (Papineni et al., 2002) score, is shown in Table 4.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas
Experiments
System BLEU Baseline 12.60 [MB OT * 13 .06
Experiments
We measured the overall translation quality with the help of 4-gram BLEU (Papineni et al., 2002), which was computed on tokenized and lower-cased data for both systems.
Experiments
We obtain a BLEU score of 13.06, which is a gain of 0.46 BLEU points over the baseline.
Introduction
The translation quality is automatically measured using BLEU scores, and we confirm the findings by providing linguistic evidence (see Section 5).
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Foster, George
Abstract
Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation.
Experiments
The 3-feature version of VSM yields +1.8 BLEU over the baseline for Arabic to English, and +1.4 BLEU for Chinese to English.
Experiments
For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.)
Experiments
To get an intuition for how VSM adaptation improves BLEU scores, we compared outputs from the baseline and VSM-adapted system (“vsm, joint” in Table 5) on the Chinese test data.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
Results
081 usually has the highest overall score (except BLEU ), it also has the highest scores for ‘regulations’ (more formal texts), P1 scores are better for the news documents.
Results
Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel
Experiments
Table 3: BLEU scores for different datasets in different translation directions (left to right), broken with different training corpora (top to bottom).
Experiments
The BLEU scores for the different parallel corpora are shown in Table 3 and the top 10 out-of-vocabulary (OOV) words for each dataset are shown in Table 4.
Experiments
However, by combining the Weibo parallel data with this standard data, improvements in BLEU are obtained.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kuznetsova, Polina and Ordonez, Vicente and Berg, Alexander and Berg, Tamara and Choi, Yejin
Code was provided by Deng et a1. (2012).
To compute evaluation measures, we take the average scores of BLEU (1) and F-score (unigram-based with respect to content-words) over k = 5 candidate captions.
Code was provided by Deng et a1. (2012).
Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU , Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database.
Related Work
When computing BLEU with semantic matching, we look for the match with the highest similarity score among words that have not been matched before.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output.
Experiments
For our test data, we only consider sentences containing measure words for Bleu score evaluation.
Experiments
Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Quirk, Chris and Moore, Robert C. and Gildea, Daniel
Experiments
Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU .
Experiments
We do compare VB against EM in terms of final BLEU scores in the translation experiments to ensure that this sparse prior has a sig-
Experiments
Minimum Error Rate training (Och, 2003) over BLEU was used to optimize the weights for each of these models over the development test data.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao
Abstract
In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
Experiment
To stabilize the MERT results, we tuned three times by MERT using the first half of the development data and we selected the SMT weighting parameter set that performed the best on the second half of the development data based on the BLEU scores from the three SMT weighting parameter sets.
Experiment
To investigate the tolerance for sparsity of the training data, we reduced the training data for the sequence model to 20,000 sentences for JE translation.14 SEQUENCE using this model with a distortion limit of 30 achieved a BLEU score of 32.22.15 Although the score is lower than the score of SEQUENCE with a distortion limit of 30 in Table 3, the score was still higher than those of LINEAR, LINEAR+LEX, and 9-CLASS for JE in Table 3.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
Abstract
For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline.
Discussion of Translation Results
The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre.
Introduction
For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zollmann, Andreas and Vogel, Stephan
Experiments
Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out.
Experiments
We have noticed that using an L0-penalized BLEU score5 as MERT’s objective on the merged n-best lists over all iterations is more stable and will therefore use this score to determine N.
Experiments
5Given by: BLEU —5 X Hi 6 {1, .
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen
Abstract
We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perpleXity.
Related Work
Figure 9 shows a statistically significant improvement to the BLEU score when using the HHMM and the n-gram LMs together on this reduced test set.
Related Work
Moses LM(s) ‘ BLEU
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lu, Shixiang and Chen, Zhenbiao and Xu, Bo
Abstract
On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively.
Conclusions
The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features.
Experiments and Results
Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Conclusion and Future Work
In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments
The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
Experiments
Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: