Index of papers in Proc. ACL 2009 that mention
  • BLEU score
Wu, Hua and Wang, Haifeng
Discussion
Table 6: CRR translation results ( BLEU scores ) by using different RBMT systems
Discussion
The BLEU scores are 43.90 and 29.77 for System A and System B, respectively.
Discussion
If we compare the results with those only using SMT systems as described in Table 3, the translation quality was greatly improved by at least 3 BLEU scores , even if the translation ac-
Experiments
Translation quality was evaluated using both the BLEU score proposed by Papineni et al.
Experiments
The results also show that our translation selection method is very effective, which achieved absolute improvements of about 4 and l BLEU scores on CRR and ASR inputs, respectively.
Experiments
As compared with those in Table 3, the translation quality was greatly improved, with absolute improvements of at least 5.1 and 3.9 BLEU scores on CRR and ASR inputs for system combination results.
Translation Selection
In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002).
Translation Selection
We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection
In the context of translation selection, 3/ is assigned as the smoothed BLEU score .
BLEU score is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun
Conclusion
As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding.
Experiments
Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
Experiments
With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
Experiments
We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score .
Introduction
0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Conclusion
BLEU score 0 iv A (A)
Experiments
BLEU score
Experiments
In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.
Experiments
Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses.
Future Work
BLEU score
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Experiments
MERT is then performed to optimize the BLEU score on a development set; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008).
Experiments
We consider a BLEU score difference to be a) gain if is at least 0.2 points, b) drop if it is at most -0.2 points, and c) no change otherwise.
Experiments
When MBR does not produce a higher BLEU score relative to MAP on the development set, MERT assigns a higher weight to this feature function.
Introduction
Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Introduction
We employ MERT to select these weights by optimizing BLEU score on a development set.
Introduction
In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
MERT for MBR Parameter Optimization
We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set.
Minimum Bayes-Risk Decoding
(2008) extended MBR decoding to translation lattices under an approximate BLEU score .
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Experimental Results
Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.
Experimental Results
Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.
Experimental Results
Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Experiments
Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models.
Experiments
Table 3 shows the BLEU scores of tree-based and forest-based tree-to-tree models achieved on the test set over different pruning thresholds.
Experiments
With the increase of the number of rules used, the BLEU score increased accordingly.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Haffari, Gholamreza and Sarkar, Anoop
AL-SMT: Multilingual Setting
The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively.
AL-SMT: Multilingual Setting
This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score , WER and PER) (Papineni et al., 2002).
Experiments
The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set.
Experiments
Avg BLEU Score
Experiments
Avg BLEU Score
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Abstract
Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
Experiments
In addition to BLEU score , percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments
We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method.
Experiments
All of the four feature functions we have tested achieve considerable improvement in BLEU scores .
Log-linear Models
BLEU score , a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al.
Log-linear Models
3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Cahill, Aoife and Riester, Arndt
Abstract
We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions
In comparison to a baseline model, we achieve statistically significant improvement in BLEU score .
Generation Ranking Experiments
We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al.,
Generation Ranking Experiments
The difference in BLEU score between the model of Cahill et al.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
Experimental Evaluation
For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training.
Results and Discussion
Our method achieved the highest BLEU score .
Results and Discussion
For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score .
Results and Discussion
Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score .
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Experiments
In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Experiments
Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
Experiments
Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Discussion and Future Work
When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
Experimental Results
These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score , with a consistent pattern of results across the MT06 and MT08 test sets.
Experimental Setup
all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim
Experiment
The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S.
Experiment
Even if in the 5000 Best case, tree sequence is still able to contribute l.l BLEU score improvement (28.89-27.79).
Experiment
2) The BLEU scores are very similar to each other when we increase the forest pruning threshold.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: