Discussion | Table 6: CRR translation results ( BLEU scores ) by using different RBMT systems |
Discussion | The BLEU scores are 43.90 and 29.77 for System A and System B, respectively. |
Discussion | If we compare the results with those only using SMT systems as described in Table 3, the translation quality was greatly improved by at least 3 BLEU scores , even if the translation ac- |
Experiments | Translation quality was evaluated using both the BLEU score proposed by Papineni et al. |
Experiments | The results also show that our translation selection method is very effective, which achieved absolute improvements of about 4 and l BLEU scores on CRR and ASR inputs, respectively. |
Experiments | As compared with those in Table 3, the translation quality was greatly improved, with absolute improvements of at least 5.1 and 3.9 BLEU scores on CRR and ASR inputs for system combination results. |
Translation Selection | In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002). |
Translation Selection | We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions. |
Translation Selection | In the context of translation selection, 3/ is assigned as the smoothed BLEU score . |
Conclusion | As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding. |
Experiments | Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive). |
Experiments | With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence. |
Experiments | We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score . |
Introduction | 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4). |
Conclusion | BLEU score 0 iv A (A) |
Experiments | BLEU score |
Experiments | In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences. |
Experiments | Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses. |
Future Work | BLEU score |
Experiments | MERT is then performed to optimize the BLEU score on a development set; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008). |
Experiments | We consider a BLEU score difference to be a) gain if is at least 0.2 points, b) drop if it is at most -0.2 points, and c) no change otherwise. |
Experiments | When MBR does not produce a higher BLEU score relative to MAP on the development set, MERT assigns a higher weight to this feature function. |
Introduction | Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice. |
Introduction | We employ MERT to select these weights by optimizing BLEU score on a development set. |
Introduction | In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008). |
MERT for MBR Parameter Optimization | We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set. |
Minimum Bayes-Risk Decoding | (2008) extended MBR decoding to translation lattices under an approximate BLEU score . |
Experimental Results | Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding. |
Experimental Results | Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding. |
Experimental Results | Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs. |
Experiments | Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models. |
Experiments | Table 3 shows the BLEU scores of tree-based and forest-based tree-to-tree models achieved on the test set over different pruning thresholds. |
Experiments | With the increase of the number of rules used, the BLEU score increased accordingly. |
AL-SMT: Multilingual Setting | The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively. |
AL-SMT: Multilingual Setting | This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score , WER and PER) (Papineni et al., 2002). |
Experiments | The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set. |
Experiments | Avg BLEU Score |
Experiments | Avg BLEU Score |
Abstract | Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874. |
Experiments | In addition to BLEU score , percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics. |
Experiments | We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method. |
Experiments | All of the four feature functions we have tested achieve considerable improvement in BLEU scores . |
Log-linear Models | BLEU score , a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al. |
Log-linear Models | 3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl |
Abstract | We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features. |
Conclusions | In comparison to a baseline model, we achieve statistically significant improvement in BLEU score . |
Generation Ranking Experiments | We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al., |
Generation Ranking Experiments | The difference in BLEU score between the model of Cahill et al. |
Experimental Evaluation | For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training. |
Results and Discussion | Our method achieved the highest BLEU score . |
Results and Discussion | For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score . |
Results and Discussion | Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score . |
Experiments | In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers. |
Experiments | Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking. |
Experiments | Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations. |
Discussion and Future Work | When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality. |
Experimental Results | These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score , with a consistent pattern of results across the MT06 and MT08 test sets. |
Experimental Setup | all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004). |
Experiment | The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S. |
Experiment | Even if in the 5000 Best case, tree sequence is still able to contribute l.l BLEU score improvement (28.89-27.79). |
Experiment | 2) The BLEU scores are very similar to each other when we increase the forest pruning threshold. |