Abstract | The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. |
Abstract | Experiments on the Europarl German-to-English dataset show that the proposed method leads to a 1.1 BLEU point improvement over a strong baseline. |
Abstract | Experimental results showed that their approach outperformed a baseline by 0.8 BLEU point when using monotonic decoding, but there was no |
Inferring a learning curve from mostly monolingual data | As an example, the model estimated using Lasso for the 75K anchor size eXhibits a root mean squared error of 6 BLEU points . |
Inferring a learning curve from mostly monolingual data | The average distance is on the same scale as the BLEU score, which suggests that our best curves can predict the gold curve within 1.5 BLEU points on average (the best result being 0.7 BLEU points when the initial points are lK-SK-lOK-ZOK) which is a telling result. |
Inferring a learning curve from mostly monolingual data | For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K. |
Introduction | They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6). |
Experiments | By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average. |
Experiments | We clearly find that the two rule-topic distributions improve the performance by 0.48 and 0.38 BLEU points over the baseline respectively. |
Experiments | Our topic similarity method on monotone rule achieves the most improvement which is 0.6 BLEU points , while the improvement on reordering rules is the smallest among the three types. |
Introduction | Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points . |
Experiments | 0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used. |
Experiments | 0 When we integrate both lexical and semantic features (lex+sem) described in Section 3.2, we obtain an improvement of about 0.33 BLEU points over the system where only lexical features (lex) are used. |
Experiments | We obtain an average improvement of 0.4 BLEU points on the two test sets over the baseline when we incorporate the proposed argument reordering model into our system. |
Conclusion & Future Work | We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model. |
Experiments & Results 4.1 Experimental Setup | In particular, Switching:Max could gain up to 2.2 BLEU points over the concatenation baseline and 0.39 BLEU points over the best performing baseline (i.e. |
Experiments & Results 4.1 Experimental Setup | lowest score among the mixture operations, however after tuning, it learns to bias the weights towards one of the models and hence improves by 1.31 BLEU points . |
Experiments | However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set. |
Experiments | Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets. |
Experiments | Another 0.5 BLEU points (test-crawll 1) or even 1.3 BLEU points (test-crawl 10) are gained when scaling to the full training set using iterative features selection. |