Experiment Results | (2008), i.e, that the Hiero baseline system underperforms compared to the phrase-based system with lexicalized phrase-based reordering for Arabic-English in all test sets, on average by about 0.60 BLEU points (46.13 versus 46.73). |
Experiment Results | As mentioned in section 2.1, Phrasal-Hiero only uses 48.54% of the rules but achieves as good or even better performance (on average 0.24 BLEU points better) compared to the original Hiero system using the full set of rules. |
Experiment Results | Table 4 shows that the P.H.+lex system gains on average 0.67 BLEU points (47.04 versus 46.37). |
Abstract | Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets. |
Conclusion | In this work, we presented an approach that can expand a translation model extracted from a sentence-aligned, bilingual corpus using a large amount of unstructured, monolingual data in both source and target languages, which leads to improvements of 1.4 and 1.2 BLEU points over strong baselines on evaluation sets, and in some scenarios gains in excess of 4 BLEU points . |
Evaluation | HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points , indicating that adding more monolingual data is likely to improve the system further. |
Evaluation | In the first setup, we get a huge improvement of 4.2 BLEU points (“SLP+Noisy”) when using the monolingual data and the noisy parallel data for graph construction. |
Evaluation | Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences. |
Introduction | This enhancement alone results in an improvement of almost 1.4 BLEU points . |
Introduction | We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points , even when using very large language models. |
Abstract | Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules. |
Experiments | As shown in the third line in the column of BLEU score, the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage. |
Experiments | This suggests that using dependency language model really improves the translation quality by less than 1 BLEU point . |
Experiments | With the help of the dependency language model, our new model achieves a significant improvement of +0.7 BLEU points over the forest 625 baseline system (p < 0.05, using the sign-test suggested by |
Introduction | Medium data experiments (Section 5) show a statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with less translation rules, this is also the first time that a tree-to-tree model can surpass tree-to-string counterparts. |
Model | (2009), their forest-based constituency-to-constituency system achieves a comparable performance against Moses (Koehn et al., 2007), but a significant improvement of +3.6 BLEU points over the 1-best tree-based constituency-to-constituency system. |
Abstract | The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. |
Abstract | Experiments on the Europarl German-to-English dataset show that the proposed method leads to a 1.1 BLEU point improvement over a strong baseline. |
Abstract | Experimental results showed that their approach outperformed a baseline by 0.8 BLEU point when using monotonic decoding, but there was no |
Abstract | The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline. |
Conclusion | Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based |
Introduction | This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments. |
Introduction | In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering. |
Results and Discussions | We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3. |
Results and Discussions | We also note a gain of 2.0 BLEU points over a hierarchical phrase based system. |
Abstract | 'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points . |
Conclusion | adapted parallel data showed an improvement of 1.87 BLEU points over our best baseline. |
Conclusion | Using phrase table merging that combined AR and EG’ training data in a way that preferred adapted dialectal data yielded an extra 0.86 BLEU points . |
Introduction | — We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points . |
Previous Work | The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4). |
Abstract | When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points . |
Conclusion | Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points . |
Experiments | The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points . |
Experiments | In the end-to-end SMT evaluation, TM selects top 600k sentence pairs of general-domain corpus, but increases the translation performance by 2.7 BLEU points . |
Experiments | Meanwhile, the TM+LM and Bidirectional TM+LM have gained 3.66 and 3.56 BLEU point improvements compared against the general-domain baseline system. |
MT performance results | We can see that Method 1 results in a good improvement of 1.2 BLEU points , even when using only the best (n = 1) translation from the baseline. |
MT performance results | The oracle improvement achievable by predicting inflections is quite substantial: more than 7 BLEU points . |
MT performance results | Propagating the uncertainty of the baseline system by using more input hypotheses consistently improves performance across the different methods, with an additional improvement of between .2 and .4 BLEU points . |
Experiments | We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points . |
Experiments | When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points . |
Experiments | From the results, we see that transductive learning can further improve the translation performance significantly by 0.6 BLEU points . |
Inferring a learning curve from mostly monolingual data | As an example, the model estimated using Lasso for the 75K anchor size eXhibits a root mean squared error of 6 BLEU points . |
Inferring a learning curve from mostly monolingual data | The average distance is on the same scale as the BLEU score, which suggests that our best curves can predict the gold curve within 1.5 BLEU points on average (the best result being 0.7 BLEU points when the initial points are lK-SK-lOK-ZOK) which is a telling result. |
Inferring a learning curve from mostly monolingual data | For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K. |
Introduction | They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6). |
Experiments | 3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s. |
Experiments | over the baseline from +0.18 (via rightmost binarization) to +0.52 (via head-out-right) BLEU points . |
Experiments | Table 11 shows when we add such boundary markups in our rules, an improvement of 0.33 BLEU points were obtained (56.46 v.s. |
Experiments | By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average. |
Experiments | We clearly find that the two rule-topic distributions improve the performance by 0.48 and 0.38 BLEU points over the baseline respectively. |
Experiments | Our topic similarity method on monotone rule achieves the most improvement which is 0.6 BLEU points , while the improvement on reordering rules is the smallest among the three types. |
Introduction | Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points . |
Experiments | 0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used. |
Experiments | 0 When we integrate both lexical and semantic features (lex+sem) described in Section 3.2, we obtain an improvement of about 0.33 BLEU points over the system where only lexical features (lex) are used. |
Experiments | We obtain an average improvement of 0.4 BLEU points on the two test sets over the baseline when we incorporate the proposed argument reordering model into our system. |
Experiment | Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points . |
Experiment | Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points ). |
Experiment | We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points ). |
Conclusion | When applied to English-to-Arabic translation, lattice desegmentation results in a 1.0 BLEU point improvement over one-best desegmentation, and a 1.7 BLEU point improvement over unsegmented translation. |
Results | For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic. |
Results | Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation. |
Results | 1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best. |
Abstract | Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. |
Conclusion and Future Work | Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database. |
Experiments | SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7). |
Introduction | Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall. |
Experiments and Results | When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data. |
Experiments and Results | TCBPPE based method drops about 3 BLEU points on both development and test data sets. |
Introduction | We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system. |
Abstract | On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively. |
Conclusions | The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features. |
Experiments and Results | Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features. |
Experiments | 0 Our sense-based translation model achieves a substantial improvement of 1.2 BLEU points over the baseline. |
Experiments | 0 If we only integrate sense features into the sense-based translation model, we can still outperform the baseline by 0.62 BLEU points . |
Experiments | From the table, we can find that the sense-based translation model outperforms the reformulated WSD by 0.57 BLEU points . |
Experiments | Adding fertility results in a further +1 BLEU point improvement. |
Experiments | While the baseline results vary by up to 1.7 BLEU points for the different alignments, our Markov model provided more stable results with the biggest difference of 0.6. |
Introduction | The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute. |
Experiments | However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set. |
Experiments | Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets. |
Experiments | Another 0.5 BLEU points (test-crawll 1) or even 1.3 BLEU points (test-crawl 10) are gained when scaling to the full training set using iterative features selection. |
Conclusion & Future Work | We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model. |
Experiments & Results 4.1 Experimental Setup | In particular, Switching:Max could gain up to 2.2 BLEU points over the concatenation baseline and 0.39 BLEU points over the best performing baseline (i.e. |
Experiments & Results 4.1 Experimental Setup | lowest score among the mixture operations, however after tuning, it learns to bias the weights towards one of the models and hence improves by 1.31 BLEU points . |
Abstract | Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). |
Experiments | On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points , and 0.8 over the hierarchical phrase-based system. |
Experiments | In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points , and 0.3 to 0.5 over the hierarchical phrase-based system. |
Experimental Setup and Results | While N0un+Adj transformations give us an increase of 2.73 BLEU points , Verbs improve the result by only 0.8 points and improvement with Adverbs is even lower. |
Related Work | (2007) have integrated more syntax in a factored translation approach by using CCG su-pertags as a separate factor and have reported a 0.46 BLEU point improvement in Dutch-to-English translations. |
Related Work | In the context of reordering, one recent work (Xu et al., 2009), was able to get an improvement of 0.6 BLEU points by using source syntactic analysis and a constituent reordering scheme like ours for English-to-Turkish translation, but without using any morphology. |
Background | For the phrase-based system, it yields over 0.6 BLEU point gains just after the 3rd iteration on all the data sets. |
Background | Also as shown in Table 1, over 0.7 BLEU point gains are obtained on the phrase-based system after 10 iterations. |
Background | The largest BLEU improvement on the phrase-based system is over 1 BLEU point in most cases. |
Evaluation | Both our systems (Model-1 and Model-2) beat the baseline phrase-based system with a BLEU point difference of 4.30 and 2.75 respectively. |
Evaluation | The difference of 2.35 BLEU points between M1 and Pbl indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. |
Final Results | BLEU point improvement and combined with all the heuristics (M2H123) gives an overall gain of 1.95 BLEU points and is close to our best results (M1H12). |
Abstract | Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees. |
Experiments | The absolute improvement of 3.6 BLEU points (from 0.2021 to 02385) is statistically significant at p < 0.01 using the sign-test as described by Collins et al. |
Introduction | Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model. |