Index of papers in Proc. ACL that mention
  • BLEU points
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
(2008), i.e, that the Hiero baseline system underperforms compared to the phrase-based system with lexicalized phrase-based reordering for Arabic-English in all test sets, on average by about 0.60 BLEU points (46.13 versus 46.73).
Experiment Results
As mentioned in section 2.1, Phrasal-Hiero only uses 48.54% of the rules but achieves as good or even better performance (on average 0.24 BLEU points better) compared to the original Hiero system using the full set of rules.
Experiment Results
Table 4 shows that the P.H.+lex system gains on average 0.67 BLEU points (47.04 versus 46.37).
BLEU points is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Abstract
Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
Conclusion
In this work, we presented an approach that can expand a translation model extracted from a sentence-aligned, bilingual corpus using a large amount of unstructured, monolingual data in both source and target languages, which leads to improvements of 1.4 and 1.2 BLEU points over strong baselines on evaluation sets, and in some scenarios gains in excess of 4 BLEU points .
Evaluation
HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points , indicating that adding more monolingual data is likely to improve the system further.
Evaluation
In the first setup, we get a huge improvement of 4.2 BLEU points (“SLP+Noisy”) when using the monolingual data and the noisy parallel data for graph construction.
Evaluation
Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences.
Introduction
This enhancement alone results in an improvement of almost 1.4 BLEU points .
Introduction
We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points , even when using very large language models.
BLEU points is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Liu, Qun
Abstract
Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules.
Experiments
As shown in the third line in the column of BLEU score, the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage.
Experiments
This suggests that using dependency language model really improves the translation quality by less than 1 BLEU point .
Experiments
With the help of the dependency language model, our new model achieves a significant improvement of +0.7 BLEU points over the forest 625 baseline system (p < 0.05, using the sign-test suggested by
Introduction
Medium data experiments (Section 5) show a statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with less translation rules, this is also the first time that a tree-to-tree model can surpass tree-to-string counterparts.
Model
(2009), their forest-based constituency-to-constituency system achieves a comparable performance against Moses (Koehn et al., 2007), but a significant improvement of +3.6 BLEU points over the 1-best tree-based constituency-to-constituency system.
BLEU points is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
He, Xiaodong and Deng, Li
Abstract
The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.
Abstract
Experiments on the Europarl German-to-English dataset show that the proposed method leads to a 1.1 BLEU point improvement over a strong baseline.
Abstract
Experimental results showed that their approach outperformed a baseline by 0.8 BLEU point when using monotonic decoding, but there was no
BLEU points is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan
Abstract
The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline.
Conclusion
Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based
Introduction
This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments.
Introduction
In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering.
Results and Discussions
We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3.
Results and Discussions
We also note a gain of 2.0 BLEU points over a hierarchical phrase based system.
BLEU points is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Abstract
'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points .
Conclusion
adapted parallel data showed an improvement of 1.87 BLEU points over our best baseline.
Conclusion
Using phrase table merging that combined AR and EG’ training data in a way that preferred adapted dialectal data yielded an extra 0.86 BLEU points .
Introduction
— We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points .
Previous Work
The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference.
Proposed Methods 3.1 Egyptian to EG’ Conversion
S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4).
BLEU points is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin
Abstract
When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points .
Conclusion
Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points .
Experiments
The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points .
Experiments
In the end-to-end SMT evaluation, TM selects top 600k sentence pairs of general-domain corpus, but increases the translation performance by 2.7 BLEU points .
Experiments
Meanwhile, the TM+LM and Bidirectional TM+LM have gained 3.66 and 3.56 BLEU point improvements compared against the general-domain baseline system.
BLEU points is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
MT performance results
We can see that Method 1 results in a good improvement of 1.2 BLEU points , even when using only the best (n = 1) translation from the baseline.
MT performance results
The oracle improvement achievable by predicting inflections is quite substantial: more than 7 BLEU points .
MT performance results
Propagating the uncertainty of the baseline system by using more input hypotheses consistently improves performance across the different methods, with an additional improvement of between .2 and .4 BLEU points .
BLEU points is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Zong, Chengqing
Experiments
We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points .
Experiments
When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points .
Experiments
From the results, we see that transductive learning can further improve the translation performance significantly by 0.6 BLEU points .
BLEU points is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram
Inferring a learning curve from mostly monolingual data
As an example, the model estimated using Lasso for the 75K anchor size eXhibits a root mean squared error of 6 BLEU points .
Inferring a learning curve from mostly monolingual data
The average distance is on the same scale as the BLEU score, which suggests that our best curves can predict the gold curve within 1.5 BLEU points on average (the best result being 0.7 BLEU points when the initial points are lK-SK-lOK-ZOK) which is a telling result.
Inferring a learning curve from mostly monolingual data
For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K.
Introduction
They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6).
BLEU points is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhao, Bing and Lee, Young-Suk and Luo, Xiaoqiang and Li, Liu
Experiments
3 as additional cost, the translation results in Table 11 show it helps BLEU by 0.29 BLEU points (56.13 V.s.
Experiments
over the baseline from +0.18 (via rightmost binarization) to +0.52 (via head-out-right) BLEU points .
Experiments
Table 11 shows when we add such boundary markups in our rules, an improvement of 0.33 BLEU points were obtained (56.46 v.s.
BLEU points is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun
Experiments
By using all the features (last line in the table), we improve the translation performance over the baseline system by 0.87 BLEU point on average.
Experiments
We clearly find that the two rule-topic distributions improve the performance by 0.48 and 0.38 BLEU points over the baseline respectively.
Experiments
Our topic similarity method on monotone rule achieves the most improvement which is 0.6 BLEU points , while the improvement on reordering rules is the smallest among the three types.
Introduction
Experiments on Chinese-English translation tasks (Section 6) show that, our method outperforms the baseline hierarchial phrase-based system by +0.9 BLEU points .
BLEU points is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Experiments
0 The proposed predicate translation models achieve an average improvement of 0.57 BLEU points across the two NIST test sets when all features (lex+sem) are used.
Experiments
0 When we integrate both lexical and semantic features (lex+sem) described in Section 3.2, we obtain an improvement of about 0.33 BLEU points over the system where only lexical features (lex) are used.
Experiments
We obtain an average improvement of 0.4 BLEU points on the two test sets over the baseline when we incorporate the proposed argument reordering model into our system.
BLEU points is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing
Experiment
Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points .
Experiment
Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points ).
Experiment
We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points ).
BLEU points is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Conclusion
When applied to English-to-Arabic translation, lattice desegmentation results in a 1.0 BLEU point improvement over one-best desegmentation, and a 1.7 BLEU point improvement over unsegmented translation.
Results
For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.
Results
Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.
Results
1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.
BLEU points is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Wang, Kun and Zong, Chengqing and Su, Keh-Yih
Abstract
Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system.
Conclusion and Future Work
Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database.
Experiments
SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7).
Introduction
Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall.
BLEU points is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Experiments and Results
When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.
Experiments and Results
TCBPPE based method drops about 3 BLEU points on both development and test data sets.
Introduction
We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lu, Shixiang and Chen, Zhenbiao and Xu, Bo
Abstract
On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively.
Conclusions
The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features.
Experiments and Results
Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min
Experiments
0 Our sense-based translation model achieves a substantial improvement of 1.2 BLEU points over the baseline.
Experiments
0 If we only integrate sense features into the sense-based translation model, we can still outperform the baseline by 0.62 BLEU points .
Experiments
From the table, we can find that the sense-based translation model outperforms the reformulated WSD by 0.57 BLEU points .
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Feng, Yang and Cohn, Trevor
Experiments
Adding fertility results in a further +1 BLEU point improvement.
Experiments
While the baseline results vary by up to 1.7 BLEU points for the different alignments, our Markov model provided more stable results with the biggest difference of 0.6.
Introduction
The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Simianer, Patrick and Riezler, Stefan and Dyer, Chris
Experiments
However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set.
Experiments
Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets.
Experiments
Another 0.5 BLEU points (test-crawll 1) or even 1.3 BLEU points (test-crawl 10) are gained when scaling to the full training set using iterative features selection.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Conclusion & Future Work
We showed that this approach can gain up to 2.2 BLEU points over its concatenation baseline and 0.39 BLEU points over a powerful mixture model.
Experiments & Results 4.1 Experimental Setup
In particular, Switching:Max could gain up to 2.2 BLEU points over the concatenation baseline and 0.39 BLEU points over the best performing baseline (i.e.
Experiments & Results 4.1 Experimental Setup
lowest score among the mixture operations, however after tuning, it learns to bias the weights towards one of the models and hence improves by 1.31 BLEU points .
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Fang, Licheng and Xu, Peng and Wu, Xiaoyun
Abstract
Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system).
Experiments
On the English-Chinese data set, the improvement over the phrase-based system is 1.3 BLEU points , and 0.8 over the hierarchical phrase-based system.
Experiments
In the tasks of translating to European languages, the improvements over the phrase-based baseline are in the range of 0.5 to 1.0 BLEU points , and 0.3 to 0.5 over the hierarchical phrase-based system.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yeniterzi, Reyyan and Oflazer, Kemal
Experimental Setup and Results
While N0un+Adj transformations give us an increase of 2.73 BLEU points , Verbs improve the result by only 0.8 points and improvement with Adverbs is even lower.
Related Work
(2007) have integrated more syntax in a factored translation approach by using CCG su-pertags as a separate factor and have reported a 0.46 BLEU point improvement in Dutch-to-English translations.
Related Work
In the context of reordering, one recent work (Xu et al., 2009), was able to get an improvement of 0.6 BLEU points by using source syntactic analysis and a constituent reordering scheme like ours for English-to-Turkish translation, but without using any morphology.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Background
For the phrase-based system, it yields over 0.6 BLEU point gains just after the 3rd iteration on all the data sets.
Background
Also as shown in Table 1, over 0.7 BLEU point gains are obtained on the phrase-based system after 10 iterations.
Background
The largest BLEU improvement on the phrase-based system is over 1 BLEU point in most cases.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Evaluation
Both our systems (Model-1 and Model-2) beat the baseline phrase-based system with a BLEU point difference of 4.30 and 2.75 respectively.
Evaluation
The difference of 2.35 BLEU points between M1 and Pbl indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Final Results
BLEU point improvement and combined with all the heuristics (M2H123) gives an overall gain of 1.95 BLEU points and is close to our best results (M1H12).
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Abstract
Comparable to the state-of-the-art phrase-based system Moses, using packed forests in tree-to-tree translation results in a significant absolute improvement of 3.6 BLEU points over using l-best trees.
Experiments
The absolute improvement of 3.6 BLEU points (from 0.2021 to 02385) is statistically significant at p < 0.01 using the sign-test as described by Collins et al.
Introduction
Comparable to Moses, our forest-based tree-to-tree model achieves an absolute improvement of 3.6 BLEU points over conventional tree-based model.
BLEU points is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: