Index of papers in Proc. ACL 2014 that mention
  • BLEU
Auli, Michael and Gao, Jianfeng
Abstract
Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.
Abstract
We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.
Abstract
Our best results improve a phrase-based statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a cross-entropy trained model by up to 0.6 BLEU in a single reference setup.
Expected BLEU Training
The n-best lists serve as an approximation to 5 (f) used in the next step for expected BLEU training of the recurrent neural network model (§3.
Expected BLEU Training
3.1 Expected BLEU Objective
Expected BLEU Training
Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:
Introduction
The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3).
Introduction
We test the expected BLEU objective by training a recurrent neural network language model and obtain substantial improvements.
Recurrent Neural Network LMs
time algorithm, which unrolls the network and then computes error gradients over multiple time steps (Rumelhart et al., 1986); we use the expected BLEU loss (§3) to obtain the error with respect to the output activations.
BLEU is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Experimental Setup
We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Methods
This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references).
Related Work
We use both segmented and unsegmented language models, and tune automatically to optimize BLEU .
Related Work
(2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU .
Results
For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.
Results
Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.
Results
1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Riezler, Stefan and Simianer, Patrick and Haas, Carolin
Experiments
call F1 BLEU 1.21 60.82 46.53 $.57 66.791 48.001 L64 72.4912 56.6412 3.36 78.15123 55.6612
Experiments
:call F1 BLEU 7.86 61.48 46.53 1.79 64.07 46.00 3.57 65.56 55.6712 7.14 68.8612 55.6712
Experiments
Method 4, named REBOL, implements REsponse-Based Online Learning by instantiating y+ and y‘ to the form described in Section 4: In addition to the model score 3, it uses a cost function 0 based on sentence-level BLEU (Nakov et al., 2012) and tests translation hypotheses for task-based feedback using a binary execution function 6.
Response-based Online Learning
Computation of distance to the reference translation usually involves cost functions based on sentence-level BLEU (Nakov et al.
Response-based Online Learning
In addition, we can use translation-specific cost functions based on sentence-level BLEU in order to boost similarity of translations to human reference translations.
Response-based Online Learning
Our cost function c(y(i), y) = (l — BLEU(y(i), is based on a version of sentence-level BLEU Nakov et al.
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
Abstract
Our best result improves over the best single MT system baseline by 1.0% BLEU and over a strong system selection baseline by 0.6% BLEU on a blind test set.
Introduction
Our best system selection approach improves over our best baseline single MT system by 1.0% absolute BLEU point on a blind test set.
MT System Selection
We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection
We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection
When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments
Feature weights are tuned to maximize BLEU on tuning sets using Minimum Error Rate Training (Och, 2003).
Machine Translation Experiments
Results are presented in terms of BLEU (Papineni et al., 2002).
Machine Translation Experiments
All differences in BLEU scores between the four systems are statistically significant above the 95% level.
BLEU is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Abstract
Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
Evaluation
We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality.
Evaluation
Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
Evaluation
HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further.
Introduction
This enhancement alone results in an improvement of almost 1.4 BLEU points.
Introduction
We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models.
BLEU is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Experimental Results
Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results
II TER .812 .836 .848 BLEU .810 .830 .846
Experimental Results
We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
Experimental Setup
To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Experimental Setup
Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A.
Related Work
A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
Related Work
For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.
BLEU is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Elliott, Desmond and Keller, Frank
Abstract
The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements.
Abstract
We estimate the correlation of unigram and Smoothed BLEU , TER, ROUGE-SU4, and Meteor against human judgements on two data sets.
Abstract
The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements.
Introduction
The main finding of our analysis is that TER and unigram BLEU are weakly corre-
Introduction
lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor.
Methodology
BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y.
Methodology
N BLEU = BP-exp < wn logpn> n=1
Methodology
Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1.
BLEU is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Duan, Manjuan and White, Michael
Abstract
Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract
However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction
With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction
With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction
Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods
In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Simple Reranking
Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking
Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
Simple Reranking
In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations.
BLEU is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Abstract
On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.
Abstract
The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation.
Introduction
Additionally, we present several variations of this model which provide significant additive BLEU gains.
Introduction
The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes
Introduction
Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined.
Model Variations
Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6
Model Variations
BLEU scores are mixed-case.
Model Variations
On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more.
Neural Network Joint Model (NNJ M)
We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU .
Neural Network Joint Model (NNJ M)
We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M.
BLEU is mentioned in 36 sentences in this paper.
Topics mentioned in this paper:
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris
Evaluation
Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set.
Evaluation
In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation
This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Experiments
To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG).
Experiments
BLEU gives a measure of how close a system’s output is to the gold standard simple sentence.
Experiments
Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification.
Related Work
(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Abstract
Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
Complexity Analysis
In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved.
Complexity Analysis
For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality.
Complexity Analysis
It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Introduction
o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Li, Junhui and Marton, Yuval and Resnik, Philip and Daumé III, Hal
Discussion
Table 6: Performance gain in BLEU over baseline and MR08 systems averaged over all test sets.
Discussion
Table 9: Performance ( BLEU score) comparison between non-oracle and oracle experiments.
Experiments
We use NIST MT 06 dataset (1664 sentence pairs) for tuning, and NIST MT 03, 05, and 08 datasets (919, 1082, and 1357 sentence pairs, respectively) for evaluation.1 We use BLEU (Pap-ineni et al., 2002) for both tuning and evaluation.
Experiments
Our first group of experiments investigates whether the syntactic reordering models are able to improve translation quality in terms of BLEU .
Experiments
Table 5: System performance in BLEU scores.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Tu, Mei and Zhou, Yu and Zong, Chengqing
Experiments
In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments
In particular, the best performance marked in bold is as high as 1.24, 0.94, and 0.82 BLEU points, respectively, over the baseline system on NIST04, CWMT08 Development, and CWMT08 Evaluation data.
Experiments
BLEU 35
Related Work
They added the labels assigned to connectives as an additional input to an SMT system, but their experimental results show that the improvements under the evaluation metric of BLEU were not significant.
Related Work
To the best of our knowledge, our work is the first attempt to exploit the source functional relationship to generate the target transitional expressions for grammatical cohesion, and we have successfully incorporated the proposed models into an SMT system with significant improvement of BLEU metrics.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin
Abstract
When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points.
Conclusion
Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points.
Experiments
The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2.
Experiments
The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points.
Experiments
The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Mason, Rebecca and Charniak, Eugene
Our Approach
BLEU Scores 13 N J:
Our Approach
Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach
Although BLEU (Papineni et al., 2002) scores are widely used for image caption evaluation, we find them to be poor indicators of the quality of our model.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min
Conclusion
o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST.
Experiments
System BLEU (%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - -
Experiments
System BLEU (%) NIST Base 33.53 9.0561 STM (sense) 34.15 9.2596 STM (sense+lexicon) 34.73 9.4184
Experiments
System BLEU (%) NIST Base 33.53 9.0561 Reformulated WSD 34.16 9.3820 STM 34.73 9.4184
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie
Abstract
We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion
The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set
In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments
Lng the performance ( BLEU ) on the test set, the total
Experiments
For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments
It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction
Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Introduction
In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
Related Work
Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
Related Work
In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
Related Work
TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
Abstract
We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work
The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation
Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems.
Evaluation
Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance.
Evaluation
Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system.
Introduction
0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Gyawali, Bikash and Gardent, Claire
Conclusion
We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning.
Results and Discussion
Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion
The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion
In terms of BLEU score, the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun
Experiments
The reported BLEU scores are averaged over 5 times of running MERT (Och, 2003).
Experiments
We illustrate the relationship among translation accuracy ( BLEU ), the number of retrieved documents (N) and the length of hidden layers (L) on different testing datasets.
Experiments
Figure 3: End-to-end translation results ( BLEU %)
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Liu, Shujie and Li, Mu and Zhou, Ming and Zong, Chengqing
Experiments
Case-insensitive BLEU is employed as the evaluation metric.
Experiments
Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test.
Experiments
In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation.
Introduction
The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved.
Related Work
(2013) also use bag-of-words but learn BLEU sensitive phrase embeddings.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Abstract
Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU .
Conclusion and Future Work
We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU .
Experiments and Results
When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.
Experiments and Results
TCBPPE based method drops about 3 BLEU points on both development and test data sets.
Introduction
We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
van Gompel, Maarten and van den Bosch, Antal
Evaluation
We report on BLEU , NIST, METEOR, and word error rate metrics WER and PER.
Experiments & Results
The BLEU scores, not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results
Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results
Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan
Abstract
We evaluate our model on a Chinese to English translation task and obtain up to 1.2 BLEU improvement over strong baselines.
Experiments
We refer to the SMT model without domain adaptation as baseline.5 LDA marginally improves machine translation (less than half a BLEU point).
Experiments
These improvements are not redundant: our new ptLDA-dict model, which has aspects of both models yields the best performance among these approaches—up to a 1.2 BLEU point gain (higher is better), and -2.6 TER improvement (lower is better).
Experiments
The BLEU improvement is significant (Koehn, 2004) at p = 0.01,6 except on MT03 with variational and variational-hybrid inference.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang
Experiments
We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
Experiments
Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU,
Experiments
Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lu, Shixiang and Chen, Zhenbiao and Xu, Bo
Abstract
On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 (NIST) BLEU points over the unsupervised DBN features and the baseline features, respectively.
Conclusions
The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 (NIST) BLEU points over the baseline features.
Experiments and Results
Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 (NIST) (row 14 vs. 1) BLEU points over the baseline features.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: