Index of papers in Proc. ACL 2013 that mention
  • BLEU
Zarriess, Sina and Kuhn, Jonas
Experiments
BLEU , sentence-level geometric mean of 1- to 4-gram precision, as in (Belz et al., 2011)
Experiments
BLEUT, sentence-level BLEU computed on post-processed output where predicted referring expressions for victim and perp are replaced in the sentences (both gold and predicted) by their original role label, this score doeS not penalize lexical mismatches between corpus and system RES
Experiments
When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9).
BLEU is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Wang, Kun and Zong, Chengqing and Su, Keh-Yih
Abstract
Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system.
Conclusion and Future Work
The experiments show that the proposed Model-III outperforms both the TM and the SMT systems significantly (p < 0.05) in either BLEU or TER when fuzzy match score is above 0.4.
Conclusion and Future Work
Compared with the pure SMT system, Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction on a Chinese—English TM database.
Experiments
In the tables, the best translation results (either in BLEU or TER) at each interval have been marked in bold.
Experiments
Compared with TM and SMT, Model-I is significantly better than the SMT system in either BLEU or TER when the fuzzy match score is above 0.7; Model-II significantly outperforms both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.5; Model-III significantly exceeds both the TM and the SMT systems in either BLEU or TER when the fuzzy match score is above 0.4.
Experiments
SMT 8.03 BLEU points at interval [0.9, 1.0), while the advantage is only 2.97 BLEU points at interval [0.6, 0.7).
Introduction
Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Wang, Lu and Cardie, Claire
Introduction
Automatic evaluation (using ROUGE (Lin and Hovy, 2003) and BLEU (Papineni et al., 2002)) against manually generated focused summaries shows that our sum-marizers uniformly and statistically significantly outperform two baseline systems as well as a state-of-the-art supervised extraction-based system.
Results
To evaluate the full abstract generation system, the BLEU score (Papineni et al., 2002) (the precision of uni-grams and bigrams with a breVity penalty) is computed with human abstracts as reference.
Results
BLEU has a fairly good agreement with human judgement and has been used to evaluate a variety of language generation systems (Angeli et al., 2010; Konstas and Lapata, 2012).
Results
BLEU
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Abstract
We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Additional Experiments
As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Additional Experiments
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Additional Experiments
Interestingly, RM achieved substantially higher BLEU precision scores in all tests for both language pairs.
Experiments
We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus.
Experiments
The bound constraint B was set to 1.4 The approximate sentence-level BLEU cost A, is computed in a manner similar to (Chiang et al., 2009), namely, in the context of previous 1-best translations of the tuning set.
Experiments
We explored alternative values for B, as well as scaling it by the current candidate’s cost, and found that the optimizer is fairly insensitive to these changes, resulting in only minor differences in BLEU .
BLEU is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Feng, Yang and Cohn, Trevor
Abstract
Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU .
Experiments
We compared the performance of Moses using the alignment produced by our model and the baseline alignment, evaluating translation quality using BLEU (Papineni et al., 2002) with case-insensitive n-gram matching with n = 4.
Experiments
We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set.
Experiments
5 The effect on translation scores is modest, roughly amounting to +0.2 BLEU versus using a single sample.
Introduction
The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Abstract
'The transfininafion reduces the out-of—vocabulary (00V) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points.
Abstract
Further, adapting large MSAflEnglish parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
Introduction
— We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data, which outperformed a non-adapted baseline by 1.87 BLEU points.
Previous Work
‘Train LM BLEU oov
Previous Work
The system trained on AR (B1) performed poorly compared to the one trained on EG (B2) with a 6.75 BLEU points difference.
Proposed Methods 3.1 Egyptian to EG’ Conversion
S], which used only EG’ for training showed an improvement of 1.67 BLEU points from the best baseline system (B4).
Proposed Methods 3.1 Egyptian to EG’ Conversion
Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96.
Proposed Methods 3.1 Egyptian to EG’ Conversion
tian sentence “wbyHtrmwA AlnAs AltAnyp” Until produced “lyfizfij (OOV) the second people” ( BLEU = 0.31).
BLEU is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Abstract
We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
Experiments and Results
To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
Experiments and Results
We show that our method achieves the best performance ( BLEU scores) on this task while being significantly faster than both the previous approaches.
Experiments and Results
We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon).
BLEU is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results
On average the improvement is 1.07 BLEU score (45.66
Experiment Results
Table 4: Arabic-English true case translation scores in BLEU metric.
Phrasal-Hiero Model
Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row).
BLEU is mentioned in 24 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Baseline MT
The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003).
Experiments
We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation.
Experiments
According to Wilcoxon Matched-Pairs Signed-Ranks Test, the improvement is not significant with BLEU metric, but is significant at 98% confidence level with all of the other metrics.
Introduction
0 The current dominant automatic MT scoring metrics (such as Bilingual Evaluation Understudy ( BLEU ) (Papineni et al., 2002)) treat all words equally, but names have relative low frequency in text (about 6% in newswire and only 3% in web documents) and thus are vastly outnumbered by function words and common nouns, etc..
Name-aware MT Evaluation
Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate (TER) (Snover et al., 2006) assign the same weights to all tokens equally.
Name-aware MT Evaluation
In order to properly evaluate the translation quality of NAMT methods, we propose to modify the BLEU metric so that they can dynamically assign more weights to names during evaluation.
Name-aware MT Evaluation
BLEU considers the correspondence between a system translation and a human translation:
BLEU is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Introduction
Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER.
Introduction
| features | BLEU | TER |
Introduction
Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly.
BLEU is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Abstract
On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.
Experiments
MT08 nw MT08 wb BLEU \ TER BLEU \ TER
Experiments
The best TER and BLEU results on each genre are in bold.
Experiments
For BLEU , higher scores are better, while for TER, lower scores are better.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi
Experiments
Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3).
Experiments
We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011).
Experiments
In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score.
BLEU is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Abstract
The performance measured by BLEU is at least as comparable to the traditional batch training method.
Conclusion and Future Work
The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones.
Experiment
The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011).
Experiment
In the IWLST2012 data set, there is a huge difference gap between the HIT corpus and the BTEC corpus, and our method gains 0.814 BLEU improvement.
Experiment
While the FBIS data set is artificially divided and no clear human assigned differences among sub-domains, our method loses 0.09 BLEU .
BLEU is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Sennrich, Rico and Schwenk, Holger and Aransa, Walid
Abstract
Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation.
Translation Model Architecture
We found that this had no significant effects on BLEU .
Translation Model Architecture
We report translation quality using BLEU (Papineni et
Translation Model Architecture
For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2).
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann
Abstract
Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average.
Conclusion
Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN).
Experiments
0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way.
Experiments
An Index column is added for score reference convenience (B for BLEU ; T for TER).
Experiments
For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1).
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU .
Abstract
On general domain and speech translation tasks where test conditions substantially differ from standard government and news training text, web-mined training data improves performance substantially, resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain.
Abstract
For all language pairs and both test sets (WMT 2011 and WMT 2012), we show an improvement of around 0.5 BLEU .
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan
Abstract
The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline.
Conclusion
Cumulatively, we see a gain of 1.8 BLEU points over a baseline reordering model that only uses manual word alignments, a gain of 2.0 BLEU points over a hierarchical phrase based system, and a gain of 5.2 BLEU points over a phrase based
Experimental setup
All experiments were done on Urdu-English and we evaluate reordering in two ways: Firstly, we evaluate reordering performance directly by comparing the reordered source sentence in Urdu with a reference reordering obtained from the manual word alignments using BLEU (Papineni et al., 2002) (we call this measure monolingual BLEU or mBLEU).
Experimental setup
Additionally, we evaluate the effect of reordering on our final systems for machine translation measured using BLEU .
Introduction
This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments.
Introduction
In all, this increases the gain in performance by using the preordering model to 5.2 BLEU points over a standard phrase-based system with no preordering.
Results and Discussions
We see a significant gain of 1.8 BLEU points in machine translation by going beyond manual word alignments using the best reordering model reported in Table 3.
Results and Discussions
We also note a gain of 2.0 BLEU points over a hierarchical phrase based system.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Weller, Marion and Fraser, Alexander and Schulte im Walde, Sabine
Experiments and evaluation
We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3.
Experiments and evaluation
Table 5 gives results in case-insensitive BLEU .
Experiments and evaluation
While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU ; however, our manual evaluation shows that the new features have a positive impact on translation quality.
BLEU is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Zong, Chengqing
Experiments
We use BLEU (Papineni et al., 2002) score with shortest length penalty as the evaluation metric and apply the pairwise re-sampling approach (Koehn, 2004) to perform the significance test.
Experiments
We can see from the table that the domain lexicon is much helpful and significantly outperforms the baseline with more than 4.0 BLEU points.
Experiments
When it is enhanced with the in-domain language model, it can further improve the translation performance by more than 2.5 BLEU points.
BLEU is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Introduction
In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
Introduction
to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
Introduction
Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro and Takamura, Hiroya and Okumura, Manabu
Abstract
Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
Experiment
Table 1: Performance on Japanese-to-English Translation Measured by BLEU (%)
Experiment
Table 1 shows the performance for the test data measured by case sensitive BLEU (Papineni et al., 2002).
Experiment
Under the Moses phrase-based SMT system (Koehn et al., 2007) with the default settings, we achieved a 26.80% BLEU score.
Introduction
Further, our independent model achieves a more than 1 point gain in BLEU , which resolves the sparseness problem introduced by the bi-word observations.
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Cohn, Trevor and Haffari, Gholamreza
Experiments
9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
Experiments
10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments.
Experiments
We use minimum error rate training (Och, 2003) with nbest list size 100 to optimize the feature weights for maximum development BLEU .
BLEU is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem
Abstract
On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU , TER, and NIST.
Corpus Data and Baseline SMT
Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs, 45K words) using BLEU as the tuning metric.
Experimental Setup and Results
Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006).
Experimental Setup and Results
In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 ( BLEU ), -0.6 (TER) and 0.08 (NIST).
Introduction
With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU , TER and NIST scores on an English-to-Iraqi CSLT task.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen
Experimental Results
The MT systems are optimized with pairwise ranking optimization (Hopkins and May, 2011) to maximize BLEU (Papineni et al., 2002).
Experimental Results
The BLEU scores from different systems are shown in Table 10 and Table 11, respectively.
Experimental Results
Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately.
BLEU is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop
Conclusion
Our results showed improvement over the baselines both in intrinsic evaluations and on BLEU .
Experiments & Results 4.1 Experimental Setup
BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
Experiments & Results 4.1 Experimental Setup
Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e.
Introduction
In general, copied-over oovs are a hindrance to fluent, high quality translation, and we can see evidence of this in automatic measures such as BLEU (Papineni et al., 2002) and also in human evaluation scores such as HTER.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing
Experiment
Specifically, after integrating the inside context information of PAS into transformation, we can see that system IC-PASTR significantly outperforms system PASTR by 0.71 BLEU points.
Experiment
Moreover, after we import the MEPD model into system PASTR, we get a significant improvement over PASTR (by 0.54 BLEU points).
Experiment
We can see that this system further achieves a remarkable improvement over system PASTR (0.95 BLEU points).
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Foster, George
Abstract
Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation.
Experiments
The 3-feature version of VSM yields +1.8 BLEU over the baseline for Arabic to English, and +1.4 BLEU for Chinese to English.
Experiments
For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.)
Experiments
To get an intuition for how VSM adaptation improves BLEU scores, we compared outputs from the baseline and VSM-adapted system (“vsm, joint” in Table 5) on the Chinese test data.
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas
Experiments
System BLEU Baseline 12.60 [MB OT * 13 .06
Experiments
We measured the overall translation quality with the help of 4-gram BLEU (Papineni et al., 2002), which was computed on tokenized and lower-cased data for both systems.
Experiments
We obtain a BLEU score of 13.06, which is a gain of 0.46 BLEU points over the baseline.
Introduction
The translation quality is automatically measured using BLEU scores, and we confirm the findings by providing linguistic evidence (see Section 5).
BLEU is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
Results
081 usually has the highest overall score (except BLEU ), it also has the highest scores for ‘regulations’ (more formal texts), P1 scores are better for the news documents.
Results
Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao
Abstract
In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
Experiment
To stabilize the MERT results, we tuned three times by MERT using the first half of the development data and we selected the SMT weighting parameter set that performed the best on the second half of the development data based on the BLEU scores from the three SMT weighting parameter sets.
Experiment
To investigate the tolerance for sparsity of the training data, we reduced the training data for the sequence model to 20,000 sentences for JE translation.14 SEQUENCE using this model with a distortion limit of 30 achieved a BLEU score of 32.22.15 Although the score is lower than the score of SEQUENCE with a distortion limit of 30 in Table 3, the score was still higher than those of LINEAR, LINEAR+LEX, and 9-CLASS for JE in Table 3.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kuznetsova, Polina and Ordonez, Vicente and Berg, Alexander and Berg, Tamara and Choi, Yejin
Code was provided by Deng et a1. (2012).
To compute evaluation measures, we take the average scores of BLEU (1) and F-score (unigram-based with respect to content-words) over k = 5 candidate captions.
Code was provided by Deng et a1. (2012).
Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU , Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database.
Related Work
When computing BLEU with semantic matching, we look for the match with the highest similarity score among words that have not been matched before.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel
Experiments
Table 3: BLEU scores for different datasets in different translation directions (left to right), broken with different training corpora (top to bottom).
Experiments
The BLEU scores for the different parallel corpora are shown in Table 3 and the top 10 out-of-vocabulary (OOV) words for each dataset are shown in Table 4.
Experiments
However, by combining the Weibo parallel data with this standard data, improvements in BLEU are obtained.
BLEU is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: