Index of papers in Proc. ACL 2013 that mention
  • BLEU scores
Ravi, Sujith
Abstract
We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
Discussion and Future Work
These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora, have been shown to yield some BLEU score improvements.
Experiments and Results
To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
Experiments and Results
We show that our method achieves the best performance ( BLEU scores ) on this task while being significantly faster than both the previous approaches.
Experiments and Results
For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words.
BLEU scores is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results
On average the improvement is 1.07 BLEU score (45.66
Experiment Results
without new phrase-based features and 1.14 BLEU score over the baseline Hiero system.
Phrasal-Hiero Model
Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row).
BLEU scores is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi
Experiments
Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3).
Experiments
We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011).
Experiments
In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score .
BLEU scores is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Baseline MT
The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003).
Experiments
In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments
We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments.
Experiments
Furthermore, we calculated three Pearson product-moment correlation coefficients between human judgment scores and name-aware BLEU scores of these two MT systems.
Name-aware MT Evaluation
Based on BLEU score , we design a name-aware BLEU metric as follows.
Name-aware MT Evaluation
Finally the name-aware BLEU score is defined as:
BLEU scores is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Conclusion and Future Work
The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones.
Experiment
The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011).
Experiment
When comparing the hier-combin with the pialign-batch, the BLEU scores are a little higher while the time spent for training is much lower, almost one quarter of the pialign-batch.
Experiment
Table 4 shows the BLEU scores for the three data sets, in which the order of combining phrase tables from each domain is alternated in the ascending and descending of the similarity to the test data.
BLEU scores is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Introduction
In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
Introduction
to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
Introduction
Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.
BLEU scores is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Kuhn, Jonas
Experiments
When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9).
Experiments
The revision-based system with disjoint modelling of implicits shows a slight, nonsignificant increase in BLEU score .
Experiments
By contrast, the BLEU.. score is signficantly better for the joint approach.
BLEU scores is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Cohn, Trevor and Haffari, Gholamreza
Experiments
9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
Experiments
Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines.
Experiments
For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs.
BLEU scores is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Additional Experiments
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Discussion
This correlates with our observation that RM’s overall BLEU score is negatively impacted by the BP, as the BLEU precision scores are noticeably higher.
Discussion
We also notice that while PRO had the lowest BLEU scores in Chinese, it was competitive in Arabic with the highest number of features.
Experiments
5In the small feature set RAMPION yielded similar best BLEU scores , but worse TER.
BLEU scores is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Experiments
In columns 2 and 4, we report the BLEU scores , while in columns 3 and 5, we report the TER scores.
Experiments
Model 2 which conditions POL on OR provides an additional +0.2 BLEU improvement on BLEU score consistently across the two genres.
Experiments
The inclusion of explicit MOS modeling in Model 4 gives a significant BLEU score improvement of +0.5 but no TER improvement in newswire.
BLEU scores is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen
Experimental Results
The BLEU scores from different systems are shown in Table 10 and Table 11, respectively.
Experimental Results
Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately.
Experimental Results
Table 10: BLEU scores in the Hiero system.
BLEU scores is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Proposed Methods 3.1 Egyptian to EG’ Conversion
Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96.
Proposed Methods 3.1 Egyptian to EG’ Conversion
In further analysis, we examined 1% of the sentences with the largest difference in BLEU score .
Proposed Methods 3.1 Egyptian to EG’ Conversion
Out of these, more than 70% were cases where the EG’ model achieved a higher BLEU score .
BLEU scores is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data.
Abstract
Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011).
Abstract
Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system.
BLEU scores is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: