Index of papers in Proc. ACL that mention
  • BLEU score
He, Xiaodong and Deng, Li
Abstract
The training objective is an expected BLEU score , which is closely linked to translation quality.
Abstract
bold updating), the author proposed a local updating strategy where the model parameters are updated towards a pseudo-reference (i.e., the hypothesis in the n-best list that gives the best BLEU score ).
Abstract
In our work, we use the expectation of BLEU scores as the objective.
BLEU score is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun
Abstract
The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Experimental results
We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
Experimental results
We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
Experimental results
2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score .
Introduction
ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.
BLEU score is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Yeniterzi, Reyyan and Oflazer, Kemal
Abstract
We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores .
Experimental Setup and Results
Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation.
Experimental Setup and Results
Table 1: BLEU scores for a variety of transformation combinations
Experimental Setup and Results
15Note than in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score .
Introduction
We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets.
Syntax-to-Morphology Mapping
We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores .
BLEU score is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Wu, Hua and Wang, Haifeng
Discussion
Table 6: CRR translation results ( BLEU scores ) by using different RBMT systems
Discussion
The BLEU scores are 43.90 and 29.77 for System A and System B, respectively.
Discussion
If we compare the results with those only using SMT systems as described in Table 3, the translation quality was greatly improved by at least 3 BLEU scores , even if the translation ac-
Experiments
Translation quality was evaluated using both the BLEU score proposed by Papineni et al.
Experiments
The results also show that our translation selection method is very effective, which achieved absolute improvements of about 4 and l BLEU scores on CRR and ASR inputs, respectively.
Experiments
As compared with those in Table 3, the translation quality was greatly improved, with absolute improvements of at least 5.1 and 3.9 BLEU scores on CRR and ASR inputs for system combination results.
Translation Selection
In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002).
Translation Selection
We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection
In the context of translation selection, 3/ is assigned as the smoothed BLEU score .
BLEU score is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Duan, Manjuan and White, Michael
Abstract
Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
Abstract
However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
Introduction
With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
Introduction
With the SVM reranker, we obtain a significant improvement in BLEU scores over
Introduction
Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
Reranking with SVMs 4.1 Methods
In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
Reranking with SVMs 4.1 Methods
The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02).
Simple Reranking
Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
Simple Reranking
Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
BLEU score is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Ganchev, Kuzman and Graça, João V. and Taskar, Ben
Abstract
We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score ) in end-to-end translation systems for six languages pairs used in recent MT competitions.
Conclusions
Table 3: BLEU scores for all language pairs using all available data.
Introduction
Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score ) across many languages on training corpora with up to hundred thousand sentences.
Introduction
In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases.
Phrase-based machine translation
We report BLEU scores using a script available with the baseline system.
Phrase-based machine translation
Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model.
Phrase-based machine translation
In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages.
Word alignment results
Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Huang, Liang and Liu, Qun
Experiments
BLEU score
Experiments
We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set.
Experiments
The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Sun, Hong and Zhou, Ming
Abstract
In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems.
Conclusion
Furthermore, a revised BLEU score that balances between paraphrase adequacy and dissimilarity is proposed in our training process.
Discussion
The first part of iBLEU, which is the traditional BLEU score , helps to ensure the quality of the machine translation results.
Experiments and Results
We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better).
Experiments and Results
From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously.
Experiments and Results
It is not capable with no joint learning or with the traditional BLEU score does not take self-paraphrase into consideration.
Introduction
The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time.
Paraphrasing with a Dual SMT System
Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible.
Paraphrasing with a Dual SMT System
(2005) have shown the capability for measuring semantic equivalency using BLEU score); BLEU (c, s) is the BLEU score computed between the candidate and the source sentence to measure the dissimilarity.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
Integration of inflection models with MT systems
We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2).
MT performance results
We also report oracle BLEU scores which incorporate two kinds of oracle knowledge.
MT performance results
For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation.
MT performance results
This system achieves a substantially better BLEU score (by 6.76) than the treelet system.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Abstract
We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score .
Conclusion
The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
Experiments
Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
Experiments
minimum error rate training (Och, 2003) with BLEU score as the objective function.
Experiments
Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Abstract
We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
Discussion and Future Work
These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora, have been shown to yield some BLEU score improvements.
Experiments and Results
To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
Experiments and Results
We show that our method achieves the best performance ( BLEU scores ) on this task while being significantly faster than both the previous approaches.
Experiments and Results
For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words.
BLEU score is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Conclusion
BLEU score 0 iv A (A)
Experiments
BLEU score
Experiments
In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.
Experiments
Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses.
Future Work
BLEU score
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Background
where BLEU(e,-j, r,-) is the smoothed sentence-level BLEU score (Liang et al., 2006) of the translation e with respect to the reference translations r,, and e: is the oracle translation which is selected from {em ..., em} in terms of BLEU(e,-j, r,-).
Background
Figures 2-5 show the BLEU curves on the development and test sets, where the X-aXis is the iteration number, and the Y-aXis is the BLEU score of the system generated by the boosting-based system combination.
Background
The BLEU scores tend to converge to the stable values after 20 iterations for all the systems.
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Liu, Qun
Conclusion and Future Work
Using all constituency-to-dependency translation rules and bilingual phrases, our model achieves +0.7 points improvement in BLEU score significantly over a state-of-the-art forest-based tree-to-string system.
Experiments
We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on development set.
Experiments
The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3.
Experiments
As shown in the third line in the column of BLEU score , the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage.
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Zhanyi and Wang, Haifeng and Wu, Hua and Li, Sheng
Abstract
As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system.
Conclusion
The improved word alignment results in an improvement of 2.16 BLEU score on a phrase-based SMT system and an improvement of 1.76 BLEU score on a parsing-based SMT system.
Conclusion
When we also used phrase collocation probabilities as additional features, the phrase-based SMT performance is finally improved by 2.40 BLEU score as compared with the baseline system.
Experiments on Parsing-Based SMT
The system using the improved word alignments achieves an absolute improvement of 1.76 BLEU score , which indicates that the improvements of word alignments are also effective to improve the performance of the parsing-based SMT systems.
Experiments on Phrase-Based SMT
If the same alignment method is used, the systems using CM-3 got the highest BLEU scores .
Experiments on Phrase-Based SMT
When the phrase collocation probabilities are incorporated into the SMT system, the translation quality is improved, achieving an absolute improvement of 0.85 BLEU score .
Experiments on Phrase-Based SMT
As compared with the baseline system, an absolute improvement of 2.40 BLEU score is achieved.
Introduction
The alignment improvement results in an improvement of 2.16 BLEU score on phrase-based SMT system and an improvement of 1.76 BLEU score on parsing-based SMT system.
Introduction
SMT performance is further improved by 0.24 BLEU score .
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun
Conclusion
As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding.
Experiments
Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive).
Experiments
With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence.
Experiments
We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score .
Introduction
0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4).
BLEU score is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Experiments
MERT is then performed to optimize the BLEU score on a development set; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008).
Experiments
We consider a BLEU score difference to be a) gain if is at least 0.2 points, b) drop if it is at most -0.2 points, and c) no change otherwise.
Experiments
When MBR does not produce a higher BLEU score relative to MAP on the development set, MERT assigns a higher weight to this feature function.
Introduction
Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Introduction
We employ MERT to select these weights by optimizing BLEU score on a development set.
Introduction
In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008).
MERT for MBR Parameter Optimization
We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set.
Minimum Bayes-Risk Decoding
(2008) extended MBR decoding to translation lattices under an approximate BLEU score .
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Experiments
Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models.
Experiments
Table 3 shows the BLEU scores of tree-based and forest-based tree-to-tree models achieved on the test set over different pruning thresholds.
Experiments
With the increase of the number of rules used, the BLEU score increased accordingly.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Experimental Results
Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding.
Experimental Results
Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding.
Experimental Results
Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi
Abstract
Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score , as compared with a strong forest-to-string baseline system.
Conclusion
Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008).
Experiments
Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset
Experiments
the final BLEU scores of C3—T with Min-F and C3-F.
Experiments
Using the composed rule set C3—F in our forest-based decoder, we achieved an optimal BLEU score of 28.89 (%).
Introduction
(2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system.
Introduction
Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results
On average the improvement is 1.07 BLEU score (45.66
Experiment Results
without new phrase-based features and 1.14 BLEU score over the baseline Hiero system.
Phrasal-Hiero Model
Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row).
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Clifton, Ann and Sarkar, Anoop
Conclusion and Future Work
We found that using a segmented translation model based on unsupervised morphology induction and a model that combined morpheme segments in the translation model with a postprocessing morphology prediction model gave us better BLEU scores than a word-based baseline.
Experimental Results
All the BLEU scores reported are for lowercase evaluation.
Experimental Results
No Uni indicates the seg-Lented BLEU score without unigrams.
Experimental Results
.on of m-BLEU score (Luong et al., 2010) where 1e BLEU score is computed by comparing the 3gmented output with a segmented reference ranslation.
Models 2.1 Baseline Models
performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline.
Translation and Morphology
Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
Conclusion
This technique, together with the progressive search at previous stages, gives a decoder that produces the highest BLEU score we have obtained on the data in a very reasonable amount of time.
Experiments
Fable 1: Speed and BLEU scores for two-pass decoding.
Experiments
However, model scores do not directly translate into BLEU scores .
Experiments
In order to maximize BLEU score using the algorithm described in Section 4, we need a sizable trigram forest as a starting point.
Introduction
With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
BLEU score is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram
Inferring a learning curve from mostly monolingual data
Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data
Inferring a learning curve from mostly monolingual data
We first train models to predict the BLEU score at m anchor sizes 81, .
Inferring a learning curve from mostly monolingual data
We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest.
Selecting a parametric family of curves
The values are on the same scale as the BLEU scores .
Selecting a parametric family of curves
BLEU scores .0 pl on
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris
Evaluation
In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
Evaluation
This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
Evaluation
As expected, random selection yields bad performance, with a BLEU score of 30.52.
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Abstract
Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
Experiments
In addition to BLEU score , percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments
We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method.
Experiments
All of the four feature functions we have tested achieve considerable improvement in BLEU scores .
Log-linear Models
BLEU score , a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al.
Log-linear Models
3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Haffari, Gholamreza and Sarkar, Anoop
AL-SMT: Multilingual Setting
The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively.
AL-SMT: Multilingual Setting
This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score , WER and PER) (Papineni et al., 2002).
Experiments
The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set.
Experiments
Avg BLEU Score
Experiments
Avg BLEU Score
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi
Experiments
Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3).
Experiments
We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011).
Experiments
In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score .
BLEU score is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Cherry, Colin
Cohesive Phrasal Output
We tested this approach on our English-French development set, and saw no improvement in BLEU score .
Conclusion
Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores .
Experiments
We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets.
Experiments
First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets.
Experiments
Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Ensemble Decoding
In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup.
Ensemble Decoding
However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically.
Ensemble Decoding
However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well.
Experiments & Results 4.1 Experimental Setup
Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2.
Experiments & Results 4.1 Experimental Setup
This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34.
Experiments & Results 4.1 Experimental Setup
We also reported the BLEU scores when we applied the span-wise normalization heuristic.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Experiments
Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
Experiments
Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Experiments
Figure 3: BLEU scores on the MT05 data set.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
Conclusion and Future Work
We plan to give different weights to different training examples based on the drop in BLEU score the example can cause if classified incorrectly.
MT System Selection
We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty).
MT System Selection
We pick the name of the MT system with the highest BLEU score as the class label for that sentence.
MT System Selection
When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied.
Machine Translation Experiments
All differences in BLEU scores between the four systems are statistically significant above the 95% level.
Machine Translation Experiments
We also report in Table 1 an oracle system selection where we pick, for each sentence, the English translation that yields the best BLEU score .
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mason, Rebecca and Charniak, Eugene
Our Approach
BLEU Scores 13 N J:
Our Approach
Figure l: BLEU scores vs k for SumBasic extraction.
Our Approach
As shown in Figure 1, our system’s BLEU scores increase rapidly until about k = 25.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Baseline MT
The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003).
Experiments
In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora.
Experiments
We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments.
Experiments
Furthermore, we calculated three Pearson product-moment correlation coefficients between human judgment scores and name-aware BLEU scores of these two MT systems.
Name-aware MT Evaluation
Based on BLEU score , we design a name-aware BLEU metric as follows.
Name-aware MT Evaluation
Finally the name-aware BLEU score is defined as:
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Conclusion and Future Work
The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones.
Experiment
The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011).
Experiment
When comparing the hier-combin with the pialign-batch, the BLEU scores are a little higher while the time spent for training is much lower, almost one quarter of the pialign-batch.
Experiment
Table 4 shows the BLEU scores for the three data sets, in which the order of combining phrase tables from each domain is alternated in the ascending and descending of the similarity to the test data.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie
Abstract
We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Conclusion
The results showed that our approach achieved a BLEU score gain of 1.61.
Dependency-based Pre-ordering Rule Set
In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set.
Experiments
For evaluation, we used BLEU scores (Papineni et al., 2002).
Experiments
It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap-
Introduction
Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.
BLEU score is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki
Experiments
training data and not necessarily exactly follow the tendency of the final BLEU scores .
Experiments
For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score .
Experiments
Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different.
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Introduction
In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
Introduction
to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
Introduction
Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Kuhn, Jonas
Experiments
When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9).
Experiments
The revision-based system with disjoint modelling of implicits shows a slight, nonsignificant increase in BLEU score .
Experiments
By contrast, the BLEU.. score is signficantly better for the joint approach.
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Experiments
The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
Experiments
In particular, their translations had a lower BLEU score , making their task easier.
Experiments
We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score ).
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Deng, Yonggang and Xu, Jia and Gao, Yuqing
Discussions
After reaching its peak, the BLEU score drops as the threshold 7' increases.
Discussions
On the other hand, adding phrase pairs extracted by the new method only (PP3) can lead to significant BLEU score increases (comparing row 1 vs. 3, and row 2 vs. 4).
Experimental Results
BLEU Scores
Experimental Results
Once we have computed all feature values for all phrase pairs in the training corpus, we discriminatively train feature weights Aks and the threshold 7' using the downhill simplex method to maximize the BLEU score on 06dev set.
Experimental Results
Roughly, it has 0.5% higher BLEU score on 2006 sets and 1.5% to 3% higher on other sets than Model-4 based ViterbiExtract method.
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Gyawali, Bikash and Gardent, Claire
Conclusion
We observed that this often fails to return the best output in terms of BLEU score , fluency, grammaticality and/or meaning.
Results and Discussion
Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees
Results and Discussion
The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered).
Results and Discussion
In terms of BLEU score , the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-().
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Machine Translation as a Decipherment Task
The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
Machine Translation as a Decipherment Task
Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
Machine Translation as a Decipherment Task
Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus.
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Zhou, Ming
Abstract
On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++.
Evaluation
Finally, we also do end-to-end evaluation using both F-score in alignment and Bleu score in translation.
Evaluation
HP-DITG using DPDI achieves the best Bleu score with acceptable time cost.
Evaluation
It shows that HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score .
BLEU score is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Experiments
In columns 2 and 4, we report the BLEU scores , while in columns 3 and 5, we report the TER scores.
Experiments
Model 2 which conditions POL on OR provides an additional +0.2 BLEU improvement on BLEU score consistently across the two genres.
Experiments
The inclusion of explicit MOS modeling in Model 4 gives a significant BLEU score improvement of +0.5 but no TER improvement in newswire.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Experiments
BLEU score We used Moses support tools: multi-bleu10 to calculate BLEU scores .
Experiments
The BLEU scores shown in Table 4 show that our system produces simplifications that are closest to the reference.
Experiments
In sum, the automatic metrics indicate that our system produces simplification that are consistently closest to the reference in terms of edit distance, number of splits and BLEU score .
Related Work
(2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score .
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cahill, Aoife and Riester, Arndt
Abstract
We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features.
Conclusions
In comparison to a baseline model, we achieve statistically significant improvement in BLEU score .
Generation Ranking Experiments
We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al.,
Generation Ranking Experiments
The difference in BLEU score between the model of Cahill et al.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
Experimental Evaluation
For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training.
Results and Discussion
Our method achieved the highest BLEU score .
Results and Discussion
For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score .
Results and Discussion
Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score .
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Experiments
In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Experiments
Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking.
Experiments
Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick
Abstract
Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score .
Conclusion and perspectives
Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging.
Evaluation
The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
Evaluation
The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen
Experimental Results
The BLEU scores from different systems are shown in Table 10 and Table 11, respectively.
Experimental Results
Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately.
Experimental Results
Table 10: BLEU scores in the Hiero system.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bojar, Ondřej and Kos, Kamil and Mareċek, David
Conclusion
This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments.
Problems of BLEU
We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009).
Problems of BLEU
Figure 3 documents the issue across languages: the lower the BLEU score itself (i.e.
Problems of BLEU
A phrase-based system like Moses (cu-bojar) can sometimes produce a long sequence of tokens exactly as required by the reference, leading to a high BLEU score .
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Additional Experiments
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Discussion
This correlates with our observation that RM’s overall BLEU score is negatively impacted by the BP, as the BLEU precision scores are noticeably higher.
Discussion
We also notice that while PRO had the lowest BLEU scores in Chinese, it was competitive in Arabic with the highest number of features.
Experiments
5In the small feature set RAMPION yielded similar best BLEU scores , but worse TER.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cohn, Trevor and Haffari, Gholamreza
Experiments
9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
Experiments
Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines.
Experiments
For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya
Experimental Evaluation
6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
Experimental Evaluation
It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
Hierarchical ITG Model
(2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score , while using larger phrases results in diminishing returns.
Introduction
We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Abstract
We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system.
Final Results
This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models.
Final Results
For other parts of the data where the translators have heavily used transliteration, the system may receive a higher BLEU score .
Introduction
Section 4 discusses the training data, parameter optimization and the initial set of experiments that compare our two models with a baseline Hindi-Urdu phrase-based system and with two transliteration-aided phrase-based systems in terms of BLEU scores
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Li, Mu and Zhang, Dongdong and Yu, Nenghai
Experiments
We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score , and the model size.
Experiments
is RankingSVM accuracy in percentage on the training data; CLN is the crossing-link number per sentence on parallel corpus with automatically generated word alignment; BLEU is the BLEU score in percentage on web test set on Rank-IT setting (system with integrated rank reordering model); leacn means 11 most frequent lexicons in the training corpus.
Experiments
These features also correspond to BLEU score improvement for End-to-End evaluations.
BLEU score is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Discussion and Future Work
When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
Experimental Results
These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score , with a consistent pattern of results across the MT06 and MT08 test sets.
Experimental Setup
all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Espinosa, Dominic and White, Michael and Mehay, Dennis
Results and Discussion
In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations.
Results and Discussion
Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%.
The Approach
compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Complexity Analysis
It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
Complexity Analysis
Table 4 presents the BLEU scores for Moses using different segmentation methods.
Introduction
o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
van Gompel, Maarten and van den Bosch, Antal
Experiments & Results
The BLEU scores , not included in the figure but shown in Table 2, show a similar trend.
Experiments & Results
Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004).
Experiments & Results
Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Tu, Mei and Zhou, Yu and Zong, Chengqing
Experiments
In Table 3, almost all BLEU scores are improved, no matter what strategy is used.
Experiments
The final BLEU scores on NIST05 and NIST06 are given in Table 4.
Experiments
BLEU scores on the large-scale training data.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output.
Experiments
For our test data, we only consider sentences containing measure words for Bleu score evaluation.
Experiments
Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Conclusion and Future Work
In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
Experiments
The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
Experiments
Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Mauser, Arne and Ney, Hermann
Alignment
We perform minimum error rate training with the downhill simplex algorithm (Nelder and Mead, 1965) on the development data to obtain a set of scaling factors that achieve a good BLEU score .
Experimental Evaluation
A second iteration of the training algorithm shows nearly no changes in BLEU score , but a small improvement in TER.
Experimental Evaluation
yields a BLEU score slightly lower than with fixed interpolation on both DEV and TEST.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim
Experiment
The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S.
Experiment
Even if in the 5000 Best case, tree sequence is still able to contribute l.l BLEU score improvement (28.89-27.79).
Experiment
2) The BLEU scores are very similar to each other when we increase the forest pruning threshold.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data.
Abstract
Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011).
Abstract
Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting
Discussion
on BLEU score
Experiments
(00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score ; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules.
Extraction of Paraphrase Rules
If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions.
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Proposed Methods 3.1 Egyptian to EG’ Conversion
Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96.
Proposed Methods 3.1 Egyptian to EG’ Conversion
In further analysis, we examined 1% of the sentences with the largest difference in BLEU score .
Proposed Methods 3.1 Egyptian to EG’ Conversion
Out of these, more than 70% were cases where the EG’ model achieved a higher BLEU score .
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Riesa, Jason and Marcu, Daniel
Abstract
Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
Conclusion
We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today.
Related Work
Very recent work in word alignment has also started to report downstream effects on BLEU score .
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Duan, Xiangyu and Zhang, Min and Li, Haizhou
Experiments and Results
Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004).
Experiments and Results
Best ESSP (Wchpwen) is significantly better than baseline (p<0.0l) in BLEU score, best SMP (wdpwen) is significantly better than baseline (p<0.05) in BLEU score .
Experiments and Results
wchpwen is significantly better than baseline (p<0.04) in BLEU score .
BLEU score is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: