Abstract | The training objective is an expected BLEU score , which is closely linked to translation quality. |
Abstract | bold updating), the author proposed a local updating strategy where the model parameters are updated towards a pseudo-reference (i.e., the hypothesis in the n-best list that gives the best BLEU score ). |
Abstract | In our work, we use the expectation of BLEU scores as the objective. |
Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Experimental results | We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002). |
Experimental results | We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och, |
Experimental results | 2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score . |
Introduction | ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”. |
Abstract | We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores . |
Experimental Setup and Results | Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation. |
Experimental Setup and Results | Table 1: BLEU scores for a variety of transformation combinations |
Experimental Setup and Results | 15Note than in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score . |
Introduction | We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets. |
Syntax-to-Morphology Mapping | We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores . |
Discussion | Table 6: CRR translation results ( BLEU scores ) by using different RBMT systems |
Discussion | The BLEU scores are 43.90 and 29.77 for System A and System B, respectively. |
Discussion | If we compare the results with those only using SMT systems as described in Table 3, the translation quality was greatly improved by at least 3 BLEU scores , even if the translation ac- |
Experiments | Translation quality was evaluated using both the BLEU score proposed by Papineni et al. |
Experiments | The results also show that our translation selection method is very effective, which achieved absolute improvements of about 4 and l BLEU scores on CRR and ASR inputs, respectively. |
Experiments | As compared with those in Table 3, the translation quality was greatly improved, with absolute improvements of at least 5.1 and 3.9 BLEU scores on CRR and ASR inputs for system combination results. |
Translation Selection | In this paper, we modify the method in Albrecht and Hwa (2007) to only prepare human reference translations for the training examples, and then evaluate the translations produced by the subject systems against the references using BLEU score (Papineni et al., 2002). |
Translation Selection | We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions. |
Translation Selection | In the context of translation selection, 3/ is assigned as the smoothed BLEU score . |
Abstract | Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make. |
Abstract | However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved. |
Introduction | With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model. |
Introduction | With the SVM reranker, we obtain a significant improvement in BLEU scores over |
Introduction | Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases). |
Reranking with SVMs 4.1 Methods | In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments. |
Reranking with SVMs 4.1 Methods | The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02). |
Simple Reranking | Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations |
Simple Reranking | Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93. |
Abstract | We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score ) in end-to-end translation systems for six languages pairs used in recent MT competitions. |
Conclusions | Table 3: BLEU scores for all language pairs using all available data. |
Introduction | Our contribution is a large scale evaluation of this methodology for word alignments, an investigation of how the produced alignments differ and how they can be used to consistently improve machine translation performance (as measured by BLEU score ) across many languages on training corpora with up to hundred thousand sentences. |
Introduction | In 10 out of 12 cases we improve BLEU score by at least i point and by more than 1 point in 4 out of 12 cases. |
Phrase-based machine translation | We report BLEU scores using a script available with the baseline system. |
Phrase-based machine translation | Figure 8: BLEU score as the amount of training data is increased on the Hansards corpus for the best decoding method for each alignment model. |
Phrase-based machine translation | In principle, we would like to tune the threshold by optimizing BLEU score on a development set, but that is impractical for experiments with many pairs of languages. |
Word alignment results | Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Pa-pineni et al., 2002), when the alignments are used to train a phrase-based translation system. |
Experiments | BLEU score |
Experiments | We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on the dev set. |
Experiments | The BLEU score of the baseline 1-best decoding is 0.2325, which is consistent with the result of 0.2302 in (Liu et al., 2007) on the same training, development and test sets, and with the same rule extraction procedure. |
Abstract | In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems. |
Conclusion | Furthermore, a revised BLEU score that balances between paraphrase adequacy and dissimilarity is proposed in our training process. |
Discussion | The first part of iBLEU, which is the traditional BLEU score , helps to ensure the quality of the machine translation results. |
Experiments and Results | We show the BLEU score (computed against references) to measure the adequacy and self-BLEU (computed against source sentence) to evaluate the dissimilarity (lower is better). |
Experiments and Results | From the results we can see that, when the value of a decreases to address more penalty on self-paraphrase, the self-BLEU score rapidly decays while the consequence effect is that BLEU score computed against references also drops seriously. |
Experiments and Results | It is not capable with no joint learning or with the traditional BLEU score does not take self-paraphrase into consideration. |
Introduction | The jointly-learned dual SMT system: (1) Adapts the SMT systems so that they are tuned specifically for paraphrase generation purposes, e. g., to increase the dissimilarity; (2) Employs a revised BLEU score (named iBLEU, as it’s an input-aware BLEU metric) that measures adequacy and dissimilarity of the paraphrase results at the same time. |
Paraphrasing with a Dual SMT System | Two issues are also raised in (Zhao and Wang, 2010) about using automatic metrics: paraphrase changes less gets larger BLEU score and the evaluations of paraphrase quality and rate tend to be incompatible. |
Paraphrasing with a Dual SMT System | (2005) have shown the capability for measuring semantic equivalency using BLEU score); BLEU (c, s) is the BLEU score computed between the candidate and the source sentence to measure the dissimilarity. |
Integration of inflection models with MT systems | We performed a grid search on the values of A and n, to maximize the BLEU score of the final system on a development set (dev) of 1000 sentences (Table 2). |
MT performance results | We also report oracle BLEU scores which incorporate two kinds of oracle knowledge. |
MT performance results | For the methods using n=l translation from a base MT system, the oracle BLEU score is the BLEU score of the stemmed translation compared to the stemmed reference, which represents the upper bound achievable by changing only the inflected forms (but not stems) of the words in a translation. |
MT performance results | This system achieves a substantially better BLEU score (by 6.76) than the treelet system. |
Abstract | We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score . |
Conclusion | The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks. |
Experiments | Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English. |
Experiments | minimum error rate training (Och, 2003) with BLEU score as the objective function. |
Experiments | Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task. |
Abstract | We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). |
Discussion and Future Work | These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora, have been shown to yield some BLEU score improvements. |
Experiments and Results | To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation. |
Experiments and Results | We show that our method achieves the best performance ( BLEU scores ) on this task while being significantly faster than both the previous approaches. |
Experiments and Results | For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words. |
Conclusion | BLEU score 0 iv A (A) |
Experiments | BLEU score |
Experiments | In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences. |
Experiments | Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses. |
Future Work | BLEU score |
Background | where BLEU(e,-j, r,-) is the smoothed sentence-level BLEU score (Liang et al., 2006) of the translation e with respect to the reference translations r,, and e: is the oracle translation which is selected from {em ..., em} in terms of BLEU(e,-j, r,-). |
Background | Figures 2-5 show the BLEU curves on the development and test sets, where the X-aXis is the iteration number, and the Y-aXis is the BLEU score of the system generated by the boosting-based system combination. |
Background | The BLEU scores tend to converge to the stable values after 20 iterations for all the systems. |
Conclusion and Future Work | Using all constituency-to-dependency translation rules and bilingual phrases, our model achieves +0.7 points improvement in BLEU score significantly over a state-of-the-art forest-based tree-to-string system. |
Experiments | We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the system’s BLEU score on development set. |
Experiments | The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3. |
Experiments | As shown in the third line in the column of BLEU score , the performance drops 1.7 BLEU points over baseline system due to the poorer rule coverage. |
Abstract | As compared to baseline systems, we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. |
Conclusion | The improved word alignment results in an improvement of 2.16 BLEU score on a phrase-based SMT system and an improvement of 1.76 BLEU score on a parsing-based SMT system. |
Conclusion | When we also used phrase collocation probabilities as additional features, the phrase-based SMT performance is finally improved by 2.40 BLEU score as compared with the baseline system. |
Experiments on Parsing-Based SMT | The system using the improved word alignments achieves an absolute improvement of 1.76 BLEU score , which indicates that the improvements of word alignments are also effective to improve the performance of the parsing-based SMT systems. |
Experiments on Phrase-Based SMT | If the same alignment method is used, the systems using CM-3 got the highest BLEU scores . |
Experiments on Phrase-Based SMT | When the phrase collocation probabilities are incorporated into the SMT system, the translation quality is improved, achieving an absolute improvement of 0.85 BLEU score . |
Experiments on Phrase-Based SMT | As compared with the baseline system, an absolute improvement of 2.40 BLEU score is achieved. |
Introduction | The alignment improvement results in an improvement of 2.16 BLEU score on phrase-based SMT system and an improvement of 1.76 BLEU score on parsing-based SMT system. |
Introduction | SMT performance is further improved by 0.24 BLEU score . |
Conclusion | As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding. |
Experiments | Table 2: Comparison of individual decoding 21111 onds/sentence) and BLEU score (case-insensitive). |
Experiments | With conventional max-derivation decoding, the hierarchical phrase-based model achieved a BLEU score of 30.11 on the test set, with an average decoding time of 40.53 seconds/sentence. |
Experiments | We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score . |
Introduction | 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4). |
Experiments | MERT is then performed to optimize the BLEU score on a development set; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008). |
Experiments | We consider a BLEU score difference to be a) gain if is at least 0.2 points, b) drop if it is at most -0.2 points, and c) no change otherwise. |
Experiments | When MBR does not produce a higher BLEU score relative to MAP on the development set, MERT assigns a higher weight to this feature function. |
Introduction | Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice. |
Introduction | We employ MERT to select these weights by optimizing BLEU score on a development set. |
Introduction | In contrast, our MBR algorithm directly selects the hypothesis in the hypergraph with the maximum expected approximate corpus BLEU score (Tromble et al., 2008). |
MERT for MBR Parameter Optimization | We now have a total of N +2 feature functions which we optimize using MERT to obtain highest BLEU score on a training set. |
Minimum Bayes-Risk Decoding | (2008) extended MBR decoding to translation lattices under an approximate BLEU score . |
Experiments | Table 3: Comparison of BLEU scores for tree-based and forest-based tree-to-tree models. |
Experiments | Table 3 shows the BLEU scores of tree-based and forest-based tree-to-tree models achieved on the test set over different pruning thresholds. |
Experiments | With the increase of the number of rules used, the BLEU score increased accordingly. |
Experimental Results | Table l: BLEU scores for Viterbi, Crunching, MBR, and variational decoding. |
Experimental Results | Table 1 presents the BLEU scores under Viterbi, crunching, MBR, and variational decoding. |
Experimental Results | Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs. |
Abstract | Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score , as compared with a strong forest-to-string baseline system. |
Conclusion | Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008). |
Experiments | Here, fw denotes function word, and DT denotes the decoding time, and the BLEU scores were computed onthetestset |
Experiments | the final BLEU scores of C3—T with Min-F and C3-F. |
Experiments | Using the composed rule set C3—F in our forest-based decoder, we achieved an optimal BLEU score of 28.89 (%). |
Introduction | (2008) achieved a 3.1-point improvement in BLEU score (Papineni et al., 2002) by including bilingual syntactic phrases in their forest-based system. |
Introduction | Using the composed rules of the present study in a baseline forest-to-string translation system results in a 1.8-point improvement in the BLEU score for large-scale English-to-Japanese translation. |
Experiment Results | We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences). |
Experiment Results | On average the improvement is 1.07 BLEU score (45.66 |
Experiment Results | without new phrase-based features and 1.14 BLEU score over the baseline Hiero system. |
Phrasal-Hiero Model | Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row). |
Conclusion and Future Work | We found that using a segmented translation model based on unsupervised morphology induction and a model that combined morpheme segments in the translation model with a postprocessing morphology prediction model gave us better BLEU scores than a word-based baseline. |
Experimental Results | All the BLEU scores reported are for lowercase evaluation. |
Experimental Results | No Uni indicates the seg-Lented BLEU score without unigrams. |
Experimental Results | .on of m-BLEU score (Luong et al., 2010) where 1e BLEU score is computed by comparing the 3gmented output with a segmented reference ranslation. |
Models 2.1 Baseline Models | performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline. |
Translation and Morphology | Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 dataset. |
Abstract | An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly. |
Conclusion | This technique, together with the progressive search at previous stages, gives a decoder that produces the highest BLEU score we have obtained on the data in a very reasonable amount of time. |
Experiments | Fable 1: Speed and BLEU scores for two-pass decoding. |
Experiments | However, model scores do not directly translate into BLEU scores . |
Experiments | In order to maximize BLEU score using the algorithm described in Section 4, we need a sizable trigram forest as a starting point. |
Introduction | With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder. |
Inferring a learning curve from mostly monolingual data | Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data |
Inferring a learning curve from mostly monolingual data | We first train models to predict the BLEU score at m anchor sizes 81, . |
Inferring a learning curve from mostly monolingual data | We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest. |
Selecting a parametric family of curves | The values are on the same scale as the BLEU scores . |
Selecting a parametric family of curves | BLEU scores .0 pl on |
Evaluation | In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations. |
Evaluation | This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators. |
Evaluation | As expected, random selection yields bad performance, with a BLEU score of 30.52. |
Abstract | Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874. |
Experiments | In addition to BLEU score , percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics. |
Experiments | We observe that the BLEU score is boosted from 0.1478 to 0.5943 by using the RPD method. |
Experiments | All of the four feature functions we have tested achieve considerable improvement in BLEU scores . |
Log-linear Models | BLEU score , a method originally proposed to automatically evaluate machine translation quality (Papineni et al., 2002), has been widely used as a metric to evaluate general-purpose sentence generation (Langkilde, 2002; White et al., 2007; Guo et al. |
Log-linear Models | 3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl |
AL-SMT: Multilingual Setting | The translation quality is measured by TQ for individual systems M Fd_, E; it can be BLEU score or WEM’ER (Word error rate and position independent WER) which induces a maximization or minimization problem, respectively. |
AL-SMT: Multilingual Setting | This process is continued iteratively until a certain level of translation quality is met (we use the BLEU score , WER and PER) (Papineni et al., 2002). |
Experiments | The number of weights 2121- is 3 plus the number of source languages, and they are trained using minimum error-rate training (MERT) to maximize the BLEU score (Och, 2003) on a development set. |
Experiments | Avg BLEU Score |
Experiments | Avg BLEU Score |
Experiments | Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3). |
Experiments | We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011). |
Experiments | In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score . |
Cohesive Phrasal Output | We tested this approach on our English-French development set, and saw no improvement in BLEU score . |
Conclusion | Our experiments have shown that roughly 1/5 of our baseline English-French translations contain cohesion violations, and these translations tend to receive lower BLEU scores . |
Experiments | We first present our soft cohesion constraint’s effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets. |
Experiments | First of all, looking across columns, we can see that there is a definite divide in BLEU score between our two evaluation subsets. |
Experiments | Sentences with cohesive baseline translations receive much higher BLEU scores than those with uncohesive baseline translations. |
Ensemble Decoding | In Section 4.2, we compare the BLEU scores of different mixture operations on a French-English experimental setup. |
Ensemble Decoding | However, experiments showed changing the scores with the normalized scores hurts the BLEU score radically. |
Ensemble Decoding | However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well. |
Experiments & Results 4.1 Experimental Setup | Since the Hiero baselines results were substantially better than those of the phrase-based model, we also implemented the best-performing baseline, linear mixture, in our Hiero-style MT system and in fact it achieves the hights BLEU score among all the baselines as shown in Table 2. |
Experiments & Results 4.1 Experimental Setup | This baseline is run three times the score is averaged over the BLEU scores with standard deviation of 0.34. |
Experiments & Results 4.1 Experimental Setup | We also reported the BLEU scores when we applied the span-wise normalization heuristic. |
Experiments | Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits. |
Experiments | Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits). |
Experiments | Figure 3: BLEU scores on the MT05 data set. |
Conclusion and Future Work | We plan to give different weights to different training examples based on the drop in BLEU score the example can cause if classified incorrectly. |
MT System Selection | We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty). |
MT System Selection | We pick the name of the MT system with the highest BLEU score as the class label for that sentence. |
MT System Selection | When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied. |
Machine Translation Experiments | All differences in BLEU scores between the four systems are statistically significant above the 95% level. |
Machine Translation Experiments | We also report in Table 1 an oracle system selection where we pick, for each sentence, the English translation that yields the best BLEU score . |
Our Approach | BLEU Scores 13 N J: |
Our Approach | Figure l: BLEU scores vs k for SumBasic extraction. |
Our Approach | As shown in Figure 1, our system’s BLEU scores increase rapidly until about k = 25. |
Baseline MT | The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003). |
Experiments | In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora. |
Experiments | We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments. |
Experiments | Furthermore, we calculated three Pearson product-moment correlation coefficients between human judgment scores and name-aware BLEU scores of these two MT systems. |
Name-aware MT Evaluation | Based on BLEU score , we design a name-aware BLEU metric as follows. |
Name-aware MT Evaluation | Finally the name-aware BLEU score is defined as: |
Conclusion and Future Work | The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones. |
Experiment | The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011). |
Experiment | When comparing the hier-combin with the pialign-batch, the BLEU scores are a little higher while the time spent for training is much lower, almost one quarter of the pialign-batch. |
Experiment | Table 4 shows the BLEU scores for the three data sets, in which the order of combining phrase tables from each domain is alternated in the ascending and descending of the similarity to the test data. |
Abstract | We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data. |
Conclusion | The results showed that our approach achieved a BLEU score gain of 1.61. |
Dependency-based Pre-ordering Rule Set | In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set. |
Experiments | For evaluation, we used BLEU scores (Papineni et al., 2002). |
Experiments | It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap- |
Introduction | Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61. |
Experiments | training data and not necessarily exactly follow the tendency of the final BLEU scores . |
Experiments | For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score . |
Experiments | Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different. |
Introduction | In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model: |
Introduction | to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8. |
Introduction | Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al. |
Experiments | When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9). |
Experiments | The revision-based system with disjoint modelling of implicits shows a slight, nonsignificant increase in BLEU score . |
Experiments | By contrast, the BLEU.. score is signficantly better for the joint approach. |
Experiments | The BLEU scores for these outputs are 32.7, 27.8, and 20.8. |
Experiments | In particular, their translations had a lower BLEU score , making their task easier. |
Experiments | We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score ). |
Discussions | After reaching its peak, the BLEU score drops as the threshold 7' increases. |
Discussions | On the other hand, adding phrase pairs extracted by the new method only (PP3) can lead to significant BLEU score increases (comparing row 1 vs. 3, and row 2 vs. 4). |
Experimental Results | BLEU Scores |
Experimental Results | Once we have computed all feature values for all phrase pairs in the training corpus, we discriminatively train feature weights Aks and the threshold 7' using the downhill simplex method to maximize the BLEU score on 06dev set. |
Experimental Results | Roughly, it has 0.5% higher BLEU score on 2006 sets and 1.5% to 3% higher on other sets than Model-4 based ViterbiExtract method. |
Conclusion | We observed that this often fails to return the best output in terms of BLEU score , fluency, grammaticality and/or meaning. |
Results and Discussion | Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees |
Results and Discussion | The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered). |
Results and Discussion | In terms of BLEU score , the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-(). |
Machine Translation as a Decipherment Task | The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output). |
Machine Translation as a Decipherment Task | Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment. |
Machine Translation as a Decipherment Task | Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus. |
Abstract | On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++. |
Evaluation | Finally, we also do end-to-end evaluation using both F-score in alignment and Bleu score in translation. |
Evaluation | HP-DITG using DPDI achieves the best Bleu score with acceptable time cost. |
Evaluation | It shows that HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score . |
Experiments | In columns 2 and 4, we report the BLEU scores , while in columns 3 and 5, we report the TER scores. |
Experiments | Model 2 which conditions POL on OR provides an additional +0.2 BLEU improvement on BLEU score consistently across the two genres. |
Experiments | The inclusion of explicit MOS modeling in Model 4 gives a significant BLEU score improvement of +0.5 but no TER improvement in newswire. |
Experiments | BLEU score We used Moses support tools: multi-bleu10 to calculate BLEU scores . |
Experiments | The BLEU scores shown in Table 4 show that our system produces simplifications that are closest to the reference. |
Experiments | In sum, the automatic metrics indicate that our system produces simplification that are consistently closest to the reference in terms of edit distance, number of splits and BLEU score . |
Related Work | (2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score . |
Abstract | We show that it achieves a statistically significantly higher BLEU score than the baseline system without these features. |
Conclusions | In comparison to a baseline model, we achieve statistically significant improvement in BLEU score . |
Generation Ranking Experiments | We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al., |
Generation Ranking Experiments | The difference in BLEU score between the model of Cahill et al. |
Experimental Evaluation | For MCE learning, we selected the reference compression that maximize the BLEU score (Pap-ineni et al., 2002) (2 argmaxreRBLEUO‘, R\7“)) from the set of reference compressions and used it as correct data for training. |
Results and Discussion | Our method achieved the highest BLEU score . |
Results and Discussion | For example, ‘w/o PLM + Dep’ achieved the second highest BLEU score . |
Results and Discussion | Compared to ‘Hori—’, ‘Hori’ achieved a significantly higher BLEU score . |
Experiments | In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers. |
Experiments | Figure 3 shows the BLEU score curves with up to 1000 candidates used for re-ranking. |
Experiments | Figure 4 shows the BLEU scores of a two-system co-decoding as a function of re-decoding iterations. |
Abstract | Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score . |
Conclusion and perspectives | Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging. |
Evaluation | The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER). |
Evaluation | The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging. |
Experimental Results | The BLEU scores from different systems are shown in Table 10 and Table 11, respectively. |
Experimental Results | Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately. |
Experimental Results | Table 10: BLEU scores in the Hiero system. |
Conclusion | This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments. |
Problems of BLEU | We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009). |
Problems of BLEU | Figure 3 documents the issue across languages: the lower the BLEU score itself (i.e. |
Problems of BLEU | A phrase-based system like Moses (cu-bojar) can sometimes produce a long sequence of tokens exactly as required by the reference, leading to a high BLEU score . |
Additional Experiments | On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain. |
Discussion | This correlates with our observation that RM’s overall BLEU score is negatively impacted by the BP, as the BLEU precision scores are noticeably higher. |
Discussion | We also notice that while PRO had the lowest BLEU scores in Chinese, it was competitive in Arabic with the highest number of features. |
Experiments | 5In the small feature set RAMPION yielded similar best BLEU scores , but worse TER. |
Experiments | 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature. |
Experiments | Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines. |
Experiments | For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs. |
Experimental Evaluation | 6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information |
Experimental Evaluation | It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER. |
Hierarchical ITG Model | (2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score , while using larger phrases results in diminishing returns. |
Introduction | We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches. |
Abstract | We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. |
Final Results | This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models. |
Final Results | For other parts of the data where the translators have heavily used transliteration, the system may receive a higher BLEU score . |
Introduction | Section 4 discusses the training data, parameter optimization and the initial set of experiments that compare our two models with a baseline Hindi-Urdu phrase-based system and with two transliteration-aided phrase-based systems in terms of BLEU scores |
Experiments | We compare their influence on RankingSVM accuracy, alignment crossing-link number, end-to-end BLEU score , and the model size. |
Experiments | is RankingSVM accuracy in percentage on the training data; CLN is the crossing-link number per sentence on parallel corpus with automatically generated word alignment; BLEU is the BLEU score in percentage on web test set on Rank-IT setting (system with integrated rank reordering model); leacn means 11 most frequent lexicons in the training corpus. |
Experiments | These features also correspond to BLEU score improvement for End-to-End evaluations. |
Discussion and Future Work | When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality. |
Experimental Results | These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score , with a consistent pattern of results across the MT06 and MT08 test sets. |
Experimental Setup | all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004). |
Results and Discussion | In particular, the hypertagger makes possible a more than 6-point improvement in the overall BLEU score on both the development and test sections, and a more than 12-point improvement on the sentences with complete realizations. |
Results and Discussion | Even with the current incomplete set of semantic templates, the hypertagger brings realizer performance roughly up to state-of-the-art levels, as our overall test set BLEU score (0.6701) slightly exceeds that of Cahill and van Genabith (2006), though at a coverage of 96% instead of 98%. |
The Approach | compared the percentage of complete realizations (versus fragmentary ones) with their top scoring model against an oracle model that uses a simplified BLEU score based on the target string, which is useful for regression testing as it guides the best-first search to the reference sentence. |
Complexity Analysis | It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings. |
Complexity Analysis | Table 4 presents the BLEU scores for Moses using different segmentation methods. |
Introduction | o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter. |
Experiments & Results | The BLEU scores , not included in the figure but shown in Table 2, show a similar trend. |
Experiments & Results | Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004). |
Experiments & Results | Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline. |
Experiments | In Table 3, almost all BLEU scores are improved, no matter what strategy is used. |
Experiments | The final BLEU scores on NIST05 and NIST06 are given in Table 4. |
Experiments | BLEU scores on the large-scale training data. |
Experiments | In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output. |
Experiments | For our test data, we only consider sentences containing measure words for Bleu score evaluation. |
Experiments | Our measure word generation step leads to a Bleu score improvement of 0.32 where the window size is set to 10, which shows that it can improve the translation quality of an English-to-Chinese SMT system. |
Conclusion and Future Work | In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling. |
Experiments | The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81. |
Experiments | Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation. |
Alignment | We perform minimum error rate training with the downhill simplex algorithm (Nelder and Mead, 1965) on the development data to obtain a set of scaling factors that achieve a good BLEU score . |
Experimental Evaluation | A second iteration of the training algorithm shows nearly no changes in BLEU score , but a small improvement in TER. |
Experimental Evaluation | yields a BLEU score slightly lower than with fixed interpolation on both DEV and TEST. |
Experiment | The 9% tree sequence rules contribute 1.17 BLEU score improvement (28.83-27.66 in Table 1) to FTS2S over FT2S. |
Experiment | Even if in the 5000 Best case, tree sequence is still able to contribute l.l BLEU score improvement (28.89-27.79). |
Experiment | 2) The BLEU scores are very similar to each other when we increase the forest pruning threshold. |
Abstract | Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data. |
Abstract | Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011). |
Abstract | Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system. |
Discussion | on BLEU score |
Experiments | (00,-, 01,-) are selected for the extraction of paraphrase rules if two conditions are satisfied: (1) BLEU(eZi) — BLEU(eli) > 61, and (2) BLEU(eZi) > 62, where BLEU(-) is a function for computing BLEU score ; 61 and 62 are thresholds for balancing the rules number and the quality of paraphrase rules. |
Extraction of Paraphrase Rules | If the sentence in T 2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | In further analysis, we examined 1% of the sentences with the largest difference in BLEU score . |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Out of these, more than 70% were cases where the EG’ model achieved a higher BLEU score . |
Abstract | Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system. |
Conclusion | We treat word alignment as a parsing problem, and by taking advantage of English syntax and the hypergraph structure of our search algorithm, we report significant increases in both F-measure and BLEU score over standard baselines in use by most state-of-the-art MT systems today. |
Related Work | Very recent work in word alignment has also started to report downstream effects on BLEU score . |
Experiments and Results | Statistical significance in BLEU score differences was tested by paired bootstrap re-sampling (Koehn, 2004). |
Experiments and Results | Best ESSP (Wchpwen) is significantly better than baseline (p<0.0l) in BLEU score, best SMP (wdpwen) is significantly better than baseline (p<0.05) in BLEU score . |
Experiments and Results | wchpwen is significantly better than baseline (p<0.04) in BLEU score . |