Abstract | We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). |
Discussion and Future Work | These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora, have been shown to yield some BLEU score improvements. |
Experiments and Results | To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation. |
Experiments and Results | We show that our method achieves the best performance ( BLEU scores ) on this task while being significantly faster than both the previous approaches. |
Experiments and Results | For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words. |
Experiment Results | We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences). |
Experiment Results | On average the improvement is 1.07 BLEU score (45.66 |
Experiment Results | without new phrase-based features and 1.14 BLEU score over the baseline Hiero system. |
Phrasal-Hiero Model | Compare BLEU scores of translation using all extracted rules (the first row) and translation using only rules without nonaligned subphrases (the second row). |
Experiments | Each utterance in the test data has more than one responses that elicit the same goal emotion, because they are used to compute BLEU score (see section 5.3). |
Experiments | We first use BLEU score (Papineni et al., 2002) to perform automatic evaluation (Ritter et al., 2011). |
Experiments | In this evaluation, the system is provided with the utterance and the goal emotion in the test data and the generated responses are evaluated through BLEU score . |
Baseline MT | The scaling factors for all features are optimized by minimum error rate training algorithm to maximize BLEU score (Och, 2003). |
Experiments | In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora. |
Experiments | We computed the name-aware BLEU scores on the subset and also the aggregated average scores from human judgments. |
Experiments | Furthermore, we calculated three Pearson product-moment correlation coefficients between human judgment scores and name-aware BLEU scores of these two MT systems. |
Name-aware MT Evaluation | Based on BLEU score , we design a name-aware BLEU metric as follows. |
Name-aware MT Evaluation | Finally the name-aware BLEU score is defined as: |
Conclusion and Future Work | The method assumes that a combined model is derived from a hierarchical Pitman-Yor process with each prior learned separately in each domain, and achieves BLEU scores competitive with traditional batch-based ones. |
Experiment | The BLEU scores reported in this paper are the average of 5 independent runs of independent batch-MIRA weight training, as suggested by (Clark et al., 2011). |
Experiment | When comparing the hier-combin with the pialign-batch, the BLEU scores are a little higher while the time spent for training is much lower, almost one quarter of the pialign-batch. |
Experiment | Table 4 shows the BLEU scores for the three data sets, in which the order of combining phrase tables from each domain is alternated in the ascending and descending of the similarity to the test data. |
Introduction | In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model: |
Introduction | to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8. |
Introduction | Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al. |
Experiments | When REG and linearization are applied on shallowSyn_re with gold shallow trees, the BLEU score is lower (60.57) as compared to the system that applies syntax and linearization on deepSynJrre, deep trees with gold REs ( BLEU score of 63.9). |
Experiments | The revision-based system with disjoint modelling of implicits shows a slight, nonsignificant increase in BLEU score . |
Experiments | By contrast, the BLEU.. score is signficantly better for the joint approach. |
Experiments | 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature. |
Experiments | Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines. |
Experiments | For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs. |
Additional Experiments | On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain. |
Discussion | This correlates with our observation that RM’s overall BLEU score is negatively impacted by the BP, as the BLEU precision scores are noticeably higher. |
Discussion | We also notice that while PRO had the lowest BLEU scores in Chinese, it was competitive in Arabic with the highest number of features. |
Experiments | 5In the small feature set RAMPION yielded similar best BLEU scores , but worse TER. |
Experiments | In columns 2 and 4, we report the BLEU scores , while in columns 3 and 5, we report the TER scores. |
Experiments | Model 2 which conditions POL on OR provides an additional +0.2 BLEU improvement on BLEU score consistently across the two genres. |
Experiments | The inclusion of explicit MOS modeling in Model 4 gives a significant BLEU score improvement of +0.5 but no TER improvement in newswire. |
Experimental Results | The BLEU scores from different systems are shown in Table 10 and Table 11, respectively. |
Experimental Results | Preprocessing of the data with ECs inserted improves the BLEU scores by about 0.6 for newswire and 0.2 to 0.3 for the weblog data, compared to each baseline separately. |
Experimental Results | Table 10: BLEU scores in the Hiero system. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Phrase merging that preferred phrases learnt from EG’ data over AR data performed the best with a BLEU score of 16.96. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | In further analysis, we examined 1% of the sentences with the largest difference in BLEU score . |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Out of these, more than 70% were cases where the EG’ model achieved a higher BLEU score . |
Abstract | Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data. |
Abstract | Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011). |
Abstract | Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system. |