Abstract | We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set. |
Additional Experiments | As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM. |
Additional Experiments | On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain. |
Conclusions and Future Work | Experimentation in statistical MT yielded significant improvements over several other state-of-the-art optimizers, especially in a high-dimensional feature space (up to 2 BLEU and 4.3 TER on average). |
Discussion | RM’s loss was only up to 0.8 BLEU (0.7 TER) from MERT or MIRA, while its gains were up to 1.7 BLEU and 2.1 TER over MIRA. |
Discussion | Small set Large set Optimizer BLEU TER BLEU TER MERT 0.4 2.6 - -MIRA 0.5 3.0 1.4 4.3 PRO 1.4 2.9 2.0 1.7 RAMPION 0.6 1.6 1.2 2.8 |
Discussion | Error Analysis: The inconclusive advantage of RM over MIRA (in BLEU vs. TER scores) on Arabic-English MT08 calls for a closer look. |
Experiments | As can be seen from the results in Table 3, our RM method was the best performer in all Chinese-English tests according to all measures — up to 1.9 BLEU and 6.6 TER over MIRA — even though we only optimized for BLEU.5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM. |
Experiments | The results are especially notable for the basic feature setting — up to 1.2 BLEU and 4.6 TER improvement over MERT — since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al., 2008). |
Experiments | 5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER . |
Abstract | Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. |
Experiments | In this work, the translation performance is measured with case-insensitive BLEU-4 score (Papineni et al., 2002) and TER score (Snover et al., 2006). |
Experiments | In the tables, the best translation results (either in BLEU or TER ) at each interval have been marked in bold. |
Experiments | It can be seen that TM significantly exceeds SMT at the interval [0.9, 1.0) in TER score, which illustrates why professional translators prefer TM rather than SMT as their assistant tool. |
Introduction | Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall. |
Abstract | On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement. |
Experiments | MT08 nw MT08 wb BLEU \ TER BLEU \ TER |
Experiments | The best TER and BLEU results on each genre are in bold. |
Experiments | For BLEU, higher scores are better, while for TER , lower scores are better. |
Introduction | Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER . |
Introduction | | features | BLEU | TER | |
Introduction | Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly. |
Abstract | Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. |
Conclusion | Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN). |
Experiments | 0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way. |
Experiments | An Index column is added for score reference convenience (B for BLEU; T for TER ). |
Experiments | For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1). |
Abstract | On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER , and NIST. |
Experimental Setup and Results | Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006). |
Experimental Setup and Results | In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 (BLEU), -0.6 ( TER ) and 0.08 (NIST). |
Introduction | With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU, TER and NIST scores on an English-to-Iraqi CSLT task. |
Experiments | Besides the new name-aware MT metric, we also adopt two traditional metrics, TER to evaluate the overall translation performance and Named Entity Weak Accuracy (NEWA) (Hermj akob et al., 2008) to evaluate the name translation performance. |
Experiments | TER measures the amount of edits required to change a system output into one of the reference translations. |
Experiments | TER = 10 average # of reference words ( ) |
Name-aware MT Evaluation | Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate ( TER ) (Snover et al., 2006) assign the same weights to all tokens equally. |
Evaluation methodology | In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003). |
Results | While TER and GTM are known to provide better correlation with post-editing efforts for English (O’Brien, 2011), free word order and greater data sparseness on the sentence level makes TER much less reliable for Russian. |
Results | Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714 |
Bayesian inference for PCFGs | OC 6£r(t)> ego—1) TER TER |
Bayesian inference for PCFGs | TER |
PCFGs and tightness | TER |