Abstract | Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. |
Conclusion and future work | We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements). |
Introduction | In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores. |
Machine translation experiments | In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001). |
Machine translation experiments | For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER , all differences are significant. |
Machine translation experiments | On the other hand, the difference on MT08 is significant in terms of TER . |
Abstract | We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST, TER , and METEOR) in two different settings. |
Experimental Evaluation | We therefore verified that the three nontrivial “baseline” regression models indeed confer a benefit over the default component combination scores: BLEU—1 (which outperformed BLEU-4 in the MetricsMATR 2008 evaluation), NIST-4, and TER (with all costs set to 1). |
Experimental Evaluation | We start with the standard TER score and the number of each of the four edit operations. |
Introduction | A number of metrics have been designed to account for paraphrase, either by making the matching more intelligent ( TER , Snover et al. |
Experiments | Actually we find that the TER score between two member decoders’ outputs are significantly reduced (as shown in Table 3), which indicates that the outputs become more similar due to the use of consensus information. |
Experiments | For example, the TER score between SYS2 and SYS3 of the NIST 2008 outputs are reduced from 0.4238 to 0.2665. |
Experiments | Table 3: TER scores between co-decoding translation outputs |