Index of papers in Proc. ACL 2013 that mention
  • TER
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Abstract
We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Additional Experiments
As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Additional Experiments
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Conclusions and Future Work
Experimentation in statistical MT yielded significant improvements over several other state-of-the-art optimizers, especially in a high-dimensional feature space (up to 2 BLEU and 4.3 TER on average).
Discussion
RM’s loss was only up to 0.8 BLEU (0.7 TER) from MERT or MIRA, while its gains were up to 1.7 BLEU and 2.1 TER over MIRA.
Discussion
Small set Large set Optimizer BLEU TER BLEU TER MERT 0.4 2.6 - -MIRA 0.5 3.0 1.4 4.3 PRO 1.4 2.9 2.0 1.7 RAMPION 0.6 1.6 1.2 2.8
Discussion
Error Analysis: The inconclusive advantage of RM over MIRA (in BLEU vs. TER scores) on Arabic-English MT08 calls for a closer look.
Experiments
As can be seen from the results in Table 3, our RM method was the best performer in all Chinese-English tests according to all measures — up to 1.9 BLEU and 6.6 TER over MIRA — even though we only optimized for BLEU.5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM.
Experiments
The results are especially notable for the basic feature setting — up to 1.2 BLEU and 4.6 TER improvement over MERT — since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al., 2008).
Experiments
5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER .
TER is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Wang, Kun and Zong, Chengqing and Su, Keh-Yih
Abstract
Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system.
Experiments
In this work, the translation performance is measured with case-insensitive BLEU-4 score (Papineni et al., 2002) and TER score (Snover et al., 2006).
Experiments
In the tables, the best translation results (either in BLEU or TER ) at each interval have been marked in bold.
Experiments
It can be seen that TM significantly exceeds SMT at the interval [0.9, 1.0) in TER score, which illustrates why professional translators prefer TM rather than SMT as their assistant tool.
Introduction
Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall.
TER is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Abstract
On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.
Experiments
MT08 nw MT08 wb BLEU \ TER BLEU \ TER
Experiments
The best TER and BLEU results on each genre are in bold.
Experiments
For BLEU, higher scores are better, while for TER , lower scores are better.
TER is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Introduction
Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER .
Introduction
| features | BLEU | TER |
Introduction
Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly.
TER is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann
Abstract
Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average.
Conclusion
Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN).
Experiments
0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way.
Experiments
An Index column is added for score reference convenience (B for BLEU; T for TER ).
Experiments
For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1).
TER is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem
Abstract
On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER , and NIST.
Experimental Setup and Results
Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006).
Experimental Setup and Results
In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 (BLEU), -0.6 ( TER ) and 0.08 (NIST).
Introduction
With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU, TER and NIST scores on an English-to-Iraqi CSLT task.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Experiments
Besides the new name-aware MT metric, we also adopt two traditional metrics, TER to evaluate the overall translation performance and Named Entity Weak Accuracy (NEWA) (Hermj akob et al., 2008) to evaluate the name translation performance.
Experiments
TER measures the amount of edits required to change a system output into one of the reference translations.
Experiments
TER = 10 average # of reference words ( )
Name-aware MT Evaluation
Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate ( TER ) (Snover et al., 2006) assign the same weights to all tokens equally.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
Results
While TER and GTM are known to provide better correlation with post-editing efforts for English (O’Brien, 2011), free word order and greater data sparseness on the sentence level makes TER much less reliable for Russian.
Results
Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
TER is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Cohen, Shay B. and Johnson, Mark
Bayesian inference for PCFGs
OC 6£r(t)> ego—1) TER TER
Bayesian inference for PCFGs
TER
PCFGs and tightness
TER
TER is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: