Index of papers in Proc. ACL that mention
  • TER
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris
Crowdsourcing Translation
"8.0 0.5 1.0 1.5 2.0 TER between pre- and post-edit translation
Crowdsourcing Translation
Aggressiveness (x-axis) is measured as the TER between the pre-edit and post-edit version of the translation, and effectiveness (y-axis) is measured as the average amount by which the editing reduces the translation’s TERgold.
Crowdsourcing Translation
We use translation edit rate ( TER ) as a measure of translation similarity.
Evaluation
Lowest TER 35.78
Evaluation
The first method selects the translation with the minimum average TER (Snover et al., 2006) against the other translations; intuitively, this would represent the “consensus” translation.
Evaluation
The second method selects the translation generated by the Turker who, on average, provides translations with the minimum average TER .
TER is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim
Static MT Quality Estimation
In the rest of the paper, we use TER and HTER inter-changably.
Static MT Quality Estimation
To evaluate the effectiveness of the proposed features, we train various classifiers with different feature configurations to predict whether a translation output is useful (with lower TER ) as described in the following section.
Static MT Quality Estimation
Predicting TER with various input features can be treated as a regression problem.
TER is mentioned in 28 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Conclusions and Future Work
III ROUGE .205 — .218 .242 TER .262 — .274 .296
Experimental Results
Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results
II TER .812 .836 .848 BLEU .810 .830 .846
Experimental Results
We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
Experimental Setup
From the original ULC, we only replaced TER and Meteor individual metrics by newer versions taking into account synonymy lookup and paraphrasing: TERp-A and METEOR-pa in ASIYA’s terminology.
Experimental Setup
To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Related Work
For BLEU and TER , they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.
TER is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Elliott, Desmond and Keller, Frank
Abstract
We estimate the correlation of unigram and Smoothed BLEU, TER , ROUGE-SU4, and Meteor against human judgements on two data sets.
Introduction
The main finding of our analysis is that TER and unigram BLEU are weakly corre-
Methodology
TER measures the number of modifications a human would need to make to transform a candidate Y into a reference X.
Methodology
TER is expressed as the percentage of the sentence that needs to be changed, and can be greater than 100 if the candidate is longer than the reference.
Methodology
TER = —|reference tokens|
Results
TER is only weakly correlated with human judgements but could prove useful in comparing the types of differences between models.
Results
An analysis of the distribution of TER scores in Figure 2(a) shows that differences in candidate and reference length are prevalent in the image description task.
TER is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Wang, Kun and Zong, Chengqing and Su, Keh-Yih
Abstract
Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system.
Experiments
In this work, the translation performance is measured with case-insensitive BLEU-4 score (Papineni et al., 2002) and TER score (Snover et al., 2006).
Experiments
In the tables, the best translation results (either in BLEU or TER ) at each interval have been marked in bold.
Experiments
It can be seen that TM significantly exceeds SMT at the interval [0.9, 1.0) in TER score, which illustrates why professional translators prefer TM rather than SMT as their assistant tool.
Introduction
Compared with the pure SMT system, the proposed integrated Model-III achieves 3.48 BLEU points improvement and 2.62 TER points reduction overall.
TER is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Abstract
We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Additional Experiments
As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Additional Experiments
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Conclusions and Future Work
Experimentation in statistical MT yielded significant improvements over several other state-of-the-art optimizers, especially in a high-dimensional feature space (up to 2 BLEU and 4.3 TER on average).
Discussion
RM’s loss was only up to 0.8 BLEU (0.7 TER) from MERT or MIRA, while its gains were up to 1.7 BLEU and 2.1 TER over MIRA.
Discussion
Small set Large set Optimizer BLEU TER BLEU TER MERT 0.4 2.6 - -MIRA 0.5 3.0 1.4 4.3 PRO 1.4 2.9 2.0 1.7 RAMPION 0.6 1.6 1.2 2.8
Discussion
Error Analysis: The inconclusive advantage of RM over MIRA (in BLEU vs. TER scores) on Arabic-English MT08 calls for a closer look.
Experiments
As can be seen from the results in Table 3, our RM method was the best performer in all Chinese-English tests according to all measures — up to 1.9 BLEU and 6.6 TER over MIRA — even though we only optimized for BLEU.5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM.
Experiments
The results are especially notable for the basic feature setting — up to 1.2 BLEU and 4.6 TER improvement over MERT — since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al., 2008).
Experiments
5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER .
TER is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Feng, Yansong and Lapata, Mirella
Experimental Setup
Our automatic evaluation was based on Translation Edit Rate ( TER , Snover et al.
Experimental Setup
TER is defined as the minimum number of edits a human would have to perform to change the system output so that it exactly matches a reference translation.
Experimental Setup
TER <E7 Er) : Ins + Del + Sub + Shft (16) M
Results
Table 2 reports our results on the test set using TER .
Results
The abstractive models obtain the best TER scores overall, however they generate shorter captions in comparison to the other models (closer to the length of the gold standard) and as a result TER treats them favorably, simply because the number of edits is less.
TER is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Abstract
On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.
Experiments
MT08 nw MT08 wb BLEU \ TER BLEU \ TER
Experiments
The best TER and BLEU results on each genre are in bold.
Experiments
For BLEU, higher scores are better, while for TER , lower scores are better.
TER is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Shen, Libin and Xu, Jinxi and Weischedel, Ralph
Abstract
Our eXperiments show that the string-to-dependency decoder achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER compared to a standard hierarchical string—to—string system on the N IST 04 Chinese—English evaluation set.
Conclusions and Future Work
Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set.
Experiments
All models are tuned on BLEU (Papineni et al., 2001), and evaluated on both BLEU and Translation Error Rate ( TER ) (Snover et al., 2006) so that we could detect over-tuning on one metric.
Experiments
BLEU% TER % lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96
Experiments
Table 2: BLEU and TER scores on the test set.
Introduction
Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set.
TER is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Experimental Evaluation
In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
Experimental Evaluation
For BLEU higher values are better, for TER lower values are better.
Experimental Evaluation
Figure 3 and Figure 4 show the evolution of BLEU and TER scores for applying our method using a 2-gram and a 3- gram LM.
TER is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Introduction
Experiments show that our approach significantly outperforms both phrase-based (Koehn et al., 2007) and string-t0-dependency approaches (Shen et al., 2008) in terms of BLEU and TER .
Introduction
| features | BLEU | TER |
Introduction
Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set, both separately and jointly.
TER is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Conclusion and future work
We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements).
Introduction
In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores.
Machine translation experiments
In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001).
Machine translation experiments
For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER , all differences are significant.
Machine translation experiments
On the other hand, the difference on MT08 is significant in terms of TER .
TER is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann
Abstract
Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average.
Conclusion
Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN).
Experiments
0 BLEU (Papineni et al., 2001) and TER (Snover et al., 2005) reported all scores calculated in lowercase way.
Experiments
An Index column is added for score reference convenience (B for BLEU; T for TER ).
Experiments
For the proposed model, significance testing results on both BLEU and TER are reported (B2 and B3 compared to B1, T2 and T3 compared to T1).
TER is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhao, Bing and Lee, Young-Suk and Luo, Xiaoqiang and Li, Liu
Experiments
We use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to evaluate translation qualities.
Experiments
| Setups \ TER \ BLEUr4n4 \
Experiments
Table 10: TER and BLEU for MT08-NW, using only {(7)
TER is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Background
Diversity ( TER [%])
Background
Diversity ( TER [%])
Background
The diversity is measured in terms of the Translation Error Rate ( TER ) metric proposed in (Snover et al., 2006).
TER is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Boyd-Graber, Jordan and Resnik, Philip
Abstract
Conditioning lexical probabilities on the topic biases translations toward topic-relevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline.
Experiments
On FBIS, we can see that both models achieve moderate but consistent gains over the baseline on both BLEU and TER .
Experiments
The best model, LTM-10, achieves a gain of about 0.5 and 0.6 BLEU and 2 TER .
Experiments
Although the performance on BLEU for both the 20 topic models LTM-20 and GTM-20 is suboptimal, the TER improvement is better.
Introduction
Incorporating these features into our hierarchical phrase-based translation system significantly improved translation performance, by up to l BLEU and 3 TER over a strong Chinese to English baseline.
TER is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Mauser, Arne and Ney, Hermann
Conclusion
In TER , improvements are 0.4 and 1.7 points.
Experimental Evaluation
‘ BLEU ‘ TER
Experimental Evaluation
The metrics used for evaluation are the case-sensitive BLEU (Papineni et al., 2002) score and the translation edit rate ( TER ) (Snover et al., 2006) with one reference translation.
Experimental Evaluation
A second iteration of the training algorithm shows nearly no changes in BLEU score, but a small improvement in TER .
TER is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
Abstract
We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work
The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation
Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems.
Evaluation
Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance.
Evaluation
Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system.
Introduction
0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.
TER is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Experiments
We employed BLEU4, METEOR (V1.0), TER (v0.7.25), and the new metric PORT.
Experiments
In the table, TER scores are presented as 1-TER to ensure that for all metrics, higher scores mean higher quality.
Introduction
0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
Introduction
information; they are fast to compute (except TER ).
Introduction
(2010) showed that BLEU tuning is more robust than tuning with other metrics (METEOR, TER , etc.
TER is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki
Abstract
BLEU, TER ) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Introduction
TER (Snover et al., 2006) allows arbitrary chunk movements, while permutation metrics like RIBES (Isozaki et al., 2010; Birch et al., 2010) measure deviation in word order.
Introduction
Experiments on NIST Chinese-English and PubMed English-Japanese translation using BLEU, TER , and RIBES are presented in Section 4.
Multi-objective Algorithms
tered are necessarily pareto-optimal.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem
Abstract
On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER , and NIST.
Experimental Setup and Results
Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006).
Experimental Setup and Results
In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 (BLEU), -0.6 ( TER ) and 0.08 (NIST).
Introduction
With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU, TER and NIST scores on an English-to-Iraqi CSLT task.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Experiments
Besides the new name-aware MT metric, we also adopt two traditional metrics, TER to evaluate the overall translation performance and Named Entity Weak Accuracy (NEWA) (Hermj akob et al., 2008) to evaluate the name translation performance.
Experiments
TER measures the amount of edits required to change a system output into one of the reference translations.
Experiments
TER = 10 average # of reference words ( )
Name-aware MT Evaluation
Traditional MT evaluation metrics such as BLEU (Papineni et al., 2002) and Translation Edit Rate ( TER ) (Snover et al., 2006) assign the same weights to all tokens equally.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
Abstract
We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST, TER , and METEOR) in two different settings.
Experimental Evaluation
We therefore verified that the three nontrivial “baseline” regression models indeed confer a benefit over the default component combination scores: BLEU—1 (which outperformed BLEU-4 in the MetricsMATR 2008 evaluation), NIST-4, and TER (with all costs set to 1).
Experimental Evaluation
We start with the standard TER score and the number of each of the four edit operations.
Introduction
A number of metrics have been designed to account for paraphrase, either by making the matching more intelligent ( TER , Snover et al.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai
Introduction
In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
Related Work
Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
Related Work
MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
Related Work
tems against MEANT produces more robustly adequate translations than the common practice of tuning against BLEU or TER across different data genres, such as formal newswire text, informal web forum text and informal public speech.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Experimental Setup
We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Results
Nonetheless, the 1000-best and lattice desegmenters both produce significant improvements over the 1-best desegmentation baseline, with Lattice Deseg achieving a 1-point improvement in TER .
Results
Model Dev Test BLEU BLEU TER
Results
Model Dev Test BLEU BLEU TER Unsegmented 15 .4 15.1 70.
TER is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
Results
While TER and GTM are known to provide better correlation with post-editing efforts for English (O’Brien, 2011), free word order and greater data sparseness on the sentence level makes TER much less reliable for Russian.
Results
Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
TER is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Cohen, Shay B. and Johnson, Mark
Bayesian inference for PCFGs
OC 6£r(t)> ego—1) TER TER
Bayesian inference for PCFGs
TER
PCFGs and tightness
TER
TER is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bojar, Ondřej and Kos, Kamil and Mareċek, David
Conclusion
14 Functor 0.21 0.40 0.09 Voidpar 0.16 0.53 -0.08 PER 0.12 0.53 -0.09 TER 0.07 0.53 -0.23
Extensions of SemPOS
NIST 0.69 0.90 0.53 semPOSsons SemPOS 0.69 0.95 0.30 2-SemPOS+l -BLEU4 0.68 0.91 0.09 BLEU1 0.68 0.87 0.43 BLEU2 0.68 0.90 0.26 BLEU3 0.66 0.90 0.14 BLEU 0.66 0.91 0.20 TER 0.63 0.87 0.29 PER 0.63 0.88 0.32 BLEU4 0.61 0.90 -0.31 Functorpar 0.57 0.83 -0.03 Functor 0.55 0.82 -0.09
Extensions of SemPOS
The error metrics PER and TER showed the lowest correlation with human judgments for translation to Czech.
TER is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Experiments
Actually we find that the TER score between two member decoders’ outputs are significantly reduced (as shown in Table 3), which indicates that the outputs become more similar due to the use of consensus information.
Experiments
For example, the TER score between SYS2 and SYS3 of the NIST 2008 outputs are reduced from 0.4238 to 0.2665.
Experiments
Table 3: TER scores between co-decoding translation outputs
TER is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: