Index of papers in Proc. ACL 2014 that mention

TER

Seen in text as:

TER (81)
TERs (3)

Seen in 78 sentences in 7 papers.

1. Comparing Automatic Evaluation Measures for Image Description

Elliott, Desmond and Keller, Frank

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We estimate the correlation of unigram and Smoothed BLEU, TER , ROUGE-SU4, and Meteor against human judgements on two data sets.
Introduction	The main finding of our analysis is that TER and unigram BLEU are weakly corre-
Methodology	TER measures the number of modifications a human would need to make to transform a candidate Y into a reference X.
Methodology	TER is expressed as the percentage of the sentence that needs to be changed, and can be greater than 100 if the candidate is longer than the reference.
Methodology	TER = —\|reference tokens\|
Results	TER is only weakly correlated with human judgements but could prove useful in comparing the types of differences between models.
Results	An analysis of the distribution of TER scores in Figure 2(a) shows that differences in candidate and reference length are prevalent in the image description task.

TER is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

2. Using Discourse Structure Improves Machine Translation Evaluation

Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions and Future Work	III ROUGE .205 — .218 .242 TER .262 — .274 .296
Experimental Results	Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results	II TER .812 .836 .848 BLEU .810 .830 .846
Experimental Results	We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
Experimental Setup	From the original ULC, we only replaced TER and Meteor individual metrics by newer versions taking into account synonymy lookup and paraphrasing: TERp-A and METEOR-pa in ASIYA’s terminology.
Experimental Setup	To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Related Work	For BLEU and TER , they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.

TER is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

3. Adaptive HTER Estimation for Document-Specific MT Post-Editing

Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Static MT Quality Estimation	In the rest of the paper, we use TER and HTER inter-changably.
Static MT Quality Estimation	To evaluate the effectiveness of the proposed features, we train various classifiers with different feature configurations to predict whether a translation output is useful (with lower TER ) as described in the following section.
Static MT Quality Estimation	Predicting TER with various input features can be treated as a regression problem.

TER is mentioned in 28 sentences in this paper.

Topics mentioned in this paper:

4. Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Crowdsourcing Translation	"8.0 0.5 1.0 1.5 2.0 TER between pre- and post-edit translation
Crowdsourcing Translation	Aggressiveness (x-axis) is measured as the TER between the pre-edit and post-edit version of the translation, and effectiveness (y-axis) is measured as the average amount by which the editing reduces the translation’s TERgold.
Crowdsourcing Translation	We use translation edit rate ( TER ) as a measure of translation similarity.
Evaluation	Lowest TER 35.78
Evaluation	The first method selects the translation with the minimum average TER (Snover et al., 2006) against the other translations; intuitively, this would represent the “consensus” translation.
Evaluation	The second method selects the translation generated by the Turker who, on average, provides translations with the minimum average TER .

TER is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

Turker (25)
TER (12)
BLEU (11)

5. A Hybrid Approach to Skeleton-based Translation

Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work	The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation	Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems.
Evaluation	Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance.
Evaluation	Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system.
Introduction	0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.

TER is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

6. XMEANT: Better semantic MT evaluation without reference translations

Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
Related Work	Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
Related Work	MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
Related Work	tems against MEANT produces more robustly adequate translations than the common practice of tuning against BLEU or TER across different data genres, such as formal newswire text, informal web forum text and informal public speech.

TER is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

7. Lattice Desegmentation for Statistical Machine Translation

Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).
Results	Nonetheless, the 1000-best and lattice desegmenters both produce significant improvements over the 1-best desegmentation baseline, with Lattice Deseg achieving a 1-point improvement in TER .
Results	Model Dev Test BLEU BLEU TER
Results	Model Dev Test BLEU BLEU TER Unsegmented 15 .4 15.1 70.

TER is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

LM (16)
language model (13)
BLEU (13)