Abstract | We estimate the correlation of unigram and Smoothed BLEU, TER , ROUGE-SU4, and Meteor against human judgements on two data sets. |
Introduction | The main finding of our analysis is that TER and unigram BLEU are weakly corre- |
Methodology | TER measures the number of modifications a human would need to make to transform a candidate Y into a reference X. |
Methodology | TER is expressed as the percentage of the sentence that needs to be changed, and can be greater than 100 if the candidate is longer than the reference. |
Methodology | TER = —|reference tokens| |
Results | TER is only weakly correlated with human judgements but could prove useful in comparing the types of differences between models. |
Results | An analysis of the distribution of TER scores in Figure 2(a) shows that differences in candidate and reference length are prevalent in the image description task. |
Conclusions and Future Work | III ROUGE .205 — .218 .242 TER .262 — .274 .296 |
Experimental Results | Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level. |
Experimental Results | II TER .812 .836 .848 BLEU .810 .830 .846 |
Experimental Results | We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively). |
Experimental Setup | From the original ULC, we only replaced TER and Meteor individual metrics by newer versions taking into account synonymy lookup and paraphrasing: TERp-A and METEOR-pa in ASIYA’s terminology. |
Experimental Setup | To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms). |
Related Work | For BLEU and TER , they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score. |
Static MT Quality Estimation | In the rest of the paper, we use TER and HTER inter-changably. |
Static MT Quality Estimation | To evaluate the effectiveness of the proposed features, we train various classifiers with different feature configurations to predict whether a translation output is useful (with lower TER ) as described in the following section. |
Static MT Quality Estimation | Predicting TER with various input features can be treated as a regression problem. |
Crowdsourcing Translation | "8.0 0.5 1.0 1.5 2.0 TER between pre- and post-edit translation |
Crowdsourcing Translation | Aggressiveness (x-axis) is measured as the TER between the pre-edit and post-edit version of the translation, and effectiveness (y-axis) is measured as the average amount by which the editing reduces the translation’s TERgold. |
Crowdsourcing Translation | We use translation edit rate ( TER ) as a measure of translation similarity. |
Evaluation | Lowest TER 35.78 |
Evaluation | The first method selects the translation with the minimum average TER (Snover et al., 2006) against the other translations; intuitively, this would represent the “consensus” translation. |
Evaluation | The second method selects the translation generated by the Turker who, on average, provides translations with the minimum average TER . |
Abstract | We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data. |
Conclusion and Future Work | The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data. |
Evaluation | Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems. |
Evaluation | Seen from row —lmT of Table l, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance. |
Evaluation | Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system. |
Introduction | 0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data. |
Introduction | In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b). |
Related Work | Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence. |
Related Work | MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments. |
Related Work | tems against MEANT produces more robustly adequate translations than the common practice of tuning against BLEU or TER across different data genres, such as formal newswire text, informal web forum text and informal public speech. |
Experimental Setup | We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006). |
Results | Nonetheless, the 1000-best and lattice desegmenters both produce significant improvements over the 1-best desegmentation baseline, with Lattice Deseg achieving a 1-point improvement in TER . |
Results | Model Dev Test BLEU BLEU TER |
Results | Model Dev Test BLEU BLEU TER Unsegmented 15 .4 15.1 70. |