Tackling Sparse Data Issue in Machine Translation Evaluation
Bojar, Ondřej and Kos, Kamil and Mareċek, David

Article Structure

Abstract

We illustrate and explain problems of n-grams-based machine translation (MT) metrics (e.g.

Introduction

Automatic metrics of machine translation (MT) quality are vital for research progress at a fast pace.

Problems of BLEU

BLEU (Papineni et al., 2002) is an established language-independent MT metric.

Extensions of SemPOS

SemPOS (Kos and Boj ar, 2009) is inspired by metrics based on overlapping of linguistic features in the reference and in the translation (Gimenez and Marquez, 2007).

Conclusion

This paper documented problems of single-reference BLEU when applied to morphologically rich languages such as Czech.

Topics

BLEU

Appears in 22 sentences as: BLEU (22)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. BLEU ) when applied to morphologically rich languages such as Czech.
    Page 1, “Abstract”
  2. Section 2 illustrates and explains severe problems of a widely used BLEU metric (Papineni et al., 2002) when applied to Czech as a representative of languages with rich morphology.
    Page 1, “Introduction”
  3. cu-bOJar uedin 0.4 l l l l 0.06 0.08 0.10 0.12 0.14 BLEU
    Page 1, “Introduction”
  4. Figure l: BLEU and human ranks of systems participating in the English-to-Czech WMT09 shared task.
    Page 1, “Introduction”
  5. BLEU (Papineni et al., 2002) is an established language-independent MT metric.
    Page 1, “Problems of BLEU”
  6. The unbeaten advantage of BLEU is its simplicity.
    Page 1, “Problems of BLEU”
  7. We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009).
    Page 1, “Problems of BLEU”
  8. In a manual analysis, we identified the reasons for the low correlation: BLEU is overly sensitive to sequences and forms in the hypothesis matching
    Page 1, “Problems of BLEU”
  9. In terms of BLEU , both hypotheses are equally poor but 90% of their tokens were not evaluated.
    Page 2, “Problems of BLEU”
  10. Figure 3 documents the issue across languages: the lower the BLEU score itself (i.e.
    Page 2, “Problems of BLEU”
  11. A phrase-based system like Moses (cu-bojar) can sometimes produce a long sequence of tokens exactly as required by the reference, leading to a high BLEU score.
    Page 2, “Problems of BLEU”

See all papers in Proc. ACL 2010 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

human judgments

Appears in 11 sentences as: human judgments (11)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as well, see e.g.
    Page 1, “Introduction”
  2. Its correlation to human judgments was originally deemed high (for English) but better correlating metrics (esp.
    Page 1, “Problems of BLEU”
  3. Figure 1 illustrates a very low correlation to human judgments when translating to Czech.
    Page 1, “Problems of BLEU”
  4. This amounts to 34% of running unigrams, giving enough space to differ in human judgments and still remain unscored.
    Page 2, “Problems of BLEU”
  5. fewer confirmed n-grams), the lower the correlation to human judgments regardless of the target language (WMT09 shared task, 2025 sentences per language).
    Page 2, “Problems of BLEU”
  6. For the evaluation of metric correlation with human judgments at the system level, we used the Pearson correlation coefficient p applied to ranks.
    Page 4, “Extensions of SemPOS”
  7. The MetricsMATR08 human judgments include preferences for pairs of MT systems saying which one of the two systems is better, while the WMT08 and WMT09 data contain system scores (for up to 5 systems) on the scale 1 to 5 for a given sentence.
    Page 4, “Extensions of SemPOS”
  8. Metrics’ performance for translation to English and Czech was measured on the following testsets (the number of human judgments for a given source language in brackets):
    Page 4, “Extensions of SemPOS”
  9. We assume this is because BLEU4 can capture correctly translated fixed phrases, which is positively reflected in human judgments .
    Page 4, “Extensions of SemPOS”
  10. The error metrics PER and TER showed the lowest correlation with human judgments for translation to Czech.
    Page 5, “Extensions of SemPOS”
  11. This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments .
    Page 5, “Conclusion”

See all papers in Proc. ACL 2010 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

n-grams

Appears in 7 sentences as: n-grams (8)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. Total n-grams 35,531 33,891 32,251 30,611
    Page 2, “Problems of BLEU”
  2. Table 1: n-grams confirmed by the reference and containing error flags.
    Page 2, “Problems of BLEU”
  3. The suspicious cases are n-grams confirmed by the reference but still containing a flag (false positives) and n-grams not confirmed despite containing no error flag (false negatives).
    Page 2, “Problems of BLEU”
  4. Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams and far fewer higher n-grams .
    Page 2, “Problems of BLEU”
  5. 30 to 40% of n-grams do not contain any error and yet they are not con-
    Page 2, “Problems of BLEU”
  6. fewer confirmed n-grams ), the lower the correlation to human judgments regardless of the target language (WMT09 shared task, 2025 sentences per language).
    Page 2, “Problems of BLEU”
  7. Surprisingly BLEU-2 performed better than any other n-grams for reasons that have yet to be examined.
    Page 5, “Extensions of SemPOS”

See all papers in Proc. ACL 2010 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

BLEU score

Appears in 4 sentences as: BLEU score (4)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. We plot the official BLEU score against the rank established as the percentage of sentences where a system ranked no worse than all its competitors (Callison-Burch et al., 2009).
    Page 1, “Problems of BLEU”
  2. Figure 3 documents the issue across languages: the lower the BLEU score itself (i.e.
    Page 2, “Problems of BLEU”
  3. A phrase-based system like Moses (cu-bojar) can sometimes produce a long sequence of tokens exactly as required by the reference, leading to a high BLEU score .
    Page 2, “Problems of BLEU”
  4. This is confirmed for other languages as well: the lower the BLEU score the lower the correlation to human judgments.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2010 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

unigrams

Appears in 4 sentences as: unigrams (4)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams and far fewer higher n-grams.
    Page 2, “Problems of BLEU”
  2. This amounts to 34% of running unigrams , giving enough space to differ in human judgments and still remain unscored.
    Page 2, “Problems of BLEU”
  3. For the purposes of the combination, we compute BLEU only on unigrams up to fourgrams (denoted BLEUl, ..., BLEU4) but including the brevity penalty as usual.
    Page 3, “Extensions of SemPOS”
  4. This is also confirmed by the observation that using BLEU alone is rather unreliable for Czech and BLEU-l (which judges unigrams only) is even worse.
    Page 5, “Extensions of SemPOS”

See all papers in Proc. ACL 2010 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.

n-gram

Appears in 3 sentences as: n-gram (3)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. Aside from including dependency and n-gram relations in the scoring, we also apply and evaluate SemPOS for English.
    Page 1, “Introduction”
  2. Table 1 estimates the overall magnitude of this issue: For 1-grams to 4-grams in 1640 instances (different MT outputs and different annotators) of 200 sentences with manually flagged errors4, we count how often the n-gram is confirmed by the reference and how often it contains an error flag.
    Page 2, “Problems of BLEU”
  3. Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams and far fewer higher n-grams.
    Page 2, “Problems of BLEU”

See all papers in Proc. ACL 2010 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

TER

Appears in 3 sentences as: TER (3)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. NIST 0.69 0.90 0.53 semPOSsons SemPOS 0.69 0.95 0.30 2-SemPOS+l -BLEU4 0.68 0.91 0.09 BLEU1 0.68 0.87 0.43 BLEU2 0.68 0.90 0.26 BLEU3 0.66 0.90 0.14 BLEU 0.66 0.91 0.20 TER 0.63 0.87 0.29 PER 0.63 0.88 0.32 BLEU4 0.61 0.90 -0.31 Functorpar 0.57 0.83 -0.03 Functor 0.55 0.82 -0.09
    Page 5, “Extensions of SemPOS”
  2. The error metrics PER and TER showed the lowest correlation with human judgments for translation to Czech.
    Page 5, “Extensions of SemPOS”
  3. 14 Functor 0.21 0.40 0.09 Voidpar 0.16 0.53 -0.08 PER 0.12 0.53 -0.09 TER 0.07 0.53 -0.23
    Page 5, “Conclusion”

See all papers in Proc. ACL 2010 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

word order

Appears in 3 sentences as: word order (3)
In Tackling Sparse Data Issue in Machine Translation Evaluation
  1. This focus goes directly against the properties of Czech: relatively free word order allows many permutations of words and rich morphology renders many valid word forms not confirmed by the reference.3 These problems are to some extent mitigated if several reference translations are available, but this is often not the case.
    Page 2, “Problems of BLEU”
  2. One of the major drawbacks of SemPOS is that it completely ignores word order .
    Page 3, “Extensions of SemPOS”
  3. This is too coarse even for languages with relatively free word order like Czech.
    Page 3, “Extensions of SemPOS”

See all papers in Proc. ACL 2010 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.