Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
Liu, Chang and Ng, Hwee Tou

Article Structure

Abstract

In this work, we introduce the TESLA-CELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis — Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation.

Introduction

Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.

Motivation

Li et al.

The Algorithm

3.1 Basic Matching

Experiments

In this section, we test the effectiveness of TESLA-CELAB on some real-world English-Chinese translation tasks.

Discussion and Future Work

5.1 Other Languages with Ambiguous Word Boundaries

Conclusion

In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese.

Topics

BLEU

Appears in 19 sentences as: BLEU (20)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks.
    Page 1, “Abstract”
  2. Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.
    Page 1, “Introduction”
  3. In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments.
    Page 1, “Introduction”
  4. Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
    Page 1, “Introduction”
  5. The work compared various MT evaluation metrics ( BLEU , NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
    Page 2, “Introduction”
  6. 4.3.1 BLEU
    Page 5, “Experiments”
  7. Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011).
    Page 5, “Experiments”
  8. use character-level BLEU as our main baseline.
    Page 6, “Experiments”
  9. The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
    Page 6, “Experiments”
  10. In addition to character-level BLEU , we also present the correlations for the word-level metric TESLA.
    Page 6, “Experiments”
  11. Compared to BLEU , TESLA allows more sophisticated weighting of n-grams and measures of word similarity including synonym relations.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

word-level

Appears in 15 sentences as: word-level (15)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA—CELAB acknowledges the advantage of character-level evaluation over word-level evaluation.
    Page 1, “Abstract”
  2. In this work, we attempt to address both of these issues by introducing TESLA-CELAB, a character-level metric that also models word-level linguistic phenomenon.
    Page 2, “Motivation”
  3. Although word-level BLEU has often been found inferior to the new-generation metrics when the target language is English or other European languages, prior research has shown that character-level BLEU is highly competitive when the target language is Chinese (Li et al., 2011).
    Page 5, “Experiments”
  4. word-level Character-level
    Page 6, “Experiments”
  5. word-level Character-level
    Page 6, “Experiments”
  6. In addition to character-level BLEU, we also present the correlations for the word-level metric TESLA.
    Page 6, “Experiments”
  7. We use TESLA as a representative of a competitive word-level metric.
    Page 6, “Experiments”
  8. The scores show that word-level TESLA-M has no clear advantage over character-level BLEU, despite its use of linguistic features.
    Page 6, “Experiments”
  9. CELAB’s ability to detect word-level synonyms and turns TESLA-CELAB into a linear programming based character-level metric.
    Page 7, “Experiments”
  10. o TESLA-M can process word-level synonyms, but does not award character-level matches.
    Page 7, “Experiments”
  11. 0 TESLA-CELAB— and character-level BLEU award character-level matches, but do not consider word-level synonyms.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

n-grams

Appears in 12 sentences as: N-grams (2) n-grams (11)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. For example, between ¥_l—?fi_$ and ¥_5l?, higher-order n-grams such as and still have no match, and will be penalized accordingly, even though ¥_l—?fi_5lk and ¥_5l?
    Page 2, “Motivation”
  2. N-grams such as which cross natural word boundaries and are meaningless by themselves can be particularly tricky.
    Page 2, “Motivation”
  3. Two n-grams are connected if they are identical, or if they are identified as synonyms by Cilin.
    Page 2, “The Algorithm”
  4. Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages.
    Page 2, “The Algorithm”
  5. This enables us to designate n-grams with different values of n as synonyms, such as (n = 2) and 5!k (n = 1).
    Page 2, “The Algorithm”
  6. Two n-grams are considered synonyms if they can be segmented into synonyms that are aligned.
    Page 3, “The Algorithm”
  7. Consequently, due to the maximum covering weights constraint, we can give the following value assignment, implying that all n-grams have been matched.
    Page 4, “The Algorithm”
  8. The recall is a function of XX Cref(X), and the precision is a function of ZY ccand(Y), where X is the set of all n-grams of the reference, and Y is the set of all n-grams of the candidate translation.
    Page 4, “The Algorithm”
  9. Compared to BLEU, TESLA allows more sophisticated weighting of n-grams and measures of word similarity including synonym relations.
    Page 6, “Experiments”
  10. The covered n-gram matching rule is then able to award tricky n-grams such as TE, Ti, /1\ [E], 1/13 [IE5 and i9}.
    Page 7, “Experiments”
  11. In the current formulation of TESLA-CELAB, two n-grams X and Y are either synonyms which completely match each other, or are completely unrelated.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

linear programming

Appears in 9 sentences as: linear programming (9)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. By reformulating the problem in the linear programming framework, TESLA—CELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters.
    Page 1, “Abstract”
  2. We formulate the n-gram matching process as a real-valued linear programming problem, which can be solved efficiently.
    Page 2, “Motivation”
  3. The linear programming problem is mathematically described as follows.
    Page 3, “The Algorithm”
  4. The linear programming solver may come up with any of the solutions where Wk, 4%) + MW.
    Page 3, “The Algorithm”
  5. n-gram matching in the linear programming problem itself.
    Page 4, “The Algorithm”
  6. However, the maX(-) operator is not allowed in the linear programming formulation.
    Page 4, “The Algorithm”
  7. Returning to our sample problem, the linear programming solver simply needs to assign:
    Page 4, “The Algorithm”
  8. We are also constrained by the linear programming framework, hence we set the objective function as
    Page 4, “The Algorithm”
  9. CELAB’s ability to detect word-level synonyms and turns TESLA-CELAB into a linear programming based character-level metric.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention linear programming.

See all papers in Proc. ACL that mention linear programming.

Back to top.

n-gram

Appears in 9 sentences as: n-gram (9)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. We formulate the n-gram matching process as a real-valued linear programming problem, which can be solved efficiently.
    Page 2, “Motivation”
  2. The basic n-gram matching problem is shown in Figure 2.
    Page 2, “The Algorithm”
  3. We observe that since has been matched, all its sub-n-grams should be considered matched as well, including and We call this the covered n-gram matching rule.
    Page 3, “The Algorithm”
  4. However, we cannot simply perform covered n-gram matching as a post processing step.
    Page 3, “The Algorithm”
  5. n-gram matching in the linear programming problem itself.
    Page 4, “The Algorithm”
  6. On top of the variables already introduced, we add the variables maximum covering weights Each C(X) represents the maximum w(Y) variable where n-gram Y completely covers n- gram X.
    Page 4, “The Algorithm”
  7. Based on these synonyms, TESLA-CELAB is able to award less trivial n-gram matches, such as T fiflifififl.
    Page 7, “Experiments”
  8. The covered n-gram matching rule is then able to award tricky n-grams such as TE, Ti, /1\ [E], 1/13 [IE5 and i9}.
    Page 7, “Experiments”
  9. The TESLA-M metric allows each n-gram to have a weight, which is primarily used to discount function words.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

word segmentation

Appears in 8 sentences as: word segmentation (6) word segmentations (1) word segmenter (1)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. The most obvious challenge for Chinese is that of word segmentation .
    Page 1, “Introduction”
  2. However, many different segmentation standards eXist for different purposes, such as Microsoft Research Asia (MSRA) for Named Entity Recognition (NER), Chinese Treebank (CTB) for parsing and part-of-speech (POS) tagging, and City University of Hong Kong (CITYU) and Academia Sinica (AS) for general word segmentation and POS tagging.
    Page 1, “Introduction”
  3. The only prior work attempting to address the problem of word segmentation in automatic MT evaluation for Chinese that we are aware of is Li et
    Page 1, “Introduction”
  4. Character-based metrics do not suffer from errors and differences in word segmentation , so and ¥_l—?fi_5lk would be judged exactly equal.
    Page 2, “Motivation”
  5. We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym
    Page 6, “Experiments”
  6. Note also that the word segmentations shown in these examples are for clarity only.
    Page 7, “Experiments”
  7. Chinese word segmentation .
    Page 8, “Discussion and Future Work”
  8. TESLA-CELAB does not have a segmentation step, hence it will not introduce word segmentation errors.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

evaluation metrics

Appears in 6 sentences as: evaluation metric (1) evaluation metrics (5)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011).
    Page 1, “Introduction”
  2. The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
    Page 2, “Introduction”
  3. Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages.
    Page 2, “The Algorithm”
  4. This relationship is implicit in the matching problem for English translation evaluation metrics where words are well delimited.
    Page 3, “The Algorithm”
  5. Many prior translation evaluation metrics such as MAXSIM (Chan and Ng, 2008) and TESLA (Liu et al., 2010; Dahlmeier et al., 2011) use the F-0.8 measure as the final score:
    Page 4, “The Algorithm”
  6. In this work, we devise a new MT evaluation metric in the family of TESLA (Translation Evaluation of Sentences with Linear-programming-based Analysis), called TESLA-CELAB (Character-level Evaluation for Languages with Ambiguous word Boundaries), to address the problem of fuzzy word boundaries in the Chinese language, although neither the phenomenon nor the method is unique to Chinese.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

human judgments

Appears in 6 sentences as: human judged (1) human judges (1) human judgments (4)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. In the WMT shared tasks, many new generation metrics, such as METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2006), and TESLA (Liu et al., 2010) have consistently outperformed BLEU as judged by the correlations with human judgments .
    Page 1, “Introduction”
  2. Some recent research (Liu et al., 2011) has shown evidence that replacing BLEU by a newer metric, TESLA, can improve the human judged translation quality.
    Page 1, “Introduction”
  3. The work compared various MT evaluation metrics (BLEU, NIST, METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments .
    Page 2, “Introduction”
  4. The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
    Page 6, “Experiments”
  5. The correlations between the TESLA-CELAB scores and human judgments are shown in the last row of Tables 2 and 3.
    Page 6, “Experiments”
  6. This is probably due to the linguistic characteristics of Chinese, where human judges apparently give equal importance to function words and content words.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

machine translation

Appears in 6 sentences as: Machine Translation (1) machine translation (7)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. In this work, we introduce the TESLA-CELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis — Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation.
    Page 1, “Abstract”
  2. Since the introduction of BLEU (Papineni et al., 2002), automatic machine translation (MT) evaluation has received a lot of research interest.
    Page 1, “Introduction”
  3. The Workshop on Statistical Machine Translation (WMT) hosts regular campaigns comparing different machine translation evaluation metrics (Callison-Burch et al., 2009; Callison-Burch et al., 2010; Callison-Burch et al., 2011).
    Page 1, “Introduction”
  4. The research on automatic machine translation evaluation is important for a number of reasons.
    Page 1, “Introduction”
  5. Automatic translation evaluation gives machine translation researchers a cheap and reproducible way to guide their research and makes it possible to compare machine translation methods across different studies.
    Page 1, “Introduction”
  6. In addition, machine translation system parameters are tuned by maximizing the automatic scores.
    Page 1, “Introduction”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

objective function

Appears in 6 sentences as: Objective Function (1) objective function (6)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. 3.4 The Objective Function
    Page 4, “The Algorithm”
  2. We now define our objective function in terms of the variables.
    Page 4, “The Algorithm”
  3. We are also constrained by the linear programming framework, hence we set the objective function as
    Page 4, “The Algorithm”
  4. We set f = 0.25 so that our objective function is also four times as sensitive to recall than to precision.3 The value of this objective function is our TESLA-CELAB score.
    Page 5, “The Algorithm”
  5. Similar to the other TESLA metrics, when there are N multiple references, we match the candidate translation against each of them and use the average of the N objective function values as the segment level score.
    Page 5, “The Algorithm”
  6. Accordingly, our objective function is replaced by:
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

NIST

Appears in 5 sentences as: NIST (5)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. The work compared various MT evaluation metrics (BLEU, NIST , METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
    Page 2, “Introduction”
  2. Table 1: Inter-judge Kappa for the NIST 2008 English—Chinese task
    Page 5, “Experiments”
  3. 4.2 NIST 2008 English-Chinese MT Task
    Page 5, “Experiments”
  4. The NIST 2008 English-Chinese MT task consists of 127 documents with 1,830 segments, each with four reference translations and eleven automatic MT system translations.
    Page 5, “Experiments”
  5. The correlations of character-level BLEU and the average human judgments are shown in the first row of Tables 2 and 3 for the IWSLT and the NIST data set, respectively.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

POS tags

Appears in 4 sentences as: POS tagger (1) POS tagging (1) POS tags (2)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. However, many different segmentation standards eXist for different purposes, such as Microsoft Research Asia (MSRA) for Named Entity Recognition (NER), Chinese Treebank (CTB) for parsing and part-of-speech (POS) tagging, and City University of Hong Kong (CITYU) and Academia Sinica (AS) for general word segmentation and POS tagging .
    Page 1, “Introduction”
  2. However, its use of POS tags and synonym dictionaries prevents its use at the character-level.
    Page 6, “Experiments”
  3. We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym
    Page 6, “Experiments”
  4. We can then award partial scores for related words, such as those identified as such by WordNet or those with the same POS tags .
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

Chinese word

Appears in 3 sentences as: Chinese word (2) Chinese words (1)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym
    Page 6, “Experiments”
  2. In all our experiments here we use TESLA-CELAB with n- grams for 77. up to four, since the vast majority of Chinese words , and therefore synonyms, are at most four characters long.
    Page 6, “Experiments”
  3. Chinese word segmentation.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

MT systems

Appears in 3 sentences as: MT system (1) MT systems (2)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. The test set was translated by seven MT systems , and each translation has been manually judged for adequacy and fluency.
    Page 5, “Experiments”
  2. In addition, the translation outputs of the MT systems are also manually ranked according to their translation quality.
    Page 5, “Experiments”
  3. The NIST 2008 English-Chinese MT task consists of 127 documents with 1,830 segments, each with four reference translations and eleven automatic MT system translations.
    Page 5, “Experiments”

See all papers in Proc. ACL 2012 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

significantly outperforms

Appears in 3 sentences as: significantly outperforms (3)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. We show empirically that TESLA—CELAB significantly outperforms character-level BLEU in the English—Chinese translation evaluation tasks.
    Page 1, “Abstract”
  2. The results indicate that TESLA-CELAB significantly outperforms BLEU.
    Page 6, “Experiments”
  3. We show empirically that TESLA-CELAB significantly outperforms the strong baseline of character-level BLEU in two well known English-Chinese MT evaluation data sets.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention significantly outperforms.

See all papers in Proc. ACL that mention significantly outperforms.

Back to top.

similarity measures

Appears in 3 sentences as: Similarity Measures (1) similarity measures (2)
In Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
  1. 5.2 Fractional Similarity Measures
    Page 8, “Discussion and Future Work”
  2. In contrast, the linear-programming based TESLA metric allows fractional similarity measures between 0 (completely unrelated) and l (exact synonyms).
    Page 8, “Discussion and Future Work”
  3. Supporting fractional similarity measures is nontrivial in the TESLA-CELAB framework.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2012 that mention similarity measures.

See all papers in Proc. ACL that mention similarity measures.

Back to top.