Error Detection for Statistical Machine Translation Using Linguistic Features
Xiong, Deyi and Zhang, Min and Li, Haizhou

Article Structure

Abstract

Automatic error detection is desired in the postprocessing to improve machine translation quality.

Introduction

Translation hypotheses generated by a statistical machine translation (SMT) system always contain both correct parts (e.g.

Related Work

In this section, we present an overview of confidence estimation (CE) for machine translation at the word level.

Features

We explore two sets of linguistic features for each word in a machine generated translation hypothesis.

Error Detection with a Maximum Entropy Model

As mentioned before, we consider error detection as a binary classification task.

SMT System

To obtain machine-generated translation hypotheses for our error detection, we use a state-of-the-art phrase-based machine translation system MOSES (Koehn et al., 2003; Koehn et al., 2007).

Experiments

We conducted our experiments at several levels.

Conclusions and Future Work

In this paper, we have presented a maximum entropy based approach to automatically detect errors in translation hypotheses generated by SMT

Topics

development set

Appears in 9 sentences as: development set (9)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. 3) Divide words into two groups (correct translations and errors) by using a classification threshold optimized on a development set .
    Page 1, “Introduction”
  2. rectly output prediction results from our dis-criminatively trained classifier without optimizing a classification threshold on a distinct development set beforehand.1 Most previous approaches make decisions based on a pre-tuned classification threshold 7' as follows
    Page 3, “Related Work”
  3. 1This does not mean we do not need a development set .
    Page 3, “Features”
  4. We do validate our feature selection and other experimental settings on the development set .
    Page 3, “Features”
  5. We optimize the discrete factor on our development set and find the optimal value is 1.
    Page 4, “Features”
  6. To avoid overfitting, we optimize the Gaussian prior on the development set .
    Page 4, “Error Detection with a Maximum Entropy Model”
  7. For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task.
    Page 5, “SMT System”
  8. We find that the LG parser can not fully parse 560 sentences (63.8%) in the training set (MT-02), 731 sentences (67.6%) in the development set (MT-05) and 660 sentences (71.8%) in the test set (MT-03).
    Page 5, “Experiments”
  9. To compare with previous work using word posterior probabilities for confidence estimation, we carried out experiments using wpp estimated from N -best lists with the classification threshold 7', which was optimized on our development set to minimize CER.
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

machine translation

Appears in 7 sentences as: machine translation (7)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. Automatic error detection is desired in the postprocessing to improve machine translation quality.
    Page 1, “Abstract”
  2. We propose to incorporate two groups of linguistic features, which convey information from outside machine translation systems, into error detection: lexical and syntactic features.
    Page 1, “Abstract”
  3. Translation hypotheses generated by a statistical machine translation (SMT) system always contain both correct parts (e.g.
    Page 1, “Introduction”
  4. Automatically distinguishing incorrect parts from correct parts is therefore very desirable not only for post-editing and interactive machine translation (Ueffing and Ney, 2007) but also for SMT itself: either by rescoring hypotheses in the N-best list using the probability of correctness calculated for each hypothesis (Zens and Ney, 2006) or by generating new hypotheses using N -best lists from one SMT system or multiple sys-
    Page 1, “Introduction”
  5. In this section, we present an overview of confidence estimation (CE) for machine translation at the word level.
    Page 2, “Related Work”
  6. To obtain machine-generated translation hypotheses for our error detection, we use a state-of-the-art phrase-based machine translation system MOSES (Koehn et al., 2003; Koehn et al., 2007).
    Page 4, “SMT System”
  7. Therefore our approach can be used for other machine translation systems, such as rule-based or example-based system, which generally do not produce N -best lists.
    Page 8, “Conclusions and Future Work”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

MaXEnt

Appears in 7 sentences as: MaXEnt (5) MaxEnt (2)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. We integrate two sets of linguistic features into a maximum entropy ( MaxEnt ) model and develop a MaxEnt-based binary classifier to predict the category (correct or incorrect) for each word in a generated target sentence.
    Page 2, “Introduction”
  2. We tune our model feature weights using an off-the-shelf MaXEnt toolkit (Zhang, 2004).
    Page 4, “Error Detection with a Maximum Entropy Model”
  3. During test, if the probability p(correct|¢) is larger than p(incorrect|¢) according the trained MaXEnt model, the word is labeled as correct otherwise incorrect.
    Page 4, “Error Detection with a Maximum Entropy Model”
  4. Starting with MaXEnt models with single linguistic feature or word posterior probability based feature, we incorporated additional features incre-mentally by combining features together.
    Page 5, “Experiments”
  5. We conducted three groups of experiments using the MaXEnt based error detection model with various feature combinations.
    Page 6, “Experiments”
  6. Using discrete word posterior probabilities as features in the MaxEnt based error detection model is marginally better than word posterior probability thresholding in terms of CER, but obtains a 13.79% relative improvement in F measure.
    Page 6, “Experiments”
  7. Figure 3 shows CERs for the feature combination MaXEnt (dwpp + wd + p08 + link) when the number of training sentences is enlarged incre-mentally.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention MaXEnt.

See all papers in Proc. ACL that mention MaXEnt.

Back to top.

SMT system

Appears in 7 sentences as: SMT system (4) SMT systems (3)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. Automatically distinguishing incorrect parts from correct parts is therefore very desirable not only for post-editing and interactive machine translation (Ueffing and Ney, 2007) but also for SMT itself: either by rescoring hypotheses in the N-best list using the probability of correctness calculated for each hypothesis (Zens and Ney, 2006) or by generating new hypotheses using N -best lists from one SMT system or multiple sys-
    Page 1, “Introduction”
  2. 1) Calculate features that express the correctness of words either based on SMT model (e. g. translatiorfllanguage model) or based on SMT system output (e.g.
    Page 1, “Introduction”
  3. However, it is not adequate to just use these features as discussed in (Shi and Zhou, 2005) because the information that they carry is either from the inner components of SMT systems or from system outputs.
    Page 1, “Introduction”
  4. To some extent, it has already been considered by SMT systems .
    Page 1, “Introduction”
  5. sources from outside SMT systems is desired for error detection.
    Page 2, “Introduction”
  6. In Section 5, we describe the SMT system which we use to generate translation hypotheses.
    Page 2, “Introduction”
  7. Given a source sentence f, let {enHV be the N -best list generated by an SMT system , and let ef, is the i-th word in en.
    Page 3, “Features”

See all papers in Proc. ACL 2010 that mention SMT system.

See all papers in Proc. ACL that mention SMT system.

Back to top.

feature vector

Appears in 6 sentences as: feature vector (6)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. To formalize this task, we use a feature vector w to represent a word 21) in question, and a binary variable 0 to indicate whether this word is correct or not.
    Page 4, “Error Detection with a Maximum Entropy Model”
  2. In the feature vector , we look at 2 words before and 2 words after the current word position (w_2, w_1, 212,201,202).
    Page 4, “Error Detection with a Maximum Entropy Model”
  3. We collect features {wd, p03, link, dwpp} for each word among these words and combine them into the feature vector w for 212.
    Page 4, “Error Detection with a Maximum Entropy Model”
  4. As such, we want the feature vector to capture the contextual environment, e.g., p03 sequence pattern, syntactic pattern, where the word to occurs.
    Page 4, “Error Detection with a Maximum Entropy Model”
  5. For classification, we employ the maximum entropy model (Berger et al., 1996) to predict whether a word 21) is correct or incorrect given its feature vector p.
    Page 4, “Error Detection with a Maximum Entropy Model”
  6. where f,- is a binary model feature defined on c and the feature vector p. 6,- is the weight of fi.
    Page 4, “Error Detection with a Maximum Entropy Model”

See all papers in Proc. ACL 2010 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

binary classification

Appears in 4 sentences as: binary classification (2) binary classifier (1) binary classifiers (1)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. Sometimes the step 2) is not necessary if only one effective feature is used (Ueffing and Ney, 2007); and sometimes the step 2) and 3) can be merged into a single step if we directly output predicting results from binary classifiers instead of making thresholding decision.
    Page 1, “Introduction”
  2. We integrate two sets of linguistic features into a maximum entropy (MaxEnt) model and develop a MaxEnt-based binary classifier to predict the category (correct or incorrect) for each word in a generated target sentence.
    Page 2, “Introduction”
  3. 0 We treat error detection as a complete binary classification problem.
    Page 2, “Related Work”
  4. As mentioned before, we consider error detection as a binary classification task.
    Page 4, “Error Detection with a Maximum Entropy Model”

See all papers in Proc. ACL 2010 that mention binary classification.

See all papers in Proc. ACL that mention binary classification.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (4)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. The performance, in terms of BLEU (Papineni et al., 2002) score, is shown in Table 4.
    Page 5, “SMT System”
  2. Corpus ‘ BLEU (%) RCW (%)
    Page 6, “Experiments”
  3. Table 4: Case-insensitive BLEU score and ratio of correct words (RCW) on the training, development and test corpus.
    Page 6, “Experiments”
  4. Table 4 shows the case-insensitive BLEU score and the percentage of words that are labeled as correct according to the method described above on the training, development and test corpus.
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

error rate

Appears in 4 sentences as: error rate (4)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. The experimental results show that l) linguistic features alone outperform word posterior probability based confidence estimation in error detection; and 2) linguistic features can further provide complementary information when combined with word confidence scores, which collectively reduce the classification error rate by 18.52% and improve the F measure by 16.37%.
    Page 1, “Abstract”
  2. For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task.
    Page 5, “SMT System”
  3. To determine the true class of a word in a generated translation hypothesis, we follow (Blatz et al., 2003) to use the word error rate (WER).
    Page 5, “Experiments”
  4. To evaluate the overall performance of the error detection, we use the commonly used metric, classification error rate (CER) to evaluate our classifiers.
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

maximum entropy

Appears in 4 sentences as: maximum entropy (4)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. We use a maximum entropy classifier to predict translation errors by integrating word posterior probability feature and linguistic features.
    Page 1, “Abstract”
  2. We integrate two sets of linguistic features into a maximum entropy (MaxEnt) model and develop a MaxEnt-based binary classifier to predict the category (correct or incorrect) for each word in a generated target sentence.
    Page 2, “Introduction”
  3. For classification, we employ the maximum entropy model (Berger et al., 1996) to predict whether a word 21) is correct or incorrect given its feature vector p.
    Page 4, “Error Detection with a Maximum Entropy Model”
  4. In this paper, we have presented a maximum entropy based approach to automatically detect errors in translation hypotheses generated by SMT
    Page 7, “Conclusions and Future Work”

See all papers in Proc. ACL 2010 that mention maximum entropy.

See all papers in Proc. ACL that mention maximum entropy.

Back to top.

NIST

Appears in 4 sentences as: NIST (4)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. The translation task is on the official NIST Chinese-to-English evaluation data.
    Page 4, “SMT System”
  2. For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task.
    Page 5, “SMT System”
  3. In order to calculate word posterior probabilities, we generate 10,000 best lists for NIST MT-02/03/05 respectively.
    Page 5, “SMT System”
  4. For the error detection task, we use the best translation hypotheses of NIST MT-02/05/03 generated by MOSES as our training, development, and test corpus respectively.
    Page 5, “Experiments”

See all papers in Proc. ACL 2010 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

translation task

Appears in 4 sentences as: translation task (4)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. The translation task is on the official NIST Chinese-to-English evaluation data.
    Page 4, “SMT System”
  2. Table 2 shows the corpora that we use for the translation task .
    Page 4, “SMT System”
  3. Table 2: Training corpora for the translation task .
    Page 5, “SMT System”
  4. For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task .
    Page 5, “SMT System”

See all papers in Proc. ACL 2010 that mention translation task.

See all papers in Proc. ACL that mention translation task.

Back to top.

language model

Appears in 3 sentences as: language model (4)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. (2009) study several confidence features based on mutual information between words and n-gram and backward n-gram language model for word-level and sentence-level CE.
    Page 2, “Related Work”
  2. To some extent, these two features have similar function to a target language model or pos-based target language model .
    Page 3, “Features”
  3. We build a four-gram language model using the SRILM toolkit (Stolcke, 2002), which is trained
    Page 4, “SMT System”

See all papers in Proc. ACL 2010 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

translation systems

Appears in 3 sentences as: translation system (1) translation systems (2)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. We propose to incorporate two groups of linguistic features, which convey information from outside machine translation systems , into error detection: lexical and syntactic features.
    Page 1, “Abstract”
  2. To obtain machine-generated translation hypotheses for our error detection, we use a state-of-the-art phrase-based machine translation system MOSES (Koehn et al., 2003; Koehn et al., 2007).
    Page 4, “SMT System”
  3. Therefore our approach can be used for other machine translation systems , such as rule-based or example-based system, which generally do not produce N -best lists.
    Page 8, “Conclusions and Future Work”

See all papers in Proc. ACL 2010 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

word-level

Appears in 3 sentences as: word-level (3)
In Error Detection for Statistical Machine Translation Using Linguistic Features
  1. In Section 2, we review the previous work on word-level confidence estimation which is used for error detection.
    Page 2, “Introduction”
  2. Ueffing and Ney (2007) exhaustively explore various word-level confidence measures to label each word in a generated translation hypothesis as correct or incorrect.
    Page 2, “Related Work”
  3. (2009) study several confidence features based on mutual information between words and n-gram and backward n-gram language model for word-level and sentence-level CE.
    Page 2, “Related Work”

See all papers in Proc. ACL 2010 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.