Adaptive HTER Estimation for Document-Specific MT Post-Editing
Huang, Fei and Xu, Jian-Ming and Ittycheriah, Abraham and Roukos, Salim

Article Structure

Abstract

We present an adaptive translation quality estimation (QE) method to predict the human-targeted translation error rate (HTER) for a document-specific machine translation model.

Introduction

Machine translation (MT) systems suffer from an inconsistent and unstable translation quality.

Related Work

There has been a long history of study in confidence estimation of machine translation.

Document-specific MT System

In our MT post-editing setup, we are given documents in the domain of software manuals, technical outlook or customer support materials.

Static MT Quality Estimation

MT quality estimation is typically formulated as a prediction problem: estimating the confidence

Adaptive MT Quality Estimation

The above QE regression model is trained on a portion of the sentences from the input document, and evaluated on the remaining sentences from the same document.

Experiments

In this section, we first discuss experiments that compare adaptive QE method and static QE method on a few documents, and then present results we obtained after deploying the adaptive QE method in an English-to-Japanese MT Post-Editing project.

Discussion and Conclusion

In this paper we proposed a method to adaptively train a quality estimation model for document-specific MT post editing.

Topics

TER

Appears in 28 sentences as: TER (30) TERs (3)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. In the rest of the paper, we use TER and HTER inter-changably.
    Page 3, “Static MT Quality Estimation”
  2. To evaluate the effectiveness of the proposed features, we train various classifiers with different feature configurations to predict whether a translation output is useful (with lower TER ) as described in the following section.
    Page 4, “Static MT Quality Estimation”
  3. Predicting TER with various input features can be treated as a regression problem.
    Page 4, “Static MT Quality Estimation”
  4. Therefore we also develop algorithms that classify the translation at different levels, depending on whether the TER is less than a given threshold.
    Page 4, “Static MT Quality Estimation”
  5. We compute TER for each sentence using the human correction as the reference.
    Page 4, “Static MT Quality Estimation”
  6. The TER of the whole document is 0.31, which means about 30% errors should be corrected.
    Page 4, “Static MT Quality Estimation”
  7. In the classification task, our goal is to predict whether a sentence is a Good translation (with TER g 0.1), and label them for human correction.
    Page 4, “Static MT Quality Estimation”
  8. In the test set, there are 46 sentences with TER g 0.1.
    Page 4, “Static MT Quality Estimation”
  9. First we can see that as the overall TER is around 0.3, predicting all the sentences being negative already has a strong baseline: 77%.
    Page 4, “Static MT Quality Estimation”
  10. For the QE regression task, we predict the TER for each sentence translation using the above 26 features.
    Page 5, “Static MT Quality Estimation”
  11. With the same training and test data set up, we predict the TER for each sentence in the test set, and compute the correlation coefficient (r) and root mean square error (RMSE).
    Page 5, “Static MT Quality Estimation”

See all papers in Proc. ACL 2014 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

MT system

Appears in 13 sentences as: MT system (11) MT systems (1) MT system’s (1)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. Depending on the difficulty of the input sentences (sentence length, OOV words, complex sentence structures and the coverage of the MT system’s training data), some translation outputs can be perfect, while others are ungrammatical, missing important words or even totally garbled.
    Page 1, “Introduction”
  2. This shortcoming is one of the main obstacles for the adoption of MT systems , especially in machine assisted human translation: MT post-editing, where human translators have an option to edit MT proposals or translate from scratch.
    Page 1, “Introduction”
  3. In section 3 we will introduce the document-specific MT system built for post-editing.
    Page 2, “Introduction”
  4. Building a general MT system using all the parallel data not only produces a huge translation model (unless with very aggressive pruning), the performance on the given input document is suboptimal due to the unwanted dominance of out-of-domain data.
    Page 2, “Document-specific MT System”
  5. The document-specific system is built based on sub-sampling: from the parallel corpora we select sentence pairs that are the most similar to the sentences from the input document, then build the MT system with the sub-sampled sentence pairs.
    Page 2, “Document-specific MT System”
  6. However for the post-editing task, we argue that it could also be cast as a classification problem: MT system
    Page 4, “Static MT Quality Estimation”
  7. We build a document-specific MT system to translate this document, then ask human translator to correct the translation output.
    Page 4, “Static MT Quality Estimation”
  8. The source side of the QE training data Sq is combined with the input document Sd for MT system training data subsampling.
    Page 5, “Adaptive MT Quality Estimation”
  9. Once the document-specific MT system is trained, we use it to translate both the input document and the source QE training data, obtaining the translation Td and
    Page 5, “Adaptive MT Quality Estimation”
  10. As the QE model is adaptively retrained for each document-specific MT system , its prediction is more accurate and consistent.
    Page 5, “Adaptive MT Quality Estimation”
  11. Figure 1 shows the flow of our MT system with the adaptive QE training integrated as part of the built.
    Page 5, “Adaptive MT Quality Estimation”

See all papers in Proc. ACL 2014 that mention MT system.

See all papers in Proc. ACL that mention MT system.

Back to top.

translation model

Appears in 9 sentences as: translation model (5) translation models (4)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. We present an adaptive translation quality estimation (QE) method to predict the human-targeted translation error rate (HTER) for a document-specific machine translation model .
    Page 1, “Abstract”
  2. First, existing approaches to MT quality estimation rely on lexical and syntactical features defined over parallel sentence pairs, which includes source sentences, MT outputs and references, and translation models (Blatz et al., 2004; Ueffing and Ney, 2007; Specia et al., 2009a; Xiong et al., 2010; Soricut and Echihabi, 2010a; Bach et al., 2011).
    Page 1, “Introduction”
  3. Building a general MT system using all the parallel data not only produces a huge translation model (unless with very aggressive pruning), the performance on the given input document is suboptimal due to the unwanted dominance of out-of-domain data.
    Page 2, “Document-specific MT System”
  4. Here we adopt the same strategy, building a document-specific translation model for each input document.
    Page 2, “Document-specific MT System”
  5. derived from a Maximum Entropy translation model (Ittycheriah and Roukos, 2005).
    Page 4, “Static MT Quality Estimation”
  6. Therefore it is necessary to build a QB regression model that’s robust to different document-specific translation models .
    Page 5, “Adaptive MT Quality Estimation”
  7. In a typical MT QE scenario, the QE model is pre-trained and applied to various MT outputs, even though the QE training data and MT outputs are generated from different translation models .
    Page 6, “Experiments”
  8. We train the static QE model with this training set, including the source sentences, references and MT outputs (from multiple translation models ).
    Page 6, “Experiments”
  9. To train the adaptive QE model for each test document, we build a translation model whose subsampling data includes source sentences from both the test document and the QE training data.
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

F-score

Appears in 8 sentences as: F-score (8)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. Here we report the precision, recall and F-score of finding such “Good” sentences (with TER g 0.1) on the three documents in Table 3.
    Page 6, “Experiments”
  2. Again, the adaptive QE model produces higher recall, mostly higher precision, and significantly improved F-score .
    Page 6, “Experiments”
  3. The overall F-score of the adaptive QE model is 0.282.
    Page 6, “Experiments”
  4. In Table 5 we present the precision, recall and F-score of the “Good” sentences in the FM and NP categories, similar to those shown in Table 3.
    Page 7, “Experiments”
  5. We consistently observe higher performance on the FM sentences, in terms of precision, recall and F-score .
    Page 7, “Experiments”
  6. The overall F-score is in line with the test set results shown in Table 3.
    Page 7, “Experiments”
  7. Type Precision Recall F-score FM 0.7 l 0.23 0.35 NP 0.67 0.18 0.29
    Page 9, “Experiments”
  8. With the 26 proposed features derived from decoding process and source sentence syntactic analysis, the proposed QE model achieved better TER prediction, higher correlation with human correction of MT output and higher F-score in finding good translations.
    Page 9, “Discussion and Conclusion”

See all papers in Proc. ACL 2014 that mention F-score.

See all papers in Proc. ACL that mention F-score.

Back to top.

sentence pairs

Appears in 7 sentences as: sentence pairs (8)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. First, existing approaches to MT quality estimation rely on lexical and syntactical features defined over parallel sentence pairs , which includes source sentences, MT outputs and references, and translation models (Blatz et al., 2004; Ueffing and Ney, 2007; Specia et al., 2009a; Xiong et al., 2010; Soricut and Echihabi, 2010a; Bach et al., 2011).
    Page 1, “Introduction”
  2. Our parallel corpora includes tens of millions of sentence pairs covering a wide range of topics.
    Page 2, “Document-specific MT System”
  3. The document-specific system is built based on sub-sampling: from the parallel corpora we select sentence pairs that are the most similar to the sentences from the input document, then build the MT system with the sub-sampled sentence pairs .
    Page 2, “Document-specific MT System”
  4. From the extracted sentence pairs , we utilize the standard pipeline in SMT system building: word align-
    Page 2, “Document-specific MT System”
  5. The high FM phrases are selected from sentence pairs which are closest in terms of n-gram overlap to the input sentence.
    Page 4, “Static MT Quality Estimation”
  6. Our proposed method is as follows: we select a fixed set of sentence pairs (Sq, Rq) to train the QE model.
    Page 5, “Adaptive MT Quality Estimation”
  7. Another option is to select the sentence pairs from the MT system subsampled training data, which is more similar to the input document thus the trained QE model could be a better match to the input document.
    Page 9, “Discussion and Conclusion”

See all papers in Proc. ACL 2014 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

translation quality

Appears in 7 sentences as: translation quality (7)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. We present an adaptive translation quality estimation (QE) method to predict the human-targeted translation error rate (HTER) for a document-specific machine translation model.
    Page 1, “Abstract”
  2. Machine translation (MT) systems suffer from an inconsistent and unstable translation quality .
    Page 1, “Introduction”
  3. It is demonstrated in (Roukos et al., 2012) that document-specific MT models significantly improve the translation quality .
    Page 1, “Introduction”
  4. 0 The average translation probability of the phrase translation pairs in the final translation, which provides the overall translation quality on the phrase level.
    Page 4, “Static MT Quality Estimation”
  5. The external features capture the syntactic structure of the source sentence, as well as the coverage of the training data with regard to the input sentence, which are good indicators of the translation quality .
    Page 5, “Static MT Quality Estimation”
  6. As seen in Table 4, we do not notice translation quality degradation.
    Page 6, “Experiments”
  7. However, adding such data in the sub-sampling process extracts more bilingual data for building the MT models, which slightly increase the model building time but increased the translation quality .
    Page 9, “Discussion and Conclusion”

See all papers in Proc. ACL 2014 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

regression model

Appears in 6 sentences as: regression model (6) regression models (1)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. Soricut and Echihabi (2010b) proposed various regression models to predict the expected BLEU score of a given sentence translation hypothesis.
    Page 2, “Related Work”
  2. We experiment with several classifiers: linear regression model, decision tree based regression model and SVM model.
    Page 5, “Static MT Quality Estimation”
  3. Our experiments show that the decision tree-based regression model obtains the highest correlation coefficients (0.53) and lowest RMSE (0.23) in both the training and test sets.
    Page 5, “Static MT Quality Estimation”
  4. The above QE regression model is trained on a portion of the sentences from the input document, and evaluated on the remaining sentences from the same document.
    Page 5, “Adaptive MT Quality Estimation”
  5. Therefore it is necessary to build a QB regression model that’s robust to different document-specific translation models.
    Page 5, “Adaptive MT Quality Estimation”
  6. We compute the TER of Tq using Rq as the reference, and train a QB regression model with the 26 features proposed in section 4.1.
    Page 5, “Adaptive MT Quality Estimation”

See all papers in Proc. ACL 2014 that mention regression model.

See all papers in Proc. ACL that mention regression model.

Back to top.

maxent

Appears in 4 sentences as: MaXEnt (1) MaxEnt (1) maxent (2)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. Target part-of-speech and null dependency link are exploited in a MaXEnt classifier to improve the MT quality estimation (Xiong et al., 2010).
    Page 2, “Related Work”
  2. ment (HMM (Vogel et al., 1996) and MaxEnt (Ittycheriah and Roukos, 2005) alignment models, phrase pair extraction, MT model training (Ittycheriah and Roukos, 2007) and LM model training.
    Page 3, “Document-specific MT System”
  3. 0 17 decoding features, including phrase translation probabilities (source-to-target and target-to-source), word translation probabilities (also in both directions), maxent prob-abilitiesl, word count, phrase count, distor-
    Page 3, “Static MT Quality Estimation”
  4. 1The maxent probability is the translation probability
    Page 3, “Static MT Quality Estimation”

See all papers in Proc. ACL 2014 that mention maxent.

See all papers in Proc. ACL that mention maxent.

Back to top.

model training

Appears in 4 sentences as: model training (4) models trained (1)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. ment (HMM (Vogel et al., 1996) and MaxEnt (Ittycheriah and Roukos, 2005) alignment models, phrase pair extraction, MT model training (Ittycheriah and Roukos, 2007) and LM model training .
    Page 3, “Document-specific MT System”
  2. Figure 2: Correlation coefficient 7“ between predicted TER (X-axis) and true TER (y-axis) for QB models trained from the same document (top figure) or different document (bottom figure).
    Page 5, “Adaptive MT Quality Estimation”
  3. As our MT model training data include proprietary data, the MT performance is significantly better than publicly available MT software.
    Page 6, “Experiments”
  4. However, the QE model training data is no longer constant.
    Page 9, “Discussion and Conclusion”

See all papers in Proc. ACL 2014 that mention model training.

See all papers in Proc. ACL that mention model training.

Back to top.

error rate

Appears in 3 sentences as: error rate (3)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. We present an adaptive translation quality estimation (QE) method to predict the human-targeted translation error rate (HTER) for a document-specific machine translation model.
    Page 1, “Abstract”
  2. In this paper we propose an adaptive quality estimation that predicts sentence-level human-targeted translation error rate (HTER) (Snover et al., 2006) for a document-specific MT post-editing system.
    Page 1, “Introduction”
  3. score or translation error rate of the translated sentences or documents based on a set of features.
    Page 3, “Static MT Quality Estimation”

See all papers in Proc. ACL 2014 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

machine translation

Appears in 3 sentences as: Machine translation (1) machine translation (2)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. We present an adaptive translation quality estimation (QE) method to predict the human-targeted translation error rate (HTER) for a document-specific machine translation model.
    Page 1, “Abstract”
  2. Machine translation (MT) systems suffer from an inconsistent and unstable translation quality.
    Page 1, “Introduction”
  3. There has been a long history of study in confidence estimation of machine translation .
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

translation probabilities

Appears in 3 sentences as: translation probabilities (2) translation probability (2)
In Adaptive HTER Estimation for Document-Specific MT Post-Editing
  1. 0 17 decoding features, including phrase translation probabilities (source-to-target and target-to-source), word translation probabilities (also in both directions), maxent prob-abilitiesl, word count, phrase count, distor-
    Page 3, “Static MT Quality Estimation”
  2. 1The maxent probability is the translation probability
    Page 3, “Static MT Quality Estimation”
  3. 0 The average translation probability of the phrase translation pairs in the final translation, which provides the overall translation quality on the phrase level.
    Page 4, “Static MT Quality Estimation”

See all papers in Proc. ACL 2014 that mention translation probabilities.

See all papers in Proc. ACL that mention translation probabilities.

Back to top.