Distortion Model Considering Rich Context for Statistical Machine Translation
Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao

Article Structure

Abstract

This paper proposes new distortion models for phrase-based SMT.

Introduction

Estimating appropriate word order in a target language is one of the most difficult problems for statistical machine translation (SMT).

Distortion Model for Phrase-Based SMT

A Moses-style phrase-based SMT generates target hypotheses sequentially from left to right.

Proposed Method

In this section, we first define our distortion model and explain our learning strategy.

Experiment

In order to confirm the effects of our distortion model, we conducted a series of Japanese to English (J E) and Chinese to English (CE) translation experiments.8

Related Works

We discuss related works other than discussed in Section 2.

Conclusion

This paper described our distortion models for phrase-based SMT.

Topics

word order

Appears in 22 sentences as: word order (19) word orders (3)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data.
    Page 1, “Abstract”
  2. Estimating appropriate word order in a target language is one of the most difficult problems for statistical machine translation (SMT).
    Page 1, “Introduction”
  3. This is particularly true when translating between languages with widely different word orders .
    Page 1, “Introduction”
  4. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data.
    Page 1, “Introduction”
  5. This estimation is used to produce a hypothesis in the target language word order sequentially from left to right.
    Page 1, “Introduction”
  6. One of the reasons for this difference is the relative word order between words.
    Page 3, “Distortion Model for Phrase-Based SMT”
  7. Thus, considering relative word order is important.
    Page 3, “Distortion Model for Phrase-Based SMT”
  8. In (d) and (e) in Figure 2, the word (kare) at the CP and the word order between katta and karita are the same.
    Page 3, “Distortion Model for Phrase-Based SMT”
  9. In summary, in order to estimate the NP, the following should be considered simultaneously: the word at the NP, the word at the CP, the relative word order among the NPCs, the words surrounding NP and CP (context), and the words between the CP and the NPC.
    Page 3, “Distortion Model for Phrase-Based SMT”
  10. The MSD lexical reordering model (Tillman, 2004; Koehn et al., 2005; Galley and Manning, 2008) only calculates probabilities for the three kinds of phrase reorderings (monotone, swap, and discontinuous), and does not consider relative word order or words between the CP and the NPC.
    Page 3, “Distortion Model for Phrase-Based SMT”
  11. However, their model did not use context, relative word order , or words between the CP and the NPC.
    Page 3, “Distortion Model for Phrase-Based SMT”

See all papers in Proc. ACL 2013 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.

phrase-based

Appears in 18 sentences as: Phrase-based (2) phrase-based (17)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. This paper proposes new distortion models for phrase-based SMT.
    Page 1, “Abstract”
  2. To address this problem, there has been a lot of research done into word reordering: lexical reordering model (Tillman, 2004), which is one of the distortion models, reordering constraint (Zens et al., 2004), pre-ordering (Xia and Mc-Cord, 2004), hierarchical phrase-based SMT (Chiang, 2007), and syntax-based SMT (Yamada and Knight, 2001).
    Page 1, “Introduction”
  3. Phrase-based SMT (Koehn et al., 2007) is a widely used SMT method that does not use a parser.
    Page 1, “Introduction”
  4. Phrase-based SMT mainly1 estimates word reordering using distortion models2.
    Page 1, “Introduction”
  5. Therefore, distortion models are one of the most important components for phrase-based SMT.
    Page 1, “Introduction”
  6. On the other hand, there are methods other than distortion models for improving word reordering for phrase-based SMT, such as pre-ordering or reordering constraints.
    Page 1, “Introduction”
  7. However, these methods also use distortion models when translating by phrase-based SMT.
    Page 1, “Introduction”
  8. If there is a good distortion model, it will improve the translation quality of phrase-based SMT and benefit to the methods using distortion models.
    Page 1, “Introduction”
  9. In this paper, we propose two distortion models for phrase-based SMT.
    Page 1, “Introduction”
  10. 2In this paper, reordering models for phrase-based SMT, which are intended to estimate the source word position to be translated next in decoding, are called distortion models.
    Page 1, “Introduction”
  11. A Moses-style phrase-based SMT generates target hypotheses sequentially from left to right.
    Page 2, “Distortion Model for Phrase-Based SMT”

See all papers in Proc. ACL 2013 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

feature templates

Appears in 11 sentences as: feature template (3) Feature templates (1) feature templates (9)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. Table 1: Feature templates .
    Page 4, “Proposed Method”
  2. Table 1 shows the feature templates used to produce the features.
    Page 4, “Proposed Method”
  3. A feature is an instance of a feature template .
    Page 4, “Proposed Method”
  4. We conjoined all the feature templates in Table l with an additional feature template (1,, [3-) to include the labels into features where Z, is the label corresponding to the position of i.
    Page 5, “Proposed Method”
  5. Note that in the feature templates in Table l, i and j are used to specify two positions.
    Page 6, “Proposed Method”
  6. were the ones that had been counted”, using the feature templates in Table l, at least four times for all of the (i, j) position pairs in the training sentences.
    Page 7, “Experiment”
  7. We conjoined the features with three types of label pairs (C, I), (LN), or (C,N> as instances of the feature template (1,, [3-) to produce features for SEQUENCE.
    Page 7, “Experiment”
  8. We used the following feature templates to produce features for the outbound model: <8i—2>a <8i—1>a <80, <8i+1>a <8i+2>a <72), 02—1772), (25,-, n+1), and (3,, ti).
    Page 7, “Experiment”
  9. These feature templates correspond to the components of the feature templates of our distortion models.
    Page 7, “Experiment”
  10. For the inbound model”, i of the feature templates was changed to j.
    Page 7, “Experiment”
  11. 12When we counted features for selection, we only counted features that were from the feature templates of <Si, 33'), (tiny), (sum-,5), and <Sj,ti,tj> in Table l whenj was not the NP, in order to avoid increasing the number of features.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention feature templates.

See all papers in Proc. ACL that mention feature templates.

Back to top.

word alignments

Appears in 9 sentences as: word aligned (1) word alignment (3) word alignments (6) words aligned (1)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. The lines represent word alignments .
    Page 6, “Proposed Method”
  2. The English side arrows point to the nearest word aligned on the right.
    Page 6, “Proposed Method”
  3. The training data is built from a parallel corpus and word alignments between corresponding source words and target words.
    Page 6, “Proposed Method”
  4. We select the target words aligned to the source words sequentially from left to right (target side arrows).
    Page 6, “Proposed Method”
  5. GIZA++ and grow-diag-final-and heuristics were used to obtain word alignments .
    Page 7, “Experiment”
  6. In order to reduce word alignment errors, we removed articles {a, an, the} in English and particles {ga, wo, wa} in Japanese before performing word alignments because these function words do not correspond to any words in the other languages.
    Page 7, “Experiment”
  7. After word alignment, we restored the removed words and shifted the word alignment positions to the original word positions.
    Page 7, “Experiment”
  8. Our distortion model was trained as follows: We used 0.2 million sentence pairs and their word alignments from the data used to build the translation model as the training data for our distortion models.
    Page 7, “Experiment”
  9. The third (CORPUS) is the probabilities for the actual distortions in the training data that were obtained from the word alignments used to build the translation model.
    Page 9, “Experiment”

See all papers in Proc. ACL 2013 that mention word alignments.

See all papers in Proc. ACL that mention word alignments.

Back to top.

CRF

Appears in 7 sentences as: CRF (7)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. We use a sequence discrimination technique based on CRF (Lafferty et al., 2001) to identify the label sequence that corresponds to the NP.
    Page 5, “Proposed Method”
  2. There are two differences between our task and the CRF task.
    Page 5, “Proposed Method”
  3. One difference is that CRF discriminates label sequences that consist of labels from all of the label candidates, whereas we constrain the label sequences to sequences where the label at the CP is C, the label at an NPC is N, and the labels between the CP and the NPC are I.
    Page 5, “Proposed Method”
  4. The other difference is that CRF is designed for discriminating label sequences corresponding to the same object sequence, whereas we do not assign labels to words outside the spans from the CF to each NPC.
    Page 5, “Proposed Method”
  5. However, when we assume that another label such as E has been assigned to the words outside the spans and there are no features involving label E, CRF with our label constraints can be applied to our task.
    Page 5, “Proposed Method”
  6. In this paper, the method designed to discriminate label sequences corresponding to the different word sequence lengths is called partial CRF .
    Page 5, “Proposed Method”
  7. The sequence model based on partial CRF is derived by extending the pair model.
    Page 5, “Proposed Method”

See all papers in Proc. ACL 2013 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

translation model

Appears in 5 sentences as: translation model (5)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. The translation model was trained using sentences of 40 words or less from the training data.
    Page 6, “Experiment”
  2. The common SMT feature set consists of: four translation model features, phrase penalty, word penalty, and a language model feature.
    Page 7, “Experiment”
  3. Our distortion model was trained as follows: We used 0.2 million sentence pairs and their word alignments from the data used to build the translation model as the training data for our distortion models.
    Page 7, “Experiment”
  4. The MSD bidirectional lexical distortion model was built using all of the data used to build the translation model .
    Page 7, “Experiment”
  5. The third (CORPUS) is the probabilities for the actual distortions in the training data that were obtained from the word alignments used to build the translation model .
    Page 9, “Experiment”

See all papers in Proc. ACL 2013 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

maximum entropy

Appears in 4 sentences as: maximum entropy (4)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. In this work, we use the maximum entropy method (Berger et al., 1996) as a discriminative machine learning method.
    Page 4, “Proposed Method”
  2. The reason for this is that a model based on the maximum entropy method can calculate probabilities.
    Page 4, “Proposed Method”
  3. The L-BFGS method (Liu and Nocedal, 1989) was used to estimate the weight parameters of maximum entropy models.
    Page 7, “Experiment”
  4. The maximum entropy method with Gaussian prior smoothing was used to estimate the model parameters.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention maximum entropy.

See all papers in Proc. ACL that mention maximum entropy.

Back to top.

BLEU

Appears in 3 sentences as: BLEU (4)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
    Page 1, “Abstract”
  2. To stabilize the MERT results, we tuned three times by MERT using the first half of the development data and we selected the SMT weighting parameter set that performed the best on the second half of the development data based on the BLEU scores from the three SMT weighting parameter sets.
    Page 7, “Experiment”
  3. To investigate the tolerance for sparsity of the training data, we reduced the training data for the sequence model to 20,000 sentences for JE translation.14 SEQUENCE using this model with a distortion limit of 30 achieved a BLEU score of 32.22.15 Although the score is lower than the score of SEQUENCE with a distortion limit of 30 in Table 3, the score was still higher than those of LINEAR, LINEAR+LEX, and 9-CLASS for JE in Table 3.
    Page 8, “Experiment”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

Chinese-English

Appears in 3 sentences as: Chinese-English (3)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. In our experiments, our model improved 2.9 BLEU points for J apanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
    Page 1, “Abstract”
  2. Experiments confirmed the effectiveness of our method for J apanese-English and Chinese-English translation, using NTCIR-9 Patent Machine Translation Task data sets (Goto et al., 2011).
    Page 2, “Introduction”
  3. Japanese-English Chinese-English HIER 30.47 32.66
    Page 8, “Experiment”

See all papers in Proc. ACL 2013 that mention Chinese-English.

See all papers in Proc. ACL that mention Chinese-English.

Back to top.

language model

Appears in 3 sentences as: language model (2) language models (1)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. 1A language model also supports the estimation.
    Page 1, “Introduction”
  2. We used 5-gram language models that were trained using the English side of each set of bilingual training data.
    Page 7, “Experiment”
  3. The common SMT feature set consists of: four translation model features, phrase penalty, word penalty, and a language model feature.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

Machine Translation

Appears in 3 sentences as: Machine Translation (2) machine translation (1)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. Estimating appropriate word order in a target language is one of the most difficult problems for statistical machine translation (SMT).
    Page 1, “Introduction”
  2. Experiments confirmed the effectiveness of our method for J apanese-English and Chinese-English translation, using NTCIR-9 Patent Machine Translation Task data sets (Goto et al., 2011).
    Page 2, “Introduction”
  3. We used the patent data for the Japanese to English and Chinese to English translation subtasks from the NTCIR-9 Patent Machine Translation Task (Goto et al., 2011).
    Page 6, “Experiment”

See all papers in Proc. ACL 2013 that mention Machine Translation.

See all papers in Proc. ACL that mention Machine Translation.

Back to top.

proposed models

Appears in 3 sentences as: Proposed Models (1) proposed models (2)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. The proposed models are the pair model and the sequence model.
    Page 1, “Introduction”
  2. Then, we describe two proposed models : the pair model and the sequence model that is the further improved model.
    Page 3, “Proposed Method”
  3. 4.2 Training for the Proposed Models
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention proposed models.

See all papers in Proc. ACL that mention proposed models.

Back to top.

sentence pairs

Appears in 3 sentences as: sentence pairs (3)
In Distortion Model Considering Rich Context for Statistical Machine Translation
  1. So approximately 2.05 million sentence pairs consisting of approximately 54 million
    Page 6, “Experiment”
  2. And approximately 0.49 million sentence pairs consisting of 14.9 million Chinese tokens whose lexicon size was 169k and 16.3 million English tokens whose lexicon size was 240k were used for CE.
    Page 7, “Experiment”
  3. Our distortion model was trained as follows: We used 0.2 million sentence pairs and their word alignments from the data used to build the translation model as the training data for our distortion models.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.