Deciphering Foreign Language
Ravi, Sujith and Knight, Kevin

Article Structure

Abstract

In this work, we tackle the task of machine translation (MT) without parallel training data.

Introduction

Bilingual corpora are a staple of statistical machine translation (SMT) research.

Word Substitution Decipherment

Before we tackle machine translation without parallel data, we first solve a simpler problem—word substitution decipherment.

Machine Translation as a Decipherment Task

We now turn to the problem of MT without parallel data.

Conclusion

Our work is the first attempt at doing MT without parallel data.

Topics

translation model

Appears in 15 sentences as: translation model (14) translation models (1)
In Deciphering Foreign Language
  1. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from nonparallel text.
    Page 1, “Abstract”
  2. From these corpora, we estimate translation model parameters: word-to-word translation tables, fertilities, distortion parameters, phrase tables, syntactic transformations, etc.
    Page 1, “Introduction”
  3. In this paper, we address the problem of learning a fall translation model from nonparallel data, and we use the
    Page 1, “Introduction”
  4. How can we learn a translation model from nonparallel data?
    Page 1, “Introduction”
  5. Intuitively, we try to construct translation model tables which, when applied to observed foreign text, consistently yield sensible English.
    Page 1, “Introduction”
  6. same full translation model .
    Page 2, “Introduction”
  7. A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters.
    Page 2, “Introduction”
  8. 0 We give first results for training a full translation model from nonparallel text, and we apply the model to translate previously-unseen text.
    Page 2, “Introduction”
  9. Probabilistic decipherment: Unlike parallel training, here we have to estimate the translation model P9( f |e) parameters using only monolingual data.
    Page 5, “Machine Translation as a Decipherment Task”
  10. We then estimate parameters of the translation model P9( f |e) during training.
    Page 5, “Machine Translation as a Decipherment Task”
  11. EM Decipherment: We propose a new translation model for MT decipherment which can be efficiently trained using the EM algorithm.
    Page 5, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

LM

Appears in 14 sentences as: LM (16)
In Deciphering Foreign Language
  1. We model P(e) using a statistical word n-gram English language model ( LM ).
    Page 2, “Word Substitution Decipherment”
  2. We use an English word bi-gram LM as the base distribution (P0) for the source model and specify a uniform P0 distribution for the
    Page 3, “Word Substitution Decipherment”
  3. In order to sample at position i, we choose the top K English words Y ranked by P(X Y 2), which can be computed offline from a statistical word bigram LM .
    Page 4, “Word Substitution Decipherment”
  4. We also have access to a separate English corpus (which is not parallel to the ciphertext) containing 125k temporal expressions (242k word tokens, 201 word types) for LM training.
    Page 4, “Word Substitution Decipherment”
  5. The data consists of 10k cipher sentences (102k tokens, 3397 word types); and a plaintext corpus of 402k English sentences (2.7M word tokens, 25761 word types) for LM training.
    Page 4, “Word Substitution Decipherment”
  6. EM with 2-gram LM 87.8 Intractable 1.
    Page 5, “Word Substitution Decipherment”
  7. Iterative EM with 2—gram LM 87.8 70.5 71.8 2.
    Page 5, “Word Substitution Decipherment”
  8. Bayesian with 2—gram LM 88.6 60.1 80.0 with 3—gram LM _ 82.5
    Page 5, “Word Substitution Decipherment”
  9. build an English word n-gram LM , which is used in the decipherment process.
    Page 5, “Word Substitution Decipherment”
  10. For P (e), we use a word n-gram LM trained on monolingual English data.
    Page 5, “Machine Translation as a Decipherment Task”
  11. Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
    Page 8, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

parallel data

Appears in 12 sentences as: parallel data (12)
In Deciphering Foreign Language
  1. Of course, for many language pairs and domains, parallel data is not available.
    Page 1, “Introduction”
  2. Before we tackle machine translation without parallel data , we first solve a simpler problem—word substitution decipherment.
    Page 2, “Word Substitution Decipherment”
  3. We now turn to the problem of MT without parallel data .
    Page 5, “Machine Translation as a Decipherment Task”
  4. Next, we present two novel decipherment approaches for MT training without parallel data .
    Page 5, “Machine Translation as a Decipherment Task”
  5. Bayesian Decipherment: We introduce a novel method for estimating IBM Model 3 parameters without parallel data , using Bayesian learning.
    Page 5, “Machine Translation as a Decipherment Task”
  6. Instead, we propose a simpler generative story for MT without parallel data .
    Page 6, “Machine Translation as a Decipherment Task”
  7. Bayesian Formulation: Our goal is to learn the probability tables 75 (translation parameters) n (fertility parameters), d (distortion parameters), and p (English NULL word probabilities) without parallel data .
    Page 6, “Machine Translation as a Decipherment Task”
  8. Decipherment without parallel data using: (a) EM method (from Section 3.1), and (b) Bayesian method (from Section 3.2).
    Page 8, “Machine Translation as a Decipherment Task”
  9. However, higher improvements are observed when using parallel data in comparison to decipherment training which only uses monolingual data.
    Page 8, “Machine Translation as a Decipherment Task”
  10. Figure 4: Comparison of training data size versus MT accuracy in terms of BLEU score under different training conditions: (1) Parallel training—(a) MOSES, (b) IBM Model 3 without distortion, and (2) Decipherment without parallel data using EM method (from Section 3.1).
    Page 9, “Machine Translation as a Decipherment Task”
  11. In other words, “how much nonparallel data is worth how much parallel data in order to achieve the same MT accuracy?” Figure 4 provides a reasonable answer to this question for the Spanish/English MT task described here.
    Page 9, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

n-gram

Appears in 7 sentences as: n-gram (7)
In Deciphering Foreign Language
  1. We model P(e) using a statistical word n-gram English language model (LM).
    Page 2, “Word Substitution Decipherment”
  2. Our method holds several other advantages over the EM approach—(l) inference using smart sampling strategies permits efficient training, allowing us to scale to large data/vocabulary sizes, (2) incremental scoring of derivations during sampling allows efficient inference even when we use higher-order n-gram LMs, (3) there are no memory bottlenecks since the full channel model and derivation lattice are never instantiated during training, and (4) prior specification allows us to learn skewed distributions that are useful here—word substitution ciphers exhibit l-to-l correspondence between plaintext and cipher types.
    Page 3, “Word Substitution Decipherment”
  3. build an English word n-gram LM, which is used in the decipherment process.
    Page 5, “Word Substitution Decipherment”
  4. For P (e), we use a word n-gram LM trained on monolingual English data.
    Page 5, “Machine Translation as a Decipherment Task”
  5. Whole-segment Language Models: When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English.
    Page 6, “Machine Translation as a Decipherment Task”
  6. This stems from the fact that n-gram LMs have no global information about what constitutes a valid English segment.
    Page 6, “Machine Translation as a Decipherment Task”
  7. We then use this model (in place of word n-gram LMs) for decipherment training and decoding.
    Page 6, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (6)
In Deciphering Foreign Language
  1. Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et
    Page 8, “Machine Translation as a Decipherment Task”
  2. The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
    Page 8, “Machine Translation as a Decipherment Task”
  3. Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
    Page 8, “Machine Translation as a Decipherment Task”
  4. Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus.
    Page 8, “Machine Translation as a Decipherment Task”
  5. BLEU score (higher is better)
    Page 9, “Machine Translation as a Decipherment Task”
  6. Figure 4: Comparison of training data size versus MT accuracy in terms of BLEU score under different training conditions: (1) Parallel training—(a) MOSES, (b) IBM Model 3 without distortion, and (2) Decipherment without parallel data using EM method (from Section 3.1).
    Page 9, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

language model

Appears in 6 sentences as: language model (5) Language Models (1)
In Deciphering Foreign Language
  1. The variable 6 ranges over all possible English strings, and P(e) is a language model built from large amounts of English text that is unrelated to the foreign strings.
    Page 1, “Introduction”
  2. A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters.
    Page 2, “Introduction”
  3. We model P(e) using a statistical word n-gram English language model (LM).
    Page 2, “Word Substitution Decipherment”
  4. 1For word substitution decipherment, we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (a = 104).
    Page 4, “Word Substitution Decipherment”
  5. Whole-segment Language Models : When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English.
    Page 6, “Machine Translation as a Decipherment Task”
  6. 5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01).
    Page 7, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

model parameters

Appears in 6 sentences as: model parameters (6)
In Deciphering Foreign Language
  1. From these corpora, we estimate translation model parameters : word-to-word translation tables, fertilities, distortion parameters, phrase tables, syntactic transformations, etc.
    Page 1, “Introduction”
  2. A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters .
    Page 2, “Introduction”
  3. During decipherment, our goal is to estimate the channel model parameters 6.
    Page 2, “Word Substitution Decipherment”
  4. These methods are attractive for their ability to manage uncertainty about model parameters and allow one to incorporate prior knowledge during inference.
    Page 3, “Word Substitution Decipherment”
  5. During decipherment training, our objective is to estimate the model parameters 0 in order to maximize the probability of the foreign corpus f. From Equation 4 we have:
    Page 5, “Machine Translation as a Decipherment Task”
  6. 5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01).
    Page 7, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

BLEU score

Appears in 5 sentences as: BLEU score (3) BLEU scores (2)
In Deciphering Foreign Language
  1. The figure also shows the corresponding BLEU scores in parentheses for comparison (higher scores indicate better MT output).
    Page 8, “Machine Translation as a Decipherment Task”
  2. Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
    Page 8, “Machine Translation as a Decipherment Task”
  3. Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus.
    Page 8, “Machine Translation as a Decipherment Task”
  4. BLEU score (higher is better)
    Page 9, “Machine Translation as a Decipherment Task”
  5. Figure 4: Comparison of training data size versus MT accuracy in terms of BLEU score under different training conditions: (1) Parallel training—(a) MOSES, (b) IBM Model 3 without distortion, and (2) Decipherment without parallel data using EM method (from Section 3.1).
    Page 9, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

edit distance

Appears in 5 sentences as: edit distance (5)
In Deciphering Foreign Language
  1. Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et
    Page 8, “Machine Translation as a Decipherment Task”
  2. 6When computing edit distance , we account for substitutions, insertions, deletions as well as local-swap edit operations required to convert a given English string into the (gold) reference translation.
    Page 8, “Machine Translation as a Decipherment Task”
  3. Results: Figure 3 compares the results of various MT systems (using parallel versus decipherment training) on the two test corpora in terms of edit distance scores (a lower score indicates closer match to the gold translation).
    Page 8, “Machine Translation as a Decipherment Task”
  4. On the Time corpus, the best decipherment (Method 2a in the figure) achieves an edit distance score of 28.7 (versus 4.7 for MOSES).
    Page 8, “Machine Translation as a Decipherment Task”
  5. Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
    Page 8, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention edit distance.

See all papers in Proc. ACL that mention edit distance.

Back to top.

MT systems

Appears in 5 sentences as: MT Systems (1) MT systems (5)
In Deciphering Foreign Language
  1. MT Systems: We build and compare different MT systems under two training scenarios:
    Page 8, “Machine Translation as a Decipherment Task”
  2. Evaluation: All the MT systems are run on the Spanish test data and the quality of the resulting English translations are evaluated using two different measures—(1) Normalized edit distance score (Navarro, 2001),6 and (2) BLEU (Papineni et
    Page 8, “Machine Translation as a Decipherment Task”
  3. Results: Figure 3 compares the results of various MT systems (using parallel versus decipherment training) on the two test corpora in terms of edit distance scores (a lower score indicates closer match to the gold translation).
    Page 8, “Machine Translation as a Decipherment Task”
  4. We also investigate how the performance of different MT systems vary with the size of the training data.
    Page 8, “Machine Translation as a Decipherment Task”
  5. Figure 4 plots the BLEU scores versus training sizes for different MT systems on the Time corpus.
    Page 8, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

language pairs

Appears in 4 sentences as: language pair (1) language pairs (3)
In Deciphering Foreign Language
  1. Of course, for many language pairs and domains, parallel data is not available.
    Page 1, “Introduction”
  2. As successful work develops along this line, we expect more domains and language pairs to be conquered by SMT.
    Page 1, “Introduction”
  3. Data: We work with the Spanish/English language pair and use the following corpora in our MT experiments:
    Page 7, “Machine Translation as a Decipherment Task”
  4. 0 OPUS movie subtitle corpus: This is a large open source collection of parallel corpora available for multiple language pairs (Tiedemann, 2009).
    Page 7, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Deciphering Foreign Language
  1. In this work, we tackle the task of machine translation (MT) without parallel training data.
    Page 1, “Abstract”
  2. Bilingual corpora are a staple of statistical machine translation (SMT) research.
    Page 1, “Introduction”
  3. Before we tackle machine translation without parallel data, we first solve a simpler problem—word substitution decipherment.
    Page 2, “Word Substitution Decipherment”
  4. From a decipherment perspective, machine translation is a much more complex task than word substitution decipherment and poses several technical challenges: (1) scalability due to large corpora sizes and huge translation tables, (2) nondeterminism in translation mappings (a word can have multiple translations), (3) reordering of words
    Page 5, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

Gibbs sampling

Appears in 3 sentences as: Gibbs sampling (3)
In Deciphering Foreign Language
  1. channel.1 We perform inference using point-wise Gibbs sampling (Geman and Geman, 1984).
    Page 4, “Word Substitution Decipherment”
  2. Parallelized Gibbs sampling : Secondly, we parallelize our sampling step using a Map-Reduce framework.
    Page 4, “Word Substitution Decipherment”
  3. Sampling IBM Model 3: We use point-wise Gibbs sampling to estimate the IBM Model 3 parameters.
    Page 7, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

iteratively

Appears in 3 sentences as: iteratively (3)
In Deciphering Foreign Language
  1. Instead of instantiating the entire channel model (with all its parameters), we iteratively train the model in small steps.
    Page 3, “Word Substitution Decipherment”
  2. Goto Step 2 and repeat the procedure, extending the channel size iteratively in each stage.
    Page 3, “Word Substitution Decipherment”
  3. 4For Iterative EM, we start with a channel of size 101x101 (K =100) and in every pass we iteratively increase the vocabulary sizes by 50, repeating the training procedure until the channel size becomes 351x351.
    Page 5, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention iteratively.

See all papers in Proc. ACL that mention iteratively.

Back to top.

sentence pairs

Appears in 3 sentences as: sentence pairs (3)
In Deciphering Foreign Language
  1. Starting with the classic IBM work (Brown et al., 1993), training has been viewed as a maximization problem involving hidden word alignments (a) that are assumed to underlie observed sentence pairs
    Page 1, “Introduction”
  2. (1993) provide an efficient algorithm for training IBM Model 3 translation model when parallel sentence pairs are available.
    Page 6, “Machine Translation as a Decipherment Task”
  3. We see that deciphering with 10k monolingual Spanish sentences yields the same performance as training with around 200-500 parallel English/Spanish sentence pairs .
    Page 9, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

Viterbi

Appears in 3 sentences as: Viterbi (3)
In Deciphering Foreign Language
  1. Finally, we decode the given ciphertext c by using the Viterbi algorithm to choose the plaintext decoding 6 that maximizes P(e) - Pgtmmed(c|e)3, stretching the channel probabilities (Knight et al., 2006).
    Page 3, “Word Substitution Decipherment”
  2. We then use the Viterbi algorithm to choose the English plaintext e that maximizes P(e) - Pgtmined(c|e)3.
    Page 4, “Word Substitution Decipherment”
  3. Finally, we use the Viterbi algorithm to decode the foreign sentence f and produce an English translation 6 that maximizes P (e) -
    Page 6, “Machine Translation as a Decipherment Task”

See all papers in Proc. ACL 2011 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.