Scalable Decipherment for Machine Translation via Hash Sampling
Ravi, Sujith

Article Structure

Abstract

In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora.

Introduction

Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora.

Decipherment Model for Machine Translation

We now describe the decipherment problem formulation for machine translation.

Feature-based representation for Source and Target

The model described in the previous section while being flexible in describing the translation process, poses several challenges for training.

Bayesian MT Decipherment via Hash Sampling

The next step is to use the feature representations described earlier and iteratively sample a target word (or phrase) translation candidate 6, for every

Training Algorithm

Putting together all the pieces described in the previous section, we perform the following steps:

Experiments and Results

We test our method on two different corpora.

Discussion and Future Work

There exists some work (Dou and Knight, 2012; Klementiev et al., 2012) that uses monolingual corpora to induce phrase tables, etc.

Conclusion

To summarize, our method is significantly faster than previous methods based on EM or Bayesian with standard Gibbs sampling and obtains better results than any previously published methods for the same task.

Topics

translation model

Appears in 30 sentences as: Translation Model (2) translation model (14) translation models (14)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models .
    Page 1, “Abstract”
  2. The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc.
    Page 1, “Introduction”
  3. Learning translation models from monolingual corpora could help address the challenges faced by modem-day MT systems, especially for low resource language pairs.
    Page 1, “Introduction”
  4. Recently, this topic has been receiving increasing attention from researchers and new methods have been proposed to train statistical machine translation models using only monolingual data in the source and target language.
    Page 1, “Introduction”
  5. The body of work that is more closely related to ours include that of Ravi and Knight (2011b) who introduced a decipherment approach for training translation models using only monolingual cor-
    Page 1, “Introduction”
  6. Their best performing method uses an EM algorithm to train a word translation model and they show results on a Spanish/English task.
    Page 2, “Introduction”
  7. In this work we propose a new Bayesian inference method for estimating translation models from scratch using only monolingual corpora.
    Page 2, “Introduction”
  8. The new sampler allows us to perform fast, efficient inference with more complex translation models (than previously used) and scale better to large vocabulary and corpora sizes compared to existing methods as evidenced by our experimental results on two different corpora.
    Page 2, “Introduction”
  9. Contrary to standard machine translation training scenarios, here we have to estimate the translation model P9( f |e) parameters using only monolingual data.
    Page 2, “Decipherment Model for Machine Translation”
  10. We then estimate the parameters of the translation model P9 (f le) during training.
    Page 2, “Decipherment Model for Machine Translation”
  11. Translation Model : Machine translation is a much more complex task than solving other decipherment tasks such as word substitution ciphers (Ravi and Knight, 2011b; Dou and Knight, 2012).
    Page 2, “Decipherment Model for Machine Translation”

See all papers in Proc. ACL 2013 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

BLEU

Appears in 14 sentences as: BLEU (14)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
    Page 1, “Abstract”
  2. To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
    Page 7, “Experiments and Results”
  3. We show that our method achieves the best performance ( BLEU scores) on this task while being significantly faster than both the previous approaches.
    Page 7, “Experiments and Results”
  4. We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon).
    Page 7, “Experiments and Results”
  5. For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words.
    Page 7, “Experiments and Results”
  6. We use the entire Spanish source text for decipherment training and evaluate the final English output to report BLEU scores.
    Page 7, “Experiments and Results”
  7. OPUS: We compare the MT results ( BLEU scores) from different systems on the OPUS corpus in Table 2.
    Page 7, “Experiments and Results”
  8. With a 3-gram LM, the new method achieves the best performance; the highest BLEU score reported on this task.
    Page 7, “Experiments and Results”
  9. Table 2: Comparison of MT performance ( BLEU sc on the Spanish/English OPUS corpus using only nc
    Page 8, “Experiments and Results”
  10. Method BLEU
    Page 8, “Experiments and Results”
  11. The comparable number for Table 2 is 63.6 BLEU .
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU score

Appears in 10 sentences as: BLEU score (5) BLEU scores (5)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster).
    Page 1, “Abstract”
  2. To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation.
    Page 7, “Experiments and Results”
  3. We show that our method achieves the best performance ( BLEU scores ) on this task while being significantly faster than both the previous approaches.
    Page 7, “Experiments and Results”
  4. For both the MT tasks, we also report BLEU scores for a baseline system using identity translations for common words (words appearing in both source/target vocabularies) and random translations for other words.
    Page 7, “Experiments and Results”
  5. We use the entire Spanish source text for decipherment training and evaluate the final English output to report BLEU scores .
    Page 7, “Experiments and Results”
  6. OPUS: We compare the MT results ( BLEU scores ) from different systems on the OPUS corpus in Table 2.
    Page 7, “Experiments and Results”
  7. With a 3-gram LM, the new method achieves the best performance; the highest BLEU score reported on this task.
    Page 7, “Experiments and Results”
  8. We report the first BLEU score results on such a large-scale task using a 2-gram LM.
    Page 8, “Experiments and Results”
  9. For comparison purposes, we also evaluate MT performance on this task using parallel training (MOSES trained with hundred sentence pairs) and observe a BLEU score of 11.7.
    Page 8, “Experiments and Results”
  10. These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora, have been shown to yield some BLEU score improvements.
    Page 9, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

machine translation

Appears in 9 sentences as: Machine translation (1) machine translation (8)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora.
    Page 1, “Abstract”
  2. Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora.
    Page 1, “Introduction”
  3. Recently, this topic has been receiving increasing attention from researchers and new methods have been proposed to train statistical machine translation models using only monolingual data in the source and target language.
    Page 1, “Introduction”
  4. We now describe the decipherment problem formulation for machine translation .
    Page 2, “Decipherment Model for Machine Translation”
  5. Contrary to standard machine translation training scenarios, here we have to estimate the translation model P9( f |e) parameters using only monolingual data.
    Page 2, “Decipherment Model for Machine Translation”
  6. Translation Model: Machine translation is a much more complex task than solving other decipherment tasks such as word substitution ciphers (Ravi and Knight, 2011b; Dou and Knight, 2012).
    Page 2, “Decipherment Model for Machine Translation”
  7. We use a similar technique as theirs but a different approximate distribution for the proposal, one that is better-suited for machine translation models and without some of the additional overhead required for computing certain terms in the original formulation.
    Page 5, “Bayesian MT Decipherment via Hash Sampling”
  8. To evaluate translation quality, we use BLEU score (Papineni et al., 2002), a standard evaluation measure used in machine translation .
    Page 7, “Experiments and Results”
  9. for unsupervised machine translation which can help further improve the performance in addition to accelerating the sampling process.
    Page 9, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

Gibbs sampling

Appears in 8 sentences as: Gibbs sampler (2) Gibbs sampling (6)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. In spite of using Bayesian inference which is typically slow in practice (with standard Gibbs sampling ), we show later that our method is scalable and permits decipherment training using more complex translation models (with several additional parameters).
    Page 2, “Decipherment Model for Machine Translation”
  2. with this problem by using a fast, efficient sampler based on hashing that allows us to speed up the Bayesian inference significantly whereas standard Gibbs sampling would be extremely slow.
    Page 3, “Decipherment Model for Machine Translation”
  3. Additionally, performing Bayesian inference with such a complex model using standard Gibbs sampling can be very slow in practice.
    Page 3, “Feature-based representation for Source and Target”
  4. Doing standard collapsed Gibbs sampling in this scenario would be very slow and intractable.
    Page 4, “Bayesian MT Decipherment via Hash Sampling”
  5. To do collapsed Gibbs sampling under this model, we would perform the following steps during sampling:
    Page 5, “Bayesian MT Decipherment via Hash Sampling”
  6. So, during decipherment training a standard collapsed Gibbs sampler will waste most of its time on expensive computations that will be discarded in the end anyways.
    Page 5, “Bayesian MT Decipherment via Hash Sampling”
  7. The table also demonstrates the siginificant speedup achieved by the hash sampler over a standard Gibbs sampler for the same model (~85 times faster when using a 2-gram LM).
    Page 8, “Experiments and Results”
  8. To summarize, our method is significantly faster than previous methods based on EM or Bayesian with standard Gibbs sampling and obtains better results than any previously published methods for the same task.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

language model

Appears in 7 sentences as: language model (7)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. For P(e), we use a word n-gram language model (LM) trained on monolingual target text.
    Page 2, “Decipherment Model for Machine Translation”
  2. Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model .
    Page 2, “Decipherment Model for Machine Translation”
  3. Secondly, for Bayesian inference we need to sample from a distribution that involves computing probabilities for all the components ( language model , translation model, fertility, etc.)
    Page 4, “Bayesian MT Decipherment via Hash Sampling”
  4. Note that the (translation) model in our case consists of multiple exponential families components—a multinomial pertaining to the language model (which remains fixed5), and other components pertaining to translation probabilities P9(fi|ei), fertility ngert, etc.
    Page 5, “Bayesian MT Decipherment via Hash Sampling”
  5. where, pold(-), pnew(-) are the true conditional likelihood probabilities according to our model (including the language model component) for the old, new sample respectively.
    Page 6, “Bayesian MT Decipherment via Hash Sampling”
  6. The latter is used to construct a target language model used for decipherment training.
    Page 7, “Experiments and Results”
  7. Overall, using a 3-gram language model (instead of 2-gram) for decipherment training improves the performance for all methods.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

LM

Appears in 7 sentences as: LM (8)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. For P(e), we use a word n-gram language model ( LM ) trained on monolingual target text.
    Page 2, “Decipherment Model for Machine Translation”
  2. 5 A high value for the LM concentration parameter oz ensures that the LM probabilities do not deviate too far from the original fixed base distribution during sampling.
    Page 5, “Bayesian MT Decipherment via Hash Sampling”
  3. We observe that our method produces much better results than the others even with a 2-gram LM .
    Page 7, “Experiments and Results”
  4. With a 3-gram LM , the new method achieves the best performance; the highest BLEU score reported on this task.
    Page 7, “Experiments and Results”
  5. Bayesian Hash Sampling with 2—gram LM vocab=full (V6), addjertility=n0 4.2 vocab=pruned*, add_fertility=yes 5.3
    Page 8, “Experiments and Results”
  6. The table also demonstrates the siginificant speedup achieved by the hash sampler over a standard Gibbs sampler for the same model (~85 times faster when using a 2-gram LM ).
    Page 8, “Experiments and Results”
  7. We report the first BLEU score results on such a large-scale task using a 2-gram LM .
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

feature vectors

Appears in 6 sentences as: feature vector (3) feature vectors (5)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. But unlike documents, here each word w is associated with a feature vector wl...wd (where wi represents the weight for the feature indexed by i) which is constructed from monolingual corpora.
    Page 3, “Feature-based representation for Source and Target”
  2. Unlike the target word feature vectors (which can be precomputed from the monolingual target corpus), the feature vector for every source word fj is dynamically constructed from the target translation sampled in each training iteration.
    Page 4, “Feature-based representation for Source and Target”
  3. ), it results in the feature representation becoming more sparse (especially for source feature vectors ) which can cause problems in efficiency as well as robustness when computing similarity against other vectors.
    Page 4, “Feature-based representation for Source and Target”
  4. One possible strategy is to compute similarity scores 8(Wfi, we/) between the current source word feature vector Wfi and feature vectors we/Eve for all possible candidates in the target vocabulary.
    Page 4, “Bayesian MT Decipherment via Hash Sampling”
  5. This makes the complexity far worse (in practice) since the dimensionality of the feature vectors d is a much higher value than Computing similarity scores alone (nai'vely) would incur O(|Ve| - d) time which is prohibitively huge since we have to do this for every token in the source language corpus.
    Page 4, “Bayesian MT Decipherment via Hash Sampling”
  6. (a) Generate a proposal distribution by computing the hamming distance between the feature vectors for the source word and each target translation candidate.
    Page 6, “Training Algorithm”

See all papers in Proc. ACL 2013 that mention feature vectors.

See all papers in Proc. ACL that mention feature vectors.

Back to top.

parallel corpora

Appears in 5 sentences as: parallel corpora (5)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora .
    Page 1, “Introduction”
  2. The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc.
    Page 1, “Introduction”
  3. OPUS movie subtitle corpus (Tiedemann, 2009): This is a large open source collection of parallel corpora available for multiple language pairs.
    Page 7, “Experiments and Results”
  4. This is achieved without using any seed lexicon or parallel corpora .
    Page 8, “Experiments and Results”
  5. These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora , have been shown to yield some BLEU score improvements.
    Page 9, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. For P(e), we use a word n-gram language model (LM) trained on monolingual target text.
    Page 2, “Decipherment Model for Machine Translation”
  2. Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model.
    Page 2, “Decipherment Model for Machine Translation”
  3. For instance, context features for word w may include other words (or phrases) that appear in the immediate context ( n-gram window) surrounding w in the monolingual corpus.
    Page 3, “Feature-based representation for Source and Target”
  4. The feature construction process is described in more detail below: Target Language: We represent each word (or phrase) ei with the following contextual features along with their counts: (a) ficontem: every (word n-gram , position) pair immediately preceding e,-in the monolingual corpus (n=l , position=— l), (b) similar features f+conte$t to model the context following ei, and (c) we also throw in generic context features fscontewt without position information—every word that co-occurs with e, in the same sen-
    Page 3, “Feature-based representation for Source and Target”

See all papers in Proc. ACL 2013 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

generative process

Appears in 3 sentences as: generative process (3)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. So, instead we use a simplified generative process for the translation model as proposed by Ravi and Knight (2011b) and used by others (Nuhn et al., 2012) for this task:
    Page 2, “Decipherment Model for Machine Translation”
  2. We now extend the generative process (described earlier) to more complex translation models.
    Page 3, “Decipherment Model for Machine Translation”
  3. Nonlocal Reordering: The generative process described earlier limits reordering to local or adjacent word pairs in a source sentence.
    Page 3, “Decipherment Model for Machine Translation”

See all papers in Proc. ACL 2013 that mention generative process.

See all papers in Proc. ACL that mention generative process.

Back to top.

model parameters

Appears in 3 sentences as: model parameters (3)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc.
    Page 1, “Introduction”
  2. During decipherment training, our objective is to estimate the model parameters in order to maximize the probability of the source text f as suggested by Ravi and Knight (2011b).
    Page 2, “Decipherment Model for Machine Translation”
  3. Instead, we propose a new Bayesian inference framework to estimate the translation model parameters .
    Page 2, “Decipherment Model for Machine Translation”

See all papers in Proc. ACL 2013 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

parallel data

Appears in 3 sentences as: parallel data (3)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. But obtaining parallel data is an expensive process and not available for all language
    Page 1, “Introduction”
  2. We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon).
    Page 7, “Experiments and Results”
  3. The results are encouraging and demonstrates the ability of the method to scale to large-scale settings while performing efficient inference with compleX models, which we believe will be especially useful for future MT application in scenarios where parallel data is hard to obtain.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

similarity scores

Appears in 3 sentences as: similarity scores (3)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. One possible strategy is to compute similarity scores 8(Wfi, we/) between the current source word feature vector Wfi and feature vectors we/Eve for all possible candidates in the target vocabulary.
    Page 4, “Bayesian MT Decipherment via Hash Sampling”
  2. Following this, we can prune the translation candidate set by keeping only the top candidates 6* according to the similarity scores .
    Page 4, “Bayesian MT Decipherment via Hash Sampling”
  3. This makes the complexity far worse (in practice) since the dimensionality of the feature vectors d is a much higher value than Computing similarity scores alone (nai'vely) would incur O(|Ve| - d) time which is prohibitively huge since we have to do this for every token in the source language corpus.
    Page 4, “Bayesian MT Decipherment via Hash Sampling”

See all papers in Proc. ACL 2013 that mention similarity scores.

See all papers in Proc. ACL that mention similarity scores.

Back to top.

statistical machine translation

Appears in 3 sentences as: Statistical machine translation (1) statistical machine translation (2)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora.
    Page 1, “Abstract”
  2. Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora.
    Page 1, “Introduction”
  3. Recently, this topic has been receiving increasing attention from researchers and new methods have been proposed to train statistical machine translation models using only monolingual data in the source and target language.
    Page 1, “Introduction”

See all papers in Proc. ACL 2013 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.

topic models

Appears in 3 sentences as: topic model (1) topic models (2)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. Similarly, we can add other features based on topic models , orthography (Haghighi et al., 2008), temporal (Klementiev et al., 2012), etc.
    Page 3, “Feature-based representation for Source and Target”
  2. We note that the new sampling framework is easily extensible to many additional feature types (for example, monolingual topic model features, etc.)
    Page 4, “Feature-based representation for Source and Target”
  3. Firstly, we would like to include as many features as possible to represent the source/target words in our framework besides simple bag-of-words context similarity (for example, left-context, right-context, and other general-purpose features based on topic models , etc.).
    Page 4, “Bayesian MT Decipherment via Hash Sampling”

See all papers in Proc. ACL 2013 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

bag-of-words

Appears in 3 sentences as: bag-of-words (3)
In Scalable Decipherment for Machine Translation via Hash Sampling
  1. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words ) as side-information used for estimating translation models.
    Page 1, “Abstract”
  2. Secondly, we introduce a new feature-based representation for sampling translation candidates that allows one to incorporate any amount of additional features (beyond simple bag-of-words ) as side-information during decipherment training.
    Page 2, “Introduction”
  3. Firstly, we would like to include as many features as possible to represent the source/target words in our framework besides simple bag-of-words context similarity (for example, left-context, right-context, and other general-purpose features based on topic models, etc.).
    Page 4, “Bayesian MT Decipherment via Hash Sampling”

See all papers in Proc. ACL 2013 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.