A Markov Model of Machine Translation using Non-parametric Bayesian Inference
Feng, Yang and Cohn, Trevor

Article Structure

Abstract

Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phrase-internal translation and reordering.

Introduction

Recent years have witnessed burgeoning development of statistical machine translation research, notably phrase-based (Koehn et al., 2003) and syntax-based approaches (Chiang, 2005; Galley et al., 2006; Liu et al., 2006).

Related Work

Word based models have a long history in machine translation, starting with the venerable IBM translation models (Brown et al., 1993) and the hidden Markov model (Vogel et al., 1996).

Model

Given a source sentence, our model infers a latent derivation which produces a target translation and meanwhile gives a word alignment between the source and the target.

Gibbs Sampling

To train the model, we use Gibbs sampling, a Markov Chain Monte Carlo (MCMC) technique for posterior inference.

Experiments

In principle our model could be directly used as a MT decoder or as a feature in a decoder.

Conclusions and Future Work

This paper proposes a word-based Markov model of translation which correlates translation decisions by conditioning on recent decisions, and incorporates a hierarchical Pitman-Yor process prior permitting elaborate backoff behaviour.

Topics

phrase-based

Appears in 16 sentences as: phrase-based (16)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs.
    Page 1, “Abstract”
  2. Recent years have witnessed burgeoning development of statistical machine translation research, notably phrase-based (Koehn et al., 2003) and syntax-based approaches (Chiang, 2005; Galley et al., 2006; Liu et al., 2006).
    Page 1, “Introduction”
  3. These approaches model sentence translation as a sequence of simple translation decisions, such as the application of a phrase translation in phrase-based methods or a grammar rule in syntax-based approaches.
    Page 1, “Introduction”
  4. This conflicts with the intuition behind phrase-based MT, namely that translation decisions should be dependent on con-
    Page 1, “Introduction”
  5. 3. providing a unifying framework spanning word-based and phrase-based model of translation, while incorporating explicit transla-
    Page 1, “Introduction”
  6. The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.
    Page 2, “Introduction”
  7. This idea has been developed explicitly in a number of previous approaches, in grammar based (Chiang, 2005) and phrase-based systems (Galley and Manning, 2010).
    Page 2, “Related Work”
  8. We consider a process in which the target string is generated using a left-to-right order, similar to the decoding strategy used by phrase-based machine translation systems (Koehn et al., 2003).
    Page 2, “Model”
  9. In contrast to phrase-based models, we use words as our basic translation unit, rather than multi-word phrases.
    Page 2, “Model”
  10. However in this paper we limit our focus to inducing word alignments, i.e., by using the model to infer alignments which are then used in a standard phrase-based translation pipeline.
    Page 6, “Experiments”
  11. We leave full decoding for later work, which we anticipate would further improve performance by exploiting gapping phrases and other phenomena that implicitly form part of our model but are not represented in the phrase-based decoder.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

BLEU

Appears in 13 sentences as: BLEU (13)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU .
    Page 1, “Abstract”
  2. The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.
    Page 2, “Introduction”
  3. We compared the performance of Moses using the alignment produced by our model and the baseline alignment, evaluating translation quality using BLEU (Papineni et al., 2002) with case-insensitive n-gram matching with n = 4.
    Page 7, “Experiments”
  4. We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set.
    Page 7, “Experiments”
  5. 5 The effect on translation scores is modest, roughly amounting to +0.2 BLEU versus using a single sample.
    Page 7, “Experiments”
  6. Table 2: Impact of adding factors to our Markov model, showing BLEU scores on IWSLT.
    Page 7, “Experiments”
  7. Note that even the simplest Markov model far outperforms the GIZA++ baseline (+1.5 BLEU ) despite the baseline (IBM model 4) including a number of advanced features (e.g., jump, fertility) that are not present in the basic Markov model.
    Page 7, “Experiments”
  8. Jump yields an improvement of +1 BLEU by capturing consistent reordering patterns.
    Page 7, “Experiments”
  9. Adding fertility results in a further +1 BLEU point improvement.
    Page 7, “Experiments”
  10. While the baseline results vary by up to 1.7 BLEU points for the different alignments, our Markov model provided more stable results with the biggest difference of 0.6.
    Page 8, “Experiments”
  11. Table 4: Translation performance on Chinese to English translation, showing BLEU % for models trained on the FBIS data set.
    Page 9, “Experiments”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

Gibbs sampling

Appears in 9 sentences as: Gibbs sampler (3) Gibbs samplers (2) Gibbs sampling (5)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. This makes the approach more suitable for learning alignments, e. g., to account for word fertilities (see §3.3), while also permitting inference using Gibbs sampling (§4).
    Page 3, “Model”
  2. To train the model, we use Gibbs sampling , a Markov Chain Monte Carlo (MCMC) technique for posterior inference.
    Page 5, “Gibbs Sampling”
  3. Our Gibbs sampler operates by sampling an update to the alignment of each target word in the corpus.
    Page 5, “Gibbs Sampling”
  4. (2009a) of using multiple processors to perform approximate Gibbs sampling which they showed achieved equivalent performance to the exact Gibbs sampler .
    Page 6, “Gibbs Sampling”
  5. For each data set, Gibbs sampling was performed on the training set in each direction (source-to-target and target-to-source), initialized using GIZA++.4 We used the grow heuristic to combine the GIZA++ alignments in both directions (Koehn et al., 2003), which we then intersect with the predictions of GIZA++ in the relevant translation direction.
    Page 7, “Experiments”
  6. The two Gibbs samplers were “burned in” for the first 1000 iterations, after which we ran a further 500 iterations selecting every 50th sample.
    Page 7, “Experiments”
  7. Because the data set is small, we performed Gibbs sampling on a single processor.
    Page 7, “Experiments”
  8. The Gibbs samplers were initialized with three diflerent alignments, shown as columns.
    Page 8, “Experiments”
  9. As the FBIS data set is large, we employed 3-processor MP1 for each Gibbs sampler , which ran in half the time compared to using a single processor.
    Page 9, “Experiments”

See all papers in Proc. ACL 2013 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

machine translation

Appears in 7 sentences as: Machine translation (1) machine translation (6)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phrase-internal translation and reordering.
    Page 1, “Abstract”
  2. Recent years have witnessed burgeoning development of statistical machine translation research, notably phrase-based (Koehn et al., 2003) and syntax-based approaches (Chiang, 2005; Galley et al., 2006; Liu et al., 2006).
    Page 1, “Introduction”
  3. Word based models have a long history in machine translation , starting with the venerable IBM translation models (Brown et al., 1993) and the hidden Markov model (Vogel et al., 1996).
    Page 2, “Related Work”
  4. More recently, a number of authors have proposed Markov models for machine translation .
    Page 2, “Related Work”
  5. We consider a process in which the target string is generated using a left-to-right order, similar to the decoding strategy used by phrase-based machine translation systems (Koehn et al., 2003).
    Page 2, “Model”
  6. We used the Moses machine translation decoder (Koehn et al., 2007), using the default features and decoding settings.
    Page 7, “Experiments”
  7. Table 3: Machine translation performance in BLE U % on the IWSLT 2005 Chinese-English test set.
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

sentence pair

Appears in 6 sentences as: sentence pair (3) sentence pairs (3)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. Therefore, we introduce fertility to denote the number of target positions a source word is linked to in a sentence pair .
    Page 5, “Model”
  2. where gbj is the fertility of source word fj in the sentence pair < fi],e{ > and p58 is the basic model defined in Eq.
    Page 5, “Model”
  3. Specifically we seek to infer the latent sequence of translation decisions given a corpus of sentence pairs .
    Page 5, “Gibbs Sampling”
  4. It visits each sentence pair in the corpus in a random order and resamples the alignments for each target position as follows.
    Page 5, “Gibbs Sampling”
  5. Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively.
    Page 8, “Experiments”
  6. Overall there are 276k sentence pairs and 8.21m and 8.97m words in Arabic and English, respectively.
    Page 9, “Experiments”

See all papers in Proc. ACL 2013 that mention sentence pair.

See all papers in Proc. ACL that mention sentence pair.

Back to top.

word alignment

Appears in 6 sentences as: word alignment (4) word alignments (2)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. In this paper we propose a new model to drop the independence assumption, by instead modelling correlations between translation decisions, which we use to induce translation derivations from aligned sentences (akin to word alignment ).
    Page 1, “Introduction”
  2. Given a source sentence, our model infers a latent derivation which produces a target translation and meanwhile gives a word alignment between the source and the target.
    Page 2, “Model”
  3. Given the structure of our model, a word alignment uniquely specifies the translation decisions and the sequence follows the order of the target sentence left to right.
    Page 5, “Gibbs Sampling”
  4. However in this paper we limit our focus to inducing word alignments , i.e., by using the model to infer alignments which are then used in a standard phrase-based translation pipeline.
    Page 6, “Experiments”
  5. We present results on translation quality and word alignment .
    Page 7, “Experiments”
  6. In this paper the model was only used to infer word alignments ; in future work we intend to develop a decoding algorithm for directly translating with the model.
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

Chinese-English

Appears in 5 sentences as: Chinese-English (5)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. We demonstrate our model on Chinese-English and Arabic-English translation datasets.
    Page 2, “Introduction”
  2. The first experiments are on the IWSLT data set for Chinese-English translation.
    Page 7, “Experiments”
  3. Table 3: Machine translation performance in BLE U % on the IWSLT 2005 Chinese-English test set.
    Page 8, “Experiments”
  4. To test whether our improvements carry over to larger datasets, we assess the performance of our model on the FBIS Chinese-English data set.
    Page 8, “Experiments”
  5. In general the improvements are more modest than for the Chinese-English results above.
    Page 9, “Experiments”

See all papers in Proc. ACL 2013 that mention Chinese-English.

See all papers in Proc. ACL that mention Chinese-English.

Back to top.

BLEU points

Appears in 3 sentences as: BLEU point (1) BLEU points (2)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. The model produces uniformly better translations than those of a competitive phrase-based baseline, amounting to an improvement of up to 3.4 BLEU points absolute.
    Page 2, “Introduction”
  2. Adding fertility results in a further +1 BLEU point improvement.
    Page 7, “Experiments”
  3. While the baseline results vary by up to 1.7 BLEU points for the different alignments, our Markov model provided more stable results with the biggest difference of 0.6.
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention BLEU points.

See all papers in Proc. ACL that mention BLEU points.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set .
    Page 7, “Experiments”
  2. For the development set we use both ASR devset l and 2 from IWSLT 2005, and
    Page 7, “Experiments”
  3. For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

generative process

Appears in 3 sentences as: generative process (3)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. The generative process employs the following recursive procedure to construct the target sentence conditioned on the source:
    Page 2, “Model”
  2. This generative process resembles the sequence of translation decisions considered by a standard MT decoder (Koehn et al., 2003), but note that our approach differs in that there is no constraint that all words are translated exactly once.
    Page 3, “Model”
  3. Decoding under our model would be straightforward in principle, as the generative process was designed to closely parallel the search procedure in the phrase-based model.3 Three data sets were used in the experiments: two Chinese to English data sets on small (IWSLT) and larger corpora (FBIS), and Arabic
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention generative process.

See all papers in Proc. ACL that mention generative process.

Back to top.

language model

Appears in 3 sentences as: language model (4)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. (2011) develop a bilingual language model which incorporates words in the source and target languages to predict the next unit, which they use as a feature in a translation system.
    Page 2, “Related Work”
  2. The language model is a 3-gram language model trained using the SRILM toolkit (Stolcke, 2002) on the English side of the training data.
    Page 7, “Experiments”
  3. The language model is a 3-gram LM trained on Xinhua portion of the Gigaword corpus using the SRILM toolkit with modified Kneser—Ney smoothing.
    Page 9, “Experiments”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

NIST

Appears in 3 sentences as: NIST (4)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively.
    Page 8, “Experiments”
  2. For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003
    Page 8, “Experiments”
  3. We evaluate on the NIST test sets from 2003 and 2005, and the 2002 test set was used for MERT training.
    Page 9, “Experiments”

See all papers in Proc. ACL 2013 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

phrase pairs

Appears in 3 sentences as: phrase pairs (4)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phrase-internal translation and reordering.
    Page 1, “Abstract”
  2. This mechanism implicitly supports not only traditional phrase pairs , but also gapping phrases which are nonconsecutive in the source.
    Page 1, “Abstract”
  3. In contrast, our model better aligns the function words, such that many more useful phrase pairs can be extracted, i.e., <7£, ’m>, <ilZ, looking for>, <l€§l% fig, grill-type> and their combinations with neighbouring phrase pairs .
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

translation systems

Appears in 3 sentences as: translation system (1) translation systems (2)
In A Markov Model of Machine Translation using Non-parametric Bayesian Inference
  1. Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phrase-internal translation and reordering.
    Page 1, “Abstract”
  2. (2011) develop a bilingual language model which incorporates words in the source and target languages to predict the next unit, which they use as a feature in a translation system .
    Page 2, “Related Work”
  3. We consider a process in which the target string is generated using a left-to-right order, similar to the decoding strategy used by phrase-based machine translation systems (Koehn et al., 2003).
    Page 2, “Model”

See all papers in Proc. ACL 2013 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.