An Infinite Hierarchical Bayesian Model of Phrasal Translation
Cohn, Trevor and Haffari, Gholamreza

Article Structure

Abstract

Modern phrase-based machine translation systems make extensive use of word-based translation models for inducing alignments from parallel corpora.

Introduction

The phrase-based approach (Koehn et al., 2003) to machine translation (MT) has transformed MT from a narrow research topic into a truly useful technology to end users.

Related Work

Inversion transduction grammar (or ITG) (Wu, 1997) is a well studied synchronous grammar formalism.

Model

The generative process of the model follows that of ITG with the following simple grammar

Experiments

Datasets We train our model across three language pairs: Urdu—>English (UR-EN), Farsi—>English (FA-EN), and Arabic—>English (AR-EN).

Analysis

In this section, we present some insights about the learned grammar and the model hyper-parameters.

Topics

phrase-based

Appears in 11 sentences as: phrase-based (12)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Modern phrase-based machine translation systems make extensive use of word-based translation models for inducing alignments from parallel corpora.
    Page 1, “Abstract”
  2. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior.
    Page 1, “Abstract”
  3. The phrase-based approach (Koehn et al., 2003) to machine translation (MT) has transformed MT from a narrow research topic into a truly useful technology to end users.
    Page 1, “Introduction”
  4. Word-based translation models (Brown et al., 1993) remain central to phrase-based model training, where they are used to infer word-level alignments from sentence aligned parallel data, from
    Page 1, “Introduction”
  5. Firstly, many phrase-based phenomena which do not decompose into word translations (e.g., idioms) will be missed, as the underlying word-based alignment model is unlikely to propose the correct alignments.
    Page 1, “Introduction”
  6. This paper develops a phrase-based translation model which aims to address the above shortcomings of the phrase-based translation pipeline.
    Page 1, “Introduction”
  7. The model is richly parameterised, such that it can describe phrase-based phenomena while also explicitly modelling the relationships between phrase-pairs and their component expansions, thus ameliorating the disconnect between the treatment of words versus phrases in the current MT pipeline.
    Page 1, “Introduction”
  8. Moreover our approach results in consistent translation improvements across a number of translation tasks compared to Neubig et al.’s method, and a competitive phrase-based baseline.
    Page 2, “Introduction”
  9. A number of other approaches have been developed for learning phrase-based models from bilingual data, starting with Marcu and Wong (2002) who developed an extension to IBM model 1 to handle multi-word units.
    Page 2, “Related Work”
  10. As a baseline, we train a phrase-based model using the moses toolkit12 based on the word alignments obtained using GIZA++ in both directions and symmetrized using the grow-diag-final-and heuristic13 (Koehn et al., 2003).
    Page 6, “Experiments”
  11. We have presented a novel method for leam-ing a phrase-based model of translation directly from parallel data which we have framed as leam-ing an inverse transduction grammar (ITG) using a recursive Bayesian prior.
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (6)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
    Page 5, “Experiments”
  2. 10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments.
    Page 5, “Experiments”
  3. We use minimum error rate training (Och, 2003) with nbest list size 100 to optimize the feature weights for maximum development BLEU .
    Page 6, “Experiments”
  4. Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines.
    Page 7, “Experiments”
  5. For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs.
    Page 7, “Experiments”
  6. Firstly, combining the phrase tables from independent runs results in increased BLEU scores, possibly due to the representation of uncertainty in the outputs, and the representation of different modes captured by the individual models.
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

machine translation

Appears in 5 sentences as: machine translation (5)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Modern phrase-based machine translation systems make extensive use of word-based translation models for inducing alignments from parallel corpora.
    Page 1, “Abstract”
  2. The phrase-based approach (Koehn et al., 2003) to machine translation (MT) has transformed MT from a narrow research topic into a truly useful technology to end users.
    Page 1, “Introduction”
  3. In the context of machine translation , ITG has been explored for statistical word alignment in both unsupervised (Zhang and Gildea, 2005; Cherry and Lin, 2007; Zhang et al., 2008; Pauls et al., 2010) and supervised (Haghighi et al., 2009; Cherry and Lin, 2006) settings, and for decoding (Petrov et al., 2008).
    Page 2, “Related Work”
  4. As mentioned above, ours is not the first work attempting to generalise adaptor grammars for machine translation ; (Neubig et al., 2011) also developed a similar approach based around ITG using a Pitman-Yor Process prior.
    Page 2, “Related Work”
  5. The time complexity of our inference algorithm is 0(n6), which can be prohibitive for large scale machine translation tasks.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

recursive

Appears in 5 sentences as: recursive (6)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior.
    Page 1, “Abstract”
  2. Additionally, we have extended the model to allow recursive nesting of adapted non-terminals, such that we end up with an infinitely recursive formulation where the top-level and base distributions are explicitly linked together.
    Page 2, “Related Work”
  3. depending on 7“ This generative process is mutually recursive : P2 makes draws from P1 and P1 makes draws from P2.
    Page 3, “Model”
  4. where the conditioning of the second recursive call to P2 reflects that the counts 71‘ and K _ may be affected by the first draw from P2.
    Page 4, “Model”
  5. We have presented a novel method for leam-ing a phrase-based model of translation directly from parallel data which we have framed as leam-ing an inverse transduction grammar (ITG) using a recursive Bayesian prior.
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention recursive.

See all papers in Proc. ACL that mention recursive.

Back to top.

translation tasks

Appears in 5 sentences as: translation tasks (5)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Moreover our approach results in consistent translation improvements across a number of translation tasks compared to Neubig et al.’s method, and a competitive phrase-based baseline.
    Page 2, “Introduction”
  2. The corpora statistics of these translation tasks are summarised in Table 2.
    Page 5, “Experiments”
  3. The time complexity of our inference algorithm is 0(n6), which can be prohibitive for large scale machine translation tasks .
    Page 5, “Experiments”
  4. Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines.
    Page 7, “Experiments”
  5. Our experiments on Urdu-English, Arabic-English, and Farsi-English translation tasks all demonstrate improvements over competitive baseline systems.
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention translation tasks.

See all papers in Proc. ACL that mention translation tasks.

Back to top.

word alignments

Appears in 5 sentences as: word alignment (2) word alignments (3)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. In the context of machine translation, ITG has been explored for statistical word alignment in both unsupervised (Zhang and Gildea, 2005; Cherry and Lin, 2007; Zhang et al., 2008; Pauls et al., 2010) and supervised (Haghighi et al., 2009; Cherry and Lin, 2006) settings, and for decoding (Petrov et al., 2008).
    Page 2, “Related Work”
  2. Our paper fits into the recent line of work for jointly inducing the phrase table and word alignment (DeNero and Klein, 2010; Neubig et al., 2011).
    Page 2, “Related Work”
  3. Following (Levenberg et al., 2012; Neubig et al., 2011), we evaluate our model by using its output word alignments to construct a phrase table.
    Page 6, “Experiments”
  4. As a baseline, we train a phrase-based model using the moses toolkit12 based on the word alignments obtained using GIZA++ in both directions and symmetrized using the grow-diag-final-and heuristic13 (Koehn et al., 2003).
    Page 6, “Experiments”
  5. 11These are taken from the final model 4 word alignments , using the intersection of the source-target and target-source models.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention word alignments.

See all papers in Proc. ACL that mention word alignments.

Back to top.

baseline system

Appears in 4 sentences as: baseline system (2) baseline systems (2)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems .
    Page 1, “Abstract”
  2. Our approach improves upon theirs in terms of the model and inference, and critically, this is borne out in our experiments where we show uniform improvements in translation quality over a baseline system , as compared to their almost entirely negative results.
    Page 2, “Related Work”
  3. Our baseline system uses the latter.
    Page 5, “Experiments”
  4. Our experiments on Urdu-English, Arabic-English, and Farsi-English translation tasks all demonstrate improvements over competitive baseline systems .
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

BLEU scores

Appears in 4 sentences as: BLEU score (1) BLEU scores (3)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. 9Hence the BLEU scores we get for the baselines may appear lower than what reported in the literature.
    Page 5, “Experiments”
  2. Table 3 shows the BLEU scores for the three translation tasks UR/AlUFA—>EN based on our method against the baselines.
    Page 7, “Experiments”
  3. For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs.
    Page 7, “Experiments”
  4. Firstly, combining the phrase tables from independent runs results in increased BLEU scores , possibly due to the representation of uncertainty in the outputs, and the representation of different modes captured by the individual models.
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

language model

Appears in 4 sentences as: language model (2) language modelling (1) language models (1)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. We develop a Bayesian approach using a Pitman-Yor process prior, which is capable of modelling a diverse range of geometrically decaying distributions over infinite event spaces (here translation phrase-pairs), an approach shown to be state of the art for language modelling (Teh, 2006).
    Page 1, “Introduction”
  2. In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count.
    Page 6, “Experiments”
  3. We train 3-gram language models using modified Kneser—Ney smoothing.
    Page 6, “Experiments”
  4. For AR-EN experiments the language model is trained on English data as (Blunsom et al., 2009a), and for FA-EN and UR-EN the English data are the target sides of the bilingual training data.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

phrase table

Appears in 4 sentences as: phrase table (3) phrase tables (1)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Our paper fits into the recent line of work for jointly inducing the phrase table and word alignment (DeNero and Klein, 2010; Neubig et al., 2011).
    Page 2, “Related Work”
  2. Following (Levenberg et al., 2012; Neubig et al., 2011), we evaluate our model by using its output word alignments to construct a phrase table .
    Page 6, “Experiments”
  3. For our models, we report the average BLEU score of the 5 independent runs as well as that of the aggregate phrase table generated by these 5 independent runs.
    Page 7, “Experiments”
  4. Firstly, combining the phrase tables from independent runs results in increased BLEU scores, possibly due to the representation of uncertainty in the outputs, and the representation of different modes captured by the individual models.
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

sentence pair

Appears in 4 sentences as: sentence pair (4)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. additional constraints on how phrase-pairs can be tiled to produce a sentence pair , and moreover, we seek to model the embedding of phrase-pairs in one another, something not considered by this prior work.
    Page 2, “Related Work”
  2. This way we don’t insist on a single tiling of phrases for a sentence pair , but explicitly model the set of hierarchically nested phrases as defined by an ITG derivation.
    Page 3, “Model”
  3. nth—Ktjsas S —> s1g(t) —n§+bs sig(t) —> yield(t) l For every word pair, 6 / f in sentence pair , LU
    Page 5, “Model”
  4. This process is then repeated for each sentence pair in the corpus in a random order.
    Page 5, “Model”

See all papers in Proc. ACL 2013 that mention sentence pair.

See all papers in Proc. ACL that mention sentence pair.

Back to top.

translation model

Appears in 4 sentences as: translation model (2) translation models (2)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Modern phrase-based machine translation systems make extensive use of word-based translation models for inducing alignments from parallel corpora.
    Page 1, “Abstract”
  2. Word-based translation models (Brown et al., 1993) remain central to phrase-based model training, where they are used to infer word-level alignments from sentence aligned parallel data, from
    Page 1, “Introduction”
  3. This paper develops a phrase-based translation model which aims to address the above shortcomings of the phrase-based translation pipeline.
    Page 1, “Introduction”
  4. In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

generative process

Appears in 3 sentences as: generative process (3)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. The generative process of the model follows that of ITG with the following simple grammar
    Page 3, “Model”
  2. The generative process is that we draw a complete ITG tree, 75 N P2 as follows:
    Page 3, “Model”
  3. depending on 7“ This generative process is mutually recursive: P2 makes draws from P1 and P1 makes draws from P2.
    Page 3, “Model”

See all papers in Proc. ACL 2013 that mention generative process.

See all papers in Proc. ACL that mention generative process.

Back to top.

parallel data

Appears in 3 sentences as: parallel data (3)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. This paper presents a novel method for inducing phrase-based translation units directly from parallel data , which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior.
    Page 1, “Abstract”
  2. Word-based translation models (Brown et al., 1993) remain central to phrase-based model training, where they are used to infer word-level alignments from sentence aligned parallel data , from
    Page 1, “Introduction”
  3. We have presented a novel method for leam-ing a phrase-based model of translation directly from parallel data which we have framed as leam-ing an inverse transduction grammar (ITG) using a recursive Bayesian prior.
    Page 8, “Analysis”

See all papers in Proc. ACL 2013 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

time complexity

Appears in 3 sentences as: time complexity (3)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. The time complexity of our inference algorithm is 0(n6), which can be prohibitive for large scale machine translation tasks.
    Page 5, “Experiments”
  2. The average time complexity for the latter is roughly 0(l4), as plotted in green 75 = 2 X 10—7l4.
    Page 6, “Experiments”
  3. However, the time complexity is still high, so we set the maximum sentence length to 30 to keep our experiments practicable.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention time complexity.

See all papers in Proc. ACL that mention time complexity.

Back to top.

translation systems

Appears in 3 sentences as: translation system (1) translation systems (2)
In An Infinite Hierarchical Bayesian Model of Phrasal Translation
  1. Modern phrase-based machine translation systems make extensive use of word-based translation models for inducing alignments from parallel corpora.
    Page 1, “Abstract”
  2. Leading translation systems (Chiang, 2007; Koehn et al., 2007; Marcu et al., 2006) all use some kind of multi-word translation unit, which allows translations to be produced from large canned units of text from the training corpus.
    Page 1, “Introduction”
  3. 10Using the factorised alignments directly in a translation system resulted in a slight loss in BLEU versus using the un-factorised alignments.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.