Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
Zhang, Hao and Quirk, Chris and Moore, Robert C. and Gildea, Daniel

Article Structure

Abstract

We combine the strengths of Bayesian modeling and synchronous grammar in unsupervised learning of basic translation phrase pairs.

Introduction

Most state-of—the-art statistical machine translation systems are based on large phrase tables extracted from parallel text using word-level alignments.

Phrasal Inversion Transduction Grammar

We use a phrasal extension of Inversion Transduction Grammar (Wu, 1997) as the generative framework.

Variational Bayes for ITG

Goldwater and Griffiths (2007) and Johnson (2007) show that modifying an HMM to include a sparse prior over its parameters and using Bayesian estimation leads to improved accuracy for unsupervised part-of-speech tagging.

Bitext Pruning Strategy

ITG is slow mainly because it considers every pair of spans in two sentences as a possible chart element.

Bootstrapping Phrasal ITG from Word-based ITG

This section introduces a technique that bootstraps candidate phrase pairs for phrase-based ITG from word-based ITG Viterbi alignments.

Summary of the Pipeline

We summarize the pipeline of our system, demonstrating the interactions between the three main contributions of this paper: Variational Bayes, tic-tac-toe pruning, and word-to-phrase bootstrapping.

Experiments

The training data was a subset of 175K sentence pairs from the NIST Chinese-English training data, automatically selected to maximize character-level overlap with the source side of the test data.

Conclusion

We have presented an improved and more efficient method of estimating phrase pairs directly.

Topics

phrase pairs

Appears in 20 sentences as: phrase pair (5) phrase pairs (20)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. We combine the strengths of Bayesian modeling and synchronous grammar in unsupervised learning of basic translation phrase pairs .
    Page 1, “Abstract”
  2. The structured space of a synchronous grammar is a natural fit for phrase pair probability estimation, though the search space can be prohibitively large.
    Page 1, “Abstract”
  3. Computational complexity arises from the exponentially large number of decompositions of a sentence pair into phrase pairs; overfitting is a problem because as EM attempts to maximize the likelihood of its training data, it prefers to directly explain a sentence pair with a single phrase pair .
    Page 1, “Introduction”
  4. Our ITG has two nonterminals: X and C, where X represents compositional phrase pairs that can have recursive structures and C is the preterminal over terminal phrase pairs .
    Page 2, “Phrasal Inversion Transduction Grammar”
  5. They split the left-hand side constituent which represents a phrase pair into two smaller phrase pairs on the right-hand side and order them according to one of the two possible permutations.
    Page 2, “Phrasal Inversion Transduction Grammar”
  6. where Ze/f P (e / f) = l is a multinomial distribution over phrase pairs .
    Page 2, “Phrasal Inversion Transduction Grammar”
  7. We can train this model using a two-dimensional extension of the inside-outside algorithm on bilingual data, assuming every phrase pair that can appear as a leaf in a parse tree of the grammar a valid candidate.
    Page 2, “Phrasal Inversion Transduction Grammar”
  8. A sparse prior over a multinomial distribution such as the distribution of phrase pairs may bias the estimator toward skewed distributions that generalize better.
    Page 3, “Variational Bayes for ITG”
  9. The other is the distribution of the phrase pairs .
    Page 3, “Variational Bayes for ITG”
  10. By adjusting ac to a very small number, we hope to place more posterior mass on parsimonious solutions with fewer but more confident and general phrase pairs .
    Page 3, “Variational Bayes for ITG”
  11. where w is the digamma function (Beal, 2003), s = 3 is the number of right-hand-sides for X, and m is the number of observed phrase pairs in the data.
    Page 3, “Variational Bayes for ITG”

See all papers in Proc. ACL 2008 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

word alignment

Appears in 13 sentences as: Word Alignment (1) word alignment (8) word alignments (4)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. Incorporating a sparse prior using Variational Bayes, biases the models toward generalizable, parsimonious parameter sets, leading to significant improvements in word alignment .
    Page 1, “Abstract”
  2. This preference for sparse solutions together with effective pruning methods forms a phrase alignment regimen that produces better end-to-end translations than standard word alignment approaches.
    Page 1, “Abstract”
  3. As these word-level alignment models restrict the word alignment complexity by requiring each target word to align to zero or one source words, results are improved by aligning both source-to-target as well as target-to-source,
    Page 1, “Introduction”
  4. Finally, the set of phrases consistent with the word alignments are extracted from every sentence pair; these form the basis of the decoding process.
    Page 1, “Introduction”
  5. Furthermore it would obviate the need for heuristic combination of word alignments .
    Page 1, “Introduction”
  6. First we train a lower level word alignment model, then we place hard constraints on the phrasal alignment space using confident word links from this simpler model.
    Page 2, “Phrasal Inversion Transduction Grammar”
  7. The scope of iterative phrasal ITG training, therefore, is limited to determining the boundaries of the phrases anchored on the given one-to-one word alignments .
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  8. Second, we do not need to worry about non-ITG word alignments , such as the (2, 4, l, 3) permutation patterns.
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  9. Figure 3 (a) shows all possible non-compositional phrases given the Viterbi word alignment of the example sentence pair.
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  10. 7.1 Word Alignment Evaluation
    Page 7, “Experiments”
  11. The output of the word alignment systems (GIZA++ or ITG) were fed to a standard phrase extraction procedure that extracted all phrases of length up to 7 and estimated the conditional probabilities of source given target and target given source using relative frequencies.
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

sentence pair

Appears in 10 sentences as: sentence pair (6) sentence pairs (5)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. Finally, the set of phrases consistent with the word alignments are extracted from every sentence pair ; these form the basis of the decoding process.
    Page 1, “Introduction”
  2. Computational complexity arises from the exponentially large number of decompositions of a sentence pair into phrase pairs; overfitting is a problem because as EM attempts to maximize the likelihood of its training data, it prefers to directly explain a sentence pair with a single phrase pair.
    Page 1, “Introduction”
  3. However, it is easy to show that the maximum likelihood training will lead to the saturated solution where PC = l —each sentence pair is generated by a single phrase spanning the whole sentence.
    Page 2, “Phrasal Inversion Transduction Grammar”
  4. If we do not put any constraint on the distribution of phrases, EM overfits the data by memorizing every sentence pair .
    Page 3, “Variational Bayes for ITG”
  5. Figure 3 (a) shows all possible non-compositional phrases given the Viterbi word alignment of the example sentence pair .
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  6. Then we use the efficient bidirectional tic-tac-toe pruning to prune the bitext space within each of the sentence pairs ; ITG parsing will be carried out on only this this sparse set of bitext cells.
    Page 7, “Summary of the Pipeline”
  7. The training data was a subset of 175K sentence pairs from the NIST Chinese-English training data, automatically selected to maximize character-level overlap with the source side of the test data.
    Page 7, “Experiments”
  8. We put a length limit of 35 on both sides, producing a training set of 141K sentence pairs .
    Page 7, “Experiments”
  9. Figure 4 examines the difference between EM and VB with varying sparse priors for the word-based model of ITG on the 500 sentence pairs , both after 10 iterations of training.
    Page 7, “Experiments”
  10. Table 1: Translation results on Chinese—English, using the subset of training data (141K sentence pairs ) that have length limit 35 on both sides.
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention sentence pair.

See all papers in Proc. ACL that mention sentence pair.

Back to top.

Viterbi

Appears in 8 sentences as: Viterbi (8)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. We can then find the best scoring sequence using the familiar Viterbi algorithm.
    Page 5, “Bitext Pruning Strategy”
  2. This section introduces a technique that bootstraps candidate phrase pairs for phrase-based ITG from word-based ITG Viterbi alignments.
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  3. We use ITG Viterbi alignments instead.
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  4. Figure 3 (a) shows all possible non-compositional phrases given the Viterbi word alignment of the example sentence pair.
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  5. After several training iterations, we obtain the Viterbi alignments on the training data according to the final model.
    Page 7, “Summary of the Pipeline”
  6. In the end, a Viterbi pass for the phrasal ITG is executed to produce the non-compositional phrasal alignments.
    Page 7, “Summary of the Pipeline”
  7. The charts were then pruned further by applying the non-compositional constraint from the Viterbi alignment links of that model.
    Page 8, “Experiments”
  8. Finally we ran 10 iterations of phrase-based ITG over the residual charts, using EM or VB, and extracted the Viterbi alignments.
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

word-level

Appears in 8 sentences as: word-level (8)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. Most state-of—the-art statistical machine translation systems are based on large phrase tables extracted from parallel text using word-level alignments.
    Page 1, “Introduction”
  2. These word-level alignments are most often obtained using Expectation Maximization on the conditional generative models of Brown et al.
    Page 1, “Introduction”
  3. As these word-level alignment models restrict the word alignment complexity by requiring each target word to align to zero or one source words, results are improved by aligning both source-to-target as well as target-to-source,
    Page 1, “Introduction”
  4. While this approach has been very successful, poor word-level alignments are nonetheless a common source of error in machine translation systems.
    Page 1, “Introduction”
  5. A natural solution to several of these issues is unite the word-level and phrase-level models into one learning procedure.
    Page 1, “Introduction”
  6. Ideally, such a procedure would remedy the deficiencies of word-level alignment models, including the strong restrictions on the form of the alignment, and the strong independence assumption between words.
    Page 1, “Introduction”
  7. We test our model by extracting longer phrases from our model’s alignments using traditional phrase extraction, and find that a phrase table based on our system improves MT results over a phrase table extracted from traditional word-level alignments.
    Page 2, “Introduction”
  8. Combining the two approaches, we have a staged training procedure going from the simplest unconstrained word based model to a constrained Bayesian word-level ITG model, and finally proceeding to a constrained Bayesian phrasal model.
    Page 2, “Phrasal Inversion Transduction Grammar”

See all papers in Proc. ACL 2008 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

overfitting

Appears in 7 sentences as: overfit (1) overfits (1) overfitting (5)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. In this direction, Expectation Maximization at the phrase level was proposed by Marcu and Wong (2002), who, however, experienced two major difficulties: computational complexity and controlling overfitting .
    Page 1, “Introduction”
  2. Computational complexity arises from the exponentially large number of decompositions of a sentence pair into phrase pairs; overfitting is a problem because as EM attempts to maximize the likelihood of its training data, it prefers to directly explain a sentence pair with a single phrase pair.
    Page 1, “Introduction”
  3. We address the tendency of EM to overfit by using Bayesian methods, where sparse priors assign greater mass to parameter vectors with fewer nonzero values therefore favoring shorter, more frequent phrases.
    Page 2, “Introduction”
  4. If we do not put any constraint on the distribution of phrases, EM overfits the data by memorizing every sentence pair.
    Page 3, “Variational Bayes for ITG”
  5. Using EM, because of overfitting , AER drops first and increases again as the number of iterations varies from 1 to 10.
    Page 7, “Experiments”
  6. The gain is especially large on the test data set, indicating VB is less prone to overfitting .
    Page 8, “Experiments”
  7. On top of these hard constraints, the sparse prior of VB helps make the model less prone to overfitting to infrequent phrase pairs, and thus improves the quality of the phrase pairs the model learns.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

alignment models

Appears in 6 sentences as: alignment model (1) alignment models (5)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. As these word-level alignment models restrict the word alignment complexity by requiring each target word to align to zero or one source words, results are improved by aligning both source-to-target as well as target-to-source,
    Page 1, “Introduction”
  2. Ideally, such a procedure would remedy the deficiencies of word-level alignment models , including the strong restrictions on the form of the alignment, and the strong independence assumption between words.
    Page 1, “Introduction”
  3. Our second approach was to constrain the search space using simpler alignment models , which has the further benefit of significantly speeding up training.
    Page 2, “Phrasal Inversion Transduction Grammar”
  4. First we train a lower level word alignment model , then we place hard constraints on the phrasal alignment space using confident word links from this simpler model.
    Page 2, “Phrasal Inversion Transduction Grammar”
  5. alignment models is the EM algorithm (Brown et al., 1993) which iteratively updates parameters to maximize the likelihood of the data.
    Page 3, “Variational Bayes for ITG”
  6. However, our best system does not apply VB to a single probability model, as we found an appreciable benefit from bootstrapping each model from simpler models, much as the IBM word alignment models are usually trained in succession.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention alignment models.

See all papers in Proc. ACL that mention alignment models.

Back to top.

phrase-based

Appears in 4 sentences as: phrase-based (4)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. The drawback of maximum likelihood is obvious for phrase-based models.
    Page 3, “Variational Bayes for ITG”
  2. This section introduces a technique that bootstraps candidate phrase pairs for phrase-based ITG from word-based ITG Viterbi alignments.
    Page 6, “Bootstrapping Phrasal ITG from Word-based ITG”
  3. From this alignment, phrase pairs are extracted in the usual manner, and a phrase-based translation system is trained.
    Page 7, “Summary of the Pipeline”
  4. Finally we ran 10 iterations of phrase-based ITG over the residual charts, using EM or VB, and extracted the Viterbi alignments.
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

translation systems

Appears in 4 sentences as: translation system (1) translation systems (3)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. Most state-of—the-art statistical machine translation systems are based on large phrase tables extracted from parallel text using word-level alignments.
    Page 1, “Introduction”
  2. While this approach has been very successful, poor word-level alignments are nonetheless a common source of error in machine translation systems .
    Page 1, “Introduction”
  3. From this alignment, phrase pairs are extracted in the usual manner, and a phrase-based translation system is trained.
    Page 7, “Summary of the Pipeline”
  4. We trained several phrasal translation systems , varying only the word alignment (or phrasal alignment) method.
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

BLEU

Appears in 3 sentences as: BLEU (3)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU .
    Page 7, “Experiments”
  2. We do compare VB against EM in terms of final BLEU scores in the translation experiments to ensure that this sparse prior has a sig-
    Page 7, “Experiments”
  3. Minimum Error Rate training (Och, 2003) over BLEU was used to optimize the weights for each of these models over the development test data.
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

end-to-end

Appears in 3 sentences as: End-to-end (1) end-to-end (2)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. This preference for sparse solutions together with effective pruning methods forms a phrase alignment regimen that produces better end-to-end translations than standard word alignment approaches.
    Page 1, “Abstract”
  2. 7.2 End-to-end Evaluation
    Page 7, “Experiments”
  3. Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU.
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention end-to-end.

See all papers in Proc. ACL that mention end-to-end.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. Most state-of—the-art statistical machine translation systems are based on large phrase tables extracted from parallel text using word-level alignments.
    Page 1, “Introduction”
  2. While this approach has been very successful, poor word-level alignments are nonetheless a common source of error in machine translation systems.
    Page 1, “Introduction”
  3. These experiments also indicate that a very sparse prior is needed for machine translation tasks.
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

objective function

Appears in 3 sentences as: objective function (3)
In Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing
  1. First we change the objective function by incorporating a prior over the phrasal parameters.
    Page 2, “Phrasal Inversion Transduction Grammar”
  2. Given an unlimited amount of time, we would tune the prior to maximize end-to-end performance, using an objective function such as BLEU.
    Page 7, “Experiments”
  3. By both changing the objective function to include a bias toward sparser models and improving the pruning techniques and efficiency, we achieve significant gains on test data with practical speed.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.