Unsupervised Multilingual Grammar Induction
Snyder, Benjamin and Naseem, Tahira and Barzilay, Regina

Article Structure

Abstract

We investigate the task of unsupervised constituency parsing from bilingual parallel corpora.

Introduction

In this paper we investigate the task of unsupervised constituency parsing when bilingual parallel text is available.

Related Work

The unsupervised grammar induction task has been studied extensively, mostly in a monolingual setting (Charniak and Carroll, 1992; Stolcke and Omohundro, 1994; Klein and Manning, 2002; Seginer, 2007).

Model

We propose an unsupervised Bayesian model for learning bilingual syntactic structure using parallel corpora.

Experimental setup

We test our model on three corpora of bilingual parallel sentences: English-Korean, English-Urdu, and English-Chinese.

Results

Table 1 reports the full results of our experiments.

Conclusion

We have presented a probabilistic model for bilingual grammar induction which uses raw parallel text to learn tree pairs and their alignments.

Topics

word-level

Appears in 14 sentences as: Word-level (1) word-level (13)
In Unsupervised Multilingual Grammar Induction
  1. Word-level alignments are then drawn based on the tree alignment.
    Page 2, “Introduction”
  2. Finally, parallel sentences are assembled from these generated part-of-speech sequences and word-level alignments.
    Page 2, “Introduction”
  3. The model is trained using bilingual data with automatically induced word-level alignments, but is tested on purely monolingual data for each language.
    Page 2, “Introduction”
  4. Assuming that trees induced over parallel sentences have to exhibit certain structural regularities, Kuhn manually specifies a set of rules for determining when parsing decisions in the two languages are inconsistent with GIZA++ word-level alignments.
    Page 2, “Related Work”
  5. word-level alignments, as observed data.
    Page 3, “Model”
  6. We obtain these word-level alignments using GIZA++ (Och and Ney, 2003).
    Page 3, “Model”
  7. Finally, word-level alignments are drawn based on the structure of the alignment tree.
    Page 4, “Model”
  8. Intuitively, aligned nodes should have a high density of word-level alignments between them, and unaligned nodes should have few lexical alignments.
    Page 4, “Model”
  9. We call a word-level alignment good if it aligns a word in 3/1 with a word in 3/2.
    Page 4, “Model”
  10. We call a word-level alignment bad if it aligns a word in yl with a word outside 3/2, or vice versa.
    Page 4, “Model”
  11. Now we describe the stochastic process whereby the observed parallel sentences and their word-level alignments are generated, according to our model.
    Page 4, “Model”

See all papers in Proc. ACL 2009 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

parallel sentences

Appears in 12 sentences as: parallel sentence (2) parallel sentences (10)
In Unsupervised Multilingual Grammar Induction
  1. Finally, parallel sentences are assembled from these generated part-of-speech sequences and word-level alignments.
    Page 2, “Introduction”
  2. Assuming that trees induced over parallel sentences have to exhibit certain structural regularities, Kuhn manually specifies a set of rules for determining when parsing decisions in the two languages are inconsistent with GIZA++ word-level alignments.
    Page 2, “Related Work”
  3. We treat the part-of-speech tag sequences of parallel sentences , as well as their
    Page 2, “Model”
  4. Now we describe the stochastic process whereby the observed parallel sentences and their word-level alignments are generated, according to our model.
    Page 4, “Model”
  5. Then, each pair of word-aligned parallel sentences is generated through the following process:
    Page 4, “Model”
  6. In the next section we turn to the problem of inference under this model when only the part-of-speech tag sequences of parallel sentences and their word-level alignments are observed.
    Page 5, “Model”
  7. For the ith parallel sentence , we wish to jointly sample the pair of trees (T1, T2),- together with their alignment Ai.
    Page 5, “Model”
  8. Once the triple (T1, T2, A) has been sampled, we move on to the next parallel sentence .
    Page 6, “Model”
  9. We test our model on three corpora of bilingual parallel sentences : English-Korean, English-Urdu, and English-Chinese.
    Page 6, “Experimental setup”
  10. To obtain leXical alignments between the parallel sentences we employ GIZA++ (Och and Ney, 2003).
    Page 6, “Experimental setup”
  11. Then for each pair of parallel sentences we randomly sample an initial alignment tree for the two sampled trees.
    Page 6, “Experimental setup”

See all papers in Proc. ACL 2009 that mention parallel sentences.

See all papers in Proc. ACL that mention parallel sentences.

Back to top.

part-of-speech

Appears in 10 sentences as: part-of-speech (10)
In Unsupervised Multilingual Grammar Induction
  1. Multilingual learning has been successful for other linguistic induction tasks such as lexicon acquisition, morphological segmentation, and part-of-speech tagging (Genzel, 2005; Snyder and Barzilay, 2008; Snyder et al., 2008; Snyder
    Page 1, “Introduction”
  2. lingual constituent, a sequence of part-of-speech tags is drawn from a language-specific distribution.
    Page 2, “Introduction”
  3. For each pair of coupled bilingual constituents, a pair of part-of-speech sequences are drawn jointly from a cross-lingual distribution.
    Page 2, “Introduction”
  4. Finally, parallel sentences are assembled from these generated part-of-speech sequences and word-level alignments.
    Page 2, “Introduction”
  5. We treat the part-of-speech tag sequences of parallel sentences, as well as their
    Page 2, “Model”
  6. Under this model, the part-of-speech sequence of each span in a sentence is generated either as a constituent yield — if it is dominated by a node in the tree —or otherwise as a distituent yield.
    Page 3, “Model”
  7. While this model is deficient —each observed subsequence of part-of-speech tags is generated many times over — its performance is far higher than that of unsupervised PCFGs.
    Page 3, “Model”
  8. In the next section we turn to the problem of inference under this model when only the part-of-speech tag sequences of parallel sentences and their word-level alignments are observed.
    Page 5, “Model”
  9. Given a corpus of paired part-of-speech tag sequences (51,52) and their GIZA++ alignments g, we would ideally like to predict the set of tree pairs (T1, T2) which have highest probability when conditioned on the observed data: P<T1,T2‘sl,52,g).
    Page 5, “Model”
  10. The English-Urdu parallel corpus3 consists of 4,325 sentences from the first three sections of the Penn Treebank and their Urdu translations annotated at the part-of-speech level.
    Page 6, “Experimental setup”

See all papers in Proc. ACL 2009 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

dynamic program

Appears in 7 sentences as: dynamic program (5) dynamic programming (2)
In Unsupervised Multilingual Grammar Induction
  1. We perform inference under this model using Markov Chain Monte Carlo and dynamic programming .
    Page 1, “Abstract”
  2. Additionally, a computational advantage of this formalism is that the marginalized probability over all possible alignments for any two trees can be efficiently computed with a dynamic program in linear time.
    Page 1, “Introduction”
  3. We sample pairs of trees and then compute marginalized probabilities over all possible alignments using dynamic programming .
    Page 2, “Introduction”
  4. Fortunately, for any given pair of trees T1 and T2 this marginalization can be computed using a dynamic program in time O(|T1||T2|).
    Page 5, “Model”
  5. A dynamic program builds this table from the bottom up: For each node pair 711,712, we sum the probabilities of all local alignment configurations, each multiplied by the appro-
    Page 5, “Model”
  6. This algorithm is an adaptation of the dynamic program presented in (J iang et al., 1995) for finding minimum cost alignment trees (Fig.
    Page 6, “Model”
  7. Once a pair of trees (T1,T2) has been sampled, we can proceed to sample an alignment tree A|T1, T2.2 We sample individual alignment decisions from the top down, at each step using the alignment marginals for the remaining subtrees (already computed using the aforementioned dynamic program ).
    Page 6, “Model”

See all papers in Proc. ACL 2009 that mention dynamic program.

See all papers in Proc. ACL that mention dynamic program.

Back to top.

F-measure

Appears in 7 sentences as: F-measure (7)
In Unsupervised Multilingual Grammar Induction
  1. On average, across a variety of testing scenarios, our model achieves an 8.8 absolute gain in F-measure .
    Page 1, “Abstract”
  2. On average, over all the testing scenarios that we studied, our model achieves an absolute increase in F-measure of 8.8 points, and a 19% reduction in error relative to a theoretical upper bound.
    Page 2, “Introduction”
  3. In both cases our implementation achieves F-measure in the range of 69-70% on W8] 10, broadly in line with the performance reported by Klein and Manning (2002).
    Page 6, “Experimental setup”
  4. To evaluate both our model as well as the baseline, we use (unlabeled) bracket precision, recall, and F-measure (Klein and Manning, 2002).
    Page 7, “Experimental setup”
  5. We also report the upper bound on F-measure for binary trees.
    Page 7, “Experimental setup”
  6. On average, the bilingual model gains 10.2 percentage points in precision, 7.7 in recall, and 8.8 in F-measure .
    Page 7, “Results”
  7. The Korean-English pairing results in substantial improvements for Korean and quite large improvements for English, for which the absolute gain reaches 28 points in F-measure .
    Page 7, “Results”

See all papers in Proc. ACL 2009 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

parallel data

Appears in 7 sentences as: parallel data (8)
In Unsupervised Multilingual Grammar Induction
  1. We formulate a generative Bayesian model which seeks to explain the observed parallel data through a combination of bilingual and monolingual parameters.
    Page 1, “Abstract”
  2. We formulate a generative Bayesian model which seeks to explain the observed parallel data through a combination of bilingual and monolingual parameters.
    Page 1, “Introduction”
  3. More recently, there has been a body of work attempting to improve parsing performance by exploiting syntactically annotated parallel data .
    Page 2, “Related Work”
  4. In one strand of this work, annotations are assumed only in a resource-rich language and are projected onto a resource-poor language using the parallel data (Hwa et al., 2005; Xi and Hwa, 2005).
    Page 2, “Related Work”
  5. In another strand of work, syntactic annotations are assumed on both sides of the parallel data, and a model is trained to exploit the parallel data at test time as well (Smith and Smith, 2004; Burkett and Klein, 2008).
    Page 2, “Related Work”
  6. Though the model is trained using parallel data , during testing it has access only to monolingual data.
    Page 6, “Experimental setup”
  7. This setup ensures that we are testing our model’s ability to learn better parameters at training time, rather than its ability to exploit parallel data at test time.
    Page 6, “Experimental setup”

See all papers in Proc. ACL 2009 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

grammar induction

Appears in 6 sentences as: grammar induction (6)
In Unsupervised Multilingual Grammar Induction
  1. We test the effectiveness of our bilingual grammar induction model on three corpora of parallel text: English-Korean, English-Urdu and English-Chinese.
    Page 2, “Introduction”
  2. The unsupervised grammar induction task has been studied extensively, mostly in a monolingual setting (Charniak and Carroll, 1992; Stolcke and Omohundro, 1994; Klein and Manning, 2002; Seginer, 2007).
    Page 2, “Related Work”
  3. We know of only one study which evaluates these bilingual grammar formalisms on the task of grammar induction itself (Smith and Smith, 2004).
    Page 2, “Related Work”
  4. In contrast to this work, our goal is to explore the benefits of multilingual grammar induction in a fully unsupervised setting.
    Page 2, “Related Work”
  5. During preprocessing of the corpora we remove all punctuation marks and special symbols, following the setup in previous grammar induction work (Klein and Manning, 2002).
    Page 6, “Experimental setup”
  6. We have presented a probabilistic model for bilingual grammar induction which uses raw parallel text to learn tree pairs and their alignments.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention grammar induction.

See all papers in Proc. ACL that mention grammar induction.

Back to top.

part-of-speech tag

Appears in 6 sentences as: part-of-speech tag (3) part-of-speech tagging (1) part-of-speech tags (2)
In Unsupervised Multilingual Grammar Induction
  1. Multilingual learning has been successful for other linguistic induction tasks such as lexicon acquisition, morphological segmentation, and part-of-speech tagging (Genzel, 2005; Snyder and Barzilay, 2008; Snyder et al., 2008; Snyder
    Page 1, “Introduction”
  2. lingual constituent, a sequence of part-of-speech tags is drawn from a language-specific distribution.
    Page 2, “Introduction”
  3. We treat the part-of-speech tag sequences of parallel sentences, as well as their
    Page 2, “Model”
  4. While this model is deficient —each observed subsequence of part-of-speech tags is generated many times over — its performance is far higher than that of unsupervised PCFGs.
    Page 3, “Model”
  5. In the next section we turn to the problem of inference under this model when only the part-of-speech tag sequences of parallel sentences and their word-level alignments are observed.
    Page 5, “Model”
  6. Given a corpus of paired part-of-speech tag sequences (51,52) and their GIZA++ alignments g, we would ideally like to predict the set of tree pairs (T1, T2) which have highest probability when conditioned on the observed data: P<T1,T2‘sl,52,g).
    Page 5, “Model”

See all papers in Proc. ACL 2009 that mention part-of-speech tag.

See all papers in Proc. ACL that mention part-of-speech tag.

Back to top.

generative process

Appears in 5 sentences as: Generative Process (1) generative process (4)
In Unsupervised Multilingual Grammar Induction
  1. Our model seeks to explain this observed data through a generative process whereby two aligned parse trees are produced jointly.
    Page 3, “Model”
  2. In the next two sections, we describe our model in more formal detail by specifying the parameters and generative process by which sentences are formed.
    Page 4, “Model”
  3. 3.4 Generative Process
    Page 4, “Model”
  4. As the first step in the Bayesian generative process , all the multinomial parameters listed in the previous section are drawn from their conjugate priors — Dirichlet distributions of appropriate dimension.
    Page 4, “Model”
  5. Note that in the case of yield pairs produced according to Distribution 1 (in step 4 of the generative process ) conjugacy is technically broken, since the yield pairs are no longer produced by a single multinomial distribution.
    Page 6, “Model”

See all papers in Proc. ACL 2009 that mention generative process.

See all papers in Proc. ACL that mention generative process.

Back to top.

Gibbs sampling

Appears in 4 sentences as: Gibbs sampler (1) Gibbs sampling (3)
In Unsupervised Multilingual Grammar Induction
  1. We use Gibbs sampling (Hastings, 1970) to draw trees for each sentence conditioned on those drawn for
    Page 5, “Model”
  2. This use of a tractable proposal distribution and acceptance ratio is known as the Metropolis-Hastings algorithm and it preserves the convergence guarantee of the Gibbs sampler (Hastings, 1970).
    Page 5, “Model”
  3. This model uses the same inference procedure as our bilingual model ( Gibbs sampling ).
    Page 6, “Experimental setup”
  4. We also reimplemented the original EM version of CCM and found virtually no difference in performance when using EM or Gibbs sampling .
    Page 6, “Experimental setup”

See all papers in Proc. ACL 2009 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

subtrees

Appears in 4 sentences as: subtrees (5)
In Unsupervised Multilingual Grammar Induction
  1. This algorithm proceeds in top-down fashion by sampling individual split points using the marginal probabilities of all possible subtrees .
    Page 5, “Model”
  2. To do so directly would involve simultaneously marginalizing over all possible subtrees as well as all possible alignments between such subtrees when sampling upper-level split points.
    Page 5, “Model”
  3. For every pair of nodes n1 6 T1,n2 6 T2, a table stores the marginal probability of the subtrees rooted at m and 712, respectively.
    Page 5, “Model”
  4. Once a pair of trees (T1,T2) has been sampled, we can proceed to sample an alignment tree A|T1, T2.2 We sample individual alignment decisions from the top down, at each step using the alignment marginals for the remaining subtrees (already computed using the aforementioned dynamic program).
    Page 6, “Model”

See all papers in Proc. ACL 2009 that mention subtrees.

See all papers in Proc. ACL that mention subtrees.

Back to top.

Treebank

Appears in 4 sentences as: Treebank (3) treebank (1)
In Unsupervised Multilingual Grammar Induction
  1. Data The Penn Korean Treebank (Han et al., 2002) consists of 5,083 Korean sentences translated into English for the purposes of language training in a military setting.
    Page 6, “Experimental setup”
  2. The English-Urdu parallel corpus3 consists of 4,325 sentences from the first three sections of the Penn Treebank and their Urdu translations annotated at the part-of-speech level.
    Page 6, “Experimental setup”
  3. We use the remaining sections of the Penn Treebank for English testing.
    Page 6, “Experimental setup”
  4. The English-Chinese treebank (Bies et al., 2007) consists of 3,850 Chinese newswire sentences translated into English.
    Page 6, “Experimental setup”

See all papers in Proc. ACL 2009 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

constituency parsing

Appears in 3 sentences as: constituency parsing (3)
In Unsupervised Multilingual Grammar Induction
  1. We investigate the task of unsupervised constituency parsing from bilingual parallel corpora.
    Page 1, “Abstract”
  2. In this paper we investigate the task of unsupervised constituency parsing when bilingual parallel text is available.
    Page 1, “Introduction”
  3. While PCFGs perform poorly on this task, the CCM model (Klein and Manning, 2002) has achieved large gains in performance and is among the state-of-the-art probabilistic models for unsupervised constituency parsing .
    Page 2, “Related Work”

See all papers in Proc. ACL 2009 that mention constituency parsing.

See all papers in Proc. ACL that mention constituency parsing.

Back to top.

cross-lingual

Appears in 3 sentences as: cross-lingual (3)
In Unsupervised Multilingual Grammar Induction
  1. One of the main challenges of unsupervised multilingual learning is to exploit cross-lingual patterns discovered in data, while still allowing a wide range of language-specific idiosyncrasies.
    Page 1, “Introduction”
  2. For each pair of coupled bilingual constituents, a pair of part-of-speech sequences are drawn jointly from a cross-lingual distribution.
    Page 2, “Introduction”
  3. Research in this direction was pioneered by (Wu, 1997), who developed Inversion Transduction Grammars to capture cross-lingual grammar variations such as phrase re-orderings.
    Page 2, “Related Work”

See all papers in Proc. ACL 2009 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

parallel corpora

Appears in 3 sentences as: parallel corpora (3)
In Unsupervised Multilingual Grammar Induction
  1. We investigate the task of unsupervised constituency parsing from bilingual parallel corpora .
    Page 1, “Abstract”
  2. Applying this model to three parallel corpora (Korean-English, Urdu-English, and Chinese-English) we find substantial performance gains over the CCM model, a strong monolingual baseline.
    Page 1, “Abstract”
  3. We propose an unsupervised Bayesian model for learning bilingual syntactic structure using parallel corpora .
    Page 2, “Model”

See all papers in Proc. ACL 2009 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.