Unsupervised Multilingual Learning for Morphological Segmentation
Snyder, Benjamin and Barzilay, Regina

Article Structure

Abstract

For centuries, the deep connection between languages has brought about major discoveries about human communication.

Introduction

For centuries, the deep connection between human languages has fascinated linguists, anthropologists and historians (Eco, 1995).

Related Work

Multilingual Language Learning Recently, the availability of parallel corpora has spurred research on multilingual analysis for a variety of tasks ranging from morphology to semantic role labeling (Yarowsky et al., 2000; Diab and Resnik, 2002; Xi and Hwa, 2005; Pado and Lapata, 2006).

Multilingual Morphological Segmentation

The underlying assumption of our work is that structural commonality across different languages is a powerful source of information for morphological analysis.

Model

Overview In order to simultaneously model probabilistic dependencies across languages as well as morpheme distributions within each language, we employ a hierarchical Bayesian model.2

Experimental SetUp

Morpheme Definition For the purpose of these experiments, we define morphemes to include conjunctions, prepositional and pronominal affixes, plural and dual suffixes, particles, definite articles, and roots.

Results

Table 1 shows the performance of the various automatic segmentation methods.

Conclusions and Future Work

We started out by posing two questions: (i) Can we exploit cross-lingual patterns to improve unsupervised analysis?

Topics

cross-lingual

Appears in 7 sentences as: cross-lingual (7)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. We present a nonparametric Bayesian model that jointly induces morpheme segmentations of each language under consideration and at the same time identifies cross-lingual morpheme patterns, or abstract morphemes.
    Page 1, “Abstract”
  2. In this paper we investigate two questions: (i) Can we exploit cross-lingual correspondences to improve unsupervised language
    Page 1, “Introduction”
  3. Our results indicate that cross-lingual patterns can indeed be exploited successfully for the task of unsupervised morphological segmentation.
    Page 2, “Introduction”
  4. In the following section, we describe a model that can model both generic cross-lingual patterns (fy and 19-), as well as cognates between related languages (ktb for Hebrew and Arabic).
    Page 3, “Multilingual Morphological Segmentation”
  5. The model is fully unsupervised and is driven by a preference for stable and high frequency cross-lingual morpheme patterns.
    Page 3, “Model”
  6. Our model utilizes a Dirichlet process prior for each language, as well as for the cross-lingual links (abstract morphemes).
    Page 4, “Model”
  7. We started out by posing two questions: (i) Can we exploit cross-lingual patterns to improve unsupervised analysis?
    Page 8, “Conclusions and Future Work”

See all papers in Proc. ACL 2008 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

segmentations

Appears in 7 sentences as: segmentations (7)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. We present a nonparametric Bayesian model that jointly induces morpheme segmentations of each language under consideration and at the same time identifies cross-lingual morpheme patterns, or abstract morphemes.
    Page 1, “Abstract”
  2. the space of joint segmentations .
    Page 2, “Introduction”
  3. For each language in the pair, the model favors segmentations which yield high frequency morphemes.
    Page 2, “Introduction”
  4. For word 21) in language 5, we consider at once all possible segmentations , and for each segmentation all possible alignments.
    Page 6, “Model”
  5. We are thus considering at once: all possible segmentations of 212 along with all possible alignments involving morphemes in 21) with some subset of previously sampled language-]: morphemes.3
    Page 6, “Model”
  6. We obtained gold standard segmentations of the Arabic translation with a handcrafted Arabic morphological analyzer which utilizes manually constructed word lists and compatibility rules and is further trained on a large corpus of hand-annotated Arabic data (Habash and Ram-bow, 2005).
    Page 7, “Experimental SetUp”
  7. We don’t have gold standard segmentations for the English and Aramaic portions of the data, and thus restrict our evaluation to Hebrew and Arabic.
    Page 7, “Experimental SetUp”

See all papers in Proc. ACL 2008 that mention segmentations.

See all papers in Proc. ACL that mention segmentations.

Back to top.

Gibbs sampling

Appears in 6 sentences as: Gibbs sampler (2) Gibbs sampling (4)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. In practice, we never deal with such distributions directly, but rather integrate over them during Gibbs sampling .
    Page 4, “Model”
  2. We achieve these aims by performing Gibbs sampling .
    Page 6, “Model”
  3. Sampling We follow (Neal, 1998) in the derivation of our blocked and collapsed Gibbs sampler .
    Page 6, “Model”
  4. Gibbs sampling starts by initializing all random variables to arbitrary starting values.
    Page 6, “Model”
  5. We also collapse our Gibbs sampler in the standard way, by using predictive posteriors marginalized over all possible draws from the Dirichlet processes (resulting in Chinese Restaurant Processes).
    Page 6, “Model”
  6. See (Neal, 1998) for general formulas for Gibbs sampling from distributions with Dirichlet process priors.
    Page 7, “Model”

See all papers in Proc. ACL 2008 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

language pairs

Appears in 4 sentences as: language pair (2) language pairs (3)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. When modeled in tandem, gains are observed for all language pairs , reducing relative error by as much as 24%.
    Page 2, “Introduction”
  2. Furthermore, our experiments show that both related and unrelated language pairs benefit from multilingual learning.
    Page 2, “Introduction”
  3. To obtain our corpus of short parallel phrases, we preprocessed each language pair using the Giza++ alignment toolkit.6 Given word alignments for each language pair , we extract a list of phrase pairs that form independent sets in the bipartite alignment graph.
    Page 7, “Experimental SetUp”
  4. However, once character-to-character phonetic correspondences are added as an abstract morpheme prior (final two rows), we find the performance of related language pairs outstrips English, reducing relative error over MONOLINGUAL by 10% and 24% for the Hebrew/Arabic pair.
    Page 8, “Results”

See all papers in Proc. ACL 2008 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

morphological analysis

Appears in 4 sentences as: morphological analyses (1) morphological analysis (2) morphological analyzer (1)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. The underlying assumption of our work is that structural commonality across different languages is a powerful source of information for morphological analysis .
    Page 3, “Multilingual Morphological Segmentation”
  2. This Bible edition is augmented by gold standard morphological analysis (including segmentation) performed by biblical scholars.
    Page 7, “Experimental SetUp”
  3. We obtained gold standard segmentations of the Arabic translation with a handcrafted Arabic morphological analyzer which utilizes manually constructed word lists and compatibility rules and is further trained on a large corpus of hand-annotated Arabic data (Habash and Ram-bow, 2005).
    Page 7, “Experimental SetUp”
  4. The accuracy of this analyzer is reported to be 94% for full morphological analyses , and 98%-99% when part-of-speech tag accuracy is not included.
    Page 7, “Experimental SetUp”

See all papers in Proc. ACL 2008 that mention morphological analysis.

See all papers in Proc. ACL that mention morphological analysis.

Back to top.

parallel corpus

Appears in 4 sentences as: parallel corpus (4)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. Given a parallel corpus , the annotations are projected from this source language to its counterpart, and the resulting annotations are used for supervised training in the target language.
    Page 2, “Related Work”
  2. While their approach does not require a parallel corpus it does assume the availability of annotations in one language.
    Page 2, “Related Work”
  3. High-level Generative Story We have a parallel corpus of several thousand short phrases in the two languages 5 and .73.
    Page 4, “Model”
  4. Once A, E, and F have been drawn, we model our parallel corpus of short phrases as a series of independent draws from a phrase-pair generation model.
    Page 4, “Model”

See all papers in Proc. ACL 2008 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

part-of-speech

Appears in 4 sentences as: part-of-speech (4)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. An example of such a property is the distribution of part-of-speech bigrams.
    Page 2, “Related Work”
  2. Hana et al., (2004) demonstrate that adding such statistics from an annotated Czech corpus improves the performance of a Russian part-of-speech tagger over a fully unsupervised version.
    Page 2, “Related Work”
  3. The accuracy of this analyzer is reported to be 94% for full morphological analyses, and 98%-99% when part-of-speech tag accuracy is not included.
    Page 7, “Experimental SetUp”
  4. In the future, we hope to apply similar multilingual models to other core unsupervised analysis tasks, including part-of-speech tagging and grammar induction, and to further investigate the role that language relatedness plays in such models.
    Page 8, “Conclusions and Future Work”

See all papers in Proc. ACL 2008 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. This Bible edition is augmented by gold standard morphological analysis (including segmentation) performed by biblical scholars.
    Page 7, “Experimental SetUp”
  2. We obtained gold standard segmentations of the Arabic translation with a handcrafted Arabic morphological analyzer which utilizes manually constructed word lists and compatibility rules and is further trained on a large corpus of hand-annotated Arabic data (Habash and Ram-bow, 2005).
    Page 7, “Experimental SetUp”
  3. We don’t have gold standard segmentations for the English and Aramaic portions of the data, and thus restrict our evaluation to Hebrew and Arabic.
    Page 7, “Experimental SetUp”

See all papers in Proc. ACL 2008 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

part-of-speech tag

Appears in 3 sentences as: part-of-speech tag (1) part-of-speech tagger (1) part-of-speech tagging (1)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. Hana et al., (2004) demonstrate that adding such statistics from an annotated Czech corpus improves the performance of a Russian part-of-speech tagger over a fully unsupervised version.
    Page 2, “Related Work”
  2. The accuracy of this analyzer is reported to be 94% for full morphological analyses, and 98%-99% when part-of-speech tag accuracy is not included.
    Page 7, “Experimental SetUp”
  3. In the future, we hope to apply similar multilingual models to other core unsupervised analysis tasks, including part-of-speech tagging and grammar induction, and to further investigate the role that language relatedness plays in such models.
    Page 8, “Conclusions and Future Work”

See all papers in Proc. ACL 2008 that mention part-of-speech tag.

See all papers in Proc. ACL that mention part-of-speech tag.

Back to top.

segmentation model

Appears in 3 sentences as: segmentation model (2) segmentation models (1)
In Unsupervised Multilingual Learning for Morphological Segmentation
  1. Our segmentation model is based on the notion that stable recurring string patterns within words are indicative of morphemes.
    Page 3, “Model”
  2. We note that these single-language morpheme distributions also serve as monolingual segmentation models , and similar models have been successfully applied to the task of word boundary detection (Goldwater et al., 2006).
    Page 5, “Model”
  3. The probabilistic formulation of this model is close to our monolingual segmentation model , but it uses a greedy search specifically designed for the segmentation task.
    Page 8, “Experimental SetUp”

See all papers in Proc. ACL 2008 that mention segmentation model.

See all papers in Proc. ACL that mention segmentation model.

Back to top.