A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
Blunsom, Phil and Cohn, Trevor

Article Structure

Abstract

In this work we address the problem of unsupervised part-of—speech induction by bringing together several strands of research into a single model.

Introduction

Unsupervised part-of-speech (PoS) induction has long been a cenUal chaflenge in conqnnafional linguistics, with applications in human language learning and for developing portable language processing systems.

Background

Past research in unsupervised PoS induction has largely been driven by two different motivations: a task based perspective which has focussed on inducing word classes to improve various applications, and a linguistic perspective where the aim is to induce classes which correspond closely to annotated part-of-speech corpora.

The PYP-HMM

We develop a trigram hidden Markov model which models the joint probability of a sequence of latent tags, 1:, and words, w, as

Experiments

We perform experiments with a range of corpora to both investigate the properties of our proposed models and inference algorithms, as well as to establish their robustness across languages and domains.

Discussion

The hidden Markov model, originally developed by Brown et al.

Topics

language model

Appears in 16 sentences as: language model (8) language modelling (5) language models (3)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. Our work brings together several strands of research including Bayesian nonparametric HMMs (Goldwater and Griffiths, 2007), Pitman-Yor language models (Teh, 2006b; Goldwater et al., 2006b), tagging constraints over word types (Brown et al., 1992) and the incorporation of morphological features (Clark, 2003).
    Page 1, “Introduction”
  2. Early work was firmly situtated in the task-based setting of improving generalisation in language models .
    Page 2, “Background”
  3. This model has been popular for language modelling and bilingual word alignment, and an implementation with improved inference called mkcls (Och, 1999)1 has become a standard part of statistical machine translation systems.
    Page 2, “Background”
  4. (l992)’s HMM by incorporating a character language model , allowing the modelling of limited morphology.
    Page 2, “Background”
  5. Our work draws from these models, in that we develop a HMM with a one class per tag restriction and include a character level language model .
    Page 2, “Background”
  6. Research in language modelling (Teh, 2006b; Goldwater et al., 2006a) and parsing (Cohn et al., 2010) has shown that models employing Pitman-Yor priors can significantly outperform the more frequently used Dirichlet priors, especially where complex hierarchical relationships exist between latent variables.
    Page 2, “Background”
  7. Prior work in unsupervised PoS induction has employed simple smoothing techniques, such as additive smoothing or Dirichlet priors (Goldwater and Griffiths, 2007; Johnson, 2007), however this body of work has overlooked recent advances in smoothing methods used for language modelling (Teh, 2006b; Goldwater et al., 2006b).
    Page 3, “The PYP-HMM”
  8. The PYP has been shown to generate distributions particularly well suited to modelling language (Teh, 2006a; Goldwater et al., 2006b), and has been shown to be a generalisation of Kneser—Ney smoothing, widely recognised as the best smoothing method for language modelling (Chen and Goodman, 1996).
    Page 3, “The PYP-HMM”
  9. We consider two different settings for the base distribution Cj: l) a simple uniform distribution over the vocabulary (denoted HMM for the experiments in section 4); and 2) a character-level language model (denoted HMM+LM).
    Page 3, “The PYP-HMM”
  10. In many languages morphological regularities correlate strongly with a word’s part-of-speech (e.g., suffixes in English), which we hope to capture using a basic character language model .
    Page 3, “The PYP-HMM”
  11. Figure 2: The conditioning structure of the hierarchical PYP with an embedded character language models .
    Page 4, “The PYP-HMM”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

bigram

Appears in 13 sentences as: Bigram (2) bigram (9) bigrams (6)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. This work differs from previous Bayesian models in that we explicitly model a complex backoff path using a hierachical prior, such that our model jointly infers distributions over tag trigrams, bigrams and unigrams and whole words and their character level representation.
    Page 2, “Background”
  2. The trigram transition distribution, Tij, is drawn from a hierarchical PYP prior which backs off to a bigram Bj and then a unigram U distribution,
    Page 3, “The PYP-HMM”
  3. This allows the modelling of trigram tag sequences, while smoothing these estimates with their corresponding bigram and unigram distributions.
    Page 3, “The PYP-HMM”
  4. We formulate the character—level language model as a bigram model over the character sequence comprising word 7111,
    Page 4, “The PYP-HMM”
  5. The HMM+LM is shown in Figure 2, illustrating the decomposition of the tag sequence into n-grams and a word into its component character bigrams .
    Page 4, “The PYP-HMM”
  6. illustrating the trigram transition distribution, where L; are all previous tags, h = (tl_2, tl_1) is the conditioning bigram, n1;- is the count of the trigram hi in t_l, n]: the total count over all trigrams beginning with h, Kb; the number of tables served dish 2' and PB is the base distribution, in this case the bigram distribution.
    Page 4, “The PYP-HMM”
  7. This calculation is complicated by the fact that these events are not independent; the counts of one trigram can affect the probability of later ones, and moreover, the table assignment for the trigram may also affect the bigram and unigram counts, of particular import when the same tag occurs twice in a row such as in Figure 2.
    Page 5, “The PYP-HMM”
  8. In our model we would need to sum over all possible table assignments that result in the same tagging, at all levels in the hierarchy: tag trigrams, bigrams and unigrams; and also words, character bigrams and character unigrams.
    Page 5, “The PYP-HMM”
  9. The fractional table count from the trigram then results in a fractional customer entering the bigram restaurant, and so on down to unigrams.
    Page 6, “The PYP-HMM”
  10. mkcls (Och, 1999) 73.7 65.6 MLE 1HMM—LM (Clark, 2003)* 71.2 65.5 BHMM (GG07) 63.2 56.2 PR (Ganchev et al., 2010)* 62.5 54.8 Trigram PYP—HMM 69.8 62.6 Trigram PYP-lHMM 76.0 68.0 Trigram PYP—lHMM-LM 77.5 69.7 Bigram PYP-HMM 66.9 59.2 Bigram PYP— 1HMM 72.9 65.9 Trigram DP—HMM 68.1 60.0 Trigram DP— 1HMM 76.0 68.0 Trigram DP— 1HMM—LM 76.8 69.8
    Page 7, “Experiments”
  11. If we restrict the model to bigrams we see a considerable drop in performance.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

unigram

Appears in 9 sentences as: unigram (5) unigrams (5)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. This work differs from previous Bayesian models in that we explicitly model a complex backoff path using a hierachical prior, such that our model jointly infers distributions over tag trigrams, bigrams and unigrams and whole words and their character level representation.
    Page 2, “Background”
  2. The trigram transition distribution, Tij, is drawn from a hierarchical PYP prior which backs off to a bigram Bj and then a unigram U distribution,
    Page 3, “The PYP-HMM”
  3. This allows the modelling of trigram tag sequences, while smoothing these estimates with their corresponding bigram and unigram distributions.
    Page 3, “The PYP-HMM”
  4. That is, each table at one level is equivalent to a customer at the next deeper level, creating the invari-ants: Kh} = n;- andKu—i 2 715, where u = tl_1 indicates the unigram backoff context of h. The recursion terminates at the lowest level where the base distribution is static.
    Page 5, “The PYP-HMM”
  5. This calculation is complicated by the fact that these events are not independent; the counts of one trigram can affect the probability of later ones, and moreover, the table assignment for the trigram may also affect the bigram and unigram counts, of particular import when the same tag occurs twice in a row such as in Figure 2.
    Page 5, “The PYP-HMM”
  6. In our model we would need to sum over all possible table assignments that result in the same tagging, at all levels in the hierarchy: tag trigrams, bigrams and unigrams; and also words, character bigrams and character unigrams .
    Page 5, “The PYP-HMM”
  7. where K, is the number of tables for the tag unigram 2' of which there are n + l occurrences, En denotes an expectation after observing n items and En [K] = :3.
    Page 5, “The PYP-HMM”
  8. The fractional table count from the trigram then results in a fractional customer entering the bigram restaurant, and so on down to unigrams .
    Page 6, “The PYP-HMM”
  9. Note that the bigram PYP-HMM outperforms the closely related BHMM (the main difference being that we smooth tag bigrams with unigrams ).
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

Gibbs sampler

Appears in 7 sentences as: Gibbs sampler (5) Gibbs samplers (2) Gibbs sampling (1)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. However this work approximated the derivation of the Gibbs sampler (omitting the interdependence between events when sampling from a collapsed model), resulting in a model which underperformed Brown et al.
    Page 2, “Background”
  2. In order to induce a tagging under this model we use Gibbs sampling , a Markov chain Monte Carlo (MCMC) technique for drawing samples from the posterior distribution over the tag sequences given observed word sequences.
    Page 4, “The PYP-HMM”
  3. We present two different sampling strategies: First, a simple Gibbs sampler which randomly samples an update to a single tag given all other tags; and second, a type-level sampler which updates all tags for a given word under a
    Page 4, “The PYP-HMM”
  4. Gibbs samplers Both our Gibbs samplers perform the same calculation of conditional tag distributions, and involve first decrementing all trigrams and emissions affected by a sampling action, and then reintroducing the trigrams one at a time, conditioning their probabilities on the updated counts and table configurations as we progress.
    Page 5, “The PYP-HMM”
  5. The first local Gibbs sampler (PYP-HMM) updates a single tag assignment at a time, in a similar fashion to Goldwater and Griffiths (2007).
    Page 5, “The PYP-HMM”
  6. This approximation is tight for small n, and therefore it should be effective in the case of the local Gibbs sampler where only three trigrams are being resampled.
    Page 5, “The PYP-HMM”
  7. We have omitted the results for the HMM-LM as experimentation showed that the local Gibbs sampler became hopelessly stuck, failing to
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention Gibbs sampler.

See all papers in Proc. ACL that mention Gibbs sampler.

Back to top.

treebank

Appears in 6 sentences as: Treebank (2) treebank (3) treebanked (1)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. Though PoS induction was not their aim, this restriction is largely validated by empirical analysis of treebanked data, and moreover conveys the significant advantage that all the tags for a given word type can be updated at the same time, allowing very efficient inference using the exchange algorithm.
    Page 2, “Background”
  2. Recent work on unsupervised PoS induction has focussed on encouraging sparsity in the emission distributions in order to match empirical distributions derived from treebank data (Goldwater and Griffiths, 2007; Johnson, 2007; Gao and Johnson, 2008).
    Page 2, “Background”
  3. Treebank (Marcus et al., 1993), while for other languages we use the corpora made available for the CoNLL-X Shared Task (Buchholz and Marsi, 2006).
    Page 6, “Experiments”
  4. Treebank , along with a number of state-of—the-art results previously reported (Table 1).
    Page 7, “Experiments”
  5. The former shows that both our models and mkcl s induce a more uniform distribution over tags than specified by the treebank .
    Page 7, “Experiments”
  6. It is unclear whether it is desirable for models to exhibit behavior closer to the treebank , which dedicates separate tags to very infrequent phenomena while lumping the large range of noun types into a single category.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

part-of-speech

Appears in 4 sentences as: part-of-speech (4)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. Unsupervised part-of-speech (PoS) induction has long been a cenUal chaflenge in conqnnafional linguistics, with applications in human language learning and for developing portable language processing systems.
    Page 1, “Introduction”
  2. Past research in unsupervised PoS induction has largely been driven by two different motivations: a task based perspective which has focussed on inducing word classes to improve various applications, and a linguistic perspective where the aim is to induce classes which correspond closely to annotated part-of-speech corpora.
    Page 2, “Background”
  3. The HMM ignores orthographic information, which is often highly indicative of a word’s part-of-speech , particularly so in morphologically rich languages.
    Page 2, “Background”
  4. In many languages morphological regularities correlate strongly with a word’s part-of-speech (e.g., suffixes in English), which we hope to capture using a basic character language model.
    Page 3, “The PYP-HMM”

See all papers in Proc. ACL 2011 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

hyperparameters

Appears in 3 sentences as: hyperparameter (1) hyperparameters (2)
In A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
  1. The arrangement of customers at tables defines a clustering which exhibits a power-law behavior controlled by the hyperparameters a and b.
    Page 4, “The PYP-HMM”
  2. Sampling hyperparameters We treat the hyper-parameters {(cfl, If”) ,x E (U, B,T, E, 0)} as random variables in our model and infer their values.
    Page 6, “The PYP-HMM”
  3. The result of this hyperparameter inference is that there are no user tunable parameters in the model, an important feature that we believe helps explain its consistently high performance across test settings.
    Page 6, “The PYP-HMM”

See all papers in Proc. ACL 2011 that mention hyperparameters.

See all papers in Proc. ACL that mention hyperparameters.

Back to top.