Minimized Models for Unsupervised Part-of-Speech Tagging
Ravi, Sujith and Knight, Kevin

Article Structure

Abstract

We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values.

Introduction

In recent years, we have seen increased interest in using unsupervised methods for attacking different NLP tasks like part-of-speech (POS) tagging.

What goes wrong with EM?

We analyze the tag sequence output produced by EM and try to see where EM goes wrong.

Small Models

Bayesian sparse priors aim to create small models.

Fitting the Model

Our IP formulation can find us a small model, but it does not attempt to fit the model to the data.

Restarts and More Data

Multiple random restarts for EM, while not often emphasized in the literature, are key in this domain.

Smaller Tagset and Incomplete Dictionaries

Previously, researchers working on this task have also reported results for unsupervised tagging with a smaller tagset (Smith and Eisner, 2005; Goldwater and Griffiths, 2007; Toutanova and J ohn-son, 2008; Goldberg et al., 2008).

Discussion

The method proposed in this paper is simple—once an integer program is produced, there are solvers available which directly give us the solution.

Conclusion

We presented a novel method for attacking dictionary-based unsupervised part-of-speech tagging.

Topics

bigrams

Appears in 11 sentences as: bigram (2) bigrams (9)
In Minimized Models for Unsupervised Part-of-Speech Tagging
  1. We investigate the Viterbi tag sequence generated by EM training and count how many distinct tag bigrams there are in that sequence.
    Page 2, “What goes wrong with EM?”
  2. That is, there exists a tag sequence that contains 459 distinct tag bigrams , and no other tag sequence contains fewer.
    Page 3, “Small Models”
  3. We also create variables for every possible tag bigram and word/tag dictionary entry.
    Page 3, “Small Models”
  4. We still give EM the full word/tag dictionary, but now we constrain its initial grammar model to the 459 tag bigrams identified by IP.
    Page 4, “Fitting the Model”
  5. In addition to removing many bad tag bigrams from the grammar, IP minimization also removes some of the good ones, leading to lower recall (EM = 0.87, IP+EM = 0.57).
    Page 4, “Fitting the Model”
  6. During EM training, the smaller grammar with fewer bad tag bigrams helps to restrict the dictionary model from making too many bad choices that EM made earlier.
    Page 4, “Fitting the Model”
  7. Note that we used a very small IP-grammar (containing only 459 tag bigrams ) during EM training.
    Page 5, “Fitting the Model”
  8. In the process of minimizing the grammar size, IP ends up removing many good tag bigrams from our grammar set (as seen from the low measured recall of 0.57 for the observed grammar).
    Page 5, “Fitting the Model”
  9. Next, we proceed to recover some good tag bigrams and expand the grammar in a restricted fashion by making use of the higher-quality dictionary produced by the IP+EM method.
    Page 5, “Fitting the Model”
  10. We now run EM again on the full grammar (all possible tag bigrams ) in combination with this good dictionary (containing fewer entries than the full dictionary).
    Page 5, “Fitting the Model”
  11. Unlike the original training with full grammar, where EM could choose any tag bigram , now the choice of grammar entries is constrained by the good dictionary model that we provide EM with.
    Page 5, “Fitting the Model”

See all papers in Proc. ACL 2009 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

Penn Treebank

Appears in 8 sentences as: Penn Treebank (8)
In Minimized Models for Unsupervised Part-of-Speech Tagging
  1. We use the standard test set for this task, a 24,115-word subset of the Penn Treebank , for which a gold tag sequence is available.
    Page 1, “Introduction”
  2. They show considerable improvements in tagging accuracy when using a coarser-grained version (with l7-tags) of the tag set from the Penn Treebank .
    Page 2, “Introduction”
  3. In contrast, we keep all the original dictionary entries derived from the Penn Treebank data for our experiments.
    Page 2, “Introduction”
  4. Their models are trained on the entire Penn Treebank data (instead of using only the 24,115-token test data), and so are the tagging models used by Goldberg et al.
    Page 6, “Restarts and More Data”
  5. ing data from the 24,115-t0ken set to the entire Penn Treebank (973k tokens).
    Page 6, “Restarts and More Data”
  6. Their systems were shown to obtain considerable improvements in accuracy when using a l7-tagset (a coarser-grained version of the tag labels from the Penn Treebank ) instead of the 45-tagset.
    Page 6, “Smaller Tagset and Incomplete Dictionaries”
  7. The accuracy numbers reported for Init-HMM and LDA+AC are for models that are trained on all the available unlabeled data from the Penn Treebank .
    Page 7, “Smaller Tagset and Incomplete Dictionaries”
  8. The IP+EM models used in the l7-tagset experiments reported here were not trained on the entire Penn Treebank , but instead used a smaller section containing 77,963 tokens for estimating model parameters.
    Page 7, “Smaller Tagset and Incomplete Dictionaries”

See all papers in Proc. ACL 2009 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

Treebank

Appears in 8 sentences as: Treebank (8)
In Minimized Models for Unsupervised Part-of-Speech Tagging
  1. We use the standard test set for this task, a 24,115-word subset of the Penn Treebank , for which a gold tag sequence is available.
    Page 1, “Introduction”
  2. They show considerable improvements in tagging accuracy when using a coarser-grained version (with l7-tags) of the tag set from the Penn Treebank .
    Page 2, “Introduction”
  3. In contrast, we keep all the original dictionary entries derived from the Penn Treebank data for our experiments.
    Page 2, “Introduction”
  4. Their models are trained on the entire Penn Treebank data (instead of using only the 24,115-token test data), and so are the tagging models used by Goldberg et al.
    Page 6, “Restarts and More Data”
  5. ing data from the 24,115-t0ken set to the entire Penn Treebank (973k tokens).
    Page 6, “Restarts and More Data”
  6. Their systems were shown to obtain considerable improvements in accuracy when using a l7-tagset (a coarser-grained version of the tag labels from the Penn Treebank ) instead of the 45-tagset.
    Page 6, “Smaller Tagset and Incomplete Dictionaries”
  7. The accuracy numbers reported for Init-HMM and LDA+AC are for models that are trained on all the available unlabeled data from the Penn Treebank .
    Page 7, “Smaller Tagset and Incomplete Dictionaries”
  8. The IP+EM models used in the l7-tagset experiments reported here were not trained on the entire Penn Treebank , but instead used a smaller section containing 77,963 tokens for estimating model parameters.
    Page 7, “Smaller Tagset and Incomplete Dictionaries”

See all papers in Proc. ACL 2009 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Minimized Models for Unsupervised Part-of-Speech Tagging
  1. Finally, we add an objective function that minimizes the number of grammar variables that are assigned a value of 1.
    Page 3, “Small Models”
  2. objective function value of 459.3
    Page 4, “Small Models”
  3. In MDL, there is a single objective function to (1) maximize the likelihood of observing the data, and at the same time (2) minimize the length of the model description (which depends on the model size).
    Page 8, “Discussion”
  4. However, the search procedure for MDL is usually nontrivial, and for our task of unsupervised tagging, we have not found a direct objective function which we can optimize and produce good tagging results.
    Page 8, “Discussion”

See all papers in Proc. ACL 2009 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

POS tagging

Appears in 4 sentences as: POS tag (1) POS taggers (1) POS tagging (2)
In Minimized Models for Unsupervised Part-of-Speech Tagging
  1. We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values.
    Page 1, “Abstract”
  2. The classic Expectation Maximization (EM) algorithm has been shown to perform poorly on POS tagging , when compared to other techniques, such as Bayesian methods.
    Page 1, “Introduction”
  3. (2008) depart from the Bayesian framework and show how EM can be used to learn good POS taggers for Hebrew and English, when provided with good initial conditions.
    Page 2, “Introduction”
  4. The overall POS tag distribution learnt by EM is relatively uniform, as noted by Johnson (2007), and it tends to assign equal number of tokens to each
    Page 2, “What goes wrong with EM?”

See all papers in Proc. ACL 2009 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

precision and recall

Appears in 3 sentences as: precision and recall (3)
In Minimized Models for Unsupervised Part-of-Speech Tagging
  1. We also measure the quality of the two observed grammars/dictionaries by computing their precision and recall against the grammar/dictionary we observe in the gold tagging.4 We find that precision of the observed grammar increases from 0.73 (EM) to 0.94 (IP+EM).
    Page 4, “Fitting the Model”
  2. Figure 6: Comparison of observed grammars from the model tagging vs. gold tagging in terms of precision and recall measures.
    Page 6, “Fitting the Model”
  3. Figure 7: Comparison of observed dictionaries from the model tagging vs. gold tagging in terms of precision and recall measures.
    Page 6, “Restarts and More Data”

See all papers in Proc. ACL 2009 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.