Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
Ravi, Sujith and Baldridge, Jason and Knight, Kevin

Article Structure

Abstract

We combine two complementary ideas for learning supertaggers from highly ambiguous lexicons: grammar-informed tag transitions and models minimized via integer programming.

Introduction

Creating accurate part-of-speech (POS) taggers using a tag dictionary and unlabeled data is an interesting task with practical applications.

Data

CCGbank.

Grammar informed initialization for supertagging

Part-of-speech tags are atomic labels that in and of themselves encode no internal structure.

Minimized models for supertagging

The idea of searching for minimized models is related to classic Minimum Description Length (MDL) (Barron et al., 1998), which seeks to select a small model that captures the most regularity in the observed data.

Experiments

We compare the four strategies described in Sections 3 and 4, summarized below:

Conclusion

We have shown how two complementary strategies—grammar—informed tag transitions and IP-minimization—for learning of supertaggers from highly ambiguous lexicons can be straight-

Topics

bigram

Appears in 26 sentences as: (1) bigram (15) bigrams (15)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. Most methods have employed some variant of Expectation Maximization (EM) to learn parameters for a bigram
    Page 1, “Introduction”
  2. Ravi and Knight (2009) achieved the best results thus far (92.3% word token accuracy) via a Minimum Description Length approach using an integer program (IP) that finds a minimal bigram grammar that obeys the tag dictionary constraints and covers the observed data.
    Page 1, “Introduction”
  3. The 1241 distinct supertags in the tagset result in 1.5 million tag bigram entries in the model and the dictionary contains almost 3.5 million word/tag pairs that are relevant to the test data.
    Page 3, “Minimized models for supertagging”
  4. The set of 45 P08 tags for the same data yields 2025 tag bigrams and 8910 dictionary entries.
    Page 3, “Minimized models for supertagging”
  5. Our objective is to find the smallest supertag grammar (of tag bigram types) that explains the entire text while obeying the lexicon’s constraints.
    Page 3, “Minimized models for supertagging”
  6. I Pom-gimp Find the smallest supertag grammar (i.e., tag bigrams ) that can explain the entire text (the test word token sequence).
    Page 3, “Minimized models for supertagging”
  7. This produces an observed grammar of distinct tag bigrams (G058) and lexicon of observed lexical assignments (Lobs).
    Page 3, “Minimized models for supertagging”
  8. The first stage (Minimization l) finds the smallest grammar Gmml C G that explains the set of word bigram types observed in the data rather than the word sequence itself, and the second (Minimization 2) finds the smallest augmentation of Gmml that explains the full word sequence.
    Page 3, “Minimized models for supertagging”
  9. I Pminl: Find the smallest set of tag bigrams Gmml C G, such that there is at least one tagging assignment possible for every word bigram type observed in the data.
    Page 3, “Minimized models for supertagging”
  10. We formulate this as an integer program, creating binary variables gvam for every tag bigram g,- = tjtk; in G. Binary link variables connect tag bigrams with word bigrams ; these are restricted
    Page 3, “Minimized models for supertagging”
  11. to the set of links that respect the lexicon L provided as input, i.e., there exists a link variable [inky-Mm connecting tag bigram tjtk, with word bigram wlwm only if the word/tag pairs (101,753) and (mm, 25],) are present in L. The entire integer programming formulation is shown Figure 2.
    Page 4, “Minimized models for supertagging”

See all papers in Proc. ACL 2010 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

CCG

Appears in 15 sentences as: CCG (15)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. A more challenging task is learning supertaggers for lexicalized grammar formalisms such as Combinatory Categorial Grammar ( CCG ) (Steedman, 2000).
    Page 1, “Introduction”
  2. Yet, this is an important task since creating grammars and resources for CCG parsers for new domains and languages is highly labor— and knowledge-intensive.
    Page 1, “Introduction”
  3. Baldridge (2008) uses grammar-informed initialization for HMM tag transitions based on the universal combinatory rules of the CCG formalism to obtain 56.1% accuracy on ambiguous word tokens, a large improvement over the 33.0% accuracy obtained with uniform initialization for tag transitions.
    Page 1, “Introduction”
  4. Applying the approach of Ravi and Knight (2009) naively to CCG supertagging is intractable due to the high level of ambiguity.
    Page 2, “Introduction”
  5. CCGbank was created by semiautomatically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007).
    Page 2, “Data”
  6. CCG-TUT was created by semiautomatically converting dependencies in the Italian Turin University Treebank to CCG derivations (Bos et al., 2009).
    Page 2, “Data”
  7. trast, supertags are detailed, structured labels; a universal set of grammatical rules defines how categories may combine with one another to project syntactic structure.2 Because of this, properties of the CCG formalism itself can be used to constrain learning—prior to considering any particular language, grammar or data set.
    Page 2, “Grammar informed initialization for supertagging”
  8. 2Note that supertags can be lexical categories of CCG (Steedman, 2000), elementary trees of Tree-adjoining Grammar (J oshi, 1988), or types in a feature hierarchy as in Head-driven Phrase Structure Grammar (Pollard and Sag, 1994).
    Page 2, “Grammar informed initialization for supertagging”
  9. As such, these supertags are outside of the categorial system: their use in derivations requires phrase structure rules that are not derivable from the CCG combinatory rules.
    Page 6, “Experiments”
  10. EMG 1’s higher recall and precision indicate the tag transition distributions do capture general patterns of linkage between adjacent CCG categories, while EM ensures that the data filters out combinable, but unnecessary, bitags.
    Page 7, “Experiments”
  11. Because the lexicon is the grammar in CCG , learning new word-category associations is grammar generalization and is of interest for grammar acquisition.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention CCG.

See all papers in Proc. ACL that mention CCG.

Back to top.

precision and recall

Appears in 7 sentences as: Precision and recall (1) precision and recall (6)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. Precision and recall of grammar and lexicon.
    Page 6, “Experiments”
  2. Table 3: Comparison of grammar/lexicon observed in the model tagging vs. gold tagging in terms of precision and recall measures for supertagging on CCGbank data.
    Page 7, “Experiments”
  3. We can obtain a more-fine grained understanding of how the models differ by considering the precision and recall values for the grammars and lexicons of the different models, given in Table 3.
    Page 7, “Experiments”
  4. Similar trends are seen for precision and recall on the lexicon.
    Page 7, “Experiments”
  5. EMGI again focuses taggings on combinable contexts, boosting precision and recall similarly to EM+IP, but in greater measure.
    Page 7, “Experiments”
  6. Table 6 gives precision and recall for the grammars and lexicons for CCG—TUT—the values are lower than for CCGbank (in line with the lower baseline), but exhibit the same trends.
    Page 8, “Experiments”
  7. Table 6: Comparison of grammar/lexicon observed in the model tagging vs. gold tagging in terms of precision and recall measures for supertagging on CCG—TUT.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. There are two complementary ways to use grammar-informed initialization with the IP-minimization approach: (1) using EMGI output as the starting grammar/lexicon and (2) using the tag transitions directly in the IP objective function .
    Page 5, “Minimized models for supertagging”
  2. For the second, we modify the objective function used in the two IP-minimization steps to be:
    Page 5, “Minimized models for supertagging”
  3. In this way, we combine the minimization and GI strategies into a single objective function that finds a minimal grammar set while keeping the more likely tag bigrams in the chosen solution.
    Page 6, “Minimized models for supertagging”
  4. Thus, the better starting point provided by EMGI has more impact than the integer program that includes G1 in its objective function .
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

part-of-speech

Appears in 4 sentences as: Part-of-speech (1) part-of-speech (3)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. Creating accurate part-of-speech (POS) taggers using a tag dictionary and unlabeled data is an interesting task with practical applications.
    Page 1, “Introduction”
  2. Nonetheless, the methods proposed apply to realistic scenarios in which one has an electronic part-of-speech tag dictionary or a handcrafted grammar with limited coverage.
    Page 1, “Introduction”
  3. Part-of-speech tags are atomic labels that in and of themselves encode no internal structure.
    Page 2, “Grammar informed initialization for supertagging”
  4. 8A state-of-the-art, fully-supervised maximum entropy tagger (Clark and Curran, 2007) (which also uses part-of-speech labels) obtains 91.4% on the same train/test split.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

POS tagging

Appears in 4 sentences as: POS tag (1) POS tagging (2) POS tags (1)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. The most ambiguous word has 7 different POS tags associated with it.
    Page 1, “Introduction”
  2. We also wish to scale our methods to larger data settings than the 24k word tokens in the test data used in the POS tagging task.
    Page 3, “Minimized models for supertagging”
  3. On the simpler task of unsupervised POS tagging with a dictionary, we compared our method versus directly solving I Pongmal and found that the minimization (in terms of grammar size) achieved by our method is close to the optimal solution for the original objective and yields the same tagging accuracy far more efficiently.
    Page 5, “Minimized models for supertagging”
  4. Ravi and Knight (2009) exploited this to iteratively improve their POS tag model: since the first minimization procedure is seeded with a noisy grammar and tag dictionary, iterating the IP procedure with progressively better grammars further improves the model.
    Page 5, “Minimized models for supertagging”

See all papers in Proc. ACL 2010 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

Treebank

Appears in 4 sentences as: Treebank (4)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. Most work has focused on POS-tagging for English using the Penn Treebank (Marcus et al., 1993), such as (Banko and Moore, 2004; Goldwater and Griffiths, 2007; Toutanova and J ohn-son, 2008; Goldberg et al., 2008; Ravi and Knight, 2009).
    Page 1, “Introduction”
  2. This generally involves working with the standard set of 45 POS-tags employed in the Penn Treebank .
    Page 1, “Introduction”
  3. CCGbank was created by semiautomatically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007).
    Page 2, “Data”
  4. CCG-TUT was created by semiautomatically converting dependencies in the Italian Turin University Treebank to CCG derivations (Bos et al., 2009).
    Page 2, “Data”

See all papers in Proc. ACL 2010 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

Penn Treebank

Appears in 3 sentences as: Penn Treebank (3)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. Most work has focused on POS-tagging for English using the Penn Treebank (Marcus et al., 1993), such as (Banko and Moore, 2004; Goldwater and Griffiths, 2007; Toutanova and J ohn-son, 2008; Goldberg et al., 2008; Ravi and Knight, 2009).
    Page 1, “Introduction”
  2. This generally involves working with the standard set of 45 POS-tags employed in the Penn Treebank .
    Page 1, “Introduction”
  3. CCGbank was created by semiautomatically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007).
    Page 2, “Data”

See all papers in Proc. ACL 2010 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (3)
In Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
  1. This provides a much more challenging starting point for the semi-supervised methods typically applied to the task.
    Page 1, “Introduction”
  2. We use the standard splits of the data used in semi-supervised tagging experiments (e. g. Banko and Moore (2004)): sections 0-18 for training, 19-21 for development, and 22-24 for test.
    Page 2, “Data”
  3. The HMM when using full supervision obtains 87.6% accuracy (Baldridge, 2008),8 so the accuracy of 63.8% achieved by EMGI+IPGI nearly halves the gap between the supervised model and the 45.6% obtained by basic EM semi-supervised model.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.