Faster Parsing by Supertagger Adaptation
Kummerfeld, Jonathan K. and Roesner, Jessika and Dawborn, Tim and Haggerty, James and Curran, James R. and Clark, Stephen

Article Structure

Abstract

We propose a novel self-training method for a parser which uses a lexicalised grammar and supertagger, focusing on increasing the speed of the parser rather than its accuracy.

Introduction

In many NLP tasks and applications, e.g.

Background

Many statistical parsers use two stages: a tagging stage that labels each word with its grammatical role, and a parsing stage that uses the tags to form a parse tree.

Adaptive Supertagging

The purpose of the supertagger is to cut down the search space for the parser by reducing the set of categories that must be considered for each word.

Data

In this work, we consider three domains: newswire, Wikipedia text and biomedical text.

Evaluation

We used the hybrid parsing model described in Clark and Curran (2007), and the Viterbi decoder to find the highest-scoring derivation.

Results

We have performed four primary sets of experiments to explore the ability of an adaptive supertagger to improve parsing speed or accuracy.

Conclusion

This work has demonstrated that an adapted supertagger can improve parsing speed and accuracy.

Topics

CCG

Appears in 13 sentences as: CCG (14)
In Faster Parsing by Supertagger Adaptation
  1. We demonstrate the effectiveness of the method using a CCG supertagger and parser, obtaining significant speed increases on newspaper text with no loss in accuracy.
    Page 1, “Abstract”
  2. We also show that the method can be used to adapt the CCG parser to new domains, obtaining accuracy and speed improvements for Wikipedia and biomedical text.
    Page 1, “Abstract”
  3. Parsing with lexicalised grammar formalisms, such as Lexicalised Tree Adjoining Grammar and Combinatory Categorial Grammar ( CCG ; Steed-man, 2000), can be made more efficient using a supertagger.
    Page 1, “Introduction”
  4. In this paper, we focus on the CCG parser and supertagger described in Clark and Curran (2007).
    Page 1, “Introduction”
  5. Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag.
    Page 1, “Introduction”
  6. Figure 1 gives two sentences and their CCG derivations, showing how some of the syntactic ambiguity is transferred to the supertagging component in a lexicalised grammar.
    Page 2, “Background”
  7. Figure 1: Two CCG derivations with PP ambiguity.
    Page 2, “Background”
  8. Clark and Curran (2004) applied supertagging to CCG , using a flexible multi-tagging approach.
    Page 2, “Background”
  9. CCG supertaggers are about 92% accurate when assigning a single lexical category to each word (Clark and Curran, 2004).
    Page 3, “Adaptive Supertagging”
  10. We have used Sections 02-21 of CCGbank (Hock-enmaier and Steedman, 2007), the CCG version of the Penn Treebank (Marcus et al., 1993), as training data for the newspaper domain.
    Page 4, “Data”
  11. For supertagger evaluation, one thousand sentences were manually annotated with CCG lexical categories and POS tags.
    Page 4, “Data”

See all papers in Proc. ACL 2010 that mention CCG.

See all papers in Proc. ACL that mention CCG.

Back to top.

F-score

Appears in 10 sentences as: F-score (11)
In Faster Parsing by Supertagger Adaptation
  1. Using an adapted supertagger with ambiguity levels tuned to match the baseline system, we were also able to increase F-score on labelled grammatical relations by 0.75%.
    Page 2, “Introduction”
  2. Interestingly, while the decrease in supertag accuracy in the previous experiment did not translate into a decrease in F-score, the increase in tag accuracy here does translate into an increase in F-score .
    Page 7, “Results”
  3. The increase in F-score has two sources.
    Page 7, “Results”
  4. As Table 6 shows, this change translates into an improvement of up to 0.75% in F-score on Section
    Page 7, “Results”
  5. F-score Speed (%) (%) (sents/sec)
    Page 7, “Results”
  6. All changes in F-score are statistically significant.
    Page 7, “Results”
  7. As Table 8 shows, in all cases the use of supertagger-annotated data led to poorer performance than the baseline system, while the use of parser-annotated data led to an improvement in F-score .
    Page 8, “Results”
  8. Meanwhile, on the target domain they are adapted to, these models achieve a higher F-score and parse sentences at least 45% faster than the baseline.
    Page 8, “Results”
  9. As for the newspaper domain, we observe increased supertag accuracy and F-score .
    Page 8, “Results”
  10. All of the training methods were tried, but only the method with the best results in newswire is included here, which for F-score when trained on 400,000 sentences was GIS.
    Page 8, “Results”

See all papers in Proc. ACL 2010 that mention F-score.

See all papers in Proc. ACL that mention F-score.

Back to top.

baseline system

Appears in 6 sentences as: baseline system (6)
In Faster Parsing by Supertagger Adaptation
  1. By increasing the ambiguity level of the adaptive models to match the baseline system , we can also slightly increase supertagging accuracy, which can lead to higher parsing accuracy.
    Page 1, “Introduction”
  2. Using an adapted supertagger with ambiguity levels tuned to match the baseline system , we were also able to increase F-score on labelled grammatical relations by 0.75%.
    Page 2, “Introduction”
  3. Both sets of annotations were produced by manually correcting the output of the baseline system .
    Page 4, “Data”
  4. As Table 8 shows, in all cases the use of supertagger-annotated data led to poorer performance than the baseline system , while the use of parser-annotated data led to an improvement in F-score.
    Page 8, “Results”
  5. However, on the corpus of the extra data, the performance of the adapted models is comparable to the baseline model, which means the parser is probably still be receiving the same categories that it used from the sets provided by the baseline system .
    Page 8, “Results”
  6. The fastest model parsed sentences 1.85 times as fast and was as accurate as the baseline system .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

gold standard

Appears in 6 sentences as: gold standard (6)
In Faster Parsing by Supertagger Adaptation
  1. As gold standard data for supertagger evaluation we have used supertagged GENIA data (Kim et al., 2003), annotated by Rimell and Clark (2008).
    Page 4, “Data”
  2. In the second set of experiments, we train on a mixture of gold standard newswire data and parser-annotated data from the target domain.
    Page 5, “Results”
  3. Table 10: Performance comparison for models using extra gold standard biomedical data.
    Page 9, “Results”
  4. This is because no gold standard biomedical training data was used in our experiments.
    Page 9, “Results”
  5. Table 10 shows the results of adding Rimell and Clark’s gold standard biomedical supertag data and using their biomedical POS-tagger.
    Page 9, “Results”
  6. The table also shows how accuracy can be further improved by adding our parser-annotated data from the biomedical domain as well as the additional gold standard data.
    Page 9, “Results”

See all papers in Proc. ACL 2010 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

models trained

Appears in 6 sentences as: Models trained (1) models trained (5)
In Faster Parsing by Supertagger Adaptation
  1. For all four algorithms the training time is proportional to the amount of data, but the GIS and BFGS models trained on only CCGbank took 4,500 and 4,200 seconds to train, while the equivalent perceptron and MIRA models took 90 and 95 seconds to train.
    Page 7, “Results”
  2. For speed improvement these were MIRA models trained on 4,000,000 parser-
    Page 8, “Results”
  3. In particular, note that models trained on Wikipedia or the biomedical data produce lower F-scores3 than the baseline on newswire.
    Page 8, “Results”
  4. The ambiguity tuning method used to improve accuracy on the newspaper domain can also be applied to the models trained on other domains.
    Page 8, “Results”
  5. In Table 7, we have tested models trained using GIS and 400,000 sentences of parsed target-domain text, with 6 levels tuned to match ambiguity with the baseline.
    Page 8, “Results”
  6. Models trained on parser-annotated Wikipedia text and MEDLINE text had improved performance on these target domains, in terms of both speed and accuracy.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention models trained.

See all papers in Proc. ACL that mention models trained.

Back to top.

lexicalised

Appears in 4 sentences as: Lexicalised (2) lexicalised (3)
In Faster Parsing by Supertagger Adaptation
  1. We propose a novel self-training method for a parser which uses a lexicalised grammar and supertagger, focusing on increasing the speed of the parser rather than its accuracy.
    Page 1, “Abstract”
  2. Parsing with lexicalised grammar formalisms, such as Lexicalised Tree Adjoining Grammar and Combinatory Categorial Grammar (CCG; Steed-man, 2000), can be made more efficient using a supertagger.
    Page 1, “Introduction”
  3. Lexicalised grammars typically contain a much smaller set of rules than phrase-structure grammars, relying on tags (supertags) that contain a more detailed description of each word’s role in the sentence.
    Page 2, “Background”
  4. Figure 1 gives two sentences and their CCG derivations, showing how some of the syntactic ambiguity is transferred to the supertagging component in a lexicalised grammar.
    Page 2, “Background”

See all papers in Proc. ACL 2010 that mention lexicalised.

See all papers in Proc. ACL that mention lexicalised.

Back to top.

POS tagging

Appears in 4 sentences as: POS tag (1) POS tagging (2) POS tags (2)
In Faster Parsing by Supertagger Adaptation
  1. Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging ; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag.
    Page 1, “Introduction”
  2. (2003) were unable to improve the accuracy of POS tagging using self-training.
    Page 1, “Introduction”
  3. The C&C supertagger is similar to the Ratnaparkhi (1996) tagger, using features based on words and POS tags in a five-word window surrounding the target word, and defining a local probability distribution over supertags for each word in the sentence, given the previous two supertags.
    Page 2, “Background”
  4. For supertagger evaluation, one thousand sentences were manually annotated with CCG lexical categories and POS tags .
    Page 4, “Data”

See all papers in Proc. ACL 2010 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

statistically significant

Appears in 4 sentences as: statistically significant (4)
In Faster Parsing by Supertagger Adaptation
  1. To check whether changes were statistically significant we applied the test described by Chinchor (1995).
    Page 5, “Evaluation”
  2. The BFGS, GIS and MIRA models produced mixed results, but no statistically significant decrease in accuracy, and as the amount of parser-annotated data was increased, parsing speed increased by up to 85%.
    Page 6, “Results”
  3. All changes in F-score are statistically significant .
    Page 7, “Results”
  4. All of the new models in the table make a statistically significant improvement over the baseline.
    Page 7, “Results”

See all papers in Proc. ACL 2010 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

manually annotated

Appears in 3 sentences as: manually annotated (2) manually annotating (1)
In Faster Parsing by Supertagger Adaptation
  1. For supertagger evaluation, one thousand sentences were manually annotated with CCG lexical categories and POS tags.
    Page 4, “Data”
  2. For parser evaluation, three hundred of these sentences were manually annotated with DepBank grammatical relations (King et al., 2003) in the style of Briscoe and Carroll (2006).
    Page 4, “Data”
  3. The result is an accurate and efficient wide-coverage CCG parser that can be easily adapted for NLP applications in new domains without manually annotating data.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention manually annotated.

See all papers in Proc. ACL that mention manually annotated.

Back to top.

parsing model

Appears in 3 sentences as: parsing model (3)
In Faster Parsing by Supertagger Adaptation
  1. We used the hybrid parsing model described in Clark and Curran (2007), and the Viterbi decoder to find the highest-scoring derivation.
    Page 4, “Evaluation”
  2. For the biomedical parser evaluation we have used the parsing model and grammatical relation conversion script from Rimell and Clark (2009).
    Page 5, “Evaluation”
  3. In the first two experiments, we explore performance on the newswire domain, which is the source of training data for the parsing model and the baseline supertagging model.
    Page 5, “Results”

See all papers in Proc. ACL 2010 that mention parsing model.

See all papers in Proc. ACL that mention parsing model.

Back to top.

Penn Treebank

Appears in 3 sentences as: Penn Treebank (3)
In Faster Parsing by Supertagger Adaptation
  1. Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag.
    Page 1, “Introduction”
  2. This method has been used effectively to improve parsing performance on newspaper text (McClosky et al., 2006a), as well as adapting a Penn Treebank parser to a new domain (McClosky et al., 2006b).
    Page 3, “Background”
  3. We have used Sections 02-21 of CCGbank (Hock-enmaier and Steedman, 2007), the CCG version of the Penn Treebank (Marcus et al., 1993), as training data for the newspaper domain.
    Page 4, “Data”

See all papers in Proc. ACL 2010 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

perceptron

Appears in 3 sentences as: Perceptron (1) perceptron (2)
In Faster Parsing by Supertagger Adaptation
  1. In our first experiment, we trained supertagger models using Generalised Iterative Scaling (GIS) (Darroch and Ratcliff, 1972), the limited memory BFGS method (BFGS) (Nocedal and Wright, 1999), the averaged perceptron (Collins, 2002), and the margin infused relaxed algorithm (MIRA) (Crammer and Singer, 2003).
    Page 5, “Results”
  2. GIS 96.34 96.43 96.53 96.62 85.3 Perceptron 95.82 95.99 96.30 - 85.2 MIRA 96.23 96.29 96.46 96.63 85 .4
    Page 7, “Results”
  3. For all four algorithms the training time is proportional to the amount of data, but the GIS and BFGS models trained on only CCGbank took 4,500 and 4,200 seconds to train, while the equivalent perceptron and MIRA models took 90 and 95 seconds to train.
    Page 7, “Results”

See all papers in Proc. ACL 2010 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

probability distribution

Appears in 3 sentences as: probability distribution (2) probability distributions (1)
In Faster Parsing by Supertagger Adaptation
  1. The C&C supertagger is similar to the Ratnaparkhi (1996) tagger, using features based on words and POS tags in a five-word window surrounding the target word, and defining a local probability distribution over supertags for each word in the sentence, given the previous two supertags.
    Page 2, “Background”
  2. Alternatively the Forward-Backward algorithm can be used to efficiently sum over all sequences, giving a probability distribution over supertags for each word which is conditional only on the input sentence.
    Page 2, “Background”
  3. Note that these are all alternative methods for estimating the local log-linear probability distributions used by the Ratnaparkhi-style tagger.
    Page 5, “Results”

See all papers in Proc. ACL 2010 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.

Treebank

Appears in 3 sentences as: Treebank (3)
In Faster Parsing by Supertagger Adaptation
  1. Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag.
    Page 1, “Introduction”
  2. This method has been used effectively to improve parsing performance on newspaper text (McClosky et al., 2006a), as well as adapting a Penn Treebank parser to a new domain (McClosky et al., 2006b).
    Page 3, “Background”
  3. We have used Sections 02-21 of CCGbank (Hock-enmaier and Steedman, 2007), the CCG version of the Penn Treebank (Marcus et al., 1993), as training data for the newspaper domain.
    Page 4, “Data”

See all papers in Proc. ACL 2010 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.