Rebanking CCGbank for Improved NP Interpretation
Honnibal, Matthew and Curran, James R. and Bos, Johan

Article Structure

Abstract

Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena.

Introduction

Progress in natural language processing relies on direct comparison on shared data, discouraging improvements to the evaluation data.

Background and motivation

Formalisms like HPSG (Pollard and Sag, 1994), LFG (Kaplan and Bresnan, 1982), and CCG (Steedman, 2000) are linguistically motivated in the sense that they attempt to explain and predict the limited variation found in the grammars of natural languages.

Combining CCGbank corrections

There have been a few papers describing corrections to CCGbank.

Noun predicate-argument structure

Many common nouns in English can receive optional complements and adjuncts, realised by prepositional phrases, genitive determiners, compound nouns, relative clauses, and for some nouns, complementised clauses.

Adding restrictivity distinctions

Adnominals can have either a restrictive or a nonrestrictive (appositional) interpretation, determining the potential reference of the noun phrase it modifies.

Reanalysing partitive constructions

True partitive constructions consist of a quantifier (16), a cardinal (17) or demonstrative (18) applied to an NP via of.

Similarity to CCGbank

Table 1 shows the percentage of labelled dependencies (L. Deps), unlabelled dependencies (U. Deps) and lexical categories (Cats) that remained the same after each set of changes.

Lexicon statistics

Our changes make the grammar sensitive to new distinctions, which increases the number of lexical categories required.

Parsing Evaluation

Some of the changes we have made correct problems that have caused the performance of a statistical CCG parser to be overestimated.

Conclusion

Research in natural language understanding is driven by the datasets that we have available.

Topics

Treebank

Appears in 14 sentences as: Treebank (7) treebank (4) Treebanking (1) treebanking (1) treebanks (1) treebank’s (1)
In Rebanking CCGbank for Improved NP Interpretation
  1. Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena.
    Page 1, “Abstract”
  2. In this paper we show how to improve the quality of a treebank , by integrating resources and implementing improved analyses for specific constructions.
    Page 1, “Abstract”
  3. Treebanking is a difficult engineering task: coverage, cost, consistency and granularity are all competing concerns that must be balanced against each other when the annotation scheme is developed.
    Page 1, “Introduction”
  4. The difficulty of the task means that we ought to view treebanking as an ongoing process akin to grammar development, such as the many years of work on the ERG (Flickinger, 2000).
    Page 1, “Introduction”
  5. This paper demonstrates how a treebank can be rebanked to incorporate novel analyses and infor-
    Page 1, “Introduction”
  6. We chose to work on CCGbank (Hockenmaier and Steedman, 2007), a Combinatory Categorial Grammar (Steedman, 2000) treebank acquired from the Penn Treebank (Marcus et al., 1993).
    Page 1, “Introduction”
  7. Statistical parsers induce their grammars from corpora, and the corpora for linguistically motivated formalisms currently do not contain high quality predicate-argument annotation, because they were derived from the Penn Treebank (PTB Marcus et al., 1993).
    Page 2, “Background and motivation”
  8. What we suggest in this paper is that a treebank’s grammar need not last its lifetime.
    Page 2, “Background and motivation”
  9. The structure of such compound noun phrases is left underspecified in the Penn Treebank (PTB), because the annotation procedure involved stitching together partial parses produced by the Fid-ditch parser (Hindle, 1983), which produced flat brackets for these constructions.
    Page 3, “Combining CCGbank corrections”
  10. When Hockenmaier and Steedman (2002) went to acquire a CCG treebank from the PTB, this posed a problem.
    Page 3, “Combining CCGbank corrections”
  11. The syntactic analysis of punctuation is notoriously difficult, and punctuation is not always treated consistently in the Penn Treebank (Bies et al., 1995).
    Page 3, “Combining CCGbank corrections”

See all papers in Proc. ACL 2010 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

CCG

Appears in 12 sentences as: CCG (12)
In Rebanking CCGbank for Improved NP Interpretation
  1. We then describe a novel CCG analysis of NP predicate-argument structure, which we implement using NomBank (Meyers et al., 2004).
    Page 1, “Introduction”
  2. We then train and evaluate a parser for these changes, to investigate their impact on the accuracy of a state-of—the-art statistical CCG parser.
    Page 1, “Introduction”
  3. Formalisms like HPSG (Pollard and Sag, 1994), LFG (Kaplan and Bresnan, 1982), and CCG (Steedman, 2000) are linguistically motivated in the sense that they attempt to explain and predict the limited variation found in the grammars of natural languages.
    Page 2, “Background and motivation”
  4. Combinatory Categorial Grammar ( CCG ; Steedman, 2000) is a lexicalised grammar, which means that all grammatical dependencies are specified in the lexical entries and that the production of derivations is governed by a small set of rules.
    Page 2, “Background and motivation”
  5. A CCG grammar consists of a small number of schematic rules, called combinators.
    Page 2, “Background and motivation”
  6. CCG extends the basic application rules of pure categorial grammar with (generalised) composition rules and type raising.
    Page 2, “Background and motivation”
  7. When Hockenmaier and Steedman (2002) went to acquire a CCG treebank from the PTB, this posed a problem.
    Page 3, “Combining CCGbank corrections”
  8. There is no equivalent way to leave these structures underspecified in CCG , because derivations must be binary branching.
    Page 3, “Combining CCGbank corrections”
  9. This distinction is represented in the surface syntax in CCG , because the category of a verb must specify its argument structure.
    Page 3, “Combining CCGbank corrections”
  10. 4.1 CCG analysis
    Page 4, “Noun predicate-argument structure”
  11. Some of the changes we have made correct problems that have caused the performance of a statistical CCG parser to be overestimated.
    Page 7, “Parsing Evaluation”

See all papers in Proc. ACL 2010 that mention CCG.

See all papers in Proc. ACL that mention CCG.

Back to top.

Penn Treebank

Appears in 7 sentences as: Penn Treebank (7)
In Rebanking CCGbank for Improved NP Interpretation
  1. We chose to work on CCGbank (Hockenmaier and Steedman, 2007), a Combinatory Categorial Grammar (Steedman, 2000) treebank acquired from the Penn Treebank (Marcus et al., 1993).
    Page 1, “Introduction”
  2. Statistical parsers induce their grammars from corpora, and the corpora for linguistically motivated formalisms currently do not contain high quality predicate-argument annotation, because they were derived from the Penn Treebank (PTB Marcus et al., 1993).
    Page 2, “Background and motivation”
  3. The structure of such compound noun phrases is left underspecified in the Penn Treebank (PTB), because the annotation procedure involved stitching together partial parses produced by the Fid-ditch parser (Hindle, 1983), which produced flat brackets for these constructions.
    Page 3, “Combining CCGbank corrections”
  4. The syntactic analysis of punctuation is notoriously difficult, and punctuation is not always treated consistently in the Penn Treebank (Bies et al., 1995).
    Page 3, “Combining CCGbank corrections”
  5. Our analysis requires semantic role labels for each argument of the nominal predicates in the Penn Treebank — precisely what NomBank (Meyers et al., 2004) provides.
    Page 5, “Noun predicate-argument structure”
  6. First, we align CCGbank and the Penn Treebank , and produce a version of NomBank that refers to CCGbank nodes.
    Page 5, “Noun predicate-argument structure”
  7. The most cited computational linguistics work to date is the Penn Treebank (Marcus et al., l993)1.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

semantic role

Appears in 5 sentences as: Semantic role (1) semantic role (3) semantic roles (1)
In Rebanking CCGbank for Improved NP Interpretation
  1. Semantic role descriptions generally recognise a distinction between core arguments, whose role comes from a set specific to the predicate, and peripheral arguments, who have a role drawn from a small, generic set.
    Page 3, “Combining CCGbank corrections”
  2. The semantic roles of Rome and Carthage are the same in (7) and (8), but the noun cannot case-mark them directly, so of and the genitive clitic are pressed into service.
    Page 4, “Noun predicate-argument structure”
  3. The semantic role depends on both the predicate and subcategorisation frame:
    Page 4, “Noun predicate-argument structure”
  4. Our analysis requires semantic role labels for each argument of the nominal predicates in the Penn Treebank — precisely what NomBank (Meyers et al., 2004) provides.
    Page 5, “Noun predicate-argument structure”
  5. We then assume that any prepositional phrase or genitive determiner annotated as a core argument in NomBank should be analysed as a complement, while peripheral arguments and adnominals that receive no semantic role label at all are analysed as adjuncts.
    Page 5, “Noun predicate-argument structure”

See all papers in Proc. ACL 2010 that mention semantic role.

See all papers in Proc. ACL that mention semantic role.

Back to top.

noun phrases

Appears in 4 sentences as: noun phrases (4)
In Rebanking CCGbank for Improved NP Interpretation
  1. Compound noun phrases can nest inside each other, creating bracketing ambiguities:
    Page 3, “Combining CCGbank corrections”
  2. The structure of such compound noun phrases is left underspecified in the Penn Treebank (PTB), because the annotation procedure involved stitching together partial parses produced by the Fid-ditch parser (Hindle, 1983), which produced flat brackets for these constructions.
    Page 3, “Combining CCGbank corrections”
  3. Vadas and Curran (2007) addressed this by manually annotating all of the ambiguous noun phrases in the PTB, and went on to use this information to correct 20,409 dependencies (1.95%) in CCGbank (Vadas and Curran, 2008).
    Page 3, “Combining CCGbank corrections”
  4. Partitive constructions are not given special treatment in the PTB, and were analysed as noun phrases with a PP modifier in CCGbank:
    Page 6, “Reanalysing partitive constructions”

See all papers in Proc. ACL 2010 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

Named entities

Appears in 3 sentences as: Named entities (1) named entities (1) named entity (1)
In Rebanking CCGbank for Improved NP Interpretation
  1. The major areas of CCGbank’s grammar left to be improved are the analysis of comparatives, and the analysis of named entities .
    Page 8, “Conclusion”
  2. Named entities are also difficult to analyse, as many entity types obey their own specific grammars.
    Page 8, “Conclusion”
  3. This is another example of a phenomenon that could be analysed much better in CCGbank using an existing resource, the BBN named entity corpus.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention Named entities.

See all papers in Proc. ACL that mention Named entities.

Back to top.

natural language

Appears in 3 sentences as: natural language (2) natural languages (1)
In Rebanking CCGbank for Improved NP Interpretation
  1. Progress in natural language processing relies on direct comparison on shared data, discouraging improvements to the evaluation data.
    Page 1, “Introduction”
  2. Formalisms like HPSG (Pollard and Sag, 1994), LFG (Kaplan and Bresnan, 1982), and CCG (Steedman, 2000) are linguistically motivated in the sense that they attempt to explain and predict the limited variation found in the grammars of natural languages .
    Page 2, “Background and motivation”
  3. Research in natural language understanding is driven by the datasets that we have available.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

role labels

Appears in 3 sentences as: role label (1) role labels (2)
In Rebanking CCGbank for Improved NP Interpretation
  1. The only information we are not specifying in the syntactic analysis are the role labels assigned to each of the syntactic arguments.
    Page 5, “Noun predicate-argument structure”
  2. Our analysis requires semantic role labels for each argument of the nominal predicates in the Penn Treebank — precisely what NomBank (Meyers et al., 2004) provides.
    Page 5, “Noun predicate-argument structure”
  3. We then assume that any prepositional phrase or genitive determiner annotated as a core argument in NomBank should be analysed as a complement, while peripheral arguments and adnominals that receive no semantic role label at all are analysed as adjuncts.
    Page 5, “Noun predicate-argument structure”

See all papers in Proc. ACL 2010 that mention role labels.

See all papers in Proc. ACL that mention role labels.

Back to top.