Index of papers in Proc. ACL 2010 that mention
  • Penn Treebank
Fowler, Timothy A. D. and Penn, Gerald
A Latent Variable CCG Parser
Unlike the context-free grammars extracted from the Penn treebank , these allow for the categorial semantics that accompanies any categorial parse and for a more elegant analysis of linguistic structures such as extraction and coordination.
A Latent Variable CCG Parser
in Petrov’s experiments on the Penn treebank , the syntactic category NP was refined to the more fine-grained N P1 and N P2 roughly corresponding to N Ps in subject and object positions.
A Latent Variable CCG Parser
In the supertagging literature, POS tagging and supertagging are distinguished — POS tags are the traditional Penn treebank tags (e. g. NN, VBZ and DT) and supertags are CCG categories.
Introduction
The Petrov parser (Petrov and Klein, 2007) uses latent variables to refine the grammar extracted from a corpus to improve accuracy, originally used to improve parsing results on the Penn treebank (PTB).
Introduction
These results should not be interpreted as proof that grammars extracted from the Penn treebank and from CCGbank are equivalent.
The Language Classes of Combinatory Categorial Grammars
CCGbank (Hockenmaier and Steedman, 2007) is a corpus of CCG derivations that was semiautomatically converted from the Wall Street J our-nal section of the Penn treebank .
Penn Treebank is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Honnibal, Matthew and Curran, James R. and Bos, Johan
Background and motivation
Statistical parsers induce their grammars from corpora, and the corpora for linguistically motivated formalisms currently do not contain high quality predicate-argument annotation, because they were derived from the Penn Treebank (PTB Marcus et al., 1993).
Combining CCGbank corrections
The structure of such compound noun phrases is left underspecified in the Penn Treebank (PTB), because the annotation procedure involved stitching together partial parses produced by the Fid-ditch parser (Hindle, 1983), which produced flat brackets for these constructions.
Combining CCGbank corrections
The syntactic analysis of punctuation is notoriously difficult, and punctuation is not always treated consistently in the Penn Treebank (Bies et al., 1995).
Conclusion
The most cited computational linguistics work to date is the Penn Treebank (Marcus et al., l993)1.
Introduction
We chose to work on CCGbank (Hockenmaier and Steedman, 2007), a Combinatory Categorial Grammar (Steedman, 2000) treebank acquired from the Penn Treebank (Marcus et al., 1993).
Noun predicate-argument structure
Our analysis requires semantic role labels for each argument of the nominal predicates in the Penn Treebank — precisely what NomBank (Meyers et al., 2004) provides.
Noun predicate-argument structure
First, we align CCGbank and the Penn Treebank , and produce a version of NomBank that refers to CCGbank nodes.
Penn Treebank is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ide, Nancy and Baker, Collin and Fellbaum, Christiane and Passonneau, Rebecca
Introduction
The most well-known multiply-annotated and validated corpus of English is the one million word Wall Street Journal corpus known as the Penn Treebank (Marcus et al., 1993), which over the years has been fully or partially annotated for several phenomena over and above the original part-of-speech tagging and phrase structure annotation.
Introduction
More recently, the OntoNotes project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast news, and broadcast conversation that is annotated for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities.
MASC Annotations
words Token Validated 1 18 222472 Sentence Validated 1 18 222472 POS/lemma Validated 1 18 222472 Noun chunks Validated 1 18 222472 Verb chunks Validated 1 18 222472 Named entities Validated 1 18 222472 FrameNet frames Manual 21 17829 HSPG Validated 40* 30106 Discourse Manual 40* 30106 Penn Treebank Validated 97 873 83 PropB ank Validated 92 50165 Opinion Manual 97 47583 TimeB ank Validated 34 5434 Committed belief Manual 13 4614 Event Manual 13 4614 Coreference Manual 2 1 877
MASC Annotations
Annotations produced by other projects and the FrameNet and Penn Treebank annotations produced specifically for MASC are semiautomatically and/or manually produced by those projects and subjected to their internal quality controls.
MASC: The Corpus
All of the first 80K increment is annotated for Penn Treebank syntax.
MASC: The Corpus
The second 120K increment includes 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank , PropBank, Penn Discourse Treebank, TimeML, and the Pittsburgh Opinion project.
Penn Treebank is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wu, Fei and Weld, Daniel S.
Experiments
We used three corpora for experiments: WSJ from Penn Treebank , Wikipedia, and the general Web.
Experiments
In contrast, TextRunner was trained with 91,687 positive examples and 96,795 negative examples generated from the WSJ dataset in Penn Treebank .
Experiments
We used three parsing options on the WSJ dataset: Stanford parsing, C] 50 parsing (Charniak and Johnson, 2005), and the gold parses from the Penn Treebank .
Introduction
For example, TextRunner uses a small set of handwritten rules to heuristically label training examples from sentences in the Penn Treebank .
Wikipedia-based Open IE
In both cases, however, we generate training data from Wikipedia by matching sentences with infoboxes, while TextRunner used a small set of handwritten rules to label training examples from the Penn Treebank .
Penn Treebank is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Gerber, Matthew and Chai, Joyce
Data annotation and analysis
Implicit arguments have not been annotated within the Penn TreeBank , which is the textual and syntactic basis for NomBank.
Implicit argument identification
Consider the following abridged sentences, which are adjacent in their Penn TreeBank document:
Implicit argument identification
Starting with a wide range of features, we performed floating forward feature selection (Pudil et al., 1994) over held-out development data comprising implicit argument annotations from section 24 of the Penn TreeBank .
Introduction
However, as shown by the following example from the Penn TreeBank (Marcus et al., 1993), this restriction excludes extra-sentential arguments:
Penn Treebank is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kummerfeld, Jonathan K. and Roesner, Jessika and Dawborn, Tim and Haggerty, James and Curran, James R. and Clark, Stephen
Background
This method has been used effectively to improve parsing performance on newspaper text (McClosky et al., 2006a), as well as adapting a Penn Treebank parser to a new domain (McClosky et al., 2006b).
Data
We have used Sections 02-21 of CCGbank (Hock-enmaier and Steedman, 2007), the CCG version of the Penn Treebank (Marcus et al., 1993), as training data for the newspaper domain.
Introduction
Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag.
Penn Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Baldridge, Jason and Knight, Kevin
Data
CCGbank was created by semiautomatically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007).
Introduction
Most work has focused on POS-tagging for English using the Penn Treebank (Marcus et al., 1993), such as (Banko and Moore, 2004; Goldwater and Griffiths, 2007; Toutanova and J ohn-son, 2008; Goldberg et al., 2008; Ravi and Knight, 2009).
Introduction
This generally involves working with the standard set of 45 POS-tags employed in the Penn Treebank .
Penn Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Substructure Spaces for BTKs
Compared with the widely used Penn TreeBank annotation, the new criterion utilizes some different grammar tags and is able to effectively describe some rare language phenomena in Chinese.
Substructure Spaces for BTKs
The annotator still uses Penn TreeBank annotation on the English side.
Substructure Spaces for BTKs
In addition, HIT corpus is not applicable for MT experiment due to the problems of domain divergence, annotation discrepancy (Chinese parse tree employs a different grammar from Penn Treebank annotations) and degree of tolerance for parsing errors.
Penn Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: