Index of papers in Proc. ACL 2010 that mention
  • Treebank
Fowler, Timothy A. D. and Penn, Gerald
A Latent Variable CCG Parser
Unlike the context-free grammars extracted from the Penn treebank , these allow for the categorial semantics that accompanies any categorial parse and for a more elegant analysis of linguistic structures such as extraction and coordination.
A Latent Variable CCG Parser
in Petrov’s experiments on the Penn treebank , the syntactic category NP was refined to the more fine-grained N P1 and N P2 roughly corresponding to N Ps in subject and object positions.
A Latent Variable CCG Parser
In the supertagging literature, POS tagging and supertagging are distinguished — POS tags are the traditional Penn treebank tags (e. g. NN, VBZ and DT) and supertags are CCG categories.
Introduction
The Petrov parser (Petrov and Klein, 2007) uses latent variables to refine the grammar extracted from a corpus to improve accuracy, originally used to improve parsing results on the Penn treebank (PTB).
Introduction
These results should not be interpreted as proof that grammars extracted from the Penn treebank and from CCGbank are equivalent.
The Language Classes of Combinatory Categorial Grammars
CCGbank (Hockenmaier and Steedman, 2007) is a corpus of CCG derivations that was semiautomatically converted from the Wall Street J our-nal section of the Penn treebank .
Treebank is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Gómez-Rodr'iguez, Carlos and Nivre, Joakim
Abstract
In addition, we present an efficient method for determining whether an arbitrary tree is 2-planar and show that 99% or more of the trees in existing treebanks are 2-planar.
Determining Multiplanarity
Several constraints on non-projective dependency structures have been proposed recently that seek a good balance between parsing efficiency and coverage of non-projective phenomena present in natural language treebanks .
Determining Multiplanarity
For example, Kuhlmann and Nivre (2006) and Havelka (2007) have shown that the vast majority of structures present in existing treebanks are well-nested and have a small gap degree (Bodirsky et al., 2005), leading to an interest in parsers for these kinds of structures (Gomez-Rodriguez et al., 2009).
Determining Multiplanarity
No similar analysis has been performed for m-planar structures, although Yli-Jyr'a (2003) provides evidence that all except two structures in the Danish dependency treebank are at most 3-planar.
Introduction
Although these proposals seem to have a very good fit with linguistic data, in the sense that they often cover 99% or more of the structures found in existing treebanks , the development of efficient parsing algorithms for these classes has met with more limited success.
Introduction
First, we present a procedure for determining the minimal number m such that a dependency tree is m-planar and use it to show that the overwhelming majority of sentences in dependency treebanks have a tree that is at most 2-planar.
Preliminaries
According to the results by Kuhlmann and Nivre (2006), most non-projective structures in dependency treebanks are also non-planar, so being able to parse planar structures will only give us a modest improvement in coverage with respect to a projective parser.
Treebank is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Honnibal, Matthew and Curran, James R. and Bos, Johan
Abstract
Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena.
Abstract
In this paper we show how to improve the quality of a treebank , by integrating resources and implementing improved analyses for specific constructions.
Background and motivation
Statistical parsers induce their grammars from corpora, and the corpora for linguistically motivated formalisms currently do not contain high quality predicate-argument annotation, because they were derived from the Penn Treebank (PTB Marcus et al., 1993).
Background and motivation
What we suggest in this paper is that a treebank’s grammar need not last its lifetime.
Combining CCGbank corrections
The structure of such compound noun phrases is left underspecified in the Penn Treebank (PTB), because the annotation procedure involved stitching together partial parses produced by the Fid-ditch parser (Hindle, 1983), which produced flat brackets for these constructions.
Combining CCGbank corrections
When Hockenmaier and Steedman (2002) went to acquire a CCG treebank from the PTB, this posed a problem.
Combining CCGbank corrections
The syntactic analysis of punctuation is notoriously difficult, and punctuation is not always treated consistently in the Penn Treebank (Bies et al., 1995).
Introduction
Treebanking is a difficult engineering task: coverage, cost, consistency and granularity are all competing concerns that must be balanced against each other when the annotation scheme is developed.
Introduction
The difficulty of the task means that we ought to view treebanking as an ongoing process akin to grammar development, such as the many years of work on the ERG (Flickinger, 2000).
Introduction
This paper demonstrates how a treebank can be rebanked to incorporate novel analyses and infor-
Treebank is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Liu, Qun
Conclusion and Future Works
In addition, when integrated into a 2nd-ordered MST parser, the projected parser brings significant improvement to the baseline, especially for the baseline trained on smaller treebanks .
Experiments
In this section, we first validate the word-pair classification model by experimenting on human-annotated treebanks .
Experiments
We experiment on two popular treebanks, the Wall Street Journal (WSJ) portion of the Penn English Treebank (Marcus et al., 1993), and the Penn Chinese Treebank (CTB) 5.0 (Xue et al., 2005).
Experiments
The constituent trees in the two treebanks are transformed to dependency trees according to the head-finding rules of Yamada and Matsumoto (2003).
Introduction
Since it is costly and difficult to build human-annotated treebanks , a lot of works have also been devoted to the utilization of unannotated text.
Introduction
For the 2nd-order MST parser trained on Penn Chinese Treebank (CTB) 5.0, the classifier give an precision increment of 0.5 points.
Related Works
On the training method, however, our model obviously differs from other graph-based models, that we only need a set of word-pair dependency instances rather than a regular dependency treebank .
Treebank is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Koller, Alexander and Thater, Stefan
Conclusion
Evaluating our algorithm on a subcorpus of the Rondane Treebank , we reduced the mean number of configurations of a sentence from several million to 4.5, in negligible runtime.
Evaluation
In this section, we evaluate the effectiveness and efficiency of our weakest readings algorithm on a treebank .
Evaluation
We compute RTGs for all sentences in the treebank and measure how many weakest readings remain after the intersection, and how much time this computation takes.
Evaluation
For our experiment, we use the Rondane treebank (version of January 2006), a “Redwoods style” (Oepen et al., 2002) treebank containing underspecified representations (USRs) in the MRS formalism (Copestake et al., 2005) for sentences from the tourism domain.
Introduction
While applications should benefit from these very precise semantic representations, their usefulness is limited by the presence of semantic ambiguity: On the Rondane Treebank (Oepen et al., 2002), the ERG computes an average of several million semantic representations for each sentence, even when the syntactic analysis is fixed.
Introduction
However, no such approach has been worked out in sufficient detail to support the disambiguation of treebank sentences.
Introduction
It is of course completely infeasible to compute all readings and compare all pairs for entailment; but even the best known algorithm in the literature (Gabsdil and Striegnitz, 1999) is only an optimization of this basic strategy, and would take months to compute the weakest readings for the sentences in the Rondane Treebank .
Underspecification
always hnc subgraphs of D. In the worst case, GD can be exponentially bigger than D, but in practice it turns out that the grammar size remains manageable: even the RTG for the most ambiguous sentence in the Rondane Treebank , which has about 4.5 x l()12 scope readings, has only about 75 000 rules and can be computed in a few seconds.
Treebank is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Abney, Steven and Bird, Steven
Building the Corpus
Two models are the volunteers who scan documents and correct OCR output in Project Gutenberg, or the undergraduate volunteers who have constructed Greek and Latin treebanks within Project Perseus (Crane, 2010).
Human Language Project
It is natural to think in terms of replicating the body of resources available for well-documented languages, and the preeminent resource for any language is a treebank .
Human Language Project
Producing a treebank involves a staggering amount of manual effort.
Human Language Project
The idea of producing treebanks for 6,900 languages is quixotic, to put it mildly.
Treebank is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Dickinson, Markus
Approach
Ad hoc rules are CFG productions extracted from a treebank which are “used for specific constructions and unlikely to be used again,” indicating annotation errors and rules for ungram-maticalities (see also Dickinson and Foster, 2009).
Approach
Each method compares a given CFG rule to all the rules in a treebank grammar.
Approach
This procedure is applicable whether the rules in question are from a new data set—as in this paper, where parses are compared to a training data grammar—or drawn from the treebank grammar itself (i.e., an internal consistency check).
Introduction and Motivation
Furthermore, parsing accuracy degrades unless sufficient amounts of labeled training data from the same domain are available (e.g., Gildea, 2001; Sekine, 1997), and thus we need larger and more varied annotated treebanks , covering a wide range of domains.
Introduction and Motivation
However, there is a bottleneck in obtaining annotation, due to the need for manual intervention in annotating a treebank .
Summary and Outlook
We have proposed different methods for flagging the errors in automatically-parsed corpora, by treating the problem as one of looking for anomalous rules with respect to a treebank grammar.
Treebank is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Gerber, Matthew and Chai, Joyce
Conclusions and future work
These predicates are among the most frequent in the TreeBank and are likely to require approaches that differ from the ones we pursued.
Data annotation and analysis
Implicit arguments have not been annotated within the Penn TreeBank , which is the textual and syntactic basis for NomBank.
Data annotation and analysis
Thus, to facilitate our study, we annotated implicit arguments for instances of nominal predicates within the standard training, development, and testing sections of the TreeBank .
Implicit argument identification
Consider the following abridged sentences, which are adjacent in their Penn TreeBank document:
Implicit argument identification
Starting with a wide range of features, we performed floating forward feature selection (Pudil et al., 1994) over held-out development data comprising implicit argument annotations from section 24 of the Penn TreeBank .
Implicit argument identification
Throughout our study, we used gold-standard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008).
Introduction
However, as shown by the following example from the Penn TreeBank (Marcus et al., 1993), this restriction excludes extra-sentential arguments:
Treebank is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ide, Nancy and Baker, Collin and Fellbaum, Christiane and Passonneau, Rebecca
Introduction
The most well-known multiply-annotated and validated corpus of English is the one million word Wall Street Journal corpus known as the Penn Treebank (Marcus et al., 1993), which over the years has been fully or partially annotated for several phenomena over and above the original part-of-speech tagging and phrase structure annotation.
Introduction
More recently, the OntoNotes project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast news, and broadcast conversation that is annotated for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities.
MASC Annotations
words Token Validated 1 18 222472 Sentence Validated 1 18 222472 POS/lemma Validated 1 18 222472 Noun chunks Validated 1 18 222472 Verb chunks Validated 1 18 222472 Named entities Validated 1 18 222472 FrameNet frames Manual 21 17829 HSPG Validated 40* 30106 Discourse Manual 40* 30106 Penn Treebank Validated 97 873 83 PropB ank Validated 92 50165 Opinion Manual 97 47583 TimeB ank Validated 34 5434 Committed belief Manual 13 4614 Event Manual 13 4614 Coreference Manual 2 1 877
MASC Annotations
Annotations produced by other projects and the FrameNet and Penn Treebank annotations produced specifically for MASC are semiautomatically and/or manually produced by those projects and subjected to their internal quality controls.
MASC: The Corpus
All of the first 80K increment is annotated for Penn Treebank syntax.
MASC: The Corpus
The second 120K increment includes 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank, PropBank, Penn Discourse Treebank , TimeML, and the Pittsburgh Opinion project.
Treebank is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, WenTing and Su, Jian and Tan, Chew Lim
Conclusions and Future Works
In Proceedings of the 5th International Workshop on Treebanks and Linguistic Theories.
Conclusions and Future Works
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank .
Conclusions and Future Works
The Penn Discourse TreeBank 2.0.
Experiments and Results
We directly use the golden standard parse trees in Penn TreeBank .
Introduction
The experiment shows that tree kernel is able to effectively incorporate syntactic structural information and produce statistical significant improvements over flat syntactic path feature for the recognition of both explicit and implicit relation in Penn Discourse Treebank (PDTB; Prasad et al., 2008).
Penn Discourse Tree Bank
The Penn Discourse Treebank (PDTB) is the largest available annotated corpora of discourse relations (Prasad et al., 2008) over 2,312 Wall Street Journal articles.
Treebank is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wu, Fei and Weld, Daniel S.
Experiments
We used three corpora for experiments: WSJ from Penn Treebank , Wikipedia, and the general Web.
Experiments
In contrast, TextRunner was trained with 91,687 positive examples and 96,795 negative examples generated from the WSJ dataset in Penn Treebank .
Experiments
We used three parsing options on the WSJ dataset: Stanford parsing, C] 50 parsing (Charniak and Johnson, 2005), and the gold parses from the Penn Treebank .
Introduction
For example, TextRunner uses a small set of handwritten rules to heuristically label training examples from sentences in the Penn Treebank .
Wikipedia-based Open IE
In both cases, however, we generate training data from Wikipedia by matching sentences with infoboxes, while TextRunner used a small set of handwritten rules to label training examples from the Penn Treebank .
Treebank is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Baldridge, Jason and Knight, Kevin
Data
CCGbank was created by semiautomatically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007).
Data
CCG-TUT was created by semiautomatically converting dependencies in the Italian Turin University Treebank to CCG derivations (Bos et al., 2009).
Introduction
Most work has focused on POS-tagging for English using the Penn Treebank (Marcus et al., 1993), such as (Banko and Moore, 2004; Goldwater and Griffiths, 2007; Toutanova and J ohn-son, 2008; Goldberg et al., 2008; Ravi and Knight, 2009).
Introduction
This generally involves working with the standard set of 45 POS-tags employed in the Penn Treebank .
Treebank is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Wenliang and Kazama, Jun'ichi and Torisawa, Kentaro
Abstract
Experiments on the translated portion of the Chinese Treebank show that our system outperforms monolingual parsers by 2.93 points for Chinese and 1.64 points for English.
Experiments
All the bilingual data were taken from the translated portion of the Chinese Treebank (CTB) (Xue et al., 2002; Bies et al., 2007), articles 1-325 of CTB, which have English translations with gold-standard parse trees.
Introduction
Experiments on the translated portion of the Chinese Treebank (Xue et al., 2002; Bies et al., 2007) show that our system outperforms state-of-the-art monolingual parsers by 2.93 points for Chinese and 1.64 points for English.
Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Koo, Terry and Collins, Michael
Abstract
We evaluate our parsers on the Penn Treebank and Prague Dependency Treebank , achieving unlabeled attachment scores of 93.04% and 87.38%, respectively.
Introduction
We evaluate our parsers on the Penn WSJ Treebank (Marcus et al., 1993) and Prague Dependency Treebank (Hajic et al., 2001), achieving unlabeled attachment scores of 93.04% and 87.38%.
Parsing experiments
In order to evaluate the effectiveness of our parsers in practice, we apply them to the Penn WSJ Treebank (Marcus et al., 1993) and the Prague Dependency Treebank (Hajic et al., 2001; Hajic, 1998).6 We use standard training, validation, and test splits7 to facilitate comparisons.
Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kummerfeld, Jonathan K. and Roesner, Jessika and Dawborn, Tim and Haggerty, James and Curran, James R. and Clark, Stephen
Background
This method has been used effectively to improve parsing performance on newspaper text (McClosky et al., 2006a), as well as adapting a Penn Treebank parser to a new domain (McClosky et al., 2006b).
Data
We have used Sections 02-21 of CCGbank (Hock-enmaier and Steedman, 2007), the CCG version of the Penn Treebank (Marcus et al., 1993), as training data for the newspaper domain.
Introduction
Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag.
Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Substructure Spaces for BTKs
Compared with the widely used Penn TreeBank annotation, the new criterion utilizes some different grammar tags and is able to effectively describe some rare language phenomena in Chinese.
Substructure Spaces for BTKs
The annotator still uses Penn TreeBank annotation on the English side.
Substructure Spaces for BTKs
In addition, HIT corpus is not applicable for MT experiment due to the problems of domain divergence, annotation discrepancy (Chinese parse tree employs a different grammar from Penn Treebank annotations) and degree of tolerance for parsing errors.
Treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: