A Latent Variable CCG Parser | Unlike the context-free grammars extracted from the Penn treebank , these allow for the categorial semantics that accompanies any categorial parse and for a more elegant analysis of linguistic structures such as extraction and coordination. |
A Latent Variable CCG Parser | in Petrov’s experiments on the Penn treebank , the syntactic category NP was refined to the more fine-grained N P1 and N P2 roughly corresponding to N Ps in subject and object positions. |
A Latent Variable CCG Parser | In the supertagging literature, POS tagging and supertagging are distinguished — POS tags are the traditional Penn treebank tags (e. g. NN, VBZ and DT) and supertags are CCG categories. |
Introduction | The Petrov parser (Petrov and Klein, 2007) uses latent variables to refine the grammar extracted from a corpus to improve accuracy, originally used to improve parsing results on the Penn treebank (PTB). |
Introduction | These results should not be interpreted as proof that grammars extracted from the Penn treebank and from CCGbank are equivalent. |
The Language Classes of Combinatory Categorial Grammars | CCGbank (Hockenmaier and Steedman, 2007) is a corpus of CCG derivations that was semiautomatically converted from the Wall Street J our-nal section of the Penn treebank . |
Abstract | In addition, we present an efficient method for determining whether an arbitrary tree is 2-planar and show that 99% or more of the trees in existing treebanks are 2-planar. |
Determining Multiplanarity | Several constraints on non-projective dependency structures have been proposed recently that seek a good balance between parsing efficiency and coverage of non-projective phenomena present in natural language treebanks . |
Determining Multiplanarity | For example, Kuhlmann and Nivre (2006) and Havelka (2007) have shown that the vast majority of structures present in existing treebanks are well-nested and have a small gap degree (Bodirsky et al., 2005), leading to an interest in parsers for these kinds of structures (Gomez-Rodriguez et al., 2009). |
Determining Multiplanarity | No similar analysis has been performed for m-planar structures, although Yli-Jyr'a (2003) provides evidence that all except two structures in the Danish dependency treebank are at most 3-planar. |
Introduction | Although these proposals seem to have a very good fit with linguistic data, in the sense that they often cover 99% or more of the structures found in existing treebanks , the development of efficient parsing algorithms for these classes has met with more limited success. |
Introduction | First, we present a procedure for determining the minimal number m such that a dependency tree is m-planar and use it to show that the overwhelming majority of sentences in dependency treebanks have a tree that is at most 2-planar. |
Preliminaries | According to the results by Kuhlmann and Nivre (2006), most non-projective structures in dependency treebanks are also non-planar, so being able to parse planar structures will only give us a modest improvement in coverage with respect to a projective parser. |
Abstract | Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena. |
Abstract | In this paper we show how to improve the quality of a treebank , by integrating resources and implementing improved analyses for specific constructions. |
Background and motivation | Statistical parsers induce their grammars from corpora, and the corpora for linguistically motivated formalisms currently do not contain high quality predicate-argument annotation, because they were derived from the Penn Treebank (PTB Marcus et al., 1993). |
Background and motivation | What we suggest in this paper is that a treebank’s grammar need not last its lifetime. |
Combining CCGbank corrections | The structure of such compound noun phrases is left underspecified in the Penn Treebank (PTB), because the annotation procedure involved stitching together partial parses produced by the Fid-ditch parser (Hindle, 1983), which produced flat brackets for these constructions. |
Combining CCGbank corrections | When Hockenmaier and Steedman (2002) went to acquire a CCG treebank from the PTB, this posed a problem. |
Combining CCGbank corrections | The syntactic analysis of punctuation is notoriously difficult, and punctuation is not always treated consistently in the Penn Treebank (Bies et al., 1995). |
Introduction | Treebanking is a difficult engineering task: coverage, cost, consistency and granularity are all competing concerns that must be balanced against each other when the annotation scheme is developed. |
Introduction | The difficulty of the task means that we ought to view treebanking as an ongoing process akin to grammar development, such as the many years of work on the ERG (Flickinger, 2000). |
Introduction | This paper demonstrates how a treebank can be rebanked to incorporate novel analyses and infor- |
Conclusion and Future Works | In addition, when integrated into a 2nd-ordered MST parser, the projected parser brings significant improvement to the baseline, especially for the baseline trained on smaller treebanks . |
Experiments | In this section, we first validate the word-pair classification model by experimenting on human-annotated treebanks . |
Experiments | We experiment on two popular treebanks, the Wall Street Journal (WSJ) portion of the Penn English Treebank (Marcus et al., 1993), and the Penn Chinese Treebank (CTB) 5.0 (Xue et al., 2005). |
Experiments | The constituent trees in the two treebanks are transformed to dependency trees according to the head-finding rules of Yamada and Matsumoto (2003). |
Introduction | Since it is costly and difficult to build human-annotated treebanks , a lot of works have also been devoted to the utilization of unannotated text. |
Introduction | For the 2nd-order MST parser trained on Penn Chinese Treebank (CTB) 5.0, the classifier give an precision increment of 0.5 points. |
Related Works | On the training method, however, our model obviously differs from other graph-based models, that we only need a set of word-pair dependency instances rather than a regular dependency treebank . |
Conclusion | Evaluating our algorithm on a subcorpus of the Rondane Treebank , we reduced the mean number of configurations of a sentence from several million to 4.5, in negligible runtime. |
Evaluation | In this section, we evaluate the effectiveness and efficiency of our weakest readings algorithm on a treebank . |
Evaluation | We compute RTGs for all sentences in the treebank and measure how many weakest readings remain after the intersection, and how much time this computation takes. |
Evaluation | For our experiment, we use the Rondane treebank (version of January 2006), a “Redwoods style” (Oepen et al., 2002) treebank containing underspecified representations (USRs) in the MRS formalism (Copestake et al., 2005) for sentences from the tourism domain. |
Introduction | While applications should benefit from these very precise semantic representations, their usefulness is limited by the presence of semantic ambiguity: On the Rondane Treebank (Oepen et al., 2002), the ERG computes an average of several million semantic representations for each sentence, even when the syntactic analysis is fixed. |
Introduction | However, no such approach has been worked out in sufficient detail to support the disambiguation of treebank sentences. |
Introduction | It is of course completely infeasible to compute all readings and compare all pairs for entailment; but even the best known algorithm in the literature (Gabsdil and Striegnitz, 1999) is only an optimization of this basic strategy, and would take months to compute the weakest readings for the sentences in the Rondane Treebank . |
Underspecification | always hnc subgraphs of D. In the worst case, GD can be exponentially bigger than D, but in practice it turns out that the grammar size remains manageable: even the RTG for the most ambiguous sentence in the Rondane Treebank , which has about 4.5 x l()12 scope readings, has only about 75 000 rules and can be computed in a few seconds. |
Building the Corpus | Two models are the volunteers who scan documents and correct OCR output in Project Gutenberg, or the undergraduate volunteers who have constructed Greek and Latin treebanks within Project Perseus (Crane, 2010). |
Human Language Project | It is natural to think in terms of replicating the body of resources available for well-documented languages, and the preeminent resource for any language is a treebank . |
Human Language Project | Producing a treebank involves a staggering amount of manual effort. |
Human Language Project | The idea of producing treebanks for 6,900 languages is quixotic, to put it mildly. |
Approach | Ad hoc rules are CFG productions extracted from a treebank which are “used for specific constructions and unlikely to be used again,” indicating annotation errors and rules for ungram-maticalities (see also Dickinson and Foster, 2009). |
Approach | Each method compares a given CFG rule to all the rules in a treebank grammar. |
Approach | This procedure is applicable whether the rules in question are from a new data set—as in this paper, where parses are compared to a training data grammar—or drawn from the treebank grammar itself (i.e., an internal consistency check). |
Introduction and Motivation | Furthermore, parsing accuracy degrades unless sufficient amounts of labeled training data from the same domain are available (e.g., Gildea, 2001; Sekine, 1997), and thus we need larger and more varied annotated treebanks , covering a wide range of domains. |
Introduction and Motivation | However, there is a bottleneck in obtaining annotation, due to the need for manual intervention in annotating a treebank . |
Summary and Outlook | We have proposed different methods for flagging the errors in automatically-parsed corpora, by treating the problem as one of looking for anomalous rules with respect to a treebank grammar. |
Conclusions and future work | These predicates are among the most frequent in the TreeBank and are likely to require approaches that differ from the ones we pursued. |
Data annotation and analysis | Implicit arguments have not been annotated within the Penn TreeBank , which is the textual and syntactic basis for NomBank. |
Data annotation and analysis | Thus, to facilitate our study, we annotated implicit arguments for instances of nominal predicates within the standard training, development, and testing sections of the TreeBank . |
Implicit argument identification | Consider the following abridged sentences, which are adjacent in their Penn TreeBank document: |
Implicit argument identification | Starting with a wide range of features, we performed floating forward feature selection (Pudil et al., 1994) over held-out development data comprising implicit argument annotations from section 24 of the Penn TreeBank . |
Implicit argument identification | Throughout our study, we used gold-standard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008). |
Introduction | However, as shown by the following example from the Penn TreeBank (Marcus et al., 1993), this restriction excludes extra-sentential arguments: |
Introduction | The most well-known multiply-annotated and validated corpus of English is the one million word Wall Street Journal corpus known as the Penn Treebank (Marcus et al., 1993), which over the years has been fully or partially annotated for several phenomena over and above the original part-of-speech tagging and phrase structure annotation. |
Introduction | More recently, the OntoNotes project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast news, and broadcast conversation that is annotated for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities. |
MASC Annotations | words Token Validated 1 18 222472 Sentence Validated 1 18 222472 POS/lemma Validated 1 18 222472 Noun chunks Validated 1 18 222472 Verb chunks Validated 1 18 222472 Named entities Validated 1 18 222472 FrameNet frames Manual 21 17829 HSPG Validated 40* 30106 Discourse Manual 40* 30106 Penn Treebank Validated 97 873 83 PropB ank Validated 92 50165 Opinion Manual 97 47583 TimeB ank Validated 34 5434 Committed belief Manual 13 4614 Event Manual 13 4614 Coreference Manual 2 1 877 |
MASC Annotations | Annotations produced by other projects and the FrameNet and Penn Treebank annotations produced specifically for MASC are semiautomatically and/or manually produced by those projects and subjected to their internal quality controls. |
MASC: The Corpus | All of the first 80K increment is annotated for Penn Treebank syntax. |
MASC: The Corpus | The second 120K increment includes 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank, PropBank, Penn Discourse Treebank , TimeML, and the Pittsburgh Opinion project. |
Conclusions and Future Works | In Proceedings of the 5th International Workshop on Treebanks and Linguistic Theories. |
Conclusions and Future Works | Recognizing Implicit Discourse Relations in the Penn Discourse Treebank . |
Conclusions and Future Works | The Penn Discourse TreeBank 2.0. |
Experiments and Results | We directly use the golden standard parse trees in Penn TreeBank . |
Introduction | The experiment shows that tree kernel is able to effectively incorporate syntactic structural information and produce statistical significant improvements over flat syntactic path feature for the recognition of both explicit and implicit relation in Penn Discourse Treebank (PDTB; Prasad et al., 2008). |
Penn Discourse Tree Bank | The Penn Discourse Treebank (PDTB) is the largest available annotated corpora of discourse relations (Prasad et al., 2008) over 2,312 Wall Street Journal articles. |
Experiments | We used three corpora for experiments: WSJ from Penn Treebank , Wikipedia, and the general Web. |
Experiments | In contrast, TextRunner was trained with 91,687 positive examples and 96,795 negative examples generated from the WSJ dataset in Penn Treebank . |
Experiments | We used three parsing options on the WSJ dataset: Stanford parsing, C] 50 parsing (Charniak and Johnson, 2005), and the gold parses from the Penn Treebank . |
Introduction | For example, TextRunner uses a small set of handwritten rules to heuristically label training examples from sentences in the Penn Treebank . |
Wikipedia-based Open IE | In both cases, however, we generate training data from Wikipedia by matching sentences with infoboxes, while TextRunner used a small set of handwritten rules to label training examples from the Penn Treebank . |
Data | CCGbank was created by semiautomatically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007). |
Data | CCG-TUT was created by semiautomatically converting dependencies in the Italian Turin University Treebank to CCG derivations (Bos et al., 2009). |
Introduction | Most work has focused on POS-tagging for English using the Penn Treebank (Marcus et al., 1993), such as (Banko and Moore, 2004; Goldwater and Griffiths, 2007; Toutanova and J ohn-son, 2008; Goldberg et al., 2008; Ravi and Knight, 2009). |
Introduction | This generally involves working with the standard set of 45 POS-tags employed in the Penn Treebank . |
Abstract | Experiments on the translated portion of the Chinese Treebank show that our system outperforms monolingual parsers by 2.93 points for Chinese and 1.64 points for English. |
Experiments | All the bilingual data were taken from the translated portion of the Chinese Treebank (CTB) (Xue et al., 2002; Bies et al., 2007), articles 1-325 of CTB, which have English translations with gold-standard parse trees. |
Introduction | Experiments on the translated portion of the Chinese Treebank (Xue et al., 2002; Bies et al., 2007) show that our system outperforms state-of-the-art monolingual parsers by 2.93 points for Chinese and 1.64 points for English. |
Abstract | We evaluate our parsers on the Penn Treebank and Prague Dependency Treebank , achieving unlabeled attachment scores of 93.04% and 87.38%, respectively. |
Introduction | We evaluate our parsers on the Penn WSJ Treebank (Marcus et al., 1993) and Prague Dependency Treebank (Hajic et al., 2001), achieving unlabeled attachment scores of 93.04% and 87.38%. |
Parsing experiments | In order to evaluate the effectiveness of our parsers in practice, we apply them to the Penn WSJ Treebank (Marcus et al., 1993) and the Prague Dependency Treebank (Hajic et al., 2001; Hajic, 1998).6 We use standard training, validation, and test splits7 to facilitate comparisons. |
Background | This method has been used effectively to improve parsing performance on newspaper text (McClosky et al., 2006a), as well as adapting a Penn Treebank parser to a new domain (McClosky et al., 2006b). |
Data | We have used Sections 02-21 of CCGbank (Hock-enmaier and Steedman, 2007), the CCG version of the Penn Treebank (Marcus et al., 1993), as training data for the newspaper domain. |
Introduction | Since the CCG lexical category set used by the supertagger is much larger than the Penn Treebank POS tag set, the accuracy of supertagging is much lower than POS tagging; hence the CCG supertagger assigns multiple supertags1 to a word, when the local context does not provide enough information to decide on the correct supertag. |
Substructure Spaces for BTKs | Compared with the widely used Penn TreeBank annotation, the new criterion utilizes some different grammar tags and is able to effectively describe some rare language phenomena in Chinese. |
Substructure Spaces for BTKs | The annotator still uses Penn TreeBank annotation on the English side. |
Substructure Spaces for BTKs | In addition, HIT corpus is not applicable for MT experiment due to the problems of domain divergence, annotation discrepancy (Chinese parse tree employs a different grammar from Penn Treebank annotations) and degree of tolerance for parsing errors. |