Abstract | We are interested in parsing constituency-based grammars such as HPSG and CCG using a small amount of data specific for the target formalism, and a large quantity of coarse CFG annotations from the Penn Treebank . |
Abstract | While all of the target formalisms share a similar basic syntactic structure with Penn Treebank CFG, they also encode additional constraints and semantic features. |
Introduction | The standard solution to this bottleneck has relied on manually crafted transformation rules that map readily available syntactic annotations (e.g, the Penn Treebank ) to the desired formalism. |
Introduction | In addition, designing these rules frequently requires external resources such as Wordnet, and even involves correction of the existing treebank . |
Introduction | A natural candidate for such coarse annotations is context-free grammar (CFG) from the Penn Treebank , while the target formalism can be any constituency-based grammars, such as Combinatory Categorial Grammar (CCG) (Steedman, 2001), Lexical Functional Grammar (LFG) (Bresnan, 1982) or Head-Driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994). |
Related Work | For instance, mappings may specify how to convert traces and functional tags in Penn Treebank to the f-structure in LFG (Cahill, 2004). |
Bilingual Projection of Dependency Grammar | Therefore, we can hardly obtain a treebank with complete trees through direct projection. |
Bilingual Projection of Dependency Grammar | So we extract projected discrete dependency arc instances instead of treebank as training set for the projected grammar induction model. |
Bilingually-Guided Dependency Grammar Induction | Then we incorporate projection model into our iterative unsupervised framework, and jointly optimize unsupervised and projection objectives with evolving treebank and constant projection information respectively. |
Introduction | A randomly-initialized monolingual treebank evolves in a self-training iterative procedure, and the grammar parameters are tuned to simultaneously maximize both the monolingual likelihood and bilingually-projected likelihood of the evolving treebank . |
Unsupervised Dependency Grammar Induction | And the framework of our unsupervised model builds a random treebank on the monolingual corpus firstly for initialization and trains a discriminative parsing model on it. |
Unsupervised Dependency Grammar Induction | Then we use the parser to build an evolved treebank with the l-best result for the next iteration run. |
Unsupervised Dependency Grammar Induction | In this way, the parser and treebank evolve in an iterative way until convergence. |
Abstract | By incorporating this knowledge into Dependency Model with Valence, we managed to considerably outperform the state-of-the-art results in terms of average attachment score over 20 treebanks from CoNLL 2006 and 2007 shared tasks. |
Experiments | The first type are CoNLL treebanks from the year 2006 (Buchholz and Marsi, 2006) and 2007 (Nivre et al., 2007), which we use for inference and for evaluation. |
Experiments | The Wikipedia texts were automatically tokenized and segmented to sentences so that their tokenization was similar to the one in the CoNLL evaluation treebanks . |
Experiments | To evaluate the quality of our estimations, we compare them with P301), the stop probabilities computed directly on the evaluation treebanks . |
Introduction | This is still far below the supervised approaches, but their indisputable advantage is the fact that no annotated treebanks are needed and the induced structures are not burdened by any linguistic conventions. |
Introduction | supervised parsers always only simulate the treebanks they were trained on, whereas unsupervised parsers have an ability to be fitted to different particular applications. |
Model | Finally, we obtain the probability of the whole generated treebank as a product over the trees: |
Model | no matter how the trees are ordered in the treebank , the Ptreebank is always the same. |
STOP-probability estimation | stop words in the treebank should be 2/3. |
Abstract | We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean. |
Abstract | This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing.1 |
Introduction | Research in dependency parsing — computational methods to predict such representations — has increased dramatically, due in large part to the availability of dependency treebanks in a number of languages. |
Introduction | While these data sets are standardized in terms of their formal representation, they are still heterogeneous treebanks . |
Introduction | That is to say, despite them all being dependency treebanks , which annotate each sentence with a dependency tree, they subscribe to different annotation schemes. |
Towards A Universal Treebank | (2004) for multilingual syntactic treebank construction. |
Towards A Universal Treebank | The second, used only for English and Swedish, is to automatically convert existing treebanks , as in Zeman et al. |
Towards A Universal Treebank | For English, we used the Stanford parser (v1.6.8) (Klein and Manning, 2003) to convert the Wall Street J our-nal section of the Penn Treebank (Marcus et al., 1993) to basic dependency trees, including punctuation and with the copula verb as head in copula constructions. |
Abstract | We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. |
Introduction | One of the reasons is the increased availability of dependency treebanks, be they results of genuine dependency annotation projects or converted automatically from previously existing phrase-structure treebanks . |
Introduction | In both cases, a number of decisions have to be made during the construction or conversion of a dependency treebank . |
Introduction | The dominating solution in treebank design is to introduce artificial rules for the encoding of coordination structures within dependency trees using the same means that express dependencies, i.e., by using edges and by labeling of nodes or edges. |
Related work | 0 PS = Prague Dependency Treebank (PDT) style: all conjuncts are attached under the coordinating conjunction (along with shared modifiers, which are distinguished by a special attribute) (Hajic et al., 2006), |
Related work | Moreover, particular treebanks vary in their contents even more than in their format, i.e. |
Related work | each treebank has its own way of representing prepositions or different granularity of syntactic labels. |
Variations in representing coordination structures | Our analysis of variations in representing coordination structures is based on observations from a set of dependency treebanks for 26 languages.7 |
Abstract | This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. |
Abstract | The treebank is based on deep LFG (Lexical-Functional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. |
Abstract | This output forms the basis of a parallel treebank covering a diverse set of phenomena. |
Introduction | This paper discusses the construction of a parallel treebank currently involving ten languages that represent several different language families, including non-Indo-European. |
Introduction | The treebank is based on the output of individual deep LFG (Lexical-Functional Grammar) grammars that were developed independently at different sites but within the overall framework of ParGram (the Parallel Grammar project) (Butt et al., 1999a; Butt et al., 2002). |
Introduction | This output forms the basis of the ParGramBank parallel treebank discussed here. |
Background | Yoshida (2005) proposed methods for extracting a wide-coverage lexicon based on HPSG from a phrase structure treebank of Japanese. |
Background | Their treebanks are annotated with dependencies of words, the conversion of which into phrase structures is not a big concern. |
Conclusion | Our method integrates multiple dependency-based resources to convert them into an integrated phrase structure treebank . |
Conclusion | The obtained treebank is then transformed into CCG derivations. |
Corpus integration and conversion | As we have adopted the method of CCGbank, which relies on a source treebank to be converted into CCG derivations, a critical issue to address is the absence of a Japanese counterpart to PTB. |
Corpus integration and conversion | Our solution is to first integrate multiple dependency-based resources and convert them into a phrase structure treebank that is independent |
Corpus integration and conversion | Next, we translate the treebank into CCG derivations (Step 2). |
Introduction | Our work is basically an extension of a seminal work on CCGbank (Hockenmaier and Steedman, 2007), in which the phrase structure trees of the Penn Treebank (PTB) (Marcus et al., 1993) are converted into CCG derivations and a wide-coverage CCG lexicon is then extracted from these derivations. |
Dataset Creation | 21,938 total examples, 15,330 come from sections 2—21 of the Penn Treebank (Marcus et al., 1993). |
Dataset Creation | For the Penn Treebank , we extracted the examples using the provided gold standard parse trees, whereas, for the latter cases, we used the output of an open source parser (Tratz and Hovy, 2011). |
Experiments | The accuracy figures for the test instances from the Penn Treebank , The Jungle Book, and The History of the Decline and Fall of the Roman Empire were 88.8%, 84.7%, and 80.6%, respectively. |
Related Work | The NomBank project (Meyers et al., 2004) provides coarse annotations for some of the possessive constructions in the Penn Treebank , but only those that meet their criteria. |
Semantic Relation Inventory | Penn Treebank , respectively. |
Semantic Relation Inventory | portion of the Penn Treebank . |
Semantic Relation Inventory | The Penn Treebank and The History of the Decline and Fall of the R0-man Empire were substantially more similar, although there are notable differences. |
Abstract | Empty categories (EC) are artificial elements in Penn Treebanks motivated by the govemment-binding (GB) theory to explain certain language phenomena such as pro-drop. |
Chinese Empty Category Prediction | The empty categories in the Chinese Treebank (CTB) include trace markers for A’- and A-movement, dropped pronoun, big PRO etc. |
Chinese Empty Category Prediction | Our effort of recovering ECs is a two-step process: first, at training time, ECs in the Chinese Treebank are moved and preserved in the portion of the tree structures pertaining to surface words only. |
Experimental Results | We use Chinese Treebank (CTB) V7.0 to train and test the EC prediction model. |
Introduction | In order to account for certain language phenomena such as pro-drop and wh-movement, a set of special tokens, called empty categories (EC), are used in Penn Treebanks (Marcus et al., 1993; Bies and Maamouri, 2003; Xue et al., 2005). |
Architecture of BRAINSUP | These constraints are learned from relation-head-modifier co-occurrence counts estimated from a dependency treebank £. |
Architecture of BRAINSUP | Algorithm 1 SentenceGeneration(U, 9, 73, [2): U is the user specification, 8 is a set of meta-parameters; 73 and £3 are two dependency treebanks . |
Architecture of BRAINSUP | We estimate the probability of a modifier word m and its head h to be in the relation r as Mb, m) = Cr(h, m)/(Zh, Em, CAM» 77%)), where cr(-) is the number of times that m depends on h in the dependency treebank £ and hi, m,- are all the head/modifier pairs observed in £. |
Conclusion | BRAINSUP makes heavy use of dependency parsed data and statistics collected from dependency treebanks to ensure the grammaticality of the generated sentences, and to trim the search space while seeking the sentences that maximize the user satisfaction. |
Evaluation | As discussed in Section 3 we use two different treebanks to learn the syntactic patterns (’P) and the dependency operators (£). |
Abstract | We perform parsing experiments the Penn Treebank and draw comparisons to Tree-Substitution Grammars and between different variations in probabilistic model design. |
Experiments | As a proof of concept, we investigate OSTAG in the context of the classic Penn Treebank statistical parsing setup; training on section 2-21 and testing on section 23. |
Experiments | Furthermore, the various parameteri-zations of adjunction with OSTAG indicate that, at least in the case of the Penn Treebank , the finer grained modeling of a full table of adjunction probabilities for each Goodman index OSTAG3 overcomes the danger of sparse data estimates. |
Introduction | We evaluate OSTAG on the familiar task of parsing the Penn Treebank . |
TAG and Variants | We propose a simple but empirically effective heuristic for grammar induction for our experiments on Penn Treebank data. |
Experiments | Labeled English data employed in this paper were derived from the Wall Street Journal (WSJ) corpus of the Penn Treebank (Marcus et al., 1993). |
Experiments | For labeled Chinese data, we used the version 5 .1 of the Penn Chinese Treebank (CTB) (Xue et al., 2005). |
Experiments | In addition, we removed from the unlabeled English data the sentences that appear in the WSJ corpus of the Penn Treebank . |
Introduction | On standard evaluations using both the Penn Treebank and the Penn Chinese Treebank , our parser gave higher accuracies than the Berkeley parser (Petrov and Klein, 2007), a state-of-the-art chart parser. |
A UCCA-Annotated Corpus | For instance, both the PTB and the Prague Dependency Treebank (Bo'hmova et al., 2003) employed annotators with extensive linguistic background. |
Introduction | In fact, the annotations of (a) and (c) are identical under the most widely-used schemes for English, the Penn Treebank (PTB) (Marcus et al., 1993) and CoNLL-style dependencies (Surdeanu et al., 2008) (see Figure l). |
Related Work | The most prominent annotation scheme in NLP for English syntax is the Penn Treebank . |
Related Work | Examples include the Groningen Meaning bank (Basile et al., 2012), Treebank Semantics (Butler and Yoshi-moto, 2012) and the Lingo Redwoods treebank (Oepen et al., 2004). |
Experiments | The first group comes from the English Web Treebank (EWT),4 also used in the Parsing the Web shared task (Petrov and McDonald, 2012). |
Experiments | We train our tagger on Sections 2—21 of the WSJ data in the Penn-III Treebank (PTB), Ontonotes 4.0 release. |
Experiments | Finally we do experiments with the Danish section of the Copenhagen Dependency Treebank (CDT). |
Algorithm | Lastly, in step (i) of Figure 1, we run k-means clustering method on the S-CODE sphere and split word-substitute word pairs into 45 clusters because the treebank we worked on uses 45 part-of—speech tags. |
Experiments | The experiments are conducted on Penn Treebank Wall Street Journal corpus. |
Experiments | Because we are trying to improve (Yatbaz et al., 2012), we select the experiment on Penn Treebank Wall Street Journal corpus in that work as our baseline and replicate it. |
Introduction | For instance,the gold tag perplexity of word “offers” in the Penn Treebank Wall Street Journal corpus we worked on equals to 1.966. |
Experimental Assessment | Performance evaluation is carried out on the Penn Treebank (Marcus et al., 1993) converted to Stanford basic dependencies (De Marneffe etal., 2006). |
Introduction | This development is probably due to many factors, such as the increased availability of dependency treebanks and the perceived usefulness of dependency structures as an interface to downstream applications, but a very important reason is also the high efficiency offered by dependency parsers, enabling web-scale parsing with high throughput. |
Introduction | While the classical approach limits training data to parser states that result from oracle predictions (derived from a treebank ), these novel approaches allow the classifier to explore states that result from its own (sometimes erroneous) predictions (Choi and Palmer, 2011; Goldberg and Nivre, 2012). |
Introduction | For example, Higgins and Sadock (2003) find fewer than 1000 sentences with two or more explicit quantifiers in the Wall Street journal section of Penn Treebank . |
Introduction | Plurals form 18% of the NPs in our corpus and 20% of the nouns in Penn Treebank . |
Introduction | Explicit universals, on the other hand, form less than 1% of the determiners in Penn Treebank . |
Experiments | We use the Penn Chinese Treebank 5.0 (CTB) (Xue et al., 2005) as the existing annotated corpus for Chinese word segmentation. |
Introduction | Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets. |
Related Work | In parsing, Pereira and Schabes (1992) proposed an extended inside-outside algorithm that infers the parameters of a stochastic CFG from a partially parsed treebank . |
Abstract | We present a reformulation of the word pair features typically used for the task of disambiguating implicit relations in the Penn Discourse Treebank . |
Other Features | Previous work has relied on features based on the gold parse trees of the Penn Treebank (which overlaps with PDTB) and on contextual information from relations preceding the one being disambiguated. |
Related Work | More recently, implicit relation prediction has been evaluated on annotated implicit relations from the Penn Discourse Treebank (Prasad et al., 2008). |
Results | We trained our model on sections 2—21 of the WSJ part of the Penn Treebank (Marcus et al., 1999). |
Results | Unfortunately, marking for argument/modifiers in the Penn Treebank is incomplete, and is limited to certain adverbials, e.g. |
Results | This corpus adds annotations indicating, for each node in the Penn Treebank , whether that node is a modifier. |