Index of papers in Proc. ACL 2012 that mention
  • treebank
Constant, Matthieu and Sigogne, Anthony and Watrin, Patrick
Conclusions and Future Work
The authors are very grateful to Spence Green for his useful help on the treebank , and to Jennifer Thewis-sen for her careful proofreading.
Introduction
The grammar was trained with a reference treebank where MWEs were annotated with a specific nonterminal node.
Introduction
The experiments were carried out on the French Treebank (Abeille et al., 2003) where MWEs are annotated.
MWE-dedicated Features
In our collocation resource, each candidate collocation of the French treebank is associated with its internal syntactic structure and its association score (log-likelihood).
Multiword expressions
(2011) confirmed these bad results on the French Treebank .
Multiword expressions
They show a general tagging accuracy of 94% on the French Treebank .
Multiword expressions
To do so, the MWEs in the training treebank were annotated with specific nonterminal nodes.
Resources
The French Treebank is composed of 435,860 lexical units (34,178 types).
Resources
In order to compare compounds in these lexical resources with the ones in the French Treebank , we applied on the development corpus the dictionaries and the lexicon extracted from the training corpus.
Resources
The authors provided us with a list of 17,315 candidate nominal collocations occurring in the French treebank with their log-likelihood and their internal flat structure.
Two strategies, two discriminative models
The vector 6 is estimated during the training stage from a reference treebank and the baseline parser ouputs.
treebank is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Li, Zhenghua and Liu, Ting and Che, Wanxiang
Abstract
We present a simple and effective framework for exploiting multiple monolingual treebanks with different annotation guidelines for parsing.
Abstract
Several types of transformation patterns (TP) are designed to capture the systematic annotation inconsistencies among different treebanks .
Abstract
Our approach can significantly advance the state—of—the—art parsing accuracy on two widely used target treebanks (Penn Chinese Treebank 5.1 and 6.0) using the Chinese Dependency Treebank as the source treebank .
Introduction
However, the heavy cost of treebanking typically limits one single treebank in both scale and genre.
Introduction
At present, learning from one single treebank seems inadequate for further boosting parsing accuracy.1
Introduction
Treebanks # of Words Grammar CTB5 0.51 million Phrase structure CTB6 0.78 million Phrase structure
treebank is mentioned in 62 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Experiments
In Table l, we show the first four samples of length between 15 and 20 generated from our model and a 5- gram model trained on the Penn Treebank .
Experiments
For training data, we constructed a large treebank by concatenating the WSJ and Brown portions of the Penn Treebank , the 50K BLLIP training sentences from Post (2011), and the AFP and APW portions of English Gigaword version 3 (Graff, 2003), totaling about 1.3 billion tokens.
Experiments
We used the human-annotated parses for the sentences in the Penn Treebank , but parsed the Gigaword and BLLIP sentences with the Berkeley Parser.
Tree Transformations
Figure 2: A sample parse from the Penn Treebank after the tree transformations described in Section 3.
Tree Transformations
number of transformations of Treebank constituency parses that allow us to capture such dependencies.
Tree Transformations
Although the Penn Treebank annotates temporal N Ps, most off-the-shelf parsers do not retain these tags, and we do not assume their presence.
Treelet Language Modeling
There is one additional hurdle in the estimation of our model: while there exist corpora with human-annotated constituency parses like the Penn Treebank (Marcus et al., 1993), these corpora are quite small — on the order of millions of tokens — and we cannot gather nearly as many counts as we can for 77.-grams, for which billions or even trillions (Brants et al., 2007) of tokens are available on the Web.
treebank is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Gardent, Claire and Narayan, Shashi
Conclusion
Using the Penn Treebank sentences associated with each SR Task dependency tree, we will create the two tree sets necessary to support error mining by dividing the set of trees output by the surface realiser into a set of trees (FAIL) associated with overgeneration (the generated sentences do not match the original sentences) and a set of trees (SUCCESS) associated with success (the generated sentence matches the original sentences).
Experiment and Results
The shallow input data provided by the SR Task was obtained from the Penn Treebank using the LTH Constituent—to—Dependency Conversion Tool for Penn—style Treebanks (Pennconverter, (J ohans—son and Nugues, 2007)).
Experiment and Results
The chunking was performed by retrieving from the Penn Treebank (PTB), for each phrase type, the yields of the constituents of that type and by using the alignment between words and dependency tree nodes provided by the organisers of the SR Task.
Experiment and Results
5 In the Penn Treebank , the POS tag is the category assigned to possessive ’s.
Related Work
(Callaway, 2003) avoids this shortcoming by converting the Penn Treebank to the format expected by his realiser.
treebank is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Shindo, Hiroyuki and Miyao, Yusuke and Fujino, Akinori and Nagata, Masaaki
Abstract
Our SR-TSG parser achieves an F 1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.
Experiment
We ran experiments on the Wall Street Journal (WSJ) portion of the English Penn Treebank data set (Marcus et al., 1993), using a standard data split (sections 2—21 for training, 22 for development and 23 for testing).
Experiment
The treebank data is right-binarized (Matsuzaki et al., 2005) to construct grammars with only unary and binary productions.
Experiment
This result suggests that the conventional TSG model trained from the vanilla treebank is insufficient to resolve
Introduction
Probabilistic context-free grammar (PCFG) underlies many statistical parsers, however, it is well known that the PCFG rules extracted from treebank data Via maximum likelihood estimation do not perform well due to unrealistic context freedom assumptions (Klein and Manning, 2003).
Introduction
Symbol refinement is a successful approach for weakening context freedom assumptions by dividing coarse treebank symbols (e.g.
Introduction
Our SR-TSG parser achieves an F1 score of 92.4% in the WSJ English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and superior to state-of-the-art discriminative reranking parsers.
treebank is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhao, Qiuye and Marcus, Mitch
Abstract
On the other hand, consider the annotation guideline of English Treebank (Marcus et a1., 1993) instead.
Abstract
Following this POS representation, there are as many as 10 possible POS tags that may occur in between the—0f, as estimated from the WSJ corpus of Penn Treebank .
Abstract
To explore determinacy in the distribution of POS tags in Penn Treebank , we need to consider that a POS tag marks the basic syntactic category of a word as well as its morphological inflection.
treebank is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
A Class-based Model of Agreement
More than 25 treebanks (in 22 languages) can be automatically mapped to this tag set, which includes “Noun” (nominals), “Verb” (verbs), “Adj” (adjectives), and “ADP” (pre-and postpositions).
A Class-based Model of Agreement
Many of these treebanks also contain per-token morphological annotations.
A Class-based Model of Agreement
We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
Conclusion and Outlook
The model can be implemented with a standard CRF package, trained on existing treebanks for many languages, and integrated easily with many MT feature APIs.
Experiments
Experimental Setup All experiments use the Penn Arabic Treebank (ATB) (Maamouri et al., 2004) parts 1—3 divided into training/dev/test sections according to the canonical split (Rambow et al., 2005).7
treebank is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Wan, Xiaojun
About Heterogeneous Annotations
This paper focuses on two representative popular corpora for Chinese lexical processing: (1) the Penn Chinese Treebank (CTB) and (2) the PKU’s People’s Daily data (PPD).
Abstract
Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible.
Data-driven Annotation Conversion
A well known work is transforming Penn Treebank into resources for various deep linguistic processing, including LTAG (Xia, 1999), CCG (Hockenmaier and Steedman, 2007), HP SG (Miyao et al., 2004) and LFG (Cahill et al., 2002).
Introduction
For example, the Penn Treebank is popular to train PCFG-based parsers, while the Redwoods Treebank is well known for HP SG research; the Propbank is favored to build general semantic role labeling systems, while the FrameNet is attractive for predicate-specific labeling.
Introduction
Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD).
treebank is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhou, Yuping and Xue, Nianwen
Abstract
Our scheme, inspired by the Penn Discourse TreeBank (PDTB), adopts the lexically grounded approach; at the same time, it makes adaptations based on the linguistic and statistical characteristics of Chinese text.
Adapted scheme for Chinese
According to a rough count on 20 randomly selected files from Chinese Treebank (Xue et al., 2005), 82% are tokens of implicit relation, compared to 54.5% in the PDTB 2.0.
Annotation experiment
The data set consists of 98 files taken from the Chinese Treebank (Xue et al., 2005).
Introduction
In the realm of discourse annotation, the Penn Discourse TreeBank (PDTB) (Prasad et al., 2008) separates itself by adopting a lexically grounded approach: Discourse relations are lexically anchored by discourse connectives (e.g., because, but, therefore), which are viewed as predicates that take abstract objects such as propositions, events and states as their arguments.
treebank is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Xiao and Kit, Chunyu
Abstract
EXperiments on English and Chinese treebanks confirm its advantage over its first-order version.
Experiment
Our parsing models are evaluated on both English and Chinese treebanks, i.e., the WSJ section of Penn Treebank 3.0 (LDC99T42) and the Chinese Treebank 5.1 (LDC2005T01U01).
Experiment
For parser combination, we follow the setting of Fossum and Knight (2009), using Section 24 instead of Section 22 of WSJ treebank as development set.
Introduction
Evaluated on the PTB WSJ and Chinese Treebank , it achieves its best Fl scores of 91.86% and 85.58%, respectively.
treebank is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Uszkoreit, Hans
Abstract
Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations.
Introduction
We conduct experiments on the Penn Chinese Treebank and Chinese Gigaword.
State-of-the-Art
Their evaluations on the Chinese Treebank show that Chinese POS tagging obtains an accuracy of about 93-94%.
State-of-the-Art
Penn Chinese Treebank (CTB) (Xue et al., 2005) is a popular data set to evaluate a number of Chinese NLP tasks, including word segmentation (Sun and
treebank is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Feng, Vanessa Wei and Hirst, Graeme
Discourse-annotated corpora
2.1 The RST Discourse Treebank
Discourse-annotated corpora
The RST Discourse Treebank (RST-DT) (Carlson et al., 2001), is a corpus annotated in the framework of RST.
Discourse-annotated corpora
2.2 The Penn Discourse Treebank
treebank is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lippincott, Thomas and Korhonen, Anna and Ó Séaghdha, Diarmuid
Introduction
However, the treebanks necessary for training a high-accuracy parsing model are expensive to build for new domains.
Methodology
An unlexicalized parser cannot distinguish these based just on POS tags, while a lexicalized parser requires a large treebank .
Previous work
These typically rely on language-specific knowledge, either directly through heuristics, or indirectly through parsing models trained on treebanks .
treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kummerfeld, Jonathan K. and Klein, Dan and Curran, James R.
Evaluation
Using sections 00-21 of the treebanks , we handcrafted instructions for 527 lexical categories, a process that took under 100 hours, and includes all the categories used by the C&C parser.
Evaluation
Figure 3: For each sentence in the treebank , we plot the converted parser output against gold conversion (left), and the original parser evaluation against gold conversion (right).
Introduction
Converting the Penn Treebank (PTB, Marcus et al., 1993) to other formalisms, such as HPSG (Miyao et al., 2004), LFG (Cahill et al., 2008), LTAG (Xia, 1999), and CCG (Hockenmaier, 2003), is a complex process that renders linguistic phenomena in formalism-specific ways.
treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hatori, Jun and Matsuzaki, Takuya and Miyao, Yusuke and Tsujii, Jun'ichi
Abstract
In experiments using the Chinese Treebank (CTB), we show that the accuracies of the three tasks can be improved significantly over the baseline models, particularly by 0.6% for POS tagging and 2.4% for dependency parsing.
Introduction
We perform experiments using the Chinese Treebank (CTB) corpora, demonstrating that the accuracies of the three tasks can be improved significantly over the pipeline combination of the state-of-the-art joint segmentation and POS tagging model, and the dependency parser.
Model
’e use the Chinese Penn Treebank ver.
treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yamangil, Elif and Shieber, Stuart
Abstract
We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.
Evaluation Results
We use the standard Penn treebank methodology of training on sections 2—21 and testing on section 23.
Evaluation Results
carried out a small treebank experiment where we train on Section 2, and a large one where we train on the full training set.
treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Wenliang and Zhang, Min and Li, Haizhou
Experiments
For English, we used the Penn Treebank (Marcus et al., 1993) in our experiments.
Experiments
For Chinese, we used the Chinese Treebank (CTB) version 4.04 in the experiments.
Experiments
3 We ensured that the text used for extracting subtrees did not include the sentences of the Penn Treebank .
treebank is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: