Simple Semi-supervised Dependency Parsing
Koo, Terry and Carreras, Xavier and Collins, Michael

Article Structure

Abstract

We present a simple and effective semi-supervised method for training dependency parsers.

Introduction

In natural language parsing, lexical information is seen as crucial to resolving ambiguous relationships, yet lexicalized statistics are sparse and difficult to estimate directly.

Background 2.1 Dependency parsing

Recent work (Buchholz and Marsi, 2006; Nivre et al., 2007) has focused on dependency parsing.

Feature design

Key to the success of our approach is the use of features which allow word-cluster-based information to assist the parser.

Experiments

In order to evaluate the effectiveness of the cluster-based feature sets, we conducted dependency parsing experiments in English and Czech.

Related Work

As mentioned earlier, our approach was inspired by the success of Miller et al.

Conclusions

In this paper, we have presented a simple but effective semi-supervised learning approach and demonstrated that it achieves substantial improvement over a competitive baseline in two broad-coverage depen-

Topics

dependency parsing

Appears in 18 sentences as: dependency parsers (3) Dependency parsing (1) dependency parsing (15)
In Simple Semi-supervised Dependency Parsing
  1. We present a simple and effective semi-supervised method for training dependency parsers .
    Page 1, “Abstract”
  2. We demonstrate the effectiveness of the approach in a series of dependency parsing experiments on the Penn Treebank and Prague Dependency Treebank, and we show that the cluster-based features yield substantial gains in performance across a wide range of conditions.
    Page 1, “Abstract”
  3. To demonstrate the effectiveness of our approach, we conduct experiments in dependency parsing, which has been the focus of much recent research—e.g., see work in the CoNLL shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007).
    Page 1, “Introduction”
  4. However, our target task of dependency parsing involves more complex structured relationships than named-entity tagging; moreover, it is not at all clear that word clusters should have any relevance to syntactic structure.
    Page 1, “Introduction”
  5. Nevertheless, our experiments demonstrate that word clusters can be quite effective in dependency parsing applications.
    Page 1, “Introduction”
  6. Section 2 gives background on dependency parsing and clustering, Section 3 describes the cluster-based features, Section 4 presents our experimental results, Section 5 discusses related work, and Section 6 concludes with ideas for future research.
    Page 2, “Introduction”
  7. Recent work (Buchholz and Marsi, 2006; Nivre et al., 2007) has focused on dependency parsing .
    Page 2, “Background 2.1 Dependency parsing”
  8. Dependency parsing depends critically on predicting head-modifier relationships, which can be difficult due to the statistical sparsity of these word-to-word interactions.
    Page 2, “Background 2.1 Dependency parsing”
  9. In this paper, we take a part-factored structured classification approach to dependency parsing .
    Page 2, “Background 2.1 Dependency parsing”
  10. In the simplest case, these parts are the dependency arcs themselves, yielding a first-order or “edge-factored” dependency parsing model.
    Page 2, “Background 2.1 Dependency parsing”
  11. These kinds of higher-order factorizations allow dependency parsers to obtain a limited form of context-sensitivity.
    Page 2, “Background 2.1 Dependency parsing”

See all papers in Proc. ACL 2008 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

feature sets

Appears in 14 sentences as: feature set (7) feature sets (12)
In Simple Semi-supervised Dependency Parsing
  1. The feature sets we used are similar to other feature sets in the literature (McDonald et al., 2005a; Carreras, 2007), so we will not attempt to give a exhaustive description of the features in this section.
    Page 3, “Feature design”
  2. In our experiments, we employed two different feature sets: a baseline feature set which draws upon “normal” information sources such as word forms and parts of speech, and a cluster-based feature set that also uses information derived from the Brown cluster hierarchy.
    Page 3, “Feature design”
  3. Our first-order baseline feature set is similar to the feature set of McDonald et al.
    Page 3, “Feature design”
  4. (2005a) feature set with backed-off versions of the “Surrounding Word POS Features” that include only one neighboring POS tag.
    Page 3, “Feature design”
  5. The first— and second-order cluster-based feature sets are supersets of the baseline feature sets : they include all of the baseline feature templates, and add an additional layer of features that incorporate word clusters.
    Page 3, “Feature design”
  6. For example, the baseline feature set includes indicators for word-to-word and tag-to-tag interactions between the head and modifier of a dependency.
    Page 4, “Feature design”
  7. In the cluster-based feature set , we correspondingly introduce new indicators for interactions between pairs of short bit-string prefixes and pairs of full bit strings.
    Page 4, “Feature design”
  8. When N is between roughly 100 and 1,000, there is little effect on the performance of the cluster-based feature sets.4 In addition, the vocabulary restriction reduces the size of the feature sets to managable proportions.
    Page 4, “Feature design”
  9. In order to evaluate the effectiveness of the cluster-based feature sets , we conducted dependency parsing experiments in English and Czech.
    Page 4, “Experiments”
  10. In our English experiments, we tested eight different parsing configurations, representing all possible choices between baseline or cluster-based feature sets , first-order (Eisner, 2000) or second-order (Carreras, 2007) factorizations, and labeled or unlabeled parsing.
    Page 5, “Experiments”
  11. Second, note that the parsers using cluster-based feature sets consistently outperform the models using the baseline features, regardless of model order or label usage.
    Page 5, “Experiments”

See all papers in Proc. ACL 2008 that mention feature sets.

See all papers in Proc. ACL that mention feature sets.

Back to top.

Treebank

Appears in 8 sentences as: Treebank (11) treebank (1)
In Simple Semi-supervised Dependency Parsing
  1. We demonstrate the effectiveness of the approach in a series of dependency parsing experiments on the Penn Treebank and Prague Dependency Treebank , and we show that the cluster-based features yield substantial gains in performance across a wide range of conditions.
    Page 1, “Abstract”
  2. We show that our semi-supervised approach yields improvements for fixed datasets by performing parsing experiments on the Penn Treebank (Marcus et al., 1993) and Prague Dependency Treebank (Hajic, 1998; Hajic et al., 2001) (see Sections 4.1 and 4.3).
    Page 1, “Introduction”
  3. The English experiments were performed on the Penn Treebank (Marcus et al., 1993), using a standard set of head-selection rules (Yamada and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank to a dependency tree representation.6 We split the Treebank into a training set (Sections 2—21), a development set (Section 22), and several test sets (Sections 0,7 1, 23, and 24).
    Page 4, “Experiments”
  4. The Czech experiments were performed on the Prague Dependency Treebank 1.0 (Hajic, 1998; Hajic et al., 2001), which is directly annotated with dependency structures.
    Page 4, “Experiments”
  5. 9We ensured that the sentences of the Penn Treebank were excluded from the text used for the clustering.
    Page 4, “Experiments”
  6. Tagger always trained on full Treebank
    Page 6, “Experiments”
  7. Table 3 displays the accuracy of first— and second-order models when trained on smaller portions of the Treebank , in both scenarios described above.
    Page 6, “Experiments”
  8. A natural avenue for further research would be the development of clustering algorithms that reflect the syntactic behavior of words; e.g., an algorithm that attempts to maximize the likelihood of a treebank , according to a probabilistic dependency model.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

semi-supervised

Appears in 6 sentences as: Semi-supervised (1) semi-supervised (5)
In Simple Semi-supervised Dependency Parsing
  1. We present a simple and effective semi-supervised method for training dependency parsers.
    Page 1, “Abstract”
  2. In this paper, we introduce lexical intermediaries via a simple two-stage semi-supervised approach.
    Page 1, “Introduction”
  3. In general, semi-supervised learning can be motivated by two concerns: first, given a fixed amount of supervised data, we might wish to leverage additional unlabeled data to facilitate the utilization of the supervised corpus, increasing the performance of the model in absolute terms.
    Page 1, “Introduction”
  4. We show that our semi-supervised approach yields improvements for fixed datasets by performing parsing experiments on the Penn Treebank (Marcus et al., 1993) and Prague Dependency Treebank (Hajic, 1998; Hajic et al., 2001) (see Sections 4.1 and 4.3).
    Page 1, “Introduction”
  5. Semi-supervised phrase structure parsing has been previously explored by McClosky et al.
    Page 8, “Related Work”
  6. In this paper, we have presented a simple but effective semi-supervised learning approach and demonstrated that it achieves substantial improvement over a competitive baseline in two broad-coverage depen-
    Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

Penn Treebank

Appears in 4 sentences as: Penn Treebank (4)
In Simple Semi-supervised Dependency Parsing
  1. We demonstrate the effectiveness of the approach in a series of dependency parsing experiments on the Penn Treebank and Prague Dependency Treebank, and we show that the cluster-based features yield substantial gains in performance across a wide range of conditions.
    Page 1, “Abstract”
  2. We show that our semi-supervised approach yields improvements for fixed datasets by performing parsing experiments on the Penn Treebank (Marcus et al., 1993) and Prague Dependency Treebank (Hajic, 1998; Hajic et al., 2001) (see Sections 4.1 and 4.3).
    Page 1, “Introduction”
  3. The English experiments were performed on the Penn Treebank (Marcus et al., 1993), using a standard set of head-selection rules (Yamada and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank to a dependency tree representation.6 We split the Treebank into a training set (Sections 2—21), a development set (Section 22), and several test sets (Sections 0,7 1, 23, and 24).
    Page 4, “Experiments”
  4. 9We ensured that the sentences of the Penn Treebank were excluded from the text used for the clustering.
    Page 4, “Experiments”

See all papers in Proc. ACL 2008 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

perceptron

Appears in 4 sentences as: perceptron (4)
In Simple Semi-supervised Dependency Parsing
  1. We trained the parsers using the averaged perceptron (Freund and Schapire, 1999; Collins, 2002), which represents a balance between strong performance and fast training times.
    Page 4, “Experiments”
  2. of iterations of perceptron training, we performed up to 30 iterations and chose the iteration which optimized accuracy on the development set.
    Page 5, “Experiments”
  3. 12Due to the sparsity of the perceptron updates, however, only a small fraction of the possible features were active in our trained models.
    Page 5, “Experiments”
  4. First, the MDl and MD2 parsers were trained via the MIRA algorithm (Crammer and Singer, 2003; Crammer et al., 2004), while we use the averaged perceptron .
    Page 5, “Experiments”

See all papers in Proc. ACL 2008 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

bigram

Appears in 3 sentences as: bigram (3)
In Simple Semi-supervised Dependency Parsing
  1. The algorithm then repeatedly merges the pair of clusters which causes the smallest decrease in the likelihood of the text corpus, according to a class-based bigram language model defined on the word clusters.
    Page 2, “Background 2.1 Dependency parsing”
  2. (2005a), and consists of indicator functions for combinations of words and parts of speech for the head and modifier of each dependency, as well as certain contextual tokens.1 Our second-order baseline features are the same as those of Carreras (2007) and include indicators for triples of part of speech tags for sibling interactions and grandparent interactions, as well as additional bigram features based on pairs of words involved these higher-order interactions.
    Page 3, “Feature design”
  3. To begin, recall that the Brown clustering algorithm is based on a bigram language model.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

clusterings

Appears in 3 sentences as: clusterings (3)
In Simple Semi-supervised Dependency Parsing
  1. By using prefixes of various lengths, we can produce clusterings of different granularities (Miller et al., 2004).
    Page 3, “Background 2.1 Dependency parsing”
  2. (2004), we use prefixes of the Brown cluster hierarchy to produce clusterings of varying granularity.
    Page 3, “Feature design”
  3. One possible explanation is that the clusterings generated by the Brown algorithm can be noisy or only weakly relevant to syntax; thus, the clusters are best exploited when “anchored” to words or parts of speech.
    Page 4, “Feature design”

See all papers in Proc. ACL 2008 that mention clusterings.

See all papers in Proc. ACL that mention clusterings.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In Simple Semi-supervised Dependency Parsing
  1. The English experiments were performed on the Penn Treebank (Marcus et al., 1993), using a standard set of head-selection rules (Yamada and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank to a dependency tree representation.6 We split the Treebank into a training set (Sections 2—21), a development set (Section 22), and several test sets (Sections 0,7 1, 23, and 24).
    Page 4, “Experiments”
  2. of iterations of perceptron training, we performed up to 30 iterations and chose the iteration which optimized accuracy on the development set .
    Page 5, “Experiments”
  3. Table 6: Parent-prediction accuracies of unlabeled Czech parsers on the PDT 1.0 development set .
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

model trained

Appears in 3 sentences as: model trained (6) models trained (1)
In Simple Semi-supervised Dependency Parsing
  1. For example, the performance of the dep1c and dep2c models trained on 1k sentences is roughly the same as the performance of the dep1 and dep2 models, respectively, trained on 2k sentences.
    Page 6, “Experiments”
  2. For example, in scenario 1 the dep2c model trained on lk sentences is close in performance to the depl model trained on 4k sentences, and the dep2c model trained on 4k sentences is close to the depl model trained on the entire training set (roughly 40k sentences).
    Page 6, “Experiments”
  3. For example, the deplc model trained on 4k sentences is roughly as good as the dep1 model trained on 8k sentences.
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention model trained.

See all papers in Proc. ACL that mention model trained.

Back to top.

reranker

Appears in 3 sentences as: reranked (1) reranker (2)
In Simple Semi-supervised Dependency Parsing
  1. (2006), who applied a reranked parser to a large unsupervised corpus in order to obtain additional training data for the parser; this self-training appraoch was shown to be quite effective in practice.
    Page 8, “Related Work”
  2. However, their approach depends on the usage of a high-quality parse reranker , whereas the method described here simply augments the features of an existing parser.
    Page 8, “Related Work”
  3. Note that our two approaches are compatible in that we could also design a reranker and apply self-training techniques on top of the cluster-based features.
    Page 8, “Related Work”

See all papers in Proc. ACL 2008 that mention reranker.

See all papers in Proc. ACL that mention reranker.

Back to top.

unlabeled data

Appears in 3 sentences as: unlabeled data (3)
In Simple Semi-supervised Dependency Parsing
  1. In general, semi-supervised learning can be motivated by two concerns: first, given a fixed amount of supervised data, we might wish to leverage additional unlabeled data to facilitate the utilization of the supervised corpus, increasing the performance of the model in absolute terms.
    Page 1, “Introduction”
  2. Second, given a fixed target performance level, we might wish to use unlabeled data to reduce the amount of annotated data necessary to reach this target.
    Page 1, “Introduction”
  3. Crucially, however, these methods do not exploit unlabeled data when leam-ing their representations.
    Page 8, “Related Work”

See all papers in Proc. ACL 2008 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.