Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
Zhou, Guangyou and Zhao, Jun and Liu, Kang and Cai, Li

Article Structure

Abstract

In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing.

Introduction

Dependency parsing is the task of building dependency links between words in a sentence, which has recently gained a wide interest in the natural language processing community.

Dependency Parsing

In dependency parsing, we attempt to build head-modifier (or head-dependent) relations between words in a sentence.

Web-Derived Selectional Preference Features

In this paper, we employ two different feature sets: a baseline feature set3 which draw upon “normal” information source, such as word forms and part-of-speech (POS) without including the web-derived selectional preference4 features, a feature set conjoins the baseline features and the web-derived selectional preference features.

Experiments

In order to evaluate the effectiveness of our proposed approach, we conducted dependency parsing experiments in English.

Related Work

Our approach is to exploit web-derived selectional preferences to improve the dependency parsing.

Conclusion

In this paper, we present a novel method which incorporates the web-derived selectional preferences to improve statistical dependency parsing.

Topics

dependency parsing

Appears in 37 sentences as: dependency parser (2) dependency parsers (4) Dependency parsing (1) dependency parsing (31)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing .
    Page 1, “Abstract”
  2. Experiments show that web-scale data improves statistical dependency parsing , particularly for long dependency relationships.
    Page 1, “Abstract”
  3. Dependency parsing is the task of building dependency links between words in a sentence, which has recently gained a wide interest in the natural language processing community.
    Page 1, “Introduction”
  4. With the availability of large-scale annotated corpora such as Penn Treebank (Marcus et al., 1993), it is easy to train a high-performance dependency parser using supervised learning methods.
    Page 1, “Introduction”
  5. However, current state-of—the-art statistical dependency parsers (McDonald et al., 2005; McDonald and Pereira, 2006; Hall et al., 2006) tend to have
    Page 1, “Introduction”
  6. Figure 1 shows the F1 score1 relative to the dependency length on the development set by using the graph-based dependency parsers (McDonald et al., 2005; McDonald and Pereira, 2006).
    Page 1, “Introduction”
  7. These longer dependencies are therefore a major opportunity to improve the overall performance of dependency parsing .
    Page 1, “Introduction”
  8. (2008) proposed a semi-supervised dependency parsing by introducing lexical intermediaries at a coarser level than words themselves via a cluster method.
    Page 1, “Introduction”
  9. Our purpose in this paper is to exploit web-derived selectional preferences to improve the supervised statistical dependency parsing .
    Page 2, “Introduction”
  10. By leveraging some assistant data, the dependency parsing model can directly utilize the additional information to capture the word-to-word level relationships.
    Page 2, “Introduction”
  11. Question I: Is there a benefit in incorporating web-derived selectional preference features for statistical dependency parsing , especially for longer dependencies?
    Page 2, “Introduction”

See all papers in Proc. ACL 2011 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

N-gram

Appears in 27 sentences as: N-gram (30) n-gram (1)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Another is a web-scale N-gram corpus, which is a N-gram corpus with N-grams of length 1-5 (Brants and Franz, 2006), we call it Google V1 in this paper.
    Page 2, “Introduction”
  2. one is web, N-gram counts are approximated by Google hits.
    Page 3, “Web-Derived Selectional Preference Features”
  3. This N-gram corpus records how often each unique sequence of words occurs.
    Page 3, “Web-Derived Selectional Preference Features”
  4. times or more (1 in 25 billion) are kept, and appear in the n-gram tables.
    Page 3, “Web-Derived Selectional Preference Features”
  5. 3.2 Web-derived N-gram features 3.2.1 PMI
    Page 3, “Web-Derived Selectional Preference Features”
  6. When use the Google V1 corpus, this probabilities can be calculated directly from the N-gram counts, while using the Google hits, we send the queries to the search engine Google5 and all the search queries are performed as exact matches by using quotation marks.6
    Page 3, “Web-Derived Selectional Preference Features”
  7. In deciding the dependency between the main verb hit and its argument headed preposition with, an example of the N-gram PMI features and the conjoin features with the baseline are shown in Table 1.
    Page 4, “Web-Derived Selectional Preference Features”
  8. N-gram feature templates hw, mw, PMI(hw,mw) hw, ht, mw, PMI(hw,mw) hw, mw, mt, PMI(hw,mw) hw, ht, mw, mt, PMI(hw,mw)
    Page 4, “Web-Derived Selectional Preference Features”
  9. Table 2: Examples of N-gram feature templates.
    Page 4, “Web-Derived Selectional Preference Features”
  10. 3.3 N-gram feature templates
    Page 4, “Web-Derived Selectional Preference Features”
  11. We generate N-gram features by mimicking the template structure of the original baseline features.
    Page 4, “Web-Derived Selectional Preference Features”

See all papers in Proc. ACL 2011 that mention N-gram.

See all papers in Proc. ACL that mention N-gram.

Back to top.

N-grams

Appears in 8 sentences as: N-grams (7) n-grams (1)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Another is a web-scale N-gram corpus, which is a N-gram corpus with N-grams of length 1-5 (Brants and Franz, 2006), we call it Google V1 in this paper.
    Page 2, “Introduction”
  2. N-grams appearing 40
    Page 3, “Web-Derived Selectional Preference Features”
  3. In this paper, the selectional preferences have the same meaning with N-grams , which model the word-to-word relationships, rather than only considering the predicates and arguments relationships.
    Page 3, “Web-Derived Selectional Preference Features”
  4. All n-grams with lower counts are discarded.
    Page 3, “Web-Derived Selectional Preference Features”
  5. 91'i’e4 1e5 1e6 1e7 1e8 1e9 Number of Unique N-grams
    Page 7, “Experiments”
  6. The former uses the web-scale data explicitly to create more data for training the model; while the latter explores the web-scale N-grams data (Lin et al., 2010) for compound bracketing disambiguation.
    Page 8, “Related Work”
  7. However, we explore the web-scale data for dependency parsing, the performance improves log-linearly with the number of parameters (unique N-grams ).
    Page 8, “Related Work”
  8. ber of parameters (unique N-grams ).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention N-grams.

See all papers in Proc. ACL that mention N-grams.

Back to top.

UAS

Appears in 6 sentences as: UAS (6)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. We measured the performance of the parsers using the following metrics: unlabeled attachment score ( UAS ), labeled attachment score (LAS) and complete match (CM), which were defined by Hall et al.
    Page 5, “Experiments”
  2. Type Systems UAS CM Yamada and Matsumoto (2003) 90.3 38.7 McDonald et a1.
    Page 6, “Experiments”
  3. UAS Score (°/o) S S N 00
    Page 7, “Experiments”
  4. UAS accuracy improves with the number of unique N -grams but still lower than the Google hits.
    Page 7, “Experiments”
  5. Figure 5 plots the UAS accuracy as function of training instances.
    Page 7, “Experiments”
  6. UAS Score (°/o) 8 fi
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention UAS.

See all papers in Proc. ACL that mention UAS.

Back to top.

dependency relationships

Appears in 6 sentences as: dependency relationships (6)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships .
    Page 1, “Abstract”
  2. The results show that web-derived selectional preference can improve the statistical dependency parsing, particularly for long dependency relationships .
    Page 2, “Introduction”
  3. The results here show that the proposed approach improves the dependency parsing performance, particularly for long dependency relationships .
    Page 7, “Experiments”
  4. Our research, however, applies the web-scale data (Google hits and Google V1) to model the word-to-word dependency relationships rather than compound bracketing disambiguation.
    Page 8, “Related Work”
  5. Our approach, however, extends these techniques to dependency parsing, particularly for long dependency relationships , which involves more challenging tasks than the previous work.
    Page 8, “Related Work”
  6. The results show that web-scale data improves the dependency parsing, particularly for long dependency relationships .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2011 that mention dependency relationships.

See all papers in Proc. ACL that mention dependency relationships.

Back to top.

feature set

Appears in 6 sentences as: feature set (4) feature sets (4)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. In this paper, we employ two different feature sets: a baseline feature set3 which draw upon “normal” information source, such as word forms and part-of-speech (POS) without including the web-derived selectional preference4 features, a feature set conjoins the baseline features and the web-derived selectional preference features.
    Page 3, “Web-Derived Selectional Preference Features”
  2. 3 This kind of feature sets are similar to other feature sets in the literature (McDonald et al., 2005; Carreras, 2007), so we will not attempt to give a exhaustive description.
    Page 3, “Web-Derived Selectional Preference Features”
  3. For example, the baseline feature set includes indicators for word-to-word and tag-to-tag interactions between the head and modifier of a dependency.
    Page 4, “Web-Derived Selectional Preference Features”
  4. In the N-gram feature set , we correspondingly introduce N-gram PMI for word-to-word interactions.
    Page 4, “Web-Derived Selectional Preference Features”
  5. The N-gram feature set for MSTParser is shown in Table 2.
    Page 5, “Web-Derived Selectional Preference Features”
  6. Second, note that the parsers incorporating the N-gram feature sets consistently outperform the models using the baseline features in all test data sets, regardless of model order or label usage.
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

co-occurrence

Appears in 5 sentences as: Co-occurrence (1) co-occurrence (4)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Co-occurrence probabilities can be calculated directly from the N- gram counts.
    Page 3, “Web-Derived Selectional Preference Features”
  2. where p(“X y”) is the co-occurrence probabilities.
    Page 3, “Web-Derived Selectional Preference Features”
  3. Turney (2007) measured the semantic orientation for sentiment classification using co-occurrence statistics obtained from the search engines.
    Page 8, “Related Work”
  4. Besides, there are some work exploring the word-to-word co-occurrence derived from the web-scale data or a fixed size of corpus (Calvo and Gelbukh, 2004; Calvo and Gelbukh, 2006; Yates et al., 2006; Drabek and Zhou, 2000; van Noord, 2007) for PP attachment ambiguities or shallow parsing.
    Page 8, “Related Work”
  5. Abekawa and Oku-mura (2006) improved Japanese dependency parsing by using the co-occurrence information derived from the results of automatic dependency parsing of large-scale corpora.
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

Treebank

Appears in 5 sentences as: Treebank (6)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. With the availability of large-scale annotated corpora such as Penn Treebank (Marcus et al., 1993), it is easy to train a high-performance dependency parser using supervised learning methods.
    Page 1, “Introduction”
  2. We conduct the experiments on the English Penn Treebank (PTB) (Marcus et al., 1993).
    Page 2, “Introduction”
  3. The experiments were performed on the Penn Treebank (PTB) (Marcus et al., 1993), using a standard set of head-selection rules (Yamada
    Page 5, “Experiments”
  4. and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank into a dependency tree representation, dependency labels were obtained via the ”Malt” hard-coded setting.8 We split the Treebank into a training set (Sections 2-2l), a development set (Section 22), and several test sets (Sections 0,9 l, 23, and 24).
    Page 5, “Experiments”
  5. The results show that our second order model incorporating the N-gram features (92.64) performs better than most previously reported discriminative systems trained on the Treebank .
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

parsing model

Appears in 5 sentences as: parsing model (5)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. By leveraging some assistant data, the dependency parsing model can directly utilize the additional information to capture the word-to-word level relationships.
    Page 2, “Introduction”
  2. The parsing model can be defined as a conditional distribution p(y|x; w) over each projective parse tree 3/ for a particular sentence X, parameterized by a vector w. The probability of a parse tree is
    Page 2, “Dependency Parsing”
  3. If both PMI features exist and PMIW-thwit, bat) > PMIW-thwall, bat), indicating to our dependency parsing model that the preposition word with depends on the verb hit is a good choice.
    Page 4, “Web-Derived Selectional Preference Features”
  4. Web-derived selectional preference features based on PMI values are trickier to incorporate into the dependency parsing model because they are continuous rather than discrete.
    Page 5, “Web-Derived Selectional Preference Features”
  5. Log-linear dependency parsing model is sensitive to inappropriately scaled feature.
    Page 5, “Web-Derived Selectional Preference Features”

See all papers in Proc. ACL 2011 that mention parsing model.

See all papers in Proc. ACL that mention parsing model.

Back to top.

log-linear

Appears in 4 sentences as: Log-linear (1) log-linear (3)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Given the training set {(Xi, MHz-1:1, parameter estimation for log-linear models generally resolve around optimization of a regularized conditional
    Page 3, “Dependency Parsing”
  2. In this paper we use the dual exponenti-ated gradient (EG)2 descent, which is a particularly effective optimization algorithm for log-linear models (Collins et al., 2008).
    Page 3, “Dependency Parsing”
  3. Log-linear dependency parsing model is sensitive to inappropriately scaled feature.
    Page 5, “Web-Derived Selectional Preference Features”
  4. Some previous studies also found a log-linear relationship between unlabeled data (Suzuki and Isozaki, 2008; Suzuki et al., 2009; Bergsma et al., 2010; Pitler et al., 2010).
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

feature templates

Appears in 4 sentences as: feature templates (4)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. N-gram feature templates hw, mw, PMI(hw,mw) hw, ht, mw, PMI(hw,mw) hw, mw, mt, PMI(hw,mw) hw, ht, mw, mt, PMI(hw,mw)
    Page 4, “Web-Derived Selectional Preference Features”
  2. Table 2: Examples of N-gram feature templates .
    Page 4, “Web-Derived Selectional Preference Features”
  3. 3.3 N-gram feature templates
    Page 4, “Web-Derived Selectional Preference Features”
  4. Besides, we also present the second-order feature templates , including the sibling and grandchild features.
    Page 5, “Web-Derived Selectional Preference Features”

See all papers in Proc. ACL 2011 that mention feature templates.

See all papers in Proc. ACL that mention feature templates.

Back to top.

part-of-speech

Appears in 4 sentences as: part-of-speech (4)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. In this paper, we employ two different feature sets: a baseline feature set3 which draw upon “normal” information source, such as word forms and part-of-speech (POS) without including the web-derived selectional preference4 features, a feature set conjoins the baseline features and the web-derived selectional preference features.
    Page 3, “Web-Derived Selectional Preference Features”
  2. is any token whose part-of-speech is IN
    Page 4, “Web-Derived Selectional Preference Features”
  3. The part-of-speech tags for the development and test set were automatically assigned by the MXPOST taggerlo, where the tagger was trained on the entire training corpus.
    Page 5, “Experiments”
  4. (2010) created robust supervised classifiers via web-scale N-gram data for adjective ordering, spelling correction, noun compound bracketing and verb part-of-speech disambiguation.
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

dependency tree

Appears in 4 sentences as: dependency tree (2) dependency tree: (2)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Figure 2: An example of a labeled dependency tree .
    Page 3, “Web-Derived Selectional Preference Features”
  2. In this paper we generalize the adjacency and dependency models by including the pointwise mutual information (Church and Hanks, 1900) between all pairs of words in the dependency tree:
    Page 3, “Web-Derived Selectional Preference Features”
  3. tween the three words in the dependency tree:
    Page 4, “Web-Derived Selectional Preference Features”
  4. and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank into a dependency tree representation, dependency labels were obtained via the ”Malt” hard-coded setting.8 We split the Treebank into a training set (Sections 2-2l), a development set (Section 22), and several test sets (Sections 0,9 l, 23, and 24).
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention dependency tree.

See all papers in Proc. ACL that mention dependency tree.

Back to top.

bigram

Appears in 4 sentences as: bigram (2) bigrams (2)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. Web page hits for word pairs and trigrams are obtained using a simple heuristic query to the search engine Google.11 Inflected queries are performed by expanding a bigram or trigram into all its morphological forms.
    Page 5, “Experiments”
  2. Although Google hits is noisier, it has very much larger coverage of bigrams or trigrams.
    Page 6, “Experiments”
  3. This means that if pages indexed by Google doubles, then so do the bigrams or trigrams frequencies.
    Page 8, “Experiments”
  4. Keller and Lapata (2003) evaluated the utility of using web search engine statistics for unseen bigram .
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

Penn Treebank

Appears in 3 sentences as: Penn Treebank (3)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. With the availability of large-scale annotated corpora such as Penn Treebank (Marcus et al., 1993), it is easy to train a high-performance dependency parser using supervised learning methods.
    Page 1, “Introduction”
  2. We conduct the experiments on the English Penn Treebank (PTB) (Marcus et al., 1993).
    Page 2, “Introduction”
  3. The experiments were performed on the Penn Treebank (PTB) (Marcus et al., 1993), using a standard set of head-selection rules (Yamada
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (4)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. (2008) proposed a semi-supervised dependency parsing by introducing lexical intermediaries at a coarser level than words themselves via a cluster method.
    Page 1, “Introduction”
  2. Type D, C and S denote discriminative, combined and semi-supervised systems, respectively.
    Page 6, “Experiments”
  3. We also compare our method with the semi-supervised approaches, the semi-supervised approaches achieved very high accuracies by leveraging on large unlabeled data directly into the systems for joint learning and decoding, while in our method, we only explore the N-gram features to further improve supervised dependency parsing performance.
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

unlabeled data

Appears in 3 sentences as: unlabeled data (3)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. All of our selectional preference features described in this paper rely on probabilities derived from unlabeled data .
    Page 3, “Web-Derived Selectional Preference Features”
  2. We also compare our method with the semi-supervised approaches, the semi-supervised approaches achieved very high accuracies by leveraging on large unlabeled data directly into the systems for joint learning and decoding, while in our method, we only explore the N-gram features to further improve supervised dependency parsing performance.
    Page 6, “Experiments”
  3. Some previous studies also found a log-linear relationship between unlabeled data (Suzuki and Isozaki, 2008; Suzuki et al., 2009; Bergsma et al., 2010; Pitler et al., 2010).
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

word pair

Appears in 3 sentences as: word pair (2) word pairs (1)
In Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
  1. The idea is very simple: web-scale data have large coverage for word pair acquisition.
    Page 2, “Introduction”
  2. Web page hits for word pairs and trigrams are obtained using a simple heuristic query to the search engine Google.11 Inflected queries are performed by expanding a bigram or trigram into all its morphological forms.
    Page 5, “Experiments”
  3. Several previous studies have exploited the web-scale data for word pair acquisition.
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention word pair.

See all papers in Proc. ACL that mention word pair.

Back to top.