Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
Druck, Gregory and Mann, Gideon and McCallum, Andrew

Article Structure

Abstract

In this paper, we propose a novel method for semi-supervised learning of non-projective log-linear dependency parsers using directly expressed linguistic prior knowledge (e.g.

Introduction

Early approaches to parsing assumed a grammar provided by human experts (Quirk et al., 1985).

Related Work

This work is closely related to the prototype-driven grammar induction method of Haghighi and Klein (2006), which uses prototype phrases to guide the EM algorithm in learning a PCFG.

Generalized Expectation Criteria

Generalized expectation criteria (Mann and McCallum, 2008; Druck et al., 2008) are terms in a parameter estimation objective function that express a preference on the value of a model expectation.

Linguistic Prior Knowledge

Training parsers using GE with the aid of linguists is an exciting direction for future work.

Experimental Comparison with Unsupervised Learning

In this section we compare GE training with methods for unsupervised parsing.

Topics

CRF

Appears in 30 sentences as: (1) +CRF (2) CRF (27)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. With GE we may add a term to the objective function that encourages a feature-rich CRF to match this expectation on unlabeled data, and in the process learn about related features.
    Page 1, “Introduction”
  2. In this paper we use a non-projective dependency tree CRF (Smith and Smith, 2007).
    Page 1, “Introduction”
  3. In the following sections we apply GE to non-projective CRF dependency parsing.
    Page 3, “Generalized Expectation Criteria”
  4. We first consider an arbitrarily structured conditional random field (Lafferty et al., 2001) p)‘ (y We describe the CRF for non-projective dependency parsing in Section 3.2.
    Page 3, “Generalized Expectation Criteria”
  5. We now define a CRF p)‘ (y|x) for unlabeled, non-projective5 dependency parsing.
    Page 3, “Generalized Expectation Criteria”
  6. 5 Note that we could instead define a CRF for projective dependency parse trees and use a variant of the inside outside algorithm for inference.
    Page 3, “Generalized Expectation Criteria”
  7. We compare GE and supervised training of an edge-factored CRF with unsupervised learning of a DMV model (Klein and Manning, 2004) using EM and contrastive estimation (CE) (Smith and Eisner, 2005).
    Page 5, “Experimental Comparison with Unsupervised Learning”
  8. We note that there are considerable differences between the DMV and CRF models.
    Page 5, “Experimental Comparison with Unsupervised Learning”
  9. The DMV model is more expressive than the CRF because it can model the arity of a head as well as sibling relationships.
    Page 5, “Experimental Comparison with Unsupervised Learning”
  10. Because these features consider multiple edges, including them in the CRF model would make exact inference intractable (McDonald and Satta, 2007).
    Page 5, “Experimental Comparison with Unsupervised Learning”
  11. However, the CRF may consider the distance between head and child, whereas DMV does not model distance.
    Page 5, “Experimental Comparison with Unsupervised Learning”

See all papers in Proc. ACL 2009 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

dependency parsing

Appears in 13 sentences as: dependency parse (1) dependency parser (2) dependency parsers (4) dependency parsing (6)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. In this paper, we propose a novel method for semi-supervised learning of non-projective log-linear dependency parsers using directly expressed linguistic prior knowledge (e.g.
    Page 1, “Abstract”
  2. This paper proposes a method for directly guiding the learning of dependency parsers with naturally encoded linguistic insights.
    Page 1, “Introduction”
  3. While a complete exploration of linguistic prior knowledge for dependency parsing is beyond the scope of this paper, we provide several promising demonstrations of the proposed method.
    Page 1, “Introduction”
  4. Smith and Eisner (2007) apply entropy regularization to dependency parsing .
    Page 2, “Related Work”
  5. There are also a number of methods for unsupervised learning of dependency parsers .
    Page 2, “Related Work”
  6. Klein and Manning (2004) use a carefully initialized and structured generative model (DMV) in conjunction with the EM algorithm to get the first positive results on unsupervised dependency parsing .
    Page 2, “Related Work”
  7. In the following sections we apply GE to non-projective CRF dependency parsing .
    Page 3, “Generalized Expectation Criteria”
  8. We first consider an arbitrarily structured conditional random field (Lafferty et al., 2001) p)‘ (y We describe the CRF for non-projective dependency parsing in Section 3.2.
    Page 3, “Generalized Expectation Criteria”
  9. We now define a CRF p)‘ (y|x) for unlabeled, non-projective5 dependency parsing .
    Page 3, “Generalized Expectation Criteria”
  10. 5 Note that we could instead define a CRF for projective dependency parse trees and use a variant of the inside outside algorithm for inference.
    Page 3, “Generalized Expectation Criteria”
  11. This type of constraint was used in the development of a rule-based dependency parser (De-busmann et al., 2004).
    Page 5, “Linguistic Prior Knowledge”

See all papers in Proc. ACL 2009 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

CRFs

Appears in 6 sentences as: CRFs (6)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. Generalized expectation (GE) (Mann and McCallum, 2008; Druck et al., 2008) is a recently proposed framework for incorporating prior knowledge into the learning of conditional random fields ( CRFs ) (Lafferty et al., 2001).
    Page 1, “Introduction”
  2. GE has been applied to logistic regression models (Mann and McCallum, 2007; Druck et al., 2008) and linear chain CRFs (Mann and McCallum, 2008).
    Page 3, “Generalized Expectation Criteria”
  3. 3.1 GE in General CRFs
    Page 3, “Generalized Expectation Criteria”
  4. 3.2 Non-Projective Dependency Tree CRFs
    Page 3, “Generalized Expectation Criteria”
  5. 3.3 GE for Non-Projective Dependency Tree CRFs
    Page 4, “Generalized Expectation Criteria”
  6. Figure 2: Comparison of GE training of the restricted and full CRFs with unsupervised learning of DMV.
    Page 7, “Experimental Comparison with Unsupervised Learning”

See all papers in Proc. ACL 2009 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

feature set

Appears in 4 sentences as: feature set (5)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. With this feature set , the CRF model is less expressive than DMV.
    Page 5, “Experimental Comparison with Unsupervised Learning”
  2. The CRF cannot consider valency even with the full feature set , but this is balanced by the ability to use distance.
    Page 5, “Experimental Comparison with Unsupervised Learning”
  3. First we note that GE training using the full feature set substantially outperforms the restricted feature set , despite the fact that the same set of constraints is used for both experiments.
    Page 6, “Experimental Comparison with Unsupervised Learning”
  4. 8The baseline eventually matches the accuracy of the restricted CRF but this is understandable because GE’s ability to bootstrap is greatly reduced with the restricted feature set .
    Page 6, “Experimental Comparison with Unsupervised Learning”

See all papers in Proc. ACL 2009 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

labeled data

Appears in 4 sentences as: labeled data (4)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. (2006) both use modified forms of self-training to bootstrap parsers from limited labeled data .
    Page 2, “Related Work”
  2. 2In general, the objective function could also include the likelihood of available labeled data , but throughout this paper we assume we have no parsed sentences.
    Page 3, “Generalized Expectation Criteria”
  3. If there are constraint functions G for all model feature functions F3, and the target expectations G are estimated from labeled data , then the globally optimal parameter setting under the GE obj ec-tive function is equivalent to the maximum likelihood solution.
    Page 3, “Generalized Expectation Criteria”
  4. For some experiments that follow we use “oracle” constraints that are estimated from labeled data .
    Page 5, “Linguistic Prior Knowledge”

See all papers in Proc. ACL 2009 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. Model parameters are estimated using a generalized expectation (GE) objective function that penalizes the mismatch between model predictions and linguistic expectation constraints.
    Page 1, “Abstract”
  2. With GE we may add a term to the objective function that encourages a feature-rich CRF to match this expectation on unlabeled data, and in the process learn about related features.
    Page 1, “Introduction”
  3. Generalized expectation criteria (Mann and McCallum, 2008; Druck et al., 2008) are terms in a parameter estimation objective function that express a preference on the value of a model expectation.
    Page 2, “Generalized Expectation Criteria”
  4. 2In general, the objective function could also include the likelihood of available labeled data, but throughout this paper we assume we have no parsed sentences.
    Page 3, “Generalized Expectation Criteria”

See all papers in Proc. ACL 2009 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

Treebank

Appears in 4 sentences as: Treebank (2) treebank (1) treebanks (2)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. While such supervised approaches have yielded accurate parsers (Chamiak, 2001), the syntactic annotation of corpora such as the Penn Treebank is extremely costly, and consequently there are few treebanks of comparable size.
    Page 1, “Introduction”
  2. The above methods can be applied to small seed corpora, but McDonald1 has criticized such methods as working from an unrealistic premise, as a significant amount of the effort required to build a treebank comes in the first 100 sentences (both because of the time it takes to create an appropriate rubric and to train annotators).
    Page 2, “Related Work”
  3. We use the WSJ 10 corpus (as processed by Smith (2006)), which is comprised of English sentences of ten words or fewer (after stripping punctuation) from the WSJ portion of the Penn Treebank .
    Page 5, “Experimental Comparison with Unsupervised Learning”
  4. It is our hope that this method will permit more effective leveraging of linguistic insight and resources and enable the construction of parsers in languages and domains where treebanks are not available.
    Page 8, “Experimental Comparison with Unsupervised Learning”

See all papers in Proc. ACL 2009 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

Dependency Tree

Appears in 3 sentences as: Dependency Tree (2) dependency tree (1)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. In this paper we use a non-projective dependency tree CRF (Smith and Smith, 2007).
    Page 1, “Introduction”
  2. 3.2 Non-Projective Dependency Tree CRFs
    Page 3, “Generalized Expectation Criteria”
  3. 3.3 GE for Non-Projective Dependency Tree CRFs
    Page 4, “Generalized Expectation Criteria”

See all papers in Proc. ACL 2009 that mention Dependency Tree.

See all papers in Proc. ACL that mention Dependency Tree.

Back to top.

score function

Appears in 3 sentences as: score function (3)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. unlabeled data), a model distribution p A(y|x), and a score function 8:
    Page 2, “Generalized Expectation Criteria”
  2. In this paper, we use a score function that is the squared difference of the model expectation of G and some target expectation G:
    Page 2, “Generalized Expectation Criteria”
  3. The partial derivative of the KL divergence score function includes the same covariance term as above but substitutes a different multiplicative term: G / G ,\.
    Page 3, “Generalized Expectation Criteria”

See all papers in Proc. ACL 2009 that mention score function.

See all papers in Proc. ACL that mention score function.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (3)
In Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria
  1. In this paper, we propose a novel method for semi-supervised learning of non-projective log-linear dependency parsers using directly expressed linguistic prior knowledge (e.g.
    Page 1, “Abstract”
  2. Conventional semi-supervised learning requires parsed sentences.
    Page 2, “Related Work”
  3. In this paper, we developed a novel method for the semi-supervised learning of a non-projective CRF dependency parser that directly uses linguistic prior knowledge as a training signal.
    Page 8, “Experimental Comparison with Unsupervised Learning”

See all papers in Proc. ACL 2009 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.