Efficient, Feature-based, Conditional Random Field Parsing
Finkel, Jenny Rose and Kleeman, Alex and Manning, Christopher D.

Article Structure

Abstract

Discriminative feature-based methods are widely used in natural language processing, but sentence parsing is still dominated by generative methods.

Introduction

Over the past decade, feature-based discriminative models have become the tool of choice for many natural language processing tasks.

The Model

2.1 A Conditional Random Field Context Free Grammar (CRF-CFG)

Stochastic Optimization Methods

Stochastic optimization methods have proven to be extremely efficient for the training of models involving computationally expensive objective functions like those encountered with our task (Vishwanathan et al., 2006) and, in fact, the online backpropagation learning used in the neural network parser of Henderson (2004) is a form of stochastic gradient descent.

Features

As discussed in Section 5 we performed experiments on both sentences of length g 15 and length g 40.

Experiments

For all experiments, we trained and tested on the Penn treebank (PTB) (Marcus et al., 1993).

Comparison With Related Work

The most similar related work is (Johnson, 2001), which did discriminative training of a generative PCFG.

Conclusions

We have presented a new, feature-rich, dynamic programming based discriminative parser which is simpler, more effective, and faster to train and test than previous work, giving us new state-of-the-art performance when training and testing on sentences of length g 15 and the first results for such a parser trained and tested on sentences of length g 40.

Topics

generative model

Appears in 5 sentences as: generative model (3) generative models (2)
In Efficient, Feature-based, Conditional Random Field Parsing
  1. Although they take much longer to train than generative models , they typically produce higher performing systems, in large part due to the ability to incorporate arbitrary, potentially overlapping features.
    Page 1, “Introduction”
  2. For both WSJ 15 and WSJ40, we trained a generative model ; a discriminative model, which used lexicon features, but no grammar features other than the rules themselves; and a feature-based model which had access to all features.
    Page 5, “Experiments”
  3. The discriminatively trained generative model (discriminative in Table 3) took approximately 12 minutes per pass through the data, while the feature-based model (feature-based in Table 3) took 35 minutes per pass through the data.
    Page 6, “Experiments”
  4. In Figure 3 we show for an example from section 22 the parse trees produced by our generative model and our feature-based discriminative model, and the correct parse.
    Page 6, “Experiments”
  5. The training set is very small, and it is a known fact that generative models tend to work better for small datasets and discriminative models tend to work better for larger datasets (Ng and Jordan, 2002).
    Page 7, “Comparison With Related Work”

See all papers in Proc. ACL 2008 that mention generative model.

See all papers in Proc. ACL that mention generative model.

Back to top.

objective function

Appears in 5 sentences as: Objective Function (1) objective function (2) objective functions (2)
In Efficient, Feature-based, Conditional Random Field Parsing
  1. 2.2 Computing the Objective Function
    Page 3, “The Model”
  2. Stochastic optimization methods have proven to be extremely efficient for the training of models involving computationally expensive objective functions like those encountered with our task (Vishwanathan et al., 2006) and, in fact, the online backpropagation learning used in the neural network parser of Henderson (2004) is a form of stochastic gradient descent.
    Page 4, “Stochastic Optimization Methods”
  3. In our experiments SGD converged to a lower objective function value than L-BFGS, however it required far
    Page 4, “Stochastic Optimization Methods”
  4. Utilization of stochastic optimization routines requires the implementation of a stochastic objective function .
    Page 4, “Stochastic Optimization Methods”
  5. Note that this property is satisfied, without scaling, for objective functions that sum over the training data, as it is in our case, but any priors must be scaled down by a factor of b/ |.
    Page 4, “Stochastic Optimization Methods”

See all papers in Proc. ACL 2008 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

binarization

Appears in 3 sentences as: Binarization (1) binarization (2)
In Efficient, Feature-based, Conditional Random Field Parsing
  1. Binarization of rules (Earley, 1970) is necessary to obtain cubic parsing time, and closure of unary chains is required for finding total probability mass (rather than just best parses) (Stolcke, 1995).
    Page 2, “The Model”
  2. This was done by collapsing all allowed unary chains to single unary rules, and disallowing multiple unary rule applications over the same span.1 We give the details of our binarization scheme in Section 5.
    Page 3, “The Model”
  3. Lastly, for the WSJ40 runs we used a simple, right branching binarization where each active state is annotated with its previous sibling and first child.
    Page 5, “Experiments”

See all papers in Proc. ACL 2008 that mention binarization.

See all papers in Proc. ACL that mention binarization.

Back to top.

CRF

Appears in 3 sentences as: CRF (3)
In Efficient, Feature-based, Conditional Random Field Parsing
  1. For example, in (Lafferty et al., 2001), when switching from a generatively trained hidden Markov model (HMM) to a discriminatively trained, linear chain, conditional random field ( CRF ) for part-of-speech tagging, their error drops from 5.7% to 5.6%.
    Page 1, “Introduction”
  2. When they add in only a small set of orthographic features, their CRF error rate drops considerably more to 4.3%, and their out-of-vocabulary error rate drops by more than half.
    Page 1, “Introduction”
  3. We then define a conditional probability distribution over entire trees, using the standard CRF distribution, shown in (1).
    Page 2, “The Model”

See all papers in Proc. ACL 2008 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

dynamic programming

Appears in 3 sentences as: dynamic programming (3)
In Efficient, Feature-based, Conditional Random Field Parsing
  1. While prior feature-based dynamic programming parsers have restricted training and evaluation to artificially short sentences, we present the first general, feature-rich discriminative parser, based on a conditional random field model, which has been successfully scaled to the full WSJ parsing data.
    Page 1, “Abstract”
  2. This paper extends the third thread of work, where joint inference via dynamic programming algorithms is used to train models and to attempt to find the globally best parse.
    Page 1, “Introduction”
  3. We have presented a new, feature-rich, dynamic programming based discriminative parser which is simpler, more effective, and faster to train and test than previous work, giving us new state-of-the-art performance when training and testing on sentences of length g 15 and the first results for such a parser trained and tested on sentences of length g 40.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention dynamic programming.

See all papers in Proc. ACL that mention dynamic programming.

Back to top.

Appears in 3 sentences as: (1) parse tree (1) parse trees (1)
In Efficient, Feature-based, Conditional Random Field Parsing
  1. of the parse tree , given the sentence, not joint likelihood of the tree and sentence; and (b) probabilities are normalized globally instead of locally —the graphical models depiction of our trees is undirected.
    Page 2, “The Model”
  2. We define t"(s) to be the set of all possible parse trees for the given sentence licensed by the grammar G.
    Page 2, “The Model”
  3. In Figure 3 we show for an example from section 22 the parse trees produced by our generative model and our feature-based discriminative model, and the correct parse.
    Page 6, “Experiments”

See all papers in Proc. ACL 2008 that mention .

See all papers in Proc. ACL that mention .

Back to top.