Beam-Width Prediction for Efficient Context-Free Parsing
Bodenstab, Nathan and Dunlop, Aaron and Hall, Keith and Roark, Brian

Article Structure

Abstract

Efficient decoding for syntactic parsing has become a necessary research area as statistical grammars grow in accuracy and size and as more NLP applications leverage syntactic analyses.

Introduction

Statistical constituent parsers have gradually increased in accuracy over the past ten years.

Background

2.1 Preliminaries and notation

Open/Closed Cell Classification

3.1 Constituent Closure

Beam-Width Prediction

The cell-closing approaches discussed in Section 3 make binary decisions to either allow or completely block all edges in each cell.

Experimental Setup

We run all experiments on the WSJ treebank (Marcus et al., 1999) using the standard splits: section 2-21 for training, section 22 for development, and section 23 for testing.

Results

We empirically demonstrate the advantages of our pruning methods by comparing the total parse time of each system, including FOM initialization, chart cell classification, and beam-width prediction.

Conclusion and Future Work

We have introduced three new pruning methods, the best of which unites figure-of—merit estimation from agenda-based parsing, local pruning from beam-search parsing, and unlabeled constituent structure

Topics

development set

Appears in 7 sentences as: development set (7)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. We found the value A = 102 to give the best performance on our development set , and we use this value in all of our experiments.
    Page 4, “Open/Closed Cell Classification”
  2. Figures 2a and 2b compare the pruned charts of Chart Constraints and Constituent Closure for a single sentence in the development set .
    Page 4, “Open/Closed Cell Classification”
  3. We compare the effectiveness of Constituent Closure, Complete Closure, and Chart Constraints, by decreasing the percentage of chart cells closed until accuracy over all sentences in our development set start to decline.
    Page 5, “Open/Closed Cell Classification”
  4. Constituent Closure improves on this baseline only slightly (36%), but we see our biggest gains with Complete Closure, which prunes 56% of all chart cells in the development set .
    Page 5, “Open/Closed Cell Classification”
  5. Figure 3 is a visual representation of beam-width prediction on a single sentence of the development set using the Berkeley latent-variable grammar and Boundary FOM.
    Page 6, “Beam-Width Prediction”
  6. In Table 1 we present the accuracy and parse time for three baseline parsers on the development set : exhaustive CYK parsing, beam-search parsing using only the inside score BC), and beam-search parsing using the Boundary FOM.
    Page 8, “Results”
  7. Table 1: Section 22 development set results for CYK and Beam-Search (Beam) parsing using the Berkeley latent-variable grammar.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

constituent parsers

Appears in 5 sentences as: constituent parsers (3) constituent parsing (2)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. Statistical constituent parsers have gradually increased in accuracy over the past ten years.
    Page 1, “Introduction”
  2. Although syntax is becoming increasingly important for large-scale NLP applications, constituent parsing is slow—too slow to scale to the size of many potential consumer applications.
    Page 1, “Introduction”
  3. Deterministic algorithms for dependency parsing exist that can extract syntactic dependency structure very quickly (Nivre, 2008), but this approach is often undesirable as constituent parsers are more accurate and more adaptable to new domains (Petrov et al., 2010).
    Page 1, “Introduction”
  4. The most accurate constituent parsers , e. g., Charniak (2000), Petrov and Klein (2007a), make use of approximate inference, limiting their search to a fraction of the total search space and achieving speeds of between one and four newspaper sentences per second.
    Page 1, “Introduction”
  5. In this paper, we examine a general approach to approximate inference in constituent parsing that learns cell-specific thresholds for arbitrary grammars.
    Page 2, “Introduction”

See all papers in Proc. ACL 2011 that mention constituent parsers.

See all papers in Proc. ACL that mention constituent parsers.

Back to top.

Berkeley parser

Appears in 4 sentences as: Berkeley parser (4) Berkeley parsers (1)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. We demonstrate that our method is faster than coarse-to-fine pruning, exemplified in both the Charniak and Berkeley parsers, by empirically comparing our parser to the Berkeley parser using the same grammar and under identical operating conditions.
    Page 1, “Abstract”
  2. Both our parser and the Berkeley parser are written in Java, both are run with Viterbi decoding, and both parse with the same grammar, so a direct comparison of speed and accuracy is fair.2
    Page 8, “Results”
  3. 2We run the Berkeley parser with the default search parameterization to achieve the fastest possible parsing time.
    Page 8, “Conclusion and Future Work”
  4. Using this framework, we have shown that we can decrease parsing time by 65% over a standard beam-search without any loss in accuracy, and parse significantly faster than both the Berkeley parser and Chart Constraints.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention Berkeley parser.

See all papers in Proc. ACL that mention Berkeley parser.

Back to top.

binary classification

Appears in 4 sentences as: binary classification (2) binary classifier (1) binary classifiers (1)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. More generally, instead of a binary classification decision, we can also use this method to predict the desired cell population directly and get cell closure for free when the classifier predicts a beam-width of zero.
    Page 2, “Introduction”
  2. We first look at the binary classification of chart cells as either open or closed to full constituents, and predict this value from the input sentence alone.
    Page 4, “Open/Closed Cell Classification”
  3. Given a global maximum beam-width b, we train 19 different binary classifiers , each using separate mapping functions (1)19, where the target value y produced by (13],, is 1 if Rm > k and 0 otherwise.
    Page 6, “Beam-Width Prediction”
  4. During decoding, we assign the beam-width for chart cell spanning wi+1...wj given models 60, 61, “.654 by finding the lowest value k such that the binary classifier 6k, classifies Rm 3 k. If no such k exists, RM is set to the maximum beam-width value b:
    Page 6, “Beam-Width Prediction”

See all papers in Proc. ACL 2011 that mention binary classification.

See all papers in Proc. ACL that mention binary classification.

Back to top.

latent variable

Appears in 4 sentences as: latent variable (3) latent variables (1)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. Grammar transformation techniques such as linguistically inspired nonterminal annotations (Johnson, 1998; Klein and Manning, 2003b) and latent variable grammars (Matsuzaki et al., 2005; Petrov et al., 2006) have increased the grammar size |G| from a few thousand rules to several million in an explicitly enumerable grammar, or even more in an implicit grammar.
    Page 1, “Introduction”
  2. Rather, the beam-width prediction model is trained to learn the rank of constituents in the maximum likelihood trees.1 We will illustrate this by presenting results using a latent-variable grammar, for which there is no “true” reference latent variable parse.
    Page 2, “Introduction”
  3. Petrov and Klein (2007a) derive coarse grammars in a more statistically principled way, although the technique is closely tied to their latent variable grammar representation.
    Page 3, “Background”
  4. Alternative decoding methods, such as marginalizing over the latent variables in the grammar or MaxRule decoding (Petrov and Klein, 2007a) are certainly possible in our framework, but it is unknown how effective these methods will be given the heavily pruned na-
    Page 7, “Experimental Setup”

See all papers in Proc. ACL 2011 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.

loss function

Appears in 4 sentences as: loss function (4)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. In addition, we use a non-symmetric loss function during optimization to account for the imbalance between over-predicting or under-predicting the beam-width.
    Page 2, “Introduction”
  2. where H is the unit step function: 1 if the inner product 6 - a: > 0, and 0 otherwise; and L ,\(-, is an asymmetric loss function , defined below.
    Page 4, “Open/Closed Cell Classification”
  3. To deal with this imbalance, we introduce an asymmetric loss function L ,\(-, to penalize false-negatives more severely during training.
    Page 4, “Open/Closed Cell Classification”
  4. For Constituent and Complete Closure, we also vary the loss function , adjusting the relative penalty between a false-negative (closing off a chart cell that contains a maximum likelihood edge) and a false-positive.
    Page 5, “Open/Closed Cell Classification”

See all papers in Proc. ACL 2011 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

parse tree

Appears in 4 sentences as: parse tree (3) parse trees (1)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. Exhaustive search for the maximum likelihood parse tree with a state-of-the-art grammar can require over a minute of processing for a single sentence of 25 words, an unacceptable amount of time for real-time applications or when processing millions of sentences.
    Page 1, “Introduction”
  2. We define an edge’s figure-of-merit (FOM) as an estimate of the product of its inside (6) and outside (04) scores, conceptually the relative merit the edge has to participate in the final parse tree (see Figure 1).
    Page 2, “Background”
  3. predictions about the unlabeled constituent structure of the target parse tree .
    Page 4, “Background”
  4. The optimal point will necessarily be very conservative, allowing outliers (sentences or sub-phrases with above average ambiguity) to stay within the beam and produce valid parse trees .
    Page 6, “Beam-Width Prediction”

See all papers in Proc. ACL 2011 that mention parse tree.

See all papers in Proc. ACL that mention parse tree.

Back to top.

treebank

Appears in 4 sentences as: treebank (4)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. We simply parse sections 2-21 of the WSJ treebank and train our search models from the output of these trees, with no prior knowledge of the nonterminal set or other grammar characteristics to guide the process.
    Page 2, “Introduction”
  2. We run all experiments on the WSJ treebank (Marcus et al., 1999) using the standard splits: section 2-21 for training, section 22 for development, and section 23 for testing.
    Page 7, “Experimental Setup”
  3. We preprocess the treebank by removing empty nodes, temporal labels, and spurious unary productions (X—>X), as is standard in published works on syntactic parsing.
    Page 7, “Experimental Setup”
  4. To achieve state-of-the-art accuracy levels, we parse with the Berkeley SM6 latent-variable grammar (Petrov and Klein, 2007b) where the original treebank non-terminals are automatically split into subclasses to optimize parsing accuracy.
    Page 7, “Experimental Setup”

See all papers in Proc. ACL 2011 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

Viterbi

Appears in 4 sentences as: Viterbi (4)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. Accuracy is computed from the l-best Viterbi (max) tree extracted from the chart.
    Page 7, “Experimental Setup”
  2. We compute the precision and recall of constituents from the l-best Viterbi trees using the standard EVALB script (?
    Page 7, “Experimental Setup”
  3. Both our parser and the Berkeley parser are written in Java, both are run with Viterbi decoding, and both parse with the same grammar, so a direct comparison of speed and accuracy is fair.2
    Page 8, “Results”
  4. CYK 64.610 88.7 Berkeley CTF MaXRule 0.213 90.2 Berkeley CTF Viterbi 0.208 88.8 Beam + Boundary FOM (BB) 0.334 88.6 BB + Chart Constraints 0.244 88.7
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

part-of-speech

Appears in 3 sentences as: part-of-speech (3)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. Prior work incorporating parse structure into machine translation (Chiang, 2010) and Semantic Role Labeling (Tsai et al., 2005; Punyakanok et al., 2008) indicate that such hierarchical structure can have great benefit over shallow labeling techniques like chunking and part-of-speech tagging.
    Page 1, “Introduction”
  2. The feature vector :13 is encoded with the chart cell’s absolute and relative span width, as well as unigram and bigram lexical and part-of-speech tag items from wi_1 .
    Page 4, “Open/Closed Cell Classification”
  3. Figure 4 contains a timing comparison of the three components of our final parser: Boundary FOM initialization (which includes the forward-backward algorithm over ambiguous part-of-speech tags), beam-
    Page 7, “Results”

See all papers in Proc. ACL 2011 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

part-of-speech tag

Appears in 3 sentences as: part-of-speech tag (1) part-of-speech tagging (1) part-of-speech tags (1)
In Beam-Width Prediction for Efficient Context-Free Parsing
  1. Prior work incorporating parse structure into machine translation (Chiang, 2010) and Semantic Role Labeling (Tsai et al., 2005; Punyakanok et al., 2008) indicate that such hierarchical structure can have great benefit over shallow labeling techniques like chunking and part-of-speech tagging .
    Page 1, “Introduction”
  2. The feature vector :13 is encoded with the chart cell’s absolute and relative span width, as well as unigram and bigram lexical and part-of-speech tag items from wi_1 .
    Page 4, “Open/Closed Cell Classification”
  3. Figure 4 contains a timing comparison of the three components of our final parser: Boundary FOM initialization (which includes the forward-backward algorithm over ambiguous part-of-speech tags ), beam-
    Page 7, “Results”

See all papers in Proc. ACL 2011 that mention part-of-speech tag.

See all papers in Proc. ACL that mention part-of-speech tag.

Back to top.