Sparser, Better, Faster GPU Parsing

Due to their origin in computer graphics, graphics processing units (GPUs) are highly optimized for dense problems, where the exact same operation is applied repeatedly to all data points.

Because NLP models typically treat sentences independently, NLP problems have long been seen as “embarrassingly parallel” — large corpora can be processed arbitrarily fast by simply sending different sentences to different machines.

We build up our approach incrementally, with experiments interspersed throughout the paper, and summarized in Tables 1 and 2.

One successful approach for speeding up constituency parsers has been to use coarse-to-fine inference (Chamiak et al., 2006).

Unfortunately, the standard coarse-to-fine approach does not nai'vely translate to GPU architectures.

This architecture environment puts very different constraints on parsing algorithms from a CPU environment.

Now we turn to the algorithmic and architectural changes in our approach.

Recall that the rules in the grammar are partitioned into a set of clusters, and that these clusters are further divided into subclusters.

The coarse to fine pruning approach of Petrov and Klein (2007) employs an X-bar grammar as its first pruning phase, but there is no reason why we cannot begin with a more complex grammar for our initial pass.

The Viterbi algorithm is a reasonably effective method for parsing.

In this section we attempt to break down how exactly our system is spending its time.

Apart from the model of Canny et al.

GPUs represent a challenging opportunity for natural language processing.

Appears in 27 sentences as: Viterbi (28)

In *Sparser, Better, Faster GPU Parsing*

- (2013) presented an approach to GPU parsing that sacrifices traditional sparsity in exchange for raw computational power, obtaining a system that can compute Viterbi parses for a high-quality grammar at about 164 sentences per second on a midrange GPU.Page 1, “Abstract”
- The resulting system is capable of computing over 404 Viterbi parses per second—more than a 2x speedup—on the same hardware.Page 1, “Abstract”
- Together these kernels implement the Viterbi inside algorithm.Page 1, “Introduction”
- On a midrange GPU, their system can compute Viterbi derivations at 164 sentences per second on sentences of length 40 or less (see timing details below).Page 1, “Introduction”
- (2013) is that it only computes Viterbi parses.Page 2, “Introduction”
- As with other grammars with a parse/derivation distinction, the grammars of Petrov and Klein (2007) only achieve their full accuracy using minimum-Bayes-risk parsing, with improvements of over 1.5 F1 over best-derivation Viterbi parsing on the Penn Treebank (Marcus et al., 1993).Page 2, “Introduction”
- Table 1: Performance numbers for computing Viterbi inside charts on 20,000 sentences of length $40 from the Penn Treebank.Page 4, “Anatomy of a Dense GPU Parser”
- The Viterbi inside loop for the grammar is inlined into a kernel.Page 4, “Anatomy of a Dense GPU Parser”
- (2013)’s system is able to compute Viterbi charts at 164 sentences per second, for sentences up to length 40.Page 4, “Anatomy of a Dense GPU Parser”
- The unpruned Viterbi computations in a fine grammar using the clustering method of Canny et al.Page 6, “Grammar Clustering”
- The Viterbi algorithm is a reasonably effective method for parsing.Page 6, “Minimum Bayes risk parsing”

See all papers in *Proc. ACL 2014* that mention Viterbi.

See all papers in *Proc. ACL* that mention Viterbi.

Back to top.

Appears in 6 sentences as: Treebank (6)

In *Sparser, Better, Faster GPU Parsing*

- As with other grammars with a parse/derivation distinction, the grammars of Petrov and Klein (2007) only achieve their full accuracy using minimum-Bayes-risk parsing, with improvements of over 1.5 F1 over best-derivation Viterbi parsing on the Penn Treebank (Marcus et al., 1993).Page 2, “Introduction”
- Table 1: Performance numbers for computing Viterbi inside charts on 20,000 sentences of length $40 from the Penn Treebank .Page 4, “Anatomy of a Dense GPU Parser”
- Table 2: Performance numbers for computing max constituent (Goodman, 1996) trees on 20,000 sentences of length 40 or less from the Penn Treebank .Page 7, “Minimum Bayes risk parsing”
- Therefore, in the fine pass, we normalize the inside scores at the leaves to sum to 1.0.5 Using this slight modification, no sentences from the Treebank under- or overflow.Page 8, “Minimum Bayes risk parsing”
- We measured parsing accuracy on sentences of length g 40 from section 22 of the Penn Treebank .Page 8, “Minimum Bayes risk parsing”
- 6We replicated the Treebank for the 100,000 sentences pass.Page 8, “Analyzing System Performance”

See all papers in *Proc. ACL 2014* that mention Treebank.

See all papers in *Proc. ACL* that mention Treebank.

Back to top.

Appears in 4 sentences as: Berkeley parser (2) Berkeley parsers (1) Berkeley parser’s (1)

In *Sparser, Better, Faster GPU Parsing*

- Their system uses a grammar based on the Berkeley parser (Petrov and Klein, 2007) (which is particularly amenable to GPU processing), “compiling” the grammar into a sequence of GPU kernels that are applied densely to every item in the parse chart.Page 1, “Introduction”
- It is of course important verify the correctness of our system; one easy way to do so is to examine parsing accuracy, as compared to the original Berkeley parser .Page 8, “Minimum Bayes risk parsing”
- These results are nearly identical to the Berkeley parsers most comparable numbers: 89.8 for Viterbi, and 90.9 for their “Max-Rule-Sum” MBR algorithm.Page 8, “Minimum Bayes risk parsing”
- The Berkeley parser’s grammars—by virtue of being unlexicalized—can be applied uniformly to all parse items.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention Berkeley parser.

See all papers in *Proc. ACL* that mention Berkeley parser.

Back to top.

Appears in 4 sentences as: Penn Treebank (4)

In *Sparser, Better, Faster GPU Parsing*

- As with other grammars with a parse/derivation distinction, the grammars of Petrov and Klein (2007) only achieve their full accuracy using minimum-Bayes-risk parsing, with improvements of over 1.5 F1 over best-derivation Viterbi parsing on the Penn Treebank (Marcus et al., 1993).Page 2, “Introduction”
- Table 1: Performance numbers for computing Viterbi inside charts on 20,000 sentences of length $40 from the Penn Treebank .Page 4, “Anatomy of a Dense GPU Parser”
- Table 2: Performance numbers for computing max constituent (Goodman, 1996) trees on 20,000 sentences of length 40 or less from the Penn Treebank .Page 7, “Minimum Bayes risk parsing”
- We measured parsing accuracy on sentences of length g 40 from section 22 of the Penn Treebank .Page 8, “Minimum Bayes risk parsing”

See all papers in *Proc. ACL 2014* that mention Penn Treebank.

See all papers in *Proc. ACL* that mention Penn Treebank.

Back to top.

Appears in 3 sentences as: latent variable (3)

In *Sparser, Better, Faster GPU Parsing*

- For instance, in a latent variable parser, the coarse grammar would have symbols like NP, VP, etc., and the fine pass would have refined symbols N P0, N P1, VP4, and so on.Page 2, “Sparsity and CPUs”
- MBR parsing has proven especially useful in latent variable grammars.Page 6, “Minimum Bayes risk parsing”
- Petrov and Klein (2007) showed that MBR trees substantially improved performance over Viterbi parses for latent variable grammars, earning up to 1.5Fl.Page 6, “Minimum Bayes risk parsing”

See all papers in *Proc. ACL 2014* that mention latent variable.

See all papers in *Proc. ACL* that mention latent variable.

Back to top.