Transition-based Dependency Parsing with Selectional Branching
Choi, Jinho D. and McCallum, Andrew

Article Structure

Abstract

We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach.

Introduction

Transition-based dependency parsing has gained considerable interest because it runs fast and performs accurately.

Transition-based dependency parsing

We introduce a transition-based dependency parsing algorithm that is a hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).

Selectional branching

3.1 Motivation

Experiments

4.1 Corpora

Related work

Our parsing algorithm is most similar to Choi and Palmer (2011) who integrated our LEFT-REDUCE transition into Nivre’s list-based algorithm.

Conclusion

We present selectional branching that uses confidence estimates to decide when to employ a beam.

Topics

beam size

Appears in 21 sentences as: Beam size (1) beam size (15) beam sizes (7)
In Transition-based Dependency Parsing with Selectional Branching
  1. Thus, it is preferred if the beam size is not fixed but proportional to the number of low confidence predictions that a greedy parser makes, in which case, fewer transition sequences need to be explored to produce the same or similar parse output.
    Page 1, “Introduction”
  2. Thus, it is preferred if the beam size is not fixed but proportional to the number of low confidence predictions made for the one-best sequence.
    Page 3, “Selectional branching”
  3. The selectional branching method presented here performs at most d - t — 6 transitions, where t is the maximum number of transitions performed to generate a transition sequence, d = min(b, |A|+1), bis the beam size , |A| is the number of low confidence predictions made for the one-best sequence, and e = @.
    Page 3, “Selectional branching”
  4. Compared to beam search that always performs b - t transitions, selectional branching is guaranteed to perform fewer transitions given the same beam size because d g b and e > 0 except for d = 1, in which case, no branching happens.
    Page 3, “Selectional branching”
  5. While generating T1, the parser adds tuples (slj,p2j), , (sum/,3) to a list A for each low confidence prediction plj given 31 j.4 Then, new transition sequences are generated by using the 19 highest scoring predictions in A, where bis the beam size .
    Page 4, “Selectional branching”
  6. Note that assigning a greater k may increase |A| but not the total number of transition sequences generated, which is restricted by the beam size , I).
    Page 4, “Selectional branching”
  7. To sum up, selectional branching generates the same or fewer transition sequences than beam search and each sequence Ti>1 performs fewer transitions than T1; thus, it performs faster than beam search in general given the same beam size .
    Page 4, “Selectional branching”
  8. For selectional branching, the margin threshold m and the beam size I) need to be tuned (Section 3.3).
    Page 6, “Experiments”
  9. For this development set, the beam size of 64 and 80 gave the exact same result, so we kept the one with a larger beam size (I) = 80).
    Page 6, “Experiments”
  10. Figure 3: Parsing accuracies with respect to margins and beam sizes on the English development set.
    Page 6, “Experiments”
  11. Figure 3 shows parsing accuracies with respect to different margins and beam sizes on the English development set.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention beam size.

See all papers in Proc. ACL that mention beam size.

Back to top.

beam search

Appears in 19 sentences as: beam search (19) beam search, (1)
In Transition-based Dependency Parsing with Selectional Branching
  1. We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach.
    Page 1, “Abstract”
  2. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately.
    Page 1, “Abstract”
  3. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of—the-art transition-based parser that uses beam search .
    Page 1, “Abstract”
  4. Greedy transition-based dependency parsing has been widely deployed because of its speed (Cer et a1., 2010); however, state-of—the-art accuracies have been achieved by globally optimized parsers using beam search (Zhang and Clark, 2008; Huang and Sagae, 2010; Zhang and Nivre, 2011; Bohnet and Nivre, 2012).
    Page 1, “Introduction”
  5. Coupled with dynamic programming, transition-based dependency parsing with beam search can be done very efficiently and gives significant improvement to parsing accuracy.
    Page 1, “Introduction”
  6. One downside of beam search is that it always uses a fixed size of beam even when a smaller size of beam is sufficient for good results.
    Page 1, “Introduction”
  7. In our experiments, a greedy parser performs as accurately as a parser that uses beam search for about 64% of time.
    Page 1, “Introduction”
  8. With our new approach, we achieve a higher parsing accuracy than the current state-of—the-art transition-based parser that uses beam search and a much faster speed.
    Page 1, “Introduction”
  9. For transition-based parsing, state-of-the-art accuracies have been achieved by parsers optimized on multiple transition sequences using beam search,
    Page 3, “Selectional branching”
  10. Compared to beam search that always performs b - t transitions, selectional branching is guaranteed to perform fewer transitions given the same beam size because d g b and e > 0 except for d = 1, in which case, no branching happens.
    Page 3, “Selectional branching”
  11. higher parsing accuracy than the current state-of-the-art transition-based parser using beam search , and performs about 3 times faster.
    Page 4, “Selectional branching”

See all papers in Proc. ACL 2013 that mention beam search.

See all papers in Proc. ACL that mention beam search.

Back to top.

dependency parsing

Appears in 14 sentences as: dependency parsers (2) dependency parsing (12)
In Transition-based Dependency Parsing with Selectional Branching
  1. We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach.
    Page 1, “Abstract”
  2. We also present a new transition-based dependency parsing algorithm that gives a complexity of 0(n) for projective parsing and an expected linear time speed for non-projective parsing.
    Page 1, “Abstract”
  3. Transition-based dependency parsing has gained considerable interest because it runs fast and performs accurately.
    Page 1, “Introduction”
  4. Greedy transition-based dependency parsing has been widely deployed because of its speed (Cer et a1., 2010); however, state-of—the-art accuracies have been achieved by globally optimized parsers using beam search (Zhang and Clark, 2008; Huang and Sagae, 2010; Zhang and Nivre, 2011; Bohnet and Nivre, 2012).
    Page 1, “Introduction”
  5. Coupled with dynamic programming, transition-based dependency parsing with beam search can be done very efficiently and gives significant improvement to parsing accuracy.
    Page 1, “Introduction”
  6. We introduce a transition-based dependency parsing algorithm that is a hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).
    Page 1, “Transition-based dependency parsing”
  7. 2The parsing complexity of a transition-based dependency parsing algorithm is determined by the number of transitions performed with respect to the number of tokens in a sentence, say n (Kubler et a1., 2009).
    Page 2, “Transition-based dependency parsing”
  8. For English, we mostly adapt features from Zhang and Nivre (2011) who have shown state-of-the-art parsing accuracy for transition-based dependency parsing .
    Page 6, “Experiments”
  9. Bohnet and Nivre (2012)’s transition-based system jointly performs POS tagging and dependency parsing , which shows higher accuracy than ours.
    Page 7, “Experiments”
  10. There are other transition-based dependency parsing algorithms that take a similar approach; Nivre (2009) integrated a SWAP transition into Nivre’s arc-standard algorithm (Nivre, 2004) and Fernandez-Gonzalez and Gomez-Rodriguez (2012) integrated a buffer transition into Nivre’s arc-eager algorithm to handle non-projectivity.
    Page 9, “Related work”
  11. Our selectional branching method is most relevant to Zhang and Clark (2008) who introduced a transition-based dependency parsing model that uses beam search.
    Page 9, “Related work”

See all papers in Proc. ACL 2013 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

development set

Appears in 13 sentences as: development set (11) development sets (2)
In Transition-based Dependency Parsing with Selectional Branching
  1. Input: Dt: training set, Dd: development set .
    Page 5, “Selectional branching”
  2. First, an initial model M0 is trained on all data by taking the one-best sequences, and its score is measured by testing on a development set (lines 2-4).
    Page 5, “Selectional branching”
  3. These languages are selected because they contain non-projective trees and are publicly available from the CoNLL-X webpage.6 Since the CoNLL-X data we have does not come with development sets , the last 10% of each training set is used for development.
    Page 6, “Experiments”
  4. Feature selection is done on the English development set .
    Page 6, “Experiments”
  5. First, all parameters are tuned on the English development set by using grid search on T = [1, .
    Page 6, “Experiments”
  6. For this development set , the beam size of 64 and 80 gave the exact same result, so we kept the one with a larger beam size (I) = 80).
    Page 6, “Experiments”
  7. Figure 3: Parsing accuracies with respect to margins and beam sizes on the English development set .
    Page 6, “Experiments”
  8. Figure 3 shows parsing accuracies with respect to different margins and beam sizes on the English development set .
    Page 6, “Experiments”
  9. Figure 4: Parsing accuracies with respect to ADAGRAD and bootstrap iterations on the English development set when 04 = 0.02, p = 0.1, m = 0.88, and b = 64|80.
    Page 6, “Experiments”
  10. Figure 4 shows parsing accuracies with respect to ADAGRAD and bootstrap iterations on the English development set .
    Page 6, “Experiments”
  11. Figure 5: The total number of transitions performed during decoding with respect to beam sizes on the English development set .
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

parsing algorithm

Appears in 11 sentences as: parsing algorithm (10) parsing algorithms (2)
In Transition-based Dependency Parsing with Selectional Branching
  1. We also present a new transition-based dependency parsing algorithm that gives a complexity of 0(n) for projective parsing and an expected linear time speed for non-projective parsing.
    Page 1, “Abstract”
  2. We first present a new transition-based parsing algorithm that gives a complexity of 0(n) for projective parsing and an expected linear time speed for non-projective parsing.
    Page 1, “Introduction”
  3. We introduce a transition-based dependency parsing algorithm that is a hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).
    Page 1, “Transition-based dependency parsing”
  4. Nivre’s arc-eager is a projective parsing algorithm showing a complexity of Nivre’s list-based algorithm is a non-projective parsing algorithm showing a complexity of 0(n2).
    Page 1, “Transition-based dependency parsing”
  5. 2The parsing complexity of a transition-based dependency parsing algorithm is determined by the number of transitions performed with respect to the number of tokens in a sentence, say n (Kubler et a1., 2009).
    Page 2, “Transition-based dependency parsing”
  6. Even with such a high number of non-projective dependencies, our parsing algorithm still shows a linear growth in transitions.
    Page 3, “Transition-based dependency parsing”
  7. Table 3 shows a transition sequence generated by our parsing algorithm using gold-standard decisions.
    Page 3, “Transition-based dependency parsing”
  8. Our greedy approach outperforms both Nivre (2009) and Fernandez-Gonzalez and Gomez-Rodriguez (2012) who use different non-projective parsing algorithms .
    Page 8, “Experiments”
  9. Our parsing algorithm is most similar to Choi and Palmer (2011) who integrated our LEFT-REDUCE transition into Nivre’s list-based algorithm.
    Page 8, “Related work”
  10. There are other transition-based dependency parsing algorithms that take a similar approach; Nivre (2009) integrated a SWAP transition into Nivre’s arc-standard algorithm (Nivre, 2004) and Fernandez-Gonzalez and Gomez-Rodriguez (2012) integrated a buffer transition into Nivre’s arc-eager algorithm to handle non-projectivity.
    Page 9, “Related work”
  11. Coupled with our new hybrid parsing algorithm , ADAGRAD, rich nonlocal features, and bootstrapping, our parser gives higher parsing accuracy than most other transition-based dependency parsers in multiple languages and shows faster parsing speed.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention parsing algorithm.

See all papers in Proc. ACL that mention parsing algorithm.

Back to top.

highest score

Appears in 5 sentences as: highest score (3) highest scoring (2)
In Transition-based Dependency Parsing with Selectional Branching
  1. While generating T1, the parser adds tuples (slj,p2j), , (sum/,3) to a list A for each low confidence prediction plj given 31 j.4 Then, new transition sequences are generated by using the 19 highest scoring predictions in A, where bis the beam size.
    Page 4, “Selectional branching”
  2. Once all transition sequences are generated, a parse tree is built from a sequence with the highest score .
    Page 4, “Selectional branching”
  3. For each parsing state sij, a prediction is made by generating a feature vector xij E X , feeding it into a classifier C1 that uses a feature map (19(53, y) and a weight vector w to measure a score for each label y E y, and choosing a label with the highest score .
    Page 4, “Selectional branching”
  4. When there is a tie between labels with the highest score , the first one is chosen.
    Page 4, “Selectional branching”
  5. ‘IC arg max’ returns a set of k’ labels whose margins to 01(55) are smaller than any other label’s margin to 01(55) and also g m, where k’ g k. When m = 0, it returns a set of the highest scoring labels only, including 0 When m = 1, it returns a set of all labels.
    Page 4, “Selectional branching”

See all papers in Proc. ACL 2013 that mention highest score.

See all papers in Proc. ACL that mention highest score.

Back to top.

logistic regression

Appears in 5 sentences as: logistic regression (4) logistic regression: (1)
In Transition-based Dependency Parsing with Selectional Branching
  1. This can be expressed as a logistic regression:
    Page 4, “Selectional branching”
  2. Algorithm 2 shows our adaptation of ADAGRAD with logistic regression for multi-class classification.
    Page 5, “Selectional branching”
  3. Note that when used with logistic regression , ADAGRAD takes a regular gradient instead of a subgradient method for updating weights.
    Page 5, “Selectional branching”
  4. Algorithm 2 ADAGRAD + logistic regression
    Page 5, “Selectional branching”
  5. For each instance, scores for all labels are measured by the logistic regression function f (cc, y) in Section 3.3.
    Page 5, “Selectional branching”

See all papers in Proc. ACL 2013 that mention logistic regression.

See all papers in Proc. ACL that mention logistic regression.

Back to top.

POS tags

Appears in 5 sentences as: POS tag (1) POS tagging (2) POS tags (3)
In Transition-based Dependency Parsing with Selectional Branching
  1. Moreover, all POS tag features from English are duplicated with coarse-grained POS tags provided by CoNLL-X.
    Page 6, “Experiments”
  2. Before parsing, POS tags were assigned to the training set by using 20-way jackknifing.
    Page 7, “Experiments”
  3. For the automatic generation of POS tags , we used the domain-specific model of Choi and Palmer (2012a)’s tagger, which gave 97.5% accuracy on the English evaluation set (0.2% higher than Collins (2002)’s tagger).
    Page 7, “Experiments”
  4. Bohnet and Nivre (2012)’s transition-based system jointly performs POS tagging and dependency parsing, which shows higher accuracy than ours.
    Page 7, “Experiments”
  5. Bohnet and Nivre (2012) introduced a transition-based system that jointly performed POS tagging and dependency parsing.
    Page 9, “Related work”

See all papers in Proc. ACL 2013 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

UAS

Appears in 4 sentences as: UAS (4)
In Transition-based Dependency Parsing with Selectional Branching
  1. UAS
    Page 6, “Experiments”
  2. UAS : unlabeled attachment score, LAS: labeled attachment score.
    Page 6, “Experiments”
  3. Approach UAS ‘ LAS | Time Zhang and Clark (2008) 92.1
    Page 7, “Experiments”
  4. UAS : unlabeled attachment score, LAS: labeled attachment score, Time: seconds per sentence.
    Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention UAS.

See all papers in Proc. ACL that mention UAS.

Back to top.

dynamic programming

Appears in 3 sentences as: dynamic programming (3)
In Transition-based Dependency Parsing with Selectional Branching
  1. Coupled with dynamic programming , transition-based dependency parsing with beam search can be done very efficiently and gives significant improvement to parsing accuracy.
    Page 1, “Introduction”
  2. which can be done very efficiently when it is coupled with dynamic programming (Zhang and Clark, 2008; Huang and Sagae, 2010; Zhang and Nivre, 2011; Huang et a1., 2012; Bohnet and Nivre, 2012).
    Page 3, “Selectional branching”
  3. Huang and Sagae (2010) later applied dynamic programming to this approach and showed improved efficiency.
    Page 9, “Related work”

See all papers in Proc. ACL 2013 that mention dynamic programming.

See all papers in Proc. ACL that mention dynamic programming.

Back to top.

feature vector

Appears in 3 sentences as: feature vector (2) feature vectors (1)
In Transition-based Dependency Parsing with Selectional Branching
  1. For each parsing state sij, a prediction is made by generating a feature vector xij E X , feeding it into a classifier C1 that uses a feature map (19(53, y) and a weight vector w to measure a score for each label y E y, and choosing a label with the highest score.
    Page 4, “Selectional branching”
  2. During training, a training instance is generated for each parsing state sij by taking a feature vector xij and its true label yij.
    Page 5, “Selectional branching”
  3. Then, a subgradient is measured by taking all feature vectors together weighted by Q (line 6).
    Page 5, “Selectional branching”

See all papers in Proc. ACL 2013 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

gold-standard

Appears in 3 sentences as: gold-standard (3)
In Transition-based Dependency Parsing with Selectional Branching
  1. This decision is consulted by gold-standard trees during training and a classifier during decoding.
    Page 2, “Transition-based dependency parsing”
  2. Table 3 shows a transition sequence generated by our parsing algorithm using gold-standard decisions.
    Page 3, “Transition-based dependency parsing”
  3. Among all transition sequences generated by Mr_1, training instances from only T1 and T9 are used to train Mr, where T1 is the one-best sequence and T9 is a sequence giving the most accurate parse output compared to the gold-standard tree.
    Page 5, “Selectional branching”

See all papers in Proc. ACL 2013 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

graph-based

Appears in 3 sentences as: graph-based (3)
In Transition-based Dependency Parsing with Selectional Branching
  1. The second block shows results from other kinds of parsing approaches (e.g., graph-based parsing, ensemble parsing, linear programming, dual decomposition).
    Page 7, “Experiments”
  2. Our parser gives a comparative accuracy to Koo and Collins (2010) that is a 3rd-order graph-based parsing approach.
    Page 7, “Experiments”
  3. Nivre and McDonald (2008) uses an ensemble model between transition-based and graph-based parsing approaches.
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention graph-based.

See all papers in Proc. ACL that mention graph-based.

Back to top.