Efficient Staggered Decoding for Sequence Labeling
Kaji, Nobuhiro and Fujiwara, Yasuhiro and Yoshinaga, Naoki and Kitsuregawa, Masaru

Article Structure

Abstract

The Viterbi algorithm is the conventional decoding algorithm most widely adopted for sequence labeling.

Introduction

In the past decade, sequence labeling algorithms such as HMMs, CRFs, and Collins’ perceptrons have been extensively studied in the field of NLP (Rabiner, 1989; Lafferty et al., 2001; Collins, 2002).

Topics

Viterbi

Appears in 42 sentences as: VITERBI (8) Viterbi (35)
In Efficient Staggered Decoding for Sequence Labeling
  1. The Viterbi algorithm is the conventional decoding algorithm most widely adopted for sequence labeling.
    Page 1, “Abstract”
  2. Viterbi decoding is, however, prohibitively slow when the label set is large, because its time complexity is quadratic in the number of labels.
    Page 1, “Abstract”
  3. Experiments on three tasks (POS tagging, joint POS tagging and chunking, and supertagging) show that the new algorithm is several orders of magnitude faster than the basic Viterbi and a state-of-the-art algorithm, CARPEDIEM (Esposito and Radicioni, 2009).
    Page 1, “Abstract”
  4. This task, referred to as decoding, is usually carried out using the Viterbi algorithm ( Viterbi , 1967).
    Page 1, “Introduction”
  5. The Viterbi algorithm has 0(NL2) time complexity,1 where N is the input size and L is the number of labels.
    Page 1, “Introduction”
  6. Although the Viterbi algorithm is generally efficient,
    Page 1, “Introduction”
  7. The proposed algorithm has three distinguishing properties: (1) It is much more efficient than the Viterbi algorithm when dealing with a large number of labels.
    Page 1, “Introduction”
  8. The results demonstrate that our algorithm is up to several orders of magnitude faster than the basic Viterbi algorithm and a state-of-the-art algorithm (Esposito and Radicioni, 2009); it makes exact decoding practical even in labeling problems with a large label set.
    Page 1, “Introduction”
  9. 2.2 Viterbi decoding
    Page 2, “Introduction”
  10. This is usually performed by applying the Viterbi algorithm.
    Page 2, “Introduction”
  11. The idea of the Viterbi algorithm is to use dynamic programming to compute In HMMs, w(yn) can be can be defined as
    Page 2, “Introduction”

See all papers in Proc. ACL 2010 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

perceptron

Appears in 13 sentences as: Perceptron (1) perceptron (7) perceptrons (5)
In Efficient Staggered Decoding for Sequence Labeling
  1. In the past decade, sequence labeling algorithms such as HMMs, CRFs, and Collins’ perceptrons have been extensively studied in the field of NLP (Rabiner, 1989; Lafferty et al., 2001; Collins, 2002).
    Page 1, “Introduction”
  2. Among them, we focus on the perceptron algorithm (Collins, 2002).
    Page 2, “Introduction”
  3. In the perceptron , the score function f (:13, y) is given as f(a:,y) = w - qb(a:,y) where w is the weight vector, and qb(a:, y) is the feature vector representation of the pair (:13, By making the first-order Markov assumption, we have
    Page 2, “Introduction”
  4. Parameter w can be estimated in the same way as in the conventional perceptron algorithm.
    Page 2, “Introduction”
  5. the perceptron algorithm is presented in Section 4.
    Page 3, “Introduction”
  6. 4 Extension to the Perceptron
    Page 7, “Introduction”
  7. The discussion we have made so far can be applied to perceptrons .
    Page 7, “Introduction”
  8. n=1 In perceptrons , on the other hand, it is given as
    Page 7, “Introduction”
  9. where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)-
    Page 7, “Introduction”
  10. However, implementing the perceptron algorithm is not straightforward.
    Page 7, “Introduction”
  11. Finally, it is worth noting that we can use staggered decoding in training perceptrons as well, although such application lies outside the scope of this paper.
    Page 7, “Introduction”

See all papers in Proc. ACL 2010 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

sequence labeling

Appears in 12 sentences as: Sequence labeling (1) sequence labeling (11)
In Efficient Staggered Decoding for Sequence Labeling
  1. The Viterbi algorithm is the conventional decoding algorithm most widely adopted for sequence labeling .
    Page 1, “Abstract”
  2. In the past decade, sequence labeling algorithms such as HMMs, CRFs, and Collins’ perceptrons have been extensively studied in the field of NLP (Rabiner, 1989; Lafferty et al., 2001; Collins, 2002).
    Page 1, “Introduction”
  3. One important task in sequence labeling is how to find the most probable label sequence from among all possible ones.
    Page 1, “Introduction”
  4. As we shall see later, we need over 300 labels to reduce joint POS tagging and chunking into the single sequence labeling problem.
    Page 1, “Introduction”
  5. 2 Preliminaries We first provide a brief overview of sequence labeling and introduce related work.
    Page 1, “Introduction”
  6. Sequence labeling is the problem of predicting la-
    Page 2, “Introduction”
  7. Discriminative models Recent years have seen the emergence of discriminative training methods for sequence labeling (Lafferty et al., 2001; Tasker et al., 2003; Collins, 2002; Tsochantaridis et al., 2005).
    Page 2, “Introduction”
  8. To reduce joint tagging into a single sequence labeling problem, we produced the labels by concatenating the POS tag and the chunk tag (BIO format), e.g., NN/B-NP.
    Page 7, “Introduction”
  9. Following VITERBI, CARPEDIEM is the most relevant algorithm, for sequence labeling in NLP, as discussed in Section 2.3.
    Page 8, “Introduction”
  10. with beam search, which is the approximate algorithm widely adopted for sequence labeling in NLP.
    Page 8, “Introduction”
  11. The sequence labeling algorithm is indispensable to modern statistical NLP.
    Page 9, “Introduction”

See all papers in Proc. ACL 2010 that mention sequence labeling.

See all papers in Proc. ACL that mention sequence labeling.

Back to top.

POS tagging

Appears in 10 sentences as: POS tag (2) POS tagging (11)
In Efficient Staggered Decoding for Sequence Labeling
  1. Experiments on three tasks (POS tagging, joint POS tagging and chunking, and supertagging) show that the new algorithm is several orders of magnitude faster than the basic Viterbi and a state-of-the-art algorithm, CARPEDIEM (Esposito and Radicioni, 2009).
    Page 1, “Abstract”
  2. Now they are indispensable in a wide range of NLP tasks including chunking, POS tagging , NER and so on (Sha and Pereira, 2003; Tsuruoka and Tsujii, 2005; Lin and Wu, 2009).
    Page 1, “Introduction”
  3. For example, there are more than 40 and 2000 labels in POS tagging and supertagging, respectively (Brants, 2000; Matsuzaki et al., 2007).
    Page 1, “Introduction”
  4. As we shall see later, we need over 300 labels to reduce joint POS tagging and chunking into the single sequence labeling problem.
    Page 1, “Introduction”
  5. Experiments evaluate our algorithm on three tasks: POS tagging, joint POS tagging and chunking, and supertaggingz.
    Page 1, “Introduction”
  6. For example, as suggested in (Liang et al., 2008), adjacent labels do not provide strong information in POS tagging .
    Page 3, “Introduction”
  7. The proposed algorithm was evaluated with three tasks: POS tagging, joint POS tagging and chunking (called joint tagging for short), and supertagging.
    Page 7, “Introduction”
  8. To reduce joint tagging into a single sequence labeling problem, we produced the labels by concatenating the POS tag and the chunk tag (BIO format), e.g., NN/B-NP.
    Page 7, “Introduction”
  9. In supertagging, the token is the pair of the word and its oracle POS tag .
    Page 7, “Introduction”
  10. In POS tagging , we used unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, and tag bigrams.
    Page 8, “Introduction”

See all papers in Proc. ACL 2010 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

time complexity

Appears in 8 sentences as: time complexity (10)
In Efficient Staggered Decoding for Sequence Labeling
  1. Viterbi decoding is, however, prohibitively slow when the label set is large, because its time complexity is quadratic in the number of labels.
    Page 1, “Abstract”
  2. The time complexity of the Viterbi algorithm is O(NL2) since there are NL nodes in the lattice and it takes 0(L) time to evaluate the score of each node.
    Page 6, “Introduction”
  3. Staggered decoding has 0(N) best case time complexity and 0(NL2) worst case time complexity .
    Page 6, “Introduction”
  4. Therefore, it has 0(23221 N4m_1) time complexity if it terminates after the M -th search step.
    Page 6, “Introduction”
  5. In the best case, M = l, the time complexity is 0(N In the worst case, M = log2 L + l, the time complexity is the order of 0(NL2) because 213:?“ N47"_1 < §NL2.
    Page 6, “Introduction”
  6. Theorem 2 reveals that staggered decoding has the same order of time complexity as the Viterbi algorithm even in the worst case.
    Page 6, “Introduction”
  7. If 1 is initialized in this manner, the best case time complexity of our algorithm becomes 0(N L).
    Page 6, “Introduction”
  8. As the result, we can no longer derive a reasonable upper bound for the time complexity .
    Page 7, “Introduction”

See all papers in Proc. ACL 2010 that mention time complexity.

See all papers in Proc. ACL that mention time complexity.

Back to top.

bigram

Appears in 6 sentences as: bigram (4) bigrams (3)
In Efficient Staggered Decoding for Sequence Labeling
  1. where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)-
    Page 7, “Introduction”
  2. For bigram features, we compute its upper bound offline.
    Page 7, “Introduction”
  3. The simplest case is that the bigram features are independent of the token sequence :13.
    Page 7, “Introduction”
  4. Even if bigram features are dependent on :13, it is still possible to compute better bounds if several features are mutually exclusive, as discussed in (Esposito and Radicioni, 2009).
    Page 7, “Introduction”
  5. In POS tagging, we used unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, and tag bigrams .
    Page 8, “Introduction”
  6. In supertagging, we used POS unigrams and bigrams in addition to the same features other than capitalization.
    Page 8, “Introduction”

See all papers in Proc. ACL 2010 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

beam search

Appears in 5 sentences as: beam search (5)
In Efficient Staggered Decoding for Sequence Labeling
  1. Approximate algorithms, such as beam search or island-driven search, have been proposed for speeding up decoding.
    Page 3, “Introduction”
  2. In our experiments, we use the path located by the left-to-right deterministic decoding (i.e., beam search with a beam width of 1).
    Page 6, “Introduction”
  3. with beam search , which is the approximate algorithm widely adopted for sequence labeling in NLP.
    Page 8, “Introduction”
  4. For this experiment, the beam width, B, was exhaustively calibrated: we tried B = {1, 2, 4, 8, until the beam search achieved comparable accuracy to the exact algorithms, i.e., the difference fell below 0.1 in our case.
    Page 8, “Introduction”
  5. Table 4: Comparison with beam search (sent./sec).
    Page 9, “Introduction”

See all papers in Proc. ACL 2010 that mention beam search.

See all papers in Proc. ACL that mention beam search.

Back to top.

score function

Appears in 5 sentences as: score function (6) score functions (1)
In Efficient Staggered Decoding for Sequence Labeling
  1. defining a score function f (:13, y) and locating the
    Page 2, “Introduction”
  2. In HMMs, the score function f (:13, y) is the joint probability distribution over (3:, If we assume a one-to-one correspondence between the hidden states and the labels, the score function can be written as:
    Page 2, “Introduction”
  3. In the perceptron, the score function f (:13, y) is given as f(a:,y) = w - qb(a:,y) where w is the weight vector, and qb(a:, y) is the feature vector representation of the pair (:13, By making the first-order Markov assumption, we have
    Page 2, “Introduction”
  4. Given the score function f(a:, y), we have to locate the best label sequence.
    Page 2, “Introduction”
  5. This can be clarified by comparing the score functions f(a:, In HMMs, the score function can be written as
    Page 7, “Introduction”

See all papers in Proc. ACL 2010 that mention score function.

See all papers in Proc. ACL that mention score function.

Back to top.

unigram

Appears in 4 sentences as: unigram (2) unigrams (2)
In Efficient Staggered Decoding for Sequence Labeling
  1. where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)-
    Page 7, “Introduction”
  2. For unigram features, we compute the maximum, maxy 2k wligbflmw), as a preprocess in
    Page 7, “Introduction”
  3. In POS tagging, we used unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, and tag bigrams.
    Page 8, “Introduction”
  4. In supertagging, we used POS unigrams and bigrams in addition to the same features other than capitalization.
    Page 8, “Introduction”

See all papers in Proc. ACL 2010 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.