Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Huang, Zhiheng and Chang, Yi and Long, Bo and Crespo, Jean-Francois and Dong, Anlei and Keerthi, Sathiya and Wu, Su-Lin

Article Structure

Abstract

Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing.

Introduction

Sequence tagging algorithms including HMMs (Ra-biner, 1989), CRFs (Lafferty et al., 2001), and Collins’s perceptron (Collins, 2002) have been widely employed in NLP applications.

Problem formulation

In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al., 2001).

Background

We present the existing algorithms for both l-best and k-best sequential decoding in this section.

Proposed Algorithms

In this section, we propose A* based sequential decoding algorithms that can efficiently handle datasets with a large number of labels.

Experiments

We compare aforementioned 1-best and k-best sequential decoding algorithms using five datasets in this section.

Related work

The Viterbi algorithm is the only exact algorithm widely adopted in the NLP applications.

Conclusions

We have presented and evaluated the A* and iterative A* algorithms for 1-best sequential decoding in this paper.

Topics

Viterbi

Appears in 72 sentences as: Viterb (1) Viterbi (84)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding.
    Page 1, “Abstract”
  2. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels.
    Page 1, “Abstract”
  3. Traditionally, the Viterbi algorithm ( Viterbi , 1967) is used.
    Page 1, “Introduction”
  4. Unfortunately, due to its 0(TL2) time complexity, where T is the input token size and L is the label size, the Viterbi decoding can become prohibitively slow when the label size is large (say, larger than 200).
    Page 1, “Introduction”
  5. The Viterbi algorithm cannot find the best sequences in tolerable
    Page 1, “Introduction”
  6. To our best knowledge, the state-of-the-art k-best sequential decoding algorithm is Viterbi A* 1.
    Page 1, “Introduction”
  7. In this paper, we generalize the iterative process from the work of (Kaji et al., 2010) and propose a k-best sequential decoding algorithm, namely iterative Viterbi A*.
    Page 1, “Introduction”
  8. We show that A* with a proper heuristic can outperform the classic Viterbi decoding.
    Page 1, “Introduction”
  9. (3) We propose k-best A* and k-best iterative Viterbi A* algorithms.
    Page 1, “Introduction”
  10. 3.1 l-Best Viterbi
    Page 2, “Background”
  11. The Viterbi algorithm is a classic dynamic programming based decoding algorithm.
    Page 2, “Background”

See all papers in Proc. ACL 2012 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

CoNLL

Appears in 15 sentences as: CoNLL (19)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. We apply 1-best and k-best sequential decoding algorithms to five NLP tagging tasks: Penn TreeBank (PTB) POS tagging, CoNLLZOOO joint POS tagging and chunking, CoNLL 2003 joint POS tagging, chunking and named entity tagging, HPSG supertag-ging (Matsuzaki et al., 2007) and a search query named entity recognition (NER) dataset.
    Page 6, “Experiments”
  2. As in (Kaji et al., 2010), we combine the POS tags and chunk tags to form joint tags for CoNLL 2000 dataset, e.g., NN|B-NP.
    Page 7, “Experiments”
  3. Similarly we combine the POS tags, chunk tags, and named entity tags to form joint tags for CoNLL 2003 dataset, e.g., PRP$|I-NP|O.
    Page 7, “Experiments”
  4. POS and supertag datasets assign tags to tokens while CoNLL 2000 , CoNLL 2003 and search query datasets assign tags to phrases.
    Page 7, “Experiments”
  5. We use the standard BIO encoding for CoNLL 2000, CoNLL 2003 and search query datasets.
    Page 7, “Experiments”
  6. In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets.
    Page 7, “Experiments”
  7. They are 97.00, 94.70, 95.80, 90.60 and 88.60 for POS, CoNLL 2000, CoNLL 2003, supertag, and search query re-
    Page 7, “Experiments”
  8. Although the max iteration size is bounded to [log2 L] for each position (for example, 9 for CoNLL 2003 dataset), the total iteration number for the whole lattice may be greater than [log2 L] as different positions may not expand at the same time.
    Page 7, “Experiments”
  9. For example, for CoNLL 2003 dataset, the former can decode 2239 sentences per second while the latter only decodes 225 sentences per second.
    Page 7, “Experiments”
  10. Take CoNLL 2003 data as an example, the removal of the pruning slows down the 1-best iterative Viterbi decoding from 2239 to 604 sentences/second.
    Page 8, “Experiments”
  11. For example, it only reaches 11 sentences per second for CoNLL 2003 dataset.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention CoNLL.

See all papers in Proc. ACL that mention CoNLL.

Back to top.

beam search

Appears in 9 sentences as: beam search (9)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. The lower bound can be initialized to the best sequence score using a beam search (with beam size being 1).
    Page 3, “Background”
  2. 1: lb 2 best score from beam search 2: init lattice 3: for i=0;;i++ do if i %2 == 0 then y 2 forward() else y = backwardO end if if y consists of active labels only then return y
    Page 3, “Background”
  3. The search is similar to the beam search with beam size being 1.
    Page 5, “Proposed Algorithms”
  4. We initialize the lower bound lb with the k-th best score from beam search (with beam size being 11:) at line 1.
    Page 6, “Proposed Algorithms”
  5. Note that the beam search is performed on the original lattice which consists of L active labels per position.
    Page 6, “Proposed Algorithms”
  6. The beam search time is negligible compared to the total decoding time.
    Page 6, “Proposed Algorithms”
  7. The beam search decoding is also shown as a baseline.
    Page 7, “Experiments”
  8. The decoding speed of iterative Viterbi A* can even be comparable to that of beam search .
    Page 8, “Experiments”
  9. Apart from the exact decoding, approximate decoding algorithms such as beam search are also related to our work.
    Page 8, “Related work”

See all papers in Proc. ACL 2012 that mention beam search.

See all papers in Proc. ACL that mention beam search.

Back to top.

CRFs

Appears in 7 sentences as: CRFs (7)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. Sequence tagging algorithms including HMMs (Ra-biner, 1989), CRFs (Lafferty et al., 2001), and Collins’s perceptron (Collins, 2002) have been widely employed in NLP applications.
    Page 1, “Introduction”
  2. In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al., 2001).
    Page 2, “Problem formulation”
  3. and a CRFs model is
    Page 2, “Problem formulation”
  4. For CRFs , Z (x) is an instance-specific normalization function
    Page 2, “Problem formulation”
  5. If x is given, the decoding is to find the best y which maximizes the score of f (y, x) for perceptron or the probability of p(y|x) for CRFs .
    Page 2, “Problem formulation”
  6. As Z (x) is a constant for any given input sequence x, the decoding for perceptron or CRFs is identical, that is,
    Page 2, “Problem formulation”
  7. A similar idea was applied to CRFs as well (Cohn, 2006; Jeong, 2009).
    Page 8, “Related work”

See all papers in Proc. ACL 2012 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

perceptron

Appears in 7 sentences as: perceptron (7)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. Sequence tagging algorithms including HMMs (Ra-biner, 1989), CRFs (Lafferty et al., 2001), and Collins’s perceptron (Collins, 2002) have been widely employed in NLP applications.
    Page 1, “Introduction”
  2. In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al., 2001).
    Page 2, “Problem formulation”
  3. Formally, a perceptron model is
    Page 2, “Problem formulation”
  4. If x is given, the decoding is to find the best y which maximizes the score of f (y, x) for perceptron or the probability of p(y|x) for CRFs.
    Page 2, “Problem formulation”
  5. As Z (x) is a constant for any given input sequence x, the decoding for perceptron or CRFs is identical, that is,
    Page 2, “Problem formulation”
  6. Due to the long CRF training time (days to weeks even for stochastic gradient descent training) for these large label size datasets, we choose the perceptron algorithm for training.
    Page 7, “Experiments”
  7. We note that the selection of training algorithm does not affect the decoding process: the decoding is identical for both CRF and perceptron training algorithms.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

named entity

Appears in 5 sentences as: named entity (6)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing.
    Page 1, “Abstract”
  2. For example to pursue a high recall ratio, a named entity recognition system may have to adopt k-best sequences in case the true entities are not recognized at the best one.
    Page 1, “Introduction”
  3. We apply 1-best and k-best sequential decoding algorithms to five NLP tagging tasks: Penn TreeBank (PTB) POS tagging, CoNLLZOOO joint POS tagging and chunking, CoNLL 2003 joint POS tagging, chunking and named entity tagging, HPSG supertag-ging (Matsuzaki et al., 2007) and a search query named entity recognition (NER) dataset.
    Page 6, “Experiments”
  4. Similarly we combine the POS tags, chunk tags, and named entity tags to form joint tags for CoNLL 2003 dataset, e.g., PRP$|I-NP|O.
    Page 7, “Experiments”
  5. Note that by such tag joining, we are able to offer different tag decodings (for example, chunking and named entity tagging) simultaneously.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

unigrams

Appears in 4 sentences as: Unigram (1) unigram (1) unigrams (2)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. To simplify the discussion, we divide the features into two groups: unigram label features and bi-gram label features.
    Page 2, “Problem formulation”
  2. Unigram features are of form fk(yt, xt) which are concerned with the current label and arbitrary feature patterns from input sequence.
    Page 2, “Problem formulation”
  3. In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets.
    Page 7, “Experiments”
  4. For supertag dataset, we use the same features for the word inputs, and the unigrams and bigrams for gold POS inputs.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.

beam size

Appears in 3 sentences as: beam size (3)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. The lower bound can be initialized to the best sequence score using a beam search (with beam size being 1).
    Page 3, “Background”
  2. The search is similar to the beam search with beam size being 1.
    Page 5, “Proposed Algorithms”
  3. We initialize the lower bound lb with the k-th best score from beam search (with beam size being 11:) at line 1.
    Page 6, “Proposed Algorithms”

See all papers in Proc. ACL 2012 that mention beam size.

See all papers in Proc. ACL that mention beam size.

Back to top.

bigrams

Appears in 3 sentences as: Bigram (1) bigrams (3)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. Bigram features are of form fk (yt, yt_1, xt) which are concerned with both the previous and the current labels.
    Page 2, “Problem formulation”
  2. In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets.
    Page 7, “Experiments”
  3. For supertag dataset, we use the same features for the word inputs, and the unigrams and bigrams for gold POS inputs.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

POS tagging

Appears in 3 sentences as: POS tagging (3) POS tags (2)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. We apply 1-best and k-best sequential decoding algorithms to five NLP tagging tasks: Penn TreeBank (PTB) POS tagging, CoNLLZOOO joint POS tagging and chunking, CoNLL 2003 joint POS tagging , chunking and named entity tagging, HPSG supertag-ging (Matsuzaki et al., 2007) and a search query named entity recognition (NER) dataset.
    Page 6, “Experiments”
  2. As in (Kaji et al., 2010), we combine the POS tags and chunk tags to form joint tags for CoNLL 2000 dataset, e.g., NN|B-NP.
    Page 7, “Experiments”
  3. Similarly we combine the POS tags , chunk tags, and named entity tags to form joint tags for CoNLL 2003 dataset, e.g., PRP$|I-NP|O.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

time complexity

Appears in 3 sentences as: time complexity (4)
In Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
  1. Unfortunately, due to its 0(TL2) time complexity , where T is the input token size and L is the label size, the Viterbi decoding can become prohibitively slow when the label size is large (say, larger than 200).
    Page 1, “Introduction”
  2. Therefore, it has 22:0 T4Z time complexity if it terminates at the m-th iteration.
    Page 4, “Background”
  3. In the best case in which m = 0, the time complexity is T. In the worst case in which m = [log2 L] — 1 is the ceiling function which maps a real number to the smallest following integer), the time complexity is
    Page 4, “Background”

See all papers in Proc. ACL 2012 that mention time complexity.

See all papers in Proc. ACL that mention time complexity.

Back to top.