Practical Very Large Scale CRFs
Lavergne, Thomas and Cappé, Olivier and Yvon, François

Article Structure

Abstract

Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels.

Introduction

Conditional Random Fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) constitute a widely-used and effective approach for supervised structure learning tasks involving the mapping between complex objects such as strings and trees.

Conditional Random Fields

In this section, we recall the basics of Conditional Random Fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) and introduce the notations that will be used throughout.

Topics

CRFs

Appears in 22 sentences as: CRFs (23)
In Practical Very Large Scale CRFs
  1. Conditional Random Fields ( CRFs ) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels.
    Page 1, “Abstract”
  2. In this paper, we address the issue of training very large CRFs , containing up to hundreds output labels and several billion features.
    Page 1, “Abstract”
  3. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.
    Page 1, “Abstract”
  4. Conditional Random Fields ( CRFs ) (Lafferty et al., 2001; Sutton and McCallum, 2006) constitute a widely-used and effective approach for supervised structure learning tasks involving the mapping between complex objects such as strings and trees.
    Page 1, “Introduction”
  5. An important property of CRFs is their ability to handle large and redundant feature sets and to integrate structural dependency between output labels.
    Page 1, “Introduction”
  6. However, even for simple linear chain CRFs , the complexity of learning and inference Was partly supported by ANR projects CroTaL
    Page 1, “Introduction”
  7. Most empirical studies on CRFs thus either consider tasks with a restricted output space (typically in the order of few dozens of output labels), heuristically reduce the use of features, especially of features that test pairs of adjacent la-belsl, and/or propose heuristics to simulate contextual dependencies, via extended tests on the observations (see discussions in, eg., (Punyakanok et al., 2005; Liang et al., 2008)).
    Page 1, “Introduction”
  8. In this paper, we show that the sparsity that is induced by 61-penalized estimation of CRFs can be used to reduce the total training time, while yielding extremely compact models.
    Page 1, “Introduction”
  9. We study and compare three different ways to implement £1 penalty for CRFs that have been introduced recently: orthant-wise Quasi Newton (Andrew and Gao, 2007), stochastic gradient descent (Tsuruoka et al., 2009) and coordinate descent (Sokolovska et al., 2010), concluding that these methods have complemen-
    Page 1, “Introduction”
  10. Based on an efficient implementation of these algorithms, we were able to train very large CRFs containing more than a hundred of output labels and up to several billion features, yielding results that are as good or better than the best reported results for two NLP benchmarks, text phonetization and part-of-speech tagging.
    Page 2, “Introduction”
  11. The rest of the paper is organized as follows: we first recall the basics of CRFs in Section 2, and discuss three ways to train CRFs with a 61 penalty in Section 3.
    Page 2, “Introduction”

See all papers in Proc. ACL 2010 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

feature sets

Appears in 20 sentences as: feature set (8) Feature Sets (2) feature sets (10)
In Practical Very Large Scale CRFs
  1. An important property of CRFs is their ability to handle large and redundant feature sets and to integrate structural dependency between output labels.
    Page 1, “Introduction”
  2. Limitating the feature set or the number of output labels is however frustrating for many NLP tasks, where the type and number of potentially relevant features are very large.
    Page 1, “Introduction”
  3. Second, the experimental demonstration that using large output label sets is doable and that very large feature sets actually help improve prediction accuracy.
    Page 2, “Introduction”
  4. In addition, we show how sparsity in structured feature sets can be used in incremental training regimes, where long-range features are progressively incorporated in the model insofar as the shorter range features have proven useful.
    Page 2, “Introduction”
  5. We then detail several implementation issues that need to be addressed when dealing with massive feature sets in Section 4.
    Page 2, “Introduction”
  6. Based on a study of three NLP benchmarks, the authors of (Tsuruoka et al., 2009) claim this approach to be much faster than the orthant-wise approach and yet to yield very comparable performance, while selecting slightly larger feature sets .
    Page 4, “Conditional Random Fields”
  7. The n-grm feature sets (n = {1, 3, 5, 7}) includes all features testing embedded windows of k letters, for all 0 g k g n; the n-grm- setting is similar, but only includes the window of length n; in the n-grm+ setting, we add features for odd-size windows; in the n-grm++ setting, we add all sequences of letters up to size n occurring in current window.
    Page 6, “Conditional Random Fields”
  8. For instance, the active bigram features at position 75 = 2 in the sequence x=’lemma’ are as follows: the 3-grm feature set contains fywx, fyw/fl and fy/Mem; only the latter appears in the 3-grm- setting.
    Page 6, “Conditional Random Fields”
  9. In the 3-grm+ feature set , we also have fly/We and fy/wflm.
    Page 6, “Conditional Random Fields”
  10. The 3-grm++ feature set additionally includes fling); and fy/Mm.
    Page 6, “Conditional Random Fields”
  11. Our baseline feature set also contains tests on individual and pairs of words in a window of 5 words.
    Page 7, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention feature sets.

See all papers in Proc. ACL that mention feature sets.

Back to top.

recursions

Appears in 10 sentences as: Recursions (1) recursions (9)
In Practical Very Large Scale CRFs
  1. (2009), who use approximations to simplify the forward-backward recursions .
    Page 1, “Introduction”
  2. and backward recursions
    Page 3, “Conditional Random Fields”
  3. These recursions require a number of operations that grows quadratically with
    Page 3, “Conditional Random Fields”
  4. One advantage of the resulting algorithm, termed BCD in the following, is that the update of 6],, only involves carrying out the forward-backward recursions for the set of sequences that contain symbols cc such that at least one {fk(yl,y,$)}(y,y/)EY2 is non null, which can be much smaller than the whole training set.
    Page 4, “Conditional Random Fields”
  5. 4.1 Sparse Forward-Backward Recursions
    Page 5, “Conditional Random Fields”
  6. Using M, the forward-backward recursions are
    Page 5, “Conditional Random Fields”
  7. (Sokolovska et a1., 2010) explains how computational savings can be obtained using the fact that the vector/matrix products in the recursions above only involve the sparse matrix Mt+1(y’, They can thus be computed with exactly r(:ct+1) multiplications instead of |Y|2.
    Page 5, “Conditional Random Fields”
  8. Forward-backward recursions can thus be truncated: in our experiments, this divided the computational cost by 1,8 on average.
    Page 5, “Conditional Random Fields”
  9. CRF++ and CRFsgd perform the forward-backward recursions in the log-domain, which guarantees that numerical over/underflows are avoided no matter the length T (i) of the sequence.
    Page 5, “Conditional Random Fields”
  10. in more features has the effect of making M much more dense, mitigating the benefits of the sparse recursions .
    Page 8, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention recursions.

See all papers in Proc. ACL that mention recursions.

Back to top.

bigram

Appears in 7 sentences as: bigram (7)
In Practical Very Large Scale CRFs
  1. In the sequel, we distinguish between two types of feature functions: unigramfea-tures fyflc, associated with parameters Hwy, and bigram features fy/Mgc, associated with parameters Aggy)?
    Page 2, “Conditional Random Fields”
  2. On the other hand, bigram features {fy/,y,$}(y,$)€y2xX are helpful in modelling dependencies between successive labels.
    Page 2, “Conditional Random Fields”
  3. Assume the set of bigram features {Ag/,y,$t+l}(y/,y)€y2 is sparse with only r(:ct+1) << |Y 2 non null values and define the |Y| >< |Y| sparse matrix
    Page 5, “Conditional Random Fields”
  4. The features used in Nettalk experiments take the form fyflu (unigram) and fy/ww ( bigram ), where w is a n-gram of letters.
    Page 6, “Conditional Random Fields”
  5. For instance, the active bigram features at position 75 = 2 in the sequence x=’lemma’ are as follows: the 3-grm feature set contains fywx, fyw/fl and fy/Mem; only the latter appears in the 3-grm- setting.
    Page 6, “Conditional Random Fields”
  6. The first important issue is to assess the benefits of using large feature sets, notably including features testing both a bigram of labels and an observation.
    Page 7, “Conditional Random Fields”
  7. During the second iteration, we add tests on bigram of words, on suffixes and prefixes up to length 4.
    Page 8, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

objective function

Appears in 5 sentences as: objective function (5)
In Practical Very Large Scale CRFs
  1. The objective function is then a smooth convex function to be minimized over an unconstrained
    Page 3, “Conditional Random Fields”
  2. In the following, we will jointly use both penalty terms, yielding the so-called elastic net penalty (Zhou and Hastie, 2005) which corresponds to the objective function
    Page 3, “Conditional Random Fields”
  3. However, the introduction of a 61 penalty term makes the optimization of (6) more problematic, as the objective function is no longer differentiable in 0.
    Page 3, “Conditional Random Fields”
  4. In this formulation, the objective function recovers its smoothness and can be optimized with conventional algorithms, subject to domain constraints.
    Page 3, “Conditional Random Fields”
  5. Full convergence, reflected by a stabilization of the objective function , is however not so easily achieved.
    Page 7, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

error rates

Appears in 4 sentences as: error rate (1) Error rates (1) error rates (3)
In Practical Very Large Scale CRFs
  1. Results are reported in terms of phoneme error rates or tag error rates on the test set.
    Page 6, “Conditional Random Fields”
  2. Table 1: Features jointly testing label pairs and the observation are useful ( error rates and features counts.)
    Page 7, “Conditional Random Fields”
  3. Table 3: Error rates of the three regularizers on the Nettalk task.
    Page 7, “Conditional Random Fields”
  4. It achieves an error rate of 2.63% (resp.
    Page 9, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention error rates.

See all papers in Proc. ACL that mention error rates.

Back to top.

part-of-speech

Appears in 4 sentences as: Part-of-Speech (1) part-of-speech (3)
In Practical Very Large Scale CRFs
  1. Based on an efficient implementation of these algorithms, we were able to train very large CRFs containing more than a hundred of output labels and up to several billion features, yielding results that are as good or better than the best reported results for two NLP benchmarks, text phonetization and part-of-speech tagging.
    Page 2, “Introduction”
  2. Our experiments use two standard NLP tasks, phonetization and part-of-speech tagging, chosen here to illustrate two very different situations, and to allow for comparison with results reported elsewhere in the literature.
    Page 6, “Conditional Random Fields”
  3. 5.1.2 Part-of-Speech Tagging
    Page 7, “Conditional Random Fields”
  4. Our second benchmark is a part-of-speech (POS) tagging task using the PennTreeBank corpus (Marcus et al., 1993), which provides us with a quite different condition.
    Page 7, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

unigram

Appears in 4 sentences as: unigram (4)
In Practical Very Large Scale CRFs
  1. Using only unigram features {fy,$}(y,$)€y>< X results in a model equivalent to a simple bag-of-tokens position-by-position logistic regression model.
    Page 2, “Conditional Random Fields”
  2. The same idea can be used when the set {My,$t+1}yey of unigram features is sparse.
    Page 5, “Conditional Random Fields”
  3. The features used in Nettalk experiments take the form fyflu ( unigram ) and fy/ww (bigram), where w is a n-gram of letters.
    Page 6, “Conditional Random Fields”
  4. After four iterations, we throw in features testing word trigrams, subject to the corresponding unigram block being active.
    Page 8, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

CRF++

Appears in 3 sentences as: CRF (1) CRF++ (2)
In Practical Very Large Scale CRFs
  1. Hence, any numerical optimization strategy may be used and practical solutions include limited memory BFGS (L-BFGS) (Liu and Nocedal, 1989), which is used in the popular CRF++ (Kudo, 2005) and CRFsuite (Okazaki, 2007) packages; conjugate gradient (Nocedal and Wright, 2006) and Stochastic Gradient Descent (SGD) (Bottou, 2004; Vishwanathan et al., 2006), used in CRFsgd (Bottou, 2007).
    Page 3, “Conditional Random Fields”
  2. CRF++ and CRFsgd perform the forward-backward recursions in the log-domain, which guarantees that numerical over/underflows are avoided no matter the length T (i) of the sequence.
    Page 5, “Conditional Random Fields”
  3. The CRF package developed in the course of this study implements many algorithmic optimizations and allows to design innovative training strategies, such as the one presented in section 5.4.
    Page 9, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention CRF++.

See all papers in Proc. ACL that mention CRF++.

Back to top.

part-of-speech tagging

Appears in 3 sentences as: Part-of-Speech Tagging (1) part-of-speech tagging (2)
In Practical Very Large Scale CRFs
  1. Based on an efficient implementation of these algorithms, we were able to train very large CRFs containing more than a hundred of output labels and up to several billion features, yielding results that are as good or better than the best reported results for two NLP benchmarks, text phonetization and part-of-speech tagging .
    Page 2, “Introduction”
  2. Our experiments use two standard NLP tasks, phonetization and part-of-speech tagging , chosen here to illustrate two very different situations, and to allow for comparison with results reported elsewhere in the literature.
    Page 6, “Conditional Random Fields”
  3. 5.1.2 Part-of-Speech Tagging
    Page 7, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention part-of-speech tagging.

See all papers in Proc. ACL that mention part-of-speech tagging.

Back to top.

POS tagging

Appears in 3 sentences as: POS tagging (3)
In Practical Very Large Scale CRFs
  1. 3-grm 10.74% 14.3M 14.59% 0.3M 5-grm 8.48% 132.5M 11.54% 2.5M POS tagging
    Page 7, “Conditional Random Fields”
  2. For the POS tagging task, BCD appears to be unpractically slower to train than the others approaches (SGD takes about 40min to train, OWL-QN about 1 hour) due the simultaneous increase in the sequence length and in the number of observations.
    Page 8, “Conditional Random Fields”
  3. Based on this observation, we have designed an incremental training strategy for the POS tagging task, where more specific features are progressively incorporated into the model if the corresponding less specific feature is active.
    Page 8, “Conditional Random Fields”

See all papers in Proc. ACL 2010 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.