Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Tsuruoka, Yoshimasa and Tsujii, Jun'ichi and Ananiadou, Sophia

Article Structure

Abstract

Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion.

Introduction

Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP).

Log-Linear Models

In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization.

Topics

log-linear

Appears in 12 sentences as: Log-linear (2) log-linear (10)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for Ll-regularized log-linear models.
    Page 1, “Abstract”
  2. Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP).
    Page 1, “Introduction”
  3. Log-linear models have a major advantage over other
    Page 1, “Introduction”
  4. Kazama and Tsujii (2003) describe a method for training a Ll-regularized log-linear model with a bound constrained version of the BFGS algorithm (Nocedal, 1980).
    Page 1, “Introduction”
  5. An alternative approach to training a log-linear model is to use stochastic gradient descent (SGD) methods.
    Page 2, “Introduction”
  6. Section 2 provides a general description of log-linear models used in NLP.
    Page 2, “Introduction”
  7. Section 3 describes our stochastic gradient descent method for Ll-regularized log-linear models.
    Page 2, “Introduction”
  8. In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization.
    Page 2, “Log-Linear Models”
  9. A log-linear model defines the following probabilistic distribution over possible structure y for input x:
    Page 2, “Log-Linear Models”
  10. The weights of the features in a log-linear model are optimized in such a way that they maximize the regularized conditional log-likelihood of the training data:
    Page 2, “Log-Linear Models”
  11. An alternative approach to producing compact models for log-linear models is to reformulate the
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

log-linear models

Appears in 12 sentences as: log-linear model (4) Log-linear models (2) log-linear models (6)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for Ll-regularized log-linear models .
    Page 1, “Abstract”
  2. Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP).
    Page 1, “Introduction”
  3. Log-linear models have a major advantage over other
    Page 1, “Introduction”
  4. Kazama and Tsujii (2003) describe a method for training a Ll-regularized log-linear model with a bound constrained version of the BFGS algorithm (Nocedal, 1980).
    Page 1, “Introduction”
  5. An alternative approach to training a log-linear model is to use stochastic gradient descent (SGD) methods.
    Page 2, “Introduction”
  6. Section 2 provides a general description of log-linear models used in NLP.
    Page 2, “Introduction”
  7. Section 3 describes our stochastic gradient descent method for Ll-regularized log-linear models .
    Page 2, “Introduction”
  8. In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization.
    Page 2, “Log-Linear Models”
  9. A log-linear model defines the following probabilistic distribution over possible structure y for input x:
    Page 2, “Log-Linear Models”
  10. The weights of the features in a log-linear model are optimized in such a way that they maximize the regularized conditional log-likelihood of the training data:
    Page 2, “Log-Linear Models”
  11. An alternative approach to producing compact models for log-linear models is to reformulate the
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention log-linear models.

See all papers in Proc. ACL that mention log-linear models.

Back to top.

named entity

Appears in 8 sentences as: named entities (1) Named Entity (1) named entity (7)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging.
    Page 1, “Abstract”
  2. We evaluate the effectiveness of our method by using linear-chain conditional random fields (CRFs) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging.
    Page 2, “Introduction”
  3. The model is used for a variety of sequence labeling tasks such as POS tagging, chunking, and named entity recognition.
    Page 2, “Log-Linear Models”
  4. We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging.
    Page 5, “Log-Linear Models”
  5. 4.2 Named Entity Recognition
    Page 7, “Log-Linear Models”
  6. The second set of experiments used the named entity recognition data set provided for the BioNLPMLPBA 2004 shared task (Kim et al., 2004).9 The training data consist of 18,546 sentences in which each token is annotated with the “ICE” tags representing biomedical named entities such as the names of proteins and RNAs.
    Page 7, “Log-Linear Models”
  7. We did not use any information on the named entity tags output by the GENIA tagger.
    Page 7, “Log-Linear Models”
  8. Figure 5: NLPBA 2004 named entity recognition task: Objective.
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

POS tagging

Appears in 8 sentences as: POS tagger (1) POS tagging (4) POS tags (3)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. We evaluate the effectiveness of our method by using linear-chain conditional random fields (CRFs) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging .
    Page 2, “Introduction”
  2. The model is used for a variety of sequence labeling tasks such as POS tagging , chunking, and named entity recognition.
    Page 2, “Log-Linear Models”
  3. We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging .
    Page 5, “Log-Linear Models”
  4. The features used in this experiment were unigrams and bigrams of neighboring words, and unigrams, bigrams and trigrams of neighboring POS tags .
    Page 5, “Log-Linear Models”
  5. The training and test data were preprocessed by the GENIA tagger,10 which provided POS tags and chunk tags.
    Page 7, “Log-Linear Models”
  6. The third set of experiments used the POS tagging data in the Penn Treebank (Marcus et al., 1994).
    Page 7, “Log-Linear Models”
  7. The POS tags were extracted from the parse trees in the corpus.
    Page 7, “Log-Linear Models”
  8. It should be noted that training a CRF-based POS tagger using the whole WSJ corpus is not a trivial task and was once even deemed impractical in previous studies.
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

objective function

Appears in 6 sentences as: objective function (5) objective functions (1)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. Also, SGD is very easy to implement because it does not need to use the Hessian information on the objective function .
    Page 2, “Introduction”
  2. SGD uses a small randomly-selected subset of the training samples to approximate the gradient of the objective function given by Equation 2.
    Page 3, “Log-Linear Models”
  3. The learning rate parameters for SGD were then tuned in such a way that they maximized the value of the objective function in 30 passes.
    Page 5, “Log-Linear Models”
  4. Figure 3 shows how the value of the objective function changed as the training proceeded.
    Page 6, “Log-Linear Models”
  5. The third column shows the final value of the objective function per sample.
    Page 6, “Log-Linear Models”
  6. Although Ll-regularized and Ll-constrained learning algorithms are not directly comparable because the objective functions are different, it would be interesting to compare the two approaches in terms of practicality.
    Page 8, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

CRF

Appears in 5 sentences as: CRF (5) CRF++ (1)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. If the structure is a sequence, the model is called a linear-chain CRF model, and the marginal probabilities of the features and the partition function can be efficiently computed by using the forward-backward algorithm.
    Page 2, “Log-Linear Models”
  2. If the structure is a tree, the model is called a tree CRF model, and the marginal probabilities can be computed by using the inside-outside algorithm.
    Page 2, “Log-Linear Models”
  3. We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging.
    Page 5, “Log-Linear Models”
  4. The CRF++ version 0.50, a popular CRF library developed by Taku Kudo,6 is reported to take 4,021 seconds on Xeon 3.0GHz processors to train the model using a richer feature set.7 CRFsuite version 0.4, a much faster library for CRFs, is reported to take 382 seconds on Xeon 3.0GHz, using the same feature set as ours.8 Their library uses the OWL-QN algorithm for optimization.
    Page 6, “Log-Linear Models”
  5. tant due to the differences in implementation and hardware platforms, these results demonstrate that our algorithm can actually result in a very fast implementation of a CRF trainer.
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

CRFs

Appears in 4 sentences as: CRFs (4)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. We evaluate the effectiveness of our method by using linear-chain conditional random fields ( CRFs ) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging.
    Page 2, “Introduction”
  2. The CRF++ version 0.50, a popular CRF library developed by Taku Kudo,6 is reported to take 4,021 seconds on Xeon 3.0GHz processors to train the model using a richer feature set.7 CRFsuite version 0.4, a much faster library for CRFs , is reported to take 382 seconds on Xeon 3.0GHz, using the same feature set as ours.8 Their library uses the OWL-QN algorithm for optimization.
    Page 6, “Log-Linear Models”
  3. (2006) report an f-score of 71.48 on the same data, using semi-Markov CRFs .
    Page 7, “Log-Linear Models”
  4. We have conducted experiments using CRFs and three NLP tasks, and demonstrated empirically that our training algorithm can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for L1-regularization.
    Page 8, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

weight vector

Appears in 4 sentences as: weight vector (5)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. L1 regularization penalizes the weight vector for its Ll-norm (i.e.
    Page 1, “Introduction”
  2. In effect, it forces the weight to receive the total Ll penalty that would have been applied if the weight had been updated by the true gradients, assuming that the current weight vector resides in the same orthant as the true weight vector .
    Page 4, “Log-Linear Models”
  3. problem as a Ll-constrained problem (Lee et al., 2006), where the conditional log-likelihood of the training data is maximized under a fixed constraint of the Ll-norm of the weight vector .
    Page 8, “Log-Linear Models”
  4. (2008) describe efficient algorithms for projecting a weight vector onto the Ll-ball.
    Page 8, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

CoNLL

Appears in 3 sentences as: CoNLL (3)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. The first set of experiments used the text chunking data set provided for the CoNLL 2000 shared task.5 The training data consists of 8,936 sentences in which each token is annotated with the “ICE” tags representing text chunks such as noun and verb phrases.
    Page 5, “Log-Linear Models”
  2. Figure 3: CoNLL 2000 chunking task: Objective
    Page 6, “Log-Linear Models”
  3. Figure 4: CoNLL 2000 chunking task: Number of active features.
    Page 6, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention CoNLL.

See all papers in Proc. ACL that mention CoNLL.

Back to top.

feature space

Appears in 3 sentences as: feature space (3)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. In NLP applications, the dimension of the feature space tends to be very large—it can easily become several millions, so the application of L1 penalty to all features significantly slows down the weight updating process.
    Page 2, “Introduction”
  2. Since the dimension of the feature space can be very large, it can significantly slow down the weight update process.
    Page 3, “Log-Linear Models”
  3. Another merit is that it allows us to perform the application of L1 penalty in a lazy fashion, so that we do not need to update the weights of the features that are not used in the current sample, which leads to much faster training when the dimension of the feature space is large.
    Page 4, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention feature space.

See all papers in Proc. ACL that mention feature space.

Back to top.

part-of-speech

Appears in 3 sentences as: Part-Of-Speech (1) part-of-speech (2)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging.
    Page 1, “Abstract”
  2. The applications range from simple classification tasks such as text classification and history-based tagging (Ratnaparkhi, 1996) to more complex structured prediction tasks such as part-of-speech (POS) tagging (Lafferty et al., 2001), syntactic parsing (Clark and Curran, 2004) and semantic role labeling (Toutanova et al., 2005).
    Page 1, “Introduction”
  3. 4.3 Part-Of-Speech Tagging
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

syntactic parsing

Appears in 3 sentences as: syntactic parsing (3)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. The applications range from simple classification tasks such as text classification and history-based tagging (Ratnaparkhi, 1996) to more complex structured prediction tasks such as part-of-speech (POS) tagging (Lafferty et al., 2001), syntactic parsing (Clark and Curran, 2004) and semantic role labeling (Toutanova et al., 2005).
    Page 1, “Introduction”
  2. SGD was recently used for NLP tasks including machine translation (Tillmann and Zhang, 2006) and syntactic parsing (Smith and Eisner, 2008; Finkel et al., 2008).
    Page 2, “Introduction”
  3. The model can be used for tasks like syntactic parsing (Finkel et al., 2008) and semantic role labeling (Cohn and Blunsom, 2005).
    Page 2, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention syntactic parsing.

See all papers in Proc. ACL that mention syntactic parsing.

Back to top.

unigrams

Appears in 3 sentences as: unigrams (4)
In Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
  1. The features used in this experiment were unigrams and bigrams of neighboring words, and unigrams , bigrams and trigrams of neighboring POS tags.
    Page 5, “Log-Linear Models”
  2. For the features, we used unigrams of neighboring chunk tags, substrings (shorter than 10 characters) of the current word, and the shape of the word (e. g. “IL-2” is converted into “AA-#”), on top of the features used in the text chunking experiments.
    Page 7, “Log-Linear Models”
  3. For the features, we used unigrams and bigrams of neighboring words, prefixes and suffixes of the current word, and some characteristics of the word.
    Page 7, “Log-Linear Models”

See all papers in Proc. ACL 2009 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.