Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion.

Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP).

In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization.

Appears in 12 sentences as: Log-linear (2) log-linear (10)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for Ll-regularized log-linear models.Page 1, “Abstract”
- Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP).Page 1, “Introduction”
- Log-linear models have a major advantage over otherPage 1, “Introduction”
- Kazama and Tsujii (2003) describe a method for training a Ll-regularized log-linear model with a bound constrained version of the BFGS algorithm (Nocedal, 1980).Page 1, “Introduction”
- An alternative approach to training a log-linear model is to use stochastic gradient descent (SGD) methods.Page 2, “Introduction”
- Section 2 provides a general description of log-linear models used in NLP.Page 2, “Introduction”
- Section 3 describes our stochastic gradient descent method for Ll-regularized log-linear models.Page 2, “Introduction”
- In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization.Page 2, “Log-Linear Models”
- A log-linear model defines the following probabilistic distribution over possible structure y for input x:Page 2, “Log-Linear Models”
- The weights of the features in a log-linear model are optimized in such a way that they maximize the regularized conditional log-likelihood of the training data:Page 2, “Log-Linear Models”
- An alternative approach to producing compact models for log-linear models is to reformulate thePage 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 12 sentences as: log-linear model (4) Log-linear models (2) log-linear models (6)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for Ll-regularized log-linear models .Page 1, “Abstract”
- Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP).Page 1, “Introduction”
- Log-linear models have a major advantage over otherPage 1, “Introduction”
- Kazama and Tsujii (2003) describe a method for training a Ll-regularized log-linear model with a bound constrained version of the BFGS algorithm (Nocedal, 1980).Page 1, “Introduction”
- An alternative approach to training a log-linear model is to use stochastic gradient descent (SGD) methods.Page 2, “Introduction”
- Section 2 provides a general description of log-linear models used in NLP.Page 2, “Introduction”
- Section 3 describes our stochastic gradient descent method for Ll-regularized log-linear models .Page 2, “Introduction”
- In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization.Page 2, “Log-Linear Models”
- A log-linear model defines the following probabilistic distribution over possible structure y for input x:Page 2, “Log-Linear Models”
- The weights of the features in a log-linear model are optimized in such a way that they maximize the regularized conditional log-likelihood of the training data:Page 2, “Log-Linear Models”
- An alternative approach to producing compact models for log-linear models is to reformulate thePage 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention log-linear models.

See all papers in *Proc. ACL* that mention log-linear models.

Back to top.

Appears in 8 sentences as: named entities (1) Named Entity (1) named entity (7)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging.Page 1, “Abstract”
- We evaluate the effectiveness of our method by using linear-chain conditional random fields (CRFs) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging.Page 2, “Introduction”
- The model is used for a variety of sequence labeling tasks such as POS tagging, chunking, and named entity recognition.Page 2, “Log-Linear Models”
- We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging.Page 5, “Log-Linear Models”
- 4.2 Named Entity RecognitionPage 7, “Log-Linear Models”
- The second set of experiments used the named entity recognition data set provided for the BioNLPMLPBA 2004 shared task (Kim et al., 2004).9 The training data consist of 18,546 sentences in which each token is annotated with the “ICE” tags representing biomedical named entities such as the names of proteins and RNAs.Page 7, “Log-Linear Models”
- We did not use any information on the named entity tags output by the GENIA tagger.Page 7, “Log-Linear Models”
- Figure 5: NLPBA 2004 named entity recognition task: Objective.Page 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention named entity.

See all papers in *Proc. ACL* that mention named entity.

Back to top.

Appears in 8 sentences as: POS tagger (1) POS tagging (4) POS tags (3)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- We evaluate the effectiveness of our method by using linear-chain conditional random fields (CRFs) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging .Page 2, “Introduction”
- The model is used for a variety of sequence labeling tasks such as POS tagging , chunking, and named entity recognition.Page 2, “Log-Linear Models”
- We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging .Page 5, “Log-Linear Models”
- The features used in this experiment were unigrams and bigrams of neighboring words, and unigrams, bigrams and trigrams of neighboring POS tags .Page 5, “Log-Linear Models”
- The training and test data were preprocessed by the GENIA tagger,10 which provided POS tags and chunk tags.Page 7, “Log-Linear Models”
- The third set of experiments used the POS tagging data in the Penn Treebank (Marcus et al., 1994).Page 7, “Log-Linear Models”
- The POS tags were extracted from the parse trees in the corpus.Page 7, “Log-Linear Models”
- It should be noted that training a CRF-based POS tagger using the whole WSJ corpus is not a trivial task and was once even deemed impractical in previous studies.Page 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention POS tagging.

See all papers in *Proc. ACL* that mention POS tagging.

Back to top.

Appears in 6 sentences as: objective function (5) objective functions (1)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- Also, SGD is very easy to implement because it does not need to use the Hessian information on the objective function .Page 2, “Introduction”
- SGD uses a small randomly-selected subset of the training samples to approximate the gradient of the objective function given by Equation 2.Page 3, “Log-Linear Models”
- The learning rate parameters for SGD were then tuned in such a way that they maximized the value of the objective function in 30 passes.Page 5, “Log-Linear Models”
- Figure 3 shows how the value of the objective function changed as the training proceeded.Page 6, “Log-Linear Models”
- The third column shows the final value of the objective function per sample.Page 6, “Log-Linear Models”
- Although Ll-regularized and Ll-constrained learning algorithms are not directly comparable because the objective functions are different, it would be interesting to compare the two approaches in terms of practicality.Page 8, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention objective function.

See all papers in *Proc. ACL* that mention objective function.

Back to top.

Appears in 5 sentences as: CRF (5) CRF++ (1)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- If the structure is a sequence, the model is called a linear-chain CRF model, and the marginal probabilities of the features and the partition function can be efficiently computed by using the forward-backward algorithm.Page 2, “Log-Linear Models”
- If the structure is a tree, the model is called a tree CRF model, and the marginal probabilities can be computed by using the inside-outside algorithm.Page 2, “Log-Linear Models”
- We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging.Page 5, “Log-Linear Models”
- The CRF++ version 0.50, a popular CRF library developed by Taku Kudo,6 is reported to take 4,021 seconds on Xeon 3.0GHz processors to train the model using a richer feature set.7 CRFsuite version 0.4, a much faster library for CRFs, is reported to take 382 seconds on Xeon 3.0GHz, using the same feature set as ours.8 Their library uses the OWL-QN algorithm for optimization.Page 6, “Log-Linear Models”
- tant due to the differences in implementation and hardware platforms, these results demonstrate that our algorithm can actually result in a very fast implementation of a CRF trainer.Page 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention CRF.

See all papers in *Proc. ACL* that mention CRF.

Back to top.

Appears in 4 sentences as: CRFs (4)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- We evaluate the effectiveness of our method by using linear-chain conditional random fields ( CRFs ) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging.Page 2, “Introduction”
- The CRF++ version 0.50, a popular CRF library developed by Taku Kudo,6 is reported to take 4,021 seconds on Xeon 3.0GHz processors to train the model using a richer feature set.7 CRFsuite version 0.4, a much faster library for CRFs , is reported to take 382 seconds on Xeon 3.0GHz, using the same feature set as ours.8 Their library uses the OWL-QN algorithm for optimization.Page 6, “Log-Linear Models”
- (2006) report an f-score of 71.48 on the same data, using semi-Markov CRFs .Page 7, “Log-Linear Models”
- We have conducted experiments using CRFs and three NLP tasks, and demonstrated empirically that our training algorithm can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for L1-regularization.Page 8, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention CRFs.

See all papers in *Proc. ACL* that mention CRFs.

Back to top.

Appears in 4 sentences as: weight vector (5)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- L1 regularization penalizes the weight vector for its Ll-norm (i.e.Page 1, “Introduction”
- In effect, it forces the weight to receive the total Ll penalty that would have been applied if the weight had been updated by the true gradients, assuming that the current weight vector resides in the same orthant as the true weight vector .Page 4, “Log-Linear Models”
- problem as a Ll-constrained problem (Lee et al., 2006), where the conditional log-likelihood of the training data is maximized under a fixed constraint of the Ll-norm of the weight vector .Page 8, “Log-Linear Models”
- (2008) describe efficient algorithms for projecting a weight vector onto the Ll-ball.Page 8, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention weight vector.

See all papers in *Proc. ACL* that mention weight vector.

Back to top.

Appears in 3 sentences as: CoNLL (3)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- The first set of experiments used the text chunking data set provided for the CoNLL 2000 shared task.5 The training data consists of 8,936 sentences in which each token is annotated with the “ICE” tags representing text chunks such as noun and verb phrases.Page 5, “Log-Linear Models”
- Figure 3: CoNLL 2000 chunking task: ObjectivePage 6, “Log-Linear Models”
- Figure 4: CoNLL 2000 chunking task: Number of active features.Page 6, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention CoNLL.

See all papers in *Proc. ACL* that mention CoNLL.

Back to top.

Appears in 3 sentences as: feature space (3)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- In NLP applications, the dimension of the feature space tends to be very large—it can easily become several millions, so the application of L1 penalty to all features significantly slows down the weight updating process.Page 2, “Introduction”
- Since the dimension of the feature space can be very large, it can significantly slow down the weight update process.Page 3, “Log-Linear Models”
- Another merit is that it allows us to perform the application of L1 penalty in a lazy fashion, so that we do not need to update the weights of the features that are not used in the current sample, which leads to much faster training when the dimension of the feature space is large.Page 4, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention feature space.

See all papers in *Proc. ACL* that mention feature space.

Back to top.

Appears in 3 sentences as: Part-Of-Speech (1) part-of-speech (2)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging.Page 1, “Abstract”
- The applications range from simple classification tasks such as text classification and history-based tagging (Ratnaparkhi, 1996) to more complex structured prediction tasks such as part-of-speech (POS) tagging (Lafferty et al., 2001), syntactic parsing (Clark and Curran, 2004) and semantic role labeling (Toutanova et al., 2005).Page 1, “Introduction”
- 4.3 Part-Of-Speech TaggingPage 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention part-of-speech.

See all papers in *Proc. ACL* that mention part-of-speech.

Back to top.

Appears in 3 sentences as: syntactic parsing (3)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- The applications range from simple classification tasks such as text classification and history-based tagging (Ratnaparkhi, 1996) to more complex structured prediction tasks such as part-of-speech (POS) tagging (Lafferty et al., 2001), syntactic parsing (Clark and Curran, 2004) and semantic role labeling (Toutanova et al., 2005).Page 1, “Introduction”
- SGD was recently used for NLP tasks including machine translation (Tillmann and Zhang, 2006) and syntactic parsing (Smith and Eisner, 2008; Finkel et al., 2008).Page 2, “Introduction”
- The model can be used for tasks like syntactic parsing (Finkel et al., 2008) and semantic role labeling (Cohn and Blunsom, 2005).Page 2, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention syntactic parsing.

See all papers in *Proc. ACL* that mention syntactic parsing.

Back to top.

Appears in 3 sentences as: unigrams (4)

In *Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty*

- The features used in this experiment were unigrams and bigrams of neighboring words, and unigrams , bigrams and trigrams of neighboring POS tags.Page 5, “Log-Linear Models”
- For the features, we used unigrams of neighboring chunk tags, substrings (shorter than 10 characters) of the current word, and the shape of the word (e. g. “IL-2” is converted into “AA-#”), on top of the features used in the text chunking experiments.Page 7, “Log-Linear Models”
- For the features, we used unigrams and bigrams of neighboring words, prefixes and suffixes of the current word, and some characteristics of the word.Page 7, “Log-Linear Models”

See all papers in *Proc. ACL 2009* that mention unigrams.

See all papers in *Proc. ACL* that mention unigrams.

Back to top.