Linguistic Structured Sparsity in Text Categorization
Yogatama, Dani and Smith, Noah A.

Article Structure

Abstract

We introduce three linguistically motivated structured regularizers based on parse trees, topics, and hierarchical word clusters for text categorization.

Introduction

What is the best way to exploit linguistic information in statistical text processing models?

Notation

We represent each document as a feature vector x 6 RV, where V is the vocabulary size.

Group Lasso

Structured regularizers penalize estimates of w in which collections of weights are penalized jointly.

Structured Regularizers for Text

Past work applying the group lasso to NLP problems has considered four ways of defining the groups.

Learning

There are many optimization methods for leam-ing models with structured regularizers, particu-lary group lasso (Jacob et al., 2009; J enatton et al., 2011; Chen et al., 2011; Qin and Goldfarb, 2012; Yuan et al., 2013).

Experiments

6.1 Datasets

Related and Future Work

Overall, our results demonstrate that linguistic structure in the data can be used to improve bag-of-words models, through structured regularization.

Conclusion

We introduced three data-driven, linguistically informed structured regularizers based on parse trees, topics, and hierarchical word clusters.

Topics

LDA

Appears in 24 sentences as: LDA (25)
In Linguistic Structured Sparsity in Text Categorization
  1. 4.3 LDA Regularizer
    Page 4, “Structured Regularizers for Text”
  2. We do this by inferring topics in the training corpus by estimating the latent Dirichlet allocation ( LDA ) model (Blei et al., 2003)).
    Page 4, “Structured Regularizers for Text”
  3. Note that LDA is an unsupervised method, so we can infer topical structures from any collection of documents that are considered related to the target corpus (e. g., training documents, text from the web, etc.).
    Page 4, “Structured Regularizers for Text”
  4. In our experiments, we choose the R most probable words given a topic and create a group for them.6 The LDA regular-
    Page 4, “Structured Regularizers for Text”
  5. The LDA regularizer will construct four groups from these topics.
    Page 4, “Structured Regularizers for Text”
  6. Unlike the parse tree regularizer, the LDA regularizer is not tree structured.
    Page 4, “Structured Regularizers for Text”
  7. Our LDA regularizer is an instance of sparse group lasso (Friedman et al., 2010).
    Page 4, “Structured Regularizers for Text”
  8. izer in a similar way to the topical word groups inferred using LDA in §4.3, but here we make use of the hierarchy.
    Page 5, “Structured Regularizers for Text”
  9. Consider a similar toy example to the LDA regularizer (sports vs. science) and the hierarchical clustering of words in Figure 2.
    Page 5, “Structured Regularizers for Text”
  10. LDA and Brown cluster regularizers offer ways to incorporate unlabeled data, if we believe that the unlabeled data can help us infer better topics or clusters.
    Page 5, “Structured Regularizers for Text”
  11. For the LDA regularizer, L = R X K. For the Brown cluster regularizer, L = V — 1.
    Page 6, “Learning”

See all papers in Proc. ACL 2014 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

parse tree

Appears in 23 sentences as: Parse Tree (1) parse tree (18) parse trees (6)
In Linguistic Structured Sparsity in Text Categorization
  1. We introduce three linguistically motivated structured regularizers based on parse trees , topics, and hierarchical word clusters for text categorization.
    Page 1, “Abstract”
  2. Figure 1: An example of a parse tree from the Stanford sentiment treebank, which annotates sentiment at the level of every constituent (indicated here by —|— and ++; no marking indicates neutral sentiment).
    Page 3, “Structured Regularizers for Text”
  3. 4.2 Parse Tree Regularizer
    Page 3, “Structured Regularizers for Text”
  4. Sentence boundaries are a rather superficial kind of linguistic structure; syntactic parse trees provide more fine-grained information.
    Page 3, “Structured Regularizers for Text”
  5. We introduce a new regularizer, the parse tree regularizer, in which groups are defined for every constituent in every parse of a training data sentence.
    Page 3, “Structured Regularizers for Text”
  6. coefficients and A) for one sentence with the parse tree shown in Figure 1 is: Qtree =
    Page 3, “Structured Regularizers for Text”
  7. Of course, in a corpus there are many parse trees (one per sentence, so the number of parse trees is the number of sentences).
    Page 3, “Structured Regularizers for Text”
  8. Note that, since each word token is itself a constituent, the parse tree regularizer includes terms just like the lasso naturally, penalizing the absolute value of each word’s weight in isolation.
    Page 3, “Structured Regularizers for Text”
  9. Of course, in some sentences, some words will occur more than once, and the parse tree regularizer instantiates groups for constituents in every sentence in the training corpus, and these groups may work against each other.
    Page 3, “Structured Regularizers for Text”
  10. The parse tree regularizer should therefore
    Page 3, “Structured Regularizers for Text”
  11. In sentence level prediction tasks, such as sentence-level sentiment analysis, it is known that most constituents (especially those that correspond to shorter phrases) in a parse tree are uninformative (neutral sentiment).
    Page 4, “Structured Regularizers for Text”

See all papers in Proc. ACL 2014 that mention parse tree.

See all papers in Proc. ACL that mention parse tree.

Back to top.

sentiment analysis

Appears in 10 sentences as: Sentiment analysis (2) sentiment analysis (8)
In Linguistic Structured Sparsity in Text Categorization
  1. We show that our structured regularizers consistently improve classification accuracies compared to standard regularizers that penalize features in isolation (such as lasso, ridge, and elastic net regularizers) on a range of datasets for various text prediction problems: topic classification, sentiment analysis , and forecasting.
    Page 1, “Abstract”
  2. For tasks like text classification, sentiment analysis , and text-driven forecasting, this is an open question, as cheap “bag-of-words” models often perform well.
    Page 1, “Introduction”
  3. In sentence level prediction tasks, such as sentence-level sentiment analysis , it is known that most constituents (especially those that correspond to shorter phrases) in a parse tree are uninformative (neutral sentiment).
    Page 4, “Structured Regularizers for Text”
  4. Sentiment analysis .
    Page 6, “Experiments”
  5. One task in sentiment analysis is predicting the polarity of a piece of text, i.e., whether the author is favorably inclined toward a (usually known) subject of discussion or proposition (Pang and Lee, 2008).
    Page 6, “Experiments”
  6. Sentiment analysis , even at the coarse level of polarity we consider here, can be confused by negation, stylistic use of irony, and other linguistic phenomena.
    Page 6, “Experiments”
  7. Our sentiment analysis datasets consist of movie reviews from the Stanford sentiment treebank (Socher et al., 2013),11 and floor speeches by US.
    Page 6, “Experiments”
  8. For the Brown cluster regularizers, we ran Brown clustering17 on training documents with 5, 000 clusters for the topic classification and sentiment analysis datasets, and 1, 000 for the larger text forecasting datasets (since they are bigger datasets that took more time).
    Page 7, “Experiments”
  9. For example, for the vote sentiment analysis datasets, latent variable models of Yessenalina et al.
    Page 9, “Related and Future Work”
  10. We empirically showed that models regularized using our methods consistently outperformed standard regularizers that penalize features in isolation such as lasso, ridge, and elastic net on a range of datasets for various text prediction problems: topic classification, sentiment analysis , and forecasting.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention sentiment analysis.

See all papers in Proc. ACL that mention sentiment analysis.

Back to top.

bag-of-words

Appears in 6 sentences as: bag-of-words (5) “bag-of-words” (1)
In Linguistic Structured Sparsity in Text Categorization
  1. These regularizers impose linguistic bias in feature weights, enabling us to incorporate prior knowledge into conventional bag-of-words models.
    Page 1, “Abstract”
  2. For tasks like text classification, sentiment analysis, and text-driven forecasting, this is an open question, as cheap “bag-of-words” models often perform well.
    Page 1, “Introduction”
  3. We embrace the conventional bag-of-words representation of text, instead bringing linguistic bias to bear on regularization.
    Page 1, “Introduction”
  4. Our experiments demonstrate that structured regularizers can squeeze higher performance out of conventional bag-of-words models on seven out of eight of text categorization tasks tested, in six cases with more compact models than the best-performing unstructured-regularized model.
    Page 1, “Introduction”
  5. Overall, our results demonstrate that linguistic structure in the data can be used to improve bag-of-words models, through structured regularization.
    Page 9, “Related and Future Work”
  6. Our experimental focus has been on a controlled comparison between regularizers for a fixed model family (the simplest available, linear with bag-of-words features).
    Page 9, “Related and Future Work”

See all papers in Proc. ACL 2014 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.

hyperparameter

Appears in 6 sentences as: hyperparameter (6)
In Linguistic Structured Sparsity in Text Categorization
  1. Both methods disprefer weights of large magnitude; smaller (relative) magnitude means a feature (here, a word) has a smaller effect on the prediction, and zero means a feature has no effect.2 The hyperparameter A in each case is typically tuned on a development dataset.
    Page 2, “Notation”
  2. where Aglas is a hyperparameter tuned on a development data, and Ag is a group specific weight.
    Page 2, “Group Lasso”
  3. As a result, besides Aglas , we have an additional hyperparameter , denoted by Alas.
    Page 3, “Structured Regularizers for Text”
  4. Since the lasso-like penalty does not occur naturally in a non tree-structured regularizer, we add an additional lasso penalty for each word type (with hyperparameter Alas) to also encourage weights of irrelevant words to go to zero.
    Page 4, “Structured Regularizers for Text”
  5. Similar to the parse tree regularizer, for the lasso-like penalty on each word, we tune one group weight for all word types on a development data with a hyperparameter Alas.
    Page 5, “Structured Regularizers for Text”
  6. Table 6 shows examples of zero and nonzero topics for the dev.-tuned hyperparameter values.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention hyperparameter.

See all papers in Proc. ACL that mention hyperparameter.

Back to top.

treebank

Appears in 6 sentences as: treebank (6)
In Linguistic Structured Sparsity in Text Categorization
  1. Figure 1: An example of a parse tree from the Stanford sentiment treebank , which annotates sentiment at the level of every constituent (indicated here by —|— and ++; no marking indicates neutral sentiment).
    Page 3, “Structured Regularizers for Text”
  2. The Stanford sentiment treebank has an annotation of sentiments at the constituent level.
    Page 3, “Structured Regularizers for Text”
  3. Figure 1 illustrates the group structures derived from an example sentence from the Stanford sentiment treebank (Socher et al., 2013).
    Page 3, “Structured Regularizers for Text”
  4. (2013) when annotating phrases in a sentence for building the Stanford sentiment treebank .
    Page 4, “Structured Regularizers for Text”
  5. Our sentiment analysis datasets consist of movie reviews from the Stanford sentiment treebank (Socher et al., 2013),11 and floor speeches by US.
    Page 6, “Experiments”
  6. Congressmen alongside “yea”/“nay” votes on the bill under discussion (Thomas et al., 2006).12 For the Stanford sentiment treebank , we only predict binary classifications (positive or negative) and exclude neutral reviews.
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

sentence-level

Appears in 4 sentences as: sentence-level (4)
In Linguistic Structured Sparsity in Text Categorization
  1. This regularizer captures the idea that phrases might be selected as relevant or (in most cases) irrelevant to a task, and is expected to be especially useful in sentence-level prediction tasks.
    Page 3, “Structured Regularizers for Text”
  2. In sentence level prediction tasks, such as sentence-level sentiment analysis, it is known that most constituents (especially those that correspond to shorter phrases) in a parse tree are uninformative (neutral sentiment).
    Page 4, “Structured Regularizers for Text”
  3. The task is to predict sentence-level sentiment, so each training example is a sentence.
    Page 7, “Experiments”
  4. It has been shown that syntactic information is helpful for sentence-level predictions (Socher et al., 2013), so the parse tree regularizer is naturally suitable for this task.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (4)
In Linguistic Structured Sparsity in Text Categorization
  1. This contrasts with typical semi-supervised learning methods for text categorization that combine unlabeled and labeled data within a generative model, such as multinomial na‘1've Bayes, via expectation-maximization (Nigam et al., 2000) or semi-supervised frequency estimation (Su et al., 2011).
    Page 4, “Structured Regularizers for Text”
  2. We leave comparison with other semi-supervised methods for future work.
    Page 4, “Structured Regularizers for Text”
  3. Note that we ran Brown clustering only on the training documents; running it on a larger collection of (unlabeled) documents relevant to the prediction task (i.e., semi-supervised learning) is worth exploring in future work.
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.