Sprinkling Topics for Weakly Supervised Text Classification
Hingmire, Swapnil and Chakraborti, Sutanu

Article Structure

Abstract

Supervised text classification algorithms require a large number of documents labeled by humans, that involve a labor-intensive and time consuming process.

Introduction

In supervised text classification learning algorithms, the learner (a program) takes human labeled documents as input and learns a decision function that can classify a previously unseen document to one of the predefined classes.

Related Work

Several researchers have proposed semi-supervised text classification algorithms with the aim of reducing the time, effort and cost involved in labeling documents.

Background 3.1 LDA

LDA is an unsupervised probabilistic generative model for collections of discrete data such as text documents.

Topic Sprinkling in LDA

In our text classification algorithm, we first infer a set of topics on the given unlabeled document corpus.

Experimental Evaluation

We determine the effectiveness of our algorithm in relation to ClassifyLDA algorithm proposed in (Hingmire et al., 2013).

Conclusions and Future Work

In this paper we propose a novel algorithm that classifies documents based on class labels over few topics.

Topics

text classification

Appears in 19 sentences as: Text Classification (1) text classification (17) text classifier (1)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. Supervised text classification algorithms require a large number of documents labeled by humans, that involve a labor-intensive and time consuming process.
    Page 1, “Abstract”
  2. We evaluate this approach to improve performance of text classification on three real world datasets.
    Page 1, “Abstract”
  3. In supervised text classification learning algorithms, the learner (a program) takes human labeled documents as input and learns a decision function that can classify a previously unseen document to one of the predefined classes.
    Page 1, “Introduction”
  4. In this paper, we propose a text classification algorithm based on Latent Dirichlet Allocation (LDA) (Blei et al., 2003) which does not need labeled documents.
    Page 1, “Introduction”
  5. (Blei et al., 2003) used LDA topics as features in text classification , but they use labeled documents while learning a classifier.
    Page 1, “Introduction”
  6. Supervised Text Classification
    Page 1, “Introduction”
  7. These models can be used for text classification , but they need expensive labeled documents.
    Page 1, “Introduction”
  8. As LDA uses higher order word associations (Lee et al., 2010) while discovering topics, we hypothesize that sprinkling will improve text classification performance of ClassifyLDA.
    Page 2, “Introduction”
  9. Several researchers have proposed semi-supervised text classification algorithms with the aim of reducing the time, effort and cost involved in labeling documents.
    Page 2, “Related Work”
  10. Semi-supervised text classification algorithms proposed in (Nigam et al., 2000), (J oachims, 1999), (Zhu and Ghahra—mani, 2002) and (Blum and Mitchell, 1998) are a few examples of this type.
    Page 2, “Related Work”
  11. Also a human annotator may discard or mislabel a polysemous word, which may affect the performance of a text classifier .
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention text classification.

See all papers in Proc. ACL that mention text classification.

Back to top.

LDA

Appears in 17 sentences as: LDA (17)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. In this paper, we propose a weakly supervised algorithm in which supervision comes in the form of labeling of Latent Dirichlet Allocation ( LDA ) topics.
    Page 1, “Abstract”
  2. In this paper, we propose a text classification algorithm based on Latent Dirichlet Allocation ( LDA ) (Blei et al., 2003) which does not need labeled documents.
    Page 1, “Introduction”
  3. LDA is an unsupervised probabilistic topic model and it is widely used to discover latent semantic structure of a document collection by modeling words in the documents.
    Page 1, “Introduction”
  4. (Blei et al., 2003) used LDA topics as features in text classification, but they use labeled documents while learning a classifier.
    Page 1, “Introduction”
  5. et al., 2008) and MedLDA (Zhu et al., 2009) are few extensions of LDA which model both class labels and words in the documents.
    Page 1, “Introduction”
  6. In this approach, a topic model on a given set of unlabeled training documents is constructed using LDA , then an annotator assigns a class label to some topics based on their most probable words.
    Page 1, “Introduction”
  7. As LDA uses higher order word associations (Lee et al., 2010) while discovering topics, we hypothesize that sprinkling will improve text classification performance of ClassifyLDA.
    Page 2, “Introduction”
  8. As LDA topics are semantically more meaningful than individual words and can be acquired easily, our approach overcomes limitations of the semi-supervised methods discussed above.
    Page 2, “Related Work”
  9. LDA is an unsupervised probabilistic generative model for collections of discrete data such as text documents.
    Page 2, “Background 3.1 LDA”
  10. The generative process of LDA can be described as follows:
    Page 2, “Background 3.1 LDA”
  11. The key problem in LDA is posterior inference.
    Page 2, “Background 3.1 LDA”

See all papers in Proc. ACL 2014 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

Gibbs sampling

Appears in 10 sentences as: Gibbs sampling (10)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. In this paper we estimate approximate posterior inference using collapsed Gibbs sampling (Griffiths and Steyvers, 2004).
    Page 2, “Background 3.1 LDA”
  2. The Gibbs sampling equation used to update the assignment of a topic I to the word 21) E W at the position n in document d, conditioned on at, flu, is:
    Page 2, “Background 3.1 LDA”
  3. We use a subscript d, fin to denote the current token, zdm is ignored in the Gibbs sampling update.
    Page 2, “Background 3.1 LDA”
  4. After performing collapsed Gibbs sampling using equation 1, we use word topic assignments to compute a point
    Page 2, “Background 3.1 LDA”
  5. We then update the new LDA model using collapsed Gibbs sampling .
    Page 3, “Topic Sprinkling in LDA”
  6. We then infer a set of |C | number of topics on the sprinkled dataset using collapsed Gibbs sampling , where C is the set of class labels of the training documents.
    Page 3, “Topic Sprinkling in LDA”
  7. We modify collapsed Gibbs sampling update in Equation 1 to carry class label information while inferring topics.
    Page 3, “Topic Sprinkling in LDA”
  8. l. Infer T number of topics on D for LDA using collapsed Gibbs sampling .
    Page 4, “Experimental Evaluation”
  9. Update M D using collapsed Gibbs sampling update in Equation 1.
    Page 4, “Experimental Evaluation”
  10. Infer ‘0‘ number of topics on the sprinkled document corpus D using collapsed Gibbs sampling update.
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2014 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

latent semantic

Appears in 5 sentences as: Latent Semantic (2) latent semantic (3)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. LDA is an unsupervised probabilistic topic model and it is widely used to discover latent semantic structure of a document collection by modeling words in the documents.
    Page 1, “Introduction”
  2. Sprinkling (Chakraborti et al., 2007) integrates class labels of documents into Latent Semantic Indexing (LSI)(Deerwester et al., 1990).
    Page 1, “Introduction”
  3. As LSI uses higher order word associations (Kontostathis and Pottenger, 2006), sprinkling of artificial words gives better and class-enriched latent semantic structure.
    Page 1, “Introduction”
  4. (Chakraborti et al., 2007) empirically show that sprinkled words boost higher order word associations and projects documents with same class labels close to each other in latent semantic space.
    Page 3, “Background 3.1 LDA”
  5. We have used the idea of sprinkling originally proposed in the context of supervised Latent Semantic Analysis, but the setting here is quite different.
    Page 5, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention latent semantic.

See all papers in Proc. ACL that mention latent semantic.

Back to top.

probability distribution

Appears in 5 sentences as: probability distribution (5)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. We use the labeled topics to find probability distribution of each training document over the class labels.
    Page 1, “Introduction”
  2. Draw a word: wdm N Multinomial(zd,n) Where, T is the number of topics, 9b,; is the word probabilities for topic 75, 6d is the topic probability distribution , 2d,“, is topic assignment and wdm is word assignment for nth word position in document d respectively.
    Page 2, “Background 3.1 LDA”
  3. We use this new model to infer the probability distribution of each unlabeled training document over the class labels.
    Page 3, “Topic Sprinkling in LDA”
  4. While classifying a test document, its probability distribution over class labels is inferred using TS-LDA model and it is classified to its most probable class label.
    Page 3, “Topic Sprinkling in LDA”
  5. (a) Infer a probability distribution 0d over class labels using M D using Equation 3.
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2014 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.

human annotator

Appears in 4 sentences as: human annotator (4)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. Also a human annotator may discard or mislabel a polysemous word, which may affect the performance of a text classifier.
    Page 2, “Related Work”
  2. In active learning, particular unlabeled documents or features are selected and queried to an oracle (e. g. human annotator ).
    Page 2, “Related Work”
  3. We then ask a human annotator to assign one or more class labels to the topics based on their most probable words.
    Page 3, “Topic Sprinkling in LDA”
  4. While labeling a topic, we show its 30 most probable words to the human annotator .
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2014 that mention human annotator.

See all papers in Proc. ACL that mention human annotator.

Back to top.

semi-supervised

Appears in 4 sentences as: Semi-supervised (1) semi-supervised (3)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. Several researchers have proposed semi-supervised text classification algorithms with the aim of reducing the time, effort and cost involved in labeling documents.
    Page 2, “Related Work”
  2. Semi-supervised text classification algorithms proposed in (Nigam et al., 2000), (J oachims, 1999), (Zhu and Ghahra—mani, 2002) and (Blum and Mitchell, 1998) are a few examples of this type.
    Page 2, “Related Work”
  3. The third type of semi-supervised text classification algorithms is based on active learning.
    Page 2, “Related Work”
  4. As LDA topics are semantically more meaningful than individual words and can be acquired easily, our approach overcomes limitations of the semi-supervised methods discussed above.
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

classification tasks

Appears in 3 sentences as: classification tasks (3)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. Following are the three classification tasks associated with this dataset.
    Page 4, “Experimental Evaluation”
  2. For SRAA dataset we infer 8 topics on the training dataset and label these 8 topics for all the three classification tasks .
    Page 4, “Experimental Evaluation”
  3. However, ClassifyLDA performs better than TS-LDA for the three classification tasks of SRAA dataset.
    Page 4, “Experimental Evaluation”

See all papers in Proc. ACL 2014 that mention classification tasks.

See all papers in Proc. ACL that mention classification tasks.

Back to top.

topic model

Appears in 3 sentences as: topic model (3)
In Sprinkling Topics for Weakly Supervised Text Classification
  1. LDA is an unsupervised probabilistic topic model and it is widely used to discover latent semantic structure of a document collection by modeling words in the documents.
    Page 1, “Introduction”
  2. In this approach, a topic model on a given set of unlabeled training documents is constructed using LDA, then an annotator assigns a class label to some topics based on their most probable words.
    Page 1, “Introduction”
  3. These labeled topics are used to create a new topic model such that in the new model topics are better aligned to class labels.
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention topic model.

See all papers in Proc. ACL that mention topic model.

Back to top.