Learning Word Vectors for Sentiment Analysis
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher

Article Structure

Abstract

Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks.

Introduction

Word representations are a critical component of many natural language processing systems.

Related work

The model we present in the next section draws inspiration from prior work on both probabilistic topic modeling and vector-spaced models for word meanings.

Our Model

To capture semantic similarities among words, we derive a probabilistic model of documents which learns word representations.

Experiments

We evaluate our model with document-level and sentence-level categorization tasks in the domain of online movie reviews.

Topics

LDA

Appears in 14 sentences as: LDA (15)
In Learning Word Vectors for Sentiment Analysis
  1. Latent Dirichlet Allocation ( LDA ; (Blei et al., 2003)) is a probabilistic document model that assumes each document is a mixture of latent topics.
    Page 2, “Related work”
  2. However, because the emphasis in LDA is on modeling topics, not word meanings, there is no guarantee that the row (word) vectors are sensible as points in a k-dimensional space.
    Page 2, “Related work”
  3. Indeed, we show in section 4 that using LDA in this way does not deliver robust word vectors.
    Page 2, “Related work”
  4. The semantic component of our model shares its probabilistic foundation with LDA , but is factored in a manner designed to discover word vectors rather than latent topics.
    Page 2, “Related work”
  5. Some recent work introduces extensions of LDA to capture sentiment in addition to topical information (Li et al., 2010; Lin and He, 2009; Boyd-Graber and Resnik,
    Page 2, “Related work”
  6. This component does not require labeled data, and shares its foundation with probabilistic topic models such as LDA .
    Page 2, “Our Model”
  7. Equation 1 resembles the probabilistic model of LDA (Blei et al., 2003), which models documents as mixtures of latent topics.
    Page 3, “Our Model”
  8. Because of the log-linear formulation of the conditional distribution, 6 is a vector in R5 and not restricted to the unit simplex as it is in LDA .
    Page 3, “Our Model”
  9. Latent Dirichlet Allocation ( LDA ; Blei et al., 2003) We use the method described in section 2 for inducing word representations from the topic matrix.
    Page 6, “Experiments”
  10. To train the 50-topic LDA model we use code released by Blei et a1.
    Page 6, “Experiments”
  11. We use the same 5,000 term vocabulary for LDA as is used for training word vector models.
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

word representations

Appears in 14 sentences as: Word Representation (1) word representation (2) Word Representations (1) Word representations (1) word representations (9) words’ representation (1)
In Learning Word Vectors for Sentiment Analysis
  1. Word representations are a critical component of many natural language processing systems.
    Page 1, “Introduction”
  2. To capture semantic similarities among words, we derive a probabilistic model of documents which learns word representations .
    Page 2, “Our Model”
  3. The energy function uses a word representation matrix R E R“ X M) where each word 21) (represented as a one-on vector) in the vocabulary V has a 6-dimensional vector representation gbw = Rw corresponding to that word’s column in R. The random variable 6 is also a B-dimensional vector, 6 E R5 which weights each of the 6 dimensions of words’ representation vectors.
    Page 3, “Our Model”
  4. We introduce a Frobenious norm regularization term for the word representation matrix R. The word biases b are not regularized reflecting the fact that we want the biases to capture whatever overall word frequency statistics are present in the data.
    Page 3, “Our Model”
  5. Following this theme we introduce a second task to utilize labeled documents to improve our model’s word representations .
    Page 4, “Our Model”
  6. word representations (R, b, w, and be) while leaving the MAP estimates (0) fixed.
    Page 5, “Our Model”
  7. Then we find the new MAP estimate for each document while leaving the word representations fixed, and continue this process until convergence.
    Page 5, “Our Model”
  8. In both tasks we compare our model’s word representations with several bag of words weighting methods, and alternative approaches to word vector induction.
    Page 5, “Experiments”
  9. 4.1 Word Representation Learning
    Page 5, “Experiments”
  10. We induce word representations with our model using 25,000 movie reviews from IMDB.
    Page 5, “Experiments”
  11. As a qualitative assessment of word representations , we visualize the words most similar to a query word using vector similarity of the learned representations.
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention word representations.

See all papers in Proc. ACL that mention word representations.

Back to top.

vector representation

Appears in 5 sentences as: vector representation (3) vector representations (2)
In Learning Word Vectors for Sentiment Analysis
  1. This component of the model uses the vector representation of words to predict the sentiment annotations on contexts in which the words appear.
    Page 1, “Introduction”
  2. This causes words expressing similar sentiment to have similar vector representations .
    Page 1, “Introduction”
  3. For each latent topic T, the model learns a conditional distribution p(w|T) for the probability that word 21) occurs in T. One can obtain a k:-dimensional vector representation of words by first training a k-topic model and then filling the matrix with the p(w|T) values (normalized to unit length).
    Page 2, “Related work”
  4. The energy function uses a word representation matrix R E R“ X M) where each word 21) (represented as a one-on vector) in the vocabulary V has a 6-dimensional vector representation gbw = Rw corresponding to that word’s column in R. The random variable 6 is also a B-dimensional vector, 6 E R5 which weights each of the 6 dimensions of words’ representation vectors.
    Page 3, “Our Model”
  5. Given a query word 21) and another word 21/ we obtain their vector representations gbw and gwa, and evaluate their cosine similarity as 8(gbw, gwa) = By assessing the similarity of 212 with all other words 212’, we can find the words deemed most similar by the model.
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention vector representation.

See all papers in Proc. ACL that mention vector representation.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Learning Word Vectors for Sentiment Analysis
  1. The full objective function of the model thus learns semantic vectors that are imbued with nuanced sentiment information.
    Page 1, “Introduction”
  2. We adopt this insight, but we are able to incorporate it directly into our model’s objective function .
    Page 2, “Related work”
  3. We can efficiently learn parameters for the joint objective function using alternating maximization.
    Page 3, “Our Model”
  4. This produces a final objective function of,
    Page 4, “Our Model”

See all papers in Proc. ACL 2011 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

probabilistic model

Appears in 4 sentences as: probabilistic model (4)
In Learning Word Vectors for Sentiment Analysis
  1. The semantic component of our model learns word vectors via an unsupervised probabilistic model of documents.
    Page 1, “Introduction”
  2. To capture semantic similarities among words, we derive a probabilistic model of documents which learns word representations.
    Page 2, “Our Model”
  3. We build a probabilistic model of a document using a continuous mixture distribution over words indexed by a multidimensional random variable 6.
    Page 3, “Our Model”
  4. Equation 1 resembles the probabilistic model of LDA (Blei et al., 2003), which models documents as mixtures of latent topics.
    Page 3, “Our Model”

See all papers in Proc. ACL 2011 that mention probabilistic model.

See all papers in Proc. ACL that mention probabilistic model.

Back to top.

vector space

Appears in 4 sentences as: Vector space (1) vector space (3)
In Learning Word Vectors for Sentiment Analysis
  1. ing sentiment-imbued topics rather than embedding words in a vector space .
    Page 2, “Related work”
  2. Vector space models (VSMs) seek to model words directly (Turney and Pantel, 2010).
    Page 2, “Related work”
  3. The logistic regression weights 2p and be define a linear hyperplane in the word vector space where a word vector’s positive sentiment probability depends on where it lies with respect to this hyperplane.
    Page 4, “Our Model”
  4. For comparison, we implemented several alternative vector space models that are conceptually similar to our own, as discussed in section 2:
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention vector space.

See all papers in Proc. ACL that mention vector space.

Back to top.

semantic similarities

Appears in 3 sentences as: Semantic Similarities (1) semantic similarities (2)
In Learning Word Vectors for Sentiment Analysis
  1. To capture semantic similarities among words, we derive a probabilistic model of documents which learns word representations.
    Page 2, “Our Model”
  2. 3.1 Capturing Semantic Similarities
    Page 3, “Our Model”
  3. All of these vectors capture broad semantic similarities .
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention semantic similarities.

See all papers in Proc. ACL that mention semantic similarities.

Back to top.

sentiment classification

Appears in 3 sentences as: sentiment classification (3)
In Learning Word Vectors for Sentiment Analysis
  1. We evaluate the model using small, widely used sentiment and subjectivity corpora and find it outperforms several previously introduced methods for sentiment classification .
    Page 1, “Abstract”
  2. Martineau and Finin present evidence that this weighting helps with sentiment classification , and Paltoglou and Thelwall (2010) systematically explore a number of weighting schemes in the context of sentiment analysis.
    Page 2, “Related work”
  3. Of course, this is only an impressionistic analysis of a few cases, but it is helpful in understanding why the sentiment-enriched model proves superior at the sentiment classification results we report next.
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention sentiment classification.

See all papers in Proc. ACL that mention sentiment classification.

Back to top.

SVD

Appears in 3 sentences as: SVD (3)
In Learning Word Vectors for Sentiment Analysis
  1. Latent Semantic Analysis (LSA), perhaps the best known VSM, explicitly learns semantic word vectors by applying singular value decomposition ( SVD ) to factor a term—document co-occurrence matrix.
    Page 2, “Related work”
  2. It is typical to weight and normalize the matrix values prior to SVD .
    Page 2, “Related work”
  3. Latent Semantic Analysis (LSA; Deerwester et al., 1990) We apply truncated SVD to a tf.idf weighted, cosine normalized count matrix, which
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention SVD.

See all papers in Proc. ACL that mention SVD.

Back to top.