Re-embedding words
Labutov, Igor and Lipson, Hod

Article Structure

Abstract

We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task.

Introduction

Incorporating the vector representation of a word as a feature, has recently been shown to benefit performance in several standard NLP tasks such as language modeling (Bengio et al., 2003; Mnih and Hinton, 2009), POS-tagging and NER (Collobert et al., 2011), parsing (Socher et al., 2010), as well as in sentiment and subjectivity analysis tasks (Maas et al., 2011; Yessenalina and Cardie, 2011).

Related Work

The most relevant to our contribution is the work by Maas et.al (2011), where word vectors are learned specifically for sentiment classification.

Approach

Let (1)3, CDT 6 RIVIXK be the source and target embedding matrices respectively, where K is the dimension of the word vector space, identical in the source and target embeddings, and V is the set of embedded words, given by V5 0 VT.

Experiments

Data: For our experiments, we employ a large, recently introduced IMDB movie review dataset (Maas et al., 2011), in place of the smaller dataset introduced in (Pang and Lee, 2004) more commonly used for sentiment analysis.

Results and Discussion

The main observation from the results is that our method improves performance for smaller training sets (3 5000 examples).

Future Work

While “semantic smoothing” obtained from introducing an external embedding helps to improve performance in the sentiment classification task, the method does not help to re-embed words that do not appear in the training set to begin with.

Conclusion

We presented a novel approach to adapting existing word vectors for improving performance in a text classification task.

Topics

embeddings

Appears in 38 sentences as: Embeddings (1) embeddings (43)
In Re-embedding words
  1. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data.
    Page 1, “Abstract”
  2. However, some methods take days or weeks to learn good embeddings , and some are notoriously difficult to train.
    Page 1, “Abstract”
  3. Moreover, we may already have on our hands embeddings for X and Y obtained from yet another (possibly unsupervised) task (C), in which X and Y are, for example, orthogonal.
    Page 1, “Introduction”
  4. If the embeddings for task C happen to be learned from a much larger dataset, it would make sense to reuse task C embeddings , but adapt them for task A and/or task B.
    Page 1, “Introduction”
  5. We will refer to task C and its embeddings as the source task and the source embeddings, and task A/B, and its embeddings as the target task and the target embeddings .
    Page 1, “Introduction”
  6. Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semi-supervised learning, and/or we may leverage labels from multiple other related tasks in a multitask approach.
    Page 1, “Introduction”
  7. Both methods have been applied successfully (Collobert and Weston, 2008) to learn task-specific embeddings .
    Page 1, “Introduction”
  8. In the case of deep neural embeddings , for example, training time can number in days.
    Page 1, “Introduction”
  9. On the other hand, learned embeddings are becoming more abundant, as much research and computing effort is being invested in learning word representations using large-scale deep architectures trained on web-scale corpora.
    Page 1, “Introduction”
  10. Many of said embeddings are published and can be harnessed in their raw form as additional features in a number of supervised tasks (Turian et al., 2010).
    Page 1, “Introduction”
  11. likelihood under the target embedding and the Frobenius norm of the distortion matrix — a matrix of component-wise differences between the target and the source embeddings .
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

sentiment classification

Appears in 4 sentences as: sentiment classification (4)
In Re-embedding words
  1. We show improvement on the task of sentiment classification with respect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
    Page 1, “Abstract”
  2. The most relevant to our contribution is the work by Maas et.al (2011), where word vectors are learned specifically for sentiment classification .
    Page 2, “Related Work”
  3. Source embeddings: We find C&W embeddings to perform best for the task of sentiment classification .
    Page 4, “Results and Discussion”
  4. While “semantic smoothing” obtained from introducing an external embedding helps to improve performance in the sentiment classification task, the method does not help to re-embed words that do not appear in the training set to begin with.
    Page 4, “Future Work”

See all papers in Proc. ACL 2013 that mention sentiment classification.

See all papers in Proc. ACL that mention sentiment classification.

Back to top.

vector space

Appears in 4 sentences as: vector space (4)
In Re-embedding words
  1. Let (1)3, CDT 6 RIVIXK be the source and target embedding matrices respectively, where K is the dimension of the word vector space , identical in the source and target embeddings, and V is the set of embedded words, given by V5 0 VT.
    Page 2, “Approach”
  2. There are almost no restrictions on (133, except that it must match the desired target vector space dimension K. The objective is convex in w and (PT, thus, yielding a unique target re-embedding.
    Page 3, “Approach”
  3. We use the document’s binary bag-of-words vector vj, and compute the document’s vector space representation through the matrix-vector product (Dij.
    Page 3, “Approach”
  4. While a smaller number of dimensions has been shown to work better in other tasks (Turian et a1., 2010), re-embedding words may benefit from a larger initial dimension of the word vector space .
    Page 4, “Results and Discussion”

See all papers in Proc. ACL 2013 that mention vector space.

See all papers in Proc. ACL that mention vector space.

Back to top.

bag-of-words

Appears in 3 sentences as: Bag-of-words (1) bag-of-words (2)
In Re-embedding words
  1. We use the document’s binary bag-of-words vector vj, and compute the document’s vector space representation through the matrix-vector product (Dij.
    Page 3, “Approach”
  2. Features Number of training examples + Bag-of-words features .5K 5K 20K | .5K 5K 20K
    Page 4, “Results and Discussion”
  3. Additional features: Across all embeddings, appending the document’s binary bag-of-words representation increases classification accuracy.
    Page 4, “Results and Discussion”

See all papers in Proc. ACL 2013 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (3)
In Re-embedding words
  1. Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semi-supervised learning, and/or we may leverage labels from multiple other related tasks in a multitask approach.
    Page 1, “Introduction”
  2. Embeddings are learned in a semi-supervised fashion, and the components of the embedding are given an explicit probabilistic interpretation.
    Page 2, “Related Work”
  3. In machine learning literature, joint semi-supervised embedding takes form in methods such as the LaplacianSVM (LapSVM) (Belkin et al., 2006) and Label Propogation (Zhu and Ghahra—mani, 2002), to which our approach is related.
    Page 2, “Related Work”

See all papers in Proc. ACL 2013 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

unlabeled data

Appears in 3 sentences as: unlabeled data (3)
In Re-embedding words
  1. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data .
    Page 1, “Abstract”
  2. Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semi-supervised learning, and/or we may leverage labels from multiple other related tasks in a multitask approach.
    Page 1, “Introduction”
  3. Our method is different in that the (potentially) massive amount of unlabeled data is not required a-priori, but only the resultant embedding.
    Page 2, “Related Work”

See all papers in Proc. ACL 2013 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.