Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
Baroni, Marco and Dinu, Georgiana and Kruszewski, Germán

Article Structure

Abstract

Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics block.

Introduction

A long tradition in computational linguistics has shown that contextual information provides a good approximation to word meaning, since semantically similar words tend to have similar contextual distributions (Miller and Charles, 1991).

Distributional semantic models

Both count and predict models are extracted from a corpus of about 2.8 billion tokens constructed by concatenating ukWaC,5 the English Wikipedia6 and the British National Corpus.7 For both model types, we consider the top 300K most frequent words in the corpus both as target and context elements.

Evaluation materials

We test our models on a variety of benchmarks, most of them already widely used to test and compare DSMs.

Results

Table 2 summarizes the evaluation results.

Conclusion

This paper has presented the first systematic comparative evaluation of count and predict vectors.

Topics

distributional semantic

Appears in 7 sentences as: Distributional Semantic (1) distributional semantic (2) distributional semanticists (2) distributional semantics (2)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics block.
    Page 1, “Abstract”
  2. Despite the buzz surrounding these models, the literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches.
    Page 1, “Abstract”
  3. In concrete, distributional semantic models (DSMs) use vectors that keep track of the contexts (e.g., co-occurring words) in which target terms appear in a large corpus as proxies for meaning representations, and apply geometric techniques to these vectors to measure the similarity in meaning of the corresponding words (Clark, 2013; Erk, 2012; Turney and Pantel, 2010).
    Page 1, “Introduction”
  4. The ESSLLI 2008 Distributional Semantic Workshop shared-task set (esslli) contains 44 concepts to be clustered into 6 categories (Baroni et al., 2008) (we ignore here the 3- and 2-way higher-level partitions coming with this set).
    Page 4, “Evaluation materials”
  5. As seasoned distributional semanticists with thorough experience in developing and using count vectors, we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding predict models, despite the almost complete lack of a proper comparison to count vectors.
    Page 7, “Conclusion”
  6. To give just one last example, distributional semanticists have looked at whether certain properties of vectors reflect semantic relations in the expected way: e.g., whether the vectors of hypemyms “distribution-ally include” the vectors of hyponyms in some mathematical precise sense.
    Page 8, “Conclusion”
  7. Does all of this even matter, or are we on the cusp of discovering radically new ways to tackle the same problems that have been approached as we just sketched in traditional distributional semantics ?
    Page 8, “Conclusion”

See all papers in Proc. ACL 2014 that mention distributional semantic.

See all papers in Proc. ACL that mention distributional semantic.

Back to top.

state of the art

Appears in 4 sentences as: State of the art (1) state of the art (3)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. State of the art performance on this set has been reported by Hassan and Mi-halcea (2011) using a technique that exploits the Wikipedia linking structure and word sense disambiguation techniques.
    Page 3, “Evaluation materials”
  2. The current state of the art is reached by Halawi et al.
    Page 4, “Evaluation materials”
  3. Current state of the art was reached by the window-based count model of Baroni and Lenci (2010).
    Page 4, “Evaluation materials”
  4. Indeed, the predictive models achieve an impressive overall performance, beating the current state of the art in several cases, and approaching it in many more.
    Page 5, “Results”

See all papers in Proc. ACL 2014 that mention state of the art.

See all papers in Proc. ACL that mention state of the art.

Back to top.

language models

Appears in 3 sentences as: language modeling (1) language models (2)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. Context-predicting models (more commonly known as embeddings or neural language models ) are the new kids on the distributional semantics block.
    Page 1, “Abstract”
  2. This is in part due to the fact that context-predicting vectors were first developed as an approach to language modeling and/or as a way to initialize feature vectors in neural-network-based “deep learning” NLP architectures, so their effectiveness as semantic representations was initially seen as little more than an interesting side effect.
    Page 2, “Introduction”
  3. Predictive DSMs are also called neural language models , because their supervised context prediction training is performed with neural networks, or, more cryptically, “embeddings”.
    Page 2, “Introduction”

See all papers in Proc. ACL 2014 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

Latent Semantic

Appears in 3 sentences as: Latent Semantic (1) latent semantic (1) “Latent Semantic (1) “latent” semantic (1)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. (2013d) compare their predict models to “Latent Semantic Analysis” (LSA) count vectors on syntactic and semantic analogy tasks, finding that the predict models are highly superior.
    Page 2, “Introduction”
  2. For example, the developers of Latent Semantic Analysis (Landauer and Dumais, 1997), Topic Models (Griffiths et al., 2007) and related DSMs have shown that the dimensions of these models can be interpreted as general “latent” semantic domains, which gives the corresponding models some a priori cognitive plausibility while paving the way for interesting applications.
    Page 8, “Conclusion”
  3. Do the dimensions of predict models also encode latent semantic domains?
    Page 8, “Conclusion”

See all papers in Proc. ACL 2014 that mention Latent Semantic.

See all papers in Proc. ACL that mention Latent Semantic.

Back to top.

lexical semantics

Appears in 3 sentences as: lexical semantics (3)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. In this paper, we perform such an extensive evaluation, on a wide range of lexical semantics tasks and across many parameter settings.
    Page 1, “Abstract”
  2. In this paper, we overcome the comparison scarcity problem by providing a direct evaluation of count and predict DSMs across many parameter settings and on a large variety of mostly standard lexical semantics benchmarks.
    Page 2, “Introduction”
  3. Add to this that, beyond the standard lexical semantics challenges we tested here, predict models are currently been successfully applied in cutting-edge domains such as representing phrases (Mikolov et al., 2013c; Socher et al., 2012) or fusing language and vision in a common semantic space (Frome et al., 2013; Socher et al., 2013).
    Page 8, “Conclusion”

See all papers in Proc. ACL 2014 that mention lexical semantics.

See all papers in Proc. ACL that mention lexical semantics.

Back to top.

probability distribution

Appears in 3 sentences as: probability distribution (2) probability distributions (1)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. Allocation (LDA) models (Blei et al., 2003; Griffiths et al., 2007), where parameters are set to optimize the joint probability distribution of words and documents.
    Page 2, “Introduction”
  2. The word2vec toolkit implements two efficient alternatives to the standard computation of the output word probability distributions by a softmax classifier.
    Page 3, “Distributional semantic models”
  3. Hierarchical softmax is a computationally efficient way to estimate the overall probability distribution using an output layer that is proportional to log(unigram.perplexity(W)) instead of W (for W the vocabulary size).
    Page 3, “Distributional semantic models”

See all papers in Proc. ACL 2014 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.

semantic relations

Appears in 3 sentences as: Semantic relatedness (1) semantic relations (2)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. Semantic relatedness A first set of semantic benchmarks was constructed by asking human subjects to rate the degree of semantic similarity or relatedness between two words on a numerical scale.
    Page 3, “Evaluation materials”
  2. To give just one last example, distributional semanticists have looked at whether certain properties of vectors reflect semantic relations in the expected way: e.g., whether the vectors of hypemyms “distribution-ally include” the vectors of hyponyms in some mathematical precise sense.
    Page 8, “Conclusion”
  3. Does the structure of predict vectors mimic meaningful semantic relations ?
    Page 8, “Conclusion”

See all papers in Proc. ACL 2014 that mention semantic relations.

See all papers in Proc. ACL that mention semantic relations.

Back to top.

semantic space

Appears in 3 sentences as: semantic space (3)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. (2012) with a method that is in the spirit of the predict models, but lets synonymy information from WordNet constrain the learning process (by favoring solutions in which WordNet synonyms are near in semantic space ).
    Page 4, “Evaluation materials”
  2. Systems are evaluated in terms of proportion of questions where the nearest neighbour from the whole semantic space is the correct answer (the given example and test vector triples are excluded from the nearest neighbour search).
    Page 5, “Evaluation materials”
  3. Add to this that, beyond the standard lexical semantics challenges we tested here, predict models are currently been successfully applied in cutting-edge domains such as representing phrases (Mikolov et al., 2013c; Socher et al., 2012) or fusing language and vision in a common semantic space (Frome et al., 2013; Socher et al., 2013).
    Page 8, “Conclusion”

See all papers in Proc. ACL 2014 that mention semantic space.

See all papers in Proc. ACL that mention semantic space.

Back to top.

SVD

Appears in 3 sentences as: SVD (8)
In Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
  1. However, it is worth pointing out that the evaluated parameter subset encompasses settings (narrow context window, positive PMI, SVD reduction) that have been
    Page 2, “Distributional semantic models”
  2. For the count models, PMI is clearly the better weighting scheme, and SVD outperforms NMF as a dimensionality reduction technique.
    Page 7, “Results”
  3. PMI SVD 500 42 PMI SVD 400 46 PMI SVD 500 47 PMI SVD 300 50 PMI SVD 400 5 1 PMI NMF 300 5 2 PMI NMF 400 53 PMI SVD 300 53
    Page 7, “Results”

See all papers in Proc. ACL 2014 that mention SVD.

See all papers in Proc. ACL that mention SVD.

Back to top.