Improving Word Representations via Global Context and Multiple Word Prototypes
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew

Article Structure

Abstract

Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems.

Introduction

Vector-space models (VSM) represent word meanings with vectors that capture semantic and syntactic information of words.

Global Context-Aware Neural Language Model

In this section, we describe the training objective of our model, followed by a description of the neural network architecture, ending with a brief description of our model’s training method.

Multi-Prototype Neural Language Model

Despite distributional similarity models’ successful applications in various NLP tasks, one major limitation common to most of these models is that they assume only one representation for each word.

Experiments

In this section, we first present a qualitative analysis comparing the nearest neighbors of our model’s embeddings with those of others, showing our embeddings better capture the semantics of words, with the use of global context.

Related Work

Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling, a task where models are asked to accurately predict the next word given previously seen words.

Conclusion

We presented a new neural network architecture that learns more semantic word representations by using both local and global context in learning.

Topics

embeddings

Appears in 19 sentences as: embeddings (21)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word.
    Page 1, “Abstract”
  2. ng = Z maX(0, 1 — 9(8, d) + g(sw, d)) (l) wEV Collobert and Weston (2008) showed that this ranking approach can produce good word embeddings that are useful in several NLP tasks, and allows much faster training of the model compared to optimizing log-likelihood of the next word.
    Page 2, “Global Context-Aware Neural Language Model”
  3. where [£131,132, ...,;vm] is the concatenation of the m word embeddings representing sequence 8, f is an element-wise activation function such as tanh, a1 6 Rh“ is the activation of the hidden layer with h hidden nodes, W1 6 WW) and W2 6 nlxh are respectively the first and second layer weights of the neural network, and b1, ()2 are the biases of each layer.
    Page 3, “Global Context-Aware Neural Language Model”
  4. For the score of the global context, we represent the document also as an ordered list of word embeddings , d 2 (d1, d2, ..., dk).
    Page 3, “Global Context-Aware Neural Language Model”
  5. We found that word embeddings move to good positions in the vector space faster when using mini-batch L-BFGS (Liu and No-cedal, 1989) with 1000 pairs of good and corrupt examples per batch for training, compared to stochastic gradient descent.
    Page 3, “Global Context-Aware Neural Language Model”
  6. We present a way to use our learned single-prototype embeddings to represent each context window, which can then be used by clustering to perform word sense discrimination (Schutze, 1998).
    Page 4, “Multi-Prototype Neural Language Model”
  7. In this section, we first present a qualitative analysis comparing the nearest neighbors of our model’s embeddings with those of others, showing our embeddings better capture the semantics of words, with the use of global context.
    Page 4, “Experiments”
  8. For all experiments, our models use 50-dimensional embeddings .
    Page 4, “Experiments”
  9. The nearest neighbors of “market” that C&W’s embeddings give are more constrained by the syntactic constraint that words in plural form are only close to other words in plural form, whereas our model captures that the singular and plural forms of a word are similar in meaning.
    Page 4, “Experiments”
  10. Table 2: Nearest neighbors of word embeddings learned by our model using the multi-prototype approach based on cosine similarity.
    Page 5, “Experiments”
  11. We downloaded these embeddings from Turian et a1.
    Page 5, “Experiments”

See all papers in Proc. ACL 2012 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

neural network

Appears in 13 sentences as: Neural Network (1) neural network (9) neural networks (3)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word.
    Page 1, “Abstract”
  2. In this section, we describe the training objective of our model, followed by a description of the neural network architecture, ending with a brief description of our model’s training method.
    Page 2, “Global Context-Aware Neural Language Model”
  3. We compute scores 9(8, d) and g(sw, d) where 8w is s with the last word replaced by word w, and g(-, is the scoring function that represents the neural networks used.
    Page 2, “Global Context-Aware Neural Language Model”
  4. 2.2 Neural Network Architecture
    Page 2, “Global Context-Aware Neural Language Model”
  5. The scoring components are computed by two neural networks , one capturing local context and the other global context, as shown in Figure 1.
    Page 2, “Global Context-Aware Neural Language Model”
  6. To compute the score of local context, scorel, we use a neural network with one hidden layer:
    Page 3, “Global Context-Aware Neural Language Model”
  7. where [£131,132, ...,;vm] is the concatenation of the m word embeddings representing sequence 8, f is an element-wise activation function such as tanh, a1 6 Rh“ is the activation of the hidden layer with h hidden nodes, W1 6 WW) and W2 6 nlxh are respectively the first and second layer weights of the neural network , and b1, ()2 are the biases of each layer.
    Page 3, “Global Context-Aware Neural Language Model”
  8. We use a two-layer neural network to compute the global context score, scoreg, similar to the above:
    Page 3, “Global Context-Aware Neural Language Model”
  9. the hidden layer with h(9) hidden nodes, ng) E Rh(g)x(2n) and ng) 6 Emmy) are respectively the first and second layer weights of the neural network , and big), bég) are the biases of each layer.
    Page 3, “Global Context-Aware Neural Language Model”
  10. Following Collobert and Weston (2008), we sample the gradient of the objective by randomly choosing a word from the dictionary as a corrupt example for each sequence-document pair, (8, d), and take the derivative of the ranking loss with respect to the parameters: weights of the neural network and the embedding matrix L. These weights are updated via backpropagation.
    Page 3, “Global Context-Aware Neural Language Model”
  11. We use 10-word windows of text as the local context, 100 hidden units, and no weight regularization for both neural networks .
    Page 4, “Experiments”

See all papers in Proc. ACL 2012 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

word embeddings

Appears in 13 sentences as: word embeddings (13)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word.
    Page 1, “Abstract”
  2. ng = Z maX(0, 1 — 9(8, d) + g(sw, d)) (l) wEV Collobert and Weston (2008) showed that this ranking approach can produce good word embeddings that are useful in several NLP tasks, and allows much faster training of the model compared to optimizing log-likelihood of the next word.
    Page 2, “Global Context-Aware Neural Language Model”
  3. where [£131,132, ...,;vm] is the concatenation of the m word embeddings representing sequence 8, f is an element-wise activation function such as tanh, a1 6 Rh“ is the activation of the hidden layer with h hidden nodes, W1 6 WW) and W2 6 nlxh are respectively the first and second layer weights of the neural network, and b1, ()2 are the biases of each layer.
    Page 3, “Global Context-Aware Neural Language Model”
  4. For the score of the global context, we represent the document also as an ordered list of word embeddings , d 2 (d1, d2, ..., dk).
    Page 3, “Global Context-Aware Neural Language Model”
  5. We found that word embeddings move to good positions in the vector space faster when using mini-batch L-BFGS (Liu and No-cedal, 1989) with 1000 pairs of good and corrupt examples per batch for training, compared to stochastic gradient descent.
    Page 3, “Global Context-Aware Neural Language Model”
  6. Table 2: Nearest neighbors of word embeddings learned by our model using the multi-prototype approach based on cosine similarity.
    Page 5, “Experiments”
  7. Table 3: Spearman’s p correlation on WordSim—353, showing our model’s improvement over previous neural models for learning word embeddings .
    Page 5, “Experiments”
  8. C&W* is the word embeddings trained and provided by C&W.
    Page 5, “Experiments”
  9. Our model is able to learn more semantic word embeddings and noticeably improves upon C&W’s model.
    Page 5, “Experiments”
  10. Our model uses a similar neural network architecture as these models and uses the ranking-loss training objective proposed by Collobert and Weston (2008), but introduces a new way to combine local and global context to train word embeddings .
    Page 8, “Related Work”
  11. Besides language modeling, word embeddings induced by neural language models have been useful in chunking, NER (Turian et al., 2010), parsing (Socher et al., 201 lb), sentiment analysis (Socher et al., 2011c) and paraphrase detection (Socher et al., 2011a).
    Page 8, “Related Work”

See all papers in Proc. ACL 2012 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

human judgments

Appears in 11 sentences as: human judgments (11)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
    Page 1, “Abstract”
  2. However, one limitation of this evaluation is that the human judgments are on pairs
    Page 1, “Introduction”
  3. Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context.
    Page 2, “Introduction”
  4. Our model also improves the correlation with human judgments on a word similarity task.
    Page 4, “Experiments”
  5. important, we introduce a new dataset with human judgments on similarity of pairs of words in sentential context.
    Page 4, “Experiments”
  6. Each pair is presented without context and associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10.
    Page 5, “Experiments”
  7. The many previous datasets that associate human judgments on similarity between pairs of words, such as WordSim-353, MC (Miller and Charles, 1991) and RG (Rubenstein and Goodenough, 1965), have helped to advance the development of vector-space models.
    Page 5, “Experiments”
  8. It is unclear how this variation in meaning is accounted for in human judgments of words presented without context.
    Page 6, “Experiments”
  9. The dataset has three interesting characteristics: 1) human judgments are on pairs of words presented in sentential context, 2) word pairs and their contexts are chosen to reflect interesting variations in meanings of homonymous and polysemous words, and 3) verbs and adjectives are present in addition to nouns.
    Page 6, “Experiments”
  10. For evaluation, we also compute Spearman correlation between a model’s computed similarity scores and human judgments .
    Page 7, “Experiments”
  11. We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

language model

Appears in 10 sentences as: language model (5) language modeling (3) language models (4)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models .
    Page 1, “Abstract”
  2. We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective.
    Page 1, “Introduction”
  3. We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset.
    Page 2, “Introduction”
  4. Note that Collobert and Weston (2008)’s language model corresponds to the network using only local context.
    Page 3, “Global Context-Aware Neural Language Model”
  5. Table 3 shows our results compared to previous methods, including C&W’s language model and the hierarchical log-bilinear (HLBL) model (Mnih and Hinton, 2008), which is a probabilistic, linear neural model.
    Page 5, “Experiments”
  6. Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling , a task where models are asked to accurately predict the next word given previously seen words.
    Page 7, “Related Work”
  7. Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model .
    Page 8, “Related Work”
  8. They used up to 3 previous head words and showed increased performance on language modeling .
    Page 8, “Related Work”
  9. Besides language modeling, word embeddings induced by neural language models have been useful in chunking, NER (Turian et al., 2010), parsing (Socher et al., 201 lb), sentiment analysis (Socher et al., 2011c) and paraphrase detection (Socher et al., 2011a).
    Page 8, “Related Work”
  10. Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

word representations

Appears in 9 sentences as: word representation (1) word representations (8)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems.
    Page 1, “Abstract”
  2. The model learns word representations that better capture the semantics of words, while still keeping syntactic information.
    Page 1, “Introduction”
  3. Our model jointly learns word representations while learning to discriminate the next word given a short word sequence (local context) and the document (global context) in which the word sequence occurs.
    Page 2, “Global Context-Aware Neural Language Model”
  4. Because our goal is to learn useful word representations and not the probability of the next word given previous words (which prohibits looking ahead), our model can utilize the entire document to provide
    Page 2, “Global Context-Aware Neural Language Model”
  5. The embedding matrix L is the word representations .
    Page 3, “Global Context-Aware Neural Language Model”
  6. Finally, each word occurrence in the corpus is relabeled to its associated cluster and is used to train the word representation for that cluster.
    Page 4, “Multi-Prototype Neural Language Model”
  7. In order to show that our model learns more semantic word representations with global context, we give the nearest neighbors of our single-prototype model versus C&W’s, which only uses local context.
    Page 4, “Experiments”
  8. Two other recent papers (Dhillon et al., 2011; Reddy et al., 2011) present models for constructing word representations that deal with context.
    Page 8, “Related Work”
  9. We presented a new neural network architecture that learns more semantic word representations by using both local and global context in learning.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention word representations.

See all papers in Proc. ACL that mention word representations.

Back to top.

synset

Appears in 7 sentences as: synset (6) synsets (4)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. In step 1, in order to make sure we select a diverse list of words, we consider three attributes of a word: frequency in a corpus, number of parts of speech, and number of synsets according to WordNet.
    Page 6, “Experiments”
  2. We also group words by their number of synsets : [0,5], [6,10], [11, 20], and [20, max].
    Page 6, “Experiments”
  3. (2010), we use WordNet to first randomly select one synset of the first word, we then construct a set of words in various relations to the first word’s chosen synset , including hypemyms, hy-ponyms, holonyms, meronyms and attributes.
    Page 6, “Experiments”
  4. In addition, for words with more than five synsets, we allow the second word to be the same as the first, but with different synsets .
    Page 6, “Experiments”
  5. We end up with pairs of words as well as the one chosen synset for each word in the pairs.
    Page 6, “Experiments”
  6. In step 3, we aim to extract a sentence from Wikipedia for each word, which contains the word and corresponds to a usage of the chosen synset .
    Page 6, “Experiments”
  7. word usages that correspond to the chosen synset, we first construct a set of related words of the chosen synset , including hypernyms, hyponyms, holonyms, meronyms and attributes.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention synset.

See all papers in Proc. ACL that mention synset.

Back to top.

word pairs

Appears in 5 sentences as: word pairs (5)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. To capture interesting word pairs , we sample different senses of words using WordNet (Miller, 1995).
    Page 2, “Introduction”
  2. The dataset has three interesting characteristics: 1) human judgments are on pairs of words presented in sentential context, 2) word pairs and their contexts are chosen to reflect interesting variations in meanings of homonymous and polysemous words, and 3) verbs and adjectives are present in addition to nouns.
    Page 6, “Experiments”
  3. We obtained a total of 2,003 word pairs and their sentential contexts.
    Page 7, “Experiments”
  4. The word pairs consist of 1,712 unique words.
    Page 7, “Experiments”
  5. Of the 2,003 word pairs , 1328 are noun-noun pairs, 399 verb-verb, 140 verb-noun, 97 adjective-adjective, 30 noun-adjective, and 9 verb-adjective.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention word pairs.

See all papers in Proc. ACL that mention word pairs.

Back to top.

cosine similarity

Appears in 3 sentences as: cosine similarity (3)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. The nearest neighbors of a word are computed by comparing the cosine similarity between the center word and all other words in the dictionary.
    Page 4, “Experiments”
  2. Table 1: Nearest neighbors of words based on cosine similarity .
    Page 5, “Experiments”
  3. Table 2: Nearest neighbors of word embeddings learned by our model using the multi-prototype approach based on cosine similarity .
    Page 5, “Experiments”

See all papers in Proc. ACL 2012 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

hidden layer

Appears in 3 sentences as: hidden layer (2) hidden layer: (1)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. To compute the score of local context, scorel, we use a neural network with one hidden layer:
    Page 3, “Global Context-Aware Neural Language Model”
  2. where [£131,132, ...,;vm] is the concatenation of the m word embeddings representing sequence 8, f is an element-wise activation function such as tanh, a1 6 Rh“ is the activation of the hidden layer with h hidden nodes, W1 6 WW) and W2 6 nlxh are respectively the first and second layer weights of the neural network, and b1, ()2 are the biases of each layer.
    Page 3, “Global Context-Aware Neural Language Model”
  3. the hidden layer with h(9) hidden nodes, ng) E Rh(g)x(2n) and ng) 6 Emmy) are respectively the first and second layer weights of the neural network, and big), bég) are the biases of each layer.
    Page 3, “Global Context-Aware Neural Language Model”

See all papers in Proc. ACL 2012 that mention hidden layer.

See all papers in Proc. ACL that mention hidden layer.

Back to top.

similarity scores

Appears in 3 sentences as: similarity score (1) similarity scores (2)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. However, common to all datasets is that similarity scores are given to pairs of words in isolation.
    Page 5, “Experiments”
  2. Single-prototype models would give the max similarity score for those pairs, which can be problematic depending on the words’ contexts.
    Page 7, “Experiments”
  3. For evaluation, we also compute Spearman correlation between a model’s computed similarity scores and human judgments.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention similarity scores.

See all papers in Proc. ACL that mention similarity scores.

Back to top.

word sense

Appears in 3 sentences as: word sense (3)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. Reisinger and Mooney (2010b) introduced a multi-prototype VSM where word sense discrimination is first applied by clustering contexts, and then prototypes are built using the contexts of the sense-labeled words.
    Page 1, “Introduction”
  2. We present a way to use our learned single-prototype embeddings to represent each context window, which can then be used by clustering to perform word sense discrimination (Schutze, 1998).
    Page 4, “Multi-Prototype Neural Language Model”
  3. The multi-prototype approach has been widely studied in models of categorization in psychology (Rosseel, 2002; Griffiths et al., 2009), while Schutze (1998) used clustering of contexts to perform word sense discrimination.
    Page 8, “Related Work”

See all papers in Proc. ACL 2012 that mention word sense.

See all papers in Proc. ACL that mention word sense.

Back to top.

WordNet

Appears in 3 sentences as: WordNet (3)
In Improving Word Representations via Global Context and Multiple Word Prototypes
  1. To capture interesting word pairs, we sample different senses of words using WordNet (Miller, 1995).
    Page 2, “Introduction”
  2. In step 1, in order to make sure we select a diverse list of words, we consider three attributes of a word: frequency in a corpus, number of parts of speech, and number of synsets according to WordNet .
    Page 6, “Experiments”
  3. (2010), we use WordNet to first randomly select one synset of the first word, we then construct a set of words in various relations to the first word’s chosen synset, including hypemyms, hy-ponyms, holonyms, meronyms and attributes.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.