Distributed Representations of Geographically Situated Language
Bamman, David and Dyer, Chris and Smith, Noah A.

Article Structure

Abstract

We introduce a model for incorporating contextual information (such as geography) in learning vector-space representations of situated language.

Introduction

The vast textual resources used in NLP —newswire, web text, parliamentary proceedings —can encourage a view of language as a disembodied phenomenon.

Model

The model we introduce is grounded in the distributional hypothesis (Harris, 1954), that two words are similar by appearing in the same kinds of contexts (where “context” itself can be variously defined as the bag or sequence of tokens around a target word, either by linear distance or dependency path).

Evaluation

We evaluate our model by confirming its face validity in a qualitative analysis and estimating its accuracy at the quantitative task of judging geographically-informed semantic similarity.

Conclusion

We introduced a model for leveraging situational information in learning vector-space representations of words that are sensitive to the speaker’s social context.

Acknowledgments

The research reported in this article was supported by US NSF grants IIS-1251131 and CAREER IIS-1054319, and by an ARCS scholarship to DB.

Topics

cosine similarity

Appears in 6 sentences as: cosine similarity (6)
In Distributed Representations of Geographically Situated Language
  1. To illustrate how the model described above can learn geographically-informed semantic representations of words, table 1 displays the terms with the highest cosine similarity to wicked in Kansas and Massachusetts after running our joint model on the full 1.1 billion words of Twitter data; while wicked in Kansas is close to other evaluative terms like evil and pure and religious terms like gods and spirit, in Massachusetts it is most similar to other intensifiers like super, ridiculously and insanely.
    Page 3, “Evaluation”
  2. Table 2 likewise presents the terms with the highest cosine similarity to city in both California and New York; while the terms most evoked by city in California include regional locations like Chinatown, Los Angeles’ South Bay and San Francisco’s East Bay, in New York the most similar terms include hamptons, upstate and borough
    Page 3, “Evaluation”
  3. Table 1: Terms with the highest cosine similarity to wicked in Kansas and Massachusetts.
    Page 4, “Evaluation”
  4. Table 2: Terms with the highest cosine similarity to city in California and New York.
    Page 4, “Evaluation”
  5. For each category, we measure similarity as the average cosine similarity between the vector for the target word for that category (e. g., city) and the corresponding vector for each state-specific answer (e.g., chicago for IL; boston for MA).
    Page 4, “Evaluation”
  6. As one concrete example of these differences between individual data points, the cosine similarity between city and seattle in the —GEO model is 0.728 (seattle is ranked as the 188th most similar term to city overall); in the INDIVIDUAL model using only tweets from Washington state, 6WA(city,seattle) = 0.780 (rank #32); and in the JOINT model, using information from the entire United States with deviations for Washington, 6WA(city, seattle) = 0858 (rank #6).
    Page 5, “Evaluation”

See all papers in Proc. ACL 2014 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

joint model

Appears in 6 sentences as: JOINT model (3) joint model (4)
In Distributed Representations of Geographically Situated Language
  1. In a quantitative evaluation on the task of judging geographically informed semantic similarity between representations learned from 1.1 billion words of geo-located tweets, our joint model outperforms comparable independent models that learn meaning in isolation.
    Page 1, “Abstract”
  2. A joint model has three a priori advantages over independent models: (i) sharing data across variable values encourages representations across those values to be similar; e.g., while city may be closer to Boston in Massachusetts and Chicago in Illinois, in both places it still generally connotes a municipality; (ii) such sharing can mitigate data sparseness for less-witnessed areas; and (iii) with a joint model , all representations are guaranteed to
    Page 3, “Model”
  3. To illustrate how the model described above can learn geographically-informed semantic representations of words, table 1 displays the terms with the highest cosine similarity to wicked in Kansas and Massachusetts after running our joint model on the full 1.1 billion words of Twitter data; while wicked in Kansas is close to other evaluative terms like evil and pure and religious terms like gods and spirit, in Massachusetts it is most similar to other intensifiers like super, ridiculously and insanely.
    Page 3, “Evaluation”
  4. As one concrete example of these differences between individual data points, the cosine similarity between city and seattle in the —GEO model is 0.728 (seattle is ranked as the 188th most similar term to city overall); in the INDIVIDUAL model using only tweets from Washington state, 6WA(city,seattle) = 0.780 (rank #32); and in the JOINT model , using information from the entire United States with deviations for Washington, 6WA(city, seattle) = 0858 (rank #6).
    Page 5, “Evaluation”
  5. While the two models that include geographical information naturally outperform the model that does not, the JOINT model generally far outperforms the INDIVIDUAL models trained on state-specific subsets of the data.1 A model that can exploit all of the information in the data, learning core vector-space representations for all words along with deviations for each contextual variable, is able to learn more geographically-informed representations for this task than strict geographical models alone.
    Page 5, “Evaluation”
  6. 1This result is robust to the choice of distance metric; an evaluation measuring the Euclidean distance between vectors shows the JOINT model to outperform the INDIVIDUAL and —GEO models across all seven categories.
    Page 5, “Evaluation”

See all papers in Proc. ACL 2014 that mention joint model.

See all papers in Proc. ACL that mention joint model.

Back to top.

embeddings

Appears in 4 sentences as: embeddings (4)
In Distributed Representations of Geographically Situated Language
  1. The first is the representation matrix W 6 RM”, which encodes the real-valued embeddings for each word in the vocabulary.
    Page 2, “Model”
  2. Backpropagation using (input :5, output 3/) word tuples learns the values of W (the embeddings ) and X (the output parameter matrix) that maximize the likelihood of y (i.e., the context words) conditioned on cc (i.e., the 31’s).
    Page 2, “Model”
  3. Given an input word w and set of active variable values A (e.g., A 2 {state 2 MA}), we calculate the hidden layer h as the sum of these independent embeddings : h = wTWmam + 26,64 wTWa.
    Page 3, “Model”
  4. The additional W embeddings we add lead to an increase in the number of total parameters by a factor of |C To control for the extra degrees of freedom this entails, we add squared £2 regularization to all parameters, using stochastic gradient descent for backpropagation with minibatch updates for the regularization term.
    Page 3, “Model”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

lexical semantics

Appears in 3 sentences as: lexical semantic (1) lexical semantics (2)
In Distributed Representations of Geographically Situated Language
  1. In this paper, we introduce a method that extends vector-space lexical semantic models to learn representations of geographically situated language.
    Page 1, “Introduction”
  2. Vector-space models of lexical semantics have been a popular and effective approach to learning representations of word meaning (Lin, 1998; Tumey and Pantel, 2010; Reisinger and Mooney, 2010; Socher et al., 2013; Mikolov et al., 2013, inter alia).
    Page 1, “Introduction”
  3. While our results use geographical information in learning low-dimensional representations, other contextual variables are straightforward to include as well; incorporating effects for time — such as time of day, month of year and absolute year — may be a powerful tool for revealing periodic and historical influences on lexical semantics .
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention lexical semantics.

See all papers in Proc. ACL that mention lexical semantics.

Back to top.

named entities

Appears in 3 sentences as: named entities (3)
In Distributed Representations of Geographically Situated Language
  1. This information enables learning models of word meaning that are sensitive to such factors, allowing us to distinguish, for example, between the usage of wicked in Massachusetts from the usage of that word elsewhere, and letting us better associate geographically grounded named entities (e.g, Boston) with their hypemyms (city) in their respective regions.
    Page 2, “Introduction”
  2. In the absence of a sizable number of linguistically interesting terms (like wicked) that are known to be geographically variable, we consider the proxy of estimating the named entities evoked by specific terms in different geographical regions.
    Page 4, “Evaluation”
  3. As noted above, geographic terms like city provide one such example: in Massachusetts we expect the term city to be more strongly connected to grounded named entities like Boston than to other US cities.
    Page 4, “Evaluation”

See all papers in Proc. ACL 2014 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

semantic similarity

Appears in 3 sentences as: semantic similarity (3)
In Distributed Representations of Geographically Situated Language
  1. In a quantitative evaluation on the task of judging geographically informed semantic similarity between representations learned from 1.1 billion words of geo-located tweets, our joint model outperforms comparable independent models that learn meaning in isolation.
    Page 1, “Abstract”
  2. We evaluate our model by confirming its face validity in a qualitative analysis and estimating its accuracy at the quantitative task of judging geographically-informed semantic similarity .
    Page 3, “Evaluation”
  3. As a quantitative measure of our model’s performance, we consider the task of judging semantic similarity among words whose meanings are likely to evoke strong geographical correlations.
    Page 4, “Evaluation”

See all papers in Proc. ACL 2014 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

vector space

Appears in 3 sentences as: vector space (3)
In Distributed Representations of Geographically Situated Language
  1. be in the same vector space and can therefore be compared to each other; with individual models (each with different initializations), word vectors across different states may not be directly compared.
    Page 3, “Model”
  2. In all experiments, the contextual variable is the observed US state (including DC), so that |C| = 51; the vector space representation of word w in state 3 is wTWmam + wTWS.
    Page 3, “Evaluation”
  3. By allowing all words in different regions (or more generally, with different metadata factors) to exist in the same vector space , we are able compare different points in that space — for example, to ask what terms used in Chicago are most similar to hot dog in New York, or what word groups shift together in the same region in comparison to the background (indicating the shift of an entire semantic field).
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention vector space.

See all papers in Proc. ACL that mention vector space.

Back to top.