Using Conceptual Class Attributes to Characterize Social Media Users
Bergsma, Shane and Van Durme, Benjamin

Article Structure

Abstract

We describe a novel approach for automatically predicting the hidden demographic properties of social media users.

Introduction

There has been growing interest in characterizing social media users based on the content they generate; that is, automatically labeling users with demographic categories such as age and gender (Burger and Henderson, 2006; Schler et al., 2006; Rao et al., 2010; Mukherjee and Liu, 2010; Pennacchiotti and Popescu, 2011; Burger et al., 2011; Van Durme, 2012).

Supervised User Characterization

The current state-of-the-art in user characterization is to use supervised classifiers trained on annotated data.

Learning Class Attributes

We aim to improve the automated classification of users into various demographic categories by learning and applying the distinguishing attributes of those categories, e.g.

Applying Class Attributes

To classify users using the extracted attributes, we look for cases where users refer to such attributes in their first-person writings.

Twitter Gender Prediction

To test the use of self-distinguishing attributes in user classification, we apply our methods to the task of gender classification on Twitter.

Results

Our main classification results are presented in Table 3.

Related Work

User Characterization The field of sociolinguistics has long been concerned with how various morphological, phonological and stylistic aspects of language can vary with a person’s age, gender, social class, etc.

Conclusion

We have proposed, developed and successfully evaluated a novel approach to user characterization based on exploiting knowledge of user class attributes.

Topics

gold standard

Appears in 13 sentences as: Gold Standard (1) gold standard (12)
In Using Conceptual Class Attributes to Characterize Social Media Users
  1. (1) ARules: Using Attribute-Based Rules to Override a Classifier When human-annotated data is available for training and testing a supervised classifier, we refer to it as gold standard data.
    Page 4, “Applying Class Attributes”
  2. (2) Bootstrapped: Automatic Labeling of Training Examples Even without gold standard training data, we can use our self-distinguishing attributes to automatically bootstrap annotations.
    Page 4, “Applying Class Attributes”
  3. (3) BootStacked: Gold Standard and Bootstrapped Combination Although we show that an accurate classifier can be trained using auto-annotated Bootstrapped data alone, we also test whether we can combine this data with any gold-standard training examples to achieve even better performance.
    Page 4, “Applying Class Attributes”
  4. We first use the trained Bootstrapped system to make predictions on the entire set of gold standard data (gold train, development, and test sets).
    Page 5, “Applying Class Attributes”
  5. We then use these predictions as features in a classifier trained on the gold standard data.
    Page 5, “Applying Class Attributes”
  6. Classifier Setup We train logistic-regression classifiers on this gold standard data via the L1-BLINEAR package (Fan et al., 2008).
    Page 5, “Twitter Gender Prediction”
  7. We also filter any users that overlap with our gold standard data.
    Page 5, “Twitter Gender Prediction”
  8. The decisions of our bootstrapping process reflect the true gender distribution; the auto-annotated data is 60.5% Female, remarkably close to the 60.9% proportion in our gold standard test set.
    Page 5, “Twitter Gender Prediction”
  9. Table 3: Classification accuracy (%) on gold standard test data for user gender prediction on Twitter
    Page 6, “Twitter Gender Prediction”
  10. This latter result represents the current state-of-the-art: a classifier trained on thousands of gold standard examples, making use of both Usr and BOW features.
    Page 6, “Results”
  11. The Bootstrapped system substantially improves over the state-of-the-art, achieving 86% accuracy and doing so without using any gold standard training data.
    Page 6, “Results”

See all papers in Proc. ACL 2013 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

social media

Appears in 10 sentences as: Social Media (1) social media (9)
In Using Conceptual Class Attributes to Characterize Social Media Users
  1. We describe a novel approach for automatically predicting the hidden demographic properties of social media users.
    Page 1, “Abstract”
  2. There has been growing interest in characterizing social media users based on the content they generate; that is, automatically labeling users with demographic categories such as age and gender (Burger and Henderson, 2006; Schler et al., 2006; Rao et al., 2010; Mukherjee and Liu, 2010; Pennacchiotti and Popescu, 2011; Burger et al., 2011; Van Durme, 2012).
    Page 1, “Introduction”
  3. t0 Characterize Social Media Users
    Page 1, “Introduction”
  4. Using a combination of content and username features “represents a use case common to many different social media sites, such as chat rooms and news article comment streams” (Burger et al., 2011).
    Page 2, “Supervised User Characterization”
  5. A leg is a relevant and correct part of both a male and a female (and many other living and inanimate objects), but it does not help us distinguish males from females in social media .
    Page 3, “Learning Class Attributes”
  6. 5While we used an “off the shelf” POS tagger in this work, we note that taggers optimized specifically for social media are now available and would likely have resulted in higher tagging accuracy (e. g. Owoputi et al.
    Page 4, “Applying Class Attributes”
  7. We can therefore benchmark our approach against state-of-the-art supervised systems trained with plentiful gold-standard data, giving us an idea of how well our Bootstrapped system might compare to theoretically top-performing systems on other tasks, domains, and social media platforms where such gold-standard training data is not available.
    Page 5, “Twitter Gender Prediction”
  8. 7Note that it is possible to achieve even higher performance on gender classification in social media if you have further information about a user, such as their full first and last name (Burger et al., 2011; Bergsma et al., 2013).
    Page 6, “Results”
  9. This is important because having thousands of gold standard annotations for every possible user characterization task, in every domain and social media platform, is not realistic.
    Page 6, “Results”
  10. Many recent papers have analyzed the language of social media users, along dimensions such as ethnicity (Eisenstein et al., 2011; Rao et al., 2011; Pennacchiotti and Popescu, 2011; Fink et al., 2012) time zone (Kiciman, 2010), political orientation (Rao et al., 2010; Pennacchiotti and Popescu, 2011) and gender (Rao et al., 2010; Burger et al., 2011; Van Durme, 2012).
    Page 7, “Related Work”

See all papers in Proc. ACL 2013 that mention social media.

See all papers in Proc. ACL that mention social media.

Back to top.

gold-standard

Appears in 8 sentences as: gold-standard (9)
In Using Conceptual Class Attributes to Characterize Social Media Users
  1. Our bootstrapped system, trained purely from automatically-annotated Twitter data, significantly reduces error over a state-of-the-art system trained on thousands of gold-standard training examples.
    Page 1, “Introduction”
  2. In our gold-standard gender data (Section 5), however, every user has a homepage [by dataset construction]; we might therefore incorrectly classify every user as Male.
    Page 3, “Learning Class Attributes”
  3. Our first technique provides a simple way to use our identified self-distinguishing attributes in conjunction with a classifier trained on gold-standard data.
    Page 4, “Applying Class Attributes”
  4. (3) BootStacked: Gold Standard and Bootstrapped Combination Although we show that an accurate classifier can be trained using auto-annotated Bootstrapped data alone, we also test whether we can combine this data with any gold-standard training examples to achieve even better performance.
    Page 4, “Applying Class Attributes”
  5. We can therefore benchmark our approach against state-of-the-art supervised systems trained with plentiful gold-standard data, giving us an idea of how well our Bootstrapped system might compare to theoretically top-performing systems on other tasks, domains, and social media platforms where such gold-standard training data is not available.
    Page 5, “Twitter Gender Prediction”
  6. A standard classifier trained on 100 gold-standard training examples improves over this baseline, to 72.0%, while one with 2282 training examples achieves 84.0%.
    Page 6, “Results”
  7. We presented three effective techniques for leveraging this knowledge within the framework of supervised user characterization: rule-based postprocessing, a leaming-by-bootstrapping approach, and a stacking approach that integrates the predictions of the bootstrapped system into a system trained on annotated gold-standard training data.
    Page 8, “Conclusion”
  8. While our technique has advanced the state-of-the-art on this important task, our approach may prove even more useful on other tasks where training on thousands of gold-standard examples is not even an option.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

WordNet

Appears in 6 sentences as: WordNet (6)
In Using Conceptual Class Attributes to Characterize Social Media Users
  1. For these approaches, lists of instances are typically collected from publicly-available resources such as WordNet or Wikipedia (Pasca and Van Durme, 2007;
    Page 2, “Learning Class Attributes”
  2. 1Reisinger and Pasca (2009) considered the related problem of finding the most appropriate class for each attribute; they take an existing ontology of concepts ( WordNet ) as a class hierarchy and use a Bayesian approach to decide “the correct level of abstraction for each attribute.”
    Page 3, “Learning Class Attributes”
  3. These efforts focused exclusively on the meronymy relation as used in WordNet (Miller et al., 1990).
    Page 7, “Related Work”
  4. Experts can manually specify the attributes of entities, as in the WordNet project (Miller et al., 1990).
    Page 8, “Related Work”
  5. In many ways WordNet can be regarded as a collection of commonsense relationships.
    Page 8, “Related Work”
  6. WordNet has been applied in a myriad of NLP applications, including in seminal works on semantic-role labeling (Gildea and Jurafsky, 2002), coreference resolution (Soon et al., 2001) and spelling correction (Budanitsky and Hirst, 2006).
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

classification task

Appears in 5 sentences as: classification task (4) classification tasks (1)
In Using Conceptual Class Attributes to Characterize Social Media Users
  1. State-of-the-art approaches cast this problem as a classification task and train classifiers using supervised learning (Section 2).
    Page 1, “Introduction”
  2. Our approach obviates the need for expensive annotation efforts, and allows us to rapidly bootstrap training data for new classification tasks .
    Page 1, “Introduction”
  3. We validate our approach by advancing the state-of-the-art on the most well-studied user classification task : predicting user gender (Section 5).
    Page 1, “Introduction”
  4. Table 1: Example instances used for extraction of class attributes for the gender classification task
    Page 2, “Learning Class Attributes”
  5. For the gender classification task , we manually filtered the entire set of attributes to select around 1000 attributes that were judged to be discriminative (two thirds of which are female).
    Page 3, “Learning Class Attributes”

See all papers in Proc. ACL 2013 that mention classification task.

See all papers in Proc. ACL that mention classification task.

Back to top.

N-gram

Appears in 3 sentences as: N-gram (3) n-gram (1)
In Using Conceptual Class Attributes to Characterize Social Media Users
  1. We extract prevalent common nouns for males and females by selecting only those nouns that (a) occur more than 200 times in the dataset, (b) mostly occur with male or female pronouns, and (c) occur as lowercase more often than uppercase in a web-scale N-gram corpus (Lin et al., 2010).
    Page 2, “Learning Class Attributes”
  2. We obtain the best of both worlds by matching our precise pattern against a version of the Google N-gram Corpus that includes the part-of-speech tag distributions for every N-gram (Lin et al., 2010).
    Page 3, “Learning Class Attributes”
  3. We include n-gram features with the original capitalization pattern and separate features with the n- grams lower-cased.
    Page 5, “Twitter Gender Prediction”

See all papers in Proc. ACL 2013 that mention N-gram.

See all papers in Proc. ACL that mention N-gram.

Back to top.