Weakly Supervised User Profile Extraction from Twitter
Li, Jiwei and Ritter, Alan and Hovy, Eduard

Article Structure

Abstract

While user attribute extraction on social media has received considerable attention, existing approaches, mostly supervised, encounter great difficulty in obtaining gold standard data and are therefore limited to predicting unary predicates (e.g., gender).

Introduction

The overwhelming popularity of online social media creates an opportunity to display given aspects of oneself.

Related Work

While user profile inference from social media has received considerable attention (Al Zamal et al., 2012; Rao and Yarowsky, 2010; Rao et al., 2010; Rao et al., 2011), most previous work has treated this as a classification task where the goal is to predict unary predicates describing attributes of the user.

Dataset Creation

We now describe the generation of our distantly supervised training dataset in detail.

Model

We now describe our approach to predicting user profile attributes.

Experiments

In this Section, we present our experimental results in detail.

Conclusion and Future Work

In this paper, we propose a framework for user attribute inference on Twitter.

Acknowledgments

A special thanks is owned to Dr. Julian McAuley and Prof. Jure Leskovec from Stanford University for the Google+ circle/network crawler, without which the network analysis would not have been conducted.

Topics

social media

Appears in 12 sentences as: social media (12)
In Weakly Supervised User Profile Extraction from Twitter
  1. While user attribute extraction on social media has received considerable attention, existing approaches, mostly supervised, encounter great difficulty in obtaining gold standard data and are therefore limited to predicting unary predicates (e.g., gender).
    Page 1, “Abstract”
  2. Users’ profiles from social media websites such as Facebook or Google Plus are used as a distant source of supervision for extraction of their attributes from user-generated text.
    Page 1, “Abstract”
  3. In addition to traditional linguistic features used in distant supervision for information extraction, our approach also takes into account network information, a unique opportunity offered by social media .
    Page 1, “Abstract”
  4. The overwhelming popularity of online social media creates an opportunity to display given aspects of oneself.
    Page 1, “Introduction”
  5. We are optimistic that our approach can easily be applied to further user attributes such as HOBBIES and INTERESTS (MOVIES, BOOKS, SPORTS or STARS), RELIGION, HOMETOWN, LIVING LOCATION, FAMILY MEMBERS and so on, where training data can be obtained by matching ground truth retrieved from multiple types of online social media such as Facebook, Google Plus, or LinkedIn.
    Page 2, “Introduction”
  6. 0 We present a large-scale dataset for this task gathered from various structured and unstructured social media sources.
    Page 2, “Introduction”
  7. While user profile inference from social media has received considerable attention (Al Zamal et al., 2012; Rao and Yarowsky, 2010; Rao et al., 2010; Rao et al., 2011), most previous work has treated this as a classification task where the goal is to predict unary predicates describing attributes of the user.
    Page 2, “Related Work”
  8. Homophily Online social media offers a rich source of network information.
    Page 3, “Related Work”
  9. (2001) discovered that people sharing more attributes such as background or hobby have a higher chance of becoming friends in social media .
    Page 3, “Related Work”
  10. This property, known as HOMOPHILY (summarized by the proverb “birds of a feather flock together”) (Al Zamal et al., 2012) has been widely applied to community detection (Yang and Leskovec, 2013) and friend recommendation (Guy et al., 2010) on social media .
    Page 3, “Related Work”
  11. Spouse Facebook is the only type of social media where spouse information is commonly displayed.
    Page 4, “Dataset Creation”

See all papers in Proc. ACL 2014 that mention social media.

See all papers in Proc. ACL that mention social media.

Back to top.

distant supervision

Appears in 7 sentences as: Distant Supervision (1) Distant supervision (1) distant supervision (6)
In Weakly Supervised User Profile Extraction from Twitter
  1. In addition to traditional linguistic features used in distant supervision for information extraction, our approach also takes into account network information, a unique opportunity offered by social media.
    Page 1, “Abstract”
  2. Inspired by the concept of distant supervision , we collect training tweets by matching attribute ground truth from an outside “knowledge base” such as Facebook or Google Plus.
    Page 2, “Introduction”
  3. Distant Supervision Distant supervision , also known as weak supervision, is a method for leam-ing to extract relations from text using ground truth from an existing database as a source of supervision.
    Page 2, “Related Work”
  4. Rather than relying on mention-level annotations, which are expensive and time consuming to generate, distant supervision leverages readily available structured data sources as a weak source of supervision for relation extraction from related text corpora (Craven et al., 1999).
    Page 2, “Related Work”
  5. In addition to the wide use in text entity relation extraction (Mintz et al., 2009; Ritter et al., 2013; Hoffmann et al., 2011; Surdeanu et al., 2012; Takamatsu et al., 2012), distant supervision has been applied to multiple
    Page 2, “Related Work”
  6. The distant supervision assumes that if entity 6 corresponds to an attribute for user i, at least one posting from user i’s Twitter stream containing a mention of 6 might express that attribute.
    Page 4, “Model”
  7. We construct the publicly available dataset based on distant supervision and experiment our model on three useful user profile attributes, i.e., Education, Job and Spouse.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2014 that mention distant supervision.

See all papers in Proc. ACL that mention distant supervision.

Back to top.

ground truth

Appears in 5 sentences as: ground truth (5)
In Weakly Supervised User Profile Extraction from Twitter
  1. Inspired by the concept of distant supervision, we collect training tweets by matching attribute ground truth from an outside “knowledge base” such as Facebook or Google Plus.
    Page 2, “Introduction”
  2. We are optimistic that our approach can easily be applied to further user attributes such as HOBBIES and INTERESTS (MOVIES, BOOKS, SPORTS or STARS), RELIGION, HOMETOWN, LIVING LOCATION, FAMILY MEMBERS and so on, where training data can be obtained by matching ground truth retrieved from multiple types of online social media such as Facebook, Google Plus, or LinkedIn.
    Page 2, “Introduction”
  3. Distant Supervision Distant supervision, also known as weak supervision, is a method for leam-ing to extract relations from text using ground truth from an existing database as a source of supervision.
    Page 2, “Related Work”
  4. To obtain ground truth for the spouse relation at large scale, we turned to Freebase“, a large, open-domain database, and gathered instances of the /PEOPLE/PERSON/SPOUSE relation.
    Page 4, “Dataset Creation”
  5. Facebook would an ideal ground truth knowledge base.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2014 that mention ground truth.

See all papers in Proc. ACL that mention ground truth.

Back to top.

name entities

Appears in 5 sentences as: Name entities (1) name entities (2) name entity (1) named entity (1)
In Weakly Supervised User Profile Extraction from Twitter
  1. 0 Token-level: for each token t E 6, word identity, word shape, part of speech tags, name entity tags.
    Page 5, “Model”
  2. We assume that attribute values should be either name entities or terms following @ and #.
    Page 6, “Experiments”
  3. Name entities are extracted using Ritter et al.’s NER system (2011).
    Page 6, “Experiments”
  4. Consecutive tokens with the same named entity tag are chunked (Mintz et al., 2009).
    Page 6, “Experiments”
  5. A deeper look at the result shows that the classifier frequently makes wrong decisions for entities such as userID and name entities .
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention name entities.

See all papers in Proc. ACL that mention name entities.

Back to top.

knowledge base

Appears in 4 sentences as: knowledge base (2) “knowledge base” (2)
In Weakly Supervised User Profile Extraction from Twitter
  1. Inspired by the concept of distant supervision, we collect training tweets by matching attribute ground truth from an outside “knowledge base” such as Facebook or Google Plus.
    Page 2, “Introduction”
  2. Figure 1: Illustration of Goolge Plus “knowledge base” .
    Page 3, “Related Work”
  3. Lists of universities and companies are taken from knowledge base NELLB.
    Page 5, “Model”
  4. Facebook would an ideal ground truth knowledge base .
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2014 that mention knowledge base.

See all papers in Proc. ACL that mention knowledge base.

Back to top.

relation extraction

Appears in 4 sentences as: relation extraction (4)
In Weakly Supervised User Profile Extraction from Twitter
  1. Concretely, we cast user profile prediction as binary relation extraction (Brin, 1999), e.g., SPOUSE(User,—, Userj), EDUCATION(User,—, Entityj) and EMPLOYER(Userz-, Entityj).
    Page 2, “Introduction”
  2. Rather than relying on mention-level annotations, which are expensive and time consuming to generate, distant supervision leverages readily available structured data sources as a weak source of supervision for relation extraction from related text corpora (Craven et al., 1999).
    Page 2, “Related Work”
  3. In addition to the wide use in text entity relation extraction (Mintz et al., 2009; Ritter et al., 2013; Hoffmann et al., 2011; Surdeanu et al., 2012; Takamatsu et al., 2012), distant supervision has been applied to multiple
    Page 2, “Related Work”
  4. fields such as protein relation extraction (Craven et al., 1999; Ravikumar et al., 2012), event extraction from Twitter (Benson et al., 2011), sentiment analysis (Go et al., 2009) and Wikipedia infobox generation (Wu and Weld, 2007).
    Page 3, “Related Work”

See all papers in Proc. ACL 2014 that mention relation extraction.

See all papers in Proc. ACL that mention relation extraction.

Back to top.

feature space

Appears in 3 sentences as: feature space (3)
In Weakly Supervised User Profile Extraction from Twitter
  1. We evaluate settings described in Section 4.2 i.e., GLOBAL setting, where user-level attribute is predicted directly from jointly feature space and LOCAL setting where user-level prediction is made based on tweet-level prediction along with different inference approaches described in Section 4.4, i.e.
    Page 7, “Experiments”
  2. This can be explained by the fact that LOCAL(U) sets 256 = 1 once one posting cc 6 L5 is identified as attribute related, while GLOBAL tend to be more meticulous by considering the conjunctive feature space from all postings.
    Page 7, “Experiments”
  3. Another direction involves incorporating richer feature space for better inference performance, such as multimedia sources (i.e.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2014 that mention feature space.

See all papers in Proc. ACL that mention feature space.

Back to top.

feature vector

Appears in 3 sentences as: feature vector (2) feature vectors (1)
In Weakly Supervised User Profile Extraction from Twitter
  1. In the user attribute extraction literature, researchers have considered neighborhood context to boost inference accuracy (Pennacchiotti and Popescu, 2011; Al Zamal et al., 2012), where information about the degree of their connectivity to their pre-labeled users is included in the feature vectors .
    Page 3, “Related Work”
  2. encode a tweet-level feature vector rather than an aggregate one.
    Page 5, “Model”
  3. (3) The feature vector wtwfizfje, Xi) encodes the following standard general features:
    Page 5, “Model”

See all papers in Proc. ACL 2014 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

iteratively

Appears in 3 sentences as: iteratively (2) \I/te (1)
In Weakly Supervised User Profile Extraction from Twitter
  1. Attributes are initialized using only text features, maximizing \I/te $t(e, Xi), and ignoring network information.
    Page 6, “Model”
  2. Then for each user we iteratively reestimate their profile given both their text features and network features (computed based on the current predictions made for their friends) which provide additional evidence.
    Page 6, “Model”
  3. Then we iteratively update .2," given
    Page 6, “Model”

See all papers in Proc. ACL 2014 that mention iteratively.

See all papers in Proc. ACL that mention iteratively.

Back to top.