Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
Rosenthal, Sara and McKeown, Kathleen

Article Structure

Abstract

We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors.

Introduction

The evolution of the internet has changed the way that people communicate.

Related Work

In previous work, Mackinnon (2006) , used LiveJournal data to identify a blogger’s age by examining the mean age of his peer group using his social network and not just his immediate friends.

Data Collection

Our corpus consists of blogs downloaded from the virtual community LiveJournal.

Methods

We preprocessed the data to add Part-of-Speech tags (POS) and dependencies (de Marn-effe et al., 2006) between words using the Stanford Parser (Klein and Manning, 2003a; Klein and Manning, 2003b).

Experiments and Results

We ran three separate experiments to determine how well we can predict age: 1. classifying into three distinct age groups (Schler et al.

Conclusion and Future Work

We have shown that it is possible to predict the age group of a person based on style, content, and online behavior features with good accuracy; these are all features that are available

Topics

social media

Appears in 15 sentences as: Social Media (1) social media (15)
In Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
  1. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction.
    Page 1, “Abstract”
  2. The users of these social media platforms have created their own form of unstructured writing that is best characterized as informal.
    Page 1, “Introduction”
  3. social media generation.
    Page 1, “Introduction”
  4. We focus on this generation due to the rise of popular social media technologies such as messaging and online social networks sites that occurred during that time.
    Page 1, “Introduction”
  5. Therefore, we experimented with binary classification into age groups using all birth dates from 1975 through 1988, thus including students from generation Y who were in college during the emergence of social media technologies.
    Page 1, “Introduction”
  6. The appearance of social media technologies such as AOL Instant Messenger (AIM), weblogs, SMS text messaging, Facebook and MySpace occurred when people with these birth dates were in college.
    Page 1, “Introduction”
  7. Their work shows that ease of classification is dependent in part on what division is made between age groups and in turn motivates our decision to study whether the creation of social media technologies can be used to find the dividing line(s).
    Page 2, “Related Work”
  8. 5.2 Social Media and Generation Y
    Page 7, “Experiments and Results”
  9. We were motivated to examine these years due to the emergence of social media technologies during that time.
    Page 7, “Experiments and Results”
  10. Generation Y is considered the social media generation, so we decided to examine how the creation and/ or popularity of social media technologies compared to the years that had a change in writing style.
    Page 8, “Experiments and Results”
  11. We looked at many popular social media technologies such as weblogs, messaging, and social networking sites.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention social media.

See all papers in Proc. ACL that mention social media.

Back to top.

binary classification

Appears in 6 sentences as: binary classification (6)
In Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
  1. Therefore, we experimented with binary classification into age groups using all birth dates from 1975 through 1988, thus including students from generation Y who were in college during the emergence of social media technologies.
    Page 1, “Introduction”
  2. We find five years where binary classification is significantly more accurate than other years: 1977, 1979, and 1982-1984.
    Page 1, “Introduction”
  3. We also use a supervised machine learning approach, but classification by gender is naturally a binary classification task, while our work requires determining a natural dividing point.
    Page 3, “Related Work”
  4. (2006) experiment), 2. binary classification with the split at each birth year from 1975-1988 and 3.
    Page 6, “Experiments and Results”
  5. In contrast to Schler et al.’s experiment, our division does not introduce a gap between age groups, we do binary classification , and we use significantly less data.
    Page 7, “Experiments and Results”
  6. 0 Perform binary classification between blogs BEFORE X and IN/ AFTER X
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention binary classification.

See all papers in Proc. ACL that mention binary classification.

Back to top.

statistically significant

Appears in 4 sentences as: statistical significance (1) statistically significant (3)
In Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
  1. the averages of the accuracies from the 10 cross-validation runs and all results were compared for statistical significance using the t—test where applicable.
    Page 6, “Experiments and Results”
  2. Unless otherwise marked, all accuracies are statistically significant at p<=.0005 for both baselines.
    Page 9, “Experiments and Results”
  3. 1 not statistically significant over Online-Behavior and Interests.
    Page 9, “Experiments and Results”
  4. 2 not statistically significant over Interests.
    Page 9, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

best results

Appears in 3 sentences as: best results (3)
In Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
  1. Our best results allow for 81.57% accuracy.
    Page 1, “Abstract”
  2. However, as shown in Figure 4, style and content combined provided the best results .
    Page 8, “Experiments and Results”
  3. We found the best results to have an accuracy of 79.96% and 81.57% for 1979 and 1984 respectively using BOW, interests, online behavior, and all lexical-stylistic features.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention best results.

See all papers in Proc. ACL that mention best results.

Back to top.

logistic regression

Appears in 3 sentences as: logistic regression (4)
In Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
  1. We ran all of our experiments in Weka (Hall et al., 2009) using logistic regression over 10 runs of 10—fold cross-validation.
    Page 6, “Experiments and Results”
  2. We use logistic regression as our classifier because it has been shown that logistic regression typically has lower asymptotic error than naive Bayes for multiple classification tasks as well as for text classification (Ng and Jordan, 2002).
    Page 6, “Experiments and Results”
  3. We experimented with an SVM classifier and found logistic regression to do slightly better.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention logistic regression.

See all papers in Proc. ACL that mention logistic regression.

Back to top.

SVM

Appears in 3 sentences as: SVM (3)
In Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
  1. They use an SVM classifier with only n-grams as features.
    Page 2, “Related Work”
  2. Nowson et al (2006) employed dictionary and n—gram based content analysis and achieved 91.5% accuracy using an SVM classifier.
    Page 3, “Related Work”
  3. We experimented with an SVM classifier and found logistic regression to do slightly better.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.