Modeling Latent Biographic Attributes in Conversational Genres
Garera, Nikesh and Yarowsky, David

Article Structure

Abstract

This paper presents and evaluates several original techniques for the latent classification of biographic attributes such as gender, age and native language, in diverse genres (conversation transcripts, email) and languages (Arabic, English).

Introduction

Speaker attributes such as gender, age, dialect, native language and educational level may be (a) stated overtly in metadata, (b) derivable indirectly from metadata such as a speaker’s phone number or userid, or (c) derivable from acoustic properties of the speaker, including pitch and f0 contours (Bocklet et al., 2008).

Related Work

Much attention has been devoted in the sociolinguistics literature to detection of age, gender, social class, religion, education, etc.

Corpus Details

Consistent with Boulis and Ostendorf (2005), we utilized the Fisher telephone conversation corpus (Cieri et al., 2004) and we also evaluated performance on the standard Switchboard conversational corpus (Godfrey et al., 1992), both collected and annotated by the Linguistic Data Consortium.

Topics

SVM

Appears in 9 sentences as: SVM (9)
In Modeling Latent Biographic Attributes in Conversational Genres
  1. As our reference algorithm, we used the current state-of-the-art system developed by Boulis and Ostendorf (2005) using unigram and bigram features in a SVM framework.
    Page 3, “Corpus Details”
  2. Table 12 Top 20 ngram features for gender, ranked by the weights assigned by the linear SVM model
    Page 3, “Corpus Details”
  3. After extracting the ngrams, a SVM model was trained via the SVMlight toolkit (J oachims, 1999) using the linear kernel with the default toolkit settings.
    Page 3, “Corpus Details”
  4. Table 1 shows the most discriminative ngrams for gender based on the weights assigned by the linear SVM model.
    Page 3, “Corpus Details”
  5. 2The modest differences with their reported results may be due to unreported details such as the exact training/test splits or SVM parameterizations, so for the purposes of assessing the relative gain of our subsequent enhancements we base all reported experiments on the internally-consistent configurations as (re-)implemented here.
    Page 4, “Corpus Details”
  6. The overall accuracy improves to 96.46% on the Fisher corpus using this oracle (from 90.84%), leading us to the experiment where the oracle is replaced with a non-oracle SVM model trained on a subset of training data such that all test conversation sides (of the speaker and the partner) are excluded from the training set.
    Page 4, “Corpus Details”
  7. their scores was used as a feature in a meta SVM classifier:
    Page 5, “Corpus Details”
  8. The above classes resulted in a total of 16 sociolinguistic features which were added based on feature ablation studies as features in the meta SVM classifier along with the 4 features as explained previously in Section 5.3.
    Page 6, “Corpus Details”
  9. Table 8: Top 25 ngram features for Age ranked by weights assigned by the linear SVM model
    Page 8, “Corpus Details”

See all papers in Proc. ACL 2009 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

native language

Appears in 8 sentences as: Native Language (1) native language (7)
In Modeling Latent Biographic Attributes in Conversational Genres
  1. This paper presents and evaluates several original techniques for the latent classification of biographic attributes such as gender, age and native language , in diverse genres (conversation transcripts, email) and languages (Arabic, English).
    Page 1, “Abstract”
  2. Speaker attributes such as gender, age, dialect, native language and educational level may be (a) stated overtly in metadata, (b) derivable indirectly from metadata such as a speaker’s phone number or userid, or (c) derivable from acoustic properties of the speaker, including pitch and f0 contours (Bocklet et al., 2008).
    Page 1, “Introduction”
  3. (including true speaker gender, age, native language , etc.)
    Page 3, “Corpus Details”
  4. Corpus details for Age and Native Language : For age, we used the same training and test speakers from Fisher corpus as explained for gender in section 3 and binarized into greater-than or less-than-or-equal-to 40 for more parallel binary evaluation.
    Page 7, “Corpus Details”
  5. Based on the prior distribution, always guessing the most likely class for age ( age less-than-or—equal-to 40) results in 62.59% accuracy and always guessing the most likely class for native language (nonnative) yields 50.59% accuracy.
    Page 8, “Corpus Details”
  6. We can see that the ngram-based approach for gender also gives reasonable performance on other speaker attributes, and more importantly, both the partner-model and sociolinguistic features help in reducing the error rate on age and native language substantially, indicating their usefulness not just on gender but also on other diverse latent attributes.
    Page 8, “Corpus Details”
  7. Table 92 Results showing improvement in the accuracy of age and native language classification using partner-model and sociolinguistic features
    Page 8, “Corpus Details”
  8. This paper has presented and evaluated several original techniques for the latent classification of speaker gender, age and native language in diverse genres and languages.
    Page 8, “Corpus Details”

See all papers in Proc. ACL 2009 that mention native language.

See all papers in Proc. ACL that mention native language.

Back to top.

unigram

Appears in 4 sentences as: unigram (3) unigrams (1)
In Modeling Latent Biographic Attributes in Conversational Genres
  1. As our reference algorithm, we used the current state-of-the-art system developed by Boulis and Ostendorf (2005) using unigram and bigram features in a SVM framework.
    Page 3, “Corpus Details”
  2. For each conversation side, a training example was created using unigram and bigram features with tf-idf weighting, as done in standard text classification approaches.
    Page 3, “Corpus Details”
  3. Also, named entity “Mike” shows up as a discriminative unigram , this maybe due to the self-introduction at the beginning of the conversations and “Mike” being a common male name.
    Page 3, “Corpus Details”
  4. processing for names is performed, and they are treated as just any other unigrams or bigramsl.
    Page 4, “Corpus Details”

See all papers in Proc. ACL 2009 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

feature set

Appears in 3 sentences as: feature set (3)
In Modeling Latent Biographic Attributes in Conversational Genres
  1. Another relevant line of work has been on the blog domain, using a bag of words feature set to discriminate age and gender (Schler et al., 2006; Burger and Henderson, 2006; Nowson and Oberlander, 2006).
    Page 2, “Related Work”
  2. However, stopwords were retained in the feature set as various sociolinguistic studies have shown that use of some of the stopwords, for instance, pronouns and determin-ers, are correlated with age and gender.
    Page 3, “Corpus Details”
  3. Also, only the ngrams with frequency greater than 5 were retained in the feature set following Boulis and Ostendorf (2005).
    Page 3, “Corpus Details”

See all papers in Proc. ACL 2009 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

joint modeling

Appears in 3 sentences as: joint modeling (3)
In Modeling Latent Biographic Attributes in Conversational Genres
  1. gender/age) based on the prior and joint modeling of the partner speaker’s gender/age in the same discourse.
    Page 4, “Corpus Details”
  2. We employ several varieties of classifier stacking and joint modeling to be effectively sensitive to these differences.
    Page 4, “Corpus Details”
  3. A novel partner-sensitve model shows performance gains from the joint modeling of speaker attributes along with partner speaker attributes, given the differences in lexical usage and discourse style such as observed between same-gender and mixed-gender conversations.
    Page 8, “Corpus Details”

See all papers in Proc. ACL 2009 that mention joint modeling.

See all papers in Proc. ACL that mention joint modeling.

Back to top.