Biases in Predicting the Human Language Model
Fine, Alex B. and Frank, Austin F. and Jaeger, T. Florian and Van Durme, Benjamin

Article Structure

Abstract

We consider the prediction of three human behavioral measures — lexical decision, word naming, and picture naming —through the lens of domain bias in language modeling.

Introduction

Computational linguists build statistical language models for aiding in natural language processing (NLP) tasks.

Fitting Behavioral Data 2.1 Data

Pairwise Pearson correlation coefficients for log frequency were computed for all corpora under consideration.

Discussion

Researchers in computational linguistics often assume that more data is always better than less data (Banko and Brill, 2001).

Conclusion

We have shown intuitive, domain-specific biases in the prediction of human behavioral measures via corpora of various genres.

Topics

language model

Appears in 9 sentences as: Language Model (1) language model (6) language modeling (1) language models (4)
In Biases in Predicting the Human Language Model
  1. We consider the prediction of three human behavioral measures — lexical decision, word naming, and picture naming —through the lens of domain bias in language modeling .
    Page 1, “Abstract”
  2. This study aims to provoke increased consideration of the human language model by NLP practitioners: biases are not limited to differences between corpora (i.e.
    Page 1, “Abstract”
  3. Computational linguists build statistical language models for aiding in natural language processing (NLP) tasks.
    Page 1, “Introduction”
  4. In the current study, we exploit errors of the latter variety—failure of a language model to predict human performance—to investigate bias across several frequently used corpora in computational linguistics.
    Page 1, “Introduction”
  5. : Human Language Model
    Page 1, “Introduction”
  6. Thus, failure of a language model to predict human performance reveals a mismatch between the language model and the human language model , i.e., bias.
    Page 1, “Introduction”
  7. Our analyses reveal that 6 commonly used corpora fail to reflect the human language model in various ways related to dialect, modality, and other properties of each corpus.
    Page 3, “Discussion”
  8. Our results point to a type of bias in commonly used language models that has been previously overlooked.
    Page 3, “Discussion”
  9. Just as language models have been used to predict reading grade-level of documents (Collins-Thompson and Callan, 2004), human language models could be
    Page 5, “Discussion”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

n-grams

Appears in 4 sentences as: n-grams (4)
In Biases in Predicting the Human Language Model
  1. Contrasting the predictive ability of statistics derived from 6 different corpora, we find intuitive results showing that, e.g., a British corpus over-predicts the speed with which an American will react to the words ward and duke, and that the Google n-grams over-predicts familiarity with technology terms.
    Page 1, “Abstract”
  2. Specifically, we predict human data from three widely used psycholinguistic experimental paradigms—lexical decision, word naming, and picture naming—using unigram frequency estimates from Google n-grams (Brants and Franz, 2006), Switchboard (Godfrey et al., 1992), spoken and written English portions of CELEX (Baayen et al., 1995), and spoken and written portions of the British National Corpus (BNC Consortium, 2007).
    Page 1, “Introduction”
  3. For example, Google n-grams overestimates the ease with which humans will process words related to the web (tech, code, search, site), while the Switchboard corpus—a collection of informal telephone conversations between strangers—overestimates how quickly humans will react to colloquialisms (heck, dam) and backchannels (wow, right).
    Page 1, “Introduction”
  4. 2Surprisingly, fife was determined to be one of the words with the largest frequency asymmetry between Switchboard and the Google n-grams corpus.
    Page 3, “Fitting Behavioral Data 2.1 Data”

See all papers in Proc. ACL 2014 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.