A Bayesian Mixed Effects Model of Literary Character

We consider the problem of automatically inferring latent character types in a collection of 15,099 English novels published between 1700 and 1899.

Recent work in NLP has begun to exploit the potential of entity-centric modeling for a variety of tasks: Chambers (2013) places entities at the center of probabilistic frame induction, showing gains over a comparable event-centric model (Cheung et al., 2013); Bamman et al.

Inferring character is challenging from a literary perspective partly because scholars have not reached consensus about the meaning of the term.

The dataset for this work consists of 15,099 distinct narratives drawn from HathiTrust Digital Library.2 From an initial collection of 469,200 volumes written in English and published between 1700 and 1899 (including poetry, drama, and nonfiction as well as prose narrative), we extract 32,209 volumes of prose fiction, remove duplicates and fuse multi-volume works to create the final dataset.

In order to separate out the effects that a character’s persona has on the words that are associated with them (as opposed to other factors, such as time period, genre, or author), we adopt a hierarchical Bayesian approach in which the words we observe are generated conditional on a combination of different effects captured in a log-linear (or “maximum entropy”) distribution.

While standard NLP and machine learning practice is to evaluate the performance of an algorithm on a held-out gold standard, articulating what a true “persona” might be for a character is inherently problematic.

Part of the motivation of our mixed effects model is to be able to tackle hypothesis class C—by factoring out the influence of a particular author on the learning of personas, we would like to be able to discriminate between characters that all have a common authorial voice.

The latent personas inferred from this model will support further exploratory analysis of literary history.

Our method establishes the possibility of representing the relationship between character and narrative form in a hierarchical Bayesian model.

We thank the reviewers for their helpful comments.

Appears in 6 sentences as: Coreference (1) coreference (5)

In *A Bayesian Mixed Effects Model of Literary Character*

- (2013) explicitly learn character types (or “personas”) in a dataset of Wikipedia movie plot summaries; and entity-centric models form one dominant approach in coreference resolution (Durrett et al., 2013; Haghighi and Klein, 2010).Page 1, “Introduction”
- While previous work uses the Stanford CoreNLP toolkit to identify characters and extract typed dependencies for them, we found this approach to be too slow for the scale of our data (a total of 1.8 billion tokens); in particular, syntactic parsing, with cubic complexity in sentence length, and out-of-the-box coreference resolution (with thousands of potential antecedents) prove to bePage 2, “Data”
- It includes the following components for clustering character name mentions, resolving pronominal coreference , and reducing vocabulary dimensionality.Page 3, “Data”
- 3.2 Pronominal Coreference ResolutionPage 3, “Data”
- While the character clustering stage is essentially performing proper noun coreference resolution, approximately 74% of references to characters in books come in the form of pronouns.5 To resolve this more difficult class at the scale of an entire book, we train a log-linear discriminative classifier only on the task of resolving pronominal anaphora (i.e., ignoring generic noun phrases such as the paint or the rascal).Page 3, “Data”
- For this task, we annotated a set of 832 coreference links in 3 books (Pride and Prejudice, The Turn of the Screw, and Heart of Darkness) and fea—turized coreference/antecedent pairs with:Page 3, “Data”

See all papers in *Proc. ACL 2014* that mention coreference.

See all papers in *Proc. ACL* that mention coreference.

Back to top.

Appears in 6 sentences as: log-linear (6)

In *A Bayesian Mixed Effects Model of Literary Character*

- While the character clustering stage is essentially performing proper noun coreference resolution, approximately 74% of references to characters in books come in the form of pronouns.5 To resolve this more difficult class at the scale of an entire book, we train a log-linear discriminative classifier only on the task of resolving pronominal anaphora (i.e., ignoring generic noun phrases such as the paint or the rascal).Page 3, “Data”
- To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books.Page 3, “Data”
- In order to separate out the effects that a character’s persona has on the words that are associated with them (as opposed to other factors, such as time period, genre, or author), we adopt a hierarchical Bayesian approach in which the words we observe are generated conditional on a combination of different effects captured in a log-linear (or “maximum entropy”) distribution.Page 4, “Model”
- This SAGE model can be understood as a log-linear distribution with three kinds of features (metadata, persona, and back-Page 4, “Model”
- Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameterPage 6, “Model”
- A Basic persona model, which ablates author information but retains the same log-linear architecture; here, the n-vector is of size P + 1 and does not model author effects.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 4 sentences as: Coreference Resolution (1) coreference resolution (3)

In *A Bayesian Mixed Effects Model of Literary Character*

- (2013) explicitly learn character types (or “personas”) in a dataset of Wikipedia movie plot summaries; and entity-centric models form one dominant approach in coreference resolution (Durrett et al., 2013; Haghighi and Klein, 2010).Page 1, “Introduction”
- While previous work uses the Stanford CoreNLP toolkit to identify characters and extract typed dependencies for them, we found this approach to be too slow for the scale of our data (a total of 1.8 billion tokens); in particular, syntactic parsing, with cubic complexity in sentence length, and out-of-the-box coreference resolution (with thousands of potential antecedents) prove to bePage 2, “Data”
- 3.2 Pronominal Coreference ResolutionPage 3, “Data”
- While the character clustering stage is essentially performing proper noun coreference resolution , approximately 74% of references to characters in books come in the form of pronouns.5 To resolve this more difficult class at the scale of an entire book, we train a log-linear discriminative classifier only on the task of resolving pronominal anaphora (i.e., ignoring generic noun phrases such as the paint or the rascal).Page 3, “Data”

See all papers in *Proc. ACL 2014* that mention coreference resolution.

See all papers in *Proc. ACL* that mention coreference resolution.

Back to top.

Appears in 4 sentences as: logistic regression (3) logistic regressions (1)

In *A Bayesian Mixed Effects Model of Literary Character*

- With this featurization and training data, we train a binary logistic regression classifier with 61 regularization (Where negative examples are comprised of all character entities in the previous 100 words not labeled as the true antecedent).Page 3, “Data”
- Since each multiplicand involves a binary prediction, we can avoid partition functions and use the classic binary logistic regression.7 We have converted the V-way multiclass logistic regression problem of Eq.Page 5, “Model”
- 7Recall that logistic regression lets PLR(y = 1 I :13, fl) = logit_1(a:Tfl) = 1/(1 —|— exp —:1:Tfl) for binary dependent variable y, independent variables :13, and coefficients fl.Page 5, “Model”
- This equates to solving 4V El-regularized logistic regressions (see Eq.Page 6, “Model”

See all papers in *Proc. ACL 2014* that mention logistic regression.

See all papers in *Proc. ACL* that mention logistic regression.

Back to top.

Appears in 3 sentences as: Gibbs sampling (3)

In *A Bayesian Mixed Effects Model of Literary Character*

- Rather than adopting a fully Bayesian approach (e.g., sampling all variables), we infer these values using stochastic EM, alternating between collapsed Gibbs sampling for each p and maximizing with respect to 77.Page 5, “Model”
- 8We assume the reader is familiar with collapsed Gibbs sampling as used in latent-variable NLP models.Page 5, “Model”
- All experiments are run with 50 iterations of Gibbs sampling to collect samples for the personas p, alternating with maximization steps for 77.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention Gibbs sampling.

See all papers in *Proc. ACL* that mention Gibbs sampling.

Back to top.

Appears in 3 sentences as: hyperparameter (3)

In *A Bayesian Mixed Effects Model of Literary Character*

- The generative story runs as follows (Figure 2 depicts the full graphical model): Let there be M unique authors in the data, P latent personas (a hyperparameter to be set), and V words in the vocabulary (in the general setting these may be word types; in our data the vocabulary is the set of 1,000 unique cluster IDs).Page 4, “Model”
- This is proportional to the number of other characters in document d who also (currently) have that persona (plus the Dirichlet hyperparameter which acts as a smoother) times the probability (under pdfi = z) of all of the wordsPage 5, “Model”
- Number of personas ( hyperparameter ) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameterPage 6, “Model”

See all papers in *Proc. ACL 2014* that mention hyperparameter.

See all papers in *Proc. ACL* that mention hyperparameter.

Back to top.

Appears in 3 sentences as: language model (2) language modeling (1) language models (1)

In *A Bayesian Mixed Effects Model of Literary Character*

- To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books.Page 3, “Data”
- Maximum entropy approaches to language modeling have been used since Rosenfeld (1996) to incorporate long-distance information, such as previously-mentioned trigger words, into n-gram language models .Page 4, “Model”
- Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameterPage 6, “Model”

See all papers in *Proc. ACL 2014* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 3 sentences as: latent variables (3)

In *A Bayesian Mixed Effects Model of Literary Character*

- Observed variables are shaded, latent variables are clear, and collapsed variables are dotted.Page 6, “Model”
- As a Baseline, we also evaluate all hypotheses on a model with no latent variables whatsoever, which instead measures similarity as the average J S divergence between the empirical word distributions over each role type.Page 7, “Experiments”
- Table 1 presents the results of this comparison; for all models with latent variables , we report the average of 5 sampling runs with different random initializations.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention latent variables.

See all papers in *Proc. ACL* that mention latent variables.

Back to top.

Appears in 3 sentences as: Regression model (3)

In *A Bayesian Mixed Effects Model of Literary Character*

- In contrast, the Persona Regression model of Bamman et al.Page 7, “Experiments”
- The Persona Regression model of Bamman et al.Page 7, “Experiments”
- As expected, the Persona Regression model performs best at hypothesis class B (correctly judging two characters from the same author to be more similar to each other than to a character from a different author); this behavior is encouraged in this model by allowing an author (as an external metadata variable) to directly influencePage 8, “Experiments”

See all papers in *Proc. ACL 2014* that mention Regression model.

See all papers in *Proc. ACL* that mention Regression model.

Back to top.