Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
Wang, William Yang and Mayfield, Elijah and Naidu, Suresh and Dittmar, Jeremiah

Article Structure

Abstract

We propose a latent variable model to enhance historical analysis of large corpora.

Introduction

Many scientific subjects, such as psychology, leam-ing sciences, and biology, have adopted computational approaches to discover latent patterns in large scale datasets (Chen and Lombardi, 2010; Baker and Yacef, 2009).

Related Work

Natural Language Processing (NLP) methods for automatically understanding and identifying key information in historical data have not yet been explored until recently.

Data

We have collected a corpus of slavery-related United States supreme court legal opinions from Lexis Nexis.

The Sparse Mixed-Effects Model

To address the over-parameterization, lack of expressiveness and robustness issues in LDA, the SAGE (Eisenstein et al., 2011a) framework draws a

Prediction Experiments

We perform three quantitative experiments to evaluate the predictive power of the sparse mixed-effects model.

Conclusion and Future Work

In this work, we propose a sparse mixed-effects model for historical analysis of text.

Topics

SVM

Appears in 21 sentences as: +SVM (1) SVM (22)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. Traditional discriminative methods, such as support vector machine ( SVM ) and logistic regression, have been very popular in various text categorization tasks (J oachims, 1998; Wang and McKe-own, 2010) in the past decades.
    Page 2, “Related Work”
  2. For example, SVM does not have latent variables to model the subtle differences and interactions of features from different domains (e.g.
    Page 2, “Related Work”
  3. In the first experiment, we compare the prediction accuracy of our SME model to a widely used discriminative learner in NLP — the linear kernel support vector machine ( SVM )3.
    Page 5, “Prediction Experiments”
  4. In the second experiment, in addition to the linear kernel SVM , we also compare our SME model to a state-of-the-art sparse generative model of text (Eisenstein et al., 2011a), and vary the size of input vocabulary W exponentially from 29 to the full size of our training vocabulary4.
    Page 5, “Prediction Experiments”
  5. We use threefold cross-validation to infer the learning rate 6 and cost C hyperpriors in the SME and SVM model respectively.
    Page 5, “Prediction Experiments”
  6. 5.1.1 Comparing SME to SVM
    Page 5, “Prediction Experiments”
  7. We show in this section the predictive power of our sparse mixed-effects model, comparing to a linear kernel SVM learner.
    Page 5, “Prediction Experiments”
  8. In terms of the size of vocabulary W for both the SME and SVM learner, we select three values to represent dense, medium or sparse feature spaces: W1 2 29, W2 2 212, and the full vocabulary size of W3 2 213'8.
    Page 5, “Prediction Experiments”
  9. Table 1 shows the accuracy of both models, as well as the relative improvement (gain) of SME over SVM .
    Page 5, “Prediction Experiments”
  10. When looking at the experiment results under different settings, we see that the SME model always outperforms the SVM learner.
    Page 5, “Prediction Experiments”
  11. SVM (W1) 33.2% — 69.7% —SME(W1) 36.4% 9.6% 71.4% 2.4%
    Page 5, “Prediction Experiments”

See all papers in Proc. ACL 2012 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

latent variables

Appears in 17 sentences as: Latent variable (1) latent variable (4) latent variables (13)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. We propose a latent variable model to enhance historical analysis of large corpora.
    Page 1, “Abstract”
  2. Latent variable models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and probabilistic latent semantic analysis (PLSA) (Hofmann, 1999), have been used in the past to facilitate social science research.
    Page 1, “Introduction”
  3. To do this we augment SAGE with two sparse latent variables that model the region and time of a document, as well as a third sparse latent
    Page 1, “Introduction”
  4. variable that captures the interactions among the region, time and topic latent variables .
    Page 2, “Introduction”
  5. We also introduce a multiclass perceptron- style weight estimation method to model the contributions from different sparse latent variables to the word posterior probabilities in this predictive task.
    Page 2, “Introduction”
  6. In the next two sections, we overview work related to qualitative social science analysis using latent variable models, and introduce our slavery-related early United States court opinion data.
    Page 2, “Introduction”
  7. For example, SVM does not have latent variables to model the subtle differences and interactions of features from different domains (e.g.
    Page 2, “Related Work”
  8. (2010) use a latent variable model to predict geolocation information of Twitter users, and investigate geographic variations of language use.
    Page 2, “Related Work”
  9. It also incorporates latent variables 7' to model the variance for each sparse deviation 77.
    Page 3, “The Sparse Mixed-Effects Model”
  10. The three major sparse deviation latent variables are (T) (R) (Q)
    Page 3, “The Sparse Mixed-Effects Model”
  11. All of the three latent variables are condi-
    Page 3, “The Sparse Mixed-Effects Model”

See all papers in Proc. ACL 2012 that mention latent variables.

See all papers in Proc. ACL that mention latent variables.

Back to top.

topic models

Appears in 10 sentences as: topic modeling (1) topic modelling (4) topic models (5)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. This work extends prior work in topic modelling by incorporating metadata, and the interactions between the components in metadata, in a general way.
    Page 1, “Abstract”
  2. Related research efforts include using the LDA model for topic modeling in historical newspapers (Yang et al., 2011), a rule-based approach to extract verbs in historical Swedish texts (Pettersson and Nivre, 2011), a system for semantic tagging of historical Dutch archives (Cybulska and Vossen, 2011).
    Page 2, “Related Work”
  3. Despite our historical data domain, our approach is more relevant to text classification and topic modelling .
    Page 2, “Related Work”
  4. mantic information in multifaceted topic models for text categorization.
    Page 2, “Related Work”
  5. Temporally, topic models have been used to show the shift in language use over time in online communities (Nguyen and Rose, 2011) and the evolution of topics over time (Shub-hankar et al., 2011).
    Page 2, “Related Work”
  6. When evaluating understandability, however, dense word distributions are a serious issue in many topic models as well as other predictive tasks.
    Page 2, “Related Work”
  7. Such topic models are often dominated by function words and do not always effectively separate topics.
    Page 2, “Related Work”
  8. To compare the two models in different settings, we first empirically set the number of topics K in our SME model to be 25, as this setting was shown to yield a promising result in a previous study (Eisenstein et al., 2011a) on sparse topic models .
    Page 5, “Prediction Experiments”
  9. Most studies on topic modelling have not been able to report results when using different sizes of vocabulary for training.
    Page 5, “Prediction Experiments”
  10. We jointly model those observed labels as well as unsupervised topic modelling .
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

LDA

Appears in 8 sentences as: LDA (8)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. Latent variable models, such as latent Dirichlet allocation ( LDA ) (Blei et al., 2003) and probabilistic latent semantic analysis (PLSA) (Hofmann, 1999), have been used in the past to facilitate social science research.
    Page 1, “Introduction”
  2. SAGE (Eisenstein et al., 2011a), a recently proposed sparse additive generative model of language, addresses many of the drawbacks of LDA .
    Page 1, “Introduction”
  3. Another advantage, from a social science perspective, is that SAGE can be derived from a standard logit random-utility model of judicial opinion writing, in contrast to LDA .
    Page 1, “Introduction”
  4. Related research efforts include using the LDA model for topic modeling in historical newspapers (Yang et al., 2011), a rule-based approach to extract verbs in historical Swedish texts (Pettersson and Nivre, 2011), a system for semantic tagging of historical Dutch archives (Cybulska and Vossen, 2011).
    Page 2, “Related Work”
  5. (2010) study the effect of the context of interaction in blogs using a standard LDA model.
    Page 2, “Related Work”
  6. To address the over-parameterization, lack of expressiveness and robustness issues in LDA , the SAGE (Eisenstein et al., 2011a) framework draws a
    Page 2, “The Sparse Mixed-Effects Model”
  7. In this SME model, we still have the same Dirichlet a, the latent topic proportion 6, and the latent topic variable 2 as the original LDA model.
    Page 3, “The Sparse Mixed-Effects Model”
  8. In contrast to traditional multinomial distribution of words in LDA models, we approximate the conditional word distribution in the document d as the
    Page 3, “The Sparse Mixed-Effects Model”

See all papers in Proc. ACL 2012 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

feature space

Appears in 5 sentences as: feature space (4) feature spaces (1)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. In terms of the size of vocabulary W for both the SME and SVM learner, we select three values to represent dense, medium or sparse feature spaces : W1 2 29, W2 2 212, and the full vocabulary size of W3 2 213'8.
    Page 5, “Prediction Experiments”
  2. For example, with a medium density feature space of 212, SVM obtained an accuracy of 35.8%, but SME achieved an accuracy of 40.9%, which is a 14.2% relative improvement (p < 0.001) over SVM.
    Page 5, “Prediction Experiments”
  3. When the feature space becomes sparser, the SME obtains an increased relative improvement (10 < 0.001) of 16.1%, using full size of vocabulary.
    Page 5, “Prediction Experiments”
  4. the vocabulary size W exponentially and make the feature space more sparse, SME obtains its best result at W = 213, where the relative improvement over SAGE and SVM is 16.8% and 22.9% respectively (p < 0.001 under all comparisons).
    Page 6, “Prediction Experiments”
  5. In this experiment, the results of SME model are in line with SAGE and SVM when the feature space is dense.
    Page 6, “Prediction Experiments”

See all papers in Proc. ACL 2012 that mention feature space.

See all papers in Proc. ACL that mention feature space.

Back to top.

classification task

Appears in 4 sentences as: classification task (3) classification tasks (1)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. The weight update of MR) and MQ) is bound by the averaged accuracy of the two classification tasks in the training data, which is similar to the notion of minimizing empirical risk (Bahl et al., 1988).
    Page 4, “The Sparse Mixed-Effects Model”
  2. We hypothesize that it might because that SVM, as a strong large margin learner, is a more natural approach in a binary classification setting, but might not be the best choice in a four-way or multiclass classification task .
    Page 5, “Prediction Experiments”
  3. Figure 2 and Figure 3 show the experiment results in both time and region classification task .
    Page 5, “Prediction Experiments”
  4. Secondly, in the two tasks, it is observed that the accuracy of the binary region classification task is much higher than the four-way task, thus while the latter benefits significantly from this joint learning scheme of the SME model, but the former might not have the equivalent gain5.
    Page 6, “Prediction Experiments”

See all papers in Proc. ACL 2012 that mention classification task.

See all papers in Proc. ACL that mention classification task.

Back to top.

generative model

Appears in 4 sentences as: generative model (3) generative models (1)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. SAGE (Eisenstein et al., 2011a), a recently proposed sparse additive generative model of language, addresses many of the drawbacks of LDA.
    Page 1, “Introduction”
  2. In the second experiment, in addition to the linear kernel SVM, we also compare our SME model to a state-of-the-art sparse generative model of text (Eisenstein et al., 2011a), and vary the size of input vocabulary W exponentially from 29 to the full size of our training vocabulary4.
    Page 5, “Prediction Experiments”
  3. In this experiment, we compare SME with a state-of-the-art sparse generative model : SAGE (Eisenstein et al., 2011a).
    Page 5, “Prediction Experiments”
  4. Unlike hierarchical Dirichlet processes (Teh et al., 2006), in parametric Bayesian generative models , the number of topics K is often set manually, and can influence the model’s accuracy significantly.
    Page 6, “Prediction Experiments”

See all papers in Proc. ACL 2012 that mention generative model.

See all papers in Proc. ACL that mention generative model.

Back to top.

best result

Appears in 3 sentences as: best result (3)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. the vocabulary size W exponentially and make the feature space more sparse, SME obtains its best result at W = 213, where the relative improvement over SAGE and SVM is 16.8% and 22.9% respectively (p < 0.001 under all comparisons).
    Page 6, “Prediction Experiments”
  2. After increasing the number of topics K, we can see SAGE consistently increase its accuracy, obtaining its best result when K = 30.
    Page 6, “Prediction Experiments”
  3. Except that the two models tie up when K = 10, SME outperforms SAGE for all subsequent variations of K. Similar to the region task, SME achieves the best result when K is sparser (p < 0.01 when K = 40 and K = 50).
    Page 7, “Prediction Experiments”

See all papers in Proc. ACL 2012 that mention best result.

See all papers in Proc. ACL that mention best result.

Back to top.

support vector

Appears in 3 sentences as: support vector (3)
In Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
  1. Traditional discriminative methods, such as support vector machine (SVM) and logistic regression, have been very popular in various text categorization tasks (J oachims, 1998; Wang and McKe-own, 2010) in the past decades.
    Page 2, “Related Work”
  2. In the first experiment, we compare the prediction accuracy of our SME model to a widely used discriminative learner in NLP — the linear kernel support vector machine (SVM)3.
    Page 5, “Prediction Experiments”
  3. Table 1: Compare the accuracy of the linear kernel support vector machine to our sparse mixed-effects model in the region and time identification tasks (K = 25).
    Page 5, “Prediction Experiments”

See all papers in Proc. ACL 2012 that mention support vector.

See all papers in Proc. ACL that mention support vector.

Back to top.