Unsupervised Alignment of Privacy Policies using Hidden Markov Models
Ramanath, Rohan and Liu, Fei and Sadeh, Norman and Smith, Noah A.

Article Structure

Abstract

To support empirical study of online privacy policies, as well as tools for users with privacy concerns, we consider the problem of aligning sections of a thousand policy documents, based on the issues they address.

Introduction

Privacy policy documents are verbose, often esoteric legal documents that many people encounter as clients of companies that provide services on the web.

Data Collection

We collected 1,010 unique privacy policy documents from the top websites ranked by Alexa.com.2 These policies were collected during a period of six weeks during December 2013 and January 2014.

Approach

Given the corpus of privacy policies described in §2, we designed a model to efficiently infer an alignment of policy sections.

Evaluation

Developing a gold-standard alignment of privacy policies would either require an interface that allows each annotator to interact with the entire corpus of previously aligned documents while reading the one she is annotating, or the definition (and likely iterative refinement) of a set of categories for manually labeling policy sections.

Experiment

In this section, we evaluate the three HMM variants described in §3, and two baselines, using the methods in §4.

Conclusion

We considered the task of aligning sections of a collection of roughly similarly-structured legal documents, based on the issues they address.

Topics

unigram

Appears in 5 sentences as: unigram (4) unigrams (1)
In Unsupervised Alignment of Privacy Policies using Hidden Markov Models
  1. 0,; is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents.
    Page 2, “Approach”
  2. models (e. g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram).
    Page 3, “Approach”
  3. We derived unigram tfidf vectors for each section in each of 50 randomly sampled policies per category.
    Page 4, “Evaluation”
  4. The implementation uses unigram features and cosine similarity.
    Page 5, “Experiment”
  5. Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference (Hoffman et al., 2010).7 To more closely match our models, LDA is given access to the same unigram and bigram tokens.
    Page 5, “Experiment”

See all papers in Proc. ACL 2014 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

bigram

Appears in 4 sentences as: bigram (3) bigrams (2)
In Unsupervised Alignment of Privacy Policies using Hidden Markov Models
  1. In our formulation, each hidden state corresponds to an issue or topic, characterized by a distribution over words and bigrams appearing in privacy policy sections addressing that issue.
    Page 2, “Approach”
  2. 0,; is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents.
    Page 2, “Approach”
  3. models (e. g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram ).
    Page 3, “Approach”
  4. Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference (Hoffman et al., 2010).7 To more closely match our models, LDA is given access to the same unigram and bigram tokens.
    Page 5, “Experiment”

See all papers in Proc. ACL 2014 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Unsupervised Alignment of Privacy Policies using Hidden Markov Models
  1. Together, these can be used as a gold standard grouping of policy sections, against which we can compare our system’s output.
    Page 4, “Evaluation”
  2. We created a separate gold standard of judgments of pairs of privacy policy sections.
    Page 4, “Evaluation”
  3. The first two options were considered a “yes” for the majority voting and for defining a gold standard .
    Page 4, “Evaluation”

See all papers in Proc. ACL 2014 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.