Perplexity on Reduced Corpora
Kobayashi, Hayato

Article Structure

Abstract

This paper studies the idea of removing low-frequency words from a corpus, which is a common practice to reduce computational costs, from a theoretical standpoint.

Introduction

Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models.

Preliminaries

Let us consider a corpus w :2 wl - - -w N of corpus size N and vocabulary size W. We use an abridged notation {w} := {w E w} to represent the vocabulary of w. Clearly, N = |w| and W = hold.

Perplexity on Reduced Corpora

Now let us consider what a cutoff is.

Experiments

We performed experiments on three real corpora (Reuters, ZOnews, and Enwiki) and two synthetic corpora (Zipf 1 and Ziprix) to verify the correctness of our theory and to examine the gap between theory and practice.

Conclusion

We studied the relationship between perplexity and vocabulary size of reduced corpora.

Topics

unigram

Appears in 13 sentences as: Unigram (1) unigram (12) unigrams (1)
In Perplexity on Reduced Corpora
  1. In Section 3, we theoretically derive the tradeoff formulae of the cutoff for unigram models, k-gram models, and topic models, each of which represents its perplexity with respect to a reduced vocabulary, under the assumption that the corpus follows Zipf’s law.
    Page 1, “Introduction”
  2. 3.1 Perplexity of Unigram Models
    Page 3, “Perplexity on Reduced Corpora”
  3. Let us consider the perplexity of a unigram model learned from a reduced corpus.
    Page 3, “Perplexity on Reduced Corpora”
  4. In unigram models, a predictive distribution 19’ on a reduced corpus w’ can be simply calculated as p’ (w’) = f (w’ ) / N’ .
    Page 3, “Perplexity on Reduced Corpora”
  5. Let PP1 .— (HwEW M) be the perplex1ty of a A-restored distribution 13 on a unigram model.
    Page 3, “Perplexity on Reduced Corpora”
  6. The next theorem gives the exact formula of the training-set perplexity of a unigram model learned
    Page 3, “Perplexity on Reduced Corpora”
  7. For any distribution 19’ on a unigram model learned from a corpus w’ reduced from the original corpus w following Zipf’s law, the perplexity PAP1 of the A*-restored distribution 13 of p’ from w’ to w is calculated by
    Page 3, “Perplexity on Reduced Corpora”
  8. Considering the special case of W’ = W, we obtain the perplexity PP1 of the unigram model learned from the original corpus w as
    Page 4, “Perplexity on Reduced Corpora”
  9. Here, we will consider the perplexity of a k-gram model learned from a reduced corpus as a standard extension of a unigram model.
    Page 4, “Perplexity on Reduced Corpora”
  10. In the case of unigrams (h = l), the formula exactly represents Zipf’s law.
    Page 5, “Perplexity on Reduced Corpora”
  11. We can regard the former case as a unigram model, since the marginal predictive distribution
    Page 6, “Perplexity on Reduced Corpora”

See all papers in Proc. ACL 2014 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

topic models

Appears in 12 sentences as: topic model (1) Topic Models (1) topic models (10)
In Perplexity on Reduced Corpora
  1. Based on the assumption that a corpus follows Zipf’s law, we derive tradeoff formulae of the perpleXity of k-gram models and topic models with respect to the size of the reduced vocabulary.
    Page 1, “Abstract”
  2. Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models .
    Page 1, “Introduction”
  3. In the case of topic models , the intuition is that low-frequency words do not make a large contribution to the statistics of the models.
    Page 1, “Introduction”
  4. Actually, when we try to roughly analyze a corpus with topic models , a reduced corpus is enough for the purpose (Steyvers and Griffiths, 2007).
    Page 1, “Introduction”
  5. To our knowledge, however, there is no theoretical study on the question and no evidence for such a tradeoff relationship, especially for topic models .
    Page 1, “Introduction”
  6. In Section 3, we theoretically derive the tradeoff formulae of the cutoff for unigram models, k-gram models, and topic models , each of which represents its perplexity with respect to a reduced vocabulary, under the assumption that the corpus follows Zipf’s law.
    Page 1, “Introduction”
  7. Perplexity is a widely used evaluation measure of k-gram models and topic models .
    Page 2, “Preliminaries”
  8. 3.3 PerpleXity of Topic Models
    Page 6, “Perplexity on Reduced Corpora”
  9. In this section, we consider the perpleXity of the widely used topic model , Latent Dirichlet Allocation (LDA) (Blei et al., 2003), by using the notation given in (Griffiths and Steyvers, 2004).
    Page 6, “Perplexity on Reduced Corpora”
  10. We used Zip fMix only for the experiments on topic models .
    Page 7, “Experiments”
  11. Therefore, we can use TheoryAve as a heuristic function for estimating the perplexity of topic models .
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

LDA

Appears in 10 sentences as: LDA (11)
In Perplexity on Reduced Corpora
  1. In this section, we consider the perpleXity of the widely used topic model, Latent Dirichlet Allocation ( LDA ) (Blei et al., 2003), by using the notation given in (Griffiths and Steyvers, 2004).
    Page 6, “Perplexity on Reduced Corpora”
  2. LDA is a probabilistic language model that generates a corpus as a mixture of hidden topics, and it allows us to infer two parameters: the document-topic distribution 6 that represents the mixture rate of topics in each document, and the topic-word distribution gb that represents the occurrence rate of words in each topic.
    Page 6, “Perplexity on Reduced Corpora”
  3. The assumption placed on 9b may not be reasonable in the case of d, because we can easily think of a document with only one topic, and we usually use a small number T of topics for LDA , e.g., T = 20.
    Page 6, “Perplexity on Reduced Corpora”
  4. In the latter case, we can obtain an exact formula for the perplexity of LDA when the topic assigned to each document follows a discrete uniform distribution, as shown in the next theorem.
    Page 6, “Perplexity on Reduced Corpora”
  5. For any distribution 19’ on the LDA model with T topics learned from a corpus w’ reduced from the original corpus w following Zipf’s law, assuming that each document only has one topic which is assigned based on a discrete uniform distribution, the perplexity PAPMix of the A*-restored distribution 13 of p’ from w’ to w is calcu-
    Page 6, “Perplexity on Reduced Corpora”
  6. This implies that the growth rate of the perplexity of LDA models is larger than that of unigram models, whereas the perplexity of LDA models for the original corpus is smaller than that of unigram models.
    Page 7, “Perplexity on Reduced Corpora”
  7. 1(f) plots the perpleXity of LDA models with 20 topics learned from Reuters, ZOnews, Enwiki, Zipfl, and Ziprix versus the size of reduced vocabulary on a log-log graph.
    Page 8, “Experiments”
  8. Table 2: Computational time and memory size for LDA learning on the original corpus, (1/ 10)-reduced corpus, and (1/20)-reduced corpus of Reuters.
    Page 9, “Experiments”
  9. Finally, let us examine the computational costs for LDA learning.
    Page 9, “Experiments”
  10. Table 2 shows computational time and memory size for LDA learning on the original corpus, (1/ 10)-reduced corpus, and (1/20)-reduced corpus of Reuters.
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

language models

Appears in 6 sentences as: language model (1) language models (5)
In Perplexity on Reduced Corpora
  1. Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models.
    Page 1, “Introduction”
  2. In the case of language models , we often have to remove low-frequency words because of a lack of computational resources, since the feature space of k:-grams tends to be so large that we sometimes need cutoffs even in a distributed environment (Brants et al., 2007).
    Page 1, “Introduction”
  3. Constant restoring is similar to the additive smoothing defined by 13(w) oc p’ + A, which is used to solve the zero-frequency problem of language models (Chen and Goodman, 1996).
    Page 2, “Perplexity on Reduced Corpora”
  4. 77k: _ 1 H7Tk (7176 _ 1)H7Tk This means that we can determine the rough sparseness of k-grams and adjust some of the parameters such as the gram size k in learning statistical language models .
    Page 6, “Perplexity on Reduced Corpora”
  5. LDA is a probabilistic language model that generates a corpus as a mixture of hidden topics, and it allows us to infer two parameters: the document-topic distribution 6 that represents the mixture rate of topics in each document, and the topic-word distribution gb that represents the occurrence rate of words in each topic.
    Page 6, “Perplexity on Reduced Corpora”
  6. Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002).
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

bigrams

Appears in 4 sentences as: bigram (1) bigrams (4)
In Perplexity on Reduced Corpora
  1. This model seems to be stupid, since we can easily notice that the bigram “is is” is quite frequent, and the two bigrams “is a” and “a is” have the same frequency.
    Page 4, “Perplexity on Reduced Corpora”
  2. For example, the decay function 92 of bigrams is as follows:
    Page 4, “Perplexity on Reduced Corpora”
  3. They pointed out that the exponent of bigrams is about 0.66, and that of 5-grams is about 0.59 in the Wall Street Journal corpus (WSJ 87).
    Page 5, “Perplexity on Reduced Corpora”
  4. In the case of bigrams , the perpleXities of TheoryZ are almost the same as that of Zipf2 when the size of reduced vocabulary is large.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

cross validation

Appears in 3 sentences as: cross validation (3)
In Perplexity on Reduced Corpora
  1. This assumption is natural, considering the situation of an in-domain test or cross validation
    Page 4, “Perplexity on Reduced Corpora”
  2. Each value is the average over different test sets of fivefold cross validation .
    Page 7, “Experiments”
  3. Each value is the average over different test sets of fivefold cross validation .
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention cross validation.

See all papers in Proc. ACL that mention cross validation.

Back to top.

error rate

Appears in 3 sentences as: error rate (3)
In Perplexity on Reduced Corpora
  1. Each of these studies experimentally discusses tradeoff relationships between the size of the reduced corpus/model and its performance measured by perplexity, word error rate , and other factors.
    Page 1, “Introduction”
  2. Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002).
    Page 9, “Experiments”
  3. Thus, we conjecture that the word error rate also has a similar tendency as perplexity with respect to the reduced vocabulary size.
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.