Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples
Bhat, Suma and Sproat, Richard

Article Structure

Abstract

Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language.

Introduction

Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language.

Previous Work

Many estimators of vocabulary size are available in the literature and a comparison of several non

Novel Estimator of Vocabulary size

We observe (X1, .

Experiments

4.1 Corpora

Results and Discussion

The performance of the different estimators as percentage errors of the true vocabulary size using different corpora are tabulated in tables 1-4.

Conclusion

In this paper, we have proposed a new nonparametric estimator of vocabulary size that takes into account the LNRE property of word frequency distributions and have shown that it is statistically consistent.

Topics

state of the art

Appears in 10 sentences as: state of the art (10)
In Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples
  1. Finally, we compare our proposal with the state of the art estimators (both parametric and nonparametric) on large standard corpora; apart from showing the favorable performance of our estimator, we also see that the classical Good-Turing estimator consistently underestimates the vocabulary size.
    Page 1, “Abstract”
  2. While compared with other vocabulary size estimates, we see that our estimator performs at least as well as some of the state of the art estimators.
    Page 1, “Introduction”
  3. A good survey of the state of the art is available in (Gandolfi and Sastri, 2004).
    Page 2, “Previous Work”
  4. In this study we consider state of the art parametric estimators, as surveyed by (Baroni and Evert, 2005).
    Page 7, “Experiments”
  5. 0 From the Figure 1, we see that our estimator compares quite favorably with the best of the state of the art estimators.
    Page 7, “Results and Discussion”
  6. The best of the state of the art estimator is a parametric one (ZM), while ours is a nonparametric estimator.
    Page 7, “Results and Discussion”
  7. Further, it compares very favorably to the state of the art estimators (both parametric and nonparametric).
    Page 7, “Results and Discussion”
  8. 0 Our estimator has theoretical performance guarantees and its empirical performance is comparable to that of the state of the art estimators.
    Page 7, “Results and Discussion”
  9. o The state of the art nonparametric Good-Turing estimator wildly underestimates the vocabulary; this is true in each of the four corpora studied and at all sample sizes.
    Page 7, “Results and Discussion”
  10. We then compared the performance of the proposed estimator with that of the state of the art estimators on large corpora.
    Page 7, “Conclusion”

See all papers in Proc. ACL 2009 that mention state of the art.

See all papers in Proc. ACL that mention state of the art.

Back to top.

probability distributions

Appears in 3 sentences as: probability distribution (1) probability distributions (2)
In Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples
  1. sequence drawn according to a probability distribution P from a large, but finite, vocabulary 9.
    Page 2, “Novel Estimator of Vocabulary size”
  2. Our main interest is in probability distributions 1?
    Page 2, “Novel Estimator of Vocabulary size”
  3. In particular, the authors consider a sequence of vocabulary sets and probability distributions , indexed by the observation size n. Specifically, the observation (X 1, .
    Page 2, “Novel Estimator of Vocabulary size”

See all papers in Proc. ACL 2009 that mention probability distributions.

See all papers in Proc. ACL that mention probability distributions.

Back to top.