Integrating history-length interpolation and classes in language modeling
Schütze, Hinrich

Article Structure

Abstract

Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation.

Introduction

Language models, probability distributions over strings of words, are fundamental to many applications in natural language processing.

Related work

A large number of different class-based models have been proposed in the literature.

Models

In this section, we introduce the three models that we compare in our experiments: Kneser-Ney model, Dupont—Rosenfeld model, and top-level interpolation model.

Experimental Setup

We run experiments on a Wall Street Journal (WSJ) corpus of 50M words, split 8:121 into training, validation and test sets.

Results

Table 3 shows the performance of pDR and mm: for a range of base set sizes and for classes trained on all events and on unique events.

Polynomial discounting

Further comparative analysis of pDR and pTop revealed that pDR is not uniformly better than pTop.

Conclusion

Our hypothesis was that classes are a generalization mechanism for rare events that serves the same function as history-length interpolation and that classes should therefore be (i) primarily trained on rare events and (ii) receive high weight only if it is likely that a rare event will follow and be weighted in a way analogous to the weighting of lower-order distributions in history-length interpolation.

Topics

bigram

Appears in 25 sentences as: bigram (18) bigrams (13)
In Integrating history-length interpolation and classes in language modeling
  1. The parameters d’, d”, and d’” are the discounts for unigrams, bigrams and trigrams, respectively, as defined by Chen and Goodman (1996, p. 20, (26)).
    Page 3, “Models”
  2. bigram ) histories that is covered by the clusters.
    Page 3, “Models”
  3. We cluster bigram histories and unigram histories separately and write 193 (7.03 |w1w2) for the bigram cluster model and pB(w3|w2) for the unigram cluster model.
    Page 3, “Models”
  4. Since we cluster both unigrams and bigrams , we interpolate three models:
    Page 4, “Models”
  5. 256,873 unique unigrams and 4,494,222 unique bigrams .
    Page 4, “Experimental Setup”
  6. We cluster unigrams (i = l) and bigrams (i = 2).
    Page 4, “Experimental Setup”
  7. SRILM does not directly support bigram clustering.
    Page 4, “Experimental Setup”
  8. We therefore represent a bigram as a hyphenated word in bigram clustering; e.g., Pan Am is represented as Pan-Am.
    Page 4, “Experimental Setup”
  9. For a particular base set size I), the unigram input vocabulary [31 is set to the I) most frequent unigrams in the training set and the bigram input vocabulary [32 is set to the I) most frequent bigrams in the training set.
    Page 4, “Experimental Setup”
  10. o All-event bigram clustering.
    Page 4, “Experimental Setup”
  11. A sentence of the raw corpus that contains 3 words is included twice, once as a sequence of the [3/2] bigrams “2121—2122 203—2114 105—2126 and once as a sequence of the [(3 — l)/2j bigrams “7.02—7.03 2114—105 2116—2117 .
    Page 4, “Experimental Setup”

See all papers in Proc. ACL 2011 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

unigram

Appears in 23 sentences as: unigram (17) unigrams (11)
In Integrating history-length interpolation and classes in language modeling
  1. symbol | denotation 2w (sum over all unigrams w)
    Page 3, “Related work”
  2. The parameters d’, d”, and d’” are the discounts for unigrams , bigrams and trigrams, respectively, as defined by Chen and Goodman (1996, p. 20, (26)).
    Page 3, “Models”
  3. 232) is the set of unigram (resp.
    Page 3, “Models”
  4. We cluster bigram histories and unigram histories separately and write 193 (7.03 |w1w2) for the bigram cluster model and pB(w3|w2) for the unigram cluster model.
    Page 3, “Models”
  5. The unigram distribution of the Dupont-Rosenfeld model is set to the unigram distribution of the KN model: pDR(w) =
    Page 4, “Models”
  6. Since we cluster both unigrams and bigrams, we interpolate three models:
    Page 4, “Models”
  7. 256,873 unique unigrams and 4,494,222 unique bigrams.
    Page 4, “Experimental Setup”
  8. We cluster unigrams (i = l) and bigrams (i = 2).
    Page 4, “Experimental Setup”
  9. For all experiments, |l31| = |l32| (except in cases where |l32| exceeds the number of unigrams , see below).
    Page 4, “Experimental Setup”
  10. For a particular base set size I), the unigram input vocabulary [31 is set to the I) most frequent unigrams in the training set and the bigram input vocabulary [32 is set to the I) most frequent bigrams in the training set.
    Page 4, “Experimental Setup”
  11. 0 All-event unigram clustering.
    Page 4, “Experimental Setup”

See all papers in Proc. ACL 2011 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

language models

Appears in 17 sentences as: language model (5) language modeling (5) Language models (1) language models (6)
In Integrating history-length interpolation and classes in language modeling
  1. Building on earlier work that integrates different factors in language modeling , we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation.
    Page 1, “Abstract”
  2. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events.
    Page 1, “Abstract”
  3. We construct such a model and show that both training on rare events and preferable application to rare events improve perpleXity when compared to a simple direct interpolation of class-based with standard language models .
    Page 1, “Abstract”
  4. Language models , probability distributions over strings of words, are fundamental to many applications in natural language processing.
    Page 1, “Introduction”
  5. The main challenge in language modeling is to estimate string probabilities accurately given that even very large training corpora cannot overcome the inherent sparseness of word sequence data.
    Page 1, “Introduction”
  6. Plausible though this line of reasoning is, the language models most commonly used today do not incorporate class-based generalization.
    Page 1, “Introduction”
  7. In this paper, we propose a new type of class-based language model .
    Page 1, “Introduction”
  8. The main contribution of this paper is to propose the same mechanism for class language models .
    Page 2, “Introduction”
  9. We will show that this method of integrating history interpolation and classes significantly increases the performance of a language model .
    Page 2, “Introduction”
  10. However, the importance of rare events for clustering in language modeling has not been investigated before.
    Page 2, “Related work”
  11. Our work is most similar to the lattice-based language models proposed by Dupont and Rosenfeld (1997).
    Page 2, “Related work”

See all papers in Proc. ACL 2011 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

n-grams

Appears in 5 sentences as: n-grams (5)
In Integrating history-length interpolation and classes in language modeling
  1. ability of a class on n-grams of lexical items (as opposed to classes) (Whittaker and Woodland, 2001; Emami and Jelinek, 2005; Uszkoreit and Brants, 2008).
    Page 2, “Related work”
  2. Models that condition classes on lexical n-grams could be extended in a way similar to what we propose here.
    Page 2, “Related work”
  3. Our use of classes of lexical n-grams for n > 1 has several precedents in the literature (Suhm and Waibel, 1994; Kuo and Reichl, 1999; Deligne and Sagisaka, 2000; Justo and Torres, 2009).
    Page 2, “Related work”
  4. An important parameter of the class-based model is size of the base set, i.e., the total number of n-grams (or rather i-grams) to be clustered.
    Page 4, “Experimental Setup”
  5. We could add a constant to d, but one of the basic premises of the KN model, derived from the assumption that n-gram marginals should be equal to relative frequencies, is that the discount is larger for more frequent n-grams although in many implementations of KN only the cases = l, = 2, and 2 3 are distinguished.
    Page 8, “Polynomial discounting”

See all papers in Proc. ACL 2011 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

clusterings

Appears in 4 sentences as: clusterings (5)
In Integrating history-length interpolation and classes in language modeling
  1. We run four different clusterings for each base set size (except for the large sets, see below).
    Page 4, “Experimental Setup”
  2. The unique-event clusterings are motivated by the fact that in the Dupont—Rosenfeld model, frequent events are handled by discounted ML estimates.
    Page 5, “Experimental Setup”
  3. As we will see below, rare-event clusterings perform better than all-event clusterings .
    Page 5, “Experimental Setup”
  4. When comparing all-event and unique-event clusterings , a clear tendency is apparent.
    Page 6, “Results”

See all papers in Proc. ACL 2011 that mention clusterings.

See all papers in Proc. ACL that mention clusterings.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Integrating history-length interpolation and classes in language modeling
  1. The reason is that the objective function maximizes mutual information.
    Page 5, “Experimental Setup”
  2. Highly differentiated classes for frequent words contribute substantially to this objective function whereas putting all rare words in a few large clusters does not hurt the objective much.
    Page 5, “Experimental Setup”
  3. However, our focus is on using clustering for improving prediction for rare events; this means that the objective function is counterproductive when contexts are frequency-weighted as they occur in the corpus.
    Page 5, “Experimental Setup”
  4. After overweighting rare contexts, the objective function is more in sync with what we use clusters for in our model.
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2011 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

probability distributions

Appears in 3 sentences as: probability distributions (3)
In Integrating history-length interpolation and classes in language modeling
  1. Language models, probability distributions over strings of words, are fundamental to many applications in natural language processing.
    Page 1, “Introduction”
  2. Table 2: Key to probability distributions
    Page 5, “Experimental Setup”
  3. Table 2 is a key to the probability distributions we use.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2011 that mention probability distributions.

See all papers in Proc. ACL that mention probability distributions.

Back to top.