Smoothed marginal distribution constraints for language modeling
Roark, Brian and Allauzen, Cyril and Riley, Michael

Article Structure

Abstract

We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing.

Introduction

Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation.

Preliminaries

N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion.

Marginal distribution constraints

Marginal distribution constraints attempt to match the expected frequency of an n-gram with its observed frequency.

Model constraint algorithm

Our algorithm takes a smoothed backoff n-gram language model in an automaton format (see Figure 1) and returns a smoothed backoff n-gram language model with the same topology.

Experimental results

All results presented here are for English Broadcast News.

Summary and Future Directions

The presented method reestimates lower order n-gram model parameters for a given smoothed backoff model, achieving perplexity and WER reductions for many smoothed models.

Topics

n-gram

Appears in 42 sentences as: N-gram (2) n-gram (45)
In Smoothed marginal distribution constraints for language modeling
  1. We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing.
    Page 1, “Abstract”
  2. We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods.
    Page 1, “Abstract”
  3. Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation.
    Page 1, “Introduction”
  4. Such models are trained on large text corpora, by counting the frequency of n-gram collocations, then normalizing and smoothing (regularizing) the resulting multinomial distributions.
    Page 1, “Introduction”
  5. Briefly, the smoothing method reesti-mates lower-order n-gram parameters in order to avoid overestimating the likelihood of n-grams that already have ample probability mass allocated as part of higher-order n-grams.
    Page 2, “Introduction”
  6. Lower-order n-gram parameters resulting from Kneser-Ney are not relative frequency estimates, as with other smoothing methods; rather they are parameters
    Page 2, “Introduction”
  7. First, the pruning methods commonly use lower order n-gram probabilities to derive an estimate of state marginals, and, since these parameters are no longer smoothed relative frequency estimates, they do not serve that purpose well.
    Page 2, “Introduction”
  8. In this paper, we present an algorithm that imposes marginal distribution constraints of the sort used in Kneser-Ney modeling on arbitrary smoothed backoff n-gram language models.
    Page 2, “Introduction”
  9. N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion.
    Page 3, “Preliminaries”
  10. where is the count of the n-gram sequence 10H, .
    Page 3, “Preliminaries”
  11. N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored.
    Page 3, “Preliminaries”

See all papers in Proc. ACL 2013 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

n-grams

Appears in 26 sentences as: n-grams (32)
In Smoothed marginal distribution constraints for language modeling
  1. Standard techniques store the observed n-grams and derive probabilities of unobserved n-grams via their longest observed suffix and “backoff” costs associated with the prefix histories of the unobserved suffixes.
    Page 1, “Introduction”
  2. Hence the size of the model grows with the number of observed n-grams , which is very large for typical training corpora.
    Page 1, “Introduction”
  3. These data structures permit efficient querying for specific n-grams in a model that has been stored in a fraction of the space required to store the full, exact model, though with some probability of false positives.
    Page 1, “Introduction”
  4. Another common approach — which we pursue in this paper — is model pruning, whereby some number of the n-grams are removed from explicit storage in the model, so that their probability must be assigned via backoff smoothing.
    Page 1, “Introduction”
  5. One simple pruning method is count thresholding, i.e., discarding n-grams that occur less than k times in the corpus.
    Page 1, “Introduction”
  6. Beyond count thresholding, the most widely used pruning methods (Seymore and Rosenfeld, 1996; Stolcke, 1998) employ greedy algorithms to reduce the number of stored n-grams by comparing the stored probabilities to those that would be assigned via the backoff smoothing mechanism, and removing those with the least impact according to some criterion.
    Page 1, “Introduction”
  7. Briefly, the smoothing method reesti-mates lower-order n-gram parameters in order to avoid overestimating the likelihood of n-grams that already have ample probability mass allocated as part of higher-order n-grams .
    Page 2, “Introduction”
  8. w,- in the training corpus; F is a regularized probability estimate that provides some probability mass for unobserved n-grams ; and 04h?)
    Page 3, “Preliminaries”
  9. N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored.
    Page 3, “Preliminaries”
  10. Probabilities for the rest of the n-grams are calculated through the “otherwise” semantics in the equation above.
    Page 3, “Preliminaries”
  11. an n-gram language model G, we will say that an n-gram hw E G if it is explicitly represented in the model; otherwise hw §Z G. In the standard n-gram formulation above, the assumption is that if > 0 then the n-gram has a parameter; yet with pruning, we remove many observed n-grams from the model, hence this is no longer the appropriate criterion.
    Page 3, “Preliminaries”

See all papers in Proc. ACL 2013 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

language models

Appears in 16 sentences as: Language Model (1) language model (4) language modeling (1) language models (11)
In Smoothed marginal distribution constraints for language modeling
  1. We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing.
    Page 1, “Abstract”
  2. Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation.
    Page 1, “Introduction”
  3. )nstraints for language modeling
    Page 1, “Introduction”
  4. As a result, statistical language models — an important component of many such applications — are often trained on very large corpora, then modified to fit within some pre-specified size bound.
    Page 1, “Introduction”
  5. While these greedy pruning methods are highly effective for models estimated with most common smoothing approaches, they have been shown to be far less effective with Kneser-Ney trained language models (Siivola et al., 2007; Chelba et al., 2010), leading to severe degradation in model quality relative to other standard smoothing meth-
    Page 1, “Introduction”
  6. In their experiments (see Table l caption for specifics on training/test setup), they trained 4-gram Broadcast News language models using a variety of both backoff and interpolated smoothing methods and measured perplexity before and after Stolcke (1998) relative entropy based pruning.
    Page 2, “Introduction”
  7. The cause of this degradation is Kneser-Ney’s unique method for estimating smoothed language models , which will be presented in more detail in Section 3.
    Page 2, “Introduction”
  8. Goodman (2001) proved that, under certain assumptions, such constraints can only improve language models .
    Page 2, “Introduction”
  9. In this paper, we present an algorithm that imposes marginal distribution constraints of the sort used in Kneser-Ney modeling on arbitrary smoothed backoff n-gram language models .
    Page 2, “Introduction”
  10. N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion.
    Page 3, “Preliminaries”
  11. N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored.
    Page 3, “Preliminaries”

See all papers in Proc. ACL 2013 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

error rate

Appears in 4 sentences as: error rate (4)
In Smoothed marginal distribution constraints for language modeling
  1. We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods.
    Page 1, “Abstract”
  2. We now look at the impacts on system performance we can achieve with these new models4, and whether the perplexity differences that we observe translate to real error rate reductions.
    Page 8, “Experimental results”
  3. Word error rate (WER)
    Page 9, “Experimental results”
  4. The perplexity reductions that were achieved for these models do translate to real word error rate reductions at both stages of between 0.5 and 0.9 percent absolute.
    Page 9, “Experimental results”

See all papers in Proc. ACL 2013 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

unigram

Appears in 4 sentences as: unigram (4) unigrams (1)
In Smoothed marginal distribution constraints for language modeling
  1. Thus the unigram distribution is with respect to the bigram model, the bigram model is with respect to the trigram model, and so forth.
    Page 5, “Marginal distribution constraints”
  2. Thus we process each history length in descending order, finishing with the unigram state.
    Page 6, “Model constraint algorithm”
  3. This can be particularly clearly seen at the unigram state, which has an arc for every unigram (the size of the vocabulary): for every bigram state (also order of the vocabulary), in the naive algorithm we must look for every possible arc.
    Page 6, “Model constraint algorithm”
  4. Note that unigrams in the models are never pruned, hence all models assign probabilities over an identical vocabulary and perplexity is comparable across models.
    Page 7, “Experimental results”

See all papers in Proc. ACL 2013 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

bigram

Appears in 3 sentences as: bigram (5)
In Smoothed marginal distribution constraints for language modeling
  1. For example, if a bigram parameter is modified due to the presence of some set of trigrams, and then some or all of those trigrams are pruned from the model, the bigram associated with the modified parameter will be unlikely to have an overall expected frequency equal to its observed frequency anymore.
    Page 2, “Introduction”
  2. Thus the unigram distribution is with respect to the bigram model, the bigram model is with respect to the trigram model, and so forth.
    Page 5, “Marginal distribution constraints”
  3. This can be particularly clearly seen at the unigram state, which has an arc for every unigram (the size of the vocabulary): for every bigram state (also order of the vocabulary), in the naive algorithm we must look for every possible arc.
    Page 6, “Model constraint algorithm”

See all papers in Proc. ACL 2013 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

model training

Appears in 3 sentences as: model training (3)
In Smoothed marginal distribution constraints for language modeling
  1. This is done via a marginal distribution constraint which requires the expected frequency of the lower-order n- grams to match their observed frequency in the training data, much as is commonly done for maximum entropy model training .
    Page 2, “Introduction”
  2. For all results reported here, we use the SRILM toolkit for baseline model training and pruning, then convert from the resulting ARPA format model to an OpenFst format (Allauzen et al., 2007), as used in the OpenGrm n-gram library (Roark et al., 2012).
    Page 7, “Experimental results”
  3. The model was trained on the 1996 and 1997 Hub4 acoustic model training sets (about 150 hours of data) using semi-tied covariance modeling and CMLLR-based speaker adaptive training and 4 iterations of boosted MMI.
    Page 8, “Experimental results”

See all papers in Proc. ACL 2013 that mention model training.

See all papers in Proc. ACL that mention model training.

Back to top.