Kneser-Ney Smoothing on Expected Counts
Zhang, Hui and Chiang, David

Article Structure

Abstract

Widely used in speech and language processing, Kneser-Ney (KN) smoothing has consistently been shown to be one of the best-performing smoothing methods.

Introduction

In speech and language processing, smoothing is essential to reduce overfitting, and Kneser-Ney (KN) smoothing (Kneser and Ney, 1995; Chen and Goodman, 1999) has consistently proven to be among the best-performing and most widely used methods.

Smoothing on integral counts

Before presenting our method, we review KN smoothing on integer counts as applied to language models, although, as we will demonstrate in Section 7, KN smoothing is applicable to other tasks as well.

Count distributions

The computation of D and p’ above made use of nr and nr+, which presupposes integer counts.

Smoothing on count distributions

We are now ready to describe how to apply KN smoothing to count distributions.

Related Work

Witten-Bell (WB) smoothing is somewhat easier than KN to adapt to fractional counts.

Language model adaptation

N -gram language models are widely used in applications like machine translation and speech recognition to select fluent output sentences.

Word Alignment

In this section, we show how to apply expected KN to the IBM word alignment models (Brown et al., 1993).

Conclusion

For a long time, and as noted by many authors, the usage of KN smoothing has been limited by its restriction to integer counts.

Topics

language model

Appears in 17 sentences as: language model (10) language modeling (3) language models (4)
In Kneser-Ney Smoothing on Expected Counts
  1. We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment.
    Page 1, “Abstract”
  2. Such cases have been noted for language modeling (Goodman, 2001; Goodman, 2004), domain adaptation (Tam and Schultz, 2008), grapheme-to-phoneme conversion (Bisani and Ney, 2008), and phrase-based translation (Andres-Ferrer, 2010; Wuebker et al., 2012).
    Page 1, “Introduction”
  3. One is language model domain adaptation, and the other is word alignment using the IBM models (Brown et al., 1993).
    Page 1, “Introduction”
  4. Before presenting our method, we review KN smoothing on integer counts as applied to language models , although, as we will demonstrate in Section 7, KN smoothing is applicable to other tasks as well.
    Page 1, “Smoothing on integral counts”
  5. This method subtracts D directly from the fractional counts, zeroing out counts that are smaller than D. The discount D must be set by minimizing an error metric on held-out data using a line search (Tam, p. c.) or Powell’s method (Bisani and Ney, 2008), requiring repeated estimation and evaluation of the language model .
    Page 5, “Related Work”
  6. N -gram language models are widely used in applications like machine translation and speech recognition to select fluent output sentences.
    Page 6, “Language model adaptation”
  7. Here, we propose to assign each sentence a probability to indicate how likely it is to belong to the domain of interest, and train a language model using expected KN smoothing.
    Page 6, “Language model adaptation”
  8. They first train two language models , pin on a set of in-domain data, and pout on a set of general-domain data.
    Page 6, “Language model adaptation”
  9. Figure 1: On the language model adaptation task, expected KN outperforms all other methods across all sizes of selected subsets.
    Page 6, “Language model adaptation”
  10. Then we use the weighted subset to train a language model with expected KN smoothing.
    Page 6, “Language model adaptation”
  11. They use the in-domain training data to select a subset of the general-domain data, build a language model on the selected subset, and evaluate its perplexity on the in-domain test data.
    Page 6, “Language model adaptation”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

n-gram

Appears in 7 sentences as: n-gram (7)
In Kneser-Ney Smoothing on Expected Counts
  1. Let uw stand for an n-gram , where u stands for the (n — l) context words and w, the predicted word.
    Page 1, “Smoothing on integral counts”
  2. where n1+(uo) = |{w | C(uw) > OH is the number of word types observed after context u, and qu(w) specifies how to distribute the subtracted discounts among unseen n-gram types.
    Page 2, “Smoothing on integral counts”
  3. The probability of an n-gram token uw using the other tokens as training data is
    Page 2, “Smoothing on integral counts”
  4. where nr = |{uw | C(uw) = r}| is the number of n-gram types appearing r times, and C is a constant not depending on D. Setting the partial derivative with respect to D to zero, we have
    Page 2, “Smoothing on integral counts”
  5. On integral counts, this is simple: we generate, for each n-gram type vu’w, an (n — l)—gram token u’w, for a total of n1+(ou’w) tokens.
    Page 5, “Smoothing on count distributions”
  6. Analogously, on count distributions, for each n-gram type vu’w, we generate an (n — l)-gram token u’w with probability p(c(vu’w) > 0).
    Page 5, “Smoothing on count distributions”
  7. Using the dynamic program in Section 3.2, computing the distributions for each r is linear in the number of n-gram types, and we only need to compute the distributions up to r = 2 (or r = 4 for modified KN), and store them for r = 0 (or up to r = 2 for modified KN).
    Page 5, “Smoothing on count distributions”

See all papers in Proc. ACL 2014 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

unigram

Appears in 7 sentences as: unigram (7)
In Kneser-Ney Smoothing on Expected Counts
  1. Absolute discounting chooses p’(w | u’) to be the maximum-likelihood unigram distribution; under KN smoothing (Kneser and Ney, 1995), it is chosen to make [9 in (2) satisfy the following constraint for all (n — l)—grams u’w:
    Page 2, “Smoothing on integral counts”
  2. p’(w | u’) For the example above, the estimates for the unigram model p’(w) are p'(cat) = x 0.489 p'(dog) = x 0.511.
    Page 4, “Smoothing on count distributions”
  3. For the example above, the count distributions used for the unigram distribution would be: ‘ r = 0 r = l p(c(cat) = r) 0.14 0.86 p(c(dog) = r) 0.1 0.9
    Page 5, “Smoothing on count distributions”
  4. Following common practice in language modeling, we use the unigram distribution p( f ) as the lower-order distribution.
    Page 7, “Word Alignment”
  5. As shown in Table l, for KN smoothing, interpolation with the unigram distribution performs the best, while for WB smoothing, interestingly, interpolation with the uniform distribution performs the best.
    Page 8, “Word Alignment”
  6. In WB smoothing, p’( f) is the empirical unigram distribution.
    Page 8, “Word Alignment”
  7. We used the Moses toolkit (Koehn et al., 2007) to build MT systems using various alignments (for expected KN, we used the one interpolated with the unigram distribution, and for fractional WB, we used the one interpolated with the uniform distribution).
    Page 9, “Word Alignment”

See all papers in Proc. ACL 2014 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

word alignment

Appears in 7 sentences as: word alignment (6) word alignments (1)
In Kneser-Ney Smoothing on Expected Counts
  1. We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment .
    Page 1, “Abstract”
  2. One is language model domain adaptation, and the other is word alignment using the IBM models (Brown et al., 1993).
    Page 1, “Introduction”
  3. In this section, we show how to apply expected KN to the IBM word alignment models (Brown et al., 1993).
    Page 7, “Word Alignment”
  4. Of course, expected KN can be applied to other instances of EM besides word alignment .
    Page 7, “Word Alignment”
  5. The IBM models and related models define probability distributions p(a, f | e, 6), which model how likely a French sentence f is to be generated from an English sentence e with word alignment a.
    Page 7, “Word Alignment”
  6. and English-to-foreign directions and then used the grow-diag-final method to symmetrize them (Koehn et al., 2003), and evaluated the alignments using F-measure against gold word alignments .
    Page 8, “Word Alignment”
  7. Figure 2: Alignment Fl vs. D of fractional KN smoothing for word alignment .
    Page 8, “Word Alignment”

See all papers in Proc. ACL 2014 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

in-domain

Appears in 6 sentences as: in-domain (8)
In Kneser-Ney Smoothing on Expected Counts
  1. Many methods (Lin et al., 1997; Gao et al., 2002; Klakow, 2000; Moore and Lewis, 2010; AX-elrod et al., 2011) rank sentences in the general-domain data according to their similarity to the in-domain data and select only those with score higher than some threshold.
    Page 6, “Language model adaptation”
  2. However, sometimes it is hard to say whether a sentence is totally in-domain or out-of-domain; for example, quoted speech in a news report might be partly in-domain if the domain of interest is broadcast conversation.
    Page 6, “Language model adaptation”
  3. They first train two language models, pin on a set of in-domain data, and pout on a set of general-domain data.
    Page 6, “Language model adaptation”
  4. Moore and Lewis (2010) test their method by partitioning the in-domain data into training data and test data, both of which are disjoint from the general-domain data.
    Page 6, “Language model adaptation”
  5. They use the in-domain training data to select a subset of the general-domain data, build a language model on the selected subset, and evaluate its perplexity on the in-domain test data.
    Page 6, “Language model adaptation”
  6. We took 60k sentences (1.7M words) of web forum data as in-domain data, further subdividing it into 54k sentences (1.5M words) for training, 3k sentences (100k words) for testing, and 3k sentences (100k words) for future use.
    Page 6, “Language model adaptation”

See all papers in Proc. ACL 2014 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

bigrams

Appears in 4 sentences as: bigram (2) bigrams (3)
In Kneser-Ney Smoothing on Expected Counts
  1. For example, suppose that our data consists of the following bigrams , with their weights:
    Page 3, “Count distributions”
  2. That is, during the E step, we calculate the distribution of C(e, f) for each e and f, and during the M step, we train a language model on bigrams e f using expected KN smoothing (that is, with u = e and w = f).
    Page 7, “Word Alignment”
  3. (The latter case is equivalent to a backoff language model, where, since all bigrams are known, the lower-order model is never used.)
    Page 7, “Word Alignment”
  4. This is much less of a problem in KN smoothing, where p’ is estimated from bigram types rather than bigram tokens.
    Page 8, “Word Alignment”

See all papers in Proc. ACL 2014 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

overfitting

Appears in 4 sentences as: overfitting (4)
In Kneser-Ney Smoothing on Expected Counts
  1. In speech and language processing, smoothing is essential to reduce overfitting , and Kneser-Ney (KN) smoothing (Kneser and Ney, 1995; Chen and Goodman, 1999) has consistently proven to be among the best-performing and most widely used methods.
    Page 1, “Introduction”
  2. It also contains most of the model’s parameters and is where overfitting occurs most.
    Page 7, “Word Alignment”
  3. However, MLE is prone to overfitting , one symptom of which is the “garbage collection” phenomenon where a rare English word is wrongly aligned to many French words.
    Page 7, “Word Alignment”
  4. To reduce overfitting , we use expected KN smoothing during the M step.
    Page 7, “Word Alignment”

See all papers in Proc. ACL 2014 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

language pairs

Appears in 3 sentences as: language pairs (4)
In Kneser-Ney Smoothing on Expected Counts
  1. We carried out experiments on two language pairs : Arabic to English and Czech to English.
    Page 7, “Word Alignment”
  2. Variational Bayes is not consistent across different language pairs .
    Page 8, “Word Alignment”
  3. While fractional KN does beat the baseline for both language pairs, the value of D, which we optimized D to maximize Fl , is not consistent across language pairs : as shown in Figure 2, on Arabic-English, a smaller D is better, while for Czech-English, a larger D is better.
    Page 8, “Word Alignment”

See all papers in Proc. ACL 2014 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

model parameters

Appears in 3 sentences as: model parameters (1) models parameterize (1) model’s parameters (1)
In Kneser-Ney Smoothing on Expected Counts
  1. For example, in Expectation-Maximization (Dempster et al., 1977), the Expectation (E) step computes the posterior distribution over possible completions of the data, and the Maximization (M) step reestimates the model parameters as
    Page 1, “Introduction”
  2. Different models parameterize this probability distribution in different ways.
    Page 7, “Word Alignment”
  3. It also contains most of the model’s parameters and is where overfitting occurs most.
    Page 7, “Word Alignment”

See all papers in Proc. ACL 2014 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

probability distribution

Appears in 3 sentences as: probability distribution (2) probability distributions (1)
In Kneser-Ney Smoothing on Expected Counts
  1. In the E step of EM, we compute a probability distribution (according to the current model) over all possible completions of the observed data, and the expected counts of all types, which may be fractional.
    Page 3, “Count distributions”
  2. The IBM models and related models define probability distributions p(a, f | e, 6), which model how likely a French sentence f is to be generated from an English sentence e with word alignment a.
    Page 7, “Word Alignment”
  3. Different models parameterize this probability distribution in different ways.
    Page 7, “Word Alignment”

See all papers in Proc. ACL 2014 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.