Randomized Language Models via Perfect Hash Functions
Talbot, David and Brants, Thorsten

Article Structure

Abstract

We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters.

Introduction

Language models (LMs) are a core component in statistical machine translation, speech recognition, optical character recognition and many other areas.

Scaling Language Models

In statistical machine translation (SMT), LMs are used to score candidate translations in the target language.

Perfect Hash-based Language Models

Our randomized LM is based on the Bloomier filter (Chazelle et al., 2004).

Experimental Setup

4.1 Distributed LM Framework

Experiments

This section describes three sets of experiments: first, we encode the Web 1T 5-gram corpus as a randomized language model and compare the resulting size with other representations; then we measure false positive rates when requesting n-grams for a held-out data set; finally we compare translation quality when using conventional (lossless) languages models and our randomized language model.

Conclusions

We have presented a novel randomized language model based on perfect hashing.

Topics

n-grams

Appears in 46 sentences as: n-grams (44) n-gram’s (4)
In Randomized Language Models via Perfect Hash Functions
  1. We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters.
    Page 1, “Abstract”
  2. Parameters that are stored in the model are retrieved without error; however, false positives may occur whereby n-grams not in the model are incorrectly ‘found’ when requested.
    Page 1, “Introduction”
  3. We encode fingerprints (random hashes) of n-grams together with their associated probabilities using a perfect hash function generated at random (Majewski et al., 1996).
    Page 1, “Introduction”
  4. In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary.
    Page 2, “Scaling Language Models”
  5. Although n-grams observed in natural language corpora are not randomly distributed within this universe no lossless data structure that we are aware of can circumvent this space-dependency on both the n-gram order and the vocabulary size.
    Page 2, “Scaling Language Models”
  6. However, if we are willing to accept that occasionally our model will be unable to distinguish between distinct n-grams , then it is possible to store
    Page 2, “Scaling Language Models”
  7. The space required in such a lossy encoding depends only on the range of values associated with the n-grams and the desired error rate, i.e.
    Page 2, “Scaling Language Models”
  8. the probability with which two distinct n-grams are assigned the same fingerprint.
    Page 2, “Scaling Language Models”
  9. We assume the n-grams and their associated parameter values have been precomputed and stored on disk.
    Page 2, “Perfect Hash-based Language Models”
  10. We then encode the model in an array such that each n-gram’s value can be retrieved.
    Page 2, “Perfect Hash-based Language Models”
  11. The model uses randomization to map n-grams to fingerprints and to generate a perfect hash function that associates n-grams with their values.
    Page 2, “Perfect Hash-based Language Models”

See all papers in Proc. ACL 2008 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

n-gram

Appears in 36 sentences as: n-gram (38)
In Randomized Language Models via Perfect Hash Functions
  1. The scheme can represent any standard n-gram model and is easily combined with existing model reduction techniques such as entropy-pruning.
    Page 1, “Abstract”
  2. LMs are usually implemented as n-gram models parameterized for each distinct sequence of up to n words observed in the training corpus.
    Page 1, “Introduction”
  3. Recent work (Talbot and Osborne, 2007b) has demonstrated that randomized encodings can be used to represent n-gram counts for LMs with signficant space-savings, circumventing information-theoretic constraints on lossless data structures by allowing errors with some small probability.
    Page 1, “Introduction”
  4. Lookup is very efficient: the values of 3 cells in a large array are combined with the fingerprint of an n-gram .
    Page 1, “Introduction”
  5. These are typically n-gram models that approximate the probability of a word sequence by assuming each token to be independent of all but n — l preceding tokens.
    Page 2, “Scaling Language Models”
  6. Although n-grams observed in natural language corpora are not randomly distributed within this universe no lossless data structure that we are aware of can circumvent this space-dependency on both the n-gram order and the vocabulary size.
    Page 2, “Scaling Language Models”
  7. While the approach results in significant space savings, working with corpus statistics, rather than n-gram probabilities directly, is computationally less efficient (particularly in a distributed setting) and introduces a dependency on the smoothing scheme used.
    Page 2, “Scaling Language Models”
  8. This scheme can be used to encode any standard n-gram model which may first be processed using any conventional model reduction technique.
    Page 2, “Scaling Language Models”
  9. The model can erroneously return a value for an n-gram that was never actually stored, but will always return the correct value for an n-gram that is in the model.
    Page 2, “Perfect Hash-based Language Models”
  10. Each n-gram sci is drawn from some set of possible n-grams LI and its associated value from a corresponding set of possible values V.
    Page 3, “Perfect Hash-based Language Models”
  11. We do not store the n-grams and their probabilities directly but rather encode a fingerprint of each n-gram f together with its associated value in such a way that the value can be retrieved when the model is queried with the n-gram sci.
    Page 3, “Perfect Hash-based Language Models”

See all papers in Proc. ACL 2008 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

language model

Appears in 25 sentences as: language model (19) language modeling (6) Language models (1) language models (1) languages models (1)
In Randomized Language Models via Perfect Hash Functions
  1. We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters.
    Page 1, “Abstract”
  2. We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework.
    Page 1, “Abstract”
  3. Language models (LMs) are a core component in statistical machine translation, speech recognition, optical character recognition and many other areas.
    Page 1, “Introduction”
  4. With large monolingual corpora available in major languages, making use of all the available data is now a fundamental challenge in language modeling .
    Page 1, “Introduction”
  5. have considered alternative parameterizations such as class-based models (Brown et al., 1992), model reduction techniques such as entropy-based pruning (Stolcke, 1998), novel represention schemes such as suffix arrays (Emami et al., 2007), Golomb Coding (Church et al., 2007) and distributed language models that scale more readily (Brants et al., 2007).
    Page 1, “Introduction”
  6. In this paper we propose a novel randomized language model .
    Page 1, “Introduction”
  7. Our randomized language model is based on the Bloomier filter (Chazelle et al., 2004).
    Page 1, “Introduction”
  8. However, many of our findings should transfer to other applications of language modeling .
    Page 1, “Introduction”
  9. In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary.
    Page 2, “Scaling Language Models”
  10. Recent work (Talbot and Osborne, 2007b) has used lossy encodings based on Bloom filters (Bloom, 1970) to represent logarithmically quantized corpus statistics for language modeling .
    Page 2, “Scaling Language Models”
  11. We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2008 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

LM

Appears in 9 sentences as: LM (9)
In Randomized Language Models via Perfect Hash Functions
  1. Using higher-order models and larger amounts of training data can significantly improve performance in applications, however the size of the resulting LM can become prohibitive.
    Page 1, “Introduction”
  2. Efficiency is paramount in applications such as machine translation which make huge numbers of LM requests per sentence.
    Page 1, “Introduction”
  3. In the next section we describe our randomized LM scheme based on perfect hash functions.
    Page 2, “Scaling Language Models”
  4. Our randomized LM is based on the Bloomier filter (Chazelle et al., 2004).
    Page 2, “Perfect Hash-based Language Models”
  5. 4.1 Distributed LM Framework
    Page 6, “Experimental Setup”
  6. We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers.
    Page 6, “Experimental Setup”
  7. The proposed randomized LM can encode parameters estimated using any smoothing scheme (e.g.
    Page 6, “Experimental Setup”
  8. 4.2 LM Data Sets
    Page 6, “Experimental Setup”
  9. size dev test test LM GB MT04 | MT05 | MT06 unpruned block 116 0.5304 0.5697 0.4663 unpruned rand 69 0.5299 0.5692 0.4659 pruned block 42 0.5294 0.5683 0.4665 pruned rand 27 0.5289 0.5679 0.4656
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

BLEU

Appears in 7 sentences as: BLEU (7)
In Randomized Language Models via Perfect Hash Functions
  1. Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
    Page 7, “Experiments”
  2. Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
    Page 8, “Experiments”
  3. Figure 3: BLEU scores on the MT05 data set.
    Page 8, “Experiments”
  4. 3 shows BLEU scores for the MT05 evaluation set with parameter values quantized into 5 to 8 bits and 8 to 16 additional ‘error’ bits.
    Page 8, “Experiments”
  5. The BLEU score drops by between 0.0007 to 0.0018 while the
    Page 8, “Experiments”
  6. MT06 (NIST) BLEU
    Page 8, “Experiments”
  7. Figure 4: BLEU scores on MT06 data (“N IST” subset).
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

error rate

Appears in 7 sentences as: error rate (5) error rates (3)
In Randomized Language Models via Perfect Hash Functions
  1. The space required in such a lossy encoding depends only on the range of values associated with the n-grams and the desired error rate , i.e.
    Page 2, “Scaling Language Models”
  2. There is a tradeoff between space and error rate since the larger B is, the lower the probability of a false positive.
    Page 3, “Perfect Hash-based Language Models”
  3. For example, if |V| is 128 then taking B = 1024 gives an error rate of e = 128/1024 = 0.125 with each entry in A using [log2 1024] = 10 bits.
    Page 5, “Perfect Hash-based Language Models”
  4. Querying each of these arrays for each n-gram requested would be inefficient and inflate the error rate since a false positive could occur on each individual array.
    Page 5, “Perfect Hash-based Language Models”
  5. Section (3) analyzed the theoretical error rate; here, we measure error rates in practice when retrieving n-grams for approx.
    Page 7, “Experiments”
  6. The error rates for bigrams are close to their expected values.
    Page 7, “Experiments”
  7. Experiments have shown that this randomized language model can be combined with entropy pruning to achieve further memory reductions; that error rates occurring in practice are much lower than those predicted by theoretical analysis due to the use of runtime sanity checks; and that the same translation quality as a lossless language model representation can be achieved when using 12 ‘error’ bits, resulting in approx.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

machine translation

Appears in 7 sentences as: Machine Translation (2) machine translation (5)
In Randomized Language Models via Perfect Hash Functions
  1. We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework.
    Page 1, “Abstract”
  2. Language models (LMs) are a core component in statistical machine translation , speech recognition, optical character recognition and many other areas.
    Page 1, “Introduction”
  3. Efficiency is paramount in applications such as machine translation which make huge numbers of LM requests per sentence.
    Page 1, “Introduction”
  4. This paper focuses on machine translation .
    Page 1, “Introduction”
  5. In statistical machine translation (SMT), LMs are used to score candidate translations in the target language.
    Page 2, “Scaling Language Models”
  6. 4.3 Machine Translation
    Page 6, “Experimental Setup”
  7. 5.3 Machine Translation
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

BLEU scores

Appears in 6 sentences as: BLEU score (1) BLEU scores (5)
In Randomized Language Models via Perfect Hash Functions
  1. Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
    Page 7, “Experiments”
  2. Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
    Page 8, “Experiments”
  3. Figure 3: BLEU scores on the MT05 data set.
    Page 8, “Experiments”
  4. 3 shows BLEU scores for the MT05 evaluation set with parameter values quantized into 5 to 8 bits and 8 to 16 additional ‘error’ bits.
    Page 8, “Experiments”
  5. The BLEU score drops by between 0.0007 to 0.0018 while the
    Page 8, “Experiments”
  6. Figure 4: BLEU scores on MT06 data (“N IST” subset).
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

NIST

Appears in 4 sentences as: NIST (3) “NIST” (1)
In Randomized Language Models via Perfect Hash Functions
  1. We run an improved version of our 2006 NIST MT Evaluation entry for the Arabic-English “Unlimited” data track.6 The language model is the same one as in the previous section.
    Page 7, “Experiments”
  2. We use MT04 data for system development, with MT05 data and MT06 ( “NIST” subset) data for blind testing.
    Page 7, “Experiments”
  3. Overall, our baseline results compare favorably to those reported on the NIST MT06 web site.
    Page 8, “Experiments”
  4. MT06 ( NIST ) BLEU
    Page 8, “Experiments”

See all papers in Proc. ACL 2008 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.