SciSurf: Index of 'Faster and Smaller N-Gram Language Models'

Faster and Smaller N-Gram Language Models

Pauls, Adam and Klein, Dan

Published in Proc. ACL, 2011

Article Structure

Abstract

N —gram language models are a major resource bottleneck in machine translation.

Introduction

For modern statistical machine translation systems, language models must be both fast and compact.

Preliminaries

Our goal in this paper is to provide data structures that map n-gram keys to values, i.e.

Language Model Implementations

In the previous section, we reviewed well-known techniques in language model implementation.

Speeding up Decoding

In the previous section, we provided compact and efficient implementations of associative arrays that allow us to query a value for an arbitrary n-gram.

Experiments

5.1 Data

Conclusion

We have presented several language model implementations which are state-of-the-art in both size and speed.

Topics

language model

Appears in 62 sentences as: (1) language model (51) language modeling (5) Language Models (1) language models (14)

In Faster and Smaller N-Gram Language Models

N —gram language models are a major resource bottleneck in machine translation.
Page 1, “Abstract”
In this paper, we present several language model implementations that are both highly compact and fast to query.
Page 1, “Abstract”
We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
Page 1, “Abstract”
For modern statistical machine translation systems, language models must be both fast and compact.
Page 1, “Introduction”
The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge.
Page 1, “Introduction”
At the same time, decoding a single sentence can trigger hundreds of thousands of queries to the language model , so speed is also critical.
Page 1, “Introduction”
To speed up our language models , we present two approaches.
Page 1, “Introduction”
Caching itself is certainly not new to language modeling , but because well-tuned LMs are essentially lookup tables to begin with, naive cache designs only speed up slower systems.
Page 1, “Introduction”
Our second speedup comes from a more fundamental change to the language modeling interface.
Page 1, “Introduction”
This setup substantially accelerates the scrolling queries issued by decoders, and also exploits language model state equivalence (Li and Khudanpur, 2008).
Page 1, “Introduction”
pus, with associated counts, in 10 GB of memory, which is smaller than state-of-the-art lossy language model implementations (Guthrie and Hepple, 2010), and significantly smaller than the best published lossless implementation (Germann et al., 2009).
Page 2, “Introduction”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

n-gram

Appears in 25 sentences as: n-gram (28)

In Faster and Smaller N-Gram Language Models

Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram , the most compact lossless representation to date, and even more compact than recent lossy compression techniques.
Page 1, “Abstract”
and Raj, 2001; Hsu and Glass, 2008), our methods are conceptually based on tabular trie encodings wherein each n-gram key is stored as the concatenation of one word (here, the last) and an offset encoding the remaining words (here, the context).
Page 1, “Introduction”
Our goal in this paper is to provide data structures that map n-gram keys to values, i.e.
Page 2, “Preliminaries”
However, because of the sheer number of keys and values needed for n-gram language modeling, generic implementations do not work efficiently “out of the box.” In this section, we will review existing techniques for encoding the keys and values of an n-gram language model, taking care to account for every bit of memory required by each implementation.
Page 2, “Preliminaries”
In the WeblT corpus, the most frequent n-gram occurs about 95 billion times.
Page 2, “Preliminaries”
The additional array is small, requiring only about 3MB, but we save 17 bits per n-gram , reducing value storage from around 16GB to about 9GB for WeblT.
Page 2, “Preliminaries”
Tries ensure that each n-gram prefix is represented only once, and are very efficient when n-grams share common prefixes.
Page 2, “Preliminaries”
In total, the most compact implementation in SRILM uses 33 bytes per n-gram of storage, which would require around 116 GB of memory to store WeblT.
Page 2, “Preliminaries”
This representation has the property that we can refer to each n-gram w?
Page 3, “Preliminaries”
For an n-gram language model, we can apply this implementation with a slight modification: we need n sorted arrays, one for each n-gram order.
Page 3, “Language Model Implementations”
Because our keys are sorted according to their context-encoded representation, we cannot straightforwardly answer queries about an n-gram w without first determining its context encoding.
Page 3, “Language Model Implementations”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

n-grams

Appears in 20 sentences as: n-grams (21)

In Faster and Smaller N-Gram Language Models

Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques.
Page 1, “Abstract”
The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge.
Page 1, “Introduction”
Overall, we are able to store the 4 billion n-grams of the Google WeblT (Brants and Franz, 2006) cor-
Page 1, “Introduction”
This corpus, which is on the large end of corpora typically employed in language modeling, is a collection of nearly 4 billion n-grams extracted from over a trillion tokens of English text, and has a vocabulary of about 13.5 million words.
Page 2, “Preliminaries”
Tries represent collections of n-grams using a tree.
Page 2, “Preliminaries”
Each node in the tree encodes a word, and paths in the tree correspond to n-grams in the collection.
Page 2, “Preliminaries”
Tries ensure that each n-gram prefix is represented only once, and are very efficient when n-grams share common prefixes.
Page 2, “Preliminaries”
Note that 32 bits is sufficient to index all n-grams in WeblT; for larger corpora, we can always increase the size of the offset.
Page 3, “Preliminaries”
2.4 Encoding n-grams
Page 3, “Preliminaries”
Note that typically, n-grams are encoded in tries in the reverse direction (first-rest instead of last-rest), which enables a more efficient computation of back-offs.
Page 3, “Preliminaries”
Lookup is linear in the length of the key and logarithmic in the number of n-grams .
Page 4, “Language Model Implementations”

See all papers in Proc. ACL 2011 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

LM

Appears in 18 sentences as: LM (19)

In Faster and Smaller N-Gram Language Models

Where classic LMs take word tuples and produce counts or probabilities, we propose an LM that takes a word-and-context encoding (so the context need not be re-looked up) and returns both the probability and also the context encoding for the sufi‘ix of the original query.
Page 1, “Introduction”
Our LM toolkit, which is implemented in Java and compatible with the standard ARPA file formats, is available on the web.1
Page 2, “Introduction”
Tries or variants thereof are implemented in many LM tool kits, including SRILM (Stolcke, 2002), IRSTLM (Federico and Cettolo, 2007), CMU SLM (Whittaker and Raj, 2001), and MIT LM (Hsu and Glass, 2008).
Page 2, “Preliminaries”
Figure 3: Queries issued when scoring trigrams that are created when a state with LM context “the cat” combines with “fell down”.
Page 7, “Speeding up Decoding”
We found in our experiments that a cache using linear probing provided marginal performance increases of about 40%, largely because of cached back-off computation, while our simpler cache increases performance by about 300% even over our HASH LM implementation.
Page 7, “Speeding up Decoding”
As discussed in Section 3, our LM implementations can answer queries about context-encoded n-grams faster than explicitly encoded n-grams.
Page 7, “Speeding up Decoding”
Then, rather than represent the LM context in the decoder as an explicit list of words, we can simply store context offsets.
Page 7, “Speeding up Decoding”
In addition to speeding up language model queries, this approach also automatically supports an equivalence of LM states (Li and Khudanpur, 2008): in standard back-off schemes, whenever we compute the probability for an n-gram (wn, C(w?_1)) when w?
Page 7, “Speeding up Decoding”
If a decoder maintains LM states using the context offsets returned by our language model, then the decoder will automatically exploit this equivalence and the size of the search space will be reduced.
Page 7, “Speeding up Decoding”
LM Type bytes/ bytes/ bytes/ Total key value n- gram Size
Page 8, “Speeding up Decoding”
To test our LM implementations, we performed experiments with two different language models.
Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention LM.

See all papers in Proc. ACL that mention LM.