Modeling of term-distance and term-occurrence information for improving n-gram language model performance
Chong, Tze Yuang and E. Banchs, Rafael and Chng, Eng Siong and Li, Haizhou

Article Structure

Abstract

In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling.

Introduction

Language models have been extensively studied in natural language processing.

Related Work

The distant bigram model (Huang et.al 1993, Simon et al.

Motivation of the Proposed Approach

The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach.

Language Modeling with TD and TO

A language model estimates word probabilities given their history, i.e.

Perplexity Evaluation

A perpleXity test was run on the BLLIP WSJ corpus (Charniak 2000) with the standard 5K vocabulary.

Conclusions

We have proposed a new approach to compute the n-gram probabilities, based on the TD and TO model components.

Topics

n-gram

Appears in 20 sentences as: n-gram (22)
In Modeling of term-distance and term-occurrence information for improving n-gram language model performance
  1. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long histo—ry-contexts.
    Page 1, “Abstract”
  2. The commonly used n-gram model (Bahl et a1.
    Page 1, “Introduction”
  3. Although n-gram models are simple and effective, modeling long history-contexts lead to severe data scarcity problems.
    Page 1, “Introduction”
  4. 2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 .
    Page 1, “Related Work”
  5. In the n-gram model, for example, these two attributes are jointly taken into account in the ordered word-sequence.
    Page 2, “Motivation of the Proposed Approach”
  6. Consequently, the n-gram model can only be effectively implemented within a short history-context (e. g. of size of three or four).
    Page 2, “Motivation of the Proposed Approach”
  7. However, intermediate distances beyond the n-gram model limits can be very useful and should not be discarded.
    Page 2, “Motivation of the Proposed Approach”
  8. However, similarly to n-gram models, distance and co-occurrence information are implicitly tied within the word-pairs.
    Page 2, “Motivation of the Proposed Approach”
  9. In our proposed approach, we attempt to exploit the TD and TO attributes, separately, to incorporate distant context information into the n-gram , as a remedy to the data scarcity problem when learning the far context.
    Page 2, “Motivation of the Proposed Approach”
  10. The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model:
    Page 3, “Language Modeling with TD and TO”
  11. Replacing the unigram model with a higher order n-gram model is important to compensate the damage incurred by the conditional independence assumption made earlier.
    Page 3, “Language Modeling with TD and TO”

See all papers in Proc. ACL 2013 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

bigram

Appears in 17 sentences as: bigram (16) bigrams (2) bigram’s (1)
In Modeling of term-distance and term-occurrence information for improving n-gram language model performance
  1. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively.
    Page 1, “Abstract”
  2. Compared to the distant bigram , we show that word-pairs can be more effectively modeled in terms of both distance and occurrence.
    Page 1, “Abstract”
  3. The distant bigram model (Huang et.al 1993, Simon et al.
    Page 1, “Related Work”
  4. 2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 .
    Page 1, “Related Work”
  5. Each distance-k bigram model predicts the target-word based on the occurrence of a history-word located k positions behind.
    Page 1, “Related Work”
  6. Zhou & Lua (1998) enhanced the effectiveness of the model by filtering out those word-pairs exhibiting low correlation, so that only the well associated distant bigrams are retained.
    Page 1, “Related Work”
  7. 1993, Rosenfeld 1996) that relies on the bigrams of arbitrary distance, i.e.
    Page 1, “Related Work”
  8. The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model:
    Page 3, “Language Modeling with TD and TO”
  9. As seen from the table, for lower order n- gram models, the complementary information captured by the TD and TO components reduced the perplexity up to 23.5% and 14.0%, for bigram and trigram models, respectively.
    Page 4, “Perplexity Evaluation”
  10. ter modeling of word-pairs compared to the distant bigram model.
    Page 4, “Perplexity Evaluation”
  11. Here we compare the perplexity of both, the distance-k bigram model and distance-k TD model (for values of k ranging from two to ten), when combined with a standard bi gram model.
    Page 4, “Perplexity Evaluation”

See all papers in Proc. ACL 2013 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

language model

Appears in 12 sentences as: language model (6) language modeling (4) Language models (1) language models (1)
In Modeling of term-distance and term-occurrence information for improving n-gram language model performance
  1. In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling .
    Page 1, “Abstract”
  2. Language models have been extensively studied in natural language processing.
    Page 1, “Introduction”
  3. The role of a language model is to measure how probably a (target) word would occur based on some given evidence extracted from the history-context.
    Page 1, “Introduction”
  4. Latent-semantic language model approaches (Bellegarda 1998, Coccaro 2005) weight word counts with TFIDF to highlight their semantic importance towards the prediction.
    Page 1, “Related Work”
  5. Other approaches such as the class-based language model (Brown 1992, Kneser & Ney 1993)
    Page 1, “Related Work”
  6. The structured language model (Chelba & J elinek 2000) determines the “heads” in the history-context by using a parsing tree.
    Page 2, “Related Work”
  7. Cache language models exploit temporal word frequencies in the history (Kuhn & Mori 1990, Clarkson & Robinson 1997).
    Page 2, “Related Work”
  8. The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach.
    Page 2, “Motivation of the Proposed Approach”
  9. A language model estimates word probabilities given their history, i.e.
    Page 2, “Language Modeling with TD and TO”
  10. In order to define the TD and TO components for language modeling , we express the observation of an arbitrary history-word, wi_k at the kth position behind the target-word, as the joint of two events: i) the word wi_k occurs within the histo-ry-context: wi_k E h, and ii) it occurs at distance k from the target-word: A(wi_k) = k, (A: k for brevity); i.e.
    Page 2, “Language Modeling with TD and TO”
  11. In fact, the TO model is closely related to the trigger language model (Rosenfeld 1996), as the prediction of the target-word (the triggered word) is based on the presence of a history-word (the trigger).
    Page 3, “Language Modeling with TD and TO”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

co-occurrence

Appears in 11 sentences as: co-occurrence (11)
In Modeling of term-distance and term-occurrence information for improving n-gram language model performance
  1. In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling.
    Page 1, “Abstract”
  2. the distance is described regardless the actual frequency of the history-word, while the co-occurrence is described regardless the actual position of the history-word.
    Page 1, “Introduction”
  3. The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach.
    Page 2, “Motivation of the Proposed Approach”
  4. Both, the conventional trigger model and the latent-semantic model capture the co-occurrence information while ignoring the distance information.
    Page 2, “Motivation of the Proposed Approach”
  5. On the other hand, distant-bigram models and distance-dependent trigger models make use of both, distance and co-occurrence , information up to window sizes of ten to twenty.
    Page 2, “Motivation of the Proposed Approach”
  6. However, similarly to n-gram models, distance and co-occurrence information are implicitly tied within the word-pairs.
    Page 2, “Motivation of the Proposed Approach”
  7. In Eq.3, we have decoupled the observation of a word-pair into the events of distance and co-occurrence .
    Page 2, “Language Modeling with TD and TO”
  8. The TD likelihood for a distance k given the co-occurrence of the word-pair (wi_k, wt) can be estimated from counts as follows:
    Page 3, “Language Modeling with TD and TO”
  9. zero co-occurrence C(wi_k E h,t = w,) = O, which results in a division by zero.
    Page 3, “Language Modeling with TD and TO”
  10. As a complement to the TD model, the TO model focuses on co-occurrence , and holds only count information.
    Page 3, “Language Modeling with TD and TO”
  11. As the distance information is captured by the TD model, the co-occurrence count captured by the TO model is independent from the given word-pair distance.
    Page 3, “Language Modeling with TD and TO”

See all papers in Proc. ACL 2013 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

n-grams

Appears in 3 sentences as: n-grams (2) n-gram’s (1)
In Modeling of term-distance and term-occurrence information for improving n-gram language model performance
  1. There are also works on skipping irrelevant his-tory-words in order to reveal more informative n-grams (Siu & Ostendorf 2000, Guthrie et al.
    Page 2, “Related Work”
  2. In this experiment, we assessed the effectiveness of the TD and TO components in reducing the n-gram’s perplexity.
    Page 4, “Perplexity Evaluation”
  3. Due to the incapability of n-grams to model long history-contexts, the TD and TO components are still effective in helping to enhance the prediction.
    Page 4, “Perplexity Evaluation”

See all papers in Proc. ACL 2013 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.