Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
Uszkoreit, Jakob and Brants, Thorsten

Article Structure

Abstract

In statistical language modeling, one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes.

Introduction

A statistical language model assigns a probability P(w) to any given string of words win 2 w1,...,wm.

Class-Based Language Modeling

By partitioning all NU words of the vocabulary into NC sets, with C(w) mapping a word onto its equivalence class and C(wg) mapping a sequence of words onto the sequence of their respective equivalence classes, a typical class-based n—gram model approximates P(w,-|w§_1) with the two following component probabilities:

Exchange Clustering

One of the frequently used algorithms for automatically obtaining partitions of the vocabulary is the exchange algorithm (Kneser and Ney, 1993; Martin et al., 1998).

Predictive Exchange Clustering

Modifying the exchange algorithm in order to optimize the log likelihood of a predictive class bigram model, leads to substantial performance improvements, similar to those previously reported for another type of one-sided class model in (Whittaker and Woodland, 2001).

Distributed Clustering

When training on large corpora, even the modified exchange algorithm would still require several days if not weeks of CPU time for a sufficient number of iterations.

Experiments

We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus.

Conclusion

In this paper, we have introduced an efficient, distributed clustering algorithm for obtaining word classifications for predictive class-based language models with which we were able to use billions of tokens of training data to obtain classifications for millions of words in relatively short amounts of time.

Topics

language models

Appears in 15 sentences as: language model (6) language modeling (1) language models (9)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. In statistical language modeling , one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes.
    Page 1, “Abstract”
  2. The resulting clusterings are then used in training partially class—based language models .
    Page 1, “Abstract”
  3. A statistical language model assigns a probability P(w) to any given string of words win 2 w1,...,wm.
    Page 1, “Introduction”
  4. In the case of n-gram language models this is done by factoring the probability:
    Page 1, “Introduction”
  5. do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1).
    Page 1, “Introduction”
  6. They have often been shown to improve the performance of speech recognition systems when combined with word-based language models (Martin et al., 1998; Whittaker and Woodland, 2001).
    Page 1, “Introduction”
  7. Class-based n-gram models have also been shown to benefit from their reduced number of parameters when scaling to higher-order n- grams (Goodman and Gao, 2000), and even despite the increasing size and decreasing sparsity of language model training corpora (Brants et al., 2007), class-based n—gram models might lead to improvements when increasing the n—gram order.
    Page 1, “Introduction”
  8. We then show that using partially class-based language models trained using the resulting classifications together with word-based language models in a state-of-the-art statistical machine translation system yields improvements despite the very large size of the word-based models used.
    Page 2, “Introduction”
  9. We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus.
    Page 5, “Experiments”
  10. We use each predictive class-based language model as well as a word-based model as separate feature functions in the log-linear combination in Eq.
    Page 6, “Experiments”
  11. The word-based language model used by the system in these experiments is a 5-gram model also trained on the enlarget data set.
    Page 6, “Experiments”

See all papers in Proc. ACL 2008 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

machine translation

Appears in 14 sentences as: Machine Translation (2) machine translation (12)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
    Page 1, “Abstract”
  2. .g for Large Scale Class-Based in Machine Translation
    Page 1, “Introduction”
  3. However, in the area of statistical machine translation , especially in the context of large training corpora, fewer experiments with class-based n-gram models have been performed with mixed success (Raab, 2006).
    Page 1, “Introduction”
  4. We then show that using partially class-based language models trained using the resulting classifications together with word-based language models in a state-of-the-art statistical machine translation system yields improvements despite the very large size of the word-based models used.
    Page 2, “Introduction”
  5. We use the distributed training and application infrastructure described in (Brants et al., 2007) with modifications to allow the training of predictive class-based models and their application in the decoder of the machine translation system.
    Page 5, “Experiments”
  6. Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
    Page 6, “Experiments”
  7. A fourth data set, en_’web, was used together with the other three data sets to train the large word-based model used in the second machine translation experiment.
    Page 6, “Experiments”
  8. 6.2 Machine Translation Results
    Page 6, “Experiments”
  9. Given a sentence f in the source language, the machine translation problem is to automatically produce a translation e in the target language.
    Page 6, “Experiments”
  10. In the subsequent experiments, we use a phrase-based statistical machine translation system based on the log-linear formulation of the problem described in (Och and Ney, 2002):
    Page 6, “Experiments”
  11. We then added these models as additional features to the log linear model of the Arabic-English machine translation system.
    Page 6, “Experiments”

See all papers in Proc. ACL 2008 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

models trained

Appears in 14 sentences as: model trained (4) model training (1) models trained (9)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. In this paper we investigate the effects of applying such a technique to higher—order n—gram models trained on large corpora.
    Page 1, “Abstract”
  2. Class-based n-gram models have also been shown to benefit from their reduced number of parameters when scaling to higher-order n- grams (Goodman and Gao, 2000), and even despite the increasing size and decreasing sparsity of language model training corpora (Brants et al., 2007), class-based n—gram models might lead to improvements when increasing the n—gram order.
    Page 1, “Introduction”
  3. We then show that using partially class-based language models trained using the resulting classifications together with word-based language models in a state-of-the-art statistical machine translation system yields improvements despite the very large size of the word-based models used.
    Page 2, “Introduction”
  4. The quality of class-based models trained using the resulting clusterings did not differ noticeably from those trained using clusterings for which the full vocabulary was considered in each iteration.
    Page 5, “Distributed Clustering”
  5. Table 1: BLEU scores of the Arabic English system using models trained on the English en_ta7"get data set
    Page 6, “Experiments”
  6. We used these models in addition to a word-based 6-gram model created by combining models trained on all four English data sets.
    Page 6, “Experiments”
  7. Table 2 shows the BLEU scores of the machine translation system using only this word-based model, the scores after adding the class-based model trained on the enlarget data set and when using all three models.
    Page 6, “Experiments”
  8. Table 2: BLEU scores of the Arabic English system using models trained on various data sets
    Page 7, “Experiments”
  9. The word-based Arabic 5-gram model we used was created by combining models trained on the Arabic side of the parallel training data (347 million tokens), the a'rng'gawO’r’d and a’rL’webne’ws data sets, and additional Arabic web data.
    Page 7, “Experiments”
  10. Table 3: BLEU scores of the English Arabic system using models trained on various data sets
    Page 7, “Experiments”
  11. As shown in Table 3, adding the predictive class-based model trained on the a’rL’webne’ws data set leads to small improvements in dev and 71732506 scores but causes the test score to decrease.
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention models trained.

See all papers in Proc. ACL that mention models trained.

Back to top.

bigram

Appears in 12 sentences as: bigram (7) bigrams (5)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. Beginning with an initial clustering, the algorithm greedily maximizes the log likelihood of a two-sided class bigram or trigram model as described in Eq.
    Page 2, “Exchange Clustering”
  2. With N g” and N fuc denoting the average number of clusters preceding and succeeding another cluster, B denoting the number of distinct bigrams in the training corpus, and I denoting the number of iterations, the worst case complexity of the algorithm is in:
    Page 3, “Exchange Clustering”
  3. When using large corpora With large numbers of bigrams the number of required updates can increase towards the quadratic upper bound as N?” and N 5% approach NC.
    Page 3, “Exchange Clustering”
  4. Modifying the exchange algorithm in order to optimize the log likelihood of a predictive class bigram model, leads to substantial performance improvements, similar to those previously reported for another type of one-sided class model in (Whittaker and Woodland, 2001).
    Page 3, “Predictive Exchange Clustering”
  5. We use a predictive class bigram model as given in Eq.
    Page 3, “Predictive Exchange Clustering”
  6. Then the following optimization criterion can be derived, with F(C) being the log likelihood function of the predictive class bigram model given a clustering C:
    Page 3, “Predictive Exchange Clustering”
  7. The first summation can be updated by iterating over all bigrams ending in the exchanged word.
    Page 3, “Predictive Exchange Clustering”
  8. amounts to the number of distinct bigrams in the training corpus B, times the number of clusters NC.
    Page 4, “Predictive Exchange Clustering”
  9. The first difference is that in contrast to the exchange algorithm using a two sided class-based bigram model in its optimization criterion, only two clusters are affected by moving a word.
    Page 4, “Predictive Exchange Clustering”
  10. With each word w in one of these sets, all words 1) preceding w in the corpus are stored with the respective bigram count N (v, w).
    Page 4, “Distributed Clustering”
  11. While the greedy non-distributed exchange algorithm is guaranteed to converge as each exchange increases the log likelihood of the assumed bigram model, this is not necessarily true for the distributed exchange algorithm.
    Page 4, “Distributed Clustering”

See all papers in Proc. ACL 2008 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

BLEU

Appears in 10 sentences as: BLEU (11)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
    Page 1, “Abstract”
  2. Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
    Page 6, “Experiments”
  3. minimum error rate training (Och, 2003) with BLEU score as the objective function.
    Page 6, “Experiments”
  4. Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
    Page 6, “Experiments”
  5. Table 1: BLEU scores of the Arabic English system using models trained on the English en_ta7"get data set
    Page 6, “Experiments”
  6. Adding the class-based models leads to small improvements in BLEU score, with the highest improvements for both dev and nist06 being statistically significant 2.
    Page 6, “Experiments”
  7. Table 2 shows the BLEU scores of the machine translation system using only this word-based model, the scores after adding the class-based model trained on the enlarget data set and when using all three models.
    Page 6, “Experiments”
  8. Table 2: BLEU scores of the Arabic English system using models trained on various data sets
    Page 7, “Experiments”
  9. Table 3: BLEU scores of the English Arabic system using models trained on various data sets
    Page 7, “Experiments”
  10. The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU scores

Appears in 10 sentences as: BLEU score (4) BLEU scores (7)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score .
    Page 1, “Abstract”
  2. Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
    Page 6, “Experiments”
  3. minimum error rate training (Och, 2003) with BLEU score as the objective function.
    Page 6, “Experiments”
  4. Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
    Page 6, “Experiments”
  5. Table 1: BLEU scores of the Arabic English system using models trained on the English en_ta7"get data set
    Page 6, “Experiments”
  6. Adding the class-based models leads to small improvements in BLEU score , with the highest improvements for both dev and nist06 being statistically significant 2.
    Page 6, “Experiments”
  7. Table 2 shows the BLEU scores of the machine translation system using only this word-based model, the scores after adding the class-based model trained on the enlarget data set and when using all three models.
    Page 6, “Experiments”
  8. Table 2: BLEU scores of the Arabic English system using models trained on various data sets
    Page 7, “Experiments”
  9. Table 3: BLEU scores of the English Arabic system using models trained on various data sets
    Page 7, “Experiments”
  10. The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

translation system

Appears in 9 sentences as: translation system (9)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
    Page 1, “Abstract”
  2. We then show that using partially class-based language models trained using the resulting classifications together with word-based language models in a state-of-the-art statistical machine translation system yields improvements despite the very large size of the word-based models used.
    Page 2, “Introduction”
  3. We use the distributed training and application infrastructure described in (Brants et al., 2007) with modifications to allow the training of predictive class-based models and their application in the decoder of the machine translation system .
    Page 5, “Experiments”
  4. Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
    Page 6, “Experiments”
  5. In the subsequent experiments, we use a phrase-based statistical machine translation system based on the log-linear formulation of the problem described in (Och and Ney, 2002):
    Page 6, “Experiments”
  6. We then added these models as additional features to the log linear model of the Arabic-English machine translation system .
    Page 6, “Experiments”
  7. Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task.
    Page 6, “Experiments”
  8. Table 2 shows the BLEU scores of the machine translation system using only this word-based model, the scores after adding the class-based model trained on the enlarget data set and when using all three models.
    Page 6, “Experiments”
  9. The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

n-gram

Appears in 6 sentences as: n-gram (6)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. In the case of n-gram language models this is done by factoring the probability:
    Page 1, “Introduction”
  2. do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1).
    Page 1, “Introduction”
  3. However, in the area of statistical machine translation, especially in the context of large training corpora, fewer experiments with class-based n-gram models have been performed with mixed success (Raab, 2006).
    Page 1, “Introduction”
  4. Class-based n-gram models have also been shown to benefit from their reduced number of parameters when scaling to higher-order n- grams (Goodman and Gao, 2000), and even despite the increasing size and decreasing sparsity of language model training corpora (Brants et al., 2007), class-based n—gram models might lead to improvements when increasing the n—gram order.
    Page 1, “Introduction”
  5. When training class-based n-gram models on large corpora and large vocabularies, one of the problems arising is the scalability of the typical clustering algorithms used for obtaining the word classification.
    Page 1, “Introduction”
  6. Generalizing this leads to arbitrary order class-based n-gram models of the form:
    Page 2, “Class-Based Language Modeling”

See all papers in Proc. ACL 2008 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

clusterings

Appears in 5 sentences as: clusterings (6)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. The resulting clusterings are then used in training partially class—based language models.
    Page 1, “Abstract”
  2. The clusterings generated in each iteration as well as the initial clustering are stored as the set of words in each cluster, the total number of occurrences of each cluster in the training corpus, and the list of words preceeding each cluster.
    Page 4, “Distributed Clustering”
  3. The quality of class-based models trained using the resulting clusterings did not differ noticeably from those trained using clusterings for which the full vocabulary was considered in each iteration.
    Page 5, “Distributed Clustering”
  4. We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus.
    Page 5, “Experiments”
  5. For the first experiment we trained predictive class-based 5-gram models using clusterings with 64, 128, 256 and 512 clusters1 on the eniarget data.
    Page 6, “Experiments”

See all papers in Proc. ACL 2008 that mention clusterings.

See all papers in Proc. ACL that mention clusterings.

Back to top.

statistical machine translation

Appears in 4 sentences as: statistical machine translation (4)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. We show that combining them with word—based n—gram models in the log—linear model of a state—of—the—art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.
    Page 1, “Abstract”
  2. However, in the area of statistical machine translation , especially in the context of large training corpora, fewer experiments with class-based n-gram models have been performed with mixed success (Raab, 2006).
    Page 1, “Introduction”
  3. We then show that using partially class-based language models trained using the resulting classifications together with word-based language models in a state-of-the-art statistical machine translation system yields improvements despite the very large size of the word-based models used.
    Page 2, “Introduction”
  4. In the subsequent experiments, we use a phrase-based statistical machine translation system based on the log-linear formulation of the problem described in (Och and Ney, 2002):
    Page 6, “Experiments”

See all papers in Proc. ACL 2008 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.

translation task

Appears in 4 sentences as: translation task (2) translation tasks (2)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of word- and class-based models for translation tasks from English to Arabic and Arabic to English.
    Page 6, “Experiments”
  2. Table 1 shows the BLEU scores reached by the translation system when combining the different class-based models with the word-based model in comparison to the BLEU scores by a system using only the word-based model on the Arabic-English translation task .
    Page 6, “Experiments”
  3. For our experiment with the English Arabic translation task we trained two 5-gram predictive class-based models with 512 clusters on the Arabic a'rng'gawO’r’d and a'rL’webne’ws data sets.
    Page 7, “Experiments”
  4. The experiments presented show that predictive class-based models trained using the obtained word classifications can improve the quality of a state-of-the-art machine translation system as indicated by the BLEU score in both translation tasks .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention translation task.

See all papers in Proc. ACL that mention translation task.

Back to top.

data sparsity

Appears in 3 sentences as: data sparsity (3)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. In statistical language modeling, one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes.
    Page 1, “Abstract”
  2. Class-based n—gram models are intended to help overcome this data sparsity problem by grouping words into equivalence classes rather than treating them as distinct words and thus reducing the number of parameters of the model (Brown et al., 1990).
    Page 1, “Introduction”
  3. We conclude that even despite the large amounts of data used to train the large word-based model in our second experiment, class-based language models are still an effective tool to ease the effects of data sparsity .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention data sparsity.

See all papers in Proc. ACL that mention data sparsity.

Back to top.

statistically significant

Appears in 3 sentences as: statistically significant (3)
In Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
  1. Adding the class-based models leads to small improvements in BLEU score, with the highest improvements for both dev and nist06 being statistically significant 2.
    Page 6, “Experiments”
  2. 2Differences of more than 0.0051 are statistically significant at the 0.05 level using bootstrap resampling (Noreen, 1989; Koehn, 2004)
    Page 6, “Experiments”
  3. When using predictive class-based models in combination with a word-based language model trained on very large amounts of data, the improvements continue to be statistically significant on the test and nist06 sets.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.