A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun

Article Structure

Abstract

This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, midrange sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm.

Introduction

The Markov chain (n-gram) source models, which predict each word on the basis of previous n-l words, have been the workhorses of state-of—the-art speech recognizers and machine translators that help to resolve acoustic or foreign language ambiguities by placing higher probability on more likely original underlying word strings.

Composite language model

The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-

Training algorithm

Under the composite n-gram/m-SLMfl’LSA language model, the likelihood of a training corpus D, a collection of documents, can be written as

Experimental results

We have trained our language models using three different training sets: one has 44 million tokens, another has 230 million tokens, and the other has 1.3 billion tokens.

Conclusion

As far as we know, this is the first work of building a complex large scale distributed language model with a principled approach that is more powerful than n-grams when both trained on a very large corpus with up to a billion tokens.

Topics

language model

Appears in 36 sentences as: language model (30) language modeling.” (1) language models (7)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, midrange sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm.
    Page 1, “Abstract”
  2. The composite language model has been trained by performing a convergent N -best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer.
    Page 1, “Abstract”
  3. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
    Page 1, “Abstract”
  4. There is a dire need for developing novel approaches to language modeling.”
    Page 1, “Introduction”
  5. (2006) integrated n-gram, structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model .
    Page 1, “Introduction”
  6. They derived a generalized inside-outside algorithm to train the composite language model from a general EM (Dempster et al., 1977) by following Je-linek’s ingenious definition of the inside and outside probabilities for SLM (J elinek, 2004) with 6th order of sentence length time complexity.
    Page 1, “Introduction”
  7. In this paper, we study the same composite language model .
    Page 1, “Introduction”
  8. ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.
    Page 2, “Introduction”
  9. The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-
    Page 2, “Composite language model”
  10. PLSA models together to build a composite generative language model under the directed MRF paradigm (Wang et al., 2005; Wang et al., 2006), the TAGGER and CONSTRUCTOR in SLM and SEMANTIZER in PLSA remain unchanged; however the WORD-PREDICTORs in n-gram, m—SLM and PLSA are combined to form a stronger WORD-PREDICTOR that generates the next word, wk+1, not only depending on the m leftmost exposed headwords bin in the word-parse k-prefix but also its n-gram history w’g_n+2 and its semantic content gk+1.
    Page 2, “Composite language model”
  11. The parameter for WORD-PREDICTOR in the composite n-gram/m-SLMfl’LSA language model becomes p(wk+1 |wlg_n+2h:,1ngk+1).
    Page 2, “Composite language model”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

BLEU

Appears in 15 sentences as: BLEU (18)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
    Page 1, “Abstract”
  2. ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.
    Page 2, “Introduction”
  3. We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
    Page 8, “Experimental results”
  4. We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
    Page 8, “Experimental results”
  5. 2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score.
    Page 9, “Experimental results”
  6. The 10 results from the folds then can be averaged (or otherwise combined) to produce a single estimation for BLEU score.
    Page 9, “Experimental results”
  7. Table 4 shows the BLEU scores through 10-fold cross-validation.
    Page 9, “Experimental results”
  8. The composite 5-gram/2-SLM+2-gram/4-SLM+5-gramfl’LSA language model gives 1.57% BLEU score improvement over the baseline and 0.79% BLEU score improvement over the 5-gram.
    Page 9, “Experimental results”
  9. Chiang (2007) studied the performance of machine translation on Hiero, the BLEU score is 33.31% when n-gram is used to re-rank the N -best list, however, the BLEU score becomes significantly higher 37.09% when the n-gram is embedded directly into Hiero’s one pass decoder, this is because there is not much diversity in the N -best list.
    Page 9, “Experimental results”
  10. It is expected that putting the our composite language into a one pass decoder of both phrase-based (Koehn et al., 2003) and parsing-based (Chiang, 2005; Chiang, 2007) MT systems should result in much improved BLEU scores.
    Page 9, “Experimental results”
  11. Table 4: 10-fold cross-validation BLEU score results for the task of re-ranking the N -best list.
    Page 9, “Experimental results”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

n-gram

Appears in 14 sentences as: n-gram (18)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The Markov chain ( n-gram ) source models, which predict each word on the basis of previous n-l words, have been the workhorses of state-of—the-art speech recognizers and machine translators that help to resolve acoustic or foreign language ambiguities by placing higher probability on more likely original underlying word strings.
    Page 1, “Introduction”
  2. are efficient at encoding local word interactions, the n-gram model clearly ignores the rich syntactic and semantic structures that constrain natural languages.
    Page 1, “Introduction”
  3. (2006) integrated n-gram , structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model.
    Page 1, “Introduction”
  4. The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-
    Page 2, “Composite language model”
  5. The SLM (Chelba and J elinek, 1998; Chelba and Jelinek, 2000) uses syntactic information beyond the regular n-gram models to capture sentence level long range dependencies.
    Page 2, “Composite language model”
  6. When combining n-gram , m order SLM and
    Page 2, “Composite language model”
  7. PLSA models together to build a composite generative language model under the directed MRF paradigm (Wang et al., 2005; Wang et al., 2006), the TAGGER and CONSTRUCTOR in SLM and SEMANTIZER in PLSA remain unchanged; however the WORD-PREDICTORs in n-gram, m—SLM and PLSA are combined to form a stronger WORD-PREDICTOR that generates the next word, wk+1, not only depending on the m leftmost exposed headwords bin in the word-parse k-prefix but also its n-gram history w’g_n+2 and its semantic content gk+1.
    Page 2, “Composite language model”
  8. Similar to SLM (Chelba and Jelinek, 2000), we adopt an N -best list approximate EM re-estimation with modular modifications to seamlessly incorporate the effect of n-gram and PLSA components.
    Page 3, “Training algorithm”
  9. This implies that when computing the language model probability of a sentence in a client, all servers need to be contacted for each n-gram request.
    Page 5, “Training algorithm”
  10. (2007) follows a standard MapReduce paradigm (Dean and Ghemawat, 2004): the corpus is first divided and loaded into a number of clients, and n-gram counts are collected at each client, then the n-gram counts mapped and stored in a number of servers, resulting in exactly one server being contacted per n-gram when computing the language model probability of a sentence.
    Page 5, “Training algorithm”
  11. The m-SLM performs competitively with its counterpart n-gram (n=m+1) on large scale corpus.
    Page 7, “Experimental results”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

parse trees

Appears in 14 sentences as: parse tree (7) parse trees (8) parse trees, (1)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. Figure 1: A composite n-gram/m-SLM/PLSA language model where the hidden information is the parse tree T and semantic content 9.
    Page 3, “Composite language model”
  2. the lth sentence Wl with its parse tree structure Tl
    Page 3, “Training algorithm”
  3. of tag 75 predicted by word 21) and the tags of m most recent exposed headwords in parse tree Tl of the lth sentence Wl in document d, and finally #(ahjn, Wl, Tl, d) is the count of constructor move a conditioning on m exposed headwords bin in parse tree Tl of the lth sentence Wl in document d.
    Page 3, “Training algorithm”
  4. For a given sentence, its parse tree and semantic content are hidden and the number of parse trees grows faster than exponential with sentence length, Wang et al.
    Page 3, “Training algorithm”
  5. where T’év is a set of N parse trees for sentence Wl in document d and - denotes the cardinality and ’2' ’ N is a collection of T’év for sentences over entire corpus D.
    Page 3, “Training algorithm”
  6. N-best list search: For each sentence W in document d, find N -best parse trees,
    Page 3, “Training algorithm”
  7. and denote ’2' N as the collection of N -best list parse trees for sentences over entire corpus D under model parameter p.
    Page 3, “Training algorithm”
  8. N -best list search strategy: To extract the N -best parse trees , we adopt a synchronous, multi-stack search strategy that is similar to the one in (Chelba and Jelinek, 2000), which involves a set of stacks storing partial parses of the most likely ones for a given prefix Wk; and the less probable parses are purged.
    Page 4, “Training algorithm”
  9. EM update: Once we have the N -best parse trees for each sentence in document d and N -best topics for document d, we derive the EM algorithm to esti-
    Page 4, “Training algorithm”
  10. that can be computed in a backward manner, here WA“ is the subsequence after k+lth word in sentence Wl, TA“ is the incremental parse structure after the parse structure Tkl; +1 of word [9+1-prefix Wkl; +1 that generates parse tree Tl, Garb is the semantic subsequence in Gl relevant to Wkl; +1), Then, the expected count of 211:717’Jr1th71n g for the WORD-PREDICTOR on sentence Wl in document d is
    Page 4, “Training algorithm”
  11. document d is the real count appeared in parse tree Tl of sentence Wl in document d times the conditional distribution Pp(Tl|Wl, d) = Pp(Tl, Wl|d)/ ZTZEW Pp(Tl, Wl|d) respectively.
    Page 5, “Training algorithm”

See all papers in Proc. ACL 2011 that mention parse trees.

See all papers in Proc. ACL that mention parse trees.

Back to top.

BLEU score

Appears in 13 sentences as: BLEU score (12) BLEU scores (4)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
    Page 1, “Abstract”
  2. ply our language models to the task of re-ranking the N-best list from Hiero (Chiang, 2005; Chiang, 2007), a state-of-the-art parsing-based MT system, we achieve significantly better translation quality measured by the BLEU score and “readability”.
    Page 2, “Introduction”
  3. We substitute our language model and use MERT (Och, 2003) to optimize the BLEU score (Papineni et al., 2002).
    Page 8, “Experimental results”
  4. We partition the data into ten pieces, 9 pieces are used as training data to optimize the BLEU score (Papineni et al., 2002) by MERT (Och,
    Page 8, “Experimental results”
  5. 2003), a remaining single piece is used to re-rank the 1000-best list and obtain the BLEU score .
    Page 9, “Experimental results”
  6. The 10 results from the folds then can be averaged (or otherwise combined) to produce a single estimation for BLEU score .
    Page 9, “Experimental results”
  7. Table 4 shows the BLEU scores through 10-fold cross-validation.
    Page 9, “Experimental results”
  8. The composite 5-gram/2-SLM+2-gram/4-SLM+5-gramfl’LSA language model gives 1.57% BLEU score improvement over the baseline and 0.79% BLEU score improvement over the 5-gram.
    Page 9, “Experimental results”
  9. Chiang (2007) studied the performance of machine translation on Hiero, the BLEU score is 33.31% when n-gram is used to re-rank the N -best list, however, the BLEU score becomes significantly higher 37.09% when the n-gram is embedded directly into Hiero’s one pass decoder, this is because there is not much diversity in the N -best list.
    Page 9, “Experimental results”
  10. It is expected that putting the our composite language into a one pass decoder of both phrase-based (Koehn et al., 2003) and parsing-based (Chiang, 2005; Chiang, 2007) MT systems should result in much improved BLEU scores .
    Page 9, “Experimental results”
  11. Table 4: 10-fold cross-validation BLEU score results for the task of re-ranking the N -best list.
    Page 9, “Experimental results”

See all papers in Proc. ACL 2011 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

model parameters

Appears in 8 sentences as: model parameter (2) model parameters (6)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The objective of maximum likelihood estimation is to maximize the likelihood £(D, p) respect to model parameters .
    Page 3, “Training algorithm”
  2. and denote ’2' N as the collection of N -best list parse trees for sentences over entire corpus D under model parameter p.
    Page 3, “Training algorithm”
  3. mate model parameters .
    Page 4, “Training algorithm”
  4. each model parameter over sentence Wl in document d in the training corpus D. For the WORD-PREDICTOR and the SEMANTIZER, the number of possible semantic annotation sequences is exponential, we use forward-backward recursive formulas that are similar to those in hidden Markov models to compute the expected counts.
    Page 4, “Training algorithm”
  5. Figure 2: Distributed architecture is essentially a MapReduce paradigm: clients store partitioned data and perform E-step: compute expected counts, this is Map; servers store parameters (counts) for M—step where counts of wiLHwhjng are hashed by word w_1 (or h_1) and its topic 9 to evenly distribute these model parameters into servers as much as possible, this is Reduce.
    Page 6, “Training algorithm”
  6. For the 44 and 230 million tokens corpora, all sentences are automatically parsed and used to initialize model parameters , while for 1.3 billion tokens corpus, we parse the sentences from a portion of the corpus that
    Page 6, “Experimental results”
  7. contain 230 million tokens, then use them to initialize model parameters .
    Page 7, “Experimental results”
  8. Nevertheless, experimental results show that this approach is effective to provide initial values of model parameters .
    Page 7, “Experimental results”

See all papers in Proc. ACL 2011 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

n-grams

Appears in 7 sentences as: n-grams (7)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
    Page 1, “Abstract”
  2. We conduct comprehensive experiments on corpora with 44 million tokens, 230 million tokens, and 1.3 billion tokens and compare perplexity results with n-grams (n=3,4,5 respectively) on these three corpora, we obtain drastic perplexity reductions.
    Page 1, “Introduction”
  3. Where #(g,Wl,Gl,d) is the count of semantic content 9 in semantic annotation string Gl of the lth sentence Wl in document d, #(w:,11+1wh:,1ng,Wl,Tl,Gl,d) is the count of n-grams , its m most recent exposed headwords and semantic content 9 in parse Tl and semantic annotation string Gl of the lth sentence Wl in document d, #(tthhtag, Wl, Tl, d) is the count
    Page 3, “Training algorithm”
  4. The topic of large scale distributed language models is relatively new, and existing works are restricted to n-grams only (Brants et al., 2007; Emami et al., 2007; Zhang et al., 2006).
    Page 5, “Training algorithm”
  5. The composite n-gram/m-SLMfl’LSA model gives significant perpleXity reductions over baseline n-grams , n = 3,4,5 and m-SLMs, m = 2,3,4.
    Page 8, “Experimental results”
  6. Also, in the same study in (Chamiak, 2003), they found that the outputs produced using the n-grams received higher scores from BLEU; ours did not.
    Page 9, “Experimental results”
  7. As far as we know, this is the first work of building a complex large scale distributed language model with a principled approach that is more powerful than n-grams when both trained on a very large corpus with up to a billion tokens.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

machine translation

Appears in 5 sentences as: machine translation (4) machine translators (1)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
    Page 1, “Abstract”
  2. The Markov chain (n-gram) source models, which predict each word on the basis of previous n-l words, have been the workhorses of state-of—the-art speech recognizers and machine translators that help to resolve acoustic or foreign language ambiguities by placing higher probability on more likely original underlying word strings.
    Page 1, “Introduction”
  3. As the machine translation (MT) working groups stated on page 3 of their final report (Lavie et al., 2006), “These approaches have resulted in small improvements in MT quality, but have not fundamentally solved the problem.
    Page 1, “Introduction”
  4. We have applied our composite 5-gram/2-SLM+2-gram/4-SLM+5-gramfl9LSA language model that is trained by 1.3 billion word corpus for the task of re-ranking the N -best list in statistical machine translation .
    Page 8, “Experimental results”
  5. Chiang (2007) studied the performance of machine translation on Hiero, the BLEU score is 33.31% when n-gram is used to re-rank the N -best list, however, the BLEU score becomes significantly higher 37.09% when the n-gram is embedded directly into Hiero’s one pass decoder, this is because there is not much diversity in the N -best list.
    Page 9, “Experimental results”

See all papers in Proc. ACL 2011 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

POS tags

Appears in 4 sentences as: POS tag (2) POS tags (3)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The SLM is based on statistical parsing techniques that allow syntactic analysis of sentences; it assigns a probability p(VV, T) to every sentence W and every possible binary parse T. The terminals of T are the words of W with POS tags , and the nodes of T are annotated with phrase headwords and nonterminal labels.
    Page 2, “Composite language model”
  2. A word-parse k-prefix has a set of exposed heads h_m, - - - , h_1, with each head being a pair (headword, nonterminal label), or in the case of a root-only tree (word, POS tag ).
    Page 2, “Composite language model”
  3. An m—th order SLM (m-SLM) has three operators to generate a sentence: WORD-PREDICTOR predicts the next word wk+1 based on the m leftmost exposed headwords bin 2 h_m, - - - , h_1 in the word-parse k-prefix with probability p(wk+1|h:,1n), and then passes control to the TAGGER; the TAGGER predicts the POS tag tk+1 to the next word wk+1 based on the next word wk+1 and the POS tags of the m leftmost exposed headwords hjn in the word-parse k-prefix with probability p(tk+1|wk+1, h_m.tag, - - - ,h_1.tag); the CONSTRUCTOR builds the partial parse Tk, from Tk,_1, wk, and tk, in a series of moves ending with NULL, where a parse move a is made with probability p(a|h:,1,,); a e A={(unary, NTlabel), (adjoin-left, NTlabel), (adjoin-right, NTlabel), null}.
    Page 2, “Composite language model”
  4. The TAGGER and CONSTRUCTOR are conditional probabilistic models of the type p(u|zl, - - - ,2“) where u, 21, - - - ,zn belong to a mixed set of words, POS tags , NTtags, CONSTRUCTOR actions (u only), and 21, - - - ,2“, form a linear Markov chain.
    Page 5, “Training algorithm”

See all papers in Proc. ACL 2011 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

probabilistic model

Appears in 4 sentences as: probabilistic model (2) probabilistic models (1) probability model (1)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. A PLSA model (Hofmann, 2001) is a generative probabilistic model of word-document co-occurrences using the bag-of-words assumption described as follows: (i) choose a document d with probability p(d); (ii) SEMANTIZER: select a semantic class 9 with probability p(g|d); and (iii) WORD-PREDICTOR: pick a word 21) with probability p(w|g).
    Page 2, “Composite language model”
  2. Since only one pair of (d, w) is being observed, as a result, the joint probability model is a mixture of log-linear model with the expression p(d, w) = p(d) Zg p(wlg)p(9|d)- Typically, the number of documents and vocabulary size are much larger than the size of latent semantic class variables.
    Page 2, “Composite language model”
  3. The TAGGER and CONSTRUCTOR are conditional probabilistic models of the type p(u|zl, - - - ,2“) where u, 21, - - - ,zn belong to a mixed set of words, POS tags, NTtags, CONSTRUCTOR actions (u only), and 21, - - - ,2“, form a linear Markov chain.
    Page 5, “Training algorithm”
  4. The WORD-PREDICTOR is, however, a conditional probabilistic model p(w|w:,11+1h:,1ng) where there are three kinds of context 21):; +1, bin and g, each forms a linear Markov chain.
    Page 5, “Training algorithm”

See all papers in Proc. ACL 2011 that mention probabilistic model.

See all papers in Proc. ACL that mention probabilistic model.

Back to top.

recursive

Appears in 4 sentences as: recursive (4)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. each model parameter over sentence Wl in document d in the training corpus D. For the WORD-PREDICTOR and the SEMANTIZER, the number of possible semantic annotation sequences is exponential, we use forward-backward recursive formulas that are similar to those in hidden Markov models to compute the expected counts.
    Page 4, “Training algorithm”
  2. In M-step, the recursive linear interpolation scheme (Jelinek and Mercer, 1981) is used to obtain a smooth probability estimate for each model component, WORD-PREDICTOR, TAGGER, and CONSTRUCTOR.
    Page 5, “Training algorithm”
  3. The recursive mixing scheme is the standard one among relative frequency estimates of different orders k = 0, - - - ,n as explained in (Chelba and J elinek, 2000).
    Page 5, “Training algorithm”
  4. We generalize J elinek and Mercer’s original recursive mixing scheme (J elinek and Mercer, 1981) and form a lattice to handle the situation where the context is a mixture of Markov chains.
    Page 5, “Training algorithm”

See all papers in Proc. ACL 2011 that mention recursive.

See all papers in Proc. ACL that mention recursive.

Back to top.

latent semantic

Appears in 3 sentences as: latent semantic (3)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. (2006) integrated n-gram, structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model.
    Page 1, “Introduction”
  2. Since only one pair of (d, w) is being observed, as a result, the joint probability model is a mixture of log-linear model with the expression p(d, w) = p(d) Zg p(wlg)p(9|d)- Typically, the number of documents and vocabulary size are much larger than the size of latent semantic class variables.
    Page 2, “Composite language model”
  3. Thus, latent semantic class variables function as bottleneck variables to constrain word occurrences in
    Page 2, “Composite language model”

See all papers in Proc. ACL 2011 that mention latent semantic.

See all papers in Proc. ACL that mention latent semantic.

Back to top.

model trained

Appears in 3 sentences as: model trained (3)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. For composite 5-gramfl’LSA model trained on 1.3 billion tokens corpus, 400 cores have to be used to keep top 5 most likely topics.
    Page 7, “Experimental results”
  2. gramfl’LSA model trained on 44M tokens corpus, the computation time increases drastically with less than 5% percent perplexity improvement.
    Page 7, “Experimental results”
  3. Its decoder uses a trigram language model trained with modified Kneser—Ney smoothing (Kneser and Ney, 1995) on a 200 million tokens corpus.
    Page 8, “Experimental results”

See all papers in Proc. ACL 2011 that mention model trained.

See all papers in Proc. ACL that mention model trained.

Back to top.

time complexity

Appears in 3 sentences as: time complexity (3)
In A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
  1. The composite language model has been trained by performing a convergent N -best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer.
    Page 1, “Abstract”
  2. They derived a generalized inside-outside algorithm to train the composite language model from a general EM (Dempster et al., 1977) by following Je-linek’s ingenious definition of the inside and outside probabilities for SLM (J elinek, 2004) with 6th order of sentence length time complexity .
    Page 1, “Introduction”
  3. Instead of using the 6th order generalized inside-outside algorithm proposed in (Wang et al., 2006), we train this composite model by a convergent N-best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power.
    Page 1, “Introduction”

See all papers in Proc. ACL 2011 that mention time complexity.

See all papers in Proc. ACL that mention time complexity.

Back to top.