Index of papers in Proc. ACL 2011 that mention
  • language model
Blunsom, Phil and Cohn, Trevor
Background
Early work was firmly situtated in the task-based setting of improving generalisation in language models .
Background
This model has been popular for language modelling and bilingual word alignment, and an implementation with improved inference called mkcls (Och, 1999)1 has become a standard part of statistical machine translation systems.
Background
(l992)’s HMM by incorporating a character language model , allowing the modelling of limited morphology.
Introduction
Our work brings together several strands of research including Bayesian nonparametric HMMs (Goldwater and Griffiths, 2007), Pitman-Yor language models (Teh, 2006b; Goldwater et al., 2006b), tagging constraints over word types (Brown et al., 1992) and the incorporation of morphological features (Clark, 2003).
The PYP-HMM
Prior work in unsupervised PoS induction has employed simple smoothing techniques, such as additive smoothing or Dirichlet priors (Goldwater and Griffiths, 2007; Johnson, 2007), however this body of work has overlooked recent advances in smoothing methods used for language modelling (Teh, 2006b; Goldwater et al., 2006b).
The PYP-HMM
The PYP has been shown to generate distributions particularly well suited to modelling language (Teh, 2006a; Goldwater et al., 2006b), and has been shown to be a generalisation of Kneser—Ney smoothing, widely recognised as the best smoothing method for language modelling (Chen and Goodman, 1996).
The PYP-HMM
We consider two different settings for the base distribution Cj: l) a simple uniform distribution over the vocabulary (denoted HMM for the experiments in section 4); and 2) a character-level language model (denoted HMM+LM).
language model is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Abstract
N —gram language models are a major resource bottleneck in machine translation.
Abstract
In this paper, we present several language model implementations that are both highly compact and fast to query.
Abstract
We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
Introduction
For modern statistical machine translation systems, language models must be both fast and compact.
Introduction
The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge.
Introduction
At the same time, decoding a single sentence can trigger hundreds of thousands of queries to the language model , so speed is also critical.
language model is mentioned in 62 sentences in this paper.
Topics mentioned in this paper:
Rush, Alexander M. and Collins, Michael
Background: Hypergraphs
The second step is to integrate an n-gram language model with this hypergraph.
Background: Hypergraphs
The labels for leaves will be words, and will be important in defining strings and language model scores for those strings.
Background: Hypergraphs
The focus of this paper will be to solve problems involving the integration of a k’th order language model with a hypergraph.
Introduction
The language model is then uwdwmmmemMMmhmmMnmmwmmmmr Decoding with these models is challenging, largely because of the cost of integrating an n-gram language model into the search process.
Introduction
2E.g., with a trigram language model they run in O(\E\w6) time, where is the number of edges in the hypergraph, and w is the number of distinct lexical items in the hypergraph.
Introduction
This step does not require language model integration, and hence is highly efficient.
language model is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Schütze, Hinrich
Abstract
Building on earlier work that integrates different factors in language modeling , we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation.
Abstract
This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events.
Abstract
We construct such a model and show that both training on rare events and preferable application to rare events improve perpleXity when compared to a simple direct interpolation of class-based with standard language models .
Introduction
Language models , probability distributions over strings of words, are fundamental to many applications in natural language processing.
Introduction
The main challenge in language modeling is to estimate string probabilities accurately given that even very large training corpora cannot overcome the inherent sparseness of word sequence data.
Introduction
Plausible though this line of reasoning is, the language models most commonly used today do not incorporate class-based generalization.
Related work
However, the importance of rare events for clustering in language modeling has not been investigated before.
Related work
Our work is most similar to the lattice-based language models proposed by Dupont and Rosenfeld (1997).
language model is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen
Abstract
Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation.
Abstract
We give a formal definition of one such linear-time syntactic language model , detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system.
Introduction
Early work in statistical machine translation Viewed translation as a noisy channel process comprised of a translation model, which functioned to posit adequate translations of source language words, and a target language model , which guided the fluency of generated target language strings (Brown et al.,
Introduction
Drawing on earlier successes in speech recognition, research in statistical machine translation has effectively used n-gram word sequence models as language models .
Introduction
Modern phrase-based translation using large scale n-gram language models generally performs well in terms of lexical choice, but still often produces ungrammatical output.
Related Work
Instead, we incorporate syntax into the language model .
Related Work
Traditional approaches to language models in
Related Work
Chelba and Jelinek (1998) proposed that syntactic structure could be used as an altema-tive technique in language modeling .
language model is mentioned in 47 sentences in this paper.
Topics mentioned in this paper:
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun
Abstract
This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, midrange sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm.
Abstract
The composite language model has been trained by performing a convergent N -best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer.
Abstract
The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Composite language model
The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-
Composite language model
PLSA models together to build a composite generative language model under the directed MRF paradigm (Wang et al., 2005; Wang et al., 2006), the TAGGER and CONSTRUCTOR in SLM and SEMANTIZER in PLSA remain unchanged; however the WORD-PREDICTORs in n-gram, m—SLM and PLSA are combined to form a stronger WORD-PREDICTOR that generates the next word, wk+1, not only depending on the m leftmost exposed headwords bin in the word-parse k-prefix but also its n-gram history w’g_n+2 and its semantic content gk+1.
Composite language model
The parameter for WORD-PREDICTOR in the composite n-gram/m-SLMfl’LSA language model becomes p(wk+1 |wlg_n+2h:,1ngk+1).
Introduction
There is a dire need for developing novel approaches to language modeling.”
Introduction
(2006) integrated n-gram, structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model .
Introduction
They derived a generalized inside-outside algorithm to train the composite language model from a general EM (Dempster et al., 1977) by following Je-linek’s ingenious definition of the inside and outside probabilities for SLM (J elinek, 2004) with 6th order of sentence length time complexity.
language model is mentioned in 36 sentences in this paper.
Topics mentioned in this paper:
Clifton, Ann and Sarkar, Anoop
Experimental Results
We trained all of the Moses systems herein using the standard features: language model , reordering model, translation model, and word penalty; in addition to these, the factored experiments called for additional translation and generation features for the added factors as noted above.
Experimental Results
For the language models, we used SRILM 5-gram language models (Stol-cke, 2002) for all factors.
Experimental Results
koske+ +va+ +A mietinto+ +A kasi+ +te+ +11a+ +a+ +n language model disambiguation:
Models 2.1 Baseline Models
Morphology generation models can use a variety of bilingual and contextual information to capture dependencies between morphemes, often more long-distance than what is possible using n-gram language models over morphemes in the segmented model.
Models 2.1 Baseline Models
is to take the abstract suffix tag sequence 31* and then map it into fully inflected word forms, and rank those outputs using a morphemic language model .
Models 2.1 Baseline Models
After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels.
Related Work
They use a segmented phrase table and language model along With the word-based versions in the decoder and in tuning a Finnish target.
Related Work
In their work a segmented language model can score a translation, but cannot insert morphology that does not show source-side reflexes.
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben
Approach
In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens.
Approach
Next, we extend our language model with trigrams extracted from a subset of the texts contained in the
Approach
As the CLC contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model .
Evaluation
Extending our language model with frequent trigrams extracted from the CLC improves Pearson’s and Spearman’s correlation by 0.006 and 0.015 respectively.
Evaluation
This suggests that there is room for improvement in the language models we developed to estimate the error-rate.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Lexical normalisation
The confusion candidates are then filtered for each token occurrence of a given OOV word, based on their local context fit with a language model .
Lexical normalisation
In addition to generating the confusion set, we rank the candidates based on a trigram language model trained over 1.5GB of clean Twitter data, i.e.
Lexical normalisation
To train the language model , we used SRILM (Stolcke, 2002) with the —<unk> option.
Related work
Suppose the ill-formed text is T and its corresponding standard form is S, the approach aims to find arg max P(S |T) by computing arg max P(T|S)P(S), in which P(S) is usually a language model and P(T | S) is an error model.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mylonakis, Markos and Sima'an, Khalil
Experiments
The final feature is the language model score for the target sentence, mounting up to the following model used at decoding time, with the feature weights A trained by Minimum Error Rate Training (MERT) (Och, 2003) on a development corpus.
Experiments
with a 3-gram language model smoothed with modified Knesser-Ney discounting (Chen and Goodman, 1998), trained on around 1M sentences per target language.
Experiments
Table 2: Additional experiments for English to Chinese translation examining (a) the impact of the linguistic annotations in the LTS system (lts), when compared with an instance not employing such annotations (lts—nolabels) and (b) decoding with a 4th-order language model (—lm4).
Joint Translation Model
While in a decoder this is somehow mitigated by the use of a language model , we believe that the weakness of straightforward applications of SCFGs to model reordering structure at the sentence level misses a chance to learn this crucial part of the translation process during grammar induction.
Joint Translation Model
As (Mylonakis and Sima’an, 2010) note, ‘plain’ SCFGs seem to perform worse than the grammars described next, mainly due to wrong long-range reordering decisions for which the language model can hardly help.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Introduction
The variable 6 ranges over all possible English strings, and P(e) is a language model built from large amounts of English text that is unrelated to the foreign strings.
Introduction
A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters.
Machine Translation as a Decipherment Task
Whole-segment Language Models : When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English.
Machine Translation as a Decipherment Task
5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01).
Word Substitution Decipherment
We model P(e) using a statistical word n-gram English language model (LM).
Word Substitution Decipherment
1For word substitution decipherment, we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (a = 104).
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Abstract
Our method uses a decipherment model which combines information from letter n—gram language models as well as word dictionaries.
Conclusion
Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model.
Decipherment
We build a statistical English language model (LM) for the plaintext source model P (p), which assigns a probability to any English letter sequence.
Decipherment
For the plaintext source model, we use probabilities from an English language model and for the channel model, we specify a uniform distribution (i.e., a plaintext letter can be substituted with any given cipher type with equal probability).
Decipherment
Combining letter n-gram language models with word dictionaries: Many existing probabilistic approaches use statistical letter n-gram language models of English to assign P (p) probabilities to plaintext hypotheses during decipherment.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Habash, Nizar and Roth, Ryan
Experimental Settings
The models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Problem Zones in Handwriting Recognition
Digits on the other hand are a hard class to language model since the vocabulary (of multi-digit numbers) is infinite.
Problem Zones in Handwriting Recognition
The HR system output does not contain any illegal non-words since its vocabulary is restricted by its training data and language models .
Related Work
Alternatively, morphological information can be used to construct supplemental lexicons or language models (Sari and Sellami, 2002; Magdy and Darwish, 2006).
Related Work
Their hypothesis that their large language model (16M words) may be responsible for why the word-based models outperformed stem-based (morphological) models is challenged by the fact that our language model data (220M words) is an order of magnitude larger, but we are still able to show benefit for using morphology.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Subotin, Michael
Corpora and baselines
A 5-gram language model with modified interpolated Kneser—Ney smoothing (Chen and Goodman, 1998) was trained by the SRILM toolkit (Stolcke, 2002) on a set of 208 million running words of text obtained by combining the monolingual Czech text distributed by the 2010
Corpora and baselines
The baselines consisted of the language model , two phrase translation models, two lexical models, and a brevity penalty.
Decoding with target-side model dependencies
language model , as described in Chiang (2007).
Decoding with target-side model dependencies
In the case of the language model these aspects include any of its target-side words that are part of still incomplete n-grams.
Hierarchical phrase-based translation
As shown by Chiang (2007), a weighted grammar of this form can be collected and scored by simple extensions of standard methods for phrase-based translation and efficiently combined with a language model in a CKY decoder to achieve large improvements over a state-of-the-art phrase-based system.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Automated Approaches to Deceptive Opinion Spam Detection
Under (2), both the NB classifier used by Mihalcea and Strapparava (2009) and the language model classifier used by Zhou et al.
Automated Approaches to Deceptive Opinion Spam Detection
(2008), we use the SRI Language Modeling Toolkit (Stolcke, 2002) to estimate individual language models , Pr(:E | y = c), for truthful and deceptive opinions.
Automated Approaches to Deceptive Opinion Spam Detection
We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zollmann, Andreas and Vogel, Stephan
Experiments
Apart from the language model , the lexical, phrasal, and (for the syntax grammar) label-conditioned features, and the rule, target word, and glue operation counters, Venugopal and Zollmann (2009) also provide both the hierarchical and syntax-augmented grammars with a rareness penalty 1/ onto“), where onto“) is the occurrence count of rule 7“ in the training corpus, allowing the system to learn penalization of low-frequency rules, as well as three indicator features firing if the rule has one, two unswapped, and two swapped nonterminal pairs, respectively.2 Further, to mitigate badly estimated PSCFG derivations based on low-frequency rules of the much sparser syntax model, the syntax grammar also contains the hierarchical grammar as a backbone (cf.
Experiments
Each system is trained separately to adapt the parameters to its specific properties (size of nonterminal set, grammar complexity, features sparseness, reliance on the language model , etc.
Related work
The supertags are also injected into the language model .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: