Index of papers in Proc. ACL that mention

language model

Seen in text as:

language model (1175)
language models (401)
language modeling (125)
Language Model (40)
Language Modeling (17)
language modelling (13)
Language model (10)
Language Models (8)
Language models (7)
Language modelling (3)
language model: (3)
languages models (3)

Seen in 1667 sentences in 176 papers.

1. Additive Neural Networks for Statistical Machine Translation

liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Further, decoding with nonlocal (or state-dependent) features, such as a language model , is also a problem.
Introduction	Actually, even for the (log-) linear model, efficient decoding with the language model is not trivial (Chiang, 2007).
Introduction	For the nonlocal features such as the language model , Chiang (2007) proposed a cube-pruning method for efficient decoding.

language model is mentioned in 17 sentences in this paper.

Topics mentioned in this paper:

2. Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling

Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our model is a nested hierarchical Pitman-Yor language model , where Pitman-Yor spelling model is embedded in the word model.
Abstract	Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications.
Introduction	Bayesian Kneser—Ney) language model , with an accurate character oo-gram Pitman-Yor spelling model embedded in word models.
Introduction	Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language.
Introduction	we briefly describe a language model based on the Pitman-Yor process (Teh, 2006b), which is a generalization of the Dirichlet process used in previous research.
Nested Pitman-Yor Language Model	In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM.
Nested Pitman-Yor Language Model	Figure 2: Chinese restaurant representation of our Nested Pitman-Yor Language Model (NPYLM).
Pitman-Yor process and n-gram models	To compute a probability p(w\|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process.
Pitman-Yor process and n-gram models	As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as
Pitman-Yor process and n-gram models	When we set thw E l, (4) recovers a Kneser-Ney smoothing: thus a HPYLM is a Bayesian Kneser—Ney language model as well as an extension of the hierarchical Dirichlet Process (HDP) used in Goldwater et al.

language model is mentioned in 18 sentences in this paper.

Topics mentioned in this paper:

3. Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models

Auli, Michael and Gao, Jianfeng

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.
Abstract	We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.
Expected BLEU Training	We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003).
Expected BLEU Training	We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt\|w1...wt_1,ht_1):
Expected BLEU Training	which computes a sentence-level language model score as the sum of individual word scores.
Introduction	In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014).
Introduction	(2013) who demonstrated that feed-forward network-based language models are more accurate in first-pass decoding than in rescoring.
Introduction	Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models .
Recurrent Neural Network LMs	Our model has a similar structure to the recurrent neural network language model of Mikolov et al.

language model is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

4. Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.
Experiments	Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score.
Experiments	This means that, when using a bigram language model , it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.
Introduction	Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence.
Phrase-based Decoding as TSP	o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model , this cost can be precomputed directly from b and b’.
Phrase-based Decoding as TSP	Successful phrase-based systems typically employ language models of order higher than two.
Phrase-based Decoding as TSP	If we want to extend the power of the model to general n-gram language models , and in particular to the 3-gram

language model is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

language model (14)
BLEU (11)
LM (11)

5. Generating Image Descriptions Using Dependency Relational Patterns

Aker, Ahmet and Gaizauskas, Robert

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries.
Introduction	They also experimented with representing such conceptual models using n- gram language models derived from corpora consisting of collections of descriptions of instances of specific object types (e.g.
Introduction	a corpus of descriptions of churches, a corpus of bridge descriptions, and so on) and reported results showing that incorporating such n-gram language models as a feature in a feature-based extractive summarizer improves the quality of automatically generated summaries.
Introduction	The main weakness of n-gram language models is that they only capture very local information aboutshofitennsequencesandcannotnnxkfllong distance dependencies between terms.
Representing conceptual models 2.1 Object type corpora	2.2 N-gram language models
Representing conceptual models 2.1 Object type corpora	Aker and Gaizauskas (2009) experimented with uni-gram and bi-gram language models to capture the features commonly used when describing an object type and used these to bias the sentence selection of the summarizer towards the sentences that contain these features.
Representing conceptual models 2.1 Object type corpora	As in Song and Croft (1999) they used their language models in a gener-

language model is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

language models (16)
n-gram (11)

6. A Rational Model of Eye Movement Control in Reading

Bicknell, Klinton and Levy, Roger

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Explaining between-word regressions	This simple example just illustrates the point that if a reader is combining noisy visual information with a language model , then confidence in previous regions will sometimes fall.
Models of eye movements in reading	Unfortunately, however, the Mr. Chips model simplifies the problem of reading in a number of ways: First, it uses a unigram model as its language model , and thus fails to use any information in the linguistic context to help with word identification.
Models of eye movements in reading	Specifically, our model identifies the words in a sentence by performing Bayesian inference combining noisy input from a realistic visual model with a language model that takes context into account.
Reading as Bayesian inference	Specifically, the model begins reading with a prior distribution over possible identities of a sentence given by its language model .
Reading as Bayesian inference	model’s prior distribution over the identity of the sentence given the language model is updated to a posterior distribution taking into account both the language model and the visual input obtained thus far.
Reading as Bayesian inference	Given the visual input and a language model, inferences about the identity of the sentence w can be made by standard Bayesian inference, where the prior is given by the language model and the likelihood is a function of the total visual input obtained from the first to the ith timestep Ii ,
Simulation 1	5.1.2 Language model
Simulation 1	Our reader’s language model was an unsmoothed bigram model created using a vocabulary set con-
Simulation 1	Specifically, we constructed the model’s initial belief state (i.e., the distribution over sentences given by its language model ) by directly translating the bigram model into a wFSA in the log semiring.
Simulation 2	6.1.3 Language model

language model is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

language model (15)
bigram (8)

7. An Exact A* Method for Deciphering Letter-Substitution Ciphers

Corlett, Eric and Penn, Gerald

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	In the event that a trigram or bigram would be found in the plaintext that was not counted in the language model , add one smoothing was used.
Experiment	Our character-level language model used was developed from the first 1.5 million characters of the Wall Street Journal section of the Penn Tree-bank corpus.
Introduction	If the text from which a language model is trained is of a different genre than the plaintext of a cipher, the unigraph letter frequencies may differ substantially from those of the language model , and so frequency counting will be misleading.
Introduction	Such inefficiency indicates that integer programming may simply be the wrong tool for the job, possibly because language model probabilities computed from empirical data are not smoothly distributed enough over the space in which a cutting-plane method would attempt to compute a linear relaxation of this problem.
Introduction	This difference in difficulty, while real, is not inherent, but rather an artefact of the character-level n-gram language models that they (and we) use, in which preponderant evidence of differences in short character sequences is necessary for the model to clearly favour one letter-substitution mapping over another.
Terminology	Every possible full solution to a cipher C will produce a plaintext string with some associated language model probability, and we will consider the best possible solution to be the one that gives the highest probability.
Terminology	For the sake of concreteness, we will assume here that the language model is a character-level trigram model.
The Algorithm	Backpointers are necessary to reference one of the two language model probabilities.
The Algorithm	Cells that would produce inconsistencies are left at zero, and these as well as cells that the language model assigns zero to can only produce zero entries in later columns.
The Algorithm	The n p x n p cells of every column 2' do not depend on each other —only on the cells of the previous two columns 2' — 1 and i— 2, as well as the language model .

language model is mentioned in 17 sentences in this paper.

Topics mentioned in this paper:

8. Pseudo-Word for Phrase-Based Machine Translation

Duan, Xiangyu and Zhang, Min and Li, Haizhou

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Removing the power of higher order language model and longer max phrase length, which are inherent in pseudo-words, shows that pseudo-words still improve translational performance significantly over unary words.
Experiments and Results	The pipeline uses GIZA++ model 4 (Brown et al., 1993; Och and Ney, 2003) for pseudo-word alignment, uses Moses (Koehn et al., 2007) as phrase-based decoder, uses the SRI Language Modeling Toolkit to train language model with modified Kneser-Ney smoothing (Kneser and Ney 1995; Chen and Goodman 1998).
Experiments and Results	A 5-gram language model is trained on English side of parallel corpus.
Experiments and Results	Xinhua portion of the English Gigaword3 corpus is used together with English side of large corpus to train a 4-gram language model .
Introduction	Further experiments of removing the power of higher order language model and longer max phrase length, which are inherent in pseudo-words, show that pseudo-words still improve translational performance significantly over unary words.

language model is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

9. Hindi-to-Urdu Machine Translation through Transliteration

Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Our Approach	(1) 3.1.1 Language Model
Our Approach	The language model (LM) pm?)
Our Approach	The parameters of the language model are learned from a monolingual Urdu corpus.

language model is mentioned in 18 sentences in this paper.

Topics mentioned in this paper:

10. Smoothed marginal distribution constraints for language modeling

Roark, Brian and Allauzen, Cyril and Riley, Michael

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing.
Introduction	Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation.
Introduction	)nstraints for language modeling
Introduction	As a result, statistical language models — an important component of many such applications — are often trained on very large corpora, then modified to fit within some pre-specified size bound.
Preliminaries	N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion.
Preliminaries	N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored.

language model is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

11. Efficient Inference through Cascades of Weighted Tree Transducers

May, Jonathan and Knight, Kevin and Vogler, Heiko

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Decoding Experiments	We add an English syntax language model £ to the cascade of transducers just described to better simulate an actual machine translation decoding task.
Decoding Experiments	The language model is cast as an identity WTT and thus fits naturally into the experimental framework.
Decoding Experiments	In our experiments we try several different language models to demonstrate varying performance of the application algorithms.

language model is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

language model (11)

12. Constituency to Dependency Translation with Forests

Mi, Haitao and Liu, Qun

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We thus propose to combine the advantages of both, and present a novel constituency-to-dependency translation model, which uses constituency forests on the source side to direct the translation, and dependency trees on the target side (as a language model ) to ensure grammaticality.
Decoding	where the first two terms are translation and language model probabilities, 6(0) is the target string (English sentence) for derivation 0, the third and forth items are the dependency language model probabilities on the target side computed with words and POS tags separately, De (0) is the target dependency tree of 0, the fifth one is the parsing probability of the source side tree TC(0) 6 FC, the ill(0) is the penalty for the number of ill-formed dependency structures in 0, and the last two terms are derivation and translation length penalties, respectively.
Decoding	For each node, we use the cube pruning technique (Chiang, 2007; Huang and Chiang, 2007) to produce partial hypotheses and compute all the feature scores including the dependency language model score (Section 4.1).
Decoding	4.1 Dependency Language Model Computing
Experiments	We also store the POS tag information for each word in dependency trees, and compute two different dependency language models for words and POS tags in dependency tree separately.
Experiments	We use SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram language model with Kneser-Ney smoothing on the first 1/3 of the Xinhua portion of Giga-word corpus.
Experiments	This suggests that using dependency language model really improves the translation quality by less than 1 BLEU point.

language model is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

13. Automatic Evaluation of Linguistic Quality in Multi-Document Summarization

Pitler, Emily and Louis, Annie and Nenkova, Ani

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Indicators of linguistic quality	3.1 Word choice: language models
Indicators of linguistic quality	Language models (LM) are a way of computing how familiar a text is to readers using the distribution of words from a large background corpus.
Indicators of linguistic quality	We built unigram, bigram, and trigram language models with Good-Turing smoothing over the New York Times (NYT) section of the English Gigaword corpus (over 900 million words).
Results and discussion	Coh-Metrix, which has been proposed as a comprehensive characterization of text, does not perform as well as the language model and the entity coherence classes, which contain considerably fewer features related to only one aspect of text.
Results and discussion	It is apparent from the results that continuity, entity coherence, sentence fluency and language models are the most powerful classes of features that should be used in automation of evaluation and against which novel predictors of text quality should be compared.
Results and discussion	For example, the language model features, which are the second best class for the system-level, do not fare as well at the input-level.

language model is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

14. Decipherment Complexity in 1:1 Substitution Ciphers

Nuhn, Malte and Ney, Hermann

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard.
Definitions	denotes the language model .
Definitions	Depending on the structure of the language model Equation 2 can be further simplified.
Definitions	Similarly, we define language model matrices S for the unigram and the bigram case.
Introduction	The general idea is to find those translation model parameters that maximize the probability of the translations of a given source text in a given language model of the target language.
Introduction	This might be related to the fact that a statistical formulation of the decipherment problem has not been analyzed with respect to n-gram language models : This paper shows the close relationship of the decipherment problem to the quadratic assignment problem.
Introduction	In Section 4 we show that decipherment using a unigram language model corresponds to solving a linear sum assignment problem (LSAP).
Related Work	gram language model .

language model is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

15. Reconstructing an Indo-European Family Tree from Non-native English Texts

Nagata, Ryo and Whittaker, Edward

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Approach	In his method, a variety of languages are modeled by their spelling systems (i.e., character-based n-gram language models ).
Approach	Then, agglomerative hierarchical clustering is applied to the language models to reconstruct a language family tree.
Approach	The similarity used for clustering is based on a divergence-like distance between two language models that was originally proposed by Juang and Rabiner (1985).
Methods	Similarly, let M,- be a language model trained using Di.
Methods	2, we use an n-gram language model based on a mixture of word and POS tokens instead of a simple word-based language model .
Methods	In this language model , content words in n-grams are replaced with their corresponding POS tags.

language model is mentioned in 20 sentences in this paper.

Topics mentioned in this paper:

16. A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

Liu, Yang

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models .
Introduction	In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right.
Introduction	As a result, phrase-based decoders only need to maintain the boundary words on one end to calculate language model probabilities.
Introduction	Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007).

language model is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

17. Kneser-Ney Smoothing on Expected Counts

Zhang, Hui and Chiang, David

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment.
Introduction	Such cases have been noted for language modeling (Goodman, 2001; Goodman, 2004), domain adaptation (Tam and Schultz, 2008), grapheme-to-phoneme conversion (Bisani and Ney, 2008), and phrase-based translation (Andres-Ferrer, 2010; Wuebker et al., 2012).
Introduction	One is language model domain adaptation, and the other is word alignment using the IBM models (Brown et al., 1993).
Language model adaptation	N -gram language models are widely used in applications like machine translation and speech recognition to select fluent output sentences.
Language model adaptation	Here, we propose to assign each sentence a probability to indicate how likely it is to belong to the domain of interest, and train a language model using expected KN smoothing.
Language model adaptation	They first train two language models , pin on a set of in-domain data, and pout on a set of general-domain data.
Related Work	This method subtracts D directly from the fractional counts, zeroing out counts that are smaller than D. The discount D must be set by minimizing an error metric on held-out data using a line search (Tam, p. c.) or Powell’s method (Bisani and Ney, 2008), requiring repeated estimation and evaluation of the language model .
Smoothing on integral counts	Before presenting our method, we review KN smoothing on integer counts as applied to language models , although, as we will demonstrate in Section 7, KN smoothing is applicable to other tasks as well.

language model is mentioned in 17 sentences in this paper.

Topics mentioned in this paper:

18. Improving Text Simplification Language Modeling Using Unsimplified Text Data

Kauchak, David

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this paper we examine language modeling for text simplification.
Abstract	Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model .
Abstract	We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each.
Introduction	An important component of many text-to-text translation systems is the language model which predicts the likelihood of a text sequence being produced in the output language.
Introduction	In some problem domains, such as machine translation, the translation is between two distinct languages and the language model can only be trained on data in the output language.
Introduction	In these monolingual problems, text could be used from both the input and output domain to train a language model .
Related Work	If we view the normal data as out-of-domain data, then the problem of combining simple and normal data is similar to the language model domain adaption problem (Suzuki and Gao, 2005), in particular cross-domain adaptation (Bellegarda, 2004) where a domain-specific model is improved by incorporating additional general data.

language model is mentioned in 53 sentences in this paper.

Topics mentioned in this paper:

19. A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

Blunsom, Phil and Cohn, Trevor

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background	Early work was firmly situtated in the task-based setting of improving generalisation in language models .
Background	This model has been popular for language modelling and bilingual word alignment, and an implementation with improved inference called mkcls (Och, 1999)1 has become a standard part of statistical machine translation systems.
Background	(l992)’s HMM by incorporating a character language model , allowing the modelling of limited morphology.
Introduction	Our work brings together several strands of research including Bayesian nonparametric HMMs (Goldwater and Griffiths, 2007), Pitman-Yor language models (Teh, 2006b; Goldwater et al., 2006b), tagging constraints over word types (Brown et al., 1992) and the incorporation of morphological features (Clark, 2003).
The PYP-HMM	Prior work in unsupervised PoS induction has employed simple smoothing techniques, such as additive smoothing or Dirichlet priors (Goldwater and Griffiths, 2007; Johnson, 2007), however this body of work has overlooked recent advances in smoothing methods used for language modelling (Teh, 2006b; Goldwater et al., 2006b).
The PYP-HMM	The PYP has been shown to generate distributions particularly well suited to modelling language (Teh, 2006a; Goldwater et al., 2006b), and has been shown to be a generalisation of Kneser—Ney smoothing, widely recognised as the best smoothing method for language modelling (Chen and Goodman, 1996).
The PYP-HMM	We consider two different settings for the base distribution Cj: l) a simple uniform distribution over the vocabulary (denoted HMM for the experiments in section 4); and 2) a character-level language model (denoted HMM+LM).

language model is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

20. Modeling of term-distance and term-occurrence information for improving n-gram language model performance

Chong, Tze Yuang and E. Banchs, Rafael and Chng, Eng Siong and Li, Haizhou

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling .
Introduction	Language models have been extensively studied in natural language processing.
Introduction	The role of a language model is to measure how probably a (target) word would occur based on some given evidence extracted from the history-context.
Language Modeling with TD and TO	A language model estimates word probabilities given their history, i.e.
Language Modeling with TD and TO	In order to define the TD and TO components for language modeling , we express the observation of an arbitrary history-word, wi_k at the kth position behind the target-word, as the joint of two events: i) the word wi_k occurs within the histo-ry-context: wi_k E h, and ii) it occurs at distance k from the target-word: A(wi_k) = k, (A: k for brevity); i.e.
Language Modeling with TD and TO	In fact, the TO model is closely related to the trigger language model (Rosenfeld 1996), as the prediction of the target-word (the triggered word) is based on the presence of a history-word (the trigger).
Motivation of the Proposed Approach	The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach.
Related Work	Latent-semantic language model approaches (Bellegarda 1998, Coccaro 2005) weight word counts with TFIDF to highlight their semantic importance towards the prediction.
Related Work	Other approaches such as the class-based language model (Brown 1992, Kneser & Ney 1993)
Related Work	The structured language model (Chelba & J elinek 2000) determines the “heads” in the history-context by using a parsing tree.

language model is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

21. Unsupervised Transcription of Historical Documents

Berg-Kirkpatrick, Taylor and Durrett, Greg and Klein, Dan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Then, Tesseract uses a classifier, aided by a word-unigram language model , to recognize whole words.
Experiments	6.3 Language Model
Learning	The number of states in the dynamic programming lattice grows exponentially with the order of the language model (Jelinek, 1998; Koehn, 2004).
Learning	As a result, inference can become slow when the language model order n is large.
Learning	On each iteration of EM, we perform two passes: a coarse pass using a low-order language model, and a fine pass using a high-order language model (Petrov et al., 2008; Zhang and Gildea, 2008).
Model	P(E, T, R, X) = P(E) [ Language model ] - P(T\|E) [Typesetting model] - P(R) [Inking model] - P (X \|E, T, R) [Noise model]
Model	3.1 Language Model P(E)
Model	Our language model , P(E), is a Kneser-Ney smoothed character n-gram model (Kneser and Ney, 1995).
Related Work	Work that has directly addressed historical documents has done so using a pipelined approach, and without fully integrating a strong language model (Vamvakas et al., 2008; Kluzner et al., 2009; Kae et al., 2010; Kluzner et al., 2011).
Related Work	They integrated typesetting models with language models , but did not model noise.
Related Work	Our approach is also similar in that we use a strong language model (in conjunction with the constraint that the correspondence be regular) to learn the correct mapping.

language model is mentioned in 27 sentences in this paper.

Topics mentioned in this paper:

22. Faster and Smaller N-Gram Language Models

Pauls, Adam and Klein, Dan

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	N —gram language models are a major resource bottleneck in machine translation.
Abstract	In this paper, we present several language model implementations that are both highly compact and fast to query.
Abstract	We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
Introduction	For modern statistical machine translation systems, language models must be both fast and compact.
Introduction	The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge.
Introduction	At the same time, decoding a single sentence can trigger hundreds of thousands of queries to the language model , so speed is also critical.

language model is mentioned in 62 sentences in this paper.

Topics mentioned in this paper:

23. Computational Approaches to Sentence Completion

Zweig, Geoffrey and Platt, John C. and Meek, Christopher and Burges, Christopher J.C. and Yessenalina, Ainur and Liu, Qiang

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model ; and methods that evaluate global coherence, such as latent semantic analysis.
Introduction	To investigate the usefulness of local information, we evaluated n—gram language model scores, from both a conventional model with Good—Turing smoothing, and with a recently proposed maximum—entropy class—based n—gram model (Chen, 2009a; Chen, 2009b).
Introduction	Also in the language modeling vein, but with potentially global context, we evaluate the use of a recurrent neural network language model .
Introduction	In all the language modeling approaches, a model is used to compute a sentence probability with each of the potential completions.
Related Work	The KU system uses just an N—gram language model to do this ranking.
Related Work	The UNT system uses a large variety of information sources, and a language model score receives the highest weight.
Sentence Completion via Language Modeling	Perhaps the most straightforward approach to solving the sentence completion task is to form the complete sentence with each option in turn, and to evaluate its likelihood under a language model .
Sentence Completion via Language Modeling	In this section, we describe the suite of state—of—the—art language modeling techniques for which we will present results.
Sentence Completion via Language Modeling	3.1 Backoff N-gram Language Model

language model is mentioned in 27 sentences in this paper.

Topics mentioned in this paper:

24. Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

Rush, Alexander M. and Collins, Michael

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background: Hypergraphs	The second step is to integrate an n-gram language model with this hypergraph.
Background: Hypergraphs	The labels for leaves will be words, and will be important in defining strings and language model scores for those strings.
Background: Hypergraphs	The focus of this paper will be to solve problems involving the integration of a k’th order language model with a hypergraph.
Introduction	The language model is then uwdwmmmemMMmhmmMnmmwmmmmr Decoding with these models is challenging, largely because of the cost of integrating an n-gram language model into the search process.
Introduction	2E.g., with a trigram language model they run in O(\E\w6) time, where is the number of edges in the hypergraph, and w is the number of distinct lexical items in the hypergraph.
Introduction	This step does not require language model integration, and hence is highly efficient.

language model is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

25. Integrating history-length interpolation and classes in language modeling

Schütze, Hinrich

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Building on earlier work that integrates different factors in language modeling , we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation.
Abstract	This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events.
Abstract	We construct such a model and show that both training on rare events and preferable application to rare events improve perpleXity when compared to a simple direct interpolation of class-based with standard language models .
Introduction	Language models , probability distributions over strings of words, are fundamental to many applications in natural language processing.
Introduction	The main challenge in language modeling is to estimate string probabilities accurately given that even very large training corpora cannot overcome the inherent sparseness of word sequence data.
Introduction	Plausible though this line of reasoning is, the language models most commonly used today do not incorporate class-based generalization.
Related work	However, the importance of rare events for clustering in language modeling has not been investigated before.
Related work	Our work is most similar to the lattice-based language models proposed by Dupont and Rosenfeld (1997).

language model is mentioned in 17 sentences in this paper.

Topics mentioned in this paper:

26. Incremental Syntactic Language Models for Phrase-based Translation

Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation.
Abstract	We give a formal definition of one such linear-time syntactic language model , detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system.
Introduction	Early work in statistical machine translation Viewed translation as a noisy channel process comprised of a translation model, which functioned to posit adequate translations of source language words, and a target language model , which guided the fluency of generated target language strings (Brown et al.,
Introduction	Drawing on earlier successes in speech recognition, research in statistical machine translation has effectively used n-gram word sequence models as language models .
Introduction	Modern phrase-based translation using large scale n-gram language models generally performs well in terms of lexical choice, but still often produces ungrammatical output.
Related Work	Instead, we incorporate syntax into the language model .
Related Work	Traditional approaches to language models in
Related Work	Chelba and Jelinek (1998) proposed that syntactic structure could be used as an altema-tive technique in language modeling .

language model is mentioned in 47 sentences in this paper.

Topics mentioned in this paper:

27. A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, midrange sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm.
Abstract	The composite language model has been trained by performing a convergent N -best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer.
Abstract	The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Composite language model	The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1\|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-
Composite language model	PLSA models together to build a composite generative language model under the directed MRF paradigm (Wang et al., 2005; Wang et al., 2006), the TAGGER and CONSTRUCTOR in SLM and SEMANTIZER in PLSA remain unchanged; however the WORD-PREDICTORs in n-gram, m—SLM and PLSA are combined to form a stronger WORD-PREDICTOR that generates the next word, wk+1, not only depending on the m leftmost exposed headwords bin in the word-parse k-prefix but also its n-gram history w’g_n+2 and its semantic content gk+1.
Composite language model	The parameter for WORD-PREDICTOR in the composite n-gram/m-SLMfl’LSA language model becomes p(wk+1 \|wlg_n+2h:,1ngk+1).
Introduction	There is a dire need for developing novel approaches to language modeling.”
Introduction	(2006) integrated n-gram, structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model .
Introduction	They derived a generalized inside-outside algorithm to train the composite language model from a general EM (Dempster et al., 1977) by following Je-linek’s ingenious definition of the inside and outside probabilities for SLM (J elinek, 2004) with 6th order of sentence length time complexity.

language model is mentioned in 36 sentences in this paper.

Topics mentioned in this paper:

language model (36)
BLEU (15)
n-gram (14)

28. Utilizing Dependency Language Models for Graph-based Dependency Parsing Models

Chen, Wenliang and Zhang, Min and Li, Haizhou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this paper, we present an approach to enriching high—order feature representations for graph-based dependency parsing models using a dependency language model and beam search.
Abstract	The dependency language model is built on a large-amount of additional auto-parsed data that is processed by a baseline parser.
Abstract	Based on the dependency language model , we represent a set of features for the parsing model.
Dependency language model	Language models play a very important role for statistical machine translation (SMT).
Dependency language model	The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words.
Dependency language model	However, the traditional N-gram language model can not capture long-distance word relations.
Introduction	In this paper, we solve this issue by enriching the feature representations for a graph-based model using a dependency language model (DLM) (Shen et al., 2008).
Introduction	0 We utilize the dependency language model to enhance the graph-based parsing model.
Parsing with dependency language model	In this section, we propose a parsing model which includes the dependency language model by extending the model of McDonald et al.

language model is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

29. You Had Me at Hello: How Phrasing Affects Memorability

Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Hello. My name is Inigo Montoya.	First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their non-memorable counterparts.
Hello. My name is Inigo Montoya.	In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes.
Never send a human to do a machine’s job.	In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplace-smoothed8 language models : l-gram, 2-gram, and 3-gram word LMs and l-gram, 2-gram and 3-gram part-of-speech9 LMs.
Never send a human to do a machine’s job.	As indicated in Table 3, for each of our lexical “common language” models , in about 60% of the quote pairs, the memorable quote is more distinctive.
Never send a human to do a machine’s job.	The language models’ vocabulary was that of the entire training corpus.

language model is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

30. Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Elsner, Micha and Goldwater, Sharon and Eisenstein, Jacob

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features.
Experiments	Nonetheless, it represents phonetic variability more realistically than the Bernstein-Ratner—Brent corpus, while still maintaining the lexical characteristics of infant-directed speech (as compared to the Buckeye corpus, with its much larger vocabulary and more complex language model ).
Inference	The language modeling term relating to the intended string again factors into multiple components.
Inference	Because neither the transducer nor the language model are perfect models of the true distribution, they can have incompatible dynamic ranges.
Inference	3The transducer scores can be cached since they depend only on surface forms, but the language model scores cannot.
Introduction	Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model ).
Introduction	Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability.
Introduction	But unlike speech recognition, we have no (intended-form, surface-form) training pairs to train the phonetic model, nor even a dictionary of intended-form strings to train the language model .
Lexical-phonetic model	Our lexical-phonetic model is defined using the standard noisy channel framework: first a sequence of intended word tokens is generated using a language model , and then each token is transformed by a probabilistic finite-state transducer to produce the observed surface sequence.
Related work	In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model .

language model is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

31. Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition

Sim, Khe Chai

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Probabilistic Formulation for HVR	where P(W) can be modelled by the word-based 77.-gram language model (Chen and Goodman, 1996) commonly used in automatic speech recognition.
A Probabilistic Formulation for HVR	0 Language model score: P(W)
A Probabilistic Formulation for HVR	Note that the acoustic model and language model scores are already used in the conventional ASR.
Abstract	In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions.
Experimental Results	These sentences contain a variety of given names, surnames and city names so that confusions cannot be easily resolved using a language model .
Experimental Results	The ASR system used in all the experiments reported in this paper consists of a set of HMM-based triphone acoustic models and an n-gram language model .
Experimental Results	A bigram language model with a vocabulary size of 200 words was used for testing.
Haptic Voice Recognition (HVR)	In conventional ASR, acoustically similar word sequences are typically resolved implicitly using a language model where contexts of neighboring words are used for disambiguation.
Integration of Knowledge Sources	where fl, 5, 75 and 7:1 denote the WFST representation of the acoustic model, language model , PLI model and haptic model respectively.
Integration of Knowledge Sources	(2002) has shown that Hidden Markov Models (HMMs) and n-gram language models can be viewed as WFSTs.
Introduction	In addition to the acoustic model and language model used in ASR, haptic model and partial lexical model are also introduced to facilitate the integration of more sophisticated haptic events, such as the keystrokes, into HVR.

language model is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

32. Large-Scale Syntactic Language Modeling with Treelets

Pauls, Adam and Klein, Dan

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.
Abstract	We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours.
Introduction	N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).
Introduction	At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies.
Introduction	Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark,
Treelet Language Modeling	The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).
Treelet Language Modeling	to use back-off-based smoothing for syntactic language modeling — such techniques have been applied to models that condition on headword contexts (Charniak, 2001; Roark, 2004; Zhang, 2009).

language model is mentioned in 31 sentences in this paper.

Topics mentioned in this paper:

33. Effective Selection of Translation Model Training Data

Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus.
Abstract	By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model.
Introduction	Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora.
Introduction	To overcome the problem, we first propose the method combining translation model with language model in data selection.
Introduction	The language model measures the domain-specif1c generation probability of sentences, being used to select domain-relevant sentences at both sides of source and target language.
Related Work	The existing data selection methods are mostly based on language model .
Related Work	(2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models .
Related Work	(2011) improved the perplexity-based approach and proposed bilingual cross-entropy difference as a ranking function with in-and general- domain language models .

language model is mentioned in 31 sentences in this paper.

Topics mentioned in this paper:

34. Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data

Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	In §3.3, we then examined the effect of using a very large 5-gram language model training on 7.5 billion English tokens to understand the nature of the improvements in §3.2.
Evaluation	The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only.
Evaluation	The 13 baseline features (2 lexical, 2 phrasal, 5 HRM, and 1 language model , word penalty, phrase length feature and distortion penalty feature) were tuned using MERT (Och, 2003), which is also used to tune the 4 feature weights introduced by the secondary phrase table (2 lexical and 2 phrasal, other features being shared between the two tables).
Generation & Propagation	These candidates are scored using stem-level translation probabilities, morpheme-level lexical weighting probabilities, and a language model , and only the top 30 candidates are included.
Introduction	We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models .

language model is mentioned in 18 sentences in this paper.

Topics mentioned in this paper:

language model (18)
BLEU (15)
bigram (13)

35. Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

Uszkoreit, Jakob and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In statistical language modeling , one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes.
Abstract	The resulting clusterings are then used in training partially class—based language models .
Experiments	We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus.
Experiments	We use each predictive class-based language model as well as a word-based model as separate feature functions in the log-linear combination in Eq.
Experiments	The word-based language model used by the system in these experiments is a 5-gram model also trained on the enlarget data set.
Introduction	A statistical language model assigns a probability P(w) to any given string of words win 2 w1,...,wm.
Introduction	In the case of n-gram language models this is done by factoring the probability:
Introduction	do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1).

language model is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

36. Randomized Language Models via Perfect Hash Functions

Talbot, David and Brants, Thorsten

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters.
Abstract	We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework.
Experimental Setup	We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers.
Introduction	Language models (LMs) are a core component in statistical machine translation, speech recognition, optical character recognition and many other areas.
Introduction	With large monolingual corpora available in major languages, making use of all the available data is now a fundamental challenge in language modeling .
Introduction	have considered alternative parameterizations such as class-based models (Brown et al., 1992), model reduction techniques such as entropy-based pruning (Stolcke, 1998), novel represention schemes such as suffix arrays (Emami et al., 2007), Golomb Coding (Church et al., 2007) and distributed language models that scale more readily (Brants et al., 2007).
Scaling Language Models	In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary.
Scaling Language Models	Recent work (Talbot and Osborne, 2007b) has used lossy encodings based on Bloom filters (Bloom, 1970) to represent logarithmically quantized corpus statistics for language modeling .

language model is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

37. Efficient Multi-Pass Decoding for Synchronous Context Free Grammars

Zhang, Hao and Gildea, Daniel

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model .
Introduction	This complexity arises from the interaction of the tree-based translation model with an n-gram language model .
Introduction	First, we present a two-pass decoding algorithm, in which the first pass explores states resulting from an integrated bigram language model , and the second pass expands these states into trigram-based
Introduction	The general bigram-to-trigram technique is common in speech recognition (Murveit et al., 1993), where lattices from a bigram-based decoder are re-scored with a trigram language model .
Language Model Integrated Decoding for SCFG	We begin by introducing Synchronous Context Free Grammars and their decoding algorithms when an n-gram language model is integrated into the grammatical search space.
Language Model Integrated Decoding for SCFG	Without an n-gram language model , decoding using SCFG is not much different from CFG parsing.
Language Model Integrated Decoding for SCFG	However, when we want to integrate an n-gram language model into the search, our goal is searching for the derivation whose total sum of weights of productions and n-gram log probabilities is maximized.
Multi-pass LM-Integrated Decoding	very good estimate of the outside cost using a trigram model since a bigram language model and a trigram language model must be strongly correlated.
Multi-pass LM-Integrated Decoding	We propagate the outside cost of the parent to its children by combining with the inside cost of the other children and the interaction cost, i.e., the language model cost between the focused child and the other children.
Multi-pass LM-Integrated Decoding	(2007) also take a two-pass decoding approach, with the first pass leaving the language model boundary words out of the dynamic programming state, such that only one hypothesis is retained for each span and grammar symbol.

language model is mentioned in 12 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
bigram (15)
language model (12)

38. Lattice Desegmentation for Statistical Machine Translation

Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model (LM).
Methods	This trivially allows for an unsegmented language model and never makes desegmentation errors.
Methods	Doing so enables the inclusion of an unsegmented target language model , and with a small amount of bookkeeping, it also allows the inclusion of features related to the operations performed during desegmentation (see Section 3.4).
Methods	We now have a desegmented lattice, but it has not been annotated with an unsegmented (word-level) language model .
Related Work	Bojar (2007) incorporates such analyses into a factored model, to either include a language model over target morphological tags, or model the generation of morphological features.
Related Work	They introduce an additional desegmentation technique that augments the table-based approach with an unsegmented language model .
Related Work	Oflazer and Durgar El-Kahlout (2007) desegment 1000-best lists for English-to-Turkish translation to enable scoring with an unsegmented language model .

language model is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

LM (16)
language model (13)
BLEU (13)

39. Quadratic-Time Dependency Parsing for Machine Translation

Galley, Michel and Manning, Christopher D.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores.
Abstract	Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Dependency parsing for machine translation	While it seems that loopy graphs are undesirable when the goal is to obtain a syntactic analysis, that is not necessarily the case when one just needs a language modeling score.
Introduction	Hierarchical approaches to machine translation have proven increasingly successful in recent years (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008), and often outperform phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) on.ungetlanguage fluency'and.adequacy; Ilouh ever, their benefits generally come with high computational costs, particularly when chart parsing, such as CKY, is integrated with language models of high orders (Wu, 1996).
Introduction	Indeed, researchers have shown that gigantic language models are key to state-of-the-art performance (Brants et al., 2007), and the ability of phrase-based decoders to handle large-size, high-order language models with no consequence on asymptotic running time during decoding presents a compelling advantage over CKY decoders, whose time complexity grows prohibitively large with higher-order language mod-ds
Introduction	Most interestingly, the time complexity of non-projective dependency parsing remains quadratic as the order of the language model increases.
Machine translation experiments	We use the standard features implemented almost exactly as in Moses: four translation features (phrase-based translation probabilities and lexically-weighted probabilities), word penalty, phrase penalty, linear distortion, and language model score.
Machine translation experiments	In order to train a competitive baseline given our computational resources, we built a large 5-gram language model using the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) in addition to the target side of the parallel data.
Machine translation experiments	The language model was smoothed with the modified Kneser-Ney algorithm as implemented in (Stolcke, 2002), and we only kept 4-grams and 5-grams that occurred at least three times in the training data.6

language model is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

40. A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model

Shen, Libin and Xu, Jinxi and Weischedel, Ralph

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	With this new framework, we employ a target dependency language model during decoding to exploit long distance word relations, which are unavailable with a traditional n—gram language model .
Dependency Language Model	wh-as-head represents 21);, used as the head, and it is different from 212;, in the dependency language model .
Dependency Language Model	In order to calculate the dependency language model score, or depLM score for short, on the fly for
Discussion	(2003) described a two-step string-to-CFG—tree translation model which employed a syntax-based language model to select the best translation from a target parse forest built in the first step.
Implementation Details	Language model score .
Implementation Details	Dependency language model score 8.
Introduction	language model during decoding, in order to exploit long-distance word relations which are unavailable with a traditional n-gram language model on target strings.
Introduction	Section 3 illustrates of the use of dependency language models .
String-to-Dependency Translation	Formal definitions also allow us to easily extend the framework to incorporate a dependency language model in decoding.
String-to-Dependency Translation	Supposing we use a traditional trigram language model in decoding, we need to specify the leftmost two words and the rightmost two words in a state.
String-to-Dependency Translation	In the next section, we will explain how to extend categories and states to exploit a dependency language model during decoding.

language model is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

41. Unsupervised Solution Post Identification from Discussion Forums

P, Deepak and Visweswariah, Karthik

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We use translation models and language models to exploit lexical correlations and solution post character respectively.
Introduction	The cornerstone of our technique is the usage of a hitherto unexplored textual feature, lexical correlations between problems and solutions, that is exploited along with language model based characterization of solution posts.
Introduction	We model the lexical correlation and solution post character using regularized translation models and unigram language models respectively.
Our Approach	Consider a unigram language model 83 that models the lexical characteristics of solution posts, and a translation model 73 that models the lexical correlation between problems and solutions.
Our Approach	In short, each solution word is assumed to be generated from the language model or the translation model (conditioned on the problem words) with a probability of A and l — A respectively, thus accounting for the correlation assumption.
Our Approach	Of the solution words above, generic words such as try and should could probably be explained by (i.e., sampled from) the solution language model , whereas disconnect and rejoin could be correlated well with surf and wifi and hence are more likely to be supported better by the translation model.
Related Work	We will use translation and language models in our method for solution identification.

language model is mentioned in 14 sentences in this paper.

Topics mentioned in this paper:

42. A Syntax-Free Approach to Japanese Sentence Compression

Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Syntax Free Sequence-oriented Sentence Compression Method	As an alternative to syntactic parsing, we propose two novel features, intra-sentence positional term weighting (IPTW) and the patched language model (PLM) for our syntax-free sentence compressor.
A Syntax Free Sequence-oriented Sentence Compression Method	3.2.2 Patched Language Model
A Syntax Free Sequence-oriented Sentence Compression Method	Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence.
Abstract	As an alternative to syntactic parsing, we propose a novel term weighting technique based on the positional information within the original sentence and a novel language model that combines statistics from the original sentence and a general corpus.
Conclusions	0 As an alternative to the syntactic parser, we proposed two novel features, Intra-sentence positional term weighting (IPTW) and the Patched language model (PLM), and showed their effectiveness by conducting automatic and human evaluations,
Experimental Evaluation	We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles.
Introduction	To maintain the subject-predicate relationship in the compressed sentence and retain fluency without using syntactic parsers, we propose two novel features: intra-sentence positional term weighting (IPTW) and the patched language model (PLM).
Introduction	PLM is a form of summarization-oriented fluency statistics derived from the original sentence and the general language model .
Results and Discussion	Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly.
Results and Discussion	This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences.

language model is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

43. Generating Impact-Based Summaries for Scientific Literature

Mei, Qiaozhu and Zhai, ChengXiang

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose language modeling methods for solving this problem, and study how to incorporate features such as authority and proximity to accurately estimate the impact language model .
Impact Summarization	To solve these challenges, in the next section, we propose to model impact with un-igram language models and score sentences using
Impact Summarization	We further propose methods for estimating the impact language model based on several features including the authority of citations, and the citation proximity.
Introduction	We propose language models to exploit both the citation context and original content of a paper to generate an impact-based summary.
Introduction	We study how to incorporate features such as authority and proximity into the estimation of language models .
Introduction	We propose and evaluate several different strategies for estimating the impact language model , which is key to impact summarization.
Language Models for Impact Summarization	3.1 Impact language models
Language Models for Impact Summarization	We thus propose to represent such a virtual impact query with a unigram language model .
Language Models for Impact Summarization	Such a model is expected to assign high probabilities to those words that can describe the impact of paper d, just as we expect a query language model in ad hoc retrieval to assign high probabilities to words that tend to occur in relevant documents (Ponte and Croft, 1998).

language model is mentioned in 23 sentences in this paper.

Topics mentioned in this paper:

language model (23)
gold standard (5)

44. Translation Assistance by Translation of L1 Fragments in an L2 Context

van Gompel, Maarten and van den Bosch, Antal

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines.
Baselines	A second baseline was constructed by weighing the probabilities from the translation table directly with the L2 language model described earlier.
Baselines	target language modelling ) which is also cus-
Introduction	The main research question in this research is how to disambiguate an L1 word or phrase to its L2 translation based on an L2 context, and whether such cross-lingual contextual approaches provide added value compared to baseline models that are not context informed or compared to standard language models .
System	3.1 Language Model
System	We also implement a statistical language model as an optional component of our classifier-based system and also as a baseline to compare our system to.
System	The language model is a trigram-based back-off language model with Kneser-Ney smoothing, computed using SRILM (Stolcke, 2002) and trained on the same training data as the translation model.

language model is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

45. A Recursive Recurrent Neural Network for Statistical Machine Translation

Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner.
Experiments and Results	The language model is a 5-gram language model trained with the target sentences in the training data.
Introduction	Recurrent neural networks are leveraged to learn language model , and they keep the history information circularly inside the network for arbitrarily long time (Mikolov et al., 2010).
Introduction	DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment, language modelling , translation modelling and distortion modelling.
Introduction	In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.
Our Model	Recurrent neural network is usually used for sequence processing, such as language model (Mikolov et al., 2010).
Our Model	Commonly used sequence processing methods, such as Hidden Markov Model (HMM) and n-gram language model , only use a limited history for the prediction.
Our Model	In HMM, the previous state is used as the history, and for n-gram language model (for example n equals to 3), the history is the previous two words.
Related Work	(2013) extend the recurrent neural network language model , in order to use both the source and target side information to scoring translation candidates.

language model is mentioned in 13 sentences in this paper.

Topics mentioned in this paper:

46. Applying a Grammar-Based Language Model to a Simplified Broadcast-News Transcription Task

Kaufmann, Tobias and Pfister, Beat

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a language model based on a precise, linguistically motivated grammar (a handcrafted Head-driven Phrase Structure Grammar) and a statistical model estimating the probability of a parse tree.
Abstract	The language model is applied by means of an N -best rescoring step, which allows to directly measure the performance gains relative to the baseline system without rescoring.
Introduction	Other linguistically inspired language models like Chelba and J elinek (2000) and Roark (2001) have been applied to continuous speech recognition.
Introduction	In the first place, we want our language model to reliably distinguish between grammatical and ungrammatical phrases.
Introduction	However, their grammar-based language model did not make use of a probabilistic component, and it was applied to a rather simple recognition task (dictation texts for pupils read and recorded under good acoustic conditions, no out-of-vocabulary words).
Language Model 2.1 The General Approach	The language model weight A and the word insertion penalty ip lead to a better performance in practice, but they have no theoretical justification.
Language Model 2.1 The General Approach	Our grammar-based language model is incorporated into the above expression as an additional probability Pyram(W), weighted by a parameter ,u:
Language Model 2.1 The General Approach	A major problem of grammar-based approaches to language modeling is how to deal with out-of-grammar utterances.

language model is mentioned in 23 sentences in this paper.

Topics mentioned in this paper:

47. Grounded Language Modeling for Automatic Speech Recognition of Sports Video

Fleischman, Michael and Roy, Deb

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Grounded language models represent the relationship between words and the nonlinguistic context in which they are said.
Abstract	Results show that grounded language models improve perplexity and word error rate over text based language models , and further, support video information retrieval better than human generated speech transcriptions.
Introduction	The method is based on the use of grounded language models to repre-
Introduction	Grounded language models are based on research from cognitive science on grounded models of meaning.
Introduction	This paper extends previous work on grounded models of meaning by learning a grounded language model from naturalistic data collected from broadcast video of Major League Baseball games.
Linguistic Mapping	We model this relationship, much like traditional language models , using conditional probability distributions.
Linguistic Mapping	Unlike traditional language models, however, our grounded language models condition the probability of a word not only on the word(s) uttered before it, but also on the temporal pattern features that describe the nonlinguistic context in which it was uttered.

language model is mentioned in 46 sentences in this paper.

Topics mentioned in this paper:

48. Searching Questions by Identifying Question Topic and Question Focus

Duan, Huizhong and Cao, Yunbo and Lin, Chin-Yew and Yu, Yong

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Then we model question topic and question focus in a language modeling framework for search.
Abstract	Experimental results indicate that our approach of identifying question topic and question focus for search significantly outperforms the baseline methods such as Vector Space Model (VSM) and Language Model for Information Retrieval (LMIR).
Introduction	vector space model, Okapi, language model , and translation-based model, within the setting of question search (Jeon et al., 2005b).
Introduction	On the basis of this, we then propose to model question topic and question focus in a language modeling framework for search.
Our Approach to Question Search	model question topic and question focus in a language modeling framework for search.
Our Approach to Question Search	We employ the framework of language modeling (for information retrieval) to develop our approach to question search.
Our Approach to Question Search	In the language modeling approach to information retrieval, the relevance of a targeted question q to a queried question q is given by the probability p(q\|fi) of generating the queried question q

language model is mentioned in 16 sentences in this paper.

Topics mentioned in this paper:

49. A Hybrid Approach to Skeleton-based Translation

Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Skeleton-based Approach to MT 2.1 Skeleton Identification	In this work both the skeleton translation model gskel (d) and full translation model gfuu (d) resemble the usual forms used in phrase-based MT, i.e., the model score is computed by a linear combination of a group of phrase-based features and language models .
A Skeleton-based Approach to MT 2.1 Skeleton Identification	Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by
A Skeleton-based Approach to MT 2.1 Skeleton Identification	lm(d) and wlm are the score and weight of the language model , respectively.
Evaluation	A 5-gram language model was trained on the Xinhua portion of the English Gi-gaword corpus in addition to the target-side of the bilingual data.
Introduction	0 We develop a skeletal language model to describe the possibility of translation skeleton and handle some of the long-distance word dependencies.

language model is mentioned in 19 sentences in this paper.

Topics mentioned in this paper:

50. A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages

Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and perspectives	It would also be interesting to test the impact of another lexical language model , learned on non-SMS sentences.
Evaluation	The language model of the evaluation is a 3-gram.
Evaluation	(2008a), who showed on a French corpus comparable to ours that, if using a larger language model is always rewarded, the improvement quickly decreases with every higher level and is already quite small between 2-gram and 3-gram.
Overview of the system	In our system, all lexicons, language models and sets of rules are compiled into finite-state machines (FSMs) and combined with the input text by composition (0).
Overview of the system	Third, a combination of the lattice of solutions with a language model , and the choice of the best sequence of lexical units.
Related work	A language model is then applied on the word lattice, and the most probable word sequence is finally chosen by applying a best-path algorithm on the lattice.
The normalization models	All tokens Tj of S are concatenated together and composed with the lexical language model LM.
The normalization models	4.6 The language model
The normalization models	Our language model is an n-gram of lexical forms, smoothed by linear interpolation (Chen and Goodman, 1998), estimated on the normalized part of our training corpus and compiled into a weighted FST LMw.

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

51. Word Sense Disambiguation Improves Information Retrieval

Zhong, Zhi and Ng, Hwee Tou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations.
Incorporating Senses into Language Modeling Approaches	The next problem is to incorporate the sense information into the language modeling approach.
Incorporating Senses into Language Modeling Approaches	Given a query (1 and a document d in text collection C, we want to reestimate the language models by making use of the sense information assigned to them.
Incorporating Senses into Language Modeling Approaches	With this language model , the probability of a query term in a document is enlarged by the synonyms of its senses; The more its synonym senses in a document, the higher the probability.
Introduction	We incorporate word senses into the language modeling (LM) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance.
The Language Modeling Approach to IR	3.1 The language modeling approach
The Language Modeling Approach to IR	In the language modeling approach to IR, language models are constructed for each query (1 and each document d in a text collection C. The documents in C are ranked by the distance to a given query (1 according to the language models .
The Language Modeling Approach to IR	The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other.

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

52. Can You Repeat That? Using Word Repetition to Improve Spoken Term Detection

Wintrode, Jonathan and Khudanpur, Sanjeev

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models .
Introduction	ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream.
Introduction	Yet, though many language models more sophisticated than N- grams have been proposed, N-grams are empirically hard to beat in terms of WER.
Introduction	The strength of this phenomenon suggests it may be more viable for improving term-detection than, say, topic-sensitive language models .
Motivation	The re-scoring approach we present is closely related to adaptive or cache language models (Je-linek, 1997; Kuhn and De Mori, 1990; Kneser and Steinbiss, 1993).
Motivation	The primary difference between this and previous work on similar language models is the narrower focus here on the term detection task, in which we consider each search term in isolation, rather than all words in the vocabulary.
Results	We train ASR acoustic and language models from the training corpus using the Kaldi speech recognition toolkit (Povey et al., 2011) following the default BABEL training and search recipe which is described in detail by Chen et al.
Term and Document Frequency Statistics	A similar phenomenon is observed concerning adaptive language models (Church, 2000).
Term and Document Frequency Statistics	In general, we can think of using word repetitions to re-score term detection as applying a limited form of adaptive or cache language model (Je-linek, 1997).
Term and Document Frequency Statistics	In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence.

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

53. Improving Word Representations via Global Context and Multiple Word Prototypes

Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models .
Conclusion	Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset.
Experiments	Table 3 shows our results compared to previous methods, including C&W’s language model and the hierarchical log-bilinear (HLBL) model (Mnih and Hinton, 2008), which is a probabilistic, linear neural model.
Global Context-Aware Neural Language Model	Note that Collobert and Weston (2008)’s language model corresponds to the network using only local context.
Introduction	We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective.
Introduction	We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset.
Related Work	Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling , a task where models are asked to accurately predict the next word given previously seen words.
Related Work	Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model .
Related Work	They used up to 3 previous head words and showed increased performance on language modeling .

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

54. Mixing Multiple Translation Models in Statistical Machine Translation

Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion & Future Work	Future work includes extending this approach to use multiple translation models with multiple language models in ensemble decoding.
Experiments & Results 4.1 Experimental Setup	For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model (LM) and word count.
Experiments & Results 4.1 Experimental Setup	Fixing the language model allows us to compare various translation model combination techniques.
Introduction	Common techniques for model adaptation adapt two main components of contemporary state-of-the-art SMT systems: the language model and the translation model.
Introduction	However, language model adaptation is a more straightforward problem compared to
Introduction	translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain.
Related Work 5.1 Domain Adaptation	They use language model perplexities from IN to select relavant sentences from OUT.

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

55. Complexity Metrics in an Incremental Right-Corner Parser

Wu, Stephen and Bachrach, Asaf and Cardenas, Carlos and Schuler, William

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	This is particularly true when the sentence structure is defined in a language model that is psycholinguistically plausible (here, bounded-memory right-corner form).
Discussion	This accords with an understated result of Boston et al.’s eye-tracking study (2008a): a richer language model predicts eye movements during reading better than an oversimplified one.
Discussion	Frank (2009) similarly reports improvements in the reading-time predictiveness of unlexi-calized surprisal when using a language model that is more plausible than PCFGs.
Introduction	Ideally, a psychologically-plausible language model would produce a surprisal that would correlate better with linguistic complexity.
Introduction	Therefore, the specification of how to encode a syntactic language model is of utmost importance to the quality of the metric.
Introduction	The purpose of this paper is to determine whether the language model defined by the HHMM parser can also predict reading times —it would be strange if a psychologically plausible model did not also produce Viable complexity metrics.
Parsing Model	Both of these metrics fall out naturally from the time-series representation of the language model .
Parsing Model	With the understanding of what operations need to occur, a formal definition of the language model is in order.

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

56. Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

Rastrow, Ariya and Dredze, Mark and Khudanpur, Sanjeev

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation.
Abstract	However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set.
Abstract	When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N -best and hill climbing rescoring, and show that up-training leads to WER reduction.
Conclusion	The computational complexity of accurate syntactic processing can make structured language models impractical for applications such as ASR that require scoring hundreds of hypotheses per input.
Incorporating Syntactic Structures	These are then passed to the language model along with the word sequence for scoring.
Introduction	Language models (LM) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT).
Related Work	The lattice parser therefore, is itself a language model .
Syntactic Language Models	There have been several approaches to include syntactic information in both generative and discriminative language models .
Syntactic Language Models	Structured language modeling incorporates syntactic parse trees to identify the head words in a hypothesis for modeling dependencies beyond n-grams.
Syntactic Language Models	Our Language Model .

language model is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

57. Biases in Predicting the Human Language Model

Fine, Alex B. and Frank, Austin F. and Jaeger, T. Florian and Van Durme, Benjamin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We consider the prediction of three human behavioral measures — lexical decision, word naming, and picture naming —through the lens of domain bias in language modeling .
Abstract	This study aims to provoke increased consideration of the human language model by NLP practitioners: biases are not limited to differences between corpora (i.e.
Discussion	Our analyses reveal that 6 commonly used corpora fail to reflect the human language model in various ways related to dialect, modality, and other properties of each corpus.
Discussion	Our results point to a type of bias in commonly used language models that has been previously overlooked.
Discussion	Just as language models have been used to predict reading grade-level of documents (Collins-Thompson and Callan, 2004), human language models could be
Introduction	Computational linguists build statistical language models for aiding in natural language processing (NLP) tasks.
Introduction	In the current study, we exploit errors of the latter variety—failure of a language model to predict human performance—to investigate bias across several frequently used corpora in computational linguistics.
Introduction	: Human Language Model

language model is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

language model (9)
n-grams (4)

58. Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

Clifton, Ann and Sarkar, Anoop

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	We trained all of the Moses systems herein using the standard features: language model , reordering model, translation model, and word penalty; in addition to these, the factored experiments called for additional translation and generation features for the added factors as noted above.
Experimental Results	For the language models, we used SRILM 5-gram language models (Stol-cke, 2002) for all factors.
Experimental Results	koske+ +va+ +A mietinto+ +A kasi+ +te+ +11a+ +a+ +n language model disambiguation:
Models 2.1 Baseline Models	Morphology generation models can use a variety of bilingual and contextual information to capture dependencies between morphemes, often more long-distance than what is possible using n-gram language models over morphemes in the segmented model.
Models 2.1 Baseline Models	is to take the abstract suffix tag sequence 31* and then map it into fully inflected word forms, and rank those outputs using a morphemic language model .
Models 2.1 Baseline Models	After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels.
Related Work	They use a segmented phrase table and language model along With the word-based versions in the decoder and in tuning a Finnish target.
Related Work	In their work a segmented language model can score a translation, but cannot insert morphology that does not show source-side reflexes.

language model is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

translation model (21)
BLEU (17)
CRF (12)

59. Text Segmentation by Language Using Minimum Description Length

Yamaguchi, Hiroshi and Tanaka-Ishii, Kumiko

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Calculation of Cross-Entropy	(XZ-), is the cross—entropy of X7; for L,- multiplied by Various methods for computing cross—entropy have been proposed, and these can be roughly classified into two types based on different methods of universal coding and the language model .
Calculation of Cross-Entropy	For example, (Benedetto et al., 2002) and (Cilibrasi and Vitanyi, 2005) used the universal coding approach, whereas (Teahan and Harper, 2001) and (Sibun and Reynar, 1996) were based on language modeling using PPM and Kullback—Leibler divergence, respectively.
Calculation of Cross-Entropy	As a representative method for calculating the cross—entropy through statistical language modeling , we adopt prediction by partial matching (PPM), a language—based encoding method devised by (Cleary and Witten, 1984).
In the experiments reported here, n is set to 5 throughout.	lel ), gives the description length of the remaining characters under the language model for L.
Introduction	They used statistical language modeling and heuristics to detect foreign words and tested the case of English embedded in German texts.
Problem Formulation	In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification.
Problem Formulation	calculates the description length of a text segment X,- through the use of a language model for Li.
Problem Formulation	Here, the first term corresponds to the code length of the text chunk X,- given a language model for L,, which in fact corresponds to the cross—entropy of X,- for L,- multiplied by The remaining terms give the code lengths of the parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li.

language model is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

60. Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Mitchell, Jeff and Lapata, Mirella and Demberg, Vera and Keller, Frank

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this paper we analyze reading times in terms of a single predictive measure which integrates a model of semantic composition with an incremental parser and a language model .
Integrating Semantic Constraint into Surprisal	While surprisal is a theoretically well-motivated measure, formalizing the idea of linguistic processing being highly predictive in terms of probabilistic language models , the measurement of semantic constraint in terms of vector similarities lacks a clear motivation.
Integrating Semantic Constraint into Surprisal	This can be achieved by turning a vector model of semantic similarity into a probabilistic language model .
Integrating Semantic Constraint into Surprisal	There are in fact a number of approaches to deriving language models from distributional models of semantics (e.g., Bellegarda 2000; Coccaro and Jurafsky 1998; Gildea and Hofmann 1999).
Models of Processing Difficulty	The basic idea is that the processing costs relating to the expectations of the language processor can be expressed in terms of the probabilities assigned by some form of language model to the input.
Models of Processing Difficulty	Surprisal could be also defined using a vanilla language model that does not take any structural or grammatical information into account (Frank 2009).

language model is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

61. A Multi-Domain Translation Model Framework for Statistical Machine Translation

Sennrich, Rico and Schwenk, Holger and Aransa, Walid

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Translation Model Architecture	We train a language model on the source language side of each of the n component bitexts, and compute an n-dimensional vector for each sentence by computing its entropy with each language model .
Translation Model Architecture	Our aim is not to discriminate between sentences that are more likely and unlikely in general, but to cluster on the basis of relative differences between the language model entropies.
Translation Model Architecture	While it is not the focus of this paper, we also evaluate language model adaptation.

language model is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

62. Phrase Table Training for Precision and Recall: What Makes a Good Phrase and a Good Phrase Pair?

Deng, Yonggang and Xu, Jia and Gao, Yuqing

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Generic Phrase Training Procedure	Each normalized feature score derived from word alignment models or language models will be log-linearly combined to generate the final score.
Discussions	We propose several information metrics derived from posterior distribution, language model and word alignments as feature functions.
Experimental Results	Like other log-linear model based decoders, active features in our translation engine include translation models in two directions, lexicon weights in two directions, language model , lexicalized distortion models, sentence length penalty and other heuristics.
Experimental Results	The language model is a statistical trigram model estimated with Modified Kneser—Ney smoothing (Chen and Goodman, 1996) using only English sentences in the parallel training data.
Features	All these features are data-driven and defined based on models, such as statistical word alignment model or language model .
Features	We apply a language model (LM) to describe the predictive uncertainty (PU) between words in two directions.
Features	Given a history 10711—1, a language model specifies a conditional distribution of the future word being predicted to follow the history.

language model is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

63. Word Representations: A Simple and General Method for Semi-Supervised Learning

Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Clustering-based word representations	So it is a class-based bigram language model .
Clustering-based word representations	Deschacht and Moens (2009) use a latent-variable language model to improve semantic role labeling.
Distributed representations	Word embeddings are typically induced using neural language models , which use neural networks as the underlying predictive model (Bengio, 2008).
Distributed representations	Historically, training and testing of neural language models has been slow, scaling as the size of the vocabulary for each model computation (Bengio et al., 2001; Bengio et al., 2003).
Distributed representations	Collobert and Weston (2008) presented a neural language model that could be trained over billions of words, because the gradient of the loss was computed stochastically over a small sample of possible outputs, in a spirit similar to Bengio and Sénecal (2003).
Introduction	Neural language models (Bengio et al., 2001; Schwenk & Gauvain, 2002; Mnih & Hinton, 2007; Collobert & Weston, 2008), on the other hand, induce dense real-valued low-dimensional
Introduction	(See Bengio (2008) for a more complete list of references on neural language models .)
Unlabled Data	These auxiliary tasks are sometimes specific to the supervised task, and sometimes general language modeling tasks like “predict the missing word”.

language model is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

64. Applying Morphology Generation Models to Machine Translation

Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Inflection prediction models	We stemmed the reference translations, predicted the inflection for each stem, and measured the accuracy of prediction, using a set of sentences that were not part of the training data (1K sentences were used for Arabic and 5K for Russian).2 Our model performs significantly better than both the random and trigram language model baselines, and achieves an accuracy of over 91%, which suggests that the model is effective when its input is clean in its stem choice and order.
Integration of inflection models with MT systems	Given such a list of candidate stem sequences, the base MT model together with the inflection model and a language model choose a translation Y* as follows:
Integration of inflection models with MT systems	PLM) is the joint probability of the sequence of inflected words according to a trigram language model (LM).
Integration of inflection models with MT systems	In addition, stemming the target sentences reduces the sparsity in the translation tables and language model , and is likely to impact positively the performance of an MT system in terms of its ability to recover correct sequences of stems in the target.
Introduction	(Goldwater and McClosky, 2005), while the application of a target language model has almost solely been responsible for addressing the second aspect.
Machine translation systems and data	(2003), a trigram target language model , two order models, word count, phrase count, and average phrase size functions.
Machine translation systems and data	The features include log-probabilities according to inverted and direct channel models estimated by relative frequency, lexical weighting channel models, a trigram target language model , distortion, word count and phrase count.
Machine translation systems and data	For each language pair, we used a set of parallel sentences (train) for training the MT system sub-models (e.g., phrase tables, language model ), a set of parallel sentences (lambda) for training the combination weights with max-BLEU training, a set of parallel sentences (dev) for training a small number of combination parameters for our integration methods (see Section 5), and a set of parallel sentences (test) for final evaluation.

language model is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

65. Hypertagging: Supertagging for Surface Realization with CCG

Espinosa, Dominic and White, Michael and Mehay, Dennis

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background	OpenCCG implements a symbolic-statistical chart realization algorithm (Kay, 1996; Carroll et al., 1999; White, 2006b) combining (l) a theoretically grounded approach to syntax and semantic composition with (2) factored language models (Bilmes and Kirchhoff, 2003) for making choices among the options left open by the grammar.
Background	makes use of n-gram language models over words represented as vectors of factors, including surface form, part of speech, supertag and semantic class.
Background	2.3 Factored Language Models
Introduction	Assigned categories are instantiated in OpenCCG’s chart realizer where, together with a treebank-derived syntactic grammar (Hockenmaier and Steedman, 2007) and a factored language model (Bilmes and Kirchhoff, 2003), they constrain the English word-strings that are chosen to express the LF.
The Approach	Table 1: Percentage of complete realizations using an oracle n-gram model versus the best performing factored language model .
The Approach	As shown in Table l, with the large grammar derived from the training sections, many fewer complete realizations are found (before timing out) using the factored language model than are possible, as indicated by the results of using the oracle model.

language model is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

logical form (13)
POS tags (13)
CCG (11)

66. Grammatical Error Correction Using Integer Linear Programming

Wu, Yuanbin and Ng, Hwee Tou

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	As described in Section 3.2, the weight of each variable is a linear combination of the language model score, three classifier confidence scores, and three classifier disagreement scores.
Experiments	We use the Web 1T 5—gram corpus (Brants and Franz, 2006) to compute the language model score for a sentence.
Experiments	Finally, the language model score, classifier confidence scores, and classifier disagreement scores are normalized to take values in [0, 1], based on the H00 2011 development data.
Inference with First Order Variables	The language model score h(s’, LM) of 8’ based on a large web corpus;
Inference with First Order Variables	Next, to compute whpyg, we collect language model score and confidence scores from the article (ART), preposition (PREP), and noun number (NOUN) classifier, i.e., E = {ART, PREP, NOUN}.
Inference with Second Order Variables	When measuring the gain due to 21131213312 2 1 (change cat to cats), the weight wNoungmluml is likely to be small since A cats will get a low language model score, a low article classifier confidence score, and a low noun number classifier confidence score.
Related Work	Features used in classification include surrounding words, part-of—speech tags, language model scores (Gamon, 2010), and parse tree structures (Tetreault et al., 2010).

language model is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

67. Selecting Query Term Alternations for Web Search by Exploiting Query Contexts

Cao, Guihong and Robertson, Stephen and Nie, Jian-Yun

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The selection is made according to the appropriateness of the alteration to the query context (using a bigram language model ), or according to its expected impact on the retrieval effectiveness (using a regression model).
Bigram Expansion Model for Alteration Selection	The query context is modeled by a bigram language model as in (Peng et al.
Bigram Expansion Model for Alteration Selection	In this work, we used bigram language model to calculate the probability of each path.
Bigram Expansion Model for Alteration Selection	P(el,ez,...,ei,...,en) = P(e1 )H:=2P(ek Iek_1) (2) P(ek\|ek_1) is estimated with a back-off bigram language model (Goodman, 2001).
Conclusion	In the first method proposed — the Bigram Expansion model, query context is modeled by a bigram language model .
Introduction	The query context is modeled by a bigram language model .
Related Work	2007), a bigram language model is used to determine the alteration of the head word that best fits the query.
Related Work	In this paper, one of the proposed methods will also use a bigram language model of the query to determine the appropriate alteration candidates.

language model is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

68. Scalable Decipherment for Machine Translation via Hash Sampling

Ravi, Sujith

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Bayesian MT Decipherment via Hash Sampling	Secondly, for Bayesian inference we need to sample from a distribution that involves computing probabilities for all the components ( language model , translation model, fertility, etc.)
Bayesian MT Decipherment via Hash Sampling	Note that the (translation) model in our case consists of multiple exponential families components—a multinomial pertaining to the language model (which remains fixed5), and other components pertaining to translation probabilities P9(fi\|ei), fertility ngert, etc.
Bayesian MT Decipherment via Hash Sampling	where, pold(-), pnew(-) are the true conditional likelihood probabilities according to our model (including the language model component) for the old, new sample respectively.
Decipherment Model for Machine Translation	For P(e), we use a word n-gram language model (LM) trained on monolingual target text.
Decipherment Model for Machine Translation	Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model .
Experiments and Results	The latter is used to construct a target language model used for decipherment training.
Experiments and Results	Overall, using a 3-gram language model (instead of 2-gram) for decipherment training improves the performance for all methods.

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

69. Hierarchical Phrase Table Combination for Machine Translation

Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	SRILM Toolkit (Stol-cke, 2002) is employed to train 4-gram language models on the Xinhua portion of Gigaword corpus, while for the IWLST2012 data set, only its training set is used.
Experiment	The similarity between the data from each domain and the test data is calculated using the perplexity measure with 5-gram language model .
Hierarchical Phrase Table Combination	Pitman-Yor process is also employed in n-gram language models which are hierarchically represented through the hierarchical Pitman-Yor process with switch priors to integrate different domains in all the levels (Wood and Teh, 2009).
Phrase Pair Extraction with Unsupervised Phrasal ITGs	Pbase is a base measure defined as a combination of the IBM Models in two directions and the unigram language models in both sides.
Related Work	The translation model and language model are primary components in SMT.
Related Work	Previous work proved successful in the use of large-scale data for language models from diverse domains (Brants et al., 2007; Schwenk and Koehn, 2008).
Related Work	Alternatively, the language model is incrementally updated by using a succinct data structure with a interpolation technique (Levenberg and Osborne, 2009; Levenberg et al., 2011).

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

70. A New Dataset and Method for Automatically Grading ESOL Texts

Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Approach	In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens.
Approach	Next, we extend our language model with trigrams extracted from a subset of the texts contained in the
Approach	As the CLC contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model .
Evaluation	Extending our language model with frequent trigrams extracted from the CLC improves Pearson’s and Spearman’s correlation by 0.006 and 0.015 respectively.
Evaluation	This suggests that there is room for improvement in the language models we developed to estimate the error-rate.

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

71. Fast and Accurate Shift-Reduce Constituent Parsing

Zhu, Muhua and Zhang, Yue and Chen, Wenliang and Zhang, Min and Zhu, Jingbo

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Semi-supervised Parsing with Large Data	These relations are captured by word clustering, lexical dependencies, and a dependency language model , respectively.
Semi-supervised Parsing with Large Data	4.3 Structural Relations: Dependency Language Model
Semi-supervised Parsing with Large Data	The dependency language model is proposed by Shen et al.

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

72. Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

Zhang, Jiajun and Zong, Chengqing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs, and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword.
Experiments	An in-domain 5-gram English language model is trained with the target 1 million monolingual data.
Experiments	(2008) regards the in-domain lexicon with corpus translation probability as another phrase table and further use the in-domain language model besides the out-of-domain language model .
Probabilistic Bilingual Lexicon Acquisition	In order to assign probabilities to each entry, we apply the Corpus Translation Probability which used in (Wu et al., 2008): given an in-domain source language monolingual data, we translate this data with the phrase-based model trained on the out-of-domain News data, the in-domain lexicon and the in-domain target language monolingual data (for language model estimation).
Related Work	For the target-side monolingual data, they just use it to train language model , and for the source-side monolingual data, they employ a baseline (word-based SMT or phrase-based SMT trained with small-scale bitext) to first translate the source sentences, combining the source sentence and its target translation as a bilingual sentence pair, and then train a new phrase-base SMT with these pseudo sentence pairs.

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

73. A Class-Based Agreement Model for Generating Accurately Inflected Translations

Green, Spence and DeNero, John

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A Class-based Model of Agreement	However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score).
A Class-based Model of Agreement	We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
Experiments	Our distributed 4—gram language model was trained on 600 million words of Arabic text, also collected from many sources including the Web (Brants et al., 2007).
Inference during Translation Decoding	With a trigram language model , the state might be the last two words of the translation prefix.
Introduction	Intuition might suggest that the standard 71- gram language model (LM) is suflicient to handle agreement phenomena.
Related Work	Monz (2011) recently investigated parameter estimation for POS-based language models , but his classes did not include inflectional features.
Related Work	One exception was the quadratic-time dependency language model presented by Galley and Manning (2009).

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

phrase-based (10)
CRF (9)
LM (9)

74. Application-driven Statistical Paraphrase Generation

Zhao, Shiqi and Lan, Xiang and Liu, Ting and Li, Sheng

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	The language model is trained using a 9 GB English corpus.
Statistical Paraphrase Generation	Our SPG model contains three sub-models: a paraphrase model, a language model , and a usability model, which control the adequacy, fluency,
Statistical Paraphrase Generation	Language Model: We use a trigram language model in this work.
Statistical Paraphrase Generation	The language model based score for the paraphrase t is computed as:

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

75. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification

Tang, Duyu and Wei, Furu and Yang, Nan and Zhou, Ming and Liu, Ting and Qin, Bing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	(2012) adopt the tweets with emoticons to smooth the language model and Hu et al.
Related Work	With the revival of interest in deep learning (Bengio et al., 2013), incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing (Socher et al., 2013a), language modeling (Bengio et al., 2003; Mnih and Hinton, 2009) and NER (Turian et al., 2010).
Related Work	The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1.

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

76. Predicting and Eliciting Addressee's Emotion in Online Dialogue

Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Eliciting Addressee’s Emotion	We use GIZA++8 and SRILM9 for learning translation model and 5-gram language model , re-
Eliciting Addressee’s Emotion	We use the emotion-tagged dialogue corpus to learn eight translation models and language models , each of which is specialized in generating the response that elicits one of the eight emotions (Plutchik, 1980).
Eliciting Addressee’s Emotion	In this case, the first two utterances are used to learn the translation model, while only the second utterance is used to learn the language model .
Experiments	Table 6: The number of utterance pairs used for training classifiers in emotion prediction and learning the translation models and language models in response generation.
Experiments	We use the utterance pairs summarized in Table 6 to learn the translation models and language models for eliciting each emotional category.
Related Work	The linear interpolation of translation and/or language models is a widely-used technique for adapting machine translation systems to new domains (Sennrich, 2012).

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

77. A Provably Correct Learning Algorithm for Latent-Variable PCFGs

Cohen, Shay B. and Collins, Michael

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Experiments on parsing and a language modeling problem show that the algorithm is efficient and effective in practice.
Experiments on Parsing	8 Experiments on the Saul and Pereira (1997) Model for Language Modeling
Experiments on Parsing	We now describe a second set of experiments, on the Saul and Pereira (1997) model for language modeling .
Experiments on Parsing	We performed the language modeling experiments for a number of reasons.
Introduction	We describe experiments on learning of L-PCFGs, and also on learning of the latent-variable language model of Saul and Pereira (1997).

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

78. Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Yeniterzi, Reyyan and Oflazer, Kemal

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions	The problem is essentially one of generating multiple candidate sentences with the unattached function words ambiguously positioned (say in a lattice) and then use a second language model to rerank these sentences to select the target sentence.
Experimental Setup and Results	Furthermore, in factored models, we can employ different language models for different factors.
Experimental Setup and Results	We believe that the use of multiple language models (some much less sparse than the surface LM) in the factored baseline is the main reason for the improvement.
Experimental Setup and Results	3.2.3 Experiments with higher-order language models
Introduction	The main reason given for these problems was that the same statistical translation, reordering and language modeling mechanisms were being employed to both determine the morphological structure of the words and, at the same time, get the global order of the words correct.

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

79. Open-Domain Semantic Role Labeling by Modeling Word Spans

Huang, Fei and Yates, Alexander

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We leverage recently-developed techniques for learning representations of text using latent-variable language models , and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling.
Introduction	Using latent-variable language models , we learn representations of texts that provide novel kinds of features to our supervised learning algorithms.
Introduction	The next section provides background information on learning representations for NLP tasks using latent-variable language models .
Introduction	2 Open-Domain Representations Using Latent-Variable Language Models

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

80. Measure Word Generation for English-Chinese SMT Systems

Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In the experiments, the language model is a Chinese 5-gram language model trained with the Chinese part of the LDC parallel corpus and the Xin-hua part of the Chinese Gigaword corpus with about 27 million words.
Experiments	In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature.
Introduction	Moreover, Chinese measure words often have a long distance dependency to their head words which makes language model ineffective in selecting the correct measure words from the measure word candidate set.
Introduction	In this case, an n-gram language model with n<15 cannot capture the MW-HW collocation.
Model Training and Application 3.1 Training	We used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a five-gram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998).
Our Method	For target features, n-gram language model score is defined as the sum of log n-gram probabilities within the target window after the measure
Our Method	Target features Source features n-gram language model MW-HW collocation score MW-HW collocation surrounding words surrounding words source head word punctuation position POS tags

language model is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

81. Discovering Latent Structure in Task-Oriented Dialogues

Zhai, Ke and Williams, Jason D

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Illustrated by the highlighted states in 6, LM—HMM model conflates interactions that commonly occur at the beginning and end of a dialogue—i.e., “acknowledge agent” and “resolve problem”, since their underlying language models are likely to produce similar probability distributions over words.
Experiments	By incorporating topic information, our proposed models (e.g., TM—HMMSS in Figure 5) are able to enforce the state transitions towards more frequent flow patterns, which further helps to overcome the weakness of language model .
Latent Structure in Dialogues	The simplest formulation we consider is an HMM where each state contains a unigram language model (LM), proposed by Chotimongkol (2008) for task-oriented dialogue and originally
Latent Structure in Dialogues	3: For each word in utterance n, first choose a word source 7“ according to 1', and then depending on 7“, generate a word 21) either from the session-wide topic distribution 6 or the language model specified by the state 37,.
Latent Structure in Dialogues	4Note that a TM-HMMS model with state-specific topic models (instead of state-specific language models ) would be subsumed by TM—HMM, since one topic could be used as the background topic in TM -HMMS.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

82. Predicting Grammaticality on an Ordinal Scale

Heilman, Michael and Cahill, Aoife and Madnani, Nitin and Lopez, Melissa and Mulholland, Matthew and Tetreault, Joel

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores).
Discussion and Conclusions	While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc.
Experiments	To create further baselines for comparison, we selected the following features that represent ways one might approximate grammaticality if a comprehensive model was unavailable: whether the link parser can fully parse the sentence (complete_l ink), the Gigaword language model score (gigaword_avglogprob), and the number of misspelled tokens (nummisspelled).
System Description	3.2.2 n-gram Count and Language Model Features
System Description	The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002):
System Description	Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”).

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

83. Fast and Robust Neural Network Joint Models for Statistical Machine Translation

Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.
Introduction	Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).
Introduction	Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.
Model Variations	In particular, we can reverse the translation direction of the languages, as well as the direction of the language model .
Model Variations	0 5-gram Kneser-Ney LM 0 Recurrent neural network language model (RNNLM) (Mikolov et al., 2010)
Neural Network Joint Model (NNJ M)	Fortunately, neural network language models are able to elegantly scale up and take advantage of arbitrarily large context sizes.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

84. A Discriminative Latent Variable Model for Statistical Machine Translation

Blunsom, Phil and Cohn, Trevor and Osborne, Miles

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discriminative Synchronous Transduction	ilar to the methods for decoding with a SCFG intersected with an n-gram language model, which require language model contexts to be stored in each chart cell.
Discussion and Further Work	To do so would require integrating a language model feature into the max-translation decoding algorithm.
Evaluation	The feature set includes: a trigram language model (lm) trained
Evaluation	To compare our model directly with these systems we would need to incorporate additional features and a language model , work which we have left for a later date.
Evaluation	The relative scores confirm that our model, with its minimalist feature set, achieves comparable performance to the standard feature set without the language model .

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

85. Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Simianer, Patrick and Riezler, Stefan and Dyer, Chris

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	3-gram (news-commentary) and 5-gram (Europarl) language models are trained on the data described in Table 1, using the SRILM toolkit (Stol-cke, 2002) and binarized for efficient querying using kenlm (Heafield, 2011).
Experiments	For the 5-gram language models, we replaced every word in the lm training data with <unk> that did not appear in the English part of the parallel training data to build an open vocabulary language model .
Experiments	7Absolute improvements would be possible, e. g., by using larger language models or by adding news data to the ep training set when evaluating on crawl test sets (see, e. g., Dyer et al.
Introduction	The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003).
Introduction	The modeler’s goals might be to identify complex properties of translations, or to counter errors of pre-trained translation models and language models by explicitly down-weighting translations that exhibit certain undesired properties.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

86. Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Wuebker, Joern and Ney, Hermann and Zens, Richard

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased efficiency of decoding by minimizing the number of language model computations and hypothesis expansions.
Abstract	Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2.
Experimental Evaluation	The English language model is a 4-gram LM created with the SRILM toolkit (Stolcke, 2002) on all bilingual and parts of the provided monolingual data.
Introduction	Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model (LM) querying (Heafield, 2011).
Introduction	ith Language Model LookAhead
Search Algorithm Extensions	2.2 Language Model LookAhead

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

LM (42)
BLEU (15)
beam search (6)

87. Perplexity on Reduced Corpora

Kobayashi, Hayato

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002).
Introduction	Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models.
Introduction	In the case of language models , we often have to remove low-frequency words because of a lack of computational resources, since the feature space of k:-grams tends to be so large that we sometimes need cutoffs even in a distributed environment (Brants et al., 2007).
Perplexity on Reduced Corpora	Constant restoring is similar to the additive smoothing defined by 13(w) oc p’ + A, which is used to solve the zero-frequency problem of language models (Chen and Goodman, 1996).
Perplexity on Reduced Corpora	77k: _ 1 H7Tk (7176 _ 1)H7Tk This means that we can determine the rough sparseness of k-grams and adjust some of the parameters such as the gram size k in learning statistical language models .
Perplexity on Reduced Corpora	LDA is a probabilistic language model that generates a corpus as a mixture of hidden topics, and it allows us to infer two parameters: the document-topic distribution 6 that represents the mixture rate of topics in each document, and the topic-word distribution gb that represents the occurrence rate of words in each topic.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

unigram (13)
topic models (12)
LDA (10)

88. A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

Wang, Lu and Raghavan, Hema and Castelli, Vittorio and Florian, Radu and Cardie, Claire

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	Beam size is fixed at 2000.4 Sentence compressions are evaluated by a 5-gram language model trained on Gigaword (Graff, 2003) by SRILM (Stolcke, 2002).
Sentence Compression	As the space of possible compressions is exponential in the number of leaves in the parse tree, instead of looking for the globally optimal solution, we use beam search to find a set of highly likely compressions and employ a language model trained on a large corpus for evaluation.
Sentence Compression	Given the N -best compressions from the decoder, we evaluate the yield of the trimmed trees using a language model trained on the Gigaword (Graff, 2003) corpus and return the compression with the highest probability.
Sentence Compression	Thus, the decoder is quite flexible — its learned scoring function allows us to incorporate features salient for sentence compression while its language model guarantees the linguistic quality of the compressed string.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

89. Semantic Parsing as Machine Translation

Andreas, Jacob and Vlachos, Andreas and Clark, Stephen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	The first is the incorporation of a language model (or comparable long-distance structure-scoring model) to assign scores to predicted parses independent of the transformation model.
Experimental setup	The best symmetrization algorithm, translation and language model weights for each language are selected using cross-validation on the development set.
MT—based semantic parsing	In order to learn a semantic parser using MT we linearize the MRs, learn alignments between the MRL and the NL, extract translation rules, and learn a language model for the MRL.
MT—based semantic parsing	Language modeling In addition to translation rules learned from a parallel corpus, MT systems also rely on an n-gram language model for the target language, estimated from a (typically larger) monolingual corpus.
MT—based semantic parsing	In the case of SP, such a monolingual corpus is rarely available, and we instead use the MRs available in the training data to learn a language model of the MRL.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

90. Hybrid Simplification using Deep Semantics and Machine Translation

Narayan, Shashi and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	It is combined with a language model to improve grammaticality and the decoder translates sentences into sim-
Simplification Framework	In addition, the language model we integrate in the SMT module helps ensuring better fluency and grammaticality.
Simplification Framework	Finally the translation and language model ensures that published, describing and boson are simplified to wrote, explaining and elementary particle respectively; and that the phrase “In 1964” is moved from the beginning of the sentence to its end.
Simplification Framework	Our simplification framework consists of a probabilistic model for splitting and dropping which we call DRS simplification model (DRS-SM); a phrase based translation model for substitution and reordering (PBMT); and a language model learned on Simple English Wikipedia (LM) for fluency and grammaticality.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

91. Translating Dialectal Arabic to English

Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Also, we believe that improving English language modeling to match the genre of the translated sentences can have significant positive impact on translation quality.
Previous Work	They used two language models built from the English GigaWord corpus and from a large web crawl.
Previous Work	For language modeling , we used either EGen or the English side of the AR corpus plus the English side of NIST12 training data and English Gi-gaWord v5.
Previous Work	— B2-B4 systems used identical training data, namely EG, with the GW, EGen, or both for B2, B3, and B4 respectively for language modeling .
Proposed Methods 3.1 Egyptian to EG’ Conversion	Using both language models (52) led to slight improvement.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

BLEU (13)
parallel data (9)
LM (8)

92. The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter

Tan, Chenhao and Lee, Lillian and Pang, Bo

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	This crawling process also yielded 632K TAC pairs whose only difference was spacing, and an additional 558M “unpaired” tweets; as shown later in this paper, we used these extra corpora for computing language models and other auxiliary information.
Introduction	Table 5: Conformity to the community and one’s own past, measured via scores assigned by various language models .
Introduction	We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams (bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

93. Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Collocational Lexicon Induction	It has been used as word similarity measure in language modeling (Dagan et al., 1999).
Experiments & Results 4.1 Experimental Setup	For the end-to-end MT pipeline, we used Moses (Koehn et al., 2007) with these standard features: relative-frequency and lexical translation model (TM) probabilities in both directions; distortion model; language model (LM) and word count.
Experiments & Results 4.1 Experimental Setup	For the language model, we used the KenLM toolkit (Heafield, 2011) to create a 5-gram language model on the target side of the Europarl corpus (V7) with approximately 54M tokens with Kneser-Ney smoothing.
Experiments & Results 4.1 Experimental Setup	However, in an MT pipeline, the language model is supposed to rerank the hypotheses and move more appropriate translations (in terms of fluency) to the top of the list.
Introduction	Even noisy translation of oovs can aid the language model to better

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

94. A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections

Weerkamp, Wouter and Balog, Krisztian and de Rijke, Maarten

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	In the setting of language modeling approaches to query expansion, the local analysis idea has been instantiated by estimating additional query language models (Lafferty and Zhai, 2003; Tao and Zhai, 2006) or relevance models (Lavrenko and Croft, 2001) from a set of feedback documents.
Related Work	(2005) also try to uncover multiple aspects of a query, and to that they provide an iterative “pseudo-query” generation technique, using cluster-based language models .
Related Work	Diaz and Metzler (2006) were the first to give a systematic account of query expansion using an external corpus in a language modeling setting, to improve the estimation of relevance models.
Retrieval Framework	We work in the setting of generative language models .
Retrieval Framework	Within the language modeling approach, one builds a language model from each document, and ranks documents based on the probability of the document model generating the query.
Retrieval Framework	The particulars of the language modeling approach have been discussed extensively in the literature (see, e.g., Balog et al.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

language modeling (6)
general model (5)

95. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Han, Bo and Baldwin, Timothy

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Lexical normalisation	The confusion candidates are then filtered for each token occurrence of a given OOV word, based on their local context fit with a language model .
Lexical normalisation	In addition to generating the confusion set, we rank the candidates based on a trigram language model trained over 1.5GB of clean Twitter data, i.e.
Lexical normalisation	To train the language model , we used SRILM (Stolcke, 2002) with the —<unk> option.
Related work	Suppose the ill-formed text is T and its corresponding standard form is S, the approach aims to find arg max P(S \|T) by computing arg max P(T\|S)P(S), in which P(S) is usually a language model and P(T \| S) is an error model.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

96. Learning Hierarchical Translation Structure with Linguistic Annotations

Mylonakis, Markos and Sima'an, Khalil

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The final feature is the language model score for the target sentence, mounting up to the following model used at decoding time, with the feature weights A trained by Minimum Error Rate Training (MERT) (Och, 2003) on a development corpus.
Experiments	with a 3-gram language model smoothed with modified Knesser-Ney discounting (Chen and Goodman, 1998), trained on around 1M sentences per target language.
Experiments	Table 2: Additional experiments for English to Chinese translation examining (a) the impact of the linguistic annotations in the LTS system (lts), when compared with an instance not employing such annotations (lts—nolabels) and (b) decoding with a 4th-order language model (—lm4).
Joint Translation Model	While in a decoder this is somehow mitigated by the use of a language model , we believe that the weakness of straightforward applications of SCFGs to model reordering structure at the sentence level misses a chance to learn this crucial part of the translation process during grammar induction.
Joint Translation Model	As (Mylonakis and Sima’an, 2010) note, ‘plain’ SCFGs seem to perform worse than the grammars described next, mainly due to wrong long-range reordering decisions for which the language model can hardly help.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

97. Deciphering Foreign Language

Ravi, Sujith and Knight, Kevin

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	The variable 6 ranges over all possible English strings, and P(e) is a language model built from large amounts of English text that is unrelated to the foreign strings.
Introduction	A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters.
Machine Translation as a Decipherment Task	Whole-segment Language Models : When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English.
Machine Translation as a Decipherment Task	5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01).
Word Substitution Decipherment	We model P(e) using a statistical word n-gram English language model (LM).
Word Substitution Decipherment	1For word substitution decipherment, we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (a = 104).

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

98. Bayesian Inference for Zodiac and Other Homophonic Ciphers

Ravi, Sujith and Knight, Kevin

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Our method uses a decipherment model which combines information from letter n—gram language models as well as word dictionaries.
Conclusion	Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model.
Decipherment	We build a statistical English language model (LM) for the plaintext source model P (p), which assigns a probability to any English letter sequence.
Decipherment	For the plaintext source model, we use probabilities from an English language model and for the channel model, we specify a uniform distribution (i.e., a plaintext letter can be substituted with any given cipher type with equal probability).
Decipherment	Combining letter n-gram language models with word dictionaries: Many existing probabilistic approaches use statistical letter n-gram language models of English to assign P (p) probabilities to plaintext hypotheses during decipherment.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

99. Variational Decoding for Statistical Machine Translation

Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	We also used a 5-gram language model with modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Giga-word (LDC2007T07) and the English side of the parallel corpora.
Experimental Results	We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models, language models , and the optimal weights for combining these models, respectively.
Variational Approximate Decoding	Of course, this last point also means that our computation becomes intractable as n —> 00.8 However, if p(y \| at) is defined by a hypergraph HG(:c) whose structure explicitly incorporates an m-gram language model , both training and decoding will be efficient when m 2 n. We will give algorithms for this case that are linear in the size of HG(:c).9
Variational Approximate Decoding	9A reviewer asks about the interaction with backed-off language models .
Variational Approximate Decoding	We sketch a method that works for any language model given by a weighted FSA, L. The variational family Q can be specified by any deterministic weighted FSA, Q, with weights parameterized by ((5.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

n-gram (32)
Viterbi (23)
BLEU (15)

100. Boosting-Based System Combination for Machine Translation

Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background	Since all the member systems share the same data resources, such as language model and translation table, we only need to keep one copy of the required resources in memory.
Background	Another method to speed up the system is to accelerate n-gram language model with n-gram caching techniques.
Background	If the required n-gram hits the cache, the corresponding n-gram probability is returned by the cached copy rather than re-fetching the original data in language model .

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

101. Training Phrase Translation Models with Leaving-One-Out

Wuebker, Joern and Mauser, Arne and Ney, Hermann

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Alignment	A language model is not used in this case, as the system is constrained to the given target sentence and thus the language model score has no effect on the alignment.
Alignment	To deal with this problem, instead of simple phrase length restriction, we propose to apply the leaving-one-out method, which is also used for language modeling techniques (Kneser and Ney, 1995).
Experimental Evaluation	The baseline system is a standard phrase-based SMT system with eight features: phrase translation and word lexicon probabilities in both translation directions, phrase penalty, word penalty, language model score and a simple distance-based reordering model.
Experimental Evaluation	We used a 4-gram language model with modified Kneser-Ney discounting for all experiments.
Introduction	The phrase model is combined with a language model , word lexicon models, word and phrase penalty, and many oth: ers.
Related Work	They report improvements over a phrase-based model that uses an inverse phrase model and a language model .

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

102. Learning Phrase-Based Spelling Error Models from Clickthrough Data

Sun, Xu and Gao, Jianfeng and Micol, Daniel and Quirk, Chris

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	First, a pre-def1ned confusion set is used to generate candidate corrections, then a scoring model, such as a trigram language model or na'1've Bayes classifier, is used to rank the candidates according to their context (e.g., Golding and Roth, 1996; Mangu and Brill, 1997; Church et al., 2007).
Related Work	(2009) present a query speller system in which both the error model and the language model are trained using Web data.
Related Work	Typically, a language model (source model) is used to capture contextual information, while an error model (channel model) is considered to be context free in that it does not take into account any contextual information in modeling word transformation probabilities.
The Baseline Speller System	where the error model P(QIC) models the transformation probability from C to Q, and the language model P(C) models how likely C is a correctly spelled query.
The Baseline Speller System	The language model (the second factor) is a backoff bigram model trained on the tokenized form of one year of query logs, using maximum likelihood estimation with absolute discounting smoothing.
The Baseline Speller System	Since we define the logarithm of the probabilities of the language model and the error model (i.e., the edit distance function) as features, the ranker can be viewed as a more general framework, subsuming the source channel model as a special case.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

103. Learning Document-Level Semantic Properties from Free-Text Annotations

Branavan, S.R.K. and Chen, Harr and Eisenstein, Jacob and Barzilay, Regina

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Each property indexes a language model , thus allowing documents that incorporate the same
Model Description	Keyphrases are drawn from a set of clusters; words in the documents are drawn from language models indexed by a set of topics, where the topics correspond to the keyphrase clusters.
Model Description	— language models of each topic
Model Description	In the LDA framework, each word is generated from a language model that is indexed by the word’s topic assignment.

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

104. Automatic sense prediction for implicit discourse relations in text

Pitler, Emily and Louis, Annie and Nenkova, Ani

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Classification Results	The language model features were completely useless for distinguishing contingencies from
Features for sense prediction of implicit discourse relations	For each sense, we created uni-gram and bigram language models over the implicit examples in the training set.
Features for sense prediction of implicit discourse relations	We compute each example’s probability according to each of these language models .
Features for sense prediction of implicit discourse relations	of the spans’ likelihoods according to the various language models .

language model is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

105. Polylingual Tree-Based Topic Models for Translation Domain Adaptation

Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Further improvement is possible by incorporating topic models deeper in the decoding process and adding domain knowledge to the language model .
Discussion	6.3 Improving Language Models
Discussion	Topic models capture document-level properties of language, but a critical component of machine translation systems is the language model , which provides local constraints and preferences.
Discussion	Domain adaptation for language models (Bellegarda, 2004; Wood and Teh, 2009) is an important avenue for improving machine translation.
Experiments	We train a modified Kneser—Ney trigram language model on English (Chen and Goodman, 1996).

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

106. Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition

Habash, Nizar and Roth, Ryan

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Settings	The models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Problem Zones in Handwriting Recognition	Digits on the other hand are a hard class to language model since the vocabulary (of multi-digit numbers) is infinite.
Problem Zones in Handwriting Recognition	The HR system output does not contain any illegal non-words since its vocabulary is restricted by its training data and language models .
Related Work	Alternatively, morphological information can be used to construct supplemental lexicons or language models (Sari and Sellami, 2002; Magdy and Darwish, 2006).
Related Work	Their hypothesis that their large language model (16M words) may be responsible for why the word-based models outperformed stem-based (morphological) models is challenged by the fact that our language model data (220M words) is an order of magnitude larger, but we are still able to show benefit for using morphology.

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

107. Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling

Huang, Fei and Yates, Alexander

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models .
Related Work	While n-gram models have traditionally dominated in language modeling , two recent efforts de-
Related Work	Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables, to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005).
Smoothing Natural Language Sequences	2.3 Latent Variable Language Model Representation
Smoothing Natural Language Sequences	Latent variable language models (LVLMs) can be used to produce just such a distributional representation.

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

108. Unsupervised Morphology-Based Vocabulary Expansion

Rasooli, Mohammad Sadegh and Lippincott, Thomas and Habash, Nizar and Rambow, Owen

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	The best-performing systems for these applications today rely on training on large amounts of data: in the case of ASR, the data is aligned audio and transcription, plus large unannotated data for the language modeling ; in the case of OCR, it is transcribed optical data; in the case of MT, it is aligned bitexts.
Introduction	For ASR and OCR, which can compose words from smaller units (phones or graphically recognized letters), an expanded target language vocabulary can be directly exploited without the need for changing the technology at all: the new words need to be inserted into the relevant resources (lexicon, language model ) etc, with appropriately estimated probabilities.
Introduction	The expanded word combinations can be used to extend the language models used for MT to bias against incoherent hypothesized new sequences of segmented words.
Morphology-based Vocabulary Expansion	In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine.
Morphology-based Vocabulary Expansion	We reweight the weights in the WFST model (Fixed or Bigram) by composing it with a letter trigraph language model (WoTr).

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

109. Fast Consensus Decoding over Translation Forests

DeNero, John and Chiang, David and Knight, Kevin

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Computing Feature Expectations	The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in
Computing Feature Expectations	3Decoder states can include additional information as well, such as local configurations for dependency language model scoring.
Computing Feature Expectations	The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes.
Experimental Results	All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT).

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

110. Sentence Level Dialect Identification for Machine Translation System Selection

Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

MT System Selection	These features rely on language models , MSA and Egyptian morphological analyzers and a Highly Dialectal Egyptian lexicon to decide whether each word is MSA, Egyptian, Both, or Out of Vocabulary.
MT System Selection	two language models : MSA and Egyptian.
MT System Selection	The second set of features uses perplexity against language models built from the source-side of the training data of each of the four
Machine Translation Experiments	The language model for our systems is trained on English Gigaword (Graff and Cieri, 2003).
Machine Translation Experiments	We use SRILM Toolkit (Stolcke, 2002) to build a 5-gram language model with modified

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

111. Deciphering Foreign Language by Combining Language Models and Context Vectors

Nuhn, Malte and Mauser, Arne and Ney, Hermann

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model .
Introduction	Combining Language Models and
Training Algorithm and Implementation	As described in Section 4, the overall procedure is divided into two alternating steps: After initialization we first perform EM training of the translation model for 20-30 iterations using a 2- gram or S-gram language model in the target language.
Training Algorithm and Implementation	The generative story described in Section 3 is implemented as a cascade of a permutation, insertion, lexicon, deletion and language model finite state transducers using OpenFST (Allauzen et al., 2007).
Translation Model	Stochastically generate the target sentence according to an n-gram language model .

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

LM (27)
BLEU (16)
translation model (15)

112. Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Decoding	The language model (LM) scoring is directly integrated into the cube pruning algorithm.
Decoding	Naturally, we also had to adjust hypothesis expansion and, most importantly, language model scoring inside the cube pruning algorithm.
Experiments	Our German 4-gram language model was trained on the German sentences in the training data augmented by the Stuttgart SdeWaC corpus (Web-as-Corpus Consortium, 2008), whose generation is detailed in (Baroni et al., 2009).
Translation Model	(1) The forward translation weight using the rule weights as described in Section 2 (2) The indirect translation weight using the rule weights as described in Section 2 (3) Lexical translation weight source —> target (4) Lexical translation weight target —> source (5) Target side language model (6) Number of words in the target sentences (7) Number of rules used in the pre-translation (8) Number of target side sequences; here k times the number of sequences used in the pre-translations that constructed 7' (gap penalty) The rule weights required for (l) are relative frequencies normalized over all rules with the same left-hand side.
Translation Model	The computation of the language model estimates for (6) is adapted to score partial translations consisting of discontiguous units.

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

113. Combining Speech Retrieval Results with Generalized Additive Models

Olsson, J. Scott and Oard, Douglas W.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The resulting interview transcripts have a reported mean word error rate (WER) of approximately 25% on held out data, which was obtained by priming the language model with meta-data available from preinterview questionnaires.
Experiments	We use a mixture of the training transcripts and various newswire sources for our language model training.
Experiments	We did not attempt to prime the language model for particular interviewees or otherwise utilize any interview metadata.
Introduction	Limitations in signal processing, acoustic modeling, pronunciation, vocabulary, and language modeling can be accommodated in several ways, each of which make different tradeoffs and thus induce different
Previous Work	In the extreme case, the term may simply be out of vocabulary, although this may occur for various other reasons (e. g., poor language modeling or pronunciation dictionaries).

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

114. An exponential translation model for target language morphology

Subotin, Michael

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Corpora and baselines	A 5-gram language model with modified interpolated Kneser—Ney smoothing (Chen and Goodman, 1998) was trained by the SRILM toolkit (Stolcke, 2002) on a set of 208 million running words of text obtained by combining the monolingual Czech text distributed by the 2010
Corpora and baselines	The baselines consisted of the language model , two phrase translation models, two lexical models, and a brevity penalty.
Decoding with target-side model dependencies	language model , as described in Chiang (2007).
Decoding with target-side model dependencies	In the case of the language model these aspects include any of its target-side words that are part of still incomplete n-grams.
Hierarchical phrase-based translation	As shown by Chiang (2007), a weighted grammar of this form can be collected and scored by simple extensions of standard methods for phrase-based translation and efficiently combined with a language model in a CKY decoder to achieve large improvements over a state-of-the-art phrase-based system.

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

115. Enriching Morphologically Poor Languages for Statistical Machine Translation

Avramidis, Eleftherios and Koehn, Philipp

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	For testing the factored translation systems, we used Moses (Koehn et al., 2007), along with a 5-gram SRILM language model (Stolcke, 2002).
Factored Model	The factored statistical machine translation model uses a log-linear approach, in order to combine the several components, including the language model , the reordering model, the translation models and the generation models.
Introduction	0 The basic SMT approach uses the target language model as a feature in the argument maximisation function.
Introduction	This language model is trained on grammatically correct text, and would therefore give a good probability for word sequences that are likely to occur in a sentence, while it would penalise ungrammatical or badly ordered formations.
Introduction	Thus, with respect to these methods, there is a problem when agreement needs to be applied on part of a sentence whose length exceeds the order of the of the target n-gram language model and the size of the chunks that are translated (see Figure 1 for an exam-

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

116. Enhancing Grammatical Cohesion: Generating Transitional Expressions for SMT

Tu, Mei and Zhou, Yu and Zong, Chengqing

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

A semantic span can include one or more eus.	Most translation systems adopt the features from a translation model, a language model , and sometimes a reordering model.
A semantic span can include one or more eus.	The process of training this transfer model and smoothing is similar to the process of training a language model .
A semantic span can include one or more eus.	formula (6) are estimated in the same way as a factored language model , which has the advantage of easily incorporating various linguistic information.
Experiments	A 5-gram language model is trained with SRILM5 on the combination of the Xinhua portion of the English Giga-word corpus combined with the English part of FBIS.
Experiments	probabilities, the BTG reordering features, and the language model feature.

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

117. Labeling Documents with Timestamps: Learning from their Time Expressions

Chambers, Nathanael

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Results	Unigram NLLR and Filtered NLLR are the language model implementations of previous work as described in Section 3.1.
Previous Work	They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores.
Timestamp Classifiers	3.1 Language Models
Timestamp Classifiers	We apply Dirichlet-smoothing to the language models (as in de J ong et al.
Timestamp Classifiers	The above language modeling and MaxEnt approaches are token-based classifiers that one could apply to any topic classification domain.

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

MaxEnt (15)
unigrams (15)
NER (7)

118. Lexicalized Phonotactic Word Segmentation

Fleck, Margaret M.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Previous work	Language modelling methods build word ngram models, like those used in speech recognition.
Previous work	3.2 Language modelling methods
Previous work	So far, language modelling methods have been more effective.
The new approach	This corresponds roughly to a unigram language model .

language model is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

119. Cross Language Dependency Parsing using a Bilingual Lexicon

Zhao, Hai and Song, Yan and Kit, Chunyu and Zhou, Guodong

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Treebank Translation and Dependency Transformation	In detail, a word-based decoding is used, which adopts a log-linear framework as in (Och and Ney, 2002) with only two features, translation model and language model,
Treebank Translation and Dependency Transformation	is the language model , a word trigram model trained from the CTB.
Treebank Translation and Dependency Transformation	Thus the decoding process is actually only determined by the language model .

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

120. Chinese-English Backward Transliteration Assisted with Mining Monolingual Web Pages

Yang, Fan and Zhao, Jun and Zou, Bo and Liu, Kang and Liu, Feifan

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Statistical Transliteration Model	The language model P(e) is trained from English texts.
Statistical Transliteration Model	generative probability of a English syllable language model .
Statistical Transliteration Model	2) The language model in backward transliteration describes the relationship of syllables in words.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

121. A Computational Approach to the Automation of Creative Naming

Ozbal, Gozde and Strapparava, Carlo

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	Therefore, we implemented a ranking mechanism which used a hybrid scoring method by giving equal weights to the language model and the normalized phonetic similarity.
System Description	To check the likelihood and well-formedness of the new string after the replacement, we learn a 3- gram language model with absolute smoothing.
System Description	For leam-ing the language model , we only consider the words in the CMU pronunciation dictionary which also exist in WordNet.
System Description	We remove the words containing at least one trigram which is very unlikely according to the language model .

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

122. Word or Phrase? Learning Which Unit to Stress for Information Retrieval

Song, Young-In and Lee, Jung-Tae and Rim, Hae-Chang

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Previous Work	These approaches have focused to model statistical or syntactic phrasal relations under the language modeling method for information retrieval.
Previous Work	(Srikanth and Srihari, 2003; Maisonnasse et al., 2005) examined the effectiveness of syntactic relations in a query by using language modeling framework.
Previous Work	(Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases.
Proposed Method	We start out by presenting a simple phrase-based language modeling retrieval model that assumes uniform contribution of words and phrases.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

123. Learning Topic Representation for SMT with Neural Networks

Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	An in-house language modeling toolkit is used to train the 5-gram language model with modified Kneser-Ney smoothing (Kneser and Ney, 1995).
Experiments	The English monolingual data used for language modeling is the same as in Table 1.
Related Work	They incorporated the bilingual topic information into language model adaptation and lexicon translation model adaptation, achieving significant improvements in the large-scale evaluation.
Topic Similarity Model with Neural Network	Standard features: Translation model, including translation probabilities and lexical weights for both directions (4 features), 5-gram language model (1 feature), word count (1 feature), phrase count (1 feature), NULL penalty (1 feature), number of hierarchical rules used (1 feature).

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

124. Learning to Tell Tales: A Data-driven Approach to Story Generation

McIntyre, Neil and Lapata, Mirella

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	And Knight and Hatzivassiloglou (1995) use a language model for selecting a fluent sentence among the vast number of surface realizations corresponding to a single semantic representation.
Introduction	The top-ranked candidate is selected for presentation and verbalized using a language model interfaced with RealPro (Lavoie and Rambow, 1997), a text generation engine.
The Story Generator	Since we do not know a priori which of these parameters will result in a grammatical sentence, we generate all possible combinations and select the most likely one according to a language model .
The Story Generator	We used the SRI toolkit to train a trigram language model on the British National Corpus, with interpolated Kneser—Ney smoothing and perplexity as the scoring metric for the generated sentences.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

125. A Sense-Based Translation Model for Statistical Machine Translation

Xiong, Deyi and Zhang, Min

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Additionally, we also want to induce sense clusters for words in the target language so that we can build sense-based language model and integrate it into SMT.
Decoding with Sense-Based Translation Model	error rate training (MERT) (Och, 2003) together with other models such as the language model .
Experiments	We trained a 5-gram language model on the Xinhua section of the English Gigaword corpus (306 million words) using the SRILM toolkit (Stolcke, 2002) with the modified Kneser—Ney smoothing (Chen and Goodman, 1996).
Related Work	(2007) also explore a bilingual topic model for translation and language model adaptation.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

126. Improving Tree-to-Tree Translation with Packed Forests

Liu, Yang and Lü, Yajuan and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Decoding	1 to a log-linear model (Och and Ney, 2002) that uses the following eight features: relative frequencies in two directions, lexical weights in two directions, number of rules used, language model score, number of target words produced, and the probability of matched source tree (Mi et al., 2008).
Decoding	We use the cube pruning method (Chiang, 2007) to approximately intersect the translation forest with the language model .
Experiments	A trigram language model was trained on the English sentences of the training corpus.
Related Work	In machine translation, the concept of packed forest is first used by Huang and Chiang (2007) to characterize the search space of decoding with language models .

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

127. Automatic Keyphrase Extraction: A Survey of the State of the Art

Hasan, Kazi Saidul and Ng, Vincent

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Keyphrase Extraction Approaches	3.3.4 Language Modeling
Keyphrase Extraction Approaches	These feature values are estimated using language models (LMs) trained on a foreground corpus and a background corpus.
Keyphrase Extraction Approaches	In sum, LMA uses a language model rather than heuristics to identify phrases, and relies on the language model trained on the background corpus to determine how “unique” a candidate keyphrase is to the domain represented by the foreground corpus.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

128. A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction

Jia, Zhongye and Zhao, Hai

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	SRILM (Stolcke, 2002) is adopted for language model training and KenLM (Heafield, 2011; Heafield et al., 2013) for language model query.
Pinyin Input Method Model	The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
Related Works	They solved the typo correction problem by decomposing the conditional probability P(H \|P) of Chinese character sequence H given pinyin sequence P into a language model P(wi\|wi_1) and a typing model The typing model that was estimated on real user input data was for typo correction.
Related Works	Various approaches were made for the task including language model (LM) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

129. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Li, Zhifei and Yarowsky, David

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Results	To handle different directions of translation between Chinese and English, we built two trigram language models with modified Kneser-Ney smoothing (Chen and Goodman, 1998) using the SRILM toolkit (Stolcke, 2002).
Experimental Results	Feature Baseline AAMT language model 0.137 0.133 phrase translation 0.066 0.023 lexical translation 0.061 0.078 reverse phrase translation 0.059 0.103 reverse lexical translation 0.
Unsupervised Translation Induction for Chinese Abbreviations	Moreover, our approach utilizes both Chinese and English monolingual data to help MT, while most SMT systems utilizes only the English monolingual data to build a language model .
Unsupervised Translation Induction for Chinese Abbreviations	However, since most of statistical translation models (Koehn et al., 2003; Chiang, 2007; Galley et al., 2006) are symmetrical, it is relatively easy to train a translation system to translate from English to Chinese, except that we need to train a Chinese language model from the Chinese monolingual data.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

130. Detecting Retries of Voice Search Queries

Levitan, Rivka and Elson, David

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Features	We look at the language model (LM) score and the number of alternate pronunciations of the first query, predicting that a misrecognized query will have a lower LM score and more alternate pronunciations.
Prediction task	In addition, the language model likelihood for the first query was, as expected, significantly lower for retries.
Related Work	Retry cases are identified with joint language modeling across multiple transcripts, with the intuition that retry pairs tend to be closely related or exact duplicates.
Related Work	While we follow this work in our usage of joint language modeling , our application encompasses open domain voice searches and voice actions (such as placing calls), so we cannot use simplifying domain assumptions.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

131. Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders

Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Collaborative Decoding	Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses.
Discussion	They also empirically show that n-gram agreement is the most important factor for improvement apart from language models .
Experiments	The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3.
Experiments	We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

132. Discovering the Discriminative Views: Measuring Term Weights for Sentiment Analysis

Kim, Jungi and Li, Jin-Ji and Lee, Jong-Hyeok

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	For the relevance retrieval model, we faithfully reproduce the passage-based language model with pseudo-relevance feedback (Lee et al., 2008).
Term Weighting and Sentiment Analysis	IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling (LM), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004).
Term Weighting and Sentiment Analysis	In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing (LM).
Term Weighting and Sentiment Analysis	5With proper assumptions and derivations, p(w \ d) can be derived to language modeling approaches.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

133. Forest-Based Translation

Mi, Haitao and Huang, Liang and Liu, Qun

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	language model and (93'8' is the length penalty term
Experiments	We also use the SRI Language Modeling Toolkit (Stolcke, 2002) to train a trigram language model with Kneser-Ney smoothing on the English side of the bitext.
Experiments	Besides the trigram language model trained on the English side of these bitext, we also use another trigram model trained on the first 1/3 of the Xinhua portion of Gigaword corpus.
Forest-based translation	The decoder performs two tasks on the translation forest: l-best search with integrated language model (LM), and k-best search with LM to be used in minimum error rate training.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

BLEU (12)
parse tree (11)
BLEU score (10)

134. Learning Translation Consensus with Structured Label Propagation

Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments and Results	The features we used are commonly used features as standard BTG decoder, such as translation probabilities, lexical weights, language model , word penalty and distortion probabilities.
Experiments and Results	The language model is 5-gram language model trained with the target sentences in the training data.
Experiments and Results	The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data.
Features and Training	We also use other fundamental features, such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

135. Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation

Nguyen, ThuyLinh and Vogel, Stephan

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment Results	The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation.
Experiment Results	The language model is the trigram SRI language model built from Xinhua corpus of 180 millions words.
Experiment Results	The language model is three-gram SRILM trained from the target side of the training corpora.
Introduction	Many features are shared between phrase-based and tree-based systems including language model , word count, and translation model features.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

136. Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Deveaud, Romain and SanJuan, Eric and Bellot, Patrice

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	The approach by Wei and Croft (2006) was the first to leverage LDA topics to improve the estimate of document language models and achieved good empirical results.
Topic-Driven Relevance Models	where 9 is a set of pseudo-relevant feedback documents and 6D is the language model of document D. This notion of estimating a query model is
Topic-Driven Relevance Models	We tackle the null probabilities problem by smoothing the document language model using the well-known Dirichlet smoothing (Zhai and Lafferty, 2004).
Topic-Driven Relevance Models	Instead of viewing 9 as a set of document language models that are likely to contain topical information about the query, we take a probabilistic topic modeling approach.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

137. Learning to Extract International Relations from Political Context

O'Connor, Brendan and Stewart, Brandon M. and Smith, Noah A.

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Inference	After randomly initializing all 77k,8,7.,t, inference is performed by a blocked Gibbs sampler, alternating resamplings for three major groups of variables: the language model (z,gb), context model (07,7, [3, p), and the 77, 6 variables, which bottleneck between the submodels.
Inference	The language model sampler sequentially updates every za) (and implicitly gb via collapsing) in the manner of Griffiths and Steyvers (2004): p(z(i)\|6, ma), 1)) oc 68,r,t,z(nw,z + b/V)/(nz + b), where counts 77 are for all event tuples besides 7'.
Model	0 Language model:
Model	Thus the language model is very similar to a topic model’s generation of token topics and wordtypes.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

138. An Infinite Hierarchical Bayesian Model of Phrasal Translation

Cohn, Trevor and Haffari, Gholamreza

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count.
Experiments	We train 3-gram language models using modified Kneser—Ney smoothing.
Experiments	For AR-EN experiments the language model is trained on English data as (Blunsom et al., 2009a), and for FA-EN and UR-EN the English data are the target sides of the bilingual training data.
Introduction	We develop a Bayesian approach using a Pitman-Yor process prior, which is capable of modelling a diverse range of geometrically decaying distributions over infinite event spaces (here translation phrase-pairs), an approach shown to be state of the art for language modelling (Teh, 2006).

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

139. Modeling Norms of Turn-Taking in Multi-Party Conversation

Laskowski, Kornel

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conceptual Framework	In language modeling practice, one finds the likelihood P ( w \| (9 of a word sequence w of length under a model (9, to be an inconvenient measure for comparison.
Discussion	This makes it suitable for comparison of conversational genres, in much the same way as are general language models of words.
Discussion	Accordingly, as for language models , density estimation in future turn-taking models may be im-
Introduction	The current work attempts to address this problem by proposing a simple framework, which, at least conceptually, borrows quite heavily from the standard language modeling paradigm.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

language modeling (4)
bigram (3)

Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The monolingual data for training English language model includes the Xinhua portion of the GIGAWORD corpus, which contains 238M English words.
Experiments	A 4—gram language model was trained on the monolingual data by the SRILM toolkit (Stolcke, 2002).
Related Work	Researchers also introduce topic model for cross-lingual language model adaptation (Tam et al., 2007; Ruiz and Federico, 2011).
Related Work	Based on the bilingual topic model, they apply the source-side topic weights into the target-side topic model, and adapt the n-gram language model of target side.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

141. Improve SMT Quality with Automatically Extracted Paraphrase Rules

He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We used SRILM2 for the training of language models (S-gram in all the experiments).
Experiments	We trained a Chinese language model for the EC translation on the Chinese part of the bi-text.
Experiments	For the English language model of CE translation, an extra corpus named Tanaka was used besides the English part of the bilingual corpora.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

142. Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	In all experiments we include the target side of the mined parallel data in the language model , in order to distinguish whether results are due to influences from parallel or monolingual data.
Abstract	In these experiments, we use 5-gram language models when the target language is English or German, and 4—gram language models for French and Spanish.
Abstract	The baseline system was trained using only the Europarl corpus (Koehn, 2005) as parallel data, and all experiments use the same language model trained on the target sides of Europarl, the English side of all linked Spanish-English Wikipedia articles, and the English side of the mined CommonCran data.

language model is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

143. Generating Code-switched Text for Lexical Learning

Labutov, Igor and Lipson, Hod

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Model	Broadly, as the learner progresses from one sentence to the next, exposing herself to more novel words, the updated parameters of the language model in turn guide the selection of new “switch-points” for replacing source words with the target foreign words.
Model	Generally, this value may come directly from the surprisal quantity given by a language model , or may incorporate additional features that are found informative in predicting the constraint on the word.
Related Work	Building on their work, (Adel et al., 2012) employ additional features and a recurrent network language model for modeling code-switching in conversational speech.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

144. A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Zollmann, Andreas and Vogel, Stephan

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Apart from the language model , the lexical, phrasal, and (for the syntax grammar) label-conditioned features, and the rule, target word, and glue operation counters, Venugopal and Zollmann (2009) also provide both the hierarchical and syntax-augmented grammars with a rareness penalty 1/ onto“), where onto“) is the occurrence count of rule 7“ in the training corpus, allowing the system to learn penalization of low-frequency rules, as well as three indicator features firing if the rule has one, two unswapped, and two swapped nonterminal pairs, respectively.2 Further, to mitigate badly estimated PSCFG derivations based on low-frequency rules of the much sparser syntax model, the syntax grammar also contains the hierarchical grammar as a backbone (cf.
Experiments	Each system is trained separately to adapt the parameters to its specific properties (size of nonterminal set, grammar complexity, features sparseness, reliance on the language model , etc.
Related work	The supertags are also injected into the language model .

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

145. Error Detection for Statistical Machine Translation Using Linguistic Features

Xiong, Deyi and Zhang, Min and Li, Haizhou

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Features	To some extent, these two features have similar function to a target language model or pos-based target language model .
Related Work	(2009) study several confidence features based on mutual information between words and n-gram and backward n-gram language model for word-level and sentence-level CE.
SMT System	We build a four-gram language model using the SRILM toolkit (Stolcke, 2002), which is trained

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

146. Accurate Word Segmentation using Transliteration and Language Model Projection

Hagiwara, Masato and Sekine, Satoshi

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Our approach is based on semi-Markov discriminative structure prediction, and it incorporates English back-transliteration and English language models (LMs) into WS in a seamless way.
Use of Language Model	Language Model Augmentation Analogous to Koehn and Knight (2003), we can exploit the fact that l/‘yF‘ reddo (red) in the example ffiayvnI/W‘ is such a common word that one can expect it appears frequently in the training corpus.
Use of Language Model	4.1 Language Model Projection

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

LM (19)
PoS tags (4)
bigram (3)

147. Distortion Model Considering Rich Context for Statistical Machine Translation

Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	We used 5-gram language models that were trained using the English side of each set of bilingual training data.
Experiment	The common SMT feature set consists of: four translation model features, phrase penalty, word penalty, and a language model feature.
Introduction	1A language model also supports the estimation.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

148. Fine-Grained Tree-to-String Translation Rule Extraction

Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Here, the first item is the language model (LM) probability where 7'(d) is the target string of derivation d; the second item is the translation length penalty; and the third item is the translation score, which is decomposed into a product of feature values of rules:
Experiments	SRI Language Modeling Toolkit (Stolcke, 2002) was employed to train 5-gram English and Japanese LMs on the training set.
Related Work	By introducing supertags into the target language side, i.e., the target language model and the target side of the phrase table, significant improvement was achieved for Arabic-to-English translation.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

149. A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Feng, Yang and Cohn, Trevor

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The language model is a 3-gram language model trained using the SRILM toolkit (Stolcke, 2002) on the English side of the training data.
Experiments	The language model is a 3-gram LM trained on Xinhua portion of the Gigaword corpus using the SRILM toolkit with modified Kneser—Ney smoothing.
Related Work	(2011) develop a bilingual language model which incorporates words in the source and target languages to predict the next unit, which they use as a feature in a translation system.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

150. Dependency Based Chinese Sentence Realization

He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al.
Introduction	(2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings.
Log-linear Models	We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models , except using the names of dependency relations instead of words.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

151. Sentence Level Dialect Identification in Arabic

Elfardy, Heba and Diab, Mona

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Approach to Sentence-Level Dialect Identification	The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV.
Approach to Sentence-Level Dialect Identification	The perplexity of a language model on a given test sentence; S(w1, .., wn) is defined as:
Related Work	Amazon Mechanical Turk and try a language modeling (LM) approach to solve the problem.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

152. Concept-to-text Generation via Discriminative Reranking

Konstas, Ioannis and Lapata, Mirella

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Design	imately via cube pruning (Chiang, 2007), by integrating a trigram language model extracted from the training set (see Konstas and Lapata (2012) for details).
Experimental Design	Lexical Features These features encourage grammatical coherence and inform lexical selection over and above the limited horizon of the language model captured by Rules (6)—(9).
Problem Formulation	In machine translation, a decoder that implements forest rescoring (Huang and Chiang, 2007) uses the language model as an external criterion of the goodness of sub-translations on account of their grammaticality.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

153. Prediction of Learning Curves in Machine Translation

Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Inferring a learning curve from mostly monolingual data	In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model .
Inferring a learning curve from mostly monolingual data	(b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus.
Inferring a learning curve from mostly monolingual data	The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus .

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

154. Cohesive Phrase-Based Decoding for Statistical Machine Translation

Cherry, Colin

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	So long as the vocabulary present in our phrase table and language model supports a literal translation, cohesion tends to produce an improvement.
Discussion	In the baseline translation, the language model encourages the system to move the negation away from “exist” and toward “reduce.” The result is a tragic reversal of meaning in the translation.
Introduction	order, forcing the decoder to rely heavily on its language model .

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

subtrees (11)
phrase-based (10)
BLEU (9)

155. Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

Celikyilmaz, Asli and Hakkani-Tur, Dilek and Tur, Gokhan and Sarikaya, Ruhi

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Markov Topic Regression - MTR	(19) Language Model Prior (77W): Probabilities on word transitions denoted as nw=p(wi=v\|wi_1).
Markov Topic Regression - MTR	We built a language model using SRILM (Stol-cke, 2002) on the domain specific sources such as top wiki pages and blogs on online movie reviews, etc., to obtain the probabilities of domain-specific n-grams, up to 3-grams.
Markov Topic Regression - MTR	(l), we assume that the prior on the semantic tags, 773, is more indicative of the decision for sampling a w,- from a new tag compared to language model posteriors on word sequences, 77W.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

156. Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Automated Approaches to Deceptive Opinion Spam Detection	Under (2), both the NB classifier used by Mihalcea and Strapparava (2009) and the language model classifier used by Zhou et al.
Automated Approaches to Deceptive Opinion Spam Detection	(2008), we use the SRI Language Modeling Toolkit (Stolcke, 2002) to estimate individual language models , Pr(:E \| y = c), for truthful and deceptive opinions.
Automated Approaches to Deceptive Opinion Spam Detection	We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

157. HEADY: News headline abstraction through event pattern clustering

Alfonseca, Enrique and Pighin, Daniele and Garrido, Guillermo

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment settings	o TopicSum: we use TopicSum (Haghighi and Vanderwende, 2009), a 3-layer hierarchical topic model, to infer the language model that is most central for the collection.
Experiment settings	divergence with respect the collection language model is the one chosen.
Related work	(2007) generate novel utterances by combining Prim’s maximum-spanning-tree algorithm with an n-gram language model to enforce fluency.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

158. How Many Words Is a Picture Worth? Automatic Caption Generation for News Images

Feng, Yansong and Lapata, Mirella

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstractive Caption Generation	Specifically, we use an adaptive language model (Kneser et al., 1997) that modifies an
Abstractive Caption Generation	where P(wi E C \|wi E D) is the probability of W, appearing in the caption given that it appears in the document D, and Padap(wi\|wi_1,wi_2) the language model adapted with probabilities from our image annotation model:
Experimental Setup	The scaling parameter [3 for the adaptive language model was also tuned on the development set using a range of [05,09].

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

159. A Metric-based Framework for Automatic Taxonomy Induction

Yang, Hui and Callan, Jamie

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

The Features	It is built into a unigram language model without smoothing for each term.
The Features	This feature function measures the Kullback—Leibler divergence (KL divergence) between the language models associated with the two inputs.
The Features	Similarly, the local context is built into a unigram language model without smoothing for each term; the feature function outputs KL divergence between the models.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

160. Word Alignment Modeling with Context Dependent Deep Neural Network

Yang, Nan and Liu, Shujie and Li, Mu and Zhou, Ming and Yu, Nenghai

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related Work	(Bengio et al., 2006) proposed to use multilayer neural network for language modeling task.
Related Work	(Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language.
Related Work	(Son et al., 2012) improves translation quality of n- gram translation model by using a bilingual neural language model .

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

161. Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	We train a 5-gram language model with the Xinhua portion of English Gigaword corpus and target part of the training data.
Integrating into the PAS-based Translation Framework	The weights of the MEPD feature can be tuned by MERT (Och, 2003) together with other translation features, such as language model .
PAS-based Translation Framework	The target-side-like PAS is selected only according to the language model and translation probabilities, without considering any context information of PAS.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

162. Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

Zhou, Guangyou and Liu, Fang and Liu, Yang and He, Shizhu and Zhao, Jun

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Row 1 and row 2 are two baseline systems, which model the relevance score using VSM (Cao et al., 2010) and language model (LM) (Zhai and Laf-ferty, 2001; Cao et al., 2010) in the term space.
Experiments	Row 3 is the word-based translation model (Jeon et al., 2005), and row 4 is the word-based translation language model, which linearly combines the word-based translation model and language model into a unified framework (Xue et al., 2008).
Experiments	(2009) in Table 3 because previous work (Ming et al., 2010) demonstrated that word-based translation language model (Xue et al., 2008) obtained the superior performance than the syntactic tree matching (Wang et al., 2009).

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

163. A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation

Sun, Jun and Zhang, Min and Tan, Chew Lim

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	In the experiments, we train the translation model on FBIS corpus (7.2M (Chinese) + 9.2M (English) words) and train a 4-gram language model on the Xinhua portion of the English Gigaword corpus (181M words) using the SRILM Toolkits (Stolcke,
NonContiguous Tree sequence Align-ment-based Model	2) The bi-lexical translation probabilities 3) The target language model
The Pisces decoder	On the other hand, to simplify the computation of language model , we only compute for source side contiguous translational hypothesis, while neglecting gaps in the target side if any.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

164. Bilingual Sense Similarity for Statistical Machine Translation

Chen, Boxing and Foster, George and Kuhn, Roland

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	We trained two language models : the first one is a 4-gram LM which is estimated on the target side of the texts used in the large data condition.
Experiments	Both language models are used for both tasks.
Experiments	Only the target-language half of the parallel training data are used to train the language model in this task.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

165. Topological Ordering of Function Words in Hierarchical Phrase-based Translation

Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup	For the language model , we used a 5-gram model with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the English side of our training data as well as portions of the Giga-word v2 English corpus.
Experimental Setup	For the language model , we used a 5-gram model trained on the English portion of the whole training data plus portions of the Gigaword v2 corpus.
Hierarchical Phrase-based System	Given 6 and f as the source and target phrases associated with the rule, typical features used are rule’s translation probability Ptmn,(f\|e') and its inverse Ptmn,(e'\| f), the lexical probability Pl“ (fl 6) and its inverse Pl“ (6 \| f Systems generally also employ a word penalty, a phrase penalty, and target language model feature.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

166. A Bayesian Mixed Effects Model of Literary Character

Bamman, David and Underwood, Ted and Smith, Noah A.

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Data	To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books.
Model	Maximum entropy approaches to language modeling have been used since Rosenfeld (1996) to incorporate long-distance information, such as previously-mentioned trigger words, into n-gram language models .
Model	Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameter

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

167. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

Baroni, Marco and Dinu, Georgiana and Kruszewski, Germán

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Context-predicting models (more commonly known as embeddings or neural language models ) are the new kids on the distributional semantics block.
Introduction	This is in part due to the fact that context-predicting vectors were first developed as an approach to language modeling and/or as a way to initialize feature vectors in neural-network-based “deep learning” NLP architectures, so their effectiveness as semantic representations was initially seen as little more than an interesting side effect.
Introduction	Predictive DSMs are also called neural language models , because their supervised context prediction training is performed with neural networks, or, more cryptically, “embeddings”.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

168. Microblogs as Parallel Corpora

Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data.
Experiments	If either the source or the target sides of the a training instance had an edit distance of less than 10%, we removed it.4 As for the language models, we collected a further 10M tweets from Twitter for the English language model and another 10M tweets from Weibo for the Chinese language model .
Experiments	As the language model , we use a 5-gram model with Kneser—Ney smoothing.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

169. A Statistical Model for Lost Language Decipherment

Snyder, Benjamin and Barzilay, Regina and Knight, Kevin

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation Tasks and Results	To produce baseline cognate identification predictions, we calculate the probability of each latent Hebrew letter sequence predicted by the HMM, and compare it to a uniform character-level Ugaritic language model (as done by our model, to avoid automatically assigning higher cognate probability to shorter Ugaritic words).
Inference	We also calculate P = 0) using a uniform uni-gram character-level language model (and thus depends only on the number of characters in ui).
Model	Otherwise, a lone word it is generated, according a uniform character-level language model .

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

170. Surface Realisation from Knowledge-Bases

Gyawali, Bikash and Gardent, Claire

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	In the current version of the generator, the output is ranked using a simple language model trained on the GENIA corpus.
Generating from the KBGen Knowledge-Base	To rank the generator output, we train a language model on the GeniA corpus 4, a corpus of 2000 MEDLINE asbtracts about biology containing more than 400000 words (Kim et al., 2003) and use this model to rank the generated sentences by decreasing probability.
Related Work	They intersect the grammar with a language model to improve fluency; use a weighted hypergraph to pack the derivations; and find the best derivation tree using Viterbi algorithm.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

171. Recommendation in Internet Forums and Blogs

Wang, Jia and Li, Qing and Chen, Yuanzhu Peter and Lin, Zhangxi

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Future Work	By combining such information with traditional statistical language models , it is capable of suggesting relevant articles that meet the dynamic nature of a discussion in social media.
Experimental Evaluation	The second one, LM, is based on statistical language models for relevant information retrieval (Ponte and Croft, 1998).
Experimental Evaluation	bilistic language model for each article, and ranks them on query likelihood, i.e.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

172. Joint Decoding with Multiple Translation Models

Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	For language model, we used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram model on the Xinhua portion of GIGAWORD corpus.
Joint Decoding	2There are also features independent of derivations, such as language model and word penalty.
Joint Decoding	Although left-to-right decoding might enable a more efficient use of language models and hopefully produce better translations, we adopt bottom-up decoding in this paper just for convenience.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

173. Multilingual Models for Compositional Distributed Semantics

Hermann, Karl Moritz and Blunsom, Phil

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Successful applications of such models include language modelling (Bengio et al., 2003), paraphrase detection (Erk and Pado, 2008), and dialogue analysis (Kalchbrenner and Blunsom, 2013).
Related Work	Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003).
Related Work	They have received a lot of attention in recent years (Collobert and Weston, 2008; Mnih and Hinton, 2009; Mikolov et al., 2010, inter alia) and have achieved state of the art performance in language modelling .

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

174. Discourse Complements Lexical Semantics for Non-factoid Answer Reranking

Jansen, Peter and Surdeanu, Mihai and Clark, Peter

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Models and Features	In particular, we use the recurrent neural network language model (RNNLM) of Mikolov et al.
Models and Features	Like any language model , a RNNLM estimates the probability of observing a word given the preceding context, but, in this process, it learns word embeddings into a latent, conceptual space with a fixed number of dimensions.
Related Work	(2013) recently addressed the problem of answer sentence selection and demonstrated that LS models, including recurrent neural network language models (RNNLM), have a higher contribution to overall performance than exploiting syntactic analysis.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

175. Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition

Tomeh, Nadi and Habash, Nizar and Roth, Ryan and Farra, Noura and Dasigi, Pradeep and Diab, Mona

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency.
Discriminative Reranking for OCR	The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Introduction	The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

reranking (14)
LM (9)
error rate (4)

176. A Convolutional Neural Network for Modelling Sentences

Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Background	The RNN is primarily used as a language model , but may also be viewed as a sentence model with a linear structure.
Introduction	Besides comprising powerful classifiers as part of their architecture, neural sentence models can be used to condition a neural language model to generate sentences word by word (Schwenk, 2012; Mikolov and Zweig, 2012; Kalchbrenner and Blunsom, 2013a).
Properties of the Sentence Model	This gives the RNN excellent performance at language modelling , but it is suboptimal for remembering at once the n-grams further back in the input sentence.

language model is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: