Index of papers in Proc. ACL that mention
  • language model
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Introduction
Further, decoding with nonlocal (or state-dependent) features, such as a language model , is also a problem.
Introduction
Actually, even for the (log-) linear model, efficient decoding with the language model is not trivial (Chiang, 2007).
Introduction
For the nonlocal features such as the language model , Chiang (2007) proposed a cube-pruning method for efficient decoding.
language model is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori
Abstract
Our model is a nested hierarchical Pitman-Yor language model , where Pitman-Yor spelling model is embedded in the word model.
Abstract
Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications.
Introduction
Bayesian Kneser—Ney) language model , with an accurate character oo-gram Pitman-Yor spelling model embedded in word models.
Introduction
Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language.
Introduction
we briefly describe a language model based on the Pitman-Yor process (Teh, 2006b), which is a generalization of the Dirichlet process used in previous research.
Nested Pitman-Yor Language Model
In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM.
Nested Pitman-Yor Language Model
Figure 2: Chinese restaurant representation of our Nested Pitman-Yor Language Model (NPYLM).
Pitman-Yor process and n-gram models
To compute a probability p(w|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process.
Pitman-Yor process and n-gram models
As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as
Pitman-Yor process and n-gram models
When we set thw E l, (4) recovers a Kneser-Ney smoothing: thus a HPYLM is a Bayesian Kneser—Ney language model as well as an extension of the hierarchical Dirichlet Process (HDP) used in Goldwater et al.
language model is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Gao, Jianfeng
Abstract
Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation.
Abstract
We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion.
Expected BLEU Training
We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003).
Expected BLEU Training
We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1):
Expected BLEU Training
which computes a sentence-level language model score as the sum of individual word scores.
Introduction
In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014).
Introduction
(2013) who demonstrated that feed-forward network-based language models are more accurate in first-pass decoding than in rescoring.
Introduction
Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models .
Recurrent Neural Network LMs
Our model has a similar structure to the recurrent neural network language model of Mikolov et al.
language model is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Experiments
First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.
Experiments
Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score.
Experiments
This means that, when using a bigram language model , it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.
Introduction
Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence.
Phrase-based Decoding as TSP
o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model , this cost can be precomputed directly from b and b’.
Phrase-based Decoding as TSP
Successful phrase-based systems typically employ language models of order higher than two.
Phrase-based Decoding as TSP
If we want to extend the power of the model to general n-gram language models , and in particular to the 3-gram
language model is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Aker, Ahmet and Gaizauskas, Robert
Abstract
Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries.
Introduction
They also experimented with representing such conceptual models using n- gram language models derived from corpora consisting of collections of descriptions of instances of specific object types (e.g.
Introduction
a corpus of descriptions of churches, a corpus of bridge descriptions, and so on) and reported results showing that incorporating such n-gram language models as a feature in a feature-based extractive summarizer improves the quality of automatically generated summaries.
Introduction
The main weakness of n-gram language models is that they only capture very local information aboutshofitennsequencesandcannotnnxkfllong distance dependencies between terms.
Representing conceptual models 2.1 Object type corpora
2.2 N-gram language models
Representing conceptual models 2.1 Object type corpora
Aker and Gaizauskas (2009) experimented with uni-gram and bi-gram language models to capture the features commonly used when describing an object type and used these to bias the sentence selection of the summarizer towards the sentences that contain these features.
Representing conceptual models 2.1 Object type corpora
As in Song and Croft (1999) they used their language models in a gener-
language model is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Bicknell, Klinton and Levy, Roger
Explaining between-word regressions
This simple example just illustrates the point that if a reader is combining noisy visual information with a language model , then confidence in previous regions will sometimes fall.
Models of eye movements in reading
Unfortunately, however, the Mr. Chips model simplifies the problem of reading in a number of ways: First, it uses a unigram model as its language model , and thus fails to use any information in the linguistic context to help with word identification.
Models of eye movements in reading
Specifically, our model identifies the words in a sentence by performing Bayesian inference combining noisy input from a realistic visual model with a language model that takes context into account.
Reading as Bayesian inference
Specifically, the model begins reading with a prior distribution over possible identities of a sentence given by its language model .
Reading as Bayesian inference
model’s prior distribution over the identity of the sentence given the language model is updated to a posterior distribution taking into account both the language model and the visual input obtained thus far.
Reading as Bayesian inference
Given the visual input and a language model, inferences about the identity of the sentence w can be made by standard Bayesian inference, where the prior is given by the language model and the likelihood is a function of the total visual input obtained from the first to the ith timestep Ii ,
Simulation 1
5.1.2 Language model
Simulation 1
Our reader’s language model was an unsmoothed bigram model created using a vocabulary set con-
Simulation 1
Specifically, we constructed the model’s initial belief state (i.e., the distribution over sentences given by its language model ) by directly translating the bigram model into a wFSA in the log semiring.
Simulation 2
6.1.3 Language model
language model is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Corlett, Eric and Penn, Gerald
Experiment
In the event that a trigram or bigram would be found in the plaintext that was not counted in the language model , add one smoothing was used.
Experiment
Our character-level language model used was developed from the first 1.5 million characters of the Wall Street Journal section of the Penn Tree-bank corpus.
Introduction
If the text from which a language model is trained is of a different genre than the plaintext of a cipher, the unigraph letter frequencies may differ substantially from those of the language model , and so frequency counting will be misleading.
Introduction
Such inefficiency indicates that integer programming may simply be the wrong tool for the job, possibly because language model probabilities computed from empirical data are not smoothly distributed enough over the space in which a cutting-plane method would attempt to compute a linear relaxation of this problem.
Introduction
This difference in difficulty, while real, is not inherent, but rather an artefact of the character-level n-gram language models that they (and we) use, in which preponderant evidence of differences in short character sequences is necessary for the model to clearly favour one letter-substitution mapping over another.
Terminology
Every possible full solution to a cipher C will produce a plaintext string with some associated language model probability, and we will consider the best possible solution to be the one that gives the highest probability.
Terminology
For the sake of concreteness, we will assume here that the language model is a character-level trigram model.
The Algorithm
Backpointers are necessary to reference one of the two language model probabilities.
The Algorithm
Cells that would produce inconsistencies are left at zero, and these as well as cells that the language model assigns zero to can only produce zero entries in later columns.
The Algorithm
The n p x n p cells of every column 2' do not depend on each other —only on the cells of the previous two columns 2' — 1 and i— 2, as well as the language model .
language model is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Duan, Xiangyu and Zhang, Min and Li, Haizhou
Conclusion
Removing the power of higher order language model and longer max phrase length, which are inherent in pseudo-words, shows that pseudo-words still improve translational performance significantly over unary words.
Experiments and Results
The pipeline uses GIZA++ model 4 (Brown et al., 1993; Och and Ney, 2003) for pseudo-word alignment, uses Moses (Koehn et al., 2007) as phrase-based decoder, uses the SRI Language Modeling Toolkit to train language model with modified Kneser-Ney smoothing (Kneser and Ney 1995; Chen and Goodman 1998).
Experiments and Results
A 5-gram language model is trained on English side of parallel corpus.
Experiments and Results
Xinhua portion of the English Gigaword3 corpus is used together with English side of large corpus to train a 4-gram language model .
Introduction
Further experiments of removing the power of higher order language model and longer max phrase length, which are inherent in pseudo-words, show that pseudo-words still improve translational performance significantly over unary words.
language model is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Durrani, Nadir and Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Our Approach
(1) 3.1.1 Language Model
Our Approach
The language model (LM) pm?)
Our Approach
The parameters of the language model are learned from a monolingual Urdu corpus.
language model is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Roark, Brian and Allauzen, Cyril and Riley, Michael
Abstract
We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing.
Introduction
Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation.
Introduction
)nstraints for language modeling
Introduction
As a result, statistical language models — an important component of many such applications — are often trained on very large corpora, then modified to fit within some pre-specified size bound.
Preliminaries
N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion.
Preliminaries
N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored.
language model is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
May, Jonathan and Knight, Kevin and Vogler, Heiko
Decoding Experiments
We add an English syntax language model £ to the cascade of transducers just described to better simulate an actual machine translation decoding task.
Decoding Experiments
The language model is cast as an identity WTT and thus fits naturally into the experimental framework.
Decoding Experiments
In our experiments we try several different language models to demonstrate varying performance of the application algorithms.
language model is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Liu, Qun
Abstract
We thus propose to combine the advantages of both, and present a novel constituency-to-dependency translation model, which uses constituency forests on the source side to direct the translation, and dependency trees on the target side (as a language model ) to ensure grammaticality.
Decoding
where the first two terms are translation and language model probabilities, 6(0) is the target string (English sentence) for derivation 0, the third and forth items are the dependency language model probabilities on the target side computed with words and POS tags separately, De (0) is the target dependency tree of 0, the fifth one is the parsing probability of the source side tree TC(0) 6 FC, the ill(0) is the penalty for the number of ill-formed dependency structures in 0, and the last two terms are derivation and translation length penalties, respectively.
Decoding
For each node, we use the cube pruning technique (Chiang, 2007; Huang and Chiang, 2007) to produce partial hypotheses and compute all the feature scores including the dependency language model score (Section 4.1).
Decoding
4.1 Dependency Language Model Computing
Experiments
We also store the POS tag information for each word in dependency trees, and compute two different dependency language models for words and POS tags in dependency tree separately.
Experiments
We use SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram language model with Kneser-Ney smoothing on the first 1/3 of the Xinhua portion of Giga-word corpus.
Experiments
This suggests that using dependency language model really improves the translation quality by less than 1 BLEU point.
language model is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Indicators of linguistic quality
3.1 Word choice: language models
Indicators of linguistic quality
Language models (LM) are a way of computing how familiar a text is to readers using the distribution of words from a large background corpus.
Indicators of linguistic quality
We built unigram, bigram, and trigram language models with Good-Turing smoothing over the New York Times (NYT) section of the English Gigaword corpus (over 900 million words).
Results and discussion
Coh-Metrix, which has been proposed as a comprehensive characterization of text, does not perform as well as the language model and the entity coherence classes, which contain considerably fewer features related to only one aspect of text.
Results and discussion
It is apparent from the results that continuity, entity coherence, sentence fluency and language models are the most powerful classes of features that should be used in automation of evaluation and against which novel predictors of text quality should be compared.
Results and discussion
For example, the language model features, which are the second best class for the system-level, do not fare as well at the input-level.
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Ney, Hermann
Abstract
In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard.
Definitions
denotes the language model .
Definitions
Depending on the structure of the language model Equation 2 can be further simplified.
Definitions
Similarly, we define language model matrices S for the unigram and the bigram case.
Introduction
The general idea is to find those translation model parameters that maximize the probability of the translations of a given source text in a given language model of the target language.
Introduction
This might be related to the fact that a statistical formulation of the decipherment problem has not been analyzed with respect to n-gram language models : This paper shows the close relationship of the decipherment problem to the quadratic assignment problem.
Introduction
In Section 4 we show that decipherment using a unigram language model corresponds to solving a linear sum assignment problem (LSAP).
Related Work
gram language model .
language model is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Nagata, Ryo and Whittaker, Edward
Approach
In his method, a variety of languages are modeled by their spelling systems (i.e., character-based n-gram language models ).
Approach
Then, agglomerative hierarchical clustering is applied to the language models to reconstruct a language family tree.
Approach
The similarity used for clustering is based on a divergence-like distance between two language models that was originally proposed by Juang and Rabiner (1985).
Methods
Similarly, let M,- be a language model trained using Di.
Methods
2, we use an n-gram language model based on a mixture of word and POS tokens instead of a simple word-based language model .
Methods
In this language model , content words in n-grams are replaced with their corresponding POS tags.
language model is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Abstract
As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models .
Introduction
In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right.
Introduction
As a result, phrase-based decoders only need to maintain the boundary words on one end to calculate language model probabilities.
Introduction
Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007).
language model is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Chiang, David
Abstract
We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment.
Introduction
Such cases have been noted for language modeling (Goodman, 2001; Goodman, 2004), domain adaptation (Tam and Schultz, 2008), grapheme-to-phoneme conversion (Bisani and Ney, 2008), and phrase-based translation (Andres-Ferrer, 2010; Wuebker et al., 2012).
Introduction
One is language model domain adaptation, and the other is word alignment using the IBM models (Brown et al., 1993).
Language model adaptation
N -gram language models are widely used in applications like machine translation and speech recognition to select fluent output sentences.
Language model adaptation
Here, we propose to assign each sentence a probability to indicate how likely it is to belong to the domain of interest, and train a language model using expected KN smoothing.
Language model adaptation
They first train two language models , pin on a set of in-domain data, and pout on a set of general-domain data.
Related Work
This method subtracts D directly from the fractional counts, zeroing out counts that are smaller than D. The discount D must be set by minimizing an error metric on held-out data using a line search (Tam, p. c.) or Powell’s method (Bisani and Ney, 2008), requiring repeated estimation and evaluation of the language model .
Smoothing on integral counts
Before presenting our method, we review KN smoothing on integer counts as applied to language models , although, as we will demonstrate in Section 7, KN smoothing is applicable to other tasks as well.
language model is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Kauchak, David
Abstract
In this paper we examine language modeling for text simplification.
Abstract
Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model .
Abstract
We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each.
Introduction
An important component of many text-to-text translation systems is the language model which predicts the likelihood of a text sequence being produced in the output language.
Introduction
In some problem domains, such as machine translation, the translation is between two distinct languages and the language model can only be trained on data in the output language.
Introduction
In these monolingual problems, text could be used from both the input and output domain to train a language model .
Related Work
If we view the normal data as out-of-domain data, then the problem of combining simple and normal data is similar to the language model domain adaption problem (Suzuki and Gao, 2005), in particular cross-domain adaptation (Bellegarda, 2004) where a domain-specific model is improved by incorporating additional general data.
language model is mentioned in 53 sentences in this paper.
Topics mentioned in this paper:
Blunsom, Phil and Cohn, Trevor
Background
Early work was firmly situtated in the task-based setting of improving generalisation in language models .
Background
This model has been popular for language modelling and bilingual word alignment, and an implementation with improved inference called mkcls (Och, 1999)1 has become a standard part of statistical machine translation systems.
Background
(l992)’s HMM by incorporating a character language model , allowing the modelling of limited morphology.
Introduction
Our work brings together several strands of research including Bayesian nonparametric HMMs (Goldwater and Griffiths, 2007), Pitman-Yor language models (Teh, 2006b; Goldwater et al., 2006b), tagging constraints over word types (Brown et al., 1992) and the incorporation of morphological features (Clark, 2003).
The PYP-HMM
Prior work in unsupervised PoS induction has employed simple smoothing techniques, such as additive smoothing or Dirichlet priors (Goldwater and Griffiths, 2007; Johnson, 2007), however this body of work has overlooked recent advances in smoothing methods used for language modelling (Teh, 2006b; Goldwater et al., 2006b).
The PYP-HMM
The PYP has been shown to generate distributions particularly well suited to modelling language (Teh, 2006a; Goldwater et al., 2006b), and has been shown to be a generalisation of Kneser—Ney smoothing, widely recognised as the best smoothing method for language modelling (Chen and Goodman, 1996).
The PYP-HMM
We consider two different settings for the base distribution Cj: l) a simple uniform distribution over the vocabulary (denoted HMM for the experiments in section 4); and 2) a character-level language model (denoted HMM+LM).
language model is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Chong, Tze Yuang and E. Banchs, Rafael and Chng, Eng Siong and Li, Haizhou
Abstract
In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling .
Introduction
Language models have been extensively studied in natural language processing.
Introduction
The role of a language model is to measure how probably a (target) word would occur based on some given evidence extracted from the history-context.
Language Modeling with TD and TO
A language model estimates word probabilities given their history, i.e.
Language Modeling with TD and TO
In order to define the TD and TO components for language modeling , we express the observation of an arbitrary history-word, wi_k at the kth position behind the target-word, as the joint of two events: i) the word wi_k occurs within the histo-ry-context: wi_k E h, and ii) it occurs at distance k from the target-word: A(wi_k) = k, (A: k for brevity); i.e.
Language Modeling with TD and TO
In fact, the TO model is closely related to the trigger language model (Rosenfeld 1996), as the prediction of the target-word (the triggered word) is based on the presence of a history-word (the trigger).
Motivation of the Proposed Approach
The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach.
Related Work
Latent-semantic language model approaches (Bellegarda 1998, Coccaro 2005) weight word counts with TFIDF to highlight their semantic importance towards the prediction.
Related Work
Other approaches such as the class-based language model (Brown 1992, Kneser & Ney 1993)
Related Work
The structured language model (Chelba & J elinek 2000) determines the “heads” in the history-context by using a parsing tree.
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Berg-Kirkpatrick, Taylor and Durrett, Greg and Klein, Dan
Experiments
Then, Tesseract uses a classifier, aided by a word-unigram language model , to recognize whole words.
Experiments
6.3 Language Model
Learning
The number of states in the dynamic programming lattice grows exponentially with the order of the language model (Jelinek, 1998; Koehn, 2004).
Learning
As a result, inference can become slow when the language model order n is large.
Learning
On each iteration of EM, we perform two passes: a coarse pass using a low-order language model, and a fine pass using a high-order language model (Petrov et al., 2008; Zhang and Gildea, 2008).
Model
P(E, T, R, X) = P(E) [ Language model ] - P(T|E) [Typesetting model] - P(R) [Inking model] - P (X |E, T, R) [Noise model]
Model
3.1 Language Model P(E)
Model
Our language model , P(E), is a Kneser-Ney smoothed character n-gram model (Kneser and Ney, 1995).
Related Work
Work that has directly addressed historical documents has done so using a pipelined approach, and without fully integrating a strong language model (Vamvakas et al., 2008; Kluzner et al., 2009; Kae et al., 2010; Kluzner et al., 2011).
Related Work
They integrated typesetting models with language models , but did not model noise.
Related Work
Our approach is also similar in that we use a strong language model (in conjunction with the constraint that the correspondence be regular) to learn the correct mapping.
language model is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Abstract
N —gram language models are a major resource bottleneck in machine translation.
Abstract
In this paper, we present several language model implementations that are both highly compact and fast to query.
Abstract
We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
Introduction
For modern statistical machine translation systems, language models must be both fast and compact.
Introduction
The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge.
Introduction
At the same time, decoding a single sentence can trigger hundreds of thousands of queries to the language model , so speed is also critical.
language model is mentioned in 62 sentences in this paper.
Topics mentioned in this paper:
Zweig, Geoffrey and Platt, John C. and Meek, Christopher and Burges, Christopher J.C. and Yessenalina, Ainur and Liu, Qiang
Abstract
We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model ; and methods that evaluate global coherence, such as latent semantic analysis.
Introduction
To investigate the usefulness of local information, we evaluated n—gram language model scores, from both a conventional model with Good—Turing smoothing, and with a recently proposed maximum—entropy class—based n—gram model (Chen, 2009a; Chen, 2009b).
Introduction
Also in the language modeling vein, but with potentially global context, we evaluate the use of a recurrent neural network language model .
Introduction
In all the language modeling approaches, a model is used to compute a sentence probability with each of the potential completions.
Related Work
The KU system uses just an N—gram language model to do this ranking.
Related Work
The UNT system uses a large variety of information sources, and a language model score receives the highest weight.
Sentence Completion via Language Modeling
Perhaps the most straightforward approach to solving the sentence completion task is to form the complete sentence with each option in turn, and to evaluate its likelihood under a language model .
Sentence Completion via Language Modeling
In this section, we describe the suite of state—of—the—art language modeling techniques for which we will present results.
Sentence Completion via Language Modeling
3.1 Backoff N-gram Language Model
language model is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Rush, Alexander M. and Collins, Michael
Background: Hypergraphs
The second step is to integrate an n-gram language model with this hypergraph.
Background: Hypergraphs
The labels for leaves will be words, and will be important in defining strings and language model scores for those strings.
Background: Hypergraphs
The focus of this paper will be to solve problems involving the integration of a k’th order language model with a hypergraph.
Introduction
The language model is then uwdwmmmemMMmhmmMnmmwmmmmr Decoding with these models is challenging, largely because of the cost of integrating an n-gram language model into the search process.
Introduction
2E.g., with a trigram language model they run in O(\E\w6) time, where is the number of edges in the hypergraph, and w is the number of distinct lexical items in the hypergraph.
Introduction
This step does not require language model integration, and hence is highly efficient.
language model is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Schütze, Hinrich
Abstract
Building on earlier work that integrates different factors in language modeling , we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation.
Abstract
This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events.
Abstract
We construct such a model and show that both training on rare events and preferable application to rare events improve perpleXity when compared to a simple direct interpolation of class-based with standard language models .
Introduction
Language models , probability distributions over strings of words, are fundamental to many applications in natural language processing.
Introduction
The main challenge in language modeling is to estimate string probabilities accurately given that even very large training corpora cannot overcome the inherent sparseness of word sequence data.
Introduction
Plausible though this line of reasoning is, the language models most commonly used today do not incorporate class-based generalization.
Related work
However, the importance of rare events for clustering in language modeling has not been investigated before.
Related work
Our work is most similar to the lattice-based language models proposed by Dupont and Rosenfeld (1997).
language model is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen
Abstract
Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation.
Abstract
We give a formal definition of one such linear-time syntactic language model , detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system.
Introduction
Early work in statistical machine translation Viewed translation as a noisy channel process comprised of a translation model, which functioned to posit adequate translations of source language words, and a target language model , which guided the fluency of generated target language strings (Brown et al.,
Introduction
Drawing on earlier successes in speech recognition, research in statistical machine translation has effectively used n-gram word sequence models as language models .
Introduction
Modern phrase-based translation using large scale n-gram language models generally performs well in terms of lexical choice, but still often produces ungrammatical output.
Related Work
Instead, we incorporate syntax into the language model .
Related Work
Traditional approaches to language models in
Related Work
Chelba and Jelinek (1998) proposed that syntactic structure could be used as an altema-tive technique in language modeling .
language model is mentioned in 47 sentences in this paper.
Topics mentioned in this paper:
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun
Abstract
This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, midrange sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm.
Abstract
The composite language model has been trained by performing a convergent N -best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer.
Abstract
The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system.
Composite language model
The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-
Composite language model
PLSA models together to build a composite generative language model under the directed MRF paradigm (Wang et al., 2005; Wang et al., 2006), the TAGGER and CONSTRUCTOR in SLM and SEMANTIZER in PLSA remain unchanged; however the WORD-PREDICTORs in n-gram, m—SLM and PLSA are combined to form a stronger WORD-PREDICTOR that generates the next word, wk+1, not only depending on the m leftmost exposed headwords bin in the word-parse k-prefix but also its n-gram history w’g_n+2 and its semantic content gk+1.
Composite language model
The parameter for WORD-PREDICTOR in the composite n-gram/m-SLMfl’LSA language model becomes p(wk+1 |wlg_n+2h:,1ngk+1).
Introduction
There is a dire need for developing novel approaches to language modeling.”
Introduction
(2006) integrated n-gram, structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model .
Introduction
They derived a generalized inside-outside algorithm to train the composite language model from a general EM (Dempster et al., 1977) by following Je-linek’s ingenious definition of the inside and outside probabilities for SLM (J elinek, 2004) with 6th order of sentence length time complexity.
language model is mentioned in 36 sentences in this paper.
Topics mentioned in this paper:
Chen, Wenliang and Zhang, Min and Li, Haizhou
Abstract
In this paper, we present an approach to enriching high—order feature representations for graph-based dependency parsing models using a dependency language model and beam search.
Abstract
The dependency language model is built on a large-amount of additional auto-parsed data that is processed by a baseline parser.
Abstract
Based on the dependency language model , we represent a set of features for the parsing model.
Dependency language model
Language models play a very important role for statistical machine translation (SMT).
Dependency language model
The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words.
Dependency language model
However, the traditional N-gram language model can not capture long-distance word relations.
Introduction
In this paper, we solve this issue by enriching the feature representations for a graph-based model using a dependency language model (DLM) (Shen et al., 2008).
Introduction
0 We utilize the dependency language model to enhance the graph-based parsing model.
Parsing with dependency language model
In this section, we propose a parsing model which includes the dependency language model by extending the model of McDonald et al.
language model is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian
Hello. My name is Inigo Montoya.
First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their non-memorable counterparts.
Hello. My name is Inigo Montoya.
In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes.
Never send a human to do a machine’s job.
In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplace-smoothed8 language models : l-gram, 2-gram, and 3-gram word LMs and l-gram, 2-gram and 3-gram part-of-speech9 LMs.
Never send a human to do a machine’s job.
As indicated in Table 3, for each of our lexical “common language” models , in about 60% of the quote pairs, the memorable quote is more distinctive.
Never send a human to do a machine’s job.
The language models’ vocabulary was that of the entire training corpus.
language model is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Elsner, Micha and Goldwater, Sharon and Eisenstein, Jacob
Abstract
We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features.
Experiments
Nonetheless, it represents phonetic variability more realistically than the Bernstein-Ratner—Brent corpus, while still maintaining the lexical characteristics of infant-directed speech (as compared to the Buckeye corpus, with its much larger vocabulary and more complex language model ).
Inference
The language modeling term relating to the intended string again factors into multiple components.
Inference
Because neither the transducer nor the language model are perfect models of the true distribution, they can have incompatible dynamic ranges.
Inference
3The transducer scores can be cached since they depend only on surface forms, but the language model scores cannot.
Introduction
Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model ).
Introduction
Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability.
Introduction
But unlike speech recognition, we have no (intended-form, surface-form) training pairs to train the phonetic model, nor even a dictionary of intended-form strings to train the language model .
Lexical-phonetic model
Our lexical-phonetic model is defined using the standard noisy channel framework: first a sequence of intended word tokens is generated using a language model , and then each token is transformed by a probabilistic finite-state transducer to produce the observed surface sequence.
Related work
In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model .
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Sim, Khe Chai
A Probabilistic Formulation for HVR
where P(W) can be modelled by the word-based 77.-gram language model (Chen and Goodman, 1996) commonly used in automatic speech recognition.
A Probabilistic Formulation for HVR
0 Language model score: P(W)
A Probabilistic Formulation for HVR
Note that the acoustic model and language model scores are already used in the conventional ASR.
Abstract
In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions.
Experimental Results
These sentences contain a variety of given names, surnames and city names so that confusions cannot be easily resolved using a language model .
Experimental Results
The ASR system used in all the experiments reported in this paper consists of a set of HMM-based triphone acoustic models and an n-gram language model .
Experimental Results
A bigram language model with a vocabulary size of 200 words was used for testing.
Haptic Voice Recognition (HVR)
In conventional ASR, acoustically similar word sequences are typically resolved implicitly using a language model where contexts of neighboring words are used for disambiguation.
Integration of Knowledge Sources
where fl, 5, 75 and 7:1 denote the WFST representation of the acoustic model, language model , PLI model and haptic model respectively.
Integration of Knowledge Sources
(2002) has shown that Hidden Markov Models (HMMs) and n-gram language models can be viewed as WFSTs.
Introduction
In addition to the acoustic model and language model used in ASR, haptic model and partial lexical model are also introduced to facilitate the integration of more sophisticated haptic events, such as the keystrokes, into HVR.
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Abstract
We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.
Abstract
We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours.
Introduction
N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).
Introduction
At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies.
Introduction
Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark,
Treelet Language Modeling
The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).
Treelet Language Modeling
to use back-off-based smoothing for syntactic language modeling — such techniques have been applied to models that condition on headword contexts (Charniak, 2001; Roark, 2004; Zhang, 2009).
language model is mentioned in 31 sentences in this paper.
Topics mentioned in this paper:
Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin
Abstract
Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus.
Abstract
By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model.
Introduction
Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora.
Introduction
To overcome the problem, we first propose the method combining translation model with language model in data selection.
Introduction
The language model measures the domain-specif1c generation probability of sentences, being used to select domain-relevant sentences at both sides of source and target language.
Related Work
The existing data selection methods are mostly based on language model .
Related Work
(2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models .
Related Work
(2011) improved the perplexity-based approach and proposed bilingual cross-entropy difference as a ranking function with in-and general- domain language models .
language model is mentioned in 31 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Evaluation
In §3.3, we then examined the effect of using a very large 5-gram language model training on 7.5 billion English tokens to understand the nature of the improvements in §3.2.
Evaluation
The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only.
Evaluation
The 13 baseline features (2 lexical, 2 phrasal, 5 HRM, and 1 language model , word penalty, phrase length feature and distortion penalty feature) were tuned using MERT (Och, 2003), which is also used to tune the 4 feature weights introduced by the secondary phrase table (2 lexical and 2 phrasal, other features being shared between the two tables).
Generation & Propagation
These candidates are scored using stem-level translation probabilities, morpheme-level lexical weighting probabilities, and a language model , and only the top 30 candidates are included.
Introduction
We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models .
language model is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Abstract
In statistical language modeling , one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes.
Abstract
The resulting clusterings are then used in training partially class—based language models .
Experiments
We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus.
Experiments
We use each predictive class-based language model as well as a word-based model as separate feature functions in the log-linear combination in Eq.
Experiments
The word-based language model used by the system in these experiments is a 5-gram model also trained on the enlarget data set.
Introduction
A statistical language model assigns a probability P(w) to any given string of words win 2 w1,...,wm.
Introduction
In the case of n-gram language models this is done by factoring the probability:
Introduction
do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1).
language model is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Abstract
We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters.
Abstract
We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework.
Experimental Setup
We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers.
Introduction
Language models (LMs) are a core component in statistical machine translation, speech recognition, optical character recognition and many other areas.
Introduction
With large monolingual corpora available in major languages, making use of all the available data is now a fundamental challenge in language modeling .
Introduction
have considered alternative parameterizations such as class-based models (Brown et al., 1992), model reduction techniques such as entropy-based pruning (Stolcke, 1998), novel represention schemes such as suffix arrays (Emami et al., 2007), Golomb Coding (Church et al., 2007) and distributed language models that scale more readily (Brants et al., 2007).
Scaling Language Models
In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary.
Scaling Language Models
Recent work (Talbot and Osborne, 2007b) has used lossy encodings based on Bloom filters (Bloom, 1970) to represent logarithmically quantized corpus statistics for language modeling .
language model is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model .
Introduction
This complexity arises from the interaction of the tree-based translation model with an n-gram language model .
Introduction
First, we present a two-pass decoding algorithm, in which the first pass explores states resulting from an integrated bigram language model , and the second pass expands these states into trigram-based
Introduction
The general bigram-to-trigram technique is common in speech recognition (Murveit et al., 1993), where lattices from a bigram-based decoder are re-scored with a trigram language model .
Language Model Integrated Decoding for SCFG
We begin by introducing Synchronous Context Free Grammars and their decoding algorithms when an n-gram language model is integrated into the grammatical search space.
Language Model Integrated Decoding for SCFG
Without an n-gram language model , decoding using SCFG is not much different from CFG parsing.
Language Model Integrated Decoding for SCFG
However, when we want to integrate an n-gram language model into the search, our goal is searching for the derivation whose total sum of weights of productions and n-gram log probabilities is maximized.
Multi-pass LM-Integrated Decoding
very good estimate of the outside cost using a trigram model since a bigram language model and a trigram language model must be strongly correlated.
Multi-pass LM-Integrated Decoding
We propagate the outside cost of the parent to its children by combining with the inside cost of the other children and the interaction cost, i.e., the language model cost between the focused child and the other children.
Multi-pass LM-Integrated Decoding
(2007) also take a two-pass decoding approach, with the first pass leaving the language model boundary words out of the dynamic programming state, such that only one hypothesis is retained for each span and grammar symbol.
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Abstract
Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model (LM).
Methods
This trivially allows for an unsegmented language model and never makes desegmentation errors.
Methods
Doing so enables the inclusion of an unsegmented target language model , and with a small amount of bookkeeping, it also allows the inclusion of features related to the operations performed during desegmentation (see Section 3.4).
Methods
We now have a desegmented lattice, but it has not been annotated with an unsegmented (word-level) language model .
Related Work
Bojar (2007) incorporates such analyses into a factored model, to either include a language model over target morphological tags, or model the generation of morphological features.
Related Work
They introduce an additional desegmentation technique that augments the table-based approach with an unsegmented language model .
Related Work
Oflazer and Durgar El-Kahlout (2007) desegment 1000-best lists for English-to-Turkish translation to enable scoring with an unsegmented language model .
language model is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Abstract
This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Dependency parsing for machine translation
While it seems that loopy graphs are undesirable when the goal is to obtain a syntactic analysis, that is not necessarily the case when one just needs a language modeling score.
Introduction
Hierarchical approaches to machine translation have proven increasingly successful in recent years (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008), and often outperform phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) on.ungetlanguage fluency'and.adequacy; Ilouh ever, their benefits generally come with high computational costs, particularly when chart parsing, such as CKY, is integrated with language models of high orders (Wu, 1996).
Introduction
Indeed, researchers have shown that gigantic language models are key to state-of-the-art performance (Brants et al., 2007), and the ability of phrase-based decoders to handle large-size, high-order language models with no consequence on asymptotic running time during decoding presents a compelling advantage over CKY decoders, whose time complexity grows prohibitively large with higher-order language mod-ds
Introduction
Most interestingly, the time complexity of non-projective dependency parsing remains quadratic as the order of the language model increases.
Machine translation experiments
We use the standard features implemented almost exactly as in Moses: four translation features (phrase-based translation probabilities and lexically-weighted probabilities), word penalty, phrase penalty, linear distortion, and language model score.
Machine translation experiments
In order to train a competitive baseline given our computational resources, we built a large 5-gram language model using the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) in addition to the target side of the parallel data.
Machine translation experiments
The language model was smoothed with the modified Kneser-Ney algorithm as implemented in (Stolcke, 2002), and we only kept 4-grams and 5-grams that occurred at least three times in the training data.6
language model is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Shen, Libin and Xu, Jinxi and Weischedel, Ralph
Abstract
With this new framework, we employ a target dependency language model during decoding to exploit long distance word relations, which are unavailable with a traditional n—gram language model .
Dependency Language Model
wh-as-head represents 21);, used as the head, and it is different from 212;, in the dependency language model .
Dependency Language Model
In order to calculate the dependency language model score, or depLM score for short, on the fly for
Discussion
(2003) described a two-step string-to-CFG—tree translation model which employed a syntax-based language model to select the best translation from a target parse forest built in the first step.
Implementation Details
Language model score .
Implementation Details
Dependency language model score 8.
Introduction
language model during decoding, in order to exploit long-distance word relations which are unavailable with a traditional n-gram language model on target strings.
Introduction
Section 3 illustrates of the use of dependency language models .
String-to-Dependency Translation
Formal definitions also allow us to easily extend the framework to incorporate a dependency language model in decoding.
String-to-Dependency Translation
Supposing we use a traditional trigram language model in decoding, we need to specify the leftmost two words and the rightmost two words in a state.
String-to-Dependency Translation
In the next section, we will explain how to extend categories and states to exploit a dependency language model during decoding.
language model is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
P, Deepak and Visweswariah, Karthik
Abstract
We use translation models and language models to exploit lexical correlations and solution post character respectively.
Introduction
The cornerstone of our technique is the usage of a hitherto unexplored textual feature, lexical correlations between problems and solutions, that is exploited along with language model based characterization of solution posts.
Introduction
We model the lexical correlation and solution post character using regularized translation models and unigram language models respectively.
Our Approach
Consider a unigram language model 83 that models the lexical characteristics of solution posts, and a translation model 73 that models the lexical correlation between problems and solutions.
Our Approach
In short, each solution word is assumed to be generated from the language model or the translation model (conditioned on the problem words) with a probability of A and l — A respectively, thus accounting for the correlation assumption.
Our Approach
Of the solution words above, generic words such as try and should could probably be explained by (i.e., sampled from) the solution language model , whereas disconnect and rejoin could be correlated well with surf and wifi and hence are more likely to be supported better by the translation model.
Related Work
We will use translation and language models in our method for solution identification.
language model is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
A Syntax Free Sequence-oriented Sentence Compression Method
As an alternative to syntactic parsing, we propose two novel features, intra-sentence positional term weighting (IPTW) and the patched language model (PLM) for our syntax-free sentence compressor.
A Syntax Free Sequence-oriented Sentence Compression Method
3.2.2 Patched Language Model
A Syntax Free Sequence-oriented Sentence Compression Method
Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence.
Abstract
As an alternative to syntactic parsing, we propose a novel term weighting technique based on the positional information within the original sentence and a novel language model that combines statistics from the original sentence and a general corpus.
Conclusions
0 As an alternative to the syntactic parser, we proposed two novel features, Intra-sentence positional term weighting (IPTW) and the Patched language model (PLM), and showed their effectiveness by conducting automatic and human evaluations,
Experimental Evaluation
We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles.
Introduction
To maintain the subject-predicate relationship in the compressed sentence and retain fluency without using syntactic parsers, we propose two novel features: intra-sentence positional term weighting (IPTW) and the patched language model (PLM).
Introduction
PLM is a form of summarization-oriented fluency statistics derived from the original sentence and the general language model .
Results and Discussion
Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly.
Results and Discussion
This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences.
language model is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Mei, Qiaozhu and Zhai, ChengXiang
Abstract
We propose language modeling methods for solving this problem, and study how to incorporate features such as authority and proximity to accurately estimate the impact language model .
Impact Summarization
To solve these challenges, in the next section, we propose to model impact with un-igram language models and score sentences using
Impact Summarization
We further propose methods for estimating the impact language model based on several features including the authority of citations, and the citation proximity.
Introduction
We propose language models to exploit both the citation context and original content of a paper to generate an impact-based summary.
Introduction
We study how to incorporate features such as authority and proximity into the estimation of language models .
Introduction
We propose and evaluate several different strategies for estimating the impact language model , which is key to impact summarization.
Language Models for Impact Summarization
3.1 Impact language models
Language Models for Impact Summarization
We thus propose to represent such a virtual impact query with a unigram language model .
Language Models for Impact Summarization
Such a model is expected to assign high probabilities to those words that can describe the impact of paper d, just as we expect a query language model in ad hoc retrieval to assign high probabilities to words that tend to occur in relevant documents (Ponte and Croft, 1998).
language model is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
van Gompel, Maarten and van den Bosch, Antal
Abstract
We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines.
Baselines
A second baseline was constructed by weighing the probabilities from the translation table directly with the L2 language model described earlier.
Baselines
target language modelling ) which is also cus-
Introduction
The main research question in this research is how to disambiguate an L1 word or phrase to its L2 translation based on an L2 context, and whether such cross-lingual contextual approaches provide added value compared to baseline models that are not context informed or compared to standard language models .
System
3.1 Language Model
System
We also implement a statistical language model as an optional component of our classifier-based system and also as a baseline to compare our system to.
System
The language model is a trigram-based back-off language model with Kneser-Ney smoothing, computed using SRILM (Stolcke, 2002) and trained on the same training data as the translation model.
language model is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Abstract
RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner.
Experiments and Results
The language model is a 5-gram language model trained with the target sentences in the training data.
Introduction
Recurrent neural networks are leveraged to learn language model , and they keep the history information circularly inside the network for arbitrarily long time (Mikolov et al., 2010).
Introduction
DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment, language modelling , translation modelling and distortion modelling.
Introduction
In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.
Our Model
Recurrent neural network is usually used for sequence processing, such as language model (Mikolov et al., 2010).
Our Model
Commonly used sequence processing methods, such as Hidden Markov Model (HMM) and n-gram language model , only use a limited history for the prediction.
Our Model
In HMM, the previous state is used as the history, and for n-gram language model (for example n equals to 3), the history is the previous two words.
Related Work
(2013) extend the recurrent neural network language model , in order to use both the source and target side information to scoring translation candidates.
language model is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Kaufmann, Tobias and Pfister, Beat
Abstract
We propose a language model based on a precise, linguistically motivated grammar (a handcrafted Head-driven Phrase Structure Grammar) and a statistical model estimating the probability of a parse tree.
Abstract
The language model is applied by means of an N -best rescoring step, which allows to directly measure the performance gains relative to the baseline system without rescoring.
Introduction
Other linguistically inspired language models like Chelba and J elinek (2000) and Roark (2001) have been applied to continuous speech recognition.
Introduction
In the first place, we want our language model to reliably distinguish between grammatical and ungrammatical phrases.
Introduction
However, their grammar-based language model did not make use of a probabilistic component, and it was applied to a rather simple recognition task (dictation texts for pupils read and recorded under good acoustic conditions, no out-of-vocabulary words).
Language Model 2.1 The General Approach
The language model weight A and the word insertion penalty ip lead to a better performance in practice, but they have no theoretical justification.
Language Model 2.1 The General Approach
Our grammar-based language model is incorporated into the above expression as an additional probability Pyram(W), weighted by a parameter ,u:
Language Model 2.1 The General Approach
A major problem of grammar-based approaches to language modeling is how to deal with out-of-grammar utterances.
language model is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
Fleischman, Michael and Roy, Deb
Abstract
Grounded language models represent the relationship between words and the nonlinguistic context in which they are said.
Abstract
Results show that grounded language models improve perplexity and word error rate over text based language models , and further, support video information retrieval better than human generated speech transcriptions.
Introduction
The method is based on the use of grounded language models to repre-
Introduction
Grounded language models are based on research from cognitive science on grounded models of meaning.
Introduction
This paper extends previous work on grounded models of meaning by learning a grounded language model from naturalistic data collected from broadcast video of Major League Baseball games.
Linguistic Mapping
We model this relationship, much like traditional language models , using conditional probability distributions.
Linguistic Mapping
Unlike traditional language models, however, our grounded language models condition the probability of a word not only on the word(s) uttered before it, but also on the temporal pattern features that describe the nonlinguistic context in which it was uttered.
language model is mentioned in 46 sentences in this paper.
Topics mentioned in this paper:
Duan, Huizhong and Cao, Yunbo and Lin, Chin-Yew and Yu, Yong
Abstract
Then we model question topic and question focus in a language modeling framework for search.
Abstract
Experimental results indicate that our approach of identifying question topic and question focus for search significantly outperforms the baseline methods such as Vector Space Model (VSM) and Language Model for Information Retrieval (LMIR).
Introduction
vector space model, Okapi, language model , and translation-based model, within the setting of question search (Jeon et al., 2005b).
Introduction
On the basis of this, we then propose to model question topic and question focus in a language modeling framework for search.
Our Approach to Question Search
model question topic and question focus in a language modeling framework for search.
Our Approach to Question Search
We employ the framework of language modeling (for information retrieval) to develop our approach to question search.
Our Approach to Question Search
In the language modeling approach to information retrieval, the relevance of a targeted question q to a queried question q is given by the probability p(q|fi) of generating the queried question q
language model is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
A Skeleton-based Approach to MT 2.1 Skeleton Identification
In this work both the skeleton translation model gskel (d) and full translation model gfuu (d) resemble the usual forms used in phrase-based MT, i.e., the model score is computed by a linear combination of a group of phrase-based features and language models .
A Skeleton-based Approach to MT 2.1 Skeleton Identification
Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by
A Skeleton-based Approach to MT 2.1 Skeleton Identification
lm(d) and wlm are the score and weight of the language model , respectively.
Evaluation
A 5-gram language model was trained on the Xinhua portion of the English Gi-gaword corpus in addition to the target-side of the bilingual data.
Introduction
0 We develop a skeletal language model to describe the possibility of translation skeleton and handle some of the long-distance word dependencies.
language model is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick
Conclusion and perspectives
It would also be interesting to test the impact of another lexical language model , learned on non-SMS sentences.
Evaluation
The language model of the evaluation is a 3-gram.
Evaluation
(2008a), who showed on a French corpus comparable to ours that, if using a larger language model is always rewarded, the improvement quickly decreases with every higher level and is already quite small between 2-gram and 3-gram.
Overview of the system
In our system, all lexicons, language models and sets of rules are compiled into finite-state machines (FSMs) and combined with the input text by composition (0).
Overview of the system
Third, a combination of the lattice of solutions with a language model , and the choice of the best sequence of lexical units.
Related work
A language model is then applied on the word lattice, and the most probable word sequence is finally chosen by applying a best-path algorithm on the lattice.
The normalization models
All tokens Tj of S are concatenated together and composed with the lexical language model LM.
The normalization models
4.6 The language model
The normalization models
Our language model is an n-gram of lexical forms, smoothed by linear interpolation (Chen and Goodman, 1998), estimated on the normalized part of our training corpus and compiled into a weighted FST LMw.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zhong, Zhi and Ng, Hwee Tou
Abstract
Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations.
Incorporating Senses into Language Modeling Approaches
The next problem is to incorporate the sense information into the language modeling approach.
Incorporating Senses into Language Modeling Approaches
Given a query (1 and a document d in text collection C, we want to reestimate the language models by making use of the sense information assigned to them.
Incorporating Senses into Language Modeling Approaches
With this language model , the probability of a query term in a document is enlarged by the synonyms of its senses; The more its synonym senses in a document, the higher the probability.
Introduction
We incorporate word senses into the language modeling (LM) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance.
The Language Modeling Approach to IR
3.1 The language modeling approach
The Language Modeling Approach to IR
In the language modeling approach to IR, language models are constructed for each query (1 and each document d in a text collection C. The documents in C are ranked by the distance to a given query (1 according to the language models .
The Language Modeling Approach to IR
The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Wintrode, Jonathan and Khudanpur, Sanjeev
Abstract
We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models .
Introduction
ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream.
Introduction
Yet, though many language models more sophisticated than N- grams have been proposed, N-grams are empirically hard to beat in terms of WER.
Introduction
The strength of this phenomenon suggests it may be more viable for improving term-detection than, say, topic-sensitive language models .
Motivation
The re-scoring approach we present is closely related to adaptive or cache language models (Je-linek, 1997; Kuhn and De Mori, 1990; Kneser and Steinbiss, 1993).
Motivation
The primary difference between this and previous work on similar language models is the narrower focus here on the term detection task, in which we consider each search term in isolation, rather than all words in the vocabulary.
Results
We train ASR acoustic and language models from the training corpus using the Kaldi speech recognition toolkit (Povey et al., 2011) following the default BABEL training and search recipe which is described in detail by Chen et al.
Term and Document Frequency Statistics
A similar phenomenon is observed concerning adaptive language models (Church, 2000).
Term and Document Frequency Statistics
In general, we can think of using word repetitions to re-score term detection as applying a limited form of adaptive or cache language model (Je-linek, 1997).
Term and Document Frequency Statistics
In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew
Abstract
We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models .
Conclusion
Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset.
Experiments
Table 3 shows our results compared to previous methods, including C&W’s language model and the hierarchical log-bilinear (HLBL) model (Mnih and Hinton, 2008), which is a probabilistic, linear neural model.
Global Context-Aware Neural Language Model
Note that Collobert and Weston (2008)’s language model corresponds to the network using only local context.
Introduction
We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective.
Introduction
We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset.
Related Work
Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling , a task where models are asked to accurately predict the next word given previously seen words.
Related Work
Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model .
Related Work
They used up to 3 previous head words and showed increased performance on language modeling .
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Conclusion & Future Work
Future work includes extending this approach to use multiple translation models with multiple language models in ensemble decoding.
Experiments & Results 4.1 Experimental Setup
For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model (LM) and word count.
Experiments & Results 4.1 Experimental Setup
Fixing the language model allows us to compare various translation model combination techniques.
Introduction
Common techniques for model adaptation adapt two main components of contemporary state-of-the-art SMT systems: the language model and the translation model.
Introduction
However, language model adaptation is a more straightforward problem compared to
Introduction
translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain.
Related Work 5.1 Domain Adaptation
They use language model perplexities from IN to select relavant sentences from OUT.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Wu, Stephen and Bachrach, Asaf and Cardenas, Carlos and Schuler, William
Discussion
This is particularly true when the sentence structure is defined in a language model that is psycholinguistically plausible (here, bounded-memory right-corner form).
Discussion
This accords with an understated result of Boston et al.’s eye-tracking study (2008a): a richer language model predicts eye movements during reading better than an oversimplified one.
Discussion
Frank (2009) similarly reports improvements in the reading-time predictiveness of unlexi-calized surprisal when using a language model that is more plausible than PCFGs.
Introduction
Ideally, a psychologically-plausible language model would produce a surprisal that would correlate better with linguistic complexity.
Introduction
Therefore, the specification of how to encode a syntactic language model is of utmost importance to the quality of the metric.
Introduction
The purpose of this paper is to determine whether the language model defined by the HHMM parser can also predict reading times —it would be strange if a psychologically plausible model did not also produce Viable complexity metrics.
Parsing Model
Both of these metrics fall out naturally from the time-series representation of the language model .
Parsing Model
With the understanding of what operations need to occur, a formal definition of the language model is in order.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Rastrow, Ariya and Dredze, Mark and Khudanpur, Sanjeev
Abstract
Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation.
Abstract
However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set.
Abstract
When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N -best and hill climbing rescoring, and show that up-training leads to WER reduction.
Conclusion
The computational complexity of accurate syntactic processing can make structured language models impractical for applications such as ASR that require scoring hundreds of hypotheses per input.
Incorporating Syntactic Structures
These are then passed to the language model along with the word sequence for scoring.
Introduction
Language models (LM) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT).
Related Work
The lattice parser therefore, is itself a language model .
Syntactic Language Models
There have been several approaches to include syntactic information in both generative and discriminative language models .
Syntactic Language Models
Structured language modeling incorporates syntactic parse trees to identify the head words in a hypothesis for modeling dependencies beyond n-grams.
Syntactic Language Models
Our Language Model .
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Fine, Alex B. and Frank, Austin F. and Jaeger, T. Florian and Van Durme, Benjamin
Abstract
We consider the prediction of three human behavioral measures — lexical decision, word naming, and picture naming —through the lens of domain bias in language modeling .
Abstract
This study aims to provoke increased consideration of the human language model by NLP practitioners: biases are not limited to differences between corpora (i.e.
Discussion
Our analyses reveal that 6 commonly used corpora fail to reflect the human language model in various ways related to dialect, modality, and other properties of each corpus.
Discussion
Our results point to a type of bias in commonly used language models that has been previously overlooked.
Discussion
Just as language models have been used to predict reading grade-level of documents (Collins-Thompson and Callan, 2004), human language models could be
Introduction
Computational linguists build statistical language models for aiding in natural language processing (NLP) tasks.
Introduction
In the current study, we exploit errors of the latter variety—failure of a language model to predict human performance—to investigate bias across several frequently used corpora in computational linguistics.
Introduction
: Human Language Model
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Clifton, Ann and Sarkar, Anoop
Experimental Results
We trained all of the Moses systems herein using the standard features: language model , reordering model, translation model, and word penalty; in addition to these, the factored experiments called for additional translation and generation features for the added factors as noted above.
Experimental Results
For the language models, we used SRILM 5-gram language models (Stol-cke, 2002) for all factors.
Experimental Results
koske+ +va+ +A mietinto+ +A kasi+ +te+ +11a+ +a+ +n language model disambiguation:
Models 2.1 Baseline Models
Morphology generation models can use a variety of bilingual and contextual information to capture dependencies between morphemes, often more long-distance than what is possible using n-gram language models over morphemes in the segmented model.
Models 2.1 Baseline Models
is to take the abstract suffix tag sequence 31* and then map it into fully inflected word forms, and rank those outputs using a morphemic language model .
Models 2.1 Baseline Models
After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels.
Related Work
They use a segmented phrase table and language model along With the word-based versions in the decoder and in tuning a Finnish target.
Related Work
In their work a segmented language model can score a translation, but cannot insert morphology that does not show source-side reflexes.
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Yamaguchi, Hiroshi and Tanaka-Ishii, Kumiko
Calculation of Cross-Entropy
(XZ-), is the cross—entropy of X7; for L,- multiplied by Various methods for computing cross—entropy have been proposed, and these can be roughly classified into two types based on different methods of universal coding and the language model .
Calculation of Cross-Entropy
For example, (Benedetto et al., 2002) and (Cilibrasi and Vitanyi, 2005) used the universal coding approach, whereas (Teahan and Harper, 2001) and (Sibun and Reynar, 1996) were based on language modeling using PPM and Kullback—Leibler divergence, respectively.
Calculation of Cross-Entropy
As a representative method for calculating the cross—entropy through statistical language modeling , we adopt prediction by partial matching (PPM), a language—based encoding method devised by (Cleary and Witten, 1984).
In the experiments reported here, n is set to 5 throughout.
lel ), gives the description length of the remaining characters under the language model for L.
Introduction
They used statistical language modeling and heuristics to detect foreign words and tested the case of English embedded in German texts.
Problem Formulation
In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification.
Problem Formulation
calculates the description length of a text segment X,- through the use of a language model for Li.
Problem Formulation
Here, the first term corresponds to the code length of the text chunk X,- given a language model for L,, which in fact corresponds to the cross—entropy of X,- for L,- multiplied by The remaining terms give the code lengths of the parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li.
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Mitchell, Jeff and Lapata, Mirella and Demberg, Vera and Keller, Frank
Abstract
In this paper we analyze reading times in terms of a single predictive measure which integrates a model of semantic composition with an incremental parser and a language model .
Integrating Semantic Constraint into Surprisal
While surprisal is a theoretically well-motivated measure, formalizing the idea of linguistic processing being highly predictive in terms of probabilistic language models , the measurement of semantic constraint in terms of vector similarities lacks a clear motivation.
Integrating Semantic Constraint into Surprisal
This can be achieved by turning a vector model of semantic similarity into a probabilistic language model .
Integrating Semantic Constraint into Surprisal
There are in fact a number of approaches to deriving language models from distributional models of semantics (e.g., Bellegarda 2000; Coccaro and Jurafsky 1998; Gildea and Hofmann 1999).
Models of Processing Difficulty
The basic idea is that the processing costs relating to the expectations of the language processor can be expressed in terms of the probabilities assigned by some form of language model to the input.
Models of Processing Difficulty
Surprisal could be also defined using a vanilla language model that does not take any structural or grammatical information into account (Frank 2009).
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Sennrich, Rico and Schwenk, Holger and Aransa, Walid
Translation Model Architecture
We train a language model on the source language side of each of the n component bitexts, and compute an n-dimensional vector for each sentence by computing its entropy with each language model .
Translation Model Architecture
Our aim is not to discriminate between sentences that are more likely and unlikely in general, but to cluster on the basis of relative differences between the language model entropies.
Translation Model Architecture
While it is not the focus of this paper, we also evaluate language model adaptation.
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Deng, Yonggang and Xu, Jia and Gao, Yuqing
A Generic Phrase Training Procedure
Each normalized feature score derived from word alignment models or language models will be log-linearly combined to generate the final score.
Discussions
We propose several information metrics derived from posterior distribution, language model and word alignments as feature functions.
Experimental Results
Like other log-linear model based decoders, active features in our translation engine include translation models in two directions, lexicon weights in two directions, language model , lexicalized distortion models, sentence length penalty and other heuristics.
Experimental Results
The language model is a statistical trigram model estimated with Modified Kneser—Ney smoothing (Chen and Goodman, 1996) using only English sentences in the parallel training data.
Features
All these features are data-driven and defined based on models, such as statistical word alignment model or language model .
Features
We apply a language model (LM) to describe the predictive uncertainty (PU) between words in two directions.
Features
Given a history 10711—1, a language model specifies a conditional distribution of the future word being predicted to follow the history.
language model is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua
Clustering-based word representations
So it is a class-based bigram language model .
Clustering-based word representations
Deschacht and Moens (2009) use a latent-variable language model to improve semantic role labeling.
Distributed representations
Word embeddings are typically induced using neural language models , which use neural networks as the underlying predictive model (Bengio, 2008).
Distributed representations
Historically, training and testing of neural language models has been slow, scaling as the size of the vocabulary for each model computation (Bengio et al., 2001; Bengio et al., 2003).
Distributed representations
Collobert and Weston (2008) presented a neural language model that could be trained over billions of words, because the gradient of the loss was computed stochastically over a small sample of possible outputs, in a spirit similar to Bengio and Sénecal (2003).
Introduction
Neural language models (Bengio et al., 2001; Schwenk & Gauvain, 2002; Mnih & Hinton, 2007; Collobert & Weston, 2008), on the other hand, induce dense real-valued low-dimensional
Introduction
(See Bengio (2008) for a more complete list of references on neural language models .)
Unlabled Data
These auxiliary tasks are sometimes specific to the supervised task, and sometimes general language modeling tasks like “predict the missing word”.
language model is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
Inflection prediction models
We stemmed the reference translations, predicted the inflection for each stem, and measured the accuracy of prediction, using a set of sentences that were not part of the training data (1K sentences were used for Arabic and 5K for Russian).2 Our model performs significantly better than both the random and trigram language model baselines, and achieves an accuracy of over 91%, which suggests that the model is effective when its input is clean in its stem choice and order.
Integration of inflection models with MT systems
Given such a list of candidate stem sequences, the base MT model together with the inflection model and a language model choose a translation Y* as follows:
Integration of inflection models with MT systems
PLM) is the joint probability of the sequence of inflected words according to a trigram language model (LM).
Integration of inflection models with MT systems
In addition, stemming the target sentences reduces the sparsity in the translation tables and language model , and is likely to impact positively the performance of an MT system in terms of its ability to recover correct sequences of stems in the target.
Introduction
(Goldwater and McClosky, 2005), while the application of a target language model has almost solely been responsible for addressing the second aspect.
Machine translation systems and data
(2003), a trigram target language model , two order models, word count, phrase count, and average phrase size functions.
Machine translation systems and data
The features include log-probabilities according to inverted and direct channel models estimated by relative frequency, lexical weighting channel models, a trigram target language model , distortion, word count and phrase count.
Machine translation systems and data
For each language pair, we used a set of parallel sentences (train) for training the MT system sub-models (e.g., phrase tables, language model ), a set of parallel sentences (lambda) for training the combination weights with max-BLEU training, a set of parallel sentences (dev) for training a small number of combination parameters for our integration methods (see Section 5), and a set of parallel sentences (test) for final evaluation.
language model is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Espinosa, Dominic and White, Michael and Mehay, Dennis
Background
OpenCCG implements a symbolic-statistical chart realization algorithm (Kay, 1996; Carroll et al., 1999; White, 2006b) combining (l) a theoretically grounded approach to syntax and semantic composition with (2) factored language models (Bilmes and Kirchhoff, 2003) for making choices among the options left open by the grammar.
Background
makes use of n-gram language models over words represented as vectors of factors, including surface form, part of speech, supertag and semantic class.
Background
2.3 Factored Language Models
Introduction
Assigned categories are instantiated in OpenCCG’s chart realizer where, together with a treebank-derived syntactic grammar (Hockenmaier and Steedman, 2007) and a factored language model (Bilmes and Kirchhoff, 2003), they constrain the English word-strings that are chosen to express the LF.
The Approach
Table 1: Percentage of complete realizations using an oracle n-gram model versus the best performing factored language model .
The Approach
As shown in Table l, with the large grammar derived from the training sections, many fewer complete realizations are found (before timing out) using the factored language model than are possible, as indicated by the results of using the oracle model.
language model is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Wu, Yuanbin and Ng, Hwee Tou
Experiments
As described in Section 3.2, the weight of each variable is a linear combination of the language model score, three classifier confidence scores, and three classifier disagreement scores.
Experiments
We use the Web 1T 5—gram corpus (Brants and Franz, 2006) to compute the language model score for a sentence.
Experiments
Finally, the language model score, classifier confidence scores, and classifier disagreement scores are normalized to take values in [0, 1], based on the H00 2011 development data.
Inference with First Order Variables
The language model score h(s’, LM) of 8’ based on a large web corpus;
Inference with First Order Variables
Next, to compute whpyg, we collect language model score and confidence scores from the article (ART), preposition (PREP), and noun number (NOUN) classifier, i.e., E = {ART, PREP, NOUN}.
Inference with Second Order Variables
When measuring the gain due to 21131213312 2 1 (change cat to cats), the weight wNoungmluml is likely to be small since A cats will get a low language model score, a low article classifier confidence score, and a low noun number classifier confidence score.
Related Work
Features used in classification include surrounding words, part-of—speech tags, language model scores (Gamon, 2010), and parse tree structures (Tetreault et al., 2010).
language model is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Cao, Guihong and Robertson, Stephen and Nie, Jian-Yun
Abstract
The selection is made according to the appropriateness of the alteration to the query context (using a bigram language model ), or according to its expected impact on the retrieval effectiveness (using a regression model).
Bigram Expansion Model for Alteration Selection
The query context is modeled by a bigram language model as in (Peng et al.
Bigram Expansion Model for Alteration Selection
In this work, we used bigram language model to calculate the probability of each path.
Bigram Expansion Model for Alteration Selection
P(el,ez,...,ei,...,en) = P(e1 )H:=2P(ek Iek_1) (2) P(ek|ek_1) is estimated with a back-off bigram language model (Goodman, 2001).
Conclusion
In the first method proposed — the Bigram Expansion model, query context is modeled by a bigram language model .
Introduction
The query context is modeled by a bigram language model .
Related Work
2007), a bigram language model is used to determine the alteration of the head word that best fits the query.
Related Work
In this paper, one of the proposed methods will also use a bigram language model of the query to determine the appropriate alteration candidates.
language model is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Bayesian MT Decipherment via Hash Sampling
Secondly, for Bayesian inference we need to sample from a distribution that involves computing probabilities for all the components ( language model , translation model, fertility, etc.)
Bayesian MT Decipherment via Hash Sampling
Note that the (translation) model in our case consists of multiple exponential families components—a multinomial pertaining to the language model (which remains fixed5), and other components pertaining to translation probabilities P9(fi|ei), fertility ngert, etc.
Bayesian MT Decipherment via Hash Sampling
where, pold(-), pnew(-) are the true conditional likelihood probabilities according to our model (including the language model component) for the old, new sample respectively.
Decipherment Model for Machine Translation
For P(e), we use a word n-gram language model (LM) trained on monolingual target text.
Decipherment Model for Machine Translation
Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model .
Experiments and Results
The latter is used to construct a target language model used for decipherment training.
Experiments and Results
Overall, using a 3-gram language model (instead of 2-gram) for decipherment training improves the performance for all methods.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Experiment
SRILM Toolkit (Stol-cke, 2002) is employed to train 4-gram language models on the Xinhua portion of Gigaword corpus, while for the IWLST2012 data set, only its training set is used.
Experiment
The similarity between the data from each domain and the test data is calculated using the perplexity measure with 5-gram language model .
Hierarchical Phrase Table Combination
Pitman-Yor process is also employed in n-gram language models which are hierarchically represented through the hierarchical Pitman-Yor process with switch priors to integrate different domains in all the levels (Wood and Teh, 2009).
Phrase Pair Extraction with Unsupervised Phrasal ITGs
Pbase is a base measure defined as a combination of the IBM Models in two directions and the unigram language models in both sides.
Related Work
The translation model and language model are primary components in SMT.
Related Work
Previous work proved successful in the use of large-scale data for language models from diverse domains (Brants et al., 2007; Schwenk and Koehn, 2008).
Related Work
Alternatively, the language model is incrementally updated by using a succinct data structure with a interpolation technique (Levenberg and Osborne, 2009; Levenberg et al., 2011).
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben
Approach
In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens.
Approach
Next, we extend our language model with trigrams extracted from a subset of the texts contained in the
Approach
As the CLC contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model .
Evaluation
Extending our language model with frequent trigrams extracted from the CLC improves Pearson’s and Spearman’s correlation by 0.006 and 0.015 respectively.
Evaluation
This suggests that there is room for improvement in the language models we developed to estimate the error-rate.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhu, Muhua and Zhang, Yue and Chen, Wenliang and Zhang, Min and Zhu, Jingbo
Semi-supervised Parsing with Large Data
These relations are captured by word clustering, lexical dependencies, and a dependency language model , respectively.
Semi-supervised Parsing with Large Data
4.3 Structural Relations: Dependency Language Model
Semi-supervised Parsing with Large Data
The dependency language model is proposed by Shen et al.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Zong, Chengqing
Experiments
For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs, and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword.
Experiments
An in-domain 5-gram English language model is trained with the target 1 million monolingual data.
Experiments
(2008) regards the in-domain lexicon with corpus translation probability as another phrase table and further use the in-domain language model besides the out-of-domain language model .
Probabilistic Bilingual Lexicon Acquisition
In order to assign probabilities to each entry, we apply the Corpus Translation Probability which used in (Wu et al., 2008): given an in-domain source language monolingual data, we translate this data with the phrase-based model trained on the out-of-domain News data, the in-domain lexicon and the in-domain target language monolingual data (for language model estimation).
Related Work
For the target-side monolingual data, they just use it to train language model , and for the source-side monolingual data, they employ a baseline (word-based SMT or phrase-based SMT trained with small-scale bitext) to first translate the source sentences, combining the source sentence and its target translation as a bilingual sentence pair, and then train a new phrase-base SMT with these pseudo sentence pairs.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
A Class-based Model of Agreement
However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score).
A Class-based Model of Agreement
We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
Experiments
Our distributed 4—gram language model was trained on 600 million words of Arabic text, also collected from many sources including the Web (Brants et al., 2007).
Inference during Translation Decoding
With a trigram language model , the state might be the last two words of the translation prefix.
Introduction
Intuition might suggest that the standard 71- gram language model (LM) is suflicient to handle agreement phenomena.
Related Work
Monz (2011) recently investigated parameter estimation for POS-based language models , but his classes did not include inflectional features.
Related Work
One exception was the quadratic-time dependency language model presented by Galley and Manning (2009).
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhao, Shiqi and Lan, Xiang and Liu, Ting and Li, Sheng
Experimental Setup
The language model is trained using a 9 GB English corpus.
Statistical Paraphrase Generation
Our SPG model contains three sub-models: a paraphrase model, a language model , and a usability model, which control the adequacy, fluency,
Statistical Paraphrase Generation
Language Model: We use a trigram language model in this work.
Statistical Paraphrase Generation
The language model based score for the paraphrase t is computed as:
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Tang, Duyu and Wei, Furu and Yang, Nan and Zhou, Ming and Liu, Ting and Qin, Bing
Related Work
(2012) adopt the tweets with emoticons to smooth the language model and Hu et al.
Related Work
With the revival of interest in deep learning (Bengio et al., 2013), incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing (Socher et al., 2013a), language modeling (Bengio et al., 2003; Mnih and Hinton, 2009) and NER (Turian et al., 2010).
Related Work
The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi
Eliciting Addressee’s Emotion
We use GIZA++8 and SRILM9 for learning translation model and 5-gram language model , re-
Eliciting Addressee’s Emotion
We use the emotion-tagged dialogue corpus to learn eight translation models and language models , each of which is specialized in generating the response that elicits one of the eight emotions (Plutchik, 1980).
Eliciting Addressee’s Emotion
In this case, the first two utterances are used to learn the translation model, while only the second utterance is used to learn the language model .
Experiments
Table 6: The number of utterance pairs used for training classifiers in emotion prediction and learning the translation models and language models in response generation.
Experiments
We use the utterance pairs summarized in Table 6 to learn the translation models and language models for eliciting each emotional category.
Related Work
The linear interpolation of translation and/or language models is a widely-used technique for adapting machine translation systems to new domains (Sennrich, 2012).
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Cohen, Shay B. and Collins, Michael
Abstract
Experiments on parsing and a language modeling problem show that the algorithm is efficient and effective in practice.
Experiments on Parsing
8 Experiments on the Saul and Pereira (1997) Model for Language Modeling
Experiments on Parsing
We now describe a second set of experiments, on the Saul and Pereira (1997) model for language modeling .
Experiments on Parsing
We performed the language modeling experiments for a number of reasons.
Introduction
We describe experiments on learning of L-PCFGs, and also on learning of the latent-variable language model of Saul and Pereira (1997).
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Yeniterzi, Reyyan and Oflazer, Kemal
Conclusions
The problem is essentially one of generating multiple candidate sentences with the unattached function words ambiguously positioned (say in a lattice) and then use a second language model to rerank these sentences to select the target sentence.
Experimental Setup and Results
Furthermore, in factored models, we can employ different language models for different factors.
Experimental Setup and Results
We believe that the use of multiple language models (some much less sparse than the surface LM) in the factored baseline is the main reason for the improvement.
Experimental Setup and Results
3.2.3 Experiments with higher-order language models
Introduction
The main reason given for these problems was that the same statistical translation, reordering and language modeling mechanisms were being employed to both determine the morphological structure of the words and, at the same time, get the global order of the words correct.
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Yates, Alexander
Abstract
We leverage recently-developed techniques for learning representations of text using latent-variable language models , and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling.
Introduction
Using latent-variable language models , we learn representations of texts that provide novel kinds of features to our supervised learning algorithms.
Introduction
The next section provides background information on learning representations for NLP tasks using latent-variable language models .
Introduction
2 Open-Domain Representations Using Latent-Variable Language Models
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In the experiments, the language model is a Chinese 5-gram language model trained with the Chinese part of the LDC parallel corpus and the Xin-hua part of the Chinese Gigaword corpus with about 27 million words.
Experiments
In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature.
Introduction
Moreover, Chinese measure words often have a long distance dependency to their head words which makes language model ineffective in selecting the correct measure words from the measure word candidate set.
Introduction
In this case, an n-gram language model with n<15 cannot capture the MW-HW collocation.
Model Training and Application 3.1 Training
We used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a five-gram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998).
Our Method
For target features, n-gram language model score is defined as the sum of log n-gram probabilities within the target window after the measure
Our Method
Target features Source features n-gram language model MW-HW collocation score MW-HW collocation surrounding words surrounding words source head word punctuation position POS tags
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhai, Ke and Williams, Jason D
Experiments
Illustrated by the highlighted states in 6, LM—HMM model conflates interactions that commonly occur at the beginning and end of a dialogue—i.e., “acknowledge agent” and “resolve problem”, since their underlying language models are likely to produce similar probability distributions over words.
Experiments
By incorporating topic information, our proposed models (e.g., TM—HMMSS in Figure 5) are able to enforce the state transitions towards more frequent flow patterns, which further helps to overcome the weakness of language model .
Latent Structure in Dialogues
The simplest formulation we consider is an HMM where each state contains a unigram language model (LM), proposed by Chotimongkol (2008) for task-oriented dialogue and originally
Latent Structure in Dialogues
3: For each word in utterance n, first choose a word source 7“ according to 1', and then depending on 7“, generate a word 21) either from the session-wide topic distribution 6 or the language model specified by the state 37,.
Latent Structure in Dialogues
4Note that a TM-HMMS model with state-specific topic models (instead of state-specific language models ) would be subsumed by TM—HMM, since one topic could be used as the background topic in TM -HMMS.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Heilman, Michael and Cahill, Aoife and Madnani, Nitin and Lopez, Melissa and Mulholland, Matthew and Tetreault, Joel
Abstract
In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores).
Discussion and Conclusions
While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc.
Experiments
To create further baselines for comparison, we selected the following features that represent ways one might approximate grammaticality if a comprehensive model was unavailable: whether the link parser can fully parse the sentence (complete_l ink), the Gigaword language model score (gigaword_avglogprob), and the number of misspelled tokens (nummisspelled).
System Description
3.2.2 n-gram Count and Language Model Features
System Description
The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002):
System Description
Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”).
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Abstract
Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.
Introduction
Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).
Introduction
Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.
Model Variations
In particular, we can reverse the translation direction of the languages, as well as the direction of the language model .
Model Variations
0 5-gram Kneser-Ney LM 0 Recurrent neural network language model (RNNLM) (Mikolov et al., 2010)
Neural Network Joint Model (NNJ M)
Fortunately, neural network language models are able to elegantly scale up and take advantage of arbitrarily large context sizes.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Blunsom, Phil and Cohn, Trevor and Osborne, Miles
Discriminative Synchronous Transduction
ilar to the methods for decoding with a SCFG intersected with an n-gram language model, which require language model contexts to be stored in each chart cell.
Discussion and Further Work
To do so would require integrating a language model feature into the max-translation decoding algorithm.
Evaluation
The feature set includes: a trigram language model (lm) trained
Evaluation
To compare our model directly with these systems we would need to incorporate additional features and a language model , work which we have left for a later date.
Evaluation
The relative scores confirm that our model, with its minimalist feature set, achieves comparable performance to the standard feature set without the language model .
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Simianer, Patrick and Riezler, Stefan and Dyer, Chris
Experiments
3-gram (news-commentary) and 5-gram (Europarl) language models are trained on the data described in Table 1, using the SRILM toolkit (Stol-cke, 2002) and binarized for efficient querying using kenlm (Heafield, 2011).
Experiments
For the 5-gram language models, we replaced every word in the lm training data with <unk> that did not appear in the English part of the parallel training data to build an open vocabulary language model .
Experiments
7Absolute improvements would be possible, e. g., by using larger language models or by adding news data to the ep training set when evaluating on crawl test sets (see, e. g., Dyer et al.
Introduction
The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003).
Introduction
The modeler’s goals might be to identify complex properties of translations, or to counter errors of pre-trained translation models and language models by explicitly down-weighting translations that exhibit certain undesired properties.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Ney, Hermann and Zens, Richard
Abstract
In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased efficiency of decoding by minimizing the number of language model computations and hypothesis expansions.
Abstract
Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2.
Experimental Evaluation
The English language model is a 4-gram LM created with the SRILM toolkit (Stolcke, 2002) on all bilingual and parts of the provided monolingual data.
Introduction
Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model (LM) querying (Heafield, 2011).
Introduction
ith Language Model LookAhead
Search Algorithm Extensions
2.2 Language Model LookAhead
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Kobayashi, Hayato
Experiments
Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002).
Introduction
Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models.
Introduction
In the case of language models , we often have to remove low-frequency words because of a lack of computational resources, since the feature space of k:-grams tends to be so large that we sometimes need cutoffs even in a distributed environment (Brants et al., 2007).
Perplexity on Reduced Corpora
Constant restoring is similar to the additive smoothing defined by 13(w) oc p’ + A, which is used to solve the zero-frequency problem of language models (Chen and Goodman, 1996).
Perplexity on Reduced Corpora
77k: _ 1 H7Tk (7176 _ 1)H7Tk This means that we can determine the rough sparseness of k-grams and adjust some of the parameters such as the gram size k in learning statistical language models .
Perplexity on Reduced Corpora
LDA is a probabilistic language model that generates a corpus as a mixture of hidden topics, and it allows us to infer two parameters: the document-topic distribution 6 that represents the mixture rate of topics in each document, and the topic-word distribution gb that represents the occurrence rate of words in each topic.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Lu and Raghavan, Hema and Castelli, Vittorio and Florian, Radu and Cardie, Claire
Experimental Setup
Beam size is fixed at 2000.4 Sentence compressions are evaluated by a 5-gram language model trained on Gigaword (Graff, 2003) by SRILM (Stolcke, 2002).
Sentence Compression
As the space of possible compressions is exponential in the number of leaves in the parse tree, instead of looking for the globally optimal solution, we use beam search to find a set of highly likely compressions and employ a language model trained on a large corpus for evaluation.
Sentence Compression
Given the N -best compressions from the decoder, we evaluate the yield of the trimmed trees using a language model trained on the Gigaword (Graff, 2003) corpus and return the compression with the highest probability.
Sentence Compression
Thus, the decoder is quite flexible — its learned scoring function allows us to incorporate features salient for sentence compression while its language model guarantees the linguistic quality of the compressed string.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Andreas, Jacob and Vlachos, Andreas and Clark, Stephen
Discussion
The first is the incorporation of a language model (or comparable long-distance structure-scoring model) to assign scores to predicted parses independent of the transformation model.
Experimental setup
The best symmetrization algorithm, translation and language model weights for each language are selected using cross-validation on the development set.
MT—based semantic parsing
In order to learn a semantic parser using MT we linearize the MRs, learn alignments between the MRL and the NL, extract translation rules, and learn a language model for the MRL.
MT—based semantic parsing
Language modeling In addition to translation rules learned from a parallel corpus, MT systems also rely on an n-gram language model for the target language, estimated from a (typically larger) monolingual corpus.
MT—based semantic parsing
In the case of SP, such a monolingual corpus is rarely available, and we instead use the MRs available in the training data to learn a language model of the MRL.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Narayan, Shashi and Gardent, Claire
Related Work
It is combined with a language model to improve grammaticality and the decoder translates sentences into sim-
Simplification Framework
In addition, the language model we integrate in the SMT module helps ensuring better fluency and grammaticality.
Simplification Framework
Finally the translation and language model ensures that published, describing and boson are simplified to wrote, explaining and elementary particle respectively; and that the phrase “In 1964” is moved from the beginning of the sentence to its end.
Simplification Framework
Our simplification framework consists of a probabilistic model for splitting and dropping which we call DRS simplification model (DRS-SM); a phrase based translation model for substitution and reordering (PBMT); and a language model learned on Simple English Wikipedia (LM) for fluency and grammaticality.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Conclusion
Also, we believe that improving English language modeling to match the genre of the translated sentences can have significant positive impact on translation quality.
Previous Work
They used two language models built from the English GigaWord corpus and from a large web crawl.
Previous Work
For language modeling , we used either EGen or the English side of the AR corpus plus the English side of NIST12 training data and English Gi-gaWord v5.
Previous Work
— B2-B4 systems used identical training data, namely EG, with the GW, EGen, or both for B2, B3, and B4 respectively for language modeling .
Proposed Methods 3.1 Egyptian to EG’ Conversion
Using both language models (52) led to slight improvement.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Tan, Chenhao and Lee, Lillian and Pang, Bo
Introduction
This crawling process also yielded 632K TAC pairs whose only difference was spacing, and an additional 558M “unpaired” tweets; as shown later in this paper, we used these extra corpora for computing language models and other auxiliary information.
Introduction
Table 5: Conformity to the community and one’s own past, measured via scores assigned by various language models .
Introduction
We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams (bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop
Collocational Lexicon Induction
It has been used as word similarity measure in language modeling (Dagan et al., 1999).
Experiments & Results 4.1 Experimental Setup
For the end-to-end MT pipeline, we used Moses (Koehn et al., 2007) with these standard features: relative-frequency and lexical translation model (TM) probabilities in both directions; distortion model; language model (LM) and word count.
Experiments & Results 4.1 Experimental Setup
For the language model, we used the KenLM toolkit (Heafield, 2011) to create a 5-gram language model on the target side of the Europarl corpus (V7) with approximately 54M tokens with Kneser-Ney smoothing.
Experiments & Results 4.1 Experimental Setup
However, in an MT pipeline, the language model is supposed to rerank the hypotheses and move more appropriate translations (in terms of fluency) to the top of the list.
Introduction
Even noisy translation of oovs can aid the language model to better
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Weerkamp, Wouter and Balog, Krisztian and de Rijke, Maarten
Related Work
In the setting of language modeling approaches to query expansion, the local analysis idea has been instantiated by estimating additional query language models (Lafferty and Zhai, 2003; Tao and Zhai, 2006) or relevance models (Lavrenko and Croft, 2001) from a set of feedback documents.
Related Work
(2005) also try to uncover multiple aspects of a query, and to that they provide an iterative “pseudo-query” generation technique, using cluster-based language models .
Related Work
Diaz and Metzler (2006) were the first to give a systematic account of query expansion using an external corpus in a language modeling setting, to improve the estimation of relevance models.
Retrieval Framework
We work in the setting of generative language models .
Retrieval Framework
Within the language modeling approach, one builds a language model from each document, and ranks documents based on the probability of the document model generating the query.
Retrieval Framework
The particulars of the language modeling approach have been discussed extensively in the literature (see, e.g., Balog et al.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Han, Bo and Baldwin, Timothy
Lexical normalisation
The confusion candidates are then filtered for each token occurrence of a given OOV word, based on their local context fit with a language model .
Lexical normalisation
In addition to generating the confusion set, we rank the candidates based on a trigram language model trained over 1.5GB of clean Twitter data, i.e.
Lexical normalisation
To train the language model , we used SRILM (Stolcke, 2002) with the —<unk> option.
Related work
Suppose the ill-formed text is T and its corresponding standard form is S, the approach aims to find arg max P(S |T) by computing arg max P(T|S)P(S), in which P(S) is usually a language model and P(T | S) is an error model.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mylonakis, Markos and Sima'an, Khalil
Experiments
The final feature is the language model score for the target sentence, mounting up to the following model used at decoding time, with the feature weights A trained by Minimum Error Rate Training (MERT) (Och, 2003) on a development corpus.
Experiments
with a 3-gram language model smoothed with modified Knesser-Ney discounting (Chen and Goodman, 1998), trained on around 1M sentences per target language.
Experiments
Table 2: Additional experiments for English to Chinese translation examining (a) the impact of the linguistic annotations in the LTS system (lts), when compared with an instance not employing such annotations (lts—nolabels) and (b) decoding with a 4th-order language model (—lm4).
Joint Translation Model
While in a decoder this is somehow mitigated by the use of a language model , we believe that the weakness of straightforward applications of SCFGs to model reordering structure at the sentence level misses a chance to learn this crucial part of the translation process during grammar induction.
Joint Translation Model
As (Mylonakis and Sima’an, 2010) note, ‘plain’ SCFGs seem to perform worse than the grammars described next, mainly due to wrong long-range reordering decisions for which the language model can hardly help.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Introduction
The variable 6 ranges over all possible English strings, and P(e) is a language model built from large amounts of English text that is unrelated to the foreign strings.
Introduction
A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters.
Machine Translation as a Decipherment Task
Whole-segment Language Models : When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English.
Machine Translation as a Decipherment Task
5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01).
Word Substitution Decipherment
We model P(e) using a statistical word n-gram English language model (LM).
Word Substitution Decipherment
1For word substitution decipherment, we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (a = 104).
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Abstract
Our method uses a decipherment model which combines information from letter n—gram language models as well as word dictionaries.
Conclusion
Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model.
Decipherment
We build a statistical English language model (LM) for the plaintext source model P (p), which assigns a probability to any English letter sequence.
Decipherment
For the plaintext source model, we use probabilities from an English language model and for the channel model, we specify a uniform distribution (i.e., a plaintext letter can be substituted with any given cipher type with equal probability).
Decipherment
Combining letter n-gram language models with word dictionaries: Many existing probabilistic approaches use statistical letter n-gram language models of English to assign P (p) probabilities to plaintext hypotheses during decipherment.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Experimental Results
We also used a 5-gram language model with modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Giga-word (LDC2007T07) and the English side of the parallel corpora.
Experimental Results
We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models, language models , and the optimal weights for combining these models, respectively.
Variational Approximate Decoding
Of course, this last point also means that our computation becomes intractable as n —> 00.8 However, if p(y | at) is defined by a hypergraph HG(:c) whose structure explicitly incorporates an m-gram language model , both training and decoding will be efficient when m 2 n. We will give algorithms for this case that are linear in the size of HG(:c).9
Variational Approximate Decoding
9A reviewer asks about the interaction with backed-off language models .
Variational Approximate Decoding
We sketch a method that works for any language model given by a weighted FSA, L. The variational family Q can be specified by any deterministic weighted FSA, Q, with weights parameterized by ((5.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Background
Since all the member systems share the same data resources, such as language model and translation table, we only need to keep one copy of the required resources in memory.
Background
Another method to speed up the system is to accelerate n-gram language model with n-gram caching techniques.
Background
If the required n-gram hits the cache, the corresponding n-gram probability is returned by the cached copy rather than re-fetching the original data in language model .
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Mauser, Arne and Ney, Hermann
Alignment
A language model is not used in this case, as the system is constrained to the given target sentence and thus the language model score has no effect on the alignment.
Alignment
To deal with this problem, instead of simple phrase length restriction, we propose to apply the leaving-one-out method, which is also used for language modeling techniques (Kneser and Ney, 1995).
Experimental Evaluation
The baseline system is a standard phrase-based SMT system with eight features: phrase translation and word lexicon probabilities in both translation directions, phrase penalty, word penalty, language model score and a simple distance-based reordering model.
Experimental Evaluation
We used a 4-gram language model with modified Kneser-Ney discounting for all experiments.
Introduction
The phrase model is combined with a language model , word lexicon models, word and phrase penalty, and many oth: ers.
Related Work
They report improvements over a phrase-based model that uses an inverse phrase model and a language model .
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Sun, Xu and Gao, Jianfeng and Micol, Daniel and Quirk, Chris
Related Work
First, a pre-def1ned confusion set is used to generate candidate corrections, then a scoring model, such as a trigram language model or na'1've Bayes classifier, is used to rank the candidates according to their context (e.g., Golding and Roth, 1996; Mangu and Brill, 1997; Church et al., 2007).
Related Work
(2009) present a query speller system in which both the error model and the language model are trained using Web data.
Related Work
Typically, a language model (source model) is used to capture contextual information, while an error model (channel model) is considered to be context free in that it does not take into account any contextual information in modeling word transformation probabilities.
The Baseline Speller System
where the error model P(QIC) models the transformation probability from C to Q, and the language model P(C) models how likely C is a correctly spelled query.
The Baseline Speller System
The language model (the second factor) is a backoff bigram model trained on the tokenized form of one year of query logs, using maximum likelihood estimation with absolute discounting smoothing.
The Baseline Speller System
Since we define the logarithm of the probabilities of the language model and the error model (i.e., the edit distance function) as features, the ranker can be viewed as a more general framework, subsuming the source channel model as a special case.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Branavan, S.R.K. and Chen, Harr and Eisenstein, Jacob and Barzilay, Regina
Introduction
Each property indexes a language model , thus allowing documents that incorporate the same
Model Description
Keyphrases are drawn from a set of clusters; words in the documents are drawn from language models indexed by a set of topics, where the topics correspond to the keyphrase clusters.
Model Description
language models of each topic
Model Description
In the LDA framework, each word is generated from a language model that is indexed by the word’s topic assignment.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Classification Results
The language model features were completely useless for distinguishing contingencies from
Features for sense prediction of implicit discourse relations
For each sense, we created uni-gram and bigram language models over the implicit examples in the training set.
Features for sense prediction of implicit discourse relations
We compute each example’s probability according to each of these language models .
Features for sense prediction of implicit discourse relations
of the spans’ likelihoods according to the various language models .
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan
Conclusion
Further improvement is possible by incorporating topic models deeper in the decoding process and adding domain knowledge to the language model .
Discussion
6.3 Improving Language Models
Discussion
Topic models capture document-level properties of language, but a critical component of machine translation systems is the language model , which provides local constraints and preferences.
Discussion
Domain adaptation for language models (Bellegarda, 2004; Wood and Teh, 2009) is an important avenue for improving machine translation.
Experiments
We train a modified Kneser—Ney trigram language model on English (Chen and Goodman, 1996).
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Habash, Nizar and Roth, Ryan
Experimental Settings
The models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Problem Zones in Handwriting Recognition
Digits on the other hand are a hard class to language model since the vocabulary (of multi-digit numbers) is infinite.
Problem Zones in Handwriting Recognition
The HR system output does not contain any illegal non-words since its vocabulary is restricted by its training data and language models .
Related Work
Alternatively, morphological information can be used to construct supplemental lexicons or language models (Sari and Sellami, 2002; Magdy and Darwish, 2006).
Related Work
Their hypothesis that their large language model (16M words) may be responsible for why the word-based models outperformed stem-based (morphological) models is challenged by the fact that our language model data (220M words) is an order of magnitude larger, but we are still able to show benefit for using morphology.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Yates, Alexander
Related Work
Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models .
Related Work
While n-gram models have traditionally dominated in language modeling , two recent efforts de-
Related Work
Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables, to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005).
Smoothing Natural Language Sequences
2.3 Latent Variable Language Model Representation
Smoothing Natural Language Sequences
Latent variable language models (LVLMs) can be used to produce just such a distributional representation.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Rasooli, Mohammad Sadegh and Lippincott, Thomas and Habash, Nizar and Rambow, Owen
Introduction
The best-performing systems for these applications today rely on training on large amounts of data: in the case of ASR, the data is aligned audio and transcription, plus large unannotated data for the language modeling ; in the case of OCR, it is transcribed optical data; in the case of MT, it is aligned bitexts.
Introduction
For ASR and OCR, which can compose words from smaller units (phones or graphically recognized letters), an expanded target language vocabulary can be directly exploited without the need for changing the technology at all: the new words need to be inserted into the relevant resources (lexicon, language model ) etc, with appropriately estimated probabilities.
Introduction
The expanded word combinations can be used to extend the language models used for MT to bias against incoherent hypothesized new sequences of segmented words.
Morphology-based Vocabulary Expansion
In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine.
Morphology-based Vocabulary Expansion
We reweight the weights in the WFST model (Fixed or Bigram) by composing it with a letter trigraph language model (WoTr).
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
DeNero, John and Chiang, David and Knight, Kevin
Computing Feature Expectations
The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in
Computing Feature Expectations
3Decoder states can include additional information as well, such as local configurations for dependency language model scoring.
Computing Feature Expectations
The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes.
Experimental Results
All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT).
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Salloum, Wael and Elfardy, Heba and Alamir-Salloum, Linda and Habash, Nizar and Diab, Mona
MT System Selection
These features rely on language models , MSA and Egyptian morphological analyzers and a Highly Dialectal Egyptian lexicon to decide whether each word is MSA, Egyptian, Both, or Out of Vocabulary.
MT System Selection
two language models : MSA and Egyptian.
MT System Selection
The second set of features uses perplexity against language models built from the source-side of the training data of each of the four
Machine Translation Experiments
The language model for our systems is trained on English Gigaword (Graff and Cieri, 2003).
Machine Translation Experiments
We use SRILM Toolkit (Stolcke, 2002) to build a 5-gram language model with modified
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Abstract
On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model .
Introduction
Combining Language Models and
Training Algorithm and Implementation
As described in Section 4, the overall procedure is divided into two alternating steps: After initialization we first perform EM training of the translation model for 20-30 iterations using a 2- gram or S-gram language model in the target language.
Training Algorithm and Implementation
The generative story described in Section 3 is implemented as a cascade of a permutation, insertion, lexicon, deletion and language model finite state transducers using OpenFST (Allauzen et al., 2007).
Translation Model
Stochastically generate the target sentence according to an n-gram language model .
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas
Decoding
The language model (LM) scoring is directly integrated into the cube pruning algorithm.
Decoding
Naturally, we also had to adjust hypothesis expansion and, most importantly, language model scoring inside the cube pruning algorithm.
Experiments
Our German 4-gram language model was trained on the German sentences in the training data augmented by the Stuttgart SdeWaC corpus (Web-as-Corpus Consortium, 2008), whose generation is detailed in (Baroni et al., 2009).
Translation Model
(1) The forward translation weight using the rule weights as described in Section 2 (2) The indirect translation weight using the rule weights as described in Section 2 (3) Lexical translation weight source —> target (4) Lexical translation weight target —> source (5) Target side language model (6) Number of words in the target sentences (7) Number of rules used in the pre-translation (8) Number of target side sequences; here k times the number of sequences used in the pre-translations that constructed 7' (gap penalty) The rule weights required for (l) are relative frequencies normalized over all rules with the same left-hand side.
Translation Model
The computation of the language model estimates for (6) is adapted to score partial translations consisting of discontiguous units.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Olsson, J. Scott and Oard, Douglas W.
Experiments
The resulting interview transcripts have a reported mean word error rate (WER) of approximately 25% on held out data, which was obtained by priming the language model with meta-data available from preinterview questionnaires.
Experiments
We use a mixture of the training transcripts and various newswire sources for our language model training.
Experiments
We did not attempt to prime the language model for particular interviewees or otherwise utilize any interview metadata.
Introduction
Limitations in signal processing, acoustic modeling, pronunciation, vocabulary, and language modeling can be accommodated in several ways, each of which make different tradeoffs and thus induce different
Previous Work
In the extreme case, the term may simply be out of vocabulary, although this may occur for various other reasons (e. g., poor language modeling or pronunciation dictionaries).
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Subotin, Michael
Corpora and baselines
A 5-gram language model with modified interpolated Kneser—Ney smoothing (Chen and Goodman, 1998) was trained by the SRILM toolkit (Stolcke, 2002) on a set of 208 million running words of text obtained by combining the monolingual Czech text distributed by the 2010
Corpora and baselines
The baselines consisted of the language model , two phrase translation models, two lexical models, and a brevity penalty.
Decoding with target-side model dependencies
language model , as described in Chiang (2007).
Decoding with target-side model dependencies
In the case of the language model these aspects include any of its target-side words that are part of still incomplete n-grams.
Hierarchical phrase-based translation
As shown by Chiang (2007), a weighted grammar of this form can be collected and scored by simple extensions of standard methods for phrase-based translation and efficiently combined with a language model in a CKY decoder to achieve large improvements over a state-of-the-art phrase-based system.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Avramidis, Eleftherios and Koehn, Philipp
Experiments
For testing the factored translation systems, we used Moses (Koehn et al., 2007), along with a 5-gram SRILM language model (Stolcke, 2002).
Factored Model
The factored statistical machine translation model uses a log-linear approach, in order to combine the several components, including the language model , the reordering model, the translation models and the generation models.
Introduction
0 The basic SMT approach uses the target language model as a feature in the argument maximisation function.
Introduction
This language model is trained on grammatically correct text, and would therefore give a good probability for word sequences that are likely to occur in a sentence, while it would penalise ungrammatical or badly ordered formations.
Introduction
Thus, with respect to these methods, there is a problem when agreement needs to be applied on part of a sentence whose length exceeds the order of the of the target n-gram language model and the size of the chunks that are translated (see Figure 1 for an exam-
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Tu, Mei and Zhou, Yu and Zong, Chengqing
A semantic span can include one or more eus.
Most translation systems adopt the features from a translation model, a language model , and sometimes a reordering model.
A semantic span can include one or more eus.
The process of training this transfer model and smoothing is similar to the process of training a language model .
A semantic span can include one or more eus.
formula (6) are estimated in the same way as a factored language model , which has the advantage of easily incorporating various linguistic information.
Experiments
A 5-gram language model is trained with SRILM5 on the combination of the Xinhua portion of the English Giga-word corpus combined with the English part of FBIS.
Experiments
probabilities, the BTG reordering features, and the language model feature.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chambers, Nathanael
Experiments and Results
Unigram NLLR and Filtered NLLR are the language model implementations of previous work as described in Section 3.1.
Previous Work
They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores.
Timestamp Classifiers
3.1 Language Models
Timestamp Classifiers
We apply Dirichlet-smoothing to the language models (as in de J ong et al.
Timestamp Classifiers
The above language modeling and MaxEnt approaches are token-based classifiers that one could apply to any topic classification domain.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Fleck, Margaret M.
Previous work
Language modelling methods build word ngram models, like those used in speech recognition.
Previous work
3.2 Language modelling methods
Previous work
So far, language modelling methods have been more effective.
The new approach
This corresponds roughly to a unigram language model .
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhao, Hai and Song, Yan and Kit, Chunyu and Zhou, Guodong
Treebank Translation and Dependency Transformation
In detail, a word-based decoding is used, which adopts a log-linear framework as in (Och and Ney, 2002) with only two features, translation model and language model,
Treebank Translation and Dependency Transformation
is the language model , a word trigram model trained from the CTB.
Treebank Translation and Dependency Transformation
Thus the decoding process is actually only determined by the language model .
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Yang, Fan and Zhao, Jun and Zou, Bo and Liu, Kang and Liu, Feifan
Statistical Transliteration Model
The language model P(e) is trained from English texts.
Statistical Transliteration Model
generative probability of a English syllable language model .
Statistical Transliteration Model
2) The language model in backward transliteration describes the relationship of syllables in words.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Ozbal, Gozde and Strapparava, Carlo
Evaluation
Therefore, we implemented a ranking mechanism which used a hybrid scoring method by giving equal weights to the language model and the normalized phonetic similarity.
System Description
To check the likelihood and well-formedness of the new string after the replacement, we learn a 3- gram language model with absolute smoothing.
System Description
For leam-ing the language model , we only consider the words in the CMU pronunciation dictionary which also exist in WordNet.
System Description
We remove the words containing at least one trigram which is very unlikely according to the language model .
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Song, Young-In and Lee, Jung-Tae and Rim, Hae-Chang
Previous Work
These approaches have focused to model statistical or syntactic phrasal relations under the language modeling method for information retrieval.
Previous Work
(Srikanth and Srihari, 2003; Maisonnasse et al., 2005) examined the effectiveness of syntactic relations in a query by using language modeling framework.
Previous Work
(Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases.
Proposed Method
We start out by presenting a simple phrase-based language modeling retrieval model that assumes uniform contribution of words and phrases.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun
Experiments
An in-house language modeling toolkit is used to train the 5-gram language model with modified Kneser-Ney smoothing (Kneser and Ney, 1995).
Experiments
The English monolingual data used for language modeling is the same as in Table 1.
Related Work
They incorporated the bilingual topic information into language model adaptation and lexicon translation model adaptation, achieving significant improvements in the large-scale evaluation.
Topic Similarity Model with Neural Network
Standard features: Translation model, including translation probabilities and lexical weights for both directions (4 features), 5-gram language model (1 feature), word count (1 feature), phrase count (1 feature), NULL penalty (1 feature), number of hierarchical rules used (1 feature).
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
McIntyre, Neil and Lapata, Mirella
Introduction
And Knight and Hatzivassiloglou (1995) use a language model for selecting a fluent sentence among the vast number of surface realizations corresponding to a single semantic representation.
Introduction
The top-ranked candidate is selected for presentation and verbalized using a language model interfaced with RealPro (Lavoie and Rambow, 1997), a text generation engine.
The Story Generator
Since we do not know a priori which of these parameters will result in a grammatical sentence, we generate all possible combinations and select the most likely one according to a language model .
The Story Generator
We used the SRI toolkit to train a trigram language model on the British National Corpus, with interpolated Kneser—Ney smoothing and perplexity as the scoring metric for the generated sentences.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min
Conclusion
Additionally, we also want to induce sense clusters for words in the target language so that we can build sense-based language model and integrate it into SMT.
Decoding with Sense-Based Translation Model
error rate training (MERT) (Och, 2003) together with other models such as the language model .
Experiments
We trained a 5-gram language model on the Xinhua section of the English Gigaword corpus (306 million words) using the SRILM toolkit (Stolcke, 2002) with the modified Kneser—Ney smoothing (Chen and Goodman, 1996).
Related Work
(2007) also explore a bilingual topic model for translation and language model adaptation.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Decoding
1 to a log-linear model (Och and Ney, 2002) that uses the following eight features: relative frequencies in two directions, lexical weights in two directions, number of rules used, language model score, number of target words produced, and the probability of matched source tree (Mi et al., 2008).
Decoding
We use the cube pruning method (Chiang, 2007) to approximately intersect the translation forest with the language model .
Experiments
A trigram language model was trained on the English sentences of the training corpus.
Related Work
In machine translation, the concept of packed forest is first used by Huang and Chiang (2007) to characterize the search space of decoding with language models .
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hasan, Kazi Saidul and Ng, Vincent
Keyphrase Extraction Approaches
3.3.4 Language Modeling
Keyphrase Extraction Approaches
These feature values are estimated using language models (LMs) trained on a foreground corpus and a background corpus.
Keyphrase Extraction Approaches
In sum, LMA uses a language model rather than heuristics to identify phrases, and relies on the language model trained on the background corpus to determine how “unique” a candidate keyphrase is to the domain represented by the foreground corpus.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Jia, Zhongye and Zhao, Hai
Experiments
SRILM (Stolcke, 2002) is adopted for language model training and KenLM (Heafield, 2011; Heafield et al., 2013) for language model query.
Pinyin Input Method Model
The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
Related Works
They solved the typo correction problem by decomposing the conditional probability P(H |P) of Chinese character sequence H given pinyin sequence P into a language model P(wi|wi_1) and a typing model The typing model that was estimated on real user input data was for typo correction.
Related Works
Various approaches were made for the task including language model (LM) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Yarowsky, David
Experimental Results
To handle different directions of translation between Chinese and English, we built two trigram language models with modified Kneser-Ney smoothing (Chen and Goodman, 1998) using the SRILM toolkit (Stolcke, 2002).
Experimental Results
Feature Baseline AAMT language model 0.137 0.133 phrase translation 0.066 0.023 lexical translation 0.061 0.078 reverse phrase translation 0.059 0.103 reverse lexical translation 0.
Unsupervised Translation Induction for Chinese Abbreviations
Moreover, our approach utilizes both Chinese and English monolingual data to help MT, while most SMT systems utilizes only the English monolingual data to build a language model .
Unsupervised Translation Induction for Chinese Abbreviations
However, since most of statistical translation models (Koehn et al., 2003; Chiang, 2007; Galley et al., 2006) are symmetrical, it is relatively easy to train a translation system to translate from English to Chinese, except that we need to train a Chinese language model from the Chinese monolingual data.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Levitan, Rivka and Elson, David
Features
We look at the language model (LM) score and the number of alternate pronunciations of the first query, predicting that a misrecognized query will have a lower LM score and more alternate pronunciations.
Prediction task
In addition, the language model likelihood for the first query was, as expected, significantly lower for retries.
Related Work
Retry cases are identified with joint language modeling across multiple transcripts, with the intuition that retry pairs tend to be closely related or exact duplicates.
Related Work
While we follow this work in our usage of joint language modeling , our application encompasses open domain voice searches and voice actions (such as placing calls), so we cannot use simplifying domain assumptions.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Collaborative Decoding
Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses.
Discussion
They also empirically show that n-gram agreement is the most important factor for improvement apart from language models .
Experiments
The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3.
Experiments
We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kim, Jungi and Li, Jin-Ji and Lee, Jong-Hyeok
Experiment
For the relevance retrieval model, we faithfully reproduce the passage-based language model with pseudo-relevance feedback (Lee et al., 2008).
Term Weighting and Sentiment Analysis
IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling (LM), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004).
Term Weighting and Sentiment Analysis
In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing (LM).
Term Weighting and Sentiment Analysis
5With proper assumptions and derivations, p(w \ d) can be derived to language modeling approaches.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Huang, Liang and Liu, Qun
Experiments
language model and (93'8' is the length penalty term
Experiments
We also use the SRI Language Modeling Toolkit (Stolcke, 2002) to train a trigram language model with Kneser-Ney smoothing on the English side of the bitext.
Experiments
Besides the trigram language model trained on the English side of these bitext, we also use another trigram model trained on the first 1/3 of the Xinhua portion of Gigaword corpus.
Forest-based translation
The decoder performs two tasks on the translation forest: l-best search with integrated language model (LM), and k-best search with LM to be used in minimum error rate training.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming
Experiments and Results
The features we used are commonly used features as standard BTG decoder, such as translation probabilities, lexical weights, language model , word penalty and distortion probabilities.
Experiments and Results
The language model is 5-gram language model trained with the target sentences in the training data.
Experiments and Results
The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data.
Features and Training
We also use other fundamental features, such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation.
Experiment Results
The language model is the trigram SRI language model built from Xinhua corpus of 180 millions words.
Experiment Results
The language model is three-gram SRILM trained from the target side of the training corpora.
Introduction
Many features are shared between phrase-based and tree-based systems including language model , word count, and translation model features.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Deveaud, Romain and SanJuan, Eric and Bellot, Patrice
Introduction
The approach by Wei and Croft (2006) was the first to leverage LDA topics to improve the estimate of document language models and achieved good empirical results.
Topic-Driven Relevance Models
where 9 is a set of pseudo-relevant feedback documents and 6D is the language model of document D. This notion of estimating a query model is
Topic-Driven Relevance Models
We tackle the null probabilities problem by smoothing the document language model using the well-known Dirichlet smoothing (Zhai and Lafferty, 2004).
Topic-Driven Relevance Models
Instead of viewing 9 as a set of document language models that are likely to contain topical information about the query, we take a probabilistic topic modeling approach.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
O'Connor, Brendan and Stewart, Brandon M. and Smith, Noah A.
Inference
After randomly initializing all 77k,8,7.,t, inference is performed by a blocked Gibbs sampler, alternating resamplings for three major groups of variables: the language model (z,gb), context model (07,7, [3, p), and the 77, 6 variables, which bottleneck between the submodels.
Inference
The language model sampler sequentially updates every za) (and implicitly gb via collapsing) in the manner of Griffiths and Steyvers (2004): p(z(i)|6, ma), 1)) oc 68,r,t,z(nw,z + b/V)/(nz + b), where counts 77 are for all event tuples besides 7'.
Model
0 Language model:
Model
Thus the language model is very similar to a topic model’s generation of token topics and wordtypes.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cohn, Trevor and Haffari, Gholamreza
Experiments
In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count.
Experiments
We train 3-gram language models using modified Kneser—Ney smoothing.
Experiments
For AR-EN experiments the language model is trained on English data as (Blunsom et al., 2009a), and for FA-EN and UR-EN the English data are the target sides of the bilingual training data.
Introduction
We develop a Bayesian approach using a Pitman-Yor process prior, which is capable of modelling a diverse range of geometrically decaying distributions over infinite event spaces (here translation phrase-pairs), an approach shown to be state of the art for language modelling (Teh, 2006).
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Laskowski, Kornel
Conceptual Framework
In language modeling practice, one finds the likelihood P ( w | (9 of a word sequence w of length under a model (9, to be an inconvenient measure for comparison.
Discussion
This makes it suitable for comparison of conversational genres, in much the same way as are general language models of words.
Discussion
Accordingly, as for language models , density estimation in future turn-taking models may be im-
Introduction
The current work attempts to address this problem by proposing a simple framework, which, at least conceptually, borrows quite heavily from the standard language modeling paradigm.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun
Experiments
The monolingual data for training English language model includes the Xinhua portion of the GIGAWORD corpus, which contains 238M English words.
Experiments
A 4—gram language model was trained on the monolingual data by the SRILM toolkit (Stolcke, 2002).
Related Work
Researchers also introduce topic model for cross-lingual language model adaptation (Tam et al., 2007; Ruiz and Federico, 2011).
Related Work
Based on the bilingual topic model, they apply the source-side topic weights into the target-side topic model, and adapt the n-gram language model of target side.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting
Experiments
We used SRILM2 for the training of language models (S-gram in all the experiments).
Experiments
We trained a Chinese language model for the EC translation on the Chinese part of the bi-text.
Experiments
For the English language model of CE translation, an extra corpus named Tanaka was used besides the English part of the bilingual corpora.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
In all experiments we include the target side of the mined parallel data in the language model , in order to distinguish whether results are due to influences from parallel or monolingual data.
Abstract
In these experiments, we use 5-gram language models when the target language is English or German, and 4—gram language models for French and Spanish.
Abstract
The baseline system was trained using only the Europarl corpus (Koehn, 2005) as parallel data, and all experiments use the same language model trained on the target sides of Europarl, the English side of all linked Spanish-English Wikipedia articles, and the English side of the mined CommonCran data.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Labutov, Igor and Lipson, Hod
Model
Broadly, as the learner progresses from one sentence to the next, exposing herself to more novel words, the updated parameters of the language model in turn guide the selection of new “switch-points” for replacing source words with the target foreign words.
Model
Generally, this value may come directly from the surprisal quantity given by a language model , or may incorporate additional features that are found informative in predicting the constraint on the word.
Related Work
Building on their work, (Adel et al., 2012) employ additional features and a recurrent network language model for modeling code-switching in conversational speech.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zollmann, Andreas and Vogel, Stephan
Experiments
Apart from the language model , the lexical, phrasal, and (for the syntax grammar) label-conditioned features, and the rule, target word, and glue operation counters, Venugopal and Zollmann (2009) also provide both the hierarchical and syntax-augmented grammars with a rareness penalty 1/ onto“), where onto“) is the occurrence count of rule 7“ in the training corpus, allowing the system to learn penalization of low-frequency rules, as well as three indicator features firing if the rule has one, two unswapped, and two swapped nonterminal pairs, respectively.2 Further, to mitigate badly estimated PSCFG derivations based on low-frequency rules of the much sparser syntax model, the syntax grammar also contains the hierarchical grammar as a backbone (cf.
Experiments
Each system is trained separately to adapt the parameters to its specific properties (size of nonterminal set, grammar complexity, features sparseness, reliance on the language model , etc.
Related work
The supertags are also injected into the language model .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Features
To some extent, these two features have similar function to a target language model or pos-based target language model .
Related Work
(2009) study several confidence features based on mutual information between words and n-gram and backward n-gram language model for word-level and sentence-level CE.
SMT System
We build a four-gram language model using the SRILM toolkit (Stolcke, 2002), which is trained
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hagiwara, Masato and Sekine, Satoshi
Introduction
Our approach is based on semi-Markov discriminative structure prediction, and it incorporates English back-transliteration and English language models (LMs) into WS in a seamless way.
Use of Language Model
Language Model Augmentation Analogous to Koehn and Knight (2003), we can exploit the fact that l/‘yF‘ reddo (red) in the example ffiayvnI/W‘ is such a common word that one can expect it appears frequently in the training corpus.
Use of Language Model
4.1 Language Model Projection
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao
Experiment
We used 5-gram language models that were trained using the English side of each set of bilingual training data.
Experiment
The common SMT feature set consists of: four translation model features, phrase penalty, word penalty, and a language model feature.
Introduction
1A language model also supports the estimation.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Matsuzaki, Takuya and Tsujii, Jun'ichi
Experiments
Here, the first item is the language model (LM) probability where 7'(d) is the target string of derivation d; the second item is the translation length penalty; and the third item is the translation score, which is decomposed into a product of feature values of rules:
Experiments
SRI Language Modeling Toolkit (Stolcke, 2002) was employed to train 5-gram English and Japanese LMs on the training set.
Related Work
By introducing supertags into the target language side, i.e., the target language model and the target side of the phrase table, significant improvement was achieved for Arabic-to-English translation.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Feng, Yang and Cohn, Trevor
Experiments
The language model is a 3-gram language model trained using the SRILM toolkit (Stolcke, 2002) on the English side of the training data.
Experiments
The language model is a 3-gram LM trained on Xinhua portion of the Gigaword corpus using the SRILM toolkit with modified Kneser—Ney smoothing.
Related Work
(2011) develop a bilingual language model which incorporates words in the source and target languages to predict the next unit, which they use as a feature in a translation system.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Introduction
One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al.
Introduction
(2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings.
Log-linear Models
We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models , except using the names of dependency relations instead of words.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Elfardy, Heba and Diab, Mona
Approach to Sentence-Level Dialect Identification
The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV.
Approach to Sentence-Level Dialect Identification
The perplexity of a language model on a given test sentence; S(w1, .., wn) is defined as:
Related Work
Amazon Mechanical Turk and try a language modeling (LM) approach to solve the problem.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Konstas, Ioannis and Lapata, Mirella
Experimental Design
imately via cube pruning (Chiang, 2007), by integrating a trigram language model extracted from the training set (see Konstas and Lapata (2012) for details).
Experimental Design
Lexical Features These features encourage grammatical coherence and inform lexical selection over and above the limited horizon of the language model captured by Rules (6)—(9).
Problem Formulation
In machine translation, a decoder that implements forest rescoring (Huang and Chiang, 2007) uses the language model as an external criterion of the goodness of sub-translations on account of their grammaticality.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram
Inferring a learning curve from mostly monolingual data
In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model .
Inferring a learning curve from mostly monolingual data
(b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus.
Inferring a learning curve from mostly monolingual data
The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Cherry, Colin
Discussion
So long as the vocabulary present in our phrase table and language model supports a literal translation, cohesion tends to produce an improvement.
Discussion
In the baseline translation, the language model encourages the system to move the negation away from “exist” and toward “reduce.” The result is a tragic reversal of meaning in the translation.
Introduction
order, forcing the decoder to rely heavily on its language model .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Celikyilmaz, Asli and Hakkani-Tur, Dilek and Tur, Gokhan and Sarikaya, Ruhi
Markov Topic Regression - MTR
(19) Language Model Prior (77W): Probabilities on word transitions denoted as nw=p(wi=v|wi_1).
Markov Topic Regression - MTR
We built a language model using SRILM (Stol-cke, 2002) on the domain specific sources such as top wiki pages and blogs on online movie reviews, etc., to obtain the probabilities of domain-specific n-grams, up to 3-grams.
Markov Topic Regression - MTR
(l), we assume that the prior on the semantic tags, 773, is more indicative of the decision for sampling a w,- from a new tag compared to language model posteriors on word sequences, 77W.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Automated Approaches to Deceptive Opinion Spam Detection
Under (2), both the NB classifier used by Mihalcea and Strapparava (2009) and the language model classifier used by Zhou et al.
Automated Approaches to Deceptive Opinion Spam Detection
(2008), we use the SRI Language Modeling Toolkit (Stolcke, 2002) to estimate individual language models , Pr(:E | y = c), for truthful and deceptive opinions.
Automated Approaches to Deceptive Opinion Spam Detection
We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Alfonseca, Enrique and Pighin, Daniele and Garrido, Guillermo
Experiment settings
o TopicSum: we use TopicSum (Haghighi and Vanderwende, 2009), a 3-layer hierarchical topic model, to infer the language model that is most central for the collection.
Experiment settings
divergence with respect the collection language model is the one chosen.
Related work
(2007) generate novel utterances by combining Prim’s maximum-spanning-tree algorithm with an n-gram language model to enforce fluency.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Feng, Yansong and Lapata, Mirella
Abstractive Caption Generation
Specifically, we use an adaptive language model (Kneser et al., 1997) that modifies an
Abstractive Caption Generation
where P(wi E C |wi E D) is the probability of W, appearing in the caption given that it appears in the document D, and Padap(wi|wi_1,wi_2) the language model adapted with probabilities from our image annotation model:
Experimental Setup
The scaling parameter [3 for the adaptive language model was also tuned on the development set using a range of [05,09].
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yang, Hui and Callan, Jamie
The Features
It is built into a unigram language model without smoothing for each term.
The Features
This feature function measures the Kullback—Leibler divergence (KL divergence) between the language models associated with the two inputs.
The Features
Similarly, the local context is built into a unigram language model without smoothing for each term; the feature function outputs KL divergence between the models.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Liu, Shujie and Li, Mu and Zhou, Ming and Yu, Nenghai
Related Work
(Bengio et al., 2006) proposed to use multilayer neural network for language modeling task.
Related Work
(Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language.
Related Work
(Son et al., 2012) improves translation quality of n- gram translation model by using a bilingual neural language model .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing
Experiment
We train a 5-gram language model with the Xinhua portion of English Gigaword corpus and target part of the training data.
Integrating into the PAS-based Translation Framework
The weights of the MEPD feature can be tuned by MERT (Och, 2003) together with other translation features, such as language model .
PAS-based Translation Framework
The target-side-like PAS is selected only according to the language model and translation probabilities, without considering any context information of PAS.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhou, Guangyou and Liu, Fang and Liu, Yang and He, Shizhu and Zhao, Jun
Experiments
Row 1 and row 2 are two baseline systems, which model the relevance score using VSM (Cao et al., 2010) and language model (LM) (Zhai and Laf-ferty, 2001; Cao et al., 2010) in the term space.
Experiments
Row 3 is the word-based translation model (Jeon et al., 2005), and row 4 is the word-based translation language model, which linearly combines the word-based translation model and language model into a unified framework (Xue et al., 2008).
Experiments
(2009) in Table 3 because previous work (Ming et al., 2010) demonstrated that word-based translation language model (Xue et al., 2008) obtained the superior performance than the syntactic tree matching (Wang et al., 2009).
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Experiments
In the experiments, we train the translation model on FBIS corpus (7.2M (Chinese) + 9.2M (English) words) and train a 4-gram language model on the Xinhua portion of the English Gigaword corpus (181M words) using the SRILM Toolkits (Stolcke,
NonContiguous Tree sequence Align-ment-based Model
2) The bi-lexical translation probabilities 3) The target language model
The Pisces decoder
On the other hand, to simplify the computation of language model , we only compute for source side contiguous translational hypothesis, while neglecting gaps in the target side if any.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Foster, George and Kuhn, Roland
Experiments
We trained two language models : the first one is a 4-gram LM which is estimated on the target side of the texts used in the large data condition.
Experiments
Both language models are used for both tasks.
Experiments
Only the target-language half of the parallel training data are used to train the language model in this task.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Experimental Setup
For the language model , we used a 5-gram model with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the English side of our training data as well as portions of the Giga-word v2 English corpus.
Experimental Setup
For the language model , we used a 5-gram model trained on the English portion of the whole training data plus portions of the Gigaword v2 corpus.
Hierarchical Phrase-based System
Given 6 and f as the source and target phrases associated with the rule, typical features used are rule’s translation probability Ptmn,(f|e') and its inverse Ptmn,(e'| f), the lexical probability Pl“ (fl 6) and its inverse Pl“ (6 | f Systems generally also employ a word penalty, a phrase penalty, and target language model feature.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bamman, David and Underwood, Ted and Smith, Noah A.
Data
To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books.
Model
Maximum entropy approaches to language modeling have been used since Rosenfeld (1996) to incorporate long-distance information, such as previously-mentioned trigger words, into n-gram language models .
Model
Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameter
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Baroni, Marco and Dinu, Georgiana and Kruszewski, Germán
Abstract
Context-predicting models (more commonly known as embeddings or neural language models ) are the new kids on the distributional semantics block.
Introduction
This is in part due to the fact that context-predicting vectors were first developed as an approach to language modeling and/or as a way to initialize feature vectors in neural-network-based “deep learning” NLP architectures, so their effectiveness as semantic representations was initially seen as little more than an interesting side effect.
Introduction
Predictive DSMs are also called neural language models , because their supervised context prediction training is performed with neural networks, or, more cryptically, “embeddings”.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel
Experiments
For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data.
Experiments
If either the source or the target sides of the a training instance had an edit distance of less than 10%, we removed it.4 As for the language models, we collected a further 10M tweets from Twitter for the English language model and another 10M tweets from Weibo for the Chinese language model .
Experiments
As the language model , we use a 5-gram model with Kneser—Ney smoothing.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Snyder, Benjamin and Barzilay, Regina and Knight, Kevin
Evaluation Tasks and Results
To produce baseline cognate identification predictions, we calculate the probability of each latent Hebrew letter sequence predicted by the HMM, and compare it to a uniform character-level Ugaritic language model (as done by our model, to avoid automatically assigning higher cognate probability to shorter Ugaritic words).
Inference
We also calculate P = 0) using a uniform uni-gram character-level language model (and thus depends only on the number of characters in ui).
Model
Otherwise, a lone word it is generated, according a uniform character-level language model .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Gyawali, Bikash and Gardent, Claire
Conclusion
In the current version of the generator, the output is ranked using a simple language model trained on the GENIA corpus.
Generating from the KBGen Knowledge-Base
To rank the generator output, we train a language model on the GeniA corpus 4, a corpus of 2000 MEDLINE asbtracts about biology containing more than 400000 words (Kim et al., 2003) and use this model to rank the generated sentences by decreasing probability.
Related Work
They intersect the grammar with a language model to improve fluency; use a weighted hypergraph to pack the derivations; and find the best derivation tree using Viterbi algorithm.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Jia and Li, Qing and Chen, Yuanzhu Peter and Lin, Zhangxi
Conclusion and Future Work
By combining such information with traditional statistical language models , it is capable of suggesting relevant articles that meet the dynamic nature of a discussion in social media.
Experimental Evaluation
The second one, LM, is based on statistical language models for relevant information retrieval (Ponte and Croft, 1998).
Experimental Evaluation
bilistic language model for each article, and ranks them on query likelihood, i.e.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun
Experiments
For language model, we used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram model on the Xinhua portion of GIGAWORD corpus.
Joint Decoding
2There are also features independent of derivations, such as language model and word penalty.
Joint Decoding
Although left-to-right decoding might enable a more efficient use of language models and hopefully produce better translations, we adopt bottom-up decoding in this paper just for convenience.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Blunsom, Phil
Introduction
Successful applications of such models include language modelling (Bengio et al., 2003), paraphrase detection (Erk and Pado, 2008), and dialogue analysis (Kalchbrenner and Blunsom, 2013).
Related Work
Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003).
Related Work
They have received a lot of attention in recent years (Collobert and Weston, 2008; Mnih and Hinton, 2009; Mikolov et al., 2010, inter alia) and have achieved state of the art performance in language modelling .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jansen, Peter and Surdeanu, Mihai and Clark, Peter
Models and Features
In particular, we use the recurrent neural network language model (RNNLM) of Mikolov et al.
Models and Features
Like any language model , a RNNLM estimates the probability of observing a word given the preceding context, but, in this process, it learns word embeddings into a latent, conceptual space with a fixed number of dimensions.
Related Work
(2013) recently addressed the problem of answer sentence selection and demonstrated that LS models, including recurrent neural network language models (RNNLM), have a higher contribution to overall performance than exploiting syntactic analysis.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Tomeh, Nadi and Habash, Nizar and Roth, Ryan and Farra, Noura and Dasigi, Pradeep and Diab, Mona
Abstract
Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency.
Discriminative Reranking for OCR
The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Introduction
The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil
Background
The RNN is primarily used as a language model , but may also be viewed as a sentence model with a linear structure.
Introduction
Besides comprising powerful classifiers as part of their architecture, neural sentence models can be used to condition a neural language model to generate sentences word by word (Schwenk, 2012; Mikolov and Zweig, 2012; Kalchbrenner and Blunsom, 2013a).
Properties of the Sentence Model
This gives the RNN excellent performance at language modelling , but it is suboptimal for remembering at once the n-grams further back in the input sentence.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: