Introduction | Further, decoding with nonlocal (or state-dependent) features, such as a language model , is also a problem. |
Introduction | Actually, even for the (log-) linear model, efficient decoding with the language model is not trivial (Chiang, 2007). |
Introduction | For the nonlocal features such as the language model , Chiang (2007) proposed a cube-pruning method for efficient decoding. |
Abstract | Our model is a nested hierarchical Pitman-Yor language model , where Pitman-Yor spelling model is embedded in the word model. |
Abstract | Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications. |
Introduction | Bayesian Kneser—Ney) language model , with an accurate character oo-gram Pitman-Yor spelling model embedded in word models. |
Introduction | Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language. |
Introduction | we briefly describe a language model based on the Pitman-Yor process (Teh, 2006b), which is a generalization of the Dirichlet process used in previous research. |
Nested Pitman-Yor Language Model | In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM. |
Nested Pitman-Yor Language Model | Figure 2: Chinese restaurant representation of our Nested Pitman-Yor Language Model (NPYLM). |
Pitman-Yor process and n-gram models | To compute a probability p(w|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process. |
Pitman-Yor process and n-gram models | As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as |
Pitman-Yor process and n-gram models | When we set thw E l, (4) recovers a Kneser-Ney smoothing: thus a HPYLM is a Bayesian Kneser—Ney language model as well as an extension of the hierarchical Dirichlet Process (HDP) used in Goldwater et al. |
Abstract | Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation. |
Abstract | We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion. |
Expected BLEU Training | We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003). |
Expected BLEU Training | We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1): |
Expected BLEU Training | which computes a sentence-level language model score as the sum of individual word scores. |
Introduction | In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014). |
Introduction | (2013) who demonstrated that feed-forward network-based language models are more accurate in first-pass decoding than in rescoring. |
Introduction | Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models . |
Recurrent Neural Network LMs | Our model has a similar structure to the recurrent neural network language model of Mikolov et al. |
Experiments | First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score. |
Experiments | Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score. |
Experiments | This means that, when using a bigram language model , it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference. |
Introduction | Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence. |
Phrase-based Decoding as TSP | o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model , this cost can be precomputed directly from b and b’. |
Phrase-based Decoding as TSP | Successful phrase-based systems typically employ language models of order higher than two. |
Phrase-based Decoding as TSP | If we want to extend the power of the model to general n-gram language models , and in particular to the 3-gram |
Abstract | Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries. |
Introduction | They also experimented with representing such conceptual models using n- gram language models derived from corpora consisting of collections of descriptions of instances of specific object types (e.g. |
Introduction | a corpus of descriptions of churches, a corpus of bridge descriptions, and so on) and reported results showing that incorporating such n-gram language models as a feature in a feature-based extractive summarizer improves the quality of automatically generated summaries. |
Introduction | The main weakness of n-gram language models is that they only capture very local information aboutshofitennsequencesandcannotnnxkfllong distance dependencies between terms. |
Representing conceptual models 2.1 Object type corpora | 2.2 N-gram language models |
Representing conceptual models 2.1 Object type corpora | Aker and Gaizauskas (2009) experimented with uni-gram and bi-gram language models to capture the features commonly used when describing an object type and used these to bias the sentence selection of the summarizer towards the sentences that contain these features. |
Representing conceptual models 2.1 Object type corpora | As in Song and Croft (1999) they used their language models in a gener- |
Explaining between-word regressions | This simple example just illustrates the point that if a reader is combining noisy visual information with a language model , then confidence in previous regions will sometimes fall. |
Models of eye movements in reading | Unfortunately, however, the Mr. Chips model simplifies the problem of reading in a number of ways: First, it uses a unigram model as its language model , and thus fails to use any information in the linguistic context to help with word identification. |
Models of eye movements in reading | Specifically, our model identifies the words in a sentence by performing Bayesian inference combining noisy input from a realistic visual model with a language model that takes context into account. |
Reading as Bayesian inference | Specifically, the model begins reading with a prior distribution over possible identities of a sentence given by its language model . |
Reading as Bayesian inference | model’s prior distribution over the identity of the sentence given the language model is updated to a posterior distribution taking into account both the language model and the visual input obtained thus far. |
Reading as Bayesian inference | Given the visual input and a language model, inferences about the identity of the sentence w can be made by standard Bayesian inference, where the prior is given by the language model and the likelihood is a function of the total visual input obtained from the first to the ith timestep Ii , |
Simulation 1 | 5.1.2 Language model |
Simulation 1 | Our reader’s language model was an unsmoothed bigram model created using a vocabulary set con- |
Simulation 1 | Specifically, we constructed the model’s initial belief state (i.e., the distribution over sentences given by its language model ) by directly translating the bigram model into a wFSA in the log semiring. |
Simulation 2 | 6.1.3 Language model |
Experiment | In the event that a trigram or bigram would be found in the plaintext that was not counted in the language model , add one smoothing was used. |
Experiment | Our character-level language model used was developed from the first 1.5 million characters of the Wall Street Journal section of the Penn Tree-bank corpus. |
Introduction | If the text from which a language model is trained is of a different genre than the plaintext of a cipher, the unigraph letter frequencies may differ substantially from those of the language model , and so frequency counting will be misleading. |
Introduction | Such inefficiency indicates that integer programming may simply be the wrong tool for the job, possibly because language model probabilities computed from empirical data are not smoothly distributed enough over the space in which a cutting-plane method would attempt to compute a linear relaxation of this problem. |
Introduction | This difference in difficulty, while real, is not inherent, but rather an artefact of the character-level n-gram language models that they (and we) use, in which preponderant evidence of differences in short character sequences is necessary for the model to clearly favour one letter-substitution mapping over another. |
Terminology | Every possible full solution to a cipher C will produce a plaintext string with some associated language model probability, and we will consider the best possible solution to be the one that gives the highest probability. |
Terminology | For the sake of concreteness, we will assume here that the language model is a character-level trigram model. |
The Algorithm | Backpointers are necessary to reference one of the two language model probabilities. |
The Algorithm | Cells that would produce inconsistencies are left at zero, and these as well as cells that the language model assigns zero to can only produce zero entries in later columns. |
The Algorithm | The n p x n p cells of every column 2' do not depend on each other —only on the cells of the previous two columns 2' — 1 and i— 2, as well as the language model . |
Conclusion | Removing the power of higher order language model and longer max phrase length, which are inherent in pseudo-words, shows that pseudo-words still improve translational performance significantly over unary words. |
Experiments and Results | The pipeline uses GIZA++ model 4 (Brown et al., 1993; Och and Ney, 2003) for pseudo-word alignment, uses Moses (Koehn et al., 2007) as phrase-based decoder, uses the SRI Language Modeling Toolkit to train language model with modified Kneser-Ney smoothing (Kneser and Ney 1995; Chen and Goodman 1998). |
Experiments and Results | A 5-gram language model is trained on English side of parallel corpus. |
Experiments and Results | Xinhua portion of the English Gigaword3 corpus is used together with English side of large corpus to train a 4-gram language model . |
Introduction | Further experiments of removing the power of higher order language model and longer max phrase length, which are inherent in pseudo-words, show that pseudo-words still improve translational performance significantly over unary words. |
Our Approach | (1) 3.1.1 Language Model |
Our Approach | The language model (LM) pm?) |
Our Approach | The parameters of the language model are learned from a monolingual Urdu corpus. |
Abstract | We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing. |
Introduction | Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation. |
Introduction | )nstraints for language modeling |
Introduction | As a result, statistical language models — an important component of many such applications — are often trained on very large corpora, then modified to fit within some pre-specified size bound. |
Preliminaries | N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion. |
Preliminaries | N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored. |
Decoding Experiments | We add an English syntax language model £ to the cascade of transducers just described to better simulate an actual machine translation decoding task. |
Decoding Experiments | The language model is cast as an identity WTT and thus fits naturally into the experimental framework. |
Decoding Experiments | In our experiments we try several different language models to demonstrate varying performance of the application algorithms. |
Abstract | We thus propose to combine the advantages of both, and present a novel constituency-to-dependency translation model, which uses constituency forests on the source side to direct the translation, and dependency trees on the target side (as a language model ) to ensure grammaticality. |
Decoding | where the first two terms are translation and language model probabilities, 6(0) is the target string (English sentence) for derivation 0, the third and forth items are the dependency language model probabilities on the target side computed with words and POS tags separately, De (0) is the target dependency tree of 0, the fifth one is the parsing probability of the source side tree TC(0) 6 FC, the ill(0) is the penalty for the number of ill-formed dependency structures in 0, and the last two terms are derivation and translation length penalties, respectively. |
Decoding | For each node, we use the cube pruning technique (Chiang, 2007; Huang and Chiang, 2007) to produce partial hypotheses and compute all the feature scores including the dependency language model score (Section 4.1). |
Decoding | 4.1 Dependency Language Model Computing |
Experiments | We also store the POS tag information for each word in dependency trees, and compute two different dependency language models for words and POS tags in dependency tree separately. |
Experiments | We use SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram language model with Kneser-Ney smoothing on the first 1/3 of the Xinhua portion of Giga-word corpus. |
Experiments | This suggests that using dependency language model really improves the translation quality by less than 1 BLEU point. |
Indicators of linguistic quality | 3.1 Word choice: language models |
Indicators of linguistic quality | Language models (LM) are a way of computing how familiar a text is to readers using the distribution of words from a large background corpus. |
Indicators of linguistic quality | We built unigram, bigram, and trigram language models with Good-Turing smoothing over the New York Times (NYT) section of the English Gigaword corpus (over 900 million words). |
Results and discussion | Coh-Metrix, which has been proposed as a comprehensive characterization of text, does not perform as well as the language model and the entity coherence classes, which contain considerably fewer features related to only one aspect of text. |
Results and discussion | It is apparent from the results that continuity, entity coherence, sentence fluency and language models are the most powerful classes of features that should be used in automation of evaluation and against which novel predictors of text quality should be compared. |
Results and discussion | For example, the language model features, which are the second best class for the system-level, do not fare as well at the input-level. |
Abstract | In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. |
Definitions | denotes the language model . |
Definitions | Depending on the structure of the language model Equation 2 can be further simplified. |
Definitions | Similarly, we define language model matrices S for the unigram and the bigram case. |
Introduction | The general idea is to find those translation model parameters that maximize the probability of the translations of a given source text in a given language model of the target language. |
Introduction | This might be related to the fact that a statistical formulation of the decipherment problem has not been analyzed with respect to n-gram language models : This paper shows the close relationship of the decipherment problem to the quadratic assignment problem. |
Introduction | In Section 4 we show that decipherment using a unigram language model corresponds to solving a linear sum assignment problem (LSAP). |
Related Work | gram language model . |
Approach | In his method, a variety of languages are modeled by their spelling systems (i.e., character-based n-gram language models ). |
Approach | Then, agglomerative hierarchical clustering is applied to the language models to reconstruct a language family tree. |
Approach | The similarity used for clustering is based on a divergence-like distance between two language models that was originally proposed by Juang and Rabiner (1985). |
Methods | Similarly, let M,- be a language model trained using Di. |
Methods | 2, we use an n-gram language model based on a mixture of word and POS tokens instead of a simple word-based language model . |
Methods | In this language model , content words in n-grams are replaced with their corresponding POS tags. |
Abstract | As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models . |
Introduction | In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right. |
Introduction | As a result, phrase-based decoders only need to maintain the boundary words on one end to calculate language model probabilities. |
Introduction | Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007). |
Abstract | We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment. |
Introduction | Such cases have been noted for language modeling (Goodman, 2001; Goodman, 2004), domain adaptation (Tam and Schultz, 2008), grapheme-to-phoneme conversion (Bisani and Ney, 2008), and phrase-based translation (Andres-Ferrer, 2010; Wuebker et al., 2012). |
Introduction | One is language model domain adaptation, and the other is word alignment using the IBM models (Brown et al., 1993). |
Language model adaptation | N -gram language models are widely used in applications like machine translation and speech recognition to select fluent output sentences. |
Language model adaptation | Here, we propose to assign each sentence a probability to indicate how likely it is to belong to the domain of interest, and train a language model using expected KN smoothing. |
Language model adaptation | They first train two language models , pin on a set of in-domain data, and pout on a set of general-domain data. |
Related Work | This method subtracts D directly from the fractional counts, zeroing out counts that are smaller than D. The discount D must be set by minimizing an error metric on held-out data using a line search (Tam, p. c.) or Powell’s method (Bisani and Ney, 2008), requiring repeated estimation and evaluation of the language model . |
Smoothing on integral counts | Before presenting our method, we review KN smoothing on integer counts as applied to language models , although, as we will demonstrate in Section 7, KN smoothing is applicable to other tasks as well. |
Abstract | In this paper we examine language modeling for text simplification. |
Abstract | Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model . |
Abstract | We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. |
Introduction | An important component of many text-to-text translation systems is the language model which predicts the likelihood of a text sequence being produced in the output language. |
Introduction | In some problem domains, such as machine translation, the translation is between two distinct languages and the language model can only be trained on data in the output language. |
Introduction | In these monolingual problems, text could be used from both the input and output domain to train a language model . |
Related Work | If we view the normal data as out-of-domain data, then the problem of combining simple and normal data is similar to the language model domain adaption problem (Suzuki and Gao, 2005), in particular cross-domain adaptation (Bellegarda, 2004) where a domain-specific model is improved by incorporating additional general data. |
Background | Early work was firmly situtated in the task-based setting of improving generalisation in language models . |
Background | This model has been popular for language modelling and bilingual word alignment, and an implementation with improved inference called mkcls (Och, 1999)1 has become a standard part of statistical machine translation systems. |
Background | (l992)’s HMM by incorporating a character language model , allowing the modelling of limited morphology. |
Introduction | Our work brings together several strands of research including Bayesian nonparametric HMMs (Goldwater and Griffiths, 2007), Pitman-Yor language models (Teh, 2006b; Goldwater et al., 2006b), tagging constraints over word types (Brown et al., 1992) and the incorporation of morphological features (Clark, 2003). |
The PYP-HMM | Prior work in unsupervised PoS induction has employed simple smoothing techniques, such as additive smoothing or Dirichlet priors (Goldwater and Griffiths, 2007; Johnson, 2007), however this body of work has overlooked recent advances in smoothing methods used for language modelling (Teh, 2006b; Goldwater et al., 2006b). |
The PYP-HMM | The PYP has been shown to generate distributions particularly well suited to modelling language (Teh, 2006a; Goldwater et al., 2006b), and has been shown to be a generalisation of Kneser—Ney smoothing, widely recognised as the best smoothing method for language modelling (Chen and Goodman, 1996). |
The PYP-HMM | We consider two different settings for the base distribution Cj: l) a simple uniform distribution over the vocabulary (denoted HMM for the experiments in section 4); and 2) a character-level language model (denoted HMM+LM). |
Abstract | In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling . |
Introduction | Language models have been extensively studied in natural language processing. |
Introduction | The role of a language model is to measure how probably a (target) word would occur based on some given evidence extracted from the history-context. |
Language Modeling with TD and TO | A language model estimates word probabilities given their history, i.e. |
Language Modeling with TD and TO | In order to define the TD and TO components for language modeling , we express the observation of an arbitrary history-word, wi_k at the kth position behind the target-word, as the joint of two events: i) the word wi_k occurs within the histo-ry-context: wi_k E h, and ii) it occurs at distance k from the target-word: A(wi_k) = k, (A: k for brevity); i.e. |
Language Modeling with TD and TO | In fact, the TO model is closely related to the trigger language model (Rosenfeld 1996), as the prediction of the target-word (the triggered word) is based on the presence of a history-word (the trigger). |
Motivation of the Proposed Approach | The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach. |
Related Work | Latent-semantic language model approaches (Bellegarda 1998, Coccaro 2005) weight word counts with TFIDF to highlight their semantic importance towards the prediction. |
Related Work | Other approaches such as the class-based language model (Brown 1992, Kneser & Ney 1993) |
Related Work | The structured language model (Chelba & J elinek 2000) determines the “heads” in the history-context by using a parsing tree. |
Experiments | Then, Tesseract uses a classifier, aided by a word-unigram language model , to recognize whole words. |
Experiments | 6.3 Language Model |
Learning | The number of states in the dynamic programming lattice grows exponentially with the order of the language model (Jelinek, 1998; Koehn, 2004). |
Learning | As a result, inference can become slow when the language model order n is large. |
Learning | On each iteration of EM, we perform two passes: a coarse pass using a low-order language model, and a fine pass using a high-order language model (Petrov et al., 2008; Zhang and Gildea, 2008). |
Model | P(E, T, R, X) = P(E) [ Language model ] - P(T|E) [Typesetting model] - P(R) [Inking model] - P (X |E, T, R) [Noise model] |
Model | 3.1 Language Model P(E) |
Model | Our language model , P(E), is a Kneser-Ney smoothed character n-gram model (Kneser and Ney, 1995). |
Related Work | Work that has directly addressed historical documents has done so using a pipelined approach, and without fully integrating a strong language model (Vamvakas et al., 2008; Kluzner et al., 2009; Kae et al., 2010; Kluzner et al., 2011). |
Related Work | They integrated typesetting models with language models , but did not model noise. |
Related Work | Our approach is also similar in that we use a strong language model (in conjunction with the constraint that the correspondence be regular) to learn the correct mapping. |
Abstract | N —gram language models are a major resource bottleneck in machine translation. |
Abstract | In this paper, we present several language model implementations that are both highly compact and fast to query. |
Abstract | We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%. |
Introduction | For modern statistical machine translation systems, language models must be both fast and compact. |
Introduction | The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge. |
Introduction | At the same time, decoding a single sentence can trigger hundreds of thousands of queries to the language model , so speed is also critical. |
Abstract | We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model ; and methods that evaluate global coherence, such as latent semantic analysis. |
Introduction | To investigate the usefulness of local information, we evaluated n—gram language model scores, from both a conventional model with Good—Turing smoothing, and with a recently proposed maximum—entropy class—based n—gram model (Chen, 2009a; Chen, 2009b). |
Introduction | Also in the language modeling vein, but with potentially global context, we evaluate the use of a recurrent neural network language model . |
Introduction | In all the language modeling approaches, a model is used to compute a sentence probability with each of the potential completions. |
Related Work | The KU system uses just an N—gram language model to do this ranking. |
Related Work | The UNT system uses a large variety of information sources, and a language model score receives the highest weight. |
Sentence Completion via Language Modeling | Perhaps the most straightforward approach to solving the sentence completion task is to form the complete sentence with each option in turn, and to evaluate its likelihood under a language model . |
Sentence Completion via Language Modeling | In this section, we describe the suite of state—of—the—art language modeling techniques for which we will present results. |
Sentence Completion via Language Modeling | 3.1 Backoff N-gram Language Model |
Background: Hypergraphs | The second step is to integrate an n-gram language model with this hypergraph. |
Background: Hypergraphs | The labels for leaves will be words, and will be important in defining strings and language model scores for those strings. |
Background: Hypergraphs | The focus of this paper will be to solve problems involving the integration of a k’th order language model with a hypergraph. |
Introduction | The language model is then uwdwmmmemMMmhmmMnmmwmmmmr Decoding with these models is challenging, largely because of the cost of integrating an n-gram language model into the search process. |
Introduction | 2E.g., with a trigram language model they run in O(\E\w6) time, where is the number of edges in the hypergraph, and w is the number of distinct lexical items in the hypergraph. |
Introduction | This step does not require language model integration, and hence is highly efficient. |
Abstract | Building on earlier work that integrates different factors in language modeling , we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. |
Abstract | This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. |
Abstract | We construct such a model and show that both training on rare events and preferable application to rare events improve perpleXity when compared to a simple direct interpolation of class-based with standard language models . |
Introduction | Language models , probability distributions over strings of words, are fundamental to many applications in natural language processing. |
Introduction | The main challenge in language modeling is to estimate string probabilities accurately given that even very large training corpora cannot overcome the inherent sparseness of word sequence data. |
Introduction | Plausible though this line of reasoning is, the language models most commonly used today do not incorporate class-based generalization. |
Related work | However, the importance of rare events for clustering in language modeling has not been investigated before. |
Related work | Our work is most similar to the lattice-based language models proposed by Dupont and Rosenfeld (1997). |
Abstract | Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation. |
Abstract | We give a formal definition of one such linear-time syntactic language model , detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system. |
Introduction | Early work in statistical machine translation Viewed translation as a noisy channel process comprised of a translation model, which functioned to posit adequate translations of source language words, and a target language model , which guided the fluency of generated target language strings (Brown et al., |
Introduction | Drawing on earlier successes in speech recognition, research in statistical machine translation has effectively used n-gram word sequence models as language models . |
Introduction | Modern phrase-based translation using large scale n-gram language models generally performs well in terms of lexical choice, but still often produces ungrammatical output. |
Related Work | Instead, we incorporate syntax into the language model . |
Related Work | Traditional approaches to language models in |
Related Work | Chelba and Jelinek (1998) proposed that syntactic structure could be used as an altema-tive technique in language modeling . |
Abstract | This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, midrange sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm. |
Abstract | The composite language model has been trained by performing a convergent N -best list approximate EM algorithm that has linear time complexity and a followup EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. |
Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Composite language model | The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:- |
Composite language model | PLSA models together to build a composite generative language model under the directed MRF paradigm (Wang et al., 2005; Wang et al., 2006), the TAGGER and CONSTRUCTOR in SLM and SEMANTIZER in PLSA remain unchanged; however the WORD-PREDICTORs in n-gram, m—SLM and PLSA are combined to form a stronger WORD-PREDICTOR that generates the next word, wk+1, not only depending on the m leftmost exposed headwords bin in the word-parse k-prefix but also its n-gram history w’g_n+2 and its semantic content gk+1. |
Composite language model | The parameter for WORD-PREDICTOR in the composite n-gram/m-SLMfl’LSA language model becomes p(wk+1 |wlg_n+2h:,1ngk+1). |
Introduction | There is a dire need for developing novel approaches to language modeling.” |
Introduction | (2006) integrated n-gram, structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model . |
Introduction | They derived a generalized inside-outside algorithm to train the composite language model from a general EM (Dempster et al., 1977) by following Je-linek’s ingenious definition of the inside and outside probabilities for SLM (J elinek, 2004) with 6th order of sentence length time complexity. |
Abstract | In this paper, we present an approach to enriching high—order feature representations for graph-based dependency parsing models using a dependency language model and beam search. |
Abstract | The dependency language model is built on a large-amount of additional auto-parsed data that is processed by a baseline parser. |
Abstract | Based on the dependency language model , we represent a set of features for the parsing model. |
Dependency language model | Language models play a very important role for statistical machine translation (SMT). |
Dependency language model | The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words. |
Dependency language model | However, the traditional N-gram language model can not capture long-distance word relations. |
Introduction | In this paper, we solve this issue by enriching the feature representations for a graph-based model using a dependency language model (DLM) (Shen et al., 2008). |
Introduction | 0 We utilize the dependency language model to enhance the graph-based parsing model. |
Parsing with dependency language model | In this section, we propose a parsing model which includes the dependency language model by extending the model of McDonald et al. |
Hello. My name is Inigo Montoya. | First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their non-memorable counterparts. |
Hello. My name is Inigo Montoya. | In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes. |
Never send a human to do a machine’s job. | In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplace-smoothed8 language models : l-gram, 2-gram, and 3-gram word LMs and l-gram, 2-gram and 3-gram part-of-speech9 LMs. |
Never send a human to do a machine’s job. | As indicated in Table 3, for each of our lexical “common language” models , in about 60% of the quote pairs, the memorable quote is more distinctive. |
Never send a human to do a machine’s job. | The language models’ vocabulary was that of the entire training corpus. |
Abstract | We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. |
Experiments | Nonetheless, it represents phonetic variability more realistically than the Bernstein-Ratner—Brent corpus, while still maintaining the lexical characteristics of infant-directed speech (as compared to the Buckeye corpus, with its much larger vocabulary and more complex language model ). |
Inference | The language modeling term relating to the intended string again factors into multiple components. |
Inference | Because neither the transducer nor the language model are perfect models of the true distribution, they can have incompatible dynamic ranges. |
Inference | 3The transducer scores can be cached since they depend only on surface forms, but the language model scores cannot. |
Introduction | Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model ). |
Introduction | Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability. |
Introduction | But unlike speech recognition, we have no (intended-form, surface-form) training pairs to train the phonetic model, nor even a dictionary of intended-form strings to train the language model . |
Lexical-phonetic model | Our lexical-phonetic model is defined using the standard noisy channel framework: first a sequence of intended word tokens is generated using a language model , and then each token is transformed by a probabilistic finite-state transducer to produce the observed surface sequence. |
Related work | In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model . |
A Probabilistic Formulation for HVR | where P(W) can be modelled by the word-based 77.-gram language model (Chen and Goodman, 1996) commonly used in automatic speech recognition. |
A Probabilistic Formulation for HVR | 0 Language model score: P(W) |
A Probabilistic Formulation for HVR | Note that the acoustic model and language model scores are already used in the conventional ASR. |
Abstract | In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions. |
Experimental Results | These sentences contain a variety of given names, surnames and city names so that confusions cannot be easily resolved using a language model . |
Experimental Results | The ASR system used in all the experiments reported in this paper consists of a set of HMM-based triphone acoustic models and an n-gram language model . |
Experimental Results | A bigram language model with a vocabulary size of 200 words was used for testing. |
Haptic Voice Recognition (HVR) | In conventional ASR, acoustically similar word sequences are typically resolved implicitly using a language model where contexts of neighboring words are used for disambiguation. |
Integration of Knowledge Sources | where fl, 5, 75 and 7:1 denote the WFST representation of the acoustic model, language model , PLI model and haptic model respectively. |
Integration of Knowledge Sources | (2002) has shown that Hidden Markov Models (HMMs) and n-gram language models can be viewed as WFSTs. |
Introduction | In addition to the acoustic model and language model used in ASR, haptic model and partial lexical model are also introduced to facilitate the integration of more sophisticated haptic events, such as the keystrokes, into HVR. |
Abstract | We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context. |
Abstract | We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. |
Introduction | N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005). |
Introduction | At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies. |
Introduction | Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark, |
Treelet Language Modeling | The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a). |
Treelet Language Modeling | to use back-off-based smoothing for syntactic language modeling — such techniques have been applied to models that condition on headword contexts (Charniak, 2001; Roark, 2004; Zhang, 2009). |
Abstract | Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus. |
Abstract | By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model. |
Introduction | Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora. |
Introduction | To overcome the problem, we first propose the method combining translation model with language model in data selection. |
Introduction | The language model measures the domain-specif1c generation probability of sentences, being used to select domain-relevant sentences at both sides of source and target language. |
Related Work | The existing data selection methods are mostly based on language model . |
Related Work | (2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models . |
Related Work | (2011) improved the perplexity-based approach and proposed bilingual cross-entropy difference as a ranking function with in-and general- domain language models . |
Evaluation | In §3.3, we then examined the effect of using a very large 5-gram language model training on 7.5 billion English tokens to understand the nature of the improvements in §3.2. |
Evaluation | The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only. |
Evaluation | The 13 baseline features (2 lexical, 2 phrasal, 5 HRM, and 1 language model , word penalty, phrase length feature and distortion penalty feature) were tuned using MERT (Och, 2003), which is also used to tune the 4 feature weights introduced by the secondary phrase table (2 lexical and 2 phrasal, other features being shared between the two tables). |
Generation & Propagation | These candidates are scored using stem-level translation probabilities, morpheme-level lexical weighting probabilities, and a language model , and only the top 30 candidates are included. |
Introduction | We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models . |
Abstract | In statistical language modeling , one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes. |
Abstract | The resulting clusterings are then used in training partially class—based language models . |
Experiments | We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus. |
Experiments | We use each predictive class-based language model as well as a word-based model as separate feature functions in the log-linear combination in Eq. |
Experiments | The word-based language model used by the system in these experiments is a 5-gram model also trained on the enlarget data set. |
Introduction | A statistical language model assigns a probability P(w) to any given string of words win 2 w1,...,wm. |
Introduction | In the case of n-gram language models this is done by factoring the probability: |
Introduction | do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1). |
Abstract | We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters. |
Abstract | We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework. |
Experimental Setup | We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers. |
Introduction | Language models (LMs) are a core component in statistical machine translation, speech recognition, optical character recognition and many other areas. |
Introduction | With large monolingual corpora available in major languages, making use of all the available data is now a fundamental challenge in language modeling . |
Introduction | have considered alternative parameterizations such as class-based models (Brown et al., 1992), model reduction techniques such as entropy-based pruning (Stolcke, 1998), novel represention schemes such as suffix arrays (Emami et al., 2007), Golomb Coding (Church et al., 2007) and distributed language models that scale more readily (Brants et al., 2007). |
Scaling Language Models | In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary. |
Scaling Language Models | Recent work (Talbot and Osborne, 2007b) has used lossy encodings based on Bloom filters (Bloom, 1970) to represent logarithmically quantized corpus statistics for language modeling . |
Abstract | We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model . |
Introduction | This complexity arises from the interaction of the tree-based translation model with an n-gram language model . |
Introduction | First, we present a two-pass decoding algorithm, in which the first pass explores states resulting from an integrated bigram language model , and the second pass expands these states into trigram-based |
Introduction | The general bigram-to-trigram technique is common in speech recognition (Murveit et al., 1993), where lattices from a bigram-based decoder are re-scored with a trigram language model . |
Language Model Integrated Decoding for SCFG | We begin by introducing Synchronous Context Free Grammars and their decoding algorithms when an n-gram language model is integrated into the grammatical search space. |
Language Model Integrated Decoding for SCFG | Without an n-gram language model , decoding using SCFG is not much different from CFG parsing. |
Language Model Integrated Decoding for SCFG | However, when we want to integrate an n-gram language model into the search, our goal is searching for the derivation whose total sum of weights of productions and n-gram log probabilities is maximized. |
Multi-pass LM-Integrated Decoding | very good estimate of the outside cost using a trigram model since a bigram language model and a trigram language model must be strongly correlated. |
Multi-pass LM-Integrated Decoding | We propagate the outside cost of the parent to its children by combining with the inside cost of the other children and the interaction cost, i.e., the language model cost between the focused child and the other children. |
Multi-pass LM-Integrated Decoding | (2007) also take a two-pass decoding approach, with the first pass leaving the language model boundary words out of the dynamic programming state, such that only one hypothesis is retained for each span and grammar symbol. |
Abstract | Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model (LM). |
Methods | This trivially allows for an unsegmented language model and never makes desegmentation errors. |
Methods | Doing so enables the inclusion of an unsegmented target language model , and with a small amount of bookkeeping, it also allows the inclusion of features related to the operations performed during desegmentation (see Section 3.4). |
Methods | We now have a desegmented lattice, but it has not been annotated with an unsegmented (word-level) language model . |
Related Work | Bojar (2007) incorporates such analyses into a factored model, to either include a language model over target morphological tags, or model the generation of morphological features. |
Related Work | They introduce an additional desegmentation technique that augments the table-based approach with an unsegmented language model . |
Related Work | Oflazer and Durgar El-Kahlout (2007) desegment 1000-best lists for English-to-Turkish translation to enable scoring with an unsegmented language model . |
Abstract | This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores. |
Abstract | Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. |
Dependency parsing for machine translation | While it seems that loopy graphs are undesirable when the goal is to obtain a syntactic analysis, that is not necessarily the case when one just needs a language modeling score. |
Introduction | Hierarchical approaches to machine translation have proven increasingly successful in recent years (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008), and often outperform phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) on.ungetlanguage fluency'and.adequacy; Ilouh ever, their benefits generally come with high computational costs, particularly when chart parsing, such as CKY, is integrated with language models of high orders (Wu, 1996). |
Introduction | Indeed, researchers have shown that gigantic language models are key to state-of-the-art performance (Brants et al., 2007), and the ability of phrase-based decoders to handle large-size, high-order language models with no consequence on asymptotic running time during decoding presents a compelling advantage over CKY decoders, whose time complexity grows prohibitively large with higher-order language mod-ds |
Introduction | Most interestingly, the time complexity of non-projective dependency parsing remains quadratic as the order of the language model increases. |
Machine translation experiments | We use the standard features implemented almost exactly as in Moses: four translation features (phrase-based translation probabilities and lexically-weighted probabilities), word penalty, phrase penalty, linear distortion, and language model score. |
Machine translation experiments | In order to train a competitive baseline given our computational resources, we built a large 5-gram language model using the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) in addition to the target side of the parallel data. |
Machine translation experiments | The language model was smoothed with the modified Kneser-Ney algorithm as implemented in (Stolcke, 2002), and we only kept 4-grams and 5-grams that occurred at least three times in the training data.6 |
Abstract | With this new framework, we employ a target dependency language model during decoding to exploit long distance word relations, which are unavailable with a traditional n—gram language model . |
Dependency Language Model | wh-as-head represents 21);, used as the head, and it is different from 212;, in the dependency language model . |
Dependency Language Model | In order to calculate the dependency language model score, or depLM score for short, on the fly for |
Discussion | (2003) described a two-step string-to-CFG—tree translation model which employed a syntax-based language model to select the best translation from a target parse forest built in the first step. |
Implementation Details | Language model score . |
Implementation Details | Dependency language model score 8. |
Introduction | language model during decoding, in order to exploit long-distance word relations which are unavailable with a traditional n-gram language model on target strings. |
Introduction | Section 3 illustrates of the use of dependency language models . |
String-to-Dependency Translation | Formal definitions also allow us to easily extend the framework to incorporate a dependency language model in decoding. |
String-to-Dependency Translation | Supposing we use a traditional trigram language model in decoding, we need to specify the leftmost two words and the rightmost two words in a state. |
String-to-Dependency Translation | In the next section, we will explain how to extend categories and states to exploit a dependency language model during decoding. |
Abstract | We use translation models and language models to exploit lexical correlations and solution post character respectively. |
Introduction | The cornerstone of our technique is the usage of a hitherto unexplored textual feature, lexical correlations between problems and solutions, that is exploited along with language model based characterization of solution posts. |
Introduction | We model the lexical correlation and solution post character using regularized translation models and unigram language models respectively. |
Our Approach | Consider a unigram language model 83 that models the lexical characteristics of solution posts, and a translation model 73 that models the lexical correlation between problems and solutions. |
Our Approach | In short, each solution word is assumed to be generated from the language model or the translation model (conditioned on the problem words) with a probability of A and l — A respectively, thus accounting for the correlation assumption. |
Our Approach | Of the solution words above, generic words such as try and should could probably be explained by (i.e., sampled from) the solution language model , whereas disconnect and rejoin could be correlated well with surf and wifi and hence are more likely to be supported better by the translation model. |
Related Work | We will use translation and language models in our method for solution identification. |
A Syntax Free Sequence-oriented Sentence Compression Method | As an alternative to syntactic parsing, we propose two novel features, intra-sentence positional term weighting (IPTW) and the patched language model (PLM) for our syntax-free sentence compressor. |
A Syntax Free Sequence-oriented Sentence Compression Method | 3.2.2 Patched Language Model |
A Syntax Free Sequence-oriented Sentence Compression Method | Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence. |
Abstract | As an alternative to syntactic parsing, we propose a novel term weighting technique based on the positional information within the original sentence and a novel language model that combines statistics from the original sentence and a general corpus. |
Conclusions | 0 As an alternative to the syntactic parser, we proposed two novel features, Intra-sentence positional term weighting (IPTW) and the Patched language model (PLM), and showed their effectiveness by conducting automatic and human evaluations, |
Experimental Evaluation | We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles. |
Introduction | To maintain the subject-predicate relationship in the compressed sentence and retain fluency without using syntactic parsers, we propose two novel features: intra-sentence positional term weighting (IPTW) and the patched language model (PLM). |
Introduction | PLM is a form of summarization-oriented fluency statistics derived from the original sentence and the general language model . |
Results and Discussion | Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly. |
Results and Discussion | This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences. |
Abstract | We propose language modeling methods for solving this problem, and study how to incorporate features such as authority and proximity to accurately estimate the impact language model . |
Impact Summarization | To solve these challenges, in the next section, we propose to model impact with un-igram language models and score sentences using |
Impact Summarization | We further propose methods for estimating the impact language model based on several features including the authority of citations, and the citation proximity. |
Introduction | We propose language models to exploit both the citation context and original content of a paper to generate an impact-based summary. |
Introduction | We study how to incorporate features such as authority and proximity into the estimation of language models . |
Introduction | We propose and evaluate several different strategies for estimating the impact language model , which is key to impact summarization. |
Language Models for Impact Summarization | 3.1 Impact language models |
Language Models for Impact Summarization | We thus propose to represent such a virtual impact query with a unigram language model . |
Language Models for Impact Summarization | Such a model is expected to assign high probabilities to those words that can describe the impact of paper d, just as we expect a query language model in ad hoc retrieval to assign high probabilities to words that tend to occur in relevant documents (Ponte and Croft, 1998). |
Abstract | We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines. |
Baselines | A second baseline was constructed by weighing the probabilities from the translation table directly with the L2 language model described earlier. |
Baselines | target language modelling ) which is also cus- |
Introduction | The main research question in this research is how to disambiguate an L1 word or phrase to its L2 translation based on an L2 context, and whether such cross-lingual contextual approaches provide added value compared to baseline models that are not context informed or compared to standard language models . |
System | 3.1 Language Model |
System | We also implement a statistical language model as an optional component of our classifier-based system and also as a baseline to compare our system to. |
System | The language model is a trigram-based back-off language model with Kneser-Ney smoothing, computed using SRILM (Stolcke, 2002) and trained on the same training data as the translation model. |
Abstract | RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner. |
Experiments and Results | The language model is a 5-gram language model trained with the target sentences in the training data. |
Introduction | Recurrent neural networks are leveraged to learn language model , and they keep the history information circularly inside the network for arbitrarily long time (Mikolov et al., 2010). |
Introduction | DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment, language modelling , translation modelling and distortion modelling. |
Introduction | In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model. |
Our Model | Recurrent neural network is usually used for sequence processing, such as language model (Mikolov et al., 2010). |
Our Model | Commonly used sequence processing methods, such as Hidden Markov Model (HMM) and n-gram language model , only use a limited history for the prediction. |
Our Model | In HMM, the previous state is used as the history, and for n-gram language model (for example n equals to 3), the history is the previous two words. |
Related Work | (2013) extend the recurrent neural network language model , in order to use both the source and target side information to scoring translation candidates. |
Abstract | We propose a language model based on a precise, linguistically motivated grammar (a handcrafted Head-driven Phrase Structure Grammar) and a statistical model estimating the probability of a parse tree. |
Abstract | The language model is applied by means of an N -best rescoring step, which allows to directly measure the performance gains relative to the baseline system without rescoring. |
Introduction | Other linguistically inspired language models like Chelba and J elinek (2000) and Roark (2001) have been applied to continuous speech recognition. |
Introduction | In the first place, we want our language model to reliably distinguish between grammatical and ungrammatical phrases. |
Introduction | However, their grammar-based language model did not make use of a probabilistic component, and it was applied to a rather simple recognition task (dictation texts for pupils read and recorded under good acoustic conditions, no out-of-vocabulary words). |
Language Model 2.1 The General Approach | The language model weight A and the word insertion penalty ip lead to a better performance in practice, but they have no theoretical justification. |
Language Model 2.1 The General Approach | Our grammar-based language model is incorporated into the above expression as an additional probability Pyram(W), weighted by a parameter ,u: |
Language Model 2.1 The General Approach | A major problem of grammar-based approaches to language modeling is how to deal with out-of-grammar utterances. |
Abstract | Grounded language models represent the relationship between words and the nonlinguistic context in which they are said. |
Abstract | Results show that grounded language models improve perplexity and word error rate over text based language models , and further, support video information retrieval better than human generated speech transcriptions. |
Introduction | The method is based on the use of grounded language models to repre- |
Introduction | Grounded language models are based on research from cognitive science on grounded models of meaning. |
Introduction | This paper extends previous work on grounded models of meaning by learning a grounded language model from naturalistic data collected from broadcast video of Major League Baseball games. |
Linguistic Mapping | We model this relationship, much like traditional language models , using conditional probability distributions. |
Linguistic Mapping | Unlike traditional language models, however, our grounded language models condition the probability of a word not only on the word(s) uttered before it, but also on the temporal pattern features that describe the nonlinguistic context in which it was uttered. |
Abstract | Then we model question topic and question focus in a language modeling framework for search. |
Abstract | Experimental results indicate that our approach of identifying question topic and question focus for search significantly outperforms the baseline methods such as Vector Space Model (VSM) and Language Model for Information Retrieval (LMIR). |
Introduction | vector space model, Okapi, language model , and translation-based model, within the setting of question search (Jeon et al., 2005b). |
Introduction | On the basis of this, we then propose to model question topic and question focus in a language modeling framework for search. |
Our Approach to Question Search | model question topic and question focus in a language modeling framework for search. |
Our Approach to Question Search | We employ the framework of language modeling (for information retrieval) to develop our approach to question search. |
Our Approach to Question Search | In the language modeling approach to information retrieval, the relevance of a targeted question q to a queried question q is given by the probability p(q|fi) of generating the queried question q |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | In this work both the skeleton translation model gskel (d) and full translation model gfuu (d) resemble the usual forms used in phrase-based MT, i.e., the model score is computed by a linear combination of a group of phrase-based features and language models . |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | lm(d) and wlm are the score and weight of the language model , respectively. |
Evaluation | A 5-gram language model was trained on the Xinhua portion of the English Gi-gaword corpus in addition to the target-side of the bilingual data. |
Introduction | 0 We develop a skeletal language model to describe the possibility of translation skeleton and handle some of the long-distance word dependencies. |
Conclusion and perspectives | It would also be interesting to test the impact of another lexical language model , learned on non-SMS sentences. |
Evaluation | The language model of the evaluation is a 3-gram. |
Evaluation | (2008a), who showed on a French corpus comparable to ours that, if using a larger language model is always rewarded, the improvement quickly decreases with every higher level and is already quite small between 2-gram and 3-gram. |
Overview of the system | In our system, all lexicons, language models and sets of rules are compiled into finite-state machines (FSMs) and combined with the input text by composition (0). |
Overview of the system | Third, a combination of the lattice of solutions with a language model , and the choice of the best sequence of lexical units. |
Related work | A language model is then applied on the word lattice, and the most probable word sequence is finally chosen by applying a best-path algorithm on the lattice. |
The normalization models | All tokens Tj of S are concatenated together and composed with the lexical language model LM. |
The normalization models | 4.6 The language model |
The normalization models | Our language model is an n-gram of lexical forms, smoothed by linear interpolation (Chen and Goodman, 1998), estimated on the normalized part of our training corpus and compiled into a weighted FST LMw. |
Abstract | Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations. |
Incorporating Senses into Language Modeling Approaches | The next problem is to incorporate the sense information into the language modeling approach. |
Incorporating Senses into Language Modeling Approaches | Given a query (1 and a document d in text collection C, we want to reestimate the language models by making use of the sense information assigned to them. |
Incorporating Senses into Language Modeling Approaches | With this language model , the probability of a query term in a document is enlarged by the synonyms of its senses; The more its synonym senses in a document, the higher the probability. |
Introduction | We incorporate word senses into the language modeling (LM) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance. |
The Language Modeling Approach to IR | 3.1 The language modeling approach |
The Language Modeling Approach to IR | In the language modeling approach to IR, language models are constructed for each query (1 and each document d in a text collection C. The documents in C are ranked by the distance to a given query (1 according to the language models . |
The Language Modeling Approach to IR | The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other. |
Abstract | We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models . |
Introduction | ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream. |
Introduction | Yet, though many language models more sophisticated than N- grams have been proposed, N-grams are empirically hard to beat in terms of WER. |
Introduction | The strength of this phenomenon suggests it may be more viable for improving term-detection than, say, topic-sensitive language models . |
Motivation | The re-scoring approach we present is closely related to adaptive or cache language models (Je-linek, 1997; Kuhn and De Mori, 1990; Kneser and Steinbiss, 1993). |
Motivation | The primary difference between this and previous work on similar language models is the narrower focus here on the term detection task, in which we consider each search term in isolation, rather than all words in the vocabulary. |
Results | We train ASR acoustic and language models from the training corpus using the Kaldi speech recognition toolkit (Povey et al., 2011) following the default BABEL training and search recipe which is described in detail by Chen et al. |
Term and Document Frequency Statistics | A similar phenomenon is observed concerning adaptive language models (Church, 2000). |
Term and Document Frequency Statistics | In general, we can think of using word repetitions to re-score term detection as applying a limited form of adaptive or cache language model (Je-linek, 1997). |
Term and Document Frequency Statistics | In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence. |
Abstract | We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models . |
Conclusion | Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset. |
Experiments | Table 3 shows our results compared to previous methods, including C&W’s language model and the hierarchical log-bilinear (HLBL) model (Mnih and Hinton, 2008), which is a probabilistic, linear neural model. |
Global Context-Aware Neural Language Model | Note that Collobert and Weston (2008)’s language model corresponds to the network using only local context. |
Introduction | We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective. |
Introduction | We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset. |
Related Work | Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling , a task where models are asked to accurately predict the next word given previously seen words. |
Related Work | Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model . |
Related Work | They used up to 3 previous head words and showed increased performance on language modeling . |
Conclusion & Future Work | Future work includes extending this approach to use multiple translation models with multiple language models in ensemble decoding. |
Experiments & Results 4.1 Experimental Setup | For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model (LM) and word count. |
Experiments & Results 4.1 Experimental Setup | Fixing the language model allows us to compare various translation model combination techniques. |
Introduction | Common techniques for model adaptation adapt two main components of contemporary state-of-the-art SMT systems: the language model and the translation model. |
Introduction | However, language model adaptation is a more straightforward problem compared to |
Introduction | translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain. |
Related Work 5.1 Domain Adaptation | They use language model perplexities from IN to select relavant sentences from OUT. |
Discussion | This is particularly true when the sentence structure is defined in a language model that is psycholinguistically plausible (here, bounded-memory right-corner form). |
Discussion | This accords with an understated result of Boston et al.’s eye-tracking study (2008a): a richer language model predicts eye movements during reading better than an oversimplified one. |
Discussion | Frank (2009) similarly reports improvements in the reading-time predictiveness of unlexi-calized surprisal when using a language model that is more plausible than PCFGs. |
Introduction | Ideally, a psychologically-plausible language model would produce a surprisal that would correlate better with linguistic complexity. |
Introduction | Therefore, the specification of how to encode a syntactic language model is of utmost importance to the quality of the metric. |
Introduction | The purpose of this paper is to determine whether the language model defined by the HHMM parser can also predict reading times —it would be strange if a psychologically plausible model did not also produce Viable complexity metrics. |
Parsing Model | Both of these metrics fall out naturally from the time-series representation of the language model . |
Parsing Model | With the understanding of what operations need to occur, a formal definition of the language model is in order. |
Abstract | Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation. |
Abstract | However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set. |
Abstract | When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N -best and hill climbing rescoring, and show that up-training leads to WER reduction. |
Conclusion | The computational complexity of accurate syntactic processing can make structured language models impractical for applications such as ASR that require scoring hundreds of hypotheses per input. |
Incorporating Syntactic Structures | These are then passed to the language model along with the word sequence for scoring. |
Introduction | Language models (LM) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT). |
Related Work | The lattice parser therefore, is itself a language model . |
Syntactic Language Models | There have been several approaches to include syntactic information in both generative and discriminative language models . |
Syntactic Language Models | Structured language modeling incorporates syntactic parse trees to identify the head words in a hypothesis for modeling dependencies beyond n-grams. |
Syntactic Language Models | Our Language Model . |
Abstract | We consider the prediction of three human behavioral measures — lexical decision, word naming, and picture naming —through the lens of domain bias in language modeling . |
Abstract | This study aims to provoke increased consideration of the human language model by NLP practitioners: biases are not limited to differences between corpora (i.e. |
Discussion | Our analyses reveal that 6 commonly used corpora fail to reflect the human language model in various ways related to dialect, modality, and other properties of each corpus. |
Discussion | Our results point to a type of bias in commonly used language models that has been previously overlooked. |
Discussion | Just as language models have been used to predict reading grade-level of documents (Collins-Thompson and Callan, 2004), human language models could be |
Introduction | Computational linguists build statistical language models for aiding in natural language processing (NLP) tasks. |
Introduction | In the current study, we exploit errors of the latter variety—failure of a language model to predict human performance—to investigate bias across several frequently used corpora in computational linguistics. |
Introduction | : Human Language Model |
Experimental Results | We trained all of the Moses systems herein using the standard features: language model , reordering model, translation model, and word penalty; in addition to these, the factored experiments called for additional translation and generation features for the added factors as noted above. |
Experimental Results | For the language models, we used SRILM 5-gram language models (Stol-cke, 2002) for all factors. |
Experimental Results | koske+ +va+ +A mietinto+ +A kasi+ +te+ +11a+ +a+ +n language model disambiguation: |
Models 2.1 Baseline Models | Morphology generation models can use a variety of bilingual and contextual information to capture dependencies between morphemes, often more long-distance than what is possible using n-gram language models over morphemes in the segmented model. |
Models 2.1 Baseline Models | is to take the abstract suffix tag sequence 31* and then map it into fully inflected word forms, and rank those outputs using a morphemic language model . |
Models 2.1 Baseline Models | After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels. |
Related Work | They use a segmented phrase table and language model along With the word-based versions in the decoder and in tuning a Finnish target. |
Related Work | In their work a segmented language model can score a translation, but cannot insert morphology that does not show source-side reflexes. |
Calculation of Cross-Entropy | (XZ-), is the cross—entropy of X7; for L,- multiplied by Various methods for computing cross—entropy have been proposed, and these can be roughly classified into two types based on different methods of universal coding and the language model . |
Calculation of Cross-Entropy | For example, (Benedetto et al., 2002) and (Cilibrasi and Vitanyi, 2005) used the universal coding approach, whereas (Teahan and Harper, 2001) and (Sibun and Reynar, 1996) were based on language modeling using PPM and Kullback—Leibler divergence, respectively. |
Calculation of Cross-Entropy | As a representative method for calculating the cross—entropy through statistical language modeling , we adopt prediction by partial matching (PPM), a language—based encoding method devised by (Cleary and Witten, 1984). |
In the experiments reported here, n is set to 5 throughout. | lel ), gives the description length of the remaining characters under the language model for L. |
Introduction | They used statistical language modeling and heuristics to detect foreign words and tested the case of English embedded in German texts. |
Problem Formulation | In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification. |
Problem Formulation | calculates the description length of a text segment X,- through the use of a language model for Li. |
Problem Formulation | Here, the first term corresponds to the code length of the text chunk X,- given a language model for L,, which in fact corresponds to the cross—entropy of X,- for L,- multiplied by The remaining terms give the code lengths of the parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li. |
Abstract | In this paper we analyze reading times in terms of a single predictive measure which integrates a model of semantic composition with an incremental parser and a language model . |
Integrating Semantic Constraint into Surprisal | While surprisal is a theoretically well-motivated measure, formalizing the idea of linguistic processing being highly predictive in terms of probabilistic language models , the measurement of semantic constraint in terms of vector similarities lacks a clear motivation. |
Integrating Semantic Constraint into Surprisal | This can be achieved by turning a vector model of semantic similarity into a probabilistic language model . |
Integrating Semantic Constraint into Surprisal | There are in fact a number of approaches to deriving language models from distributional models of semantics (e.g., Bellegarda 2000; Coccaro and Jurafsky 1998; Gildea and Hofmann 1999). |
Models of Processing Difficulty | The basic idea is that the processing costs relating to the expectations of the language processor can be expressed in terms of the probabilities assigned by some form of language model to the input. |
Models of Processing Difficulty | Surprisal could be also defined using a vanilla language model that does not take any structural or grammatical information into account (Frank 2009). |
Translation Model Architecture | We train a language model on the source language side of each of the n component bitexts, and compute an n-dimensional vector for each sentence by computing its entropy with each language model . |
Translation Model Architecture | Our aim is not to discriminate between sentences that are more likely and unlikely in general, but to cluster on the basis of relative differences between the language model entropies. |
Translation Model Architecture | While it is not the focus of this paper, we also evaluate language model adaptation. |
A Generic Phrase Training Procedure | Each normalized feature score derived from word alignment models or language models will be log-linearly combined to generate the final score. |
Discussions | We propose several information metrics derived from posterior distribution, language model and word alignments as feature functions. |
Experimental Results | Like other log-linear model based decoders, active features in our translation engine include translation models in two directions, lexicon weights in two directions, language model , lexicalized distortion models, sentence length penalty and other heuristics. |
Experimental Results | The language model is a statistical trigram model estimated with Modified Kneser—Ney smoothing (Chen and Goodman, 1996) using only English sentences in the parallel training data. |
Features | All these features are data-driven and defined based on models, such as statistical word alignment model or language model . |
Features | We apply a language model (LM) to describe the predictive uncertainty (PU) between words in two directions. |
Features | Given a history 10711—1, a language model specifies a conditional distribution of the future word being predicted to follow the history. |
Clustering-based word representations | So it is a class-based bigram language model . |
Clustering-based word representations | Deschacht and Moens (2009) use a latent-variable language model to improve semantic role labeling. |
Distributed representations | Word embeddings are typically induced using neural language models , which use neural networks as the underlying predictive model (Bengio, 2008). |
Distributed representations | Historically, training and testing of neural language models has been slow, scaling as the size of the vocabulary for each model computation (Bengio et al., 2001; Bengio et al., 2003). |
Distributed representations | Collobert and Weston (2008) presented a neural language model that could be trained over billions of words, because the gradient of the loss was computed stochastically over a small sample of possible outputs, in a spirit similar to Bengio and Sénecal (2003). |
Introduction | Neural language models (Bengio et al., 2001; Schwenk & Gauvain, 2002; Mnih & Hinton, 2007; Collobert & Weston, 2008), on the other hand, induce dense real-valued low-dimensional |
Introduction | (See Bengio (2008) for a more complete list of references on neural language models .) |
Unlabled Data | These auxiliary tasks are sometimes specific to the supervised task, and sometimes general language modeling tasks like “predict the missing word”. |
Inflection prediction models | We stemmed the reference translations, predicted the inflection for each stem, and measured the accuracy of prediction, using a set of sentences that were not part of the training data (1K sentences were used for Arabic and 5K for Russian).2 Our model performs significantly better than both the random and trigram language model baselines, and achieves an accuracy of over 91%, which suggests that the model is effective when its input is clean in its stem choice and order. |
Integration of inflection models with MT systems | Given such a list of candidate stem sequences, the base MT model together with the inflection model and a language model choose a translation Y* as follows: |
Integration of inflection models with MT systems | PLM) is the joint probability of the sequence of inflected words according to a trigram language model (LM). |
Integration of inflection models with MT systems | In addition, stemming the target sentences reduces the sparsity in the translation tables and language model , and is likely to impact positively the performance of an MT system in terms of its ability to recover correct sequences of stems in the target. |
Introduction | (Goldwater and McClosky, 2005), while the application of a target language model has almost solely been responsible for addressing the second aspect. |
Machine translation systems and data | (2003), a trigram target language model , two order models, word count, phrase count, and average phrase size functions. |
Machine translation systems and data | The features include log-probabilities according to inverted and direct channel models estimated by relative frequency, lexical weighting channel models, a trigram target language model , distortion, word count and phrase count. |
Machine translation systems and data | For each language pair, we used a set of parallel sentences (train) for training the MT system sub-models (e.g., phrase tables, language model ), a set of parallel sentences (lambda) for training the combination weights with max-BLEU training, a set of parallel sentences (dev) for training a small number of combination parameters for our integration methods (see Section 5), and a set of parallel sentences (test) for final evaluation. |
Background | OpenCCG implements a symbolic-statistical chart realization algorithm (Kay, 1996; Carroll et al., 1999; White, 2006b) combining (l) a theoretically grounded approach to syntax and semantic composition with (2) factored language models (Bilmes and Kirchhoff, 2003) for making choices among the options left open by the grammar. |
Background | makes use of n-gram language models over words represented as vectors of factors, including surface form, part of speech, supertag and semantic class. |
Background | 2.3 Factored Language Models |
Introduction | Assigned categories are instantiated in OpenCCG’s chart realizer where, together with a treebank-derived syntactic grammar (Hockenmaier and Steedman, 2007) and a factored language model (Bilmes and Kirchhoff, 2003), they constrain the English word-strings that are chosen to express the LF. |
The Approach | Table 1: Percentage of complete realizations using an oracle n-gram model versus the best performing factored language model . |
The Approach | As shown in Table l, with the large grammar derived from the training sections, many fewer complete realizations are found (before timing out) using the factored language model than are possible, as indicated by the results of using the oracle model. |
Experiments | As described in Section 3.2, the weight of each variable is a linear combination of the language model score, three classifier confidence scores, and three classifier disagreement scores. |
Experiments | We use the Web 1T 5—gram corpus (Brants and Franz, 2006) to compute the language model score for a sentence. |
Experiments | Finally, the language model score, classifier confidence scores, and classifier disagreement scores are normalized to take values in [0, 1], based on the H00 2011 development data. |
Inference with First Order Variables | The language model score h(s’, LM) of 8’ based on a large web corpus; |
Inference with First Order Variables | Next, to compute whpyg, we collect language model score and confidence scores from the article (ART), preposition (PREP), and noun number (NOUN) classifier, i.e., E = {ART, PREP, NOUN}. |
Inference with Second Order Variables | When measuring the gain due to 21131213312 2 1 (change cat to cats), the weight wNoungmluml is likely to be small since A cats will get a low language model score, a low article classifier confidence score, and a low noun number classifier confidence score. |
Related Work | Features used in classification include surrounding words, part-of—speech tags, language model scores (Gamon, 2010), and parse tree structures (Tetreault et al., 2010). |
Abstract | The selection is made according to the appropriateness of the alteration to the query context (using a bigram language model ), or according to its expected impact on the retrieval effectiveness (using a regression model). |
Bigram Expansion Model for Alteration Selection | The query context is modeled by a bigram language model as in (Peng et al. |
Bigram Expansion Model for Alteration Selection | In this work, we used bigram language model to calculate the probability of each path. |
Bigram Expansion Model for Alteration Selection | P(el,ez,...,ei,...,en) = P(e1 )H:=2P(ek Iek_1) (2) P(ek|ek_1) is estimated with a back-off bigram language model (Goodman, 2001). |
Conclusion | In the first method proposed — the Bigram Expansion model, query context is modeled by a bigram language model . |
Introduction | The query context is modeled by a bigram language model . |
Related Work | 2007), a bigram language model is used to determine the alteration of the head word that best fits the query. |
Related Work | In this paper, one of the proposed methods will also use a bigram language model of the query to determine the appropriate alteration candidates. |
Bayesian MT Decipherment via Hash Sampling | Secondly, for Bayesian inference we need to sample from a distribution that involves computing probabilities for all the components ( language model , translation model, fertility, etc.) |
Bayesian MT Decipherment via Hash Sampling | Note that the (translation) model in our case consists of multiple exponential families components—a multinomial pertaining to the language model (which remains fixed5), and other components pertaining to translation probabilities P9(fi|ei), fertility ngert, etc. |
Bayesian MT Decipherment via Hash Sampling | where, pold(-), pnew(-) are the true conditional likelihood probabilities according to our model (including the language model component) for the old, new sample respectively. |
Decipherment Model for Machine Translation | For P(e), we use a word n-gram language model (LM) trained on monolingual target text. |
Decipherment Model for Machine Translation | Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model . |
Experiments and Results | The latter is used to construct a target language model used for decipherment training. |
Experiments and Results | Overall, using a 3-gram language model (instead of 2-gram) for decipherment training improves the performance for all methods. |
Experiment | SRILM Toolkit (Stol-cke, 2002) is employed to train 4-gram language models on the Xinhua portion of Gigaword corpus, while for the IWLST2012 data set, only its training set is used. |
Experiment | The similarity between the data from each domain and the test data is calculated using the perplexity measure with 5-gram language model . |
Hierarchical Phrase Table Combination | Pitman-Yor process is also employed in n-gram language models which are hierarchically represented through the hierarchical Pitman-Yor process with switch priors to integrate different domains in all the levels (Wood and Teh, 2009). |
Phrase Pair Extraction with Unsupervised Phrasal ITGs | Pbase is a base measure defined as a combination of the IBM Models in two directions and the unigram language models in both sides. |
Related Work | The translation model and language model are primary components in SMT. |
Related Work | Previous work proved successful in the use of large-scale data for language models from diverse domains (Brants et al., 2007; Schwenk and Koehn, 2008). |
Related Work | Alternatively, the language model is incrementally updated by using a succinct data structure with a interpolation technique (Levenberg and Osborne, 2009; Levenberg et al., 2011). |
Approach | In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens. |
Approach | Next, we extend our language model with trigrams extracted from a subset of the texts contained in the |
Approach | As the CLC contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model . |
Evaluation | Extending our language model with frequent trigrams extracted from the CLC improves Pearson’s and Spearman’s correlation by 0.006 and 0.015 respectively. |
Evaluation | This suggests that there is room for improvement in the language models we developed to estimate the error-rate. |
Semi-supervised Parsing with Large Data | These relations are captured by word clustering, lexical dependencies, and a dependency language model , respectively. |
Semi-supervised Parsing with Large Data | 4.3 Structural Relations: Dependency Language Model |
Semi-supervised Parsing with Large Data | The dependency language model is proposed by Shen et al. |
Experiments | For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs, and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword. |
Experiments | An in-domain 5-gram English language model is trained with the target 1 million monolingual data. |
Experiments | (2008) regards the in-domain lexicon with corpus translation probability as another phrase table and further use the in-domain language model besides the out-of-domain language model . |
Probabilistic Bilingual Lexicon Acquisition | In order to assign probabilities to each entry, we apply the Corpus Translation Probability which used in (Wu et al., 2008): given an in-domain source language monolingual data, we translate this data with the phrase-based model trained on the out-of-domain News data, the in-domain lexicon and the in-domain target language monolingual data (for language model estimation). |
Related Work | For the target-side monolingual data, they just use it to train language model , and for the source-side monolingual data, they employ a baseline (word-based SMT or phrase-based SMT trained with small-scale bitext) to first translate the source sentences, combining the source sentence and its target translation as a bilingual sentence pair, and then train a new phrase-base SMT with these pseudo sentence pairs. |
A Class-based Model of Agreement | However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score). |
A Class-based Model of Agreement | We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data: |
Experiments | Our distributed 4—gram language model was trained on 600 million words of Arabic text, also collected from many sources including the Web (Brants et al., 2007). |
Inference during Translation Decoding | With a trigram language model , the state might be the last two words of the translation prefix. |
Introduction | Intuition might suggest that the standard 71- gram language model (LM) is suflicient to handle agreement phenomena. |
Related Work | Monz (2011) recently investigated parameter estimation for POS-based language models , but his classes did not include inflectional features. |
Related Work | One exception was the quadratic-time dependency language model presented by Galley and Manning (2009). |
Experimental Setup | The language model is trained using a 9 GB English corpus. |
Statistical Paraphrase Generation | Our SPG model contains three sub-models: a paraphrase model, a language model , and a usability model, which control the adequacy, fluency, |
Statistical Paraphrase Generation | Language Model: We use a trigram language model in this work. |
Statistical Paraphrase Generation | The language model based score for the paraphrase t is computed as: |
Related Work | (2012) adopt the tweets with emoticons to smooth the language model and Hu et al. |
Related Work | With the revival of interest in deep learning (Bengio et al., 2013), incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing (Socher et al., 2013a), language modeling (Bengio et al., 2003; Mnih and Hinton, 2009) and NER (Turian et al., 2010). |
Related Work | The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1. |
Eliciting Addressee’s Emotion | We use GIZA++8 and SRILM9 for learning translation model and 5-gram language model , re- |
Eliciting Addressee’s Emotion | We use the emotion-tagged dialogue corpus to learn eight translation models and language models , each of which is specialized in generating the response that elicits one of the eight emotions (Plutchik, 1980). |
Eliciting Addressee’s Emotion | In this case, the first two utterances are used to learn the translation model, while only the second utterance is used to learn the language model . |
Experiments | Table 6: The number of utterance pairs used for training classifiers in emotion prediction and learning the translation models and language models in response generation. |
Experiments | We use the utterance pairs summarized in Table 6 to learn the translation models and language models for eliciting each emotional category. |
Related Work | The linear interpolation of translation and/or language models is a widely-used technique for adapting machine translation systems to new domains (Sennrich, 2012). |
Abstract | Experiments on parsing and a language modeling problem show that the algorithm is efficient and effective in practice. |
Experiments on Parsing | 8 Experiments on the Saul and Pereira (1997) Model for Language Modeling |
Experiments on Parsing | We now describe a second set of experiments, on the Saul and Pereira (1997) model for language modeling . |
Experiments on Parsing | We performed the language modeling experiments for a number of reasons. |
Introduction | We describe experiments on learning of L-PCFGs, and also on learning of the latent-variable language model of Saul and Pereira (1997). |
Conclusions | The problem is essentially one of generating multiple candidate sentences with the unattached function words ambiguously positioned (say in a lattice) and then use a second language model to rerank these sentences to select the target sentence. |
Experimental Setup and Results | Furthermore, in factored models, we can employ different language models for different factors. |
Experimental Setup and Results | We believe that the use of multiple language models (some much less sparse than the surface LM) in the factored baseline is the main reason for the improvement. |
Experimental Setup and Results | 3.2.3 Experiments with higher-order language models |
Introduction | The main reason given for these problems was that the same statistical translation, reordering and language modeling mechanisms were being employed to both determine the morphological structure of the words and, at the same time, get the global order of the words correct. |
Abstract | We leverage recently-developed techniques for learning representations of text using latent-variable language models , and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. |
Introduction | Using latent-variable language models , we learn representations of texts that provide novel kinds of features to our supervised learning algorithms. |
Introduction | The next section provides background information on learning representations for NLP tasks using latent-variable language models . |
Introduction | 2 Open-Domain Representations Using Latent-Variable Language Models |
Experiments | In the experiments, the language model is a Chinese 5-gram language model trained with the Chinese part of the LDC parallel corpus and the Xin-hua part of the Chinese Gigaword corpus with about 27 million words. |
Experiments | In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature. |
Introduction | Moreover, Chinese measure words often have a long distance dependency to their head words which makes language model ineffective in selecting the correct measure words from the measure word candidate set. |
Introduction | In this case, an n-gram language model with n<15 cannot capture the MW-HW collocation. |
Model Training and Application 3.1 Training | We used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a five-gram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998). |
Our Method | For target features, n-gram language model score is defined as the sum of log n-gram probabilities within the target window after the measure |
Our Method | Target features Source features n-gram language model MW-HW collocation score MW-HW collocation surrounding words surrounding words source head word punctuation position POS tags |
Experiments | Illustrated by the highlighted states in 6, LM—HMM model conflates interactions that commonly occur at the beginning and end of a dialogue—i.e., “acknowledge agent” and “resolve problem”, since their underlying language models are likely to produce similar probability distributions over words. |
Experiments | By incorporating topic information, our proposed models (e.g., TM—HMMSS in Figure 5) are able to enforce the state transitions towards more frequent flow patterns, which further helps to overcome the weakness of language model . |
Latent Structure in Dialogues | The simplest formulation we consider is an HMM where each state contains a unigram language model (LM), proposed by Chotimongkol (2008) for task-oriented dialogue and originally |
Latent Structure in Dialogues | 3: For each word in utterance n, first choose a word source 7“ according to 1', and then depending on 7“, generate a word 21) either from the session-wide topic distribution 6 or the language model specified by the state 37,. |
Latent Structure in Dialogues | 4Note that a TM-HMMS model with state-specific topic models (instead of state-specific language models ) would be subsumed by TM—HMM, since one topic could be used as the background topic in TM -HMMS. |
Abstract | In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores). |
Discussion and Conclusions | While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc. |
Experiments | To create further baselines for comparison, we selected the following features that represent ways one might approximate grammaticality if a comprehensive model was unavailable: whether the link parser can fully parse the sentence (complete_l ink), the Gigaword language model score (gigaword_avglogprob), and the number of misspelled tokens (nummisspelled). |
System Description | 3.2.2 n-gram Count and Language Model Features |
System Description | The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002): |
System Description | Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”). |
Abstract | Recent work has shown success in using neural network language models (NNLMs) as features in MT systems. |
Introduction | Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010). |
Introduction | Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window. |
Model Variations | In particular, we can reverse the translation direction of the languages, as well as the direction of the language model . |
Model Variations | 0 5-gram Kneser-Ney LM 0 Recurrent neural network language model (RNNLM) (Mikolov et al., 2010) |
Neural Network Joint Model (NNJ M) | Fortunately, neural network language models are able to elegantly scale up and take advantage of arbitrarily large context sizes. |
Discriminative Synchronous Transduction | ilar to the methods for decoding with a SCFG intersected with an n-gram language model, which require language model contexts to be stored in each chart cell. |
Discussion and Further Work | To do so would require integrating a language model feature into the max-translation decoding algorithm. |
Evaluation | The feature set includes: a trigram language model (lm) trained |
Evaluation | To compare our model directly with these systems we would need to incorporate additional features and a language model , work which we have left for a later date. |
Evaluation | The relative scores confirm that our model, with its minimalist feature set, achieves comparable performance to the standard feature set without the language model . |
Experiments | 3-gram (news-commentary) and 5-gram (Europarl) language models are trained on the data described in Table 1, using the SRILM toolkit (Stol-cke, 2002) and binarized for efficient querying using kenlm (Heafield, 2011). |
Experiments | For the 5-gram language models, we replaced every word in the lm training data with <unk> that did not appear in the English part of the parallel training data to build an open vocabulary language model . |
Experiments | 7Absolute improvements would be possible, e. g., by using larger language models or by adding news data to the ep training set when evaluating on crawl test sets (see, e. g., Dyer et al. |
Introduction | The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003). |
Introduction | The modeler’s goals might be to identify complex properties of translations, or to counter errors of pre-trained translation models and language models by explicitly down-weighting translations that exhibit certain undesired properties. |
Abstract | In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased efficiency of decoding by minimizing the number of language model computations and hypothesis expansions. |
Abstract | Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2. |
Experimental Evaluation | The English language model is a 4-gram LM created with the SRILM toolkit (Stolcke, 2002) on all bilingual and parts of the provided monolingual data. |
Introduction | Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model (LM) querying (Heafield, 2011). |
Introduction | ith Language Model LookAhead |
Search Algorithm Extensions | 2.2 Language Model LookAhead |
Experiments | Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002). |
Introduction | Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models. |
Introduction | In the case of language models , we often have to remove low-frequency words because of a lack of computational resources, since the feature space of k:-grams tends to be so large that we sometimes need cutoffs even in a distributed environment (Brants et al., 2007). |
Perplexity on Reduced Corpora | Constant restoring is similar to the additive smoothing defined by 13(w) oc p’ + A, which is used to solve the zero-frequency problem of language models (Chen and Goodman, 1996). |
Perplexity on Reduced Corpora | 77k: _ 1 H7Tk (7176 _ 1)H7Tk This means that we can determine the rough sparseness of k-grams and adjust some of the parameters such as the gram size k in learning statistical language models . |
Perplexity on Reduced Corpora | LDA is a probabilistic language model that generates a corpus as a mixture of hidden topics, and it allows us to infer two parameters: the document-topic distribution 6 that represents the mixture rate of topics in each document, and the topic-word distribution gb that represents the occurrence rate of words in each topic. |
Experimental Setup | Beam size is fixed at 2000.4 Sentence compressions are evaluated by a 5-gram language model trained on Gigaword (Graff, 2003) by SRILM (Stolcke, 2002). |
Sentence Compression | As the space of possible compressions is exponential in the number of leaves in the parse tree, instead of looking for the globally optimal solution, we use beam search to find a set of highly likely compressions and employ a language model trained on a large corpus for evaluation. |
Sentence Compression | Given the N -best compressions from the decoder, we evaluate the yield of the trimmed trees using a language model trained on the Gigaword (Graff, 2003) corpus and return the compression with the highest probability. |
Sentence Compression | Thus, the decoder is quite flexible — its learned scoring function allows us to incorporate features salient for sentence compression while its language model guarantees the linguistic quality of the compressed string. |
Discussion | The first is the incorporation of a language model (or comparable long-distance structure-scoring model) to assign scores to predicted parses independent of the transformation model. |
Experimental setup | The best symmetrization algorithm, translation and language model weights for each language are selected using cross-validation on the development set. |
MT—based semantic parsing | In order to learn a semantic parser using MT we linearize the MRs, learn alignments between the MRL and the NL, extract translation rules, and learn a language model for the MRL. |
MT—based semantic parsing | Language modeling In addition to translation rules learned from a parallel corpus, MT systems also rely on an n-gram language model for the target language, estimated from a (typically larger) monolingual corpus. |
MT—based semantic parsing | In the case of SP, such a monolingual corpus is rarely available, and we instead use the MRs available in the training data to learn a language model of the MRL. |
Related Work | It is combined with a language model to improve grammaticality and the decoder translates sentences into sim- |
Simplification Framework | In addition, the language model we integrate in the SMT module helps ensuring better fluency and grammaticality. |
Simplification Framework | Finally the translation and language model ensures that published, describing and boson are simplified to wrote, explaining and elementary particle respectively; and that the phrase “In 1964” is moved from the beginning of the sentence to its end. |
Simplification Framework | Our simplification framework consists of a probabilistic model for splitting and dropping which we call DRS simplification model (DRS-SM); a phrase based translation model for substitution and reordering (PBMT); and a language model learned on Simple English Wikipedia (LM) for fluency and grammaticality. |
Conclusion | Also, we believe that improving English language modeling to match the genre of the translated sentences can have significant positive impact on translation quality. |
Previous Work | They used two language models built from the English GigaWord corpus and from a large web crawl. |
Previous Work | For language modeling , we used either EGen or the English side of the AR corpus plus the English side of NIST12 training data and English Gi-gaWord v5. |
Previous Work | — B2-B4 systems used identical training data, namely EG, with the GW, EGen, or both for B2, B3, and B4 respectively for language modeling . |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Using both language models (52) led to slight improvement. |
Introduction | This crawling process also yielded 632K TAC pairs whose only difference was spacing, and an additional 558M “unpaired” tweets; as shown later in this paper, we used these extra corpora for computing language models and other auxiliary information. |
Introduction | Table 5: Conformity to the community and one’s own past, measured via scores assigned by various language models . |
Introduction | We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams (bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history. |
Collocational Lexicon Induction | It has been used as word similarity measure in language modeling (Dagan et al., 1999). |
Experiments & Results 4.1 Experimental Setup | For the end-to-end MT pipeline, we used Moses (Koehn et al., 2007) with these standard features: relative-frequency and lexical translation model (TM) probabilities in both directions; distortion model; language model (LM) and word count. |
Experiments & Results 4.1 Experimental Setup | For the language model, we used the KenLM toolkit (Heafield, 2011) to create a 5-gram language model on the target side of the Europarl corpus (V7) with approximately 54M tokens with Kneser-Ney smoothing. |
Experiments & Results 4.1 Experimental Setup | However, in an MT pipeline, the language model is supposed to rerank the hypotheses and move more appropriate translations (in terms of fluency) to the top of the list. |
Introduction | Even noisy translation of oovs can aid the language model to better |
Related Work | In the setting of language modeling approaches to query expansion, the local analysis idea has been instantiated by estimating additional query language models (Lafferty and Zhai, 2003; Tao and Zhai, 2006) or relevance models (Lavrenko and Croft, 2001) from a set of feedback documents. |
Related Work | (2005) also try to uncover multiple aspects of a query, and to that they provide an iterative “pseudo-query” generation technique, using cluster-based language models . |
Related Work | Diaz and Metzler (2006) were the first to give a systematic account of query expansion using an external corpus in a language modeling setting, to improve the estimation of relevance models. |
Retrieval Framework | We work in the setting of generative language models . |
Retrieval Framework | Within the language modeling approach, one builds a language model from each document, and ranks documents based on the probability of the document model generating the query. |
Retrieval Framework | The particulars of the language modeling approach have been discussed extensively in the literature (see, e.g., Balog et al. |
Lexical normalisation | The confusion candidates are then filtered for each token occurrence of a given OOV word, based on their local context fit with a language model . |
Lexical normalisation | In addition to generating the confusion set, we rank the candidates based on a trigram language model trained over 1.5GB of clean Twitter data, i.e. |
Lexical normalisation | To train the language model , we used SRILM (Stolcke, 2002) with the —<unk> option. |
Related work | Suppose the ill-formed text is T and its corresponding standard form is S, the approach aims to find arg max P(S |T) by computing arg max P(T|S)P(S), in which P(S) is usually a language model and P(T | S) is an error model. |
Experiments | The final feature is the language model score for the target sentence, mounting up to the following model used at decoding time, with the feature weights A trained by Minimum Error Rate Training (MERT) (Och, 2003) on a development corpus. |
Experiments | with a 3-gram language model smoothed with modified Knesser-Ney discounting (Chen and Goodman, 1998), trained on around 1M sentences per target language. |
Experiments | Table 2: Additional experiments for English to Chinese translation examining (a) the impact of the linguistic annotations in the LTS system (lts), when compared with an instance not employing such annotations (lts—nolabels) and (b) decoding with a 4th-order language model (—lm4). |
Joint Translation Model | While in a decoder this is somehow mitigated by the use of a language model , we believe that the weakness of straightforward applications of SCFGs to model reordering structure at the sentence level misses a chance to learn this crucial part of the translation process during grammar induction. |
Joint Translation Model | As (Mylonakis and Sima’an, 2010) note, ‘plain’ SCFGs seem to perform worse than the grammars described next, mainly due to wrong long-range reordering decisions for which the language model can hardly help. |
Introduction | The variable 6 ranges over all possible English strings, and P(e) is a language model built from large amounts of English text that is unrelated to the foreign strings. |
Introduction | A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters. |
Machine Translation as a Decipherment Task | Whole-segment Language Models : When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English. |
Machine Translation as a Decipherment Task | 5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01). |
Word Substitution Decipherment | We model P(e) using a statistical word n-gram English language model (LM). |
Word Substitution Decipherment | 1For word substitution decipherment, we want to keep the language model probabilities fixed during training, and hence we set the prior on that model to be high (a = 104). |
Abstract | Our method uses a decipherment model which combines information from letter n—gram language models as well as word dictionaries. |
Conclusion | Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model. |
Decipherment | We build a statistical English language model (LM) for the plaintext source model P (p), which assigns a probability to any English letter sequence. |
Decipherment | For the plaintext source model, we use probabilities from an English language model and for the channel model, we specify a uniform distribution (i.e., a plaintext letter can be substituted with any given cipher type with equal probability). |
Decipherment | Combining letter n-gram language models with word dictionaries: Many existing probabilistic approaches use statistical letter n-gram language models of English to assign P (p) probabilities to plaintext hypotheses during decipherment. |
Experimental Results | We also used a 5-gram language model with modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Giga-word (LDC2007T07) and the English side of the parallel corpora. |
Experimental Results | We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models, language models , and the optimal weights for combining these models, respectively. |
Variational Approximate Decoding | Of course, this last point also means that our computation becomes intractable as n —> 00.8 However, if p(y | at) is defined by a hypergraph HG(:c) whose structure explicitly incorporates an m-gram language model , both training and decoding will be efficient when m 2 n. We will give algorithms for this case that are linear in the size of HG(:c).9 |
Variational Approximate Decoding | 9A reviewer asks about the interaction with backed-off language models . |
Variational Approximate Decoding | We sketch a method that works for any language model given by a weighted FSA, L. The variational family Q can be specified by any deterministic weighted FSA, Q, with weights parameterized by ((5. |
Background | Since all the member systems share the same data resources, such as language model and translation table, we only need to keep one copy of the required resources in memory. |
Background | Another method to speed up the system is to accelerate n-gram language model with n-gram caching techniques. |
Background | If the required n-gram hits the cache, the corresponding n-gram probability is returned by the cached copy rather than re-fetching the original data in language model . |
Alignment | A language model is not used in this case, as the system is constrained to the given target sentence and thus the language model score has no effect on the alignment. |
Alignment | To deal with this problem, instead of simple phrase length restriction, we propose to apply the leaving-one-out method, which is also used for language modeling techniques (Kneser and Ney, 1995). |
Experimental Evaluation | The baseline system is a standard phrase-based SMT system with eight features: phrase translation and word lexicon probabilities in both translation directions, phrase penalty, word penalty, language model score and a simple distance-based reordering model. |
Experimental Evaluation | We used a 4-gram language model with modified Kneser-Ney discounting for all experiments. |
Introduction | The phrase model is combined with a language model , word lexicon models, word and phrase penalty, and many oth: ers. |
Related Work | They report improvements over a phrase-based model that uses an inverse phrase model and a language model . |
Related Work | First, a pre-def1ned confusion set is used to generate candidate corrections, then a scoring model, such as a trigram language model or na'1've Bayes classifier, is used to rank the candidates according to their context (e.g., Golding and Roth, 1996; Mangu and Brill, 1997; Church et al., 2007). |
Related Work | (2009) present a query speller system in which both the error model and the language model are trained using Web data. |
Related Work | Typically, a language model (source model) is used to capture contextual information, while an error model (channel model) is considered to be context free in that it does not take into account any contextual information in modeling word transformation probabilities. |
The Baseline Speller System | where the error model P(QIC) models the transformation probability from C to Q, and the language model P(C) models how likely C is a correctly spelled query. |
The Baseline Speller System | The language model (the second factor) is a backoff bigram model trained on the tokenized form of one year of query logs, using maximum likelihood estimation with absolute discounting smoothing. |
The Baseline Speller System | Since we define the logarithm of the probabilities of the language model and the error model (i.e., the edit distance function) as features, the ranker can be viewed as a more general framework, subsuming the source channel model as a special case. |
Introduction | Each property indexes a language model , thus allowing documents that incorporate the same |
Model Description | Keyphrases are drawn from a set of clusters; words in the documents are drawn from language models indexed by a set of topics, where the topics correspond to the keyphrase clusters. |
Model Description | — language models of each topic |
Model Description | In the LDA framework, each word is generated from a language model that is indexed by the word’s topic assignment. |
Classification Results | The language model features were completely useless for distinguishing contingencies from |
Features for sense prediction of implicit discourse relations | For each sense, we created uni-gram and bigram language models over the implicit examples in the training set. |
Features for sense prediction of implicit discourse relations | We compute each example’s probability according to each of these language models . |
Features for sense prediction of implicit discourse relations | of the spans’ likelihoods according to the various language models . |
Conclusion | Further improvement is possible by incorporating topic models deeper in the decoding process and adding domain knowledge to the language model . |
Discussion | 6.3 Improving Language Models |
Discussion | Topic models capture document-level properties of language, but a critical component of machine translation systems is the language model , which provides local constraints and preferences. |
Discussion | Domain adaptation for language models (Bellegarda, 2004; Wood and Teh, 2009) is an important avenue for improving machine translation. |
Experiments | We train a modified Kneser—Ney trigram language model on English (Chen and Goodman, 1996). |
Experimental Settings | The models are built using the SRI Language Modeling Toolkit (Stolcke, 2002). |
Problem Zones in Handwriting Recognition | Digits on the other hand are a hard class to language model since the vocabulary (of multi-digit numbers) is infinite. |
Problem Zones in Handwriting Recognition | The HR system output does not contain any illegal non-words since its vocabulary is restricted by its training data and language models . |
Related Work | Alternatively, morphological information can be used to construct supplemental lexicons or language models (Sari and Sellami, 2002; Magdy and Darwish, 2006). |
Related Work | Their hypothesis that their large language model (16M words) may be responsible for why the word-based models outperformed stem-based (morphological) models is challenged by the fact that our language model data (220M words) is an order of magnitude larger, but we are still able to show benefit for using morphology. |
Related Work | Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models . |
Related Work | While n-gram models have traditionally dominated in language modeling , two recent efforts de- |
Related Work | Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables, to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005). |
Smoothing Natural Language Sequences | 2.3 Latent Variable Language Model Representation |
Smoothing Natural Language Sequences | Latent variable language models (LVLMs) can be used to produce just such a distributional representation. |
Introduction | The best-performing systems for these applications today rely on training on large amounts of data: in the case of ASR, the data is aligned audio and transcription, plus large unannotated data for the language modeling ; in the case of OCR, it is transcribed optical data; in the case of MT, it is aligned bitexts. |
Introduction | For ASR and OCR, which can compose words from smaller units (phones or graphically recognized letters), an expanded target language vocabulary can be directly exploited without the need for changing the technology at all: the new words need to be inserted into the relevant resources (lexicon, language model ) etc, with appropriately estimated probabilities. |
Introduction | The expanded word combinations can be used to extend the language models used for MT to bias against incoherent hypothesized new sequences of segmented words. |
Morphology-based Vocabulary Expansion | In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine. |
Morphology-based Vocabulary Expansion | We reweight the weights in the WFST model (Fixed or Bigram) by composing it with a letter trigraph language model (WoTr). |
Computing Feature Expectations | The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in |
Computing Feature Expectations | 3Decoder states can include additional information as well, such as local configurations for dependency language model scoring. |
Computing Feature Expectations | The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes. |
Experimental Results | All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT). |
MT System Selection | These features rely on language models , MSA and Egyptian morphological analyzers and a Highly Dialectal Egyptian lexicon to decide whether each word is MSA, Egyptian, Both, or Out of Vocabulary. |
MT System Selection | two language models : MSA and Egyptian. |
MT System Selection | The second set of features uses perplexity against language models built from the source-side of the training data of each of the four |
Machine Translation Experiments | The language model for our systems is trained on English Gigaword (Graff and Cieri, 2003). |
Machine Translation Experiments | We use SRILM Toolkit (Stolcke, 2002) to build a 5-gram language model with modified |
Abstract | On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model . |
Introduction | Combining Language Models and |
Training Algorithm and Implementation | As described in Section 4, the overall procedure is divided into two alternating steps: After initialization we first perform EM training of the translation model for 20-30 iterations using a 2- gram or S-gram language model in the target language. |
Training Algorithm and Implementation | The generative story described in Section 3 is implemented as a cascade of a permutation, insertion, lexicon, deletion and language model finite state transducers using OpenFST (Allauzen et al., 2007). |
Translation Model | Stochastically generate the target sentence according to an n-gram language model . |
Decoding | The language model (LM) scoring is directly integrated into the cube pruning algorithm. |
Decoding | Naturally, we also had to adjust hypothesis expansion and, most importantly, language model scoring inside the cube pruning algorithm. |
Experiments | Our German 4-gram language model was trained on the German sentences in the training data augmented by the Stuttgart SdeWaC corpus (Web-as-Corpus Consortium, 2008), whose generation is detailed in (Baroni et al., 2009). |
Translation Model | (1) The forward translation weight using the rule weights as described in Section 2 (2) The indirect translation weight using the rule weights as described in Section 2 (3) Lexical translation weight source —> target (4) Lexical translation weight target —> source (5) Target side language model (6) Number of words in the target sentences (7) Number of rules used in the pre-translation (8) Number of target side sequences; here k times the number of sequences used in the pre-translations that constructed 7' (gap penalty) The rule weights required for (l) are relative frequencies normalized over all rules with the same left-hand side. |
Translation Model | The computation of the language model estimates for (6) is adapted to score partial translations consisting of discontiguous units. |
Experiments | The resulting interview transcripts have a reported mean word error rate (WER) of approximately 25% on held out data, which was obtained by priming the language model with meta-data available from preinterview questionnaires. |
Experiments | We use a mixture of the training transcripts and various newswire sources for our language model training. |
Experiments | We did not attempt to prime the language model for particular interviewees or otherwise utilize any interview metadata. |
Introduction | Limitations in signal processing, acoustic modeling, pronunciation, vocabulary, and language modeling can be accommodated in several ways, each of which make different tradeoffs and thus induce different |
Previous Work | In the extreme case, the term may simply be out of vocabulary, although this may occur for various other reasons (e. g., poor language modeling or pronunciation dictionaries). |
Corpora and baselines | A 5-gram language model with modified interpolated Kneser—Ney smoothing (Chen and Goodman, 1998) was trained by the SRILM toolkit (Stolcke, 2002) on a set of 208 million running words of text obtained by combining the monolingual Czech text distributed by the 2010 |
Corpora and baselines | The baselines consisted of the language model , two phrase translation models, two lexical models, and a brevity penalty. |
Decoding with target-side model dependencies | language model , as described in Chiang (2007). |
Decoding with target-side model dependencies | In the case of the language model these aspects include any of its target-side words that are part of still incomplete n-grams. |
Hierarchical phrase-based translation | As shown by Chiang (2007), a weighted grammar of this form can be collected and scored by simple extensions of standard methods for phrase-based translation and efficiently combined with a language model in a CKY decoder to achieve large improvements over a state-of-the-art phrase-based system. |
Experiments | For testing the factored translation systems, we used Moses (Koehn et al., 2007), along with a 5-gram SRILM language model (Stolcke, 2002). |
Factored Model | The factored statistical machine translation model uses a log-linear approach, in order to combine the several components, including the language model , the reordering model, the translation models and the generation models. |
Introduction | 0 The basic SMT approach uses the target language model as a feature in the argument maximisation function. |
Introduction | This language model is trained on grammatically correct text, and would therefore give a good probability for word sequences that are likely to occur in a sentence, while it would penalise ungrammatical or badly ordered formations. |
Introduction | Thus, with respect to these methods, there is a problem when agreement needs to be applied on part of a sentence whose length exceeds the order of the of the target n-gram language model and the size of the chunks that are translated (see Figure 1 for an exam- |
A semantic span can include one or more eus. | Most translation systems adopt the features from a translation model, a language model , and sometimes a reordering model. |
A semantic span can include one or more eus. | The process of training this transfer model and smoothing is similar to the process of training a language model . |
A semantic span can include one or more eus. | formula (6) are estimated in the same way as a factored language model , which has the advantage of easily incorporating various linguistic information. |
Experiments | A 5-gram language model is trained with SRILM5 on the combination of the Xinhua portion of the English Giga-word corpus combined with the English part of FBIS. |
Experiments | probabilities, the BTG reordering features, and the language model feature. |
Experiments and Results | Unigram NLLR and Filtered NLLR are the language model implementations of previous work as described in Section 3.1. |
Previous Work | They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores. |
Timestamp Classifiers | 3.1 Language Models |
Timestamp Classifiers | We apply Dirichlet-smoothing to the language models (as in de J ong et al. |
Timestamp Classifiers | The above language modeling and MaxEnt approaches are token-based classifiers that one could apply to any topic classification domain. |
Previous work | Language modelling methods build word ngram models, like those used in speech recognition. |
Previous work | 3.2 Language modelling methods |
Previous work | So far, language modelling methods have been more effective. |
The new approach | This corresponds roughly to a unigram language model . |
Treebank Translation and Dependency Transformation | In detail, a word-based decoding is used, which adopts a log-linear framework as in (Och and Ney, 2002) with only two features, translation model and language model, |
Treebank Translation and Dependency Transformation | is the language model , a word trigram model trained from the CTB. |
Treebank Translation and Dependency Transformation | Thus the decoding process is actually only determined by the language model . |
Statistical Transliteration Model | The language model P(e) is trained from English texts. |
Statistical Transliteration Model | generative probability of a English syllable language model . |
Statistical Transliteration Model | 2) The language model in backward transliteration describes the relationship of syllables in words. |
Evaluation | Therefore, we implemented a ranking mechanism which used a hybrid scoring method by giving equal weights to the language model and the normalized phonetic similarity. |
System Description | To check the likelihood and well-formedness of the new string after the replacement, we learn a 3- gram language model with absolute smoothing. |
System Description | For leam-ing the language model , we only consider the words in the CMU pronunciation dictionary which also exist in WordNet. |
System Description | We remove the words containing at least one trigram which is very unlikely according to the language model . |
Previous Work | These approaches have focused to model statistical or syntactic phrasal relations under the language modeling method for information retrieval. |
Previous Work | (Srikanth and Srihari, 2003; Maisonnasse et al., 2005) examined the effectiveness of syntactic relations in a query by using language modeling framework. |
Previous Work | (Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases. |
Proposed Method | We start out by presenting a simple phrase-based language modeling retrieval model that assumes uniform contribution of words and phrases. |
Experiments | An in-house language modeling toolkit is used to train the 5-gram language model with modified Kneser-Ney smoothing (Kneser and Ney, 1995). |
Experiments | The English monolingual data used for language modeling is the same as in Table 1. |
Related Work | They incorporated the bilingual topic information into language model adaptation and lexicon translation model adaptation, achieving significant improvements in the large-scale evaluation. |
Topic Similarity Model with Neural Network | Standard features: Translation model, including translation probabilities and lexical weights for both directions (4 features), 5-gram language model (1 feature), word count (1 feature), phrase count (1 feature), NULL penalty (1 feature), number of hierarchical rules used (1 feature). |
Introduction | And Knight and Hatzivassiloglou (1995) use a language model for selecting a fluent sentence among the vast number of surface realizations corresponding to a single semantic representation. |
Introduction | The top-ranked candidate is selected for presentation and verbalized using a language model interfaced with RealPro (Lavoie and Rambow, 1997), a text generation engine. |
The Story Generator | Since we do not know a priori which of these parameters will result in a grammatical sentence, we generate all possible combinations and select the most likely one according to a language model . |
The Story Generator | We used the SRI toolkit to train a trigram language model on the British National Corpus, with interpolated Kneser—Ney smoothing and perplexity as the scoring metric for the generated sentences. |
Conclusion | Additionally, we also want to induce sense clusters for words in the target language so that we can build sense-based language model and integrate it into SMT. |
Decoding with Sense-Based Translation Model | error rate training (MERT) (Och, 2003) together with other models such as the language model . |
Experiments | We trained a 5-gram language model on the Xinhua section of the English Gigaword corpus (306 million words) using the SRILM toolkit (Stolcke, 2002) with the modified Kneser—Ney smoothing (Chen and Goodman, 1996). |
Related Work | (2007) also explore a bilingual topic model for translation and language model adaptation. |
Decoding | 1 to a log-linear model (Och and Ney, 2002) that uses the following eight features: relative frequencies in two directions, lexical weights in two directions, number of rules used, language model score, number of target words produced, and the probability of matched source tree (Mi et al., 2008). |
Decoding | We use the cube pruning method (Chiang, 2007) to approximately intersect the translation forest with the language model . |
Experiments | A trigram language model was trained on the English sentences of the training corpus. |
Related Work | In machine translation, the concept of packed forest is first used by Huang and Chiang (2007) to characterize the search space of decoding with language models . |
Keyphrase Extraction Approaches | 3.3.4 Language Modeling |
Keyphrase Extraction Approaches | These feature values are estimated using language models (LMs) trained on a foreground corpus and a background corpus. |
Keyphrase Extraction Approaches | In sum, LMA uses a language model rather than heuristics to identify phrases, and relies on the language model trained on the background corpus to determine how “unique” a candidate keyphrase is to the domain represented by the foreground corpus. |
Experiments | SRILM (Stolcke, 2002) is adopted for language model training and KenLM (Heafield, 2011; Heafield et al., 2013) for language model query. |
Pinyin Input Method Model | The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables: |
Related Works | They solved the typo correction problem by decomposing the conditional probability P(H |P) of Chinese character sequence H given pinyin sequence P into a language model P(wi|wi_1) and a typing model The typing model that was estimated on real user input data was for typo correction. |
Related Works | Various approaches were made for the task including language model (LM) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc. |
Experimental Results | To handle different directions of translation between Chinese and English, we built two trigram language models with modified Kneser-Ney smoothing (Chen and Goodman, 1998) using the SRILM toolkit (Stolcke, 2002). |
Experimental Results | Feature Baseline AAMT language model 0.137 0.133 phrase translation 0.066 0.023 lexical translation 0.061 0.078 reverse phrase translation 0.059 0.103 reverse lexical translation 0. |
Unsupervised Translation Induction for Chinese Abbreviations | Moreover, our approach utilizes both Chinese and English monolingual data to help MT, while most SMT systems utilizes only the English monolingual data to build a language model . |
Unsupervised Translation Induction for Chinese Abbreviations | However, since most of statistical translation models (Koehn et al., 2003; Chiang, 2007; Galley et al., 2006) are symmetrical, it is relatively easy to train a translation system to translate from English to Chinese, except that we need to train a Chinese language model from the Chinese monolingual data. |
Features | We look at the language model (LM) score and the number of alternate pronunciations of the first query, predicting that a misrecognized query will have a lower LM score and more alternate pronunciations. |
Prediction task | In addition, the language model likelihood for the first query was, as expected, significantly lower for retries. |
Related Work | Retry cases are identified with joint language modeling across multiple transcripts, with the intuition that retry pairs tend to be closely related or exact duplicates. |
Related Work | While we follow this work in our usage of joint language modeling , our application encompasses open domain voice searches and voice actions (such as placing calls), so we cannot use simplifying domain assumptions. |
Collaborative Decoding | Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses. |
Discussion | They also empirically show that n-gram agreement is the most important factor for improvement apart from language models . |
Experiments | The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3. |
Experiments | We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output. |
Experiment | For the relevance retrieval model, we faithfully reproduce the passage-based language model with pseudo-relevance feedback (Lee et al., 2008). |
Term Weighting and Sentiment Analysis | IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling (LM), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004). |
Term Weighting and Sentiment Analysis | In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing (LM). |
Term Weighting and Sentiment Analysis | 5With proper assumptions and derivations, p(w \ d) can be derived to language modeling approaches. |
Experiments | language model and (93'8' is the length penalty term |
Experiments | We also use the SRI Language Modeling Toolkit (Stolcke, 2002) to train a trigram language model with Kneser-Ney smoothing on the English side of the bitext. |
Experiments | Besides the trigram language model trained on the English side of these bitext, we also use another trigram model trained on the first 1/3 of the Xinhua portion of Gigaword corpus. |
Forest-based translation | The decoder performs two tasks on the translation forest: l-best search with integrated language model (LM), and k-best search with LM to be used in minimum error rate training. |
Experiments and Results | The features we used are commonly used features as standard BTG decoder, such as translation probabilities, lexical weights, language model , word penalty and distortion probabilities. |
Experiments and Results | The language model is 5-gram language model trained with the target sentences in the training data. |
Experiments and Results | The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data. |
Features and Training | We also use other fundamental features, such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability. |
Experiment Results | The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation. |
Experiment Results | The language model is the trigram SRI language model built from Xinhua corpus of 180 millions words. |
Experiment Results | The language model is three-gram SRILM trained from the target side of the training corpora. |
Introduction | Many features are shared between phrase-based and tree-based systems including language model , word count, and translation model features. |
Introduction | The approach by Wei and Croft (2006) was the first to leverage LDA topics to improve the estimate of document language models and achieved good empirical results. |
Topic-Driven Relevance Models | where 9 is a set of pseudo-relevant feedback documents and 6D is the language model of document D. This notion of estimating a query model is |
Topic-Driven Relevance Models | We tackle the null probabilities problem by smoothing the document language model using the well-known Dirichlet smoothing (Zhai and Lafferty, 2004). |
Topic-Driven Relevance Models | Instead of viewing 9 as a set of document language models that are likely to contain topical information about the query, we take a probabilistic topic modeling approach. |
Inference | After randomly initializing all 77k,8,7.,t, inference is performed by a blocked Gibbs sampler, alternating resamplings for three major groups of variables: the language model (z,gb), context model (07,7, [3, p), and the 77, 6 variables, which bottleneck between the submodels. |
Inference | The language model sampler sequentially updates every za) (and implicitly gb via collapsing) in the manner of Griffiths and Steyvers (2004): p(z(i)|6, ma), 1)) oc 68,r,t,z(nw,z + b/V)/(nz + b), where counts 77 are for all event tuples besides 7'. |
Model | 0 Language model: |
Model | Thus the language model is very similar to a topic model’s generation of token topics and wordtypes. |
Experiments | In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count. |
Experiments | We train 3-gram language models using modified Kneser—Ney smoothing. |
Experiments | For AR-EN experiments the language model is trained on English data as (Blunsom et al., 2009a), and for FA-EN and UR-EN the English data are the target sides of the bilingual training data. |
Introduction | We develop a Bayesian approach using a Pitman-Yor process prior, which is capable of modelling a diverse range of geometrically decaying distributions over infinite event spaces (here translation phrase-pairs), an approach shown to be state of the art for language modelling (Teh, 2006). |
Conceptual Framework | In language modeling practice, one finds the likelihood P ( w | (9 of a word sequence w of length under a model (9, to be an inconvenient measure for comparison. |
Discussion | This makes it suitable for comparison of conversational genres, in much the same way as are general language models of words. |
Discussion | Accordingly, as for language models , density estimation in future turn-taking models may be im- |
Introduction | The current work attempts to address this problem by proposing a simple framework, which, at least conceptually, borrows quite heavily from the standard language modeling paradigm. |
Experiments | The monolingual data for training English language model includes the Xinhua portion of the GIGAWORD corpus, which contains 238M English words. |
Experiments | A 4—gram language model was trained on the monolingual data by the SRILM toolkit (Stolcke, 2002). |
Related Work | Researchers also introduce topic model for cross-lingual language model adaptation (Tam et al., 2007; Ruiz and Federico, 2011). |
Related Work | Based on the bilingual topic model, they apply the source-side topic weights into the target-side topic model, and adapt the n-gram language model of target side. |
Experiments | We used SRILM2 for the training of language models (S-gram in all the experiments). |
Experiments | We trained a Chinese language model for the EC translation on the Chinese part of the bi-text. |
Experiments | For the English language model of CE translation, an extra corpus named Tanaka was used besides the English part of the bilingual corpora. |
Abstract | In all experiments we include the target side of the mined parallel data in the language model , in order to distinguish whether results are due to influences from parallel or monolingual data. |
Abstract | In these experiments, we use 5-gram language models when the target language is English or German, and 4—gram language models for French and Spanish. |
Abstract | The baseline system was trained using only the Europarl corpus (Koehn, 2005) as parallel data, and all experiments use the same language model trained on the target sides of Europarl, the English side of all linked Spanish-English Wikipedia articles, and the English side of the mined CommonCran data. |
Model | Broadly, as the learner progresses from one sentence to the next, exposing herself to more novel words, the updated parameters of the language model in turn guide the selection of new “switch-points” for replacing source words with the target foreign words. |
Model | Generally, this value may come directly from the surprisal quantity given by a language model , or may incorporate additional features that are found informative in predicting the constraint on the word. |
Related Work | Building on their work, (Adel et al., 2012) employ additional features and a recurrent network language model for modeling code-switching in conversational speech. |
Experiments | Apart from the language model , the lexical, phrasal, and (for the syntax grammar) label-conditioned features, and the rule, target word, and glue operation counters, Venugopal and Zollmann (2009) also provide both the hierarchical and syntax-augmented grammars with a rareness penalty 1/ onto“), where onto“) is the occurrence count of rule 7“ in the training corpus, allowing the system to learn penalization of low-frequency rules, as well as three indicator features firing if the rule has one, two unswapped, and two swapped nonterminal pairs, respectively.2 Further, to mitigate badly estimated PSCFG derivations based on low-frequency rules of the much sparser syntax model, the syntax grammar also contains the hierarchical grammar as a backbone (cf. |
Experiments | Each system is trained separately to adapt the parameters to its specific properties (size of nonterminal set, grammar complexity, features sparseness, reliance on the language model , etc. |
Related work | The supertags are also injected into the language model . |
Features | To some extent, these two features have similar function to a target language model or pos-based target language model . |
Related Work | (2009) study several confidence features based on mutual information between words and n-gram and backward n-gram language model for word-level and sentence-level CE. |
SMT System | We build a four-gram language model using the SRILM toolkit (Stolcke, 2002), which is trained |
Introduction | Our approach is based on semi-Markov discriminative structure prediction, and it incorporates English back-transliteration and English language models (LMs) into WS in a seamless way. |
Use of Language Model | Language Model Augmentation Analogous to Koehn and Knight (2003), we can exploit the fact that l/‘yF‘ reddo (red) in the example ffiayvnI/W‘ is such a common word that one can expect it appears frequently in the training corpus. |
Use of Language Model | 4.1 Language Model Projection |
Experiment | We used 5-gram language models that were trained using the English side of each set of bilingual training data. |
Experiment | The common SMT feature set consists of: four translation model features, phrase penalty, word penalty, and a language model feature. |
Introduction | 1A language model also supports the estimation. |
Experiments | Here, the first item is the language model (LM) probability where 7'(d) is the target string of derivation d; the second item is the translation length penalty; and the third item is the translation score, which is decomposed into a product of feature values of rules: |
Experiments | SRI Language Modeling Toolkit (Stolcke, 2002) was employed to train 5-gram English and Japanese LMs on the training set. |
Related Work | By introducing supertags into the target language side, i.e., the target language model and the target side of the phrase table, significant improvement was achieved for Arabic-to-English translation. |
Experiments | The language model is a 3-gram language model trained using the SRILM toolkit (Stolcke, 2002) on the English side of the training data. |
Experiments | The language model is a 3-gram LM trained on Xinhua portion of the Gigaword corpus using the SRILM toolkit with modified Kneser—Ney smoothing. |
Related Work | (2011) develop a bilingual language model which incorporates words in the source and target languages to predict the next unit, which they use as a feature in a translation system. |
Introduction | One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al. |
Introduction | (2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings. |
Log-linear Models | We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models , except using the names of dependency relations instead of words. |
Approach to Sentence-Level Dialect Identification | The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV. |
Approach to Sentence-Level Dialect Identification | The perplexity of a language model on a given test sentence; S(w1, .., wn) is defined as: |
Related Work | Amazon Mechanical Turk and try a language modeling (LM) approach to solve the problem. |
Experimental Design | imately via cube pruning (Chiang, 2007), by integrating a trigram language model extracted from the training set (see Konstas and Lapata (2012) for details). |
Experimental Design | Lexical Features These features encourage grammatical coherence and inform lexical selection over and above the limited horizon of the language model captured by Rules (6)—(9). |
Problem Formulation | In machine translation, a decoder that implements forest rescoring (Huang and Chiang, 2007) uses the language model as an external criterion of the goodness of sub-translations on account of their grammaticality. |
Inferring a learning curve from mostly monolingual data | In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model . |
Inferring a learning curve from mostly monolingual data | (b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus. |
Inferring a learning curve from mostly monolingual data | The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus . |
Discussion | So long as the vocabulary present in our phrase table and language model supports a literal translation, cohesion tends to produce an improvement. |
Discussion | In the baseline translation, the language model encourages the system to move the negation away from “exist” and toward “reduce.” The result is a tragic reversal of meaning in the translation. |
Introduction | order, forcing the decoder to rely heavily on its language model . |
Markov Topic Regression - MTR | (19) Language Model Prior (77W): Probabilities on word transitions denoted as nw=p(wi=v|wi_1). |
Markov Topic Regression - MTR | We built a language model using SRILM (Stol-cke, 2002) on the domain specific sources such as top wiki pages and blogs on online movie reviews, etc., to obtain the probabilities of domain-specific n-grams, up to 3-grams. |
Markov Topic Regression - MTR | (l), we assume that the prior on the semantic tags, 773, is more indicative of the decision for sampling a w,- from a new tag compared to language model posteriors on word sequences, 77W. |
Automated Approaches to Deceptive Opinion Spam Detection | Under (2), both the NB classifier used by Mihalcea and Strapparava (2009) and the language model classifier used by Zhou et al. |
Automated Approaches to Deceptive Opinion Spam Detection | (2008), we use the SRI Language Modeling Toolkit (Stolcke, 2002) to estimate individual language models , Pr(:E | y = c), for truthful and deceptive opinions. |
Automated Approaches to Deceptive Opinion Spam Detection | We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996). |
Experiment settings | o TopicSum: we use TopicSum (Haghighi and Vanderwende, 2009), a 3-layer hierarchical topic model, to infer the language model that is most central for the collection. |
Experiment settings | divergence with respect the collection language model is the one chosen. |
Related work | (2007) generate novel utterances by combining Prim’s maximum-spanning-tree algorithm with an n-gram language model to enforce fluency. |
Abstractive Caption Generation | Specifically, we use an adaptive language model (Kneser et al., 1997) that modifies an |
Abstractive Caption Generation | where P(wi E C |wi E D) is the probability of W, appearing in the caption given that it appears in the document D, and Padap(wi|wi_1,wi_2) the language model adapted with probabilities from our image annotation model: |
Experimental Setup | The scaling parameter [3 for the adaptive language model was also tuned on the development set using a range of [05,09]. |
The Features | It is built into a unigram language model without smoothing for each term. |
The Features | This feature function measures the Kullback—Leibler divergence (KL divergence) between the language models associated with the two inputs. |
The Features | Similarly, the local context is built into a unigram language model without smoothing for each term; the feature function outputs KL divergence between the models. |
Related Work | (Bengio et al., 2006) proposed to use multilayer neural network for language modeling task. |
Related Work | (Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language. |
Related Work | (Son et al., 2012) improves translation quality of n- gram translation model by using a bilingual neural language model . |
Experiment | We train a 5-gram language model with the Xinhua portion of English Gigaword corpus and target part of the training data. |
Integrating into the PAS-based Translation Framework | The weights of the MEPD feature can be tuned by MERT (Och, 2003) together with other translation features, such as language model . |
PAS-based Translation Framework | The target-side-like PAS is selected only according to the language model and translation probabilities, without considering any context information of PAS. |
Experiments | Row 1 and row 2 are two baseline systems, which model the relevance score using VSM (Cao et al., 2010) and language model (LM) (Zhai and Laf-ferty, 2001; Cao et al., 2010) in the term space. |
Experiments | Row 3 is the word-based translation model (Jeon et al., 2005), and row 4 is the word-based translation language model, which linearly combines the word-based translation model and language model into a unified framework (Xue et al., 2008). |
Experiments | (2009) in Table 3 because previous work (Ming et al., 2010) demonstrated that word-based translation language model (Xue et al., 2008) obtained the superior performance than the syntactic tree matching (Wang et al., 2009). |
Experiments | In the experiments, we train the translation model on FBIS corpus (7.2M (Chinese) + 9.2M (English) words) and train a 4-gram language model on the Xinhua portion of the English Gigaword corpus (181M words) using the SRILM Toolkits (Stolcke, |
NonContiguous Tree sequence Align-ment-based Model | 2) The bi-lexical translation probabilities 3) The target language model |
The Pisces decoder | On the other hand, to simplify the computation of language model , we only compute for source side contiguous translational hypothesis, while neglecting gaps in the target side if any. |
Experiments | We trained two language models : the first one is a 4-gram LM which is estimated on the target side of the texts used in the large data condition. |
Experiments | Both language models are used for both tasks. |
Experiments | Only the target-language half of the parallel training data are used to train the language model in this task. |
Experimental Setup | For the language model , we used a 5-gram model with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the English side of our training data as well as portions of the Giga-word v2 English corpus. |
Experimental Setup | For the language model , we used a 5-gram model trained on the English portion of the whole training data plus portions of the Gigaword v2 corpus. |
Hierarchical Phrase-based System | Given 6 and f as the source and target phrases associated with the rule, typical features used are rule’s translation probability Ptmn,(f|e') and its inverse Ptmn,(e'| f), the lexical probability Pl“ (fl 6) and its inverse Pl“ (6 | f Systems generally also employ a word penalty, a phrase penalty, and target language model feature. |
Data | To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books. |
Model | Maximum entropy approaches to language modeling have been used since Rosenfeld (1996) to incorporate long-distance information, such as previously-mentioned trigger words, into n-gram language models . |
Model | Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameter |
Abstract | Context-predicting models (more commonly known as embeddings or neural language models ) are the new kids on the distributional semantics block. |
Introduction | This is in part due to the fact that context-predicting vectors were first developed as an approach to language modeling and/or as a way to initialize feature vectors in neural-network-based “deep learning” NLP architectures, so their effectiveness as semantic representations was initially seen as little more than an interesting side effect. |
Introduction | Predictive DSMs are also called neural language models , because their supervised context prediction training is performed with neural networks, or, more cryptically, “embeddings”. |
Experiments | For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data. |
Experiments | If either the source or the target sides of the a training instance had an edit distance of less than 10%, we removed it.4 As for the language models, we collected a further 10M tweets from Twitter for the English language model and another 10M tweets from Weibo for the Chinese language model . |
Experiments | As the language model , we use a 5-gram model with Kneser—Ney smoothing. |
Evaluation Tasks and Results | To produce baseline cognate identification predictions, we calculate the probability of each latent Hebrew letter sequence predicted by the HMM, and compare it to a uniform character-level Ugaritic language model (as done by our model, to avoid automatically assigning higher cognate probability to shorter Ugaritic words). |
Inference | We also calculate P = 0) using a uniform uni-gram character-level language model (and thus depends only on the number of characters in ui). |
Model | Otherwise, a lone word it is generated, according a uniform character-level language model . |
Conclusion | In the current version of the generator, the output is ranked using a simple language model trained on the GENIA corpus. |
Generating from the KBGen Knowledge-Base | To rank the generator output, we train a language model on the GeniA corpus 4, a corpus of 2000 MEDLINE asbtracts about biology containing more than 400000 words (Kim et al., 2003) and use this model to rank the generated sentences by decreasing probability. |
Related Work | They intersect the grammar with a language model to improve fluency; use a weighted hypergraph to pack the derivations; and find the best derivation tree using Viterbi algorithm. |
Conclusion and Future Work | By combining such information with traditional statistical language models , it is capable of suggesting relevant articles that meet the dynamic nature of a discussion in social media. |
Experimental Evaluation | The second one, LM, is based on statistical language models for relevant information retrieval (Ponte and Croft, 1998). |
Experimental Evaluation | bilistic language model for each article, and ranks them on query likelihood, i.e. |
Experiments | For language model, we used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram model on the Xinhua portion of GIGAWORD corpus. |
Joint Decoding | 2There are also features independent of derivations, such as language model and word penalty. |
Joint Decoding | Although left-to-right decoding might enable a more efficient use of language models and hopefully produce better translations, we adopt bottom-up decoding in this paper just for convenience. |
Introduction | Successful applications of such models include language modelling (Bengio et al., 2003), paraphrase detection (Erk and Pado, 2008), and dialogue analysis (Kalchbrenner and Blunsom, 2013). |
Related Work | Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003). |
Related Work | They have received a lot of attention in recent years (Collobert and Weston, 2008; Mnih and Hinton, 2009; Mikolov et al., 2010, inter alia) and have achieved state of the art performance in language modelling . |
Models and Features | In particular, we use the recurrent neural network language model (RNNLM) of Mikolov et al. |
Models and Features | Like any language model , a RNNLM estimates the probability of observing a word given the preceding context, but, in this process, it learns word embeddings into a latent, conceptual space with a fixed number of dimensions. |
Related Work | (2013) recently addressed the problem of answer sentence selection and demonstrated that LS models, including recurrent neural network language models (RNNLM), have a higher contribution to overall performance than exploiting syntactic analysis. |
Abstract | Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency. |
Discriminative Reranking for OCR | The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002). |
Introduction | The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output. |
Background | The RNN is primarily used as a language model , but may also be viewed as a sentence model with a linear structure. |
Introduction | Besides comprising powerful classifiers as part of their architecture, neural sentence models can be used to condition a neural language model to generate sentences word by word (Schwenk, 2012; Mikolov and Zweig, 2012; Kalchbrenner and Blunsom, 2013a). |
Properties of the Sentence Model | This gives the RNN excellent performance at language modelling , but it is suboptimal for remembering at once the n-grams further back in the input sentence. |