Abstract | In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. |
Definitions | denotes the language model . |
Definitions | Depending on the structure of the language model Equation 2 can be further simplified. |
Definitions | Similarly, we define language model matrices S for the unigram and the bigram case. |
Introduction | The general idea is to find those translation model parameters that maximize the probability of the translations of a given source text in a given language model of the target language. |
Introduction | This might be related to the fact that a statistical formulation of the decipherment problem has not been analyzed with respect to n-gram language models : This paper shows the close relationship of the decipherment problem to the quadratic assignment problem. |
Introduction | In Section 4 we show that decipherment using a unigram language model corresponds to solving a linear sum assignment problem (LSAP). |
Related Work | gram language model . |
Experiments | Then, Tesseract uses a classifier, aided by a word-unigram language model , to recognize whole words. |
Experiments | 6.3 Language Model |
Learning | The number of states in the dynamic programming lattice grows exponentially with the order of the language model (Jelinek, 1998; Koehn, 2004). |
Learning | As a result, inference can become slow when the language model order n is large. |
Learning | On each iteration of EM, we perform two passes: a coarse pass using a low-order language model, and a fine pass using a high-order language model (Petrov et al., 2008; Zhang and Gildea, 2008). |
Model | P(E, T, R, X) = P(E) [ Language model ] - P(T|E) [Typesetting model] - P(R) [Inking model] - P (X |E, T, R) [Noise model] |
Model | 3.1 Language Model P(E) |
Model | Our language model , P(E), is a Kneser-Ney smoothed character n-gram model (Kneser and Ney, 1995). |
Related Work | Work that has directly addressed historical documents has done so using a pipelined approach, and without fully integrating a strong language model (Vamvakas et al., 2008; Kluzner et al., 2009; Kae et al., 2010; Kluzner et al., 2011). |
Related Work | They integrated typesetting models with language models , but did not model noise. |
Related Work | Our approach is also similar in that we use a strong language model (in conjunction with the constraint that the correspondence be regular) to learn the correct mapping. |
Approach | In his method, a variety of languages are modeled by their spelling systems (i.e., character-based n-gram language models ). |
Approach | Then, agglomerative hierarchical clustering is applied to the language models to reconstruct a language family tree. |
Approach | The similarity used for clustering is based on a divergence-like distance between two language models that was originally proposed by Juang and Rabiner (1985). |
Methods | Similarly, let M,- be a language model trained using Di. |
Methods | 2, we use an n-gram language model based on a mixture of word and POS tokens instead of a simple word-based language model . |
Methods | In this language model , content words in n-grams are replaced with their corresponding POS tags. |
Abstract | As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models . |
Introduction | In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right. |
Introduction | As a result, phrase-based decoders only need to maintain the boundary words on one end to calculate language model probabilities. |
Introduction | Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007). |
Abstract | In this paper, we explore the use of distance and co-occurrence information of word—pairs for language modeling . |
Introduction | Language models have been extensively studied in natural language processing. |
Introduction | The role of a language model is to measure how probably a (target) word would occur based on some given evidence extracted from the history-context. |
Language Modeling with TD and TO | A language model estimates word probabilities given their history, i.e. |
Language Modeling with TD and TO | In order to define the TD and TO components for language modeling , we express the observation of an arbitrary history-word, wi_k at the kth position behind the target-word, as the joint of two events: i) the word wi_k occurs within the histo-ry-context: wi_k E h, and ii) it occurs at distance k from the target-word: A(wi_k) = k, (A: k for brevity); i.e. |
Language Modeling with TD and TO | In fact, the TO model is closely related to the trigger language model (Rosenfeld 1996), as the prediction of the target-word (the triggered word) is based on the presence of a history-word (the trigger). |
Motivation of the Proposed Approach | The attributes of distance and co-occurrence are exploited and modeled differently in each language modeling approach. |
Related Work | Latent-semantic language model approaches (Bellegarda 1998, Coccaro 2005) weight word counts with TFIDF to highlight their semantic importance towards the prediction. |
Related Work | Other approaches such as the class-based language model (Brown 1992, Kneser & Ney 1993) |
Related Work | The structured language model (Chelba & J elinek 2000) determines the “heads” in the history-context by using a parsing tree. |
Introduction | Further, decoding with nonlocal (or state-dependent) features, such as a language model , is also a problem. |
Introduction | Actually, even for the (log-) linear model, efficient decoding with the language model is not trivial (Chiang, 2007). |
Introduction | For the nonlocal features such as the language model , Chiang (2007) proposed a cube-pruning method for efficient decoding. |
Abstract | We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing. |
Introduction | Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation. |
Introduction | )nstraints for language modeling |
Introduction | As a result, statistical language models — an important component of many such applications — are often trained on very large corpora, then modified to fit within some pre-specified size bound. |
Preliminaries | N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion. |
Preliminaries | N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored. |
Abstract | In this paper we examine language modeling for text simplification. |
Abstract | Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model . |
Abstract | We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. |
Introduction | An important component of many text-to-text translation systems is the language model which predicts the likelihood of a text sequence being produced in the output language. |
Introduction | In some problem domains, such as machine translation, the translation is between two distinct languages and the language model can only be trained on data in the output language. |
Introduction | In these monolingual problems, text could be used from both the input and output domain to train a language model . |
Related Work | If we view the normal data as out-of-domain data, then the problem of combining simple and normal data is similar to the language model domain adaption problem (Suzuki and Gao, 2005), in particular cross-domain adaptation (Bellegarda, 2004) where a domain-specific model is improved by incorporating additional general data. |
Translation Model Architecture | We train a language model on the source language side of each of the n component bitexts, and compute an n-dimensional vector for each sentence by computing its entropy with each language model . |
Translation Model Architecture | Our aim is not to discriminate between sentences that are more likely and unlikely in general, but to cluster on the basis of relative differences between the language model entropies. |
Translation Model Architecture | While it is not the focus of this paper, we also evaluate language model adaptation. |
Experiments | As described in Section 3.2, the weight of each variable is a linear combination of the language model score, three classifier confidence scores, and three classifier disagreement scores. |
Experiments | We use the Web 1T 5—gram corpus (Brants and Franz, 2006) to compute the language model score for a sentence. |
Experiments | Finally, the language model score, classifier confidence scores, and classifier disagreement scores are normalized to take values in [0, 1], based on the H00 2011 development data. |
Inference with First Order Variables | The language model score h(s’, LM) of 8’ based on a large web corpus; |
Inference with First Order Variables | Next, to compute whpyg, we collect language model score and confidence scores from the article (ART), preposition (PREP), and noun number (NOUN) classifier, i.e., E = {ART, PREP, NOUN}. |
Inference with Second Order Variables | When measuring the gain due to 21131213312 2 1 (change cat to cats), the weight wNoungmluml is likely to be small since A cats will get a low language model score, a low article classifier confidence score, and a low noun number classifier confidence score. |
Related Work | Features used in classification include surrounding words, part-of—speech tags, language model scores (Gamon, 2010), and parse tree structures (Tetreault et al., 2010). |
Experiment | SRILM Toolkit (Stol-cke, 2002) is employed to train 4-gram language models on the Xinhua portion of Gigaword corpus, while for the IWLST2012 data set, only its training set is used. |
Experiment | The similarity between the data from each domain and the test data is calculated using the perplexity measure with 5-gram language model . |
Hierarchical Phrase Table Combination | Pitman-Yor process is also employed in n-gram language models which are hierarchically represented through the hierarchical Pitman-Yor process with switch priors to integrate different domains in all the levels (Wood and Teh, 2009). |
Phrase Pair Extraction with Unsupervised Phrasal ITGs | Pbase is a base measure defined as a combination of the IBM Models in two directions and the unigram language models in both sides. |
Related Work | The translation model and language model are primary components in SMT. |
Related Work | Previous work proved successful in the use of large-scale data for language models from diverse domains (Brants et al., 2007; Schwenk and Koehn, 2008). |
Related Work | Alternatively, the language model is incrementally updated by using a succinct data structure with a interpolation technique (Levenberg and Osborne, 2009; Levenberg et al., 2011). |
Bayesian MT Decipherment via Hash Sampling | Secondly, for Bayesian inference we need to sample from a distribution that involves computing probabilities for all the components ( language model , translation model, fertility, etc.) |
Bayesian MT Decipherment via Hash Sampling | Note that the (translation) model in our case consists of multiple exponential families components—a multinomial pertaining to the language model (which remains fixed5), and other components pertaining to translation probabilities P9(fi|ei), fertility ngert, etc. |
Bayesian MT Decipherment via Hash Sampling | where, pold(-), pnew(-) are the true conditional likelihood probabilities according to our model (including the language model component) for the old, new sample respectively. |
Decipherment Model for Machine Translation | For P(e), we use a word n-gram language model (LM) trained on monolingual target text. |
Decipherment Model for Machine Translation | Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model . |
Experiments and Results | The latter is used to construct a target language model used for decipherment training. |
Experiments and Results | Overall, using a 3-gram language model (instead of 2-gram) for decipherment training improves the performance for all methods. |
Eliciting Addressee’s Emotion | We use GIZA++8 and SRILM9 for learning translation model and 5-gram language model , re- |
Eliciting Addressee’s Emotion | We use the emotion-tagged dialogue corpus to learn eight translation models and language models , each of which is specialized in generating the response that elicits one of the eight emotions (Plutchik, 1980). |
Eliciting Addressee’s Emotion | In this case, the first two utterances are used to learn the translation model, while only the second utterance is used to learn the language model . |
Experiments | Table 6: The number of utterance pairs used for training classifiers in emotion prediction and learning the translation models and language models in response generation. |
Experiments | We use the utterance pairs summarized in Table 6 to learn the translation models and language models for eliciting each emotional category. |
Related Work | The linear interpolation of translation and/or language models is a widely-used technique for adapting machine translation systems to new domains (Sennrich, 2012). |
Experiments | For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs, and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword. |
Experiments | An in-domain 5-gram English language model is trained with the target 1 million monolingual data. |
Experiments | (2008) regards the in-domain lexicon with corpus translation probability as another phrase table and further use the in-domain language model besides the out-of-domain language model . |
Probabilistic Bilingual Lexicon Acquisition | In order to assign probabilities to each entry, we apply the Corpus Translation Probability which used in (Wu et al., 2008): given an in-domain source language monolingual data, we translate this data with the phrase-based model trained on the out-of-domain News data, the in-domain lexicon and the in-domain target language monolingual data (for language model estimation). |
Related Work | For the target-side monolingual data, they just use it to train language model , and for the source-side monolingual data, they employ a baseline (word-based SMT or phrase-based SMT trained with small-scale bitext) to first translate the source sentences, combining the source sentence and its target translation as a bilingual sentence pair, and then train a new phrase-base SMT with these pseudo sentence pairs. |
Semi-supervised Parsing with Large Data | These relations are captured by word clustering, lexical dependencies, and a dependency language model , respectively. |
Semi-supervised Parsing with Large Data | 4.3 Structural Relations: Dependency Language Model |
Semi-supervised Parsing with Large Data | The dependency language model is proposed by Shen et al. |
Collocational Lexicon Induction | It has been used as word similarity measure in language modeling (Dagan et al., 1999). |
Experiments & Results 4.1 Experimental Setup | For the end-to-end MT pipeline, we used Moses (Koehn et al., 2007) with these standard features: relative-frequency and lexical translation model (TM) probabilities in both directions; distortion model; language model (LM) and word count. |
Experiments & Results 4.1 Experimental Setup | For the language model, we used the KenLM toolkit (Heafield, 2011) to create a 5-gram language model on the target side of the Europarl corpus (V7) with approximately 54M tokens with Kneser-Ney smoothing. |
Experiments & Results 4.1 Experimental Setup | However, in an MT pipeline, the language model is supposed to rerank the hypotheses and move more appropriate translations (in terms of fluency) to the top of the list. |
Introduction | Even noisy translation of oovs can aid the language model to better |
Conclusion | Also, we believe that improving English language modeling to match the genre of the translated sentences can have significant positive impact on translation quality. |
Previous Work | They used two language models built from the English GigaWord corpus and from a large web crawl. |
Previous Work | For language modeling , we used either EGen or the English side of the AR corpus plus the English side of NIST12 training data and English Gi-gaWord v5. |
Previous Work | — B2-B4 systems used identical training data, namely EG, with the GW, EGen, or both for B2, B3, and B4 respectively for language modeling . |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Using both language models (52) led to slight improvement. |
Experimental Setup | Beam size is fixed at 2000.4 Sentence compressions are evaluated by a 5-gram language model trained on Gigaword (Graff, 2003) by SRILM (Stolcke, 2002). |
Sentence Compression | As the space of possible compressions is exponential in the number of leaves in the parse tree, instead of looking for the globally optimal solution, we use beam search to find a set of highly likely compressions and employ a language model trained on a large corpus for evaluation. |
Sentence Compression | Given the N -best compressions from the decoder, we evaluate the yield of the trimmed trees using a language model trained on the Gigaword (Graff, 2003) corpus and return the compression with the highest probability. |
Sentence Compression | Thus, the decoder is quite flexible — its learned scoring function allows us to incorporate features salient for sentence compression while its language model guarantees the linguistic quality of the compressed string. |
Discussion | The first is the incorporation of a language model (or comparable long-distance structure-scoring model) to assign scores to predicted parses independent of the transformation model. |
Experimental setup | The best symmetrization algorithm, translation and language model weights for each language are selected using cross-validation on the development set. |
MT—based semantic parsing | In order to learn a semantic parser using MT we linearize the MRs, learn alignments between the MRL and the NL, extract translation rules, and learn a language model for the MRL. |
MT—based semantic parsing | Language modeling In addition to translation rules learned from a parallel corpus, MT systems also rely on an n-gram language model for the target language, estimated from a (typically larger) monolingual corpus. |
MT—based semantic parsing | In the case of SP, such a monolingual corpus is rarely available, and we instead use the MRs available in the training data to learn a language model of the MRL. |
Decoding | The language model (LM) scoring is directly integrated into the cube pruning algorithm. |
Decoding | Naturally, we also had to adjust hypothesis expansion and, most importantly, language model scoring inside the cube pruning algorithm. |
Experiments | Our German 4-gram language model was trained on the German sentences in the training data augmented by the Stuttgart SdeWaC corpus (Web-as-Corpus Consortium, 2008), whose generation is detailed in (Baroni et al., 2009). |
Translation Model | (1) The forward translation weight using the rule weights as described in Section 2 (2) The indirect translation weight using the rule weights as described in Section 2 (3) Lexical translation weight source —> target (4) Lexical translation weight target —> source (5) Target side language model (6) Number of words in the target sentences (7) Number of rules used in the pre-translation (8) Number of target side sequences; here k times the number of sequences used in the pre-translations that constructed 7' (gap penalty) The rule weights required for (l) are relative frequencies normalized over all rules with the same left-hand side. |
Translation Model | The computation of the language model estimates for (6) is adapted to score partial translations consisting of discontiguous units. |
Inference | After randomly initializing all 77k,8,7.,t, inference is performed by a blocked Gibbs sampler, alternating resamplings for three major groups of variables: the language model (z,gb), context model (07,7, [3, p), and the 77, 6 variables, which bottleneck between the submodels. |
Inference | The language model sampler sequentially updates every za) (and implicitly gb via collapsing) in the manner of Griffiths and Steyvers (2004): p(z(i)|6, ma), 1)) oc 68,r,t,z(nw,z + b/V)/(nz + b), where counts 77 are for all event tuples besides 7'. |
Model | 0 Language model: |
Model | Thus the language model is very similar to a topic model’s generation of token topics and wordtypes. |
Experiment Results | The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation. |
Experiment Results | The language model is the trigram SRI language model built from Xinhua corpus of 180 millions words. |
Experiment Results | The language model is three-gram SRILM trained from the target side of the training corpora. |
Introduction | Many features are shared between phrase-based and tree-based systems including language model , word count, and translation model features. |
Abstract | In all experiments we include the target side of the mined parallel data in the language model , in order to distinguish whether results are due to influences from parallel or monolingual data. |
Abstract | In these experiments, we use 5-gram language models when the target language is English or German, and 4—gram language models for French and Spanish. |
Abstract | The baseline system was trained using only the Europarl corpus (Koehn, 2005) as parallel data, and all experiments use the same language model trained on the target sides of Europarl, the English side of all linked Spanish-English Wikipedia articles, and the English side of the mined CommonCran data. |
Introduction | The approach by Wei and Croft (2006) was the first to leverage LDA topics to improve the estimate of document language models and achieved good empirical results. |
Topic-Driven Relevance Models | where 9 is a set of pseudo-relevant feedback documents and 6D is the language model of document D. This notion of estimating a query model is |
Topic-Driven Relevance Models | We tackle the null probabilities problem by smoothing the document language model using the well-known Dirichlet smoothing (Zhai and Lafferty, 2004). |
Topic-Driven Relevance Models | Instead of viewing 9 as a set of document language models that are likely to contain topical information about the query, we take a probabilistic topic modeling approach. |
Experiments | In the end-to-end MT pipeline we use a standard set of features: relative-frequency and lexical translation model probabilities in both directions; distance-based distortion model; language model and word count. |
Experiments | We train 3-gram language models using modified Kneser—Ney smoothing. |
Experiments | For AR-EN experiments the language model is trained on English data as (Blunsom et al., 2009a), and for FA-EN and UR-EN the English data are the target sides of the bilingual training data. |
Introduction | We develop a Bayesian approach using a Pitman-Yor process prior, which is capable of modelling a diverse range of geometrically decaying distributions over infinite event spaces (here translation phrase-pairs), an approach shown to be state of the art for language modelling (Teh, 2006). |
Experiment settings | o TopicSum: we use TopicSum (Haghighi and Vanderwende, 2009), a 3-layer hierarchical topic model, to infer the language model that is most central for the collection. |
Experiment settings | divergence with respect the collection language model is the one chosen. |
Related work | (2007) generate novel utterances by combining Prim’s maximum-spanning-tree algorithm with an n-gram language model to enforce fluency. |
Experiments | Row 1 and row 2 are two baseline systems, which model the relevance score using VSM (Cao et al., 2010) and language model (LM) (Zhai and Laf-ferty, 2001; Cao et al., 2010) in the term space. |
Experiments | Row 3 is the word-based translation model (Jeon et al., 2005), and row 4 is the word-based translation language model, which linearly combines the word-based translation model and language model into a unified framework (Xue et al., 2008). |
Experiments | (2009) in Table 3 because previous work (Ming et al., 2010) demonstrated that word-based translation language model (Xue et al., 2008) obtained the superior performance than the syntactic tree matching (Wang et al., 2009). |
Markov Topic Regression - MTR | (19) Language Model Prior (77W): Probabilities on word transitions denoted as nw=p(wi=v|wi_1). |
Markov Topic Regression - MTR | We built a language model using SRILM (Stol-cke, 2002) on the domain specific sources such as top wiki pages and blogs on online movie reviews, etc., to obtain the probabilities of domain-specific n-grams, up to 3-grams. |
Markov Topic Regression - MTR | (l), we assume that the prior on the semantic tags, 773, is more indicative of the decision for sampling a w,- from a new tag compared to language model posteriors on word sequences, 77W. |
Experiment | We train a 5-gram language model with the Xinhua portion of English Gigaword corpus and target part of the training data. |
Integrating into the PAS-based Translation Framework | The weights of the MEPD feature can be tuned by MERT (Och, 2003) together with other translation features, such as language model . |
PAS-based Translation Framework | The target-side-like PAS is selected only according to the language model and translation probabilities, without considering any context information of PAS. |
Related Work | (Bengio et al., 2006) proposed to use multilayer neural network for language modeling task. |
Related Work | (Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language. |
Related Work | (Son et al., 2012) improves translation quality of n- gram translation model by using a bilingual neural language model . |
Abstract | Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency. |
Discriminative Reranking for OCR | The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002). |
Introduction | The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output. |
Approach to Sentence-Level Dialect Identification | The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV. |
Approach to Sentence-Level Dialect Identification | The perplexity of a language model on a given test sentence; S(w1, .., wn) is defined as: |
Related Work | Amazon Mechanical Turk and try a language modeling (LM) approach to solve the problem. |
Experiments | The language model is a 3-gram language model trained using the SRILM toolkit (Stolcke, 2002) on the English side of the training data. |
Experiments | The language model is a 3-gram LM trained on Xinhua portion of the Gigaword corpus using the SRILM toolkit with modified Kneser—Ney smoothing. |
Related Work | (2011) develop a bilingual language model which incorporates words in the source and target languages to predict the next unit, which they use as a feature in a translation system. |
Experiment | We used 5-gram language models that were trained using the English side of each set of bilingual training data. |
Experiment | The common SMT feature set consists of: four translation model features, phrase penalty, word penalty, and a language model feature. |
Introduction | 1A language model also supports the estimation. |
Introduction | Our approach is based on semi-Markov discriminative structure prediction, and it incorporates English back-transliteration and English language models (LMs) into WS in a seamless way. |
Use of Language Model | Language Model Augmentation Analogous to Koehn and Knight (2003), we can exploit the fact that l/‘yF‘ reddo (red) in the example ffiayvnI/W‘ is such a common word that one can expect it appears frequently in the training corpus. |
Use of Language Model | 4.1 Language Model Projection |
Experiments | For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data. |
Experiments | If either the source or the target sides of the a training instance had an edit distance of less than 10%, we removed it.4 As for the language models, we collected a further 10M tweets from Twitter for the English language model and another 10M tweets from Weibo for the Chinese language model . |
Experiments | As the language model , we use a 5-gram model with Kneser—Ney smoothing. |