Abstract | This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores. |
Abstract | Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. |
Dependency parsing for machine translation | While it seems that loopy graphs are undesirable when the goal is to obtain a syntactic analysis, that is not necessarily the case when one just needs a language modeling score. |
Introduction | Hierarchical approaches to machine translation have proven increasingly successful in recent years (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008), and often outperform phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) on.ungetlanguage fluency'and.adequacy; Ilouh ever, their benefits generally come with high computational costs, particularly when chart parsing, such as CKY, is integrated with language models of high orders (Wu, 1996). |
Introduction | Indeed, researchers have shown that gigantic language models are key to state-of-the-art performance (Brants et al., 2007), and the ability of phrase-based decoders to handle large-size, high-order language models with no consequence on asymptotic running time during decoding presents a compelling advantage over CKY decoders, whose time complexity grows prohibitively large with higher-order language mod-ds |
Introduction | Most interestingly, the time complexity of non-projective dependency parsing remains quadratic as the order of the language model increases. |
Machine translation experiments | We use the standard features implemented almost exactly as in Moses: four translation features (phrase-based translation probabilities and lexically-weighted probabilities), word penalty, phrase penalty, linear distortion, and language model score. |
Machine translation experiments | In order to train a competitive baseline given our computational resources, we built a large 5-gram language model using the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) in addition to the target side of the parallel data. |
Machine translation experiments | The language model was smoothed with the modified Kneser-Ney algorithm as implemented in (Stolcke, 2002), and we only kept 4-grams and 5-grams that occurred at least three times in the training data.6 |
Experiments | First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score. |
Experiments | Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score. |
Experiments | This means that, when using a bigram language model , it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference. |
Introduction | Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence. |
Phrase-based Decoding as TSP | o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model , this cost can be precomputed directly from b and b’. |
Phrase-based Decoding as TSP | Successful phrase-based systems typically employ language models of order higher than two. |
Phrase-based Decoding as TSP | If we want to extend the power of the model to general n-gram language models , and in particular to the 3-gram |
A Syntax Free Sequence-oriented Sentence Compression Method | As an alternative to syntactic parsing, we propose two novel features, intra-sentence positional term weighting (IPTW) and the patched language model (PLM) for our syntax-free sentence compressor. |
A Syntax Free Sequence-oriented Sentence Compression Method | 3.2.2 Patched Language Model |
A Syntax Free Sequence-oriented Sentence Compression Method | Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence. |
Abstract | As an alternative to syntactic parsing, we propose a novel term weighting technique based on the positional information within the original sentence and a novel language model that combines statistics from the original sentence and a general corpus. |
Conclusions | 0 As an alternative to the syntactic parser, we proposed two novel features, Intra-sentence positional term weighting (IPTW) and the Patched language model (PLM), and showed their effectiveness by conducting automatic and human evaluations, |
Experimental Evaluation | We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles. |
Introduction | To maintain the subject-predicate relationship in the compressed sentence and retain fluency without using syntactic parsers, we propose two novel features: intra-sentence positional term weighting (IPTW) and the patched language model (PLM). |
Introduction | PLM is a form of summarization-oriented fluency statistics derived from the original sentence and the general language model . |
Results and Discussion | Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly. |
Results and Discussion | This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences. |
Abstract | Our model is a nested hierarchical Pitman-Yor language model , where Pitman-Yor spelling model is embedded in the word model. |
Abstract | Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications. |
Introduction | Bayesian Kneser—Ney) language model , with an accurate character oo-gram Pitman-Yor spelling model embedded in word models. |
Introduction | Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language. |
Introduction | we briefly describe a language model based on the Pitman-Yor process (Teh, 2006b), which is a generalization of the Dirichlet process used in previous research. |
Nested Pitman-Yor Language Model | In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM. |
Nested Pitman-Yor Language Model | Figure 2: Chinese restaurant representation of our Nested Pitman-Yor Language Model (NPYLM). |
Pitman-Yor process and n-gram models | To compute a probability p(w|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process. |
Pitman-Yor process and n-gram models | As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as |
Pitman-Yor process and n-gram models | When we set thw E l, (4) recovers a Kneser-Ney smoothing: thus a HPYLM is a Bayesian Kneser—Ney language model as well as an extension of the hierarchical Dirichlet Process (HDP) used in Goldwater et al. |
Experimental Setup | The language model is trained using a 9 GB English corpus. |
Statistical Paraphrase Generation | Our SPG model contains three sub-models: a paraphrase model, a language model , and a usability model, which control the adequacy, fluency, |
Statistical Paraphrase Generation | Language Model: We use a trigram language model in this work. |
Statistical Paraphrase Generation | The language model based score for the paraphrase t is computed as: |
Experimental Results | We also used a 5-gram language model with modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Giga-word (LDC2007T07) and the English side of the parallel corpora. |
Experimental Results | We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models, language models , and the optimal weights for combining these models, respectively. |
Variational Approximate Decoding | Of course, this last point also means that our computation becomes intractable as n —> 00.8 However, if p(y | at) is defined by a hypergraph HG(:c) whose structure explicitly incorporates an m-gram language model , both training and decoding will be efficient when m 2 n. We will give algorithms for this case that are linear in the size of HG(:c).9 |
Variational Approximate Decoding | 9A reviewer asks about the interaction with backed-off language models . |
Variational Approximate Decoding | We sketch a method that works for any language model given by a weighted FSA, L. The variational family Q can be specified by any deterministic weighted FSA, Q, with weights parameterized by ((5. |
Classification Results | The language model features were completely useless for distinguishing contingencies from |
Features for sense prediction of implicit discourse relations | For each sense, we created uni-gram and bigram language models over the implicit examples in the training set. |
Features for sense prediction of implicit discourse relations | We compute each example’s probability according to each of these language models . |
Features for sense prediction of implicit discourse relations | of the spans’ likelihoods according to the various language models . |
Related Work | In the setting of language modeling approaches to query expansion, the local analysis idea has been instantiated by estimating additional query language models (Lafferty and Zhai, 2003; Tao and Zhai, 2006) or relevance models (Lavrenko and Croft, 2001) from a set of feedback documents. |
Related Work | (2005) also try to uncover multiple aspects of a query, and to that they provide an iterative “pseudo-query” generation technique, using cluster-based language models . |
Related Work | Diaz and Metzler (2006) were the first to give a systematic account of query expansion using an external corpus in a language modeling setting, to improve the estimation of relevance models. |
Retrieval Framework | We work in the setting of generative language models . |
Retrieval Framework | Within the language modeling approach, one builds a language model from each document, and ranks documents based on the probability of the document model generating the query. |
Retrieval Framework | The particulars of the language modeling approach have been discussed extensively in the literature (see, e.g., Balog et al. |
Related Work | Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models . |
Related Work | While n-gram models have traditionally dominated in language modeling , two recent efforts de- |
Related Work | Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables, to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005). |
Smoothing Natural Language Sequences | 2.3 Latent Variable Language Model Representation |
Smoothing Natural Language Sequences | Latent variable language models (LVLMs) can be used to produce just such a distributional representation. |
Computing Feature Expectations | The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in |
Computing Feature Expectations | 3Decoder states can include additional information as well, such as local configurations for dependency language model scoring. |
Computing Feature Expectations | The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes. |
Experimental Results | All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT). |
Experiment | For the relevance retrieval model, we faithfully reproduce the passage-based language model with pseudo-relevance feedback (Lee et al., 2008). |
Term Weighting and Sentiment Analysis | IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling (LM), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004). |
Term Weighting and Sentiment Analysis | In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing (LM). |
Term Weighting and Sentiment Analysis | 5With proper assumptions and derivations, p(w \ d) can be derived to language modeling approaches. |
Collaborative Decoding | Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses. |
Discussion | They also empirically show that n-gram agreement is the most important factor for improvement apart from language models . |
Experiments | The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3. |
Experiments | We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output. |
Decoding | 1 to a log-linear model (Och and Ney, 2002) that uses the following eight features: relative frequencies in two directions, lexical weights in two directions, number of rules used, language model score, number of target words produced, and the probability of matched source tree (Mi et al., 2008). |
Decoding | We use the cube pruning method (Chiang, 2007) to approximately intersect the translation forest with the language model . |
Experiments | A trigram language model was trained on the English sentences of the training corpus. |
Related Work | In machine translation, the concept of packed forest is first used by Huang and Chiang (2007) to characterize the search space of decoding with language models . |
Introduction | And Knight and Hatzivassiloglou (1995) use a language model for selecting a fluent sentence among the vast number of surface realizations corresponding to a single semantic representation. |
Introduction | The top-ranked candidate is selected for presentation and verbalized using a language model interfaced with RealPro (Lavoie and Rambow, 1997), a text generation engine. |
The Story Generator | Since we do not know a priori which of these parameters will result in a grammatical sentence, we generate all possible combinations and select the most likely one according to a language model . |
The Story Generator | We used the SRI toolkit to train a trigram language model on the British National Corpus, with interpolated Kneser—Ney smoothing and perplexity as the scoring metric for the generated sentences. |
Previous Work | These approaches have focused to model statistical or syntactic phrasal relations under the language modeling method for information retrieval. |
Previous Work | (Srikanth and Srihari, 2003; Maisonnasse et al., 2005) examined the effectiveness of syntactic relations in a query by using language modeling framework. |
Previous Work | (Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases. |
Proposed Method | We start out by presenting a simple phrase-based language modeling retrieval model that assumes uniform contribution of words and phrases. |
Treebank Translation and Dependency Transformation | In detail, a word-based decoding is used, which adopts a log-linear framework as in (Och and Ney, 2002) with only two features, translation model and language model, |
Treebank Translation and Dependency Transformation | is the language model , a word trigram model trained from the CTB. |
Treebank Translation and Dependency Transformation | Thus the decoding process is actually only determined by the language model . |
Experiments | For language model, we used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram model on the Xinhua portion of GIGAWORD corpus. |
Joint Decoding | 2There are also features independent of derivations, such as language model and word penalty. |
Joint Decoding | Although left-to-right decoding might enable a more efficient use of language models and hopefully produce better translations, we adopt bottom-up decoding in this paper just for convenience. |
Experimental Setup | For the language model , we used a 5-gram model with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the English side of our training data as well as portions of the Giga-word v2 English corpus. |
Experimental Setup | For the language model , we used a 5-gram model trained on the English portion of the whole training data plus portions of the Gigaword v2 corpus. |
Hierarchical Phrase-based System | Given 6 and f as the source and target phrases associated with the rule, typical features used are rule’s translation probability Ptmn,(f|e') and its inverse Ptmn,(e'| f), the lexical probability Pl“ (fl 6) and its inverse Pl“ (6 | f Systems generally also employ a word penalty, a phrase penalty, and target language model feature. |
Experiments | In the experiments, we train the translation model on FBIS corpus (7.2M (Chinese) + 9.2M (English) words) and train a 4-gram language model on the Xinhua portion of the English Gigaword corpus (181M words) using the SRILM Toolkits (Stolcke, |
NonContiguous Tree sequence Align-ment-based Model | 2) The bi-lexical translation probabilities 3) The target language model |
The Pisces decoder | On the other hand, to simplify the computation of language model , we only compute for source side contiguous translational hypothesis, while neglecting gaps in the target side if any. |
The Features | It is built into a unigram language model without smoothing for each term. |
The Features | This feature function measures the Kullback—Leibler divergence (KL divergence) between the language models associated with the two inputs. |
The Features | Similarly, the local context is built into a unigram language model without smoothing for each term; the feature function outputs KL divergence between the models. |
Introduction | One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al. |
Introduction | (2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings. |
Log-linear Models | We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models , except using the names of dependency relations instead of words. |