Index of papers in Proc. ACL 2009 that mention
  • language model
Galley, Michel and Manning, Christopher D.
Abstract
This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Dependency parsing for machine translation
While it seems that loopy graphs are undesirable when the goal is to obtain a syntactic analysis, that is not necessarily the case when one just needs a language modeling score.
Introduction
Hierarchical approaches to machine translation have proven increasingly successful in recent years (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008), and often outperform phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) on.ungetlanguage fluency'and.adequacy; Ilouh ever, their benefits generally come with high computational costs, particularly when chart parsing, such as CKY, is integrated with language models of high orders (Wu, 1996).
Introduction
Indeed, researchers have shown that gigantic language models are key to state-of-the-art performance (Brants et al., 2007), and the ability of phrase-based decoders to handle large-size, high-order language models with no consequence on asymptotic running time during decoding presents a compelling advantage over CKY decoders, whose time complexity grows prohibitively large with higher-order language mod-ds
Introduction
Most interestingly, the time complexity of non-projective dependency parsing remains quadratic as the order of the language model increases.
Machine translation experiments
We use the standard features implemented almost exactly as in Moses: four translation features (phrase-based translation probabilities and lexically-weighted probabilities), word penalty, phrase penalty, linear distortion, and language model score.
Machine translation experiments
In order to train a competitive baseline given our computational resources, we built a large 5-gram language model using the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) in addition to the target side of the parallel data.
Machine translation experiments
The language model was smoothed with the modified Kneser-Ney algorithm as implemented in (Stolcke, 2002), and we only kept 4-grams and 5-grams that occurred at least three times in the training data.6
language model is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Experiments
First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.
Experiments
Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score.
Experiments
This means that, when using a bigram language model , it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.
Introduction
Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence.
Phrase-based Decoding as TSP
o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model , this cost can be precomputed directly from b and b’.
Phrase-based Decoding as TSP
Successful phrase-based systems typically employ language models of order higher than two.
Phrase-based Decoding as TSP
If we want to extend the power of the model to general n-gram language models , and in particular to the 3-gram
language model is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
A Syntax Free Sequence-oriented Sentence Compression Method
As an alternative to syntactic parsing, we propose two novel features, intra-sentence positional term weighting (IPTW) and the patched language model (PLM) for our syntax-free sentence compressor.
A Syntax Free Sequence-oriented Sentence Compression Method
3.2.2 Patched Language Model
A Syntax Free Sequence-oriented Sentence Compression Method
Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence.
Abstract
As an alternative to syntactic parsing, we propose a novel term weighting technique based on the positional information within the original sentence and a novel language model that combines statistics from the original sentence and a general corpus.
Conclusions
0 As an alternative to the syntactic parser, we proposed two novel features, Intra-sentence positional term weighting (IPTW) and the Patched language model (PLM), and showed their effectiveness by conducting automatic and human evaluations,
Experimental Evaluation
We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles.
Introduction
To maintain the subject-predicate relationship in the compressed sentence and retain fluency without using syntactic parsers, we propose two novel features: intra-sentence positional term weighting (IPTW) and the patched language model (PLM).
Introduction
PLM is a form of summarization-oriented fluency statistics derived from the original sentence and the general language model .
Results and Discussion
Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly.
Results and Discussion
This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences.
language model is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori
Abstract
Our model is a nested hierarchical Pitman-Yor language model , where Pitman-Yor spelling model is embedded in the word model.
Abstract
Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications.
Introduction
Bayesian Kneser—Ney) language model , with an accurate character oo-gram Pitman-Yor spelling model embedded in word models.
Introduction
Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language.
Introduction
we briefly describe a language model based on the Pitman-Yor process (Teh, 2006b), which is a generalization of the Dirichlet process used in previous research.
Nested Pitman-Yor Language Model
In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM.
Nested Pitman-Yor Language Model
Figure 2: Chinese restaurant representation of our Nested Pitman-Yor Language Model (NPYLM).
Pitman-Yor process and n-gram models
To compute a probability p(w|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process.
Pitman-Yor process and n-gram models
As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as
Pitman-Yor process and n-gram models
When we set thw E l, (4) recovers a Kneser-Ney smoothing: thus a HPYLM is a Bayesian Kneser—Ney language model as well as an extension of the hierarchical Dirichlet Process (HDP) used in Goldwater et al.
language model is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Zhao, Shiqi and Lan, Xiang and Liu, Ting and Li, Sheng
Experimental Setup
The language model is trained using a 9 GB English corpus.
Statistical Paraphrase Generation
Our SPG model contains three sub-models: a paraphrase model, a language model , and a usability model, which control the adequacy, fluency,
Statistical Paraphrase Generation
Language Model: We use a trigram language model in this work.
Statistical Paraphrase Generation
The language model based score for the paraphrase t is computed as:
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Experimental Results
We also used a 5-gram language model with modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Giga-word (LDC2007T07) and the English side of the parallel corpora.
Experimental Results
We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRILM (Stol-cke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006)17 to obtain word alignments, translation models, language models , and the optimal weights for combining these models, respectively.
Variational Approximate Decoding
Of course, this last point also means that our computation becomes intractable as n —> 00.8 However, if p(y | at) is defined by a hypergraph HG(:c) whose structure explicitly incorporates an m-gram language model , both training and decoding will be efficient when m 2 n. We will give algorithms for this case that are linear in the size of HG(:c).9
Variational Approximate Decoding
9A reviewer asks about the interaction with backed-off language models .
Variational Approximate Decoding
We sketch a method that works for any language model given by a weighted FSA, L. The variational family Q can be specified by any deterministic weighted FSA, Q, with weights parameterized by ((5.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Classification Results
The language model features were completely useless for distinguishing contingencies from
Features for sense prediction of implicit discourse relations
For each sense, we created uni-gram and bigram language models over the implicit examples in the training set.
Features for sense prediction of implicit discourse relations
We compute each example’s probability according to each of these language models .
Features for sense prediction of implicit discourse relations
of the spans’ likelihoods according to the various language models .
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Weerkamp, Wouter and Balog, Krisztian and de Rijke, Maarten
Related Work
In the setting of language modeling approaches to query expansion, the local analysis idea has been instantiated by estimating additional query language models (Lafferty and Zhai, 2003; Tao and Zhai, 2006) or relevance models (Lavrenko and Croft, 2001) from a set of feedback documents.
Related Work
(2005) also try to uncover multiple aspects of a query, and to that they provide an iterative “pseudo-query” generation technique, using cluster-based language models .
Related Work
Diaz and Metzler (2006) were the first to give a systematic account of query expansion using an external corpus in a language modeling setting, to improve the estimation of relevance models.
Retrieval Framework
We work in the setting of generative language models .
Retrieval Framework
Within the language modeling approach, one builds a language model from each document, and ranks documents based on the probability of the document model generating the query.
Retrieval Framework
The particulars of the language modeling approach have been discussed extensively in the literature (see, e.g., Balog et al.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Yates, Alexander
Related Work
Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models .
Related Work
While n-gram models have traditionally dominated in language modeling , two recent efforts de-
Related Work
Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables, to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005).
Smoothing Natural Language Sequences
2.3 Latent Variable Language Model Representation
Smoothing Natural Language Sequences
Latent variable language models (LVLMs) can be used to produce just such a distributional representation.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
DeNero, John and Chiang, David and Knight, Kevin
Computing Feature Expectations
The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in
Computing Feature Expectations
3Decoder states can include additional information as well, such as local configurations for dependency language model scoring.
Computing Feature Expectations
The weight of h is the incremental score contributed to all translations containing the rule application, including translation model features on 7“ and language model features that depend on both 7“ and the English contexts of the child nodes.
Experimental Results
All four systems used two language models: one trained from the combined English sides of both parallel texts, and another, larger, language model trained on 2 billion words of English text (1 billion for Chinese-English SBMT).
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kim, Jungi and Li, Jin-Ji and Lee, Jong-Hyeok
Experiment
For the relevance retrieval model, we faithfully reproduce the passage-based language model with pseudo-relevance feedback (Lee et al., 2008).
Term Weighting and Sentiment Analysis
IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling (LM), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004).
Term Weighting and Sentiment Analysis
In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing (LM).
Term Weighting and Sentiment Analysis
5With proper assumptions and derivations, p(w \ d) can be derived to language modeling approaches.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Collaborative Decoding
Similar to a language model score, n-gram consensus -based feature values cannot be summed up from smaller hypotheses.
Discussion
They also empirically show that n-gram agreement is the most important factor for improvement apart from language models .
Experiments
The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3.
Experiments
We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Lü, Yajuan and Liu, Qun
Decoding
1 to a log-linear model (Och and Ney, 2002) that uses the following eight features: relative frequencies in two directions, lexical weights in two directions, number of rules used, language model score, number of target words produced, and the probability of matched source tree (Mi et al., 2008).
Decoding
We use the cube pruning method (Chiang, 2007) to approximately intersect the translation forest with the language model .
Experiments
A trigram language model was trained on the English sentences of the training corpus.
Related Work
In machine translation, the concept of packed forest is first used by Huang and Chiang (2007) to characterize the search space of decoding with language models .
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
McIntyre, Neil and Lapata, Mirella
Introduction
And Knight and Hatzivassiloglou (1995) use a language model for selecting a fluent sentence among the vast number of surface realizations corresponding to a single semantic representation.
Introduction
The top-ranked candidate is selected for presentation and verbalized using a language model interfaced with RealPro (Lavoie and Rambow, 1997), a text generation engine.
The Story Generator
Since we do not know a priori which of these parameters will result in a grammatical sentence, we generate all possible combinations and select the most likely one according to a language model .
The Story Generator
We used the SRI toolkit to train a trigram language model on the British National Corpus, with interpolated Kneser—Ney smoothing and perplexity as the scoring metric for the generated sentences.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Song, Young-In and Lee, Jung-Tae and Rim, Hae-Chang
Previous Work
These approaches have focused to model statistical or syntactic phrasal relations under the language modeling method for information retrieval.
Previous Work
(Srikanth and Srihari, 2003; Maisonnasse et al., 2005) examined the effectiveness of syntactic relations in a query by using language modeling framework.
Previous Work
(Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases.
Proposed Method
We start out by presenting a simple phrase-based language modeling retrieval model that assumes uniform contribution of words and phrases.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhao, Hai and Song, Yan and Kit, Chunyu and Zhou, Guodong
Treebank Translation and Dependency Transformation
In detail, a word-based decoding is used, which adopts a log-linear framework as in (Och and Ney, 2002) with only two features, translation model and language model,
Treebank Translation and Dependency Transformation
is the language model , a word trigram model trained from the CTB.
Treebank Translation and Dependency Transformation
Thus the decoding process is actually only determined by the language model .
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang and Mi, Haitao and Feng, Yang and Liu, Qun
Experiments
For language model, we used the SRI Language Modeling Toolkit (Stolcke, 2002) to train a 4-gram model on the Xinhua portion of GIGAWORD corpus.
Joint Decoding
2There are also features independent of derivations, such as language model and word penalty.
Joint Decoding
Although left-to-right decoding might enable a more efficient use of language models and hopefully produce better translations, we adopt bottom-up decoding in this paper just for convenience.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Experimental Setup
For the language model , we used a 5-gram model with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the English side of our training data as well as portions of the Giga-word v2 English corpus.
Experimental Setup
For the language model , we used a 5-gram model trained on the English portion of the whole training data plus portions of the Gigaword v2 corpus.
Hierarchical Phrase-based System
Given 6 and f as the source and target phrases associated with the rule, typical features used are rule’s translation probability Ptmn,(f|e') and its inverse Ptmn,(e'| f), the lexical probability Pl“ (fl 6) and its inverse Pl“ (6 | f Systems generally also employ a word penalty, a phrase penalty, and target language model feature.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Jun and Zhang, Min and Tan, Chew Lim
Experiments
In the experiments, we train the translation model on FBIS corpus (7.2M (Chinese) + 9.2M (English) words) and train a 4-gram language model on the Xinhua portion of the English Gigaword corpus (181M words) using the SRILM Toolkits (Stolcke,
NonContiguous Tree sequence Align-ment-based Model
2) The bi-lexical translation probabilities 3) The target language model
The Pisces decoder
On the other hand, to simplify the computation of language model , we only compute for source side contiguous translational hypothesis, while neglecting gaps in the target side if any.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yang, Hui and Callan, Jamie
The Features
It is built into a unigram language model without smoothing for each term.
The Features
This feature function measures the Kullback—Leibler divergence (KL divergence) between the language models associated with the two inputs.
The Features
Similarly, the local context is built into a unigram language model without smoothing for each term; the feature function outputs KL divergence between the models.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Introduction
One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al.
Introduction
(2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings.
Log-linear Models
We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models , except using the names of dependency relations instead of words.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: