Index of papers in Proc. ACL 2012 that mention
  • language model
Chen, Wenliang and Zhang, Min and Li, Haizhou
Abstract
In this paper, we present an approach to enriching high—order feature representations for graph-based dependency parsing models using a dependency language model and beam search.
Abstract
The dependency language model is built on a large-amount of additional auto-parsed data that is processed by a baseline parser.
Abstract
Based on the dependency language model , we represent a set of features for the parsing model.
Dependency language model
Language models play a very important role for statistical machine translation (SMT).
Dependency language model
The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words.
Dependency language model
However, the traditional N-gram language model can not capture long-distance word relations.
Introduction
In this paper, we solve this issue by enriching the feature representations for a graph-based model using a dependency language model (DLM) (Shen et al., 2008).
Introduction
0 We utilize the dependency language model to enhance the graph-based parsing model.
Parsing with dependency language model
In this section, we propose a parsing model which includes the dependency language model by extending the model of McDonald et al.
language model is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Zweig, Geoffrey and Platt, John C. and Meek, Christopher and Burges, Christopher J.C. and Yessenalina, Ainur and Liu, Qiang
Abstract
We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model ; and methods that evaluate global coherence, such as latent semantic analysis.
Introduction
To investigate the usefulness of local information, we evaluated n—gram language model scores, from both a conventional model with Good—Turing smoothing, and with a recently proposed maximum—entropy class—based n—gram model (Chen, 2009a; Chen, 2009b).
Introduction
Also in the language modeling vein, but with potentially global context, we evaluate the use of a recurrent neural network language model .
Introduction
In all the language modeling approaches, a model is used to compute a sentence probability with each of the potential completions.
Related Work
The KU system uses just an N—gram language model to do this ranking.
Related Work
The UNT system uses a large variety of information sources, and a language model score receives the highest weight.
Sentence Completion via Language Modeling
Perhaps the most straightforward approach to solving the sentence completion task is to form the complete sentence with each option in turn, and to evaluate its likelihood under a language model .
Sentence Completion via Language Modeling
In this section, we describe the suite of state—of—the—art language modeling techniques for which we will present results.
Sentence Completion via Language Modeling
3.1 Backoff N-gram Language Model
language model is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Sim, Khe Chai
A Probabilistic Formulation for HVR
where P(W) can be modelled by the word-based 77.-gram language model (Chen and Goodman, 1996) commonly used in automatic speech recognition.
A Probabilistic Formulation for HVR
0 Language model score: P(W)
A Probabilistic Formulation for HVR
Note that the acoustic model and language model scores are already used in the conventional ASR.
Abstract
In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions.
Experimental Results
These sentences contain a variety of given names, surnames and city names so that confusions cannot be easily resolved using a language model .
Experimental Results
The ASR system used in all the experiments reported in this paper consists of a set of HMM-based triphone acoustic models and an n-gram language model .
Experimental Results
A bigram language model with a vocabulary size of 200 words was used for testing.
Haptic Voice Recognition (HVR)
In conventional ASR, acoustically similar word sequences are typically resolved implicitly using a language model where contexts of neighboring words are used for disambiguation.
Integration of Knowledge Sources
where fl, 5, 75 and 7:1 denote the WFST representation of the acoustic model, language model , PLI model and haptic model respectively.
Integration of Knowledge Sources
(2002) has shown that Hidden Markov Models (HMMs) and n-gram language models can be viewed as WFSTs.
Introduction
In addition to the acoustic model and language model used in ASR, haptic model and partial lexical model are also introduced to facilitate the integration of more sophisticated haptic events, such as the keystrokes, into HVR.
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Abstract
We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.
Abstract
We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours.
Introduction
N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).
Introduction
At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies.
Introduction
Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark,
Treelet Language Modeling
The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).
Treelet Language Modeling
to use back-off-based smoothing for syntactic language modeling — such techniques have been applied to models that condition on headword contexts (Charniak, 2001; Roark, 2004; Zhang, 2009).
language model is mentioned in 31 sentences in this paper.
Topics mentioned in this paper:
Danescu-Niculescu-Mizil, Cristian and Cheng, Justin and Kleinberg, Jon and Lee, Lillian
Hello. My name is Inigo Montoya.
First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their non-memorable counterparts.
Hello. My name is Inigo Montoya.
In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes.
Never send a human to do a machine’s job.
In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplace-smoothed8 language models : l-gram, 2-gram, and 3-gram word LMs and l-gram, 2-gram and 3-gram part-of-speech9 LMs.
Never send a human to do a machine’s job.
As indicated in Table 3, for each of our lexical “common language” models , in about 60% of the quote pairs, the memorable quote is more distinctive.
Never send a human to do a machine’s job.
The language models’ vocabulary was that of the entire training corpus.
language model is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Elsner, Micha and Goldwater, Sharon and Eisenstein, Jacob
Abstract
We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features.
Experiments
Nonetheless, it represents phonetic variability more realistically than the Bernstein-Ratner—Brent corpus, while still maintaining the lexical characteristics of infant-directed speech (as compared to the Buckeye corpus, with its much larger vocabulary and more complex language model ).
Inference
The language modeling term relating to the intended string again factors into multiple components.
Inference
Because neither the transducer nor the language model are perfect models of the true distribution, they can have incompatible dynamic ranges.
Inference
3The transducer scores can be cached since they depend only on surface forms, but the language model scores cannot.
Introduction
Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model ).
Introduction
Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability.
Introduction
But unlike speech recognition, we have no (intended-form, surface-form) training pairs to train the phonetic model, nor even a dictionary of intended-form strings to train the language model .
Lexical-phonetic model
Our lexical-phonetic model is defined using the standard noisy channel framework: first a sequence of intended word tokens is generated using a language model , and then each token is transformed by a probabilistic finite-state transducer to produce the observed surface sequence.
Related work
In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model .
language model is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew
Abstract
We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models .
Conclusion
Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset.
Experiments
Table 3 shows our results compared to previous methods, including C&W’s language model and the hierarchical log-bilinear (HLBL) model (Mnih and Hinton, 2008), which is a probabilistic, linear neural model.
Global Context-Aware Neural Language Model
Note that Collobert and Weston (2008)’s language model corresponds to the network using only local context.
Introduction
We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective.
Introduction
We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset.
Related Work
Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling , a task where models are asked to accurately predict the next word given previously seen words.
Related Work
Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model .
Related Work
They used up to 3 previous head words and showed increased performance on language modeling .
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zhong, Zhi and Ng, Hwee Tou
Abstract
Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations.
Incorporating Senses into Language Modeling Approaches
The next problem is to incorporate the sense information into the language modeling approach.
Incorporating Senses into Language Modeling Approaches
Given a query (1 and a document d in text collection C, we want to reestimate the language models by making use of the sense information assigned to them.
Incorporating Senses into Language Modeling Approaches
With this language model , the probability of a query term in a document is enlarged by the synonyms of its senses; The more its synonym senses in a document, the higher the probability.
Introduction
We incorporate word senses into the language modeling (LM) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance.
The Language Modeling Approach to IR
3.1 The language modeling approach
The Language Modeling Approach to IR
In the language modeling approach to IR, language models are constructed for each query (1 and each document d in a text collection C. The documents in C are ranked by the distance to a given query (1 according to the language models .
The Language Modeling Approach to IR
The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Rastrow, Ariya and Dredze, Mark and Khudanpur, Sanjeev
Abstract
Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation.
Abstract
However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set.
Abstract
When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N -best and hill climbing rescoring, and show that up-training leads to WER reduction.
Conclusion
The computational complexity of accurate syntactic processing can make structured language models impractical for applications such as ASR that require scoring hundreds of hypotheses per input.
Incorporating Syntactic Structures
These are then passed to the language model along with the word sequence for scoring.
Introduction
Language models (LM) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT).
Related Work
The lattice parser therefore, is itself a language model .
Syntactic Language Models
There have been several approaches to include syntactic information in both generative and discriminative language models .
Syntactic Language Models
Structured language modeling incorporates syntactic parse trees to identify the head words in a hypothesis for modeling dependencies beyond n-grams.
Syntactic Language Models
Our Language Model .
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Conclusion & Future Work
Future work includes extending this approach to use multiple translation models with multiple language models in ensemble decoding.
Experiments & Results 4.1 Experimental Setup
For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model (LM) and word count.
Experiments & Results 4.1 Experimental Setup
Fixing the language model allows us to compare various translation model combination techniques.
Introduction
Common techniques for model adaptation adapt two main components of contemporary state-of-the-art SMT systems: the language model and the translation model.
Introduction
However, language model adaptation is a more straightforward problem compared to
Introduction
translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain.
Related Work 5.1 Domain Adaptation
They use language model perplexities from IN to select relavant sentences from OUT.
language model is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Yamaguchi, Hiroshi and Tanaka-Ishii, Kumiko
Calculation of Cross-Entropy
(XZ-), is the cross—entropy of X7; for L,- multiplied by Various methods for computing cross—entropy have been proposed, and these can be roughly classified into two types based on different methods of universal coding and the language model .
Calculation of Cross-Entropy
For example, (Benedetto et al., 2002) and (Cilibrasi and Vitanyi, 2005) used the universal coding approach, whereas (Teahan and Harper, 2001) and (Sibun and Reynar, 1996) were based on language modeling using PPM and Kullback—Leibler divergence, respectively.
Calculation of Cross-Entropy
As a representative method for calculating the cross—entropy through statistical language modeling , we adopt prediction by partial matching (PPM), a language—based encoding method devised by (Cleary and Witten, 1984).
In the experiments reported here, n is set to 5 throughout.
lel ), gives the description length of the remaining characters under the language model for L.
Introduction
They used statistical language modeling and heuristics to detect foreign words and tested the case of English embedded in German texts.
Problem Formulation
In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification.
Problem Formulation
calculates the description length of a text segment X,- through the use of a language model for Li.
Problem Formulation
Here, the first term corresponds to the code length of the text chunk X,- given a language model for L,, which in fact corresponds to the cross—entropy of X,- for L,- multiplied by The remaining terms give the code lengths of the parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li.
language model is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
A Class-based Model of Agreement
However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score).
A Class-based Model of Agreement
We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
Experiments
Our distributed 4—gram language model was trained on 600 million words of Arabic text, also collected from many sources including the Web (Brants et al., 2007).
Inference during Translation Decoding
With a trigram language model , the state might be the last two words of the translation prefix.
Introduction
Intuition might suggest that the standard 71- gram language model (LM) is suflicient to handle agreement phenomena.
Related Work
Monz (2011) recently investigated parameter estimation for POS-based language models , but his classes did not include inflectional features.
Related Work
One exception was the quadratic-time dependency language model presented by Galley and Manning (2009).
language model is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Ney, Hermann and Zens, Richard
Abstract
In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased efficiency of decoding by minimizing the number of language model computations and hypothesis expansions.
Abstract
Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2.
Experimental Evaluation
The English language model is a 4-gram LM created with the SRILM toolkit (Stolcke, 2002) on all bilingual and parts of the provided monolingual data.
Introduction
Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model (LM) querying (Heafield, 2011).
Introduction
ith Language Model LookAhead
Search Algorithm Extensions
2.2 Language Model LookAhead
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Simianer, Patrick and Riezler, Stefan and Dyer, Chris
Experiments
3-gram (news-commentary) and 5-gram (Europarl) language models are trained on the data described in Table 1, using the SRILM toolkit (Stol-cke, 2002) and binarized for efficient querying using kenlm (Heafield, 2011).
Experiments
For the 5-gram language models, we replaced every word in the lm training data with <unk> that did not appear in the English part of the parallel training data to build an open vocabulary language model .
Experiments
7Absolute improvements would be possible, e. g., by using larger language models or by adding news data to the ep training set when evaluating on crawl test sets (see, e. g., Dyer et al.
Introduction
The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003).
Introduction
The modeler’s goals might be to identify complex properties of translations, or to counter errors of pre-trained translation models and language models by explicitly down-weighting translations that exhibit certain undesired properties.
language model is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chambers, Nathanael
Experiments and Results
Unigram NLLR and Filtered NLLR are the language model implementations of previous work as described in Section 3.1.
Previous Work
They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores.
Timestamp Classifiers
3.1 Language Models
Timestamp Classifiers
We apply Dirichlet-smoothing to the language models (as in de J ong et al.
Timestamp Classifiers
The above language modeling and MaxEnt approaches are token-based classifiers that one could apply to any topic classification domain.
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Abstract
On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model .
Introduction
Combining Language Models and
Training Algorithm and Implementation
As described in Section 4, the overall procedure is divided into two alternating steps: After initialization we first perform EM training of the translation model for 20-30 iterations using a 2- gram or S-gram language model in the target language.
Training Algorithm and Implementation
The generative story described in Section 3 is implemented as a cascade of a permutation, insertion, lexicon, deletion and language model finite state transducers using OpenFST (Allauzen et al., 2007).
Translation Model
Stochastically generate the target sentence according to an n-gram language model .
language model is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ozbal, Gozde and Strapparava, Carlo
Evaluation
Therefore, we implemented a ranking mechanism which used a hybrid scoring method by giving equal weights to the language model and the normalized phonetic similarity.
System Description
To check the likelihood and well-formedness of the new string after the replacement, we learn a 3- gram language model with absolute smoothing.
System Description
For leam-ing the language model , we only consider the words in the CMU pronunciation dictionary which also exist in WordNet.
System Description
We remove the words containing at least one trigram which is very unlikely according to the language model .
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming
Experiments and Results
The features we used are commonly used features as standard BTG decoder, such as translation probabilities, lexical weights, language model , word penalty and distortion probabilities.
Experiments and Results
The language model is 5-gram language model trained with the target sentences in the training data.
Experiments and Results
The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data.
Features and Training
We also use other fundamental features, such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun
Experiments
The monolingual data for training English language model includes the Xinhua portion of the GIGAWORD corpus, which contains 238M English words.
Experiments
A 4—gram language model was trained on the monolingual data by the SRILM toolkit (Stolcke, 2002).
Related Work
Researchers also introduce topic model for cross-lingual language model adaptation (Tam et al., 2007; Ruiz and Federico, 2011).
Related Work
Based on the bilingual topic model, they apply the source-side topic weights into the target-side topic model, and adapt the n-gram language model of target side.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting
Experiments
We used SRILM2 for the training of language models (S-gram in all the experiments).
Experiments
We trained a Chinese language model for the EC translation on the Chinese part of the bi-text.
Experiments
For the English language model of CE translation, an extra corpus named Tanaka was used besides the English part of the bilingual corpora.
language model is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Konstas, Ioannis and Lapata, Mirella
Experimental Design
imately via cube pruning (Chiang, 2007), by integrating a trigram language model extracted from the training set (see Konstas and Lapata (2012) for details).
Experimental Design
Lexical Features These features encourage grammatical coherence and inform lexical selection over and above the limited horizon of the language model captured by Rules (6)—(9).
Problem Formulation
In machine translation, a decoder that implements forest rescoring (Huang and Chiang, 2007) uses the language model as an external criterion of the goodness of sub-translations on account of their grammaticality.
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kolachina, Prasanth and Cancedda, Nicola and Dymetman, Marc and Venkatapathy, Sriram
Inferring a learning curve from mostly monolingual data
In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model .
Inferring a learning curve from mostly monolingual data
(b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus.
Inferring a learning curve from mostly monolingual data
The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus .
language model is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: