Abstract | We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment. |
Introduction | Such cases have been noted for language modeling (Goodman, 2001; Goodman, 2004), domain adaptation (Tam and Schultz, 2008), grapheme-to-phoneme conversion (Bisani and Ney, 2008), and phrase-based translation (Andres-Ferrer, 2010; Wuebker et al., 2012). |
Introduction | One is language model domain adaptation, and the other is word alignment using the IBM models (Brown et al., 1993). |
Language model adaptation | N -gram language models are widely used in applications like machine translation and speech recognition to select fluent output sentences. |
Language model adaptation | Here, we propose to assign each sentence a probability to indicate how likely it is to belong to the domain of interest, and train a language model using expected KN smoothing. |
Language model adaptation | They first train two language models , pin on a set of in-domain data, and pout on a set of general-domain data. |
Related Work | This method subtracts D directly from the fractional counts, zeroing out counts that are smaller than D. The discount D must be set by minimizing an error metric on held-out data using a line search (Tam, p. c.) or Powell’s method (Bisani and Ney, 2008), requiring repeated estimation and evaluation of the language model . |
Smoothing on integral counts | Before presenting our method, we review KN smoothing on integer counts as applied to language models , although, as we will demonstrate in Section 7, KN smoothing is applicable to other tasks as well. |
Abstract | Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus. |
Abstract | By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model. |
Introduction | Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora. |
Introduction | To overcome the problem, we first propose the method combining translation model with language model in data selection. |
Introduction | The language model measures the domain-specif1c generation probability of sentences, being used to select domain-relevant sentences at both sides of source and target language. |
Related Work | The existing data selection methods are mostly based on language model . |
Related Work | (2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models . |
Related Work | (2011) improved the perplexity-based approach and proposed bilingual cross-entropy difference as a ranking function with in-and general- domain language models . |
Abstract | RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner. |
Experiments and Results | The language model is a 5-gram language model trained with the target sentences in the training data. |
Introduction | Recurrent neural networks are leveraged to learn language model , and they keep the history information circularly inside the network for arbitrarily long time (Mikolov et al., 2010). |
Introduction | DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment, language modelling , translation modelling and distortion modelling. |
Introduction | In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model. |
Our Model | Recurrent neural network is usually used for sequence processing, such as language model (Mikolov et al., 2010). |
Our Model | Commonly used sequence processing methods, such as Hidden Markov Model (HMM) and n-gram language model , only use a limited history for the prediction. |
Our Model | In HMM, the previous state is used as the history, and for n-gram language model (for example n equals to 3), the history is the previous two words. |
Related Work | (2013) extend the recurrent neural network language model , in order to use both the source and target side information to scoring translation candidates. |
Abstract | Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation. |
Abstract | We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion. |
Expected BLEU Training | We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003). |
Expected BLEU Training | We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1): |
Expected BLEU Training | which computes a sentence-level language model score as the sum of individual word scores. |
Introduction | In this paper we focus on recurrent neural network architectures which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011; Sundermeyer et al., 2013) with several subsequent applications in machine translation (Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Hu et al., 2014). |
Introduction | (2013) who demonstrated that feed-forward network-based language models are more accurate in first-pass decoding than in rescoring. |
Introduction | Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models . |
Recurrent Neural Network LMs | Our model has a similar structure to the recurrent neural network language model of Mikolov et al. |
Evaluation | In §3.3, we then examined the effect of using a very large 5-gram language model training on 7.5 billion English tokens to understand the nature of the improvements in §3.2. |
Evaluation | The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only. |
Evaluation | The 13 baseline features (2 lexical, 2 phrasal, 5 HRM, and 1 language model , word penalty, phrase length feature and distortion penalty feature) were tuned using MERT (Och, 2003), which is also used to tune the 4 feature weights introduced by the secondary phrase table (2 lexical and 2 phrasal, other features being shared between the two tables). |
Generation & Propagation | These candidates are scored using stem-level translation probabilities, morpheme-level lexical weighting probabilities, and a language model , and only the top 30 candidates are included. |
Introduction | We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models . |
Abstract | We use translation models and language models to exploit lexical correlations and solution post character respectively. |
Introduction | The cornerstone of our technique is the usage of a hitherto unexplored textual feature, lexical correlations between problems and solutions, that is exploited along with language model based characterization of solution posts. |
Introduction | We model the lexical correlation and solution post character using regularized translation models and unigram language models respectively. |
Our Approach | Consider a unigram language model 83 that models the lexical characteristics of solution posts, and a translation model 73 that models the lexical correlation between problems and solutions. |
Our Approach | In short, each solution word is assumed to be generated from the language model or the translation model (conditioned on the problem words) with a probability of A and l — A respectively, thus accounting for the correlation assumption. |
Our Approach | Of the solution words above, generic words such as try and should could probably be explained by (i.e., sampled from) the solution language model , whereas disconnect and rejoin could be correlated well with surf and wifi and hence are more likely to be supported better by the translation model. |
Related Work | We will use translation and language models in our method for solution identification. |
Abstract | We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines. |
Baselines | A second baseline was constructed by weighing the probabilities from the translation table directly with the L2 language model described earlier. |
Baselines | target language modelling ) which is also cus- |
Introduction | The main research question in this research is how to disambiguate an L1 word or phrase to its L2 translation based on an L2 context, and whether such cross-lingual contextual approaches provide added value compared to baseline models that are not context informed or compared to standard language models . |
System | 3.1 Language Model |
System | We also implement a statistical language model as an optional component of our classifier-based system and also as a baseline to compare our system to. |
System | The language model is a trigram-based back-off language model with Kneser-Ney smoothing, computed using SRILM (Stolcke, 2002) and trained on the same training data as the translation model. |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | In this work both the skeleton translation model gskel (d) and full translation model gfuu (d) resemble the usual forms used in phrase-based MT, i.e., the model score is computed by a linear combination of a group of phrase-based features and language models . |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | lm(d) and wlm are the score and weight of the language model , respectively. |
Evaluation | A 5-gram language model was trained on the Xinhua portion of the English Gi-gaword corpus in addition to the target-side of the bilingual data. |
Introduction | 0 We develop a skeletal language model to describe the possibility of translation skeleton and handle some of the long-distance word dependencies. |
Abstract | Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model (LM). |
Methods | This trivially allows for an unsegmented language model and never makes desegmentation errors. |
Methods | Doing so enables the inclusion of an unsegmented target language model , and with a small amount of bookkeeping, it also allows the inclusion of features related to the operations performed during desegmentation (see Section 3.4). |
Methods | We now have a desegmented lattice, but it has not been annotated with an unsegmented (word-level) language model . |
Related Work | Bojar (2007) incorporates such analyses into a factored model, to either include a language model over target morphological tags, or model the generation of morphological features. |
Related Work | They introduce an additional desegmentation technique that augments the table-based approach with an unsegmented language model . |
Related Work | Oflazer and Durgar El-Kahlout (2007) desegment 1000-best lists for English-to-Turkish translation to enable scoring with an unsegmented language model . |
Abstract | We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models . |
Introduction | ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream. |
Introduction | Yet, though many language models more sophisticated than N- grams have been proposed, N-grams are empirically hard to beat in terms of WER. |
Introduction | The strength of this phenomenon suggests it may be more viable for improving term-detection than, say, topic-sensitive language models . |
Motivation | The re-scoring approach we present is closely related to adaptive or cache language models (Je-linek, 1997; Kuhn and De Mori, 1990; Kneser and Steinbiss, 1993). |
Motivation | The primary difference between this and previous work on similar language models is the narrower focus here on the term detection task, in which we consider each search term in isolation, rather than all words in the vocabulary. |
Results | We train ASR acoustic and language models from the training corpus using the Kaldi speech recognition toolkit (Povey et al., 2011) following the default BABEL training and search recipe which is described in detail by Chen et al. |
Term and Document Frequency Statistics | A similar phenomenon is observed concerning adaptive language models (Church, 2000). |
Term and Document Frequency Statistics | In general, we can think of using word repetitions to re-score term detection as applying a limited form of adaptive or cache language model (Je-linek, 1997). |
Term and Document Frequency Statistics | In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence. |
Abstract | We consider the prediction of three human behavioral measures — lexical decision, word naming, and picture naming —through the lens of domain bias in language modeling . |
Abstract | This study aims to provoke increased consideration of the human language model by NLP practitioners: biases are not limited to differences between corpora (i.e. |
Discussion | Our analyses reveal that 6 commonly used corpora fail to reflect the human language model in various ways related to dialect, modality, and other properties of each corpus. |
Discussion | Our results point to a type of bias in commonly used language models that has been previously overlooked. |
Discussion | Just as language models have been used to predict reading grade-level of documents (Collins-Thompson and Callan, 2004), human language models could be |
Introduction | Computational linguists build statistical language models for aiding in natural language processing (NLP) tasks. |
Introduction | In the current study, we exploit errors of the latter variety—failure of a language model to predict human performance—to investigate bias across several frequently used corpora in computational linguistics. |
Introduction | : Human Language Model |
Related Work | (2012) adopt the tweets with emoticons to smooth the language model and Hu et al. |
Related Work | With the revival of interest in deep learning (Bengio et al., 2013), incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing (Socher et al., 2013a), language modeling (Bengio et al., 2003; Mnih and Hinton, 2009) and NER (Turian et al., 2010). |
Related Work | The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1. |
Abstract | Experiments on parsing and a language modeling problem show that the algorithm is efficient and effective in practice. |
Experiments on Parsing | 8 Experiments on the Saul and Pereira (1997) Model for Language Modeling |
Experiments on Parsing | We now describe a second set of experiments, on the Saul and Pereira (1997) model for language modeling . |
Experiments on Parsing | We performed the language modeling experiments for a number of reasons. |
Introduction | We describe experiments on learning of L-PCFGs, and also on learning of the latent-variable language model of Saul and Pereira (1997). |
Abstract | In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores). |
Discussion and Conclusions | While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc. |
Experiments | To create further baselines for comparison, we selected the following features that represent ways one might approximate grammaticality if a comprehensive model was unavailable: whether the link parser can fully parse the sentence (complete_l ink), the Gigaword language model score (gigaword_avglogprob), and the number of misspelled tokens (nummisspelled). |
System Description | 3.2.2 n-gram Count and Language Model Features |
System Description | The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002): |
System Description | Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”). |
Abstract | Recent work has shown success in using neural network language models (NNLMs) as features in MT systems. |
Introduction | Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010). |
Introduction | Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window. |
Model Variations | In particular, we can reverse the translation direction of the languages, as well as the direction of the language model . |
Model Variations | 0 5-gram Kneser-Ney LM 0 Recurrent neural network language model (RNNLM) (Mikolov et al., 2010) |
Neural Network Joint Model (NNJ M) | Fortunately, neural network language models are able to elegantly scale up and take advantage of arbitrarily large context sizes. |
Introduction | This crawling process also yielded 632K TAC pairs whose only difference was spacing, and an additional 558M “unpaired” tweets; as shown later in this paper, we used these extra corpora for computing language models and other auxiliary information. |
Introduction | Table 5: Conformity to the community and one’s own past, measured via scores assigned by various language models . |
Introduction | We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams (bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history. |
Experiments | Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002). |
Introduction | Removing low-frequency words from a corpus (often called cutofi‘) is a common practice to save on the computational costs involved in learning language models and topic models. |
Introduction | In the case of language models , we often have to remove low-frequency words because of a lack of computational resources, since the feature space of k:-grams tends to be so large that we sometimes need cutoffs even in a distributed environment (Brants et al., 2007). |
Perplexity on Reduced Corpora | Constant restoring is similar to the additive smoothing defined by 13(w) oc p’ + A, which is used to solve the zero-frequency problem of language models (Chen and Goodman, 1996). |
Perplexity on Reduced Corpora | 77k: _ 1 H7Tk (7176 _ 1)H7Tk This means that we can determine the rough sparseness of k-grams and adjust some of the parameters such as the gram size k in learning statistical language models . |
Perplexity on Reduced Corpora | LDA is a probabilistic language model that generates a corpus as a mixture of hidden topics, and it allows us to infer two parameters: the document-topic distribution 6 that represents the mixture rate of topics in each document, and the topic-word distribution gb that represents the occurrence rate of words in each topic. |
Experiments | Illustrated by the highlighted states in 6, LM—HMM model conflates interactions that commonly occur at the beginning and end of a dialogue—i.e., “acknowledge agent” and “resolve problem”, since their underlying language models are likely to produce similar probability distributions over words. |
Experiments | By incorporating topic information, our proposed models (e.g., TM—HMMSS in Figure 5) are able to enforce the state transitions towards more frequent flow patterns, which further helps to overcome the weakness of language model . |
Latent Structure in Dialogues | The simplest formulation we consider is an HMM where each state contains a unigram language model (LM), proposed by Chotimongkol (2008) for task-oriented dialogue and originally |
Latent Structure in Dialogues | 3: For each word in utterance n, first choose a word source 7“ according to 1', and then depending on 7“, generate a word 21) either from the session-wide topic distribution 6 or the language model specified by the state 37,. |
Latent Structure in Dialogues | 4Note that a TM-HMMS model with state-specific topic models (instead of state-specific language models ) would be subsumed by TM—HMM, since one topic could be used as the background topic in TM -HMMS. |
Related Work | It is combined with a language model to improve grammaticality and the decoder translates sentences into sim- |
Simplification Framework | In addition, the language model we integrate in the SMT module helps ensuring better fluency and grammaticality. |
Simplification Framework | Finally the translation and language model ensures that published, describing and boson are simplified to wrote, explaining and elementary particle respectively; and that the phrase “In 1964” is moved from the beginning of the sentence to its end. |
Simplification Framework | Our simplification framework consists of a probabilistic model for splitting and dropping which we call DRS simplification model (DRS-SM); a phrase based translation model for substitution and reordering (PBMT); and a language model learned on Simple English Wikipedia (LM) for fluency and grammaticality. |
Introduction | The best-performing systems for these applications today rely on training on large amounts of data: in the case of ASR, the data is aligned audio and transcription, plus large unannotated data for the language modeling ; in the case of OCR, it is transcribed optical data; in the case of MT, it is aligned bitexts. |
Introduction | For ASR and OCR, which can compose words from smaller units (phones or graphically recognized letters), an expanded target language vocabulary can be directly exploited without the need for changing the technology at all: the new words need to be inserted into the relevant resources (lexicon, language model ) etc, with appropriately estimated probabilities. |
Introduction | The expanded word combinations can be used to extend the language models used for MT to bias against incoherent hypothesized new sequences of segmented words. |
Morphology-based Vocabulary Expansion | In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine. |
Morphology-based Vocabulary Expansion | We reweight the weights in the WFST model (Fixed or Bigram) by composing it with a letter trigraph language model (WoTr). |
MT System Selection | These features rely on language models , MSA and Egyptian morphological analyzers and a Highly Dialectal Egyptian lexicon to decide whether each word is MSA, Egyptian, Both, or Out of Vocabulary. |
MT System Selection | two language models : MSA and Egyptian. |
MT System Selection | The second set of features uses perplexity against language models built from the source-side of the training data of each of the four |
Machine Translation Experiments | The language model for our systems is trained on English Gigaword (Graff and Cieri, 2003). |
Machine Translation Experiments | We use SRILM Toolkit (Stolcke, 2002) to build a 5-gram language model with modified |
A semantic span can include one or more eus. | Most translation systems adopt the features from a translation model, a language model , and sometimes a reordering model. |
A semantic span can include one or more eus. | The process of training this transfer model and smoothing is similar to the process of training a language model . |
A semantic span can include one or more eus. | formula (6) are estimated in the same way as a factored language model , which has the advantage of easily incorporating various linguistic information. |
Experiments | A 5-gram language model is trained with SRILM5 on the combination of the Xinhua portion of the English Giga-word corpus combined with the English part of FBIS. |
Experiments | probabilities, the BTG reordering features, and the language model feature. |
Conclusion | Further improvement is possible by incorporating topic models deeper in the decoding process and adding domain knowledge to the language model . |
Discussion | 6.3 Improving Language Models |
Discussion | Topic models capture document-level properties of language, but a critical component of machine translation systems is the language model , which provides local constraints and preferences. |
Discussion | Domain adaptation for language models (Bellegarda, 2004; Wood and Teh, 2009) is an important avenue for improving machine translation. |
Experiments | We train a modified Kneser—Ney trigram language model on English (Chen and Goodman, 1996). |
Conclusion | Additionally, we also want to induce sense clusters for words in the target language so that we can build sense-based language model and integrate it into SMT. |
Decoding with Sense-Based Translation Model | error rate training (MERT) (Och, 2003) together with other models such as the language model . |
Experiments | We trained a 5-gram language model on the Xinhua section of the English Gigaword corpus (306 million words) using the SRILM toolkit (Stolcke, 2002) with the modified Kneser—Ney smoothing (Chen and Goodman, 1996). |
Related Work | (2007) also explore a bilingual topic model for translation and language model adaptation. |
Features | We look at the language model (LM) score and the number of alternate pronunciations of the first query, predicting that a misrecognized query will have a lower LM score and more alternate pronunciations. |
Prediction task | In addition, the language model likelihood for the first query was, as expected, significantly lower for retries. |
Related Work | Retry cases are identified with joint language modeling across multiple transcripts, with the intuition that retry pairs tend to be closely related or exact duplicates. |
Related Work | While we follow this work in our usage of joint language modeling , our application encompasses open domain voice searches and voice actions (such as placing calls), so we cannot use simplifying domain assumptions. |
Experiments | SRILM (Stolcke, 2002) is adopted for language model training and KenLM (Heafield, 2011; Heafield et al., 2013) for language model query. |
Pinyin Input Method Model | The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables: |
Related Works | They solved the typo correction problem by decomposing the conditional probability P(H |P) of Chinese character sequence H given pinyin sequence P into a language model P(wi|wi_1) and a typing model The typing model that was estimated on real user input data was for typo correction. |
Related Works | Various approaches were made for the task including language model (LM) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc. |
Keyphrase Extraction Approaches | 3.3.4 Language Modeling |
Keyphrase Extraction Approaches | These feature values are estimated using language models (LMs) trained on a foreground corpus and a background corpus. |
Keyphrase Extraction Approaches | In sum, LMA uses a language model rather than heuristics to identify phrases, and relies on the language model trained on the background corpus to determine how “unique” a candidate keyphrase is to the domain represented by the foreground corpus. |
Experiments | An in-house language modeling toolkit is used to train the 5-gram language model with modified Kneser-Ney smoothing (Kneser and Ney, 1995). |
Experiments | The English monolingual data used for language modeling is the same as in Table 1. |
Related Work | They incorporated the bilingual topic information into language model adaptation and lexicon translation model adaptation, achieving significant improvements in the large-scale evaluation. |
Topic Similarity Model with Neural Network | Standard features: Translation model, including translation probabilities and lexical weights for both directions (4 features), 5-gram language model (1 feature), word count (1 feature), phrase count (1 feature), NULL penalty (1 feature), number of hierarchical rules used (1 feature). |
Model | Broadly, as the learner progresses from one sentence to the next, exposing herself to more novel words, the updated parameters of the language model in turn guide the selection of new “switch-points” for replacing source words with the target foreign words. |
Model | Generally, this value may come directly from the surprisal quantity given by a language model , or may incorporate additional features that are found informative in predicting the constraint on the word. |
Related Work | Building on their work, (Adel et al., 2012) employ additional features and a recurrent network language model for modeling code-switching in conversational speech. |
Background | The RNN is primarily used as a language model , but may also be viewed as a sentence model with a linear structure. |
Introduction | Besides comprising powerful classifiers as part of their architecture, neural sentence models can be used to condition a neural language model to generate sentences word by word (Schwenk, 2012; Mikolov and Zweig, 2012; Kalchbrenner and Blunsom, 2013a). |
Properties of the Sentence Model | This gives the RNN excellent performance at language modelling , but it is suboptimal for remembering at once the n-grams further back in the input sentence. |
Models and Features | In particular, we use the recurrent neural network language model (RNNLM) of Mikolov et al. |
Models and Features | Like any language model , a RNNLM estimates the probability of observing a word given the preceding context, but, in this process, it learns word embeddings into a latent, conceptual space with a fixed number of dimensions. |
Related Work | (2013) recently addressed the problem of answer sentence selection and demonstrated that LS models, including recurrent neural network language models (RNNLM), have a higher contribution to overall performance than exploiting syntactic analysis. |
Introduction | Successful applications of such models include language modelling (Bengio et al., 2003), paraphrase detection (Erk and Pado, 2008), and dialogue analysis (Kalchbrenner and Blunsom, 2013). |
Related Work | Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003). |
Related Work | They have received a lot of attention in recent years (Collobert and Weston, 2008; Mnih and Hinton, 2009; Mikolov et al., 2010, inter alia) and have achieved state of the art performance in language modelling . |
Conclusion | In the current version of the generator, the output is ranked using a simple language model trained on the GENIA corpus. |
Generating from the KBGen Knowledge-Base | To rank the generator output, we train a language model on the GeniA corpus 4, a corpus of 2000 MEDLINE asbtracts about biology containing more than 400000 words (Kim et al., 2003) and use this model to rank the generated sentences by decreasing probability. |
Related Work | They intersect the grammar with a language model to improve fluency; use a weighted hypergraph to pack the derivations; and find the best derivation tree using Viterbi algorithm. |
Abstract | Context-predicting models (more commonly known as embeddings or neural language models ) are the new kids on the distributional semantics block. |
Introduction | This is in part due to the fact that context-predicting vectors were first developed as an approach to language modeling and/or as a way to initialize feature vectors in neural-network-based “deep learning” NLP architectures, so their effectiveness as semantic representations was initially seen as little more than an interesting side effect. |
Introduction | Predictive DSMs are also called neural language models , because their supervised context prediction training is performed with neural networks, or, more cryptically, “embeddings”. |
Data | To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books. |
Model | Maximum entropy approaches to language modeling have been used since Rosenfeld (1996) to incorporate long-distance information, such as previously-mentioned trigger words, into n-gram language models . |
Model | Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameter |