Abstract | In this paper, we present an approach to enriching high—order feature representations for graph-based dependency parsing models using a dependency language model and beam search. |
Abstract | The dependency language model is built on a large-amount of additional auto-parsed data that is processed by a baseline parser. |
Abstract | Based on the dependency language model , we represent a set of features for the parsing model. |
Dependency language model | Language models play a very important role for statistical machine translation (SMT). |
Dependency language model | The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words. |
Dependency language model | However, the traditional N-gram language model can not capture long-distance word relations. |
Introduction | In this paper, we solve this issue by enriching the feature representations for a graph-based model using a dependency language model (DLM) (Shen et al., 2008). |
Introduction | 0 We utilize the dependency language model to enhance the graph-based parsing model. |
Parsing with dependency language model | In this section, we propose a parsing model which includes the dependency language model by extending the model of McDonald et al. |
Abstract | We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model ; and methods that evaluate global coherence, such as latent semantic analysis. |
Introduction | To investigate the usefulness of local information, we evaluated n—gram language model scores, from both a conventional model with Good—Turing smoothing, and with a recently proposed maximum—entropy class—based n—gram model (Chen, 2009a; Chen, 2009b). |
Introduction | Also in the language modeling vein, but with potentially global context, we evaluate the use of a recurrent neural network language model . |
Introduction | In all the language modeling approaches, a model is used to compute a sentence probability with each of the potential completions. |
Related Work | The KU system uses just an N—gram language model to do this ranking. |
Related Work | The UNT system uses a large variety of information sources, and a language model score receives the highest weight. |
Sentence Completion via Language Modeling | Perhaps the most straightforward approach to solving the sentence completion task is to form the complete sentence with each option in turn, and to evaluate its likelihood under a language model . |
Sentence Completion via Language Modeling | In this section, we describe the suite of state—of—the—art language modeling techniques for which we will present results. |
Sentence Completion via Language Modeling | 3.1 Backoff N-gram Language Model |
A Probabilistic Formulation for HVR | where P(W) can be modelled by the word-based 77.-gram language model (Chen and Goodman, 1996) commonly used in automatic speech recognition. |
A Probabilistic Formulation for HVR | 0 Language model score: P(W) |
A Probabilistic Formulation for HVR | Note that the acoustic model and language model scores are already used in the conventional ASR. |
Abstract | In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions. |
Experimental Results | These sentences contain a variety of given names, surnames and city names so that confusions cannot be easily resolved using a language model . |
Experimental Results | The ASR system used in all the experiments reported in this paper consists of a set of HMM-based triphone acoustic models and an n-gram language model . |
Experimental Results | A bigram language model with a vocabulary size of 200 words was used for testing. |
Haptic Voice Recognition (HVR) | In conventional ASR, acoustically similar word sequences are typically resolved implicitly using a language model where contexts of neighboring words are used for disambiguation. |
Integration of Knowledge Sources | where fl, 5, 75 and 7:1 denote the WFST representation of the acoustic model, language model , PLI model and haptic model respectively. |
Integration of Knowledge Sources | (2002) has shown that Hidden Markov Models (HMMs) and n-gram language models can be viewed as WFSTs. |
Introduction | In addition to the acoustic model and language model used in ASR, haptic model and partial lexical model are also introduced to facilitate the integration of more sophisticated haptic events, such as the keystrokes, into HVR. |
Abstract | We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context. |
Abstract | We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. |
Introduction | N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005). |
Introduction | At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies. |
Introduction | Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark, |
Treelet Language Modeling | The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a). |
Treelet Language Modeling | to use back-off-based smoothing for syntactic language modeling — such techniques have been applied to models that condition on headword contexts (Charniak, 2001; Roark, 2004; Zhang, 2009). |
Hello. My name is Inigo Montoya. | First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their non-memorable counterparts. |
Hello. My name is Inigo Montoya. | In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes. |
Never send a human to do a machine’s job. | In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplace-smoothed8 language models : l-gram, 2-gram, and 3-gram word LMs and l-gram, 2-gram and 3-gram part-of-speech9 LMs. |
Never send a human to do a machine’s job. | As indicated in Table 3, for each of our lexical “common language” models , in about 60% of the quote pairs, the memorable quote is more distinctive. |
Never send a human to do a machine’s job. | The language models’ vocabulary was that of the entire training corpus. |
Abstract | We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. |
Experiments | Nonetheless, it represents phonetic variability more realistically than the Bernstein-Ratner—Brent corpus, while still maintaining the lexical characteristics of infant-directed speech (as compared to the Buckeye corpus, with its much larger vocabulary and more complex language model ). |
Inference | The language modeling term relating to the intended string again factors into multiple components. |
Inference | Because neither the transducer nor the language model are perfect models of the true distribution, they can have incompatible dynamic ranges. |
Inference | 3The transducer scores can be cached since they depend only on surface forms, but the language model scores cannot. |
Introduction | Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model ). |
Introduction | Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability. |
Introduction | But unlike speech recognition, we have no (intended-form, surface-form) training pairs to train the phonetic model, nor even a dictionary of intended-form strings to train the language model . |
Lexical-phonetic model | Our lexical-phonetic model is defined using the standard noisy channel framework: first a sequence of intended word tokens is generated using a language model , and then each token is transformed by a probabilistic finite-state transducer to produce the observed surface sequence. |
Related work | In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model . |
Abstract | We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models . |
Conclusion | Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset. |
Experiments | Table 3 shows our results compared to previous methods, including C&W’s language model and the hierarchical log-bilinear (HLBL) model (Mnih and Hinton, 2008), which is a probabilistic, linear neural model. |
Global Context-Aware Neural Language Model | Note that Collobert and Weston (2008)’s language model corresponds to the network using only local context. |
Introduction | We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective. |
Introduction | We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset. |
Related Work | Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling , a task where models are asked to accurately predict the next word given previously seen words. |
Related Work | Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model . |
Related Work | They used up to 3 previous head words and showed increased performance on language modeling . |
Abstract | Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations. |
Incorporating Senses into Language Modeling Approaches | The next problem is to incorporate the sense information into the language modeling approach. |
Incorporating Senses into Language Modeling Approaches | Given a query (1 and a document d in text collection C, we want to reestimate the language models by making use of the sense information assigned to them. |
Incorporating Senses into Language Modeling Approaches | With this language model , the probability of a query term in a document is enlarged by the synonyms of its senses; The more its synonym senses in a document, the higher the probability. |
Introduction | We incorporate word senses into the language modeling (LM) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance. |
The Language Modeling Approach to IR | 3.1 The language modeling approach |
The Language Modeling Approach to IR | In the language modeling approach to IR, language models are constructed for each query (1 and each document d in a text collection C. The documents in C are ranked by the distance to a given query (1 according to the language models . |
The Language Modeling Approach to IR | The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other. |
Abstract | Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation. |
Abstract | However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set. |
Abstract | When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N -best and hill climbing rescoring, and show that up-training leads to WER reduction. |
Conclusion | The computational complexity of accurate syntactic processing can make structured language models impractical for applications such as ASR that require scoring hundreds of hypotheses per input. |
Incorporating Syntactic Structures | These are then passed to the language model along with the word sequence for scoring. |
Introduction | Language models (LM) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT). |
Related Work | The lattice parser therefore, is itself a language model . |
Syntactic Language Models | There have been several approaches to include syntactic information in both generative and discriminative language models . |
Syntactic Language Models | Structured language modeling incorporates syntactic parse trees to identify the head words in a hypothesis for modeling dependencies beyond n-grams. |
Syntactic Language Models | Our Language Model . |
Conclusion & Future Work | Future work includes extending this approach to use multiple translation models with multiple language models in ensemble decoding. |
Experiments & Results 4.1 Experimental Setup | For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model (LM) and word count. |
Experiments & Results 4.1 Experimental Setup | Fixing the language model allows us to compare various translation model combination techniques. |
Introduction | Common techniques for model adaptation adapt two main components of contemporary state-of-the-art SMT systems: the language model and the translation model. |
Introduction | However, language model adaptation is a more straightforward problem compared to |
Introduction | translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain. |
Related Work 5.1 Domain Adaptation | They use language model perplexities from IN to select relavant sentences from OUT. |
Calculation of Cross-Entropy | (XZ-), is the cross—entropy of X7; for L,- multiplied by Various methods for computing cross—entropy have been proposed, and these can be roughly classified into two types based on different methods of universal coding and the language model . |
Calculation of Cross-Entropy | For example, (Benedetto et al., 2002) and (Cilibrasi and Vitanyi, 2005) used the universal coding approach, whereas (Teahan and Harper, 2001) and (Sibun and Reynar, 1996) were based on language modeling using PPM and Kullback—Leibler divergence, respectively. |
Calculation of Cross-Entropy | As a representative method for calculating the cross—entropy through statistical language modeling , we adopt prediction by partial matching (PPM), a language—based encoding method devised by (Cleary and Witten, 1984). |
In the experiments reported here, n is set to 5 throughout. | lel ), gives the description length of the remaining characters under the language model for L. |
Introduction | They used statistical language modeling and heuristics to detect foreign words and tested the case of English embedded in German texts. |
Problem Formulation | In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification. |
Problem Formulation | calculates the description length of a text segment X,- through the use of a language model for Li. |
Problem Formulation | Here, the first term corresponds to the code length of the text chunk X,- given a language model for L,, which in fact corresponds to the cross—entropy of X,- for L,- multiplied by The remaining terms give the code lengths of the parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li. |
A Class-based Model of Agreement | However, in MT, we seek a measure of sentence quality (1(6) that is comparable across different hypotheses on the beam (much like the n-gram language model score). |
A Class-based Model of Agreement | We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data: |
Experiments | Our distributed 4—gram language model was trained on 600 million words of Arabic text, also collected from many sources including the Web (Brants et al., 2007). |
Inference during Translation Decoding | With a trigram language model , the state might be the last two words of the translation prefix. |
Introduction | Intuition might suggest that the standard 71- gram language model (LM) is suflicient to handle agreement phenomena. |
Related Work | Monz (2011) recently investigated parameter estimation for POS-based language models , but his classes did not include inflectional features. |
Related Work | One exception was the quadratic-time dependency language model presented by Galley and Manning (2009). |
Abstract | In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased efficiency of decoding by minimizing the number of language model computations and hypothesis expansions. |
Abstract | Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2. |
Experimental Evaluation | The English language model is a 4-gram LM created with the SRILM toolkit (Stolcke, 2002) on all bilingual and parts of the provided monolingual data. |
Introduction | Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model (LM) querying (Heafield, 2011). |
Introduction | ith Language Model LookAhead |
Search Algorithm Extensions | 2.2 Language Model LookAhead |
Experiments | 3-gram (news-commentary) and 5-gram (Europarl) language models are trained on the data described in Table 1, using the SRILM toolkit (Stol-cke, 2002) and binarized for efficient querying using kenlm (Heafield, 2011). |
Experiments | For the 5-gram language models, we replaced every word in the lm training data with <unk> that did not appear in the English part of the parallel training data to build an open vocabulary language model . |
Experiments | 7Absolute improvements would be possible, e. g., by using larger language models or by adding news data to the ep training set when evaluating on crawl test sets (see, e. g., Dyer et al. |
Introduction | The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och (2003). |
Introduction | The modeler’s goals might be to identify complex properties of translations, or to counter errors of pre-trained translation models and language models by explicitly down-weighting translations that exhibit certain undesired properties. |
Experiments and Results | Unigram NLLR and Filtered NLLR are the language model implementations of previous work as described in Section 3.1. |
Previous Work | They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores. |
Timestamp Classifiers | 3.1 Language Models |
Timestamp Classifiers | We apply Dirichlet-smoothing to the language models (as in de J ong et al. |
Timestamp Classifiers | The above language modeling and MaxEnt approaches are token-based classifiers that one could apply to any topic classification domain. |
Abstract | On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model . |
Introduction | Combining Language Models and |
Training Algorithm and Implementation | As described in Section 4, the overall procedure is divided into two alternating steps: After initialization we first perform EM training of the translation model for 20-30 iterations using a 2- gram or S-gram language model in the target language. |
Training Algorithm and Implementation | The generative story described in Section 3 is implemented as a cascade of a permutation, insertion, lexicon, deletion and language model finite state transducers using OpenFST (Allauzen et al., 2007). |
Translation Model | Stochastically generate the target sentence according to an n-gram language model . |
Evaluation | Therefore, we implemented a ranking mechanism which used a hybrid scoring method by giving equal weights to the language model and the normalized phonetic similarity. |
System Description | To check the likelihood and well-formedness of the new string after the replacement, we learn a 3- gram language model with absolute smoothing. |
System Description | For leam-ing the language model , we only consider the words in the CMU pronunciation dictionary which also exist in WordNet. |
System Description | We remove the words containing at least one trigram which is very unlikely according to the language model . |
Experiments and Results | The features we used are commonly used features as standard BTG decoder, such as translation probabilities, lexical weights, language model , word penalty and distortion probabilities. |
Experiments and Results | The language model is 5-gram language model trained with the target sentences in the training data. |
Experiments and Results | The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data. |
Features and Training | We also use other fundamental features, such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability. |
Experiments | The monolingual data for training English language model includes the Xinhua portion of the GIGAWORD corpus, which contains 238M English words. |
Experiments | A 4—gram language model was trained on the monolingual data by the SRILM toolkit (Stolcke, 2002). |
Related Work | Researchers also introduce topic model for cross-lingual language model adaptation (Tam et al., 2007; Ruiz and Federico, 2011). |
Related Work | Based on the bilingual topic model, they apply the source-side topic weights into the target-side topic model, and adapt the n-gram language model of target side. |
Experiments | We used SRILM2 for the training of language models (S-gram in all the experiments). |
Experiments | We trained a Chinese language model for the EC translation on the Chinese part of the bi-text. |
Experiments | For the English language model of CE translation, an extra corpus named Tanaka was used besides the English part of the bilingual corpora. |
Experimental Design | imately via cube pruning (Chiang, 2007), by integrating a trigram language model extracted from the training set (see Konstas and Lapata (2012) for details). |
Experimental Design | Lexical Features These features encourage grammatical coherence and inform lexical selection over and above the limited horizon of the language model captured by Rules (6)—(9). |
Problem Formulation | In machine translation, a decoder that implements forest rescoring (Huang and Chiang, 2007) uses the language model as an external criterion of the goodness of sub-translations on account of their grammaticality. |
Inferring a learning curve from mostly monolingual data | In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model . |
Inferring a learning curve from mostly monolingual data | (b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus. |
Inferring a learning curve from mostly monolingual data | The Lasso regression model selected four features from the entire feature set: i) Size of the test set (sentences & tokens) ii) PerpleXity of language model (order 5) on the test set iii) Type-token ratio of the target monolingual corpus . |