Abstract | These techniques speed up NNJM computation by a factor of l(),()()()X, making the model as fast as a standard back-off LM . |
Decoding with the NNJ M | When performing hierarchical decoding with an n-gram LM , the leftmost and rightmost n — 1 words from each constituent must be stored in the state space. |
Model Variations | 4- gram Kneser—Ney LM |
Model Variations | Dependency LM (Shen et al., 2010) Contextual lexical smoothing (Devlin, 2009) Length distribution (Shen et al., 2010) |
Model Variations | 0 LM adaptation (Snover et al., 2008) |
Neural Network Joint Model (NNJ M) | Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words. |
Neural Network Joint Model (NNJ M) | It is clear that this model is effectively an (n+m)-gram LM, and a lS-gram LM would be |
Neural Network Joint Model (NNJ M) | Although self-normalization significantly improves the speed of NNJM lookups, the model is still several orders of magnitude slower than a back-off LM . |
Abstract | Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model ( LM ). |
Methods | In order to annotate lattice edges with an n-gram LM , every path coming into a node must end with the same sequence of (n — l) tokens. |
Methods | If this property does not hold, then nodes must be split until it does.4 This property is maintained by the decoder’s recombination rules for the segmented LM, but it is not guaranteed for the desegmented LM . |
Methods | Indeed, the expanded word-level context is one of the main benefits of incorporating a word-level LM . |
Baselines | It adds a LM component to the MLF baseline. |
Baselines | This LM baseline allows the comparison of classification through L1 fragments in an L2 context, with a more traditional L2 context modelling (i.e. |
Discussion and conclusion | LM baseline llrl |
Discussion and conclusion | llrl +LM auto |
Discussion and conclusion | LM baseline 11r1 |
Experiments & Results | As expected, the LM baseline substantially outperforms the context-insensitive MLF baseline. |
Experiments & Results | Second, our classifier approach attains a substantially higher accuracy than the LM baseline. |
Experiments & Results | The same significance level was found when comparing llrl+LM against llrl, auto+LM against aut o, as well as the LM baseline against the MLF baseline. |
Experiments | The Chinese part of the corpus is segmented into words before LM training. |
Experiments | The pinyin syllable segmentation already has very high (over 98%) accuracy with a trigram LM using improved Kneser-Ney smoothing. |
Experiments | We consider different LM smoothing methods including Kneser-Ney (KN), improved Kneser-Ney (IKN), and Witten-Bell (WB). |
Pinyin Input Method Model | WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM , by tracking longer history while traversing the graph. |
Related Works | Various approaches were made for the task including language model ( LM ) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc. |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | g(d;w, m, lm) 2 Wm - fm(d) + wlm - lm (d) (4) |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | lm (d) and wlm are the score and weight of the language model, respectively. |
Experiments and Results | The LM corpus is the English side of the parallel data (BTEC, CJK and CWMT083) (1.34M sentences). |
Experiments and Results | The LM corpus is the English side of the parallel data as well as the English Gigaword corpus (LDC2007T07) (11.3M sentences). |
Input Features for DNN Feature Learning | 1Backward LM has been introduced by Xiong et a1. |
Input Features for DNN Feature Learning | (2011), which successfully capture both the preceding and succeeding contexts of the current word, and we estimate the backward LM by inverting the order in each sentence in the training data from the original order to the reverse order. |
Related Work | (2012) improved translation quality of n-gram translation model by using a bilingual neural LM , where translation probabilities are estimated using a continuous representation of translation units in lieu of standard discrete representations. |
Latent Structure in Dialogues | The simplest formulation we consider is an HMM where each state contains a unigram language model ( LM ), proposed by Chotimongkol (2008) for task-oriented dialogue and originally |
Latent Structure in Dialogues | ,men are generated (independently) according to the LM . |
Latent Structure in Dialogues | (2010) extends LM—HMM to allow words to be emitted from two additional sources: the topic of current dialogue qb, or a background LM a shared across all dialogues. |
Keyphrase Extraction Approaches | A unigram LM and an n-gram LM are constructed for each of these two corpora. |
Keyphrase Extraction Approaches | Phraseness, defined using the foreground LM, is calculated as the loss of information incurred as a result of assuming a unigram LM (i.e., conditional independence among the words of the phrase) instead of an n-gram LM (i.e., the phrase is drawn from an n-gram LM ). |
Keyphrase Extraction Approaches | Informativeness is computed as the loss that results because of the assumption that the candidate is sampled from the background LM rather than the foreground LM . |
System Description | Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM” ). |
System Description | 7. our system 0.668 — nonnative LM (§3.2.2) 0.665 — HPSG parse (§3.2.3) 0.664 — PCFG parse (§3.2.4) 0.662 |
System Description | — spelling (§3.2.l) 0.643 — gigaword LM (§3.2.2) 0.638 — link parse (§3.2.3) 0.632 |
Pattern extraction by sentence compression | The method of Clarke and Lapata (2008) uses a trigram language model ( LM ) to score compressions. |
Pattern extraction by sentence compression | Since we are interested in very short outputs, a LM trained on standard, uncompressed text would not be suitable. |
Pattern extraction by sentence compression | Instead, we chose to modify the method of Filippova and Altun (2013) because it relies on dependency parse trees and does not use any LM scoring. |