Mention Resolution Approach | We define a mention m as a tuple < lm, em >, where lm is the “literal” string of characters that represents m and em is the email where m is observed.1 We assume that m can be resolved to a distinguishable participant for whom at least one email address is present in the collection.2 |
Mention Resolution Approach | Select a specific lexical reference lm to refer to 0 given the context ark. |
Mention Resolution Approach | 1The exact position in em where lm is observed should also be included in the definition, but we ignore it assuming that all matched literal mentions in one email refer to the same identity. |
Baselines | It adds a LM component to the MLF baseline. |
Baselines | This LM baseline allows the comparison of classification through L1 fragments in an L2 context, with a more traditional L2 context modelling (i.e. |
Discussion and conclusion | LM baseline llrl |
Discussion and conclusion | llrl +LM auto |
Discussion and conclusion | LM baseline 11r1 |
Experiments & Results | As expected, the LM baseline substantially outperforms the context-insensitive MLF baseline. |
Experiments & Results | Second, our classifier approach attains a substantially higher accuracy than the LM baseline. |
Experiments & Results | The same significance level was found when comparing llrl+LM against llrl, auto+LM against aut o, as well as the LM baseline against the MLF baseline. |
Abstract | Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model ( LM ). |
Methods | In order to annotate lattice edges with an n-gram LM , every path coming into a node must end with the same sequence of (n — l) tokens. |
Methods | If this property does not hold, then nodes must be split until it does.4 This property is maintained by the decoder’s recombination rules for the segmented LM, but it is not guaranteed for the desegmented LM . |
Methods | Indeed, the expanded word-level context is one of the main benefits of incorporating a word-level LM . |
Abstract | These techniques speed up NNJM computation by a factor of l(),()()()X, making the model as fast as a standard back-off LM . |
Decoding with the NNJ M | When performing hierarchical decoding with an n-gram LM , the leftmost and rightmost n — 1 words from each constituent must be stored in the state space. |
Model Variations | 4- gram Kneser—Ney LM |
Model Variations | Dependency LM (Shen et al., 2010) Contextual lexical smoothing (Devlin, 2009) Length distribution (Shen et al., 2010) |
Model Variations | 0 LM adaptation (Snover et al., 2008) |
Neural Network Joint Model (NNJ M) | Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words. |
Neural Network Joint Model (NNJ M) | It is clear that this model is effectively an (n+m)-gram LM, and a lS-gram LM would be |
Neural Network Joint Model (NNJ M) | Although self-normalization significantly improves the speed of NNJM lookups, the model is still several orders of magnitude slower than a back-off LM . |
Baseline MT | The LM used for decoding is a log-linear combination of four word n-gram LMs which are built on different English |
Baseline MT | corpora (details described in section 5.1), with the LM weights optimized on a development set and determined by minimum error rate training (MERT), to estimate the probability of a word given the preceding words. |
Experiments | LM1 is a 7-gram LM trained on the tar- |
Experiments | LM2 is a 7-gram LM trained only on the English monolingual discussion forums data listed above. |
Experiments | LM3 is a 4-gram LM trained on the web genre among the target side of all parallel text (i.e., web text from pre-BOLT parallel text and BOLT released discussion forum parallel text). |
Introduction | Optimize name translation and context translation simultaneously and conduct name translation driven decoding with language model ( LM ) based selection (Section 3.2). |
Related Work | enable the LM to decide which translations to choose when encountering the names in the texts (Ji et al., 2009). |
Related Work | The LM selection method often assigns an inappropriate weight to the additional name translation table because it is constructed independently from translation of context words; therefore after weighted voting most correct name translations are not used in the final translation output. |
Related Work | More importantly, in these approaches the MT model was still mostly treated as a “black-box” because neither the translation model nor the LM was updated or adapted specifically for names. |
Abstract | We propose an online approach, integrating source LM, and / or, back-transliteration and English LM . |
Experiments | We observed slight improvement by incorporating the source LM , and observed a 0.48 point F—value increase over baseline, which translates to 4.65 point Katakana F—value change and 16.0% (3.56% to 2.99 %) WER reduction, mainly due to its higher Katakana word rate (11.2%). |
Experiments | H. This type of error is reduced by +LM-P, e.g., * 7°?X/fify 7 pumsu chikku “*plus tick” to 703%?‘y7 pumsuchz’kku “plastic” due to LM projection. |
Experiments | performance, which may be because one cannot limit where the source LM features are applied. |
Introduction | We refer to this process of transliterating unknown words into another language and using the target LM as LM projection. |
Introduction | Since the model employs a general transliteration model and a general English LM , it achieves robust WS for unknown words. |
Use of Language Model | As we mentioned in Section 2, English LM knowledge helps split transliterated compounds. |
Use of Language Model | We use ( LM ) projection, which is a combination of back-transliteration and an English model, by extending the normal lattice building process as follows: |
Use of Language Model | Then, edges are spanned between these extended English nodes, instead of between the original nodes, by additionally taking into consideration English LM features (ID 21 and 22 in Table 1): ¢iMP(wi) = IOgPWi) and ¢§Mp(wi—1,wi) = log p(wi_1,wi). |
Word Segmentation Model | Figure 1: Example lattice with LM projection |
Comparative Study | Figure 4: bilingual LM illustration. |
Comparative Study | 4.3 Bilingual LM |
Comparative Study | We build a 9-gram LM using SRILM toolkit (Stolcke, 2002) with modified Kneser—Ney smoothing. |
Experiments | Our baseline is a phrase-based decoder, which includes the following models: an n-gram target-side language model ( LM ), a phrase translation model and a word-based lexicon model. |
Experiments | 0 lowercased training data from the GALE task (Table 1, UN corpus not included) alignment trained with GIZA++ o tuning corpus: NIST06 test corpora: NIST02 03 04 05 and 08 o 5-gram LM (1694412027 running words) trained by SRILM toolkit (Stolcke, 2002) with modified Kneser—Ney smoothing training data: target side of bilingual data. |
Approach to Sentence-Level Dialect Identification | The aforementioned approach relies on language models ( LM ) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV. |
Approach to Sentence-Level Dialect Identification | The following variants of the underlying token-level system are built to assess the effect of varying the level of preprocessing on the underlying LM on the performance of the overall sentence level dialect identification process: (1) Surface, (2) Tokenized, (3) CODAfied, and (4) Tokenized-CODA. |
Approach to Sentence-Level Dialect Identification | Orthography Normalized (CODAfied) LM : since DA is not originally a written form of Arabic, no standard orthography exists for it. |
Related Work | Amazon Mechanical Turk and try a language modeling ( LM ) approach to solve the problem. |
Decoding | The language model ( LM ) scoring is directly integrated into the cube pruning algorithm. |
Decoding | Thus, LM estimates are available for all considered hypotheses. |
Decoding | Because the output trees can remain discontiguous after hypothesis creation, LM scoring has to be done individually over all output trees. |
Conclusion | We proposed a method for annotating senses to terms in short queries, and also described an approach to integrate senses into an LM approach for IR. |
Conclusion | Our experimental results showed that the incorporation of senses improved a state-of-the-art baseline, a stem-based LM approach with PRF method. |
Conclusion | We also proposed a method to further integrate the synonym relations to the LM approaches. |
Experiments | We use the Lemur toolkit (Ogilvie and Callan, 2001) version 4.11 as the basic retrieval tool, and select the default unigram LM approach based on KL-divergence and Dirichlet-prior smoothing method in Lemur as our basic retrieval approach. |
Incorporating Senses into Language Modeling Approaches | In this section, we propose to incorporate senses into the LM approach to IR. |
Incorporating Senses into Language Modeling Approaches | In this part, we further integrate the synonym relations of senses into the LM approach. |
Introduction | We incorporate word senses into the language modeling ( LM ) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance. |
Introduction | Section 3 introduces the LM approach to IR, including the pseudo relevance feedback method. |
Introduction | generating word senses for query terms in Section 4, followed by presenting our novel method of incorporating word senses and their synonyms into the LM approach in Section 5. |
The Language Modeling Approach to IR | This section describes the LM approach to IR and the pseudo relevance feedback approach. |
Word Sense Disambiguation | Suppose we already have a basic IR method which does not require any sense information, such as the stem-based LM approach. |
Introduction | Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model ( LM ) querying (Heafield, 2011). |
Introduction | We show that taking a heuristic LM score estimate for pre-sorting the |
Introduction | Further, we introduce two novel LM lookahead methods. |
Search Algorithm Extensions | In addition to sorting according to the purely phrase-intemal scores, which is common practice, we compute an estimate qLME(é) for the LM score of each target phrase é. qLME(é) is the weighted LM score we receive by assuming 5 to be a complete sentence without using sentence start and end markers. |
Search Algorithm Extensions | LM score computations are among the most expensive in decoding. |
Search Algorithm Extensions | (2006) report significant improvements in runtime by removing unnecessary LM lookups via early pruning. |
Experiments | We use a modified Kneser—Ney (KN) backoff 4- gram baseline LM . |
Incorporating Syntactic Structures | tool(s) 1 LM 1 W1 ———> $1,...,371" ———> F(w1,sl,...,slln) tool(s) 1 LM 1 W2 ———> S2,...,sl2n ———> F(w2,s2,...,sl2n) tool(s) 1 LM 1 wk ———> sk,...,s}:‘ ———> F(wk,sk,...,s}:‘) Here, {w1, . |
Introduction | Language models ( LM ) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT). |
Introduction | For example, incorporating long-distance dependencies and syntactic structure can help the LM better predict words by complementing the predictive power of n-grams (Chelba and Jelinek, 2000; Collins et al., 2005; Filimonov and Harper, 2009; Kuo et al., 2009). |
Introduction | Even so, the reliance on auxiliary tools slow LM application to the point of being impractical for real time systems. |
Syntactic Language Models | (2009) integrate syntactic features into a neural network LM for Arabic speech recognition. |
Syntactic Language Models | We work with a discriminative LM with long-span dependencies. |
Syntactic Language Models | The LM score 8(w, a) for each hypothesis w of a speech utterance with acoustic sequence a is based on the baseline ASR system score b(w, a) (initial 77.-gram LM score and the acoustic score) and cm, the weight assigned to the baseline score.1 The score is defined as: |
Experimental Evaluation | Using a 2-gram LM they obtain 15.3 BLEU and with a whole segment LM , they achieve 19.3 BLEU. |
Experimental Evaluation | In comparison to this baseline we run our algorithm with N0 = 50 candidates per source word for both, a 2-gram and a 3- gram LM . |
Experimental Evaluation | Figure 3 and Figure 4 show the evolution of BLEU and TER scores for applying our method using a 2-gram and a 3- gram LM . |
Training Algorithm and Implementation | Our FST representation of the LM makes use of failure transitions as described in (Allauzen et al., 2003). |
Machine Translation as a Decipherment Task | For P (e), we use a word n-gram LM trained on monolingual English data. |
Machine Translation as a Decipherment Task | Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment. |
Word Substitution Decipherment | We model P(e) using a statistical word n-gram English language model ( LM ). |
Word Substitution Decipherment | We use an English word bi-gram LM as the base distribution (P0) for the source model and specify a uniform P0 distribution for the |
Word Substitution Decipherment | In order to sample at position i, we choose the top K English words Y ranked by P(X Y 2), which can be computed offline from a statistical word bigram LM . |
Experiments | To test our LM implementations, we performed experiments with two different language models. |
Introduction | Where classic LMs take word tuples and produce counts or probabilities, we propose an LM that takes a word-and-context encoding (so the context need not be re-looked up) and returns both the probability and also the context encoding for the sufi‘ix of the original query. |
Introduction | Our LM toolkit, which is implemented in Java and compatible with the standard ARPA file formats, is available on the web.1 |
Preliminaries | Tries or variants thereof are implemented in many LM tool kits, including SRILM (Stolcke, 2002), IRSTLM (Federico and Cettolo, 2007), CMU SLM (Whittaker and Raj, 2001), and MIT LM (Hsu and Glass, 2008). |
Speeding up Decoding | Figure 3: Queries issued when scoring trigrams that are created when a state with LM context “the cat” combines with “fell down”. |
Speeding up Decoding | We found in our experiments that a cache using linear probing provided marginal performance increases of about 40%, largely because of cached back-off computation, while our simpler cache increases performance by about 300% even over our HASH LM implementation. |
Speeding up Decoding | As discussed in Section 3, our LM implementations can answer queries about context-encoded n-grams faster than explicitly encoded n-grams. |
Experiments | imizing the LM score over all possible permutations. |
Experiments | Usually the LM score is used as one component of a more complex decoder score which also includes biphrase and distortion scores. |
Experiments | But in this particular “translation task” from bad to good English, we consider that all “biphrases” are of the form 6 — e, where e is an English word, and we do not take into account any distortion: we only consider the quality of the permutation as it is measured by the LM component. |
Phrase-based Decoding as TSP | 4.1 From Bigram to N-gram LM |
Decipherment | We build a statistical English language model ( LM ) for the plaintext source model P (p), which assigns a probability to any English letter sequence. |
Decipherment | We build an interpolated word+n-gram LM and use it to assign P (p) probabilities to any plaintext letter sequence p1...pn.3 The advantage is that it helps direct the sampler towards plaintext hypotheses that resemble natural language—high probability letter sequences which form valid words such as “H E L L O” instead of sequences like “‘T X H R T”. |
Decipherment | 3We set the interpolation weights for the word and n-gram LM as (0.9, 0.1). |
Experiments and Results | Even with a 3-gram letter LM , our method yields a +63% improvement in decipherment accuracy over EM on the homophonic cipher with spaces. |
Experiments and Results | We observe that the word+3—gram LM proves highly effective when tackling more complex ciphers and cracks the Zodiac-408 cipher. |
Experiments and Results | 0 Letter n-gram versus W0rd+n-gram LMs—Figure 2 shows that using a word+3-gram LM instead of a 3-gram LM results in +75% improvement in decipherment accuracy. |
Dependency Language Model | Suppose we use a trigram dependency LM, |
Discussion | Only translation probability P was employed in the construction of the target forest due to the complexity of the syntax-based LM . |
Discussion | Since our dependency LM models structures over target words directly based on dependency trees, we can build a single-step system. |
Discussion | This dependency LM can also be used in hierarchical MT systems using lexical-ized CFG trees. |
Experiments | 0 str-dep: a string-to-dependency system with a dependency LM . |
Experiments | The English side of this subset was also used to train a 3-gram dependency LM . |
Experiments | BLEU% TER% lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM ) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96 |
Implementation Details | We rescore 1000-best translations (Huang and Chiang, 2005) by replacing the 3-gram LM score with the 5-gram LM score computed offline. |
Discriminative Reranking for OCR | Base features include the HMM and LM scores produced by the OCR system. |
Discriminative Reranking for OCR | Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, . |
Discriminative Reranking for OCR | The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002). |
Experiments | Our baseline is based on the sum of the logs of the HMM and LM scores. |
Experiments | For LM training we used 220M words from Arabic Gigaword 3, and 2.4M words from each “print” and “hand” ground truth annotations. |
Introduction | The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model ( LM ) to emphasize the fluency of the output. |
Introduction | For an input image, the OCR decoder generates an n-best list of hypotheses each of which is associated with HMM and LM scores. |
Related Work | Like (Galley and Manning, 2009) our work implements an incremental syntactic language model; our approach differs by calculating syntactic LM scores over all available phrase-structure parses at each hypothesis instead of the l-best dependency parse. |
Related Work | In-domain Out-of-domain LM WSJ 23 ppl ur-en dev ppl WSJ 1-gram 1973.57 3581.72 WSJ 2-gram 349.18 1312.61 WSJ 3-gram 262.04 1264.47 WSJ 4-gram 244.12 1261.37 WSJ 5-gram 232.08 1261.90 WSJ HHMM 384.66 529.41 Interpolated WSJ 5-gram + HHMM 209.13 225.48 Giga 5-gram 258.35 312.28 Interp. |
Related Work | HHMM parser beam sizes are indicated for the syntactic LM . |
Experimental Setup | 4.1 Distributed LM Framework |
Experimental Setup | We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers. |
Experimental Setup | The proposed randomized LM can encode parameters estimated using any smoothing scheme (e.g. |
Experiments | size dev test test LM GB MT04 | MT05 | MT06 unpruned block 116 0.5304 0.5697 0.4663 unpruned rand 69 0.5299 0.5692 0.4659 pruned block 42 0.5294 0.5683 0.4665 pruned rand 27 0.5289 0.5679 0.4656 |
Introduction | Using higher-order models and larger amounts of training data can significantly improve performance in applications, however the size of the resulting LM can become prohibitive. |
Introduction | Efficiency is paramount in applications such as machine translation which make huge numbers of LM requests per sentence. |
Perfect Hash-based Language Models | Our randomized LM is based on the Bloomier filter (Chazelle et al., 2004). |
Scaling Language Models | In the next section we describe our randomized LM scheme based on perfect hash functions. |
Introduction | This provides a compelling advantage over previous dependency language models for MT (Shen et al., 2008), which use a 5 - gram LM only during rerank-ing. |
Introduction | In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores. |
Machine translation experiments | This LM required 16GB of RAM during training. |
Machine translation experiments | Regarding the small difference in BLEU scores on MT08, we would like to point out that tuning on MTOS and testing on MT08 had a rather adverse effect with respect to translation length: while the two systems are relatively close in terms of BLEU scores (24.83 and 24.91, respectively), the dependency LM provides a much bigger gain when evaluated with BLEU precision (27.73 vs. 28.79), i.e., by ignoring the brevity penalty. |
Machine translation experiments | LM newswire web speech all |
Related work | In the latter paper, Huang and Chiang introduce rescoring methods named “cube pruning” and “cube growing”, which first use a baseline decoder (either synchronous CFG or a phrase-based system) and no LM to generate a hypergraph, and then rescoring this hypergraph with a language model. |
A Class-based Model of Agreement | For contexts in which the LM is guaranteed to back off (for instance, after an unseen bigram), our decoder maintains only the minimal state needed (perhaps only a single word). |
Inference during Translation Decoding | initialize 77 to —oo set 77(t) = 0 compute 7* from parameters <8, 5%, 7r, is_goal> compute q(e{,+1) 2 p(7'*) under the generative LM set model state anew = <§L, 7-2) for prefix ef Output: q(e£+1) |
Introduction | Intuition might suggest that the standard 71- gram language model ( LM ) is suflicient to handle agreement phenomena. |
Introduction | However, LM statistics are sparse, and they are made sparser by morphological variation. |
Introduction | For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM . |
Related Work | They used a target-side LM over Combinatorial Categorial Grammar (CCG) supertags, along with a penalty for the number of operator violations, and also modified the phrase probabilities based on the tags. |
Related Work | Then they mixed the classes into a word-based LM . |
Related Work | Target-Side Syntactic LMs Our agreement model is a form of syntactic LM , of which there is a long history of research, especially in speech processing.5 Syntactic LMs have traditionally been too slow for scoring during MT decoding. |
Previous Work | Our LM experiments also affirmed the importance of in-domain English LMs. |
Previous Work | ‘Train LM BLEU oov |
Previous Work | Table 1: Baseline results using the EG and AR training sets with G W and EGen corpora for LM training |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Then we used a trigram LM that we built from the aforementioned Aljazeera articles to pick the most likely candidate in context. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | We simply multiplied the character-level transformation probability with the LM probability — giving them equal weight. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | - SI and $2 trained on the EG’ with EGen and both EGen and GW for LM training respectively. |
Experimental results and discussions 6.1 Baseline experiments | In the first set of experiments, we evaluate the baseline performance of the LM and BC summarizers (cf. |
Experimental results and discussions 6.1 Baseline experiments | Second, the supervised summarizer (i.e., BC) outperforms the unsupervised summarizer (i.e., LM ). |
Experimental results and discussions 6.1 Baseline experiments | One is that BC is trained with the handcrafted document-summary sentence labels in the development set while LM is instead conducted in a purely unsupervised manner. |
Proposed Methods | In order to estimate the sentence generative probability, we explore the language modeling ( LM ) approach, which has been introduced to a wide spectrum of IR tasks and demonstrated with good empirical success, to predict the sentence generative probability. |
Proposed Methods | In the LM approach, each sentence in a document can be simply regarded as a probabilistic generative model consisting of a unigram distribution (the so-called “bag-0f-words” assumption) for generating the document (Chen et al., 2009): (w) |
Experiments | The Chinese part of the corpus is segmented into words before LM training. |
Experiments | The pinyin syllable segmentation already has very high (over 98%) accuracy with a trigram LM using improved Kneser-Ney smoothing. |
Experiments | We consider different LM smoothing methods including Kneser-Ney (KN), improved Kneser-Ney (IKN), and Witten-Bell (WB). |
Pinyin Input Method Model | WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM , by tracking longer history while traversing the graph. |
Related Works | Various approaches were made for the task including language model ( LM ) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc. |
Bayesian MT Decipherment via Hash Sampling | 5 A high value for the LM concentration parameter oz ensures that the LM probabilities do not deviate too far from the original fixed base distribution during sampling. |
Decipherment Model for Machine Translation | For P(e), we use a word n-gram language model ( LM ) trained on monolingual target text. |
Experiments and Results | We observe that our method produces much better results than the others even with a 2-gram LM . |
Experiments and Results | With a 3-gram LM , the new method achieves the best performance; the highest BLEU score reported on this task. |
Experiments and Results | Bayesian Hash Sampling with 2—gram LM vocab=full (V6), addjertility=n0 4.2 vocab=pruned*, add_fertility=yes 5.3 |
Abstract | Features used in a phrase-based system usually include LM , reordering model, word and phrase counts, and phrase and lexicon translation models. |
Abstract | Other models used in the baseline system include lexicalized ordering model, word count and phrase count, and a 3-gram LM trained on the English side of the parallel training corpus. |
Abstract | In our system, a primary phrase table is trained from the 110K TED parallel training data, and a 3-gram LM is trained on the English side of the parallel data. |
Experimental Setup | Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65 |
Experiments | In Table 2, we compare the performance of the linguistically informed model described in Section 4 on the candidates sets against a random choice and a language model ( LM ) baseline. |
Experiments | statistically significant.5 In general, the linguistic model largely outperforms the LM and is less sensitive to the additional confusion introduced by the SEMh input. |
Experiments | Table 5 also reports the performance of an unlabelled model that additionally integrates LM scores. |
Human Evaluation | 6(Nakanishi et al., 2005) also note a negative effect of including LM scores in their model, pointing out that the LM was not trained on enough data. |
Human Evaluation | The corpus used for training our LM might also have been too small or distinct in genre. |
Experimental Setup and Results | As a baseline system, we built a standard phrase-based system, using the surface forms of the words without any transformations, and with a 3—gram LM in the decoder. |
Experimental Setup and Results | We believe that the use of multiple language models (some much less sparse than the surface LM ) in the factored baseline is the main reason for the improvement. |
Experimental Setup and Results | Using a 4-gram root LM , considerably less sparse than word forms but more sparse that tags, we get a BLEU score of 22.80 (max: 24.07, min: 21.57, std: 0.85). |
Experiments with Constituent Reordering | 16These experiments were done on top of the model in 3.2.3 with a 3-gram word and root LMs and 8-gram tag LM . |
Forest-based translation | The decoder performs two tasks on the translation forest: l-best search with integrated language model (LM), and k-best search with LM to be used in minimum error rate training. |
Forest-based translation | For l-best search, we use the cube pruning technique (Chiang, 2007; Huang and Chiang, 2007) which approximately intersects the translation forest with the LM . |
Forest-based translation | Basically, cube pruning works bottom up in a forest, keeping at most k +LM items at each node, and uses the best-first expansion idea from the Algorithm 2 of Huang and Chiang (2005) to speed |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | g(d;w, m, lm) 2 Wm - fm(d) + wlm - lm (d) (4) |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | lm (d) and wlm are the score and weight of the language model, respectively. |
Inflection prediction models | LM 81.0 69.4 Model 91.6 91.0 Avg | I | 13.9 24.1 |
Integration of inflection models with MT systems | PLM) is the joint probability of the sequence of inflected words according to a trigram language model ( LM ). |
Integration of inflection models with MT systems | The LM used for the integration is the same LM used in the base MT system that is trained on fully inflected word forms (the base MT system trained on stems uses an LM trained on a stem sequence). |
Integration of inflection models with MT systems | Equation (1) shows that the model first selects the best sequence of inflected forms for each MT hypothesis Si according to the LM and the inflection model. |
Experimental Evaluation | TM (en) 1.80M 1.62M 1.35M 2.38M TM (other) 1.85M 1.82M 1.56M 2.78M LM (en) 52.7M 52.7M 52.7M 44.7M |
Experimental Evaluation | Table 1: The number of words in each corpus for TM and LM training, tuning, and testing. |
Experimental Evaluation | We use the news commentary corpus for training the TM, and the news commentary and Europarl corpora for training the LM . |
Inference with First Order Variables | The language model score h(s’, LM ) of 8’ based on a large web corpus; |
Inference with First Order Variables | wlmflc : VLMh(8/7 LM ) + Z )‘tf(8/7 |
Inference with First Order Variables | wNoun,2,singular : VLMh/(Sl, LM )+ AARTfls’, ART) + APREpfls’, PREP) + ANOUNfls’, NOUN)+ IU’ARTg(S/7 ART) + ,U/PREPg(3/a PREP) ‘i— MNOUNg(8/, NOUN). |
Inference with Second Order Variables | wufl) : VLMh(8/7 LM ) + Z )‘tf(8/7 |
Experiments and Results | The LM corpus is the English side of the parallel data (BTEC, CJK and CWMT083) (1.34M sentences). |
Experiments and Results | The LM corpus is the English side of the parallel data as well as the English Gigaword corpus (LDC2007T07) (11.3M sentences). |
Input Features for DNN Feature Learning | 1Backward LM has been introduced by Xiong et a1. |
Input Features for DNN Feature Learning | (2011), which successfully capture both the preceding and succeeding contexts of the current word, and we estimate the backward LM by inverting the order in each sentence in the training data from the original order to the reverse order. |
Related Work | (2012) improved translation quality of n-gram translation model by using a bilingual neural LM , where translation probabilities are estimated using a continuous representation of translation units in lieu of standard discrete representations. |
Latent Structure in Dialogues | The simplest formulation we consider is an HMM where each state contains a unigram language model ( LM ), proposed by Chotimongkol (2008) for task-oriented dialogue and originally |
Latent Structure in Dialogues | ,men are generated (independently) according to the LM . |
Latent Structure in Dialogues | (2010) extends LM—HMM to allow words to be emitted from two additional sources: the topic of current dialogue qb, or a background LM a shared across all dialogues. |
Approach | In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM ) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens. |
Approach | correlatlon correlatlon word ngrams 0.601 0.598 +PoS ngrams 0.682 0.687 +script length 0.692 0.689 +PS rules 0.707 0.708 +complexity 0.714 0.712 Error-rate features +ukWaC LM 0.735 0.758 +CLC LM 0.741 0.773 +true CLC error-rate 0.751 0.789 |
Approach | CLC (CLC LM ). |
Evaluation | feature correlation correlation none 0.741 0.773 word ngrams 0.713 0.762 PoS ngrams 0.724 0.737 script length 0.734 0.772 PS rules 0.712 0.731 complexity 0.738 0.760 ukWaC+CLC LM 0.714 0.712 |
Evaluation | In the experiments reported hereafter, we use the ukWaC+CLC LM to calculate the error-rate. |
Experiments | Other features include lexical weighting in both directions, word count, a distance-based RM, a 4-gram LM trained on the target side of the parallel data, and a 6-gram English Gigaword LM . |
Experiments | Some of the results reported above involved linear TM mixtures, but none of them involved linear LM mixtures. |
Experiments | For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.) |
Introduction | Both were studied in (Foster and Kuhn, 2007), which concluded that the best approach was to combine sub-models of the same type (for instance, several different TMs or several different LMs) linearly, while combining models of different types (for instance, a mixture TM with a mixture LM ) log-linearly. |
Experiments | The first is a 4-gram LM which is estimated on the target side of the texts used in the large data condition (below). |
Experiments | The second is a 5-gram LM estimated on English Gigaword. |
Experiments | We used two LMs in loglinear combination: a 4—gram LM trained on the target side of the parallel |
Experiment | VS 0.4196 0.4542 0.6600 BM25 0.4235 T 0.4579 0.6600 LM 0.4158 0.4520 0.6560 PMI 0.4177 0.4538 0.6620 LSA 0.4155 0.4526 0.6480 WP 0.4165 0.4533 0.6640 |
Experiment | Model Precision Recall F-Measure BASELINE 0.305 0.866 0.451 VS 0.331 0.807 0.470 BM25 0.327 0.795 0.464 LM 0.325 0.794 0.461 LSA 0.315 0.806 0.453 PMI 0.342 0.603 0.436 DTP 0.322 0.778 0.455 VS-LSA 0.335 0.769 0.466 VS-PMI 0.311 0.833 0.453 VS-DTP 0.342 0.745 0.469 |
Experiment | Differences in effectiveness of VS, BM25, and LM come from parameter tuning and corpus differences. |
Term Weighting and Sentiment Analysis | IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling ( LM ), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004). |
Term Weighting and Sentiment Analysis | In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing ( LM ). |
Experiments | In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature. |
Experiments | Feature setting Precision Recall Baseline 54.82% 45.61% Lm 51.11% 41.24% +Tmh 61.43% 49.22% +Punc 62.54% 50.08% +Tlex 64.80% 51.87% |
Experiments | Feature setting Precision Recall Baseline 54.82% 45.61% Lm 51.11% 41.24% +Tmh+Smh 64.50% 51.64% +Hs 65.32% 52.26% +Punc 66.29% 53.10% +Pos 66.53% 53.25% +Tlex 67.50% 54.02% +Slex 69.52% 55.54% |
Joint Model | 0 LINK: 0(n2) boolean variables Lm corresponding to a possible link between each pair |
Joint Model | LM 2 true when there is a dependency link from the word 2121- to the word wj. |
Joint Model | o POS-LINK: There are 0(n2m2) such ternary factors, each connected to the variables LM , TWO”, and ijoswj. |
Experiments | ‘ # 1 Methods | MAP 1 P@ 10 l 1 VSM 0.242 0.226 2 LM 0.385 0.242 3 Jeon et a1. |
Experiments | Row 1 and row 2 are two baseline systems, which model the relevance score using VSM (Cao et al., 2010) and language model ( LM ) (Zhai and Laf-ferty, 2001; Cao et al., 2010) in the term space. |
Experiments | (l) Monolingual translation models significantly outperform the VSM and LM (row 1 and |
Introduction | As a principle approach to capture semantic word relations, word-based translation models are built by using the IBM model 1 (Brown et al., 1993) and have been shown to outperform traditional models (e.g., VSM, BM25, LM ) for question retrieval. |
The normalization models | All tokens Tj of S are concatenated together and composed with the lexical language model LM . |
The normalization models | 8’ = BestPath( (©3112) 0 LM ) (6) |
The normalization models | LM 2 FirstProjection( L o LMw ) (13) |
Translation Model Architecture | We find that an adaptation of the TM and LM to the full development set (system “1 cluster”) yields the smallest improvements over the unadapted baseline. |
Translation Model Architecture | For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2). |
Translation Model Architecture | TM adaptation with 8 clusters (21.1 —> 21.8 —> 22.1), or LM adaptation with 4 or 8 clusters (21.1 —> 22.4 —> 23.1). |
Discussion | 6In our preliminary experiments with the smaller trigram LM , MERT did better on MT05 in the smaller feature set, and MIRA had a small advantage in two cases. |
Experiments | We trained a 4-gram LM on the |
Experiments | In preliminary experiments with a smaller trigram LM , our RM method consistently yielded the highest scores in all Chinese-English tests — up to 1.6 BLEU and 6.4 TER from MIRA, the second best performer. |
The Relative Margin Machine in SMT | 4- gram LM 24M 600M — |
Experiments | In detail, a paraphrase pattern 6’ of e was reranked based on a language model ( LM ): |
Experiments | scoreLM(e’ |SE) is the LM based score: scoreLM(e’|SE) = %logPLM(S’E), where 8% is the sentence generated by replacing e in SE with e’ . |
Experiments | To investigate the contribution of the LM based score, we ran the experiment again with A = l (ignoring the LM based score) and found that the precision is 57.09%. |
Experiments & Results 4.1 Experimental Setup | For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model ( LM ) and word count. |
Experiments & Results 4.1 Experimental Setup | For ensemble decoding, we modified an in-house implementation of hierarchical phrase-based system, Kriya (Sankaran et al., 2012) which uses the same features mentioned in (Chiang, 2005): forward and backward relative-frequency and lexical TM probabilities; LM ; word, phrase and glue-rules penalty. |
Experiments & Results 4.1 Experimental Setup | We suspect this is because a single LM is shared between both models. |
Experiments | TM (en) 2.80M 3.10M 2.77M 2.13M TM (other) 2.56M 2.23M 3.05M 2.34M LM (en) 16.0M 15.5M 13.8M 11.5M |
Experiments | LM (other) 15.3M 11.3M 15.6M 11.9M |
Experiments | Table 1: The number of words in each corpus for TM and LM training, tuning, and testing. |
Keyphrase Extraction Approaches | A unigram LM and an n-gram LM are constructed for each of these two corpora. |
Keyphrase Extraction Approaches | Phraseness, defined using the foreground LM, is calculated as the loss of information incurred as a result of assuming a unigram LM (i.e., conditional independence among the words of the phrase) instead of an n-gram LM (i.e., the phrase is drawn from an n-gram LM ). |
Keyphrase Extraction Approaches | Informativeness is computed as the loss that results because of the assumption that the candidate is sampled from the background LM rather than the foreground LM . |
System Description | Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM” ). |
System Description | 7. our system 0.668 — nonnative LM (§3.2.2) 0.665 — HPSG parse (§3.2.3) 0.664 — PCFG parse (§3.2.4) 0.662 |
System Description | — spelling (§3.2.l) 0.643 — gigaword LM (§3.2.2) 0.638 — link parse (§3.2.3) 0.632 |
Discussion | Method SAT Holmes Chance 20% 20% GT N— gram LM 42 39 RNN 42 45 |
Experimental Results 5.1 Data Resources | These data sources were evaluated using the baseline n—gram LM approach of Section 3.1. |
Experimental Results 5.1 Data Resources | Method Test 3—input LSA 46% LSA + Good—Turing LM 53 LSA + Good—Turing LM + RNN 52 |
Pattern extraction by sentence compression | The method of Clarke and Lapata (2008) uses a trigram language model ( LM ) to score compressions. |
Pattern extraction by sentence compression | Since we are interested in very short outputs, a LM trained on standard, uncompressed text would not be suitable. |
Pattern extraction by sentence compression | Instead, we chose to modify the method of Filippova and Altun (2013) because it relies on dependency parse trees and does not use any LM scoring. |
Models | Secondly, it is easy to use a large language model ( LM ) with Moses. |
Models | We build the LM on the target word types in the data to be filtered. |
Models | The LM is implemented as a five-gram model using the SRILM-Toolkit (Stol-cke, 2002), with Add-l smoothing for unigrams and Kneser-Ney smoothing for higher n-grams. |
Evaluation | The feature set includes: a trigram language model ( lm ) trained |
Evaluation | Discriminative max-derivation 25.78 Hiero (pd, gr, re, we) 26.48 Discriminative max—translation 27.72 Hiero (pd, 19,, p2“, pi“, 97", re, we) 28.14 Hiero (pd, 19,, p2“, pi“, 97", re, we, lm) 32.00 |
Evaluation | 8Hiero (pd, Pr, P262195”, 97“, re, we, lm ) represents state- |
Experimental Evaluation | The second one, LM , is based on statistical language models for relevant information retrieval (Ponte and Croft, 1998). |
Experimental Evaluation | Okapi 0.827 0.833 0.807 0.751 Forum LM 0.804 0.833 0.807 0.731 Our 0.967 0.967 0.9 0.85 |
Experimental Evaluation | Okapi 0.733 0.651 0.667 0.466 Blog LM 0.767 0.718 0.70 0.524 Our 0.933 0.894 0.867 0.756 |