Index of papers in Proc. ACL that mention
  • LM
Elsayed, Tamer and Oard, Douglas W. and Namata, Galileo
Mention Resolution Approach
We define a mention m as a tuple < lm, em >, where lm is the “literal” string of characters that represents m and em is the email where m is observed.1 We assume that m can be resolved to a distinguishable participant for whom at least one email address is present in the collection.2
Mention Resolution Approach
Select a specific lexical reference lm to refer to 0 given the context ark.
Mention Resolution Approach
1The exact position in em where lm is observed should also be included in the definition, but we ignore it assuming that all matched literal mentions in one email refer to the same identity.
LM is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
van Gompel, Maarten and van den Bosch, Antal
Baselines
It adds a LM component to the MLF baseline.
Baselines
This LM baseline allows the comparison of classification through L1 fragments in an L2 context, with a more traditional L2 context modelling (i.e.
Discussion and conclusion
LM baseline llrl
Discussion and conclusion
llrl +LM auto
Discussion and conclusion
LM baseline 11r1
Experiments & Results
As expected, the LM baseline substantially outperforms the context-insensitive MLF baseline.
Experiments & Results
Second, our classifier approach attains a substantially higher accuracy than the LM baseline.
Experiments & Results
The same significance level was found when comparing llrl+LM against llrl, auto+LM against aut o, as well as the LM baseline against the MLF baseline.
LM is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Abstract
Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model ( LM ).
Methods
In order to annotate lattice edges with an n-gram LM , every path coming into a node must end with the same sequence of (n — l) tokens.
Methods
If this property does not hold, then nodes must be split until it does.4 This property is maintained by the decoder’s recombination rules for the segmented LM, but it is not guaranteed for the desegmented LM .
Methods
Indeed, the expanded word-level context is one of the main benefits of incorporating a word-level LM .
LM is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Abstract
These techniques speed up NNJM computation by a factor of l(),()()()X, making the model as fast as a standard back-off LM .
Decoding with the NNJ M
When performing hierarchical decoding with an n-gram LM , the leftmost and rightmost n — 1 words from each constituent must be stored in the state space.
Model Variations
4- gram Kneser—Ney LM
Model Variations
Dependency LM (Shen et al., 2010) Contextual lexical smoothing (Devlin, 2009) Length distribution (Shen et al., 2010)
Model Variations
0 LM adaptation (Snover et al., 2008)
Neural Network Joint Model (NNJ M)
Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.
Neural Network Joint Model (NNJ M)
It is clear that this model is effectively an (n+m)-gram LM, and a lS-gram LM would be
Neural Network Joint Model (NNJ M)
Although self-normalization significantly improves the speed of NNJM lookups, the model is still several orders of magnitude slower than a back-off LM .
LM is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Baseline MT
The LM used for decoding is a log-linear combination of four word n-gram LMs which are built on different English
Baseline MT
corpora (details described in section 5.1), with the LM weights optimized on a development set and determined by minimum error rate training (MERT), to estimate the probability of a word given the preceding words.
Experiments
LM1 is a 7-gram LM trained on the tar-
Experiments
LM2 is a 7-gram LM trained only on the English monolingual discussion forums data listed above.
Experiments
LM3 is a 4-gram LM trained on the web genre among the target side of all parallel text (i.e., web text from pre-BOLT parallel text and BOLT released discussion forum parallel text).
Introduction
Optimize name translation and context translation simultaneously and conduct name translation driven decoding with language model ( LM ) based selection (Section 3.2).
Related Work
enable the LM to decide which translations to choose when encountering the names in the texts (Ji et al., 2009).
Related Work
The LM selection method often assigns an inappropriate weight to the additional name translation table because it is constructed independently from translation of context words; therefore after weighted voting most correct name translations are not used in the final translation output.
Related Work
More importantly, in these approaches the MT model was still mostly treated as a “black-box” because neither the translation model nor the LM was updated or adapted specifically for names.
LM is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Hagiwara, Masato and Sekine, Satoshi
Abstract
We propose an online approach, integrating source LM, and / or, back-transliteration and English LM .
Experiments
We observed slight improvement by incorporating the source LM , and observed a 0.48 point F—value increase over baseline, which translates to 4.65 point Katakana F—value change and 16.0% (3.56% to 2.99 %) WER reduction, mainly due to its higher Katakana word rate (11.2%).
Experiments
H. This type of error is reduced by +LM-P, e.g., * 7°?X/fify 7 pumsu chikku “*plus tick” to 703%?‘y7 pumsuchz’kku “plastic” due to LM projection.
Experiments
performance, which may be because one cannot limit where the source LM features are applied.
Introduction
We refer to this process of transliterating unknown words into another language and using the target LM as LM projection.
Introduction
Since the model employs a general transliteration model and a general English LM , it achieves robust WS for unknown words.
Use of Language Model
As we mentioned in Section 2, English LM knowledge helps split transliterated compounds.
Use of Language Model
We use ( LM ) projection, which is a combination of back-transliteration and an English model, by extending the normal lattice building process as follows:
Use of Language Model
Then, edges are spanned between these extended English nodes, instead of between the original nodes, by additionally taking into consideration English LM features (ID 21 and 22 in Table 1): ¢iMP(wi) = IOgPWi) and ¢§Mp(wi—1,wi) = log p(wi_1,wi).
Word Segmentation Model
Figure 1: Example lattice with LM projection
LM is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann
Comparative Study
Figure 4: bilingual LM illustration.
Comparative Study
4.3 Bilingual LM
Comparative Study
We build a 9-gram LM using SRILM toolkit (Stolcke, 2002) with modified Kneser—Ney smoothing.
Experiments
Our baseline is a phrase-based decoder, which includes the following models: an n-gram target-side language model ( LM ), a phrase translation model and a word-based lexicon model.
Experiments
0 lowercased training data from the GALE task (Table 1, UN corpus not included) alignment trained with GIZA++ o tuning corpus: NIST06 test corpora: NIST02 03 04 05 and 08 o 5-gram LM (1694412027 running words) trained by SRILM toolkit (Stolcke, 2002) with modified Kneser—Ney smoothing training data: target side of bilingual data.
LM is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Elfardy, Heba and Diab, Mona
Approach to Sentence-Level Dialect Identification
The aforementioned approach relies on language models ( LM ) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV.
Approach to Sentence-Level Dialect Identification
The following variants of the underlying token-level system are built to assess the effect of varying the level of preprocessing on the underlying LM on the performance of the overall sentence level dialect identification process: (1) Surface, (2) Tokenized, (3) CODAfied, and (4) Tokenized-CODA.
Approach to Sentence-Level Dialect Identification
Orthography Normalized (CODAfied) LM : since DA is not originally a written form of Arabic, no standard orthography exists for it.
Related Work
Amazon Mechanical Turk and try a language modeling ( LM ) approach to solve the problem.
LM is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas
Decoding
The language model ( LM ) scoring is directly integrated into the cube pruning algorithm.
Decoding
Thus, LM estimates are available for all considered hypotheses.
Decoding
Because the output trees can remain discontiguous after hypothesis creation, LM scoring has to be done individually over all output trees.
LM is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Zhong, Zhi and Ng, Hwee Tou
Conclusion
We proposed a method for annotating senses to terms in short queries, and also described an approach to integrate senses into an LM approach for IR.
Conclusion
Our experimental results showed that the incorporation of senses improved a state-of-the-art baseline, a stem-based LM approach with PRF method.
Conclusion
We also proposed a method to further integrate the synonym relations to the LM approaches.
Experiments
We use the Lemur toolkit (Ogilvie and Callan, 2001) version 4.11 as the basic retrieval tool, and select the default unigram LM approach based on KL-divergence and Dirichlet-prior smoothing method in Lemur as our basic retrieval approach.
Incorporating Senses into Language Modeling Approaches
In this section, we propose to incorporate senses into the LM approach to IR.
Incorporating Senses into Language Modeling Approaches
In this part, we further integrate the synonym relations of senses into the LM approach.
Introduction
We incorporate word senses into the language modeling ( LM ) approach to IR (Ponte and Croft, 1998), and utilize sense synonym relations to further improve the performance.
Introduction
Section 3 introduces the LM approach to IR, including the pseudo relevance feedback method.
Introduction
generating word senses for query terms in Section 4, followed by presenting our novel method of incorporating word senses and their synonyms into the LM approach in Section 5.
The Language Modeling Approach to IR
This section describes the LM approach to IR and the pseudo relevance feedback approach.
Word Sense Disambiguation
Suppose we already have a basic IR method which does not require any sense information, such as the stem-based LM approach.
LM is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Wuebker, Joern and Ney, Hermann and Zens, Richard
Introduction
Research efforts to increase search efficiency for phrase-based MT (Koehn et al., 2003) have explored several directions, ranging from generalizing the stack decoding algorithm (Ortiz et al., 2006) to additional early pruning techniques (Delaney et al., 2006), (Moore and Quirk, 2007) and more efficient language model ( LM ) querying (Heafield, 2011).
Introduction
We show that taking a heuristic LM score estimate for pre-sorting the
Introduction
Further, we introduce two novel LM lookahead methods.
Search Algorithm Extensions
In addition to sorting according to the purely phrase-intemal scores, which is common practice, we compute an estimate qLME(é) for the LM score of each target phrase é. qLME(é) is the weighted LM score we receive by assuming 5 to be a complete sentence without using sentence start and end markers.
Search Algorithm Extensions
LM score computations are among the most expensive in decoding.
Search Algorithm Extensions
(2006) report significant improvements in runtime by removing unnecessary LM lookups via early pruning.
LM is mentioned in 42 sentences in this paper.
Topics mentioned in this paper:
Rastrow, Ariya and Dredze, Mark and Khudanpur, Sanjeev
Experiments
We use a modified Kneser—Ney (KN) backoff 4- gram baseline LM .
Incorporating Syntactic Structures
tool(s) 1 LM 1 W1 ———> $1,...,371" ———> F(w1,sl,...,slln) tool(s) 1 LM 1 W2 ———> S2,...,sl2n ———> F(w2,s2,...,sl2n) tool(s) 1 LM 1 wk ———> sk,...,s}:‘ ———> F(wk,sk,...,s}:‘) Here, {w1, .
Introduction
Language models ( LM ) are crucial components in tasks that require the generation of coherent natural language text, such as automatic speech recognition (ASR) and machine translation (MT).
Introduction
For example, incorporating long-distance dependencies and syntactic structure can help the LM better predict words by complementing the predictive power of n-grams (Chelba and Jelinek, 2000; Collins et al., 2005; Filimonov and Harper, 2009; Kuo et al., 2009).
Introduction
Even so, the reliance on auxiliary tools slow LM application to the point of being impractical for real time systems.
Syntactic Language Models
(2009) integrate syntactic features into a neural network LM for Arabic speech recognition.
Syntactic Language Models
We work with a discriminative LM with long-span dependencies.
Syntactic Language Models
The LM score 8(w, a) for each hypothesis w of a speech utterance with acoustic sequence a is based on the baseline ASR system score b(w, a) (initial 77.-gram LM score and the acoustic score) and cm, the weight assigned to the baseline score.1 The score is defined as:
LM is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Experimental Evaluation
Using a 2-gram LM they obtain 15.3 BLEU and with a whole segment LM , they achieve 19.3 BLEU.
Experimental Evaluation
In comparison to this baseline we run our algorithm with N0 = 50 candidates per source word for both, a 2-gram and a 3- gram LM .
Experimental Evaluation
Figure 3 and Figure 4 show the evolution of BLEU and TER scores for applying our method using a 2-gram and a 3- gram LM .
Training Algorithm and Implementation
Our FST representation of the LM makes use of failure transitions as described in (Allauzen et al., 2003).
LM is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Machine Translation as a Decipherment Task
For P (e), we use a word n-gram LM trained on monolingual English data.
Machine Translation as a Decipherment Task
Better LMs yield better MT results for both parallel and decipherment training—for example, using a segment-based English LM instead of a 2-gram LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for EM decipherment.
Word Substitution Decipherment
We model P(e) using a statistical word n-gram English language model ( LM ).
Word Substitution Decipherment
We use an English word bi-gram LM as the base distribution (P0) for the source model and specify a uniform P0 distribution for the
Word Substitution Decipherment
In order to sample at position i, we choose the top K English words Y ranked by P(X Y 2), which can be computed offline from a statistical word bigram LM .
LM is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Experiments
To test our LM implementations, we performed experiments with two different language models.
Introduction
Where classic LMs take word tuples and produce counts or probabilities, we propose an LM that takes a word-and-context encoding (so the context need not be re-looked up) and returns both the probability and also the context encoding for the sufi‘ix of the original query.
Introduction
Our LM toolkit, which is implemented in Java and compatible with the standard ARPA file formats, is available on the web.1
Preliminaries
Tries or variants thereof are implemented in many LM tool kits, including SRILM (Stolcke, 2002), IRSTLM (Federico and Cettolo, 2007), CMU SLM (Whittaker and Raj, 2001), and MIT LM (Hsu and Glass, 2008).
Speeding up Decoding
Figure 3: Queries issued when scoring trigrams that are created when a state with LM context “the cat” combines with “fell down”.
Speeding up Decoding
We found in our experiments that a cache using linear probing provided marginal performance increases of about 40%, largely because of cached back-off computation, while our simpler cache increases performance by about 300% even over our HASH LM implementation.
Speeding up Decoding
As discussed in Section 3, our LM implementations can answer queries about context-encoded n-grams faster than explicitly encoded n-grams.
LM is mentioned in 18 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Experiments
imizing the LM score over all possible permutations.
Experiments
Usually the LM score is used as one component of a more complex decoder score which also includes biphrase and distortion scores.
Experiments
But in this particular “translation task” from bad to good English, we consider that all “biphrases” are of the form 6 — e, where e is an English word, and we do not take into account any distortion: we only consider the quality of the permutation as it is measured by the LM component.
Phrase-based Decoding as TSP
4.1 From Bigram to N-gram LM
LM is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Decipherment
We build a statistical English language model ( LM ) for the plaintext source model P (p), which assigns a probability to any English letter sequence.
Decipherment
We build an interpolated word+n-gram LM and use it to assign P (p) probabilities to any plaintext letter sequence p1...pn.3 The advantage is that it helps direct the sampler towards plaintext hypotheses that resemble natural language—high probability letter sequences which form valid words such as “H E L L O” instead of sequences like “‘T X H R T”.
Decipherment
3We set the interpolation weights for the word and n-gram LM as (0.9, 0.1).
Experiments and Results
Even with a 3-gram letter LM , our method yields a +63% improvement in decipherment accuracy over EM on the homophonic cipher with spaces.
Experiments and Results
We observe that the word+3—gram LM proves highly effective when tackling more complex ciphers and cracks the Zodiac-408 cipher.
Experiments and Results
0 Letter n-gram versus W0rd+n-gram LMs—Figure 2 shows that using a word+3-gram LM instead of a 3-gram LM results in +75% improvement in decipherment accuracy.
LM is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Shen, Libin and Xu, Jinxi and Weischedel, Ralph
Dependency Language Model
Suppose we use a trigram dependency LM,
Discussion
Only translation probability P was employed in the construction of the target forest due to the complexity of the syntax-based LM .
Discussion
Since our dependency LM models structures over target words directly based on dependency trees, we can build a single-step system.
Discussion
This dependency LM can also be used in hierarchical MT systems using lexical-ized CFG trees.
Experiments
0 str-dep: a string-to-dependency system with a dependency LM .
Experiments
The English side of this subset was also used to train a 3-gram dependency LM .
Experiments
BLEU% TER% lower mixed lower mixed Decoding (3—gram LM) baseline 38.18 35.77 58.91 56.60 filtered 37.92 35.48 57.80 55.43 str-dep 39.52 37.25 56.27 54.07 Rescoring (5—gram LM ) baseline 40.53 38.26 56.35 54.15 filtered 40.49 38.26 55.57 53.47 str-dep 41.60 39.47 55.06 52.96
Implementation Details
We rescore 1000-best translations (Huang and Chiang, 2005) by replacing the 3-gram LM score with the 5-gram LM score computed offline.
LM is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Tomeh, Nadi and Habash, Nizar and Roth, Ryan and Farra, Noura and Dasigi, Pradeep and Diab, Mona
Discriminative Reranking for OCR
Base features include the HMM and LM scores produced by the OCR system.
Discriminative Reranking for OCR
Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, .
Discriminative Reranking for OCR
The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Experiments
Our baseline is based on the sum of the logs of the HMM and LM scores.
Experiments
For LM training we used 220M words from Arabic Gigaword 3, and 2.4M words from each “print” and “hand” ground truth annotations.
Introduction
The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model ( LM ) to emphasize the fluency of the output.
Introduction
For an input image, the OCR decoder generates an n-best list of hypotheses each of which is associated with HMM and LM scores.
LM is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen
Related Work
Like (Galley and Manning, 2009) our work implements an incremental syntactic language model; our approach differs by calculating syntactic LM scores over all available phrase-structure parses at each hypothesis instead of the l-best dependency parse.
Related Work
In-domain Out-of-domain LM WSJ 23 ppl ur-en dev ppl WSJ 1-gram 1973.57 3581.72 WSJ 2-gram 349.18 1312.61 WSJ 3-gram 262.04 1264.47 WSJ 4-gram 244.12 1261.37 WSJ 5-gram 232.08 1261.90 WSJ HHMM 384.66 529.41 Interpolated WSJ 5-gram + HHMM 209.13 225.48 Giga 5-gram 258.35 312.28 Interp.
Related Work
HHMM parser beam sizes are indicated for the syntactic LM .
LM is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Experimental Setup
4.1 Distributed LM Framework
Experimental Setup
We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers.
Experimental Setup
The proposed randomized LM can encode parameters estimated using any smoothing scheme (e.g.
Experiments
size dev test test LM GB MT04 | MT05 | MT06 unpruned block 116 0.5304 0.5697 0.4663 unpruned rand 69 0.5299 0.5692 0.4659 pruned block 42 0.5294 0.5683 0.4665 pruned rand 27 0.5289 0.5679 0.4656
Introduction
Using higher-order models and larger amounts of training data can significantly improve performance in applications, however the size of the resulting LM can become prohibitive.
Introduction
Efficiency is paramount in applications such as machine translation which make huge numbers of LM requests per sentence.
Perfect Hash-based Language Models
Our randomized LM is based on the Bloomier filter (Chazelle et al., 2004).
Scaling Language Models
In the next section we describe our randomized LM scheme based on perfect hash functions.
LM is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Introduction
This provides a compelling advantage over previous dependency language models for MT (Shen et al., 2008), which use a 5 - gram LM only during rerank-ing.
Introduction
In our experiments, we build a competitive baseline (Koehn et al., 2007) incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores.
Machine translation experiments
This LM required 16GB of RAM during training.
Machine translation experiments
Regarding the small difference in BLEU scores on MT08, we would like to point out that tuning on MTOS and testing on MT08 had a rather adverse effect with respect to translation length: while the two systems are relatively close in terms of BLEU scores (24.83 and 24.91, respectively), the dependency LM provides a much bigger gain when evaluated with BLEU precision (27.73 vs. 28.79), i.e., by ignoring the brevity penalty.
Machine translation experiments
LM newswire web speech all
Related work
In the latter paper, Huang and Chiang introduce rescoring methods named “cube pruning” and “cube growing”, which first use a baseline decoder (either synchronous CFG or a phrase-based system) and no LM to generate a hypergraph, and then rescoring this hypergraph with a language model.
LM is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
A Class-based Model of Agreement
For contexts in which the LM is guaranteed to back off (for instance, after an unseen bigram), our decoder maintains only the minimal state needed (perhaps only a single word).
Inference during Translation Decoding
initialize 77 to —oo set 77(t) = 0 compute 7* from parameters <8, 5%, 7r, is_goal> compute q(e{,+1) 2 p(7'*) under the generative LM set model state anew = <§L, 7-2) for prefix ef Output: q(e£+1)
Introduction
Intuition might suggest that the standard 71- gram language model ( LM ) is suflicient to handle agreement phenomena.
Introduction
However, LM statistics are sparse, and they are made sparser by morphological variation.
Introduction
For English-to-Arabic translation, we achieve a +1.04 BLEU average improvement by tiling our model on top of a large LM .
Related Work
They used a target-side LM over Combinatorial Categorial Grammar (CCG) supertags, along with a penalty for the number of operator violations, and also modified the phrase probabilities based on the tags.
Related Work
Then they mixed the classes into a word-based LM .
Related Work
Target-Side Syntactic LMs Our agreement model is a form of syntactic LM , of which there is a long history of research, especially in speech processing.5 Syntactic LMs have traditionally been too slow for scoring during MT decoding.
LM is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Darwish, Kareem and Belinkov, Yonatan
Previous Work
Our LM experiments also affirmed the importance of in-domain English LMs.
Previous Work
‘Train LM BLEU oov
Previous Work
Table 1: Baseline results using the EG and AR training sets with G W and EGen corpora for LM training
Proposed Methods 3.1 Egyptian to EG’ Conversion
Then we used a trigram LM that we built from the aforementioned Aljazeera articles to pick the most likely candidate in context.
Proposed Methods 3.1 Egyptian to EG’ Conversion
We simply multiplied the character-level transformation probability with the LM probability — giving them equal weight.
Proposed Methods 3.1 Egyptian to EG’ Conversion
- SI and $2 trained on the EG’ with EGen and both EGen and GW for LM training respectively.
LM is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Lin, Shih-Hsiang and Chen, Berlin
Experimental results and discussions 6.1 Baseline experiments
In the first set of experiments, we evaluate the baseline performance of the LM and BC summarizers (cf.
Experimental results and discussions 6.1 Baseline experiments
Second, the supervised summarizer (i.e., BC) outperforms the unsupervised summarizer (i.e., LM ).
Experimental results and discussions 6.1 Baseline experiments
One is that BC is trained with the handcrafted document-summary sentence labels in the development set while LM is instead conducted in a purely unsupervised manner.
Proposed Methods
In order to estimate the sentence generative probability, we explore the language modeling ( LM ) approach, which has been introduced to a wide spectrum of IR tasks and demonstrated with good empirical success, to predict the sentence generative probability.
Proposed Methods
In the LM approach, each sentence in a document can be simply regarded as a probabilistic generative model consisting of a unigram distribution (the so-called “bag-0f-words” assumption) for generating the document (Chen et al., 2009): (w)
LM is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Jia, Zhongye and Zhao, Hai
Experiments
The Chinese part of the corpus is segmented into words before LM training.
Experiments
The pinyin syllable segmentation already has very high (over 98%) accuracy with a trigram LM using improved Kneser-Ney smoothing.
Experiments
We consider different LM smoothing methods including Kneser-Ney (KN), improved Kneser-Ney (IKN), and Witten-Bell (WB).
Pinyin Input Method Model
WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM , by tracking longer history while traversing the graph.
Related Works
Various approaches were made for the task including language model ( LM ) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc.
LM is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Bayesian MT Decipherment via Hash Sampling
5 A high value for the LM concentration parameter oz ensures that the LM probabilities do not deviate too far from the original fixed base distribution during sampling.
Decipherment Model for Machine Translation
For P(e), we use a word n-gram language model ( LM ) trained on monolingual target text.
Experiments and Results
We observe that our method produces much better results than the others even with a 2-gram LM .
Experiments and Results
With a 3-gram LM , the new method achieves the best performance; the highest BLEU score reported on this task.
Experiments and Results
Bayesian Hash Sampling with 2—gram LM vocab=full (V6), addjertility=n0 4.2 vocab=pruned*, add_fertility=yes 5.3
LM is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
He, Xiaodong and Deng, Li
Abstract
Features used in a phrase-based system usually include LM , reordering model, word and phrase counts, and phrase and lexicon translation models.
Abstract
Other models used in the baseline system include lexicalized ordering model, word count and phrase count, and a 3-gram LM trained on the English side of the parallel training corpus.
Abstract
In our system, a primary phrase table is trained from the 110K TED parallel training data, and a 3-gram LM is trained on the English side of the parallel data.
LM is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Cahill, Aoife and Kuhn, Jonas
Experimental Setup
Match 15.45 15.04 11.89 LM BLEU 0.68 0.68 0.65
Experiments
In Table 2, we compare the performance of the linguistically informed model described in Section 4 on the candidates sets against a random choice and a language model ( LM ) baseline.
Experiments
statistically significant.5 In general, the linguistic model largely outperforms the LM and is less sensitive to the additional confusion introduced by the SEMh input.
Experiments
Table 5 also reports the performance of an unlabelled model that additionally integrates LM scores.
Human Evaluation
6(Nakanishi et al., 2005) also note a negative effect of including LM scores in their model, pointing out that the LM was not trained on enough data.
Human Evaluation
The corpus used for training our LM might also have been too small or distinct in genre.
LM is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Yeniterzi, Reyyan and Oflazer, Kemal
Experimental Setup and Results
As a baseline system, we built a standard phrase-based system, using the surface forms of the words without any transformations, and with a 3—gram LM in the decoder.
Experimental Setup and Results
We believe that the use of multiple language models (some much less sparse than the surface LM ) in the factored baseline is the main reason for the improvement.
Experimental Setup and Results
Using a 4-gram root LM , considerably less sparse than word forms but more sparse that tags, we get a BLEU score of 22.80 (max: 24.07, min: 21.57, std: 0.85).
Experiments with Constituent Reordering
16These experiments were done on top of the model in 3.2.3 with a 3-gram word and root LMs and 8-gram tag LM .
LM is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Mi, Haitao and Huang, Liang and Liu, Qun
Forest-based translation
The decoder performs two tasks on the translation forest: l-best search with integrated language model (LM), and k-best search with LM to be used in minimum error rate training.
Forest-based translation
For l-best search, we use the cube pruning technique (Chiang, 2007; Huang and Chiang, 2007) which approximately intersects the translation forest with the LM .
Forest-based translation
Basically, cube pruning works bottom up in a forest, keeping at most k +LM items at each node, and uses the best-first expansion idea from the Algorithm 2 of Huang and Chiang (2005) to speed
LM is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
A Skeleton-based Approach to MT 2.1 Skeleton Identification
Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by
A Skeleton-based Approach to MT 2.1 Skeleton Identification
g(d;w, m, lm) 2 Wm - fm(d) + wlm - lm (d) (4)
A Skeleton-based Approach to MT 2.1 Skeleton Identification
lm (d) and wlm are the score and weight of the language model, respectively.
LM is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Toutanova, Kristina and Suzuki, Hisami and Ruopp, Achim
Inflection prediction models
LM 81.0 69.4 Model 91.6 91.0 Avg | I | 13.9 24.1
Integration of inflection models with MT systems
PLM) is the joint probability of the sequence of inflected words according to a trigram language model ( LM ).
Integration of inflection models with MT systems
The LM used for the integration is the same LM used in the base MT system that is trained on fully inflected word forms (the base MT system trained on stems uses an LM trained on a stem sequence).
Integration of inflection models with MT systems
Equation (1) shows that the model first selects the best sequence of inflected forms for each MT hypothesis Si according to the LM and the inflection model.
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya
Experimental Evaluation
TM (en) 1.80M 1.62M 1.35M 2.38M TM (other) 1.85M 1.82M 1.56M 2.78M LM (en) 52.7M 52.7M 52.7M 44.7M
Experimental Evaluation
Table 1: The number of words in each corpus for TM and LM training, tuning, and testing.
Experimental Evaluation
We use the news commentary corpus for training the TM, and the news commentary and Europarl corpora for training the LM .
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Wu, Yuanbin and Ng, Hwee Tou
Inference with First Order Variables
The language model score h(s’, LM ) of 8’ based on a large web corpus;
Inference with First Order Variables
wlmflc : VLMh(8/7 LM ) + Z )‘tf(8/7
Inference with First Order Variables
wNoun,2,singular : VLMh/(Sl, LM )+ AARTfls’, ART) + APREpfls’, PREP) + ANOUNfls’, NOUN)+ IU’ARTg(S/7 ART) + ,U/PREPg(3/a PREP) ‘i— MNOUNg(8/, NOUN).
Inference with Second Order Variables
wufl) : VLMh(8/7 LM ) + Z )‘tf(8/7
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Lu, Shixiang and Chen, Zhenbiao and Xu, Bo
Experiments and Results
The LM corpus is the English side of the parallel data (BTEC, CJK and CWMT083) (1.34M sentences).
Experiments and Results
The LM corpus is the English side of the parallel data as well as the English Gigaword corpus (LDC2007T07) (11.3M sentences).
Input Features for DNN Feature Learning
1Backward LM has been introduced by Xiong et a1.
Input Features for DNN Feature Learning
(2011), which successfully capture both the preceding and succeeding contexts of the current word, and we estimate the backward LM by inverting the order in each sentence in the training data from the original order to the reverse order.
Related Work
(2012) improved translation quality of n-gram translation model by using a bilingual neural LM , where translation probabilities are estimated using a continuous representation of translation units in lieu of standard discrete representations.
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhai, Ke and Williams, Jason D
Latent Structure in Dialogues
The simplest formulation we consider is an HMM where each state contains a unigram language model ( LM ), proposed by Chotimongkol (2008) for task-oriented dialogue and originally
Latent Structure in Dialogues
,men are generated (independently) according to the LM .
Latent Structure in Dialogues
(2010) extends LM—HMM to allow words to be emitted from two additional sources: the topic of current dialogue qb, or a background LM a shared across all dialogues.
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben
Approach
In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM ) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens.
Approach
correlatlon correlatlon word ngrams 0.601 0.598 +PoS ngrams 0.682 0.687 +script length 0.692 0.689 +PS rules 0.707 0.708 +complexity 0.714 0.712 Error-rate features +ukWaC LM 0.735 0.758 +CLC LM 0.741 0.773 +true CLC error-rate 0.751 0.789
Approach
CLC (CLC LM ).
Evaluation
feature correlation correlation none 0.741 0.773 word ngrams 0.713 0.762 PoS ngrams 0.724 0.737 script length 0.734 0.772 PS rules 0.712 0.731 complexity 0.738 0.760 ukWaC+CLC LM 0.714 0.712
Evaluation
In the experiments reported hereafter, we use the ukWaC+CLC LM to calculate the error-rate.
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Foster, George
Experiments
Other features include lexical weighting in both directions, word count, a distance-based RM, a 4-gram LM trained on the target side of the parallel data, and a 6-gram English Gigaword LM .
Experiments
Some of the results reported above involved linear TM mixtures, but none of them involved linear LM mixtures.
Experiments
For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.)
Introduction
Both were studied in (Foster and Kuhn, 2007), which concluded that the best approach was to combine sub-models of the same type (for instance, several different TMs or several different LMs) linearly, while combining models of different types (for instance, a mixture TM with a mixture LM ) log-linearly.
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Experiments
The first is a 4-gram LM which is estimated on the target side of the texts used in the large data condition (below).
Experiments
The second is a 5-gram LM estimated on English Gigaword.
Experiments
We used two LMs in loglinear combination: a 4—gram LM trained on the target side of the parallel
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Kim, Jungi and Li, Jin-Ji and Lee, Jong-Hyeok
Experiment
VS 0.4196 0.4542 0.6600 BM25 0.4235 T 0.4579 0.6600 LM 0.4158 0.4520 0.6560 PMI 0.4177 0.4538 0.6620 LSA 0.4155 0.4526 0.6480 WP 0.4165 0.4533 0.6640
Experiment
Model Precision Recall F-Measure BASELINE 0.305 0.866 0.451 VS 0.331 0.807 0.470 BM25 0.327 0.795 0.464 LM 0.325 0.794 0.461 LSA 0.315 0.806 0.453 PMI 0.342 0.603 0.436 DTP 0.322 0.778 0.455 VS-LSA 0.335 0.769 0.466 VS-PMI 0.311 0.833 0.453 VS-DTP 0.342 0.745 0.469
Experiment
Differences in effectiveness of VS, BM25, and LM come from parameter tuning and corpus differences.
Term Weighting and Sentiment Analysis
IR models, such as Vector Space (VS), probabilistic models such as BM25, and Language Modeling ( LM ), albeit in different forms of approach and measure, employ heuristics and formal modeling approaches to effectively evaluate the relevance of a term to a document (Fang et al., 2004).
Term Weighting and Sentiment Analysis
In our experiments, we use the Vector Space model with Pivoted Normalization (VS), Probabilistic model (BM25), and Language modeling with Dirichlet Smoothing ( LM ).
LM is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature.
Experiments
Feature setting Precision Recall Baseline 54.82% 45.61% Lm 51.11% 41.24% +Tmh 61.43% 49.22% +Punc 62.54% 50.08% +Tlex 64.80% 51.87%
Experiments
Feature setting Precision Recall Baseline 54.82% 45.61% Lm 51.11% 41.24% +Tmh+Smh 64.50% 51.64% +Hs 65.32% 52.26% +Punc 66.29% 53.10% +Pos 66.53% 53.25% +Tlex 67.50% 54.02% +Slex 69.52% 55.54%
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lee, John and Naradowsky, Jason and Smith, David A.
Joint Model
0 LINK: 0(n2) boolean variables Lm corresponding to a possible link between each pair
Joint Model
LM 2 true when there is a dependency link from the word 2121- to the word wj.
Joint Model
o POS-LINK: There are 0(n2m2) such ternary factors, each connected to the variables LM , TWO”, and ijoswj.
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhou, Guangyou and Liu, Fang and Liu, Yang and He, Shizhu and Zhao, Jun
Experiments
‘ # 1 Methods | MAP 1 P@ 10 l 1 VSM 0.242 0.226 2 LM 0.385 0.242 3 Jeon et a1.
Experiments
Row 1 and row 2 are two baseline systems, which model the relevance score using VSM (Cao et al., 2010) and language model ( LM ) (Zhai and Laf-ferty, 2001; Cao et al., 2010) in the term space.
Experiments
(l) Monolingual translation models significantly outperform the VSM and LM (row 1 and
Introduction
As a principle approach to capture semantic word relations, word-based translation models are built by using the IBM model 1 (Brown et al., 1993) and have been shown to outperform traditional models (e.g., VSM, BM25, LM ) for question retrieval.
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick
The normalization models
All tokens Tj of S are concatenated together and composed with the lexical language model LM .
The normalization models
8’ = BestPath( (©3112) 0 LM ) (6)
The normalization models
LM 2 FirstProjection( L o LMw ) (13)
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Sennrich, Rico and Schwenk, Holger and Aransa, Walid
Translation Model Architecture
We find that an adaptation of the TM and LM to the full development set (system “1 cluster”) yields the smallest improvements over the unadapted baseline.
Translation Model Architecture
For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2).
Translation Model Architecture
TM adaptation with 8 clusters (21.1 —> 21.8 —> 22.1), or LM adaptation with 4 or 8 clusters (21.1 —> 22.4 —> 23.1).
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Discussion
6In our preliminary experiments with the smaller trigram LM , MERT did better on MT05 in the smaller feature set, and MIRA had a small advantage in two cases.
Experiments
We trained a 4-gram LM on the
Experiments
In preliminary experiments with a smaller trigram LM , our RM method consistently yielded the highest scores in all Chinese-English tests — up to 1.6 BLEU and 6.4 TER from MIRA, the second best performer.
The Relative Margin Machine in SMT
4- gram LM 24M 600M —
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhao, Shiqi and Wang, Haifeng and Liu, Ting and Li, Sheng
Experiments
In detail, a paraphrase pattern 6’ of e was reranked based on a language model ( LM ):
Experiments
scoreLM(e’ |SE) is the LM based score: scoreLM(e’|SE) = %logPLM(S’E), where 8% is the sentence generated by replacing e in SE with e’ .
Experiments
To investigate the contribution of the LM based score, we ran the experiment again with A = l (ignoring the LM based score) and found that the precision is 57.09%.
LM is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Experiments & Results 4.1 Experimental Setup
For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al., 2003), Portage (Sadat et al., 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; word-displacement distortion model; language model ( LM ) and word count.
Experiments & Results 4.1 Experimental Setup
For ensemble decoding, we modified an in-house implementation of hierarchical phrase-based system, Kriya (Sankaran et al., 2012) which uses the same features mentioned in (Chiang, 2005): forward and backward relative-frequency and lexical TM probabilities; LM ; word, phrase and glue-rules penalty.
Experiments & Results 4.1 Experimental Setup
We suspect this is because a single LM is shared between both models.
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya
Experiments
TM (en) 2.80M 3.10M 2.77M 2.13M TM (other) 2.56M 2.23M 3.05M 2.34M LM (en) 16.0M 15.5M 13.8M 11.5M
Experiments
LM (other) 15.3M 11.3M 15.6M 11.9M
Experiments
Table 1: The number of words in each corpus for TM and LM training, tuning, and testing.
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hasan, Kazi Saidul and Ng, Vincent
Keyphrase Extraction Approaches
A unigram LM and an n-gram LM are constructed for each of these two corpora.
Keyphrase Extraction Approaches
Phraseness, defined using the foreground LM, is calculated as the loss of information incurred as a result of assuming a unigram LM (i.e., conditional independence among the words of the phrase) instead of an n-gram LM (i.e., the phrase is drawn from an n-gram LM ).
Keyphrase Extraction Approaches
Informativeness is computed as the loss that results because of the assumption that the candidate is sampled from the background LM rather than the foreground LM .
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Heilman, Michael and Cahill, Aoife and Madnani, Nitin and Lopez, Melissa and Mulholland, Matthew and Tetreault, Joel
System Description
Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM” ).
System Description
7. our system 0.668 — nonnative LM (§3.2.2) 0.665 — HPSG parse (§3.2.3) 0.664 — PCFG parse (§3.2.4) 0.662
System Description
— spelling (§3.2.l) 0.643 — gigaword LM (§3.2.2) 0.638 — link parse (§3.2.3) 0.632
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zweig, Geoffrey and Platt, John C. and Meek, Christopher and Burges, Christopher J.C. and Yessenalina, Ainur and Liu, Qiang
Discussion
Method SAT Holmes Chance 20% 20% GT N— gram LM 42 39 RNN 42 45
Experimental Results 5.1 Data Resources
These data sources were evaluated using the baseline n—gram LM approach of Section 3.1.
Experimental Results 5.1 Data Resources
Method Test 3—input LSA 46% LSA + Good—Turing LM 53 LSA + Good—Turing LM + RNN 52
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Pighin, Daniele and Cornolti, Marco and Alfonseca, Enrique and Filippova, Katja
Pattern extraction by sentence compression
The method of Clarke and Lapata (2008) uses a trigram language model ( LM ) to score compressions.
Pattern extraction by sentence compression
Since we are interested in very short outputs, a LM trained on standard, uncompressed text would not be suitable.
Pattern extraction by sentence compression
Instead, we chose to modify the method of Filippova and Altun (2013) because it relies on dependency parse trees and does not use any LM scoring.
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Models
Secondly, it is easy to use a large language model ( LM ) with Moses.
Models
We build the LM on the target word types in the data to be filtered.
Models
The LM is implemented as a five-gram model using the SRILM-Toolkit (Stol-cke, 2002), with Add-l smoothing for unigrams and Kneser-Ney smoothing for higher n-grams.
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Blunsom, Phil and Cohn, Trevor and Osborne, Miles
Evaluation
The feature set includes: a trigram language model ( lm ) trained
Evaluation
Discriminative max-derivation 25.78 Hiero (pd, gr, re, we) 26.48 Discriminative max—translation 27.72 Hiero (pd, 19,, p2“, pi“, 97", re, we) 28.14 Hiero (pd, 19,, p2“, pi“, 97", re, we, lm) 32.00
Evaluation
8Hiero (pd, Pr, P262195”, 97“, re, we, lm ) represents state-
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Jia and Li, Qing and Chen, Yuanzhu Peter and Lin, Zhangxi
Experimental Evaluation
The second one, LM , is based on statistical language models for relevant information retrieval (Ponte and Croft, 1998).
Experimental Evaluation
Okapi 0.827 0.833 0.807 0.751 Forum LM 0.804 0.833 0.807 0.731 Our 0.967 0.967 0.9 0.85
Experimental Evaluation
Okapi 0.733 0.651 0.667 0.466 Blog LM 0.767 0.718 0.70 0.524 Our 0.933 0.894 0.867 0.756
LM is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: