Abstract | We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long histo—ry-contexts. |
Introduction | The commonly used n-gram model (Bahl et a1. |
Introduction | Although n-gram models are simple and effective, modeling long history-contexts lead to severe data scarcity problems. |
Language Modeling with TD and TO | The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model: |
Language Modeling with TD and TO | Replacing the unigram model with a higher order n-gram model is important to compensate the damage incurred by the conditional independence assumption made earlier. |
Motivation of the Proposed Approach | In the n-gram model, for example, these two attributes are jointly taken into account in the ordered word-sequence. |
Motivation of the Proposed Approach | Consequently, the n-gram model can only be effectively implemented within a short history-context (e. g. of size of three or four). |
Motivation of the Proposed Approach | However, intermediate distances beyond the n-gram model limits can be very useful and should not be discarded. |
Related Work | 2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 . |
Error Classification | While we employ seven types of features (see Sections 4.2 and 4.3), only the word n-gram features are subject to feature selection.2 Specifically, we employ |
Error Classification | the top n,- n-gram features as selected according to information gain computed over the training data (see Yang and Pedersen (1997) for details). |
Error Classification | Aggregated word n-gram features. |
Evaluation | Our Baseline system, which only uses word n-gram and random indexing features, seems to perform uniformly poorly across both micro and macro F-scores (F and F; see row 1). |
Score Prediction | 6Before tuning the feature selection parameter, we have to sort the list of n-gram features occurring the training set. |
Abstract | We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing. |
Abstract | We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. |
Introduction | Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation. |
Introduction | Such models are trained on large text corpora, by counting the frequency of n-gram collocations, then normalizing and smoothing (regularizing) the resulting multinomial distributions. |
Introduction | Briefly, the smoothing method reesti-mates lower-order n-gram parameters in order to avoid overestimating the likelihood of n-grams that already have ample probability mass allocated as part of higher-order n-grams. |
Preliminaries | N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion. |
Preliminaries | where is the count of the n-gram sequence 10H, . |
Preliminaries | N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored. |
Task A: Polarity Classification | 4.4 N-gram Evaluation and Results |
Task A: Polarity Classification | N-gram features are widely used in a variety of classification tasks, therefore we also use them in our polarity classification task. |
Task A: Polarity Classification | Figure 2 shows a study of the influence of the different information sources and their combination with n-gram features for English. |
Task B: Valence Prediction | The Farsi and Russian regression models are based only on n-gram features, while the English and Spanish regression models have both n-gram and LIWC features. |
Introduction | Table 1 shows the n-gram overlap proportions in a sentence aligned data set of 137K sentence pairs from aligning Simple English Wikipedia and English Wikipedia articles (Coster and Kauchak, 2011a).1 The data highlights two conflicting views: does the benefit of additional data outweigh the problem of the source of the data? |
Introduction | n-gram size: 1 2 3 4 5 simple in normal 0.96 0.80 0.68 0.61 0.55 normal in simple 0.87 0.68 0.58 0.51 0.46 |
Why Does Unsimplified Data Help? | trained using a smoothed version of the maximum likelihood estimate for an n-gram . |
Why Does Unsimplified Data Help? | count(bc) where count(-) is the number of times the n-gram occurs in the training corpus. |
Why Does Unsimplified Data Help? | For interpolated and backoff n-gram models, these counts are smoothed based on the probabilities of lower order n-gram models, which are inturn calculated based on counts from the corpus. |
Experimental Evaluation | N-gram suggests no n-referential instances |
Experimental Evaluation | Effectiveness To understand the contribution of the n-gram (NG), ontology (ON), and clustering (CL) based modules, we ran each separately, as well as every possible combination. |
Experimental Evaluation | Of the three individual modules, the n-gram and clustering methods achieve F-measure of around 0.9, while the ontology-based module performs only modestly above baseline. |
Term Ambiguity Detection (TAD) | This module examines n-gram data from a large text collection. |
Term Ambiguity Detection (TAD) | The rationale behind the n-gram module is based on the understanding that terms appearing in non-named entity contexts are likely to be non-referential, and terms that can be non-referential are ambiguous. |
Term Ambiguity Detection (TAD) | Since we wish for the ambiguity detection determination to be fast, we develop our method to make this judgment solely on the n-gram probability, without the need to examine each individual usage context. |
Baseline MT | The LM used for decoding is a log-linear combination of four word n-gram LMs which are built on different English |
Name-aware MT Evaluation | where wn is a set of positive weights summing to one and usually uniformly set as 712,, = l/N, c is the length of the system translation and 7“ is the length of reference translation, and pn is modified n-gram precision defined as: Z Z Countelipm-gram) |
Name-aware MT Evaluation | As in BLEU metric, we first count the maximum number of times an n-gram occurs in any single reference translation. |
Name-aware MT Evaluation | The weight of an n-gram in reference translation is the sum of weights of all tokens it contains. |
New Sense Indicators | N-gram Probability Features The goal of the Type:NgramProb feature is to capture the fact that “unusual contexts” might imply new senses. |
New Sense Indicators | To capture this, we can look at the log probability of the word under consideration given its N-gram context, both according to an old-domain language model (call this 6°“) and a new-domain language |
New Sense Indicators | From these four values, we compute corpus-level (and therefore type-based) statistics of the new domain n-gram log probability (Eflgw, the difference between the n-gram probabilities in each domain (623” — 6:51), the difference between the n-gram and unigram probabilities in the new domain (EQSW — 633‘”), and finally the combined difference: 623"” — [SSW + 63:: — 635’). |
Abstract | As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models. |
Introduction | In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right. |
Introduction | Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007). |
Introduction | 3. efficient integration of n-gram language model: as translation grows left-to-right in our algorithm, integrating n-gram language models is straightforward. |
Conclusion | A novel technique was also proposed to rank n-gram phrases where relevance based ranking was used in conjunction with a semi-supervised generative model. |
Empirical Evaluation | The reduced dataset consists of 1095586 tokens (after n-gram preprocessing in §4), 40102 posts with an average of 27 posts or interactions per pair. |
Model | Like most generative models for text, a post (document) is viewed as a bag of n-grams and each n-gram (word/phrase) takes one value from a predefined vocabulary. |
Phrase Ranking based on Relevance | While this is reasonable, a significant n-gram with high likelihood score may not necessarily be relevant to the problem domain. |
Phrase Ranking based on Relevance | This is nothing wrong per se because the statistical tests only judge significance of an n-gram, but a significant n-gram may not necessarily be relevant in a given problem domain. |
Experiments | RESPONSE The n-gram and emotion features induced from the response. |
Experiments | The n-gram and emotion features induced from the response and the addressee’s utterance. |
Predicting Addressee’s Emotion | We extract all the n-grams (n g 3) in the response to induce (binary) n-gram features. |
Predicting Addressee’s Emotion | The extracted n-grams activate another set of binary n-gram features. |
Architecture of BRAINSUP | N-gram likelihood. |
Architecture of BRAINSUP | This is simply the likelihood of a sentence estimated by an n-gram language model, to enforce the generation of well-formed word sequences. |
Architecture of BRAINSUP | When a solution is not complete, in the computation we include only the sequences of contiguous words (i.e., not interrupted by empty slots) having length greater than or equal to the order of the n-gram model. |
Evaluation | The four combinations of features are: base: Target-word scorer + N-gram likelihood + Dependency likelihood + Variety scorer + Unusual-words scorer + Semantic cohesion; base+D: all the scorers in base + Domain relatedness; base+D+C: all the scorers in base+D + Chromatic connotation; base+D+E: all the scorers in base+D + Emotional connotation; base+D+P: all the scorers in base+D + Phonetic features. |
Decipherment Model for Machine Translation | For P(e), we use a word n-gram language model (LM) trained on monolingual target text. |
Decipherment Model for Machine Translation | Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model. |
Feature-based representation for Source and Target | For instance, context features for word w may include other words (or phrases) that appear in the immediate context ( n-gram window) surrounding w in the monolingual corpus. |
Feature-based representation for Source and Target | The feature construction process is described in more detail below: Target Language: We represent each word (or phrase) ei with the following contextual features along with their counts: (a) ficontem: every (word n-gram , position) pair immediately preceding e,-in the monolingual corpus (n=l , position=— l), (b) similar features f+conte$t to model the context following ei, and (c) we also throw in generic context features fscontewt without position information—every word that co-occurs with e, in the same sen- |
Learning Class Attributes | We extract prevalent common nouns for males and females by selecting only those nouns that (a) occur more than 200 times in the dataset, (b) mostly occur with male or female pronouns, and (c) occur as lowercase more often than uppercase in a web-scale N-gram corpus (Lin et al., 2010). |
Learning Class Attributes | We obtain the best of both worlds by matching our precise pattern against a version of the Google N-gram Corpus that includes the part-of-speech tag distributions for every N-gram (Lin et al., 2010). |
Twitter Gender Prediction | We include n-gram features with the original capitalization pattern and separate features with the n- grams lower-cased. |
MBR-based Answering Re-ranking | 0 answer-level n-gram correlation feature: |
MBR-based Answering Re-ranking | where w denotes an n-gram in A, #w(“4k3) denotes the number of times that w occurs in |
MBR-based Answering Re-ranking | o passage-level n-gram correlation feature: |
Discriminative Reranking for OCR | Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, . |
Discriminative Reranking for OCR | Semantic coherence feature (“SemCoh”) is motivated by the fact that semantic information can be very useful in modeling the fluency of phrases, and can augment the information provided by n-gram LMs. |
Introduction | The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output. |