Introduction | Standard techniques store the observed n-grams and derive probabilities of unobserved n-grams via their longest observed suffix and “backoff” costs associated with the prefix histories of the unobserved suffixes. |
Introduction | Hence the size of the model grows with the number of observed n-grams , which is very large for typical training corpora. |
Introduction | These data structures permit efficient querying for specific n-grams in a model that has been stored in a fraction of the space required to store the full, exact model, though with some probability of false positives. |
Preliminaries | w,- in the training corpus; F is a regularized probability estimate that provides some probability mass for unobserved n-grams ; and 04h?) |
Preliminaries | N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored. |
Preliminaries | Probabilities for the rest of the n-grams are calculated through the “otherwise” semantics in the equation above. |
Abstract | Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams . |
Introduction | At the word level 96% of the simple words are found in the normal corpus and even for n-grams as large as 5, more than half of the n-grams can be found in the normal text. |
Introduction | This extra information may help with data sparsity, providing better estimates for rare and unseen n-grams . |
Introduction | On the other hand, there is still only modest overlap between the sentences for longer n-grams , particularly given that the corpus is sentence-aligned and that 27% of the sentence pairs in this aligned data set are identical. |
Why Does Unsimplified Data Help? | 6.1 More n-grams |
Why Does Unsimplified Data Help? | Table 3: Proportion of n-grams in the test sets that occur in the simple and normal training data sets. |
Why Does Unsimplified Data Help? | We hypothesize that the key benefit of additional normal data is access to more n-gram counts and therefore better probability estimation, particularly for n-grams in the simple corpus that are unseen or have low frequency. |
Model | Like most generative models for text, a post (document) is viewed as a bag of n-grams and each n-gram (word/phrase) takes one value from a predefined vocabulary. |
Model | Instead of using all n-grams, a relevance based ranking method is proposed to select a subset of highly relevant n-grams for model building (details in §4). |
Model | For notational convenience, we use terms to denote both words (unigrams) and phrases ( n-grams ). |
Phrase Ranking based on Relevance | We now detail our method of preprocessing n-grams (phrases) based on relevance to select a subset of highly relevant n-grams for model building. |
Phrase Ranking based on Relevance | A large number of irrelevant n-grams slow inference. |
Phrase Ranking based on Relevance | This method, however, is expensive computationally and has a limitation for arbitrary length n-grams . |
Error Classification | Additionally, since there are numerous word n-grams , some infrequent ones may just by chance only occur in positive training set instances, causing the learner to think they indicate the positive class when they do not. |
Error Classification | For each essay, Aw+i counts the number of word n-grams we believe indicate that an essay is a positive example of 61-, and Aw—i counts the number of word n-grams we believe indicate an essay is not an example of 6,. |
Error Classification | Aw+ n-grams for the Missing Details error tend to include phrases like “there is something” or “this statement”, while Aw— ngrams are often words taken directly from an essay’s prompt. |
Evaluation | We see that the thesis clarity score predicting variation of the Baseline system, which employs as features only word n-grams and random indexing features, predicts the wrong score 65.8% of the time. |
Eliciting Addressee’s Emotion | 7We have excluded n-grams that matched the emotional expressions used in Section 2 to avoid overfitting. |
Predicting Addressee’s Emotion | We extract all the n-grams (n g 3) in the response to induce (binary) n-gram features. |
Predicting Addressee’s Emotion | The extracted n-grams could indicate a certain action that elicits a specific emotion (e. g., ‘have a fever’ in Table 2), or a style or tone of speaking (e. g., ‘Sorry’). |
Predicting Addressee’s Emotion | Likewise, we extract word n-grams from the addressee’s utterance. |
Causal Relations for Why-QA | Table 4: Causal relation features: n in n-grams is n = {2, 3} and n-grams in an effect part are distinguished from those in a cause part. |
Causal Relations for Why-QA | The n-grams of 75 f1 and tfg are restricted to those containing at least one content word in a question. |
Causal Relations for Why-QA | For example, word 3-gram “this/cause/QW” is extracted from This causes tsunamis in A2 for “Why is a tsunami generated?” Further, we create a word class version of word n-grams by converting the words in these word n-grams into their corresponding word class using the semantic word classes (500 classes for 5.5 million nouns) from our previous work (Oh et al., 2012). |
Related Work | These previous studies took basically bag-of-words approaches and used the semantic knowledge to identify certain semantic associations using terms and n-grams . |
System Architecture | employed three types of features for training the re-ranker: morphosyntactic features ( n-grams of morphemes and syntactic dependency chains), semantic word class features (semantic word classes obtained by automatic word clustering (Kazama and Torisawa, 2008)) and sentiment polarity features (word and phrase polarities). |
Experiments | We removed n-grams that appeared less than five times8 in each subcorpus in the language models. |
Implications for Work in Related Domains | The experimental results show that n-grams containing articles are predictive for identifying native languages. |
Implications for Work in Related Domains | Importantly, all n-grams containing articles should be used in the classifier unlike the previous methods that are based only on n-grams containing article errors. |
Implications for Work in Related Domains | Besides, no articles should be explicitly coded in n-grams for taking the overuse/underuse of articles into consideration. |
Methods | In this language model, content words in n-grams are replaced with their corresponding POS tags. |
Divergent (Re)Categorization | To tap into a richer source of concept properties than WordNet’s glosses, we can use web n-grams . |
Divergent (Re)Categorization | Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). |
Divergent (Re)Categorization | So for each property P suggested by Google n-grams for a lexical concept C, we generate a like-simile for verbal behaviors such as swaggering and an as-as-simile for adjectives such as lonesome. |
Summary and Conclusions | Using the Google n-grams as a source of tacit grouping constructions, we have created a comprehensive lookup table that provides Rex similarity scores for the most common (if often implicit) comparisons. |
Problem Report and Aid Message Recognizers | MSAl Morpheme n-grams, syntactic dependency n-grams in the tweet and morpheme n-grams before and after the nucleus template. |
Problem Report and Aid Message Recognizers | MSA2 Character n-grams of the nucleus template to capture conjugation and modality variations. |
Problem Report and Aid Message Recognizers | MSA3 Morpheme and part-of-speech n-grams within the bunsetsu containing the nucleus template to capture conjugation and modality variations. |
Conclusion | However, oovs can be considered as n-grams (phrases) instead of unigrams. |
Experiments & Results 4.1 Experimental Setup | We did not use trigrams or larger n-grams in our experiments. |
Graph-based Lexicon Induction | However, constructing such graph and doing graph propagation on it is computationally very expensive for large n-grams . |
Graph-based Lexicon Induction | These phrases are n-grams up to a certain value, which can result in millions of nodes. |
Experiments | Table 1: Question classification precision for both levels of the hierarchy (features = word n-grams , classifier = libsvm) |
Experiments | Using word n-grams , monolingual English classification obtains .798 correct classification for the fine grained classes, and .90 for the coarse grained classes, results which are very close to those obtained by (Zhang and Lee, 2003). |
Experiments | Table 2: Question classification precision for both levels of the hierarchy (features = word n-grams with abbreviations, classifier = libsvm) |
Evaluation | 0 n-grams represents a simple 5- gram baseline that is similar to Oh and Rudnicky (2000)’s system. |
Evaluation | CRF global 3.65 3.64 3.65 CRF local 310* 319* 313* CLASSiC 353* 3.59 348* n-grams 301* 309* 332* |
Evaluation | This difference is significant for all categories compared with CRF (local) and n-grams (using a 1-sided Mann Whitney U-test, p < 0.001). |
Introduction | In addition, we compare our system with alternative surface realisation methods from the literature, namely, a rank and boost approach and n-grams . |
Experiment 1: Textual Similarity | 0 Character n-grams which were also used as one of our additional features. |
Experiment 1: Textual Similarity | Another interesting point is the high scores achieved by the Character n-grams |
Experiment 1: Textual Similarity | Dataset Mpar Mvid SMTe DW 0.448 0.820 0.660 ADW—MF 0.485 0.842 0.721 Explicit Semantic Analysis 0.427 0.781 0.619 Pairwise Word Similarity 0.564 0.835 0.527 Distributional Thesaurus 0.494 0.481 0.365 Character n-grams 0.658 0.771 0.554 |
Introduction | In many natural language systems, single words and n-grams are usefully described by their distributional similarities (Brown et al., 1992), among many others. |
Introduction | n-grams will never be seen during training, especially when n is large. |
Introduction | In this work, we present a new solution to learn features and phrase representations even for very long, unseen n-grams . |
Conclusions | In particular, we plan to extend the use of n-grams to larger contexts and consider more fine-grained tuning of other constraints, too. |
Lexical Constraints for Humorous Word Substitution | Implementation Local coherence is implemented using n-grams . |
Lexical Constraints for Humorous Word Substitution | To estimate the level of expectation triggered by a left-context, we rely on a vast collection of n-grams, the 2012 Google Books n-grams collection4 (Michel et al., 2011) and compute the cohesion of each n-gram, by comparing their expected frequency (assuming word inde-pence), to their observed number of occurrences. |
Perplexity Evaluation | In this experiment, we assessed the effectiveness of the TD and TO components in reducing the n-gram’s perplexity. |
Perplexity Evaluation | Due to the incapability of n-grams to model long history-contexts, the TD and TO components are still effective in helping to enhance the prediction. |
Related Work | There are also works on skipping irrelevant his-tory-words in order to reveal more informative n-grams (Siu & Ostendorf 2000, Guthrie et al. |
Experiments | In addition, we extracted web n-grams and entity lists (see §3) from movie related web sites, and online blogs and reviews. |
Experiments | We extract prior distributions for entities and n-grams to calculate entity list 77 and word-tag [3 priors (see §3.1). |
Markov Topic Regression - MTR | We built a language model using SRILM (Stol-cke, 2002) on the domain specific sources such as top wiki pages and blogs on online movie reviews, etc., to obtain the probabilities of domain-specific n-grams , up to 3-grams. |