Error Classification | Additionally, since there are numerous word n-grams , some infrequent ones may just by chance only occur in positive training set instances, causing the learner to think they indicate the positive class when they do not. |
Error Classification | For each essay, Aw+i counts the number of word n-grams we believe indicate that an essay is a positive example of 61-, and Aw—i counts the number of word n-grams we believe indicate an essay is not an example of 6,. |
Error Classification | Aw+ n-grams for the Missing Details error tend to include phrases like “there is something” or “this statement”, while Aw— ngrams are often words taken directly from an essay’s prompt. |
Evaluation | We see that the thesis clarity score predicting variation of the Baseline system, which employs as features only word n-grams and random indexing features, predicts the wrong score 65.8% of the time. |
Model | Like most generative models for text, a post (document) is viewed as a bag of n-grams and each n-gram (word/phrase) takes one value from a predefined vocabulary. |
Model | Instead of using all n-grams, a relevance based ranking method is proposed to select a subset of highly relevant n-grams for model building (details in §4). |
Model | For notational convenience, we use terms to denote both words (unigrams) and phrases ( n-grams ). |
Phrase Ranking based on Relevance | We now detail our method of preprocessing n-grams (phrases) based on relevance to select a subset of highly relevant n-grams for model building. |
Phrase Ranking based on Relevance | A large number of irrelevant n-grams slow inference. |
Phrase Ranking based on Relevance | This method, however, is expensive computationally and has a limitation for arbitrary length n-grams . |
Abstract | Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams . |
Introduction | At the word level 96% of the simple words are found in the normal corpus and even for n-grams as large as 5, more than half of the n-grams can be found in the normal text. |
Introduction | This extra information may help with data sparsity, providing better estimates for rare and unseen n-grams . |
Introduction | On the other hand, there is still only modest overlap between the sentences for longer n-grams , particularly given that the corpus is sentence-aligned and that 27% of the sentence pairs in this aligned data set are identical. |
Why Does Unsimplified Data Help? | 6.1 More n-grams |
Why Does Unsimplified Data Help? | Table 3: Proportion of n-grams in the test sets that occur in the simple and normal training data sets. |
Why Does Unsimplified Data Help? | We hypothesize that the key benefit of additional normal data is access to more n-gram counts and therefore better probability estimation, particularly for n-grams in the simple corpus that are unseen or have low frequency. |
Abstract | We propose a succinct randomized language model which employs a peifect hash fimc-tion to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters. |
Introduction | Parameters that are stored in the model are retrieved without error; however, false positives may occur whereby n-grams not in the model are incorrectly ‘found’ when requested. |
Introduction | We encode fingerprints (random hashes) of n-grams together with their associated probabilities using a perfect hash function generated at random (Majewski et al., 1996). |
Perfect Hash-based Language Models | We assume the n-grams and their associated parameter values have been precomputed and stored on disk. |
Perfect Hash-based Language Models | We then encode the model in an array such that each n-gram’s value can be retrieved. |
Perfect Hash-based Language Models | The model uses randomization to map n-grams to fingerprints and to generate a perfect hash function that associates n-grams with their values. |
Scaling Language Models | In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary. |
Scaling Language Models | Although n-grams observed in natural language corpora are not randomly distributed within this universe no lossless data structure that we are aware of can circumvent this space-dependency on both the n-gram order and the vocabulary size. |
Scaling Language Models | However, if we are willing to accept that occasionally our model will be unable to distinguish between distinct n-grams , then it is possible to store |
Introduction | Standard techniques store the observed n-grams and derive probabilities of unobserved n-grams via their longest observed suffix and “backoff” costs associated with the prefix histories of the unobserved suffixes. |
Introduction | Hence the size of the model grows with the number of observed n-grams , which is very large for typical training corpora. |
Introduction | These data structures permit efficient querying for specific n-grams in a model that has been stored in a fraction of the space required to store the full, exact model, though with some probability of false positives. |
Preliminaries | w,- in the training corpus; F is a regularized probability estimate that provides some probability mass for unobserved n-grams ; and 04h?) |
Preliminaries | N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored. |
Preliminaries | Probabilities for the rest of the n-grams are calculated through the “otherwise” semantics in the equation above. |
Discussion and Future Work | In the current formulation of TESLA-CELAB, two n-grams X and Y are either synonyms which completely match each other, or are completely unrelated. |
Experiments | Compared to BLEU, TESLA allows more sophisticated weighting of n-grams and measures of word similarity including synonym relations. |
Experiments | The covered n-gram matching rule is then able to award tricky n-grams such as TE, Ti, /1\ [E], 1/13 [IE5 and i9}. |
Motivation | For example, between ¥_l—?fi_$ and ¥_5l?, higher-order n-grams such as and still have no match, and will be penalized accordingly, even though ¥_l—?fi_5lk and ¥_5l? |
Motivation | N-grams such as which cross natural word boundaries and are meaningless by themselves can be particularly tricky. |
The Algorithm | Two n-grams are connected if they are identical, or if they are identified as synonyms by Cilin. |
The Algorithm | Notice that all n-grams are put in the same matching problem regardless of n, unlike in translation evaluation metrics designed for European languages. |
The Algorithm | This enables us to designate n-grams with different values of n as synonyms, such as (n = 2) and 5!k (n = 1). |
Abstract | Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. |
Introduction | The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge. |
Introduction | Overall, we are able to store the 4 billion n-grams of the Google WeblT (Brants and Franz, 2006) cor- |
Language Model Implementations | Lookup is linear in the length of the key and logarithmic in the number of n-grams . |
Preliminaries | This corpus, which is on the large end of corpora typically employed in language modeling, is a collection of nearly 4 billion n-grams extracted from over a trillion tokens of English text, and has a vocabulary of about 13.5 million words. |
Preliminaries | Tries represent collections of n-grams using a tree. |
Preliminaries | Each node in the tree encodes a word, and paths in the tree correspond to n-grams in the collection. |
Computing Feature Expectations | Un-igrams are produced by lexical rules, while higher-order n-grams can be produced either directly by lexical rules, or by combining constituents. |
Computing Feature Expectations | The n-gram language model score of e similarly decomposes over the h in e that produce n-grams . |
Computing Feature Expectations | The linear similarity measure takes the following form, where Tn is the set of n-grams: |
Experimental Results | Above, we compare the precision, relative to reference translations, of sets of n-grams chosen in two ways. |
Experimental Results | The left bar is the precision of the n-grams in 6*. |
Experimental Results | The right bar is the precision of n-grams with E[c(6, 75)] > p. To justify this comparison, we chose p so that both methods of choosing 71- grams gave the same n-gram recall: the fraction of n-grams in reference translations that also appeared in 6* or had E[6(6, 75)] > p. |
Abstract | This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). |
Abstract | In this work we explore the suitability of LHs over n-grams at the character-level for AA. |
Abstract | We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. |
Introduction | In particular, we consider local histograms over n-grams at the character-level obtained via the locally-weighted bag of words (LOWBOW) framework (Lebanon et al., 2007). |
Introduction | Results confirm that local histograms of character n-grams are more helpful for AA than the usual global histograms of words or character n-grams (Luyckx and Daelemans, 2010); our results are superior to those reported in related works. |
Introduction | We also show that local histograms over character n-grams are more helpful than local histograms over words, as originally proposed by (Lebanon et al., 2007). |
Related Work | Some researchers have gone a step further and have attempted to capture sequential information by using n-grams at the word-level (Peng et al., 2004) or by discovering maximal frequent word sequences (Coyotl-Morales et al., 2006). |
Related Work | Unfortunately, because of computational limitations, the latter methods cannot discover enough sequential information from documents (e.g., word n-grams are often restricted to n E {1,2,3}, while full sequential information would be obtained with n E {l . |
Related Work | Stamatatos and coworkers have studied the impact of feature selection, with character n-grams, in AA (Houvardas and Stamatatos, 2006; Stamatatos, 2006a), ensemble learning with character n-grams (Stamatatos, 2006b) and novel classification techniques based |
Abstract | Our approach first identifies statistically salient phrases of words and parts of speech — known as n-grams — in training texts generated in conditions where the social power |
Abstract | Then, we apply machine learning to train classifiers with groups of these n-grams as features. |
Abstract | Unlike their works, our text classification techniques take into account the frequency of occurrence of word n-grams and part-of-speech (POS) tag sequences, and other measures of statistical salience in training data. |
Introduction | Instead, we consider how to efficiently mine the Google n-grams corpus. |
Introduction | This work uses a large web-scale corpus (Google n-grams ) to compute features for the full parsing task. |
Web-count Features | The approach of Lauer (1995), for example, would be to take an ambiguous noun sequence like hydrogen ion exchange and compare the various counts (or associated conditional probabilities) of n-grams like hydrogen ion and hydrogen exchange. |
Web-count Features | (2010) use web-scale n-grams to compute similar association statistics for longer sequences of nouns. |
Web-count Features | These paraphrase features hint at the correct attachment decision by looking for web n-grams with special contexts that reveal syntax superficially. |
Working with Web n-Grams | Rather than working through a search API (or scraper), we use an offline web corpus — the Google n-gram corpus (Brants and Franz, 2006) — which contains English n-grams (n = l to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences. |
Working with Web n-Grams | In particular, we only use counts of n-grams of the form :3 * y where the gap length is g 3. |
Working with Web n-Grams | Next, we exploit a simple but efficient trie-based hashing algorithm to efficiently answer all of them in one pass over the n-grams corpus. |
Background | These generally consist of a projection layer that maps words, sub-word units or n-grams to high dimensional embeddings; the latter are then combined component-wise with an operation such as summation. |
Background | The trained weights in the filter m correspond to a linguistic feature detector that learns to recognise a specific class of n-grams . |
Background | These n-grams have size n g m, where m is the width of the filter. |
Experiments | tures based on long n-grams and to hierarchically combine these features is highly beneficial. |
Experiments | In the first layer, the sequence is a continuous n-gram from the input sentence; in higher layers, sequences can be made of multiple separate n-grams . |
Experiments | The feature detectors learn to recognise not just single n-grams, but patterns within n-grams that have syntactic, semantic or structural significance. |
Introduction | Since individual sentences are rarely observed or not observed at all, one must represent a sentence in terms of features that depend on the words and short n-grams in the sentence that are frequently observed. |
Introduction | by which the features of the sentence are extracted from the features of the words or n-grams . |
Properties of the Sentence Model | The filters m of the wide convolution in the first layer can learn to recognise specific n-grams that have size less or equal to the filter width m; as we see in the experiments, m in the first layer is often set to a relatively large value |
Properties of the Sentence Model | The subsequence of n-grams extracted by the generalised pooling operation induces in-variance to absolute positions, but maintains their order and relative positions. |
Properties of the Sentence Model | This gives the RNN excellent performance at language modelling, but it is suboptimal for remembering at once the n-grams further back in the input sentence. |
Abstract | This general framework allows us to use arbitrary similarity functions between items, and to incorporate different information in our comparison, such as n-grams , dependency relations, etc. |
Introduction | In this paper, we propose a new automatic MT evaluation metric, MAXSIM, that compares a pair of system-reference sentences by extracting n-grams and dependency relations. |
Introduction | Recognizing that different concepts can be expressed in a variety of ways, we allow matching across synonyms and also compute a score between two matching items (such as between two n-grams or between two dependency relations), which indicates their degree of similarity with each other. |
Metric Design Considerations | We note, however, that matches between items (such as words, n-grams , etc.) |
Metric Design Considerations | In this subsection, we describe in detail how we match the n-grams of a system-reference sentence pair. |
Metric Design Considerations | Lemma match For the remaining set of n-grams that are not yet matched, we now relax our matching criteria by allowing a match if their corresponding lemmas match. |
Experimental Settings | nw N—grams Normword 1/2/3-gram probabilities lem N—grams Lemma 1/2/3-gram probabilities pos N-grams POS 1/2/3-gram probabilities |
Experimental Settings | served for word N-grams which did not appear in the models). |
Experimental Settings | Like the N-grams , this number is binned; in this case there are 11 bins, with 10 spread evenly over the [0,1) range, and an extra bin for values of exactly 1 (i.e., when the word appears in every hypothesis in the set). |
Results | First, Table 4 shows the effect of adding nw N-grams of successively higher orders to the word baseline. |
Results | word+nw l—gram 49.51 12.9 word+nw l—gram+nw 2—gram 59.26 35.2 word+nw N—grams 59.33 35.3 +pos 58.50 33.4 +pos N-grams 57.35 30.8 +lem+lem N—grams 59.63 36.0 +lem+lem N—grams+na 59.93 36.7 +lem+lem N-grams+na+nw 59.77 36.3 +lem 60.92 38.9 +lem+na 60.47 37.9 +lem+lem N—grams 60.44 37.9 |
Results | Here, the best performer is the model which utilizes the word, nw N-grams, |
Abstract | Our model incorporates heterogeneous relational evidence about both hypernymy and siblinghood, captured by semantic features based on patterns and statistics from Web n-grams and Wikipedia abstracts. |
Analysis | Here, our Web 71- grams dataset (which only contains frequent n-grams ) and Wikipedia abstracts do not suffice and we would need to add richer Web data for such world knowledge to be reflected in the features. |
Experiments | Feature sources: The n-gram semantic features are extracted from the Google n-grams corpus (Brants and Franz, 2006), a large collection of English n-grams (for n = 1 to 5) and their frequencies computed from almost 1 trillion tokens (95 billion sentences) of Web text. |
Experiments | For this, we use a hash-trie on term pairs (similar to that of Bansal and Klein (2011)), and scan once through the 71- gram (or abstract) set, skipping many n-grams (or abstracts) based on fast checks of missing unigrams, exceeding length, suffix mismatches, etc. |
Features | The Web n-grams corpus has broad coverage but is limited to up to 5-grams, so it may not contain pattem-based evidence for various longer multi-word terms and pairs. |
Features | Similar to the Web n-grams case, we also fire Wikipedia-based pattern order features. |
Introduction | The belief propagation approach allows us to efficiently and effectively incorporate heterogeneous relational evidence via hypernymy and siblinghood (e.g., coordination) cues, which we capture by semantic features based on simple surface patterns and statistics from Web n-grams and Wikipedia abstracts. |
Related Work | 8All the patterns and counts for our Web and Wikipedia edge and sibling features described above are extracted after stemming the words in the terms, the n-grams , and the abstracts (using the Porter stemmer). |
Conclusion | ber of parameters (unique N-grams ). |
Experiments | 91'i’e4 1e5 1e6 1e7 1e8 1e9 Number of Unique N-grams |
Introduction | Another is a web-scale N-gram corpus, which is a N-gram corpus with N-grams of length 1-5 (Brants and Franz, 2006), we call it Google V1 in this paper. |
Related Work | The former uses the web-scale data explicitly to create more data for training the model; while the latter explores the web-scale N-grams data (Lin et al., 2010) for compound bracketing disambiguation. |
Related Work | However, we explore the web-scale data for dependency parsing, the performance improves log-linearly with the number of parameters (unique N-grams ). |
Web-Derived Selectional Preference Features | N-grams appearing 40 |
Web-Derived Selectional Preference Features | In this paper, the selectional preferences have the same meaning with N-grams , which model the word-to-word relationships, rather than only considering the predicates and arguments relationships. |
Web-Derived Selectional Preference Features | All n-grams with lower counts are discarded. |
Extensions of SemPOS | Surprisingly BLEU-2 performed better than any other n-grams for reasons that have yet to be examined. |
Problems of BLEU | Total n-grams 35,531 33,891 32,251 30,611 |
Problems of BLEU | Table 1: n-grams confirmed by the reference and containing error flags. |
Problems of BLEU | The suspicious cases are n-grams confirmed by the reference but still containing a flag (false positives) and n-grams not confirmed despite containing no error flag (false negatives). |
Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Conclusion | As far as we know, this is the first work of building a complex large scale distributed language model with a principled approach that is more powerful than n-grams when both trained on a very large corpus with up to a billion tokens. |
Experimental results | The composite n-gram/m-SLMfl’LSA model gives significant perpleXity reductions over baseline n-grams , n = 3,4,5 and m-SLMs, m = 2,3,4. |
Experimental results | Also, in the same study in (Chamiak, 2003), they found that the outputs produced using the n-grams received higher scores from BLEU; ours did not. |
Introduction | We conduct comprehensive experiments on corpora with 44 million tokens, 230 million tokens, and 1.3 billion tokens and compare perplexity results with n-grams (n=3,4,5 respectively) on these three corpora, we obtain drastic perplexity reductions. |
Training algorithm | Where #(g,Wl,Gl,d) is the count of semantic content 9 in semantic annotation string Gl of the lth sentence Wl in document d, #(w:,11+1wh:,1ng,Wl,Tl,Gl,d) is the count of n-grams , its m most recent exposed headwords and semantic content 9 in parse Tl and semantic annotation string Gl of the lth sentence Wl in document d, #(tthhtag, Wl, Tl, d) is the count |
Training algorithm | The topic of large scale distributed language models is relatively new, and existing works are restricted to n-grams only (Brants et al., 2007; Emami et al., 2007; Zhang et al., 2006). |
Eliciting Addressee’s Emotion | 7We have excluded n-grams that matched the emotional expressions used in Section 2 to avoid overfitting. |
Predicting Addressee’s Emotion | We extract all the n-grams (n g 3) in the response to induce (binary) n-gram features. |
Predicting Addressee’s Emotion | The extracted n-grams could indicate a certain action that elicits a specific emotion (e. g., ‘have a fever’ in Table 2), or a style or tone of speaking (e. g., ‘Sorry’). |
Predicting Addressee’s Emotion | Likewise, we extract word n-grams from the addressee’s utterance. |
Data 5.1 Development Data | For disambiguation with n-grams (see §3.3), we made use of the WEB lT 5-GRAM corpus. |
Data 5.1 Development Data | Prepared by Google Inc., it contains English n-grams , up to 5-grams, with their observed frequency counts from a large number of web pages. |
Experiments | Table 6: The n-grams used for filtering, with examples of sentences which they are intended to differentiate. |
Experiments | 6.2 Disambiguation with N-grams |
Experiments | For those categories with a high rate of false positives (all except BASEmd, BASEdO and FINITE), we utilized n-grams as filters, allowing a correction only when its n-gram count in the WEB 1T 5-GRAM |
Collaborative Decoding | Here we do not discriminate among different lexical n-grams and are only concerned with statistics aggregation of all n-grams of the same order. |
Collaborative Decoding | Gn+(e,e') is the n-gram agreement measure function which counts the number of occurrences in e'of n-grams in 6. |
Collaborative Decoding | So the corresponding feature value will be the expected number of occurrences in 17-[k (f) of all n-grams in e: |
Discussion | Our method uses agreement information of n-grams , and consensus features are integrated into decoding models. |
Experiments | In Table 5 we show in another dimension the impact of consensus-based features by restricting the maximum order of n-grams used to compute agreement statistics. |
Experiments | One reason could be that the data sparsity for high-order n-grams leads to over fitting on development data. |
MERT for MBR Parameter Optimization | This linear function contains n + 1 parameters 60, 61, ..., 6 N, where N is the maximum order of the n-grams involved. |
Minimum Bayes-Risk Decoding | First, the set of n-grams is extracted from the lattice. |
Minimum Bayes-Risk Decoding | For a moderately large lattice, there can be several thousands of n-grams and the procedure becomes expensive. |
Minimum Bayes-Risk Decoding | For each node 75 in the lattice, we maintain a quantity Score(w, t) for each n-gram 212 that lies on a path from the source node to t. Score(w, t) is the highest posterior probability among all edges on the paths that terminate on t and contain n-gram w. The forward pass requires computing the n-grams introduced by each edge; to do this, we propagate n-grams (up to maximum order — l) terminating on each node. |
Causal Relations for Why-QA | Table 4: Causal relation features: n in n-grams is n = {2, 3} and n-grams in an effect part are distinguished from those in a cause part. |
Causal Relations for Why-QA | The n-grams of 75 f1 and tfg are restricted to those containing at least one content word in a question. |
Causal Relations for Why-QA | For example, word 3-gram “this/cause/QW” is extracted from This causes tsunamis in A2 for “Why is a tsunami generated?” Further, we create a word class version of word n-grams by converting the words in these word n-grams into their corresponding word class using the semantic word classes (500 classes for 5.5 million nouns) from our previous work (Oh et al., 2012). |
Related Work | These previous studies took basically bag-of-words approaches and used the semantic knowledge to identify certain semantic associations using terms and n-grams . |
System Architecture | employed three types of features for training the re-ranker: morphosyntactic features ( n-grams of morphemes and syntactic dependency chains), semantic word class features (semantic word classes obtained by automatic word clustering (Kazama and Torisawa, 2008)) and sentiment polarity features (word and phrase polarities). |
Data and Approach Overview | We represent each utterance a as a vector wu of Nu word n-grams (segments), wuj, each of which are chosen from a vocabulary W of fixed-size V. We use entity lists obtained from web sources (explained next) to identify segments in the corpus. |
Data and Approach Overview | Web n-Grams (G). |
Experiments | Our vocabulary consists of n—grams and segments (phrases) in utterances that are extracted using web n-grams and entity lists of §3. |
MultiLayer Context Model - MCM | * Web n-Gram Context Base Measure (2%): As explained in §3, we use the web n-grams as additional information for calculating the base measures of the Dirichlet topic distributions. |
MultiLayer Context Model - MCM | In (1) we assume that entities (E) are more indicative of the domain compared to other n-grams (G) and should be more dominant in sampling decision for domain topics. |
MultiLayer Context Model - MCM | During Gibbs sampling, we keep track of the frequency of draws of domain, dialog act and slot indicating n-grams wj, in M D, M A and MS matrices, respectively. |
Methodology | The maximum size of a context pattern depends on the size of n-grams available in the data. |
Methodology | In our n- gram collection (Section 3.4), the lengths of the n-grams range from unigrams to 5-grams, so our maximum pattern size is five. |
Methodology | Shorter n-grams were not found to improve performance on development data and hence are not extracted. |
Experiments | The classifiers use a feature set designed to mimic our LT-HMM as closely as possible, including n-grams , dictionary matches, ConText output, and symptom/disease se- |
Method | The transition features are modeled as simple indicators over n-grams of present codes, for values of n up to 10, the largest number of codes proposed by |
Method | We found it useful to pad our n-grams with “beginning of document” tokens for sequences when fewer than n codes have been labelled as present, but found it harmful to include an end-of-document tag once labelling is complete. |
Method | Note that n-grams features are only code-specific, as they are not connected to any specific trigger term. |
Related work | Statistical systems trained on only text-derived features (such as n-grams ) did not show good performance due to a wide variety of medical language and a relatively small training set (Goldstein et al., 2007). |
A Generic Phrase Training Procedure | Usually all n-grams up to a predefined length limit are considered as candidate phrases. |
Features | First, due to data sparsity and/or alignment model’s capability, there would exist n-grams that cannot be aligned |
Features | well, for instance, n-grams that are part of a paraphrase translation or metaphorical expression. |
Features | Extracting candidate translations for such kind of n-grams for the sake of improving coverage (recall) might hurt translation quality (precision). |
Experimental Setup | An important parameter of the class-based model is size of the base set, i.e., the total number of n-grams (or rather i-grams) to be clustered. |
Polynomial discounting | We could add a constant to d, but one of the basic premises of the KN model, derived from the assumption that n-gram marginals should be equal to relative frequencies, is that the discount is larger for more frequent n-grams although in many implementations of KN only the cases = l, = 2, and 2 3 are distinguished. |
Related work | ability of a class on n-grams of lexical items (as opposed to classes) (Whittaker and Woodland, 2001; Emami and Jelinek, 2005; Uszkoreit and Brants, 2008). |
Related work | Models that condition classes on lexical n-grams could be extended in a way similar to what we propose here. |
Related work | Our use of classes of lexical n-grams for n > 1 has several precedents in the literature (Suhm and Waibel, 1994; Kuo and Reichl, 1999; Deligne and Sagisaka, 2000; Justo and Torres, 2009). |
Experiments | We removed n-grams that appeared less than five times8 in each subcorpus in the language models. |
Implications for Work in Related Domains | The experimental results show that n-grams containing articles are predictive for identifying native languages. |
Implications for Work in Related Domains | Importantly, all n-grams containing articles should be used in the classifier unlike the previous methods that are based only on n-grams containing article errors. |
Implications for Work in Related Domains | Besides, no articles should be explicitly coded in n-grams for taking the overuse/underuse of articles into consideration. |
Methods | In this language model, content words in n-grams are replaced with their corresponding POS tags. |
Experimental Evaluation | CNG is a pro-file-based method which represents the author as the N most frequent character n-grams of all his/her training texts. |
Proposed Tri-Training Algorithm | The features in the character view are the character n-grams of a document. |
Proposed Tri-Training Algorithm | Character n-grams are simple and easily available for any natural language. |
Proposed Tri-Training Algorithm | We use four content-independent structures including n-grams of POS tags (n = 1..3) and rewrite rules (Kim et al., 2011). |
Related Work | Example features include function words (Argamon et al., 2007), richness features (Gamon 2004), punctuation frequencies (Graham et al., 2005), character (Grieve, 2007), word (Burrows, 1992) and POS n-grams (Gamon, 2004; Hirst and Feiguina, 2007), rewrite rules (Halteren et al., 1996), and similarities (Qian and Liu, 2013). |
Experiments | Table 1: Question classification precision for both levels of the hierarchy (features = word n-grams , classifier = libsvm) |
Experiments | Using word n-grams , monolingual English classification obtains .798 correct classification for the fine grained classes, and .90 for the coarse grained classes, results which are very close to those obtained by (Zhang and Lee, 2003). |
Experiments | Table 2: Question classification precision for both levels of the hierarchy (features = word n-grams with abbreviations, classifier = libsvm) |
Evaluation | 0 n-grams represents a simple 5- gram baseline that is similar to Oh and Rudnicky (2000)’s system. |
Evaluation | CRF global 3.65 3.64 3.65 CRF local 310* 319* 313* CLASSiC 353* 3.59 348* n-grams 301* 309* 332* |
Evaluation | This difference is significant for all categories compared with CRF (local) and n-grams (using a 1-sided Mann Whitney U-test, p < 0.001). |
Introduction | In addition, we compare our system with alternative surface realisation methods from the literature, namely, a rank and boost approach and n-grams . |
Conclusion | However, oovs can be considered as n-grams (phrases) instead of unigrams. |
Experiments & Results 4.1 Experimental Setup | We did not use trigrams or larger n-grams in our experiments. |
Graph-based Lexicon Induction | However, constructing such graph and doing graph propagation on it is computationally very expensive for large n-grams . |
Graph-based Lexicon Induction | These phrases are n-grams up to a certain value, which can result in millions of nodes. |
BLEU and PORT | translation hypothesis to compute the numbers of the reference n-grams . |
Experiments | Both BLEU and PORT perform matching of n-grams up to n = 4. |
Experiments | In all tuning experiments, both BLEU and PORT performed lower case matching of n-grams up to n = 4. |
Experiments | The BLEU-tuned and Qmean-tuned systems generate similar numbers of matching n-grams, but Qmean-tuned systems produce fewer n-grams (thus, shorter translations). |
Problem Report and Aid Message Recognizers | MSAl Morpheme n-grams, syntactic dependency n-grams in the tweet and morpheme n-grams before and after the nucleus template. |
Problem Report and Aid Message Recognizers | MSA2 Character n-grams of the nucleus template to capture conjugation and modality variations. |
Problem Report and Aid Message Recognizers | MSA3 Morpheme and part-of-speech n-grams within the bunsetsu containing the nucleus template to capture conjugation and modality variations. |
Divergent (Re)Categorization | To tap into a richer source of concept properties than WordNet’s glosses, we can use web n-grams . |
Divergent (Re)Categorization | Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). |
Divergent (Re)Categorization | So for each property P suggested by Google n-grams for a lexical concept C, we generate a like-simile for verbal behaviors such as swaggering and an as-as-simile for adjectives such as lonesome. |
Summary and Conclusions | Using the Google n-grams as a source of tacit grouping constructions, we have created a comprehensive lookup table that provides Rex similarity scores for the most common (if often implicit) comparisons. |
Experiments | This is mainly due to the additional minimum support constraint we added which discards many noisy lexical entries from infrequently seen n-grams . |
Online Lexicon Learning Algorithm | and the corresponding navigation plan, we first segment the instruction into word tokens and construct n-grams from them. |
Online Lexicon Learning Algorithm | From the corresponding navigation plan, we find all connected subgraphs of size less than or equal to m. We then update the co-occurrence counts between all the n-grams w and all the connected subgraphs 9. |
Online Lexicon Learning Algorithm | We also update the counts of how many examples we have encountered so far and counts of the n-grams w and subgraphs 9. |
Semantics via Web Features | As the source of Web information, we use the Google n-grams corpus (Brants and Franz, 2006) which contains English n-grams (n = 1 to 5) and their Web frequency counts, derived from nearly 1 trillion word tokens and 95 billion sentences. |
Semantics via Web Features | Using the n-grams corpus (for n = l to 5), we collect co-occurrence Web-counts by allowing a varying number of wildcards between hl and hg in the query. |
Semantics via Web Features | 2These clusters are derived form the V2 Google n-grams corpus. |
Experiments | Following evaluations in machine translation as well as previous work in sentence compression (Unno et al., 2006; Clarke and Lapata, 2008; Martins and Smith, 2009; Napoles et al., 2011b; Thadani and McKeown, 2013), we evaluate system performance using F1 metrics over n-grams and dependency edges produced by parsing system output with RASP (Briscoe et al., 2006) and the Stanford parser. |
Experiments | We report results over the following systems grouped into three categories of models: tokens + n-grams , tokens + dependencies, and joint models. |
Introduction | Joint methods have also been proposed that invoke integer linear programming (ILP) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies (Martins and Smith, 2009) or n-grams and all possible dependencies (Thadani and McKeown, 2013). |
Multi-Structure Sentence Compression | C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows. |
Our Approach | The features we use are character n-grams around mismatches. |
Our Approach | i) n-grams around gaps, i.e., we account only for insertions and deletions; |
Our Approach | ii) n-grams around any type of mismatch, i.e., we account for all three types of mismatches. |
Paraphrase Evaluation Metrics | We introduce a new scoring metric PINC that measures how many n-grams differ between the two sentences. |
Paraphrase Evaluation Metrics | grams and n-gramC are the lists of n-grams in the |
Paraphrase Evaluation Metrics | The PINC score computes the percentage of n-grams that appear in the candidate sentence but not in the source sentence. |
Variational Approximate Decoding | whose edges correspond to n-grams (weighted with negative log-probabilities) and whose vertices correspond to (n — l)-grams. |
Variational Approximate Decoding | This may be regarded as favoring n-grams that are likely to appear in the reference translation (because they are likely in the derivation forest). |
Variational Approximate Decoding | However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams . |
Variational vs. Min-Risk Decoding | Now, let us divide N, which contains n-gram types of different n, into several subsets Wn, each of which contains only the n-grams with a given length n. We can now rewrite (19) as follows, |
Generation & Propagation | For the unlabeled phrases, the set of possible target translations could be extremely large (e.g., all target language n-grams ). |
Generation & Propagation | A nai've way to achieve this goal would be to extract all n-grams , from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases. |
Generation & Propagation | This set of candidate phrases is filtered to include only n-grams occurring in the target monolingual corpus, and helps to prune passed-through OOV words and invalid translations. |
Introduction | Unlike previous work (Irvine and Callison-Burch, 2013a; Razmara et al., 2013), we use higher order n-grams instead of restricting to unigrams, since our approach goes beyond OOV mitigation and can enrich the entire translation model by using evidence from monolingual text. |
Abstract | Contrasting the predictive ability of statistics derived from 6 different corpora, we find intuitive results showing that, e.g., a British corpus over-predicts the speed with which an American will react to the words ward and duke, and that the Google n-grams over-predicts familiarity with technology terms. |
Fitting Behavioral Data 2.1 Data | 2Surprisingly, fife was determined to be one of the words with the largest frequency asymmetry between Switchboard and the Google n-grams corpus. |
Introduction | Specifically, we predict human data from three widely used psycholinguistic experimental paradigms—lexical decision, word naming, and picture naming—using unigram frequency estimates from Google n-grams (Brants and Franz, 2006), Switchboard (Godfrey et al., 1992), spoken and written English portions of CELEX (Baayen et al., 1995), and spoken and written portions of the British National Corpus (BNC Consortium, 2007). |
Introduction | For example, Google n-grams overestimates the ease with which humans will process words related to the web (tech, code, search, site), while the Switchboard corpus—a collection of informal telephone conversations between strangers—overestimates how quickly humans will react to colloquialisms (heck, dam) and backchannels (wow, right). |
Experiments | 4.4 Analysis of Different Effects of Different N-grams |
Experiments | To evaluate the effects of different n-grams for our proposed transfer model, we compared the uni-/bi-/tri-gram transfer models in SMT, and illustrate the results in Fig- |
Experiments | Different translation qualities along with different n-grams for transfer model. |
Conclusions | In particular, we plan to extend the use of n-grams to larger contexts and consider more fine-grained tuning of other constraints, too. |
Lexical Constraints for Humorous Word Substitution | Implementation Local coherence is implemented using n-grams . |
Lexical Constraints for Humorous Word Substitution | To estimate the level of expectation triggered by a left-context, we rely on a vast collection of n-grams, the 2012 Google Books n-grams collection4 (Michel et al., 2011) and compute the cohesion of each n-gram, by comparing their expected frequency (assuming word inde-pence), to their observed number of occurrences. |
Introduction | In many natural language systems, single words and n-grams are usefully described by their distributional similarities (Brown et al., 1992), among many others. |
Introduction | n-grams will never be seen during training, especially when n is large. |
Introduction | In this work, we present a new solution to learn features and phrase representations even for very long, unseen n-grams . |
Introduction | Recent approaches to this task have been based on slot-filling (Yang et al., 2011; Elliott and Keller, 2013), combining web-scale n-grams (Li et al., 2011), syntactic tree substitution (Mitchell et al., 2012), and description-by-retrieval (Farhadi et al., 2010; Ordonez et al., 2011; Hodosh et al., 2013). |
Methodology | pn measures the effective overlap by calculating the proportion of the maximum number of n-grams co-occurring between a candidate and a reference and the total number of n-grams in the candidate text. |
Methodology | (2012); to the best of our knowledge, the only image description work to use higher-order n-grams with BLEU is Elliott and Keller (2013). |
Introduction | Yet, though many language models more sophisticated than N- grams have been proposed, N-grams are empirically hard to beat in terms of WER. |
Motivation | Given the rise of unsupervised latent topic modeling with Latent Dirchlet Allocation (Blei et al., 2003) and similar latent variable approaches for discovering meaningful word co-occurrence patterns in large text corpora, we ought to be able to leverage these topic contexts instead of merely N-grams . |
Motivation | information retrieval, and again, interpolate latent topic models with N-grams to improve retrieval performance. |
Experiments | To collect the statistics, we take each NP in the training data and consider all possible 2-gms through 5-gms that are present in the NP’s modifier sequence, allowing for nonconsecutive n-grams . |
Model | Previous counting approaches can be expressed as a real-valued feature that, given all n-grams generated by a permutation of modifiers, returns the count of all these n-grams in the original training data. |
Model | We might also expect permutations that contain n-grams previously seen in the training data to be more natural sounding than other permutations that generate n-grams that have not been seen before. |
The Model | The first score is computed on the basis of all the n-grams , but using a common set of weights independent of the aspect a. |
The Model | Another score is computed only using n-grams associated with the related topic, but an aspect-specific set of weights is used in this computation. |
The Model | b; is the bias term which regulates the prior distribution P(ya = y), f iterates through all the n-grams , J1me and Jif are common weights and aspect-specific weights for n-gram feature f. pinz is equal to a fraction of words in n-gram feature f assigned to the aspect topic (7“ = [00, z = a). |
Introduction | Use of longer n-grams improves translation results, but exacerbates this interaction. |
Introduction | We examine the question of whether, given the reordering inherent in the machine translation problem, lower order n-grams will provide as valuable a search heuristic as they do for speech recognition. |
Multi-pass LM-Integrated Decoding | where 8(2', j) is the set of candidate target language words outside the span of (i, j h B B is the product of the upper bounds for the two on-the-border n-grams . |
Introduction | Rank OIflGR using words —-- ——onte‘Carfo usin - rams --» ---d/x 7,‘ Monte Carlo using words w ~ 7 ~ ’ Relative Entropy using N-grams — —~—-’ IRelative EIntropy usin words If r r w |
Related Work | The n-gram based approaches are based on the counts of character or byte n-grams , which are sequences of n characters or bytes, extracted from a corpus for each reference language. |
Related Work | (Dunning, 1994) proposed a system that uses Markov Chains of byte n-grams with Bayesian Decision Rules to minimize the probability error. |
Introduction | (Miller et al., 1999; Song and Croft, 1999) explore the use n-grams in retrieval models. |
Previous Work | Subsequently, various types of phrases, such as sequential n-grams (Mitra et al., 1997), head-modifier pairs extracted from syntactic structures (Lewis and Croft, 1990; Zhai, 1997; Dillon and Gray, 1983; Strzalkowski et al., 1994), proximity-based phrases (Turpin and Mof—fat, 1999), were examined with conventional retrieval models (e. g. vector space model). |
Previous Work | (Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases. |
Experiments | For the web baseline (reported as Google), we stemmed all words in the Google n-grams and counted every verb 2) and noun n that appear in Gigaword. |
How Frequent is Unseen Data? | The dotted line uses Google n-grams as training. |
Models | We also avoided over-counting co-occurrences in lower order n-grams that appear again in 4 or 5 - grams. |
Abstract | We adopt a representation of concepts alternative to n-grams and propose two concept—scoring functions based on semantic overlap. |
The summarization framework | To represent sentences and answers we adopted an alternative approach to classical n-grams that could be defined bag-of-BEs. |
The summarization framework | Different from n-grams , they are variant in length and depend on parsing techniques, named entity detection, part-of-speech tagging and resolution of syntactic forms such as hyponyms, pronouns, per-tainyms, abbreviation and synonyms. |
Automated Classification | To provide information related to term usage to the classifier, we extracted trigram and 4-gram features from the Web lT Corpus (Brants and Franz, 2006), a large collection of n-grams and their counts created from approximately one trillion words of Web text. |
Automated Classification | Only n-grams containing lowercase words were used. |
Automated Classification | Only n-grams containing both terms (including plural forms) were extracted. |
Background | where gn(s) is the multi-set of all n-grams in a string s. In this definition, n-grams in e,~ and {rij} are weighted by Dt(i). |
Background | If the i-th training sample has a larger weight, the corresponding n-grams will have more contributions to the overall score WBLEU(E,R) . |
Background | In this method, a n-gram cache is used to store the most frequently and recently accessed n-grams . |
Experiment 1: Textual Similarity | 0 Character n-grams which were also used as one of our additional features. |
Experiment 1: Textual Similarity | Another interesting point is the high scores achieved by the Character n-grams |
Experiment 1: Textual Similarity | Dataset Mpar Mvid SMTe DW 0.448 0.820 0.660 ADW—MF 0.485 0.842 0.721 Explicit Semantic Analysis 0.427 0.781 0.619 Pairwise Word Similarity 0.564 0.835 0.527 Distributional Thesaurus 0.494 0.481 0.365 Character n-grams 0.658 0.771 0.554 |
Evaluation | Table 3: MWE identification with CRF: base are the features corresponding to token properties and word n-grams . |
MWE-dedicated Features | Word n-grams . |
MWE-dedicated Features | POS n-grams . |
Experimental Design | Field bigrams/trigrams Analogously to the lexical features mentioned above, we introduce a series of nonlocal features that capture field n-grams , given a specific record. |
Related Work | Local and nonlocal information (e.g., word n-grams , long- |
Results | The l-BEST system has some grammaticality issues, which we avoid by defining features over lexical n-grams and repeated words. |
Conclusion and Future Work | In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures. |
Features and Training | Tl(e,e') is the propagating probability in equation (8), with the similarity measure Sim(e,e') defined as the Dice coefficient over the set of all n-grams in e and those in e'. |
Features and Training | where N Grn(x) is the set of n-grams in string x, and Dice (A, B) is the Dice coefficient over sets A and B: |
Introduction | While traditional LMs use word n-grams , where the n — 1 previous words predict the next word, newer models integrate long-span information in making decisions. |
Introduction | For example, incorporating long-distance dependencies and syntactic structure can help the LM better predict words by complementing the predictive power of n-grams (Chelba and Jelinek, 2000; Collins et al., 2005; Filimonov and Harper, 2009; Kuo et al., 2009). |
Syntactic Language Models | Structured language modeling incorporates syntactic parse trees to identify the head words in a hypothesis for modeling dependencies beyond n-grams . |
Experiments | Feature templates such as rule n-grams and rule shapes only work if iterative mixing (algorithm 3) or feature selection (algorithm 4) are used. |
Introduction | Such features include rule ids, rule-local n-grams , or types of rule shapes. |
Local Features for Synchronous CFGs | Rule n-grams: These features identify n-grams of consecutive items in a rule. |
Experiments | In addition, we extracted web n-grams and entity lists (see §3) from movie related web sites, and online blogs and reviews. |
Experiments | We extract prior distributions for entities and n-grams to calculate entity list 77 and word-tag [3 priors (see §3.1). |
Markov Topic Regression - MTR | We built a language model using SRILM (Stol-cke, 2002) on the domain specific sources such as top wiki pages and blogs on online movie reviews, etc., to obtain the probabilities of domain-specific n-grams , up to 3-grams. |
Perplexity Evaluation | In this experiment, we assessed the effectiveness of the TD and TO components in reducing the n-gram’s perplexity. |
Perplexity Evaluation | Due to the incapability of n-grams to model long history-contexts, the TD and TO components are still effective in helping to enhance the prediction. |
Related Work | There are also works on skipping irrelevant his-tory-words in order to reveal more informative n-grams (Siu & Ostendorf 2000, Guthrie et al. |