Introduction | Instead, we consider how to efficiently mine the Google n-grams corpus. |
Introduction | This work uses a large web-scale corpus (Google n-grams ) to compute features for the full parsing task. |
Web-count Features | The approach of Lauer (1995), for example, would be to take an ambiguous noun sequence like hydrogen ion exchange and compare the various counts (or associated conditional probabilities) of n-grams like hydrogen ion and hydrogen exchange. |
Web-count Features | (2010) use web-scale n-grams to compute similar association statistics for longer sequences of nouns. |
Web-count Features | These paraphrase features hint at the correct attachment decision by looking for web n-grams with special contexts that reveal syntax superficially. |
Working with Web n-Grams | Rather than working through a search API (or scraper), we use an offline web corpus — the Google n-gram corpus (Brants and Franz, 2006) — which contains English n-grams (n = l to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences. |
Working with Web n-Grams | In particular, we only use counts of n-grams of the form :3 * y where the gap length is g 3. |
Working with Web n-Grams | Next, we exploit a simple but efficient trie-based hashing algorithm to efficiently answer all of them in one pass over the n-grams corpus. |
Abstract | Our approach first identifies statistically salient phrases of words and parts of speech — known as n-grams — in training texts generated in conditions where the social power |
Abstract | Then, we apply machine learning to train classifiers with groups of these n-grams as features. |
Abstract | Unlike their works, our text classification techniques take into account the frequency of occurrence of word n-grams and part-of-speech (POS) tag sequences, and other measures of statistical salience in training data. |
Abstract | This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). |
Abstract | In this work we explore the suitability of LHs over n-grams at the character-level for AA. |
Abstract | We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. |
Introduction | In particular, we consider local histograms over n-grams at the character-level obtained via the locally-weighted bag of words (LOWBOW) framework (Lebanon et al., 2007). |
Introduction | Results confirm that local histograms of character n-grams are more helpful for AA than the usual global histograms of words or character n-grams (Luyckx and Daelemans, 2010); our results are superior to those reported in related works. |
Introduction | We also show that local histograms over character n-grams are more helpful than local histograms over words, as originally proposed by (Lebanon et al., 2007). |
Related Work | Some researchers have gone a step further and have attempted to capture sequential information by using n-grams at the word-level (Peng et al., 2004) or by discovering maximal frequent word sequences (Coyotl-Morales et al., 2006). |
Related Work | Unfortunately, because of computational limitations, the latter methods cannot discover enough sequential information from documents (e.g., word n-grams are often restricted to n E {1,2,3}, while full sequential information would be obtained with n E {l . |
Related Work | Stamatatos and coworkers have studied the impact of feature selection, with character n-grams, in AA (Houvardas and Stamatatos, 2006; Stamatatos, 2006a), ensemble learning with character n-grams (Stamatatos, 2006b) and novel classification techniques based |
Abstract | Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. |
Introduction | The largest language models (LMs) can contain as many as several hundred billion n-grams (Brants et al., 2007), so storage is a challenge. |
Introduction | Overall, we are able to store the 4 billion n-grams of the Google WeblT (Brants and Franz, 2006) cor- |
Language Model Implementations | Lookup is linear in the length of the key and logarithmic in the number of n-grams . |
Preliminaries | This corpus, which is on the large end of corpora typically employed in language modeling, is a collection of nearly 4 billion n-grams extracted from over a trillion tokens of English text, and has a vocabulary of about 13.5 million words. |
Preliminaries | Tries represent collections of n-grams using a tree. |
Preliminaries | Each node in the tree encodes a word, and paths in the tree correspond to n-grams in the collection. |
Experimental Settings | nw N—grams Normword 1/2/3-gram probabilities lem N—grams Lemma 1/2/3-gram probabilities pos N-grams POS 1/2/3-gram probabilities |
Experimental Settings | served for word N-grams which did not appear in the models). |
Experimental Settings | Like the N-grams , this number is binned; in this case there are 11 bins, with 10 spread evenly over the [0,1) range, and an extra bin for values of exactly 1 (i.e., when the word appears in every hypothesis in the set). |
Results | First, Table 4 shows the effect of adding nw N-grams of successively higher orders to the word baseline. |
Results | word+nw l—gram 49.51 12.9 word+nw l—gram+nw 2—gram 59.26 35.2 word+nw N—grams 59.33 35.3 +pos 58.50 33.4 +pos N-grams 57.35 30.8 +lem+lem N—grams 59.63 36.0 +lem+lem N—grams+na 59.93 36.7 +lem+lem N-grams+na+nw 59.77 36.3 +lem 60.92 38.9 +lem+na 60.47 37.9 +lem+lem N—grams 60.44 37.9 |
Results | Here, the best performer is the model which utilizes the word, nw N-grams, |
Conclusion | ber of parameters (unique N-grams ). |
Experiments | 91'i’e4 1e5 1e6 1e7 1e8 1e9 Number of Unique N-grams |
Introduction | Another is a web-scale N-gram corpus, which is a N-gram corpus with N-grams of length 1-5 (Brants and Franz, 2006), we call it Google V1 in this paper. |
Related Work | The former uses the web-scale data explicitly to create more data for training the model; while the latter explores the web-scale N-grams data (Lin et al., 2010) for compound bracketing disambiguation. |
Related Work | However, we explore the web-scale data for dependency parsing, the performance improves log-linearly with the number of parameters (unique N-grams ). |
Web-Derived Selectional Preference Features | N-grams appearing 40 |
Web-Derived Selectional Preference Features | In this paper, the selectional preferences have the same meaning with N-grams , which model the word-to-word relationships, rather than only considering the predicates and arguments relationships. |
Web-Derived Selectional Preference Features | All n-grams with lower counts are discarded. |
Abstract | The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” when applied to the task of re—ranking the N -best list from a state-of—the-art parsing-based machine translation system. |
Conclusion | As far as we know, this is the first work of building a complex large scale distributed language model with a principled approach that is more powerful than n-grams when both trained on a very large corpus with up to a billion tokens. |
Experimental results | The composite n-gram/m-SLMfl’LSA model gives significant perpleXity reductions over baseline n-grams , n = 3,4,5 and m-SLMs, m = 2,3,4. |
Experimental results | Also, in the same study in (Chamiak, 2003), they found that the outputs produced using the n-grams received higher scores from BLEU; ours did not. |
Introduction | We conduct comprehensive experiments on corpora with 44 million tokens, 230 million tokens, and 1.3 billion tokens and compare perplexity results with n-grams (n=3,4,5 respectively) on these three corpora, we obtain drastic perplexity reductions. |
Training algorithm | Where #(g,Wl,Gl,d) is the count of semantic content 9 in semantic annotation string Gl of the lth sentence Wl in document d, #(w:,11+1wh:,1ng,Wl,Tl,Gl,d) is the count of n-grams , its m most recent exposed headwords and semantic content 9 in parse Tl and semantic annotation string Gl of the lth sentence Wl in document d, #(tthhtag, Wl, Tl, d) is the count |
Training algorithm | The topic of large scale distributed language models is relatively new, and existing works are restricted to n-grams only (Brants et al., 2007; Emami et al., 2007; Zhang et al., 2006). |
Experiments | The classifiers use a feature set designed to mimic our LT-HMM as closely as possible, including n-grams , dictionary matches, ConText output, and symptom/disease se- |
Method | The transition features are modeled as simple indicators over n-grams of present codes, for values of n up to 10, the largest number of codes proposed by |
Method | We found it useful to pad our n-grams with “beginning of document” tokens for sequences when fewer than n codes have been labelled as present, but found it harmful to include an end-of-document tag once labelling is complete. |
Method | Note that n-grams features are only code-specific, as they are not connected to any specific trigger term. |
Related work | Statistical systems trained on only text-derived features (such as n-grams ) did not show good performance due to a wide variety of medical language and a relatively small training set (Goldstein et al., 2007). |
Experimental Setup | An important parameter of the class-based model is size of the base set, i.e., the total number of n-grams (or rather i-grams) to be clustered. |
Polynomial discounting | We could add a constant to d, but one of the basic premises of the KN model, derived from the assumption that n-gram marginals should be equal to relative frequencies, is that the discount is larger for more frequent n-grams although in many implementations of KN only the cases = l, = 2, and 2 3 are distinguished. |
Related work | ability of a class on n-grams of lexical items (as opposed to classes) (Whittaker and Woodland, 2001; Emami and Jelinek, 2005; Uszkoreit and Brants, 2008). |
Related work | Models that condition classes on lexical n-grams could be extended in a way similar to what we propose here. |
Related work | Our use of classes of lexical n-grams for n > 1 has several precedents in the literature (Suhm and Waibel, 1994; Kuo and Reichl, 1999; Deligne and Sagisaka, 2000; Justo and Torres, 2009). |
Paraphrase Evaluation Metrics | We introduce a new scoring metric PINC that measures how many n-grams differ between the two sentences. |
Paraphrase Evaluation Metrics | grams and n-gramC are the lists of n-grams in the |
Paraphrase Evaluation Metrics | The PINC score computes the percentage of n-grams that appear in the candidate sentence but not in the source sentence. |
Experiments | To collect the statistics, we take each NP in the training data and consider all possible 2-gms through 5-gms that are present in the NP’s modifier sequence, allowing for nonconsecutive n-grams . |
Model | Previous counting approaches can be expressed as a real-valued feature that, given all n-grams generated by a permutation of modifiers, returns the count of all these n-grams in the original training data. |
Model | We might also expect permutations that contain n-grams previously seen in the training data to be more natural sounding than other permutations that generate n-grams that have not been seen before. |