Language Identification | For each language, we collect the n-gram counts (for n = l to n = 7 also using the word beginning and ending spaces) from the vocabulary of the training corpus, and then generate a probability distribution from these counts. |
Language Identification | From these counts, we obtained a probability distribution for all the words in our vocabulary. |
Language Identification | In Table 3, we present the top 10 results of the probability distributions obtained from the vocabulary of English, Finnish, and German corpora. |
Related Work | (Sibun and Reynar, 1996) used Relative Entropy by first generating n-gram probability distributions for both training and test data, and then measuring the distance between the two probability distributions by using the Kullback-Liebler Distance. |
Novel Estimator of Vocabulary size | sequence drawn according to a probability distribution P from a large, but finite, vocabulary 9. |
Novel Estimator of Vocabulary size | Our main interest is in probability distributions 1? |
Novel Estimator of Vocabulary size | In particular, the authors consider a sequence of vocabulary sets and probability distributions , indexed by the observation size n. Specifically, the observation (X 1, . |
Introduction | We investigate the use of distributional representations, which model the probability distribution of a word’s context, as techniques for finding smoothed representations of word sequences. |
Smoothing Natural Language Sequences | If V is the vocabulary, or the set of word types, and X is a sequence of random variables over V, the left and right context of Xi = 2) may each be represented as a probability distribution over V: P(XZ-_1|XZ- = v) and P(Xi+1|X = 2)) respectively. |
Smoothing Natural Language Sequences | We then normalize each vector to form a probability distribution . |