Previous Work | They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores. |
Previous Work | Kanhabua and Norvag (2008; 2009) extended this approach with the same model, but expanded its unigrams with POS tags, collocations, and tf-idf scores. |
Previous Work | As above, they learned unigram LMs, but instead measured the KL-divergence between a document and a time period’s LM. |
Timestamp Classifiers | The unigrams w are lowercased tokens. |
Timestamp Classifiers | model as the Unigram NLLR. |
Timestamp Classifiers | Followup work by Kanhabua and Norvag (2008) applied two filtering techniques to the unigrams in the model: |
Abstract | We show how to model the task of inferring which objects are being talked about (and which words refer to which objects) as standard grammatical inference, and describe PCFG-based unigram models and adaptor grammar-based collocation models for the task. |
Introduction | The unigram model we describe below corresponds most closely to the Frank |
Introduction | 2.1 Topic models and the unigram PCFG |
Introduction | This leads to our first model, the unigram grammar, which is a PCFG.1 |
System Architecture | To derive word features, first of all, our system automatically collect a list of word unigrams and bigrams from the training data. |
System Architecture | To avoid overfitting, we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set. |
System Architecture | This list of word unigrams and bigrams are then used as a unigram-dictionary and a bigram-dictionary to generate word-based unigram and bigram features. |
Experiments | In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets. |
Experiments | For supertag dataset, we use the same features for the word inputs, and the unigrams and bigrams for gold POS inputs. |
Problem formulation | To simplify the discussion, we divide the features into two groups: unigram label features and bi-gram label features. |
Problem formulation | Unigram features are of form fk(yt, xt) which are concerned with the current label and arbitrary feature patterns from input sequence. |
Experiments | We use the Lemur toolkit (Ogilvie and Callan, 2001) version 4.11 as the basic retrieval tool, and select the default unigram LM approach based on KL-divergence and Dirichlet-prior smoothing method in Lemur as our basic retrieval approach. |
The Language Modeling Approach to IR | The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other. |
The Language Modeling Approach to IR | In the rest of this paper, language model will refer to the unigram language model. |
The Language Modeling Approach to IR | With unigram model, the negative KL-divergence between model 6.1 of query (1 and model 6d of document d is calculated as follows: |
Experiments | We used a simple measure for isolating the syntactic likelihood of a sentence: we take the log-probability under our model and subtract the log-probability under a unigram model, then normalize by the length of the sentence.8 This measure, which we call the syntactic log-odds ratio (SLR), is a crude way of “subtracting out” the semantic component of the generative probability, so that sentences that use rare words are not penalized for doing so. |
Experiments | (2004) also report using a parser probability normalized by the unigram probability (but not length), and did not find it effective. |
Treelet Language Modeling | p(w|P, R, r’, w_1, w_2) to p(w|P, R, r’, w_1) and then p(w|P, R, r’ From there, we back off to p(w|P, R) where R is the sibling immediately to the right of P, then to a raw PCFG p(w|P), and finally to a unigram distribution. |
Structure-based Stacking | 0 Character unigrams : ck (i — l S k: S i + l) 0 Character bigrams: ckck+1 (i — l S k: < i + l) |
Structure-based Stacking | 0 Character label unigrams : cgpd (i—lppd S k: 3 73+ zppd) |
Structure-based Stacking | 0 Unigram features: C(sk) (i — l0 3 k: S +l0), Tctb(3k) (i — 1351) S k? |