A Motivating Example | ( unigrams ) opment, civilization |
Feature Expansion | ,w N}, where the elements 212,- are either unigrams or bigrams that appear in the review d. We then represent a review d by a real-valued term-frequency vector d 6 RN , where the value of the j-th element dj is set to the total number of occurrences of the unigram or bigram wj in the review d. To find the suitable candidates to expand a vector d for the review d, we define a ranking score score(ui, d) for each base entry in the thesaurus as follows: |
Feature Expansion | Moreover, we weight the relatedness scores for each word wj by its normalized term-frequency to emphasize the salient unigrams and bigrams in a review. |
Feature Expansion | This is particularly important because we would like to score base entries ui considering all the unigrams and bigrams that appear in a review d, instead of considering each unigram or bigram individually. |
Introduction | a unigram or a bigram of word lemma) in a review using a feature vector. |
Sentiment Sensitive Thesaurus | We select unigrams and bigrams from each sentence. |
Sentiment Sensitive Thesaurus | For the remainder of this paper, we will refer to unigrams and bigrams collectively as lexical elements. |
Sentiment Sensitive Thesaurus | Previous work on sentiment classification has shown that both unigrams and bigrams are useful for training a sentiment classifier (Blitzer et al., 2007). |
Experimental Setup | 256,873 unique unigrams and 4,494,222 unique bigrams. |
Experimental Setup | We cluster unigrams (i = l) and bigrams (i = 2). |
Experimental Setup | For all experiments, |l31| = |l32| (except in cases where |l32| exceeds the number of unigrams , see below). |
Models | The parameters d’, d”, and d’” are the discounts for unigrams , bigrams and trigrams, respectively, as defined by Chen and Goodman (1996, p. 20, (26)). |
Models | 232) is the set of unigram (resp. |
Models | We cluster bigram histories and unigram histories separately and write 193 (7.03 |w1w2) for the bigram cluster model and pB(w3|w2) for the unigram cluster model. |
Related work | symbol | denotation 2w (sum over all unigrams w) |
Background | This work differs from previous Bayesian models in that we explicitly model a complex backoff path using a hierachical prior, such that our model jointly infers distributions over tag trigrams, bigrams and unigrams and whole words and their character level representation. |
Experiments | Note that the bigram PYP-HMM outperforms the closely related BHMM (the main difference being that we smooth tag bigrams with unigrams ). |
The PYP-HMM | The trigram transition distribution, Tij, is drawn from a hierarchical PYP prior which backs off to a bigram Bj and then a unigram U distribution, |
The PYP-HMM | This allows the modelling of trigram tag sequences, while smoothing these estimates with their corresponding bigram and unigram distributions. |
The PYP-HMM | That is, each table at one level is equivalent to a customer at the next deeper level, creating the invari-ants: Kh} = n;- andKu—i 2 715, where u = tl_1 indicates the unigram backoff context of h. The recursion terminates at the lowest level where the base distribution is static. |
Abstract | Previous work in traditional text classification and its variants — such as sentiment analysis — has achieved successful results by using the bag-of-words representation; that is, by treating text as a collection of words with no interdependencies, training a classifier on a large feature set of word unigrams which appear in the corpus. |
Abstract | Few of these tactics would be effectively encapsulated by word unigrams . |
Abstract | Many would be better modeled by POS tag unigrams (with no word information) or by longer n-grams consisting of either words, POS tags, or a combination of the two. |
Approach | (a) Word unigrams (b) Word bigrams |
Approach | (a) PoS unigrams (b) PoS bigrams (c) PoS trigrams |
Approach | Word unigrams and bigrams are lower-cased and used in their inflected forms. |
Previous work | The Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) uses multinomial or Bernoulli Naive Bayes models to classify texts into different classes (e. g. pass/fail, grades AF) based on content and style features such as word unigrams and bigrams, sentence length, number of verbs, noun—verb pairs etc. |
Validity tests | (a) word unigrams within a sentence (b) word bigrams within a sentence (c) word trigrams within a sentence |
Automated Approaches to Deceptive Opinion Spam Detection | Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS , BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set. |
Automated Approaches to Deceptive Opinion Spam Detection | We consider all three n-gram feature sets, namely UNIGRAMS , BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996). |
Automated Approaches to Deceptive Opinion Spam Detection | We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS , BIGRAMS+, and TRIGRAMS+. |
Experiments | The DISCRIMINATIVE baseline for this task is a standard maximum entropy discriminative binary classifier over unigrams . |
Model | Global Distributions: At the global level, we draw several unigram distributions: a global background distribution 63 and attribute distributions 6% for each attribute. |
Model | Product Level: For the ith product, we draw property unigram distributions 6351, . |