Finding Deceptive Opinion Spam by Any Stretch of the Imagination
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.

Article Structure

Abstract

Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008).

Introduction

With the ever-increasing popularity of review web-sites that feature user-generated opinions (e.g., TripAdvisor1 and Yelpz), there comes an increasing potential for monetary gain through opinion spam—inappropriate or fraudulent reviews.

Related Work

Spam has historically been studied in the contexts of email (Drucker et al., 2002), and the Web (Gyo'ngyi et al., 2004; Ntoulas et al., 2006).

Dataset Construction and Human Performance

While truthful opinions are ubiquitous online, deceptive opinions are difficult to obtain without resorting to heuristic methods (J indal and Liu, 2008; Wu et al., 2010).

Automated Approaches to Deceptive Opinion Spam Detection

We consider three automated approaches to detecting deceptive opinion spam, each of which utilizes classifiers (described in Section 4.4) trained on the dataset of Section 3.

Results and Discussion

The deception detection strategies described in Section 4 are evaluated using a 5-fold nested cross-validation (CV) procedure (Quadrianto et a1., 2009), where model parameters are selected for each test fold based on standard CV experiments on the training folds.

Conclusion and Future Work

In this work we have developed the first large-scale dataset containing gold-standard deceptive opinion spam.

Topics

human judges

Appears in 12 sentences as: human judge (1) human judges (10) human judgments (2)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges , who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).
    Page 2, “Introduction”
  2. However, while these studies compare n-gram—based deception classifiers to a random guess baseline of 50%, we additionally evaluate and compare two other computational approaches (described in Section 4), as well as the performance of human judges (described in Section 3.3).
    Page 3, “Related Work”
  3. Unfortunately, most measures of quality employed in those works are based exclusively on human judgments , which we find in Section 3 to be poorly calibrated to detecting deceptive opinion spam.
    Page 3, “Related Work”
  4. In this section, we report our efforts to gather (and validate with human judgments ) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
    Page 3, “Dataset Construction and Human Performance”
  5. Additionally, to test the extent to which the individual human judges are biased, we evaluate the performance of two virtual meta-judges.
    Page 5, “Dataset Construction and Human Performance”
  6. Specifically, the MAJORITY meta-judge predicts “decep-rive” when at least two out of three human judges believe the review to be deceptive, and the SKEPTIC meta-judge predicts “deceptive” when any human judge believes the review to be deceptive.
    Page 5, “Dataset Construction and Human Performance”
  7. It is clear from the results that human judges are not particularly effective at this task.
    Page 5, “Dataset Construction and Human Performance”
  8. Furthermore, all three judges suffer from truth-bias (Vrij, 2008), a common finding in deception detection research in which human judges are more likely to classify an opinion as truthful than deceptive.
    Page 5, “Dataset Construction and Human Performance”
  9. We suspect that agreement among our human judges is so low precisely because humans are poor judges of deception (Vrij, 2008), and therefore they perform nearly at-chance respective to one another.
    Page 5, “Dataset Construction and Human Performance”
  10. We observe that automated classifiers outperform human judges for every metric, except truthful recall where JUDGE 2 performs best.16 However, this is expected given that untrained humans often focus on unreliable cues to deception (Vrij, 2008).
    Page 7, “Results and Discussion”
  11. mated classifier outperforms most human judges (one-tailed sign test p = 0.06,0.01,0.001 for the three judges, respectively, on the first fold).
    Page 7, “Results and Discussion”

See all papers in Proc. ACL 2011 that mention human judges.

See all papers in Proc. ACL that mention human judges.

Back to top.

gold-standard

Appears in 9 sentences as: gold-standard (9)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90% accurate on our gold-standard opinion spam dataset.
    Page 1, “Abstract”
  2. Indeed, in the absence of gold-standard data, related studies (see Section 2) have been forced to utilize ad hoc procedures for evaluation.
    Page 2, “Introduction”
  3. In contrast, one contribution of the work presented here is the creation of the first large-scale, publicly available6 dataset for deceptive opinion spam research, containing 400 truthful and 400 gold-standard deceptive reviews.
    Page 2, “Introduction”
  4. Using product review data, and in the absence of gold-standard deceptive opinions, they train models using features based on the review text, reviewer, and product, to distinguish between duplicate opinions7 (considered deceptive spam) and non-duplicate opinions (considered truthful).
    Page 2, “Related Work”
  5. of gold-standard data, based on the distortion of popularity rankings.
    Page 3, “Related Work”
  6. Both of these heuristic evaluation approaches are unnecessary in our work, since we compare gold-standard deceptive and truthful opinions.
    Page 3, “Related Work”
  7. In this section, we report our efforts to gather (and validate with human judgments) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
    Page 3, “Dataset Construction and Human Performance”
  8. To solicit gold-standard deceptive opinion spam using AMT, we create a pool of 400 Human-Intelligence Tasks (HITS) and allocate them evenly across our 20 chosen hotels.
    Page 3, “Dataset Construction and Human Performance”
  9. In this work we have developed the first large-scale dataset containing gold-standard deceptive opinion spam.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

Turkers

Appears in 8 sentences as: Turker (3) Turkers (5)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Crowdsourcing services such as AMT have made large-scale data annotation and collection efforts financially affordable by granting anyone with basic programming skills access to a marketplace of anonymous online workers (known as Turkers ) willing to complete small tasks.
    Page 3, “Dataset Construction and Human Performance”
  2. To ensure that opinions are written by unique authors, we allow only a single submission per Turker .
    Page 3, “Dataset Construction and Human Performance”
  3. We also restrict our task to Turkers who are located in the United States, and who maintain an approval rating of at least 90%.
    Page 3, “Dataset Construction and Human Performance”
  4. Turkers are allowed a maximum of 30 minutes to work on the HIT, and are paid one US dollar for an accepted submission.
    Page 3, “Dataset Construction and Human Performance”
  5. Each HIT presents the Turker with the name and website of a hotel.
    Page 3, “Dataset Construction and Human Performance”
  6. The HIT instructions ask the Turker to assume that they work for the hotel’s marketing department, and to pretend that their boss wants them to write a fake review (as if they were a customer) to be posted on a travel review website; additionally, the review needs to sound realistic and portray the hotel in a positive light.
    Page 3, “Dataset Construction and Human Performance”
  7. Unfortunately, we found that some Turkers selected among the choices seemingly at random, presumably to maximize their hourly earnings by obviating the need to read the review.
    Page 4, “Dataset Construction and Human Performance”
  8. Unlike the Turkers , our student volunteers are not offered a monetary reward.
    Page 5, “Dataset Construction and Human Performance”

See all papers in Proc. ACL 2011 that mention Turkers.

See all papers in Proc. ACL that mention Turkers.

Back to top.

BIGRAMS+

Appears in 7 sentences as: BIGRAMS+ (7)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+ , TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  2. We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+ , and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  3. We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS, BIGRAMS+ , and TRIGRAMS+.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  4. For LIWC+BIGRAMS+, we unit-length normalize LIWC and BIGRAMS+ features individually before combining them.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  5. This suggests that a universal set of keyword-based deception cues (e.g., LIWC) is not the best approach to detecting deception, and a context-sensitive approach (e.g., BIGRAMS+ ) might be necessary to achieve state-of-the-art deception detection performance.
    Page 8, “Results and Discussion”
  6. Additional work is required, but these findings further suggest the importance of moving beyond a universal set of deceptive language features (e. g., LIWC) by considering both the contextual (e. g., BIGRAMS+ ) and motivational parameters underlying a deception as well.
    Page 9, “Results and Discussion”
  7. Specifically, our findings suggest the importance of considering both the context (e.g., BIGRAMS+ ) and motivations underlying a deception, rather than strictly adhering to a universal set of deception cues (e.g., LIWC).
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention BIGRAMS+.

See all papers in Proc. ACL that mention BIGRAMS+.

Back to top.

n-gram

Appears in 5 sentences as: n-gram (5)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Notably, a combined classifier with both n-gram and psychological deception features achieves nearly 90% cross-validated accuracy on this task.
    Page 2, “Introduction”
  2. In contrast to the other strategies just discussed, our text categorization approach to deception detection allows us to model both content and context with n-gram features.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  3. Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  4. We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  5. Surprisingly, models trained only on UNIGRAMS—the simplest n-gram feature set—outperform all non—text-categorization approaches, and models trained on BIGRAMSJr perform even better (one-tailed sign test p = 0.07).
    Page 8, “Results and Discussion”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

feature sets

Appears in 3 sentences as: feature set (2) feature sets (3)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set .
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  2. We consider all three n-gram feature sets , namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  3. We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS, BIGRAMS+, and TRIGRAMS+.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”

See all papers in Proc. ACL 2011 that mention feature sets.

See all papers in Proc. ACL that mention feature sets.

Back to top.

language models

Appears in 3 sentences as: language model (1) Language Modeling (1) language models (2)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Under (2), both the NB classifier used by Mihalcea and Strapparava (2009) and the language model classifier used by Zhou et al.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  2. (2008), we use the SRI Language Modeling Toolkit (Stolcke, 2002) to estimate individual language models , Pr(:E | y = c), for truthful and deceptive opinions.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  3. We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”

See all papers in Proc. ACL 2011 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

UNIGRAMS

Appears in 3 sentences as: UNIGRAMS (3)
In Finding Deceptive Opinion Spam by Any Stretch of the Imagination
  1. Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS , BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  2. We consider all three n-gram feature sets, namely UNIGRAMS , BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”
  3. We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS , BIGRAMS+, and TRIGRAMS+.
    Page 6, “Automated Approaches to Deceptive Opinion Spam Detection”

See all papers in Proc. ACL 2011 that mention UNIGRAMS.

See all papers in Proc. ACL that mention UNIGRAMS.

Back to top.