Experiments | Our experimental procedure was as follows: 162 turkers were partitioned into four groups, each corresponding to a treatment condition: OPT (N=34), HF (N=4l), RANDOM (N=43), MAN (N=44). |
Experiments | Font-size correlates with the score given by judge turkers in evaluating guesses of other turkers that were presented with the same text, but the word replaced with a blank. |
Experiments | Turkers were solicited to participate in a study that involved “reading a short story with a twist” (title of HIT). |
Model | For collecting data about which words are likely to be “predicted” given their content, we developed an Amazon Mechanical Turk task that presented turkers with excerpts of a short story (English translation of “The Man who Repented” by |
Model | Turkers were required to type in their best guess, and the number of semantically similar guesses were counted by an average number of 6 other turkers . |
Model | Turkers that judged the semantic similarity of the guesses of other turkers achieved an average Cohen’s kappa agreement of 0.44, indicating fair to poor agreement. |
Abstract | customer generated truthful reviews, Turker generated deceptive reviews and employee (domain-expert) generated deceptive reviews. |
Dataset Construction | 3.1 Turker set, using Mechanical Turk |
Dataset Construction | Anyone with basic programming skills can create Human Intelligence Tasks (HITs) and access a marketplace of anonymous online workers ( Turkers ) willing to complete the tasks. |
Dataset Construction | to create their dataset, such as restricting task to Turkers located in the United States, and who maintain an approval rating of at least 90%. |
Experiments | Specifically, we reframe it as a intra-domain multi-class classification task, where given the labeled training data from one domain, we learn a classifier to classify reviews according to their source, i.e., Employee, Turker and Customer. |
Feature-based Additive Model | If we instead use SVM, for example, we would have to train classifiers one by one (due to the distinct features from different sources) to draw conclusions regarding the differences between Turker vs Expert vs truthful reviews, positive expert vs negative expert reviews, or reviews from different domains. |
Feature-based Additive Model | ysource E {employee, turker , customer}} |
Introduction | Despite the advantages of soliciting deceptive gold-standard material from Turkers (it is easy, large-scale, and affordable), it is unclear whether Turkers are representative of the general population that generate fake reviews, or in other words, Ott et al.’s data set may correspond to only one type of online deceptive opinion spam — fake reviews generated by people who have never been to offerings or experienced the entities. |
Introduction | In contrast to existing work (Ott et al., 2011; Li et al., 2013b), our new gold standard includes three types of reviews: domain expert deceptive opinion spam (Employee), crowdsourced deceptive opinion spam ( Turker ), and truthful Customer reviews (Customer). |
Related Work | created a gold-standard collection by employing Turkers to write fake reviews, and followup research was based on their data (Ott et al., 2012; Ott et al., 2013; Li et al., 2013b; Feng and Hirst, 2013). |
Crowdsourcing Translation | 52 different Turkers took part in the translation task, each translating 138 sentences on average. |
Crowdsourcing Translation | In the editing task, 320 Turkers participated, averaging 56 sentences each. |
Problem Formulation | We form two graphs: the first graph (GT) represents Turkers (translator/editor pairs) as nodes; the second graph (G0) represents candidate translated and |
Problem Formulation | GT = (VT, ET) is a weighted undirected graph representing collaborations between Turkers . |
Problem Formulation | The mutual reinforcement framework couples the two random walks on GT and G0 that rank candidates and Turkers in isolation. |
Related work | ferent Turkers for a collection of Urdu sentences that had been previously professionally translated by the Linguistics Data Consortium. |
Related work | They also hired US-based Turkers to edit the translations, since the translators were largely based in Pakistan and exhibited errors that are characteristic of speakers of English as a language. |
Annotation | To ensure quality control we required the Turkers to have at least 85% hit approval rating and to reside in the United States, because the Twitter messages in our dataset were related to American politics. |
Annotation | we obtained five independent ratings from Turkers satisfying the above qualifications. |
Annotation | We also allowed for “Not Applicable” option to capture ratings where the Turkers did not have sufficient knowledge about the statement or if the statement was not really a claim. |
Corpus | It was important to find a reasonable amount of corresponding edit-turn-pairs before the actual annotation could take place, as we needed a certain amount of positive seeds to keep turkers from simply labeling pairs as non-corresponding all the time. |
Corpus | The resulting 750 pairs have each been annotated by five turkers . |
Corpus | The turkers were presented the turn text, the turn topic name, the edit in its context, and the edit comment (if present). |
Evaluation 11: Human Evaluation on ConnotationWordNet | We first describe the labeling process of sense-level connotation: We selected 350 polysemous words and one of their senses, and each Turker was asked to rate the connotative polarity of a given word (or of a given sense), from -5 to 5, 0 being the neutral.7 For each word, we asked 5 Turkers to rate and we took the average of the 5 ratings as the connotative intensity score of the word. |
Evaluation 11: Human Evaluation on ConnotationWordNet | 7Because senses in WordNet can be tricky to understand, care should be taken in designing the task so that the Turkers will focus only on the corresponding sense of a word. |
Evaluation 11: Human Evaluation on ConnotationWordNet | As an incentive, each Turker was rewarded $0.07 per hit which consists of 10 words to label. |