Abstract | Our experimental results closely match the Turkers’ response data, demonstrating that meanings can be learned from Web data and that such meanings can drive pragmatic inference. |
Analysis and discussion | Figure 5: Correlation between agreement among Turkers and whether the system gets the correct answer. |
Analysis and discussion | For each dialogue, we plot a circle at Turker response entropy and either 1 = correct inference or 0 = incorrect inference, except the points are jittered a little vertically to show where the mass of data lies. |
Analysis and discussion | late almost perfectly with the Turkers’ responses. |
Corpus description | Given a written dialogue between speakers A and B, Turkers were asked to judge what B’s answer conveys: ‘definite yes’, ‘probable yes’, ‘uncertain’, ‘probable no’ , ‘definite no’. |
Corpus description | For each dialogue, we got answers from 30 Turkers , and we took the dominant response as the correct one though we make extensive use of the full response distributions in evaluating our approach.2 We also computed entropy values for the distribution of answers for each item. |
Corpus description | 2120 Turkers were involved (the median number of items done was 28 and the mean 56.5). |
Evaluation and results | In the case of the scalar modifiers experiment, there were just two examples whose dominant response from the Turkers was ‘Uncertain’, so we have left that category out of the results. |
Evaluation and results | We count an inference as successful if it matches the dominant Turker response category. |
Evaluation | Due to the relatively high speed and low cost of Amazon’s Mechanical Turk serVice, we chose to use Mechanical Turkers as our annotators. |
Evaluation | The first and most significant drawback is that it is impossible to force each Turker to label every data point without putting all the terms onto a single web page, which is highly impractical for a large taxonomy. |
Evaluation | Some Turkers may label every compound, but most do not. |
Taxonomy | We then embarked on a series of changes, testing each generation by annotation using Amazon’s Mechanical Turk service, a relatively quick and inexpensive online platform where requesters may publish tasks for anonymous online workers ( Turkers ) to perform. |
Taxonomy | Turkers were asked to select one or, if they deemed it appropriate, two categories for each noun pair. |
Taxonomy | In addition to influencing the category definitions, some taxonomy groupings were altered with the hope that this would improve inter-annotator agreement for cases where Turker disagreement was systematic. |
Experiments | Our experimental procedure was as follows: 162 turkers were partitioned into four groups, each corresponding to a treatment condition: OPT (N=34), HF (N=4l), RANDOM (N=43), MAN (N=44). |
Experiments | Font-size correlates with the score given by judge turkers in evaluating guesses of other turkers that were presented with the same text, but the word replaced with a blank. |
Experiments | Turkers were solicited to participate in a study that involved “reading a short story with a twist” (title of HIT). |
Model | For collecting data about which words are likely to be “predicted” given their content, we developed an Amazon Mechanical Turk task that presented turkers with excerpts of a short story (English translation of “The Man who Repented” by |
Model | Turkers were required to type in their best guess, and the number of semantically similar guesses were counted by an average number of 6 other turkers . |
Model | Turkers that judged the semantic similarity of the guesses of other turkers achieved an average Cohen’s kappa agreement of 0.44, indicating fair to poor agreement. |
Abstract | customer generated truthful reviews, Turker generated deceptive reviews and employee (domain-expert) generated deceptive reviews. |
Dataset Construction | 3.1 Turker set, using Mechanical Turk |
Dataset Construction | Anyone with basic programming skills can create Human Intelligence Tasks (HITs) and access a marketplace of anonymous online workers ( Turkers ) willing to complete the tasks. |
Dataset Construction | to create their dataset, such as restricting task to Turkers located in the United States, and who maintain an approval rating of at least 90%. |
Experiments | Specifically, we reframe it as a intra-domain multi-class classification task, where given the labeled training data from one domain, we learn a classifier to classify reviews according to their source, i.e., Employee, Turker and Customer. |
Feature-based Additive Model | If we instead use SVM, for example, we would have to train classifiers one by one (due to the distinct features from different sources) to draw conclusions regarding the differences between Turker vs Expert vs truthful reviews, positive expert vs negative expert reviews, or reviews from different domains. |
Feature-based Additive Model | ysource E {employee, turker , customer}} |
Introduction | Despite the advantages of soliciting deceptive gold-standard material from Turkers (it is easy, large-scale, and affordable), it is unclear whether Turkers are representative of the general population that generate fake reviews, or in other words, Ott et al.’s data set may correspond to only one type of online deceptive opinion spam — fake reviews generated by people who have never been to offerings or experienced the entities. |
Introduction | In contrast to existing work (Ott et al., 2011; Li et al., 2013b), our new gold standard includes three types of reviews: domain expert deceptive opinion spam (Employee), crowdsourced deceptive opinion spam ( Turker ), and truthful Customer reviews (Customer). |
Related Work | created a gold-standard collection by employing Turkers to write fake reviews, and followup research was based on their data (Ott et al., 2012; Ott et al., 2013; Li et al., 2013b; Feng and Hirst, 2013). |
Crowdsourcing Translation | 52 different Turkers took part in the translation task, each translating 138 sentences on average. |
Crowdsourcing Translation | In the editing task, 320 Turkers participated, averaging 56 sentences each. |
Problem Formulation | We form two graphs: the first graph (GT) represents Turkers (translator/editor pairs) as nodes; the second graph (G0) represents candidate translated and |
Problem Formulation | GT = (VT, ET) is a weighted undirected graph representing collaborations between Turkers . |
Problem Formulation | The mutual reinforcement framework couples the two random walks on GT and G0 that rank candidates and Turkers in isolation. |
Related work | ferent Turkers for a collection of Urdu sentences that had been previously professionally translated by the Linguistics Data Consortium. |
Related work | They also hired US-based Turkers to edit the translations, since the translators were largely based in Pakistan and exhibited errors that are characteristic of speakers of English as a language. |
Experimental Results 11 | About 300 unique Turkers participated the evaluation tasks. |
Experimental Results 11 | Otherwise we treat them as ambiguous cases.17 Figure 3 shows a part of the AMT task, where Turkers are presented with questions that help judges to determine the subtle connotative polarity of each word, then asked to rate the degree of connotation on a scale from -5 (most negative) and 5 (most positive). |
Experimental Results 11 | 17We allow Turkers to mark words that can be used with both positive and negative connotation, which results in about 7% of words that are excluded from the gold standard set. |
Dataset Construction and Human Performance | Crowdsourcing services such as AMT have made large-scale data annotation and collection efforts financially affordable by granting anyone with basic programming skills access to a marketplace of anonymous online workers (known as Turkers ) willing to complete small tasks. |
Dataset Construction and Human Performance | To ensure that opinions are written by unique authors, we allow only a single submission per Turker . |
Dataset Construction and Human Performance | We also restrict our task to Turkers who are located in the United States, and who maintain an approval rating of at least 90%. |
Human Perception of Biased Language | Turkers were shown Wikipedia’s definition of a “biased statement” and two example sentences that illustrated the two types of bias, framing and epistemological. |
Human Perception of Biased Language | Before the 10 sentences, turkers were asked to list the languages they spoke as well as their primary language in primary school. |
Human Perception of Biased Language | On average, it took turkers about four minutes to complete each HIT. |
Evaluation | To evaluate the quality of types assigned to emerging entities, we presented turkers with sentences from the news tagged with out-of-KB entities and the types inferred by the methods under test. |
Evaluation | The turkers task was to assess the correctness of types assigned to an entity mention. |
Evaluation | To make it easy to understand the task for the turkers , we combined the extracted entity and type into a sentence. |
Annotation | To ensure quality control we required the Turkers to have at least 85% hit approval rating and to reside in the United States, because the Twitter messages in our dataset were related to American politics. |
Annotation | we obtained five independent ratings from Turkers satisfying the above qualifications. |
Annotation | We also allowed for “Not Applicable” option to capture ratings where the Turkers did not have sufficient knowledge about the statement or if the statement was not really a claim. |
Corpus | It was important to find a reasonable amount of corresponding edit-turn-pairs before the actual annotation could take place, as we needed a certain amount of positive seeds to keep turkers from simply labeling pairs as non-corresponding all the time. |
Corpus | The resulting 750 pairs have each been annotated by five turkers . |
Corpus | The turkers were presented the turn text, the turn topic name, the edit in its context, and the edit comment (if present). |
Evaluation 11: Human Evaluation on ConnotationWordNet | We first describe the labeling process of sense-level connotation: We selected 350 polysemous words and one of their senses, and each Turker was asked to rate the connotative polarity of a given word (or of a given sense), from -5 to 5, 0 being the neutral.7 For each word, we asked 5 Turkers to rate and we took the average of the 5 ratings as the connotative intensity score of the word. |
Evaluation 11: Human Evaluation on ConnotationWordNet | 7Because senses in WordNet can be tricky to understand, care should be taken in designing the task so that the Turkers will focus only on the corresponding sense of a word. |
Evaluation 11: Human Evaluation on ConnotationWordNet | As an incentive, each Turker was rewarded $0.07 per hit which consists of 10 words to label. |