To evaluate the methods, we collected examples of question—answer pairs involving scalar modifiers from CNN transcripts and the Dialog Act corpus and use response distributions from Mechanical Turk workers to assess the degree to which each answer conveys ‘yes’ or ‘no’.
Table 2: Mean entropy values and standard deviation obtained in the Mechanical Turk experiment for each question—answer pair category.
To assess the degree to which each answer conveys ‘yes’ or ‘no’ in context, we use response distributions from Mechanical Turk workers.
Despite variant individual judgments, aggregate annotations done with Mechanical Turk have been shown to be reliable (Snow et a1., 2008; Sheng et a1., 2008; Munro et a1., 2010).
|Evaluation and results|
To evaluate the techniques, we pool the Mechanical Turk ‘definite yes’ and ‘probable yes’ categories into a single category ‘Yes’, and we do the same for ‘definite no’ and ‘probable no’.
Due to the relatively high speed and low cost of Amazon’s Mechanical Turk serVice, we chose to use Mechanical Turkers as our annotators.
Using Mechanical Turk to obtain inter-annotator agreement figures has several drawbacks.
We then embarked on a series of changes, testing each generation by annotation using Amazon’s Mechanical Turk service, a relatively quick and inexpensive online platform where requesters may publish tasks for anonymous online workers (Turkers) to perform.
Mechanical Turk has been previously used in a variety of NLP research, including recent work on noun compounds by Nakov (2008) to collect short phrases for linking the nouns within noun compounds.
For the Mechanical Turk annotation tests, we created five sets of 100 noun compounds from noun compounds automatically extracted from a random subset of New York Times articles written between 1987 and 2007 (Sandhaus, 2008).
We want to thank Dustin Smith for the OMICS data, Alexis Palmer for her support with Amazon Mechanical Turk , Nils Bendfeldt for the creation of all web forms and Ines Rehbein for her effort
We presented each pair to 5 non-experts, all US residents, via Mechanical Turk .
In particular, the use of the Amazon Mechanical Turk , which we use here, has been evaluated and shown to be useful for language processing tasks (Snow et al., 2008).