Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
Endriss, Ulle and Fernández, Raquel

Article Structure

Abstract

Crowdsourcing, which offers new ways of cheaply and quickly gathering large amounts of information contributed by volunteers online, has revolutionised the collection of labelled data.

Introduction

In recent years, the possibility to undertake large-scale annotation projects with hundreds or thousands of annotators has become a reality thanks to online crowdsourcing methods such as Amazon’s Mechanical Turk and Games with a Purpose.

Four Types of Collective Annotation

An annotation task consists of a set of items, each of which is associated with a set of possible categories (Artstein and Poesio, 2008).

Formal Model

Next we present our model for general aggregation of plain annotations into a collective annotation.

Three Families of Aggregators

In this section we instantiate our formal model by proposing three families of methods for aggregation.

A Case Study

In this section, we report on a case study in which we have tested our bias-correcting majority and greedy consensus rules.10 We have used the dataset created by Snow et al.

Related Work

There is an increasing number of projects using crowdsourcing methods for labelling data.

Conclusions

We have presented a framework for combining the expertise of speakers taking part in large-scale

Topics

gold standard

Appears in 13 sentences as: gold standard (13) gold standards (1)
In Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
  1. Those who have looked into this increasingly important issue have mostly concentrated on validating the quality of multiple non-expert annotations in terms of how they compare to expert gold standards ; but they have only used simple aggregation methods based on majority voting to combine the judgments of individual annotators (Snow et al., 2008; Venhuizen et al., 2013).
    Page 1, “Introduction”
  2. The original RTEl Challenge testset consists of 800 text-hypothesis pairs (such as T: “Chre’tien visited Peugeot’s newly renovated car factory”, H: “Peugeot manufactures cars”) with a gold standard annotation that classifies each item as either true (l)—in case H can be inferred from Tor false (0).
    Page 7, “A Case Study”
  3. this testset, obtaining 95% agreement between the RTEl gold standard and their own annotation.
    Page 7, “A Case Study”
  4. We have applied our aggregators to this data and compared the outcomes with each other and to the gold standard .
    Page 7, “A Case Study”
  5. (2008) work with a majority rule where ties are broken uniformly at random and report an observed agreement (accuracy) between the majority rule and the gold standard of 89.7%.
    Page 7, “A Case Study”
  6. If we break ties in the optimal way (in view of approximating the gold standard (which of course would not actually be possible without having access to that gold standard ), then we obtain an observed agreement of 93.8%, but if we are unlucky and ties happen to get broken in the worst possible way, we obtain an observed agreement of only 85.6%.
    Page 7, “A Case Study”
  7. Observe that all of the bias-correcting majority rules approximate the gold standard better than the majority rule with uniformly random tie-breaking.
    Page 7, “A Case Study”
  8. These parameters yield neither the best or the worst approximations of the gold standard .
    Page 8, “A Case Study”
  9. While GreedyCR0 appears to perform rather poorly, GreedyCR15 approximates the gold standard particularly well.
    Page 8, “A Case Study”
  10. 13Creating a gold standard often involves adjudication of disagreements by experts, or even the removal of cases with disagreement from the dataset.
    Page 8, “Related Work”
  11. Although in our case study we have tested our aggregators by comparing their outcomes to a gold standard , our approach to collective annotation itself does not assume that there is in fact a ground truth.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

Amazon’s Mechanical Turk

Appears in 4 sentences as: Amazon’s Mechanical Turk (4)
In Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
  1. In recent years, the possibility to undertake large-scale annotation projects with hundreds or thousands of annotators has become a reality thanks to online crowdsourcing methods such as Amazon’s Mechanical Turk and Games with a Purpose.
    Page 1, “Introduction”
  2. (2008) includes 10 non-expert annotations for each of the 800 items in the RTEl testset, collected with Amazon’s Mechanical Turk .
    Page 7, “A Case Study”
  3. Similarly, crowdsourcing via microworking sites like Amazon’s Mechanical Turk has been used in several annotation experiments related to tasks such as affect analysis, event annotation, sense definition and word sense disambiguation (Snow et al., 2008; Rumshisky, 2011; Rumshisky et al., 2012), amongst others.12
    Page 8, “Related Work”
  4. 12See also the papers presented at the NAACL 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (t inyurl .
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention Amazon’s Mechanical Turk.

See all papers in Proc. ACL that mention Amazon’s Mechanical Turk.

Back to top.

ground truth

Appears in 4 sentences as: ground truth (6)
In Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
  1. Although in our case study we have tested our aggregators by comparing their outcomes to a gold standard, our approach to collective annotation itself does not assume that there is in fact a ground truth .
    Page 9, “Related Work”
  2. In application domains where it is reasonable to assume the existence of a ground truth and where we are able to model the manner in which individual judgments are being distorted relative to this ground truth, social choice theory provides tools (using again maximum-likelihood estimators) for the design of aggregators that maximise chances of recovering the ground truth for a given model of distortion (Young, 1995; Conitzer and Sandholm, 2005).
    Page 9, “Related Work”
  3. Specifically, they have designed an experiment in which the ground truth is defined unambiguously and known to the experiment designer, so as to be able to extract realistic models of distortion from the data collected in a crowdsourcing exercise.
    Page 9, “Related Work”
  4. 14In some domains, such as medical diagnosis, it makes perfect sense to assume that there is a ground truth .
    Page 9, “Conclusions”

See all papers in Proc. ACL 2013 that mention ground truth.

See all papers in Proc. ACL that mention ground truth.

Back to top.

Mechanical Turk

Appears in 4 sentences as: Mechanical Turk (4)
In Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
  1. In recent years, the possibility to undertake large-scale annotation projects with hundreds or thousands of annotators has become a reality thanks to online crowdsourcing methods such as Amazon’s Mechanical Turk and Games with a Purpose.
    Page 1, “Introduction”
  2. (2008) includes 10 non-expert annotations for each of the 800 items in the RTEl testset, collected with Amazon’s Mechanical Turk .
    Page 7, “A Case Study”
  3. Similarly, crowdsourcing via microworking sites like Amazon’s Mechanical Turk has been used in several annotation experiments related to tasks such as affect analysis, event annotation, sense definition and word sense disambiguation (Snow et al., 2008; Rumshisky, 2011; Rumshisky et al., 2012), amongst others.12
    Page 8, “Related Work”
  4. 12See also the papers presented at the NAACL 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (t inyurl .
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention Mechanical Turk.

See all papers in Proc. ACL that mention Mechanical Turk.

Back to top.

word sense

Appears in 4 sentences as: Word Sense (1) word sense (2) word senses (1)
In Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
  1. In contrast, in tasks such as word sense labelling (Kilgarriff and Palmer, 2000; Palmer et al., 2007; Venhuizen et al., 2013) and PP-attachment annotation (Rosenthal et al., 2010; J ha et al., 2010) coders need to choose a category amongst a set of options specific to each item—the possible senses of each word or the possible attachment points in each sentence with a prepositional phrase.
    Page 2, “Four Types of Collective Annotation”
  2. 1Some authors have combined qualitative and quantitative ratings; e. g., for the Graded Word Sense dataset of Erk et a1.
    Page 2, “Four Types of Collective Annotation”
  3. have developed the Wordrobe set of games for annotating named entities, word senses , homographs, and pronouns.
    Page 8, “Related Work”
  4. Similarly, crowdsourcing via microworking sites like Amazon’s Mechanical Turk has been used in several annotation experiments related to tasks such as affect analysis, event annotation, sense definition and word sense disambiguation (Snow et al., 2008; Rumshisky, 2011; Rumshisky et al., 2012), amongst others.12
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention word sense.

See all papers in Proc. ACL that mention word sense.

Back to top.