A chance-corrected measure of inter-annotator agreement for syntax
Skjaerholt, Arne

Article Structure

Abstract

Following the works of Carletta (1996) and Artstein and Poesio (2008), there is an increasing consensus within the field that in order to properly gauge the reliability of an annotation effort, chance-corrected measures of inter-annotator agreement should be used.

Introduction

It is a truth universally acknowledged that an annotation task in good standing be in possession of a measure of inter-annotator agreement (1AA).

The metric

The most common metrics used in computational linguistics are the metrics H (Cohen, 1960, introduced to computational linguistics by Carletta, 1996) and 7r (Scott, 1955).

Synthetic experiments

In the previous section, we proposed three different agreement metrics aplam, adiff and anorm, each involving different tradeoffs.

Real-world corpora

Synthetic experiments do not always fully reflect real-world behaviour, however.

Conclusion

The most important conclusion we draw from this work is the most appropriate agreement metric for syntactic annotation.

Topics

Treebank

Appears in 15 sentences as: Treebank (8) treebank (3) treebanked (1) treebanking (3) Treebanks (1) treebanks (7)
In A chance-corrected measure of inter-annotator agreement for syntax
  1. However, most evaluations of syntactic treebanks use simple accuracy measures such as bracket F1 scores for constituent trees (NEGRA, Brants, 2000; TIGER, Brants and Hansen, 2002; Cat3LB, Civit et al., 2003; The Arabic Treebank, Maamouri et al., 2008) or labelled or unlabelled attachment scores for dependency syntax (PDT, Hajic, 2004; PCEDT Mikulova and 8tepanek, 2010; Norwegian Dependency Treebank , Skjaerholt, 2013).
    Page 1, “Introduction”
  2. In grammar-driven treebanking (or parsebank-ing), the problems encountered are slightly different.
    Page 2, “Introduction”
  3. In HPSG and LPG treebanking annotators do not annotate structure directly.
    Page 2, “Introduction”
  4. This is different from our approach in that agreement is computed on annotator decisions rather than on the treebanked analyses, and is only applicable to grammar-based approaches such as HPSG and LPG treebanking .
    Page 2, “Introduction”
  5. An already annotated corpus, in our case 100 randomly selected sentences from the Norwegian Dependency Treebank (Solberg et al., 2014), are taken as correct and then permuted to produce “annotations” of different quality.
    Page 4, “Synthetic experiments”
  6. Three of the data sets are dependency treebanks
    Page 6, “Real-world corpora”
  7. 7We contacted a number of treebank projects, among them the Penn Treebank and the Prague Dependency Treebank , but not all of them had data available.
    Page 6, “Real-world corpora”
  8. (NDT, CDT, PCEDT) and one phrase structure treebank (SSD), and of the dependency treebanks the PCEDT contains semantic dependencies, while the other two have traditional syntactic dependencies.
    Page 6, “Real-world corpora”
  9. NDT The Norwegian Dependency Treebank (Solberg et al., 2014) is a dependency treebank constructed at the National Library of Norway.
    Page 6, “Real-world corpora”
  10. CDT The Copenhagen Dependency Treebanks (Buch-Kromann et al., 2009; Buch-Kromann and Korzen, 2010) is a collection of parallel dependency treebanks , containing data from the Danish PAROLE corpus (Keson, 1998b; Keson, 1998a) in the original Danish and translated into English, Italian and Spanish.
    Page 6, “Real-world corpora”
  11. PCEDT The Prague Czech-English Dependency Treebank 2.0 Hajic et al.
    Page 6, “Real-world corpora”

See all papers in Proc. ACL 2014 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

edit distance

Appears in 12 sentences as: edit distance (15)
In A chance-corrected measure of inter-annotator agreement for syntax
  1. In this article we propose a family of chance-corrected measures of agreement, applicable to both dependency- and constituency-based syntactic annotation, based on Krippendorff’s 04 and tree edit distance .
    Page 1, “Introduction”
  2. The idea of using edit distance as the basis for an inter-annotator agreement metric has previously been explored by Fournier (2013).
    Page 2, “Introduction”
  3. However that work used a boundary edit distance as the basis of a metric for the task of text segmentation.
    Page 2, “Introduction”
  4. Instead, we propose to use an agreement measure based on Krippendorff’s a (Krippendorff, 1970; Krippendorff, 2004) and tree edit distance .
    Page 3, “The metric”
  5. Instead, we base our work on tree edit distance .
    Page 3, “The metric”
  6. The tree edit distance (TED) problem is defined analogously to the more familiar problem of string edit distance : what is the minimum number of edit operations required to transform one tree into the other?
    Page 3, “The metric”
  7. See Bille (2005) for a thorough introduction to the tree edit distance problem and other related problems.
    Page 3, “The metric”
  8. Tree edit distance has previously been used in the TEDEVAL software (Tsarfaty et al., 2011; Tsarfaty et al., 2012) for parser evaluation agnostic to both annotation scheme and theoretical framework, but this by itself is still an
    Page 3, “The metric”
  9. We propose three different distance functions for the agreement computation: the unmodified tree edit distance function, denoted (Splam, a second function (Sch-HOE, y) = TED(:c, y)—abs(|:c| —|y|), the edit distance minus the difference in length between the two sentences, and finally 6n0rm(:c, y) = TED(w,y)/|w|+|y|, the edit distance normalised to the range [0, l] .4
    Page 4, “The metric”
  10. In future work, we would like to investigate the use of other distance functions, in particular the use of approximate tree edit distance functions such as the pq-gram algorithm (Augsten et al., 2005).
    Page 9, “Conclusion”
  11. For large data sets such as the PCEDT set used in this work, computing 04 with tree edit distance as the distance measure can take a very long time.8 This is due to the fact that 04 requires 0(n2) comparisons to be made, each of which is 0(n2) using our current approach.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention edit distance.

See all papers in Proc. ACL that mention edit distance.

Back to top.

dependency trees

Appears in 5 sentences as: dependency trees (5)
In A chance-corrected measure of inter-annotator agreement for syntax
  1. Figure 1: Transformation of dependency trees before comparison
    Page 3, “The metric”
  2. Therefore we remove the leaf nodes in the case of phrase structure trees, and in the case of dependency trees we compare trees whose edges are unlabelled and nodes are labelled with the dependency relation between that word and its head; the root node receives the label 6.
    Page 4, “The metric”
  3. (2012), adapted to dependency trees .
    Page 4, “Synthetic experiments”
  4. For dependency trees , the input corpus is permuted as follows:
    Page 4, “Synthetic experiments”
  5. For example in the trees in figure 2, assigning any other head than the root to the PRED nodes directly dominated by the root will result in invalid (cyclic and unconnected) dependency trees .
    Page 5, “Synthetic experiments”

See all papers in Proc. ACL 2014 that mention dependency trees.

See all papers in Proc. ACL that mention dependency trees.

Back to top.

evaluation metric

Appears in 5 sentences as: evaluation metric (3) evaluation metrics (2)
In A chance-corrected measure of inter-annotator agreement for syntax
  1. With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies).
    Page 1, “Abstract”
  2. To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real
    Page 1, “Abstract”
  3. 6The de facto standard parser evaluation metric in depen-
    Page 5, “Synthetic experiments”
  4. In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric , and our three 04 metrics.
    Page 6, “Real-world corpora”
  5. In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention evaluation metric.

See all papers in Proc. ACL that mention evaluation metric.

Back to top.

dependency relations

Appears in 4 sentences as: dependency relation (2) dependency relations (3)
In A chance-corrected measure of inter-annotator agreement for syntax
  1. is Ragheb and Dickinson (2013), who use MASI (Passonneau, 2006) to measure agreement on dependency relations and head selection in multi-headed dependency syntax, and Bhat and Sharma (2012), who compute Cohen’s H (Cohen, 1960) on dependency relations in single-headed dependency syntax.
    Page 2, “Introduction”
  2. When comparing syntactic trees, we only want to compare dependency relations or nonterminal categories.
    Page 4, “The metric”
  3. Therefore we remove the leaf nodes in the case of phrase structure trees, and in the case of dependency trees we compare trees whose edges are unlabelled and nodes are labelled with the dependency relation between that word and its head; the root node receives the label 6.
    Page 4, “The metric”
  4. dency parsing: the percentage of tokens that receive the correct head and dependency relation .
    Page 5, “Synthetic experiments”

See all papers in Proc. ACL 2014 that mention dependency relations.

See all papers in Proc. ACL that mention dependency relations.

Back to top.