Universal Dependency Annotation for Multilingual Parsing
McDonald, Ryan and Nivre, Joakim and Quirmbach-Brundage, Yvonne and Goldberg, Yoav and Das, Dipanjan and Ganchev, Kuzman and Hall, Keith and Petrov, Slav and Zhang, Hao and Täckström, Oscar and Bedini, Claudia and Bertomeu Castelló, Núria and Lee, Jungmee

Article Structure

Abstract

We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean.

Introduction

In recent years, syntactic representations based on head-modifier dependency relations between words have attracted a lot of interest (Kubler et al., 2009).

Towards A Universal Treebank

The Stanford typed dependencies for English (De Marneffe et al., 2006; de Marneffe and Manning, 2008) serve as the point of departure for our ‘universal’ dependency representation, together with the tag set of Petrov et al.

Experiments

One of the motivating factors in creating such a data set was improved cross-lingual transfer evaluation.

Conclusion

We have released data sets for six languages with consistent dependency annotation.

Topics

treebanks

Appears in 19 sentences as: Treebank (6) treebank (4) treebanks (10)
In Universal Dependency Annotation for Multilingual Parsing
  1. We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean.
    Page 1, “Abstract”
  2. This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing.1
    Page 1, “Abstract”
  3. Research in dependency parsing — computational methods to predict such representations — has increased dramatically, due in large part to the availability of dependency treebanks in a number of languages.
    Page 1, “Introduction”
  4. While these data sets are standardized in terms of their formal representation, they are still heterogeneous treebanks .
    Page 1, “Introduction”
  5. That is to say, despite them all being dependency treebanks , which annotate each sentence with a dependency tree, they subscribe to different annotation schemes.
    Page 1, “Introduction”
  6. These data sets can be sufficient if one’s goal is to build monolingual parsers and evaluate their quality without reference to other languages, as in the original CoNLL shared tasks, but there are many cases where heterogenous treebanks are less than adequate.
    Page 1, “Introduction”
  7. In order to overcome these difficulties, some cross-lingual studies have resorted to heuristics to homogenize treebanks (Hwa et al., 2005; Smith and Eisner, 2009; Ganchev et al., 2009), but we are only aware of a few systematic attempts to create homogenous syntactic dependency annotation in multiple languages.
    Page 1, “Introduction”
  8. (2012) attempt to harmonize a large number of dependency treebanks by mapping their annotation to a version of the Prague Dependency Treebank scheme (Hajic et al., 2001; Bohmova et al., 2003).
    Page 1, “Introduction”
  9. (2004) for multilingual syntactic treebank construction.
    Page 2, “Towards A Universal Treebank”
  10. The second, used only for English and Swedish, is to automatically convert existing treebanks , as in Zeman et al.
    Page 2, “Towards A Universal Treebank”
  11. For English, we used the Stanford parser (v1.6.8) (Klein and Manning, 2003) to convert the Wall Street J our-nal section of the Penn Treebank (Marcus et al., 1993) to basic dependency trees, including punctuation and with the copula verb as head in copula constructions.
    Page 2, “Towards A Universal Treebank”

See all papers in Proc. ACL 2013 that mention treebanks.

See all papers in Proc. ACL that mention treebanks.

Back to top.

cross-lingual

Appears in 16 sentences as: cross-lingual (16)
In Universal Dependency Annotation for Multilingual Parsing
  1. To show the usefulness of such a resource, we present a case study of cross-lingual transfer parsing with more reliable evaluation than has been possible before.
    Page 1, “Abstract”
  2. First, a homogeneous representation is critical for multilingual language technologies that require consistent cross-lingual analysis for downstream components.
    Page 1, “Introduction”
  3. Second, consistent syntactic representations are desirable in the evaluation of unsupervised (Klein and Manning, 2004) or cross-lingual syntactic parsers (Hwa et al., 2005).
    Page 1, “Introduction”
  4. In the cross-lingual study of McDonald et al.
    Page 1, “Introduction”
  5. In order to overcome these difficulties, some cross-lingual studies have resorted to heuristics to homogenize treebanks (Hwa et al., 2005; Smith and Eisner, 2009; Ganchev et al., 2009), but we are only aware of a few systematic attempts to create homogenous syntactic dependency annotation in multiple languages.
    Page 1, “Introduction”
  6. (2012), have already spurred numerous examples of improved empirical cross-lingual systems (Zhang et al., 2012; Gelling et al., 2012; Tackstro'm et al., 2013).
    Page 2, “Introduction”
  7. We aim to do the same for syntactic dependencies and present cross-lingual parsing experiments to highlight some of the benefits of cross-lingually consistent annotation.
    Page 2, “Introduction”
  8. Second, the evaluation scores in general are significantly higher than previous cross-lingual studies, suggesting that most of these studies underestimate true accuracy.
    Page 2, “Introduction”
  9. Finally, unlike all previous cross-lingual studies, we can report full labeled accuracies and not just unlabeled structural accuracies.
    Page 2, “Introduction”
  10. The selected sentences were preprocessed using cross-lingual taggers (Das and Petrov, 2011) and parsers (McDonald et al., 2011).
    Page 3, “Towards A Universal Treebank”
  11. One of the motivating factors in creating such a data set was improved cross-lingual transfer evaluation.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

part-of-speech

Appears in 5 sentences as: part-of-speech (5)
In Universal Dependency Annotation for Multilingual Parsing
  1. In the context of part-of-speech tagging, universal representations, such as that of Petrov et al.
    Page 2, “Introduction”
  2. (2012) as the underlying part-of-speech representation.
    Page 2, “Towards A Universal Treebank”
  3. For both English and Swedish, we mapped the language-specific part-of-speech tags to universal tags using the mappings of Petrov et al.
    Page 2, “Towards A Universal Treebank”
  4. Note that relative to the universal part-of-speech tagset of Petrov et al.
    Page 3, “Towards A Universal Treebank”
  5. We use the features of Zhang and Nivre (2011), except that all lexical identities are dropped from the templates during training and testing, hence inducing a ‘delexicalized’ model that employs only ‘universal’ properties from source-side treebanks, such as part-of-speech tags, labels, head-modifier distance, etc.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

CoNLL

Appears in 3 sentences as: CoNLL (3)
In Universal Dependency Annotation for Multilingual Parsing
  1. In particular, the CoNLL shared tasks on dependency parsing have provided over twenty data sets in a standardized format (Buch-holz and Marsi, 2006; Nivre et al., 2007).
    Page 1, “Introduction”
  2. These data sets can be sufficient if one’s goal is to build monolingual parsers and evaluate their quality without reference to other languages, as in the original CoNLL shared tasks, but there are many cases where heterogenous treebanks are less than adequate.
    Page 1, “Introduction”
  3. (2011), who observe that this is rarely the case with the heterogenous CoNLL treebanks.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention CoNLL.

See all papers in Proc. ACL that mention CoNLL.

Back to top.

dependency parsing

Appears in 3 sentences as: dependency parsers (1) dependency parsing (2)
In Universal Dependency Annotation for Multilingual Parsing
  1. Research in dependency parsing — computational methods to predict such representations — has increased dramatically, due in large part to the availability of dependency treebanks in a number of languages.
    Page 1, “Introduction”
  2. In particular, the CoNLL shared tasks on dependency parsing have provided over twenty data sets in a standardized format (Buch-holz and Marsi, 2006; Nivre et al., 2007).
    Page 1, “Introduction”
  3. We use the so-called basic dependencies (with punctuation included), where every dependency structure is a tree spanning all the input tokens, because this is the kind of representation that most available dependency parsers require.
    Page 2, “Towards A Universal Treebank”

See all papers in Proc. ACL 2013 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

dependency relations

Appears in 3 sentences as: dependency relations (3)
In Universal Dependency Annotation for Multilingual Parsing
  1. In recent years, syntactic representations based on head-modifier dependency relations between words have attracted a lot of interest (Kubler et al., 2009).
    Page 1, “Introduction”
  2. This mainly consisted in relabeling dependency relations and, due to the fine-grained label set used in the Swedish Treebank (Teleman, 1974), this could be done with high precision.
    Page 2, “Towards A Universal Treebank”
  3. Such a reduction may ultimately be necessary also in the case of dependency relations , but since most of our data sets were created through manual annotation, we could afford to retain a fine-grained analysis, knowing that it is always possible to map from finer to coarser distinctions, but not vice versa.4
    Page 3, “Towards A Universal Treebank”

See all papers in Proc. ACL 2013 that mention dependency relations.

See all papers in Proc. ACL that mention dependency relations.

Back to top.

dependency tree

Appears in 3 sentences as: dependency tree (2) dependency trees (1)
In Universal Dependency Annotation for Multilingual Parsing
  1. That is to say, despite them all being dependency treebanks, which annotate each sentence with a dependency tree , they subscribe to different annotation schemes.
    Page 1, “Introduction”
  2. A sample dependency tree from the French data set is shown in Figure 1.
    Page 2, “Towards A Universal Treebank”
  3. For English, we used the Stanford parser (v1.6.8) (Klein and Manning, 2003) to convert the Wall Street J our-nal section of the Penn Treebank (Marcus et al., 1993) to basic dependency trees , including punctuation and with the copula verb as head in copula constructions.
    Page 2, “Towards A Universal Treebank”

See all papers in Proc. ACL 2013 that mention dependency tree.

See all papers in Proc. ACL that mention dependency tree.

Back to top.

fine-grained

Appears in 3 sentences as: fine-grained (3)
In Universal Dependency Annotation for Multilingual Parsing
  1. This mainly consisted in relabeling dependency relations and, due to the fine-grained label set used in the Swedish Treebank (Teleman, 1974), this could be done with high precision.
    Page 2, “Towards A Universal Treebank”
  2. Making fine-grained label distinctions was discouraged.
    Page 3, “Towards A Universal Treebank”
  3. Such a reduction may ultimately be necessary also in the case of dependency relations, but since most of our data sets were created through manual annotation, we could afford to retain a fine-grained analysis, knowing that it is always possible to map from finer to coarser distinctions, but not vice versa.4
    Page 3, “Towards A Universal Treebank”

See all papers in Proc. ACL 2013 that mention fine-grained.

See all papers in Proc. ACL that mention fine-grained.

Back to top.

manual annotation

Appears in 3 sentences as: Manual Annotation (1) manual annotation (2)
In Universal Dependency Annotation for Multilingual Parsing
  1. The first is traditional manual annotation , as previously used by Helmreich et al.
    Page 2, “Towards A Universal Treebank”
  2. 2.2 Manual Annotation
    Page 2, “Towards A Universal Treebank”
  3. Such a reduction may ultimately be necessary also in the case of dependency relations, but since most of our data sets were created through manual annotation , we could afford to retain a fine-grained analysis, knowing that it is always possible to map from finer to coarser distinctions, but not vice versa.4
    Page 3, “Towards A Universal Treebank”

See all papers in Proc. ACL 2013 that mention manual annotation.

See all papers in Proc. ACL that mention manual annotation.

Back to top.

part-of-speech tags

Appears in 3 sentences as: part-of-speech tagging (1) part-of-speech tags (2)
In Universal Dependency Annotation for Multilingual Parsing
  1. In the context of part-of-speech tagging , universal representations, such as that of Petrov et al.
    Page 2, “Introduction”
  2. For both English and Swedish, we mapped the language-specific part-of-speech tags to universal tags using the mappings of Petrov et al.
    Page 2, “Towards A Universal Treebank”
  3. We use the features of Zhang and Nivre (2011), except that all lexical identities are dropped from the templates during training and testing, hence inducing a ‘delexicalized’ model that employs only ‘universal’ properties from source-side treebanks, such as part-of-speech tags , labels, head-modifier distance, etc.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention part-of-speech tags.

See all papers in Proc. ACL that mention part-of-speech tags.

Back to top.