The Manually Annotated Sub-Corpus: A Community Resource for and by the People
Ide, Nancy and Baker, Collin and Fellbaum, Christiane and Passonneau, Rebecca

Article Structure

Abstract

The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a community-Wide annotation effort of a subset of the American National Corpus.

Introduction

The need for corpora annotated for multiple phenomena across a variety of linguistic layers is keenly recognized in the computational linguistics community.

MASC: The Corpus

MASC is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC)1.

MASC Annotations

Annotations for a variety of linguistic phenomena, either manually produced or corrected from output of automatic annotation systems, are being added

MASC Availability and Distribution

Like the OANC, MASC is distributed without license or other restrictions from the American National Corpus website7.

Tools

The ANC project provides an API for GrAF annotations that can be used to access and manipulate GrAF annotations directly from Java programs and render GrAF annotations in a format suitable for input to the open source GraphViz12 graph Visualization application.13 Beyond this, the ANC project does not provide specific tools for use of the corpus, but rather provides the data in formats suitable for use with a variety of available applications, as described in section 4, together with means to import GrAF annotations into major annotation software platforms.

Community Contributions

The ANC project solicits contributions of annotations of any kind, applied to any part or all of the MASC data.

Conclusion

MASC is already the most richly annotated corpus of English available for widespread use.

Topics

WordNet

Appears in 11 sentences as: WordNet (13) WordNets (1)
In The Manually Annotated Sub-Corpus: A Community Resource for and by the People
  1. The MASC project is itself producing annotations for portions of the corpus for WordNet senses and FrameNet frames and frame elements.
    Page 3, “MASC Annotations”
  2. 4This includes WordNet sense annotations, which are not listed in Table 2 because they are not applied to full texts; see Section 3.1 for a description of the WordNet sense annotations in MASC.
    Page 3, “MASC Annotations”
  3. 3.1 WordNet Sense Annotations
    Page 3, “MASC Annotations”
  4. A focus of the MASC project is to provide corpus evidence to support an effort to harmonize sense distinctions in WordNet and FrameNet (Baker and Fellbaum, 2009), (Fellbaum and Baker, to appear).
    Page 3, “MASC Annotations”
  5. The WordNet and FrameNet teams have selected for this purpose 100 common polysemous words whose senses they will study in detail, and the MASC team is annotating occurrences of these words in the MASC.
    Page 3, “MASC Annotations”
  6. As a first step, fifty occurrences of each word are annotated using the WordNet 3.0 inventory and analyzed for problems in sense assignment, after which the WordNet team may make modifications to the inventory if needed.
    Page 3, “MASC Annotations”
  7. The revised inventory (which will be released as part of WordNet 3.1) is then used to annotate 1000 occurrences.
    Page 3, “MASC Annotations”
  8. Furthermore, the FrameNet team is also annotating one hundred of the 1000 sentences for each word with FrameNet frames and frame elements, providing direct comparisons of WordNet and FrameNet sense assignments in
    Page 3, “MASC Annotations”
  9. For convenience, the annotated sentences are provided as a standalone corpus, with the WordNet and FrameNet annotations represented in standoff files.
    Page 4, “MASC Annotations”
  10. The WordNet sense annotations are being used as a base for an extensive inter-annotator agreement study, which is described in detail in (Passonneau et al., 2009), (Passonneau et al., 2010).
    Page 4, “MASC Annotations”
  11. In addition, by virtue of its WordNet and FrameNet annotations, MASC will be linked to parallel WordNets and FrameNets in languages other than English, thus creating a global resource for multilingual technologies, including machine translation.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2010 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

Penn Treebank

Appears in 6 sentences as: Penn Treebank (6)
In The Manually Annotated Sub-Corpus: A Community Resource for and by the People
  1. The most well-known multiply-annotated and validated corpus of English is the one million word Wall Street Journal corpus known as the Penn Treebank (Marcus et al., 1993), which over the years has been fully or partially annotated for several phenomena over and above the original part-of-speech tagging and phrase structure annotation.
    Page 1, “Introduction”
  2. More recently, the OntoNotes project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast news, and broadcast conversation that is annotated for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities.
    Page 1, “Introduction”
  3. All of the first 80K increment is annotated for Penn Treebank syntax.
    Page 2, “MASC: The Corpus”
  4. The second 120K increment includes 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank , PropBank, Penn Discourse Treebank, TimeML, and the Pittsburgh Opinion project.
    Page 2, “MASC: The Corpus”
  5. words Token Validated 1 18 222472 Sentence Validated 1 18 222472 POS/lemma Validated 1 18 222472 Noun chunks Validated 1 18 222472 Verb chunks Validated 1 18 222472 Named entities Validated 1 18 222472 FrameNet frames Manual 21 17829 HSPG Validated 40* 30106 Discourse Manual 40* 30106 Penn Treebank Validated 97 873 83 PropB ank Validated 92 50165 Opinion Manual 97 47583 TimeB ank Validated 34 5434 Committed belief Manual 13 4614 Event Manual 13 4614 Coreference Manual 2 1 877
    Page 3, “MASC Annotations”
  6. Annotations produced by other projects and the FrameNet and Penn Treebank annotations produced specifically for MASC are semiautomatically and/or manually produced by those projects and subjected to their internal quality controls.
    Page 4, “MASC Annotations”

See all papers in Proc. ACL 2010 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

Treebank

Appears in 6 sentences as: Treebank (7)
In The Manually Annotated Sub-Corpus: A Community Resource for and by the People
  1. The most well-known multiply-annotated and validated corpus of English is the one million word Wall Street Journal corpus known as the Penn Treebank (Marcus et al., 1993), which over the years has been fully or partially annotated for several phenomena over and above the original part-of-speech tagging and phrase structure annotation.
    Page 1, “Introduction”
  2. More recently, the OntoNotes project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast news, and broadcast conversation that is annotated for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities.
    Page 1, “Introduction”
  3. All of the first 80K increment is annotated for Penn Treebank syntax.
    Page 2, “MASC: The Corpus”
  4. The second 120K increment includes 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank, PropBank, Penn Discourse Treebank , TimeML, and the Pittsburgh Opinion project.
    Page 2, “MASC: The Corpus”
  5. words Token Validated 1 18 222472 Sentence Validated 1 18 222472 POS/lemma Validated 1 18 222472 Noun chunks Validated 1 18 222472 Verb chunks Validated 1 18 222472 Named entities Validated 1 18 222472 FrameNet frames Manual 21 17829 HSPG Validated 40* 30106 Discourse Manual 40* 30106 Penn Treebank Validated 97 873 83 PropB ank Validated 92 50165 Opinion Manual 97 47583 TimeB ank Validated 34 5434 Committed belief Manual 13 4614 Event Manual 13 4614 Coreference Manual 2 1 877
    Page 3, “MASC Annotations”
  6. Annotations produced by other projects and the FrameNet and Penn Treebank annotations produced specifically for MASC are semiautomatically and/or manually produced by those projects and subjected to their internal quality controls.
    Page 4, “MASC Annotations”

See all papers in Proc. ACL 2010 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

named entities

Appears in 4 sentences as: Named entities (1) named entities (3)
In The Manually Annotated Sub-Corpus: A Community Resource for and by the People
  1. More recently, the OntoNotes project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast news, and broadcast conversation that is annotated for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities .
    Page 1, “Introduction”
  2. words Token Validated 1 18 222472 Sentence Validated 1 18 222472 POS/lemma Validated 1 18 222472 Noun chunks Validated 1 18 222472 Verb chunks Validated 1 18 222472 Named entities Validated 1 18 222472 FrameNet frames Manual 21 17829 HSPG Validated 40* 30106 Discourse Manual 40* 30106 Penn Treebank Validated 97 873 83 PropB ank Validated 92 50165 Opinion Manual 97 47583 TimeB ank Validated 34 5434 Committed belief Manual 13 4614 Event Manual 13 4614 Coreference Manual 2 1 877
    Page 3, “MASC Annotations”
  3. To derive maximal benefit from the semantic information provided by these resources, the entire corpus is also annotated and manually validated for shallow parses (noun and verb chunks) and named entities (person, location, organization, date and time).
    Page 3, “MASC Annotations”
  4. Automatically-produced annotations for sentence, token, part of speech, shallow parses (noun and verb chunks), and named entities (person, location, organization, date and time) are hand-validated by a team of students.
    Page 4, “MASC Annotations”

See all papers in Proc. ACL 2010 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.