Creating a manually error-tagged and shallow-parsed learner corpus
Nagata, Ryo and Whittaker, Edward and Sheinman, Vera

Article Structure

Abstract

The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited.

Introduction

The availability of learner corpora is still somewhat limited despite the obvious usefulness of such data in conducting research on natural language processing of learner English in recent years.

Difficulties in Learner Corpus Creation

In addition to the common difficulties in creating any corpus, learner corpus creation has its own difficulties.

Method

3.1 How to Collect and Maintain Texts Written by Learners

The Corpus

We carried out a learner corpus creation project using the described method.

UK and XP stand for unknown and X phrase, respectively.

tion were resolved by consulting the first two.

Conclusions

In this paper, we discussed the difficulties inherent in learner corpus creation and a method for efficiently creating a learner corpus.

Topics

POS tagger

Appears in 18 sentences as: POS tag (2) POS Tagger (1) POS tagger (7) POS taggers (6) POS Tagging (1) POS tags (1)
In Creating a manually error-tagged and shallow-parsed learner corpus
  1. Such a comparison brings up another crucial question: “Do existing POS taggers and chun-
    Page 1, “Introduction”
  2. Nevertheless, a great number of researchers have used existing POS taggers and chunkers to analyze the writing of learners of English.
    Page 2, “Introduction”
  3. For instance, error detection methods normally use a POS tagger and/or a chunker in the error detection process.
    Page 2, “Introduction”
  4. Considering this, we determined a basic rule as follows: “Use the Penn Treebank tag set and preserve the original texts as much as possible.” To handle such errors, we made several modifications and added two new POS tags (CE and UK) and another two for chunking (XP and PH), which are described below.
    Page 5, “Method”
  5. Note that each POS tag is hyphenated.
    Page 5, “Method”
  6. 5.1 POS Tagging
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  7. HMM-based and CRF-based POS taggers were tested on the shallow-parsed corpus.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  8. Both use the Penn Treebank POS tag set.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  9. As a result, 19 and 126 sentences (215 and 1,352 tokens) were excluded from the evaluation in the HMM-based and CRF-based POS taggers , respectively.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  10. As shown in Table 4, the CRF-based POS tagger suffers a decrease in accuracy as expected.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  11. Interestingly, the HMM-based POS tagger performed better on the learner corpus.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”

See all papers in Proc. ACL 2011 that mention POS tagger.

See all papers in Proc. ACL that mention POS tagger.

Back to top.

Penn Treebank

Appears in 9 sentences as: Penn Treebank (9)
In Creating a manually error-tagged and shallow-parsed learner corpus
  1. For similar reasons, to the best of our knowledge, there exists no such learner corpus that is manually shallow-parsed and which is also publicly available, unlike, say, native-speaker corpora such as the Penn Treebank .
    Page 1, “Introduction”
  2. For POS/parsing annotation, there are also a number of annotation schemes including the Brown tag set, the Claws tag set, and the Penn Treebank tag set.
    Page 3, “Difficulties in Learner Corpus Creation”
  3. For instance, there are at least three possibilities for POS-tagging the word sing in the sentence everyone sing together using the Penn Treebank tag set: singN B, sing/VBP, or sing/VBZ.
    Page 3, “Difficulties in Learner Corpus Creation”
  4. We selected the Penn Treebank tag set, which is one of the most widely used tag sets, for our
    Page 5, “Method”
  5. Similar to the error annotation scheme, we conducted a pilot study to determine what modifications we needed to make to the Penn Treebank scheme.
    Page 5, “Method”
  6. As a result of the pilot study, we found that the Penn Treebank tag set sufficed in most cases except for errors which learners made.
    Page 5, “Method”
  7. Considering this, we determined a basic rule as follows: “Use the Penn Treebank tag set and preserve the original texts as much as possible.” To handle such errors, we made several modifications and added two new POS tags (CE and UK) and another two for chunking (XP and PH), which are described below.
    Page 5, “Method”
  8. Both use the Penn Treebank POS tag set.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  9. An obvious cause of mistakes in both taggers is that they inevitably make errors in the POSs that are not defined in the Penn Treebank tag set, that is, UK and CE.
    Page 7, “UK and XP stand for unknown and X phrase, respectively.”

See all papers in Proc. ACL 2011 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

Treebank

Appears in 9 sentences as: Treebank (9)
In Creating a manually error-tagged and shallow-parsed learner corpus
  1. For similar reasons, to the best of our knowledge, there exists no such learner corpus that is manually shallow-parsed and which is also publicly available, unlike, say, native-speaker corpora such as the Penn Treebank .
    Page 1, “Introduction”
  2. For POS/parsing annotation, there are also a number of annotation schemes including the Brown tag set, the Claws tag set, and the Penn Treebank tag set.
    Page 3, “Difficulties in Learner Corpus Creation”
  3. For instance, there are at least three possibilities for POS-tagging the word sing in the sentence everyone sing together using the Penn Treebank tag set: singN B, sing/VBP, or sing/VBZ.
    Page 3, “Difficulties in Learner Corpus Creation”
  4. We selected the Penn Treebank tag set, which is one of the most widely used tag sets, for our
    Page 5, “Method”
  5. Similar to the error annotation scheme, we conducted a pilot study to determine what modifications we needed to make to the Penn Treebank scheme.
    Page 5, “Method”
  6. As a result of the pilot study, we found that the Penn Treebank tag set sufficed in most cases except for errors which learners made.
    Page 5, “Method”
  7. Considering this, we determined a basic rule as follows: “Use the Penn Treebank tag set and preserve the original texts as much as possible.” To handle such errors, we made several modifications and added two new POS tags (CE and UK) and another two for chunking (XP and PH), which are described below.
    Page 5, “Method”
  8. Both use the Penn Treebank POS tag set.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  9. An obvious cause of mistakes in both taggers is that they inevitably make errors in the POSs that are not defined in the Penn Treebank tag set, that is, UK and CE.
    Page 7, “UK and XP stand for unknown and X phrase, respectively.”

See all papers in Proc. ACL 2011 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

CRF

Appears in 4 sentences as: CRF (4)
In Creating a manually error-tagged and shallow-parsed learner corpus
  1. 6“CRFTagger: CRF English POS Tagger,” Xuan-Hieu Phan, http: //crftagger .
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  2. Method Native Corpus Learner Corpus CRF 0.970 0.932 HMM 0.887 0.926
    Page 7, “UK and XP stand for unknown and X phrase, respectively.”
  3. HMM CRF POS Freq.
    Page 7, “UK and XP stand for unknown and X phrase, respectively.”
  4. Method Original Improved CRF 0.932 0.934 HMM 0.926 0.933
    Page 8, “UK and XP stand for unknown and X phrase, respectively.”

See all papers in Proc. ACL 2011 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

human annotation

Appears in 3 sentences as: human annotation (2) human annotators (1)
In Creating a manually error-tagged and shallow-parsed learner corpus
  1. number of tokens (1) If the number of tokens in a sentence was different in the human annotation and the system output, the sentence was excluded from the calculation.
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  2. This discrepancy sometimes occurred because the tokenization of the system sometimes differed from that of the human annotators .
    Page 6, “UK and XP stand for unknown and X phrase, respectively.”
  3. In the technique, transformation rules are obtained by comparing the output of a POS tagger and the human annotation so that the differences between the two are reduced.
    Page 7, “UK and XP stand for unknown and X phrase, respectively.”

See all papers in Proc. ACL 2011 that mention human annotation.

See all papers in Proc. ACL that mention human annotation.

Back to top.

natural language

Appears in 3 sentences as: natural language (3)
In Creating a manually error-tagged and shallow-parsed learner corpus
  1. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection.
    Page 1, “Abstract”
  2. The availability of learner corpora is still somewhat limited despite the obvious usefulness of such data in conducting research on natural language processing of learner English in recent years.
    Page 1, “Introduction”
  3. This is one of the most active research areas in natural language processing of learner English.
    Page 1, “Introduction”

See all papers in Proc. ACL 2011 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.