Mining Wiki Resources for Multilingual Named Entity Recognition
Richman, Alexander E. and Schone, Patrick

Article Structure

Abstract

In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise.

Introduction

Named Entity Recognition (NER) has long been a major task of natural language processing.

Wikipedia 2.1 Structure

Wikipedia is a multilingual, collaborative encyclopedia on the Web which is freely available for research purposes.

Training Data Generation

3.1 Initial Setup and Overview

Evaluation and Results

After each data set was generated, we used the text as a training set for input to PhoenixIDF.

Conclusions

In conclusion, we have demonstrated that Wikipedia can be used to create a Named Entity Recognition system with performance comparable to one developed from 15-40,000 words of human-anno-tated newswire, while not requiring any linguistic expertise on the part of the user.

Topics

Named Entity

Appears in 8 sentences as: named entities (2) Named Entity (5) named entity (2)
In Mining Wiki Resources for Multilingual Named Entity Recognition
  1. In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise.
    Page 1, “Abstract”
  2. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity.
    Page 1, “Abstract”
  3. Named Entity Recognition (NER) has long been a major task of natural language processing.
    Page 1, “Introduction”
  4. Toral and Munoz (2006) used Wikipedia to create lists of named entities .
    Page 2, “Wikipedia 2.1 Structure”
  5. Cucerzan (2007), by contrast to the above, used Wikipedia primarily for Named Entity Disambiguation, following the path of Bunescu and Pasca (2006).
    Page 3, “Wikipedia 2.1 Structure”
  6. We elected to use the ACE Named Entity types PERSON, GPE (GeoPolitical Entities), ORGANIZATION, VEHICLE, WEAPON, LOCATION, FACILITY, DATE, TIME, MONEY, and PERCENT.
    Page 3, “Training Data Generation”
  7. Other categories can reliably be used to determine that the article does not refer to a named entity , such as “CategoryzEndangered species.” We manually derived a relatively small set of key phrases, the most important of which are shown in Table 1.
    Page 3, “Training Data Generation”
  8. In conclusion, we have demonstrated that Wikipedia can be used to create a Named Entity Recognition system with performance comparable to one developed from 15-40,000 words of human-anno-tated newswire, while not requiring any linguistic expertise on the part of the user.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention Named Entity.

See all papers in Proc. ACL that mention Named Entity.

Back to top.

NER

Appears in 8 sentences as: NER (9)
In Mining Wiki Resources for Multilingual Named Entity Recognition
  1. In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition ( NER ) tags requiring minimal human intervention and no linguistic expertise.
    Page 1, “Abstract”
  2. language daut can be used u) bootstrap the NER process in other languages.
    Page 1, “Abstract”
  3. Named Entity Recognition ( NER ) has long been a major task of natural language processing.
    Page 1, “Introduction”
  4. The authors noted that their results would need to pass a manual supervision step before being useful for the NER task, and thus did not evaluate their results in the context of a full NER system.
    Page 2, “Wikipedia 2.1 Structure”
  5. phrases to the classical NER tags (PERSON, LOCATION, etc.)
    Page 3, “Wikipedia 2.1 Structure”
  6. For eXample, they used the sentence “Franz Fischler is an Austrian politician” to associate the label “politician” to the surface form “Franz Fischler.” They proceeded to show that the dictionaries generated by their method are useful when integrated into an NER system.
    Page 3, “Wikipedia 2.1 Structure”
  7. We also note that the NER component was not the focus of the research, and was specific to the English language.
    Page 3, “Wikipedia 2.1 Structure”
  8. Our approach to multilingual NER is to pull back the decision-making process to English whenever possible, so that we could apply some level of linguistic expertise.
    Page 3, “Training Data Generation”

See all papers in Proc. ACL 2008 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

human annotated

Appears in 5 sentences as: human annotated (5)
In Mining Wiki Resources for Multilingual Named Entity Recognition
  1. We had three human annotated test sets, Spanish, French and Ukrainian, consisting of newswire.
    Page 6, “Evaluation and Results”
  2. When human annotated sets were not available, we held out more than 100,000 words of text generated by our wiki-mining process to use as a test set.
    Page 6, “Evaluation and Results”
  3. The first consists of 25,000 words of human annotated newswire derived from the ACE 2007 test set, manually modified to conform to our extended MUC-style standards.
    Page 7, “Evaluation and Results”
  4. For this evaluation, we have 25,000 words of human annotated newswire (Agence France Presse, 30 April and 1 May 1997) covering diverse topics.
    Page 7, “Evaluation and Results”
  5. For Portuguese, Russian, and Polish, we did not have human annotated corpora available for test-
    Page 8, “Evaluation and Results”

See all papers in Proc. ACL 2008 that mention human annotated.

See all papers in Proc. ACL that mention human annotated.

Back to top.