Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
Kim, Sungchul and Toutanova, Kristina and Yu, Hwanjo

Article Structure

Abstract

In this paper we propose a method to automatically label multilingual data with named entity tags.

Introduction

Named Entity Recognition (NER) is a frequently needed technology in NLP applications.

Data and task

As a case study, we focus on two very different foreign languages: Korean and Bulgarian.

Topics

named entity

Appears in 11 sentences as: named entities (5) Named Entity (1) named entity (6)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. In this paper we propose a method to automatically label multilingual data with named entity tags.
    Page 1, “Abstract”
  2. Named Entity Recognition (NER) is a frequently needed technology in NLP applications.
    Page 1, “Introduction”
  3. Of these, we manually annotated 91 English-Bulgarian and 79 English-Korean sentence pairs with source and target named entities as well as word-alignment links among named entities in the two languages.
    Page 2, “Data and task”
  4. The named entity annotation scheme followed has the labels GPE (Geopolitical entity), PER (Person), ORG (Organization), and DATE.
    Page 2, “Data and task”
  5. The other is that the same information might be expressed using a named entity in one language, and using a nonentity phrase in the other language (e. g. “He is from Bulgaria” versus “He is Bulgarian”).
    Page 2, “Data and task”
  6. We followed the approach of Richman and Schone (2008) to derive named entity annotations of both English and foreign phrases in Wikipedia, using Wikipedia metadata.
    Page 2, “Data and task”
  7. To tag English language phrases, we first derived named entity categorizations of English article titles, by assigning a tag based on the article’s category information.
    Page 3, “Data and task”
  8. The semi-CRF defines a distribution over foreign sentence labeled segmentations (where the segments are named entities with their labels, or segments of length one with label “NONE”).
    Page 6, “Data and task”
  9. As discussed throughout the paper, our model builds upon prior work on Wikipedia metadata-based NE tagging (Richman and Schone, 2008) and cross-lingual projection for named entities (Feng et al., 2004).
    Page 8, “Data and task”
  10. Other interesting work on aligning named entities in two languages is reported in (Huang and Vogel, 2002; Moore, 2003).
    Page 8, “Data and task”
  11. In this paper we showed that using resources from Wikipedia, it is possible to combine metadata-based approaches and projection-based approaches for inducing named entity annotations for foreign languages.
    Page 9, “Data and task”

See all papers in Proc. ACL 2012 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

sentence pairs

Appears in 9 sentences as: sentence pair (3) sentence pairs (6)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. Our results show that the semi-CRF model improves on the performance of projection models by more than 10 points in F—measure, and that we can achieve tagging F-measure of over 91 using a very small number of annotated sentence pairs .
    Page 1, “Introduction”
  2. A total of 13,410 English-Bulgarian and 8,832 English-Korean sentence pairs were extracted.
    Page 2, “Data and task”
  3. Of these, we manually annotated 91 English-Bulgarian and 79 English-Korean sentence pairs with source and target named entities as well as word-alignment links among named entities in the two languages.
    Page 2, “Data and task”
  4. Figure 1 illustrates a Bulgarian-English sentence pair with alignment.
    Page 2, “Data and task”
  5. They can be applied to tag foreign sentences in English-foreign sentence pairs extracted from Wikipedia.
    Page 4, “Data and task”
  6. The features on segments can also use information from the corresponding English sentence e along with external annotations on the sentence pair A.
    Page 6, “Data and task”
  7. The features are the ones that fire on the segment of length 1 containing the Bulgarian equivalent of the word “Split” and labeled with label GPE (tj=l3,uj=l3,yj=GPE), from the English-Bulgarian sentence pair in Figure l.
    Page 6, “Data and task”
  8. We should note that the proposed method can only tag foreign sentences in English-foreign sentence pairs .
    Page 8, “Data and task”
  9. We presented a direct semi-CRF tagging model for labeling foreign sentences in parallel sentence pairs , which outperformed projection by more than 10 F—measure points for Bulgarian and Korean.
    Page 9, “Data and task”

See all papers in Proc. ACL 2012 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

word alignments

Appears in 7 sentences as: Word alignment (1) word alignment (2) word alignments (4)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. Word alignment features
    Page 4, “Data and task”
  2. We eXploit a feature set based on HMM word alignments in both directions (Och and Ney, 2000).
    Page 4, “Data and task”
  3. The first oracle ORACLEl has access to the gold-standard English entities and gold-standard word alignments among English and foreign words.
    Page 5, “Data and task”
  4. Note that the word alignments do not uniquely identify the corresponding foreign phrase for each English phrase and some error is possible due to this.
    Page 5, “Data and task”
  5. The performance of ORACLE2 is determined by the error in automatic word alignment and in determining phonetic correspondence.
    Page 5, “Data and task”
  6. Another annotation type is derived from HMM-based word alignments and the transliteration model described in Section 4.
    Page 7, “Data and task”
  7. In addition to segment-level comparisons, they also look at tag assignments for individual source tokens linked to the individual target tokens (by word alignment and transliteration links).
    Page 7, “Data and task”

See all papers in Proc. ACL 2012 that mention word alignments.

See all papers in Proc. ACL that mention word alignments.

Back to top.

NER

Appears in 6 sentences as: NER (6)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. Named Entity Recognition ( NER ) is a frequently needed technology in NLP applications.
    Page 1, “Introduction”
  2. State-of-the-art statistical models for NER typically require a large amount of training data and linguistic expertise to be sufficiently accurate, which makes it nearly impossible to build high-accuracy models for a large number of languages.
    Page 1, “Introduction”
  3. Recently, there have been two lines of work which have offered hope for creating NER analyzers in many languages.
    Page 1, “Introduction”
  4. The second has been to use parallel English-foreign language data, a high-quality NER tagger for English, and projected annotations for the foreign language (Yarowsky et al., 2001; Das and Petrov, 2011).
    Page 1, “Introduction”
  5. The goal of this work is to create high-accuracy NER annotated data for foreign languages.
    Page 1, “Introduction”
  6. The Figure also shows the results of the Stanford NER tagger for English (Finkel et al., 2005) (we used the MUC-7 classifier).
    Page 3, “Data and task”

See all papers in Proc. ACL 2012 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

F-measure

Appears in 4 sentences as: F-measure (4)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. Our results show that the semi-CRF model improves on the performance of projection models by more than 10 points in F—measure, and that we can achieve tagging F-measure of over 91 using a very small number of annotated sentence pairs.
    Page 1, “Introduction”
  2. For Bulgarian, the F-measure of the full model is 92.8 compared to the best baseline result of 83.2.
    Page 8, “Data and task”
  3. Within the semi-CRF model, the contribution of English sentence context was substantial, leading to 2.5 point increase in F-measure for Bulgarian (92.8 versus 90.3 F—measure), and 4.0 point increase for Korean (91.2 versus 87.2).
    Page 8, “Data and task”
  4. Preliminary results show performance of over 80 F-measure for such monolingual models.
    Page 8, “Data and task”

See all papers in Proc. ACL 2012 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

parallel sentence

Appears in 4 sentences as: parallel sentence (2) parallel sentences (2)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. Here we combine elements of both Wikipedia metadata-based approaches and projection-based approaches, making use of parallel sentences extracted from Wikipedia.
    Page 1, “Introduction”
  2. The approach uses a small amount of manually annotated article-pairs to train a document-level CRF model for parallel sentence extraction.
    Page 2, “Data and task”
  3. This is due to two phenomena: one is that the parallel sentences sometimes contain different amounts of information and one language might use more detail than the other.
    Page 2, “Data and task”
  4. We presented a direct semi-CRF tagging model for labeling foreign sentences in parallel sentence pairs, which outperformed projection by more than 10 F—measure points for Bulgarian and Korean.
    Page 9, “Data and task”

See all papers in Proc. ACL 2012 that mention parallel sentence.

See all papers in Proc. ACL that mention parallel sentence.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. (2010a), our model can incorporate both monolingual and bilingual features in a log-linear framework.
    Page 1, “Introduction”
  2. (2004) to train a log-linear model for projection.
    Page 4, “Data and task”
  3. Compared to the joint log-linear model of Burkett et al.
    Page 9, “Data and task”

See all papers in Proc. ACL 2012 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

manually annotated

Appears in 3 sentences as: manually annotated (3)
In Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
  1. The approach uses a small amount of manually annotated article-pairs to train a document-level CRF model for parallel sentence extraction.
    Page 2, “Data and task”
  2. Of these, we manually annotated 91 English-Bulgarian and 79 English-Korean sentence pairs with source and target named entities as well as word-alignment links among named entities in the two languages.
    Page 2, “Data and task”
  3. At test time we use the local+global Wiki-based tagger to define the English entities and we don’t use the manually annotated alignments.
    Page 4, “Data and task”

See all papers in Proc. ACL 2012 that mention manually annotated.

See all papers in Proc. ACL that mention manually annotated.

Back to top.