Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Darwish, Kareem

Article Structure

Abstract

Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities.

Introduction

Named Entity Recognition (NER) is essential for a variety of Natural Language Processing (NLP) applications such as information extraction.

Related Work

2.1 Using cross-lingual Features

Baseline Arabic NER System

For the baseline system, we used the CRF++1 implementation of CRF sequence labeling with default parameters.

Cross-lingual Features

We experimented with three different cross-lingual features that used Arabic and English Wikipedia cross-language links and a true-cased phrase table that was generated using Moses (Koehn et al., 2007).

Conclusion

In this paper, we presented different cross-lingual features that can make use of linguistic properties and knowledge bases of other languages for NER.

Topics

cross-lingual

Appears in 35 sentences as: Cross-lingual (2) cross-lingual (34)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross—lingual links.
    Page 1, “Abstract”
  2. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs.
    Page 1, “Abstract”
  3. To address this problem, we introduce the use of cross-lingual links between a disadvantaged language, Arabic, and a language with good discrim-inative features and large resources, English, to improve Arabic NER.
    Page 2, “Introduction”
  4. Cross-lingual links are obtained using Wikipedia cross-language links and a large Machine Translation (MT) phrase table that is true cased, where word casing is preserved during training.
    Page 2, “Introduction”
  5. - Using cross-lingual links to exploit orthographic features in other languages.
    Page 2, “Introduction”
  6. - Using cross-lingual links to exploit a large knowledge base, namely English DBpedia, to benefit NER.
    Page 2, “Introduction”
  7. - Improving over the best reported results in the literature by 4.1% (Abdul-Hamid and Darwish, 2010) by strictly adding cross-lingual features.
    Page 2, “Introduction”
  8. The remainder of the paper is organized as follows: Section 2 provides related work; Section 3 describes the baseline system; Section 4 introduces the cross-lingual features and reports on their effectiveness; and Section 5 concludes the paper.
    Page 2, “Introduction”
  9. 2.1 Using cross-lingual Features
    Page 2, “Related Work”
  10. If cross-lingual resources are available, such as parallel data, increased training data, better resources, or superior features can be used to improve the processing (ex.
    Page 2, “Related Work”
  11. To overcome these two problems, we use cross-lingual features to improve NER using large bilingual resources, and we incorporate confidences to avoid having a binary feature.
    Page 2, “Related Work”

See all papers in Proc. ACL 2013 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

NER

Appears in 31 sentences as: NER (31)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition ( NER ) that can generalize to previously unseen named entities.
    Page 1, “Abstract”
  2. Named Entity Recognition ( NER ) is essential for a variety of Natural Language Processing (NLP) applications such as information extraction.
    Page 1, “Introduction”
  3. There has been a fair amount of work on NER for a variety of languages including Arabic.
    Page 1, “Introduction”
  4. To train an NER system, some of the following feature types are typically used (Benajiba and Rosso, 2008; Nadeau and Sekine, 2009):
    Page 1, “Introduction”
  5. One of the most effective orthographic features is capitalization in English, which helps NER to generalize to new text of different genres.
    Page 1, “Introduction”
  6. For example, morphological, contextual, and character-level features have been shown to be effective for Arabic NER (Benajiba and Rosso, 2008).
    Page 1, “Introduction”
  7. of the Arabic gazetteers that were used for NER were small (Benajiba and Rosso, 2008), there has been efforts to build larger Arabic gazetteers (Attia et al., 2010).
    Page 2, “Introduction”
  8. Since training and test parts of standard datasets for Arabic NER are drawn from the same genre in relatively close temporal proximity, a named entity recognizer that simply memorizes named entities in the training set generally performs well on such test sets.
    Page 2, “Introduction”
  9. To address this problem, we introduce the use of cross-lingual links between a disadvantaged language, Arabic, and a language with good discrim-inative features and large resources, English, to improve Arabic NER .
    Page 2, “Introduction”
  10. We also show how to use transliteration mining to improve NER , even when neither language has a capitalization (or similar) feature.
    Page 2, “Introduction”
  11. - Employing transliteration mining to improve NER .
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

named entities

Appears in 26 sentences as: Name Entity (1) named entities (17) Named Entity (1) named entity (10)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities .
    Page 1, “Abstract”
  2. Named Entity Recognition (NER) is essential for a variety of Natural Language Processing (NLP) applications such as information extraction.
    Page 1, “Introduction”
  3. - Contextual features: Certain words are indicative of the existence of named entities .
    Page 1, “Introduction”
  4. For example, the word “said” is often preceded by a named entity of type “person” or “organization”.
    Page 1, “Introduction”
  5. Such features can be indicative or counter-indicative of the existence of named entities .
    Page 1, “Introduction”
  6. For example, a word ending with “ing” is typically not a named entity, while a word ending in “berg” is often a named entity .
    Page 1, “Introduction”
  7. - Part-of-speech (POS) tags and morphological features: POS tags indicate (or counter-indicate) the possible presence of a named entity at word level or at word sequence level.
    Page 1, “Introduction”
  8. Morphological features can mostly indicate the absence of named entities .
    Page 1, “Introduction”
  9. However, pronouns are rarely ever attached to named entities .
    Page 1, “Introduction”
  10. - Gazetteers: This feature checks the presence of a word or a sequence of words in large lists of named entities .
    Page 1, “Introduction”
  11. However, Arabic lacks indicative orthographic features that generalize to previously unseen named entities .
    Page 1, “Introduction”

See all papers in Proc. ACL 2013 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

F-measure

Appears in 16 sentences as: F-measure (17)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. They reported 80%, 37%, and 47% F-measure for locations, organizations, and persons respectively on the ANERCORP dataset that they created and publicly released.
    Page 3, “Related Work”
  2. They reported 87%, 46%, and 52% F-measure for locations, organizations, and persons respectively.
    Page 3, “Related Work”
  3. Using POS tagging generally improved recall at the expense of precision, leading to overall improvements in F-measure .
    Page 3, “Related Work”
  4. Using all their suggested features, they reported 90%, 66%, and 73% F-measure for location, organization, and persons respectively.
    Page 3, “Related Work”
  5. They did not report per category F-measure, but they reported overall 81%, 75%, and 78% macro-average F-measure for broadcast news and newswire on the ACE 2003, 2004, and 2005 datasets respectively.
    Page 3, “Related Work”
  6. Huang (2005) used an HMM-based NE recognizer for Arabic and reported 77% F-measure on the ACE 2003 dataset.
    Page 3, “Related Work”
  7. They reported 70% F-measure on the ACE 2005 dataset.
    Page 3, “Related Work”
  8. They reported upwards of 93% F-measure , but they conducted their experiments on nonstandard datasets, making comparison difficult.
    Page 3, “Related Work”
  9. They reported an F-measure of 76% and 81% for the ACE2005 and the ANERCorp datasets datasets respectively.
    Page 3, “Related Work”
  10. This led to an overall improvement in F-measure of 1.8 to 3.4 points (absolute) or 4.2% to 5.7% (relative).
    Page 5, “Cross-lingual Features”
  11. ture, transliteration mining slightly lowered precision — except for the TWEETS test set where the drop in precision was significant — and positively increased recall, leading to an overall improvement in F-measure for all test sets.
    Page 6, “Cross-lingual Features”

See all papers in Proc. ACL 2013 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

phrase table

Appears in 15 sentences as: phrase table (14) phrase tables (1)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. Cross-lingual links are obtained using Wikipedia cross-language links and a large Machine Translation (MT) phrase table that is true cased, where word casing is preserved during training.
    Page 2, “Introduction”
  2. Transliteration Mining (TM) has been used to enrich MT phrase tables or to improve cross language search (Udupa et al., 2009).
    Page 3, “Related Work”
  3. We experimented with three different cross-lingual features that used Arabic and English Wikipedia cross-language links and a true-cased phrase table that was generated using Moses (Koehn et al., 2007).
    Page 5, “Cross-lingual Features”
  4. The phrase table was trained on a set of 3.69 million parallel sentences containing 123.4 million English tokens.
    Page 5, “Cross-lingual Features”
  5. To capture cross-lingual capitalization, we used the aforementioned true-cased phrase table at word and
    Page 5, “Cross-lingual Features”
  6. Input: True-cased phrase table PT, sentence S containing 77. words wOHn, max sequence length l, translations Tlnknm of winj fori = 0 —> 77. do j = min(i+ l — 1,77,) if PT contains winj & El Tk isCaps then
    Page 5, “Cross-lingual Features”
  7. Where: PT was the aforementioned phrase table ; 1 = 4; P(Tk,) equaled to the product of p(source|target) and p(target|source) for a word sequence; isCaps and notCaps were whether the
    Page 5, “Cross-lingual Features”
  8. We performed transliteration mining (aka cognate matching) at word level for each Arabic word against all its possible translations in the phrase table .
    Page 6, “Cross-lingual Features”
  9. We retained valid target sequences that produced translations in the phrase table .
    Page 6, “Cross-lingual Features”
  10. TlesTransl'iteration P(Tk) + Z P(Tk) TkJsT'r’anslite'r’ation Tk motT’ransliteration (1) where P (T1,) is probability of the kth translation of a word in the phrase table .
    Page 6, “Cross-lingual Features”
  11. If a word was not found in the phrase table , the feature value was assigned null.
    Page 6, “Cross-lingual Features”

See all papers in Proc. ACL 2013 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

baseline system

Appears in 7 sentences as: baseline system (7)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. The remainder of the paper is organized as follows: Section 2 provides related work; Section 3 describes the baseline system ; Section 4 introduces the cross-lingual features and reports on their effectiveness; and Section 5 concludes the paper.
    Page 2, “Introduction”
  2. We used their simplified features in our baseline system .
    Page 3, “Related Work”
  3. For the baseline system , we used the CRF++1 implementation of CRF sequence labeling with default parameters.
    Page 3, “Baseline Arabic NER System”
  4. Table 4 reports on the results of the baseline system with the capitalization feature on the three datasets.
    Page 5, “Cross-lingual Features”
  5. Table 5 reports on the results using the baseline system with the transliteration mining feature.
    Page 6, “Cross-lingual Features”
  6. Table 6 reports on the results of using the baseline system with the two DBpedia features.
    Page 7, “Cross-lingual Features”
  7. For Arabic NER, the new features yielded an improvement of 5.5% over a strong baseline system on a standard dataset, with 10.7% gain in recall and negligible change in precision.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

knowledge bases

Appears in 7 sentences as: knowledge base (3) knowledge bases (4)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities.
    Page 1, “Abstract”
  2. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases , such as Wikipedia.
    Page 1, “Abstract”
  3. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross—lingual links.
    Page 1, “Abstract”
  4. - Using cross-lingual links to exploit a large knowledge base , namely English DBpedia, to benefit NER.
    Page 2, “Introduction”
  5. DBpedia2 is a large collaboratively-built knowledge base in which structured information is extracted from Wikipedia (Bizer et al., 2009).
    Page 6, “Cross-lingual Features”
  6. In this paper, we presented different cross-lingual features that can make use of linguistic properties and knowledge bases of other languages for NER.
    Page 9, “Conclusion”
  7. We used English as the “helper” language and we exploited the English capitalization feature and an English knowledge base , DBpedia.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention knowledge bases.

See all papers in Proc. ACL that mention knowledge bases.

Back to top.

CRF

Appears in 6 sentences as: CRF (5) CRF++ (1)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. Conditional Random Fields ( CRF )) can often identify such indicative words.
    Page 1, “Introduction”
  2. Benajiba and Rosso (2008) used CRF sequence labeling and incorporated many language specific features, namely POS tagging, base-phrase chunking, Arabic tokenization, and adjectives indicating nationality.
    Page 3, “Related Work”
  3. (2008), they examined the same feature set on the Automatic Content Extraction (ACE) datasets using CRF
    Page 3, “Related Work”
  4. The use of CRF sequence labeling for NER has shown success (McCallum and Li, 2003; Nadeau and Sekine, 2009; Benajiba and Rosso, 2008).
    Page 3, “Related Work”
  5. For the baseline system, we used the CRF++1 implementation of CRF sequence labeling with default parameters.
    Page 3, “Baseline Arabic NER System”
  6. translation was capitalized or not respectively; and the weights were binned because CRF++ only takes nominal features.
    Page 5, “Cross-lingual Features”

See all papers in Proc. ACL 2013 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

POS tags

Appears in 6 sentences as: POS tagging (2) POS tags (4)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. - Part-of-speech (POS) tags and morphological features: POS tags indicate (or counter-indicate) the possible presence of a named entity at word level or at word sequence level.
    Page 1, “Introduction”
  2. Benajiba and Rosso (2007) improved their system by incorporating POS tags to improve NE boundary detection.
    Page 3, “Related Work”
  3. Benajiba and Rosso (2008) used CRF sequence labeling and incorporated many language specific features, namely POS tagging , base-phrase chunking, Arabic tokenization, and adjectives indicating nationality.
    Page 3, “Related Work”
  4. Using POS tagging generally improved recall at the expense of precision, leading to overall improvements in F-measure.
    Page 3, “Related Work”
  5. (2008) used POS tags obtained from an Arabic tagger to enhance NER.
    Page 3, “Related Work”
  6. (2003) used thousands of language independent features such as character n-grams, capitalization, word length, and position in a sentence, along with language dependent features such as POS tags and BP chunking.
    Page 3, “Related Work”

See all papers in Proc. ACL 2013 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

sequence labeling

Appears in 5 sentences as: Sequence labeling (1) sequence labeling (4)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. Sequence labeling algorithms (ex.
    Page 1, “Introduction”
  2. Benajiba and Rosso (2008) used CRF sequence labeling and incorporated many language specific features, namely POS tagging, base-phrase chunking, Arabic tokenization, and adjectives indicating nationality.
    Page 3, “Related Work”
  3. sequence labeling and a Support Vector Machine (SVM) classifier.
    Page 3, “Related Work”
  4. The use of CRF sequence labeling for NER has shown success (McCallum and Li, 2003; Nadeau and Sekine, 2009; Benajiba and Rosso, 2008).
    Page 3, “Related Work”
  5. For the baseline system, we used the CRF++1 implementation of CRF sequence labeling with default parameters.
    Page 3, “Baseline Arabic NER System”

See all papers in Proc. ACL 2013 that mention sequence labeling.

See all papers in Proc. ACL that mention sequence labeling.

Back to top.

feature set

Appears in 3 sentences as: feature set (3)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. (2007) used a maximum entropy classifier trained on a feature set that includes the use of gazetteers and a stop-word list, appearance of a NE in the training set, leading and trailing word bigrams, and the tag of the previous word.
    Page 3, “Related Work”
  2. (2008), they examined the same feature set on the Automatic Content Extraction (ACE) datasets using CRF
    Page 3, “Related Work”
  3. Abdul-Hamid and Darwish (2010) used a simplified feature set that relied primarily on character level features, namely leading and trailing letters in a word.
    Page 3, “Related Work”

See all papers in Proc. ACL 2013 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

parallel data

Appears in 3 sentences as: parallel data (3)
In Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
  1. If cross-lingual resources are available, such as parallel data , increased training data, better resources, or superior features can be used to improve the processing (ex.
    Page 2, “Related Work”
  2. They did so by training a bilingual model and then generating more training data from unlabeled parallel data .
    Page 2, “Related Work”
  3. The sentences were drawn from the UN parallel data along with a variety of parallel news data from LDC and the GALE project.
    Page 5, “Cross-lingual Features”

See all papers in Proc. ACL 2013 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.