SciSurf: Index of 'Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER'

Topics

NER (21)
MaxEnt (16)
cosine similarity (6)
similarity measures (6)
best result (5)
overfitting (5)
feature set (4)
Maximum Entropy (4)
Named Entity (4)
baseline system (3)

Topics

NER (21)
MaxEnt (16)
cosine similarity (6)
similarity measures (6)
best result (5)
overfitting (5)
feature set (4)
Maximum Entropy (4)
Named Entity (4)
baseline system (3)

Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

Saha, Sujan Kumar and Mitra, Pabitra and Sarkar, Sudeshna

Published in Proc. ACL, 2008

Article Structure

Abstract

Statistical machine learning methods are employed to train a Named Entity Recognizer from annotated data.

Introduction

Named Entity Recognition (NER) involves locating and classifying the names in a text.

Maximum Entropy Based Model for Hindi NER

Maximum Entropy (MaxEnt) principle is a commonly used technique which provides probability of belongingness of a token to a class.

Word Clustering

Clustering is the process of grouping together objects based on their similarity.

Important Word Selection

It is noted that not all words are equally important in determining the NE category.

Evaluation of NE Recognition

The following subsections contain the experimental results using word clustering and important word selection.

Conclusion

A hierarchical word clustering technique, where clusters are driven automatically from large unan-

Acknowledgement

The work is partially funded by Microsoft Research India.

Topics

NER

Appears in 21 sentences as: NER (21)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

Named Entity Recognition ( NER ) involves locating and classifying the names in a text.
Page 1, “Introduction”
NER is an important task, having applications in information extraction, question answering, machine translation and in most other Natural Language Processing (NLP) applications.
Page 1, “Introduction”
NER systems have been developed for English and few other languages with high accuracy.
Page 1, “Introduction”
Absence of capitalization makes the Hindi NER task difficult.
Page 1, “Introduction”
A pioneering work on Hindi NER is by Li and McCallum (2003) where they used Conditional Random Fields (CRF) and feature induction to automatically construct only the features that are important for recognition.
Page 1, “Introduction”
In their Maximum Entropy (MaXEnt) based approach for Hindi NER development, Saha et al.
Page 1, “Introduction”
This paper is a study on effectiveness of word clustering and selection as feature reduction techniques for MaXEnt based NER .
Page 1, “Introduction”
For important word selection we use corpus based statistical measurements to find the importance of the words in the NER task.
Page 1, “Introduction”
based NER system is described in Section 2.
Page 2, “Introduction”
MaxEnt computes the probability p(0| h) for any 0 from the space of all possible outcomes 0, and for every h from the space of all possible histories H. In NER , history can be viewed as all information derivable from the training corpus relative to the current token.
Page 2, “Maximum Entropy Based Model for Hindi NER”
The training data for the Hindi NER task is composed of about 243K words which is collected from the popular daily Hindi newspaper “Dainik Jagaran”.
Page 2, “Maximum Entropy Based Model for Hindi NER”

See all papers in Proc. ACL 2008 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

MaxEnt

Appears in 16 sentences as: MaXEnt (6) MaxEnt (10)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

based Feature Reduction for MaxEnt
Page 1, “Introduction”
In their Maximum Entropy ( MaXEnt ) based approach for Hindi NER development, Saha et al.
Page 1, “Introduction”
(2008) also observed that the performance of the MaXEnt based model often decreases when huge number of features are used in the model.
Page 1, “Introduction”
This paper is a study on effectiveness of word clustering and selection as feature reduction techniques for MaXEnt based NER.
Page 1, “Introduction”
A significant performance improvement over baseline MaXEnt was observed after using the above feature reduction techniques.
Page 1, “Introduction”
The MaXEnt
Page 1, “Introduction”
Maximum Entropy ( MaxEnt ) principle is a commonly used technique which provides probability of belongingness of a token to a class.
Page 2, “Maximum Entropy Based Model for Hindi NER”
MaxEnt computes the probability p(0| h) for any 0 from the space of all possible outcomes 0, and for every h from the space of all possible histories H. In NER, history can be viewed as all information derivable from the training corpus relative to the current token.
Page 2, “Maximum Entropy Based Model for Hindi NER”
The computation of probability (p(0|h)) of an outcome for a token in MaxEnt depends on a set of features that are helpful in making predictions about the outcome.
Page 2, “Maximum Entropy Based Model for Hindi NER”
Given a set of features and a training corpus, the MaxEnt estimation process produces a model in which every feature fz- has a weight 041-.
Page 2, “Maximum Entropy Based Model for Hindi NER”
For our development we have used a Java based open-nlp MaxEnt toolkitl.
Page 2, “Maximum Entropy Based Model for Hindi NER”

See all papers in Proc. ACL 2008 that mention MaxEnt.

See all papers in Proc. ACL that mention MaxEnt.

Back to top.

cosine similarity

Appears in 6 sentences as: Cosine Similarity (2) cosine similarity (4)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

For clustering we use a number of word similarities like cosine similarity among words and co-occurrence, along with the k-means clustering algorithm.
Page 1, “Introduction”
3.1 Cosine Similarity based on Sentence Level Co-occurrence
Page 5, “Word Clustering”
Then we measure cosine similarity between the word vectors.
Page 5, “Word Clustering”
The cosine similarity between two word vectors (fl and E”) with dimension d is measured as:
Page 5, “Word Clustering”
3.2 Cosine Similarity based on Proximal Words
Page 5, “Word Clustering”
Then the cosine similarity is measured between the word vectors.
Page 5, “Word Clustering”

See all papers in Proc. ACL 2008 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

similarity measures

Appears in 6 sentences as: similarity measure (1) similarity measurement (1) similarity measures (4)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

A number of word similarity measures are proposed for clustering words for the Named Entity Recognition task.
Page 1, “Abstract”
The Euclidean distance is used to find the similarity between the above word vectors as a similarity measure .
Page 5, “Word Clustering”
Using the above similarity measures we have used the k-means algorithm.
Page 5, “Word Clustering”
Among the various similarity measures of clustering, improved results are obtained using the clus-
Page 6, “Evaluation of NE Recognition”
ters which uses the similarity measurement based on proximity of the words to NE categories (defined in Section 3.3).
Page 7, “Evaluation of NE Recognition”
A number of word similarity measures are used for clustering.
Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention similarity measures.

See all papers in Proc. ACL that mention similarity measures.

Back to top.

best result

Appears in 5 sentences as: best result (4) best results (1)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

While experimenting with static word features, we have observed that a window of previous and next two words (wi_2...wi+2) gives best result (69.09) using the word features only.
Page 4, “Maximum Entropy Based Model for Hindi NER”
The value of k (number of clusters) was varied till the best result is obtained.
Page 5, “Word Clustering”
From the table we observe that the best result is obtained when k is 100.
Page 6, “Evaluation of NE Recognition”
Similarly when we deal with all the words in the corpus (17,465 words), we got best results when the words are clustered into 1100 clusters.
Page 6, “Evaluation of NE Recognition”
the best result is obtained when important words for two preceding and two following positions (defined in Section 4.3) are selected.
Page 7, “Evaluation of NE Recognition”

See all papers in Proc. ACL 2008 that mention best result.

See all papers in Proc. ACL that mention best result.

Back to top.

overfitting

Appears in 5 sentences as: overfit (1) overfitting (4)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

These methods tend to overfit when the available training corpus is limited especially if the number of features is large or the number of values for a feature is large.
Page 1, “Abstract”
In an effort to reduce overfitting , they use a combination of a Gaussian prior and early-stopping.
Page 1, “Introduction”
This is due to overfitting which is a serious problem in most of the NLP tasks in resource poor languages where annotated data is scarce.
Page 1, “Introduction”
From the above discussion it is clear that the system suffers from overfitting if a large number of features are used to train the system.
Page 4, “Maximum Entropy Based Model for Hindi NER”
This is probably due to reduction of overfitting .
Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

feature set

Appears in 4 sentences as: feature set (3) feature sets (1)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

In Table 2 we have shown the accuracy values for few feature sets .
Page 4, “Maximum Entropy Based Model for Hindi NER”
Again when wi_2 and rut-+2 are deducted from the feature set (i.e.
Page 4, “Maximum Entropy Based Model for Hindi NER”
When suffix, prefix and digit information are added to the feature set , the f-value is increased upto 74.26.
Page 4, “Maximum Entropy Based Model for Hindi NER”
The value is obtained using the feature set F8 [21%, wi_1, 1014.1, ti_1, Suffix, Digit].
Page 4, “Maximum Entropy Based Model for Hindi NER”

See all papers in Proc. ACL 2008 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

Maximum Entropy

Appears in 4 sentences as: Maximum Entropy (4)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

Methods like Maximum Entropy and Conditional Random Fields make use of features for the training purpose.
Page 1, “Abstract”
The feature reduction techniques lead to a substantial performance improvement over baseline Maximum Entropy technique.
Page 1, “Abstract”
In their Maximum Entropy (MaXEnt) based approach for Hindi NER development, Saha et al.
Page 1, “Introduction”
Maximum Entropy (MaxEnt) principle is a commonly used technique which provides probability of belongingness of a token to a class.
Page 2, “Maximum Entropy Based Model for Hindi NER”

See all papers in Proc. ACL 2008 that mention Maximum Entropy.

See all papers in Proc. ACL that mention Maximum Entropy.

Back to top.

Named Entity

Appears in 4 sentences as: Named Entities (1) Named Entity (3)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

Statistical machine learning methods are employed to train a Named Entity Recognizer from annotated data.
Page 1, “Abstract”
A number of word similarity measures are proposed for clustering words for the Named Entity Recognition task.
Page 1, “Abstract”
Named Entity Recognition (NER) involves locating and classifying the names in a text.
Page 1, “Introduction”
This corpus has been manually annotated and contains about 16,491 Named Entities (NEs).
Page 2, “Maximum Entropy Based Model for Hindi NER”

See all papers in Proc. ACL 2008 that mention Named Entity.

See all papers in Proc. ACL that mention Named Entity.

Back to top.

baseline system

Appears in 3 sentences as: baseline system (3)

In Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

The best accuracy (75.6 f-value) of the baseline system is obtained using the binary NomPSP feature along with word feature (wi_1, wi+1), suffix and digit information.
Page 4, “Maximum Entropy Based Model for Hindi NER”
But in the baseline system addition of word features (wi_2 and 212,42) over the same feature decrease the f-value from 75.6 to 72.65.
Page 7, “Evaluation of NE Recognition”
It is observed that significant enhancement of accuracy over the baseline system which use word features is obtained.
Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.