Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model
Kulkarni, Anagha and Callan, Jamie

Article Structure

Abstract

A solution to the problem of homograph (words with multiple distinct meanings) identification is proposed and evaluated in this paper.

Introduction

Lexical ambiguity resolution is an important research problem for the fields of information retrieval and machine translation (Sanderson, 2000; Chan et al., 2007).

Finding the Homographs in a Lexicon

Our goal is to identify the homographs in a large lexicon.

Data

In this study, we concentrate on recognizing homographic nouns, because homographic ambiguity is much more common in nouns than in verbs, adverbs or adjectives.

Experiments and Results

A stratified division of the gold standard data in the proportion of 0.75 and 0.25 was done in the first step.

Conclusions

We have demonstrated in this paper that the problem of homograph identification can be approached using dictionary definitions as the source of information about the word.

Topics

cosine similarity

Appears in 4 sentences as: cosine similarities (1) cosine similarity (3)
In Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model
  1. Cohesiveness Score: Mean of the cosine similarities between each pair of definitions of w.
    Page 2, “Finding the Homographs in a Lexicon”
  2. Average Number of Null Similarities: The number of definition pairs that have zero cosine similarity score (no word overlap).
    Page 2, “Finding the Homographs in a Lexicon”
  3. The last feature sorts the pairwise cosine similarity scores in ascending order, prunes the top n% of the scores, and uses the maximum remaining score as the feature value.
    Page 2, “Finding the Homographs in a Lexicon”
  4. The set of definitions is formed from eight dictionaries, so almost identical definitions are a frequent phenomenon, which makes the maximum cosine similarity a useless feature.
    Page 2, “Finding the Homographs in a Lexicon”

See all papers in Proc. ACL 2008 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

gold standard

Appears in 4 sentences as: Gold Standard (1) gold standard (2) “gold standard” (1)
In Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model
  1. 3.1 Gold Standard Data
    Page 3, “Data”
  2. 23 words on which annotators disagreed (2/2 vote) were discarded, leaving a set of 202 words (the “gold standard” ) on which at least 3 of the 4 annotators agreed.
    Page 3, “Data”
  3. The best agreement between the gold standard and a human annotator was 0.87 kappa, and the worst was 0.78.
    Page 3, “Data”
  4. A stratified division of the gold standard data in the proportion of 0.75 and 0.25 was done in the first step.
    Page 3, “Experiments and Results”

See all papers in Proc. ACL 2008 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

semi-supervised

Appears in 4 sentences as: semi-supervised (4)
In Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model
  1. We experiment with three model setups: Supervised, semi-supervised , and unsupervised.
    Page 3, “Finding the Homographs in a Lexicon”
  2. In Model II, the semi-supervised setup, the training data is used to initialize the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) and the unlabeled data, described in Section 3.1, updates the initial estimates.
    Page 3, “Finding the Homographs in a Lexicon”
  3. The unsupervised setup, Model III, is similar to the semi-supervised setup except that the EM algorithm is initialized using an informed guess by the authors.
    Page 3, “Finding the Homographs in a Lexicon”
  4. The results for the semi-supervised models are nonconclusive.
    Page 4, “Experiments and Results”

See all papers in Proc. ACL 2008 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

unlabeled data

Appears in 3 sentences as: unlabeled data (3)
In Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model
  1. In Model II, the semi-supervised setup, the training data is used to initialize the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) and the unlabeled data , described in Section 3.1, updates the initial estimates.
    Page 3, “Finding the Homographs in a Lexicon”
  2. The set of 3,123 words that were not annotated was the unlabeled data for the EM algorithm.
    Page 3, “Data”
  3. Our post-experimental analysis reveals that the parameter updation process using the unlabeled data has an effect of overly separating the two overlapping distributions.
    Page 4, “Experiments and Results”

See all papers in Proc. ACL 2008 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.