Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
Lau, Jey Han and Cook, Paul and McCarthy, Diana and Gella, Spandana and Baldwin, Timothy

Article Structure

Abstract

Unsupervised word sense disambiguation (WS D) methods are an attractive approach to all-words WSD due to their non-reliance on expensive annotated data.

Introduction

The automatic determination of word sense information has been a longterm pursuit of the NLP community (Agirre and Edmonds, 2006; Navigli, 2009).

Background and Related Work

There has been a considerable amount of research on representing word senses and disambiguating usages of words in context (WSD) as, in order to produce computational systems that understand and produce natural language, it is essential to have a means of representing and disambiguat-ing word sense.

Methodology

Our methodology is based on the WSI system described in Lau et al.

WordNet Experiments

We first test the proposed method over the tasks of predominant sense learning and sense distribution induction, using the WordNet-tagged dataset of Koeling et al.

Macmillan Experiments

In our second set of experiments, we move to a new dataset (Gella et al., to appear) based on text from ukWaC (Ferraresi et al., 2008) and Twitter, and annotated using the Macmillan English Dictionary10 (henceforth “Macmillan”).

Discussion

Our methodologies for the two proposed tasks of identifying unused and novel senses are simple 15 Note that the number of words with low-frequency novel

Topics

word sense

Appears in 23 sentences as: Word sense (1) word sense (20) word senses (6)
In Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
  1. Unsupervised word sense disambiguation (WS D) methods are an attractive approach to all-words WSD due to their non-reliance on expensive annotated data.
    Page 1, “Abstract”
  2. Unsupervised estimates of sense frequency have been shown to be very useful for WSD due to the skewed nature of word sense distributions.
    Page 1, “Abstract”
  3. The automatic determination of word sense information has been a longterm pursuit of the NLP community (Agirre and Edmonds, 2006; Navigli, 2009).
    Page 1, “Introduction”
  4. Word sense distributions tend to be Zip-fian, and as such, a simple but surprisingly high-accuracy back-off heuristic for word sense disambiguation (WSD) is to tag each instance of a given word with its predominant sense (McCarthy et al., 2007).
    Page 1, “Introduction”
  5. Such an approach requires knowledge of predominant senses; however, word sense distributions — and predominant senses too —vary from corpus to corpus.
    Page 1, “Introduction”
  6. In this paper, we propose a method which uses topic models to estimate word sense distributions.
    Page 1, “Introduction”
  7. Topic models have been used for WSD in a number of studies (Boyd-Graber et al., 2007; Li et al., 2010; Lau et al., 2012; Preiss and Stevenson, 2013; Cai et al., 2007; Knopp et al., 2013), but our work extends significantly on this earlier work in focusing on the acquisition of prior word sense distributions (and predominant senses).
    Page 1, “Introduction”
  8. Because of domain differences and the skewed nature of word sense distributions, it is often the case that some senses in a sense inventory will not be attested in a given corpus.
    Page 1, “Introduction”
  9. There has been a considerable amount of research on representing word senses and disambiguating usages of words in context (WSD) as, in order to produce computational systems that understand and produce natural language, it is essential to have a means of representing and disambiguat-ing word sense .
    Page 2, “Background and Related Work”
  10. WSD algorithms require word sense information to disambiguate token instances of a given ambiguous word, e.g.
    Page 2, “Background and Related Work”
  11. One extremely useful piece of information is the word sense prior or expected word sense frequency distribution.
    Page 2, “Background and Related Work”

See all papers in Proc. ACL 2014 that mention word sense.

See all papers in Proc. ACL that mention word sense.

Back to top.

WordNet

Appears in 12 sentences as: WordNet (14)
In Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
  1. (2004b) to remove low-frequency senses from WordNet , we focus on finding senses that are unattested in the corpus on the premise that, given accurate disambiguation, rare senses in a corpus contribute to correct interpretation.
    Page 1, “Introduction”
  2. Typically, word frequency distributions are estimated with respect to a sense-tagged corpus such as SemCor (Miller et al., 1993), a 220,000 word corpus tagged with WordNet (Fellbaum, 1998) senses.
    Page 2, “Background and Related Work”
  3. The distributional similarity scores of the nearest neighbours are associated with the respective target word senses using a WordNet similarity measure, such as those proposed by J iang and Conrath (1997) and Banerjee and Pedersen (2002).
    Page 2, “Background and Related Work”
  4. the WordNet hierarchy).
    Page 3, “Methodology”
  5. For each domain, annotators were asked to sense-annotate a random selection of sentences for each of 40 target nouns, based on WordNet v1.7.
    Page 4, “WordNet Experiments”
  6. For each dataset, we use HDP to induce topics for each target lemma, compute the similarity between the topics and the WordNet senses (Equation (1)), and rank the senses based on the prevalence scores (Equation (2)).
    Page 4, “WordNet Experiments”
  7. It is important to bear in mind that MKWC in these experiments makes use of full-text parsing in calculating the distributional similarity thesaurus, and the WordNet graph structure in calculating the similarity between associated words and different senses.
    Page 5, “WordNet Experiments”
  8. Our method, on the other hand, uses no parsing, and only the synset definitions (and not the graph structure) of Word Net.8 The non-reliance on parsing is significant in terms of portability to text sources which are less amenable to parsing (such as Twitter: (Baldwin et al., 2013)), and the non-reliance on the graph structure of WordNet is significant in terms of portability to conventional “flat” sense inventories.
    Page 5, “WordNet Experiments”
  9. For the purposes of this research, the choice of Macmillan is significant in that it is a conventional dictionary with sense definitions and examples, but no linking between senses.11 In terms of the original research which gave rise to the sense-tagged dataset, Macmillan was chosen over WordNet for reasons including: (l) the well-documented difficulties of sense tagging with fine-grained WordNet senses (Palmer et al., 2004; Navigli et al., 2007); (2) the regular update cycle of Macmillan (meaning it contains many recently-emerged senses); and (3) the finding in a preliminary sense-tagging task that it better captured Twitter usages than WordNet (and also OntoNotes: Hovy et al.
    Page 6, “Macmillan Experiments”
  10. The average sense ambiguity of the 20 target nouns in Macmillan is 5.6 (but 12.3 in WordNet ).
    Page 6, “Macmillan Experiments”
  11. We first notice that, despite the coarser-grained senses of Macmillan as compared to WordNet , the upper bound WSD accuracy using Macmillan is comparable to that of the WordNet-based datasets over the balanced BNC, and quite a bit lower than that of the two domain corpora of Koeling et al.
    Page 6, “Macmillan Experiments”

See all papers in Proc. ACL 2014 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

topic models

Appears in 6 sentences as: topic model (2) Topic models (1) topic models (3)
In Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
  1. In this paper, we propose a method which uses topic models to estimate word sense distributions.
    Page 1, “Introduction”
  2. Topic models have been used for WSD in a number of studies (Boyd-Graber et al., 2007; Li et al., 2010; Lau et al., 2012; Preiss and Stevenson, 2013; Cai et al., 2007; Knopp et al., 2013), but our work extends significantly on this earlier work in focusing on the acquisition of prior word sense distributions (and predominant senses).
    Page 1, “Introduction”
  3. (2004b), the use of topic models makes this possible, using topics as a proxy for sense (Brody and Lapata, 2009; Yao and Durme, 2011; Lau et al., 2012).
    Page 2, “Introduction”
  4. Recent work on finding novel senses has tended to focus on comparing diachronic corpora (Sagi et al., 2009; Cook and Stevenson, 2010; Gulor—dava and Baroni, 2011) and has also considered topic models (Lau et al., 2012).
    Page 3, “Background and Related Work”
  5. (2006)), a nonparametric variant of a Latent Dirichlet Allocation topic model (Blei et al., 2003) where the model automatically opti-mises the number of topics in a fully-unsupervised fashion over the training data.
    Page 3, “Methodology”
  6. To learn the senses of a target lemma, we train a single topic model per target lemma.
    Page 3, “Methodology”

See all papers in Proc. ACL 2014 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

similarity scores

Appears in 4 sentences as: similarity score (1) similarity scores (3)
In Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
  1. The distributional similarity scores of the nearest neighbours are associated with the respective target word senses using a WordNet similarity measure, such as those proposed by J iang and Conrath (1997) and Banerjee and Pedersen (2002).
    Page 2, “Background and Related Work”
  2. The word senses are ranked based on these similarity scores , and the most frequent sense is selected for the corpus that the distributional similarity thesaurus was trained over.
    Page 2, “Background and Related Work”
  3. To compute the similarity between a sense and a topic, we first convert the words in the gloss/definition into a multinomial distribution over words, based on simple maximum likelihood estimation.6 We then calculate the Jensen—Shannon divergence between the multinomial distribution (over words) of the gloss and that of the topic, and convert the divergence value into a similarity score by subtracting it from 1.
    Page 3, “Methodology”
  4. The prevalence score for a sense is computed by summing the product of its similarity scores with each topic (i.e.
    Page 4, “Methodology”

See all papers in Proc. ACL 2014 that mention similarity scores.

See all papers in Proc. ACL that mention similarity scores.

Back to top.

distributional similarity

Appears in 3 sentences as: distributional similarity (3)
In Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
  1. The distributional similarity scores of the nearest neighbours are associated with the respective target word senses using a WordNet similarity measure, such as those proposed by J iang and Conrath (1997) and Banerjee and Pedersen (2002).
    Page 2, “Background and Related Work”
  2. The word senses are ranked based on these similarity scores, and the most frequent sense is selected for the corpus that the distributional similarity thesaurus was trained over.
    Page 2, “Background and Related Work”
  3. It is important to bear in mind that MKWC in these experiments makes use of full-text parsing in calculating the distributional similarity thesaurus, and the WordNet graph structure in calculating the similarity between associated words and different senses.
    Page 5, “WordNet Experiments”

See all papers in Proc. ACL 2014 that mention distributional similarity.

See all papers in Proc. ACL that mention distributional similarity.

Back to top.

statistical significance

Appears in 3 sentences as: statistical significance (2) statistically significance (1)
In Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
  1. Based on the McNemar’s Test with Yates correction for continuity, MKWC is significantly better over BNC and HDP-WSI is significantly better over FINANCE (p < 0.0001 in both cases), but the difference over SPORTS is not statistically significance (p > 0.1).
    Page 4, “WordNet Experiments”
  2. Testing for statistical significance over the paired J S divergence values for each lemma using the Wilcoxon signed-rank test, the result for F1-NANCE is significant (p < 0.05) but the results for the other two datasets are not (p > 0.1 in each case).
    Page 5, “WordNet Experiments”
  3. To summarise, the results for MKWC and HDP-WSI are fairly even for predominant sense leam-ing (each outperforms the other at a level of statistical significance over one dataset), but HDP—WSI is better at inducing the overall sense distribution.
    Page 5, “WordNet Experiments”

See all papers in Proc. ACL 2014 that mention statistical significance.

See all papers in Proc. ACL that mention statistical significance.

Back to top.