Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
Li, Linlin and Roth, Benjamin and Sporleder, Caroline

Article Structure

Abstract

This paper presents a probabilistic model for sense disambiguation which chooses the best sense based on the conditional probability of sense paraphrases given a context.

Introduction

Word sense disambiguation (WSD) is the task of automatically determining the correct sense for a target word given the context in which it occurs.

Related Work

There is a large body of work on WSD, covering supervised, unsupervised (word sense induction) and knowledge-based approaches (see McCarthy (2009) for an overview).

The Sense Disambiguation Model

3.1 Topic Model

Experimental Setup

We evaluate our models on three different tasks: coarse-grained WSD, fine-grained WSD and literal vs. nonliteral sense detection.

Experiments

As mentioned above, we test our proposed sense disambiguation framework on three tasks.

Conclusion

We propose three models for sense disambiguation on words and multi-word expressions.

Topics

sense disambiguation

Appears in 21 sentences as: Sense Disambiguation (2) sense disambiguation (20)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. This paper presents a probabilistic model for sense disambiguation which chooses the best sense based on the conditional probability of sense paraphrases given a context.
    Page 1, “Abstract”
  2. We propose three different instanti-ations of the model for solving sense disambiguation problems with different degrees of resource availability.
    Page 1, “Abstract”
  3. The proposed models are tested on three different tasks: coarse-grained word sense disambiguation, fine-grained word sense disambiguation , and detection of literal vs. nonliteral usages of potentially idiomatic expressions.
    Page 1, “Abstract”
  4. Word sense disambiguation (WSD) is the task of automatically determining the correct sense for a target word given the context in which it occurs.
    Page 1, “Introduction”
  5. Recently, several researchers have experimented with topic models (Brody and Lapata, 2009; Boyd-Graber et al., 2007; Boyd-Graber and Blei, 2007; Cai et al., 2007) for sense disambiguation and induction.
    Page 1, “Introduction”
  6. Previous approaches using topic models for sense disambiguation either embed topic features in a supervised model (Cai et al., 2007) or rely heavily on the structure of hierarchical lexicons such as WordNet (Boyd-Graber et al., 2007).
    Page 1, “Introduction”
  7. We approach the sense disambiguation task by choosing the best sense based on the conditional probability of sense paraphrases given a context.
    Page 1, “Introduction”
  8. Recently, a number of systems have been proposed that make use of topic models for sense disambiguation .
    Page 2, “Related Work”
  9. 3.2 The Sense Disambiguation Model
    Page 3, “The Sense Disambiguation Model”
  10. Finally, we test our model on the related sense disambiguation task of distinguishing literal and nonliteral usages of potentially ambiguous expressions such as break the ice.
    Page 5, “Experimental Setup”
  11. Sense Paraphrases For word sense disambiguation tasks, the paraphrases of the sense keys are represented by information from WordNet 2.1.
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2010 that mention sense disambiguation.

See all papers in Proc. ACL that mention sense disambiguation.

Back to top.

WordNet

Appears in 16 sentences as: WordNet (16)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. Previous approaches using topic models for sense disambiguation either embed topic features in a supervised model (Cai et al., 2007) or rely heavily on the structure of hierarchical lexicons such as WordNet (Boyd-Graber et al., 2007).
    Page 1, “Introduction”
  2. (2007) enhance the basic LDA algorithm by incorporating WordNet senses as an additional latent variable.
    Page 2, “Related Work”
  3. Instead of generating words directly from a topic, each topic is associated with a random walk through the WordNet hierarchy which generates the observed word.
    Page 2, “Related Work”
  4. iosyncracies in the hierarchical structure of WordNet can harm performance.
    Page 2, “Related Work”
  5. In our approach, we circumvent this problem by exploiting paraphrase information for the target senses rather than relying on the structure of WordNet as a whole.
    Page 2, “Related Work”
  6. These paraphrases can be taken from an existing resource such as WordNet (Miller, 1995) or supplied by the user (see Section 4).
    Page 3, “The Sense Disambiguation Model”
  7. In Model I and Model 11, the sense paraphrases are obtained from WordNet , and both the context and the sense paraphrases are treated as documents, 0 2 dc and s 2 d8.
    Page 3, “The Sense Disambiguation Model”
  8. WordNet is a fairly rich resource which provides detailed information about word senses (glosses, example sentences, synsets, semantic relations between senses, etc.).
    Page 3, “The Sense Disambiguation Model”
  9. very well covered in WordNet , such as idioms.
    Page 3, “The Sense Disambiguation Model”
  10. We use sense frequency information from WordNet to estimate the prior sense distribution, although it must be kept in mind that, depending on the genre of the texts, it is possible that the distribution of senses in the testing corpus may diverge greatly from the WordNet-based estimation.
    Page 4, “The Sense Disambiguation Model”
  11. The data were annotated with coarse-grained senses which were obtained by clustering senses from the WordNet 2.1 sense inventory based on the procedure proposed by Navigli (2006).
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2010 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

topic models

Appears in 15 sentences as: Topic Model (1) topic model (4) topic modelling (1) Topic models (3) topic models (6)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. We use a topic model to decompose this conditional probability into two conditional probabilities with latent variables.
    Page 1, “Abstract”
  2. Recently, several researchers have experimented with topic models (Brody and Lapata, 2009; Boyd-Graber et al., 2007; Boyd-Graber and Blei, 2007; Cai et al., 2007) for sense disambiguation and induction.
    Page 1, “Introduction”
  3. Topic models are generative probabilistic models of text corpora in which each document is modelled as a mixture over (latent) topics, which are in turn represented by a distribution over words.
    Page 1, “Introduction”
  4. Previous approaches using topic models for sense disambiguation either embed topic features in a supervised model (Cai et al., 2007) or rely heavily on the structure of hierarchical lexicons such as WordNet (Boyd-Graber et al., 2007).
    Page 1, “Introduction”
  5. Recently, a number of systems have been proposed that make use of topic models for sense disambiguation.
    Page 2, “Related Work”
  6. They compute topic models from a large unlabelled corpus and include them as features in a supervised system.
    Page 2, “Related Work”
  7. Boyd-Graber and Blei (2007) propose an unsupervised approach that integrates McCarthy et al.’s (2004) method for finding predominant word senses into a topic modelling framework.
    Page 2, “Related Work”
  8. Topic models have also been applied to the related task of word sense induction.
    Page 2, “Related Work”
  9. Topic models have been previously considered for metaphor extraction and estimating the frequency of metaphors (Klebanov et al., 2009; Bethard et al., 2009).
    Page 2, “Related Work”
  10. 3.1 Topic Model
    Page 2, “The Sense Disambiguation Model”
  11. As pointed out by Hofmann (1999), the starting point of topic models is to decompose the conditional word-document probability distribution p(w|d) into two different distributions: the word-topic distribution p(w|z), and the topic-document distribution p(z|d) (see Equation 1).
    Page 2, “The Sense Disambiguation Model”

See all papers in Proc. ACL 2010 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

word sense

Appears in 13 sentences as: Word sense (1) word sense (9) word senses (4)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. The proposed models are tested on three different tasks: coarse-grained word sense disambiguation, fine-grained word sense disambiguation, and detection of literal vs. nonliteral usages of potentially idiomatic expressions.
    Page 1, “Abstract”
  2. Word sense disambiguation (WSD) is the task of automatically determining the correct sense for a target word given the context in which it occurs.
    Page 1, “Introduction”
  3. There is a large body of work on WSD, covering supervised, unsupervised ( word sense induction) and knowledge-based approaches (see McCarthy (2009) for an overview).
    Page 2, “Related Work”
  4. Boyd-Graber and Blei (2007) propose an unsupervised approach that integrates McCarthy et al.’s (2004) method for finding predominant word senses into a topic modelling framework.
    Page 2, “Related Work”
  5. Topic models have also been applied to the related task of word sense induction.
    Page 2, “Related Work”
  6. WordNet is a fairly rich resource which provides detailed information about word senses (glosses, example sentences, synsets, semantic relations between senses, etc.).
    Page 3, “The Sense Disambiguation Model”
  7. However, this assumption does not hold, as the true distribution of word senses is often highly skewed (McCarthy, 2009).
    Page 4, “The Sense Disambiguation Model”
  8. Sense Paraphrases For word sense disambiguation tasks, the paraphrases of the sense keys are represented by information from WordNet 2.1.
    Page 5, “Experimental Setup”
  9. McCarthy (2009) also ad-lresses the issue of performance and cost by com-taring supervised word sense disambiguation sys-ems with unsupervised ones.
    Page 6, “Experiments”
  10. .‘he reason is that although this system is claimed 0 be unsupervised, and it performs better than .11 the participating systems (including the super-'ised systems) in the SemEval-2007 shared task, it till needs to incorporate a lot of prior knowledge, pecifically information about co-occurrences be-ween different word senses , which was obtained rom a number of resources (SSI+LKB) includ-ng: (i) SemCor (manually annotated); (ii) LDC-)SO (partly manually annotated); (iii) collocation lictionaries which are then disambiguated semi-.utomatically.
    Page 6, “Experiments”
  11. Table 4: Model performance (F-score) for the fine-grained word sense disambiguation task.
    Page 8, “Experiments”

See all papers in Proc. ACL 2010 that mention word sense.

See all papers in Proc. ACL that mention word sense.

Back to top.

synsets

Appears in 12 sentences as: synset (5) synsets (12) synsets’ (1)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. Topics and synsets are then inferred together.
    Page 2, “Related Work”
  2. WordNet is a fairly rich resource which provides detailed information about word senses (glosses, example sentences, synsets , semantic relations between senses, etc.).
    Page 3, “The Sense Disambiguation Model”
  3. To obtain the paraphrases, we use the word forms, glosses and example sentences of the synset itself and a set of selected reference synsets (i.e., synsets linked to the target synset by specific semantic relations, see Table 1).
    Page 5, “Experimental Setup”
  4. We excluded the ‘hypemym reference synsets’, since information common to all of the child synsets may confuse the disambiguation process.
    Page 5, “Experimental Setup”
  5. In the latter case, each sense can be represented by its synset as well as its reference synsets .
    Page 5, “Experimental Setup”
  6. We think that there are three reasons for this: first, adjectives and adverbs have fewer reference synsets for paraphrases compared with nouns and verbs (see Table 1); second, adjectives and adverbs tend to convey less key semantic content in the document, so they are more difficult to capture by the topic model; and third, adjectives and adverbs are a small portion of the test set, so their performances are statistically unstable.
    Page 7, “Experiments”
  7. MII+ref is the result of including the reference synsets , while MII-ref excludes the refer-
    Page 7, “Experiments”
  8. ence synsets .
    Page 7, “Experiments”
  9. As can be seen from the table, including all reference synsets in sense paraphrases increases performance.
    Page 7, “Experiments”
  10. We find that nouns get the greatest performance boost from including reference synsets, as they have the largest number of different types of synsets .
    Page 7, “Experiments”
  11. We also find the ‘similar’ reference synset for adjectives to be very useful.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention synsets.

See all papers in Proc. ACL that mention synsets.

Back to top.

conditional probability

Appears in 11 sentences as: conditional probabilities (3) conditional probability (11)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. This paper presents a probabilistic model for sense disambiguation which chooses the best sense based on the conditional probability of sense paraphrases given a context.
    Page 1, “Abstract”
  2. We use a topic model to decompose this conditional probability into two conditional probabilities with latent variables.
    Page 1, “Abstract”
  3. We approach the sense disambiguation task by choosing the best sense based on the conditional probability of sense paraphrases given a context.
    Page 1, “Introduction”
  4. We propose three models which are suitable for different situations: Model I requires knowledge of the prior distribution over senses and directly maximizes the conditional probability of a sense given the context; Model 11 maximizes this conditional probability by maximizing the cosine value of two topic-document vectors (one for the sense and one for the context).
    Page 1, “Introduction”
  5. Assigning the correct sense 3 to a target word 21) occurring in a context 0 involves finding the sense which maximizes the conditional probability of senses given a context:
    Page 3, “The Sense Disambiguation Model”
  6. This conditional probability is decomposed by incorporating a hidden variable, topic 2, introduced by the topic model.
    Page 3, “The Sense Disambiguation Model”
  7. Model I directly maximizes the conditional probability of the sense given the context, where the sense is modeled as a ‘paraphrase document’ ds and the context as a ‘context document’ dc.
    Page 3, “The Sense Disambiguation Model”
  8. The conditional probability of sense given context p(ds|dc) can be rewritten as a joint probability divided by a normalization factor:
    Page 3, “The Sense Disambiguation Model”
  9. We apply the same process to the conditional probability p(dc|z).
    Page 3, “The Sense Disambiguation Model”
  10. One model from information retrieval takes the conditional probability of the query given the document as a product of all the conditional probabilities of words in the query given the document.
    Page 4, “The Sense Disambiguation Model”
  11. However, instead of taking the product of all the conditional probabilities of words given the document, we take the maximum.
    Page 4, “The Sense Disambiguation Model”

See all papers in Proc. ACL 2010 that mention conditional probability.

See all papers in Proc. ACL that mention conditional probability.

Back to top.

fine-grained

Appears in 9 sentences as: Fine-grained (2) fine-grained (7)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. The proposed models are tested on three different tasks: coarse-grained word sense disambiguation, fine-grained word sense disambiguation, and detection of literal vs. nonliteral usages of potentially idiomatic expressions.
    Page 1, “Abstract”
  2. We apply these models to coarse- and fine-grained WSD and find that they outperform comparable systems for both tasks.
    Page 1, “Introduction”
  3. We evaluate our models on three different tasks: coarse-grained WSD, fine-grained WSD and literal vs. nonliteral sense detection.
    Page 4, “Experimental Setup”
  4. To determine whether our model is also suitable for fine-grained WSD, we test on the data provided by Pradhan et al.
    Page 5, “Experimental Setup”
  5. (2009) for the Semeval-2007 Task-l7 (English fine-grained all-words task).
    Page 5, “Experimental Setup”
  6. Table 4: Model performance (F-score) for the fine-grained word sense disambiguation task.
    Page 8, “Experiments”
  7. 5.2 Fine-grained WSD
    Page 8, “Experiments”
  8. Fine-grained WSD, however, is a more difficult task.
    Page 8, “Experiments”
  9. In the previous section, we provided the results of applying our framework to coarse- and fine-grained word sense disambiguation tasks.
    Page 8, “Experiments”

See all papers in Proc. ACL 2010 that mention fine-grained.

See all papers in Proc. ACL that mention fine-grained.

Back to top.

topic distribution

Appears in 7 sentences as: topic distribution (4) topic distributions (2) topics distributions (1)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. In this paper, we propose a novel framework which is fairly resource-poor in that it requires only 1) a large unlabelled corpus from which to estimate the topics distributions , and 2) paraphrases for the possible target senses.
    Page 1, “Introduction”
  2. In addition to generating a topic from the document’s topic distribution and sampling a word from that topic, the enhanced model also generates a distributional neighbour for the chosen word and then assigns a sense based on the word, its neighbour and the topic.
    Page 2, “Related Work”
  3. A similar topic distribution to that of the individual words ‘norm’ or ‘trouble’ would be strong supporting evidence of the corresponding idiomatic reading.).
    Page 4, “The Sense Disambiguation Model”
  4. We then compare the topic distributions of literal and nonliteral senses.
    Page 8, “Experiments”
  5. As the topic distribution of nouns and verbs exhibit different properties, topic comparisons across parts-of-speech do not make sense.
    Page 8, “Experiments”
  6. make the topic distributions comparable by making sure each type of paraphrase contains the same sets of parts-of-speech.
    Page 9, “Experiments”
  7. The basic idea of these models is to compare the topic distribution of a target instance with the candidate sense paraphrases and choose the most probable one.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention topic distribution.

See all papers in Proc. ACL that mention topic distribution.

Back to top.

F-score

Appears in 5 sentences as: F-score (5)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. Ve only compare the F-score , since all the com-tared systems have an attempted rate7 of 1.0,
    Page 6, “Experiments”
  2. ), F-score (Fl).
    Page 7, “Experiments”
  3. System F-score
    Page 8, “Experiments”
  4. Table 4: Model performance ( F-score ) for the fine-grained word sense disambiguation task.
    Page 8, “Experiments”
  5. literal precision (Precl), literal recall (Recl), literal F-score (Fl), accuracy(Acc.
    Page 8, “Experiments”

See all papers in Proc. ACL 2010 that mention F-score.

See all papers in Proc. ACL that mention F-score.

Back to top.

propose Model

Appears in 5 sentences as: propose Model (2) proposed model (1) proposed models (2)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. The proposed models are tested on three different tasks: coarse-grained word sense disambiguation, fine-grained word sense disambiguation, and detection of literal vs. nonliteral usages of potentially idiomatic expressions.
    Page 1, “Abstract”
  2. To overcome this problem, we propose Model 11, which indirectly maximizes the sense-context probability by maximizing the cosine value of two document vectors that encode the document-topic frequencies from sampling, v(z|dc) and v(z|ds).
    Page 4, “The Sense Disambiguation Model”
  3. We propose Model III:
    Page 4, “The Sense Disambiguation Model”
  4. Table 5 shows the results of our proposed model compared with state-of-the-art systems.
    Page 8, “Experiments”
  5. We test the proposed models on three tasks.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention propose Model.

See all papers in Proc. ACL that mention propose Model.

Back to top.

statistically significantly

Appears in 5 sentences as: statistically significant (1) statistically significantly (4)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. In all three cases, we outperform state-of-the-art systems either quantitatively or statistically significantly .
    Page 1, “Abstract”
  2. We find that Model I performs better than both the best unsupervised system, RACAI (Ion and Tufis, 2007) and the most frequent sense baseline (BmeS), although these differences are not statistically significant due to the small size of the available test data (465).
    Page 8, “Experiments”
  3. For both tasks, our models outperform the state-of-the-art systems of the same type either quantitatively or statistically significantly .
    Page 8, “Experiments”
  4. system by Li and Sporleder (2009), although not statistically significantly .
    Page 8, “Experiments”
  5. We find that all models outperform comparable state-of-the-art systems either quantitatively or statistically significantly .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention statistically significantly.

See all papers in Proc. ACL that mention statistically significantly.

Back to top.

manually annotated

Appears in 4 sentences as: manually annotated (5)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. One major factor that makes WSD difficult is a relative lack of manually annotated corpora, which hampers the performance of supervised systems.
    Page 1, “Introduction”
  2. This dataset consists of 3964 instances of 17 potential English idioms which were manually annotated as literal or nonliteral.
    Page 5, “Experimental Setup”
  3. .‘he reason is that although this system is claimed 0 be unsupervised, and it performs better than .11 the participating systems (including the super-'ised systems) in the SemEval-2007 shared task, it till needs to incorporate a lot of prior knowledge, pecifically information about co-occurrences be-ween different word senses, which was obtained rom a number of resources (SSI+LKB) includ-ng: (i) SemCor (manually annotated); (ii) LDC-)SO (partly manually annotated ); (iii) collocation lictionaries which are then disambiguated semi-.utomatically.
    Page 6, “Experiments”
  4. Even though the system is not 'trained”, it needs a lot of information which is argely dependent on manually annotated data, so t does not fit neatly into the categories Type II or Type III either.
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention manually annotated.

See all papers in Proc. ACL that mention manually annotated.

Back to top.

LDA

Appears in 3 sentences as: LDA (3)
In Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
  1. (2007), for example, use LDA to capture global context.
    Page 2, “Related Work”
  2. (2007) enhance the basic LDA algorithm by incorporating WordNet senses as an additional latent variable.
    Page 2, “Related Work”
  3. LDA is a Bayesian version of this framework with Dirichlet hyper-parameters (Blei et al., 2003).
    Page 3, “The Sense Disambiguation Model”

See all papers in Proc. ACL 2010 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.