A Bayesian Method for Robust Estimation of Distributional Similarities
Kazama, Jun'ichi and De Saeger, Stijn and Kuroda, Kow and Murata, Masaki and Torisawa, Kentaro

Article Structure

Abstract

Existing word similarity measures are not robust to data sparseness since they rely only on the point estimation of words’ context profiles obtained from a limited amount of data.

Introduction

The semantic similarity of words is a longstanding topic in computational linguistics because it is theoretically intriguing and has many applications in the field.

Background

2.1 Bayesian estimation with Dirichlet prior

Method

In this section, we show that if our base similarity measure is BC and the distributions under which we take the expectation are Dirichlet distributions, then Eq.

Implementation Issues

Although we have derived the analytical form (Eq.

Experiments

5.1 Evaluation setting

Discussion

We should note that the improvement by using our method is just “on average,” as in many other NLP tasks, and observing clear qualitative change is relatively difficult, for example, by just showing examples of similar word lists here.

Conclusion

We proposed a Bayesian method for robust distributional word similarities.

Topics

similarity measure

Appears in 25 sentences as: similarity measure (13) similarity measures (13)
In A Bayesian Method for Robust Estimation of Distributional Similarities
  1. Existing word similarity measures are not robust to data sparseness since they rely only on the point estimation of words’ context profiles obtained from a limited amount of data.
    Page 1, “Abstract”
  2. The method uses a distribution of context profiles obtained by Bayesian estimation and takes the expectation of a base similarity measure under that distribution.
    Page 1, “Abstract”
  3. For the task of word similarity estimation using a large amount of Web data in Japanese, we show that the proposed measure gives better accuracies than other well-known similarity measures .
    Page 1, “Abstract”
  4. A number of semantic similarity measures have been proposed based on this hypothesis (Hindle, 1990; Grefenstette, 1994; Dagan et al., 1994; Dagan et al., 1995; Lin, 1998; Dagan et al., 1999).
    Page 1, “Introduction”
  5. In general, most semantic similarity measures have the following form:
    Page 1, “Introduction”
  6. Our technical contribution in this paper is to show that in the case where the context profiles are multinomial distributions, the priors are Dirichlet, and the base similarity measure is the Bhattacharyya coefficient (Bhattacharyya, 1943), we can derive an analytical form for Eq.
    Page 2, “Introduction”
  7. In experiments, we estimate semantic similarities using a large amount of Web data in Japanese and show that the proposed measure gives better word similarities than a non-Bayesian Bhattacharyya coefficient or other well-known similarity measures such as Jensen-Shannon divergence and the cosine with PMI weights.
    Page 2, “Introduction”
  8. The BC is also a similarity measure on probability distributions and is suitable for our purposes as we describe in the next section.
    Page 3, “Background”
  9. Although BC has not been explored well in the literature on distributional word similarities, it is also a good similarity measure as the experiments show.
    Page 3, “Background”
  10. In this section, we show that if our base similarity measure is BC and the distributions under which we take the expectation are Dirichlet distributions, then Eq.
    Page 3, “Method”
  11. To put it all together, we can obtain a new Bayesian similarity measure on words, which can be calculated only from the hyperparameters for the Dirichlet prior, 04 and 6, and the observed counts C(wi, fk).
    Page 3, “Method”

See all papers in Proc. ACL 2010 that mention similarity measure.

See all papers in Proc. ACL that mention similarity measure.

Back to top.

hyperparameters

Appears in 7 sentences as: hyperparameters (7)
In A Bayesian Method for Robust Estimation of Distributional Similarities
  1. The Dirichlet distribution is parametrized by hyperparameters ak(> 0).
    Page 2, “Background”
  2. where C(k) is the frequency of choice k in data D. For example, C(k) = C(wi, fk) in the estimation of p( This is very simple: we just need to add the observed counts to the hyperparameters .
    Page 2, “Background”
  3. Note that with the Dirichlet prior, 04;; 2 04k, + C(wl, fk) and 6,; = 6;, + C(wg, fk), where 04],; and 6;, are the hyperparameters of the priors of ml and 2122, respectively.
    Page 3, “Method”
  4. To put it all together, we can obtain a new Bayesian similarity measure on words, which can be calculated only from the hyperparameters for the Dirichlet prior, 04 and 6, and the observed counts C(wi, fk).
    Page 3, “Method”
  5. We randomly chose 200 sets each for sets “A” and “B.” Set “A” is a development set to tune the value of the hyperparameters and
    Page 5, “Experiments”
  6. As for BCb, we assumed that all of the hyperparameters had the same value, i.e., 04k; = 04.
    Page 6, “Experiments”
  7. Because tuning hyperparameters involves the possibility of overfitting, its robustness should be assessed.
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention hyperparameters.

See all papers in Proc. ACL that mention hyperparameters.

Back to top.

semantic similarity

Appears in 7 sentences as: semantic similarities (3) semantic similarity (4)
In A Bayesian Method for Robust Estimation of Distributional Similarities
  1. The semantic similarity of words is a longstanding topic in computational linguistics because it is theoretically intriguing and has many applications in the field.
    Page 1, “Introduction”
  2. A number of semantic similarity measures have been proposed based on this hypothesis (Hindle, 1990; Grefenstette, 1994; Dagan et al., 1994; Dagan et al., 1995; Lin, 1998; Dagan et al., 1999).
    Page 1, “Introduction”
  3. In general, most semantic similarity measures have the following form:
    Page 1, “Introduction”
  4. Previous studies have focused on how to devise good contexts and a good function g for semantic similarities .
    Page 1, “Introduction”
  5. has been no study that seriously dealt with data sparseness in the context of semantic similarity calculation.
    Page 2, “Introduction”
  6. Since our motivation for this research is to calculate good semantic similarities for a large set of words (e. g., one million nouns) and apply them to a wide range of NLP tasks, such costs must be minimized.
    Page 2, “Introduction”
  7. In experiments, we estimate semantic similarities using a large amount of Web data in Japanese and show that the proposed measure gives better word similarities than a non-Bayesian Bhattacharyya coefficient or other well-known similarity measures such as Jensen-Shannon divergence and the cosine with PMI weights.
    Page 2, “Introduction”

See all papers in Proc. ACL 2010 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

data sparseness

Appears in 6 sentences as: data sparseness (6)
In A Bayesian Method for Robust Estimation of Distributional Similarities
  1. Existing word similarity measures are not robust to data sparseness since they rely only on the point estimation of words’ context profiles obtained from a limited amount of data.
    Page 1, “Abstract”
  2. In the NLP field, data sparseness has been recognized as a serious problem and tackled in the context of language modeling and supervised machine learning.
    Page 1, “Introduction”
  3. has been no study that seriously dealt with data sparseness in the context of semantic similarity calculation.
    Page 2, “Introduction”
  4. The data sparseness problem is usually solved by smoothing, regularization, margin maximization and so on (Chen and Goodman, 1998; Chen and Rosenfeld, 2000; Cortes and Vap-nik, 1995).
    Page 2, “Introduction”
  5. The uncertainty due to data sparseness is represented by p(v and taking the expectation enables us to take this into account.
    Page 2, “Introduction”
  6. In this study, we combined two clustering results (denoted as “sl+s2” in the results), each of which (“sl” and “s2”) has 2,000 hidden classes.4 We included this method since clustering can be regarded as another way of treating data sparseness .
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention data sparseness.

See all papers in Proc. ACL that mention data sparseness.

Back to top.

dependency relations

Appears in 4 sentences as: dependency relation (1) Dependency relations (1) dependency relations (2)
In A Bayesian Method for Robust Estimation of Distributional Similarities
  1. Each dimension of the vector corresponds to a context, fk, which is typically a neighboring word or a word having dependency relations with 212,- in a corpus.
    Page 1, “Introduction”
  2. Dependency relations are used as context profiles as in Kazama and Torisawa (2008) and Kazama et al.
    Page 4, “Experiments”
  3. For example, we extract a dependency relation (7% V,( E ’3 , 75: from the sentence below, where a postposition “75: (wo)” is used to mark the verb object.
    Page 5, “Experiments”
  4. (2009) proposed using the J ensen-Shannon divergence between hidden class distributions, p(c|w1) and p(c|w2), which are obtained by using an EM-based clustering of dependency relations with a model p(wi,fk) = Zcp(wilc)p(fklc)p(c) (Kazama and Torisawa, 2008).
    Page 5, “Experiments”

See all papers in Proc. ACL 2010 that mention dependency relations.

See all papers in Proc. ACL that mention dependency relations.

Back to top.

probability distributions

Appears in 3 sentences as: probability distribution (1) probability distributions (3)
In A Bayesian Method for Robust Estimation of Distributional Similarities
  1. Estimating a conditional probability distribution gbk; = p( as a context profile for each 212,- falls into this case.
    Page 2, “Background”
  2. When the context profiles are probability distributions, we usually utilize the measures on probability distributions such as the Jensen-Shannon (J S) divergence to calculate similarities (Dagan et al., 1994; Dagan et al., 1997).
    Page 3, “Background”
  3. The BC is also a similarity measure on probability distributions and is suitable for our purposes as we describe in the next section.
    Page 3, “Background”

See all papers in Proc. ACL 2010 that mention probability distributions.

See all papers in Proc. ACL that mention probability distributions.

Back to top.