Abstract | We propose an integrated distributional similarity filter to identify and censor potential semantic drifts, ensuring over 10% higher precision when extracting large semantic lexicons. |
Background | 2.2 Distributional Similarity |
Background | Distributional similarity has been used to extract semantic lexicons (Grefenstette, 1994), based on the distributional hypothesis that semantically similar words appear in similar contexts (Harris, 1954). |
Background | (2006) used 11 patterns, and the distributional similarity score of each pair of terms, to construct features for lexical entailment. |
Conclusion | In this paper, we have proposed unsupervised bagging and integrated distributional similarity to minimise the problem of semantic drift in iterative bootstrapping algorithms, particularly when extracting large semantic lexicons. |
Detecting semantic drift | In this section, we propose distributional similarity measurements over the extracted lexicon to detect semantic drift during the bootstrapping process. |
Detecting semantic drift | We calculate the average distributional similarity (Sim) of t with all terms in L1,”, and those in L( N_m)m N and call the ratio the drift for term t: |
Detecting semantic drift | For calculating drift we use the distributional similarity approach described in Curran (2004). |
Introduction | We integrate a distributional similarity filter directly into WMEB (McIntosh and Curran, 2008). |
Introduction | Our distributional similarity filter gives a similar performance improvement. |
Abstract | The best results are obtained with a novel second-order distributional similarity measure, and the positive effect is specially relevant for out-of-domain data. |
Related Work | Distributional similarity has also been used to tackle syntactic ambiguity. |
Related Work | Pantel and Lin (2000) obtained very good results using the distributional similarity measure defined by Lin (1998). |
Related Work | The results over 100 frame-specific roles showed that distributional similarities get smaller error rates than Resnik and EM, with Lin’s formula having the smallest error rate. |
Results and Discussion | Regarding the selectional preference variants, WordNet—based and first-order distributional similarity models attain similar levels of precision, but the former are clearly worse on recall and F1. |
Results and Discussion | The second-order distributional similarity measures perform best overall, both in precision and recall. |
Results and Discussion | Regarding the similarity metrics, the cosine seems to perform consistently better for first-order distributional similarity , while J accard provided slightly better results for second-order similarity. |
Selectional Preference Models | Distributional SP models: Given the availability of publicly available resources for distributional similarity , we used 1) a ready-made thesaurus (Lin, 1998), and 2) software (Pado and Lapata, 2007) which we run on the British National Corpus (BNC). |
Background | Entailment learning Two information types have primarily been utilized to learn entailment rules between predicates: lexicographic resources and distributional similarity resources. |
Background | Therefore, distributional similarity is used to learn broad-scale resources. |
Background | Distributional similarity algorithms predict a semantic relation between two predicates by comparing the arguments with which they occur. |
Experimental Evaluation | When computing distributional similarity scores, a template is represented as a feature vector of the CUIs that instantiate its arguments. |
Experimental Evaluation | Local algorithms We described 12 distributional similarity measures computed over our corpus (Section 5.1). |
Experimental Evaluation | For each distributional similarity measure (altogether 16 measures), we learned a graph by inserting any edge (u, v) , when u is in the top K templates most similar to 2). |
Learning Entailment Graph Edges | Next, we represent each pair of propositional templates with a feature vector of various distributional similarity scores. |
Learning Entailment Graph Edges | Distributional similarity representation We aim to train a classifier that for an input template pair (t1, t2) determines whether t1 entails 752. |
Learning Entailment Graph Edges | A template pair is represented by a feature vector where each coordinate is a different distributional similarity score. |
Evaluation | The background documents consists of 2.7M running words, which was used to compute distributional similarity . |
Evaluation | The context metric space was composed by k:-nearest neighbor words of distributional similarity (Lin, 1998), as is described in Section 4. |
Evaluation | (2004), which determines the word sense based on sense similarity and distributional similarity to the k-nearest neighbor words of a target word by distributional similarity . |
Introduction | (2004) proposed a method to combine sense similarity with distributional similarity and configured predominant sense score. |
Introduction | Distributional similarity was used to weight the influence of context words, based on large-scale statistics. |
Introduction | (2009) used a k-nearest words on distributional similarity as context words. |
Metric Space Implementation | Distributional similarity (Lin, 1998) was computed among target words, based on the statistics of the test set and the background text provided as the official dataset of the SemEval-2 English all-words task (Agirre et al., 2010). |
Metric Space Implementation | Those texts were parsed using RASP parser (Briscoe et al., 2006) version 3.1, to obtain grammatical relations for the distributional similarity , as well as to obtain lemmata and part-of-speech (POS) tags which are required to look up the sense inventory of WordNet. |
Metric Space Implementation | Based on the distributional similarity , we just used k-nearest neighbor words as the context of each target word. |
Abstract | Automatic acquisition of inference rules for predicates has been commonly addressed by computing distributional similarity between vectors of argument words, operating at the word space level. |
Background and Model Setting | learning, based on distributional similarity at the word level, and then context-sensitive scoring for rule applications, based on topic-level similarity. |
Background and Model Setting | The DIRT algorithm (Lin and Pantel, 2001) follows the distributional similarity paradigm to learn predicate inference rules. |
Discussion and Future Work | In particular, we proposed a novel scheme that applies over any base distributional similarity measure which operates at the word level, and computes a single context-insensitive score for a rule. |
Discussion and Future Work | We therefore focused on comparing the performance of our two-level scheme with state-of-the-art prior topic-level and word-level models of distributional similarity , over a random sample of inference rule applications. |
Experimental Settings | Since our model can contextualize various distributional similarity measures, we evaluated the performance of all the above methods on several base similarity measures and their learned rule- |
Experimental Settings | Whenever we evaluated a distributional similarity measure (namely Lin, BInc, or Cosine), we discarded instances from Zeichner et al.’s dataset in which the assessed rule is not in the context-insensitive rule-set learned for this measure or the argument instantiation of the rule is not in the LDA lexicon. |
Results | Specifically, topics are leveraged for high-level domain disambiguation, while fine grained word-level distributional similarity is computed for each rule under each such domain. |
Results | Indeed, on test-setvc, in which context mismatches are rare, our algorithm is still better than the original measure, indicating that WT can be safely applied to distributional similarity measures without concerns of reduced performance in different context scenarios. |
Two-level Context-sensitive Inference | On the other hand, the topic-biased similarity for 751 is substantially lower, since prominent words in this topic are likely to occur with ‘acquire’ but not with ‘learn’, yielding low distributional similarity . |
Related Work | rity: l) thesaurus-based word similarity, 2) distributional similarity and 3) confusion set derived from learner corpus. |
Related Work | Distributional Similarity : Thesaurus-based methods produce weak recall since many words, phrases and semantic connections are not covered by hand-built thesauri, especially for verbs and adjectives. |
Related Work | As an alternative, distributional similarity models are often used since it gives higher recall. |
Background and Related Work | Most approaches to the task used distributional similarity as a major component within their system. |
Background and Related Work | (2006) presented a system for learning inference rules between nouns, using distributional similarity and pattern-based features. |
Background and Related Work | (2011) used distributional similarity between predicates to weight the edges of an entailment graph. |
Discussion | The distributional similarity between p L and p R under this model is Sim(pL,pR) = 2:121 sim(wi,w3), where sim(wi, is the dot product between 2),- and 213. |
Introduction | Most works to this task use distributional similarity , either as their main component (Szpektor and Dagan, 2008; Melamud et al., 2013b), or as part of a more comprehensive system (Berant et al., 2011; Lewis and Steedman, 2013). |
Our Proposal: A Latent LC Approach | Distributional Similarity Features. |
Our Proposal: A Latent LC Approach | The distributional similarity features are based on the DIRT system (Lin and Pantel, 2001). |
Background | Most work on learning entailment rules between predicates considered each rule independently of others, using two sources of information: lexicographic resources and distributional similarity . |
Background | Distributional similarity algorithms use large corpora to learn broader resources by assuming that semantically similar predicates appear with similar arguments. |
Background | Distributional similarity algorithms differ in their feature representation: Some use a binary representation: each predicate is represented by one feature vector where each feature is a pair of arguments (Szpektor et al., 2004; Yates and Etzioni, 2009). |
Experimental Evaluation | Second, to distributional similarity algorithms: (a) SR: the score used by Schoenmackers et al. |
Experimental Evaluation | Third, we compared to the entailment classifier with no transitivity constraints (clsf) to see if combining distributional similarity scores improves performance over single measures. |
Learning Typed Entailment Graphs | We compute 11 distributional similarity scores for each pair of predicates based on the arguments appearing in the extracted arguments. |
Abstract | Finally, we present a ranker that employs distributional similarities to build a network of words, and captures the diversity of perspectives by detecting communities in this network. |
Conclusion and Future Work | Finally, we proposed a ranking system that employs word distributional similarities to identify semantically equivalent words, and compared it with a wide |
Diversity-based Ranking | 5.1 Distributional Similarity |
Diversity-based Ranking | In order to capture the nuggets of equivalent semantic classes, we use a distributional similarity of |
Diversity-based Ranking | The method based on the distributional similarities of words outperforms other methods in the citations category. |
Background | To date, most distributional similarity research concentrated on symmetric measures, such as the widely cited and competitive (as shown in (Weeds and Weir, 2003)) LIN measure (Lin, 1998): |
Evaluation and Results | In this setting, category names were taken as seeds and expanded by distributional similarity , further measuring cosine similarity with categorized documents similarly to IR query expansion. |
Introduction | Much work on automatic identification of semantically similar terms exploits Distributional Similarity , assuming that such terms appear in similar contexts. |
Introduction | This paper is motivated by one of the prominent applications of distributional similarity , namely identifying lexical expansions. |
Introduction | Often, distributional similarity measures are used to identify expanding terms (e.g. |
Acquiring Paraphrases | 3.1 Distributional Similarity |
Conclusion | We have shown that high precision surface paraphrases can be obtained by using distributional similarity on a large corpus. |
Introduction | A popular method, the so-called distributional similarity , is based on the dictum of Zelig Harris “you shall know the words by the company they keep”: given highly discriminating left and right contexts, only words with very similar meaning will be found to fit in between them. |
Related Work | Our method however, pre-computes paraphrases for a large set of surface patterns using distributional similarity over a large corpus and then obtains patterns for a relation by simply finding paraphrases (offline) for a few seed patterns. |
Related Work | Using distributional similarity avoids the problem of obtaining overly general patterns and the pre-computation of paraphrases means that we can obtain the set of patterns for any relation instantaneously. |
Abstract | The focus of this paper is drawing nuanced, connotative sentiments from even those words that are objective on the surface, such as “intelligence”, “human”, and “cheesecake We propose induction algorithms encoding a diverse set of linguistic insights (semantic prosody, distributional similarity , semantic parallelism of coordination) and prior knowledge drawn from lexical resources, resulting in the first broad-coverage connotation lexicon. |
Connotation Induction Algorithms | The second subgraph is based on the distributional similarities among the arguments. |
Connotation Induction Algorithms | One possible way of constructing such a graph is simply connecting all nodes and assign edge weights proportionate to the word association scores, such as PMI, or distributional similarity . |
Connotation Induction Algorithms | where (meso‘ly is the scores based on semantic prosody, (1)0007"d captures the distributional similarity over coordination, and (13%“ controls the sensitivity of connotation detection between positive (negative) and neutral. |
Introduction | Therefore, in order to attain a broad coverage lexicon while maintaining good precision, we guide the induction algorithm with multiple, carefully selected linguistic insights: [1] distributional similarity , [2] semantic parallelism of coordination, [3] selectional preference, and [4] semantic prosody (e.g., Sinclair (1991), Louw (1993), Stubbs (1995), Stefanowitsch and Gries (2003))), and also exploit existing lexical resources as an additional inductive bias. |
Evaluation | The fifth Arabic-English example demonstrates the pitfalls of over-reliance on the distributional hypothesis: the source bigram corresponding to the name “abd almahmood” is distributional similar to another named entity “mahmood” and the English equivalent is offered as a translation. |
Generation & Propagation | Co-occurrence counts for each feature (context word) are accumulated over the monolingual corpus, and these counts are converted to pointwise mutual information (PMI) values, as is standard practice when computing distributional similarities . |
Related Work | The idea presented in this paper is similar in spirit to bilingual lexicon induction (BLI), where a seed lexicon in two different languages is expanded with the help of monolingual corpora, primarily by extracting distributional similarities from the data using word context. |
Related Work | Paraphrases extracted by “pivoting” via a third language (Callison-Burch et al., 2006) can be derived solely from monolingual corpora using distributional similarity (Marton et al., 2009). |
Background | Distributional similarity between pairs of words is converted into weighted inference rules that are added to the logical representation, and Markov Logic Networks are used to perform probabilistic logical inference. |
Introduction | deep representation of sentence meaning, expressed in first-order logic, to capture sentence structure, but combine it with distributional similarity ratings at the word and phrase level. |
Introduction | This approach is interesting in that it uses a very deep and precise representation of meaning, which can then be relaxed in a controlled fashion using distributional similarity . |
PSL for STS | where vs_sim is a similarity function that calculates the distributional similarity score between the two lexical predicates. |
Introduction and related work | In this paper, we propose a novel unsupervised approach that compares the major senses of a MWE and its semantic head using distributional similarity measures to test the compositionality of the MWE. |
Proposed approach | We used two techniques to measure the distributional similarity of major uses of the M WE and its semantic head, both based on Jaccard coefi‘icient (J). |
Proposed approach | Given the major uses of a MWE and its semantic head, the MWE is considered as compositional, when the corresponding distributional similarity measure (Jc or 197,) value is above a parameter threshold, sim. |
Unsupervised parameter tuning | The best performing distributional similarity measure is an. |
Experiments: predicting relevance in context | For each pair neighboura/neighbourb, we computed a set of features from Wikipedia (the corpus used to derive the distributional similarity ): We first computed the frequencies of each item in the corpus, f reqa and f reqb, from which we derive |
Introduction | They are not suitable for the evaluation of the whole range of semantic relatedness that is exhibited by distributional similarities , which exceeds the limits of classical lexical relations, even though researchers have tried to collect equivalent resources manually, to be used as a gold standard (Weeds, 2003; Bordag, 2008; Anguiano et al., 2011). |
Introduction | One advantage of distributional similarities is to exhibit a lot of different semantic relations, not necessarily standard lexical relations. |
Background | Most methods utilized the distributional similarity hypothesis that states that semantically similar predicates occur with similar arguments (Lin and Pantel, 2001; Szpektor et al., 2004; Yates and Etzioni, 2009; Schoenmackers et al., 2010). |
Background | For every pair of predicates i, j, an entailment score wij was learned by training a classifier over distributional similarity features. |
Experiments and Results | The data set also contains, for every pair of predicates i, j in every graph, a local score 87;], which is the output of a classifier trained over distributional similarity features. |
Background and Related Work | The distributional similarity scores of the nearest neighbours are associated with the respective target word senses using a WordNet similarity measure, such as those proposed by J iang and Conrath (1997) and Banerjee and Pedersen (2002). |
Background and Related Work | The word senses are ranked based on these similarity scores, and the most frequent sense is selected for the corpus that the distributional similarity thesaurus was trained over. |
WordNet Experiments | It is important to bear in mind that MKWC in these experiments makes use of full-text parsing in calculating the distributional similarity thesaurus, and the WordNet graph structure in calculating the similarity between associated words and different senses. |
Background | For distributional similarity computing, each word is represented as a semantic vector composed of the pointwise mutual information (PMI) with its contexts. |
Introduction | Besides, distributional similarity methods (Kotlerman et al., 2010; Lenci and Benotto, 2012) are based on the assumption that a term can only be used in contexts where its hypemyms can be used and that a term might be used in any contexts where its hyponyms are used. |
Related Work | (2010) and Lenci and Benotto (2012), other researchers also propose directional distributional similarity methods (Weeds et al., 2004; Geffet and Dagan, 2005; Bhagat et al., 2007; Szpektor et al., 2007; Clarke, 2009). |
Set Expansion | We consider three similarity data sources: the Moby thesaurus1 , WordNet (Fellbaum, 1998), and distributional similarity based on a large corpus of text (Lin, 1998). |
Set Expansion | Distributional similarity . |
Set Expansion | Second, the data sources used: each source separately (M for Moby, W for WordNet, D for distributional similarity ), and all three in combination (MWD). |
Abstract | We study the global topology of the syntactic and semantic distributional similarity networks for English through the technique of spectral analysis. |
Introduction | An alternative, but equally popular, visualization of distributional similarity is through graphs or networks, where each word is represented as nodes and weighted edges indicate the extent of distributional similarity between them. |
Introduction | intriguing question, whereby we construct the syntactic and semantic distributional similarity network (DSN) and analyze their spectrum to understand their global topology. |