Lexical Inference over Multi-Word Predicates: A Distributional Approach

Representing predicates in terms of their argument distribution is common practice in NLP.

Multi-word expressions (MWEs) constitute a large part of the lexicon and account for much of its growth (J ackendoff, 2002; Seaton and Macaulay, 2002).

Inference Relations.

This section details our approach for distribu-tionally representing MWPs by leveraging their

Corpora and Preprocessing.

Table 2 presents the results of our experiments.

Relation to CD8.

We have presented a novel approach to the distributional representation of multi-word predicates.

Appears in 25 sentences as: Feature Set (1) feature set (17) feature sets (9)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- (2012) and compare our methods with analogous ones that select a fixed LC, using state-of-the-art feature sets .Page 2, “Introduction”
- Section 3.1 describes our general approach, Section 3.2 presents our model and Section 3.3 details the feature set .Page 3, “Our Proposal: A Latent LC Approach”
- We choose this model for its generality, conceptual simplicity, and because it allows to easily incorporate various feature sets and sets of latent variables.Page 4, “Our Proposal: A Latent LC Approach”
- 3.3 Feature SetPage 4, “Our Proposal: A Latent LC Approach”
- We intentionally select a feature set that relies on either completely unsupervised or shallow processing tools that are available for a wide variety of languages and domains.Page 4, “Our Proposal: A Latent LC Approach”
- Within the LHS feature set , we distinguish between two subtypes of features: word features that encode the individual properties of bf and hf (table’s upper middle part), and pair features that only apply to LCs of size 2 and reflect the relation between bf and hf (table’s lower middle part).Page 5, “Our Proposal: A Latent LC Approach”
- The feature set is composed of four analogous sets corresponding to hA,hB,h§ and hg, as well as two sets of features that capture relations between hf, bf and hfi, hg (in cases h is of size 2).Page 5, “Our Proposal: A Latent LC Approach”
- We conjoin some of the feature sets by multiplying their values.Page 6, “Our Proposal: A Latent LC Approach”
- Instead, we use several baselines to demonstrate the usefulness of integrating multiple LCs, as well as the relative usefulness of our feature sets .Page 7, “Experimental Setup”
- The other evaluated systems are formed by taking various subsets of our feature set .Page 7, “Experimental Setup”
- We experiment with 4 feature sets .Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention feature set.

See all papers in *Proc. ACL* that mention feature set.

Back to top.

Appears in 11 sentences as: distributional representation (7) distributional representations (5)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- We propose a novel approach that integrates the distributional representation of multiple subsets of the MWP’s words.Page 1, “Abstract”
- This heterogeneity of “take” is likely to have a negative effect on downstream systems that use its distributional representation .Page 1, “Introduction”
- For instance, while “take” and “accept” are often considered lexically similar, the high frequency in which “take” participates in non-compositional MWPs is likely to push the two verbs’ distributional representations apart.Page 1, “Introduction”
- This approach allows the classifier that uses the distributional representations to take into account the most relevant LCs in order to make the prediction.Page 2, “Introduction”
- While previous work focused either on improving the quality of the distributional representations themselves or on their incorporation into more elaborate systems, we focus on the integration of the distributional representation of multiple LCs to improve the identification of inference relations between MWPs.Page 3, “Background and Related Work”
- Much work in recent years has concentrated on the relation between the distributional representations of composite phrases and the representations of their component subparts (Widdows, 2008; Mitchell and Lapata, 2010; Baroni and Zampar—elli, 2010; Coecke et al., 2010).Page 3, “Background and Related Work”
- Despite significant advances, previous work has mostly been concerned with highly compositional cases and does not address the distributional representation of predicates of varying degrees of compositionality.Page 3, “Background and Related Work”
- We propose a method for addressing MWPs of varying degrees of compositionality through the integration of the distributional representation of multiple subsets of the predicate’s words (LCs).Page 3, “Our Proposal: A Latent LC Approach”
- Much recent work subsumed under the title Compositional Distributional Semantics addressed the distributional representation of multi-word phrases (see Section 2).Page 8, “Discussion”
- A standard approach in CD8 is to compose distributional representations by taking their vector sum 2),; 2 211 + 212... + on and ’UR = 2/1 + + vjn (Mitchell and Lapata, 2010).Page 8, “Discussion”
- We have presented a novel approach to the distributional representation of multi-word predicates.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention distributional representation.

See all papers in *Proc. ACL* that mention distributional representation.

Back to top.

Appears in 8 sentences as: POS tag (4) POS Tagger (1) POS tagger (2) POS tags (2)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- 1We use a POS tagger to identify content words.Page 4, “Our Proposal: A Latent LC Approach”
- In addition, we use POS-based features that encode the most frequent POS tag for the word lemma and the second most frequent POS tag (according to R).Page 5, “Our Proposal: A Latent LC Approach”
- Information about the second most frequent POS tag can be important in identifying light verb constructions, such as “take a swim” or “give a smile”, where the object is derived from a verb.Page 5, “Our Proposal: A Latent LC Approach”
- Relations were extracted using regular expressions over the output of a POS tagger and an NP chunker.Page 6, “Experimental Setup”
- We use a Maximum Entropy POS Tagger , trained on the Penn Treebank, and the WordNet lemmatizer, both implemented within the NLTK package (Loper and Bird, 2002).Page 7, “Experimental Setup”
- To obtain a coarse-grained set of POS tags , we collapse the tag set to 7 categories: nouns, verbs, adjectives, adverbs, prepositions, the word “to” and a category that includes all other words.Page 7, “Experimental Setup”
- Function words are defined according to their POS tags and include determiners, possessive pronouns, existential “there”, numbers and coordinating conjunctions.Page 7, “Experimental Setup”
- The next feature set BASIC includes the features found to be most useful during the development of the model: the most frequent POS tag , the frequency features and the feature Common.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention POS tag.

See all papers in *Proc. ACL* that mention POS tag.

Back to top.

Appears in 7 sentences as: content word (1) content words (7)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- In our experiments we attempt to keep the approach maximally general, and define H p to be the set of all subsets of size 1 or 2 of content words in Wpl.Page 4, “Our Proposal: A Latent LC Approach”
- 1We use a POS tagger to identify content words .Page 4, “Our Proposal: A Latent LC Approach”
- Prepositions are considered content words under this definition.Page 4, “Our Proposal: A Latent LC Approach”
- number of content words in p, and as the number of content words is usually smallz, inference can be carried out by directly summing over H (z).Page 4, “Our Proposal: A Latent LC Approach”
- Instead, we initialize our model with a simplified convex model that fixes the LCs to be the pair of leftmost content words comprising each of the predicates.Page 4, “Our Proposal: A Latent LC Approach”
- A Reverb argument is represented as the conjunction of its content words that appear more than 10 times in the corpus.Page 7, “Experimental Setup”
- The first, LEFTMOST, selects the leftmost content word for each predicate.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention content words.

See all papers in *Proc. ACL* that mention content words.

Back to top.

Appears in 7 sentences as: Distributional Similarity (1) distributional similarity (6)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- Most works to this task use distributional similarity , either as their main component (Szpektor and Dagan, 2008; Melamud et al., 2013b), or as part of a more comprehensive system (Berant et al., 2011; Lewis and Steedman, 2013).Page 2, “Introduction”
- Most approaches to the task used distributional similarity as a major component within their system.Page 2, “Background and Related Work”
- (2006) presented a system for learning inference rules between nouns, using distributional similarity and pattern-based features.Page 2, “Background and Related Work”
- (2011) used distributional similarity between predicates to weight the edges of an entailment graph.Page 2, “Background and Related Work”
- Distributional Similarity Features.Page 5, “Our Proposal: A Latent LC Approach”
- The distributional similarity features are based on the DIRT system (Lin and Pantel, 2001).Page 5, “Our Proposal: A Latent LC Approach”
- The distributional similarity between p L and p R under this model is Sim(pL,pR) = 2:121 sim(wi,w3), where sim(wi, is the dot product between 2),- and 213.Page 8, “Discussion”

See all papers in *Proc. ACL 2014* that mention distributional similarity.

See all papers in *Proc. ACL* that mention distributional similarity.

Back to top.

Appears in 6 sentences as: Distributional Semantics (2) distributional semantics (4)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- To our knowledge, this is the first work to address lexical relations between MWPs of varying degrees of compositionality within distributional semantics .Page 1, “Abstract”
- This work addresses the modelling of MWPs within the context of distributional semantics (Tur-ney and Pantel, 2010), in which predicates are represented through the distribution of arguments they may take.Page 1, “Introduction”
- To our knowledge, this is the first work to address lexical relations between MWPs of varying degrees of compositionality within distributional semantics .Page 2, “Introduction”
- Compositional Distributional Semantics .Page 3, “Background and Related Work”
- Several works have used compositional distributional semantics (CDS) representations to assess the compositionality of MWEs, such as noun compounds (Reddy et al., 2011) or verb-noun combinations (Kiela and Clark, 2013).Page 3, “Background and Related Work”
- Much recent work subsumed under the title Compositional Distributional Semantics addressed the distributional representation of multi-word phrases (see Section 2).Page 8, “Discussion”

See all papers in *Proc. ACL 2014* that mention distributional semantics.

See all papers in *Proc. ACL* that mention distributional semantics.

Back to top.

Appears in 6 sentences as: latent variable (2) latent variables (4)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- We present a novel approach to the task that models the selection and relative weighting of the predicate’s LCs using latent variables .Page 2, “Introduction”
- Our work proposes a uniform treatment to MWPs of varying degrees of compositionality, and avoids defining MWPs explicitly by modelling their LCs as latent variables .Page 3, “Background and Related Work”
- We address the task with a latent variable log-linear model, representing the LCs of the predicates.Page 4, “Our Proposal: A Latent LC Approach”
- We choose this model for its generality, conceptual simplicity, and because it allows to easily incorporate various feature sets and sets of latent variables .Page 4, “Our Proposal: A Latent LC Approach”
- The introduction of latent variables into the log-linear model leads to a non-convex objective function.Page 4, “Our Proposal: A Latent LC Approach”
- The optimal w is then taken as an initialization point for the latent variable model.Page 4, “Our Proposal: A Latent LC Approach”

See all papers in *Proc. ACL 2014* that mention latent variables.

See all papers in *Proc. ACL* that mention latent variables.

Back to top.

Appears in 5 sentences as: LDA (5)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- We further incorporate features based on a Latent Dirichlet Allocation ( LDA ) topic model (Blei et al., 2003).Page 6, “Our Proposal: A Latent LC Approach”
- We populate the pseudo-documents of an LC with its arguments according to R. We then train an LDA model with 25 topics over these documents.Page 6, “Our Proposal: A Latent LC Approach”
- To compute the LDA features, we use the online variational Bayes algorithm of (Hoffman et al., 2010) as implemented in the Gensim software package (Rehurek and Sojka, 2010).Page 7, “Experimental Setup”
- More inclusive is the feature set NO-LDA, which includes all features except the LDA features.Page 7, “Experimental Setup”
- Experiments with this set were performed in order to isolate the effect of the LDA features.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention LDA.

See all papers in *Proc. ACL* that mention LDA.

Back to top.

Appears in 5 sentences as: Similar measures (1) similarity measure (3) similarity measures (2)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- where sim is some vector similarity measure .Page 5, “Our Proposal: A Latent LC Approach”
- We use two common similarity measures: the vector cosine metric, and the BInc (Szpektor and Dagan, 2008) similarity measure .Page 5, “Our Proposal: A Latent LC Approach”
- To do so, we use point-wise mutual information, and the conditional probabilities P(hf|hf) and POLE Similar measures have often been used for the unsupervised detection of MWEs (Villavicencio et al., 2007; Fazly and Stevenson, 2006).Page 6, “Our Proposal: A Latent LC Approach”
- One of the most effective similarity measures is the cosine similarity, which is a normalized dot product.Page 8, “Discussion”
- In order to appreciate the effect of these advantages, we perform an experiment that takes H to be the set of all LCs of size 1, and uses a single similarity measure .Page 9, “Discussion”

See all papers in *Proc. ACL 2014* that mention similarity measure.

See all papers in *Proc. ACL* that mention similarity measure.

Back to top.

Appears in 4 sentences as: log-linear (4)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- We address the task with a latent variable log-linear model, representing the LCs of the predicates.Page 4, “Our Proposal: A Latent LC Approach”
- The introduction of latent variables into the log-linear model leads to a non-convex objective function.Page 4, “Our Proposal: A Latent LC Approach”
- Once h has been fixed, the model collapses to a convex log-linear model.Page 4, “Our Proposal: A Latent LC Approach”
- Determining h for each predicate yields a regular log-linear binary classification model.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 3 sentences as: cosine similarity (3)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- These measures give complementary perspectives on the similarity between the predicates, as the cosine similarity is symmetric between the LHS and RHS predicates, while BInc takes into account the directionality of the inference relation.Page 5, “Our Proposal: A Latent LC Approach”
- One of the most effective similarity measures is the cosine similarity , which is a normalized dot product.Page 8, “Discussion”
- Indeed, taking Hp as above, and cosine similarity as the only feature (i.e., w E R), yields the distributionPage 8, “Discussion”

See all papers in *Proc. ACL 2014* that mention cosine similarity.

See all papers in *Proc. ACL* that mention cosine similarity.

Back to top.

Appears in 3 sentences as: log-linear model (3)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- We address the task with a latent variable log-linear model , representing the LCs of the predicates.Page 4, “Our Proposal: A Latent LC Approach”
- The introduction of latent variables into the log-linear model leads to a non-convex objective function.Page 4, “Our Proposal: A Latent LC Approach”
- Once h has been fixed, the model collapses to a convex log-linear model .Page 4, “Our Proposal: A Latent LC Approach”

See all papers in *Proc. ACL 2014* that mention log-linear model.

See all papers in *Proc. ACL* that mention log-linear model.

Back to top.

Appears in 3 sentences as: topic model (1) topic models (2)

In *Lexical Inference over Multi-Word Predicates: A Distributional Approach*

- (2013a) used topic models to combine type-level predicate inference rules with token-level information from their arguments in a specific context.Page 2, “Background and Related Work”
- We further incorporate features based on a Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003).Page 6, “Our Proposal: A Latent LC Approach”
- Several recent works have underscored the usefulness of using topic models to model a predicate’s selectional preferences (Ritter et al., 2010; Dinu and Lapata, 2010; Seaghdha, 2010; Lewis and Steedman, 2013; Melamud et al., 2013a).Page 6, “Our Proposal: A Latent LC Approach”

See all papers in *Proc. ACL 2014* that mention topic models.

See all papers in *Proc. ACL* that mention topic models.

Back to top.