A Two Level Model for Context Sensitive Inference Rules
Melamud, Oren and Berant, Jonathan and Dagan, Ido and Goldberger, Jacob and Szpektor, Idan

Article Structure

Abstract

Automatic acquisition of inference rules for predicates has been commonly addressed by computing distributional similarity between vectors of argument words, operating at the word space level.

Introduction

Inference rules for predicates have been identified as an important component in semantic applications, such as Question Answering (QA) (Ravichandran and Hovy, 2002) and Information Extraction (IE) (Shinyama and Sekine, 2006).

Background and Model Setting

This section presents components of prior work which are included in our model and experiments, setting the technical preliminaries for the rest of the paper.

Two-level Context-sensitive Inference

Our model follows the general DIRT scheme while extending it to handle context-sensitive scoring of rule applications, addressing the scenario dealt by the context-sensitive topic models.

Experimental Settings

To evaluate our model, we compare it both to context-insensitive similarity measures as well as to prior context-sensitive methods.

Results

We evaluated the performance of each tested method by measuring Mean Average Precision (MAP) (Manning et al., 2008) of the rule application ranking computed by this method.

Discussion and Future Work

This paper addressed the problem of computing context-sensitive reliability scores for predicate inference rules.

Topics

similarity measure

Appears in 30 sentences as: similarity measure (18) similarity measure: (1) similarity measures (12)
In A Two Level Model for Context Sensitive Inference Rules
  1. Our scheme can be applied on top of any context-insensitive “base” similarity measure for rule learning, which operates at the word level, such as Cosine or Lin (Lin, 1998).
    Page 2, “Introduction”
  2. We apply our two-level scheme over three state-of-the-art context-insensitive similarity measures .
    Page 2, “Introduction”
  3. where sim(v, v’) is a vector similarity measure .
    Page 2, “Background and Model Setting”
  4. We note that the general DIRT scheme may be used while employing other “base” vector similarity measures .
    Page 3, “Background and Model Setting”
  5. This issue has been addressed in a separate line of research which introduced directional similarity measures suitable for inference relations (Bhagat et al., 2007; Szpektor and Dagan, 2008; Kotlerman et al., 2010).
    Page 3, “Background and Model Setting”
  6. In our experiments we apply our proposed context-sensitive similarity scheme over three different base similarity measures .
    Page 3, “Background and Model Setting”
  7. where sim(d, d’, w) is a topic-distribution similarity measure conditioned on a given context word.
    Page 4, “Background and Model Setting”
  8. (2010) utilized the dot product form for their similarity measure:
    Page 4, “Background and Model Setting”
  9. Dinu and Lapata (2010b) presented a slightly different similarity measure for topic distributions that performed better in their setting as well as in a related later paper on context-sensitive scoring of lexical similarity (Dinu and Lapata, 2010a).
    Page 4, “Background and Model Setting”
  10. They also experimented with a few variants for the structure of the similarity measure and assessed that best results are obtained with the dot product form.
    Page 4, “Background and Model Setting”
  11. In our experiments, we employ these two similarity measures for topic distributions as baselines representing topic-level models.
    Page 4, “Background and Model Setting”

See all papers in Proc. ACL 2013 that mention similarity measure.

See all papers in Proc. ACL that mention similarity measure.

Back to top.

LDA

Appears in 25 sentences as: LDA (27)
In A Two Level Model for Context Sensitive Inference Rules
  1. cation ( LDA ) model.
    Page 2, “Introduction”
  2. Rather than computing a single context-insensitive rule score, we compute a distinct word-level similarity score for each topic in an LDA model.
    Page 2, “Introduction”
  3. Several more recent works utilize a Latent Dirichlet Allocation ( LDA ) (Blei et al., 2003) framework.
    Page 3, “Background and Model Setting”
  4. We note that a similar LDA model construction was employed also in (Séaghdha, 2010), for estimating predicate-argument likelihood.
    Page 3, “Background and Model Setting”
  5. First, an LDA model is constructed, as follows.
    Page 3, “Background and Model Setting”
  6. Next, an LDA model is learned from the set of all pseudo-documents, extracted for all predicates.2 The learning process results in the construction of K latent topics, where each topic 75 specifies a distribution over all words, denoted by p(w|t), and a topic distribution for each pseudo-document d, denoted by p(t|d).
    Page 3, “Background and Model Setting”
  7. Within the LDA model we can derive the a-posteriori topic distribution conditioned on a particular word within a document, denoted by p(t|d,w) oc p(w|t) -p(t|d).
    Page 3, “Background and Model Setting”
  8. 2We note that there are variants in the type of LDA model and the way the pseudo-documents are constructed in the referenced prior work.
    Page 3, “Background and Model Setting”
  9. In order to focus on the inference methods rather than on the underlying LDA model, we use the LDA framework described in this paper for all compared methods.
    Page 3, “Background and Model Setting”
  10. Based on all pseudo-documents we learn an LDA model and obtain its associated probability distributions.
    Page 4, “Two-level Context-sensitive Inference”
  11. At learning time, we compute for each candidate rule a separate, topic-biased, similarity score per each of the topics in the LDA model.
    Page 4, “Two-level Context-sensitive Inference”

See all papers in Proc. ACL 2013 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

similarity score

Appears in 15 sentences as: similarities scores (1) similarity score (7) similarity scores (7)
In A Two Level Model for Context Sensitive Inference Rules
  1. Rather than computing a single context-insensitive rule score, we compute a distinct word-level similarity score for each topic in an LDA model.
    Page 2, “Introduction”
  2. At learning time, we compute for each candidate rule a separate, topic-biased, similarity score per each of the topics in the LDA model.
    Page 4, “Two-level Context-sensitive Inference”
  3. Then, at rule application time, we compute an overall reliability score for the rule by combining the per-topic similarity scores , while biasing the score combination according to the given context of 212.
    Page 4, “Two-level Context-sensitive Inference”
  4. sim/3m), we compute a topic-biased similarity score for each LDA topic 75, denoted by simt(v, v’ simt(v, v’) is computed by applying
    Page 4, “Two-level Context-sensitive Inference”
  5. This learning process results in K different topic-biased similarity scores for each candidate rule, where K is the number of LDA topics.
    Page 5, “Two-level Context-sensitive Inference”
  6. When applying an inference rule, we compute for each slot its context-sensitive similarity score simWT(v, v’, 21)), where v and v’ are the slot’s argument vectors for the two rule sides and w is the word instantiating the slot in the given rule application.
    Page 5, “Two-level Context-sensitive Inference”
  7. This score is computed as a weighted average of the rule’s K topic-biased similarity scores simt.
    Page 5, “Two-level Context-sensitive Inference”
  8. Table 1: Two characteristic topics for the Y slot of ‘acquire’, along with their topic-biased Lin similarities scores Lint, compared with the original Lin similarity, for two rules.
    Page 5, “Two-level Context-sensitive Inference”
  9. Table 2 illustrates the calculation of context-sensitive similarity scores in four rule applications, involving the Y slot of the predicate ‘acquire’.
    Page 5, “Two-level Context-sensitive Inference”
  10. The opposite behavior is observed for ‘acquire —> purchase’, altogether demonstrating how our model successfully biases the similarity score according to rule validity in context.
    Page 5, “Two-level Context-sensitive Inference”
  11. Table 2: Context-sensitive similarity scores (in bold) for the Y slots of four rule applications.
    Page 6, “Experimental Settings”

See all papers in Proc. ACL 2013 that mention similarity score.

See all papers in Proc. ACL that mention similarity score.

Back to top.

word-level

Appears in 15 sentences as: word-level (16)
In A Two Level Model for Context Sensitive Inference Rules
  1. We propose a novel two-level model, which computes similarities between word-level vectors that are biased by topic-level context representations.
    Page 1, “Abstract”
  2. Evaluations on a naturally-distributed dataset show that our model significantly outperforms prior word-level and topic-level models.
    Page 1, “Abstract”
  3. To address this hypothesized caveat of prior context-sensitive rule scoring methods, we propose a novel generic scheme that integrates word-level and topic-level representations.
    Page 2, “Introduction”
  4. Rather than computing a single context-insensitive rule score, we compute a distinct word-level similarity score for each topic in an LDA model.
    Page 2, “Introduction”
  5. However, while DIRT computes sim(v, 21’) over vectors in the original word-level space, topic-level models compute sim(d, d’, w) by measuring similarity of vectors in a reduced-dimensionality latent space.
    Page 4, “Background and Model Setting”
  6. slots in the original word-level space while biasing the similarity measure through topic-level context models.
    Page 4, “Background and Model Setting”
  7. Thus, our model computes similarity over word-level (rather than topic-level) argument vectors, while biasing it according to the specific argument words in the given rule application context.
    Page 4, “Two-level Context-sensitive Inference”
  8. The core of our contribution is thus defining the context-sensitive word-level vector similarity measure sim(v, v’ , w), as described in the remainder of this section.
    Page 4, “Two-level Context-sensitive Inference”
  9. This way, rather than replacing altogether the word-level values v(w) by the topic probabilities p(t|dv, w), as done in the topic-level models, we use the latter to only bias the former while preserving fine-grained word-level representations.
    Page 5, “Two-level Context-sensitive Inference”
  10. Specifically, topics are leveraged for high-level domain disambiguation, while fine grained word-level distributional similarity is computed for each rule under each such domain.
    Page 8, “Results”
  11. This result more explicitly shows the advantages of integrating word-level and context-sensitive topic-level similarities for differentiating valid and invalid contexts for rule applications.
    Page 8, “Results”

See all papers in Proc. ACL 2013 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

distributional similarity

Appears in 10 sentences as: distributional similarity (10)
In A Two Level Model for Context Sensitive Inference Rules
  1. Automatic acquisition of inference rules for predicates has been commonly addressed by computing distributional similarity between vectors of argument words, operating at the word space level.
    Page 1, “Abstract”
  2. learning, based on distributional similarity at the word level, and then context-sensitive scoring for rule applications, based on topic-level similarity.
    Page 2, “Background and Model Setting”
  3. The DIRT algorithm (Lin and Pantel, 2001) follows the distributional similarity paradigm to learn predicate inference rules.
    Page 2, “Background and Model Setting”
  4. On the other hand, the topic-biased similarity for 751 is substantially lower, since prominent words in this topic are likely to occur with ‘acquire’ but not with ‘learn’, yielding low distributional similarity .
    Page 5, “Two-level Context-sensitive Inference”
  5. Since our model can contextualize various distributional similarity measures, we evaluated the performance of all the above methods on several base similarity measures and their learned rule-
    Page 6, “Experimental Settings”
  6. Whenever we evaluated a distributional similarity measure (namely Lin, BInc, or Cosine), we discarded instances from Zeichner et al.’s dataset in which the assessed rule is not in the context-insensitive rule-set learned for this measure or the argument instantiation of the rule is not in the LDA lexicon.
    Page 7, “Experimental Settings”
  7. Specifically, topics are leveraged for high-level domain disambiguation, while fine grained word-level distributional similarity is computed for each rule under each such domain.
    Page 8, “Results”
  8. Indeed, on test-setvc, in which context mismatches are rare, our algorithm is still better than the original measure, indicating that WT can be safely applied to distributional similarity measures without concerns of reduced performance in different context scenarios.
    Page 8, “Results”
  9. In particular, we proposed a novel scheme that applies over any base distributional similarity measure which operates at the word level, and computes a single context-insensitive score for a rule.
    Page 8, “Discussion and Future Work”
  10. We therefore focused on comparing the performance of our two-level scheme with state-of-the-art prior topic-level and word-level models of distributional similarity , over a random sample of inference rule applications.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention distributional similarity.

See all papers in Proc. ACL that mention distributional similarity.

Back to top.

topic distribution

Appears in 10 sentences as: topic distribution (8) topic distributions (2)
In A Two Level Model for Context Sensitive Inference Rules
  1. Then, similarity is measured between the two topic distribution vectors corresponding to the two sides of the rule in the given context, yielding a context-sensitive score for each particular rule application.
    Page 2, “Introduction”
  2. Then, when applying a rule in a given context, these different scores are weighed together based on the specific topic distribution under the given context.
    Page 2, “Introduction”
  3. Next, an LDA model is learned from the set of all pseudo-documents, extracted for all predicates.2 The learning process results in the construction of K latent topics, where each topic 75 specifies a distribution over all words, denoted by p(w|t), and a topic distribution for each pseudo-document d, denoted by p(t|d).
    Page 3, “Background and Model Setting”
  4. Within the LDA model we can derive the a-posteriori topic distribution conditioned on a particular word within a document, denoted by p(t|d,w) oc p(w|t) -p(t|d).
    Page 3, “Background and Model Setting”
  5. Dinu and Lapata (2010b) presented a slightly different similarity measure for topic distributions that performed better in their setting as well as in a related later paper on context-sensitive scoring of lexical similarity (Dinu and Lapata, 2010a).
    Page 4, “Background and Model Setting”
  6. In this measure, the topic distribution for the right hand side of the rule is not conditioned on w:
    Page 4, “Background and Model Setting”
  7. In our experiments, we employ these two similarity measures for topic distributions as baselines representing topic-level models.
    Page 4, “Background and Model Setting”
  8. Then, given a specific candidate rule application, the LDA model is used to infer the topic distribution relevant to the context specified by the given arguments.
    Page 8, “Discussion and Future Work”
  9. Finally, the context-sensitive rule application score is computed as a weighted average of the per-topic word-level similarity scores, which are weighed according to the inferred topic distribution .
    Page 8, “Discussion and Future Work”
  10. Finally, they train a classifier to translate a given target word based on these tables and the inferred topic distribution of the given document in which the target word appears.
    Page 9, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention topic distribution.

See all papers in Proc. ACL that mention topic distribution.

Back to top.

random sample

Appears in 4 sentences as: random sample (3) randomly sampling (1)
In A Two Level Model for Context Sensitive Inference Rules
  1. In order to promote replicability and equal-term comparison with our results, we based our experiments on publicly available datasets, both for unsupervised learning of the evaluated models and for testing them over a random sample of rule applications.
    Page 2, “Introduction”
  2. Rule applications were generated by randomly sampling extractions from ReVerb, such as ( ‘Jack’, ‘agree with’, ‘Jill ’) and then sampling possible rules for each, such as ‘agree with —> feel sorry for’.
    Page 7, “Experimental Settings”
  3. However, our result suggests that topic-level models might not be robust enough when applied to a random sample of inferences.
    Page 7, “Results”
  4. We therefore focused on comparing the performance of our two-level scheme with state-of-the-art prior topic-level and word-level models of distributional similarity, over a random sample of inference rule applications.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention random sample.

See all papers in Proc. ACL that mention random sample.

Back to top.

topic model

Appears in 4 sentences as: topic model (2) topic models (2)
In A Two Level Model for Context Sensitive Inference Rules
  1. This way, we calculate similarity over vectors in the original word space, while biasing them towards the given context via a topic model .
    Page 2, “Introduction”
  2. Our model follows the general DIRT scheme while extending it to handle context-sensitive scoring of rule applications, addressing the scenario dealt by the context-sensitive topic models .
    Page 4, “Two-level Context-sensitive Inference”
  3. While most works on context-insensitive predicate inference rules, such as DIRT (Lin and Pantel, 2001), are based on word-level similarity measures, almost all prior models addressing context-sensitive predicate inference rules are based on topic models (except for (Pantel et al., 2007), which was outperformed by later models).
    Page 8, “Discussion and Future Work”
  4. In addition, (Dinu and Lapata, 2010a) adapted the predicate inference topic model from (Dinu and Lapata, 2010b) to compute lexical similarity in context.
    Page 9, “Discussion and Future Work”

See all papers in Proc. ACL 2013 that mention topic model.

See all papers in Proc. ACL that mention topic model.

Back to top.

statistically significant

Appears in 3 sentences as: statistical significance (1) statistically significant (2)
In A Two Level Model for Context Sensitive Inference Rules
  1. to compute MAP values and corresponding statistical significance , we randomly split each test set into 30 subsets.
    Page 7, “Results”
  2. This improvement is statistically significant at p < 0.01 for BInc and Lin, and p < 0.015 for Cosine, using paired t-test.
    Page 7, “Results”
  3. On test-setivc, where context mismatches are abundant, our model outperformed all other baselines ( statistically significant at p < 0.01).
    Page 8, “Results”

See all papers in Proc. ACL 2013 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.