Global Learning of Focused Entailment Graphs
Berant, Jonathan and Dagan, Ido and Goldberger, Jacob

Article Structure

Abstract

We propose a global algorithm for learning entailment relations between predicates.

Introduction

The Textual Entailment (TE) paradigm (Dagan et al., 2009) is a generic framework for applied semantic inference.

Background

Entailment learning Two information types have primarily been utilized to learn entailment rules between predicates: lexicographic resources and distributional similarity resources.

Entailment Graph

This section presents an entailment graph structure, which resembles the graph in (Szpektor and Dagan, 2009).

Motivating Application

In this section we propose an application that provides a hierarchical view of propositions extracted from a corpus, based on an entailment graph.

Learning Entailment Graph Edges

In this section we present an algorithm for learning the edges of an entailment graph given its set of nodes.

Experimental Evaluation

This section presents our evaluation, which is geared for the application proposed in Section 4.

Conclusion

This paper presented a global optimization algorithm for learning entailment relations between predicates represented as propositional templates.

Topics

distributional similarity

Appears in 15 sentences as: Distributional similarity (2) distributional similarity (13)
In Global Learning of Focused Entailment Graphs
  1. Entailment learning Two information types have primarily been utilized to learn entailment rules between predicates: lexicographic resources and distributional similarity resources.
    Page 1, “Background”
  2. Therefore, distributional similarity is used to learn broad-scale resources.
    Page 2, “Background”
  3. Distributional similarity algorithms predict a semantic relation between two predicates by comparing the arguments with which they occur.
    Page 2, “Background”
  4. Next, we represent each pair of propositional templates with a feature vector of various distributional similarity scores.
    Page 4, “Learning Entailment Graph Edges”
  5. Distributional similarity representation We aim to train a classifier that for an input template pair (t1, t2) determines whether t1 entails 752.
    Page 4, “Learning Entailment Graph Edges”
  6. A template pair is represented by a feature vector where each coordinate is a different distributional similarity score.
    Page 4, “Learning Entailment Graph Edges”
  7. There are a myriad of distributional similarity algorithms.
    Page 4, “Learning Entailment Graph Edges”
  8. We then generate for any (t1, t2) features that are the 12 distributional similarity scores using all combinations of the dimensions.
    Page 4, “Learning Entailment Graph Edges”
  9. When computing distributional similarity scores, a template is represented as a feature vector of the CUIs that instantiate its arguments.
    Page 6, “Experimental Evaluation”
  10. Local algorithms We described 12 distributional similarity measures computed over our corpus (Section 5.1).
    Page 7, “Experimental Evaluation”
  11. For each distributional similarity measure (altogether 16 measures), we learned a graph by inserting any edge (u, v) , when u is in the top K templates most similar to 2).
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention distributional similarity.

See all papers in Proc. ACL that mention distributional similarity.

Back to top.

WordNet

Appears in 15 sentences as: WordNet (17) WordNet’s (1)
In Global Learning of Focused Entailment Graphs
  1. WordNet (Fellbaum, 1998), by far the most widely used resource, specifies relations such as hyponymy, derivation, and entailment that can be used for semantic inference (Budanitsky and Hirst, 2006).
    Page 2, “Background”
  2. WordNet has also been exploited to automatically generate a training set for a hyponym classifier (Snow et al., 2005), and we make a similar use of WordNet in Section 5.1.
    Page 2, “Background”
  3. Recently, Szpektor and Dagan (2009) presented the resource Argument-mapped WordNet, providing entailment relations for predicates in WordNet .
    Page 2, “Background”
  4. Their resource was built on top of WordNet, and makes simple use of WordNet’s global graph structure: new rules are suggested by transitively chaining graph edges, and verified against corpus statistics.
    Page 2, “Background”
  5. Their algorithm incrementally adds hyponyms to an existing taxonomy ( WordNet ), using a greedy search algorithm that adds at each step the set of hyponyms that maximize the probability of the evidence while respecting the transitivity constraint.
    Page 2, “Background”
  6. The first step is preprocessing: We use a large corpus and WordNet to train an entailment classifier that estimates the likelihood that one propositional template entails another.
    Page 4, “Learning Entailment Graph Edges”
  7. We describe a procedure for learning an entailment classifier, given a corpus and a lexicographic resource ( WordNet ).
    Page 4, “Learning Entailment Graph Edges”
  8. Last, we use WordNet to automatically generate a training set and train a classifier.
    Page 4, “Learning Entailment Graph Edges”
  9. (2005), WordNet is used to automatically generate a training set of positive (entailing) and negative (non-entailing) template pairs.
    Page 5, “Learning Entailment Graph Edges”
  10. For each 75, E T with two variables and a single predicate word 21), we extract from WordNet the set H of direct hypernyms and synonyms of 21).
    Page 5, “Learning Entailment Graph Edges”
  11. large taxonomy ( WordNet ) and therefore utilize a greedy algorithm, while we simultaneously learn all edges of a rather small graph and employ integer linear programming, which is more sound theoretically, and as shown in Section 6, leads to an optimal solution.
    Page 6, “Learning Entailment Graph Edges”

See all papers in Proc. ACL 2010 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

Linear Program

Appears in 14 sentences as: Linear Program (5) linear program (1) Linear Programming (1) Linear programming (1) linear programming (3) linear programs (3)
In Global Learning of Focused Entailment Graphs
  1. We define a graph structure over predicates that represents entailment relations as directed edges, and use a global transitivity constraint on the graph to learn the optimal set of edges, by formulating the optimization problem as an Integer Linear Program .
    Page 1, “Abstract”
  2. The optimization problem is formulated as an Integer Linear Program (ILP) and solved with an ILP solver.
    Page 1, “Introduction”
  3. In this paper we tackle a similar problem of learning a transitive relation, but we use linear programming .
    Page 2, “Background”
  4. A Linear Program (LP) is an optimization problem, where a linear function is minimized (or maximized) under linear constraints.
    Page 2, “Background”
  5. variables are integers, the problem is termed an Integer Linear Program (ILP).
    Page 2, “Background”
  6. Linear programming has attracted attention recently in several fields of NLP, such as semantic role labeling, summarization and parsing (Roth and tau Yih, 2005; Clarke and Lapata, 2008; Martins et al., 2009).
    Page 2, “Background”
  7. In this paper we formulate the entailment graph learning problem as an Integer Linear Program , and find that this leads to an optimal solution with respect to the target function in our experiment.
    Page 2, “Background”
  8. Since we seek a global solution under transitivity and other constraints, linear programming is a natural choice, enabling the use of state of the art optimization packages.
    Page 5, “Learning Entailment Graph Edges”
  9. We describe two formulations of integer linear programs that learn the edges: one maximizing a global score function, and another maximizing a global probability function.
    Page 5, “Learning Entailment Graph Edges”
  10. Since the variables are binary, both formulations are integer linear programs with O(|V|2) variables and O(|V|3) transitivity constraints that can be solved using standard ILP packages.
    Page 6, “Learning Entailment Graph Edges”
  11. large taxonomy (WordNet) and therefore utilize a greedy algorithm, while we simultaneously learn all edges of a rather small graph and employ integer linear programming , which is more sound theoretically, and as shown in Section 6, leads to an optimal solution.
    Page 6, “Learning Entailment Graph Edges”

See all papers in Proc. ACL 2010 that mention Linear Program.

See all papers in Proc. ACL that mention Linear Program.

Back to top.

development set

Appears in 8 sentences as: development set (9)
In Global Learning of Focused Entailment Graphs
  1. Note that this constant needs to be optimized on a development set .
    Page 5, “Learning Entailment Graph Edges”
  2. Importantly, while the score-based formulation contains a parameter A that requires optimization, this probabilistic formulation is parameter free and does not utilize a development set at all.
    Page 6, “Learning Entailment Graph Edges”
  3. The graphs were randomly split into a development set (11 graphs) and a test set (12 graphs)6.
    Page 7, “Experimental Evaluation”
  4. The left half depicts methods where the development set was needed to tune parameters, and the right half depicts methods that do not require a (manually created) development set at all.
    Page 8, “Experimental Evaluation”
  5. Results on the left were achieved by optimizing the top-K parameter on the development set , and on the right by optimizing on the training set automatically generated from WordNet.
    Page 8, “Experimental Evaluation”
  6. The global methods clearly outperform local methods: Tuned-LP outperforms significantly all local methods that require a development set both on the edges F1 measure (p<.
    Page 8, “Experimental Evaluation”
  7. The untuned-LP algorithm also significantly outperforms all 10-cal methods that do not require a development set on the edges F1 measure (p<.
    Page 8, “Experimental Evaluation”
  8. ods are sensitive to parameter tuning and in the absence of a development set their performance dramatically deteriorates.
    Page 8, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

ILP

Appears in 8 sentences as: ILP (9)
In Global Learning of Focused Entailment Graphs
  1. The optimization problem is formulated as an Integer Linear Program (ILP) and solved with an ILP solver.
    Page 1, “Introduction”
  2. We show that this leads to an optimal solution with respect to the global function, and demonstrate that the algorithm outperforms methods that utilize only local information by more than 10%, as well as methods that employ a greedy optimization algorithm rather than an ILP solver (Section 6).
    Page 1, “Introduction”
  3. variables are integers, the problem is termed an Integer Linear Program ( ILP ).
    Page 2, “Background”
  4. Since the variables are binary, both formulations are integer linear programs with O(|V|2) variables and O(|V|3) transitivity constraints that can be solved using standard ILP packages.
    Page 6, “Learning Entailment Graph Edges”
  5. Global algorithms We experimented with all 6 combinations of the following two dimensions: (1) Target functions: score-based, probabilistic and Snow et al.’s (2) Optimization algorithms: Snow et al.’s greedy algorithm and a standard ILP solver.
    Page 7, “Experimental Evaluation”
  6. This is the type of global consideration that is addressed in an ILP formulation, but is ignored in a local approach and often overlooked when employing a greedy algorithm.
    Page 8, “Experimental Evaluation”
  7. Comparing our use of an ILP algorithm to the greedy one reveals that tuned-LP significantly outperforms its greedy counterpart on both measures (p< .01).
    Page 9, “Experimental Evaluation”
  8. The results clearly demonstrate that a global approach improves performance on the entailment graph learning task, and the overall advantage of employing an ILP solver rather than a greedy algorithm.
    Page 9, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention ILP.

See all papers in Proc. ACL that mention ILP.

Back to top.

gold standard

Appears in 7 sentences as: gold standard (7)
In Global Learning of Focused Entailment Graphs
  1. To evaluate the performance of our algorithm, we constructed 23 gold standard entailment graphs.
    Page 6, “Experimental Evaluation”
  2. Ten medical students constructed the gold standard of graph edges.
    Page 7, “Experimental Evaluation”
  3. The entailment graph fragment in Figure 1 is from the gold standard .
    Page 7, “Experimental Evaluation”
  4. The graphs learned by our algorithm were evaluated by two measures, one evaluating the graph directly, and the other motivated by our application: (1) F1 of the learned edges compared to the gold standard edges (2) Our application provides a summary of propositions extracted from the corpus.
    Page 7, “Experimental Evaluation”
  5. Thus, we compute F1 for the set of propositions inferred from the learned graph, compared to the set inferred based on the gold standard graph.
    Page 7, “Experimental Evaluation”
  6. Table 22 Comparing disagreements between the best local and global algorithms against the gold standard
    Page 8, “Experimental Evaluation”
  7. The table considers all edges where the two algorithms disagree, and counts how many are in the gold standard and how many are not.
    Page 8, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

feature vector

Appears in 5 sentences as: feature vector (4) feature vectors (1)
In Global Learning of Focused Entailment Graphs
  1. Quite a few methods have been suggested (Lin and Pantel, 2001; Bhagat et al., 2007; Yates and Etzioni, 2009), which differ in terms of the specifics of the ways in which predicates are represented, the features that are extracted, and the function used to compute feature vector similarity.
    Page 2, “Background”
  2. Next, we represent each pair of propositional templates with a feature vector of various distributional similarity scores.
    Page 4, “Learning Entailment Graph Edges”
  3. A template pair is represented by a feature vector where each coordinate is a different distributional similarity score.
    Page 4, “Learning Entailment Graph Edges”
  4. Another variant occurs when using binary templates: a template may be represented by a pair of feature vectors , one for each variable (Lin and Pantel, 2001), or by a single vector, where features represent pairs of instantiations (Szpektor et al., 2004; Yates and Etzioni, 2009).
    Page 4, “Learning Entailment Graph Edges”
  5. When computing distributional similarity scores, a template is represented as a feature vector of the CUIs that instantiate its arguments.
    Page 6, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

similarity measure

Appears in 5 sentences as: similarity measure (3) similarity measures (3)
In Global Learning of Focused Entailment Graphs
  1. Similarity function We consider two similarity functions: The Lin (2001) similarity measure, and the Balanced Inclusion (BInc) similarity measure (Szpektor and Dagan, 2008).
    Page 4, “Learning Entailment Graph Edges”
  2. Local algorithms We described 12 distributional similarity measures computed over our corpus (Section 5.1).
    Page 7, “Experimental Evaluation”
  3. In addition, we obtained similarity lists learned by Lin and Pantel (2001), and replicated 3 similarity measures learned by Szpektor and Dagan (2008), over the RCV1 corpus7.
    Page 7, “Experimental Evaluation”
  4. For each distributional similarity measure (altogether 16 measures), we learned a graph by inserting any edge (u, v) , when u is in the top K templates most similar to 2).
    Page 7, “Experimental Evaluation”
  5. A training set of 20,144 examples was automatically generated, each example represented by 16 features using the distributional similarity measures mentioned above.
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention similarity measure.

See all papers in Proc. ACL that mention similarity measure.

Back to top.

hypernym

Appears in 4 sentences as: hypernym (2) hypernyms (2)
In Global Learning of Focused Entailment Graphs
  1. For each 75, E T with two variables and a single predicate word 21), we extract from WordNet the set H of direct hypernyms and synonyms of 21).
    Page 5, “Learning Entailment Graph Edges”
  2. Negative examples are generated analogously, by looking at direct co-hyponyms of 212 instead of hypernyms and synonyms.
    Page 5, “Learning Entailment Graph Edges”
  3. Combined with the constraint of transitivity this implies that there must be no path from u to v. This is done in the following two scenarios: (1) When two nodes u and v are identical except for a pair of words wu and my, and mu is an antonym of my, or a hypernym of my at distance 2 2.
    Page 5, “Learning Entailment Graph Edges”
  4. Another local resource was WordNet where we inserted an edge (u, 2)) when U was a direct hypernym or synonym of u.
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention hypernym.

See all papers in Proc. ACL that mention hypernym.

Back to top.

Im

Appears in 4 sentences as: Im (3) |Im (1)
In Global Learning of Focused Entailment Graphs
  1. Thus, P(Fm,|G) = P(Fm, |Im ,).
    Page 5, “Learning Entailment Graph Edges”
  2. P(G) = HWEU P( Im ,).
    Page 6, “Learning Entailment Graph Edges”
  3. First, Snow et al.’s model attempts to determine the graph that maximizes the likelihood P and not the posterior P(G Therefore, their model contains an edge prior P( Im ,) that has to be estimated, whereas in our model it cancels out.
    Page 6, “Learning Entailment Graph Edges”
  4. P( Im , = 1)
    Page 6, “Learning Entailment Graph Edges”

See all papers in Proc. ACL 2010 that mention Im.

See all papers in Proc. ACL that mention Im.

Back to top.

optimization problem

Appears in 4 sentences as: optimization problem (4)
In Global Learning of Focused Entailment Graphs
  1. We define a graph structure over predicates that represents entailment relations as directed edges, and use a global transitivity constraint on the graph to learn the optimal set of edges, by formulating the optimization problem as an Integer Linear Program.
    Page 1, “Abstract”
  2. The optimization problem is formulated as an Integer Linear Program (ILP) and solved with an ILP solver.
    Page 1, “Introduction”
  3. A Linear Program (LP) is an optimization problem , where a linear function is minimized (or maximized) under linear constraints.
    Page 2, “Background”
  4. We used Integer Linear Programming to solve the optimization problem , which is theoretically sound, and demonstrated empirically that this method outperforms local algorithms as well as a greedy optimization algorithm on the graph learning task.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention optimization problem.

See all papers in Proc. ACL that mention optimization problem.

Back to top.

similarity scores

Appears in 4 sentences as: similarity score (1) similarity scores (3)
In Global Learning of Focused Entailment Graphs
  1. Next, we represent each pair of propositional templates with a feature vector of various distributional similarity scores .
    Page 4, “Learning Entailment Graph Edges”
  2. A template pair is represented by a feature vector where each coordinate is a different distributional similarity score .
    Page 4, “Learning Entailment Graph Edges”
  3. We then generate for any (t1, t2) features that are the 12 distributional similarity scores using all combinations of the dimensions.
    Page 4, “Learning Entailment Graph Edges”
  4. When computing distributional similarity scores , a template is represented as a feature vector of the CUIs that instantiate its arguments.
    Page 6, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention similarity scores.

See all papers in Proc. ACL that mention similarity scores.

Back to top.