Open Information Extraction Using Wikipedia
Wu, Fei and Weld, Daniel S.

Article Structure

Abstract

Information-extraction (IE) systems seek to distill semantic relations from natural-language text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data.

Introduction

The problem of information-extraction (IE), generating relational data from natural-language text, has received increasing attention in recent years.

Problem Definition

An open information extractor is a function from a document, d, to a set of triples, {(argl, rel, arg2>}, where the args are noun phrases and rel is a textual fragment indicating an implicit, semantic relation between the two noun phrases.

Wikipedia-based Open IE

The key idea underlying WOE is the automatic construction of training examples by heuristically matching Wikipedia infobox values and corresponding text; these examples are used to generate

Experiments

We used three corpora for experiments: WSJ from Penn Treebank, Wikipedia, and the general Web.

Related Work

Open or Traditional Information Extraction: Most existing work on IE is relation-specific.

Conclusion

This paper introduces WOE, a new approach to open IE that uses self-supervised learning over unlexicalized features, based on a heuristic match

Topics

CRF

Appears in 9 sentences as: CRF (9)
In Open Information Extraction Using Wikipedia
  1. [ Primary Entity Matching ] - Sentence—Matching MatCher K Pattern Classifier over Parser Features ] Learner CRF Extractor over Shallow Features
    Page 2, “Wikipedia-based Open IE”
  2. In contrast, WOEPOS (like TextRunner) trains a conditional random field ( CRF ) to output certain text between noun phrases when the text denotes such a relation.
    Page 3, “Wikipedia-based Open IE”
  3. Since high speed can be crucial when processing Web-scale corpora, we additionally learn a CRF extractor WOEPOS based on shallow features like POS-tags.
    Page 5, “Wikipedia-based Open IE”
  4. Specifically, for each matching sentence, we label the subject and infobox attribute value as argl and argg to serve as the ends of a linear CRF chain.
    Page 5, “Wikipedia-based Open IE”
  5. WOEPOS uses the same learning algorithm and selection of features as TextRunner: a two-order CRF chain model is trained with the Mallet package (McCallum, 2002).
    Page 5, “Wikipedia-based Open IE”
  6. To compare with TextRunner, we tested four different ways to generate training examples from Wikipedia for learning a CRF extractor.
    Page 7, “Experiments”
  7. The CRF extractors are trained using the same learning algorithm and feature selection as TextRunner.
    Page 7, “Experiments”
  8. Wu and Weld proposed the KYLIN system (Wu and Weld, 2007; Wu et al., 2008) which has the same spirit of matching Wikipedia sentences with infoboxes to learn CRF extractors.
    Page 9, “Related Work”
  9. WOE can run in two modes: a CRF extractor (WOEPOS) trained with shallow features like POS tags; a pattern classfier (WOEparse) learned from dependency path patterns.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

dependency path

Appears in 7 sentences as: Dependency Path (1) dependency path (4) dependency paths (2)
In Open Information Extraction Using Wikipedia
  1. We show that abstract dependency paths are a highly informative feature when performing unlexicalized extraction.
    Page 2, “Introduction”
  2. WOEparse uses a pattern learner to classify whether the shortest dependency path between two noun phrases indicates a semantic relation.
    Page 3, “Wikipedia-based Open IE”
  3. Despite some evidence that parser-based features have limited utility in IE (Jiang and Zhai, 2007), we hoped dependency paths would improve precision on long sentences.
    Page 3, “Wikipedia-based Open IE”
  4. Shortest Dependency Path as Relation: Unless otherwise noted, WOE uses the Stanford Parser to create dependencies in the “collapsedDepen-dency” format.
    Page 3, “Wikipedia-based Open IE”
  5. (Snow et al., 2005) utilize WordNet to learn dependency path patterns for extracting the hypernym relation from text.
    Page 7, “Related Work”
  6. However, our results imply that abstracted dependency path features are highly informative for open IE.
    Page 9, “Related Work”
  7. WOE can run in two modes: a CRF extractor (WOEPOS) trained with shallow features like POS tags; a pattern classfier (WOEparse) learned from dependency path patterns.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention dependency path.

See all papers in Proc. ACL that mention dependency path.

Back to top.

POS tags

Appears in 7 sentences as: POS tags (10)
In Open Information Extraction Using Wikipedia
  1. NLP Annotation: As we discuss fully in Section 4 (Experiments), we consider several variations of our system; one version, WOEparse, uses parser-based features, while another, WOEPOS , uses shallow features like POS tags , which may be more quickly computed.
    Page 2, “Wikipedia-based Open IE”
  2. Depending on which version is being trained, the preprocessor uses OpenNLP to supply POS tags and NP—chunk annotations — or uses the Stanford Parser to create a dependency parse.
    Page 2, “Wikipedia-based Open IE”
  3. We learn two kinds of extractors, one (WOEparse) using features from dependency-parse trees and the other (WOEPOS) limited to shallow features like POS tags .
    Page 3, “Wikipedia-based Open IE”
  4. Lexical words in corePaths are replaced with their POS tags .
    Page 4, “Wikipedia-based Open IE”
  5. Further, all Noun POS tags and “PRP” are abstracted to “N”, all Verb POS tags to “V”, all Adverb POS tags to “RB” and all Adjective POS tags to “J”.
    Page 4, “Wikipedia-based Open IE”
  6. Shallow or Deep Parsing: Shallow features, like POS tags , enable fast extraction over large-scale corpora (Davidov et al., 2007; Banko et al., 2007).
    Page 9, “Related Work”
  7. WOE can run in two modes: a CRF extractor (WOEPOS) trained with shallow features like POS tags ; a pattern classfier (WOEparse) learned from dependency path patterns.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

F-measure

Appears in 6 sentences as: F-measure (7)
In Open Information Extraction Using Wikipedia
  1. Compared with TextRunner (the state of the art) on three corpora, WOE yields between 72% and 91% improved F-measure — generalizing well beyond Wikipedia.
    Page 2, “Introduction”
  2. As shown in the experiments on three corpora, WOEparse achieves an F-measure which is between 72% to 91% greater than TextRunner’s.
    Page 4, “Wikipedia-based Open IE”
  3. As shown in the experiments, WOEPOS achieves an improved F-measure over TextRunner between 18% to 34% on three corpora, and this is mainly due to the increase on precision.
    Page 5, “Wikipedia-based Open IE”
  4. Figure 3: WOEPOS achieves an F-measure , which is between 18% and 34% better than TextRunner’s.
    Page 6, “Experiments”
  5. Figure 4: WOEparse’s F-measure decreases more slowly with sentence length than WOEPOS and TextRunner, due to its better handling of difficult sentences using parser features.
    Page 6, “Experiments”
  6. Comparing with TextRunner, WOEPOS runs at the same speed, but achieves an F-measure which is between 18% and 34% greater on three corpora; WOEparse achieves an F-measure which is between 72% and 91% higher than that of TextRunner, but runs about 30X times slower due to the time required for parsing.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

noun phrases

Appears in 5 sentences as: noun phrases (6)
In Open Information Extraction Using Wikipedia
  1. An open information extractor is a function from a document, d, to a set of triples, {(argl, rel, arg2>}, where the args are noun phrases and rel is a textual fragment indicating an implicit, semantic relation between the two noun phrases .
    Page 2, “Problem Definition”
  2. Given the article on “Stanford University,” for example, the matcher should associate (established, 1 8 91) with the sentence “The university was founded in 1891 by Given a Wikipedia page with an infobox, the matcher iterates through all its attributes looking for a unique sentence that contains references to both the subject of the article and the attribute value; these noun phrases will be annotated argl and argg in the training set.
    Page 3, “Wikipedia-based Open IE”
  3. Second, it rejects the sentence if the subject and/or attribute value are not heads of the noun phrases containing them.
    Page 3, “Wikipedia-based Open IE”
  4. WOEparse uses a pattern learner to classify whether the shortest dependency path between two noun phrases indicates a semantic relation.
    Page 3, “Wikipedia-based Open IE”
  5. In contrast, WOEPOS (like TextRunner) trains a conditional random field (CRF) to output certain text between noun phrases when the text denotes such a relation.
    Page 3, “Wikipedia-based Open IE”

See all papers in Proc. ACL 2010 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

Penn Treebank

Appears in 5 sentences as: Penn Treebank (5)
In Open Information Extraction Using Wikipedia
  1. For example, TextRunner uses a small set of handwritten rules to heuristically label training examples from sentences in the Penn Treebank .
    Page 1, “Introduction”
  2. In both cases, however, we generate training data from Wikipedia by matching sentences with infoboxes, while TextRunner used a small set of handwritten rules to label training examples from the Penn Treebank .
    Page 5, “Wikipedia-based Open IE”
  3. We used three corpora for experiments: WSJ from Penn Treebank , Wikipedia, and the general Web.
    Page 5, “Experiments”
  4. In contrast, TextRunner was trained with 91,687 positive examples and 96,795 negative examples generated from the WSJ dataset in Penn Treebank .
    Page 7, “Experiments”
  5. We used three parsing options on the WSJ dataset: Stanford parsing, C] 50 parsing (Charniak and Johnson, 2005), and the gold parses from the Penn Treebank .
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

relation extraction

Appears in 5 sentences as: relation extraction (4) relation extractors (1)
In Open Information Extraction Using Wikipedia
  1. noted in (de Marneffe and Manning, 2008), this collapsed format often yields simplified patterns which are useful for relation extraction .
    Page 4, “Wikipedia-based Open IE”
  2. (Mintz et al., 2009) uses Freebase to provide distant supervision for relation extraction .
    Page 8, “Related Work”
  3. They applied a similar heuristic by matching Freebase tuples with unstructured sentences (Wikipedia articles in their experiments) to create features for learning relation extractors .
    Page 8, “Related Work”
  4. (Akbik and BroB, 2009) annotated 10,000 sentences parsed with LinkGrammar and selected 46 general linkpaths as patterns for relation extraction .
    Page 8, “Related Work”
  5. Jiang and Zhai (Jiang and Zhai, 2007) did a systematic exploration of the feature space for relation extraction on the ACE corpus.
    Page 9, “Related Work”

See all papers in Proc. ACL 2010 that mention relation extraction.

See all papers in Proc. ACL that mention relation extraction.

Back to top.

Treebank

Appears in 5 sentences as: Treebank (5)
In Open Information Extraction Using Wikipedia
  1. For example, TextRunner uses a small set of handwritten rules to heuristically label training examples from sentences in the Penn Treebank .
    Page 1, “Introduction”
  2. In both cases, however, we generate training data from Wikipedia by matching sentences with infoboxes, while TextRunner used a small set of handwritten rules to label training examples from the Penn Treebank .
    Page 5, “Wikipedia-based Open IE”
  3. We used three corpora for experiments: WSJ from Penn Treebank , Wikipedia, and the general Web.
    Page 5, “Experiments”
  4. In contrast, TextRunner was trained with 91,687 positive examples and 96,795 negative examples generated from the WSJ dataset in Penn Treebank .
    Page 7, “Experiments”
  5. We used three parsing options on the WSJ dataset: Stanford parsing, C] 50 parsing (Charniak and Johnson, 2005), and the gold parses from the Penn Treebank .
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

parse trees

Appears in 4 sentences as: parse tree (1) parse trees (3)
In Open Information Extraction Using Wikipedia
  1. Third, it discards the sentence if the subject and the attribute value do not appear in the same clause (or in parent/child clauses) in the parse tree .
    Page 3, “Wikipedia-based Open IE”
  2. Most likely, this is because TextRunner’s heuristics rely on parse trees to label training examples,
    Page 7, “Experiments”
  3. The Stanford Parser is used to derive dependencies from CJ50 and gold parse trees .
    Page 7, “Experiments”
  4. Deep features are derived from parse trees with the hope of training better extractors (Zhang et al., 2006; Zhao and Grishman, 2005; Bunescu and Mooney, 2005; Wang, 2008).
    Page 9, “Related Work”

See all papers in Proc. ACL 2010 that mention parse trees.

See all papers in Proc. ACL that mention parse trees.

Back to top.

precision and recall

Appears in 4 sentences as: precision and recall (4)
In Open Information Extraction Using Wikipedia
  1. This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall .
    Page 1, “Abstract”
  2. WOE can operate in two modes: when restricted to P08 tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.
    Page 1, “Abstract”
  3. high precision and recall , they are limited by the availability of training data and are unlikely to scale to the thousands of relations found in text on the Web.
    Page 1, “Introduction”
  4. WOE can operate in two modes: when restricted to shallow features like part-of-speech (POS) tags, it runs as quickly as Textrunner, but when set to use dependency-parse features its precision and recall rise even higher.
    Page 1, “Introduction”

See all papers in Proc. ACL 2010 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

semantic relation

Appears in 4 sentences as: semantic relation (2) semantic relations (2)
In Open Information Extraction Using Wikipedia
  1. Information-extraction (IE) systems seek to distill semantic relations from natural-language text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data.
    Page 1, “Abstract”
  2. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations .
    Page 1, “Abstract”
  3. An open information extractor is a function from a document, d, to a set of triples, {(argl, rel, arg2>}, where the args are noun phrases and rel is a textual fragment indicating an implicit, semantic relation between the two noun phrases.
    Page 2, “Problem Definition”
  4. WOEparse uses a pattern learner to classify whether the shortest dependency path between two noun phrases indicates a semantic relation .
    Page 3, “Wikipedia-based Open IE”

See all papers in Proc. ACL 2010 that mention semantic relation.

See all papers in Proc. ACL that mention semantic relation.

Back to top.

learning algorithm

Appears in 3 sentences as: learning algorithm (3)
In Open Information Extraction Using Wikipedia
  1. 0 Using the same learning algorithm and features as TextRunner, we compare four different ways to generate positive and negative training data with TextRunner’s method, concluding that our Wikipedia heuristic is responsible for the bulk of WOE’s improved accuracy.
    Page 2, “Introduction”
  2. WOEPOS uses the same learning algorithm and selection of features as TextRunner: a two-order CRF chain model is trained with the Mallet package (McCallum, 2002).
    Page 5, “Wikipedia-based Open IE”
  3. The CRF extractors are trained using the same learning algorithm and feature selection as TextRunner.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

lexicalized

Appears in 3 sentences as: lexicalized (3)
In Open Information Extraction Using Wikipedia
  1. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations.
    Page 1, “Abstract”
  2. First, J iang and Zhai’s results are tested for traditional IE where local lexicalized tokens might contain sufficient information to trigger a correct classification.
    Page 9, “Related Work”
  3. We are also interested in merging lexicalized and open extraction methods; the use of some domain-specific lexical features might help to improve WOE’s practical performance, but the best way to do this is unclear.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention lexicalized.

See all papers in Proc. ACL that mention lexicalized.

Back to top.