Learning 5000 Relational Extractors
Hoffmann, Raphael and Zhang, Congle and Weld, Daniel S.

Article Structure

Abstract

Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web.

Introduction

Information extraction (IE), the process of generating relational data from natural-language text, has gained popularity for its potential applications in Web search, question answering and other tasks.

Heuristic Generation of Training Data

Wikipedia is an ideal starting point for our longterm goal of creating a massive knowledge base of extracted facts for two reasons.

Learning Extractors

We first assume that each Wikipedia infobox attribute corresponds to a unique relation (but see Section 5.6) for which we would like to learn a specific extractor.

Extraction with Lexicons

It is often possible to group words that are likely to be assigned similar labels, even if many of these words do not appear in our training set.

Experiments

We start by evaluating end-to-end performance of LUCHS when applied to Wikipedia text, then analyze the characteristics of its components.

Related Work

Large-scale extraction A popular approach to IE is supervised learning of relation-specific extractors (Freitag, 1998).

Future Work

We envision a Web-scale machine reading system which simultaneously learns ontologies and extractors, and we believe that LUCHS’s approach of leveraging noisy semistructured information (such as lists or formatting templates) is a key towards this goal.

Conclusion

Many researchers are trying to use IE to create large-scale knowledge bases from natural language text on the Web, but existing relation-specific techniques do not scale to the thousands of relations encoded in Web text — while relation-independent techniques suffer from lower precision and recall, and do not canonicalize the relations.

Topics

CRF

Appears in 10 sentences as: CRF (11)
In Learning 5000 Relational Extractors
  1. These lexicons form Boolean features which, along with lexical and dependency parser-based features, are used to produce a CRF extractor for each relation — one which performs much better than lexicon-free extraction on sparse training data.
    Page 2, “Introduction”
  2. We use a linear-chain conditional random field ( CRF ) — an undirected graphical model connecting a sequence of input and output random variables, cc 2 (x0, .
    Page 3, “Learning Extractors”
  3. The CRF models are represented with a log-linear distribution
    Page 3, “Learning Extractors”
  4. Domain-independence requires access to an extremely large number of lists, but our tight integration of lexicon acquisition and CRF learning requires that relevant lists be accessed instantaneously.
    Page 4, “Extraction with Lexicons”
  5. While training a CRF extractor for a given relation, LUCHS uses its corpus of lists to automatically generate a set of semantic lexicons — specific to that relation.
    Page 4, “Extraction with Lexicons”
  6. The semantic lexicons are added as features to the CRF learning algorithm.
    Page 4, “Extraction with Lexicons”
  7. Finally, we integrate the acquired semantic lexicons as features into the CRF .
    Page 5, “Extraction with Lexicons”
  8. Although Section 3 discussed how to use lexicons as CRF features, there are some subtleties.
    Page 5, “Extraction with Lexicons”
  9. If we now train the CRF on the same examples that generated the lexicon features, then the CRF will likely overfit, and weight the lexicon features too highly!
    Page 5, “Extraction with Lexicons”
  10. When we apply the CRF to our test set, we use the lexicons based on all k partitions.
    Page 5, “Extraction with Lexicons”

See all papers in Proc. ACL 2010 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

F1 score

Appears in 8 sentences as: F1 score (5) F1 scores (3)
In Learning 5000 Relational Extractors
  1. 0 We evaluate the overall end-to-end performance of LUCHS, showing an F1 score of 61% when extracting relations from randomly selected Wikipedia pages.
    Page 2, “Introduction”
  2. Table 2: Lexicon and Gaussian features greatly expand F1 score (Fl-LUCHS) over the baseline (F1-B), in particular for attributes with few training examples.
    Page 7, “Experiments”
  3. Figure 3 shows the distribution of obtained F1 scores .
    Page 7, “Experiments”
  4. Averaging across all attributes we obtain F1 scores of 0.56 and 0.60 for textual and numeric values respectively.
    Page 7, “Experiments”
  5. Figure 3: F1 scores among attributes, ranked by score.
    Page 7, “Experiments”
  6. Figure 4: Average F1 score by number of training examples.
    Page 7, “Experiments”
  7. Figure 5 shows the confusion matrix between attributes in the biggest clusters; the shade of the i, jth pixel indicates the F1 score achieved by training on instances of attribute i and testing on attribute j.
    Page 7, “Experiments”
  8. We show an overall performance of 61% F1 score , and present experiments evaluating LUCHS’s individual components.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention F1 score.

See all papers in Proc. ACL that mention F1 score.

Back to top.

overfitting

Appears in 6 sentences as: overfit (1) Overfitting (1) overfitting (4)
In Learning 5000 Relational Extractors
  1. However, there is a danger of overfitting , which we discuss in Section 4.2.4.
    Page 3, “Extraction with Lexicons”
  2. 4.2.4 Preventing Lexicon Overfitting
    Page 5, “Extraction with Lexicons”
  3. If we now train the CRF on the same examples that generated the lexicon features, then the CRF will likely overfit , and weight the lexicon features too highly!
    Page 5, “Extraction with Lexicons”
  4. This avoids overfitting and ensures that we will not perform much worse than without lexicon features.
    Page 5, “Extraction with Lexicons”
  5. Without cross-training we observe a reduction in performance, due to overfitting .
    Page 6, “Experiments”
  6. Crucual to LUCHS’s different setting is also the need to avoid overfitting .
    Page 8, “Related Work”

See all papers in Proc. ACL 2010 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

precision and recall

Appears in 5 sentences as: precision and recall (5)
In Learning 5000 Relational Extractors
  1. Open extraction is more scalable, but has lower precision and recall .
    Page 1, “Introduction”
  2. We expect that lists with higher similarity are more likely to contain phrases which are related to our seeds; hence, by varying the similarity threshold one may produce lexicons representing different compromises between lexicon precision and recall .
    Page 4, “Extraction with Lexicons”
  3. Open IE, self-supervised learning of unlexicalized, relation-independent extractors (Banko et al., 2007), is a more scalable approach, but suffers from lower precision and recall , and doesn’t canonicalize the relations.
    Page 8, “Related Work”
  4. The goal of set expansion techniques is to generate high precision sets of related items; hence, these techniques are evaluated based on lexicon precision and recall .
    Page 8, “Related Work”
  5. Many researchers are trying to use IE to create large-scale knowledge bases from natural language text on the Web, but existing relation-specific techniques do not scale to the thousands of relations encoded in Web text — while relation-independent techniques suffer from lower precision and recall , and do not canonicalize the relations.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

relation extractors

Appears in 5 sentences as: relation extractors (2) relational extractor (1) relational extractors (2)
In Learning 5000 Relational Extractors
  1. This paper presents LUCHS, an autonomous, self-supervised system, which learns 5025 relational extractors — an order of magnitude greater than any previous effort.
    Page 1, “Introduction”
  2. In order to handle sparsity in its heuristically- generated training data, LUCHS generates custom lexicon features when learning each relational extractor .
    Page 2, “Introduction”
  3. Our experiments demonstrate a high Fl score, 61%, across the 5025 relational extractors learned.
    Page 2, “Introduction”
  4. We therefore choose a hierarchical approach that combines both article classifiers and relation extractors .
    Page 2, “Learning Extractors”
  5. is likely to contain a schema, does LUCHS run that schema’s relation extractors .
    Page 3, “Learning Extractors”

See all papers in Proc. ACL 2010 that mention relation extractors.

See all papers in Proc. ACL that mention relation extractors.

Back to top.

end-to-end

Appears in 4 sentences as: end-to-end (4)
In Learning 5000 Relational Extractors
  1. 0 We evaluate the overall end-to-end performance of LUCHS, showing an F1 score of 61% when extracting relations from randomly selected Wikipedia pages.
    Page 2, “Introduction”
  2. We start by evaluating end-to-end performance of LUCHS when applied to Wikipedia text, then analyze the characteristics of its components.
    Page 5, “Experiments”
  3. Figure 2: Precision / recall curve for end-to-end system performance on 100 random articles.
    Page 5, “Experiments”
  4. To evaluate the end-to-end performance of LUCHS, we test the pipeline which first classifies incoming pages, activating a small set of extractors on the text.
    Page 5, “Experiments”

See all papers in Proc. ACL 2010 that mention end-to-end.

See all papers in Proc. ACL that mention end-to-end.

Back to top.

knowledge bases

Appears in 3 sentences as: knowledge base (1) knowledge bases (2)
In Learning 5000 Relational Extractors
  1. Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web.
    Page 1, “Abstract”
  2. Wikipedia is an ideal starting point for our longterm goal of creating a massive knowledge base of extracted facts for two reasons.
    Page 2, “Heuristic Generation of Training Data”
  3. Many researchers are trying to use IE to create large-scale knowledge bases from natural language text on the Web, but existing relation-specific techniques do not scale to the thousands of relations encoded in Web text — while relation-independent techniques suffer from lower precision and recall, and do not canonicalize the relations.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention knowledge bases.

See all papers in Proc. ACL that mention knowledge bases.

Back to top.

learning algorithm

Appears in 3 sentences as: learning algorithm (3)
In Learning 5000 Relational Extractors
  1. When learning an extractor for relation R, LUCHS extracts seed phrases from R’s training data and uses a semi-supervised learning algorithm to create several relation-specific lexicons at different points on a precision-recall spectrum.
    Page 2, “Introduction”
  2. A learning algorithm expands the seed phrases into a set of lexicons.
    Page 4, “Extraction with Lexicons”
  3. The semantic lexicons are added as features to the CRF learning algorithm .
    Page 4, “Extraction with Lexicons”

See all papers in Proc. ACL 2010 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

semi-supervised

Appears in 3 sentences as: Semi-Supervised (1) semi-supervised (2)
In Learning 5000 Relational Extractors
  1. When learning an extractor for relation R, LUCHS extracts seed phrases from R’s training data and uses a semi-supervised learning algorithm to create several relation-specific lexicons at different points on a precision-recall spectrum.
    Page 2, “Introduction”
  2. Then Section 4.2 presents our semi-supervised algorithm for learning semantic lexicons from these lists.
    Page 3, “Extraction with Lexicons”
  3. 4.2 Semi-Supervised Learning of Lexicons
    Page 4, “Extraction with Lexicons”

See all papers in Proc. ACL 2010 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.