Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
Paşca, Marius and Van Durme, Benjamin

Article Structure

Abstract

A new approach to large-scale information extraction exploits both Web documents and query logs to acquire thousands of open-domain classes of instances, along with relevant sets of open-domain class attributes at precision levels previously obtained only on small-scale, manually-assembled classes.

Introduction

Current methods for large-scale information extraction take advantage of unstructured text available from either Web documents (Banko et al., 2007; Snow et al., 2006) or, more recently, logs of Web search queries (Pasca, 2007) to acquire useful knowledge with minimal supervision.

Extraction from Documents and Queries

2.1 Open-Domain Labeled Classes of Instances

Evaluation

3.1 Textual Data Sources

Related Work

4.1 Acquisition of Classes of Instances

Conclusion

In a departure from previous approaches to large-scale information extraction from unstructured text on the Web, this paper introduces a weakly-supervised extraction framework for mining useful knowledge from a combination of both documents and search query logs.

Topics

WordNet

Appears in 13 sentences as: WordNet (16)
In Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
  1. WordNet ?
    Page 4, “Evaluation”
  2. Table 1: Class labels found in WordNet in original form, or found in WordNet after removal of leading words, or not found in WordNet at all
    Page 4, “Evaluation”
  3. Accuracy of Class Labels: Built over many years of manual construction efforts, lexical gold standards such as WordNet (Fellbaum, 1998) provide wide-coverage upper ontologies of the English language.
    Page 4, “Evaluation”
  4. Builtin morphological normalization routines make it straightforward to verify whether a class label (e. g., faculty members) exists as a concept in WordNet (e. g., faculty member).
    Page 4, “Evaluation”
  5. When an extracted label (e. g., central nervous system disorders) is not found in WordNet , it is looked up again after iteratively removing its leading words (e.g., nervous system dis-
    Page 4, “Evaluation”
  6. WordNet american composers={aaron copland, composers Y
    Page 4, “Evaluation”
  7. Table 2: Correctness judgments for extracted classes whose class labels are found in WordNet only after removal of their leading words (C=Correctness, Y=correct, S=subjectively correct, N =incorrect)
    Page 4, “Evaluation”
  8. As shown in Table 1, less than half of the 4,583 extracted class labels (e.g., baseball players) are found in their original forms in WordNet .
    Page 4, “Evaluation”
  9. The majority of the class labels (2,614 out of 4,583) can be found in WordNet only after removal of one or more leading words (e.g., caribbean countries), which suggests that many of the class labels correspond to finer-grained, automatically-extracted concepts that are not available in the manually-built WordNet .
    Page 4, “Evaluation”
  10. A class label is: correct, if it captures a relevant concept although it could not be found in WordNet ; subjectively correct, if it is relevant not in general but only in a particular context, either from a subjective viewpoint (e.g., modern appliances), or relative to a particular temporal anchor (e.g., current players), or in connection to a particular geographical area (e.g., area hospitals); or incorrect, if it does not capture any useful concept (e.g., multiple languages).
    Page 4, “Evaluation”
  11. It is worth emphasizing the importance of automatically-collected classes judged as relevant and not present in WordNet : caribbean countries, computer manufacturers, entertainment companies, market research firms are arguably very useful and should probably be considered as part of
    Page 4, “Evaluation”

See all papers in Proc. ACL 2008 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

gold-standard

Appears in 7 sentences as: gold-standard (7)
In Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
  1. Rather than inspecting a random sample of classes, the evaluation validates the results against a reference set of 40 gold-standard classes that were manually assembled as part of previous work (Pasca, 2007).
    Page 5, “Evaluation”
  2. To evaluate the precision of the extracted instances, the manual label of each gold-standard class (e.g., SearchEngine) is mapped into a class label extracted from text (e.g., search engines).
    Page 5, “Evaluation”
  3. As shown in the first two columns of Table 3, the mapping into extracted class labels succeeds for 37 of the 40 gold-standard classes.
    Page 5, “Evaluation”
  4. For example, the gold-standard class SearchEngine contains 25 manually-collected instances, While the parallel class label search engines contains 133 automatically-extracted instances.
    Page 5, “Evaluation”
  5. Overall, the relative coverage of automatically-extracted instance sets with respect to manually-collected instance sets is 26.89%, as an average over the 37 gold-standard classes.
    Page 6, “Evaluation”
  6. Indeed, the manual inspection of the automatically-extracted instances sets indicates an average accuracy of 79.3% over the 37 gold-standard classes retained in the experiments.
    Page 6, “Evaluation”
  7. Precision at some rank N in the list is thus measured as the sum of the assigned values of the first N candidate attributes, divided by N. Accuracy of Class Attributes: Figure 3 plots precision values for ranks 1 through 50 of the lists of attributes extracted through several runs over the 37 gold-standard classes described in the previous section.
    Page 6, “Evaluation”

See all papers in Proc. ACL 2008 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

gold standard

Appears in 4 sentences as: gold standard (3) gold standards (1)
In Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
  1. Accuracy of Class Labels: Built over many years of manual construction efforts, lexical gold standards such as WordNet (Fellbaum, 1998) provide wide-coverage upper ontologies of the English language.
    Page 4, “Evaluation”
  2. A class from the gold standard consists of a manually-created class label (e.g., AircraftModel) associated with a manually-assembled, and therefore high-precision, set of representative instances of the class.
    Page 5, “Evaluation”
  3. The sizes of the instance sets available for each class in the gold standard are compared in the third through fifth columns of Table 3.
    Page 5, “Evaluation”
  4. Figure 3: Accuracy of attributes extracted based on manually assembled, gold standard (M) vs. automatically extracted (E) instance sets, for a few target classes (leftmost graphs) and as an average over all (37) target classes (rightmost graphs).
    Page 6, “Evaluation”

See all papers in Proc. ACL 2008 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

random sample

Appears in 4 sentences as: random sample (4)
In Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
  1. The collection of queries is a random sample of fully-anonymized queries in English submitted by Web users in 2006.
    Page 3, “Evaluation”
  2. To test whether that is the case, a random sample of 200 class labels, out of the 2,614 labels found to be potentially-useful specific concepts, are manually annotated as correct, subjectively correct or incorrect, as shown in Table 2.
    Page 4, “Evaluation”
  3. Rather than inspecting a random sample of classes, the evaluation validates the results against a reference set of 40 gold-standard classes that were manually assembled as part of previous work (Pasca, 2007).
    Page 5, “Evaluation”
  4. Table 5 offers an alternative view on the quality of the attributes extracted for a random sample of 25 classes out of the larger set of 4,583 classes acquired from text.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2008 that mention random sample.

See all papers in Proc. ACL that mention random sample.

Back to top.

vector representations

Appears in 4 sentences as: vector representation (1) vector representations (3)
In Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
  1. 2) construction of internal search-signature vector representations for each candidate attribute, based
    Page 3, “Extraction from Documents and Queries”
  2. 3) construction of a reference internal search-signature vector representation for a small set of seed attributes provided as input.
    Page 3, “Extraction from Documents and Queries”
  3. 4) ranking of candidate attributes with respect to each class (e.g., movies), by computing similarity scores between their individual vector representations and the reference vector of the seed attributes.
    Page 3, “Extraction from Documents and Queries”
  4. To this effect, the extraction includes modifications such that only one reference vector is constructed internally from the seed attributes during the third stage, rather one such vector for each class in (Pasca, 2007); and similarity scores are computed cross-class by comparing vector representations of individual candidate attributes against the only reference vector available during the fourth stage, rather than with respect to the reference vector of each class in (Pasca, 2007).
    Page 3, “Extraction from Documents and Queries”

See all papers in Proc. ACL 2008 that mention vector representations.

See all papers in Proc. ACL that mention vector representations.

Back to top.

similarity scores

Appears in 3 sentences as: similarity scores (3)
In Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
  1. 4) ranking of candidate attributes with respect to each class (e.g., movies), by computing similarity scores between their individual vector representations and the reference vector of the seed attributes.
    Page 3, “Extraction from Documents and Queries”
  2. To this effect, the extraction includes modifications such that only one reference vector is constructed internally from the seed attributes during the third stage, rather one such vector for each class in (Pasca, 2007); and similarity scores are computed cross-class by comparing vector representations of individual candidate attributes against the only reference vector available during the fourth stage, rather than with respect to the reference vector of each class in (Pasca, 2007).
    Page 3, “Extraction from Documents and Queries”
  3. Internally, the ranking uses Jensen-Shannon (Lee, 1999) to compute similarity scores between internal representations of seed attributes, on one hand, and each of the candidate attributes, on the other hand.
    Page 6, “Evaluation”

See all papers in Proc. ACL 2008 that mention similarity scores.

See all papers in Proc. ACL that mention similarity scores.

Back to top.