Zero-shot Entity Extraction from Web Pages
Pasupat, Panupong and Liang, Percy

Article Structure

Abstract

In order to extract entities of a fine- grained category from semistructured data in web pages, existing information extraction systems rely on seed examples or redundancy across multiple web pages.

Introduction

We consider the task of extracting entities of a given category (e.g., hiking trails) from web pages.

Problem statement

We define the zero-shot entity extraction task as follows: let at be a natural language query (e.g., hiking trails near Baltimore), and to be a web page.

Approach

Figure 3 shows the framework of our system.

Experiments

In this section we evaluate our system on the OPENWEB dataset.

Discussion

Our work shares a base with the wrapper induction literature (Kushmerick, 1997) in that it leverages regularities of web page structures.

Topics

structural features

Appears in 13 sentences as: Structural feature (1) Structural features (3) structural features (9)
In Zero-shot Entity Extraction from Web Pages
  1. To generalize across different inputs, we rely on two types of features: structural features , which look at the layout and placement of the entities being extracted; and denotation fea-
    Page 1, “Introduction”
  2. The final feature vector is the concatenation of structural features gbs(w,z), which consider the selected nodes in the DOM tree, and denotation features gbd(:c, y), which look at the extracted entities.
    Page 4, “Approach”
  3. One main focus of our work is finding good feature representations for a list of objects (DOM tree nodes for structural features and entity strings for denotation features).
    Page 4, “Approach”
  4. Structural feature Value Features on selected nodes:
    Page 5, “Approach”
  5. 3.3.2 Structural features
    Page 5, “Approach”
  6. To capture this, we define structural features gbs(w, 2), which consider the properties of the selected nodes in the DOM tree, as follows:
    Page 5, “Approach”
  7. Structural features are not powerful enough to distinguish between entity lists appearing in similar structures such as columns of the same table or fields of the same record.
    Page 6, “Approach”
  8. Setting Acc A@5 All features 41.1 :I: 3.4 58.4 :I: 2.7 Oracle 68.7 :I: 2.4 68.7 :I: 2.4 (Section 4.5) Structural features only 36.2 :I: 1.9 54.5 :I: 2.5 Denotation features only 19.8 :I: 2.5 41.7 :I: 2.7 (Section 4.6) Structural + query-denotation 41.7 :I: 2.5 58.1 :I: 2.4 Query-denotation features only 25.0 :I: 2.3 48.0 :I: 2.7 Concat.
    Page 7, “Experiments”
  9. We observe that denotation features improves accuracy on top of structural features .
    Page 7, “Experiments”
  10. On the other hand, structural features prevent the system from selecting random entities outside the main part of the page.
    Page 7, “Experiments”
  11. And without structural features , the system selects the hidden navigation links from the top of the page.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention structural features.

See all papers in Proc. ACL that mention structural features.

Back to top.

natural language

Appears in 9 sentences as: Natural Language (1) natural language (8)
In Zero-shot Entity Extraction from Web Pages
  1. In this paper, we consider a new zero-shot learning task of extracting entities specified by a natural language query (in place of seeds) given only a single web page.
    Page 1, “Abstract”
  2. In this paper, we propose a novel task, zero-shot entity extraction, where the specification of the desired entities is provided as a natural language query.
    Page 1, “Introduction”
  3. In our setting, we take as input a natural language query and extract entities from a single web page.
    Page 1, “Introduction”
  4. For evaluation, we created the OPENWEB dataset comprising natural language queries from the Google Suggest API and diverse web pages returned from web search.
    Page 2, “Introduction”
  5. We define the zero-shot entity extraction task as follows: let at be a natural language query (e.g., hiking trails near Baltimore), and to be a web page.
    Page 2, “Problem statement”
  6. In our case, we only have the natural language query, which presents the more difficult problem of associating the entity class in the query (e.g., hiking trails) to concrete entities (e.g., Avalon Super Loop).
    Page 9, “Discussion”
  7. Another related line of work is information extraction from text, which relies on natural language patterns to extract categories and relations of entities.
    Page 9, “Discussion”
  8. In future work, we would like to explore the issue of compositionality in queries by aligning linguistic structures in natural language with the relative position of entities on web pages.
    Page 9, “Discussion”
  9. We gratefully acknowledge the support of the Google Natural Language Understanding Focused Program.
    Page 9, “Discussion”

See all papers in Proc. ACL 2014 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

part-of-speech

Appears in 5 sentences as: part-of-speech (5)
In Zero-shot Entity Extraction from Web Pages
  1. For example, we can map each DOM tree node onto an integer equal to the number of children, or map each entity string onto its part-of-speech tag sequence.
    Page 5, “Approach”
  2. For abstract tokens with finitely many possible values (e. g., part-of-speech ), we also use the normalized
    Page 5, “Approach”
  3. On the other hand, random words on the web page tend to have more diverse lengths and part-of-speech tags.
    Page 6, “Approach”
  4. o PHRASEPOS and WORDPOS ( part-of-speech tags for whole phrases and individual words)
    Page 6, “Approach”
  5. For example, queries mayors of Chicago and universities in Chicago will produce entities of different lengths, part-of-speech sequences, and word distributions.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

extraction systems

Appears in 4 sentences as: extraction systems (4)
In Zero-shot Entity Extraction from Web Pages
  1. In order to extract entities of a fine- grained category from semistructured data in web pages, existing information extraction systems rely on seed examples or redundancy across multiple web pages.
    Page 1, “Abstract”
  2. We represent each web page 212 as a DOM tree, a common representation among wrapper induction and web information extraction systems (Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001).
    Page 3, “Approach”
  3. In the literature, many information extraction systems employ more versatile extraction predicates (Wang and Cohen, 2009; Fumarola et al., 2011).
    Page 3, “Approach”
  4. In contrast to information extraction systems that extract homogeneous records from web pages (Liu et al., 2003; Zheng et al., 2009), our system must choose the correct field from each record and also identify the relevant part of the page based on the query.
    Page 9, “Discussion”

See all papers in Proc. ACL 2014 that mention extraction systems.

See all papers in Proc. ACL that mention extraction systems.

Back to top.

feature vector

Appears in 4 sentences as: feature vector (4) feature vectors (1)
In Zero-shot Entity Extraction from Web Pages
  1. where 6 6 Rd is the parameter vector and gb(:c, w, z) is the feature vector , which will be defined in Section 3.3.
    Page 4, “Approach”
  2. To construct the log-linear model, we define a feature vector gb(:c, w, z) for each query at, web page 212, and extraction predicate z.
    Page 4, “Approach”
  3. The final feature vector is the concatenation of structural features gbs(w,z), which consider the selected nodes in the DOM tree, and denotation features gbd(:c, y), which look at the extracted entities.
    Page 4, “Approach”
  4. One approach is to define the feature vector of a list to be the sum of the feature vectors of individual elements.
    Page 4, “Approach”

See all papers in Proc. ACL 2014 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Zero-shot Entity Extraction from Web Pages
  1. Our approach defines a log-linear model over latent extraction predicates, which select lists of entities from the web page.
    Page 1, “Abstract”
  2. Given a query cc and a web page 212, we define a log-linear distribution over all extraction predicates z E Z(w) as
    Page 4, “Approach”
  3. To construct the log-linear model, we define a feature vector gb(:c, w, z) for each query at, web page 212, and extraction predicate z.
    Page 4, “Approach”

See all papers in Proc. ACL 2014 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

part-of-speech tags

Appears in 3 sentences as: part-of-speech tag (1) part-of-speech tags (2)
In Zero-shot Entity Extraction from Web Pages
  1. For example, we can map each DOM tree node onto an integer equal to the number of children, or map each entity string onto its part-of-speech tag sequence.
    Page 5, “Approach”
  2. On the other hand, random words on the web page tend to have more diverse lengths and part-of-speech tags .
    Page 6, “Approach”
  3. o PHRASEPOS and WORDPOS ( part-of-speech tags for whole phrases and individual words)
    Page 6, “Approach”

See all papers in Proc. ACL 2014 that mention part-of-speech tags.

See all papers in Proc. ACL that mention part-of-speech tags.

Back to top.