Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
Flati, Tiziano and Vannella, Daniele and Pasini, Tommaso and Navigli, Roberto

Article Structure

Abstract

We present WiBi, an approach to the automatic creation of a bitaxonomy for Wikipedia, that is, an integrated taxonomy of Wikipage pages and categories.

Introduction

Knowledge has unquestionably become a key component of current intelligent systems in many fields of Artificial Intelligence.

WiBi: A Wikipedia Bitaxonomy

We induce a Wikipedia bitaxonomy, i.e., a taxonomy of pages and categories, in 3 phases:

Phase 1: Inducing the Page Taxonomy

The goal of the first phase is to induce a taxonomy of Wikipedia pages.

Phase 2: Inducing the Bitaxonomy

The page taxonomy built in Section 3 will serve as a stable, pivotal input to the second phase, the aim of which is to build our bitaxonomy, that is, a taxonomy of pages and categories.

Phase 3: Category taxonomy refinement

As the final phase, we refine and enrich the category taxonomy.

Related Work

Although the extraction of taxonomies from machine-readable dictionaries was already being studied in the early 1970s (Calzolari et al., 1973), pioneering work on large amounts of data only appeared in the 1990s (Hearst, 1992; Ide and Veronis, 1993).

Comparative Evaluation

7.1 Experimental Setup

Conclusions

In this paper we have presented WiBi, an automatic 3—phase approach to the construction of a bitaxonomy for the English Wikipedia, i.e., a full-fledged, integrated page and category taxonomy: first, using a set of high-precision linkers, the page taxonomy is populated; next, a fixed point algorithm populates the category taxonomy while enriching the page taxonomy iteratively; finally, the category taxonomy undergoes structural refinements.

Topics

hypernym

Appears in 79 sentences as: Hypernym (3) hypernym (58) hypernyms (37)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. However, unlike the case with smaller manually-curated resources such as WordNet (Fellbaum, 1998), in many large automatically-created resources the taxonomical information is either missing, mixed across resources, e.g., linking Wikipedia categories to WordNet synsets as in YAGO, or coarse-grained, as in DBpedia whose hypernyms link to a small upper taxonomy.
    Page 1, “Introduction”
  2. Creation of the initial page taxonomy: we first create a taxonomy for the Wikipedia pages by parsing textual definitions, extracting the hypernym (s) and disambiguating them according to the page inventory.
    Page 2, “WiBi: A Wikipedia Bitaxonomy”
  3. At each iteration, the links in the page taxonomy are used to identify category hypemyms and, conversely, the new category hypernyms are used to identify more page hypernyms .
    Page 2, “WiBi: A Wikipedia Bitaxonomy”
  4. For each p E P our aim is to identify the most suitable generalization ph E P so that we can create the edge (p, ph) and add it to E. For instance, given the page APPLE, which represents the fruit meaning of apple, we want to determine that its hypemym is FRUIT and add the hypernym edge connecting the two pages (i.e., E := E U {(APPLE, FRUIT)}).
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  5. 3.1 Syntactic step: hypernym extraction
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  6. In the syntactic step, for each page p E P, we extract zero, one or more hypernym lemmas, that is, we output potentially ambiguous hypernyms for the page.
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  7. the Wikipedia guidelines and is validated in the literature (Navigli and Velardi, 2010; Navigli and Ponzetto, 2012), is that the first sentence of each Wikipedia page p provides a textual definition for the concept represented by p. The second assumption we build upon is the idea that a lexical taxonomy can be obtained by extracting hypernyms from textual definitions.
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  8. To extract hypernym lemmas, we draw on the notion of copula, that is, the relation between the complement of a copular verb and the copular verb itself.
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  9. The noun involved in the copula relation is actress and thus it is taken as the page’s hypernym lemma.
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  10. However, the extracted hypernym is sometimes overgeneral (one, kind, type, etc.).
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  11. To cope with this problem we use a list of stop-word8.1When such a term is extracted as hypernym , we replace it with the rightmost noun of the first following noun sequence (e.g., deity in the above example).
    Page 2, “Phase 1: Inducing the Page Taxonomy”

See all papers in Proc. ACL 2014 that mention hypernym.

See all papers in Proc. ACL that mention hypernym.

Back to top.

WordNet

Appears in 13 sentences as: WordNet (15) WordNet’s (1)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. However, unlike the case with smaller manually-curated resources such as WordNet (Fellbaum, 1998), in many large automatically-created resources the taxonomical information is either missing, mixed across resources, e.g., linking Wikipedia categories to WordNet synsets as in YAGO, or coarse-grained, as in DBpedia whose hypernyms link to a small upper taxonomy.
    Page 1, “Introduction”
  2. (2005) provide a general vector-based method which, however, is incapable of linking pages which do not have a WordNet counterpart.
    Page 1, “Introduction”
  3. Higher coverage is provided by de Melo and Weikum (2010) thanks to the use of a set of effective heuristics, however, the approach also draws on WordNet and sense frequency information.
    Page 1, “Introduction”
  4. In this paper we address the task of taxono-mizing Wikipedia in a way that is fully independent of other existing resources such as WordNet .
    Page 1, “Introduction”
  5. However, these methods do not link terms to existing knowledge resources such as WordNet , whereas those that explicitly link do so by adding new leaves to the existing taxonomy instead of acquiring wide-coverage taxonomies from scratch (Pan-tel and Ravichandran, 2004; Snow et al., 2006).
    Page 7, “Related Work”
  6. Other approaches, such as YAGO (Suchanek et al., 2008; Hoffart et al., 2013), yield a taxonomical backbone by linking Wikipedia categories to WordNet .
    Page 7, “Related Work”
  7. However, the categories are linked to the first, i.e., most frequent, sense of the category head in WordNet , involving only leaf categories in the linking.
    Page 7, “Related Work”
  8. Our work differs from the others in at least three respects: first, in marked contrast to most other resources, but similarly to WikiNet and WikiTaxonomy, our resource is self-contained and does not depend on other resources such as WordNet ; second, we address the taxonomization task on both sides, i.e., pages and categories, by providing an algorithm which mutually and iteratively transfers knowledge from one side of the bitaxonomy to the other; third, we provide a wide coverage bitaxonomy closer in structure and granularity to a manual WordNet-like taxonomy, in contrast, for example, to DBpedia’s flat entity-focused hierarchy.2
    Page 7, “Related Work”
  9. Since WordNet’s average height is 8.07 we deem WiBi to be the resource structurally closest to WordNet .
    Page 7, “Related Work”
  10. As regards recall, we note that in two cases (i.e., DBpedia returning page super-types from its upper taxonomy, YAGO linking categories to WordNet synsets) the generalizations are neither pages nor categories and that MENTA returns heterogeneous hypernyms as mixed sets of WordNet synsets, Wikipedia pages and categories.
    Page 8, “Comparative Evaluation”
  11. MENTA seems to be the closest resource to ours, however, we remark that the hypernyms output by MENTA are very heterogeneous: 48% of answers are represented by a WordNet synset, 37% by Wikipedia categories and 15% are Wikipedia pages.
    Page 8, “Comparative Evaluation”

See all papers in Proc. ACL 2014 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

iteratively

Appears in 5 sentences as: iteratively (5)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. Finally, to capture multiple hypernyms, we iteratively follow the conj_and and conj_or relations starting from the initially extracted hypernym.
    Page 2, “Phase 1: Inducing the Page Taxonomy”
  2. In the following we describe the core algorithm of our approach, which iteratively and mutually populates and refines the edge sets E(Tp) and E (To).
    Page 4, “Phase 2: Inducing the Bitaxonomy”
  3. Figure 4b shows the performance trend as the algorithm iteratively covers more and more categories.
    Page 6, “Phase 3: Category taxonomy refinement”
  4. Our work differs from the others in at least three respects: first, in marked contrast to most other resources, but similarly to WikiNet and WikiTaxonomy, our resource is self-contained and does not depend on other resources such as WordNet; second, we address the taxonomization task on both sides, i.e., pages and categories, by providing an algorithm which mutually and iteratively transfers knowledge from one side of the bitaxonomy to the other; third, we provide a wide coverage bitaxonomy closer in structure and granularity to a manual WordNet-like taxonomy, in contrast, for example, to DBpedia’s flat entity-focused hierarchy.2
    Page 7, “Related Work”
  5. In this paper we have presented WiBi, an automatic 3—phase approach to the construction of a bitaxonomy for the English Wikipedia, i.e., a full-fledged, integrated page and category taxonomy: first, using a set of high-precision linkers, the page taxonomy is populated; next, a fixed point algorithm populates the category taxonomy while enriching the page taxonomy iteratively ; finally, the category taxonomy undergoes structural refinements.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention iteratively.

See all papers in Proc. ACL that mention iteratively.

Back to top.

knowledge bases

Appears in 3 sentences as: knowledge bases (3)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. The creation and use of machine-readable knowledge has not only entailed researchers (Mitchell, 2005; Mirkin et al., 2009; Poon et al., 2010) developing huge, broad-coverage knowledge bases (Hovy et al., 2013; Suchanek and Weikum, 2013), but it has also hit big industry players such as Google (Singhal, 2012) and IBM (Ferrucci, 2012), which are moving fast towards large-scale knowledge-oriented systems.
    Page 1, “Introduction”
  2. The creation of very large knowledge bases has been made possible by the availability of collaboratively-curated online resources such as Wikipedia and Wiktionary.
    Page 1, “Introduction”
  3. A second project, MENTA (de Melo and Weikum, 2010), creates one of the largest multilingual lexical knowledge bases by interconnecting more than 13M articles in 271 languages.
    Page 7, “Related Work”

See all papers in Proc. ACL 2014 that mention knowledge bases.

See all papers in Proc. ACL that mention knowledge bases.

Back to top.

precision and recall

Appears in 3 sentences as: precision and recall (3)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. Not only does our taxonomy show high precision and recall in extracting ambiguous hypernyms, it also disambiguates more than 3/4 of the hypernyms with high precision.
    Page 4, “Phase 1: Inducing the Page Taxonomy”
  2. MENTA is the closest system to ours, obtaining slightly higher precision and recall .
    Page 8, “Comparative Evaluation”
  3. Notably, however, MENTA outputs the first WordNet sense of entity for 13.17% of all the given answers, which, despite being correct and accounted in precision and recall , is uninformative.
    Page 8, “Comparative Evaluation”

See all papers in Proc. ACL 2014 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

randomly sampled

Appears in 3 sentences as: randomly sampled (3)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. Taxonomy quality To evaluate the quality of our page taxonomy we randomly sampled 1,000 Wikipedia pages.
    Page 4, “Phase 1: Inducing the Page Taxonomy”
  2. It was established by selecting the combination, among all possible permutations, which maximized precision on a tuning set of 100 randomly sampled pages, disjoint from our page dataset.
    Page 4, “Phase 1: Inducing the Page Taxonomy”
  3. Category taxonomy quality To estimate the quality of the category taxonomy, we randomly sampled 1,000 categories and, for each of them, we manually associated the super-categories which were deemed to be appropriate hypemyms.
    Page 6, “Phase 3: Category taxonomy refinement”

See all papers in Proc. ACL 2014 that mention randomly sampled.

See all papers in Proc. ACL that mention randomly sampled.

Back to top.

synsets

Appears in 3 sentences as: synset (1) synsets (3)
In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
  1. However, unlike the case with smaller manually-curated resources such as WordNet (Fellbaum, 1998), in many large automatically-created resources the taxonomical information is either missing, mixed across resources, e.g., linking Wikipedia categories to WordNet synsets as in YAGO, or coarse-grained, as in DBpedia whose hypernyms link to a small upper taxonomy.
    Page 1, “Introduction”
  2. As regards recall, we note that in two cases (i.e., DBpedia returning page super-types from its upper taxonomy, YAGO linking categories to WordNet synsets) the generalizations are neither pages nor categories and that MENTA returns heterogeneous hypernyms as mixed sets of WordNet synsets , Wikipedia pages and categories.
    Page 8, “Comparative Evaluation”
  3. MENTA seems to be the closest resource to ours, however, we remark that the hypernyms output by MENTA are very heterogeneous: 48% of answers are represented by a WordNet synset , 37% by Wikipedia categories and 15% are Wikipedia pages.
    Page 8, “Comparative Evaluation”

See all papers in Proc. ACL 2014 that mention synsets.

See all papers in Proc. ACL that mention synsets.

Back to top.