BabelNet: Building a Very Large Multilingual Semantic Network
Navigli, Roberto and Ponzetto, Simone Paolo

Article Structure

Abstract

In this paper we present BabelNet — a very large, wide-coverage multilingual semantic network.

Introduction

In many research areas of Natural Language Processing (NLP) lexical knowledge is exploited to perform tasks effectively.

BabelNet

We encode knowledge as a labeled directed graph G = (V, E) where V is the set of vertices — i.e.

Methodology

3.1 Knowledge Resources

Experiment 1: Mapping Evaluation

Experimental setting.

Experiment 2: Translation Evaluation

We perform a second set of experiments concem-ing the quality of the acquired concepts.

Related Work

Previous attempts to manually build multilingual resources have led to the creation of a multitude of wordnets such as EuroWordNet (Vossen, 1998), MultiWordNet (Pianta et al., 2002), Balka-Net (Tufis et al., 2004), Arabic WordNet (Black et al., 2006), the Multilingual Central Repository (Atserias et al., 2004), bilingual electronic dictionaries such as EDR (Yokoi, 1995), and fully-fledged frameworks for the development of multilingual lexicons (Lenci et al., 2000).

Conclusions

In this paper we have presented a novel methodology for the automatic construction of a large multilingual lexical knowledge resource.

Topics

WordNet

Appears in 73 sentences as: WOrdnet (1) WordNet (71) wordnet (1) wordnets (8)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. The resource is automatically constructed by means of a methodology that integrates lexicographic and encyclopedic knowledge from WordNet and Wikipedia.
    Page 1, “Abstract”
  2. A pioneering endeavor was WordNet (Fellbaum, 1998), a computational lexicon of English based on psycholinguistic theories.
    Page 1, “Introduction”
  3. Wikipedia represents the perfect complement to WordNet , as it provides multilingual lexical knowledge of a mostly encyclopedic nature.
    Page 1, “Introduction”
  4. But while a great deal of work has been recently devoted to the automatic extraction of structured information from Wikipedia (Wu and Weld, 2007; Ponzetto and Strube, 2007; Suchanek et al., 2008; Medelyan et al., 2009, inter alia), the knowledge extracted is organized in a looser way than in a computational lexicon such as WordNet .
    Page 1, “Introduction”
  5. This resource is created by linking Wikipedia to WordNet via an automatic mapping and by integrating lexical gaps in resource-
    Page 1, “Introduction”
  6. Concepts and relations in BabelNet are harvested from the largest available semantic lexicon of English, WordNet , and a wide-coverage collaboratively edited encyclopedia, the English Wikipedia (Section 3.1).
    Page 2, “BabelNet”
  7. We collect (a) from WordNet , all available word senses (as concepts) and all the semantic pointers between synsets (as relations); (b) from Wikipedia, all encyclopedic entries (i.e.
    Page 2, “BabelNet”
  8. their concepts in common) by establishing a mapping between Wikipedia pages and WordNet senses (Section 3.2).
    Page 2, “BabelNet”
  9. using (a) the human-generated translations provided in Wikipedia (the so-called inter-language links), as well as (b) a machine translation system to translate occurrences of the concepts within sense-tagged corpora, namely SemCor (Miller et al., 1993) — a corpus annotated with WordNet senses — and Wikipedia itself (Section 3.3).
    Page 2, “BabelNet”
  10. WordNet .
    Page 2, “Methodology”
  11. The most popular lexical knowledge resource in the field of NLP is certainly WordNet , a computational lexicon of the English language.
    Page 2, “Methodology”

See all papers in Proc. ACL 2010 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

synset

Appears in 50 sentences as: Synset (1) synset (28) synset: (1) SYNSETS (1) Synsets (2) synsets (27)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. We collect (a) from WordNet, all available word senses (as concepts) and all the semantic pointers between synsets (as relations); (b) from Wikipedia, all encyclopedic entries (i.e.
    Page 2, “BabelNet”
  2. We call the resulting set of multilingual lexicalizations of a given concept a babel synset .
    Page 2, “BabelNet”
  3. A concept in WordNet is represented as a synonym set (called synset ), i.e.
    Page 2, “Methodology”
  4. For instance, the concept wind is expressed by the following synset:
    Page 2, “Methodology”
  5. We denote with w; the i-th sense of a word 7.0 with part of speech p. We use word senses to unambiguously denote the corresponding synsets (e.g.
    Page 2, “Methodology”
  6. Hereafter, we use word sense and synset interchangeably.
    Page 2, “Methodology”
  7. For each synset , WordNet provides a textual definition, or gloss.
    Page 3, “Methodology”
  8. For example, the gloss of the above synset is: “air moving from an area of high pressure to an area of low pressure”.
    Page 3, “Methodology”
  9. Given a WordNet sense 3 and its synset S, we collect the following information:
    Page 3, “Methodology”
  10. o Synonymy: all synonyms of s in S. For instance, given the sense airplane}, and its corresponding synset { airplane},, aeroplane,1,,, plane}, }, the words contained therein are included in the context.
    Page 3, “Methodology”
  11. 0 Hypernymy/Hyponymy: all synonyms in the synsets H such that H is either a hypernym (i.e., a generalization) or a hyponym (i.e., a specialization) of S. For example, given bal-loo n},, we include the words from its hypernym { lighter-than-air craft}, } and all its hyponyms (e.g.
    Page 3, “Methodology”

See all papers in Proc. ACL 2010 that mention synset.

See all papers in Proc. ACL that mention synset.

Back to top.

word senses

Appears in 21 sentences as: Word Sense (1) word sense (5) Word senses (1) word senses (14)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. Recent studies in the difficult task of Word Sense Disambiguation (Navigli, 2009b, WSD) have shown the impact of the amount and quality of lexical knowledge (Cuadros and Rigau, 2006): richer knowledge sources can be of great benefit to both knowledge-lean systems (Navigli and Lapata, 2010) and supervised classifiers (Ng and Lee, 1996; Yarowsky and Florian, 2002).
    Page 1, “Introduction”
  2. We collect (a) from WordNet, all available word senses (as concepts) and all the semantic pointers between synsets (as relations); (b) from Wikipedia, all encyclopedic entries (i.e.
    Page 2, “BabelNet”
  3. We denote with w; the i-th sense of a word 7.0 with part of speech p. We use word senses to unambiguously denote the corresponding synsets (e.g.
    Page 2, “Methodology”
  4. Hereafter, we use word sense and synset interchangeably.
    Page 2, “Methodology”
  5. Given a WordNet word sense in our babel synset of interest (e.g.
    Page 4, “Methodology”
  6. In the second phase (see Section 3.3), we collect all the sentences in SemCor and Wikipedia in which the above English word sense occurs.
    Page 5, “Methodology”
  7. The final mapping contains 81,533 pairs of Wikipages and word senses they map to, covering 55.7% of the noun senses in WordNet.
    Page 5, “Experiment 1: Mapping Evaluation”
  8. Language Word senses Synsets
    Page 6, “Experiment 2: Translation Evaluation”
  9. In Table 2 we report the number of synsets and word senses available in the gold-standard resources for the 5 languages.
    Page 6, “Experiment 2: Translation Evaluation”
  10. We assess the coverage of BabelNet against our gold-standard wordnets both in terms of synsets and word senses .
    Page 6, “Experiment 2: Translation Evaluation”
  11. For word senses we calculate a similar measure of coverage:
    Page 6, “Experiment 2: Translation Evaluation”

See all papers in Proc. ACL 2010 that mention word senses.

See all papers in Proc. ACL that mention word senses.

Back to top.

gold-standard

Appears in 15 sentences as: gold-standard (15)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource.
    Page 1, “Abstract”
  2. The gold-standard dataset includes 505 nonempty mappings, i.e.
    Page 5, “Experiment 1: Mapping Evaluation”
  3. This is assessed in terms of coverage against gold-standard resources (Section 5.1) and against a manually-validated dataset of translations (Section 5.2).
    Page 5, “Experiment 2: Translation Evaluation”
  4. Table 2: Size of the gold-standard wordnets.
    Page 6, “Experiment 2: Translation Evaluation”
  5. We compare BabelNet against gold-standard resources for 5 languages, namely: the subset of GermaNet (Lemnitzer and Kunze, 2002) included in EuroWordNet for German, MultiWordNet (Pianta et al., 2002) for Italian, the Multilingual Central Repository for Spanish and Catalan (Atserias et al., 2004), and WOrdnet Libre du Francais (Benoit and Fiser, 2008, WOLF) for French.
    Page 6, “Experiment 2: Translation Evaluation”
  6. In Table 2 we report the number of synsets and word senses available in the gold-standard resources for the 5 languages.
    Page 6, “Experiment 2: Translation Evaluation”
  7. Let B be BabelNet, .7: our gold-standard non-English wordnet (e.g.
    Page 6, “Experiment 2: Translation Evaluation”
  8. All the gold-standard non-English resources, as well as BabelNet, are linked to the English WordNet: given a synset S; E .73, we denote its corresponding babel synset as 83 and its synset in the English WordNet as 85.
    Page 6, “Experiment 2: Translation Evaluation”
  9. We assess the coverage of BabelNet against our gold-standard wordnets both in terms of synsets and word senses.
    Page 6, “Experiment 2: Translation Evaluation”
  10. That is we calculate the ratio of word senses in our gold-standard resource .7: that also occur in the corresponding synset 83 to the overall number of senses in .73.
    Page 6, “Experiment 2: Translation Evaluation”
  11. However, our gold-standard resources cover only a portion of the English WordNet, whereas the overall coverage of BabelNet is much higher.
    Page 6, “Experiment 2: Translation Evaluation”

See all papers in Proc. ACL 2010 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

lexicalizations

Appears in 9 sentences as: lexicalization (1) lexicalizations (7) lexicalized (1)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. Importantly, each vertex v E V contains a set of lexicalizations of the concept for different languages, e.g.
    Page 2, “BabelNet”
  2. We call the resulting set of multilingual lexicalizations of a given concept a babel synset.
    Page 2, “BabelNet”
  3. An overview of BabelNet is given in Figure l (we label vertices with English lexicalizations ): unlabeled edges are obtained from links in the Wikipedia pages (e.g.
    Page 2, “BabelNet”
  4. In this paper we restrict ourselves to concepts lexicalized as nouns.
    Page 2, “BabelNet”
  5. By repeating this step for each English lexicalization in a babel synset, we obtain a collection of sentences for the babel synset (see left part of Figure 1).
    Page 4, “Methodology”
  6. Note that we had no translation for Catalan and French in the first phase, because the inter-language link was not available, and we also obtain new lexicalizations for the Spanish and Italian languages.
    Page 5, “Methodology”
  7. However, it does not say anything about the precision of the additional lexicalizations provided by BabelNet.
    Page 7, “Experiment 2: Translation Evaluation”
  8. those mapped with our method illustrated in Section 3.2), 200 synsets whose lexicalizations exist in Wikipedia only.
    Page 7, “Experiment 2: Translation Evaluation”
  9. lexicalizations ) were appropriate given the corresponding WordNet gloss and/or Wikipage.
    Page 7, “Experiment 2: Translation Evaluation”

See all papers in Proc. ACL 2010 that mention lexicalizations.

See all papers in Proc. ACL that mention lexicalizations.

Back to top.

machine translation

Appears in 7 sentences as: Machine Translation (2) machine translation (5)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. In addition Machine Translation is also applied to enrich the resource with lexical information for all languages.
    Page 1, “Abstract”
  2. poor languages with the aid of Machine Translation .
    Page 2, “Introduction”
  3. using (a) the human-generated translations provided in Wikipedia (the so-called inter-language links), as well as (b) a machine translation system to translate occurrences of the concepts within sense-tagged corpora, namely SemCor (Miller et al., 1993) — a corpus annotated with WordNet senses — and Wikipedia itself (Section 3.3).
    Page 2, “BabelNet”
  4. An initial prototype used a statistical machine translation system based on Moses (Koehn et al., 2007) and trained on Europarl (Koehn, 2005).
    Page 4, “Methodology”
  5. both from Wikipedia and the machine translation system.
    Page 6, “Experiment 2: Translation Evaluation”
  6. In contrast, good translations were produced using our machine translation method when enough sentences were available.
    Page 8, “Experiment 2: Translation Evaluation”
  7. Further, we contribute a large set of sense occurrences harvested from Wikipedia and SemCor, a corpus that we input to a state-of-the-art machine translation system to fill in the gap between resource-rich languages — such as English — and resource-poorer ones.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

named entities

Appears in 7 sentences as: named entities (4) Named Entity (1) named entity (2)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. These include, among others, text summarization (Nastase, 2008), Named Entity Recognition (Bunescu and Pasca, 2006), Question Answering (Harabagiu et al., 2000) and text categorization (Gabrilovich and Markovitch, 2006).
    Page 1, “Introduction”
  2. Second, such resources are typically lexicographic, and thus contain mainly concepts and only a few named entities .
    Page 1, “Introduction”
  3. The result is an “encyclopedic dictionary”, that provides concepts and named entities lexical-ized in many languages and connected with large amounts of semantic relations.
    Page 2, “Introduction”
  4. the general term concept to denote either a concept or a named entity .
    Page 2, “BabelNet”
  5. A Wikipedia page (henceforth, Wikipage) presents the knowledge about a specific concept (e. g. BALLOON (AIRCRAFT)) or named entity (e.g.
    Page 3, “Methodology”
  6. Firstly, the two resources contribute different kinds of lexical knowledge, one is concerned mostly with named entities , the other with concepts.
    Page 8, “Conclusions”
  7. Thus, even when they overlap, the two resources provide complementary information about the same named entities or concepts.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

semantic relations

Appears in 6 sentences as: semantic relation (2) semantic relations (4)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. The result is an “encyclopedic dictionary”, that provides concepts and named entities lexical-ized in many languages and connected with large amounts of semantic relations .
    Page 2, “Introduction”
  2. Each edge is labeled with a semantic relation from R, e.g.
    Page 2, “BabelNet”
  3. , e}, where 6 denotes an unspecified semantic relation .
    Page 2, “BabelNet”
  4. However, while providing lexical resources on a very large scale for hundreds of thousands of language pairs, these do not encode semantic relations between concepts denoted by their lexical entries.
    Page 8, “Related Work”
  5. amounts of semantic relations and can be leveraged to enable multilinguality.
    Page 9, “Conclusions”
  6. The resource includes millions of semantic relations , mainly from Wikipedia (however, WordNet relations are labeled), and contains almost 3 million concepts (6.7 labels per concept on average).
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention semantic relations.

See all papers in Proc. ACL that mention semantic relations.

Back to top.

translation system

Appears in 5 sentences as: translation system (5)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. using (a) the human-generated translations provided in Wikipedia (the so-called inter-language links), as well as (b) a machine translation system to translate occurrences of the concepts within sense-tagged corpora, namely SemCor (Miller et al., 1993) — a corpus annotated with WordNet senses — and Wikipedia itself (Section 3.3).
    Page 2, “BabelNet”
  2. Note that translations are sense-specific, as the context in which a term occurs is provided to the translation system .
    Page 4, “Methodology”
  3. An initial prototype used a statistical machine translation system based on Moses (Koehn et al., 2007) and trained on Europarl (Koehn, 2005).
    Page 4, “Methodology”
  4. both from Wikipedia and the machine translation system .
    Page 6, “Experiment 2: Translation Evaluation”
  5. Further, we contribute a large set of sense occurrences harvested from Wikipedia and SemCor, a corpus that we input to a state-of-the-art machine translation system to fill in the gap between resource-rich languages — such as English — and resource-poorer ones.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

hypernym

Appears in 4 sentences as: hypernym (3) hypernyms (2)
In BabelNet: Building a Very Large Multilingual Semantic Network
  1. 0 Hypernymy/Hyponymy: all synonyms in the synsets H such that H is either a hypernym (i.e., a generalization) or a hyponym (i.e., a specialization) of S. For example, given bal-loo n},, we include the words from its hypernym { lighter-than-air craft}, } and all its hyponyms (e.g.
    Page 3, “Methodology”
  2. o Sisterhood: words from the sisters of S. A sister synset S’ is such that S and 8’ have a common direct hypernym .
    Page 3, “Methodology”
  3. To do so, we include words from their synsets, hypernyms , hyponyms, sisters, and glosses.
    Page 5, “Methodology”
  4. We also intend to link missing concepts in WordNet, by establishing their most likely hypernyms — e.g., a la Snow et al.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention hypernym.

See all papers in Proc. ACL that mention hypernym.

Back to top.