GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
De Benedictis, Flavio and Faralli, Stefano and Navigli, Roberto

Article Structure

Abstract

We present GlossBoot, an effective minimally-supervised approach to acquiring wide-coverage domain glossaries for many languages.

Introduction

Much textual content, such as that available on the Web, contains a great deal of information focused on specific areas of knowledge.

GlossBoot

Our objective is to harvest a domain glossary G containing pairs of terms/glosses in a given language.

Experimental Setup

3.1 Domains and Gold Standards

Results and Discussion

4.1 Terms

Comparative Evaluation

5.1 Comparison with Google Define

Related Work

There are several techniques in the literature for the automated acquisition of definitional knowledge.

Conclusions

In this paper we have presented GlossBoot, a new, minimally-supervised approach to multilingual glossary learning.

Topics

hypernym

Appears in 12 sentences as: Hypernym (2) hypernym (13)
In GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
  1. Given a domain and a language of interest, we bootstrap the glossary learning process with just a few hypernymy relations (such as computer isa device), with the only condition that the (term, hypernym ) pairs must be specific enough to implicitly identify the domain in the target language.
    Page 2, “Introduction”
  2. (a) Hypernym extraction: for each newly-acquired term/ gloss pair (75, g) E G k, we automatically extract a candidate hypernym h from the textual gloss 9.
    Page 4, “GlossBoot”
  3. To do this we use a simple unsupervised heuristic which just selects the first term in the gloss.5 We show an example of hypernym extraction for some terms in Table 2 (we report the term in column 1, the gloss in column 2 and the hypemyms extracted by the first term hypernym extraction heuristic in column 3).
    Page 4, “GlossBoot”
  4. (b) (Term, Hypernym )-ranking: we sort all the glosses in Gk by the number of seed terms found in each gloss.
    Page 4, “GlossBoot”
  5. 5While more complex strategies could be used, such as supervised classifiers (Navigli and Velardi, 2010), we found that this heuristic works well because, even when it is not a hypernym , the first term plays the role of a cue word for the defined term.
    Page 4, “GlossBoot”
  6. (c) New seed selection: we select the (term, hypernym ) pairs corresponding to the K top-ranking glosses.
    Page 4, “GlossBoot”
  7. Now, an obvious question arises: what if we bootstrapped GlossBoot with fewer hypernym seeds, e.g., just one seed?
    Page 6, “Results and Discussion”
  8. To answer this question we replicated our English experiments on each single (term, hypernym ) pair in our seed set.
    Page 6, “Results and Discussion”
  9. To avoid the use of a large domain corpus, terminologies can be obtained from the Web by using Doubly-Anchored Patterns (DAPs) which, given a (term, hypernym) pair, harvest sentences matching manually-defined patterns like “< hypernym > such as <term>, and *” (Kozareva et al., 2008).
    Page 8, “Related Work”
  10. Similarly to our approach, they drop the requirement of a domain corpus and start from a small number of (term, hypernym ) seeds.
    Page 8, “Related Work”
  11. In contrast, GlossBoot performs the novel task of multilingual glossary learning from the Web by bootstrapping the extraction process with a few (term, hypernym ) seeds.
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention hypernym.

See all papers in Proc. ACL that mention hypernym.

Back to top.

gold standard

Appears in 11 sentences as: gold standard (11) Gold Standards (1) gold standards (3)
In GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
  1. 3.1 Domains and Gold Standards
    Page 4, “Experimental Setup”
  2. For each domain and language we selected as gold standards well-reputed glossaries on the Web, such as: the Utah computing glossary,6 the Wikipedia glossary of botanical terms,7 a set of Wikipedia glossaries about environment,8 and the Reuters glossary for Finance9 [full list at http: //lcl .
    Page 4, “Experimental Setup”
  3. For each domain and language we calculated coverage, extra-coverage and precision of the acquired terms T. Coverage is the ratio of extracted terms in T also contained in the gold standard T to the size of T. Extra-coverage is calculated as the ratio of the additional extracted terms in T \ T over the number of gold standard terms T. Finally, precision is the ratio of extracted terms in T deemed to be within the domain.
    Page 5, “Experimental Setup”
  4. Note that by sampling on the entire set T, we calculate the precision of both terms in T m T, i.e., in the gold standard, and terms in T \ T, i.e., not in the gold standard , which are not necessarily outside the domain.
    Page 5, “Experimental Setup”
  5. Note that, since precision also concerns terms not in the gold standard , we had to manually validate a sample of the extracted terms for each of the 21 tested values of6 E {0,0.05,0.1, .
    Page 5, “Experimental Setup”
  6. which are also found in the gold standard (column 2), in-domain extracted terms but not in the gold standard (column 3), out-of—domain extracted terms (column 4), and domain terms in the gold standard but not extracted by our approach (column 5).
    Page 6, “Results and Discussion”
  7. In all three languages GlossBoot provides very high extra coverage of domain terms, i.e., additional terms which are not in the gold standard but are returned by our system.
    Page 6, “Results and Discussion”
  8. These results, together with the generally high precision values, indicate the larger extent of our bootstrapped glossaries compared to our gold standards .
    Page 6, “Results and Discussion”
  9. First, we randomly sampled 100 terms from our gold standard for each domain and each of the three languages.
    Page 7, “Comparative Evaluation”
  10. Table 9: Number of domain glosses (from a random sample of 100 gold standard terms per domain) retrieved using Google Define and GlossBoot.
    Page 7, “Comparative Evaluation”
  11. All the glossaries and gold standards created for our experiments are available from the authors’ Web site http: / /lcl .
    Page 9, “Conclusions”

See all papers in Proc. ACL 2013 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

randomly sampled

Appears in 6 sentences as: random sample (2) randomly sampled (3) randomly sampling (1)
In GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
  1. T0 calculate precision we randomly sampled 5% of the retrieved terms and asked two human annotators to manually tag their domain pertinence (with adjudication in case of disagreement; H = .62, indicating substantial agreement).
    Page 5, “Experimental Setup”
  2. Precision was determined on a random sample of 5% of the acquired glosses for each domain and language.
    Page 5, “Experimental Setup”
  3. First, we randomly sampled 100 terms from our gold standard for each domain and each of the three languages.
    Page 7, “Comparative Evaluation”
  4. Table 9: Number of domain glosses (from a random sample of 100 gold standard terms per domain) retrieved using Google Define and GlossBoot.
    Page 7, “Comparative Evaluation”
  5. As for the precision of the extracted terms, we randomly sampled 50% of them for each system.
    Page 7, “Comparative Evaluation”
  6. We performed a similar evaluation for the precision of the acquired glosses, by randomly sampling 50% of them for each system.
    Page 8, “Comparative Evaluation”

See all papers in Proc. ACL 2013 that mention randomly sampled.

See all papers in Proc. ACL that mention randomly sampled.

Back to top.

in-domain

Appears in 4 sentences as: in-domain (4)
In GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
  1. In Table 5 we show examples of the possible scenarios for terms: in-domain extracted terms
    Page 5, “Results and Discussion”
  2. which are also found in the gold standard (column 2), in-domain extracted terms but not in the gold standard (column 3), out-of—domain extracted terms (column 4), and domain terms in the gold standard but not extracted by our approach (column 5).
    Page 6, “Results and Discussion”
  3. Table 6), because the retrieved glosses of domain terms are usually in-domain too, and follow a definitional style because they come from glossaries.
    Page 7, “Results and Discussion”
  4. Next, for each domain and language, we manually calculated the fraction of terms for which an in-domain definition was provided by Google Define and GlossBoot.
    Page 7, “Comparative Evaluation”

See all papers in Proc. ACL 2013 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

Sense Disambiguation

Appears in 3 sentences as: Sense Disambiguation (3)
In GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
  1. Interestingly, electronic glossaries have been shown to be key resources not only for humans, but also in Natural Language Processing (NLP) tasks such as Question Answering (Cui et al., 2007), Word Sense Disambiguation (Duan and Yates, 2010; Faralli and Navigli, 2012) and ontology learning (Navigli et al., 2011; Velardi et al., 2013).
    Page 1, “Introduction”
  2. and Curran, 2008; McIntosh and Curran, 2009), learning semantic relations (Pantel and Pennac-chiotti, 2006), extracting surface text patterns for open-domain question answering (Ravichandran and Hovy, 2002), semantic tagging (Huang and Riloff, 2010) and unsupervised Word Sense Disambiguation (Yarowsky, 1995).
    Page 9, “Related Work”
  3. Beyond the immediate usability of its output and its effective use for domain Word Sense Disambiguation (Faralli and Navigli, 2012), we wish to show the benefit of GlossBoot in gloss-driven approaches to ontology learning (Navigli et al., 2011; Velardi et al., 2013) and semantic network enrichment (Navigli and Ponzetto, 2012).
    Page 9, “Conclusions”

See all papers in Proc. ACL 2013 that mention Sense Disambiguation.

See all papers in Proc. ACL that mention Sense Disambiguation.

Back to top.

Word Sense

Appears in 3 sentences as: Word Sense (3)
In GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
  1. Interestingly, electronic glossaries have been shown to be key resources not only for humans, but also in Natural Language Processing (NLP) tasks such as Question Answering (Cui et al., 2007), Word Sense Disambiguation (Duan and Yates, 2010; Faralli and Navigli, 2012) and ontology learning (Navigli et al., 2011; Velardi et al., 2013).
    Page 1, “Introduction”
  2. and Curran, 2008; McIntosh and Curran, 2009), learning semantic relations (Pantel and Pennac-chiotti, 2006), extracting surface text patterns for open-domain question answering (Ravichandran and Hovy, 2002), semantic tagging (Huang and Riloff, 2010) and unsupervised Word Sense Disambiguation (Yarowsky, 1995).
    Page 9, “Related Work”
  3. Beyond the immediate usability of its output and its effective use for domain Word Sense Disambiguation (Faralli and Navigli, 2012), we wish to show the benefit of GlossBoot in gloss-driven approaches to ontology learning (Navigli et al., 2011; Velardi et al., 2013) and semantic network enrichment (Navigli and Ponzetto, 2012).
    Page 9, “Conclusions”

See all papers in Proc. ACL 2013 that mention Word Sense.

See all papers in Proc. ACL that mention Word Sense.

Back to top.