Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
Davidov, Dmitry and Rappoport, Ari

Article Structure

Abstract

We present a novel framework for the discovery and representation of general semantic relationships that hold between lexical items.

Introduction

Semantic resources can be very useful in many NLP tasks.

Related Work

Extraction of relation information from text is a large subfield in NLP.

Pattern Clustering Algorithm

Our algorithm first discovers and clusters patterns in which a single (‘hook’) word participates, and then merges the resulting clusters to form the final structure.

Corpora and Parameters

In this section we describe our experimental setup, and discuss in detail the effect of each of the algorithms’ parameters.

SAT-based Evaluation

As discussed in Section 2, the evaluation of semantic relationship structures is nontrivial.

Evaluation Using Known Information

We also evaluated our pattern clusters using relevant information reported in related work.

Conclusion

We have proposed a novel way to define and identify generic lexical relationships as clusters of patterns.

Topics

word pairs

Appears in 12 sentences as: word pair (4) word pairs (10)
In Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
  1. The standard process for pattem-based relation extraction is to start with hand-selected patterns or word pairs expressing a particular relationship, and iteratively scan the corpus for co-appearances of word pairs in patterns and for patterns that contain known word pairs .
    Page 1, “Introduction”
  2. We also propose a way to label each cluster by word pairs that represent it best.
    Page 2, “Introduction”
  3. Several recent papers discovered relations on the web using seed patterns (Pantel et al., 2004), rules (Etzioni et al., 2004), and word pairs (Pasca et al., 2006; Alfonseca et al., 2006).
    Page 3, “Related Work”
  4. (Alfonseca et al., 2006) for extracting general relations starting from given seed word pairs .
    Page 3, “Pattern Clustering Algorithm”
  5. Unlike most previous work, our hook words are not provided in advance but selected randomly; the goal in those papers is to discover relationships between given word pairs , while we use hook words in order to discover relationships that generally occur in the corpus.
    Page 3, “Pattern Clustering Algorithm”
  6. To label pattern clusters we define a HITS measure that reflects the affinity of a given word pair to a given cluster.
    Page 5, “Pattern Clustering Algorithm”
  7. For a given word pair (101,102) and cluster C with n core patterns Peore and m unconfirmed patterns Punconf,
    Page 5, “Pattern Clustering Algorithm”
  8. a X Hp; (“117102) appears inp E Punconf}| In this formula, ‘appears in’ means that the word pair appears in instances of this pattern extracted from the original corpus or retrieved from the web during evaluation (see Section 5.2).
    Page 5, “Pattern Clustering Algorithm”
  9. We addressed the evaluation questions above using a SAT-like analogy test automatically generated from word pairs captured by our clusters (see below in this section).
    Page 6, “SAT-based Evaluation”
  10. The header of the question is a word pair that is one of the label pairs of the cluster.
    Page 6, “SAT-based Evaluation”
  11. In our sample there were no word pairs assigned as labels to more than one cluster4.
    Page 7, “SAT-based Evaluation”

See all papers in Proc. ACL 2008 that mention word pairs.

See all papers in Proc. ACL that mention word pairs.

Back to top.

semantic relationship

Appears in 7 sentences as: semantic relationship (4) semantic relationships (3)
In Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
  1. We present a novel framework for the discovery and representation of general semantic relationships that hold between lexical items.
    Page 1, “Abstract”
  2. They aim to find relationship instances rather than identify generic semantic relationships .
    Page 2, “Related Work”
  3. As discussed in Section 2, the evaluation of semantic relationship structures is nontrivial.
    Page 6, “SAT-based Evaluation”
  4. The first is the quality (precisiorflrecall) of individual pattern clusters: does each pattern cluster capture lexical item pairs of the same semantic relationship ?
    Page 6, “SAT-based Evaluation”
  5. does it recognize many pairs of the same semantic relationship ?
    Page 6, “SAT-based Evaluation”
  6. The second is the quality of the cluster set as whole: does the pattern clusters set allow identification of important known semantic relationships ?
    Page 6, “SAT-based Evaluation”
  7. Each such cluster is set of patterns that can be used to identify, classify or capture new instances of some unspecified semantic relationship .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention semantic relationship.

See all papers in Proc. ACL that mention semantic relationship.

Back to top.

content words

Appears in 4 sentences as: content word (1) content words (3)
In Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
  1. Following (Davidov and Rappoport, 2006), we classified words into high-frequency words (HFWs) and content words (CWs).
    Page 4, “Pattern Clustering Algorithm”
  2. F0 (upper bound for content word frequency in patterns) influences which words are considered as hook and target words.
    Page 5, “Corpora and Parameters”
  3. Since content words determine the joining of patterns into clusters, the more ambiguous a word is, the noisier the resulting clusters.
    Page 5, “Corpora and Parameters”
  4. The value we use for FH is lower than that used for F0, in order to allow as HFWs function words of relatively low frequency (e.g., ‘through’), while allowing as content words some frequent words that participate in meaningful relationships (e.g., ‘ game’).
    Page 5, “Corpora and Parameters”

See all papers in Proc. ACL 2008 that mention content words.

See all papers in Proc. ACL that mention content words.

Back to top.

WordNet

Appears in 4 sentences as: WordNet (4)
In Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
  1. Most established resources (e.g., WordNet ) represent only the main and widely accepted relationships such as hyper-nymy and meronymy.
    Page 1, “Introduction”
  2. There is a large body of related work that deals with discovery of basic relationship types represented in useful resources such as WordNet , including hyper-nymy (Hearst, 1992; Pantel et al., 2004; Snow et al., 2006), synonymy (Davidov and Rappoport, 2006; Widdows and Dorow, 2002) and meronymy (Berland and Charniak, 1999; Girju et al., 2006).
    Page 2, “Related Work”
  3. Several algorithms use manually-prepared resources, including WordNet (Moldovan et al., 2004; Costello et al., 2006) and Wikipedia (Strube and Ponzetto, 2006).
    Page 3, “Related Work”
  4. Evaluation for hypemymy and synonymy usually uses WordNet (Lin and Pantel, 2002; Widdows and Dorow, 2002; Davidov and Rappoport, 2006).
    Page 3, “Related Work”

See all papers in Proc. ACL 2008 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions
  1. We used part of the Russian corpus as a development set for determining the parameters.
    Page 5, “Corpora and Parameters”
  2. On our development set we have tested various parameter settings.
    Page 5, “Corpora and Parameters”
  3. In our experiments we have used the following values (again, determined using a development set ) for these parameters: F0: 1,000 words per million (wpm); FH: 100 wpm; FB: 1.2 wpm; N: 500 words; W: 5 words; L: 30%; S: 2/3; 04: 0.1.
    Page 6, “Corpora and Parameters”

See all papers in Proc. ACL 2008 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.