Semi-supervised Relation Extraction with Large-scale Word Clustering
Sun, Ang and Grishman, Ralph and Sekine, Satoshi

Article Structure

Abstract

We present a simple semi-supervised relation extraction system with large-scale word clustering.

Introduction

Relation extraction is an important information extraction task in natural language processing (NLP), with many practical applications.

Related Work

The idea of using word clusters as features in discriminative learning was pioneered by Miller et al.

Background

3.1 Relation Extraction

Feature Based Relation Extraction

Given a pair of entity mentions < m. , m j > and the sentence containing the pair, a feature based system extracts a feature vector v which contains diverse lexical, syntactic and semantic features.

Cluster Feature Selection

The selection of cluster features aims to answer the following two questions: which lexical features should be augmented with word clusters to improve generalization accuracy?

Experiments

In this section, we first present details of our unsupervised word clusters, the relation extraction data set and its preprocessing.

Conclusion and Future Work

We have described a semi-supervised relation extraction system with large-scale word clustering.

Topics

relation extraction

Appears in 22 sentences as: Relation Extraction (1) Relation extraction (1) relation extraction (19) relation extraction: (1)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. We present a simple semi-supervised relation extraction system with large-scale word clustering.
    Page 1, “Abstract”
  2. Relation extraction is an important information extraction task in natural language processing (NLP), with many practical applications.
    Page 1, “Introduction”
  3. The goal of relation extraction is to detect and characterize semantic relations between pairs of entities in text.
    Page 1, “Introduction”
  4. For example, a relation extraction system needs to be able to extract an Employment relation between the entities US soldier and US in the phrase US soldier.
    Page 1, “Introduction”
  5. The performance of a supervised relation extraction system is usually degraded by the sparsity of lexical features.
    Page 1, “Introduction”
  6. This motivates our work to use word clusters as additional features for relation extraction .
    Page 1, “Introduction”
  7. The rest of this paper is organized as follows: Section 2 presents related work and Section 3 provides the background of the relation extraction task and the word clustering algorithm.
    Page 2, “Introduction”
  8. A second difference between this work and the above ones is that we utilize word clusters in the task of relation extraction which is very different from sequence labeling tasks such as name tagging and chunking.
    Page 2, “Related Work”
  9. (2005) and Chan and Roth (2010) used word clusters in relation extraction , they shared the same limitation as the above approaches in choosing clusters.
    Page 2, “Related Work”
  10. 3.1 Relation Extraction
    Page 2, “Background”
  11. One of the well defined relation extraction tasks is the Automatic Content Extraction1 (ACE) program sponsored by the U.S. government.
    Page 2, “Background”

See all papers in Proc. ACL 2011 that mention relation extraction.

See all papers in Proc. ACL that mention relation extraction.

Back to top.

relation instances

Appears in 11 sentences as: relation instance (3) relation instances (8)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. In contrast, the kernel based method does not explicitly extract features; it designs kernel functions over the structured sentence representations (sequence, dependency or parse tree) to capture the similarities between different relation instances (Zelenko et al., 2003; Bunescu and Mooney, 2005a; Bunescu and Mooney, 2005b; Zhao and Grishman, 2005; Zhang et al., 2006; Zhou et al., 2007; Qian et al., 2008).
    Page 1, “Introduction”
  2. The assumption is that even if the word soldier may never have been seen in the annotated Employment relation instances , other words which share the same cluster membership with soldier such as president and ambassador may have been observed in the Employment instances.
    Page 1, “Introduction”
  3. classifier is trained first to distinguish between relation instances and non-relation instances.
    Page 3, “Background”
  4. At the lexical level, a relation instance can be seen as a sequence of tokens which form a five tuple <Before, M], Between, M2, After>.
    Page 3, “Feature Based Relation Extraction”
  5. Specifically, we first train a binary classifier to distinguish between relation instances and non-relation instances.
    Page 4, “Feature Based Relation Extraction”
  6. Then rather than using the thresholded output of this binary classifier as training data, we use only the annotated relation instances to train a multi-class classifier for the 7 relation types.
    Page 4, “Feature Based Relation Extraction”
  7. given a test instance x , we first apply the binary classifier to it for relation detection; if it is detected as a relation instance we then apply the multi-class relation classifier to classify it4.
    Page 4, “Feature Based Relation Extraction”
  8. Table 4 simplifies a relation instance as a three tuple <Context, M], M2> where the Context includes the Before, Between and After from the
    Page 4, “Cluster Feature Selection”
  9. previous research, we used in experiments the nwire (newswire) and bnews (broadcast news) genres of the data containing 348 documents and 4374 relation instances .
    Page 6, “Experiments”
  10. The non-relation instances generated were about 8 times more than the relation instances .
    Page 6, “Experiments”
  11. The unbalanced distribution of relation instances and non-relation instances remains as an obstacle for pushing the performance of relation extraction to the next level.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention relation instances.

See all papers in Proc. ACL that mention relation instances.

Back to top.

development set

Appears in 7 sentences as: development set (7)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. Our main idea is to learn the best set of prefix lengths, perhaps through the validation of their effectiveness on a development set of data.
    Page 5, “Cluster Feature Selection”
  2. Because this method does not need validation on the development set , it is the laziest but the fastest method for selecting clusters.
    Page 6, “Cluster Feature Selection”
  3. Exhaustive Search (ES): ES works by trying every possible combination of the set I and picking the one that works the best for the development set .
    Page 6, “Cluster Feature Selection”
  4. The set of prefix lengths that worked the best for the development set was chosen to select clusters.
    Page 7, “Experiments”
  5. It was interesting that ES did not always outperform the two statistical methods which might be because of its overfitting to the development set .
    Page 7, “Experiments”
  6. For the semi-supervised system, each test fold was the same one used in the baseline and the other 4 folds were further split into a training set and a development set in a ratio of 7:3 for selecting clusters.
    Page 7, “Experiments”
  7. The training documents for each size setup were split into a real training set and a development set in a ratio of 7:3 for selecting clusters.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

entity mentions

Appears in 7 sentences as: entity mentions (7)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. A relation was defined over a pair of entity mentions within a single sentence.
    Page 2, “Background”
  2. The heads of the two entity mentions are marked.
    Page 3, “Background”
  3. Given a pair of entity mentions < m. , m j > and the sentence containing the pair, a feature based system extracts a feature vector v which contains diverse lexical, syntactic and semantic features.
    Page 3, “Feature Based Relation Extraction”
  4. As a relation in ACE is usually short, the words of the two entity mentions can provide more critical indications for relation classification than the words from the context.
    Page 5, “Cluster Feature Selection”
  5. Within the two entity mentions , the head word of each mention is usually more important than other words of the mention; the conjunction of the two heads can provide an additional clue.
    Page 5, “Cluster Feature Selection”
  6. And in general words other than the chunk head in the context do not contribute to establishing a relationship between the two entity mentions .
    Page 5, “Cluster Feature Selection”
  7. Following previous work, we did 5-fold cross-validation on the 348 documents with hand-annotated entity mentions .
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention entity mentions.

See all papers in Proc. ACL that mention entity mentions.

Back to top.

semi-supervised

Appears in 6 sentences as: semi-supervised (6)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. We present a simple semi-supervised relation extraction system with large-scale word clustering.
    Page 1, “Abstract”
  2. When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system.
    Page 1, “Abstract”
  3. The cluster based semi-supervised system works by adding an additional layer of lexical features that incorporate word clusters as shown in column 4 of Table 4.
    Page 5, “Cluster Feature Selection”
  4. For the semi-supervised system, 70 percent of the rest of the documents were randomly selected as training data and 30 percent as development data.
    Page 7, “Experiments”
  5. For the semi-supervised system, each test fold was the same one used in the baseline and the other 4 folds were further split into a training set and a development set in a ratio of 7:3 for selecting clusters.
    Page 7, “Experiments”
  6. We have described a semi-supervised relation extraction system with large-scale word clustering.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

binary classifier

Appears in 5 sentences as: binary classifier (4) binary classifier’s (1)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. Then the thresholded output of this binary classifier is used as training data for learning a multi-class classifier for the 7 relation types (Bunescu and Mooney, 2005b).
    Page 3, “Background”
  2. Specifically, we first train a binary classifier to distinguish between relation instances and non-relation instances.
    Page 4, “Feature Based Relation Extraction”
  3. Then rather than using the thresholded output of this binary classifier as training data, we use only the annotated relation instances to train a multi-class classifier for the 7 relation types.
    Page 4, “Feature Based Relation Extraction”
  4. given a test instance x , we first apply the binary classifier to it for relation detection; if it is detected as a relation instance we then apply the multi-class relation classifier to classify it4.
    Page 4, “Feature Based Relation Extraction”
  5. When the binary classifier’s prediction probability is greater than 0.5, we take the prediction with the highest probability of the multi-class classifier as the final class label.
    Page 4, “Cluster Feature Selection”

See all papers in Proc. ACL 2011 that mention binary classifier.

See all papers in Proc. ACL that mention binary classifier.

Back to top.

extraction system

Appears in 5 sentences as: extraction system (5)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. We present a simple semi-supervised relation extraction system with large-scale word clustering.
    Page 1, “Abstract”
  2. For example, a relation extraction system needs to be able to extract an Employment relation between the entities US soldier and US in the phrase US soldier.
    Page 1, “Introduction”
  3. The performance of a supervised relation extraction system is usually degraded by the sparsity of lexical features.
    Page 1, “Introduction”
  4. (2005), a state-of—the-art feature based relation extraction system .
    Page 3, “Feature Based Relation Extraction”
  5. We have described a semi-supervised relation extraction system with large-scale word clustering.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention extraction system.

See all papers in Proc. ACL that mention extraction system.

Back to top.

baseline system

Appears in 4 sentences as: baseline system (4)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system .
    Page 1, “Abstract”
  2. Section 4 describes in detail a state-of-the-art supervised baseline system .
    Page 2, “Introduction”
  3. We now describe a supervised baseline system with a very large set of features and its learning strategy.
    Page 3, “Feature Based Relation Extraction”
  4. Nonetheless, we believe our baseline system has achieved very competitive performance.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

UA

Appears in 4 sentences as: UA (5)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. Use All Prefixes (UA): UA produces a cluster feature at every available bit length with the hope that the underlying supervised system can learn proper weights of different cluster features during training.
    Page 6, “Cluster Feature Selection”
  2. For example, if the full bit representation of “Apple” is “000”, UA would produce three cluster features: prefix] =0, prefix2=00 and prefix3=000.
    Page 6, “Cluster Feature Selection”
  3. UA 71.19 +0.49 1.5
    Page 7, “Experiments”
  4. Table 6 shows that all the 4 proposed methods improved baseline performance, with UA as the fastest and ES as the slowest.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention UA.

See all papers in Proc. ACL that mention UA.

Back to top.

dependency parsing

Appears in 3 sentences as: dependency parsing (3)
In Semi-supervised Relation Extraction with Large-scale Word Clustering
  1. Given an entity pair and a sentence containing the pair, both approaches usually start with multiple level analyses of the sentence such as tokenization, partial or full syntactic parsing, and dependency parsing .
    Page 1, “Introduction”
  2. Preprocessing of the ACE documents: We used the Stanford parser6 for syntactic and dependency parsing .
    Page 6, “Experiments”
  3. (2008) for dependency parsing .
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.