Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
Huang, Jian and Taylor, Sarah M. and Smith, Jonathan L. and Fotiadis, Konstantinos A. and Giles, C. Lee

Article Structure

Abstract

Coreferencing entities across documents in a large corpus enables advanced document understanding tasks such as question answering.

Introduction

A named entity that represents a person, an organization or a geo-location may appear within and across documents in different forms.

Methods 2.1 Document Level and Profile Based CDC

We make distinctions between document level and profile based cross document coreference.

Experiments

In this section, we first formally define the evaluation metrics, followed by the introduction to the benchmark test sets and the system’s performance.

Related Work

The original work in (Bagga and Baldwin, 1998) proposed a CDC system by first performing WDC and then disambiguating based on the summary sentences of the chains.

Conclusions

We have presented a profile-based Cross Document Coreference (CDC) approach based on a novel fuzzy relational clustering algorithm KARC.

Topics

coreference

Appears in 30 sentences as: Coreference (1) coreference (24) Coreferencing (1) coreferent (4) coreferents (1)
In Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
  1. Coreferencing entities across documents in a large corpus enables advanced document understanding tasks such as question answering.
    Page 1, “Abstract”
  2. This paper presents a novel cross document coreference approach that leverages the profiles of entities which are constructed by using information extraction tools and reconciled by using a within-document coreference module.
    Page 1, “Abstract”
  3. We compare the kernelized clustering method with a popular fuzzy relation clustering algorithm (FRC) and show 5% improvement in coreference performance.
    Page 1, “Abstract”
  4. Cross document coreference (CDC) is the task of consolidating named entities that appear in multiple documents according to their real referents.
    Page 1, “Introduction”
  5. document coreference (WDC), which limits the scope of disambiguation to within the boundary of a document.
    Page 1, “Introduction”
  6. Cross document coreference , on the other hand, is a more challenging task because these linguistics cues and sentence structures no longer apply, given the wide variety of context and styles in different documents.
    Page 1, “Introduction”
  7. Cross document coreference research has recently become more popular due to the increasing interests in the web person search task (Artiles et al., 2007).
    Page 1, “Introduction”
  8. We review related work in cross document coreference and conclude in Section 5.
    Page 2, “Introduction”
  9. We make distinctions between document level and profile based cross document coreference .
    Page 2, “Methods 2.1 Document Level and Profile Based CDC”
  10. persons in this work), a within-document coreference (WDC) module then links the entities deemed as referring to the same underlying identity into a WDC chain.
    Page 2, “Methods 2.1 Document Level and Profile Based CDC”
  11. Therefore, the chained entities placed in a name cluster are deemed as coreferent .
    Page 2, “Methods 2.1 Document Level and Profile Based CDC”

See all papers in Proc. ACL 2009 that mention coreference.

See all papers in Proc. ACL that mention coreference.

Back to top.

named entities

Appears in 8 sentences as: named entities (4) named entity (4)
In Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
  1. A named entity that represents a person, an organization or a geo-location may appear within and across documents in different forms.
    Page 1, “Introduction”
  2. Cross document coreference (CDC) is the task of consolidating named entities that appear in multiple documents according to their real referents.
    Page 1, “Introduction”
  3. Document level CDC makes a simplifying assumption that a named entity (and its variants) in a document has one underlying real identity.
    Page 2, “Methods 2.1 Document Level and Profile Based CDC”
  4. The named entity President Bush is extracted from the sentence “President Bush addressed the nation from the Oval Office Monday.”
    Page 2, “Methods 2.1 Document Level and Profile Based CDC”
  5. First, the attribute information about the person named entity includes first/middle/last names, gender, mention, etc.
    Page 6, “Experiments”
  6. In addition, AeroText extracts relationship information between named entities , such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation.
    Page 6, “Experiments”
  7. This permits inexact matching of named entities due to name
    Page 6, “Experiments”
  8. KARC partitions named entities based on their profiles constructed by an information extraction tool.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

SEG

Appears in 5 sentences as: SEG (5)
In Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
  1. The chained entities 5 are first objectified into the relation strength matrix R using SEG , the details of which are described in the following section.
    Page 4, “Methods 2.1 Document Level and Profile Based CDC”
  2. Algorithm 2 SEG (Freund et al., 1997) Input: Initial weight distribution p1; learning rate 77 > 0; training set {< st, 3/75 >} 1: for t=l to T do 2: Predict using:
    Page 5, “Methods 2.1 Document Level and Profile Based CDC”
  3. We adopt the Specialist Exponentiated Gradient ( SEG ) (Freund et al., 1997) algorithm to learn the mixing weights of the specialists’ prediction (Algorithm 2) in an online manner.
    Page 5, “Methods 2.1 Document Level and Profile Based CDC”
  4. The SEG algorithm first predicts the value 33"
    Page 5, “Methods 2.1 Document Level and Profile Based CDC”
  5. We then used the SEG algorithm to learn the weight distribution model.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention SEG.

See all papers in Proc. ACL that mention SEG.

Back to top.

co-occurrence

Appears in 3 sentences as: co-occurrence (3)
In Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
  1. For instance, the similarity between the occupations ‘President’ and ‘Commander in Chief’ can be computed using the JC semantic distance (J iang and Conrath, 1997) with WordNet; the similarity of co-occurrence with other people can be measured by the J accard coefficient.
    Page 3, “Methods 2.1 Document Level and Profile Based CDC”
  2. a match in a family relationship is considered more important than in a co-occurrence relationship.
    Page 5, “Methods 2.1 Document Level and Profile Based CDC”
  3. To decide whether two names in the co-occurrence or family relationship match, we use the SoftTFIDF measure (Cohen et al., 2003), which is a hybrid matching scheme that combines the token-based TFIDF with the Jam-Winkler string distance metric.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

feature sets

Appears in 3 sentences as: feature sets (3)
In Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
  1. Since different feature sets , NLP tools, etc are used in different benchmarked systems, we are also interested in comparing the proposed algorithm with different soft relational clustering variants.
    Page 7, “Experiments”
  2. With the same feature sets and distance function, KARC-S outperforms FRC in F score by about 5%.
    Page 7, “Experiments”
  3. Future research directions include developing rich feature sets and using corpus level or external information.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention feature sets.

See all papers in Proc. ACL that mention feature sets.

Back to top.

feature space

Appears in 3 sentences as: feature space (3)
In Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
  1. Kemelization (Scholkopf and Smola, 2002) is a machine learning technique to transform patterns in the data space to a high-dimensional feature space so that the structure of the data can be more easily and adequately discovered.
    Page 3, “Methods 2.1 Document Level and Profile Based CDC”
  2. Using the kernel trick, the squared distance between (19(rj) and (l)(wi) in the feature space H can be computed as:
    Page 3, “Methods 2.1 Document Level and Profile Based CDC”
  3. We measure the kemelized XBI (KXBI) in the feature space as,
    Page 4, “Methods 2.1 Document Level and Profile Based CDC”

See all papers in Proc. ACL 2009 that mention feature space.

See all papers in Proc. ACL that mention feature space.

Back to top.