Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
Singh, Sameer and Subramanya, Amarnag and Pereira, Fernando and McCallum, Andrew

Article Structure

Abstract

Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction.

Introduction

Given a collection of mentions of entities extracted from a body of text, coreference or entity resolution consists of clustering the mentions such that two mentions belong to the same cluster if and only if they refer to the same entity.

Cross-document Coreference

The problem of coreference is to identify the sets of mention strings that refer to the same underlying entity.

Distributed MAP Inference

The key observation that enables distribution is that the acceptance probability computation of a proposal only examines a few factors that are not common to the previous and next configurations (Eq.

Hierarchical Coreference Model

The proposal function for MCMC-based MAP inference presents changes to the current entities.

Experiments

We evaluate our models and algorithms on a number of datasets.

Related Work

Although the cross-document coreference problem is challenging and lacks large labeled datasets, its ubiquitous role as a key component of many knowledge discovery tasks has inspired several efforts.

Conclusions

Motivated by the problem of solving the coreference problem on billions of mentions from all of the newswire documents from the past few decades, we make the following contributions.

Topics

coreference

Appears in 34 sentences as: Coreference (2) coreference (31) coreferent (3)
In Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
  1. Cross-document coreference , the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction.
    Page 1, “Abstract”
  2. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granular—ities of entities to facilitate more effective approximate inference.
    Page 1, “Abstract”
  3. Given a collection of mentions of entities extracted from a body of text, coreference or entity resolution consists of clustering the mentions such that two mentions belong to the same cluster if and only if they refer to the same entity.
    Page 1, “Introduction”
  4. While significant progress has been made in within-document coreference (Ng, 2005; Culotta et al., 2007; Haghighi and Klein, 2007; Bengston and Roth, 2008; Haghighi and Klein,
    Page 1, “Introduction”
  5. 2009; Haghighi and Klein, 2010), the larger problem of cross-document coreference has not received as much attention.
    Page 1, “Introduction”
  6. Unlike inference in other language processing tasks that scales linearly in the size of the corpus, the hypothesis space for coreference grows super-exponentially with the number of mentions.
    Page 1, “Introduction”
  7. We believe that cross-document coreference resolution is most useful when applied to a very large set of documents, such as all the news articles published during the last 20 years.
    Page 1, “Introduction”
  8. In this paper we propose a model and inference algorithms that can scale the cross-document coreference problem to corpora of that size.
    Page 1, “Introduction”
  9. Much of the previous work in cross-document coreference (Bagga and Baldwin, 1998; Ravin and Kazi, 1999; Gooi and Allan, 2004; Pedersen et al., 2006; Rao et al., 2010) groups mentions into entities with some form of greedy clustering using a pairwise mention similarity or distance function based on mention text, context, and document-level statistics.
    Page 1, “Introduction”
  10. Other previous work attempts to address some of the above concerns by mapping coreference to inference on an undirected graphical model (Culotta et al., 2007; Poon et al., 2008; Wellner et al., 2004; Wick et al., 2009a).
    Page 1, “Introduction”
  11. Figure 1: Cross-Document Coreference Problem: I
    Page 2, “Introduction”

See all papers in Proc. ACL 2011 that mention coreference.

See all papers in Proc. ACL that mention coreference.

Back to top.

graphical model

Appears in 4 sentences as: graphical model (3) graphical models (1)
In Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
  1. Other previous work attempts to address some of the above concerns by mapping coreference to inference on an undirected graphical model (Culotta et al., 2007; Poon et al., 2008; Wellner et al., 2004; Wick et al., 2009a).
    Page 1, “Introduction”
  2. In this work we first distribute MCMC-based inference for the graphical model representation of coreference.
    Page 2, “Introduction”
  3. Our representation of the problem as an undirected graphical model , and performing distributed inference on it, provides a combination of advantages not available in any of these approaches.
    Page 9, “Related Work”
  4. In addition to representing features from all of the related work, graphical models can also use more complex entity-wide features (Culotta et al., 2007; Wick et al., 2009a), and parameters can be learned using supervised (Collins, 2002) or semi-supervised techniques (Mann and McCallum, 2008).
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention graphical model.

See all papers in Proc. ACL that mention graphical model.

Back to top.

F1 score

Appears in 3 sentences as: F1 score (2) F1 Scores (1)
In Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
  1. On this dataset, our proposed model yields a B3 (Bagga and Baldwin, 1998) F1 score of 73.7%, improving over the baseline by 16% absolute (corresponding to 38% error reduction).
    Page 2, “Introduction”
  2. Table 2: F1 Scores on the Wikipedia Link Data.
    Page 8, “Experiments”
  3. We use N = 100, 500 and the B3 F1 score results obtained set for each case are shown in Figure 7.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention F1 score.

See all papers in Proc. ACL that mention F1 score.

Back to top.