Robust Entity Clustering via Phylogenetic Inference

Entity clustering must determine when two named-entity mentions refer to the same entity.

Variation poses a serious challenge for determining who or what a name refers to.

Cross-document coreference resolution (CDCR) was first introduced by Bagga and Baldwin (1998b).

Let a: = (x1, .

Given a few constants that are referenced in the main text, we assume that the corpus d was generated as follows.

We use a block Gibbs sampler, which from an initial state (190,21), zo) repeats these steps: 1.

Evaluating the likelihood and its partial derivatives with respect to the parameters of the model requires marginalizing over our latent variables.

From a single phylogeny p, we deterministically obtain a clustering e by removing the root <>.

In this section, we describe experiments on three different datasets.

Our primary contribution consists of new modeling ideas, and associated inference techniques, for the problem of cross-document coreference resolution.

Appears in 21 sentences as: coref (1) corefer (1) coreference (14) coreferent (4) coreferents (1)

In *Robust Entity Clustering via Phylogenetic Inference*

- In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data.Page 1, “Abstract”
- even identical—do not necessarily corefer .Page 1, “Introduction”
- In this paper, we propose a method for jointly (1) learning similarity between names and (2) clustering name mentions into entities, the two major components of cross-document coreference resolution systems (Baron and Freedman, 2008; Finin et al., 2009; Rao et al., 2010; Singh et al., 2011; Lee et al., 2012; Green et al., 2012).Page 1, “Introduction”
- Such creative spellings are especially common on Twitter and other social media; we give more examples of coreferents learned by our model in Section 8.4.Page 1, “Introduction”
- The procedure is applicable to any model capable of producing a posterior over coreference decisions.Page 2, “Introduction”
- Cross-document coreference resolution (CDCR) was first introduced by Bagga and Baldwin (1998b).Page 2, “Overview and Related Work”
- Most approaches since then are based on the intuitions that coreferent names tend to have “similar” spellings and tend to appear in “similar” contexts.Page 2, “Overview and Related Work”
- We adopt a “phylogenetic” generative model of coreference .Page 2, “Overview and Related Work”
- The basic insight is that coreference is created when an author thinks of an entity that was mentioned earlier in a similar context, and mentions it again in a similar way.Page 2, “Overview and Related Work”
- To apply our model to the CDCR task, we observe that the probability that two name mentions are coreferent is the probability that they arose from a common ancestor in the phylogeny.Page 2, “Overview and Related Work”
- Thus, our sampled phylogenies tend to make similar names coreferent—especially long or unusual names that would be expensive to generate repeatedly, and especially in contexts that are topically similar and therefore have a higher prior probability of coreference .Page 2, “Overview and Related Work”

See all papers in *Proc. ACL 2014* that mention coreference.

See all papers in *Proc. ACL* that mention coreference.

Back to top.

Appears in 6 sentences as: Gibbs sampler (3) Gibbs sampling (3)

In *Robust Entity Clustering via Phylogenetic Inference*

- We present a block Gibbs sampler for posterior inference and an empirical evaluation on several datasets.Page 1, “Abstract”
- We use a block Gibbs sampler , which from an initial state (190,21), zo) repeats these steps: 1.Page 5, “Inference by Block Gibbs Sampling”
- The topics of context words are assumed exchangeable, and so we re-sample them using Gibbs sampling (Griffiths and Steyvers, 2004).Page 5, “Inference by Block Gibbs Sampling”
- Unfortunately, this is prohibitively expensive for the (nonexchangeable) topics of the named mentions c. A Gibbs sampler would have to choose a new value for cc.z with probability proportional to the resulting joint probability of the full sample.Page 5, “Inference by Block Gibbs Sampling”
- The \1136 factors in (5) approximate the topic model’s prior distribution over z. is proportional to the probability that a Gibbs sampling step for an ordinary topic model would choose this value of cc.z.Page 5, “Inference by Block Gibbs Sampling”
- (1) We fix topics zo via collapsed Gibbs sampling (Griffiths and Steyvers, 2004).Page 6, “Inference by Block Gibbs Sampling”

See all papers in *Proc. ACL 2014* that mention Gibbs sampler.

See all papers in *Proc. ACL* that mention Gibbs sampler.

Back to top.

Appears in 5 sentences as: coreference resolution (5)

In *Robust Entity Clustering via Phylogenetic Inference*

- In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data.Page 1, “Abstract”
- In this paper, we propose a method for jointly (1) learning similarity between names and (2) clustering name mentions into entities, the two major components of cross-document coreference resolution systems (Baron and Freedman, 2008; Finin et al., 2009; Rao et al., 2010; Singh et al., 2011; Lee et al., 2012; Green et al., 2012).Page 1, “Introduction”
- Cross-document coreference resolution (CDCR) was first introduced by Bagga and Baldwin (1998b).Page 2, “Overview and Related Work”
- Name similarity is also an important component of within-document coreference resolution , and efforts in that area bear resemblance to our approach.Page 2, “Overview and Related Work”
- Our primary contribution consists of new modeling ideas, and associated inference techniques, for the problem of cross-document coreference resolution .Page 9, “Conclusions”

See all papers in *Proc. ACL 2014* that mention coreference resolution.

See all papers in *Proc. ACL* that mention coreference resolution.

Back to top.

Appears in 4 sentences as: clusterings (4)

In *Robust Entity Clustering via Phylogenetic Inference*

- Our model gives a distribution over phylogenies p (given observations :13 and learned parameters (ID—and thus gives a posterior distribution over clusterings e, which can be used to answer various queries.Page 7, “Consensus Clustering”
- More similar clusterings achieve larger R, with R(e’, e) = 1 iff e’ = e. In all cases, 0 S R(e’,e) = R(e,e’) g 1.Page 7, “Consensus Clustering”
- As explained above, the sij are coreference probabilities sij that can be estimated from a sample of clusterings 6.Page 7, “Consensus Clustering”
- For PHYLO, the entity clustering is the result of (1) training the model using EM, (2) sampling from the posterior to obtain a distribution over clusterings , and (3) finding a consensus clustering.Page 8, “Experiments”

See all papers in *Proc. ACL 2014* that mention clusterings.

See all papers in *Proc. ACL* that mention clusterings.

Back to top.

Appears in 4 sentences as: entity type (4)

In *Robust Entity Clustering via Phylogenetic Inference*

- However, any topic may generate an entity type , 6. g. PERSON, which is then replaced by a specific name: when PERSON is generated, the model chooses a previous mention of any person and copies it, perhaps mutating its name.1 Alternatively, the model may manufacturePage 2, “Generative Model of Coreference”
- (c) If wdk is a named entity type (PERSON, PLACE, ORG, .Page 3, “Detailed generative story”
- One could also make more specific versions of any feature by conjoining it with the entity type t.Page 4, “Detailed generative story”
- More generally, the probability (2) may also be conditioned on other variables such as on the languages pi and sci—this leaves room for a transliteration model when 53.6 75 p.6—and on the entity type cut.Page 4, “Detailed generative story”

See all papers in *Proc. ACL 2014* that mention entity type.

See all papers in *Proc. ACL* that mention entity type.

Back to top.

Appears in 4 sentences as: model parameterized (1) model parameters (2) model’s parameters (1)

In *Robust Entity Clustering via Phylogenetic Inference*

- For learning, we iteratively adjust our model’s parameters to better explain our samples.Page 2, “Overview and Related Work”
- (2012) we use topics as the contexts, but learn mention topics jointly with other model parameters .Page 2, “Overview and Related Work”
- This is a conditional log-linear model parameterized by qb, where gbk, ~ N(0, 0,3).Page 4, “Detailed generative story”
- E-step: Collect samples by MCMC simulation as in §5, given current model parameters 6 and qb.Page 6, “Parameter Estimation”

See all papers in *Proc. ACL 2014* that mention model parameters.

See all papers in *Proc. ACL* that mention model parameters.

Back to top.

Appears in 3 sentences as: generative process (2) generative process: (1)

In *Robust Entity Clustering via Phylogenetic Inference*

- The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context.Page 1, “Abstract”
- Our model is an evolutionary generative process based on the name variation model of Andrews et al.Page 1, “Introduction”
- This can also relate seemingly dissimilar names via multiple steps in the generative process:Page 1, “Introduction”

See all papers in *Proc. ACL 2014* that mention generative process.

See all papers in *Proc. ACL* that mention generative process.

Back to top.

Appears in 3 sentences as: log-linear (3)

In *Robust Entity Clustering via Phylogenetic Inference*

- This is a conditional log-linear model parameterized by qb, where gbk, ~ N(0, 0,3).Page 4, “Detailed generative story”
- When 6L is the special end-of-string symbol #, the only allowed edits are the insertion (g) and the substitution We define the edit probability using a locally normalized log-linear model:Page 4, “Detailed generative story”
- We leave other hyperparameters fixed: 16 latent topics, and Gaussian priors N (0, l) on all log-linear parameters.Page 8, “Experiments”

See all papers in *Proc. ACL 2014* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 3 sentences as: named entities (1) named entity (2)

In *Robust Entity Clustering via Phylogenetic Inference*

- (c) If wdk is a named entity type (PERSON, PLACE, ORG, .Page 3, “Detailed generative story”
- Each context word and each named entity is associated with a latent topic.Page 5, “Inference by Block Gibbs Sampling”
- This process treats all topics as exchangeable, including those associated with named entities .Page 6, “Inference by Block Gibbs Sampling”

See all papers in *Proc. ACL 2014* that mention named entity.

See all papers in *Proc. ACL* that mention named entity.

Back to top.

Appears in 3 sentences as: topic model (2) topic model’s (1) topical model (1)

In *Robust Entity Clustering via Phylogenetic Inference*

- Our novel approach features: §4.1 A topical model of which entities from previ-Page 1, “Introduction”
- The entire corpus, including these entities, is generated according to standard topic model assumptions; we first generate a topic distribution for a document, then sample topics and words for the document (Blei et al., 2003).Page 2, “Generative Model of Coreference”
- The \1136 factors in (5) approximate the topic model’s prior distribution over z. is proportional to the probability that a Gibbs sampling step for an ordinary topic model would choose this value of cc.z.Page 5, “Inference by Block Gibbs Sampling”

See all papers in *Proc. ACL 2014* that mention topic model.

See all papers in *Proc. ACL* that mention topic model.

Back to top.