Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
Stoyanov, Veselin and Gilbert, Nathan and Cardie, Claire and Riloff, Ellen

Article Structure

Introduction

As is common for many natural language processing problems, the state-of-the-art in noun phrase (NP) coreference resolution is typically quantified based on system performance on manually annotated text corpora.

Coreference Task Definitions

This paper studies the six most commonly used coreference resolution data sets.

Coreference Subtask Analysis

Coreference resolution is a complex task that requires solving numerous nontrivial subtasks such as syntactic analysis, semantic class tagging, pleonastic pronoun identification and antecedent identification to name a few.

Resolution Complexity

Different types of anaphora that have to be handled by coreference resolution systems exhibit different properties.

Related Work

The bulk of the relevant related work is described in earlier sections, as appropriate.

Conclusions

We examine the state-of-the-art in NP coreference resolution.

Topics

coreference

Appears in 80 sentences as: corefer (1) Coreference (7) coreference (82) coreferent (5)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. As is common for many natural language processing problems, the state-of-the-art in noun phrase (NP) coreference resolution is typically quantified based on system performance on manually annotated text corpora.
    Page 1, “Introduction”
  2. MUC-6 (1995), ACE NIST (2004)) and their use in many formal evaluations, as a field we can make surprisingly few conclusive statements about the state-of-the-art in NP coreference resolution.
    Page 1, “Introduction”
  3. In particular, it remains difi‘icult to assess the effectiveness of diflerent coreference resolution approaches, even in relative terms.
    Page 1, “Introduction”
  4. Coreference resolution scores range from 85-90% on the ACE 2004 and 2005 data sets to a much lower 60-70% on the MUC 6 and 7 data sets (e.g.
    Page 1, “Introduction”
  5. Or do differences in the coreference task definitions account for the differences in performance?
    Page 1, “Introduction”
  6. We have little understanding of which aspects of the coreference resolution problem are handled well or poorly by state-of-the-art systems.
    Page 1, “Introduction”
  7. The goal of this paper is to take initial steps toward making sense of the disparate performance results reported for NP coreference resolution.
    Page 1, “Introduction”
  8. For our investigations, we employ a state-of-the-art classification-based NP coreference resolver and focus on the widely used MUC and ACE coreference resolution data sets.
    Page 1, “Introduction”
  9. We hypothesize that performance variation within and across coreference resolvers is, at least in part, a function of (1) the (sometimes unstated) assumptions in evaluation methodologies, and (2) the relative difficulty of the benchmark text corpora.
    Page 1, “Introduction”
  10. With these in mind, Section 3 first examines three subproblems that play an important role in coreference resolution: named entity recognition, anaphoricity determination, and coreference element detection.
    Page 1, “Introduction”
  11. We quantitatively measure the impact of each of these subproblems on coreference resolution performance as a whole.
    Page 1, “Introduction”

See all papers in Proc. ACL 2009 that mention coreference.

See all papers in Proc. ACL that mention coreference.

Back to top.

coreference resolution

Appears in 48 sentences as: Coreference resolution (2) coreference resolution (33) Coreference Resolver (1) coreference resolver (8) coreference resolvers (7)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. As is common for many natural language processing problems, the state-of-the-art in noun phrase (NP) coreference resolution is typically quantified based on system performance on manually annotated text corpora.
    Page 1, “Introduction”
  2. MUC-6 (1995), ACE NIST (2004)) and their use in many formal evaluations, as a field we can make surprisingly few conclusive statements about the state-of-the-art in NP coreference resolution .
    Page 1, “Introduction”
  3. In particular, it remains difi‘icult to assess the effectiveness of diflerent coreference resolution approaches, even in relative terms.
    Page 1, “Introduction”
  4. Coreference resolution scores range from 85-90% on the ACE 2004 and 2005 data sets to a much lower 60-70% on the MUC 6 and 7 data sets (e.g.
    Page 1, “Introduction”
  5. We have little understanding of which aspects of the coreference resolution problem are handled well or poorly by state-of-the-art systems.
    Page 1, “Introduction”
  6. The goal of this paper is to take initial steps toward making sense of the disparate performance results reported for NP coreference resolution .
    Page 1, “Introduction”
  7. For our investigations, we employ a state-of-the-art classification-based NP coreference resolver and focus on the widely used MUC and ACE coreference resolution data sets.
    Page 1, “Introduction”
  8. We hypothesize that performance variation within and across coreference resolvers is, at least in part, a function of (1) the (sometimes unstated) assumptions in evaluation methodologies, and (2) the relative difficulty of the benchmark text corpora.
    Page 1, “Introduction”
  9. With these in mind, Section 3 first examines three subproblems that play an important role in coreference resolution : named entity recognition, anaphoricity determination, and coreference element detection.
    Page 1, “Introduction”
  10. We quantitatively measure the impact of each of these subproblems on coreference resolution performance as a whole.
    Page 1, “Introduction”
  11. In Section 4, we quantify the difficulty of a text corpus with respect to coreference resolution by analyzing performance on different resolution classes.
    Page 2, “Introduction”

See all papers in Proc. ACL 2009 that mention coreference resolution.

See all papers in Proc. ACL that mention coreference resolution.

Back to top.

gold standard

Appears in 9 sentences as: gold standard (9)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. To avoid ambiguity, we will use the term coreference element (CE) to refer to the set of linguistic expressions that participate in the coreference relation, as defined for each of the MUC and ACE tasks.1 At times, it will be important to distinguish between the CEs that are included in the gold standard — the annotated CEs — from those that are generated by the coreference resolution system — the extracted CEs.
    Page 2, “Coreference Task Definitions”
  2. the gold standard ).
    Page 4, “Coreference Subtask Analysis”
  3. Unlike the M U 0 score, which counts links between CEs, B3 presumes that the gold standard and the system response are clusterings over the same set of CEs.
    Page 4, “Coreference Subtask Analysis”
  4. (No gold standard NEs are available for MUC7.)
    Page 5, “Coreference Subtask Analysis”
  5. Comparison to the BASELINE system (box 2) shows that using gold standard NEs leads to improvements on all data sets with the exception of ACE2 and ACEOS, on which performance is virtually unchanged.
    Page 5, “Coreference Subtask Analysis”
  6. (1) PN-e: a proper name is assigned to this exact string match class if there is at least one preceding CE in its gold standard coreference chain that exactly matches it.
    Page 6, “Resolution Complexity”
  7. (2) PN-p: a proper name is assigned to this partial string match class if there is at least one preceding CE in its gold standard chain that has some content words in common.
    Page 6, “Resolution Complexity”
  8. class if no preceding CE in its gold standard chain has an_y content words in common with it.
    Page 7, “Resolution Complexity”
  9. We compute a MUC-RC score (for MUC Resolution Class) for class C as follows: we assume that all CEs that do not belong to class C are resolved correctly by taking the correct clustering for them from the gold standard .
    Page 7, “Resolution Complexity”

See all papers in Proc. ACL 2009 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

named entity

Appears in 8 sentences as: Named Entities (1) named entities (2) Named Entity (2) named entity (4)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. With these in mind, Section 3 first examines three subproblems that play an important role in coreference resolution: named entity recognition, anaphoricity determination, and coreference element detection.
    Page 1, “Introduction”
  2. Our results suggest that the availability of accurate detectors for anaphoricity or coreference elements could substantially improve the performance of state-of-the-art resolvers, while improvements to named entity recognition likely offer little gains.
    Page 1, “Introduction”
  3. However, the MUC definition excludes (l) “nested” named entities (NEs) (e.g.
    Page 2, “Coreference Task Definitions”
  4. This section examines the role of three such subtasks — named entity recognition, anaphoricity determination, and coreference element detection — in the performance of an end-to-end coreference resolution system.
    Page 3, “Coreference Subtask Analysis”
  5. We employ the Stanford CRF-based Named Entity Recognizer (Finkel et al., 2004) for named entity tagging.
    Page 3, “Coreference Subtask Analysis”
  6. 3.3 Named Entities
    Page 4, “Coreference Subtask Analysis”
  7. Thus, we would expect a coreference resolution system to depend critically on its Named Entity (NE) extractor.
    Page 4, “Coreference Subtask Analysis”
  8. Proper Names: Three resolution classes cover CEs that are named entities (e.g.
    Page 6, “Resolution Complexity”

See all papers in Proc. ACL 2009 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

precision and recall

Appears in 4 sentences as: Precision and recall (1) precision and recall (4)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. The MUC scoring algorithm (Vilain et a1., 1995) computes the F1 score (harmonic mean) of precision and recall based on the identifcation of unique coreference links.
    Page 4, “Coreference Subtask Analysis”
  2. The B3 algorithm (Bagga and Baldwin, 1998) computes a precision and recall score for each CE:
    Page 4, “Coreference Subtask Analysis”
  3. Precision and recall for a set of documents are computed as the mean over all CEs in the documents and the F1 score of precision and recall is reported.
    Page 4, “Coreference Subtask Analysis”
  4. The classification threshold, however, can be gainfully employed to control the tradeoff between precision and recall .
    Page 4, “Coreference Subtask Analysis”

See all papers in Proc. ACL 2009 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

BASELINE system

Appears in 3 sentences as: BASELINE system (2) Baseline System (1)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. 3.2 Baseline System Results
    Page 3, “Coreference Subtask Analysis”
  2. In all remaining experiments, we learn the threshold from the training set as in the BASELINE system .
    Page 4, “Coreference Subtask Analysis”
  3. Comparison to the BASELINE system (box 2) shows that using gold standard NEs leads to improvements on all data sets with the exception of ACE2 and ACEOS, on which performance is virtually unchanged.
    Page 5, “Coreference Subtask Analysis”

See all papers in Proc. ACL 2009 that mention BASELINE system.

See all papers in Proc. ACL that mention BASELINE system.

Back to top.

end-to-end

Appears in 3 sentences as: end-to-end (3)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. (2003) represents a fully automatic end-to-end resolver.
    Page 1, “Introduction”
  2. This section examines the role of three such subtasks — named entity recognition, anaphoricity determination, and coreference element detection — in the performance of an end-to-end coreference resolution system.
    Page 3, “Coreference Subtask Analysis”
  3. We expect CE detection to be an important subproblem for an end-to-end coreference system.
    Page 5, “Coreference Subtask Analysis”

See all papers in Proc. ACL 2009 that mention end-to-end.

See all papers in Proc. ACL that mention end-to-end.

Back to top.

F1 score

Appears in 3 sentences as: F1 score (3)
In Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art
  1. The MUC scoring algorithm (Vilain et a1., 1995) computes the F1 score (harmonic mean) of precision and recall based on the identifcation of unique coreference links.
    Page 4, “Coreference Subtask Analysis”
  2. Precision and recall for a set of documents are computed as the mean over all CEs in the documents and the F1 score of precision and recall is reported.
    Page 4, “Coreference Subtask Analysis”
  3. We then count the number of unique correct/incorrect links that the system introduced on top of the correct partial clustering and compute precision, recall, and F1 score .
    Page 7, “Resolution Complexity”

See all papers in Proc. ACL 2009 that mention F1 score.

See all papers in Proc. ACL that mention F1 score.

Back to top.