Coreference Semantics from Web Features
Bansal, Mohit and Klein, Dan

Article Structure

Abstract

To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way.

Introduction

Many of the most difficult ambiguities in coreference resolution are semantic in nature.

Baseline System

Before describing our semantic Web features, we first describe our baseline.

Semantics via Web Features

Our Web features for coreference resolution are simple and capture a range of diffuse world knowledge.

Experiments

4.1 Data

Analysis

In this section, we briefly discuss errors (in the DT baseline) corrected by our Web features, and analyze the decision tree classifier built during training (based on the ACE04 development experiments).

Conclusion

We have presented a collection of simple Web-count features for coreference resolution that capture a range of world knowledge Via statistics of general lexical co-occurrence, hypernymy, semantic compatibility, and semantic context.

Topics

coreference

Appears in 31 sentences as: coreference (27) coreferent (5)
In Coreference Semantics from Web Features
  1. To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way.
    Page 1, “Abstract”
  2. When added to a state-of-the-art coreference baseline, our Web features give significant gains on multiple datasets (ACE 2004 and ACE 2005) and metrics (MUC and B3), resulting in the best results reported to date for the end-to-end task of coreference resolution.
    Page 1, “Abstract”
  3. Many of the most difficult ambiguities in coreference resolution are semantic in nature.
    Page 1, “Introduction”
  4. For resolving coreference in this example, a system would benefit from the world knowledge that Obama is the president.
    Page 1, “Introduction”
  5. There have been multiple previous systems that incorporate some form of world knowledge in coreference resolution tasks.
    Page 1, “Introduction”
  6. There is also work on end-to-end coreference resolution that uses large noun-similarity lists (Daumé III and Marcu, 2005) or structured knowledge bases such as Wikipedia (Yang and Su, 2007; Haghighi and Klein, 2009; Kobdani et al., 2011) and YAGO (Rahman and Ng, 2011).
    Page 1, “Introduction”
  7. However, such structured knowledge bases are of limited scope, and, while Haghighi and Klein (2010) self-acquires knowledge about coreference , it does so only via reference constructions and on a limited scale.
    Page 1, “Introduction”
  8. Altogether, our final system produces the best numbers reported to date on end-to-end coreference resolution (with automatically detected system mentions) on multiple data sets (ACE 2004 and ACE 2005) and metrics (MUC and B3), achieving significant improvements over the Reconcile DT baseline and over the state-of-the-art results of Haghighi and Klein (2010).
    Page 2, “Introduction”
  9. Reconcile is one of the best implementations of the mention-pair model (Soon et al., 2001) of coreference resolution.
    Page 2, “Baseline System”
  10. The mention-pair model relies on a pairwise function to determine whether or not two mentions are coreferent .
    Page 2, “Baseline System”
  11. Pairwise predictions are then consolidated by transitive closure (or some other clustering method) to form the final set of coreference clusters (chains).
    Page 2, “Baseline System”

See all papers in Proc. ACL 2012 that mention coreference.

See all papers in Proc. ACL that mention coreference.

Back to top.

co-occurrence

Appears in 17 sentences as: co-occurrence (18)
In Coreference Semantics from Web Features
  1. Specifically, we exploit short-distance cues to hypernymy, semantic compatibility, and semantic context, as well as general lexical co-occurrence .
    Page 1, “Abstract”
  2. For example, we can collect the co-occurrence statistics of an anaphor with various candidate antecedents to judge relative surface affinities (i.e., (Obama, president) versus (Jobs, president)).
    Page 1, “Introduction”
  3. We can also count co-occurrence statistics of competing antecedents when placed in the context of an anaphoric pronoun (i.e., Obama ’s election campaign versus Jobs’ election campaign).
    Page 1, “Introduction”
  4. We explore five major categories of semantically informative Web features, based on (1) general lexical affinities (via generic co-occurrence statistics), (2) lexical relations (via Hearst-style hypernymy patterns), (3) similarity of entity-based context (e.g., common values of y for
    Page 1, “Introduction”
  5. The first four types are most intuitive for mention pairs where both members are non-pronominal, but, aside from the general co-occurrence group, helped for all mention pair types.
    Page 3, “Semantics via Web Features”
  6. 3.1 General co-occurrence
    Page 3, “Semantics via Web Features”
  7. These features capture co-occurrence statistics of the two headwords, i.e., how often hl and hg are seen adjacent or nearly adjacent on the Web.
    Page 3, “Semantics via Web Features”
  8. Using the n-grams corpus (for n = l to 5), we collect co-occurrence Web-counts by allowing a varying number of wildcards between hl and hg in the query.
    Page 3, “Semantics via Web Features”
  9. The co-occurrence value is:
    Page 3, “Semantics via Web Features”
  10. We normalize the overall co-occurrence count of the headword pair cm by the unigram counts of the individual headwords c1 and c2, so that high-frequency headwords do not unfairly get a high feature value (this is similar to computing scaled mutual information MI (Church and Hanks, 1989)).3 This 1101‘-malized value is quantized by taking its loglo and binning.
    Page 3, “Semantics via Web Features”
  11. As a real example from our development set, the co-occurrence count cm for the headword pair (leaden president) is 11383, while it is only 95 for the headword pair (voter president); after normalization and loglo, the values are -10.9 and -12.0, respectively.
    Page 3, “Semantics via Web Features”

See all papers in Proc. ACL 2012 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

coreference resolution

Appears in 13 sentences as: coreference resolution (13)
In Coreference Semantics from Web Features
  1. To address semantic ambiguities in coreference resolution , we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way.
    Page 1, “Abstract”
  2. When added to a state-of-the-art coreference baseline, our Web features give significant gains on multiple datasets (ACE 2004 and ACE 2005) and metrics (MUC and B3), resulting in the best results reported to date for the end-to-end task of coreference resolution .
    Page 1, “Abstract”
  3. Many of the most difficult ambiguities in coreference resolution are semantic in nature.
    Page 1, “Introduction”
  4. There have been multiple previous systems that incorporate some form of world knowledge in coreference resolution tasks.
    Page 1, “Introduction”
  5. There is also work on end-to-end coreference resolution that uses large noun-similarity lists (Daumé III and Marcu, 2005) or structured knowledge bases such as Wikipedia (Yang and Su, 2007; Haghighi and Klein, 2009; Kobdani et al., 2011) and YAGO (Rahman and Ng, 2011).
    Page 1, “Introduction”
  6. Altogether, our final system produces the best numbers reported to date on end-to-end coreference resolution (with automatically detected system mentions) on multiple data sets (ACE 2004 and ACE 2005) and metrics (MUC and B3), achieving significant improvements over the Reconcile DT baseline and over the state-of-the-art results of Haghighi and Klein (2010).
    Page 2, “Introduction”
  7. Reconcile is one of the best implementations of the mention-pair model (Soon et al., 2001) of coreference resolution .
    Page 2, “Baseline System”
  8. Our Web features for coreference resolution are simple and capture a range of diffuse world knowledge.
    Page 2, “Semantics via Web Features”
  9. datasets for end-to-end coreference resolution (see Section 4.3).
    Page 4, “Semantics via Web Features”
  10. This keeps the total number of features small, which is important for the relatively small datasets used for coreference resolution .
    Page 5, “Semantics via Web Features”
  11. We show results on three popular and comparatively larger coreference resolution data sets — the ACE04, ACE05, and ACE05-ALL datasets from the ACE Program (NIST, 2004).
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention coreference resolution.

See all papers in Proc. ACL that mention coreference resolution.

Back to top.

development set

Appears in 8 sentences as: development set (8) development set: (1)
In Coreference Semantics from Web Features
  1. As a real example from our development set , the co-occurrence count cm for the headword pair (leaden president) is 11383, while it is only 95 for the headword pair (voter president); after normalization and loglo, the values are -10.9 and -12.0, respectively.
    Page 3, “Semantics via Web Features”
  2. Also, we do not constrain the order of hl and hg because these patterns can hold for either direction of coreference.4 As a real example from our development set , the 012 count for the headword pair (leaden president) is 752, while for (voter president), it is 0.
    Page 4, “Semantics via Web Features”
  3. We chose the following three context types, based on performance on a development set:
    Page 5, “Semantics via Web Features”
  4. Note that most previous work does not report (or need) a standard development set; hence, for tuning our features and its hyper-parameters, we randomly split the original training data into a training and development set with a 70/30 ratio (and then use the full original training set during testing).
    Page 6, “Experiments”
  5. 7Note that the development set is used only for ACE04, because for ACE05, and ACE05-ALL, we directly test using the features tuned on ACE04.
    Page 6, “Experiments”
  6. Table 2: Incremental results for the Web features on the ACE04 development set .
    Page 6, “Experiments”
  7. (2009).9 Table 2 compares the baseline perceptron results to the DT results and then shows the incremental addition of the Web features to the DT baseline (on the ACE04 development set ).
    Page 6, “Experiments”
  8. We develop our features and tune their hyperparameter values on the ACE04 development set and then use these on the ACE04 test set.12 On the ACE05 and ACE05-ALL datasets, we directly transfer our Web features and their hyperparameter values from the ACE04 dev-set, without any retuning.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

perceptron

Appears in 8 sentences as: perceptron (8)
In Coreference Semantics from Web Features
  1. (2009) in using a decision tree classifier rather than an averaged linear perceptron .
    Page 2, “Baseline System”
  2. We find the decision tree classifier to work better than the default averaged perceptron (used by Stoyanov et al.
    Page 2, “Baseline System”
  3. Angerc is the averaged perceptron baseline, DecTree is the decision tree baseline, and the +Feature rows show the effect of adding a particular feature incrementally (not in isolation) to the DecTree baseline.
    Page 6, “Experiments”
  4. We start with the Reconcile baseline but employ the decision tree (DT) classifier, because it has significantly better performance than the default averaged perceptron classifier used in Stoyanov et a1.
    Page 6, “Experiments”
  5. (2009).9 Table 2 compares the baseline perceptron results to the DT results and then shows the incremental addition of the Web features to the DT baseline (on the ACE04 development set).
    Page 6, “Experiments”
  6. 9Moreover, a DT classifier takes roughly the same amount of time and memory as a perceptron on our ACE04 development experiments.
    Page 6, “Experiments”
  7. We report our results (the 3 rows marked ‘This Work’) on the perceptron baseline, the DT baseline, and the Web features added to the DT baseline.
    Page 7, “Experiments”
  8. 10We also initially experimented with smaller datasets (MUC6 and MUC7) and an averaged perceptron baseline, and we did see similar improvements, arguing that these features are useful independently of the learning algorithm and dataset.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

end-to-end

Appears in 4 sentences as: end-to-end (4)
In Coreference Semantics from Web Features
  1. When added to a state-of-the-art coreference baseline, our Web features give significant gains on multiple datasets (ACE 2004 and ACE 2005) and metrics (MUC and B3), resulting in the best results reported to date for the end-to-end task of coreference resolution.
    Page 1, “Abstract”
  2. There is also work on end-to-end coreference resolution that uses large noun-similarity lists (Daumé III and Marcu, 2005) or structured knowledge bases such as Wikipedia (Yang and Su, 2007; Haghighi and Klein, 2009; Kobdani et al., 2011) and YAGO (Rahman and Ng, 2011).
    Page 1, “Introduction”
  3. Altogether, our final system produces the best numbers reported to date on end-to-end coreference resolution (with automatically detected system mentions) on multiple data sets (ACE 2004 and ACE 2005) and metrics (MUC and B3), achieving significant improvements over the Reconcile DT baseline and over the state-of-the-art results of Haghighi and Klein (2010).
    Page 2, “Introduction”
  4. datasets for end-to-end coreference resolution (see Section 4.3).
    Page 4, “Semantics via Web Features”

See all papers in Proc. ACL 2012 that mention end-to-end.

See all papers in Proc. ACL that mention end-to-end.

Back to top.

hyperparameter

Appears in 4 sentences as: hyperparameter (5)
In Coreference Semantics from Web Features
  1. To capture this effect, we create a feature that indicates whether there is a match in the top 1:: seeds of the two headwords (where k: is a hyperparameter to tune).
    Page 4, “Semantics via Web Features”
  2. We first collect the POS tags (using length 2 character prefixes to indicate coarse parts of speech) of the seeds matched in the top h’ seed lists of the two headwords, where h’ is another hyperparameter to tune.
    Page 4, “Semantics via Web Features”
  3. We tune a separate bin-size hyperparameter for each of these three features.
    Page 5, “Semantics via Web Features”
  4. We develop our features and tune their hyperparameter values on the ACE04 development set and then use these on the ACE04 test set.12 On the ACE05 and ACE05-ALL datasets, we directly transfer our Web features and their hyperparameter values from the ACE04 dev-set, without any retuning.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention hyperparameter.

See all papers in Proc. ACL that mention hyperparameter.

Back to top.

n-grams

Appears in 4 sentences as: n-grams (5)
In Coreference Semantics from Web Features
  1. As the source of Web information, we use the Google n-grams corpus (Brants and Franz, 2006) which contains English n-grams (n = 1 to 5) and their Web frequency counts, derived from nearly 1 trillion word tokens and 95 billion sentences.
    Page 3, “Semantics via Web Features”
  2. Using the n-grams corpus (for n = l to 5), we collect co-occurrence Web-counts by allowing a varying number of wildcards between hl and hg in the query.
    Page 3, “Semantics via Web Features”
  3. 2These clusters are derived form the V2 Google n-grams corpus.
    Page 3, “Semantics via Web Features”
  4. (2010), which were created using the Google n-grams V2 corpus.
    Page 4, “Semantics via Web Features”

See all papers in Proc. ACL 2012 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

n-gram

Appears in 3 sentences as: n-gram (3)
In Coreference Semantics from Web Features
  1. To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way.
    Page 1, “Abstract”
  2. In order to harness the information on the Web without presupposing a deep understanding of all Web text, we instead turn to a diverse collection of Web n-gram counts (Brants and Franz, 2006) which, in aggregate, contain diffuse and indirect, but often robust, cues to reference.
    Page 1, “Introduction”
  3. These clusters come from distributional K -Means clustering (with K = 1000) on phrases, using the n-gram context as features.
    Page 4, “Semantics via Web Features”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.