Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
Arnold, Andrew and Nallapati, Ramesh and Cohen, William W.

Article Structure

Abstract

We present a novel hierarchical prior structure for supervised transfer learning in named entity recognition, motivated by the common structure of feature spaces for this task across natural language data sets.

Introduction

1.1 Problem definition

Models considered 2.1 Basic Conditional Random Fields

In this work, we will base our work on Conditional Random Fields (CRF’s) (Lafferty et al., 2001), which are now one of the most preferred sequential models for many natural language processing tasks.

Investigation

3.1 Data, domains and tasks

Conclusions, related & future work

In this work we have introduced hierarchical feature tree priors for use in transfer learning on named entity extraction tasks.

Topics

CRF

Appears in 11 sentences as: CRF (14)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. §2 introduces the maximum entropy (maxent) and conditional random field ( CRF ) learning techniques employed, along with specifications for the design and training of our hierarchical prior.
    Page 1, “Introduction”
  2. The parametric form of the CRF for a sentence of length n is given as follows:
    Page 3, “Models considered 2.1 Basic Conditional Random Fields”
  3. CRF learns a model consisting of a set of weights A = {A1...)\F} over the features so as to maximize the conditional likelihood of the training data, p(lémm|Xtmm), given the model p A.
    Page 3, “Models considered 2.1 Basic Conditional Random Fields”
  4. 2.2 CRF with Gaussian priors
    Page 3, “Models considered 2.1 Basic Conditional Random Fields”
  5. The method under discussion can also be extended to CRF directly.
    Page 3, “Models considered 2.1 Basic Conditional Random Fields”
  6. Train CRF using DSOWCG to obtain feature weights ASOWCG For each feature f E Parget
    Page 5, “Models considered 2.1 Basic Conditional Random Fields”
  7. {Afource l E LeaveS(HSOU/T'C€ Train Gaussian prior CRF using ’Dtarget as data and {,uf} and {of} as Gaussian prior parameters.
    Page 5, “Models considered 2.1 Basic Conditional Random Fields”
  8. OutputzParameters of the new CRF Mamet.
    Page 5, “Models considered 2.1 Basic Conditional Random Fields”
  9. Specifically, we compared our approximate hierarchical prior model (HIER), implemented as a CRF, against three baselines: o GAUSS: CRF model tuned on a single domain’s data, using a standard N(0, 1) prior 0 CAT: CRF model tuned on a concatenation of multiple domains’ data, using a N(0, 1) prior 0 CHELBA: CRF model tuned on one domain’s data, using a prior trained on a different, related domain’s data (cf.
    Page 6, “Investigation”
  10. Line a shows the F1 performance of a CRF model tuned only on the target MUC6 domain (GAUSS) across a range of tuning data sizes.
    Page 6, “Investigation”
  11. Line I) shows the same experiment, but this time the CRF model has been tuned on a dataset comprised of a simple concatenation of the training MUC6 data from (a), along with a different training set from MUC7 (CAT).
    Page 6, “Investigation”

See all papers in Proc. ACL 2008 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

news articles

Appears in 9 sentences as: news article (1) news articles (9)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. Specifically, you are given a corpus of news articles in which all tokens have been labeled as either belonging to personal name mentions or not.
    Page 1, “Introduction”
  2. Clearly the problems of identifying names in news articles and e-mails are closely related, and learning to do well on one should help your performance on the other.
    Page 1, “Introduction”
  3. When only the type of data being examined is allowed to vary (from news articles to e-mails, for example), the problem is called domain adaptation (Daumé III and Marcu, 2006).
    Page 2, “Introduction”
  4. For example, in this work we group data collected in the same medium (e. g., all annotated e-mails or all annotated news articles ) as belonging to the same genre.
    Page 3, “Introduction”
  5. In the example from §l.1, our belief that capitalization is less strict in e-mails than in news articles could be encoded in a prior that biased the importance of the capitalization feature to be lower for e-mails than news articles .
    Page 3, “Introduction”
  6. These are: abstracts from biological journals [UT (Bunescu et al., 2004), Yapex (Franzen et al., 2002)]; news articles [MUC6 (Fisher et al., 1995), MUC7 (Borthwick et al., 1998)]; and personal e-mails [CSPACE (Kraut et al., 2004)].
    Page 6, “Investigation”
  7. 0 person names in news articles and e-mails We chose this array of corpora so that we could evaluate our hierarchical prior’s ability to generalize across and incorporate information from a variety of domains, genres and tasks.
    Page 6, “Investigation”
  8. Figure 3 shows the results of an experiment in learning to recognize person names in MUC6 news articles .
    Page 6, “Investigation”
  9. Here again we are trying to learn to recognize person names in MUC6 e-mails, but this time, instead of adding only other datasets similarly labeled with person names, we are additionally adding biological corpora (UT & YAPEX), labeled not with person names but with protein names instead, along with the CSPACE email and MUC7 news article corpora.
    Page 7, “Investigation”

See all papers in Proc. ACL 2008 that mention news articles.

See all papers in Proc. ACL that mention news articles.

Back to top.

named entity

Appears in 8 sentences as: named entity (8)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. We present a novel hierarchical prior structure for supervised transfer learning in named entity recognition, motivated by the common structure of feature spaces for this task across natural language data sets.
    Page 1, “Abstract”
  2. Consider the task of named entity recognition (NER).
    Page 1, “Introduction”
  3. Having successfully trained a named entity classifier on this news data, now consider the problem of learning to classify tokens as names in email data.
    Page 1, “Introduction”
  4. In particular, we develop a novel prior for named entity recognition that exploits the hierarchical feature space often found in natural language domains (§l.2) and allows for the transfer of information from labeled datasets in other domains (§l.3).
    Page 1, “Introduction”
  5. to the named entity status of the current word.
    Page 2, “Introduction”
  6. In the next section we address the problem of how to come up with a suitable prior for transfer learning across named entity recognition problems.
    Page 3, “Introduction”
  7. The goal of our experiments was to see to what degree named entity recognition problems naturally conformed to hierarchical methods, and not just to achieve the highest performance possible.
    Page 6, “Investigation”
  8. In this work we have introduced hierarchical feature tree priors for use in transfer learning on named entity extraction tasks.
    Page 8, “Conclusions, related & future work”

See all papers in Proc. ACL 2008 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

domain adaptation

Appears in 5 sentences as: Domain adaptation (1) domain adaptation (4)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. In the subproblem of domain adaptation , a model trained over a source domain is generalized to perform well on a related target domain, where the two domains’ data are distributed similarly, but not identically.
    Page 1, “Abstract”
  2. We introduce the concept of groups of closely-related domains, called genres, and show how inter-genre adaptation is related to domain adaptation .
    Page 1, “Abstract”
  3. When only the type of data being examined is allowed to vary (from news articles to e-mails, for example), the problem is called domain adaptation (Daumé III and Marcu, 2006).
    Page 2, “Introduction”
  4. 0 domain adaptation , where we assume Y (the set of possible labels) is the same for both DSOWCG and Dtafget, while DSOWCG and Dtafget themselves are allowed to vary between domains.
    Page 3, “Introduction”
  5. Domain adaptation can be further distinguished by the degree of relatedness between the source and target domains.
    Page 3, “Introduction”

See all papers in Proc. ACL 2008 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

natural language

Appears in 5 sentences as: natural language (5)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. We present a novel hierarchical prior structure for supervised transfer learning in named entity recognition, motivated by the common structure of feature spaces for this task across natural language data sets.
    Page 1, “Abstract”
  2. In particular, we develop a novel prior for named entity recognition that exploits the hierarchical feature space often found in natural language domains (§l.2) and allows for the transfer of information from labeled datasets in other domains (§l.3).
    Page 1, “Introduction”
  3. Representing feature spaces with this kind of tree, besides often coinciding with the explicit language used by common natural language toolkits (Cohen, 2004), has the added benefit of allowing a model to easily back-off, or smooth, to decreasing levels of specificity.
    Page 2, “Introduction”
  4. In this work, we will base our work on Conditional Random Fields (CRF’s) (Lafferty et al., 2001), which are now one of the most preferred sequential models for many natural language processing tasks.
    Page 3, “Models considered 2.1 Basic Conditional Random Fields”
  5. We used a standard natural language toolkit (Cohen, 2004) to compute tens of thousands of binary features on each of these tokens, encoding such information as capitalization patterns and contextual information from surrounding words.
    Page 6, “Investigation”

See all papers in Proc. ACL 2008 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

model parameters

Appears in 4 sentences as: model parameter (2) model parameters (3)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. The model parameters Nd), then, form the parameters of the leaves of this hierarchy.
    Page 4, “Models considered 2.1 Basic Conditional Random Fields”
  2. (3) represent the likelihood of data in each domain given their corresponding model parameters, the second line represents the likelihood of each model parameter in each domain given the hyper-parameter of its parent in the tree hierarchy of features and the last term goes over the entire tree ’2' except the leaf nodes.
    Page 4, “Models considered 2.1 Basic Conditional Random Fields”
  3. We perform a MAP estimation for each model parameter as well as the hyper-parameters.
    Page 4, “Models considered 2.1 Basic Conditional Random Fields”
  4. Essentially, in this model, the weights of the leaf nodes ( model parameters ) depend on the log-likelihood as well as the prior weight of its parent.
    Page 4, “Models considered 2.1 Basic Conditional Random Fields”

See all papers in Proc. ACL 2008 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

feature spaces

Appears in 3 sentences as: feature space (1) feature spaces (2)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. We present a novel hierarchical prior structure for supervised transfer learning in named entity recognition, motivated by the common structure of feature spaces for this task across natural language data sets.
    Page 1, “Abstract”
  2. In particular, we develop a novel prior for named entity recognition that exploits the hierarchical feature space often found in natural language domains (§l.2) and allows for the transfer of information from labeled datasets in other domains (§l.3).
    Page 1, “Introduction”
  3. Representing feature spaces with this kind of tree, besides often coinciding with the explicit language used by common natural language toolkits (Cohen, 2004), has the added benefit of allowing a model to easily back-off, or smooth, to decreasing levels of specificity.
    Page 2, “Introduction”

See all papers in Proc. ACL 2008 that mention feature spaces.

See all papers in Proc. ACL that mention feature spaces.

Back to top.

NER

Appears in 3 sentences as: NER (3)
In Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
  1. Consider the task of named entity recognition ( NER ).
    Page 1, “Introduction”
  2. In many NER problems, features are often constructed as a series of transformations of the input training data, performed in sequence.
    Page 1, “Introduction”
  3. Thus hierarchical priors seem a natural, effective and robust choice for transferring learning across NER datasets and tasks.
    Page 8, “Conclusions, related & future work”

See all papers in Proc. ACL 2008 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.