Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
Finkel, Jenny Rose and Manning, Christopher D.

Article Structure

Abstract

One of the main obstacles to producing high quality joint models is the lack of jointly annotated data.

Introduction

Joint learning of multiple types of linguistic structure results in models which produce more consistent outputs, and for which performance improves across all aspects of the joint structure.

Related Work

Our task can be viewed as an instance of multitask learning, a machine learning paradigm in which the objective is to simultaneously solve multiple, related tasks for which you have separate labeled training data.

Hierarchical Joint Learning

In this section we will discuss the main contribution of this paper, our hierarchical joint model which improves joint modeling performance through the use of single-task models which can be trained on singly-annotated data.

Base Models

Our hierarchical joint model is composed of three separate models, one for just named entity recognition, one for just parsing, and one for joint parsing and named entity recognition.

Experiments and Discussion

We compared our hierarchical joint model to a regular (non-hierarchical) joint model, and to parse-only and NER-only models.

Conclusion

In this paper we presented a novel method for improving joint modeling using additional data which has not been labeled with the entire joint structure.

Topics

joint model

Appears in 49 sentences as: Joint Model (1) joint model (43) Joint modeling (1) joint modeling (6) Joint models (1) joint models (7) joint model’s (1) jointly model (1)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. One of the main obstacles to producing high quality joint models is the lack of jointly annotated data.
    Page 1, “Abstract”
  2. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still under-performs compared to single-task models learned on the more abundant quantities of available single-task annotated data.
    Page 1, “Abstract”
  3. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model .
    Page 1, “Abstract”
  4. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model .
    Page 1, “Abstract”
  5. Experiments on joint parsing and named entity recognition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.
    Page 1, “Abstract”
  6. Joint models can be particularly useful for producing analyses of sentences which are used as input for higher-level, more semantically-oriented systems, such as question answering and machine translation.
    Page 1, “Introduction”
  7. However, designing joint models which actually improve performance has proven challenging.
    Page 1, “Introduction”
  8. There have been some recent successes with joint modeling .
    Page 1, “Introduction”
  9. Zhang and Clark (2008) built a perceptron-based joint segmenter and part-of-speech (POS) tagger for Chinese, and Toutanova and Cherry (2009) learned a joint model of lemmatization and POS tagging which outperformed a pipelined model.
    Page 1, “Introduction”
  10. Adler and Elhadad (2006) presented an HMM-based approach for unsupervised joint morphological segmentation and tagging of Hebrew, and Goldberg and Tsarfaty (2008) developed a joint model of segmentation, tagging and parsing of Hebrew, based on lattice parsing.
    Page 1, “Introduction”
  11. No discussion of joint modeling would be complete without mention of (Miller et al., 2000), who trained a Collins-style generative parser (Collins, 1997) over a syntactic structure augmented with the template entity and template relations annotations for the MUC-7 shared task.
    Page 1, “Introduction”

See all papers in Proc. ACL 2010 that mention joint model.

See all papers in Proc. ACL that mention joint model.

Back to top.

named entity

Appears in 26 sentences as: named entities (2) Named Entity (2) named entity (27)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. Experiments on joint parsing and named entity recognition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.
    Page 1, “Abstract”
  2. These high-level systems typically combine the outputs from many low-level systems, such as parsing, named entity recognition (NER) and coreference resolution.
    Page 1, “Introduction”
  3. When trained separately, these single-task models can produce outputs which are inconsistent with one another, such as named entities which do not correspond to any nodes in the parse tree (see Figure l for an example).
    Page 1, “Introduction”
  4. Because a named entity should correspond to a node in the parse tree, strong evidence about either aspect of the model should positively impact the other aspect
    Page 1, “Introduction”
  5. We built a joint model of parsing and named entity recognition (Finkel and Manning, 2009b), which had small gains on parse performance and moderate gains on named entity performance, when compared with single-task models trained on the same data.
    Page 1, “Introduction”
  6. We applied our hierarchical joint model to parsing and named entity recognition, and it reduced errors by over 20% on both tasks when compared to a joint model trained on only the jointly annotated data.
    Page 2, “Introduction”
  7. Our experiments are on a joint parsing and named entity task, but the technique is more general and only requires that the base models (the joint model and single-task models) share some features.
    Page 2, “Hierarchical Joint Learning”
  8. This section covers the general technique, and we will cover the details of the parsing, named entity , and joint models that we use in Section 4.
    Page 2, “Hierarchical Joint Learning”
  9. Features which don’t apply to a particular model type (e. g., parse features in the named entity model) will always be zero, so their weights have no impact on that model’s likelihood function.
    Page 3, “Hierarchical Joint Learning”
  10. Our hierarchical joint model is composed of three separate models, one for just named entity recognition, one for just parsing, and one for joint parsing and named entity recognition.
    Page 5, “Base Models”
  11. 4.1 Semi-CRF for Named Entity Recognition
    Page 5, “Base Models”

See all papers in Proc. ACL 2010 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

NER

Appears in 14 sentences as: NER (15)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. These high-level systems typically combine the outputs from many low-level systems, such as parsing, named entity recognition ( NER ) and coreference resolution.
    Page 1, “Introduction”
  2. PARSE JOINT NER
    Page 3, “Hierarchical Joint Learning”
  3. There are separate base models for just parsing, just NER, and joint parsing and NER .
    Page 3, “Hierarchical Joint Learning”
  4. Because we use a tree representation, it is easy to ensure that the features used in the NER model are identical to those in the joint parsing and named entity model, because the joint model (which we will discuss in Section 4.3) is also based on a tree representation where each entity corresponds to a single node in the tree.
    Page 6, “Base Models”
  5. The joint model shares the NER and parse features with the respective single-task models.
    Page 6, “Base Models”
  6. We did not run this experiment on the CNN portion of the data, because the CNN data was already being used as the extra NER data.
    Page 7, “Experiments and Discussion”
  7. Looking at the smaller corpora (NBC and MNB) we see the largest gains, with both parse and NER performance improving by about 8% Fl.
    Page 7, “Experiments and Discussion”
  8. Our one negative result is in the PRI portion: parsing improves slightly, but NER performance decreases by almost 2%.
    Page 7, “Experiments and Discussion”
  9. We found it interesting that the gains tended to be similar on both tasks for all datasets, and believe this fact is due to our use of roughly the same amount of singly-annotated data for both parsing and NER .
    Page 7, “Experiments and Discussion”
  10. Just NER
    Page 8, “Experiments and Discussion”
  11. Just NER
    Page 8, “Experiments and Discussion”

See all papers in Proc. ACL 2010 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

CRF

Appears in 10 sentences as: CRF (11)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. Figure 3: A linear-chain CRF (a) labels each word, whereas a semi-CRF (b) labels entire entities.
    Page 5, “Hierarchical Joint Learning”
  2. Semi-CRFs segment and label the text simultaneously, whereas a linear-chain CRF will only label each word, and segmentation is implied by the labels assigned to the words.
    Page 5, “Base Models”
  3. doing named entity recognition, a semi-CRF will have one node for each entity, unlike a regular CRF which will have one node for each word.2 See Figure 3ab for an example of a semi-CRF and a linear-chain CRF over the same sentence.
    Page 5, “Base Models”
  4. Note that the entity Hilary Clinton has one node in the semi-CRF representation, but two nodes in the linear-chain CRF .
    Page 5, “Base Models”
  5. While a linear-chain CRF allows features over adjacent words, a semi-CRF allows them over adjacent segments.
    Page 5, “Base Models”
  6. This means that a semi-CRF can utilize all features used by a linear-chain CRF , and can also utilize features over entire segments, such as First National Bank of New York City, instead of just adjacent words like First National and Bank of.
    Page 5, “Base Models”
  7. 3While converting a semi-CRF into a parser results in much slower inference than a linear-chain CRF , it is still significantly faster than a treebank parser due to the reduced number of labels.
    Page 5, “Base Models”
  8. The relationship between a CRF-CFG and a PCFG is analogous to the relationship between a linear-chain CRF and a hidden Markov model (HMM) for modeling sequence data.
    Page 6, “Base Models”
  9. Just like with a linear-chain CRF , this equation will be zero when the feature expectations in the model equal the feature values in the training data.
    Page 6, “Base Models”
  10. For each section of the data (ABC, MNB, NBC, PRI, VOA) we ran experiments training a linear-chain CRF on only the named entity information, a CRF-CFG parser on only the parse information, a joint parser and named entity recognizer, and our hierarchical model.
    Page 7, “Experiments and Discussion”

See all papers in Proc. ACL 2010 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

feature weights

Appears in 9 sentences as: feature weight (2) feature weights (7)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model.
    Page 1, “Abstract”
  2. Then, the singly-annotated data can be used to influence the feature weights for the shared features in the joint model.
    Page 2, “Introduction”
  3. Each model has its own set of parameters ( feature weights ).
    Page 3, “Hierarchical Joint Learning”
  4. These have corresponding log-likelihood functions £p(Dp; 6p), £n(Dn; 6n), and Lj (133-; 63-), where the PS are the training data for each model, and the 6s are the model-specific parameter ( feature weight ) vectors.
    Page 3, “Hierarchical Joint Learning”
  5. These three models are linked by a hierarchical prior, and their feature weight vectors are all drawn from this prior.
    Page 3, “Hierarchical Joint Learning”
  6. This formulation encourages each base model to have feature weights similar to the top-level parameters (and hence one another).
    Page 3, “Hierarchical Joint Learning”
  7. 0* has the familiar interpretation of dictating how much the model “cares” about feature weights diverging from zero (or ,u).
    Page 3, “Hierarchical Joint Learning”
  8. Let 6 be the feature weights , and f (s, y, yi_1) the feature function over adjacent segments yz- and yi_1 in sentence 3.4 The log likelihood of a semi-CRF for a single sentence 3 is given by:
    Page 5, “Base Models”
  9. Let 6 be the vector of feature weights .
    Page 6, “Base Models”

See all papers in Proc. ACL 2010 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

model trained

Appears in 8 sentences as: model trained (4) models trained (3) model’s training (2)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. Experiments on joint parsing and named entity recognition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.
    Page 1, “Abstract”
  2. We built a joint model of parsing and named entity recognition (Finkel and Manning, 2009b), which had small gains on parse performance and moderate gains on named entity performance, when compared with single-task models trained on the same data.
    Page 1, “Introduction”
  3. entity models trained on larger corpora, annotated with only one type of information.
    Page 2, “Introduction”
  4. We use a hierarchical prior to link a joint model trained on jointly-annotated data with other single-task models trained on single-task annotated data.
    Page 2, “Introduction”
  5. We applied our hierarchical joint model to parsing and named entity recognition, and it reduced errors by over 20% on both tasks when compared to a joint model trained on only the jointly annotated data.
    Page 2, “Introduction”
  6. Our resulting joint model is of higher quality than a comparable joint model trained on only the jointly-annotated data, due to all of the evidence provided by the additional single-task data.
    Page 3, “Hierarchical Joint Learning”
  7. When we rescale the model-specific prior, we rescale based on the number of data in that model’s training set, not the total number of data in all the models combined.
    Page 4, “Hierarchical Joint Learning”
  8. Having uniformly randomly drawn datum d E UmeM ’Dm, let m(d) E M tell us to which model’s training data the datum belongs.
    Page 4, “Hierarchical Joint Learning”

See all papers in Proc. ACL 2010 that mention model trained.

See all papers in Proc. ACL that mention model trained.

Back to top.

objective function

Appears in 7 sentences as: objective function (8)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. L-BFGS and gradient descent, two frequently used numerical optimization algorithms, require computing the value and partial derivatives of the objective function using the entire training set.
    Page 4, “Hierarchical Joint Learning”
  2. It requires a stochastic objective function, which is meant to be a low computational cost estimate of the real objective function .
    Page 4, “Hierarchical Joint Learning”
  3. In most NLP models, such as logistic regression with a Gaussian prior, computing the stochastic objective function is fairly straightforward: you compute the model likelihood and partial derivatives for a randomly sampled subset of the training data.
    Page 4, “Hierarchical Joint Learning”
  4. The stochastic objective function , where ’13 g D is a randomly drawn subset of the full training set, is given by
    Page 4, “Hierarchical Joint Learning”
  5. When designing a stochastic objective function , the critical fact to keep in mind is that the summed values and partial derivatives for any split of the data need to be equal to that of the full dataset.
    Page 4, “Hierarchical Joint Learning”
  6. We now describe the more complicated case of stochastic optimization with a hierarchical objective function .
    Page 4, “Hierarchical Joint Learning”
  7. We are also interested in ways to modify the objective function to place more emphasis on learning a good joint model, instead of equally weighting the learning of the joint and single-task models.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

parse tree

Appears in 7 sentences as: parse tree (5) parse trees (2)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. When trained separately, these single-task models can produce outputs which are inconsistent with one another, such as named entities which do not correspond to any nodes in the parse tree (see Figure l for an example).
    Page 1, “Introduction”
  2. Because a named entity should correspond to a node in the parse tree , strong evidence about either aspect of the model should positively impact the other aspect
    Page 1, “Introduction”
  3. Figure 3c shows a parse tree representation of a semi-CRF.
    Page 5, “Base Models”
  4. Let t be a complete parse tree for sentence 3, and each local subtree 7“ E t encodes both the rule from the grammar, and the span and split information (e.g NP(7,9) —> JJ(7,8)NN(8,9) which covers the last two words in Figure l).
    Page 6, “Base Models”
  5. f(7~,s)} (9) r675 To compute the partition function ZS, which serves to normalize the function, we must sum over 7(3), the set of all possible parse trees for sentence 3.
    Page 6, “Base Models”
  6. The parse tree structure is augmented with named entity information; see Figure 4 for an example.
    Page 6, “Base Models”
  7. For the hierarchical model, we used the CNN portion of the data (5093 sentences) for the extra named entity data (and ignored the parse trees ) and the remaining portions combined for the extra parse data (and ignored the named entity annotations).
    Page 7, “Experiments and Discussion”

See all papers in Proc. ACL 2010 that mention parse tree.

See all papers in Proc. ACL that mention parse tree.

Back to top.

model parameters

Appears in 5 sentences as: model parameters (3) model’s parameters (2)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. After training has been completed, we retain only the joint model’s parameters .
    Page 3, “Hierarchical Joint Learning”
  2. The first summation in this equation computes the log-likelihood of each model, using the data and parameters which correspond to that model, and the prior likelihood of that model’s parameters , based on a Gaussian prior centered around the top-level, non-model-specific parameters 6*, and with model-specific variance am.
    Page 3, “Hierarchical Joint Learning”
  3. We need to compute partial derivatives in order to optimize the model parameters .
    Page 3, “Hierarchical Joint Learning”
  4. The stochastic partial derivatives will equal zero for all model parameters 6m such that m 75 m(d), and for 6m<d> it becomes:
    Page 4, “Hierarchical Joint Learning”
  5. be the value of feature 2' for subtree 7“ over sentence s, and let E9 [fi|s] be the expected value of feature 2' in sentence 3, based on the current model parameters 6.
    Page 6, “Base Models”

See all papers in Proc. ACL 2010 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

parsing model

Appears in 3 sentences as: parsing model (3)
In Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
  1. When segment length is not restricted, the inference procedure is the same as that used in parsing (Finkel and Manning, 2009c).3 In this work we do not enforce a length restriction, and directly utilize the fact that the model can be transformed into a parsing model .
    Page 5, “Base Models”
  2. Our parsing model is the discriminatively trained, conditional random field-based context-free grammar parser (CRF-CFG) of (Finkel et al., 2008).
    Page 6, “Base Models”
  3. In the parsing model , the grammar consists of only the rules observed in the training data.
    Page 6, “Base Models”

See all papers in Proc. ACL 2010 that mention parsing model.

See all papers in Proc. ACL that mention parsing model.

Back to top.