Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
Wang, Mengqiu and Che, Wanxiang and Manning, Christopher D.

Article Structure

Abstract

Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages.

Introduction

We study the problem of Named Entity Recognition (NER) in a bilingual context, where the goal is to annotate parallel bi-texts with named entity tags.

Bilingual NER by Agreement

The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and

Joint Alignment and NER Decoding

In this section we develop an extended model in which NER information can in turn be used to improve alignment accuracy.

Experimental Setup

We evaluate on the large OntoNotes (v4.0) corpus (Hovy et al., 2006) which contains manually

Bilingual NER Results

The main results on bilingual NER over the test portion of full-set are shown in Table 1.

Joint NER and Alignment Results

We present results for the BI-NER-WA model in Table 2.

Error Analysis and Discussion

We can examine the example in Figure 3 to gain an understanding of the model’s performance.

Related Work

The idea of employing bilingual resources to improve over monolingual systems has been explored by much previous work.

Conclusion

We introduced a graphical model that combines two HMM word aligners and two CRF NER taggers into a joint model, and presented a dual decomposition inference method for performing efficient decoding over this model.

Topics

NER

Appears in 41 sentences as: NER (46)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition ( NER ) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages.
    Page 1, “Abstract”
  2. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models.
    Page 1, “Abstract”
  3. We design a dual decomposition inference algorithm to perform joint decoding over the combined alignment and NER output space.
    Page 1, “Abstract”
  4. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.
    Page 1, “Abstract”
  5. We study the problem of Named Entity Recognition ( NER ) in a bilingual context, where the goal is to annotate parallel bi-texts with named entity tags.
    Page 1, “Introduction”
  6. (2012) have also demonstrated that bi-texts annotated with NER tags can provide useful additional training sources for improving the performance of standalone monolingual taggers.
    Page 1, “Introduction”
  7. In this work, we first develop a bilingual NER model (denoted as BI-NER) by embedding two monolingual CRF-based NER models into a larger undirected graphical model, and introduce additional edge factors based on word alignment (WA).
    Page 1, “Introduction”
  8. But the entity span and type predictions given by the NER models contain complementary information for correcting alignment errors.
    Page 2, “Introduction”
  9. To capture this source of information, we present a novel extension that combines the BI-NER model with two unidirectional HMM-based alignment models, and perform joint decoding of NER and word alignments.
    Page 2, “Introduction”
  10. The new model (denoted as BI-NER-WA) factors over five components: one NER model and one word alignment model for each language, plus a joint NER-alignment model which not only enforces NER label agreements but also facilitates message passing among the other four components.
    Page 2, “Introduction”
  11. We assume access to two monolingual linear-chain CRF-based NER models that are already trained.
    Page 2, “Bilingual NER by Agreement”

See all papers in Proc. ACL 2013 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

word alignment

Appears in 20 sentences as: word aligner (2) word aligners (1) Word alignment (1) word alignment (15) word alignments (4)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. However, most previous approaches to bilingual tagging assume word alignments are given as fixed input, which can cause cascading errors.
    Page 1, “Abstract”
  2. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment , by combining two monolingual tagging models with two unidirectional alignment models.
    Page 1, “Abstract”
  3. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.
    Page 1, “Abstract”
  4. In this work, we first develop a bilingual NER model (denoted as BI-NER) by embedding two monolingual CRF-based NER models into a larger undirected graphical model, and introduce additional edge factors based on word alignment (WA).
    Page 1, “Introduction”
  5. Our method does not require any manual annotation of word alignments or named entities over the bilingual training data.
    Page 2, “Introduction”
  6. The aforementioned BI-NER model assumes fixed alignment input given by an underlying word aligner .
    Page 2, “Introduction”
  7. To capture this source of information, we present a novel extension that combines the BI-NER model with two unidirectional HMM-based alignment models, and perform joint decoding of NER and word alignments .
    Page 2, “Introduction”
  8. The new model (denoted as BI-NER-WA) factors over five components: one NER model and one word alignment model for each language, plus a joint NER-alignment model which not only enforces NER label agreements but also facilitates message passing among the other four components.
    Page 2, “Introduction”
  9. We also assume that a set of word alignments (A = {(z’,j) : e,- <—> fj}) is given by a word aligner and remain fixed in our model.
    Page 2, “Bilingual NER by Agreement”
  10. The assumption in the hard agreement model can also be violated if there are word alignment errors.
    Page 3, “Bilingual NER by Agreement”
  11. To capture this intuition, we extend the BI-NER model to jointly perform word alignment and NER decoding, and call the resulting model BI-NER-WA.
    Page 4, “Joint Alignment and NER Decoding”

See all papers in Proc. ACL 2013 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

CRF

Appears in 15 sentences as: CRF (16)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. The English-side CRF model assigns the following probability for a tag sequence ye:
    Page 2, “Bilingual NER by Agreement”
  2. where V6 is the set of vertices in the CRF and De is the set of edges.
    Page 2, “Bilingual NER by Agreement”
  3. 2mm) and w(vi,vj) are the node and edge clique potentials, and Ze(e) is the partition function for input sequence e under the English CRF model.
    Page 2, “Bilingual NER by Agreement”
  4. In order to model this uncertainty, we extend the two previously independent CRF models into a larger undirected graphical model, by introducing a cross-lingual edge factor gb(z', j ) for every pair of word positions (2', j) E A.
    Page 3, “Bilingual NER by Agreement”
  5. Since we assume no bilingually annotated NER corpus is available, in order to get an estimate of the PMI scores, we first tag a collection of unannotated bilingual sentence pairs using the monolingual CRF taggers, and collect counts of aligned entity pairs from this auto- generated tagged data.
    Page 4, “Bilingual NER by Agreement”
  6. Each of the gb(z', j) edge factors (e.g., the edge between node f3 and 64 in Figure l) overlaps with each of the two CRF models over one vertex (e. g., f3 on Chinese side and 64 on English side), and we seek agreement with the Chinese CRF model over tag assignment of fj, and similarly for e,- on English side.
    Page 4, “Bilingual NER by Agreement”
  7. In other words, no direct agreement between the two CRF models is enforced, but they both need to agree with the bilingual edge factors.
    Page 4, “Bilingual NER by Agreement”
  8. where the notation yfw) denotes tag assignment to word 6,- by the English CRF and yfw) denotes as-
    Page 4, “Bilingual NER by Agreement”
  9. ( ) , nese CRF and h denotes ass1gnment to word fj by the bilingual factor.
    Page 4, “Bilingual NER by Agreement”
  10. We train the two CRF models on all portions of the OntoNotes corpus that are annotated with named entity tags, except the parallel-aligned portion which we reserve for development and test purposes.
    Page 6, “Experimental Setup”
  11. 1The exact feature set and the CRF implementation
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2013 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

alignment models

Appears in 10 sentences as: aligner model (1) alignment model (2) alignment models (7)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models .
    Page 1, “Abstract”
  2. To capture this source of information, we present a novel extension that combines the BI-NER model with two unidirectional HMM-based alignment models , and perform joint decoding of NER and word alignments.
    Page 2, “Introduction”
  3. The new model (denoted as BI-NER-WA) factors over five components: one NER model and one word alignment model for each language, plus a joint NER-alignment model which not only enforces NER label agreements but also facilitates message passing among the other four components.
    Page 2, “Introduction”
  4. Most commonly used alignment models , such as the IBM models and HMM-based aligner are unsupervised learners, and can only capture simple distortion features and lexical translational features due to the high complexity of the structure prediction space.
    Page 4, “Joint Alignment and NER Decoding”
  5. We name the Chinese-to-English aligner model as m(Be) and the reverse directional model 71(Bf Be is a matrix that holds the output of the Chinese-to-English aligner.
    Page 4, “Joint Alignment and NER Decoding”
  6. In our experiments, we used two HMM-based alignment models .
    Page 4, “Joint Alignment and NER Decoding”
  7. But in principle we can adopt any alignment model as long as we can perform efficient inference over it.
    Page 4, “Joint Alignment and NER Decoding”
  8. But the real power of these cross-language edge cliques is that they act as a liaison between the NER and alignment models on each language side, and encourage these models to indirectly agree with each other by having them all agree with the edge cliques.
    Page 5, “Joint Alignment and NER Decoding”
  9. It is also worth noting that since we decode the alignment models with Viterbi inference, additional constraints such as the neighborhood constraint proposed by DeNero and Macherey (2011) can be easily integrated into our model.
    Page 5, “Joint Alignment and NER Decoding”
  10. directional HMM models as our baseline and monolingual alignment models .
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2013 that mention alignment models.

See all papers in Proc. ACL that mention alignment models.

Back to top.

cross-lingual

Appears in 8 sentences as: cross-lingual (8)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. We introduce additional cross-lingual edge factors that encourage agreements between tagging and alignment decisions.
    Page 1, “Abstract”
  2. In order to model this uncertainty, we extend the two previously independent CRF models into a larger undirected graphical model, by introducing a cross-lingual edge factor gb(z', j ) for every pair of word positions (2', j) E A.
    Page 3, “Bilingual NER by Agreement”
  3. Initially, each of the cross-lingual edge factors will attempt to assign a pair of tags that has the highest PMI score, but if the monolingual taggers do not agree, a penalty will start accumulating over this pair, until some other pair that agrees better with the monolingual models takes the top spot.
    Page 3, “Bilingual NER by Agreement”
  4. Simultaneously, the monolingual models will also be encouraged to agree with the cross-lingual edge factors.
    Page 4, “Bilingual NER by Agreement”
  5. This way, the various components effectively trade penalties indirectly through the cross-lingual edges, until a tag sequence that maximizes the joint probability is achieved.
    Page 4, “Bilingual NER by Agreement”
  6. We introduce a cross-lingual edge factor C (i, j) in the undirected graphical model for every pair of word indices (2', j), which predicts a binary vari-
    Page 4, “Joint Alignment and NER Decoding”
  7. One special note is that after each iteration when we consider updates to the dual constraint for entity tags, we only check tag agreements for cross-lingual edge factors that have an alignment assignment value of 1.
    Page 5, “Joint Alignment and NER Decoding”
  8. In other words, cross-lingual edges that are not aligned do not affect bilingual NER tagging.
    Page 5, “Joint Alignment and NER Decoding”

See all papers in Proc. ACL 2013 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

named entity

Appears in 8 sentences as: named entities (1) Named Entity (2) named entity (8)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages.
    Page 1, “Abstract”
  2. We study the problem of Named Entity Recognition (NER) in a bilingual context, where the goal is to annotate parallel bi-texts with named entity tags.
    Page 1, “Introduction”
  3. We can also automatically construct a named entity translation lexicon by annotating and extracting entities from bi-texts, and use it to improve MT performance (Huang and Vogel, 2002; Al-Onaizan and Knight, 2002).
    Page 1, “Introduction”
  4. As a result, we can find complementary cues in the two languages that help to disambiguate named entity mentions (Brown et al., 1991).
    Page 1, “Introduction”
  5. Our method does not require any manual annotation of word alignments or named entities over the bilingual training data.
    Page 2, “Introduction”
  6. We train the two CRF models on all portions of the OntoNotes corpus that are annotated with named entity tags, except the parallel-aligned portion which we reserve for development and test purposes.
    Page 6, “Experimental Setup”
  7. Out of the 18 named entity types that are annotated in OntoNotes, which include person, location, date, money, and so on, we select the four most commonly seen named entity types for evaluation.
    Page 6, “Experimental Setup”
  8. set of heuristic rules to expand a candidate named entity set generated by monolingual taggers, and then rank those candidates using a bilingual named entity dictionary.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

graphical model

Appears in 7 sentences as: graphical model (6) graphical models (1)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models.
    Page 1, “Abstract”
  2. In this work, we first develop a bilingual NER model (denoted as BI-NER) by embedding two monolingual CRF-based NER models into a larger undirected graphical model , and introduce additional edge factors based on word alignment (WA).
    Page 1, “Introduction”
  3. previous applications of the DD method in NLP, where the model typically factors over two components and agreement is to be sought between the two (Rush et al., 2010; Koo et al., 2010; DeNero and Macherey, 2011; Chieu and Teow, 2012), our method decomposes the larger graphical model into many overlapping components where each alignment edge forms a separate factor.
    Page 2, “Introduction”
  4. In order to model this uncertainty, we extend the two previously independent CRF models into a larger undirected graphical model , by introducing a cross-lingual edge factor gb(z', j ) for every pair of word positions (2', j) E A.
    Page 3, “Bilingual NER by Agreement”
  5. The way DD algorithms work in decomposing undirected graphical models is analogous to other message passing algorithms such as loopy belief propagation, but DD gives a stronger optimality guarantee upon convergence (Rush et al., 2010).
    Page 4, “Bilingual NER by Agreement”
  6. We introduce a cross-lingual edge factor C (i, j) in the undirected graphical model for every pair of word indices (2', j), which predicts a binary vari-
    Page 4, “Joint Alignment and NER Decoding”
  7. We introduced a graphical model that combines two HMM word aligners and two CRF NER taggers into a joint model, and presented a dual decomposition inference method for performing efficient decoding over this model.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention graphical model.

See all papers in Proc. ACL that mention graphical model.

Back to top.

sentence pairs

Appears in 6 sentences as: sentence pair (1) sentence pairs (5)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and
    Page 2, “Bilingual NER by Agreement”
  2. Since we assume no bilingually annotated NER corpus is available, in order to get an estimate of the PMI scores, we first tag a collection of unannotated bilingual sentence pairs using the monolingual CRF taggers, and collect counts of aligned entity pairs from this auto- generated tagged data.
    Page 4, “Bilingual NER by Agreement”
  3. After discarding sentences with no aligned counterpart, a total of 402 documents and 8,249 parallel sentence pairs were used for evaluation.
    Page 6, “Experimental Setup”
  4. Word alignment evaluation is done over the sections of OntoNotes that have matching gold-standard word alignment annotations from GALE Y1Q4 dataset.2 This subset contains 288 documents and 3,391 sentence pairs .
    Page 6, “Experimental Setup”
  5. An extra set of 5,000 unannotated parallel sentence pairs are used for
    Page 6, “Experimental Setup”
  6. In this example, a snippet of a longer sentence pair is shown with NER and word alignment results.
    Page 7, “Error Analysis and Discussion”

See all papers in Proc. ACL 2013 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

manually annotated

Appears in 5 sentences as: manual annotation (1) manually annotated (4)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. Our method does not require any manual annotation of word alignments or named entities over the bilingual training data.
    Page 2, “Introduction”
  2. On the other hand, the CRF—based NER models are trained on manually annotated data, and admit richer sequence and lexical features.
    Page 4, “Joint Alignment and NER Decoding”
  3. If a model only gives good performance with well-tuned hyper-parameters, then we must have manually annotated data for tuning, which would significantly reduce the applicability and portability of this method to other language pairs and tasks.
    Page 7, “Bilingual NER Results”
  4. proach is that training such a feature-rich model requires manually annotated bilingual NER data, which can be prohibitively expensive to generate.
    Page 8, “Related Work”
  5. The model demonstrates performance improvements in both parsing and alignment, but shares the common limitations of other supervised work in that it requires manually annotated bilingual joint parsing and word alignment data.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention manually annotated.

See all papers in Proc. ACL that mention manually annotated.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. But instead of using just the PMI scores of bilingual NE pairs, as in our work, they employed a feature-rich log-linear model to capture bilingual correlations.
    Page 6, “Experimental Setup”
  2. Parameters in their log-linear model require training with bilingually annotated data, which is not readily available.
    Page 6, “Experimental Setup”
  3. (2010a) presented a supervised learning method for performing joint parsing and word alignment using log-linear models over parse trees and an ITG model over alignment.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

log-linear model

Appears in 3 sentences as: log-linear model (2) log-linear models (1)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. But instead of using just the PMI scores of bilingual NE pairs, as in our work, they employed a feature-rich log-linear model to capture bilingual correlations.
    Page 6, “Experimental Setup”
  2. Parameters in their log-linear model require training with bilingually annotated data, which is not readily available.
    Page 6, “Experimental Setup”
  3. (2010a) presented a supervised learning method for performing joint parsing and word alignment using log-linear models over parse trees and an ITG model over alignment.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention log-linear model.

See all papers in Proc. ACL that mention log-linear model.

Back to top.

parallel sentence

Appears in 3 sentences as: parallel sentence (3)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and
    Page 2, “Bilingual NER by Agreement”
  2. After discarding sentences with no aligned counterpart, a total of 402 documents and 8,249 parallel sentence pairs were used for evaluation.
    Page 6, “Experimental Setup”
  3. An extra set of 5,000 unannotated parallel sentence pairs are used for
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2013 that mention parallel sentence.

See all papers in Proc. ACL that mention parallel sentence.

Back to top.

significant improvements

Appears in 3 sentences as: significant improvements (4)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.
    Page 1, “Abstract”
  2. By jointly decoding NER with word alignment, our model not only maintains significant improvements in NER performance, but also yields significant improvements to alignment performance.
    Page 7, “Joint NER and Alignment Results”
  3. Results from NER and word alignment experiments suggest that our method gives significant improvements in both NER and word alignment.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.

Viterbi

Appears in 3 sentences as: Viterbi (3)
In Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
  1. It is also worth noting that since we decode the alignment models with Viterbi inference, additional constraints such as the neighborhood constraint proposed by DeNero and Macherey (2011) can be easily integrated into our model.
    Page 5, “Joint Alignment and NER Decoding”
  2. Instead of enforcing agreement in the alignment space based on best sequences found by Viterbi , we could opt to encourage agreement between posterior probability distributions, which is related to the posterior regularization work by Graca et al.
    Page 9, “Related Work”
  3. We also differ in the implementation details, where in their case belief propagation is used in both training and Viterbi inference.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.