Word Alignment Modeling with Context Dependent Deep Neural Network
Yang, Nan and Liu, Shujie and Li, Mu and Zhou, Ming and Yu, Nenghai

Article Structure

Abstract

In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011).

Introduction

Recent years research communities have seen a strong resurgent interest in modeling with deep (multilayer) neural networks.

Related Work

DNN with unsupervised pre-training was firstly introduced by (Hinton et al., 2006) for MNIST digit image classification problem, in which, RBM was introduced as the layer-wise pre-trainer.

DNN structures for NLP

The most important and prevalent features available in NLP are the words themselves.

DNN for word alignment

Our DNN word alignment model extends classic HMM word alignment model (Vogel et al., 1996).

Training

Although unsupervised training technique such as Contrastive Estimation as in (Smith and Eisner, 2005), (Dyer et al., 2011) can be adapted to train

Experiments and Results

We conduct our experiment on Chinese-to-English word alignment task.

Conclusion

In this paper, we explores applying deep neural network for word alignment task.

Topics

neural network

Appears in 33 sentences as: Neural Network (2) neural network (26) Neural Networks (2) neural networks (6)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network ), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011).
    Page 1, “Abstract”
  2. Recent years research communities have seen a strong resurgent interest in modeling with deep (multilayer) neural networks .
    Page 1, “Introduction”
  3. For speech recognition, (Dahl et al., 2012) proposed context-dependent neural network with large vocabulary, which achieved 16.0% relative error reduction.
    Page 1, “Introduction”
  4. (Collobert et al., 2011) and (Socher et al., 2011) further apply Recursive Neural Networks to address the structural prediction tasks such as tagging and parsing, and (Socher et al., 2012) explores the compositional aspect of word representations.
    Page 1, “Introduction”
  5. Based on the above analysis, in this paper, both the words in the source and target sides are firstly mapped to a vector via a discriminatively trained word embeddings, and word pairs are scored by a multilayer neural network which takes rich contexts (surrounding words on both source and target sides) into consideration; and a HMM-like distortion model is applied on top of the neural network to characterize structural aspect of bilingual sentences.
    Page 2, “Introduction”
  6. (Seide et al., 2011) and (Dahl et al., 2012) apply Context-Dependent Deep Neural Network with HMM (CD-DNN-HMM) to speech recognition task, which significantly outperforms traditional models.
    Page 2, “Related Work”
  7. (Bengio et al., 2006) proposed to use multilayer neural network for language modeling task.
    Page 2, “Related Work”
  8. The lookup process is called a lookup layer LT , which is usually the first layer after the input layer in neural network .
    Page 3, “DNN structures for NLP”
  9. Multilayer neural networks are trained with the standard back propagation algorithm (LeCun, 1985).
    Page 3, “DNN structures for NLP”
  10. Techniques such as layerwise pre-training(Bengio et al., 2007) and many tricks(LeCun et al., 1998) have been developed to train better neural networks .
    Page 3, “DNN structures for NLP”
  11. Besides that, neural network training also involves some hyperparameters such as learning rate, the number of hidden layers.
    Page 3, “DNN structures for NLP”

See all papers in Proc. ACL 2013 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

word alignment

Appears in 28 sentences as: word aligned (1) Word alignment (1) word alignment (28) word alignments (1)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011).
    Page 1, “Abstract”
  2. We describe in detail how we adapt and extend the CD-DNN-HMM (Dahl et al., 2012) method introduced in speech recognition to the HMM-based word alignment model, in which bilingual word embedding is discrimina-tively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences.
    Page 1, “Abstract”
  3. Experiments on a large scale English-Chinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
    Page 1, “Abstract”
  4. Inspired by successful previous works, we propose a new DNN-based word alignment method, which exploits contextual and semantic similarities between words.
    Page 1, “Introduction”
  5. Figure 1: Two examples of word alignment
    Page 2, “Introduction”
  6. In the rest of this paper, related work about DNN and word alignment are first reviewed in Section 2, followed by a brief introduction of DNN in Section 3.
    Page 2, “Introduction”
  7. We then introduce the details of leveraging DNN for word alignment , including the details of our network structure in Section 4
    Page 2, “Introduction”
  8. For the related works of word alignment , the most popular methods are based on generative models such as IBM Models (Brown et al., 1993) and HMM (Vogel et al., 1996).
    Page 2, “Related Work”
  9. Discriminative approaches are also proposed to use hand crafted features to improve word alignment .
    Page 2, “Related Work”
  10. Our DNN word alignment model extends classic HMM word alignment model (Vogel et al., 1996).
    Page 3, “DNN for word alignment”
  11. Given a sentence pair (e, f), HMM word alignment takes the following form:
    Page 3, “DNN for word alignment”

See all papers in Proc. ACL 2013 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

word embeddings

Appears in 23 sentences as: Word Embedding (1) Word embedding (1) word embedding (7) Word embeddings (2) word embeddings (12)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. We describe in detail how we adapt and extend the CD-DNN-HMM (Dahl et al., 2012) method introduced in speech recognition to the HMM-based word alignment model, in which bilingual word embedding is discrimina-tively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences.
    Page 1, “Abstract”
  2. Most works convert atomic lexical entries into a dense, low dimensional, real-valued representation, called word embedding ; Each dimension represents a latent aspect of a word, capturing its semantic and syntactic properties (Bengio et al., 2006).
    Page 1, “Introduction”
  3. Word embedding is usually first learned from huge amount of monolingual texts, and then fine-tuned with task-specific objectives.
    Page 1, “Introduction”
  4. As we mentioned in the last paragraph, word embedding (trained with huge monolingual texts) has the ability to map a word into a vector space, in which, similar words are near each other.
    Page 2, “Introduction”
  5. Based on the above analysis, in this paper, both the words in the source and target sides are firstly mapped to a vector via a discriminatively trained word embeddings , and word pairs are scored by a multilayer neural network which takes rich contexts (surrounding words on both source and target sides) into consideration; and a HMM-like distortion model is applied on top of the neural network to characterize structural aspect of bilingual sentences.
    Page 2, “Introduction”
  6. Most methods using DNN in NLP start with a word embedding phase, which maps words into a fixed length, real valued vectors.
    Page 2, “Related Work”
  7. (Titov et al., 2012) learns a context-free cross-lingual word embeddings to facilitate cross-lingual information retrieval.
    Page 2, “Related Work”
  8. To apply DNN to NLP task, the first step is to transform a discrete word into its word embedding , a low dimensional, dense, real-valued vector (Bengio et al., 2006).
    Page 3, “DNN structures for NLP”
  9. Word embeddings often implicitly encode syntactic or semantic knowledge of the words.
    Page 3, “DNN structures for NLP”
  10. Assuming a finite sized vocabulary V, word embeddings form a (L x |V|)-dimension embedding matrix WV, where L is a predetermined embedding length; mapping words to embeddings is done by simply looking up their respective columns in the embedding matrix WV.
    Page 3, “DNN structures for NLP”
  11. Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters 3d of jump distance.
    Page 5, “Training”

See all papers in Proc. ACL 2013 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

embeddings

Appears in 21 sentences as: embeddings (23)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. Based on the above analysis, in this paper, both the words in the source and target sides are firstly mapped to a vector via a discriminatively trained word embeddings , and word pairs are scored by a multilayer neural network which takes rich contexts (surrounding words on both source and target sides) into consideration; and a HMM-like distortion model is applied on top of the neural network to characterize structural aspect of bilingual sentences.
    Page 2, “Introduction”
  2. (Titov et al., 2012) learns a context-free cross-lingual word embeddings to facilitate cross-lingual information retrieval.
    Page 2, “Related Work”
  3. Word embeddings often implicitly encode syntactic or semantic knowledge of the words.
    Page 3, “DNN structures for NLP”
  4. Assuming a finite sized vocabulary V, word embeddings form a (L x |V|)-dimension embedding matrix WV, where L is a predetermined embedding length; mapping words to embeddings is done by simply looking up their respective columns in the embedding matrix WV.
    Page 3, “DNN structures for NLP”
  5. After words have been transformed to their embeddings , they can be fed into subsequent classical network layers to model highly nonlinear relations:
    Page 3, “DNN structures for NLP”
  6. Words are converted to embeddings using the lookup table LT, and the catenation of embeddings are fed to a classic neural network with two hidden-layers, and the output of the network is the our lexical translation score:
    Page 4, “DNN for word alignment”
  7. Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters 3d of jump distance.
    Page 5, “Training”
  8. Most parameters reside in the word embeddings .
    Page 5, “Training”
  9. To get a good initial value, the usual approach is to pre-train the embeddings on a large monolingual corpus.
    Page 5, “Training”
  10. et al., 2011) and train word embeddings for source and target languages from their monolingual corpus respectively.
    Page 5, “Training”
  11. Word embeddings from monolingual corpus learn strong syntactic knowledge of each word, which is not always desirable for word alignment between some language pairs like English and Chinese.
    Page 5, “Training”

See all papers in Proc. ACL 2013 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

hidden layers

Appears in 11 sentences as: hidden layer (6) hidden layers (8)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. Besides that, neural network training also involves some hyperparameters such as learning rate, the number of hidden layers .
    Page 3, “DNN structures for NLP”
  2. Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters 3d of jump distance.
    Page 5, “Training”
  3. We set word embedding length to 20, window size to 5, and the length of the only hidden layer to 40.
    Page 5, “Training”
  4. To make our model concrete, there are still hyper-parameters to be determined: the window size 3212 and tw, the length of each hidden layer Ll.
    Page 6, “Training”
  5. Table 3: Effect of different number of hidden layers .
    Page 7, “Experiments and Results”
  6. Two hidden layers outperform one hidden layer, while three hidden layers do not bring further improvement.
    Page 7, “Experiments and Results”
  7. 6.4.3 Effect of number of hidden layers
    Page 7, “Experiments and Results”
  8. Our neural network contains two hidden layers besides the lookup layer.
    Page 7, “Experiments and Results”
  9. For l-hidden-layer setting, we set the hidden layer length to 120; and for 3-hidden-layer setting, we set hidden layer lengths to 120, 100, 10 respectively.
    Page 7, “Experiments and Results”
  10. As can be seen from Table 3, 2-hidden-layer outperforms the l-hidden-layer setting, while another hidden layer does not bring
    Page 7, “Experiments and Results”
  11. Due to time constraint, we have not tuned the hyper-parameters such as length of hidden layers in l and 3-hidden-layer settings, nor have we tested settings with more hidden-layers.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention hidden layers.

See all papers in Proc. ACL that mention hidden layers.

Back to top.

word pair

Appears in 11 sentences as: word pair (11) word pairs (2)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. As shown in example (a) of Figure 1, in word pair {“juda” =>“mammot ”}, the Chinese word “juda” is a common word, but
    Page 1, “Introduction”
  2. For example (b) in Figure l, for the word pair {“yibula” => “Yibula”}, both the Chinese word “yibula” and English word “Yibula” are rare name entities, but the words around them are very common, which are {“nongmin”, “shuo”} for Chinese side and {“farmer”, “said”} for the English side.
    Page 2, “Introduction”
  3. The pattern of the context {“nongmin X shuo” => “farmer X said”} may help to align the word pair which fill the variable X, and also, the pattern {“yixiang X gongcheng” => “a X job”} is helpful to align the word pair {“juda” =>“mammoth”} for example (a).
    Page 2, “Introduction”
  4. Based on the above analysis, in this paper, both the words in the source and target sides are firstly mapped to a vector via a discriminatively trained word embeddings, and word pairs are scored by a multilayer neural network which takes rich contexts (surrounding words on both source and target sides) into consideration; and a HMM-like distortion model is applied on top of the neural network to characterize structural aspect of bilingual sentences.
    Page 2, “Introduction”
  5. In contrast, our model does not maintain a separate translation score parameters for every source-target word pair , but computes tlegc through a multilayer network, which naturally handles contexts on both sides without explosive growth of number of parameters.
    Page 3, “DNN for word alignment”
  6. The example computes translation score for word pair (yibula, yibulayin) given its surrounding context.
    Page 4, “DNN for word alignment”
  7. For word pair (61', fj), we take fixed length windows surrounding both 6, and fj as input: (€i_%, .
    Page 4, “DNN for word alignment”
  8. To decode our model, the lexical translation scores are computed for each source-target word pair in the sentence pair, which requires going through the neural network (|e| >< |f times; after that, the forward-backward algorithm can be used to find the viterbi path as in the classic HMM model.
    Page 4, “DNN for word alignment”
  9. ma${071 _ t9((ev f)+|e7 + t9((ev f)—|e7 (10) where (e, f)+ is a correct word pair, (6, f)‘ is a wrong word pair in the same sentence, and 759 is as defined in Eq.
    Page 5, “Training”
  10. This training criteria essentially means our model suffers loss unless it gives correct word pairs a higher score than random pairs from the same sentence pair with some margin.
    Page 5, “Training”
  11. We randomly cycle through all sentence pairs in training data; for each correct word pair (including null alignment), we generate a positive example, and generate two negative examples by randomly corrupting either
    Page 5, “Training”

See all papers in Proc. ACL 2013 that mention word pair.

See all papers in Proc. ACL that mention word pair.

Back to top.

sentence pairs

Appears in 10 sentences as: sentence pair (4) sentence pairs (6)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. Given a sentence pair (e, f), HMM word alignment takes the following form:
    Page 3, “DNN for word alignment”
  2. To decode our model, the lexical translation scores are computed for each source-target word pair in the sentence pair , which requires going through the neural network (|e| >< |f times; after that, the forward-backward algorithm can be used to find the viterbi path as in the classic HMM model.
    Page 4, “DNN for word alignment”
  3. 1In practice, the number of nonzero parameters in classic HMM model would be much smaller, as many words do not co-occur in bilingual sentence pairs .
    Page 4, “Training”
  4. our model from raw sentence pairs , they are too computational demanding as the lexical translation probabilities must be computed from neural networks.
    Page 5, “Training”
  5. Hence, we opt for a simpler supervised approach, which learns the model from sentence pairs with word alignment.
    Page 5, “Training”
  6. This training criteria essentially means our model suffers loss unless it gives correct word pairs a higher score than random pairs from the same sentence pair with some margin.
    Page 5, “Training”
  7. We randomly cycle through all sentence pairs in training data; for each correct word pair (including null alignment), we generate a positive example, and generate two negative examples by randomly corrupting either
    Page 5, “Training”
  8. side of the pair with another word in the sentence pair .
    Page 6, “Training”
  9. We use the manually aligned Chinese-English alignment corpus (Haghighi et al., 2009) which contains 491 sentence pairs as test set.
    Page 6, “Experiments and Results”
  10. Our parallel corpus contains about 26 million unique sentence pairs in total which are mined from web.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

alignment model

Appears in 9 sentences as: alignment model (8) alignment models (2)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. We describe in detail how we adapt and extend the CD-DNN-HMM (Dahl et al., 2012) method introduced in speech recognition to the HMM-based word alignment model , in which bilingual word embedding is discrimina-tively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences.
    Page 1, “Abstract”
  2. Our DNN word alignment model extends classic HMM word alignment model (Vogel et al., 1996).
    Page 3, “DNN for word alignment”
  3. In the classic HMM word alignment model , context is not considered in the lexical translation probability.
    Page 3, “DNN for word alignment”
  4. Vocabulary V of our alignment model consists of a source vocabulary V6 and a target vocabulary Vf.
    Page 4, “DNN for word alignment”
  5. As we do not have a large manually word aligned corpus, we use traditional word alignment models such as HMM and IBM model 4 to generate word alignment on a large parallel corpus.
    Page 5, “Training”
  6. Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters 3d of jump distance.
    Page 5, “Training”
  7. In future we would like to explore whether our method can improve other word alignment models .
    Page 6, “Experiments and Results”
  8. embeddings trained by our word alignment model .
    Page 8, “Experiments and Results”
  9. Secondly, we want to explore the possibility of unsupervised training of our neural word alignment model , without reliance of alignment result of other models.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2013 that mention alignment model.

See all papers in Proc. ACL that mention alignment model.

Back to top.

parallel corpus

Appears in 6 sentences as: parallel corpus (6)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. As we do not have a large manually word aligned corpus, we use traditional word alignment models such as HMM and IBM model 4 to generate word alignment on a large parallel corpus .
    Page 5, “Training”
  2. Our vocabularies V8 and Vt contain the most frequent 100,000 words from each side of the parallel corpus , and all other words are treated as unknown words.
    Page 5, “Training”
  3. As there is no clear stopping criteria, we simply run the stochastic optimizer through parallel corpus for N iterations.
    Page 6, “Training”
  4. As there are only 17 parameters in 3d, we only need to run the optimizer over a small portion of the parallel corpus .
    Page 6, “Training”
  5. Our parallel corpus contains about 26 million unique sentence pairs in total which are mined from web.
    Page 6, “Experiments and Results”
  6. The result is not surprising considering our parallel corpus is quite large, and similar observations have been made in previous work as (DeNero and Macherey, 2011) that better alignment quality does not necessarily lead to better end-to-end result.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

Chinese word

Appears in 5 sentences as: Chinese word (3) Chinese words (2)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. As shown in example (a) of Figure 1, in word pair {“juda” =>“mammot ”}, the Chinese word “juda” is a common word, but
    Page 1, “Introduction”
  2. For example (b) in Figure l, for the word pair {“yibula” => “Yibula”}, both the Chinese word “yibula” and English word “Yibula” are rare name entities, but the words around them are very common, which are {“nongmin”, “shuo”} for Chinese side and {“farmer”, “said”} for the English side.
    Page 2, “Introduction”
  3. For example, many Chinese words can act as a verb, noun and adjective without any change, while their English counter parts are distinct words with quite different word embeddings due to their different syntactic roles.
    Page 5, “Training”
  4. By analyzing the results, we found out that for both baseline and our model, a large part of missing alignment links involves stop words like English words “the”, “a”, “it” and Chinese words “de”.
    Page 7, “Experiments and Results”
  5. As Chinese language lacks morphology, the single form and plural form of a noun in English often correspond to the same Chinese word , thus it is desirable that the two English words should have similar word embeddings.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

language model

Appears in 3 sentences as: language model (2) language modeling (1)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. (Bengio et al., 2006) proposed to use multilayer neural network for language modeling task.
    Page 2, “Related Work”
  2. (Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language.
    Page 2, “Related Work”
  3. (Son et al., 2012) improves translation quality of n- gram translation model by using a bilingual neural language model .
    Page 2, “Related Work”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

lexicalized

Appears in 3 sentences as: lexicalized (3)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. For the distortion td, we could use a lexicalized distortion model:
    Page 4, “DNN for word alignment”
  2. But we found in our initial experiments on small scale data, lexicalized distortion does not produce better alignment over the simple jump-distance based model.
    Page 4, “DNN for word alignment”
  3. So we drop the lexicalized
    Page 4, “DNN for word alignment”

See all papers in Proc. ACL 2013 that mention lexicalized.

See all papers in Proc. ACL that mention lexicalized.

Back to top.

proposed model

Appears in 3 sentences as: proposed model (3)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. We train our proposed model from results of classic HMM and IBM model 4 separately.
    Page 6, “Experiments and Results”
  2. It can be seen from Table l, the proposed model consistently outperforms its corresponding baseline whether it is trained from alignment of classic HMM or IBM model 4.
    Page 6, “Experiments and Results”
  3. The second row and fourth row show results of the proposed model trained from HMM and IBM4 respectively.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2013 that mention proposed model.

See all papers in Proc. ACL that mention proposed model.

Back to top.

translation probability

Appears in 3 sentences as: translation probabilities (1) translation probability (2)
In Word Alignment Modeling with Context Dependent Deep Neural Network
  1. where Plegc is the lexical translation probability and Pd is the jump distance distortion probability.
    Page 3, “DNN for word alignment”
  2. In the classic HMM word alignment model, context is not considered in the lexical translation probability .
    Page 3, “DNN for word alignment”
  3. our model from raw sentence pairs, they are too computational demanding as the lexical translation probabilities must be computed from neural networks.
    Page 5, “Training”

See all papers in Proc. ACL 2013 that mention translation probability.

See all papers in Proc. ACL that mention translation probability.

Back to top.