Recurrent Neural Networks for Word Alignment Model
Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro

Article Structure

Abstract

This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers.

Introduction

Automatic word alignment is an important task for statistical machine translation.

Related Work

Various word alignment models have been proposed.

RNN-based Alignment Model

This section proposes an RNN-based alignment model, which computes a score for alignments a1] using an RNN:

Training

During training, we optimize the weight matrices of each layer (i.e., L, Hd, Rd, Bj‘ll, O, and B0) following a given objective using a mini-batch SGD with batch size D, which converges faster than a plain SGD (D = l).

Topics

word alignment

Appears in 28 sentences as: Word Alignment (1) Word alignment (4) word alignment (21) word alignments (5)
In Recurrent Neural Networks for Word Alignment Model
  1. This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers.
    Page 1, “Abstract”
  2. The RNN-based model outperforms the feed-forward neural network-based model (Yang et al., 2013) as well as the IBM Model 4 under Japanese-English and French-English word alignment tasks, and achieves comparable translation performance to those baselines for Japanese-English and Chinese-English translation tasks.
    Page 1, “Abstract”
  3. Automatic word alignment is an important task for statistical machine translation.
    Page 1, “Introduction”
  4. We assume that this property would fit with a word alignment task, and we propose an RNN-based word alignment model.
    Page 1, “Introduction”
  5. (2013) trained their model from word alignments produced by traditional unsupervised probabilistic models.
    Page 1, “Introduction”
  6. This paper presents evaluations of Japanese-English and French-English word alignment tasks and Japanese-to-English and Chinese-to-English translation tasks.
    Page 2, “Introduction”
  7. The results illustrate that our RNN-based model outperforms the FFNN-based model (up to +0.0792 F1-measure) and the IBM Model 4 (up to +0.0703 F1-measure) for the word alignment tasks.
    Page 2, “Introduction”
  8. Various word alignment models have been proposed.
    Page 2, “Related Work”
  9. As an instance of discriminative models, we describe an FFNN-based word alignment model (Yang et al., 2013), which is our baseline.
    Page 2, “Related Work”
  10. GEN is a subset of all possible word alignments (I), which is generated by beam search.
    Page 5, “Training”
  11. We evaluated the alignment performance of the proposed models with two tasks: Japanese-English word alignment with the Basic Travel Expression Corpus (BTEC) (Takezawa et a1., 2002) and French-English word alignment with the Hansard dataset (H ansards) from the 2003 NAACL shared task (Mihalcea and Pedersen, 2003).
    Page 6, “Training”

See all papers in Proc. ACL 2014 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

alignment model

Appears in 25 sentences as: Alignment Model (3) alignment model (17) alignment models (5)
In Recurrent Neural Networks for Word Alignment Model
  1. This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers.
    Page 1, “Abstract”
  2. Our alignment model is directional, similar to the generative IBM models (Brown et al., 1993).
    Page 1, “Abstract”
  3. the HMM alignment model and achieved state-of-the-art performance.
    Page 1, “Introduction”
  4. We assume that this property would fit with a word alignment task, and we propose an RNN-based word alignment model .
    Page 1, “Introduction”
  5. The NN-based alignment models are supervised models.
    Page 1, “Introduction”
  6. Our RNN-based alignment model has a direc-
    Page 1, “Introduction”
  7. tion, such as other alignment models , i.e., from f (source language) to 6 (target language) and from 6 to f. It has been proven that the limitation may be overcome by encouraging two directional models to agree by training them concurrently (Ma-tusov et al., 2004; Liang et al., 2006; Graca et al., 2008; Ganchev et al., 2008).
    Page 2, “Introduction”
  8. Various word alignment models have been proposed.
    Page 2, “Related Work”
  9. 2.1 Generative Alignment Model
    Page 2, “Related Work”
  10. 2.2 FFNN-based Alignment Model
    Page 2, “Related Work”
  11. As an instance of discriminative models, we describe an FFNN-based word alignment model (Yang et al., 2013), which is our baseline.
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention alignment model.

See all papers in Proc. ACL that mention alignment model.

Back to top.

hidden layer

Appears in 25 sentences as: Hidden Layer (1) hidden layer (23) hidden layers (3)
In Recurrent Neural Networks for Word Alignment Model
  1. This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers .
    Page 1, “Abstract”
  2. An RNN has a hidden layer with recurrent connections that propagates its own previous signals.
    Page 1, “Introduction”
  3. 21 Hidden Layer | htanh(H><20+BH) l
    Page 3, “Related Work”
  4. Figure 1 shows the network structure with one hidden layer for computing a lexical translation probability tle$(fj, ea].
    Page 3, “Related Work”
  5. The model consists of a lookup layer, a hidden layer , and an output layer, which have weight matrices.
    Page 3, “Related Work”
  6. The concatenation (zo) is then fed to the hidden layer to capture nonlinear relations.
    Page 3, “Related Work”
  7. Finally, the output layer receives the output of the hidden layer (21) and computes a lexical translation score.
    Page 3, “Related Work”
  8. The model consists of a lookup layer, a hidden layer , and an output layer, which have weight
    Page 3, “RNN-based Alignment Model”
  9. 2Consecutive l hidden layers can be used: 2; = f (H; X 21—1 -|- B H1).
    Page 3, “RNN-based Alignment Model”
  10. For simplicity, this paper describes the model with 1 hidden layer .
    Page 3, “RNN-based Alignment Model”
  11. Each matrix in the hidden layer (Hd, Rd, and B35) depends on alignment, where d denotes the jump distance from aj_1 to aj: d = aj —aj_1.
    Page 4, “RNN-based Alignment Model”

See all papers in Proc. ACL 2014 that mention hidden layer.

See all papers in Proc. ACL that mention hidden layer.

Back to top.

word embeddings

Appears in 13 sentences as: word embedding (4) Word embeddings (1) word embeddings (7) word embeddings: (1)
In Recurrent Neural Networks for Word Alignment Model
  1. To overcome this limitation, we encourage agreement between the two directional models by introducing a penalty function that ensures word embedding consistency across two directional models during training.
    Page 1, “Abstract”
  2. Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function.
    Page 2, “Introduction”
  3. First, the lookup layer converts each input word into its word embedding by looking up its corresponding column in the embedding matrix (L), and then concatenates them.
    Page 3, “Related Work”
  4. Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words (Bengio et al., 2003).
    Page 3, “Related Work”
  5. In the lookup layer, each of these words is converted to its word embedding , and then the concatenation of the two embeddings (any) is fed to the hidden layer in the same manner as the FFNN-based model.
    Page 4, “RNN-based Alignment Model”
  6. The constraint concretely enforces agreement in word embeddings of both directions.
    Page 5, “Training”
  7. The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:
    Page 5, “Training”
  8. where QFE (or 6gp) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, 6,; denotes weights of a lookup layer, i.e., word embeddings , and 04 is a parameter that controls the strength of the agreement constraint.
    Page 5, “Training”
  9. Note that 6}]; and BEEF are concurrently updated in each iteration, and BEEF (or 63%) is employed to enforce agreement between word embeddings when updating 6}]; (or
    Page 6, “Training”
  10. For the FFNN-based model, we set the word embedding length M to 30, the number of units of a hidden layer |zl| to 100, and the window size of contexts to 5.
    Page 6, “Training”
  11. For the weights of a lookup layer L, we preliminarily trained word embeddings for the source and target language from each side of the training data.
    Page 6, “Training”

See all papers in Proc. ACL 2014 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

embeddings

Appears in 10 sentences as: embeddings (9) embeddings: (1)
In Recurrent Neural Networks for Word Alignment Model
  1. Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function.
    Page 2, “Introduction”
  2. Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words (Bengio et al., 2003).
    Page 3, “Related Work”
  3. In the lookup layer, each of these words is converted to its word embedding, and then the concatenation of the two embeddings (any) is fed to the hidden layer in the same manner as the FFNN-based model.
    Page 4, “RNN-based Alignment Model”
  4. The constraint concretely enforces agreement in word embeddings of both directions.
    Page 5, “Training”
  5. The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:
    Page 5, “Training”
  6. where QFE (or 6gp) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, 6,; denotes weights of a lookup layer, i.e., word embeddings , and 04 is a parameter that controls the strength of the agreement constraint.
    Page 5, “Training”
  7. Note that 6}]; and BEEF are concurrently updated in each iteration, and BEEF (or 63%) is employed to enforce agreement between word embeddings when updating 6}]; (or
    Page 6, “Training”
  8. For the weights of a lookup layer L, we preliminarily trained word embeddings for the source and target language from each side of the training data.
    Page 6, “Training”
  9. We then set the word embeddings to L to avoid falling into local minima.
    Page 6, “Training”
  10. Furthermore, we proposed an unsupervised method for training our model using NCE and introduced an agreement constraint that encourages word embeddings to be consistent across alignment directions.
    Page 9, “Training”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

gold standard

Appears in 6 sentences as: gold standard (6)
In Recurrent Neural Networks for Word Alignment Model
  1. To solve this problem, we apply noise-contrastive estimation (NCE) (Gutmann and Hyvarinen, 2010; Mnih and Teh, 2012) for unsupervised training of our RNN-based model without gold standard alignments or pseudo-oracle alignments.
    Page 1, “Introduction”
  2. where 6 denotes the weights of layers in the model, T is a set of training data, (1+ is the gold standard alignment, a‘ is the incorrect alignment with the highest score under 6, and 39 denotes the score defined by Eq.
    Page 3, “Related Work”
  3. However, this approach requires gold standard alignments.
    Page 4, “Training”
  4. Hereafter, M ODEL(R) and M ODEL(I ) denote the MODEL trained from gold standard alignments and word alignments found by the IBM Model 4, respectively.
    Page 7, “Training”
  5. Figure 3 shows word alignment examples from FFNNS and RN N8, where solid squares indicate the gold standard alignments.
    Page 8, “Training”
  6. Note that RN N 8+0 (R) cannot be trained from the 40 K data because the 40 K data does not have gold standard
    Page 8, “Training”

See all papers in Proc. ACL 2014 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

Viterbi

Appears in 6 sentences as: Viterbi (9)
In Recurrent Neural Networks for Word Alignment Model
  1. Given a specific model, the best alignment ( Viterbi alignment) of the sentence pair (ff, (if) can be found as
    Page 2, “Related Work”
  2. a1 For example, the HMM model identifies the Viterbi alignment using the Viterbi algorithm.
    Page 2, “Related Work”
  3. The model finds the Viterbi alignment using the Viterbi algorithm, similar to the classic HMM model.
    Page 3, “Related Work”
  4. The Viterbi alignment is determined using the Viterbi algorithm, similar to the FFNN-based model, where the model is sequentially applied from f1 to f J5.
    Page 4, “RNN-based Alignment Model”
  5. 5 Strictly speaking, we cannot apply the dynamic programming forward-backward algorithm (i.e., the Viterbi algorithm) due to the long alignment history of yi.
    Page 4, “RNN-based Alignment Model”
  6. Thus, the Viterbi alignment is computed approximately using heuristic beam search.
    Page 4, “RNN-based Alignment Model”

See all papers in Proc. ACL 2014 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

proposed model

Appears in 5 sentences as: proposed model (3) proposed models (3)
In Recurrent Neural Networks for Word Alignment Model
  1. Under the recurrence, the proposed model compactly encodes the entire history of previous alignments in the hidden layer configuration 3),.
    Page 4, “RNN-based Alignment Model”
  2. Therefore, the proposed model can find alignments by taking advantage of the long alignment history, while the FFNN-based model considers only the last alignment.
    Page 4, “RNN-based Alignment Model”
  3. We evaluated the alignment performance of the proposed models with two tasks: Japanese-English word alignment with the Basic Travel Expression Corpus (BTEC) (Takezawa et a1., 2002) and French-English word alignment with the Hansard dataset (H ansards) from the 2003 NAACL shared task (Mihalcea and Pedersen, 2003).
    Page 6, “Training”
  4. In addition, Table 3 shows that these proposed models are comparable to IBM4au in NTCIR and FBIS even though the proposed models are trained from only a small part of the training data.
    Page 8, “Training”
  5. Our experiments have shown that the proposed model outperforms the FFNN-based model (Yang et al., 2013) for word alignment and machine translation, and that the agreement constraint improves alignment performance.
    Page 9, “Training”

See all papers in Proc. ACL 2014 that mention proposed model.

See all papers in Proc. ACL that mention proposed model.

Back to top.

randomly sampled

Appears in 5 sentences as: random sampling (1) randomly sampled (3) randomly samples (1)
In Recurrent Neural Networks for Word Alignment Model
  1. To reduce computation, we employ NCE, which uses randomly sampled sentences from all target language sentences in Q as e‘, and calculate the expected values by a beam search with beam width W to truncate alignments with low scores.
    Page 5, “Training”
  2. where 6+ is a target language sentence aligned to f+ in the training data, i.e., (f+, 6+) 6 T, e‘ is a randomly sampled pseudo-target language sentence with length |e+|, and N denotes the number of pseudo-target language sentences per source sentence f+.
    Page 5, “Training”
  3. In a simple implementation, each 6— is generated by repeating a random sampling from a set of target words (V6) |e+| times and lining them up sequentially.
    Page 5, “Training”
  4. In Algorithm 1, line 2 randomly samples D bilingual sentences (f +, e+)D from training data T. Lines 3-1 and 3-2 generate N pseudo-negative samples for each f + and e+ based on the translation candidates of f + and e+ found by the IBM Model 1 with lo prior,
    Page 5, “Training”
  5. Table 4 shows the alignment performance on BTEC with various training data sizes, i.e., training data for I WSLT (40 K), training data for BTEC (9 K), and the randomly sampled 1 K data from the BTEC training data.
    Page 8, “Training”

See all papers in Proc. ACL 2014 that mention randomly sampled.

See all papers in Proc. ACL that mention randomly sampled.

Back to top.

translation tasks

Appears in 5 sentences as: translation task (3) translation tasks (4)
In Recurrent Neural Networks for Word Alignment Model
  1. The RNN-based model outperforms the feed-forward neural network-based model (Yang et al., 2013) as well as the IBM Model 4 under Japanese-English and French-English word alignment tasks, and achieves comparable translation performance to those baselines for Japanese-English and Chinese-English translation tasks .
    Page 1, “Abstract”
  2. This paper presents evaluations of Japanese-English and French-English word alignment tasks and Japanese-to-English and Chinese-to-English translation tasks .
    Page 2, “Introduction”
  3. For the translation tasks , our model achieves up to 0.74% gain in BLEU as compared to the FFNN-based model, which matches the translation qualities of the IBM Model 4.
    Page 2, “Introduction”
  4. In addition, we evaluated the end-to-end translation performance of three tasks: a Chinese-to-English translation task with the FBIS corpus (FBI 8), the IWSLT 2007 Japanese-to-English translation task (I WSLT) (Fordyce, 2007), and the NTCIR-9 Japanese-to-English patent translation task (NTCIR) (Goto et a1., 2011)?
    Page 6, “Training”
  5. In the translation tasks , we used the Moses phrase-based SMT systems (Koehn et al., 2007).
    Page 7, “Training”

See all papers in Proc. ACL 2014 that mention translation tasks.

See all papers in Proc. ACL that mention translation tasks.

Back to top.

machine translation

Appears in 4 sentences as: Machine Translation (1) machine translation (3)
In Recurrent Neural Networks for Word Alignment Model
  1. Automatic word alignment is an important task for statistical machine translation .
    Page 1, “Introduction”
  2. Recently, FFNNs have been applied successfully to several tasks, such as speech recognition (Dahl et al., 2012), statistical machine translation (Le et al., 2012; Vaswani et al., 2013), and other popular natural language processing tasks (Collobert and Weston, 2008; Collobert et al., 2011).
    Page 2, “Related Work”
  3. 5.4 Machine Translation Results
    Page 7, “Training”
  4. Our experiments have shown that the proposed model outperforms the FFNN-based model (Yang et al., 2013) for word alignment and machine translation , and that the agreement constraint improves alignment performance.
    Page 9, “Training”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

beam search

Appears in 3 sentences as: beam search (3)
In Recurrent Neural Networks for Word Alignment Model
  1. Thus, the Viterbi alignment is computed approximately using heuristic beam search .
    Page 4, “RNN-based Alignment Model”
  2. To reduce computation, we employ NCE, which uses randomly sampled sentences from all target language sentences in Q as e‘, and calculate the expected values by a beam search with beam width W to truncate alignments with low scores.
    Page 5, “Training”
  3. GEN is a subset of all possible word alignments (I), which is generated by beam search .
    Page 5, “Training”

See all papers in Proc. ACL 2014 that mention beam search.

See all papers in Proc. ACL that mention beam search.

Back to top.

neural network

Appears in 3 sentences as: Neural Network (1) neural network (3)
In Recurrent Neural Networks for Word Alignment Model
  1. This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers.
    Page 1, “Abstract”
  2. (2013) adapted the Context-Dependent Deep Neural Network for HMM (CD-DNN-HMM) (Dahl et al., 2012), a type of feed-forward neural network (FFNN)-based model, to
    Page 1, “Introduction”
  3. Recurrent neural network (RNN)-based models have recently demonstrated state-of-the-art performance that outperformed FFNN-based models for various tasks (Mikolov et al., 2010; Mikolov and Zweig, 2012; Auli et al., 2013; Kalchbrenner and Blunsom, 2013; Sundermeyer et al., 2013).
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

overfitting

Appears in 3 sentences as: overfitting (3)
In Recurrent Neural Networks for Word Alignment Model
  1. This constraint prevents each model from overfitting to a particular direction and leads to global optimization across alignment directions.
    Page 2, “Introduction”
  2. In addition, an [2 regularization term is added to the objective to prevent the model from overfitting the training data.
    Page 4, “Training”
  3. The proposed constraint penalizes overfitting to a particular direction and enables two directional models to optimize across alignment directions globally.
    Page 5, “Training”

See all papers in Proc. ACL 2014 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

SMT system

Appears in 3 sentences as: SMT system (2) SMT systems (1)
In Recurrent Neural Networks for Word Alignment Model
  1. In the translation tasks, we used the Moses phrase-based SMT systems (Koehn et al., 2007).
    Page 7, “Training”
  2. In addition, for a detailed comparison, we evaluated the SMT system where the IBM Model 4 was trained from all the training data (I BM 4a“).
    Page 8, “Training”
  3. Consequently, the SMT system using RN Nu+c trained from a small part of training data can achieve comparable performance to that using I BM 4 trained from all training data, which is shown in Table 3.
    Page 9, “Training”

See all papers in Proc. ACL 2014 that mention SMT system.

See all papers in Proc. ACL that mention SMT system.

Back to top.