Additive Neural Networks for Statistical Machine Translation
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun

Article Structure

Abstract

Most statistical machine translation (SMT) systems are modeled using a log-linear framework.

Introduction

Recently, great progress has been achieved in SMT, especially since Och and Ney (2002) proposed the log-linear model: almost all the state-of-the-art SMT systems are based on the log-linear model.

Topics

log-linear

Appears in 34 sentences as: Log-linear (1) log-linear (36)
In Additive Neural Networks for Statistical Machine Translation
  1. Most statistical machine translation (SMT) systems are modeled using a log-linear framework.
    Page 1, “Abstract”
  2. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential.
    Page 1, “Abstract”
  3. additive neural networks, for SMT to go beyond the log-linear translation model.
    Page 1, “Abstract”
  4. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and J apanese-to-English translation tasks.
    Page 1, “Abstract”
  5. Recently, great progress has been achieved in SMT, especially since Och and Ney (2002) proposed the log-linear model: almost all the state-of-the-art SMT systems are based on the log-linear model.
    Page 1, “Introduction”
  6. Regardless of how successful the log-linear model is in SMT, it still has some shortcomings.
    Page 1, “Introduction”
  7. Compared with the log-linear model, it has more powerful expressive abilities and can deeply interpret and represent features with hidden units in neural networks.
    Page 2, “Introduction”
  8. Moreover, our method is simple to implement and its decoding efficiency is comparable to that of the log-linear model.
    Page 2, “Introduction”
  9. The biggest contribution of this paper is that it goes beyond the log-linear model and proposes a nonlinear translation model instead of re-ranking model (Duh and Kirchhoff, 2008; Sokolov et al., 2012).
    Page 2, “Introduction”
  10. On both Chinese-to-English and J apanese-to-English translation tasks, experiment results show that our model can leverage the shortcomings suffered by the log-linear model, and thus achieves significant improvements over the log-linear based translation.
    Page 2, “Introduction”
  11. Different from Brown’s generative model (Brown et al., 1993), the log-linear model does not assume strong indepen-dency holds, and allows arbitrary features to be integrated into the model easily.
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

neural network

Appears in 34 sentences as: neural network (26) neural network: (1) Neural Networks (2) Neural networks (1) neural networks (11) neural networks” (1)
In Additive Neural Networks for Statistical Machine Translation
  1. A neural network is a reasonable method to address these pitfalls.
    Page 1, “Abstract”
  2. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration.
    Page 1, “Abstract”
  3. In this paper, we propose a variant of a neural network , i.e.
    Page 1, “Abstract”
  4. additive neural networks , for SMT to go beyond the log-linear translation model.
    Page 1, “Abstract”
  5. In addition, word embedding is employed as the input to the neural network , which encodes each word as a feature vector.
    Page 1, “Abstract”
  6. A neural network (Bishop, 1995) is a reasonable method to overcome the above shortcomings.
    Page 1, “Introduction”
  7. In the search procedure, frequent computation of the model score is needed for the search heuristic function, which will be challenged by the decoding efficiency for the neural network based translation model.
    Page 1, “Introduction”
  8. In this paper, we propose a variant of neural networks , i.e.
    Page 1, “Introduction”
  9. additive neural networks (see Section 3 for details), for SMT.
    Page 1, “Introduction”
  10. Compared with the log-linear model, it has more powerful expressive abilities and can deeply interpret and represent features with hidden units in neural networks .
    Page 2, “Introduction”
  11. 3 Additive Neural Networks
    Page 3, “Introduction”

See all papers in Proc. ACL 2013 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

log-linear model

Appears in 27 sentences as: log-linear model (29)
In Additive Neural Networks for Statistical Machine Translation
  1. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential.
    Page 1, “Abstract”
  2. Recently, great progress has been achieved in SMT, especially since Och and Ney (2002) proposed the log-linear model: almost all the state-of-the-art SMT systems are based on the log-linear model .
    Page 1, “Introduction”
  3. Regardless of how successful the log-linear model is in SMT, it still has some shortcomings.
    Page 1, “Introduction”
  4. Compared with the log-linear model , it has more powerful expressive abilities and can deeply interpret and represent features with hidden units in neural networks.
    Page 2, “Introduction”
  5. Moreover, our method is simple to implement and its decoding efficiency is comparable to that of the log-linear model .
    Page 2, “Introduction”
  6. The biggest contribution of this paper is that it goes beyond the log-linear model and proposes a nonlinear translation model instead of re-ranking model (Duh and Kirchhoff, 2008; Sokolov et al., 2012).
    Page 2, “Introduction”
  7. On both Chinese-to-English and J apanese-to-English translation tasks, experiment results show that our model can leverage the shortcomings suffered by the log-linear model , and thus achieves significant improvements over the log-linear based translation.
    Page 2, “Introduction”
  8. Different from Brown’s generative model (Brown et al., 1993), the log-linear model does not assume strong indepen-dency holds, and allows arbitrary features to be integrated into the model easily.
    Page 2, “Introduction”
  9. In the log-linear model , if f, e, d) is a local feature, the calculation of its score 212,- -f, e, d) has a substructure, and thus it can be calculated with dynamic programming which accelerates its decoding.
    Page 3, “Introduction”
  10. Although the log-linear model has achieved great progress for SMT, it still suffers from some pitfalls: it requires features be linear with the model and it can not interpret and represent features deeply.
    Page 3, “Introduction”
  11. There are also strong relationships between AdNN and the log-linear model .
    Page 4, “Introduction”

See all papers in Proc. ACL 2013 that mention log-linear model.

See all papers in Proc. ACL that mention log-linear model.

Back to top.

language model

Appears in 17 sentences as: language model (17) language models (1)
In Additive Neural Networks for Statistical Machine Translation
  1. Further, decoding with nonlocal (or state-dependent) features, such as a language model , is also a problem.
    Page 1, “Introduction”
  2. Actually, even for the (log-) linear model, efficient decoding with the language model is not trivial (Chiang, 2007).
    Page 1, “Introduction”
  3. For the nonlocal features such as the language model , Chiang (2007) proposed a cube-pruning method for efficient decoding.
    Page 3, “Introduction”
  4. The main reason why cube-pruning works is that the translation model is linear and the model score for the language model is approximately monotonic (Chiang, 2007).
    Page 3, “Introduction”
  5. Actually, existing works empirically show that some nonlocal features, especially language model , contribute greatly to machine translation.
    Page 3, “Introduction”
  6. Scoring for nonlocal features such as a n-gram language model is not easily done.
    Page 3, “Introduction”
  7. In log-linear translation model, Chiang (2007) proposed a cube-pruning method for scoring the language model .
    Page 3, “Introduction”
  8. The premise of cube-pruning is that the language model score is approximately monotonic (Chiang, 2007).
    Page 3, “Introduction”
  9. However, if scoring the language model with a neural network, this premise is difficult to hold.
    Page 3, “Introduction”
  10. Therefore, one of the solutions is to preserve a linear model for scoring the language model directly.
    Page 3, “Introduction”
  11. (5) includes 8 default features, which consist of translation probabilities, lexical translation probabilities, word penalty, glue rule penalty, synchronous rule penalty and language model .
    Page 4, “Introduction”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

translation model

Appears in 15 sentences as: translation model (13) translation models (2)
In Additive Neural Networks for Statistical Machine Translation
  1. additive neural networks, for SMT to go beyond the log-linear translation model .
    Page 1, “Abstract”
  2. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and J apanese-to-English translation tasks.
    Page 1, “Abstract”
  3. On the one hand, features are required to be linear with respect to the objective of the translation model (Nguyen et al., 2007), but it is not guaranteed that the potential features be linear with the model.
    Page 1, “Introduction”
  4. In the search procedure, frequent computation of the model score is needed for the search heuristic function, which will be challenged by the decoding efficiency for the neural network based translation model .
    Page 1, “Introduction”
  5. The biggest contribution of this paper is that it goes beyond the log-linear model and proposes a nonlinear translation model instead of re-ranking model (Duh and Kirchhoff, 2008; Sokolov et al., 2012).
    Page 2, “Introduction”
  6. The main reason why cube-pruning works is that the translation model is linear and the model score for the language model is approximately monotonic (Chiang, 2007).
    Page 3, “Introduction”
  7. Firstly, let us consider a simple case in neural network based translation where all the features in the translation model are independent of the translation state, i.e.
    Page 3, “Introduction”
  8. In this way, we can easily define the following translation model with a single-layer neural network:
    Page 3, “Introduction”
  9. In order to keep the substructure property, 8 (f, 62, d2; W, M, B) should be represented as F(S(f,el, d1; W, M, B);S(h(7“2);M, B)) by a function F. For simplicity, we suppose that the additive property holds in F, and then we can obtain a new translation model via the following re-cursive equation:
    Page 3, “Introduction”
  10. In log-linear translation model , Chiang (2007) proposed a cube-pruning method for scoring the language model.
    Page 3, “Introduction”
  11. Formally, the AdNN based translation model is discriminative but non-probabilistic, and it can be defined as follows:
    Page 3, “Introduction”

See all papers in Proc. ACL 2013 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

word embedding

Appears in 15 sentences as: Word Embedding (1) Word embedding (2) word embedding (13) word embeddings (1)
In Additive Neural Networks for Statistical Machine Translation
  1. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector.
    Page 1, “Abstract”
  2. We also integrate word embedding into the model by representing each word as a feature vector (Collobert and Weston, 2008).
    Page 2, “Introduction”
  3. For the local feature vector h’ in Eq (5), we employ word embedding features as described in the following subsection.
    Page 4, “Introduction”
  4. 3.3 Word Embedding features for AdNN
    Page 4, “Introduction”
  5. Word embedding can relax the sparsity introduced by the lexicalization in NLP, and it improves the systems for many tasks such as language model, named entity recognition, and parsing (Collobert and Weston, 2008; Turian et al., 2010; Collobert, 2011).
    Page 4, “Introduction”
  6. Here, we propose embedding features for rules in SMT by combining word embeddings .
    Page 4, “Introduction”
  7. Let Vs be the vocabulary in the source language with size |V5 |; R“ W5 | be the word embedding matrix, each column of which is the word embedding (n-dimensional vector) for the corresponding word in V3; and mamSource be the maximal length of 04 for all rules.
    Page 4, “Introduction”
  8. We define the embedding of 04 as the concatenation of the word embedding of each word in 04.
    Page 4, “Introduction”
  9. In particular, for the nonterminal in 04, we define its word embedding as the vector whose components are 0.1; and we define the word embedding of “NULL” as 0.
    Page 4, “Introduction”
  10. In this paper, we apply the word embedding matrices from the RNNLM toolkit (Mikolov et al., 2010) with the default settings: we train two RNN language models on the source and target sides of training corpus, respectively, and then we obtain two matrices as their by-productsl.
    Page 4, “Introduction”
  11. It would be potentially better to train the word embedding matrix from a much larger corpus as (Collobert and Weston, 2008), and we will leave this as a future task.
    Page 4, “Introduction”

See all papers in Proc. ACL 2013 that mention word embedding.

See all papers in Proc. ACL that mention word embedding.

Back to top.

feature vector

Appears in 9 sentences as: feature vector (8) feature vectors (1)
In Additive Neural Networks for Statistical Machine Translation
  1. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector .
    Page 1, “Abstract”
  2. We also integrate word embedding into the model by representing each word as a feature vector (Collobert and Weston, 2008).
    Page 2, “Introduction”
  3. .. ,hK(f,e,d))T is a K -dimensional feature vector defined on the tuple (f,e,d); W = (w1,w2,--- ,wK)T is a K-dimensional weight vector of h, i.e., the parameters of the model, and it can be tuned by the toolkit MERT (Och, 2003).
    Page 2, “Introduction”
  4. (3) as a function of a feature vector h, i.e.
    Page 3, “Introduction”
  5. where h and h’ are feature vectors with dimension K and K’ respectively, and each component of h’ is a local feature which can be defined on a rule 7“ : X —> (04,7); 6 = (W, W’,M,B) is the model parameters with M E RUXKI.
    Page 4, “Introduction”
  6. If we consider the parameters (M, B) as constant and 0(M -h’ (7“) + B) as a new feature vector , then AdNN is reduced to a log-linear model.
    Page 4, “Introduction”
  7. Similar to Hiero (Chiang, 2005), the feature vector h in Eq.
    Page 4, “Introduction”
  8. For the local feature vector h’ in Eq (5), we employ word embedding features as described in the following subsection.
    Page 4, “Introduction”
  9. Given the model parameter 6 = (W, W’, M, B), if we consider (M, B) as constant and 0 (M - h’ (7“) + B) as an additional feature vector besides h, then Eq.
    Page 5, “Introduction”

See all papers in Proc. ACL 2013 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

translation tasks

Appears in 9 sentences as: translation task (2) translation tasks (8)
In Additive Neural Networks for Statistical Machine Translation
  1. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and J apanese-to-English translation tasks .
    Page 1, “Abstract”
  2. On both Chinese-to-English and J apanese-to-English translation tasks , experiment results show that our model can leverage the shortcomings suffered by the log-linear model, and thus achieves significant improvements over the log-linear based translation.
    Page 2, “Introduction”
  3. We conduct our experiments on the Chinese-to-English and J apanese-to-English translation tasks .
    Page 6, “Introduction”
  4. Although there are serious overlaps between h and h’ for AdNN-Hiero-D which may limit its generalization abilities, as shown in Table 3, it is still comparable to L—Hiero on the J apanese-to-English task, and significantly outperforms L-Hiero on the Chinese-to-English translation task .
    Page 8, “Introduction”
  5. To investigate the reason why the gains for AdNN-Hiero-D on the two different translation tasks differ, we calculate the perpleXities between the target side of training data and test datasets on both translation tasks .
    Page 8, “Introduction”
  6. Based on these similarity statistics, we conjecture that the log-linear model does not fit well for difficult translation tasks (e.g.
    Page 8, “Introduction”
  7. translation task on the news domain).
    Page 8, “Introduction”
  8. For Chinese-to-English and Japanese-to-English translation tasks , our model significantly outperforms the log-linear model, with the help of word embedding.
    Page 9, “Introduction”
  9. For example, we will train word embedding matrices for source and target languages from a larger corpus, and take into consideration the bilingual information, for instance, word alignment; the multilayer neural network within the additive neural networks will be also investigated in addition to the single-layer neural network; and we will test our method on other translation tasks with larger training data as well.
    Page 9, “Introduction”

See all papers in Proc. ACL 2013 that mention translation tasks.

See all papers in Proc. ACL that mention translation tasks.

Back to top.

machine translation

Appears in 7 sentences as: machine translation (7)
In Additive Neural Networks for Statistical Machine Translation
  1. Most statistical machine translation (SMT) systems are modeled using a log-linear framework.
    Page 1, “Abstract”
  2. However, the neural network based machine translation is far from easy.
    Page 3, “Introduction”
  3. Actually, existing works empirically show that some nonlocal features, especially language model, contribute greatly to machine translation .
    Page 3, “Introduction”
  4. According to the above analysis, we propose a variant of a neural network model for machine translation , and we call it Additive Neural Networks or AdNN for short.
    Page 3, “Introduction”
  5. Unlike additive models and generalized additive neural networks, our model is decomposable with respect to translation rules rather than its component variables considering the decoding efficiency of machine translation ; and it allows its additive terms of neural networks to share the same parameters for a compact structure to avoid sparsity.
    Page 8, “Introduction”
  6. The idea of the neural network in machine translation has already been pioneered in previous works.
    Page 8, “Introduction”
  7. (1997) introduced a neural network for example-based machine translation .
    Page 8, “Introduction”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (6)
In Additive Neural Networks for Statistical Machine Translation
  1. In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
    Page 5, “Introduction”
  2. to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
    Page 7, “Introduction”
  3. Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.
    Page 7, “Introduction”
  4. Table 2: The BLEU comparisons between AdNN-Hiero-E and Log-linear translation models on the Chinese-to-English and J apanese-to-English tasks.
    Page 7, “Introduction”
  5. In detail, for the Chinese-to-English task, AdNN-Hiero-E improves more than 0.6 BLEU scores over L-Hiero on both test sets: the gains over L—Hiero tuned with PRO are 0.66 and 1.09 on NIST06 and NIST08, respectively, and the gains over L—Hiero tuned with MERT are even more.
    Page 7, “Introduction”
  6. AdNN-Hiero-E gains about 0.7 BLEU scores on
    Page 7, “Introduction”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU scores

Appears in 5 sentences as: BLEU score (1) BLEU scores (4)
In Additive Neural Networks for Statistical Machine Translation
  1. In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
    Page 5, “Introduction”
  2. to that of Moses: on the NISTOS test set, L-Hiero achieves 25.1 BLEU scores and Moses achieves 24.8.
    Page 7, “Introduction”
  3. Since both MERT and PRO tuning toolkits involve randomness in their implementations, all BLEU scores reported in the experiments are the average of five tuning runs, as suggested by Clark et al.
    Page 7, “Introduction”
  4. In detail, for the Chinese-to-English task, AdNN-Hiero-E improves more than 0.6 BLEU scores over L-Hiero on both test sets: the gains over L—Hiero tuned with PRO are 0.66 and 1.09 on NIST06 and NIST08, respectively, and the gains over L—Hiero tuned with MERT are even more.
    Page 7, “Introduction”
  5. AdNN-Hiero-E gains about 0.7 BLEU scores on
    Page 7, “Introduction”

See all papers in Proc. ACL 2013 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

development set

Appears in 5 sentences as: development set (6)
In Additive Neural Networks for Statistical Machine Translation
  1. MERT (Och, 2003), MIRA (Watanabe et al., 2007; Chiang et al., 2008), PRO (Hopkins and May, 2011) and so on, which itera-tively optimize a weight such that, after re-ranking a k-best list of a given development set with this weight, the loss of the resulting l-best list is minimal.
    Page 5, “Introduction”
  2. where f is a source sentence in a given development set , and ((6*, d*), (6’, d’ is a preference pair for f; N is the number of all preference pairs; A > 0 is a regularizer.
    Page 5, “Introduction”
  3. Given a development set , we first run pre-training to obtain an initial parameter 61 for Algorithm 1 in line 1.
    Page 6, “Introduction”
  4. For the Chinese-to-English task, the training data is the FBIS corpus (news domain) with about 240k: sentence pairs; the development set is the NIST02 evaluation data; the development test set is NIST05; and the test datasets are NIST06, and NIST08.
    Page 6, “Introduction”
  5. For the Japanese-to-English task, the training data with 300k: sentence pairs is from the NTCIR-patent task (Fujii et al., 2010); the development set, development test set, and two test sets are averagely extracted from a given development set with 4000 sentences, and these four datasets are called testl, test2, test3 and test4, respectively.
    Page 6, “Introduction”

See all papers in Proc. ACL 2013 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

model score

Appears in 5 sentences as: model score (5)
In Additive Neural Networks for Statistical Machine Translation
  1. In the search procedure, frequent computation of the model score is needed for the search heuristic function, which will be challenged by the decoding efficiency for the neural network based translation model.
    Page 1, “Introduction”
  2. The main reason why cube-pruning works is that the translation model is linear and the model score for the language model is approximately monotonic (Chiang, 2007).
    Page 3, “Introduction”
  3. The premise of cube-pruning is that the language model score is approximately monotonic (Chiang, 2007).
    Page 3, “Introduction”
  4. Again for the example shown in Figure l, the model score defined in Eq.
    Page 4, “Introduction”
  5. In the extreme, if the k-best list consists only of a pair of translations ((6*, d*), (6’, d’ )), the desirable weight should satisfy the assertion: if the BLEU score of 6* is greater than that of 6’, then the model score of (6*, d*) with this weight will be also greater than that of (6’, d’ In this paper, a pair (6*, 6’) for a source sentence f is called as a preference pair for f. Following PRO, we define the following objective function under the maX-margin framework to optimize the AdNN model:
    Page 5, “Introduction”

See all papers in Proc. ACL 2013 that mention model score.

See all papers in Proc. ACL that mention model score.

Back to top.

sentence pairs

Appears in 3 sentences as: sentence pair (1) sentence pairs (2)
In Additive Neural Networks for Statistical Machine Translation
  1. For the Chinese-to-English task, the training data is the FBIS corpus (news domain) with about 240k: sentence pairs ; the development set is the NIST02 evaluation data; the development test set is NIST05; and the test datasets are NIST06, and NIST08.
    Page 6, “Introduction”
  2. For the Japanese-to-English task, the training data with 300k: sentence pairs is from the NTCIR-patent task (Fujii et al., 2010); the development set, development test set, and two test sets are averagely extracted from a given development set with 4000 sentences, and these four datasets are called testl, test2, test3 and test4, respectively.
    Page 6, “Introduction”
  3. We run GIZA++ (Och and Ney, 2000) on the training corpus in both directions (Koehn et al., 2003) to obtain the word alignment for each sentence pair .
    Page 6, “Introduction”

See all papers in Proc. ACL 2013 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.