A Recursive Recurrent Neural Network for Statistical Machine Translation

In this paper, we propose a novel recursive recurrent neural network (RZNN) to model the end-to-end decoding process for statistical machine translation.

Deep Neural Network (DNN), which essentially is a multilayer neural network, has regained more and more attentions these years.

Yang et al.

In this section, we leverage DNN to model the end-to-end SMT decoding process, using a novel recursive recurrent neural network (RZNN), which is different from the above mentioned work applying DNN to components of conventional SMT framework.

In this section, we propose a three-step training method to train the parameters of our proposed R2NN, which includes unsupervised pre-training using recursive auto-encoding, supervised local training on the derivation tree of forced decoding, and supervised global training using early update training strategy.

The next question is how to initialize the phrase pair embedding in the translation table, so as to generate the leaf nodes of the derivation tree.

In this section, we conduct experiments to test our method on a Chinese-to-English translation task.

In this paper, we propose a Recursive Recurrent Neural Network(R2NN) to combine the recurrent neural network and recursive neural network.

Appears in 48 sentences as: Neural Network (6) neural network (40) neural networks (17)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- In this paper, we propose a novel recursive recurrent neural network (RZNN) to model the end-to-end decoding process for statistical machine translation.Page 1, “Abstract”
- RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks , so as to generate the translation candidates in a bottom up manner.Page 1, “Abstract”
- Deep Neural Network (DNN), which essentially is a multilayer neural network , has regained more and more attentions these years.Page 1, “Introduction”
- Recurrent neural networks are leveraged to learn language model, and they keep the history information circularly inside the network for arbitrarily long time (Mikolov et al., 2010).Page 1, “Introduction”
- Recursive neural networks , which have the ability to generate a tree structured output, are applied to natural language parsing (Socher et al., 2011), and they are extended to recursive neural tensor networks to explore the compositional aspect of semantics (Socher et al., 2013).Page 1, “Introduction”
- (2013) propose a joint language and translation model, based on a recurrent neural network .Page 1, “Introduction”
- (2013) propose an additive neural network for SMT decoding.Page 1, “Introduction”
- RZNN is a combination of recursive neural network and recurrent neural network .Page 2, “Introduction”
- In RZNN, new information can be used to generate the next hidden state, like recurrent neural networks, and a tree structure can be built, as recursive neural networks .Page 2, “Introduction”
- To generate the translation candidates in a commonly used bottom-up manner, recursive neural networks are naturally adopted to build the tree structure.Page 2, “Introduction”
- In recursive neural networks , all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.Page 2, “Introduction”

See all papers in *Proc. ACL 2014* that mention neural network.

See all papers in *Proc. ACL* that mention neural network.

Back to top.

Appears in 37 sentences as: Recursive (8) recursive (33)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- In this paper, we propose a novel recursive recurrent neural network (RZNN) to model the end-to-end decoding process for statistical machine translation.Page 1, “Abstract”
- RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner.Page 1, “Abstract”
- Recursive neural networks, which have the ability to generate a tree structured output, are applied to natural language parsing (Socher et al., 2011), and they are extended to recursive neural tensor networks to explore the compositional aspect of semantics (Socher et al., 2013).Page 1, “Introduction”
- (2013) use recursive auto encoders to make full use of the entire merging phrase pairs, going beyond the boundary words with a maximum entropy classifier (Xiong et al., 2006).Page 1, “Introduction”
- RZNN is a combination of recursive neural network and recurrent neural network.Page 2, “Introduction”
- In RZNN, new information can be used to generate the next hidden state, like recurrent neural networks, and a tree structure can be built, as recursive neural networks.Page 2, “Introduction”
- To generate the translation candidates in a commonly used bottom-up manner, recursive neural networks are naturally adopted to build the tree structure.Page 2, “Introduction”
- In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.Page 2, “Introduction”
- In order to integrate these crucial information for better translation prediction, we combine recurrent neural networks into the recursive neural networks, so that we can use global information to generate the next hidden state, and select the better translation candidate.Page 2, “Introduction”
- We propose a three-step semi-supervised training approach to optimizing the parameters of RZNN, which includes recursive auto-encoding for unsupervised pre-training, supervised local training based on the derivation trees of forced decoding, and supervised global training using early update strategy.Page 2, “Introduction”
- (2013) propose to apply recursive auto-encoder to make full use of the entire merged blocks.Page 2, “Related Work”

See all papers in *Proc. ACL 2014* that mention recursive.

See all papers in *Proc. ACL* that mention recursive.

Back to top.

Appears in 36 sentences as: Phrase Pair (1) Phrase pair (1) phrase pair (27) phrase pairs (12)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- A semi-supervised training approach is proposed to train the parameters, and the phrase pair embedding is explored to model translation confidence directly.Page 1, “Abstract”
- (2013) use recursive auto encoders to make full use of the entire merging phrase pairs , going beyond the boundary words with a maximum entropy classifier (Xiong et al., 2006).Page 1, “Introduction”
- So as to model the translation confidence for a translation phrase pair, we initialize the phrase pair embedding by leveraging the sparse features and recurrent neural network.Page 2, “Introduction”
- The sparse features are phrase pairs in translation table, and recurrent neural network is utilized to learn a smoothed translation score with the source and target side information.Page 2, “Introduction”
- Phrase pair embedding method using translation confidence is elaborated in Section 5.Page 2, “Introduction”
- Given the representations of the smaller phrase pairs, recursive auto-encoder can generate the representation of the parent phrase pair with a reordering confidence score.Page 2, “Related Work”
- We then check whether translation candidates can be found in the translation table for each span, together with the phrase pair embedding and recurrent input vector (global features).Page 4, “Our Model”
- We extract phrase pairs using the conventional method (Och and Ney, 2004).Page 4, “Our Model”
- 0 Representations of phrase pairs are automatically learnt to optimize the translation performance, while features used in conventional model are handcrafted.Page 4, “Our Model”
- A feature is learnt via a one-hidden-layer neural network, and the embedding of words in the phrase pairs are used as the input vector.Page 5, “Our Model”
- (2013) also generate the representation of phrase pairs in a recursive way.Page 5, “Our Model”

See all papers in *Proc. ACL 2014* that mention phrase pair.

See all papers in *Proc. ACL* that mention phrase pair.

Back to top.

Appears in 26 sentences as: Word embedding (6) word embedding (17) Word embeddings (1) word embeddings (6)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Word embedding is a dense, low dimensional, real-valued vector.Page 1, “Introduction”
- Word embedding is usually learnt from large amount of monolingual corpus at first, and then fine tuned for special distinct tasks.Page 1, “Introduction”
- In their work, bilingual word embedding is trained to capture lexical translation information, and surrounding words are utilized to model context information.Page 1, “Introduction”
- Word embedding is used as the input to learn translation confidence score, which is combined with commonly used features in the conventional log-linear model.Page 1, “Introduction”
- In their work, initial word embedding is firstly trained with a huge monolingual corpus, then the word embedding is adapted and fine tuned bilingually in a context-depended DNN HMM framework.Page 2, “Related Work”
- Word embeddings capturing lexical translation information and surrounding words modeling context information are leveraged to improve the word alignment performance.Page 2, “Related Work”
- In their work, not only the target word embedding is used as the input of the network, but also the embedding of the source word, which is aligned to the current target word.Page 2, “Related Work”
- RNNLM (Mikolov et al., 2010) is firstly used to generate the source and target word embeddings , which are fed into a one-hidden-layer neural network to get a translation confidence score.Page 2, “Related Work”
- Word embedding act is integrated withPage 3, “Our Model”
- The new history ht is used for the future prediction, and updated with new information from word embedding :ct recurrently.Page 3, “Our Model”
- Word embedding act is integrated as new input information in recurrent neural networks for each prediction, but in recursive neural networks, no additional input information is used except the two representation vectors of the child nodes.Page 3, “Our Model”

See all papers in *Proc. ACL 2014* that mention word embedding.

See all papers in *Proc. ACL* that mention word embedding.

Back to top.

Appears in 19 sentences as: Recursive Neural Network (1) Recursive neural network (1) recursive neural network (8) Recursive neural networks (1) recursive neural networks (9)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks , so as to generate the translation candidates in a bottom up manner.Page 1, “Abstract”
- Recursive neural networks , which have the ability to generate a tree structured output, are applied to natural language parsing (Socher et al., 2011), and they are extended to recursive neural tensor networks to explore the compositional aspect of semantics (Socher et al., 2013).Page 1, “Introduction”
- RZNN is a combination of recursive neural network and recurrent neural network.Page 2, “Introduction”
- In RZNN, new information can be used to generate the next hidden state, like recurrent neural networks, and a tree structure can be built, as recursive neural networks .Page 2, “Introduction”
- To generate the translation candidates in a commonly used bottom-up manner, recursive neural networks are naturally adopted to build the tree structure.Page 2, “Introduction”
- In recursive neural networks , all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.Page 2, “Introduction”
- In order to integrate these crucial information for better translation prediction, we combine recurrent neural networks into the recursive neural networks , so that we can use global information to generate the next hidden state, and select the better translation candidate.Page 2, “Introduction”
- RZNN is a combination of recursive neural network and recurrent neural network, which not only integrates the conventional global features as input information for each combination, but also generates the representation of the parent node for the future candidate generation.Page 3, “Our Model”
- In this section, we briefly recall the recurrent neural network and recursive neural network in Section 3.1 and 3.2, and then we elaborate our RZNN in detail in Section 3.3.Page 3, “Our Model”
- 3.2 Recursive Neural NetworkPage 3, “Our Model”
- To generate a tree structure, recursive neural networks are introduced for natural language parsing (Socher et al., 2011).Page 3, “Our Model”

See all papers in *Proc. ACL 2014* that mention recursive neural networks.

See all papers in *Proc. ACL* that mention recursive neural networks.

Back to top.

Appears in 19 sentences as: Recursive Neural (1) Recursive neural (2) recursive neural (18)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner.Page 1, “Abstract”
- Recursive neural networks, which have the ability to generate a tree structured output, are applied to natural language parsing (Socher et al., 2011), and they are extended to recursive neural tensor networks to explore the compositional aspect of semantics (Socher et al., 2013).Page 1, “Introduction”
- RZNN is a combination of recursive neural network and recurrent neural network.Page 2, “Introduction”
- In RZNN, new information can be used to generate the next hidden state, like recurrent neural networks, and a tree structure can be built, as recursive neural networks.Page 2, “Introduction”
- To generate the translation candidates in a commonly used bottom-up manner, recursive neural networks are naturally adopted to build the tree structure.Page 2, “Introduction”
- In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.Page 2, “Introduction”
- In order to integrate these crucial information for better translation prediction, we combine recurrent neural networks into the recursive neural networks, so that we can use global information to generate the next hidden state, and select the better translation candidate.Page 2, “Introduction”
- RZNN is a combination of recursive neural network and recurrent neural network, which not only integrates the conventional global features as input information for each combination, but also generates the representation of the parent node for the future candidate generation.Page 3, “Our Model”
- In this section, we briefly recall the recurrent neural network and recursive neural network in Section 3.1 and 3.2, and then we elaborate our RZNN in detail in Section 3.3.Page 3, “Our Model”
- 3.2 Recursive Neural NetworkPage 3, “Our Model”
- To generate a tree structure, recursive neural networks are introduced for natural language parsing (Socher et al., 2011).Page 3, “Our Model”

See all papers in *Proc. ACL 2014* that mention recursive neural.

See all papers in *Proc. ACL* that mention recursive neural.

Back to top.

Appears in 13 sentences as: language model (13) language modelling (1)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner.Page 1, “Abstract”
- Recurrent neural networks are leveraged to learn language model , and they keep the history information circularly inside the network for arbitrarily long time (Mikolov et al., 2010).Page 1, “Introduction”
- DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment, language modelling , translation modelling and distortion modelling.Page 1, “Introduction”
- In recursive neural networks, all the representations of nodes are generated based on their child nodes, and it is difficult to integrate additional global information, such as language model and distortion model.Page 2, “Introduction”
- (2013) extend the recurrent neural network language model , in order to use both the source and target side information to scoring translation candidates.Page 2, “Related Work”
- Recurrent neural network is usually used for sequence processing, such as language model (Mikolov et al., 2010).Page 3, “Our Model”
- Commonly used sequence processing methods, such as Hidden Markov Model (HMM) and n-gram language model , only use a limited history for the prediction.Page 3, “Our Model”
- In HMM, the previous state is used as the history, and for n-gram language model (for example n equals to 3), the history is the previous two words.Page 3, “Our Model”
- for SMT performance, such as language model score and distortion model score.Page 4, “Our Model”
- The commonly used features, such as translation score, language model score and distortion score, are used as the recurrent input vector :c .Page 4, “Our Model”
- The language model is a 5-gram language model trained with the target sentences in the training data.Page 7, “Experiments and Results”

See all papers in *Proc. ACL 2014* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 11 sentences as: confidence score (8) confidence scores (3)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Word embedding is used as the input to learn translation confidence score , which is combined with commonly used features in the conventional log-linear model.Page 1, “Introduction”
- RNNLM (Mikolov et al., 2010) is firstly used to generate the source and target word embeddings, which are fed into a one-hidden-layer neural network to get a translation confidence score .Page 2, “Related Work”
- Together with other commonly used features, the translation confidence score is integrated into a conventional log-linear model.Page 2, “Related Work”
- Given the representations of the smaller phrase pairs, recursive auto-encoder can generate the representation of the parent phrase pair with a reordering confidence score .Page 2, “Related Work”
- yum] is the confidence score of how plausible the parent node should be created.Page 3, “Our Model”
- The recurrent input vector film] is concatenated with parent node representation sum] to compute the confidence score yum] .Page 4, “Our Model”
- The one-hot representation vector is used as the input, and a one-hidden-layer network generates a confidence score .Page 7, “Phrase Pair Embedding”
- To train the neural network, we add the confidence scores to the conventional log-linear model as features.Page 7, “Phrase Pair Embedding”
- We use recurrent neural network to generate two smoothed translation confidence scores based on source and target word embeddings.Page 7, “Phrase Pair Embedding”
- One is source to target translation confidence score and the other is target to source.Page 7, “Phrase Pair Embedding”
- These two confidence scores are defined as:Page 7, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention confidence score.

See all papers in *Proc. ACL* that mention confidence score.

Back to top.

Appears in 8 sentences as: embeddings (10)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Word embeddings capturing lexical translation information and surrounding words modeling context information are leveraged to improve the word alignment performance.Page 2, “Related Work”
- RNNLM (Mikolov et al., 2010) is firstly used to generate the source and target word embeddings , which are fed into a one-hidden-layer neural network to get a translation confidence score.Page 2, “Related Work”
- Back propagation is performed along the tree structure, and the phrase pair embeddings of the leaf nodess are updated.Page 6, “Model Training”
- A simple approach to construct phrase pair embedding is to use the average of the embeddings of the words in the phrase pair.Page 6, “Phrase Pair Embedding”
- We use recurrent neural network to generate two smoothed translation confidence scores based on source and target word embeddings .Page 7, “Phrase Pair Embedding”
- As we mentioned in Section 5, constructing phrase pair embeddings from word embeddings may be not suitable.Page 7, “Experiments and Results”
- We first train the source and target word embeddings separately using large monolingual data, following (Collobert et al., 2011).Page 8, “Experiments and Results”
- Ewms and Emma/9) are the monolingual word embeddings, and Ewbs(si) and Emma/9) are the bilingual word embeddings .Page 8, “Experiments and Results”

See all papers in *Proc. ACL 2014* that mention embeddings.

See all papers in *Proc. ACL* that mention embeddings.

Back to top.

Appears in 7 sentences as: word aligned (2) word alignment (5)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment , language modelling, translation modelling and distortion modelling.Page 1, “Introduction”
- (2013) adapt and extend the CD-DNN-HMM (Dahl et al., 2012) method to HMM-based word alignment model.Page 1, “Introduction”
- (2013) adapt and extend CD-DNN-HMM (Dahl et al., 2012) to word alignment .Page 2, “Related Work”
- Word embeddings capturing lexical translation information and surrounding words modeling context information are leveraged to improve the word alignment performance.Page 2, “Related Work”
- Unfortunately, the better word alignment result generated by this model, cannot bring significant performance improvement on a end-to-end SMT evaluation task.Page 2, “Related Work”
- where, fa, is the corresponding target word aligned to 6, , and it is similar for ea].Page 7, “Phrase Pair Embedding”
- The recurrent neural network is trained with word aligned bilingual corpus, similar as (Auli et al., 2013).Page 7, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention word alignment.

See all papers in *Proc. ACL* that mention word alignment.

Back to top.

Appears in 5 sentences as: BLEU (6)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU .Page 1, “Abstract”
- We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.Page 2, “Introduction”
- When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.Page 8, “Experiments and Results”
- TCBPPE based method drops about 3 BLEU points on both development and test data sets.Page 8, “Experiments and Results”
- We conduct experiments on a Chinese-to-English translation task, and our method outperforms a state-of-the-art baseline about 1.5 points BLEU .Page 9, “Conclusion and Future Work”

See all papers in *Proc. ACL 2014* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 5 sentences as: translation task (5)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Experiments on a Chinese to English translation task show that our proposed RZNN can outperform the state-of-the-art baseline by about 1.5 points in BLEU.Page 1, “Abstract”
- We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.Page 2, “Introduction”
- In this section, we conduct experiments to test our method on a Chinese-to-English translation task .Page 7, “Experiments and Results”
- And also, translation task is difference from other NLP tasks, that, it is more important to model the translation confidence directly (the confidence of onePage 8, “Experiments and Results”
- We conduct experiments on a Chinese-to-English translation task , and our method outperforms a state-of-the-art baseline about 1.5 points BLEU.Page 9, “Conclusion and Future Work”

See all papers in *Proc. ACL 2014* that mention translation task.

See all papers in *Proc. ACL* that mention translation task.

Back to top.

Appears in 5 sentences as: natural language (4) nature language (1)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Applying DNN to natural language processing (NLP), representation or embedding of words is usually learnt first.Page 1, “Introduction”
- Recursive neural networks, which have the ability to generate a tree structured output, are applied to natural language parsing (Socher et al., 2011), and they are extended to recursive neural tensor networks to explore the compositional aspect of semantics (Socher et al., 2013).Page 1, “Introduction”
- To generate a tree structure, recursive neural networks are introduced for natural language parsing (Socher et al., 2011).Page 3, “Our Model”
- For example, for nature language parsing, sum] is the representation of the parent node, which could be a NP or VP node, and it is also the representation of the whole subtree covering from Z to n .Page 3, “Our Model”
- We will apply our proposed R2NN to other tree structure learning tasks, such as natural language parsing.Page 9, “Conclusion and Future Work”

See all papers in *Proc. ACL 2014* that mention natural language.

See all papers in *Proc. ACL* that mention natural language.

Back to top.

Appears in 5 sentences as: model trained (2) model training (3)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- The combination of reconstruction error and reordering error is used to be the objective function for the model training .Page 2, “Related Work”
- Due to the inexact search nature of SMT decoding, search errors may inevitably break theoretical properties, and the final translation results may be not suitable for model training .Page 6, “Model Training”
- Forced decoding is utilized to get positive samples, and contrastive divergence is used for model training .Page 7, “Phrase Pair Embedding”
- The language model is a 5-gram language model trained with the target sentences in the training data.Page 7, “Experiments and Results”
- Our baseline decoder is an in-house implementation of Bracketing Transduction Grammar (BT-G) (Wu, 1997) in CKY-style decoding with a lexical reordering model trained with maximum entropy (Xiong et al., 2006).Page 7, “Experiments and Results”

See all papers in *Proc. ACL 2014* that mention model training.

See all papers in *Proc. ACL* that mention model training.

Back to top.

Appears in 5 sentences as: log-linear model (5)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Word embedding is used as the input to learn translation confidence score, which is combined with commonly used features in the conventional log-linear model .Page 1, “Introduction”
- Together with other commonly used features, the translation confidence score is integrated into a conventional log-linear model .Page 2, “Related Work”
- The difference between our model and the conventional log-linear model includes:Page 4, “Our Model”
- Instead of integrating the sparse features directly into the log-linear model , we use them as the input to learn a phrase pair embedding.Page 7, “Phrase Pair Embedding”
- To train the neural network, we add the confidence scores to the conventional log-linear model as features.Page 7, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention log-linear model.

See all papers in *Proc. ACL* that mention log-linear model.

Back to top.

Appears in 5 sentences as: log-linear (5)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Word embedding is used as the input to learn translation confidence score, which is combined with commonly used features in the conventional log-linear model.Page 1, “Introduction”
- Together with other commonly used features, the translation confidence score is integrated into a conventional log-linear model.Page 2, “Related Work”
- The difference between our model and the conventional log-linear model includes:Page 4, “Our Model”
- Instead of integrating the sparse features directly into the log-linear model, we use them as the input to learn a phrase pair embedding.Page 7, “Phrase Pair Embedding”
- To train the neural network, we add the confidence scores to the conventional log-linear model as features.Page 7, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 5 sentences as: end-to-end (5)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- In this paper, we propose a novel recursive recurrent neural network (RZNN) to model the end-to-end decoding process for statistical machine translation.Page 1, “Abstract”
- Different from the work mentioned above, which applies DNN to components of conventional SMT framework, in this paper, we propose a novel RZNN to model the end-to-end decoding process.Page 2, “Introduction”
- Unfortunately, the better word alignment result generated by this model, cannot bring significant performance improvement on a end-to-end SMT evaluation task.Page 2, “Related Work”
- In this section, we leverage DNN to model the end-to-end SMT decoding process, using a novel recursive recurrent neural network (RZNN), which is different from the above mentioned work applying DNN to components of conventional SMT framework.Page 3, “Our Model”
- Our R2NN is used to model the end-to-end translation process, with recurrent global information added.Page 5, “Our Model”

See all papers in *Proc. ACL 2014* that mention end-to-end.

See all papers in *Proc. ACL* that mention end-to-end.

Back to top.

Appears in 4 sentences as: loss function (4)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- The loss function is defined as following so as to minimize the information lost:Page 5, “Model Training”
- The loss function is the commonly used ranking loss with a margin, and it is defined as follows:Page 5, “Model Training”
- The loss function aims to learn a model which assigns the good translation candidate (the oracle candidate) higher score than the bad ones, with a margin 1.Page 5, “Model Training”
- The loss function for supervised global training is defined as follows:Page 6, “Model Training”

See all papers in *Proc. ACL 2014* that mention loss function.

See all papers in *Proc. ACL* that mention loss function.

Back to top.

Appears in 4 sentences as: hidden layer (4)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- As shown in Figure l, the network contains three layers, an input layer, a hidden layer , and an output layer.Page 3, “Our Model”
- previous history ht_1 to generate the current hidden layer , which is a new history vector ht .Page 3, “Our Model”
- The neural network is used to reduce the space dimension of sparse features, and the hidden layer of the network is used as the phrase pair embedding.Page 7, “Phrase Pair Embedding”
- The length of the hidden layer is empirically set to 20.Page 7, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention hidden layer.

See all papers in *Proc. ACL* that mention hidden layer.

Back to top.

Appears in 4 sentences as: semi-supervised (4)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- A semi-supervised training approach is proposed to train the parameters, and the phrase pair embedding is explored to model translation confidence directly.Page 1, “Abstract”
- We propose a three-step semi-supervised training approach to optimizing the parameters of RZNN, which includes recursive auto-encoding for unsupervised pre-training, supervised local training based on the derivation trees of forced decoding, and supervised global training using early update strategy.Page 2, “Introduction”
- Our RZNN framework is introduced in detail in Section 3, followed by our three-step semi-supervised training approach in Section 4.Page 2, “Introduction”
- We apply our model to SMT decoding, and propose a three-step semi-supervised training method.Page 9, “Conclusion and Future Work”

See all papers in *Proc. ACL 2014* that mention semi-supervised.

See all papers in *Proc. ACL* that mention semi-supervised.

Back to top.

Appears in 4 sentences as: sentence pair (3) sentence pairs (1)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- The training samples for RAE are phrase pairs {31, 32} in translation table, where 31 and 32 can form a continuous partial sentence pair in the training data.Page 5, “Model Training”
- Forced decoding performs sentence pair segmentation using the same translation system as decoding.Page 5, “Model Training”
- For each sentence pair in the training data, SMT decoder is applied to the source side, and any candidate which is not the partial substring of the target sentence is removed from the n-best list during decoding.Page 5, “Model Training”
- The training data contains 81k sentence pairs , 655K Chinese words and 806K English words.Page 7, “Experiments and Results”

See all papers in *Proc. ACL 2014* that mention sentence pair.

See all papers in *Proc. ACL* that mention sentence pair.

Back to top.

Appears in 3 sentences as: model score (4)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- for SMT performance, such as language model score and distortion model score .Page 4, “Our Model”
- The commonly used features, such as translation score, language model score and distortion score, are used as the recurrent input vector :c .Page 4, “Our Model”
- LSGT(VV7 V, 8[l’n]) = —10g( [m] ZtEnbest exp (yt ) (7) where yggile is the model score of a oracle translation candidate for the span [1, n] .Page 6, “Model Training”

See all papers in *Proc. ACL 2014* that mention model score.

See all papers in *Proc. ACL* that mention model score.

Back to top.

Appears in 3 sentences as: model parameters (3)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- gym] is the plausible score for the best translation candidate given the model parameters W and V .Page 5, “Model Training”
- Table 1: The relationship between the size of training data and the number of model parameters .Page 6, “Phrase Pair Embedding”
- Table 1 shows the relationship between the size of training data and the number of model parameters .Page 6, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention model parameters.

See all papers in *Proc. ACL* that mention model parameters.

Back to top.

Appears in 3 sentences as: translation model (2) translation modelling (1)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- RZNN is a combination of recursive neural network and recurrent neural network, and in turn integrates their respective capabilities: (1) new information can be used to generate the next hidden state, like recurrent neural networks, so that language model and translation model can be integrated naturally; (2) a tree structure can be built, as recursive neural networks, so as to generate the translation candidates in a bottom up manner.Page 1, “Abstract”
- DNN is also introduced to Statistical Machine Translation (SMT) to learn several components or features of conventional framework, including word alignment, language modelling, translation modelling and distortion modelling.Page 1, “Introduction”
- (2013) propose a joint language and translation model , based on a recurrent neural network.Page 1, “Introduction”

See all papers in *Proc. ACL 2014* that mention translation model.

See all papers in *Proc. ACL* that mention translation model.

Back to top.

Appears in 3 sentences as: BLEU points (4)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system.Page 2, “Introduction”
- When we remove it from RZNN, WEPPE based method drops about 10 BLEU points on development data and more than 6 BLEU points on test data.Page 8, “Experiments and Results”
- TCBPPE based method drops about 3 BLEU points on both development and test data sets.Page 8, “Experiments and Results”

See all papers in *Proc. ACL 2014* that mention BLEU points.

See all papers in *Proc. ACL* that mention BLEU points.

Back to top.

Appears in 3 sentences as: baseline system (3)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system .Page 2, “Introduction”
- We compare our phrase pair embedding methods and our proposed RZNN with baseline system , in Table 2.Page 8, “Experiments and Results”
- We can see that, our RZNN models with WEPPE and TCBPPE are both better than the baseline system .Page 8, “Experiments and Results”

See all papers in *Proc. ACL 2014* that mention baseline system.

See all papers in *Proc. ACL* that mention baseline system.

Back to top.

Appears in 3 sentences as: Word Pair (1) word pair (2)

In *A Recursive Recurrent Neural Network for Statistical Machine Translation*

- Word 1G 500K 20 X 500K Word Pair 7M (500K)2 20 X (500K)2 Phrase Pair 7M (500104 20 X (500104Page 6, “Phrase Pair Embedding”
- For word pair and phrase pair embedding, the numbers are calculated on IWSLT 2009 dialog training set.Page 6, “Phrase Pair Embedding”
- But for source-target word pair , we may only have 7M bilingual corpus for training (taking IWSLT data set as an example), and there are 20 ><(500K)2 parameters to be tuned.Page 6, “Phrase Pair Embedding”

See all papers in *Proc. ACL 2014* that mention word pair.

See all papers in *Proc. ACL* that mention word pair.

Back to top.