Multilingual Models for Compositional Distributed Semantics
Hermann, Karl Moritz and Blunsom, Phil

Article Structure

Abstract

We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings.

Introduction

Distributed representations of words provide the basis for many state-of-the-art approaches to various problems in natural language processing today.

Overview

Distributed representation learning describes the task of learning continuous representations for discrete objects.

Approach

Most prior work on learning compositional semantic representations employs parse trees on their training data to structure their composition functions (Socher et al., 2012; Hermann and Blunsom, 2013, inter alia).

Corpora

We use two corpora for learning semantic representations and performing the experiments described in this paper.

Experiments

We report results on two experiments.

Related Work

Distributed Representations Distributed representations can be learned through a number of approaches.

Conclusion

To summarize, we have presented a novel method for learning multilingual word embeddings using parallel data in conjunction with a multilingual objective function for compositional vector models.

Topics

embeddings

Appears in 28 sentences as: embeddings (32)
In Multilingual Models for Compositional Distributed Semantics
  1. We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings .
    Page 1, “Abstract”
  2. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences.
    Page 1, “Abstract”
  3. Such word embeddings are naturally richer representations than those of symbolic or discrete models, and have been shown to be able to capture both syntactic and semantic information.
    Page 1, “Introduction”
  4. this work, we extend this hypothesis to multilingual data and joint-space embeddings .
    Page 1, “Introduction”
  5. We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings .
    Page 2, “Overview”
  6. We also investigate the learned embeddings from a qualitative perspective in §5.4.
    Page 4, “Experiments”
  7. All our embeddings have dimensionality d=128, with the margin set to m=d.6 Further, we use L2 regularization with A21 and step-size in {001,005}.
    Page 4, “Experiments”
  8. This task involves learning language independent embeddings which are then used for document classification across the English-German language pair.
    Page 4, “Experiments”
  9. (2012), with the exception that we learn our embeddings using solely the Europarl data and use the Reuters corpora only during for classifier training and testing.
    Page 4, “Experiments”
  10. The motivation behind ADD+ and BI+ is to investigate whether we can learn better embeddings by introducing additional data from other languages.
    Page 5, “Experiments”
  11. This suggests that the joint mode improves the quality of the English embeddings more than it affects the L2-embeddings.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

semantic representations

Appears in 17 sentences as: semantic representation (3) semantic representations (16)
In Multilingual Models for Compositional Distributed Semantics
  1. We present a novel technique for learning semantic representations , which extends the distributional hypothesis to multilingual data and joint-space embeddings.
    Page 1, “Abstract”
  2. We extend our approach to learn semantic representations at the document level, too.
    Page 1, “Abstract”
  3. We present a novel unsupervised technique for learning semantic representations that leverages parallel corpora and employs semantic transfer through compositional representations.
    Page 1, “Introduction”
  4. The results on this task, in comparison with a number of strong baselines, further demonstrate the relevance of our approach and the success of our method in learning multilingual semantic representations over a wide range of languages.
    Page 1, “Introduction”
  5. Here, we focus on learning semantic representations and investigate how the use of multilingual data can improve learning such representations at the word and higher level.
    Page 2, “Overview”
  6. We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings.
    Page 2, “Overview”
  7. As part of this, we use a compositional vector model (CVM, henceforth) to compute semantic representations of sentences and documents.
    Page 2, “Overview”
  8. A CVM learns semantic representations of larger syntactic units given the semantic representations of their constituents (Clark and Pulman, 2007; Mitchell and Lapata, 2008; Baroni and Zamparelli, 2010; Grefenstette and Sadrzadeh, 2011; Socher et al., 2012; Hermann and Blunsom, 2013, inter alia).
    Page 2, “Overview”
  9. Most prior work on learning compositional semantic representations employs parse trees on their training data to structure their composition functions (Socher et al., 2012; Hermann and Blunsom, 2013, inter alia).
    Page 2, “Approach”
  10. We utilise this diversity to abstract further from monolingual surface realisations to deeper semantic representations .
    Page 2, “Approach”
  11. Assume two functions f : X —> Rd and g : Y —> Rd, which map sentences from languages cc and 3/ onto distributed semantic representations in Rd.
    Page 2, “Approach”

See all papers in Proc. ACL 2014 that mention semantic representations.

See all papers in Proc. ACL that mention semantic representations.

Back to top.

distributed representations

Appears in 12 sentences as: Distributed representation (1) distributed representation (2) Distributed Representations (1) Distributed representations (2) distributed representations (7)
In Multilingual Models for Compositional Distributed Semantics
  1. Distributed representations of words provide the basis for many state-of-the-art approaches to various problems in natural language processing today.
    Page 1, “Introduction”
  2. Distributed representation learning describes the task of learning continuous representations for discrete objects.
    Page 2, “Overview”
  3. Such distributed representations allow a model to share meaning between similar words, and have been used to capture semantic, syntactic and morphological content (Collobert and Weston, 2008; Turian et al., 2010, inter alia).
    Page 2, “Overview”
  4. Some work has exploited this idea for transferring linguistic knowledge into low-resource languages or to learn distributed representations at the word level (Klementiev et al., 2012; Zou et al., 2013; Lauly et al., 2013, inter alia).
    Page 2, “Overview”
  5. As distributed representations of larger expressions have been shown to be highly useful for a number of tasks, it seems to be a natural next step to attempt to induce these, too, cross-lingually.
    Page 2, “Overview”
  6. (2012), learning distributed representations on the Europarl corpus and evaluating on documents from the Reuters RCVIRCV2 corpora.
    Page 4, “Experiments”
  7. We use the training data of the corpus to learn distributed representations across 12 languages.
    Page 5, “Experiments”
  8. In a third evaluation (Table 4), we apply the embeddings learnt with out models to a monolingual classification task, enabling us to compare with prior work on distributed representation learning.
    Page 7, “Experiments”
  9. Distributed Representations Distributed representations can be learned through a number of approaches.
    Page 8, “Related Work”
  10. Tasks, where the use of distributed representations has resulted in improvements include topic modelling (Blei et al., 2003) or named entity recognition (Turian et al., 2010; Collobert et al., 2011).
    Page 9, “Related Work”
  11. Multilingual Representation Learning Most research on distributed representation induction has focused on single languages.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention distributed representations.

See all papers in Proc. ACL that mention distributed representations.

Back to top.

cross-lingual

Appears in 8 sentences as: Cross-Lingual (1) Cross-lingual (1) cross-lingual (6)
In Multilingual Models for Compositional Distributed Semantics
  1. We evaluate these models on two cross-lingual document classification tasks, outperforming the prior state of the art.
    Page 1, “Abstract”
  2. First, we show that for cross-lingual document classification on the Reuters RCVIRCV2 corpora (Lewis et al., 2004), we outperform the prior state of the art (Klementiev et al., 2012).
    Page 1, “Introduction”
  3. The Europarl corpus v71 (Koehn, 2005) was used during initial development and testing of our approach, as well as to learn the representations used for the Cross-Lingual Document Classification task described in §5.2.
    Page 4, “Corpora”
  4. First, we replicate the cross-lingual document classification task of Klementiev et al.
    Page 4, “Experiments”
  5. We evaluate our models on the cross-lingual document classification (CLDC, henceforth) task first described in Klementiev et al.
    Page 4, “Experiments”
  6. Cross-lingual compositional representations (ADD, BI and their multilingual extensions), I—Matrix (Klementiev et al., 2012) translated (MT) and glossed (Glossed) word baselines, and the majority class baseline.
    Page 5, “Experiments”
  7. (2013) train a cross-lingual encoder, where an autoencoder is used to recreate words in two languages in parallel.
    Page 9, “Related Work”
  8. Coupled with very simple composition functions, vectors learned with this method outperform the state of the art on the task of cross-lingual document classification.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

parallel data

Appears in 8 sentences as: Parallel data (1) parallel data (7)
In Multilingual Models for Compositional Distributed Semantics
  1. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences.
    Page 1, “Abstract”
  2. Through qualitative analysis and the study of pivoting effects we demonstrate that our representations are semantically plausible and can capture semantic relationships across languages without parallel data .
    Page 1, “Abstract”
  3. A key difference between our approach and those listed above is that we only require sentence-aligned parallel data in our otherwise unsupervised learning function.
    Page 2, “Overview”
  4. Parallel data in multiple languages provides an
    Page 2, “Overview”
  5. The idea is that, given enough parallel data , a shared representation of two parallel sentences would be forced to capture the common elements between these two sentences.
    Page 2, “Approach”
  6. However, there exists a corpus of prior work on learning multilingual embeddings or on using parallel data to transfer linguistic information across languages.
    Page 9, “Related Work”
  7. (2012), our baseline in §5.2, use a form of multi-agent learning on word-aligned parallel data to transfer embeddings from one language to another.
    Page 9, “Related Work”
  8. To summarize, we have presented a novel method for learning multilingual word embeddings using parallel data in conjunction with a multilingual objective function for compositional vector models.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

classification task

Appears in 7 sentences as: Classification task (1) classification task (5) classification tasks (1)
In Multilingual Models for Compositional Distributed Semantics
  1. We evaluate these models on two cross-lingual document classification tasks , outperforming the prior state of the art.
    Page 1, “Abstract”
  2. The Europarl corpus v71 (Koehn, 2005) was used during initial development and testing of our approach, as well as to learn the representations used for the Cross-Lingual Document Classification task described in §5.2.
    Page 4, “Corpora”
  3. First, we replicate the cross-lingual document classification task of Klementiev et al.
    Page 4, “Experiments”
  4. multi-label classification task using the TED corpus, both for training and evaluating.
    Page 4, “Experiments”
  5. Each document in the classification task is represented by the average of the d-dimensional representations of all its sentences.
    Page 4, “Experiments”
  6. As described in §4, we use 15 keywords for the classification task .
    Page 5, “Experiments”
  7. In a third evaluation (Table 4), we apply the embeddings learnt with out models to a monolingual classification task , enabling us to compare with prior work on distributed representation learning.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention classification task.

See all papers in Proc. ACL that mention classification task.

Back to top.

word embeddings

Appears in 7 sentences as: word embeddings (6) word’s embedding (1)
In Multilingual Models for Compositional Distributed Semantics
  1. Such word embeddings are naturally richer representations than those of symbolic or discrete models, and have been shown to be able to capture both syntactic and semantic information.
    Page 1, “Introduction”
  2. We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings .
    Page 2, “Overview”
  3. (2013), who published word embeddings across 100 languages, including all languages considered in this paper.
    Page 7, “Experiments”
  4. While the classification experiments focused on establishing the semantic content of the sentence level representations, we also want to briefly investigate the induced word embeddings .
    Page 8, “Experiments”
  5. In their simplest form, distributional information from large corpora can be used to learn embeddings, where the words appearing within a certain window of the target word are used to compute that word’s embedding .
    Page 8, “Related Work”
  6. (2011) further popularised using neural network architectures for learning word embeddings from large amounts of largely unlabelled data by showing the embeddings can then be used to improve standard supervised tasks.
    Page 8, “Related Work”
  7. To summarize, we have presented a novel method for learning multilingual word embeddings using parallel data in conjunction with a multilingual objective function for compositional vector models.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

objective function

Appears in 6 sentences as: objective function (4) objective function: (1) objective functions (1)
In Multilingual Models for Compositional Distributed Semantics
  1. We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings.
    Page 2, “Overview”
  2. Further, these approaches typically depend on specific semantic signals such as sentiment- or topic-labels for their objective functions .
    Page 2, “Approach”
  3. This results in the following objective function:
    Page 3, “Approach”
  4. The objective function in Equation 2 could be coupled with any two given vector composition functions f, g from the literature.
    Page 3, “Approach”
  5. level, our model extends to document-level learning quite naturally, by recursively applying the composition and objective function (Equation 2) to compose sentences into documents.
    Page 3, “Approach”
  6. To summarize, we have presented a novel method for learning multilingual word embeddings using parallel data in conjunction with a multilingual objective function for compositional vector models.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

state of the art

Appears in 6 sentences as: state of the art (6)
In Multilingual Models for Compositional Distributed Semantics
  1. We evaluate these models on two cross-lingual document classification tasks, outperforming the prior state of the art .
    Page 1, “Abstract”
  2. First, we show that for cross-lingual document classification on the Reuters RCVIRCV2 corpora (Lewis et al., 2004), we outperform the prior state of the art (Klementiev et al., 2012).
    Page 1, “Introduction”
  3. Our models outperform the prior state of the art , with the BI models performing slightly better than the ADD models.
    Page 5, “Experiments”
  4. We compare our embeddings with the SENNA embeddings, which achieve state of the art performance on a number of tasks (Collobert et al., 2011).
    Page 7, “Experiments”
  5. They have received a lot of attention in recent years (Collobert and Weston, 2008; Mnih and Hinton, 2009; Mikolov et al., 2010, inter alia) and have achieved state of the art performance in language modelling.
    Page 8, “Related Work”
  6. Coupled with very simple composition functions, vectors learned with this method outperform the state of the art on the task of cross-lingual document classification.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention state of the art.

See all papers in Proc. ACL that mention state of the art.

Back to top.

machine translation

Appears in 6 sentences as: machine translation (6)
In Multilingual Models for Compositional Distributed Semantics
  1. While the corpus is aimed at machine translation tasks, we use the keywords associated with each talk to build a subsidiary corpus for multilingual document classification as follows.3
    Page 4, “Corpora”
  2. A similar idea exists in machine translation where English is frequently used to pivot between other languages (Cohn and Lapata, 2007).
    Page 5, “Experiments”
  3. MT System We develop a machine translation baseline as follows.
    Page 5, “Experiments”
  4. We train a machine translation tool on the parallel training data, using the development data of each language pair to optimize the translation system.
    Page 5, “Experiments”
  5. Is was demonstrated that this approach can be applied to improve tasks related to machine translation .
    Page 9, “Related Work”
  6. (2013), also learned bilingual embeddings for machine translation .
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

language pair

Appears in 6 sentences as: language pair (4) language pairs (2)
In Multilingual Models for Compositional Distributed Semantics
  1. We considered the English-German and English-French language pairs from this corpus.
    Page 4, “Corpora”
  2. This task involves learning language independent embeddings which are then used for document classification across the English-German language pair .
    Page 4, “Experiments”
  3. In the single mode, vectors are learnt from a single language pair (en-X), while in the joint mode vector-learning is performed on all parallel sub-corpora simultaneously.
    Page 5, “Experiments”
  4. In the English case we train twelve individual classifiers, each using the training data of a single language pair only.
    Page 5, “Experiments”
  5. We train a machine translation tool on the parallel training data, using the development data of each language pair to optimize the translation system.
    Page 5, “Experiments”
  6. As expected, the MT system slightly outperforms our models on most language pairs .
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention language pair.

See all papers in Proc. ACL that mention language pair.

Back to top.

semantic space

Appears in 5 sentences as: semantic space (5)
In Multilingual Models for Compositional Distributed Semantics
  1. Unlike most methods for learning word representations, which are restricted to a single language, our approach learns to represent meaning across languages in a shared multilingual semantic space .
    Page 1, “Introduction”
  2. This setting causes words from all languages to be embedded in a single semantic space .
    Page 5, “Experiments”
  3. These results further support our hypothesis that the bilingual contrastive error function can learn semantically plausible embeddings and furthermore, that it can abstract away from monolingual surface realisations into a shared semantic space across languages.
    Page 8, “Experiments”
  4. (2013), that learn embeddings across a large variety of languages and models such as ours, that learn joint embeddings, that is a projection into a shared semantic space across multiple languages.
    Page 9, “Related Work”
  5. Further experiments and analysis support our hypothesis that bilingual signals are a useful tool for learning distributed representations by enabling models to abstract away from monolingual surface realisations into a deeper semantic space .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention semantic space.

See all papers in Proc. ACL that mention semantic space.

Back to top.

word representations

Appears in 5 sentences as: word representations (5)
In Multilingual Models for Compositional Distributed Semantics
  1. Within a monolingual context, the distributional hypothesis (Firth, 1957) forms the basis of most approaches for learning word representations .
    Page 1, “Introduction”
  2. Unlike most methods for learning word representations , which are restricted to a single language, our approach learns to represent meaning across languages in a shared multilingual semantic space.
    Page 1, “Introduction”
  3. Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003).
    Page 8, “Related Work”
  4. Unsupervised word representations can easily be plugged into a variety of NLP related tasks.
    Page 9, “Related Work”
  5. Hermann and Blunsom (2014) propose a large-margin learner for multilingual word representations , similar to the basic additive model proposed here, which, like the approaches above, relies on a bag-of-words model for sentence representations.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention word representations.

See all papers in Proc. ACL that mention word representations.

Back to top.

bag-of-words

Appears in 4 sentences as: bag-of-words (5)
In Multilingual Models for Compositional Distributed Semantics
  1. This is a distributed bag-of-words approach as sentence ordering is not taken into account by the model.
    Page 3, “Approach”
  2. The use of a nonlinearity enables the model to learn interesting interactions between words in a document, which the bag-of-words approach of ADD is not capable of learning.
    Page 3, “Approach”
  3. (2013) proposed a bag-of-words autoencoder model, where the bag-of-words representation in one language is used to train the embeddings in another.
    Page 9, “Related Work”
  4. Hermann and Blunsom (2014) propose a large-margin learner for multilingual word representations, similar to the basic additive model proposed here, which, like the approaches above, relies on a bag-of-words model for sentence representations.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.

MT System

Appears in 4 sentences as: MT System (2) MT system (2)
In Multilingual Models for Compositional Distributed Semantics
  1. MT System We develop a machine translation baseline as follows.
    Page 5, “Experiments”
  2. MT System ADD single
    Page 6, “Experiments”
  3. As expected, the MT system slightly outperforms our models on most language pairs.
    Page 7, “Experiments”
  4. However, the overall performance of the models is comparable to that of the MT system .
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention MT System.

See all papers in Proc. ACL that mention MT System.

Back to top.

parallel sentences

Appears in 4 sentences as: parallel sentences (4)
In Multilingual Models for Compositional Distributed Semantics
  1. The idea is that, given enough parallel data, a shared representation of two parallel sentences would be forced to capture the common elements between these two sentences.
    Page 2, “Approach”
  2. What parallel sentences share, of course, are their semantics.
    Page 2, “Approach”
  3. For every pair of parallel sentences (a, b) we sample a number of additional sentence pairs (-, n) E C, where n—with high probability—is not semantically equivalent to a.
    Page 3, “Approach”
  4. The ADD+ model uses an additional 500k parallel sentences from the English-French corpus, resulting in one million English sentences, each paired up with either a German or a French sentence, with BI and BI+ trained accordingly.
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention parallel sentences.

See all papers in Proc. ACL that mention parallel sentences.

Back to top.

semantic similarity

Appears in 4 sentences as: semantic similarity (4)
In Multilingual Models for Compositional Distributed Semantics
  1. We exploit this semantic similarity across languages by defining a bilingual (and trivially multilingual) energy as follows.
    Page 2, “Approach”
  2. Even though the model did not use any parallel French-German data during training, it learns semantic similarity between these two languages using English as a pivot, and semantically clusters words across all languages.
    Page 8, “Experiments”
  3. Very simple composition functions have been shown to suffice for tasks such as judging bi-gram semantic similarity (Mitchell and Lapata, 2008).
    Page 9, “Related Work”
  4. Their architecture op-timises the cosine similarity of documents, using relative semantic similarity scores during learning.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

parse trees

Appears in 3 sentences as: parse trees (3)
In Multilingual Models for Compositional Distributed Semantics
  1. This removes a number of constraints that normally come with CVM models, such as the need for syntactic parse trees , word alignment or annotated data as a training signal.
    Page 2, “Overview”
  2. Most prior work on learning compositional semantic representations employs parse trees on their training data to structure their composition functions (Socher et al., 2012; Hermann and Blunsom, 2013, inter alia).
    Page 2, “Approach”
  3. While these methods have been shown to work in some cases, the need for parse trees and annotated data limits such approaches to resource-fortunate languages.
    Page 2, “Approach”

See all papers in Proc. ACL 2014 that mention parse trees.

See all papers in Proc. ACL that mention parse trees.

Back to top.

language modelling

Appears in 3 sentences as: language modelling (2) language models (1)
In Multilingual Models for Compositional Distributed Semantics
  1. Successful applications of such models include language modelling (Bengio et al., 2003), paraphrase detection (Erk and Pado, 2008), and dialogue analysis (Kalchbrenner and Blunsom, 2013).
    Page 1, “Introduction”
  2. Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003).
    Page 8, “Related Work”
  3. They have received a lot of attention in recent years (Collobert and Weston, 2008; Mnih and Hinton, 2009; Mikolov et al., 2010, inter alia) and have achieved state of the art performance in language modelling .
    Page 8, “Related Work”

See all papers in Proc. ACL 2014 that mention language modelling.

See all papers in Proc. ACL that mention language modelling.

Back to top.