Learning Grounded Meaning Representations with Autoencoders
Silberer, Carina and Lapata, Mirella

Article Structure

Abstract

In this paper we address the problem of grounding distributional representations of lexical meaning.

Introduction

Recent years have seen a surge of interest in single word vector spaces (Turney and Pantel, 2010; Collobert et al., 2011; Mikolov et al., 2013) and their successful use in many natural language applications.

Related Work

The presented model has connections to several lines of work in NLP, computer vision research, and more generally multimodal learning.

Autoencoders for Grounded Semantics

3.1 Background

Experimental Setup

In this section we present our experimental setup for assessing the performance of our model.

Results

Table 3 presents our results on the word similarity task.

Conclusions

In this paper, we presented a model that uses stacked autoencoders to learn grounded meaning representations by simultaneously combining textual and Visual modalities.

Topics

meaning representations

Appears in 11 sentences as: meaning representation (4) meaning representations (7)
In Learning Grounded Meaning Representations with Autoencoders
  1. Despite differences in formulation, most existing models conceptualize the problem of meaning representation as one of learning from multiple views corresponding to different modalities.
    Page 1, “Introduction”
  2. In this work, we introduce a model, illustrated in Figure 1, which learns grounded meaning representations by mapping words and images into a common embedding space.
    Page 1, “Introduction”
  3. Unlike most previous work, our model is defined at a finer level of granularity — it computes meaning representations for individual words and is unique in its use of attributes as a means of representing the textual and visual modalities.
    Page 1, “Introduction”
  4. Unlike previous efforts such as the widely used WordSim353 collection (Finkelstein et al., 2002), our dataset contains ratings for visual and textual similarity, thus allowing to study the two modalities (and their contribution to meaning representation ) together and in isolation.
    Page 2, “Introduction”
  5. The use of stacked autoencoders to extract a shared lexical meaning representation is new to our knowledge, although, as we explain below related to a large body of work on deep learning.
    Page 2, “Related Work”
  6. Our model learns higher-level meaning representations for single words from textual and visual input in a joint fashion.
    Page 3, “Autoencoders for Grounded Semantics”
  7. To learn meaning representations of single words from textual and visual input, we employ stacked (denoising) autoencoders (SAEs).
    Page 3, “Autoencoders for Grounded Semantics”
  8. Then, we join these two SAEs by feeding their respective second coding simultaneously to another autoencoder, whose hidden layer thus yields the fused meaning representation .
    Page 4, “Autoencoders for Grounded Semantics”
  9. We learn meaning representations for the nouns contained in McRae et al.’s (2005) feature norms.
    Page 5, “Experimental Setup”
  10. We used the model described above and the meaning representations obtained from the output of the bimodal latent layer for all the evaluation tasks detailed below.
    Page 6, “Experimental Setup”
  11. In this paper, we presented a model that uses stacked autoencoders to learn grounded meaning representations by simultaneously combining textual and Visual modalities.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention meaning representations.

See all papers in Proc. ACL that mention meaning representations.

Back to top.

SVD

Appears in 8 sentences as: SVD (8)
In Learning Grounded Meaning Representations with Autoencoders
  1. Specifically, we concatenate the textual and visual vectors and project them onto a lower dimensional latent space using SVD (Golub and Reinsch, 1970).
    Page 7, “Experimental Setup”
  2. We furthermore report results obtained with Bruni et al.’s (2014) bimodal distributional model, which employs SVD to integrate co-occurrence-based textual representations with visual repre-
    Page 7, “Experimental Setup”
  3. McRae 0.71 0.49 0.68 0.58 0.52 0.62 Attributes 0.58 0.61 0.68 0.46 0.56 0.58 SAE 0.65 0.60 0.70 0.52 0.60 0.64 SVD — — 0.67 — — 0.57 kCCA — — 0.57 — — 0.55 Bruni — — 0.52 — — 0.46 RNN-640 0.41 — — 0.34 — —
    Page 8, “Experimental Setup”
  4. The automatically obtained textual and visual attribute vectors serve as input to SVD , kCCA, and our stacked autoencoder (SAE).
    Page 8, “Results”
  5. McRae 0.52 0.31 0.42 Attributes 0.35 0.37 0.33 SAE 0.36 0.35 0.43 SVD — — 0.39 kCCA — — 0.37 Bruni — — 0.34 RNN-640 0.32 — —
    Page 9, “Results”
  6. We also observe that simply concatenating textual and visual attributes (Attributes, T+V) performs competitively with SVD and better than kCCA.
    Page 9, “Results”
  7. In contrast, all bimodal models ( SVD , kCCA, and SAE) are better than their unimodal equivalents and RNN-640.
    Page 9, “Results”
  8. The SAE outperforms both kCCA and SVD by a large margin delivering clustering performance similar to the McRae et al.’s (2005) norms.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention SVD.

See all papers in Proc. ACL that mention SVD.

Back to top.

semantic similarity

Appears in 7 sentences as: semantic similarity (7)
In Learning Grounded Meaning Representations with Autoencoders
  1. Next, for each word we randomly selected 30 pairs under the assumption that they are representative of the full variation of semantic similarity .
    Page 6, “Experimental Setup”
  2. Participants were asked to rate a pair on two dimensions, visual and semantic similarity using a Likert scale of 1 (highly dissimilar) to 5 (highly similar).
    Page 6, “Experimental Setup”
  3. For semantic similarity , the mean correlation was 0.76 (Min 20.34, Max
    Page 6, “Experimental Setup”
  4. For comparison, Patwardhan and Pedersen’s (2006) measure achieved a coefficient of 0.56 on the dataset for semantic similarity and 0.48 for visual similarity.
    Page 7, “Experimental Setup”
  5. We would expect the textual modality to be more dominant when modeling semantic similarity and conversely the perceptual modality to be stronger with respect to visual similarity.
    Page 8, “Results”
  6. The textual SAE correlates better with semantic similarity judgments (p = 0.65) than its visual equivalent (p = 0.60).
    Page 8, “Results”
  7. It yields a correlation coefficient of p = 0.70 on semantic similarity and p = 0.64 on visual similarity.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

hidden layer

Appears in 6 sentences as: hidden layer (5) hidden layers (1)
In Learning Grounded Meaning Representations with Autoencoders
  1. To further optimize the parameters of the network, a supervised criterion can be imposed on top of the last hidden layer such as the minimization of a prediction error on a supervised task (Bengio, 2009).
    Page 3, “Autoencoders for Grounded Semantics”
  2. We first train SAEs with two hidden layers (codings) for each modality separately.
    Page 4, “Autoencoders for Grounded Semantics”
  3. Then, we join these two SAEs by feeding their respective second coding simultaneously to another autoencoder, whose hidden layer thus yields the fused meaning representation.
    Page 4, “Autoencoders for Grounded Semantics”
  4. Bimodal Autoencoder The bimodal autoencoder is fed with the concatenated final hidden codings of the visual and textual modalities as input and maps these inputs to a joint hidden layer 57 with B units.
    Page 4, “Autoencoders for Grounded Semantics”
  5. We also encourage the autoencoder to detect dependencies between the two modalities while learning the mapping to the bimodal hidden layer .
    Page 4, “Autoencoders for Grounded Semantics”
  6. This model has the following architecture: the textual autoencoder (see Figure 1, left-hand side) consists of 700 hidden units which are then mapped to the second hidden layer with 500 units (the corruption parameter was set to v = 0.1); the visual autoencoder (see Figure 1, right-hand side) has 170 and 100 hidden units, in the first and second layer, respectively.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention hidden layer.

See all papers in Proc. ACL that mention hidden layer.

Back to top.

neural network

Appears in 6 sentences as: neural network (4) neural networks (2)
In Learning Grounded Meaning Representations with Autoencoders
  1. A large body of work has focused on projecting words and images into a common space using a variety of deep learning methods ranging from deep and restricted Boltzman machines (Srivastava and Salakhutdinov, 2012; Feng et al., 2013), to autoencoders (Wu et al., 2013), and recursive neural networks (Socher et al., 2013b).
    Page 2, “Related Work”
  2. Secondly, our problem setting is different from the former studies, which usually deal with classification tasks and fine-tune the deep neural networks using training data with explicit class labels; in contrast we fine-tune our autoencoders using a semi-supervised criterion.
    Page 3, “Related Work”
  3. Autoencoders An autoencoder is an unsupervised neural network which is trained to reconstruct a given input from its latent representation (Bengio, 2009).
    Page 3, “Autoencoders for Grounded Semantics”
  4. Stacked Autoencoders Several (denoising) autoencoders can be used as building blocks to form a deep neural network (Bengio et al., 2007; Vincent et al., 2010).
    Page 3, “Autoencoders for Grounded Semantics”
  5. Finally, we also compare to the word embeddings obtained using Mikolov et al.’s (2011) recurrent neural network based language model.
    Page 8, “Experimental Setup”
  6. To the best of our knowledge, our model is novel in its use of attribute-based input in a deep neural network .
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

word pairs

Appears in 6 sentences as: Word Pairs (1) Word pairs (1) word pairs (4)
In Learning Grounded Meaning Representations with Autoencoders
  1. We performed a large-scale evaluation on a new dataset consisting of human similarity judgments for 7,576 word pairs .
    Page 2, “Introduction”
  2. 4435 word pairs constitute the overlap between Nelson et al.’s norms (1998) and McRae et al.’s (2005) nouns.
    Page 6, “Experimental Setup”
  3. This resulted in 7,576 word pairs for which we obtained similarity ratings using Amazon Mechanical Turk (AMT).
    Page 6, “Experimental Setup”
  4. Word Pairs Semantic Visual
    Page 7, “Experimental Setup”
  5. Table 4: Word pairs with highest semantic and visual similarity according to SAE model.
    Page 8, “Results”
  6. Table 4 shows examples of word pairs with highest semantic and visual similarity according to the SAE model.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention word pairs.

See all papers in Proc. ACL that mention word pairs.

Back to top.

embeddings

Appears in 5 sentences as: embeddings (5)
In Learning Grounded Meaning Representations with Autoencoders
  1. We introduce a new model which uses stacked autoencoders to learn higher-level embeddings from textual and visual input.
    Page 1, “Abstract”
  2. We evaluate the embeddings it produces on two tasks, namely word similarity and categorization.
    Page 2, “Introduction”
  3. Finally, we also compare to the word embeddings obtained using Mikolov et al.’s (2011) recurrent neural network based language model.
    Page 8, “Experimental Setup”
  4. These were pre-trained on Broadcast news data (400M words) using the word2vec tool.8 We report results with the 640-dimensional embeddings as they performed best.
    Page 8, “Experimental Setup”
  5. This indicates that higher level embeddings may be beneficial to NLP tasks in general, not only to those requiring multimodal information.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

deep learning

Appears in 4 sentences as: Deep Learning (1) deep learning (4)
In Learning Grounded Meaning Representations with Autoencoders
  1. The use of stacked autoencoders to extract a shared lexical meaning representation is new to our knowledge, although, as we explain below related to a large body of work on deep learning .
    Page 2, “Related Work”
  2. Multimodal Deep Learning Our work employs deep learning (a.k.a deep networks) to project linguistic and visual information onto a unified representation that fuses the two modalities together.
    Page 2, “Related Work”
  3. The goal of deep learning is to learn multiple levels of representations through a hierarchy of network architectures, where higher-level representations are expected to help define higher-level concepts.
    Page 2, “Related Work”
  4. A large body of work has focused on projecting words and images into a common space using a variety of deep learning methods ranging from deep and restricted Boltzman machines (Srivastava and Salakhutdinov, 2012; Feng et al., 2013), to autoencoders (Wu et al., 2013), and recursive neural networks (Socher et al., 2013b).
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention deep learning.

See all papers in Proc. ACL that mention deep learning.

Back to top.

loss function

Appears in 4 sentences as: loss function (3) loss functions (1)
In Learning Grounded Meaning Representations with Autoencoders
  1. where L is a loss function , such as cross-entropy.
    Page 3, “Autoencoders for Grounded Semantics”
  2. The reconstruction error for an input x“) with loss function L then is:
    Page 3, “Autoencoders for Grounded Semantics”
  3. Unimodal Autoencoders For both modalities, we use the hyperbolic tangent function as activation function for encoder f9 and decoder gel and an entropic loss function for L. The weights of each autoencoder are tied, i.e., W’ 2 WT.
    Page 4, “Autoencoders for Grounded Semantics”
  4. LC and L are entropic loss functions , and R is a regularization term with R = 23:12||W(j)||2 -|-||W(6) | |2.
    Page 5, “Autoencoders for Grounded Semantics”

See all papers in Proc. ACL 2014 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

natural language

Appears in 4 sentences as: natural language (4)
In Learning Grounded Meaning Representations with Autoencoders
  1. Recent years have seen a surge of interest in single word vector spaces (Turney and Pantel, 2010; Collobert et al., 2011; Mikolov et al., 2013) and their successful use in many natural language applications.
    Page 1, “Introduction”
  2. The visual and textual modalities on which our model is trained are decoupled in that they are not derived from the same corpus (we would expect co-occurring images and text to correlate to some extent) but unified in their representation by natural language attributes.
    Page 2, “Related Work”
  3. As our input consists of natural language attributes, the model would infer textual attributes given visual attributes and vice versa.
    Page 5, “Autoencoders for Grounded Semantics”
  4. The two modalities are encoded as vectors of natural language attributes and are obtained automatically from decoupled text and image data.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

semi-supervised

Appears in 4 sentences as: semi-supervised (4)
In Learning Grounded Meaning Representations with Autoencoders
  1. Secondly, our problem setting is different from the former studies, which usually deal with classification tasks and fine-tune the deep neural networks using training data with explicit class labels; in contrast we fine-tune our autoencoders using a semi-supervised criterion.
    Page 3, “Related Work”
  2. Alternatively, a semi-supervised criterion can be used (Ranzato and Szummer, 2008; Socher et al., 2011) through combination of the unsupervised training criterion (global reconstruction) with a supervised criterion (prediction of some target given the latent representation).
    Page 3, “Autoencoders for Grounded Semantics”
  3. Stacked Bimodal Autoencoder We finally build a stacked bimodal autoencoder (SAE) with all pre-trained layers and fine-tune them with respect to a semi-supervised criterion.
    Page 4, “Autoencoders for Grounded Semantics”
  4. Furthermore, the semi-supervised setting affords flexibility, allowing to adapt the architecture to specific tasks.
    Page 5, “Autoencoders for Grounded Semantics”

See all papers in Proc. ACL 2014 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

vectors representing

Appears in 4 sentences as: vector representations (1) vectors representing (3)
In Learning Grounded Meaning Representations with Autoencoders
  1. The target vector is the sum of X0) and the centroid X0) of the remaining attribute vectors representing object 0.
    Page 4, “Autoencoders for Grounded Semantics”
  2. As shown in Figure 1, our model takes as input two (real-valued) vectors representing the visual and textual modalities.
    Page 5, “Experimental Setup”
  3. respond to words and edges to cosine similarity scores between vectors representing their meaning.
    Page 7, “Experimental Setup”
  4. Table 6 shows examples of clusters produced by Chinese Whispers when using vector representations provided by the SAE model.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention vectors representing.

See all papers in Proc. ACL that mention vectors representing.

Back to top.

semantic representations

Appears in 3 sentences as: Semantic Representations (1) semantic representations (2)
In Learning Grounded Meaning Representations with Autoencoders
  1. In general, these models specify mechanisms for constructing semantic representations from text corpora based on the distributional hypothesis (Harris, 1970): words that appear in similar linguistic contexts are likely to have related meanings.
    Page 1, “Introduction”
  2. Our model uses stacked autoencoders (Bengio et al., 2007) to induce semantic representations integrating visual and textual information.
    Page 1, “Introduction”
  3. 3.2 Semantic Representations
    Page 3, “Autoencoders for Grounded Semantics”

See all papers in Proc. ACL 2014 that mention semantic representations.

See all papers in Proc. ACL that mention semantic representations.

Back to top.