Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
Lazaridou, Angeliki and Bruni, Elia and Baroni, Marco

Article Structure

Abstract

Following up on recent work on establishing a mapping between vector-based semantic embeddings of words and the visual representations of the corresponding objects from natural images, we first present a simple approach to cross-modal vector-based semantics for the task of zero-shot learning, in which an image of a previously unseen object is mapped to a linguistic representation denoting its word.

Introduction

Computational models of meaning that rely on corpus-extracted context vectors, such as LSA (Landauer and Dumais, 1997), HAL (Lund and Burgess, 1996), Topic Models (Griffiths et al., 2007) and more recent neural-network approaches (Collobert and Weston, 2008; Mikolov et al., 2013b) have successfully tackled a number of lexical semantics tasks, where context vector similarity highly correlates with various indices of semantic relatedness (Turney and Pantel, 2010).

Related Work

The problem of establishing word reference has been extensively explored in computational simulations of cross-situational learning (see Fazly et al.

Zero-shot learning and fast mapping

“We found a cute, hairy wampimuk sleeping behind the tree.” Even though the previous statement is certainly the first time one hears about wampimuks, the linguistic context already creates some visual expectations: Wampimuks probably resemble small animals (Figure la).

Experimental Setup

4.1 Visual Datasets

Results

Our experiments focus on the tasks of zero-shot learning (Sections 5.1 and 5.2) and fast mapping (Section 5.3).

Conclusion

At the outset of this work, we considered the problem of linking purely language-based distri-

Topics

semantic space

Appears in 18 sentences as: semantic space (15) Semantic Spaces (2) semantic spaces (1)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. This is achieved by means of a simple neural network trained to project image-extracted feature vectors to text-based vectors through a hidden layer that can be interpreted as a cross-modal semantic space .
    Page 2, “Introduction”
  2. We first test the effectiveness of our cross-modal semantic space on the so-called zero-shot learning task (Palatucci et al., 2009), which has recently been explored in the machine learning community (Frome et al., 2013; Socher et al., 2013).
    Page 2, “Introduction”
  3. We show that the induced cross-modal semantic space is powerful enough that sensible guesses about the correct word denoting an object can be made, even when the linguistic context vector representing the word has been created from as little as 1 sentence containing it.
    Page 2, “Introduction”
  4. Most importantly, by projecting visual representations of objects into a shared semantic space , we do not limit ourselves to establishing a link between ob-
    Page 2, “Related Work”
  5. (2013) focus on zero-shot learning in the vision-language domain by exploiting a shared visual-linguistic semantic space .
    Page 3, “Related Work”
  6. (2013) learn to project unsupervised vector-based image representations onto a word-based semantic space using a neural network architecture.
    Page 3, “Related Work”
  7. (2013) use linear regression to transform vector-based image representations onto vectors representing the same concepts in linguistic semantic space .
    Page 3, “Related Work”
  8. Concretely, we assume that concepts, denoted for convenience by word labels, are represented in linguistic terms by vectors in a text-based distributional semantic space (see Section 4.3).
    Page 3, “Zero-shot learning and fast mapping”
  9. Objects corresponding to concepts are represented in visual terms by vectors in an image-based semantic space (Section 4.2).
    Page 3, “Zero-shot learning and fast mapping”
  10. 4.2 Visual Semantic Spaces
    Page 4, “Experimental Setup”
  11. 4.3 Linguistic Semantic Spaces
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention semantic space.

See all papers in Proc. ACL that mention semantic space.

Back to top.

neural network

Appears in 7 sentences as: Neural Network (1) neural network (7)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. This is achieved by means of a simple neural network trained to project image-extracted feature vectors to text-based vectors through a hidden layer that can be interpreted as a cross-modal semantic space.
    Page 2, “Introduction”
  2. (2013) learn to project unsupervised vector-based image representations onto a word-based semantic space using a neural network architecture.
    Page 3, “Related Work”
  3. (2013) rely on a supervised state-of-the-art method: They feed low-level features to a deep neural network trained on a supervised object recognition task (Krizhevsky et al., 2012).
    Page 3, “Related Work”
  4. Neural Network (NNet) The last model that we introduce is a neural network with one hidden layer.
    Page 6, “Experimental Setup”
  5. For the neural network NN, we use prior knowledge
    Page 6, “Results”
  6. The neural network architecture emerged as the best performing approach, and our qualitative analysis revealed that it induced a categorical organization of concepts.
    Page 9, “Conclusion”
  7. Given the success of NN, we plan to experiment in the future with more sophisticated neural network architectures inspired by recent work in machine translation (Gao et al., 2013) and multimodal deep learning (Srivastava and Salakhut-dinov, 2012).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

SVD

Appears in 7 sentences as: SVD (8)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. Singular Value Decomposition (SVD) SVD is the most widely used dimensionality reduction technique in distributional semantics (Turney and Pantel, 2010), and it has recently been exploited to combine visual and linguistic dimensions in the multimodal distributional semantic model of Bruni et al.
    Page 6, “Experimental Setup”
  2. SVD smoothing is also a way to infer values of unseen dimensions in partially incomplete matrices, a technique that has been applied to the task of inferring word tags of unanno-tated images (Hare et al., 2008).
    Page 6, “Experimental Setup”
  3. Assuming that the concept-representing rows of V8 and W8 are ordered in the same way, we apply the (k-truncated) SVD to the concatenated matrix [VSWS], such that [VSWS] = U192 kzgf is a k-rank approximation of the original matrix.6 The projection function is then:
    Page 6, “Experimental Setup”
  4. k Model 1 2 3 5 10 20 Chance 1.1 2.2 3.3 5.5 11.0 22.0 SVD 1.9 5.0 8.1 14.5 29.0 48.6 CCA 3.0 6.9 10.7 17.9 31.7 51.7 lin 2.4 6.4 10.5 18.7 33.0 55.0 NN 3.9 6.6 10.6 21.9 37.9 58.2
    Page 7, “Results”
  5. For the SVD model, we set the number of dimensions to 300, a common choice in distributional semantics, coherent with the settings we used for the visual and linguistic spaces.
    Page 7, “Results”
  6. Surprisingly, the very simple lin method outperforms both CCA and SVD .
    Page 7, “Results”
  7. The derived vectors were reduced with the same SVD projection induced from the complete corpus.
    Page 8, “Results”

See all papers in Proc. ACL 2014 that mention SVD.

See all papers in Proc. ACL that mention SVD.

Back to top.

distributional semantic

Appears in 5 sentences as: distributional semantic (3) distributional semantics (3)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. Concretely, we assume that concepts, denoted for convenience by word labels, are represented in linguistic terms by vectors in a text-based distributional semantic space (see Section 4.3).
    Page 3, “Zero-shot learning and fast mapping”
  2. For constructing the text-based vectors, we follow a standard pipeline in distributional semantics (Turney and Pantel, 2010) without tuning its parameters and collect co-occurrence statistics from the concatenation of ukWaC4 and the Wikipedia, amounting to 2.7 billion tokens in total.
    Page 5, “Experimental Setup”
  3. Singular Value Decomposition (SVD) SVD is the most widely used dimensionality reduction technique in distributional semantics (Turney and Pantel, 2010), and it has recently been exploited to combine visual and linguistic dimensions in the multimodal distributional semantic model of Bruni et al.
    Page 6, “Experimental Setup”
  4. The cosine has been widely used in the distributional semantic literature, and it has been shown to outperform Euclidean distance (Bullinaria and Levy, 2007).7 Parameters were estimated with standard backpropagation and L—BFGS.
    Page 6, “Experimental Setup”
  5. For the SVD model, we set the number of dimensions to 300, a common choice in distributional semantics , coherent with the settings we used for the visual and linguistic spaces.
    Page 7, “Results”

See all papers in Proc. ACL 2014 that mention distributional semantic.

See all papers in Proc. ACL that mention distributional semantic.

Back to top.

hidden layer

Appears in 5 sentences as: hidden layer (5)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. This is achieved by means of a simple neural network trained to project image-extracted feature vectors to text-based vectors through a hidden layer that can be interpreted as a cross-modal semantic space.
    Page 2, “Introduction”
  2. Neural Network (NNet) The last model that we introduce is a neural network with one hidden layer .
    Page 6, “Experimental Setup”
  3. where (9va consists of the model weights 6(1) 6 Rdeh and 6(2) 6 Rthw that map the input image-based vectors V8 first to the hidden layer and then to the output layer in order to obtain text-based vectors, i.e., W8 2 0(2)(0(1)(V36(1))6(2)), where 0(1) and 0(2) are
    Page 6, “Experimental Setup”
  4. In order to gain qualitative insights into the performance of the projection process of NN, we attempt to investigate the role and interpretability of the hidden layer .
    Page 7, “Results”
  5. hidden layer acts as a cross-modal concept cate-gorizatiorflorganization system.
    Page 7, “Results”

See all papers in Proc. ACL 2014 that mention hidden layer.

See all papers in Proc. ACL that mention hidden layer.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. The weights are estimated by minimizing the objective function
    Page 6, “Experimental Setup”
  2. (2013), however, our objective function yielded consistently better results in all experimental settings.
    Page 6, “Results”
  3. 8For this post-hoc analysis, we include a sparsity parameter in the objective function of Equation 5 in order to get more interpretable results; hidden units are therefore maximally activated by a only few concepts.
    Page 7, “Results”
  4. The adaptation of NN is straightforward; the new objective function is derived as
    Page 8, “Results”

See all papers in Proc. ACL 2014 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

vectors representing

Appears in 4 sentences as: vector representations (1) vector representing (1) vectors representing (2)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. In this paper, we rely on the same image analysis techniques but instead focus on the reference problem: We do not aim at enriching word representations with visual information, although this might be a side effect of our approach, but we address the issue of automatically mapping objects, as depicted in images, to the context vectors representing the corresponding words.
    Page 2, “Introduction”
  2. We show that the induced cross-modal semantic space is powerful enough that sensible guesses about the correct word denoting an object can be made, even when the linguistic context vector representing the word has been created from as little as 1 sentence containing it.
    Page 2, “Introduction”
  3. First, we conduct experiments with simple image-and text-based vector representations and compare alternative methods to perform cross-modal mapping.
    Page 2, “Introduction”
  4. (2013) use linear regression to transform vector-based image representations onto vectors representing the same concepts in linguistic semantic space.
    Page 3, “Related Work”

See all papers in Proc. ACL 2014 that mention vectors representing.

See all papers in Proc. ACL that mention vectors representing.

Back to top.

co-occurrence

Appears in 3 sentences as: co-occurrence (3)
In Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
  1. However, the models induce the meaning of words entirely from their co-occurrence with other words, without links to the external world.
    Page 1, “Introduction”
  2. We apply Local Mutual Information (LMI, (Evert, 2005)) as weighting scheme and reduce the full co-occurrence space to 300 dimensions using the Singular Value Decomposition.
    Page 4, “Experimental Setup”
  3. For constructing the text-based vectors, we follow a standard pipeline in distributional semantics (Turney and Pantel, 2010) without tuning its parameters and collect co-occurrence statistics from the concatenation of ukWaC4 and the Wikipedia, amounting to 2.7 billion tokens in total.
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.