Grounded Language Learning from Video Described with Sentences
Yu, Haonan and Siskind, Jeffrey Mark

Article Structure

Abstract

We present a method that learns representations for word meanings from short video clips paired with sentences.

Introduction

People learn language through exposure to a rich perceptual context.

General Problem Formulation

Throughout this paper, lowercase letters are used for variables or hidden quantities while uppercase ones are used for constants or observed quantities.

The Sentence Tracker

Barbu et al.

Detailed Problem Formulation

We adapt the sentence tracker to training a corpus of R video clips, each paired with a sentence.

The Learning Algorithm

Instantiating the above approach requires a definition for what it means to best explain the R training samples.

Experiment

We filmed 61 video clips (each 3—5 seconds at 640x480 resolution and 40 fps) that depict a variety of different compound events.

Conclusion

We presented a method that learns word meanings from video paired with sentences.

Topics

feature vector

Appears in 13 sentences as: feature vector (7) feature vectors (6)
In Grounded Language Learning from Video Described with Sentences
  1. We associate a feature vector with each frame (detection) of each such track.
    Page 2, “Introduction”
  2. This feature vector can encode image features (including the identity of the particular detector that produced that detection) that correlate with object class; region color, shape, and size features that correlate with object properties; and motion features, such as linear and angular object position, velocity, and acceleration, that correlate with event properties.
    Page 2, “Introduction”
  3. involves computing the associated feature vector for that HMM over the detections in the tracks chosen to fill its arguments.
    Page 3, “Introduction”
  4. This whole sentence is then grounded in a particular video by mapping these participants to particular tracks and instantiating the associated HMMs over those tracks, by computing the feature vectors for each HMM from the tracks chosen to fill its arguments.
    Page 3, “Introduction”
  5. First, we assume that we know the part of speech Cm associated with each lexical entry m, along with the part-of-speech dependent number of states 10 in the HMMs used to represent word meanings in that part of speech, the part-of-speech dependent number of features NC in the feature vectors used by HMMs to represent word meanings in that part of speech, and the part-of-speech dependent feature-vector computation (DC used to compute the features used by HMMs to represent word meanings in that part of speech.
    Page 3, “Introduction”
  6. ,qT) denote the sequence of states qt that leads to an observed track, B (D75, jt,qt,)\) denote the conditional log probability of observing the feature vector associated with the detection selected by jt among the detections D75 in frame t, given that the HMM is in state qt, and A(qt_1, qt, A) denote the log transition probability of the HMM.
    Page 5, “The Sentence Tracker”
  7. We further need to generalize F so that it computes the joint score of a sequence of detections, one for each track, G so that it computes the joint measure of coherence between a sequence of pairs of detections in two adjacent frames, and B so that it computes the joint conditional log probability of observing the feature vectors associated with the sequence of detections selected by jt.
    Page 5, “The Sentence Tracker”
  8. We further need to generalize B so that it computes the joint conditional log probability of observing the feature vectors for the detections in the tracks that are assigned to the arguments of the HMM for each word in the sentence and A so that it computes the joint log transition probability for the HMMs for all words in the sentence.
    Page 5, “The Sentence Tracker”
  9. We use discrete features, namely natural numbers, in our feature vectors , quantized by a binning process.
    Page 6, “Detailed Problem Formulation”
  10. The length of the feature vector may vary across parts of speech.
    Page 6, “Detailed Problem Formulation”
  11. Let NC denote the length of the feature vector for part of speech c, 3073; denote the time-series (mil, .
    Page 6, “Detailed Problem Formulation”

See all papers in Proc. ACL 2013 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

meaning representations

Appears in 5 sentences as: meaning representation (1) meaning representations (4)
In Grounded Language Learning from Video Described with Sentences
  1. Language is grounded by mapping words, phrases, and sentences to meaning representations referring to the world.
    Page 1, “Introduction”
  2. Dominey and Boucher (2005) paired narrated sentences with symbolic representations of their meanings, automatically extracted from video, to learn object names, spatial-relation terms, and event names as a mapping from the grammatical structure of a sentence to the semantic structure of the associated meaning representation .
    Page 1, “Introduction”
  3. Chen and Mooney (2008) learned the language of sportscasting by determining the mapping between game commentaries and the meaning representations output by a rule-based simulation of the game.
    Page 1, “Introduction”
  4. Although most of these methods succeed in learning word meanings from sentential descriptions they do so only for symbolic or simple visual input (often synthesized); they fail to bridge the gap between language and computer vision, i.e., they do not attempt to extract meaning representations from complex visual scenes.
    Page 1, “Introduction”
  5. The experiment shows that it can correctly learn the meaning representations in terms of HMM parameters for our lexical entries, from highly ambiguous training data.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention meaning representations.

See all papers in Proc. ACL that mention meaning representations.

Back to top.

manually annotated

Appears in 3 sentences as: manual annotations (1) manually annotated (2)
In Grounded Language Learning from Video Described with Sentences
  1. We manually annotated each video with several sentences that describe what occurs in that video.
    Page 7, “Experiment”
  2. To evaluate our results, we manually annotated the correctness of each such pair.
    Page 8, “Experiment”
  3. The scores are thresholded to decide hits, which together with the manual annotations , can generate TP, TN, FF, and FN counts.
    Page 9, “Experiment”

See all papers in Proc. ACL 2013 that mention manually annotated.

See all papers in Proc. ACL that mention manually annotated.

Back to top.

Viterbi

Appears in 3 sentences as: Viterbi (4)
In Grounded Language Learning from Video Described with Sentences
  1. 25:2 This can be determined with the Viterbi (1967) algorithm and is known as detection-based tracking ( Viterbi , 1971).
    Page 5, “The Sentence Tracker”
  2. max = q1,...,qT + which can also be found by the Viterbi algorithm.
    Page 5, “The Sentence Tracker”
  3. 3, which can also be determined using the Viterbi algorithm.
    Page 5, “The Sentence Tracker”

See all papers in Proc. ACL 2013 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.