Collective Generation of Natural Image Descriptions
Kuznetsova, Polina and Ordonez, Vicente and Berg, Alexander and Berg, Tamara and Choi, Yejin

Article Structure

Abstract

We present a holistic data—driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web.

Introduction

Automatically describing images in natural language is an intriguing, but complex AI task, requiring accurate computational visual recognition, comprehensive world knowledge, and natural language generation.

Vision & Phrase Retrieval

For a query image, we retrieve relevant candidate natural language phrases by visually comparing the query image to database images from the SBU Captioned Photo Collection (Ordonez et al., 2011) (1 million photographs with associated human-composed descriptions).

Overview of ILP Formulation

For each image, we aim to generate multiple sentences, each sentence corresponding to a single distinct object detected in the given image.

Image-level Content Planning

First we describe image-level content planning, i.e., abstract generation.

Surface Realization

Recall that for each image, the computer vision system identifies phrases from descriptions of images that are similar in a variety of aspects.

Topics

ILP

Appears in 31 sentences as: ILP (34)
In Collective Generation of Natural Image Descriptions
  1. We employ Integer Linear Programming ( ILP ) as an optimization framework that has been used successfully in other generation tasks (e.g., Clarke and Lapata (2006), Martins and Smith (2009), Woodsend and Lapata (2010)).
    Page 2, “Introduction”
  2. Our ILP formulation encodes a rich set of linguistically motivated constraints and weights that incorporate multiple aspects of the generation process.
    Page 2, “Introduction”
  3. For a query image, we first retrieve candidate descriptive phrases from a large image-caption database using measures of visual similarity We then generate a coherent description from these candidates using ILP formulations for content planning (§4) and surface realization
    Page 2, “Introduction”
  4. The ILP formulation of §4 addresses T1 & T2, i.e., content-planning, and the ILP of §5 addresses T3 & T4, i.e., surface realization.1
    Page 3, “Overview of ILP Formulation”
  5. 1It is possible to create one conjoined ILP formulation to address all four operations T1—T4 at once.
    Page 3, “Overview of ILP Formulation”
  6. This trick helps the ILP solver to generate sentences with varying number of phrases, rather than always selecting the maximum number of phrases allowed.
    Page 6, “Surface Realization”
  7. Baselines: We compare our ILP approaches with two nontrivial baselines: the first is an HMM approach (comparable to Yang et al.
    Page 6, “Surface Realization”
  8. HMM HMM ILP ILP cognitive phrases: with w/o with w/o | | 0.111 | 0.114 1 0.114 1 0.116 |
    Page 7, “Surface Realization”
  9. l | ILP selection rate |
    Page 7, “Surface Realization”
  10. ILP V.S.
    Page 7, “Surface Realization”
  11. HMM (w/o cogn) 67.2% ILP V.S.
    Page 7, “Surface Realization”

See all papers in Proc. ACL 2012 that mention ILP.

See all papers in Proc. ACL that mention ILP.

Back to top.

objective function

Appears in 9 sentences as: Objective Function (2) objective function (7)
In Collective Generation of Natural Image Descriptions
  1. 4.1 Variables and Objective Function The following set of indicator variables encodes the selection of objects and ordering:
    Page 3, “Image-level Content Planning”
  2. The objective function , F, that we will maximize is a weighted linear combination of these indicator variables and can be optimized using integer linear programming:
    Page 3, “Image-level Content Planning”
  3. We use IBM CPLEX to optimize this objective function subject to the constraints introduced next in §4.2.
    Page 3, “Image-level Content Planning”
  4. Note that in the objective function , we subtract this quantity from the function to be maximized.
    Page 4, “Image-level Content Planning”
  5. 5.1 Variables and Objective Function The following set of variables encodes the selection of phrases and their ordering in constructing 5’ sentences.
    Page 4, “Surface Realization”
  6. Finally, we define the objective function F as:
    Page 4, “Surface Realization”
  7. the objective function (Eq.
    Page 6, “Surface Realization”
  8. Note that we subtract this cohesion score from the objective function .
    Page 6, “Surface Realization”
  9. Perhaps the most important difference in our approach is the use of negative weights in the objective function to create the necessary tension between selection (salience) and compatibility, which makes it possible for ILP to generate variable length descriptions, effectively correcting some of the erroneous vision detections.
    Page 8, “Surface Realization”

See all papers in Proc. ACL 2012 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

natural language

Appears in 4 sentences as: natural language (5)
In Collective Generation of Natural Image Descriptions
  1. We present a holistic data—driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web.
    Page 1, “Abstract”
  2. Automatically describing images in natural language is an intriguing, but complex AI task, requiring accurate computational visual recognition, comprehensive world knowledge, and natural language generation.
    Page 1, “Introduction”
  3. By judiciously exploiting the correspondence between image content elements and phrases, it is possible to generate natural language descriptions that are substantially richer in content and more linguistically interesting than previous work.
    Page 2, “Introduction”
  4. For a query image, we retrieve relevant candidate natural language phrases by visually comparing the query image to database images from the SBU Captioned Photo Collection (Ordonez et al., 2011) (1 million photographs with associated human-composed descriptions).
    Page 2, “Vision & Phrase Retrieval”

See all papers in Proc. ACL 2012 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

co-occurrence

Appears in 3 sentences as: co-occurrence (3)
In Collective Generation of Natural Image Descriptions
  1. NPMI(ngr) = (21) Where NPMI is the normalized point-wise mutual information.4 Co—occurrence Cohesion Score: To capture long-distance cohesion, we introduce a co-occurrence-based score, which measures order-preserved co-occurrence statistics between the head words hsij and hqu 5.
    Page 6, “Surface Realization”
  2. co-occurrence cohesion is computed as:
    Page 6, “Surface Realization”
  3. Final Cohesion Score: Finally, the pairwise phrase cohesion score 177;qu is a weighted sum of n-gram and co-occurrence cohesion scores: NGRAM CO a ° Fsiqu + fl ° Faiqu (23) a + fl
    Page 6, “Surface Realization”

See all papers in Proc. ACL 2012 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

generation process

Appears in 3 sentences as: generation process (3)
In Collective Generation of Natural Image Descriptions
  1. We cast the generation process as constraint optimization problems, collectively incorporating multiple interconnected aspects of language composition for content planning, surface realization and discourse structure.
    Page 1, “Abstract”
  2. Because the generation process sticks relatively closely to the recognized content, the resulting descriptions often lack the kind of coverage, creativity, and complexity typically found in human-written text.
    Page 1, “Introduction”
  3. Our ILP formulation encodes a rich set of linguistically motivated constraints and weights that incorporate multiple aspects of the generation process .
    Page 2, “Introduction”

See all papers in Proc. ACL 2012 that mention generation process.

See all papers in Proc. ACL that mention generation process.

Back to top.