How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
Feng, Yansong and Lapata, Mirella

Article Structure

Abstract

In this paper we tackle the problem of automatic caption generation for news images.

Introduction

Recent years have witnessed an unprecedented growth in the amount of digital information available on the Internet.

Related Work

Although image understanding is a popular topic within computer vision, relatively little work has focused on the interplay between visual and linguistic information.

Problem Formulation

We formulate image caption generation as follows.

Image Annotation

As mentioned earlier, our approach relies on an image annotation model to provide description keywords for the picture.

Extractive Caption Generation

Much work in summarization to date focuses on sentence extraction where a summary is created simply by identifying and subsequently concatenating the most important sentences in a document.

Abstractive Caption Generation

Although extractive methods yield grammatical captions and require relatively little linguistic analysis, there are a few caveats to consider.

Experimental Setup

In this section we discuss our experimental design for assessing the performance of the caption generation models presented above.

Results

Table 2 reports our results on the test set using TER.

Conclusions

We have presented extractive and abstractive models that generate image captions for news articles.

Topics

phrase-based

Appears in 15 sentences as: Phrase-based (1) phrase-based (14)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Phrase-based Model The model outlined in equation (8) will generate captions with function words.
    Page 6, “Abstractive Caption Generation”
  2. Search To generate a caption it is necessary to find the sequence of words that maximizes P(w1,w2, ...,wn) for the word-based model (equation (8)) and P(p1, p2, ..., pm) for the phrase-based model (equation (15)).
    Page 7, “Abstractive Caption Generation”
  3. Documents and captions were parsed with the Stanford parser (Klein and Manning, 2003) in order to obtain dependencies for the phrase-based abstractive model.
    Page 7, “Experimental Setup”
  4. We tuned the caption length parameter on the development set using a range of [5, 14] tokens for the word-based model and [2, 5] phrases for the phrase-based model.
    Page 7, “Experimental Setup”
  5. For the phrase-based model, we also experimented with reducing the search scope, either by considering only the n most similar sentences to the keywords (range [2,10]), or simply the single most similar sentence and its neighbors (range [2, 5]).
    Page 7, “Experimental Setup”
  6. We randomly selected 12 document-image pairs from the test set and generated captions for them using the best extractive system, and two abstractive systems (word-based and phrase-based ).
    Page 8, “Experimental Setup”
  7. different from phrase-based abstractive system.
    Page 8, “Results”
  8. Table 4: Captions written by humans (G) and generated by extractive (KL), word-based abstractive (AW), and phrase-based extractive (Ap systems).
    Page 9, “Results”
  9. It is significantly worse than the phrase-based abstractive system (0c < 0.01), the extractive system (0c < 0.01), and the gold standard (0c < 0.01).
    Page 9, “Results”
  10. Unsurprisingly, the phrase-based system is significantly less grammatical than the gold standard and the extractive system, whereas the latter is perceived as equally grammatical as the gold standard (the difference in the means is not significant).
    Page 9, “Results”
  11. With regard to relevance, the word-based system is significantly worse than the phrase-based system, the extractive system, and the gold-standard.
    Page 9, “Results”

See all papers in Proc. ACL 2010 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

TER

Appears in 10 sentences as: TER (11)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Our automatic evaluation was based on Translation Edit Rate ( TER , Snover et al.
    Page 7, “Experimental Setup”
  2. TER is defined as the minimum number of edits a human would have to perform to change the system output so that it exactly matches a reference translation.
    Page 7, “Experimental Setup”
  3. TER <E7 Er) : Ins + Del + Sub + Shft (16) M
    Page 7, “Experimental Setup”
  4. TER is similar to word error rate, the only difference being that it allows shifts.
    Page 7, “Experimental Setup”
  5. The perfect TER score is 0, however note that it can be higher than 1 due to insertions.
    Page 7, “Experimental Setup”
  6. Model | TER l Angen
    Page 8, “Experimental Setup”
  7. Table 2: TER results for extractive, abstractive models, and lead sentence baseline; *: sig.
    Page 8, “Experimental Setup”
  8. We used TER to compare the output of our extractive and abstractive models and also for parameter tuning (see the discussion above).
    Page 8, “Experimental Setup”
  9. Table 2 reports our results on the test set using TER .
    Page 8, “Results”
  10. The abstractive models obtain the best TER scores overall, however they generate shorter captions in comparison to the other models (closer to the length of the gold standard) and as a result TER treats them favorably, simply because the number of edits is less.
    Page 8, “Results”

See all papers in Proc. ACL 2010 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

extractive system

Appears in 6 sentences as: extractive system (6)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. We randomly selected 12 document-image pairs from the test set and generated captions for them using the best extractive system , and two abstractive systems (word-based and phrase-based).
    Page 8, “Experimental Setup”
  2. Table 3 reports mean ratings for the output of the extractive system (based on the KL divergence), the two abstractive systems, and the human-authored gold standard caption.
    Page 8, “Results”
  3. It is significantly worse than the phrase-based abstractive system (0c < 0.01), the extractive system (0c < 0.01), and the gold standard (0c < 0.01).
    Page 9, “Results”
  4. Unsurprisingly, the phrase-based system is significantly less grammatical than the gold standard and the extractive system , whereas the latter is perceived as equally grammatical as the gold standard (the difference in the means is not significant).
    Page 9, “Results”
  5. With regard to relevance, the word-based system is significantly worse than the phrase-based system, the extractive system , and the gold-standard.
    Page 9, “Results”
  6. Interestingly, the phrase-based system performs on the same level with the human gold standard (the difference in the means is not significant) and significantly better than the extractive system .
    Page 9, “Results”

See all papers in Proc. ACL 2010 that mention extractive system.

See all papers in Proc. ACL that mention extractive system.

Back to top.

generation models

Appears in 6 sentences as: general model (1) generation model (2) generation models (3)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Inspired by recent work in summarization, we propose extractive and abstractive caption generation models .
    Page 1, “Abstract”
  2. It is important to note that the caption generation models we propose are not especially tied
    Page 4, “Image Annotation”
  3. Despite its simplicity, the caption generation model in (7) has a major drawback.
    Page 5, “Abstractive Caption Generation”
  4. After integrating the attachment probabilities into equation (12), the caption generation model becomes:
    Page 6, “Abstractive Caption Generation”
  5. In this section we discuss our experimental design for assessing the performance of the caption generation models presented above.
    Page 7, “Experimental Setup”
  6. Rather than adopting a two-stage approach, where the image processing and caption generation are carried out sequentially, a more general model should integrate the two steps in a unified framework.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention generation models.

See all papers in Proc. ACL that mention generation models.

Back to top.

gold standard

Appears in 6 sentences as: Gold Standard (1) gold standard (6)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. KL Divergence 6.42’kl 4.10’kl Abstract Words 2.08l 3.20l Abstract Phrases 4.8()* 4.96* Gold Standard 6.39*l 5.55*
    Page 8, “Results”
  2. The abstractive models obtain the best TER scores overall, however they generate shorter captions in comparison to the other models (closer to the length of the gold standard ) and as a result TER treats them favorably, simply because the number of edits is less.
    Page 8, “Results”
  3. Table 3 reports mean ratings for the output of the extractive system (based on the KL divergence), the two abstractive systems, and the human-authored gold standard caption.
    Page 8, “Results”
  4. It is significantly worse than the phrase-based abstractive system (0c < 0.01), the extractive system (0c < 0.01), and the gold standard (0c < 0.01).
    Page 9, “Results”
  5. Unsurprisingly, the phrase-based system is significantly less grammatical than the gold standard and the extractive system, whereas the latter is perceived as equally grammatical as the gold standard (the difference in the means is not significant).
    Page 9, “Results”
  6. Interestingly, the phrase-based system performs on the same level with the human gold standard (the difference in the means is not significant) and significantly better than the extractive system.
    Page 9, “Results”

See all papers in Proc. ACL 2010 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

natural language

Appears in 4 sentences as: natural language (4)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. The picture is first analyzed using image processing techniques into an abstract representation, which is then rendered into a natural language description with a text generation engine.
    Page 2, “Related Work”
  2. They extract features of human motion and interleave them with a concept hierarchy of actions to create a case frame from which a natural language sentence is generated.
    Page 2, “Related Work”
  3. Within natural language processing most previous efforts have focused on generating captions to accompany complex graphical presentations (Mittal et al., 1998; Corio and Lapalme, 1999; Fas-ciano and Lapalme, 2000; Feiner and McKeown, 1990) or on using the captions accompanying information graphics to infer their intended message, e.g., the author’s goal to convey ostensible increase or decrease of a quantity of interest (Elzer et al., 2005).
    Page 2, “Related Work”
  4. Given an image I , and a related knowledge database K, create a natural language description C which captures the main content of the image under K. Specifically, in the news story scenario, we will generate a caption C for an image I and its accompanying document D. The training data thus consists of document-image-caption tu-
    Page 2, “Problem Formulation”

See all papers in Proc. ACL 2010 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

probabilistic model

Appears in 4 sentences as: probabilistic model (3) probabilistic models (1)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Our experiments made use of the probabilistic model presented in Feng and Lapata (2010).
    Page 3, “Image Annotation”
  2. Any probabilistic model with broadly similar properties could serve our purpose.
    Page 4, “Image Annotation”
  3. Word-based Model Our first abstractive model builds on and extends a well-known probabilistic model of headline generation (Banko et al., 2000).
    Page 5, “Abstractive Caption Generation”
  4. As can be seen the probabilistic models (KL and J S divergence) outperform word overlap and cosine similarity (all differences are statistically significant, p < 0.01).6 They make use of the same topic model as the image annotation model, and are thus able to select sentences that cover common content.
    Page 8, “Results”

See all papers in Proc. ACL 2010 that mention probabilistic model.

See all papers in Proc. ACL that mention probabilistic model.

Back to top.

topic model

Appears in 4 sentences as: topic model (3) topic models (1)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. The basic idea underlying LDA, and topic models in general, is that each document is composed of a probability distribution over topics, where each topic represents a probability distribution over words.
    Page 4, “Image Annotation”
  2. Probabilistic Similarity Recall that the backbone of our image annotation model is a topic model with images and documents represented as a probability distribution over latent topics.
    Page 4, “Extractive Caption Generation”
  3. The underlying topic model was trained with 1,000 topics using only content words (i.e., nouns, verbs, and adjectives) that appeared
    Page 7, “Experimental Setup”
  4. As can be seen the probabilistic models (KL and J S divergence) outperform word overlap and cosine similarity (all differences are statistically significant, p < 0.01).6 They make use of the same topic model as the image annotation model, and are thus able to select sentences that cover common content.
    Page 8, “Results”

See all papers in Proc. ACL 2010 that mention topic model.

See all papers in Proc. ACL that mention topic model.

Back to top.

cosine similarity

Appears in 3 sentences as: Cosine Similarity (1) cosine similarity (2)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Cosine Similarity Word overlap is admittedly a naive measure of similarity, based on lexical identity.
    Page 4, “Extractive Caption Generation”
  2. We compare four extractive models based on word overlap, cosine similarity , and two probabilistic similarity measures, namely KL and JS divergence and two abstractive models based on words (see equation (8)) and phrases (see equation (15)).
    Page 8, “Results”
  3. As can be seen the probabilistic models (KL and J S divergence) outperform word overlap and cosine similarity (all differences are statistically significant, p < 0.01).6 They make use of the same topic model as the image annotation model, and are thus able to select sentences that cover common content.
    Page 8, “Results”

See all papers in Proc. ACL 2010 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

language model

Appears in 3 sentences as: language model (3)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Specifically, we use an adaptive language model (Kneser et al., 1997) that modifies an
    Page 5, “Abstractive Caption Generation”
  2. where P(wi E C |wi E D) is the probability of W, appearing in the caption given that it appears in the document D, and Padap(wi|wi_1,wi_2) the language model adapted with probabilities from our image annotation model:
    Page 6, “Abstractive Caption Generation”
  3. The scaling parameter [3 for the adaptive language model was also tuned on the development set using a range of [05,09].
    Page 7, “Experimental Setup”

See all papers in Proc. ACL 2010 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

LDA

Appears in 3 sentences as: LDA (3)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Latent Dirichlet Allocation ( LDA , Blei et al.
    Page 4, “Image Annotation”
  2. The basic idea underlying LDA , and topic models in general, is that each document is composed of a probability distribution over topics, where each topic represents a probability distribution over words.
    Page 4, “Image Annotation”
  3. Examples include PLSA-based approaches to image annotation (e.g., Monay and Gatica-Perez 2007) and correspondence LDA (Blei and Jordan, 2003).
    Page 4, “Image Annotation”

See all papers in Proc. ACL 2010 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

manual annotation

Appears in 3 sentences as: manual annotation (2) manually annotated (1)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Obtaining training data in this setting does not require expensive manual annotation as many articles are published together with captioned images.
    Page 1, “Introduction”
  2. The image parser is trained on a corpus, manually annotated with graphs representing image structure.
    Page 2, “Related Work”
  3. Instead of relying on manual annotation or background ontological information we exploit a multimodal database of news articles, images, and their captions.
    Page 2, “Related Work”

See all papers in Proc. ACL 2010 that mention manual annotation.

See all papers in Proc. ACL that mention manual annotation.

Back to top.

news articles

Appears in 3 sentences as: news articles (3)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. Instead of relying on manual annotation or background ontological information we exploit a multimodal database of news articles , images, and their captions.
    Page 2, “Related Work”
  2. It is well known that news articles are written so that the lead contains the most important information in a story.7 This is an encouraging result as it highlights the importance of the visual information for the caption generation task.
    Page 8, “Results”
  3. We have presented extractive and abstractive models that generate image captions for news articles .
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention news articles.

See all papers in Proc. ACL that mention news articles.

Back to top.

topic distribution

Appears in 3 sentences as: topic distribution (2) topic distributions (2)
In How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
  1. The image annotation model takes the topic distributions into account when finding the most likely keywords for an image and its associated document.
    Page 4, “Image Annotation”
  2. age and a sentence can be broadly measured by the extent to which they share the same topic distributions (Steyvers and Griffiths, 2007).
    Page 5, “Extractive Caption Generation”
  3. K 1309,61): 2 pjlogz Ii (4) 1:1 ‘11 where p and q are shorthand for the image topic distribution PdMix and sentence topic distribution PSd, respectively.
    Page 5, “Extractive Caption Generation”

See all papers in Proc. ACL 2010 that mention topic distribution.

See all papers in Proc. ACL that mention topic distribution.

Back to top.