Every Picture Tells a Story: Generating Sentences from...

31
Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth University of Illinois at Urbana-Champaign Images most from Farhadi et al. (2010)

Transcript of Every Picture Tells a Story: Generating Sentences from...

Page 1: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Every Picture Tells a Story:Generating Sentences from Images

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, PeterYoung, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth

University of Illinois at Urbana-Champaign

Images most from Farhadi et al. (2010)

Page 2: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Goal

Auto-annotation: find text annotations for images

I This is a lot of technology.

I Somebodys screensaver of apumpkin

I A black laptop is connected to ablack Dell monitor

I This is a dual monitor setup

I Old school Computer monitor withway to many stickers on it

Page 3: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Goal

Auto-annotation: find text annotations for images

I This is a lot of technology.

I Somebodys screensaver of apumpkin

I A black laptop is connected to ablack Dell monitor

I This is a dual monitor setup

I Old school Computer monitor withway to many stickers on it

Page 4: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Goal

Auto-annotation: find text annotations for images

I This is a lot of technology.

I Somebodys screensaver of apumpkin

I A black laptop is connected to ablack Dell monitor

I This is a dual monitor setup

I Old school Computer monitor withway to many stickers on it

Page 5: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Goal

Auto-illustration: find pictures suggested by given text

Yellow train on the tracks.

Page 6: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Goal

Auto-illustration: find pictures suggested by given text

Yellow train on the tracks.

Page 7: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Goal

Auto-illustration: find pictures suggested by given text

Yellow train on the tracks.

Page 8: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Overview

I Evaluate the similarity between a sentence and an image

I Build around an intermediate representation

Page 9: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Meaning Space

I a triplet of 〈object, action, scene〉.I predicting a triplet involves solving a multi-label Markov

Random Field

Page 10: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Node Potentials

I Computed as a linear combination of scores fromdetectors/classifiers

I Image FeaturesI DPM response: max detection confidence for each class, their

center location, aspect ratio and scale

I Image classification scores: based on geometry, HOG featuresand detection response

I GIST based scene classification: scores for each scene

Page 11: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Deformable Part-based Model (DPM)

I Using sliding window approach to search for all possiblelocations

I Adopt Histogram of Oriented Gradients(HOG) features &linear SVM classifiers

Images from Felzenszwalb et al. (2008)

Page 12: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Deformable Part-based Model (DPM)

I Build HOG pyramid thus fix-sized filter can be used

I Sum the score from root/part filters and deformation costs

Page 13: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Deformable Part-based Model (DPM)

I Build HOG pyramid thus fix-sized filter can be used

I Sum the score from root/part filters and deformation costs

Page 14: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

GIST

I Using a set of perceptual dimensions (naturalness, openness,roughness, expansion, ruggedness) for scene representation

I Estimate these dimensions from DFT and windowed DFT

Images from Oliva and Torralba (2001)

Page 15: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Node Potentials

I Node features, Similarity Features

I Node featuresI a #-of-nodes-dimensional vectorI obtained by feeding image features into a linear SVM

I Similarity FeaturesI Average of the node features over KNN in the training set to

the test image by matching image featuresI Average of the node features over KNN in the training set to

the test image by matching those node features

Page 16: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Edge Potentials

I One parameter per edge results in large number of parameters

I Linear combination of multiple initial estimates

I The weights of linear combination can be learnt

I The normalized frequency of the word A in our corpus, f (A)

I The normalized frequency of the word B in our corpus, f (B)

I The normalized frequency of (A and B) at the same time,f (A,B)

I f (A,B)f (A)f (B)

Page 17: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Sentence Potentials

I Extract (object,action) pairs by Curran & Clark parser.

I Extract head nouns of prepositional phrases etc. for scene

I Use Lin Similarity to determine semantic distance betweentwo words

I Determine actions commonly co-occurring from 8, 000 imagescaptions

I Compute sentence node potentials from these measures

I Estimating edge potentials is identical with that for images

Page 18: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Learning & Inference

I Learn mapping from image space to meaning space

I Learn mapping from sentence space to meaning space

minw

λ

2||ω||2 +

1

n

∑i∈examples

ξi

s.t. ∀i ∈ examples :

ωΦ(xi , yi ) + ξi ≥ maxy∈meaningspace

ωΦ(xi , y) + L(yi , y)

ξi ≥ 0

Page 19: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Learning & Inference

I Search for the best triplet that maximizes

arg maxyωTΦ(xi , y)

I A multiplicative model prefer all response to be good

arg maxy

∏ωTΦ(xi , y)

I Greedily relax an edge, solving best path and re-scoring

Page 20: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Matching

I Match sentence triplets and image triplets

I Obtain top k ranking triplets from sentence, compute theirranks as image triplet

I Obtain top k ranking triplets from image, compute their ranksas sentence triplet

I Sum the ranks of all these sets

Text Information and Similarity measure is used to take care of outof vocabulary words that occurs in sentences but are not beinglearnt by a detector/classifier

Page 21: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Evaluation

I Build dataset with images and sentences from PASCAL 2008images

I Randomly select 50 images per class (20 class in total)

I Label 5 sentences per image on AMT

I Manually add labels for triplets of 〈objects, actions, scenes〉I Select 600 images for training and 400 for testing

Measures:I Tree-F1 measure:

I Build taxonomy tree for objects, actions and scenesI Calculate F1 score for precision and recallI Tree-F1 score is the mean of F1 scores for objects, actions and

scenes

I BLUE score:I Measure if the generated triplet appear in the corpus or not

Page 22: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Results

Mapping images to meaning space

Page 23: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Results: Auto-annotation

Page 24: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Results: Auto-illustration

A two girls in the store.

A horse being ridden within a fenced area.

Page 25: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Failure Case

Page 26: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Discussion

I Sentences are not generated, but searched from a pool ofcandidate sentences

I Using triplet limits the representation of meaning space

I Proposed dataset is small

I Using Recall@K and median rank as performance measure

Page 27: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Discussion

I Sentences are not generated, but searched from a pool ofcandidate sentences

I Using triplet limits the representation of meaning space

I Proposed dataset is small

I Using Recall@K and median rank as performance measure

Page 28: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Discussion

I Sentences are not generated, but searched from a pool ofcandidate sentences

I Using triplet limits the representation of meaning space

I Proposed dataset is small

I Using Recall@K and median rank as performance measure

Page 29: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Discussion

I Sentences are not generated, but searched from a pool ofcandidate sentences

I Using triplet limits the representation of meaning space

I Proposed dataset is small

I Using Recall@K and median rank as performance measure

Page 30: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

Summary

I Proposes a system to compute score linking of an image to asentence and vice versa

I Evaluates their methodology on a novel dataset consisting ofhuman-annotated images (PASCAL Sentence Dataset)

I Quantitative evaluation on the quality of the predictions

Page 31: Every Picture Tells a Story: Generating Sentences from Imagesfidler/slides/CSC2523/YukunZhu_Every...Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian,J. Hockenmaier, and D. Forsyth. Every picture tells a story:Generating sentences from images. In Computer Vision–ECCV2010, pages 15–29. Springer, 2010.

P. Felzenszwalb, D. McAllester, and D. Ramanan. Adiscriminatively trained, multiscale, deformable part model. InComputer Vision and Pattern Recognition, 2008. CVPR 2008.IEEE Conference on, pages 1–8. IEEE, 2008.

A. Oliva and A. Torralba. Modeling the shape of the scene: Aholistic representation of the spatial envelope. Internationaljournal of computer vision, 42(3):145–175, 2001.