Learning Visual Representations using Images with Captionstrevor/learning-workshop07.pdf · Trevor...

Learning Visual Representations using Images with Captions

Ariadna QuattoniMichael CollinsTrevor Darrell

Training visual classifiers when a few examples are available

Problem:Image classification from a few examples can be hard.A good representation of images is crucial.

Solution: Use available resources to learn a good image representation.

Semi-supervised learning

Available resource:large dataset of unlabeled data

Unsupervised learning

h:F I R→

Visual Representation

Small training set of labeled images

h:F I R→ Train a classifier

h-dimensional training set

Learning visual representations Using unlabeled data

Unsupervised learning in data spaceGood thing:

Lower dimensional representation preserves relevant statistics of the data sample.

Bad things:The representation might not capture the relevant information for an image classification problem.Appearance of a visual component may be heterogeneous and hard to discover.

Learning visual components

Want a representation based on visual components.

Visual features that are useful to group together for classification, e.g.:

teams have: people, medals…crowds have: people, pavement…

Appearance of a visual component may be heterogeneous

Learning visual representations

Unlabeled + metadata: Images with associated natural language captions.Video sequences with associated speech.

How could the metadata help?A hint for discovering important features.

Feature selection paradigm:Use the metadata to define “auxiliary tasks”.Discover feature groupings that are useful for these tasks.

OverviewAvailable resource:

large dataset of images and captions

Create auxiliaryproblems

Perform structural learning h:F I R→

Visual Representation

Small training set of labeled images

h:F I R→ Train a classifier

h-dimensional training set

Learning visual representations:Learning from auxiliary tasks

[Ando & Zhang, JMLR 2005]Classification problem, core task:

Core training set

Unlabeled set

A set of auxiliary problems related to the core taskA method for generating a collection of auxiliary training sets for the auxiliary problems

{ 1,1}Y = −

)},()...,,(),,{( 2211 nncore yxyxyxD =

},...,,{ 21 uxxxU =

1 2{ , ,..., }mD D D

:F X Y→ dX R∈

Learning visual representations Using images with captions

Core task: news topic prediction.Learn a function:

is an image and if

: { 1,1}j

F X → −

Xx∈ 1=y jx topic∈

News domain

figure skating ice hockey golden globes

grammys

Dataset: News images from Reuters web-site.Problem: Predicting news topics from images.


The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskatingat Oval Lingotto during the 2006 Winter Olympics.

Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad.

Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine.

Senior Hamas leader KhaledMeshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006.

Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet,

U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th BerlinaleInternational Film Festival.

Auxiliary task: predict “ team ” from image contentAuxiliary task: predict “ team ” from image content


)},)....(,{( 11 uu cxcxU =

)...,( 21 mwww

Image caption pairs:

m content words:

1),( =cxp j

1 1 1{( , ( , )),...( , ( , ))}j j ju u uD x p x c x p x c=

m auxiliary problems

if word appears in caption c and -1 otherwise

Auxiliary training sets:jw

Structural Learning Train classifiers for auxiliary tasks.

Step 1: Train m linear classifiers setting optimal parameters for to be: jD

( , )i i jx y D∈

We use logistic loss in our experiments.

* 2arg min ( ( , ), ) || ||2j w i i

i

Cw l f w x y w= +∑

, is the auxiliary training setjD thj

Structural Learning Manifold Learning in classifier space

Step 2: Perform SVD on Parameters Vectors Matrix.

Compute by taking the first h eigenvectorsof

defines a linear subspace of dimension h; a good approximation to the space of weights.

1 2

* * *[ , ,... ]m

w w w=W

θ

dhR ×∈θtWW

Structural Learning: manifold learning in classifierweight space.

Structural Learning Training on the core task

( , )i i corex y D∈

( ) 'q x xθ=

* 2arg min ( ( , ( )), ) || ||2v i i

i

Cv l f v q x y v= +∑

Project data

Equivalent to training core task on the original d dimensional space with parameters constraints:

Step 3: Train core classifier

** vw tcore θ=

Structural Learning Minimizing a joint loss

Class of linear predictors:is an h by d structural parameters matrix.

Goal: Find the parameters and shared that minimizes the joint loss.

( , ) 'tf v x v xθ θ=θ

v θ

1:

( , , ) ( ) ( )j j jj m

L v D reg v regθ θ=

⎡ ⎤+ +⎣ ⎦∑

jDLoss on training set

Problem specific parametersShared parameters

Toy Example

Object = { letter, letter, letter }

An object

AbC

Toy Example

The same object

Abc

Toy Example

The same object

ABc

Toy Example

The same object

abC

Toy example Discovering visual components

acJ 001000000000001 ... 10000

A C JB

10 visual components10 visual components5 appearances eac5 appearances each

task: recognize object .

objects

“ABC” object “ADE” object

“BCF” object “ABD” object

Toy example Discovering components

Plots show 50 points corresponding to the appearances in the model.

PCA in data space Structural learning:PCA on weights of object classifiers

• Find interest points in .• Map each point to its closest cluster.

• Find interest points (Sift detector) • Map each point to a feature vector (Sift descriptor)• Do vector quantization giving d clusters.

Experiments Baseline image representation

)(Ig

XIg →:dRX =

1 2( ) [# ,# ,..., # ]dg I c c c=

I

Bag of words sift representation, ‘raw’ representation:

Preprocessing

Compute

( )g I :

1000 images as development data labeled for a single topic.

Experiments Dataset

nL

10576 images, 130 topics .Predict 15 most frequent topics.

Reuters Dataset

Data Partitions

8000 unlabeled images with captions but no topic labels.

labeled training sets of sizes: 5, 10, 20 ,…320.

1756 images as testing data.

Experiments Model

Three models: Logistic regression classifiers.Baseline model:

Uses raw representation. Regularization constant C optimized on development set.

PCA model: Uses PCA-Unlabeled representation.C and h optimized on the development set of a single selected topic.

Structure learning model:Uses structural learning representation.C and h optimized as for the PCA model.

Experiments Results

5 10 20 40 80 160 3200.45

0.5

0.55

0.6

0.65

0.7

0.75

# training examples

Ave

rage

Equ

al E

rror R

ate

Reuters Dataset: 14 topics

Baseline ModelPCA ModelStructural Learning

Conclusion

Summary:We described a method for learning visual representations from images captions.Use captions and auxiliary tasks to induce a representation that captures visual components.The induced representation enables learning from fewer examples.

Future work:Exploring ways of determining the relatedness between core and auxiliary tasks.Performing structural learning from more weakly related auxiliary tasks.

Thanks!

Learning Visual Representations using Images with Captionstrevor/learning-workshop07.pdf · Trevor...

Documents

Transcript of Learning Visual Representations using Images with Captionstrevor/learning-workshop07.pdf · Trevor...