Learning and Inference in Vision: from Features to Scene Understanding Jonathan Huang, Tomasz...

Learning and Inference in Vision: from Features to Scene Understanding

Jonathan Huang, Tomasz Malisiewicz

MLD Student Research Symposium, 2009

Road

Sky

Trees

Bridge

SignCar

Huge datasetsPASCAL Visual Objects Challenge (VOC) dataset

~15000 annotated images, ~35,000 annotated object instances, 20 object classes with segmentations, bounding boxes

Huge datasets

LabelMe dataset

~11845 static images, >100,000 labeled polygons

Outline

I. Recognizing single object classes (Jon)

II. Scene understanding with multiple classes (Tomasz)

Recognition task #1: Find all markers

Geometric Variability

Recognition task #2: Find all cats

Object recognition is often hard due to:

Variation within an object class

Viewpoint/Scales/Illumination Variability Images from Flickr

From Pixels to Visual features

car

ImagingImaging

InferenceInference

Scene

Featu

res

Pixels

Low level features

Higher level inference

Local Visual Features

Images are high dimensional!

Compute image statistics in a region (e.g., estimate the distribution of image gradient orientations)

(640 width) *(480 height) = (307200 pixels)

Key ideas in feature design

Be invariant to stuff you don’t care about…

while not being too invariant

Object classification

Inference: What object class is this?Learning: What does each object class look like?

Cow or Horse??

Let’s look at a simpler example first…

Document classification analogy

John Terry scored on a header to lift Chelsea to a 1-0 victory over Manchester United and extend the Blues’ Premier League lead to 5 points. Chelsea had been frustrated by Manchester United for 76 minutes, but took advantage of a free kick awarded when Darren Fletcher fouled Ashley Cole.Brian Ching scored six minutes into overtime and the Houston Dynamo advanced to Major League Soccer’s Western ...

In the Senate, where proposals differ substantially from the House-passed measure on issues like a government-run plan and how to pay for coverage, the bill is stalled while budget analysts assess its overall costs. The slim margin in the House — the bill passed with just two votes to spare, and 39 Democrats opposed it — suggests even greater challenges in the Senate, where the majority leader, ...

??? ???

Classify each document as sports or politics

Bag-of-words models for text classification

“Much of the meaning behind written language is preserved even when the ordering of the individual words is lost.” [El-Arini et al.,’09]

bag

words(Sue Ann)


but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ...

the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House passed The ...

??? ???


but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ...

the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government-run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House-passed The ...

??? ???

Visual words (discretization)

Words are discrete, visual features are typically continuous…

Discretization via clustering/vector quantization

Visual words

[Sivic et al., ‘05]

Object classification with bag of words

[Sivic et al., ‘05]

Object classification with bag of wordsPerformance on Caltech 101 dataset with linear SVM on bag-of-word vectors:

Faces

Airplanes Cars

[Csurka et al., ‘04]

Object Detection problemDetection: Locate all the faces in this image.

Classification: Is this a face, or not a face?

Face detection via a series of classifications(a.k.a. sliding window brain damage)

False Detection

Missed Faces

Sliding window detection results

The need for… capturing spatial relationships

One ApproachCreate a more descriptive (complicated) feature

Histograms of Oriented Gradients (HOG) features

Original ImageSubdivided Image cells

Histogrammed gradients in

each cell

Estimated Image Gradients

gradient magnitudes

gradient orientations

[Dalal & Triggs, ‘06]

People Tracking with HOG features

bette

r

Modeling Spatial Relationships with Deformable Part Based Models

Spring-based models: Parts prefer low-energy configurations

[Fischler & Elschlager ,’73], [Ramanan et al,’07], [Felszwenwalb et al,’05,’09], [Kumar et al, ‘09]

Parts Based Model

Vertices – Local Appearance

Edges - Spatial Relationship

Goal: Assign model parts to image regions preserving

both local appearance and spatial relationships

Parts based models - Inference ProblemInference problem: What is the best scoring assignment f?

Local Appearance termPairwise Spatial

Relationship term

Inference is NP-hard for general graphs

For trees can use belief propagation for exact solution in polytime

Parts based models - Learning Problem

Linear models:

s.t.

Local Appearance termPairwise Spatial

Relationship term

Convex max-margin objective

Positive examples on one side

Negative examples on the other

[Kumar et al,’09]

Learning linear models: Find weight vectors that best separate positive and negative examples. E.g.,

Person deformable part model

Root filter (8x8 resolution)

Part filter (4x4 resolution)

Quadratic spatial configuration model

[Felszwenwalb et al,’09]

[Felszwenwalb et al,’09]

[Ramanan et al,’09]

Outline

I. Recognizing single object classes (Jon)

II. Scene understanding with multiple classes (Tomasz)

Part II: Scene Understanding with Multiple ClassesGoal: Predict Many Different Objects in a Single Image

Car

Fire Hydrant

Building

Fence

Sidewalk

Tree

Wait...

• What’s wrong with just learning a different sliding window classifier for each object type in the world?

The image as seen from a object detector’s point of view

41

Relationships between objects make recognition possible

41Antonio Torralba. The Context Challenge. http://web.mit.edu/torralba/www/carsAndFacesInContext.html

http://web.mit.edu/torralba/www/carsAndFacesInContext.html

43

Objects as the “Parts” of a Scene

Key Challenge in Scene Understanding: Modeling relationships between objects from different categories

Deformable Part Model Scene Model

Fixed Extent “Things” vs Free-form “Stuff”

Building

Fence

Sidewalk

Car

Fire Hydrant

Tree

Things have a well-defined shape. A part of a car is not a car.

Stuff is free-form and mostly defined by color/texture. A part of a building is still a building.

3 Types of Scene Models

Pixel-based Window-based Segment-based

Pixel-based Scene Understanding

Unable to reason about instances

Only limited notion of context

TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006

Produces Segmentation

Works well on “stuff”

50

Pixel-wise Conditional Random Fields (TextonBoost)

• Inference

• y^* = argmax_y p(y|x)

• Training: Use boosting to learn unary potential

• Future Direction: Higher-Order Cliques50

TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006

Window-based Scene Understanding

Often not possible to model “stuff” using windows.

Window assumption also questionable for some “things.”

Possible to model interactions between object instances.

Discriminative models for multi-class object layout. Desai et al. ICCV 2009Object Recognition by Scene Alignment.

Russell et al. NIPS 2007

52

Discriminative models for multi-class object layout

• Inference via Greedy Forward Search

• Training

52

53

Window-based results

53

Region-Based Scene Understanding

Use Segmentation algorithm to extract stable regionsUse CRF to label those segments

Problem: Hard to get object-segments. Problem: Inference difficult for fully connected models.

56

Region-Based CRF

• Training: Bag of Words with Nearest Neighbor classifier

• Maximum Likelihood training of pairwise potentials

56

Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008.

Spatial Relations

57

Segmentation-Based Results

57

Input image No context w/ context

Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008.

58

Model Granularity vs. Object Type

Pixels Windows Regions

Things (car, cow, person) :-( :-) :-/

Stuff (road, sky, tree) :-) :-( :-)

Granularity

ObjectType

Scene Understanding Recap

• Rich object-object interactions are important for scene understanding.

• Different underlying assumptions (pixel vs. window vs. region) are better suited for different types of objects (“stuff” vs. “things”)

• Many of the techniques for single class object recognition (e.g., part based models) are relevant for scene understanding

Thanks!

Image Classification

Sliding Window based Object Detection

Modeling Spatial Relationships between parts

Modeling Spatial Relationships between objects

Learning and Inference in Vision: from Features to Scene Understanding Jonathan Huang, Tomasz...

Documents

Transcript of Learning and Inference in Vision: from Features to Scene Understanding Jonathan Huang, Tomasz...