Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik...

45
Visual Grouping and Recognition Jitendra Malik University of California at Berkeley

Transcript of Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik...

Page 1: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Visual Grouping and RecognitionVisual Grouping and Recognition

Jitendra Malik

University of California at Berkeley

Jitendra Malik

University of California at Berkeley

Page 2: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

CollaboratorsCollaborators

• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)

• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren

• Recognition: Serge Belongie, Jan Puzicha

• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)

• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren

• Recognition: Serge Belongie, Jan Puzicha

Page 3: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

From images to objectsFrom images to objects

Labeled sets: tiger, grass etc

Page 4: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

What enables us to parse a scene?What enables us to parse a scene?

– Low level cues• Color/texture• Contours• Motion

– Mid level cues• T-junctions• Convexity

– High level Cues• Familiar Object• Familiar Motion

– Low level cues• Color/texture• Contours• Motion

– Mid level cues• T-junctions• Convexity

– High level Cues• Familiar Object• Familiar Motion

Page 5: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Grouping factors Grouping factors

Page 6: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

But is segmentation a meaningful problem?

But is segmentation a meaningful problem?

• Difficult to define formally, but humans are remarkably consistent…

• Difficult to define formally, but humans are remarkably consistent…

Page 7: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Human Segmentations (1)Human Segmentations (1)

Page 8: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Human Segmentations (2)Human Segmentations (2)

Page 9: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

ConsistencyConsistencyA

B C

• A,C are refinements of B• A,C are mutual refinements • A,B,C represent the same percept

• Attention accounts for differences

Image

BG L-bird R-bird

grass bush

headeye

beakfar body

headeye

beak body

Perceptual organization forms a tree:

Two segmentations are consistent when they can beexplained by the samesegmentation tree (i.e. theycould be derived from a single perceptual organization).

Page 10: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Ecological Statistics of image segmentation

Ecological Statistics of image segmentation

• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)

• Design algorithm for incorporating multiple cues for image segmentation

• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)

• Design algorithm for incorporating multiple cues for image segmentation

Page 11: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

ProximityProximity

Page 12: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)

Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)

Page 13: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

ConvexityConvexity

Page 14: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Region AreaRegion Area

• Compare to Alvarez,Gousseau,Morel• Compare to Alvarez,Gousseau,Morel

y = Kx-

= 1.008

Page 15: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Lengths of curvesLengths of curves

100 120 140 160 180 200 220 240140

150

160

170

180

190

200

210

220

230

240

-1 0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

8

9

log(contour length)

log(c

ount)

Page 16: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Image Segmentation as Graph PartitioningImage Segmentation as Graph PartitioningBuild a weighted graph G=(V,E) from image

V: image pixels

E: connections between pairs of nearby pixels

region

same the tobelong

j& iy that probabilit :ijW

Partition graph so that similarity within group is large and similarity between groups is small -- Normalized Cuts [Shi&Malik 97]

Page 17: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Normalized Cut, A measure of dissimilarity

Normalized Cut, A measure of dissimilarity

• Minimum cut is not appropriate since it favors cutting small pieces.

• Normalized Cut, Ncut:

• Minimum cut is not appropriate since it favors cutting small pieces.

• Normalized Cut, Ncut:

V),(

B)A,(

V)A,(

B)A,( B)A,(

Bassoc

cut

assoc

cutNcut

Page 18: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Normalized Cut As Generalized Eigenvalue problem

Normalized Cut As Generalized Eigenvalue problem

• after simplification, we get• after simplification, we get

...

),(

),( ;

11)1(

)1)(()1(

11

)1)(()1(

)VB,(

)BA,(

)VA,(

B)A,(B)A,(

0

i

x

T

T

T

T

iiD

iiDk

Dk

xWDx

Dk

xWDx

assoc

cut

assoc

cutNcut

i

.01},,1{ with ,)(

),(

DybyDyy

yWDyBANcut T

iT

T

Page 19: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Cue-Integration for Image SegmentationCue-Integration for Image Segmentation

[Malik, Belongie, Shi, Leung 1999]

Page 20: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

On image segmentation..On image segmentation..

• Humans are quite consistent, so model the goal as emulating their behavior.

• Ecological statistics of grouping cues can be learned from image data.

• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.

• Humans are quite consistent, so model the goal as emulating their behavior.

• Ecological statistics of grouping cues can be learned from image data.

• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.

Page 21: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Framework for RecognitionFramework for Recognition(1) Segmentation

PixelsSegments(2) Association

SegmentsRegions(3) Matching

RegionsPrototypes

Over-segmentation necessary; Under-segmentation fatal

Enumerate: # of size k regions in image with n segments is ~(4**k)*n/k

~10 views/object. Matching tolerant to pose/illumination changes, intra-category variation, error in previous steps

Page 22: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Matching regions to viewsMatching regions to views

• GOAL: obtain small misclassification error using few views

• Matching allowing deformations of prototype views makes this possible

• GOAL: obtain small misclassification error using few views

• Matching allowing deformations of prototype views makes this possible

Page 23: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Matching with original and deformed prototypesPrototype Test Error

Page 24: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Deforming Biological ShapesDeforming Biological Shapes

• D’Arcy Thompson: On Growth and Form, 1917– studied transformations between shapes of organisms

Page 25: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

• Find correspondences between points on shape

• Estimate transformation

• Measure similarity

model target

...

Page 26: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Finding correspondences between shapes

Finding correspondences between shapes

• Each shape is represented by a set of sample points

• Each sample point has a descriptor – the shape context

• Define cost Wij for matching point i on first shape with point j on second shape.

• Solve for correspondence as optimum assignment.

• Each shape is represented by a set of sample points

• Each sample point has a descriptor – the shape context

• Define cost Wij for matching point i on first shape with point j on second shape.

• Solve for correspondence as optimum assignment.

Page 27: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Shape ContextShape ContextCount the number of points inside each bin, e.g.:

Count = 4

Count = 10

...

Compact representation of distribution of points relative to each point

Page 28: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.
Page 29: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Comparing Shape ContextsComparing Shape ContextsCompute matching costs using Chi Squared Test:

Recover correspondences by solving linear assignment problem with costs Cij

[Jonker & Volgenant 1987]

Page 30: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

MatchingExampleMatchingExample

model target

Page 31: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Synthetic Test ResultsSynthetic Test ResultsFish - deformation + noise Fish - deformation + outliers

ICP Shape Context RPM

Page 32: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Measuring Shape SimilarityMeasuring Shape Similarity• Image appearance around matched points

– color or gray-level window– orientation

• Shape context differences at matched points

• Bending Energy

• Image appearance around matched points– color or gray-level window– orientation

• Shape context differences at matched points

• Bending Energy

Page 33: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

COIL Object Database

Page 34: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Editing: PrototypesEditing: Prototypes

• Human Shape Perception• Computational Needs for K-NN

• Human Shape Perception• Computational Needs for K-NN

Page 35: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Prototype Selection: Coil-20Prototype Selection: Coil-20

Page 36: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

MNIST Handwritten Digits

Page 37: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Handwritten Digit RecognitionHandwritten Digit Recognition• MNIST 60 000:

– linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 60 000: – linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 600 000 (distortions): – LeNet 5: 0.8%– SVM: 0.8%– Boosted LeNet 4: 0.7%

• MNIST 20 000: – K-NN, Shape Context

matching: 0.63%

Page 38: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Hand-written Digit RecognitionHand-written Digit Recognition• MNIST 60 000:

– linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 60 000: – linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 600 000 (distortions): – LeNet 5: 0.8%

– SVM: 0.8%

– Boosted LeNet 4: 0.7%

• MNIST 20 000– K-NN, Shape context

matching: 0.63 %

Page 39: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Results: Digit RecognitionResults: Digit Recognition

1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance

1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance

Page 40: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Results: Digit Recognition (Detail)

Results: Digit Recognition (Detail)

Page 41: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.
Page 42: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Trademark SimilarityTrademark Similarity

Page 43: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Future work..Future work..

• Indexing based on color/texture/shape features before correspondence matching

• Integrate segmentation and recognition

• Indexing based on color/texture/shape features before correspondence matching

• Integrate segmentation and recognition

Page 44: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Computing cost on a Pentium PCComputing cost on a Pentium PC

• Segmentation: 2 minutes /image (200x100)

• Matching : 0.2 sec / match (100 points)

• Segmentation: 2 minutes /image (200x100)

• Matching : 0.2 sec / match (100 points)

Page 45: Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Given a 104 speedup..Given a 104 speedup..

• 5K object categories/sec

• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.

• 5K object categories/sec

• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.