Visual Recognition: The Big Picture Jitendra Malik University of California at Berkeley.
Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik...
-
Upload
dayna-dawson -
Category
Documents
-
view
221 -
download
4
Transcript of Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik...
Visual Grouping and RecognitionVisual Grouping and Recognition
Jitendra Malik
University of California at Berkeley
Jitendra Malik
University of California at Berkeley
CollaboratorsCollaborators
• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)
• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren
• Recognition: Serge Belongie, Jan Puzicha
• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)
• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren
• Recognition: Serge Belongie, Jan Puzicha
From images to objectsFrom images to objects
Labeled sets: tiger, grass etc
What enables us to parse a scene?What enables us to parse a scene?
– Low level cues• Color/texture• Contours• Motion
– Mid level cues• T-junctions• Convexity
– High level Cues• Familiar Object• Familiar Motion
– Low level cues• Color/texture• Contours• Motion
– Mid level cues• T-junctions• Convexity
– High level Cues• Familiar Object• Familiar Motion
Grouping factors Grouping factors
But is segmentation a meaningful problem?
But is segmentation a meaningful problem?
• Difficult to define formally, but humans are remarkably consistent…
• Difficult to define formally, but humans are remarkably consistent…
Human Segmentations (1)Human Segmentations (1)
Human Segmentations (2)Human Segmentations (2)
ConsistencyConsistencyA
B C
• A,C are refinements of B• A,C are mutual refinements • A,B,C represent the same percept
• Attention accounts for differences
Image
BG L-bird R-bird
grass bush
headeye
beakfar body
headeye
beak body
Perceptual organization forms a tree:
Two segmentations are consistent when they can beexplained by the samesegmentation tree (i.e. theycould be derived from a single perceptual organization).
Ecological Statistics of image segmentation
Ecological Statistics of image segmentation
• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)
• Design algorithm for incorporating multiple cues for image segmentation
• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)
• Design algorithm for incorporating multiple cues for image segmentation
ProximityProximity
Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)
Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)
ConvexityConvexity
Region AreaRegion Area
• Compare to Alvarez,Gousseau,Morel• Compare to Alvarez,Gousseau,Morel
y = Kx-
= 1.008
Lengths of curvesLengths of curves
100 120 140 160 180 200 220 240140
150
160
170
180
190
200
210
220
230
240
-1 0 1 2 3 4 5 6 70
1
2
3
4
5
6
7
8
9
log(contour length)
log(c
ount)
Image Segmentation as Graph PartitioningImage Segmentation as Graph PartitioningBuild a weighted graph G=(V,E) from image
V: image pixels
E: connections between pairs of nearby pixels
region
same the tobelong
j& iy that probabilit :ijW
Partition graph so that similarity within group is large and similarity between groups is small -- Normalized Cuts [Shi&Malik 97]
Normalized Cut, A measure of dissimilarity
Normalized Cut, A measure of dissimilarity
• Minimum cut is not appropriate since it favors cutting small pieces.
• Normalized Cut, Ncut:
• Minimum cut is not appropriate since it favors cutting small pieces.
• Normalized Cut, Ncut:
V),(
B)A,(
V)A,(
B)A,( B)A,(
Bassoc
cut
assoc
cutNcut
Normalized Cut As Generalized Eigenvalue problem
Normalized Cut As Generalized Eigenvalue problem
• after simplification, we get• after simplification, we get
...
),(
),( ;
11)1(
)1)(()1(
11
)1)(()1(
)VB,(
)BA,(
)VA,(
B)A,(B)A,(
0
i
x
T
T
T
T
iiD
iiDk
Dk
xWDx
Dk
xWDx
assoc
cut
assoc
cutNcut
i
.01},,1{ with ,)(
),(
DybyDyy
yWDyBANcut T
iT
T
Cue-Integration for Image SegmentationCue-Integration for Image Segmentation
[Malik, Belongie, Shi, Leung 1999]
On image segmentation..On image segmentation..
• Humans are quite consistent, so model the goal as emulating their behavior.
• Ecological statistics of grouping cues can be learned from image data.
• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.
• Humans are quite consistent, so model the goal as emulating their behavior.
• Ecological statistics of grouping cues can be learned from image data.
• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.
Framework for RecognitionFramework for Recognition(1) Segmentation
PixelsSegments(2) Association
SegmentsRegions(3) Matching
RegionsPrototypes
Over-segmentation necessary; Under-segmentation fatal
Enumerate: # of size k regions in image with n segments is ~(4**k)*n/k
~10 views/object. Matching tolerant to pose/illumination changes, intra-category variation, error in previous steps
Matching regions to viewsMatching regions to views
• GOAL: obtain small misclassification error using few views
• Matching allowing deformations of prototype views makes this possible
• GOAL: obtain small misclassification error using few views
• Matching allowing deformations of prototype views makes this possible
Matching with original and deformed prototypesPrototype Test Error
Deforming Biological ShapesDeforming Biological Shapes
• D’Arcy Thompson: On Growth and Form, 1917– studied transformations between shapes of organisms
• Find correspondences between points on shape
• Estimate transformation
• Measure similarity
model target
...
Finding correspondences between shapes
Finding correspondences between shapes
• Each shape is represented by a set of sample points
• Each sample point has a descriptor – the shape context
• Define cost Wij for matching point i on first shape with point j on second shape.
• Solve for correspondence as optimum assignment.
• Each shape is represented by a set of sample points
• Each sample point has a descriptor – the shape context
• Define cost Wij for matching point i on first shape with point j on second shape.
• Solve for correspondence as optimum assignment.
Shape ContextShape ContextCount the number of points inside each bin, e.g.:
Count = 4
Count = 10
...
Compact representation of distribution of points relative to each point
Comparing Shape ContextsComparing Shape ContextsCompute matching costs using Chi Squared Test:
Recover correspondences by solving linear assignment problem with costs Cij
[Jonker & Volgenant 1987]
MatchingExampleMatchingExample
model target
Synthetic Test ResultsSynthetic Test ResultsFish - deformation + noise Fish - deformation + outliers
ICP Shape Context RPM
Measuring Shape SimilarityMeasuring Shape Similarity• Image appearance around matched points
– color or gray-level window– orientation
• Shape context differences at matched points
• Bending Energy
• Image appearance around matched points– color or gray-level window– orientation
• Shape context differences at matched points
• Bending Energy
COIL Object Database
Editing: PrototypesEditing: Prototypes
• Human Shape Perception• Computational Needs for K-NN
• Human Shape Perception• Computational Needs for K-NN
Prototype Selection: Coil-20Prototype Selection: Coil-20
MNIST Handwritten Digits
Handwritten Digit RecognitionHandwritten Digit Recognition• MNIST 60 000:
– linear: 12.0%
– 40 PCA+ quad: 3.3%
– 1000 RBF +linear: 3.6%
– K-NN: 5%
– K-NN (deskewed): 2.4%
– K-NN (tangent dist.): 1.1%
– SVM: 1.1%
– LeNet 5: 0.95%
• MNIST 60 000: – linear: 12.0%
– 40 PCA+ quad: 3.3%
– 1000 RBF +linear: 3.6%
– K-NN: 5%
– K-NN (deskewed): 2.4%
– K-NN (tangent dist.): 1.1%
– SVM: 1.1%
– LeNet 5: 0.95%
• MNIST 600 000 (distortions): – LeNet 5: 0.8%– SVM: 0.8%– Boosted LeNet 4: 0.7%
• MNIST 20 000: – K-NN, Shape Context
matching: 0.63%
Hand-written Digit RecognitionHand-written Digit Recognition• MNIST 60 000:
– linear: 12.0%
– 40 PCA+ quad: 3.3%
– 1000 RBF +linear: 3.6%
– K-NN: 5%
– K-NN (deskewed): 2.4%
– K-NN (tangent dist.): 1.1%
– SVM: 1.1%
– LeNet 5: 0.95%
• MNIST 60 000: – linear: 12.0%
– 40 PCA+ quad: 3.3%
– 1000 RBF +linear: 3.6%
– K-NN: 5%
– K-NN (deskewed): 2.4%
– K-NN (tangent dist.): 1.1%
– SVM: 1.1%
– LeNet 5: 0.95%
• MNIST 600 000 (distortions): – LeNet 5: 0.8%
– SVM: 0.8%
– Boosted LeNet 4: 0.7%
• MNIST 20 000– K-NN, Shape context
matching: 0.63 %
Results: Digit RecognitionResults: Digit Recognition
1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance
1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance
Results: Digit Recognition (Detail)
Results: Digit Recognition (Detail)
Trademark SimilarityTrademark Similarity
Future work..Future work..
• Indexing based on color/texture/shape features before correspondence matching
• Integrate segmentation and recognition
• Indexing based on color/texture/shape features before correspondence matching
• Integrate segmentation and recognition
Computing cost on a Pentium PCComputing cost on a Pentium PC
• Segmentation: 2 minutes /image (200x100)
• Matching : 0.2 sec / match (100 points)
• Segmentation: 2 minutes /image (200x100)
• Matching : 0.2 sec / match (100 points)
Given a 104 speedup..Given a 104 speedup..
• 5K object categories/sec
• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.
• 5K object categories/sec
• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.