Levels of supervision for training object category models
description
Transcript of Levels of supervision for training object category models
![Page 1: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/1.jpg)
Unsupervised discovery of visual object class hierarchies
Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU) and
Bill Freeman (MIT)
![Page 2: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/2.jpg)
Levels of supervision for training object category models
• None? Images only
[Agarwal & Roth, Leibe & Schiele, Torralba et al., Shotton et al.]
[Barnard et al.][Csurka et al, Dorko & Schmid, Fergus et al., Opelt et al]
• Object label +
segmentation
• Object label only
[Viola & Jones]
Can we learn about objects just by looking at images?
![Page 3: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/3.jpg)
Goal: Given a collection of unlabelled images, discover a hierarchy of visual object categories
Which images contain the same object(s)?
Where is the object in the image?
Organize objects into a visual hierarchy (tree).
![Page 4: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/4.jpg)
I. Represent an image as a bag-of-visual-words
II. Apply topic discovery methods to find objects in the corpus of images
Review: Object discovery in the visual domain
Decompose image collection into objects common to all images and mixture coefficients specific to each image
Hofmann: Probabilistic latent semantic analysisBlei et al.: Latent Dirichlet Allocation
[Sivic, Russell, Efros, Freeman, Zisserman, ICCV’05]
![Page 5: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/5.jpg)
Topic discovery models
‘Flat’ topic structure – all topics are ‘available’ to all documents
d … documents (images)
w … visual words
z … topics (‘objects’)
Probabilistic Latent Semantic Analysis (pLSA) [Hofmann’99]
M documents
N words per document
![Page 6: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/6.jpg)
Hierarchical topic models
• Topics organized in a tree
• Document is a superposition of topics along a single path
• Topics at internal nodes are shared by two or more paths
• The hope is that more specialized topics emerge as we descend the tree
c … paths
z … levels
[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]
![Page 7: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/7.jpg)
Hierarchical topic models
• Topics organized in a tree
• Document is a superposition of topics along a single path
• Topics at internal nodes are shared by two or more paths
• The hope is that more specialized topics emerge as we descend the tree
c … paths
z … levels
[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]
![Page 8: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/8.jpg)
Hierarchical topic models
• Topics organized in a tree
• Document is a superposition of topics along a single path
• Topics at internal nodes are shared by two or more paths
• The hope is that more specialized topics emerge as we descend the tree
c … paths
z … levels
[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]
![Page 9: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/9.jpg)
Hierarchical topic models
• Topics organized in a tree
• Document is a superposition of topics along a single path
• Topics at internal nodes are shared by two or more paths
• The hope is that more specialized topics emerge as we descend the tree
c … paths
z … levels
[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]
![Page 10: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/10.jpg)
Hierarchical topic models
d … documents (images)
w … words
z … levels of the tree
c … paths in the tree
For each document:
Introduce a hidden variable c indicating the path in the tree
c … paths
z … levels
[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]
![Page 11: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/11.jpg)
Hierarchical Latent Dirichlet Allocation (hLDA)
d … documents (images)
w … words
z … levels of the tree
c … paths in the tree
Treat P(z|d) and P(w|z,c) as random variables sampled from Dirichlet prior:
c … paths
z … levels
[Blei et al. ’2004]
![Page 12: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/12.jpg)
Hierarchical Latent Dirichlet Allocation (hLDA)
d … documents (images)
w … words
z … levels of the tree
c … paths in the tree
c … paths
z … levels
[Blei et al. ’2004]
Tree structure is not fixed:
assignments of documents to paths, cj, are sampled from
the nested Chinese restaurant process prior (nCRP)
![Page 13: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/13.jpg)
CRP: customers sit in a restaurant with unlimited number of tables
Nested Chinese restaurant process (nCRP)
1,2,3,4
1,2,3
1,2 3
4
4
[Blei et al.’04]
Nested CRP: extension of CRP to tree structures
• Prior on assignments of documents to paths in the tree (of fixed depth L)
• Each internal node corresponds to a CRP, each table points to a child node
Example:
Example: Tree of depth 3 with 4 documents
Sample path for the 5-th document
5th customer arriving
A
CB
D E F
![Page 14: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/14.jpg)
hLDA model fitting
Use Gibbs sampler to generate samples from P(z,c,T|w)
c … paths
z … levels
For a given document j:
• sample zj while keeping cj fixed (LDA along one path)
• sample cj while keeping zj fixed (can delete/create branches)
![Page 15: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/15.jpg)
Image representation – ‘dense’ visual words
Represent each region by a SIFT descriptor
Extract circular regions on a regular grid, at multiple scales
Cf. [Agarwal and Triggs’05, Bosch and Zisserman’06]
![Page 16: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/16.jpg)
Build a visual vocabulary
Quantize descriptors using k-means
K = 10 + 1 K = 100 + 1
Visualization by ‘average’ words from the training set (single scale)
![Page 17: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/17.jpg)
Vocabulary with varying degree of spatial and appearance granularity
K1 = 11
K2 = 101
K3 = 101
K4 = 101
Granularity
Appearance Spatial
Bag of words
Bag of words
3x3 grid
5x5 grid
Combined vocabulary:
K = 11+101+909+2,525
= 3,546 visual words
V1:
V2:
V3:
V4: Cf. Fergus et al.’ 05 Lazebnik et al.’06
![Page 18: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/18.jpg)
Example I. – cropped LabelMe images
• 125 images, 5 object classes:
cars side, cars rear, switches, traffic lights, computer screens
• Images cropped to contain mostly the object, and normalized for scale
![Page 19: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/19.jpg)
Example I. – cropped LabelMe images
• Learn 4-level tree hierarchy
• Initialization:
• c with a random tree (125 documents) sampled from nCRP (=1)
• z based on vocabulary granularity
c … paths
z … levelsK1 = 11
K2 = 101
K3 = 101
K4 = 101
Bag of words
Bag of words
3x3 grid
5x5 grid
V1:
V2:
V3:
V4:
![Page 20: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/20.jpg)
Example I. – cropped LabelMe images
Learnt object hierarchy
Nodes visualized by average images
Example images assigned to different paths
![Page 21: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/21.jpg)
Quality of the tree?
Intersection Union
ground truth images in class i
For each node t and class i measure the classification score:
Images assigned to a path passing through t
Good score:
- All images of class i assigned to node t (high recall)
- No images of other classes assigned to t (high precision)
Score for class i:
![Page 22: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/22.jpg)
Quality of the tree?
Intersection Union
ground truth images in class i
For each node t and class i measure the classification score:
Images assigned to a path passing through t
Score for class i:
Example: traff. lights, node 2
![Page 23: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/23.jpg)
Quality of the tree?
Intersection Union
ground truth images in class i
For each node t and class i measure the classification score:
Images assigned to a path passing through t
Score for class i:
Example: switches, node 9
![Page 24: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/24.jpg)
Quality of the tree?
Intersection Union
ground truth images in class i
For each node t and class i measure the classification score:
Images assigned to a path passing through t
Overall score:
![Page 25: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/25.jpg)
Example II. – MSRC b1 dataset
240 images, 9 object classes, pixel-wise labelled
Cars
Airplanes
Cows
Buildings
Faces
Grass
Trees
Bicycles
Sky
![Page 26: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/26.jpg)
Example II. – MSRC b1 dataset
Experiment 1: Known object mask (manual), unknown class labels
Experiment 2: Both segmentation and class labels unknown (just images)
- More objects and images (than Ex. I)
- Measure classification performance
- Compare with the standard `flat’ LDA
- ‘Unsupervised discovery’ scenario
- Employ the ‘multiple segmentations’ framework of [Russell et al.,’06]
- Measure segmentation accuracy
![Page 27: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/27.jpg)
MSRC b1 dataset – known object mask
Learnt tree visualized by average images, nodes size indicates # of images
Some nodes visualized by top 3 images (sorted by KL divergence)
![Page 28: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/28.jpg)
MSRC b1 dataset – known object mask
Classification performance: comparison with ‘flat’ LDA
Flat LDA:
Estimate mixing weights for each topic i
Assign each image to a single topic:
![Page 29: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/29.jpg)
MSRC b1 dataset – unknown object mask and image labels
![Page 30: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/30.jpg)
Multiple segmentation approach [Russell et al.’06]
1) Produce multiple segmentations of each image
2) Discover clusters of similar segments
3) Score segments by how well they fit object cluster
Images Multiple segmentations Cars Buildings
(review)
(here use hLDA)
![Page 31: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/31.jpg)
Road/asphalt
![Page 32: Levels of supervision for training object category models](https://reader035.fdocuments.us/reader035/viewer/2022070411/568148c6550346895db5e537/html5/thumbnails/32.jpg)
Conclusions
• Investigated learning visual object hierarchies using hLDA
• The number of topics/objects and the structure of the tree is estimated automatically from the data
• Topic/object hierarchy may improve classification performance