A probabilistic model for recursive factorized image features ppt
-
Upload
irisshicat -
Category
Technology
-
view
875 -
download
1
description
Transcript of A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image
features.
Sergey KarayevMario FritzSanja Fidler
Trevor Darrell
Outline
• Motivation:
✴Distributed coding of local image features
✴Hierarchical models
✴Bayesian inference
• Our model: Recursive LDA
• Evaluation
Local Features
• Gradient energy histograms by orientation and grid position in local patches.
• Coded and used in bag-of-words or spatial model classifiers.
Feature Coding• Traditionally vector quantized as visual
words.
• But coding as mixture of components, such as in sparse coding, is empirically better. (Yang et al. 2009)Dense codes
(ascii)Sparse, distributed codes Local codes
(grandmother cells)
. . . . . .
+ High combinatorial capacity (2N)
- Difficult to read out
+ Decent combinatorial capacity (~NK)
+ Still easy to read out
- Low combinatorial capacity (N)
+ Easy to read out
Local codes vs. Dense codes
picture from Bruno Olshausen
Another motivation: additive image formation
Fritz et al. An Additive Latent Feature Model for Transparent Object Recognition. NIPS (2009).
Outline
• Motivation:
✴Distributed coding of local image features
✴Hierarchical models
✴Bayesian inference
• Our model: Recursive LDA
• Evaluation
Hierarchies
• Biological evidence for increasing spatial support and complexity of visual pathway.
• Local features not robust to ambiguities. Higher layers can help resolve.
• Efficient parametrization possible due to sharing of lower-layer components.
Past Work• HMAX models (Riesenhuber and Poggio 1999, Mutch and
Lowe 2008)
• Convolutional networks (Ranzato et al. 2007, Ahmed et al. 2009)
• Deep Belief Nets (Hinton 2007, Lee et al. 2009)
• Hyperfeatures (Agarwal and Triggs 2008)
• Fragment-based hierarchies (Ullman 2007)
• Stochastic grammars (Zhu and Mumford 2006)
• Compositional object representations (Fidler and Leonardis 2007, Zhu et al. 2008)
HMAX
nature neuroscience • volume 2 no 11 • november 1999 1021
lus (unless the afferents showed no overlap in spaceor scale); consequently, excitation of the ‘complex’cell would increase along with the stimulus size,even though the afferents show size invariance!(This is borne out in simulations using a simplifiedtwo-layer model25.) For the MAX mechanism, how-ever, cell response would show little variation, even as stimulussize increased, because the cell’s response would be determinedjust by the best-matching afferent.
These considerations (supported by quantitative simulationsof the model, described below) suggest that a nonlinear MAXfunction represents a sensible way of pooling responses to achieveinvariance. This would involve implicitly scanning (see Discus-sion) over afferents of the same type differing in the parameterof the transformation to which responses should be invariant(for instance, feature size for scale invariance), and then select-ing the best-matching afferent. Note that these considerationsapply where different afferent to a pooling cell (for instance, thoselooking at different parts of space), are likely to respond to dif-ferent objects (or different parts of the same object) in the visu-al field. (This is the case with cells in lower visual areas with theirbroad shape tuning.) Here, pooling by combining afferents would
mix up signals caused by different stimuli. However, if the affer-ents are specific enough to respond only to one pattern, as oneexpects in the final stages of the model, then it is advantageousto pool them using a weighted sum, as in the RBF network15,where VTUs tuned to different viewpoints were combined tointerpolate between the stored views.
MAX-like mechanisms at some stages of the circuitry seemcompatible with neurophysiological data. For instance, when twostimuli are brought into the receptive field of an IT neuron, thatneuron’s response seems dominated by the stimulus that, whenpresented in isolation to the cell, produces a higher firing rate24—just as expected if a MAX-like operation is performed at the levelof this neuron or its afferents. Theoretical investigations into pos-sible pooling mechanisms for V1 complex cells also support amaximum-like pooling mechanism (K. Sakai and S. Tanaka, Soc.Neurosci. Abstr. 23, 453, 1997).
articles
View-tuned cells
MAX
weighted sum
Simple cells (S1)
Complex cells (C1)
Complex composite cells (C2)
Composite feature cells (S2)
Fig. 2. Sketch of the model. The model was an extension ofclassical models of complex cells built from simple cells4,consisting of a hierarchy of layers with linear (‘S’ units in thenotation of Fukushima6, performing template matching, solidlines) and non-linear operations (‘C’ pooling units6, perform-ing a ‘MAX’ operation, dashed lines). The nonlinear MAXoperation—which selected the maximum of the cell’s inputsand used it to drive the cell—was key to the model’s proper-ties, and differed from the basically linear summation ofinputs usually assumed for complex cells. These two types ofoperations provided pattern specificity and invariance totranslation, by pooling over afferents tuned to different posi-tions, and to scale (not shown), by pooling over afferentstuned to different scales.
Fig. 3. Highly nonlinear shape-tuning properties of the MAX mechanism. (a) Experimentally observed responses of IT cells obtained using a ‘simplifi-cation procedure’26 designed to determine ‘optimal’ features (responses normalized so that the response to the preferred stimulus is equal to 1). Inthat experiment, the cell originally responded quite strongly to the image of a ‘water bottle’ (leftmost object). The stimulus was then ‘simplified’ to itsmonochromatic outline, which increased the cell’s firing, and further, to a paddle-like object consisting of a bar supporting an ellipse. Whereas thisobject evoked a strong response, the bar or the ellipse alone produced almost no response at all (figure used by permission). (b) Comparison ofexperiment and model. White bars show the responses of the experimental neuron from (a). Black and gray bars show the response of a model neu-ron tuned to the stem-ellipsoidal base transition of the preferred stimulus. The model neuron is at the top of a simplified version of the model shownin Fig. 2, where there were only two types of S1 features at each position in the receptive field, each tuned to the left or right side of the transitionregion, which fed into C1 units that pooled them using either a MAX function (black bars) or a SUM function (gray bars). The model neuron was con-nected to these C1 units so that its response was maximal when the experimental neuron’s preferred stimulus was in its receptive field.
0
0.2
0.4
0.6
0.8
1
Resp
onse
MAX expt. SUM(a) (b)a b
© 1999 Nature America Inc. • http://neurosci.nature.com
© 1
999
Nat
ure
Am
eric
a In
c. •
http
://ne
uros
ci.n
atur
e.co
m
Riesenhuber and Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience (1999).
Convolutional Deep Belief
NetsStacked layers, each consisting of
feature extraction, transformation, and pooling.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
(NW � NV − NH + 1); the filter weights are sharedacross all the hidden units within the group. In addi-tion, each hidden group has a bias bk and all visibleunits share a single bias c.
We define the energy function E(v,h) as:
P (v,h) =1Z
exp(−E(v,h))
E(v,h) = −K�
k=1
NH�
i,j=1
NW�
r,s=1
hk
ijW
k
rsvi+r−1,j+s−1
−K�
k=1
bk
NH�
i,j=1
hk
ij− c
NV�
i,j=1
vij . (1)
Using the operators defined previously,
E(v,h) = −K�
k=1
hk • (W̃ k ∗ v)−
K�
k=1
bk
�
i,j
hk
i,j− c
�
i,j
vij .
As with standard RBMs (Section 2.1), we can performblock Gibbs sampling using the following conditionaldistributions:
P (hk
ij= 1|v) = σ((W̃ k ∗ v)ij + bk)
P (vij = 1|h) = σ((�
k
Wk ∗ h
k)ij + c),
where σ is the sigmoid function. Gibbs sampling formsthe basis of our inference and learning algorithms.
3.3. Probabilistic max-poolingIn order to learn high-level representations, we stackCRBMs into a multilayer architecture analogous toDBNs. This architecture is based on a novel opera-tion that we call probabilistic max-pooling.
In general, higher-level feature detectors need informa-tion from progressively larger input regions. Existingtranslation-invariant representations, such as convolu-tional networks, often involve two kinds of layers inalternation: “detection” layers, whose responses arecomputed by convolving a feature detector with theprevious layer, and “pooling” layers, which shrink therepresentation of the detection layers by a constantfactor. More specifically, each unit in a pooling layercomputes the maximum activation of the units in asmall region of the detection layer. Shrinking the rep-resentation with max-pooling allows higher-layer rep-resentations to be invariant to small translations of theinput and reduces the computational burden.
Max-pooling was intended only for feed-forward archi-tectures. In contrast, we are interested in a generativemodel of images which supports both top-down andbottom-up inference. Therefore, we designed our gen-erative model so that inference involves max-pooling-like behavior.
!
"#
$#%&'
(#
)**!"#$#%&'(&)*'+,
+# !-'.'/.#01(&)*'+,
,# !200
(&)*'+,
-+
-)
.
-"
-,
Figure 1. Convolutional RBM with probabilistic max-pooling. For simplicity, only group k of the detection layerand the pooing layer are shown. The basic CRBM corre-sponds to a simplified structure with only visible layer anddetection (hidden) layer. See text for details.
To simplify the notation, we consider a model with avisible layer V , a detection layer H, and a pooling layerP , as shown in Figure 1. The detection and poolinglayers both have K groups of units, and each groupof the pooling layer has NP × NP binary units. Foreach k ∈ {1, ...,K}, the pooling layer P
k shrinks therepresentation of the detection layer H
k by a factorof C along each dimension, where C is a small in-teger such as 2 or 3. I.e., the detection layer H
k ispartitioned into blocks of size C × C, and each blockα is connected to exactly one binary unit p
k
α in thepooling layer (i.e., NP = NH/C). Formally, we defineBα � {(i, j) : hij belongs to the block α.}.
The detection units in the block Bα and the poolingunit pα are connected in a single potential which en-forces the following constraints: at most one of thedetection units may be on, and the pooling unit is onif and only if a detection unit is on. Equivalently, wecan consider these C
2+1 units as a single random vari-able which may take on one of C
2 + 1 possible values:one value for each of the detection units being on, andone value indicating that all units are off.
We formally define the energy function of this simpli-fied probabilistic max-pooling-CRBM as follows:
E(v,h) = −X
k
X
i,j
“hk
i,j(W̃k ∗ v)i,j + bkhk
i,j
”− c
X
i,j
vi,j
subj. toX
(i,j)∈Bα
hki,j ≤ 1, ∀k, α.
We now discuss sampling the detection layer H andthe pooling layer P given the visible layer V . Group k
receives the following bottom-up signal from layer V :
I(hk
ij) � bk + (W̃ k ∗ v)ij . (2)
Now, we sample each block independently as a multi-nomial function of its inputs. Suppose h
k
i,jis a hid-
den unit contained in block α (i.e., (i, j) ∈ Bα), the
611
Lee et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical …. Proceedings of the 26th Annual International Conference on Machine Learning (2009)
HyperfeaturesInt J Comput Vis (2008) 78: 15–27 17
Fig. 1 Constructing a hyperfeature stack. The ‘level 0’ (base fea-ture) pyramid is constructed by calculating a local image descriptorvector for each patch in a multiscale pyramid of overlapping imagepatches. These vectors are vector quantized according to the level 0codebook, and local histograms of codebook memberships are accu-mulated over local position-scale neighborhoods (the smaller darkenedregions within the image pyramids) to make the level 1 feature vectors.Other local accumulation schemes can be used, such as soft voting or
more generally capturing local statistics using posterior membershipprobabilities of mixture components or similar latent models. Softerspatial windowing is also possible. The process simply repeats itself athigher levels. The level l to l + 1 coding is also used to generate thelevel l output vectors—global histograms over the whole level-l pyra-mid. The collected output features are fed to a learning machine andused to classify the (local or global) image region
and HMAX (Riesenhuber and Poggio 1999; Serre et al.2005; Mutch and Lowe 2006). These are all multilayer net-works with alternating stages of linear filtering (banks oflearned convolution filters for CNN’s and of learned ‘sim-ple cells’ for HMAX and the neocognitron) and nonlin-ear rectify-and-pool operations. The neocognitron activatesa higher level cell if at least one associated lower levelcell is active. In CNNs the rectified signals are pooledlinearly, while in HMAX a max-like operation (‘complexcell’) is used so that only the dominant input is passedthrough to the next stage. The neocognitron and HMAXlay claims to biological plausibility whereas CNN is moreof an engineering solution, but all three are convolution-based discriminatively-trained models. In contrast, althoughhyperfeatures are still bottom-up, they are essentially a de-scriptive statistics model not a discriminative one: train-ing is completely unsupervised and there are no convolu-tion weights to learn for hyperfeature extraction, althoughthe object classes can still influence the coding indirectlyvia the choices of codebook. The basic nonlinearity is alsodifferent: for hyperfeatures, nonlinear descriptor coding bynearest neighbor lookup (or more generally by evaluat-ing posterior probabilities of membership to latent featureclasses) is followed by a comparatively linear accumulate-
and-normalize process, while for the neural models linearconvolution filtering is followed by simple but nonlinear rec-tification.
A number of other hierarchical feature representationshave been proposed very recently. Some of these, e.g. (Serreet al. 2005; Mutch and Lowe 2006), follow the HMAXmodel of object recognition in the primal cortex while otherssuch as (Epshtein and Ullman 2005) are based on top-downmethods of successively breaking top-level image fragmentsinto smaller components to extract informative features.This does not allow higher levels to abstract from the fea-tures at lower levels. Another top-down approach presentedin (Lazebnik et al. 2006) is based on pyramid matching thatrepeatedly subdivides the image to compute histograms atvarious levels. Again, unlike the hyperfeatures that we pro-pose in this paper, this does not feed lower level featuresinto the computational process of higher ones. Hyperfea-tures are literally “features of (local sets of) features”, andthis differentiates them from such methods. Another keydistinction of hyperfeatures is that they are based on localco-occurrence alone, not on numerical geometry (althoughquantitative spatial relationships are to some extent capturedby coding sets of overlapping patches). Advantages of thisare discussed in Sect. 3.
Agarwal and Triggs. Multilevel Image Coding with Hyperfeatures. International Journal of Computer Vision (2008).
Compositional Representations
!"#$%& '( )&*+ %&,-+./%$,/"-+. -0 /1& 2&*%+&3 4*%/. 5.4*/"*2 !&6"7"2"/8 *2.- 9-3&2&3 78 /1& 4*%/. ". -9"//&3 3$& /- 2*,: -0 .4*,&;( !"# $%&<L2= L3 5/1& "%./ 186 -0 *22 499 4*%/. *%& .1->+;= '() $%&< L4 4*%/. 0-% 0*,&.= ,*%.= *+3 9$#.= *$) $%&< L5 4*%/. 0-% 0*,&.= ,*%. 5-7/*"+&3-+ 3 3"00&%&+/ .,*2&.;= *+3 9$#.(
!"#$%& ?( @&/&,/"-+ -0 L4 *+3 L5 4*%/.( @&/&,/"-+ 4%-,&&3. 7-//-9A$4 *. 3&.,%"7&3 "+ B$7.&,/"-+ C(C( D,/"E& 4*%/. "+ /1& /-4 2*8&% *%&/%*,&3 3->+ /- /1& "9*#& /1%-$#1 2"+:. !(
FGHI J( K-22. *+3 L( @&,-( !"#$%&'&(")'* +,%-"./(,)/, "0 1(.(")2 M60-%3N+"E( O%&..= CPPC( G
FG?I !( B,*2Q- *+3 R( S( O"*/&%( B/*/"./",*2 2&*%+"+# -0 E".$*2 0&*/$%& 1"&%A*%,1"&.( T+ 3"-4.5"$ ") 6,'-)()78 !19:= CPPU( G
FGVI W( B&%%&= X( Y-20= *+3 W( O-##"-( M7Z&,/ %&,-#+"/"-+ >"/1 0&*/$%&."+.4"%&3 78 E".$*2 ,-%/&6( T+ !19: ;<== 4*#&. VV'[GPPP= CPPU( G= C
FCPI J( B$33&%/1= D( W-%%*27*= Y( !%&&9*+= *+3 D( Y"22.:8( X&*%+"+#
1"&%*%,1",*2 9-3&2. -0 .,&+&.= -7Z&,/.= *+3 4*%/.( T+ >!!1= 4*#&.G\\G[G\\?= CPPU( G= C
FCGI R( ]( W.-/.-.( D+*28Q"+# E"."-+ */ /1& ,-942&6"/8 2&E&2( ??@=G\5\;<'C\['^V= GVVP( G
FCCI B( N229*+ *+3 _( J4.1/&"+( 1(.%'* !*'..(!/'&(") AB ' C(,-'-/5B"0 DE&,)F,F G,'&%-,.2 W->*%3. `*/&#-%8AX&E&2 M7Z&,/ K&,-#+"/"-+(B4%"+#&%Aa&%2*#= CPP^( G= C
Fidler and Leonardis. Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts. CVPR (2007)
Outline
• Motivation:
✴Distributed coding of local image features
✴Hierarchical models
✴Bayesian inference
• Our model: Recursive LDA
• Evaluation
Bayesian inference• The human visual cortex
deals with inherently ambiguous data.
• Role of priors and inference (Lee and Mumford 2003).
C-2 KERSTEN ! MAMASSIAN ! YUILLE
See legend on next page
Kersten.qxd 12/17/2003 9:17 AM Page 2
Ann
u. R
ev. P
sych
ol. 2
004.
55:2
71-3
04. D
ownl
oade
d fr
om a
rjour
nals
.ann
ualre
view
s.org
by U
nive
rsity
of C
alifo
rnia
- B
erke
ley
on 0
8/26
/09.
For
per
sona
l use
onl
y.
Kersten et al. Object perception as Bayesian inference. Annual Reviews (2004)
• But most hierarchical approaches do both learning and inference only from the bottom-up.
w patch
T0 activations
patch
inference
activations
inference
T1
L0
L1
L0representation
learning
representation
L1
learning
w patch
T0 activations
patch
activationsT1
L0
L1
L0learning
L1
learning
Joint representation
Joint Inference
What we would like
•Distributed coding of local features in a hierarchical model that would allow full inference.
Outline
• Motivation:
✴Distributed coding of local image features
✴Hierarchical models
✴Bayesian inference
• Our model: Recursive LDA
• Evaluation
Our model: rLDA
• Based on Latent Dirichlet Allocation (LDA).
• Multiple layers, with increasing spatial support.
• Learns representation jointly across layers.
w patch
T0 activations
patch
activationsT1
L0
L1
L0learning
L1
learning
Joint representation
Joint Inference
z
α
β
Tw
D
Griffiths LDA
θ
Ndφ
Wednesday, December 2, 2009
Latent Dirichlet Allocation
• Bayesian multinomial mixture model originally formulated for text analysis.
Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research (2003)
z
α
β
Tw
D
Griffiths LDA
θ
Ndφ
Wednesday, December 2, 2009
Latent Dirichlet AllocationCorpus-wide, the multinomial distributions ofwords (topics) are sampled:
• φ ∼ Dir(β)
For each document, d ∈ 1, . . . , D, mixing propor-tions θ(d) are sampled according to:
• θ(d) ∼ Dir(α)
And Nd words w are sampled according to:
• z ∼ Mult(θ(d)): sample topic given thedocument-topic mixing proportions
• w ∼ Mult(φ(z): sample word given the topicand the topic-word multinomials
LDA in vision• Past work has applied LDA to visual words,
with topics being distributions over them.
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 highway
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 inside of city
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 tall buildings
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 street
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 suburb
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 forest
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 coast
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 mountain
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 open country
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 bedroom
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 kitchen
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 livingroom
0 20 40 60 80 100 120 140 1600
0.05
0.1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3 office
0 20 40 60 80 100 120 140 1600
0.05
0.1
themes codewords
Figure 5. Internal structure of the models learnt for each cat-egory. Each row represents one category. The left panel showsthe distribution of the 40 intermediate themes. The right panelshows the distribution of codewords as well as the appearance of10 codewords selected from the top 20 most likely codewords forthis category model.
highw
ayins
ide o
f city
tall b
uildin
gstr
eet
sub
urb
fore
stco
ast
mou
ntain
open
coun
trybe
droo
mki
tche
nliv
ingr
oom
offic
e
correct incorrect
Figure 6. Examples of testing images for each category. Eachrow is for one category. The first 3 columns on the left show 3examples of correctly recognized images, the last column on theright shows an example of incorrectly recognized image. Super-imposed on each image, we show samples of patches that belongto the most significant set of codewords given the category model.Note for the incorrectly categorized images, the number of signifi-cant codewords of the model tends to occur less likely. (This figureis best viewed in color.)6
Fei-Fei and Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR (2005)
LDA-SIFTDistinctive Image Features from Scale-Invariant Keypoints 101
Figure 7. A keypoint descriptor is created by first computing the gradient magnitude and orientation at each image sample point in a regionaround the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samplesare then accumulated into orientation histograms summarizing the contents over 4x4 subregions, as shown on the right, with the length of eacharrow corresponding to the sum of the gradient magnitudes near that direction within the region. This figure shows a 2 ! 2 descriptor arraycomputed from an 8 ! 8 set of samples, whereas the experiments in this paper use 4 ! 4 descriptors computed from a 16 ! 16 sample array.
6.1. Descriptor Representation
Figure 7 illustrates the computation of the keypoint de-scriptor. First the image gradient magnitudes and ori-entations are sampled around the keypoint location,using the scale of the keypoint to select the level ofGaussian blur for the image. In order to achieve ori-entation invariance, the coordinates of the descriptorand the gradient orientations are rotated relative tothe keypoint orientation. For efficiency, the gradientsare precomputed for all levels of the pyramid as de-scribed in Section 5. These are illustrated with smallarrows at each sample location on the left side ofFig. 7.
A Gaussian weighting function with ! equal to onehalf the width of the descriptor window is used to as-sign a weight to the magnitude of each sample point.This is illustrated with a circular window on the leftside of Fig. 7, although, of course, the weight fallsoff smoothly. The purpose of this Gaussian window isto avoid sudden changes in the descriptor with smallchanges in the position of the window, and to give lessemphasis to gradients that are far from the center of thedescriptor, as these are most affected by misregistrationerrors.
The keypoint descriptor is shown on the right sideof Fig. 7. It allows for significant shift in gradient po-sitions by creating orientation histograms over 4 ! 4sample regions. The figure shows eight directions foreach orientation histogram, with the length of each ar-row corresponding to the magnitude of that histogramentry. A gradient sample on the left can shift up to 4sample positions while still contributing to the same
histogram on the right, thereby achieving the objectiveof allowing for larger local positional shifts.
It is important to avoid all boundary affects in whichthe descriptor abruptly changes as a sample shiftssmoothly from being within one histogram to anotheror from one orientation to another. Therefore, trilin-ear interpolation is used to distribute the value of eachgradient sample into adjacent histogram bins. In otherwords, each entry into a bin is multiplied by a weight of1"d for each dimension, where d is the distance of thesample from the central value of the bin as measuredin units of the histogram bin spacing.
The descriptor is formed from a vector containingthe values of all the orientation histogram entries, cor-responding to the lengths of the arrows on the right sideof Fig. 7. The figure shows a 2 ! 2 array of orienta-tion histograms, whereas our experiments below showthat the best results are achieved with a 4 ! 4 array ofhistograms with 8 orientation bins in each. Therefore,the experiments in this paper use a 4 ! 4 ! 8 = 128element feature vector for each keypoint.
Finally, the feature vector is modified to reduce theeffects of illumination change. First, the vector is nor-malized to unit length. A change in image contrast inwhich each pixel value is multiplied by a constant willmultiply gradients by the same constant, so this contrastchange will be canceled by vector normalization. Abrightness change in which a constant is added to eachimage pixel will not affect the gradient values, as theyare computed from pixel differences. Therefore, the de-scriptor is invariant to affine changes in illumination.However, non-linear illumination changes can also oc-cur due to camera saturation or due to illumination
tokencount
words
How training works
• (quantization, extracting patches, inference illustration)
Topics
(subset of 1024 topics)SIFT average image
w patch
T0 activations
patch
inference
activations
inference
T1
L0
L1
L0representation
learning
representation
L1
learning
Stacking two layers of LDA
D
θ
z0
z1
x0
wφ0
φ1
x1
β0
β1
T0
T1
α
χ
χ
N (d)
Wednesday, November 10, 2010
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
CVPR#1888
CVPR#1888
CVPR 2011 Submission #1888. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
θ(d)
X1
X1
x1
z1
z0x0
X0
V
X0
T0
w
L1
L0T0
T1
χ(z1,·) ∈ RX1
χ(z0,·) ∈ RX0
φ(z0,·,·) ∈ RX0×T0
φ(z1,·,·) ∈ RX1×T0
(a) Recursive LDA model concept.
D
θ
z0
z1
x0
wφ0
φ1
x1
β0
β1
T0
T1
α
χ
χ
N (d)
Wednesday, November 10, 2010
(b) Graphical model for RLDA.
Figure 2: Recursive LDA model.
3. Recursive LDA Model
In contrast to previous approaches to latent factor mod-
eling, we formulate a layered approach that derives progres-
sively higher-level spatial distributions based on the under-
lying latent activations of the lower layer.
For clarity, we will first derive this model for two layers
(L0 and L1) as shown in Figure 2a, but will show how it
generalizes to an arbitrary number of layers. For illustra-
tion purposes, the reader may visualize a particular instance
of our model that observes SIFT descriptors (such as the L0
patch in Figure 2a) as discrete spatial distributions on the
L0-layer. In this particular case, L0 layer models the dis-
tribution of words from a vocabulary of size V = 8 over a
spatial grid X0 of size 4× 4. The vocabulary represents the
8 gradient orientations used by the SIFT descriptor, which
we interpret as words. The frequency of these words is then
the histogram energy in the corresponding bin of the SIFT
descriptor.
The mixture model of T0 components is parameterized
by multinomial parameters φ0 ∈ RT0×X0×V; in our partic-
ular example φ0 ∈ RT0×(4×4)×8. The L1 aggregates the
mixing proportions obtained at layer L0 over a spatial grid
X1 to an L1 patch. In contrast to the L0 layer, L1 mod-
els a spatial distribution over L0 components. The mixture
model of T1 components at layer L1 is parameterized by
multinomial parameters φ1 ∈ RT1×X1×T0 .
The spatial grid is considered to be deterministic at each
layer and position variables for each word x are observed.
However, the distribution of words / topics over the grid is
not uniform and may vary across different components. We
thus have to introduce a spatial (multinomial) distribution χat each layer which is computed from the mixture distribu-
tion φ. This is needed to define a full generative model.
The base model for a single layer (with T1 = 0) is equiv-
alent to LDA, which is therefore a special case of our nested
approach.
3.1. Generative Process
Given symmetric Dirichlet priors α,β0,β1 and a fixed
choice of the number of mixture components T0 and T1 for
layer L1 and L0 respectively, we define the following gener-
ative process which is also illustrated in Figure 2b. Mixture
distributions are sampled globally according to:
• φ1 ∼ Dir(β1) and φ0 ∼ Dir(β0): sample L1 and L0
multinomial parameters
• χ1 ← φ1 and χ0 ← φ0 : compute spatial distributions
from mixture distributions
For each document, d ∈ {1, . . . , D} top level mixing pro-
portions θ(d) are sampled according to:
• θ(d) ∼ Dir(α) : sample top level mixing proportions
For each document d, N (d)words w are sampled according
to:
• z1 ∼ Mult(θ(d)) : sample L1 mixture distribution
• x1 ∼ Mult(χ(z1,·)1 ) : sample spatial position on L1
given z1• z0 ∼ Mult(φ(z1,x1,·)
1 ) : sample L0 mixture distribution
given z1 and x1 from L1
• x0 ∼ Mult(χ(z0,·)0 ) : sample spatial position on L0
given z0• w ∼ Mult(φ(z0,x0,·)
0 ) : sample word given z0 and x0
According to the proposed generative process, the joint dis-
tribution of the model parameters given the hyperparame-
3
Recursive LDA
θ(d)
X1
X1
x1
z1
z0
x0
X0
V
X0
T0
w
L1
L0T0
T1
χ(z1,·) ∈ RX1
χ(z0,·) ∈ RX0
φ(z0,·,·) ∈ RX0×T0
φ(z1,·,·) ∈ RX1×T0
Inference scheme
• Gibbs sampling: sequential updates of random variables with all others held constant.
• Linear topic response for initialization.
Outline
• Motivation:
✴Distributed coding of local image features
✴Hierarchical models
✴Value of Bayesian inference
• Our model: Recursive LDA
• Evaluation
Evaluation
• 16px SIFT, extracted densely every 6px; max value normalized to 100 tokens
• Three conditions:
✴Single-layer LDA
✴Feed-forward two-layer LDA (FLDA)
✴Recursive two-layer LDA (RLDA)
RLDA > FLDA > LDA540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
CVPR#1888
CVPR#1888
CVPR 2011 Submission #1888. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Table 2: Results for different implementations of our model with 128 components at L0 and 128 components at L1.
Approach Caltech-101
Model Basis size Layer(s) used 15 30
128-dim
models
LDA 128 “bottom” 52.3± 0.5% 58.7± 1.1%RLDA 128t/128b bottom 55.2± 0.3% 62.6± 0.9%LDA 128 “top” 53.7± 0.4% 60.5± 1.0%FLDA 128t/128b top 55.4± 0.5% 61.3± 1.3%RLDA 128t/128b top 59.3± 0.3% 66.0± 1.2%FLDA 128t/128b both 57.8± 0.8% 64.2± 1.0%RLDA 128t/128b both 61.9± 0.3% 68.3± 0.7%
5 10 15 20 25 3035
40
45
50
55
60
65
70
Number of training examples
Aver
age
% c
orre
ct
RLDA,top,128t/128bFLDA,top,128t/128bLDA,"top",128bLDA,"bottom",128b
(a) Caltech-101: 128 topic models, LDA-SIFT vs FLDA vs RLDA
5 10 15 20 25 3040
45
50
55
60
65
70
Number of training examples
Aver
age
% c
orre
ct
RDLA,both,128t/128bRLDA,top,128t/128bFLDA,both,128t/128bFLDA,top,128t/128b
(b) Caltech-101: 128 topic models, L2 vs L1+L2 models
Figure 4: Comparison of classification rates on Caltech-101 of the one-layer model (LDA-SIFT) with the feed-forward
(FLDA) and full generative (RLDA) two-layer models for different number of training examples, all trained with 128 at L0
and 128 at L1 layers. We also compare to the stacked L0 and L1 layers for both FLDA and RLDA, which performed the best.
SIFT model trained with 128 topics, obtaining respective
classification rates of 61.3% and 66.0%, compared to the
LDA-SIFT result of 58.7%. Detailed results across differ-
ent numbers of training examples are shown in Figure 4a.
The curves show that the RLDA model consistently outper-
forms the feed-forward FLDA model, which in turn outper-
forms the one-layer LDA-SIFT model. This is due to the
fact that RLDA has learned more complex spatial structures
(the learned vocabularies are presented in Figure 5).
To show that the increased performance of the second
layer is not only due to a larger spatial support of the fea-
tures, we have also ran the baseline LDA on the bigger,
16 × 16 grid, which is the size of the spatial support of
the RLDA and FLDA models. The performance, reported
in Table 2 under LDA “top” is 60.6% which is less than
FLDA and RLDA, which further demonstrates the neces-
sity of having multi-layered representations.
We have also tested the classification performance on
stacking L0 and L1 features which contains more informa-
tion (spatially larger as well as smaller features are taken
into account this way), which is a standard practice for hi-
erarchical models [17]. The results are presented in Table 2
(under “both”) and Figure 4b, showing that using informa-
tion from both layers facilitates discrimination.
We have further tested the performance of the first layer
(L0) obtained within the RLDA model and compared it with
the single-layer model of LDA-SIFT. Interestingly, the L0
learned with the RLDA model outperforms just the bottom-
up single layer model (62.6% vs 58.7%), demonstrating that
learning the layers jointly is beneficial for the performance
of both, bottom as well as the top layers. However, for
the RLDA model with 1024 L1 and 128 L0 components,
the performance of the bottom (L0) layer performed only
slightly better as the L0 layer of the 128 L1 / 128 L0 RLDA
model, achieving 62.7% vs 62.6%, which seems to be the
saturation point for the 128 bottom layer.
Following this evaluation, we learned two-layer models
with 1024 topics in the top layer and 128 in the bottom layer,
6
• additional layer increases performance
• full inference increases performance
RLDA > FLDA > LDA
5 10 15 20 25 3035
40
45
50
55
60
65
70
Number of training examples
Aver
age
% c
orre
ct
RLDA,top,128t/128bFLDA,top,128t/128bLDA,"top",128bLDA,"bottom",128b
5 10 15 20 25 3040
45
50
55
60
65
70
Number of training examples
Aver
age
% c
orre
ct
RDLA,both,128t/128bRLDA,top,128t/128bFLDA,both,128t/128bFLDA,top,128t/128b
• additional layer increases performance
• full inference increases performance
• using both layers increases performance
RLDA vs. other hierarchies
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
CVPR#1888
CVPR#1888
CVPR 2011 Submission #1888. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
p(w, z0,1|x0,1,α,β0,1) =
�
θ
�
φ1
�
φ0
�
χ1
�
χ0
p(w, z1, z0,φ1,φ0, θ,χ1,χ0|x1, x0,α,β1,β0)dχ0dχ1dφ0dφ1dθ (2)
=
�
θ
�
φ1
�
φ0
p(w, z1, z0,φ1,φ0, θ|x1, x0,α,β1,β0)dφ0dφ1dθ (3)
=
�
θ
p(θ|α)p(z1|θ)dθ� �� �
top layer
�
φ1
p(φ1|β1)p(z0|φ1, z1, x1)dφ1
� �� �intermediate layer
�
φ0
p(w|φ0, z0, x0)p(φ0|β)dφ0
� �� �evidence layer
(4)
p(w, z1,...,L|x1,...,L,α,β1,...,L) =
�
θ
p(θ|α)p(zL|θ)dθ� �� �
top layer L
L−1�
l=1
�
φl
p(φl|βl)p(zl−1|φl, zl, xl)dφl
� �� �layer l
�
φ0
p(w|φ0, z0, x0)p(φ0|β)dφ0
� �� �evidence layer
(5)
Figure 3: RLDA conditional probabilities for a two-layer model, which is then generalized to an L-layer model.
Table 1: Comparison of the top performing RLDA with 128 components at L0 and 1024 components at layer L1 to otherhierarchical approaches. “bottom” is used to denote the L0 features of the RLDA model, “top” for L1 features, while “both”denotes stacked features from L0 and L1 layers.
Approach Caltech-101Model Layer(s) used 15 30
Our ModelRLDA (1024t/128b) bottom 56.6± 0.8% 62.7± 0.5%RLDA (1024t/128b) top 66.7± 0.9% 72.6± 1.2%RLDA (1024t/128b) both 67.4± 0.5 73.7± 0.8%
HierarchicalModels
Sparse-HMAX [21] top 51.0% 56.0%CNN [15] bottom – 57.6± 0.4%CNN [15] top – 66.3± 1.5%CNN + Transfer [2] top 58.1% 67.2%CDBN [17] bottom 53.2± 1.2% 60.5± 1.1%CDBN [17] both 57.7± 1.5% 65.4± 0.4%Hierarchy-of-parts [8] both 60.5% 66.5%Ommer and Buhmann [23] top – 61.3± 0.9%
Caltech-101 evaluations. A spatial pyramid with 3 layersof square grids of 1, 2, and 4 cells each was always con-structed on top of our features. Guided by the best practicesoutlined in a recent comparison of different pooling and fac-torization functions [5], we used max pooling for the spatialpyramid aggregation. For classification, we used a linearSVM, following the state-of-the-art results of Yang et al.[31]. Caltech-101 is a dataset comprised of 101 object cat-egories, with differing numbers of images per category [7].Following standard procedure, we use 30 images per cat-egory for training and the rest for testing. Mean accuracyand standard deviation are reported for 8 runs over differentsplits of the data, normalized per class.
We first tested a one-layer LDA over SIFT features.For 128 components we obtain classification accuracy of58.7%. Performance increases as we increase the number
of components, demonstrating the lack of discrimination ofthe baseline LDA-SIFT model: for 1024 topics the perfor-mance is 68.8% and 70.4% if we used 2048 topics.
The main emphasis of our experiments is to evaluatethe contribution of additional layers to classification perfor-mance. We tested models FLDA and RLDA (described in4.1) in the same regime as described above.
As an initial experiment, we constrained the numberof topics to 128 and trained three models: single-layer,FLDA, and RLDA. The two-layer models were trained with128 topics on the bottom layer, corresponding to factorizedSIFT descriptors, and 128 on the top. Only the top-layertopics were used for classification, so the effective totalnumber of topics in the two-layer models was the same asthe single-layer model. We present the results in Table 2.Both two-layer models improved on the one-layer LDA-
5
RLDA vs. single-feature state-of-the-art
• RLDA: 73.7%
• Sparse-Coding Spatial Pyramid Matching: 73.2% (Yang et al. CVPR 2009)
• SCSPM with “macrofeatures” and denser sampling: 75.7% (Bouerau et al. CVPR 2010)
• Locality-constrained Linear Coding: 73.4% (Wang et al. CVPR 2010)
• Saliency sampling + NBNN: 78.5% (Kanan and Cottrell, CVPR 2010)
FLDA 128t/128b
RLDA 128t/128b
Top Bottom
Bottom and top layers
Conclusions
• Presented Bayesian hierarchical approach to modeling sparsely coded visual features of increasing complexity and spatial support.
• Showed value of full inference.
Future directions
• Extend hierarchy to object level.
• Direct discriminative component
• Non-parametrics
• Sparse Coding + LDA