1 Unsupervised learning of visual representations and their use in object & face recognition Gary...

45
1 Unsupervised learning of Unsupervised learning of visual representations visual representations and their use in object and their use in object & face & face recognition recognition Gary Cottrell Gary Cottrell Chris Kanan Honghao Chris Kanan Honghao Shan Shan Lingyun Zhang Matthew Lingyun Zhang Matthew Tong Tong Tim Marks Tim Marks

Transcript of 1 Unsupervised learning of visual representations and their use in object & face recognition Gary...

Page 1: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

1

Unsupervised learning of Unsupervised learning of visual representations and visual representations and their use in object & facetheir use in object & face

recognitionrecognitionGary CottrellGary Cottrell

Chris Kanan Honghao Chris Kanan Honghao ShanShan

Lingyun Zhang Matthew Lingyun Zhang Matthew TongTong

Tim MarksTim Marks

Page 2: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

2

CollaboratorsCollaborators

Honghao ShanHonghao Shan Chris KananChris Kanan

Page 3: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

3

CollaboratorsCollaborators

Tim MarksTim Marks Matt TongMatt TongLingyun Zhang

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 4: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

4

Efficient Encoding of the Efficient Encoding of the worldworld

Sparse Principal Components Analysis:Sparse Principal Components Analysis:A model of unsupervised learning for early A model of unsupervised learning for early perceptual processing (Honghao Shan)perceptual processing (Honghao Shan)

The model embodies three constraintsThe model embodies three constraints

1.1. Keep as much information as possibleKeep as much information as possible

2.2. While trying to equalize the neural responsesWhile trying to equalize the neural responses

3.3. And minimizing the connections.And minimizing the connections.

Page 5: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

5

Trained on grayscale imagesTrained on color images

Spatial extent

Tem

pora

l ext

en

t

Efficient Encoding of the world leads Efficient Encoding of the world leads to magno- and parvo-cellular response to magno- and parvo-cellular response

properties…properties…

Trained on video cubes

This suggests that these cell types exist becausebecause they are useful for efficiently encoding the temporal dynamics of the world.

Midget?

Parasol?

Persistent, small

Transient, large

Page 6: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

6

Efficient Encoding of the world Efficient Encoding of the world leads to gammatone filters as leads to gammatone filters as

in auditory nerves:in auditory nerves: Using Using exactly the same algorithmexactly the same algorithm, ,

applied to speech, environmental sounds, applied to speech, environmental sounds, etc.:etc.:

Page 7: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

7

Efficient Encoding of the Efficient Encoding of the worldworld

A single unsupervised learning A single unsupervised learning algorithm leads toalgorithm leads to Model cells with properties similar to Model cells with properties similar to

those found in the retina when applied those found in the retina when applied to natural videosto natural videos

Models cells with properties similar to Models cells with properties similar to those found in auditory nerve when those found in auditory nerve when applied to natural soundsapplied to natural sounds

One small step towards a unified One small step towards a unified theory of temporal processing.theory of temporal processing.

Page 8: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

8

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) Recursive ICA (RICA 1.0 Recursive ICA (RICA 1.0 (Shan et al., (Shan et al.,

2008)2008)):): Alternately compress and expand Alternately compress and expand

representation using PCA and ICA;representation using PCA and ICA; ICA was modified by a component-wise ICA was modified by a component-wise

nonlinearitynonlinearity Receptive fields expanded at each ICA Receptive fields expanded at each ICA

layerlayer

Page 9: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

9

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) ICA was modified by a component-ICA was modified by a component-wise nonlinearity:wise nonlinearity: Think of ICA as a generative model: The Think of ICA as a generative model: The

pixels are the sum of many independent pixels are the sum of many independent random variables: Gaussian.random variables: Gaussian.

Hence ICA prefers its inputs to be Hence ICA prefers its inputs to be Gaussian-distributed.Gaussian-distributed.

We apply an inverse cumulative We apply an inverse cumulative Gaussian to the absolute value of the Gaussian to the absolute value of the ICA components to “gaussianize” them.ICA components to “gaussianize” them.

Page 10: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

10

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) Strong responses, either positive or Strong responses, either positive or negative, are mapped to the positive negative, are mapped to the positive tail of the Gaussian; weak ones, to tail of the Gaussian; weak ones, to the negative tail; ambiguous ones to the negative tail; ambiguous ones to the center.the center.

Page 11: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

11

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) RICA 2.0:RICA 2.0:

Replace PCA by SPCAReplace PCA by SPCA

SPCA SPCA

Page 12: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

12

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) RICA 2.0 Results: RICA 2.0 Results: Multiple layer system withMultiple layer system with Center-surround receptive fields at the first Center-surround receptive fields at the first

layerlayer Simple edge filters at the second (ICA) layerSimple edge filters at the second (ICA) layer Spatial pooling of orientations at the third Spatial pooling of orientations at the third

(SPCA) layer:(SPCA) layer:

V2-like response properties at the fourth (ICA) V2-like response properties at the fourth (ICA) layerlayer

QuickTime™ and a decompressor

are needed to see this picture.

Page 13: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

13

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) V2-like response properties at V2-like response properties at the fourth (ICA) layerthe fourth (ICA) layer

These maps show strengths of These maps show strengths of connections to layer 1 ICA filters. connections to layer 1 ICA filters. Warm and cold colors are strong Warm and cold colors are strong +/- connections, gray is weak +/- connections, gray is weak connections, orientation connections, orientation corresponds to layer 1 corresponds to layer 1 orientation.orientation.

The left-most column displays The left-most column displays two model neurons that show two model neurons that show uniform orientation preference to uniform orientation preference to layer-1 ICA features. layer-1 ICA features.

The middle column displays The middle column displays model neurons that have non-model neurons that have non-uniform/varying orientation uniform/varying orientation preference to layer-1 ICA preference to layer-1 ICA features. features.

The right column displays two The right column displays two model neurons that have location model neurons that have location preference, but no orientation preference, but no orientation preference, to layer-1 ICA preference, to layer-1 ICA features.features.

The left two columns are consistent with Anzen, Peng, & Van Essen 2007. The

right hand column is a prediction

Page 14: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

14

Unsupervised Learning Unsupervised Learning of Hierarchical of Hierarchical

Representations Representations (RICA 2.0; cf. Shan et (RICA 2.0; cf. Shan et

al., NIPS 19)al., NIPS 19) Dimensionality Reduction & Expansion might be a general Dimensionality Reduction & Expansion might be a general strategy of information processing in the brain.strategy of information processing in the brain. The first step removes noise and reduces complexity, the second The first step removes noise and reduces complexity, the second

step captures the statistical structure.step captures the statistical structure. We showed that retinal ganglion cells and V1 complex cells We showed that retinal ganglion cells and V1 complex cells

may be derived from the same learning algorithm, applied to may be derived from the same learning algorithm, applied to pixels in one case, and V1 simple cell outputs in the second.pixels in one case, and V1 simple cell outputs in the second.

This highly simplified model of early vision is the first one This highly simplified model of early vision is the first one that learns the RFs of all early visual layers, using a that learns the RFs of all early visual layers, using a consistent theory consistent theory -- the efficient coding theory. the efficient coding theory.

We believe it could serve as a basis for more sophisticated We believe it could serve as a basis for more sophisticated models of early vision.models of early vision.

An obvious next step is to train and thus make predictions An obvious next step is to train and thus make predictions about higher layers.about higher layers.

Page 15: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

15

We showed in Shan & Cottrell (CVPR We showed in Shan & Cottrell (CVPR 2008) that we could achieve state-of-the-2008) that we could achieve state-of-the-art face recognition with the non-linear art face recognition with the non-linear ICA features and a simple softmax output.ICA features and a simple softmax output.

We showed in Kanan & Cottrell (CVPR We showed in Kanan & Cottrell (CVPR 2010) that we could achieve state-of-the-2010) that we could achieve state-of-the-art face and object recognition with a art face and object recognition with a system that used an ICA-based salience system that used an ICA-based salience map, simulated fixations, non-linear ICA map, simulated fixations, non-linear ICA features, and a kernel-density memory.features, and a kernel-density memory.

Here I briefly describe the latter.Here I briefly describe the latter.

Nice, but is it useful?Nice, but is it useful?

Page 16: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

16

Our attention is Our attention is automatically drawn to automatically drawn to interesting regions in interesting regions in images.images.

Our salience algorithm is Our salience algorithm is automatically drawn to automatically drawn to interesting regions in interesting regions in images.images.

These are useful locations These are useful locations for for discriminatingdiscriminating one one object (face, butterfly) object (face, butterfly) from another.from another.

One reason why this One reason why this might be a good idea…might be a good idea…

Page 17: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

17

Training Phase (learning Training Phase (learning object appearances):object appearances):

Use the salience map to decide Use the salience map to decide where to look. where to look. (We use the ICA salience map)(We use the ICA salience map)

Memorize these samplesMemorize these samples of the of the image, with labels (Bob, Carol, image, with labels (Bob, Carol, Ted, or Alice) Ted, or Alice) (We store the (compressed) ICA (We store the (compressed) ICA feature values)feature values)

Main IdeaMain Idea

Page 18: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

18

Testing Phase Testing Phase (recognizing objects we (recognizing objects we have learned):have learned):

Now, given a new face, use the Now, given a new face, use the salience map to decide where to salience map to decide where to look.look.

Compare Compare new new image samples to image samples to storedstored ones - the closest ones in ones - the closest ones in memory get to vote for their label.memory get to vote for their label.

Main IdeaMain Idea

Page 19: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

19

Stored memories of BobStored memories of AliceNew fragments

19Result: 7 votes for Alice, only 3 for Bob. It’s Alice!

Page 20: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

20

VotingVoting

The voting process is based on The voting process is based on Bayesian updating (with Naïve Bayesian updating (with Naïve Bayes).Bayes).

The size of the vote depends on the The size of the vote depends on the distance from the stored sample, distance from the stored sample, using kernel density estimation. using kernel density estimation.

Hence NIMBLE: NIM with Bayesian Hence NIMBLE: NIM with Bayesian Likelihood Estimation.Likelihood Estimation.

Page 21: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

21

The ICA features do double-duty:The ICA features do double-duty: They are They are combinedcombined to make the salience to make the salience

map - which is used to decide where to map - which is used to decide where to looklook

They are They are storedstored to represent the object at to represent the object at that locationthat location

QuickTime™ and a decompressor

are needed to see this picture.

Overview of the systemOverview of the system

Page 22: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

22

Compare this to (most, not all!) Compare this to (most, not all!) computer vision systems:computer vision systems:

One pass over the image, and One pass over the image, and global features.global features.

ImageGlobal

FeaturesGlobal

ClassifierDecision

NIMBLE vs. Computer NIMBLE vs. Computer VisionVision

Page 23: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

23

QuickTime™ and aH.264 decompressor

are needed to see this picture.

Page 24: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

24Belief After 1 Fixation Belief After 10 Fixations

Page 25: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

25

Human vision works in multiple Human vision works in multiple environments - our basic features environments - our basic features (neurons!) don’t change from one (neurons!) don’t change from one problem to the next.problem to the next.

We tune our parameters so that the We tune our parameters so that the system works well on Bird and Butterfly system works well on Bird and Butterfly datasets - and then apply the system datasets - and then apply the system unchangedunchanged to faces, flowers, and objects to faces, flowers, and objects

This is very different from standard This is very different from standard computer vision systems, that are computer vision systems, that are (usually) tuned to a particular domain(usually) tuned to a particular domain

Robust VisionRobust Vision

Page 26: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

26

Cal Tech 101: 101 Different Categories

AR dataset: 120 Different People with different lighting, expression, and accessories

Page 27: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

27

Flowers: 102 Different Flower SpeciesFlowers: 102 Different Flower Species

Page 28: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

28

~7 fixations required to achieve at ~7 fixations required to achieve at least 90% of maximum least 90% of maximum performance performance

Page 29: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

29

So, we created a simple cognitive So, we created a simple cognitive model that uses simulated fixations model that uses simulated fixations to recognize things.to recognize things. But it isn’t But it isn’t thatthat complicated. complicated.

How does it compare to approaches How does it compare to approaches in computer vision?in computer vision?

Page 30: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

30

Caveats:Caveats: As of mid-2010.As of mid-2010. Only comparing to single feature Only comparing to single feature

type approaches (no “Multiple type approaches (no “Multiple Kernel Learning” (MKL) Kernel Learning” (MKL) approaches).approaches).

Still superior to MKL with very few Still superior to MKL with very few training examples per category.training examples per category.

Page 31: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

311 5 15

30NUMBER OF TRAINING EXAMPLES

Page 32: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

321 2 3 6

8 NUMBER OF TRAINING EXAMPLES

Page 33: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

33

QuickTime™ and a decompressor

are needed to see this picture.

Page 34: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

34Again, best for single feature-type systems

and for 1 training instance better than all systems

Page 35: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

35

More neurally and behaviorally More neurally and behaviorally relevant gaze control and fixation relevant gaze control and fixation integration.integration. People don’t randomly sample People don’t randomly sample

images.images.

A foveated retinaA foveated retina

Comparison with human eye Comparison with human eye movement data during movement data during recognition/classification of faces, recognition/classification of faces, objects, etc.objects, etc.

Page 36: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

36

A biologically-inspired, fixation-A biologically-inspired, fixation-based approach can work well based approach can work well for image classification.for image classification.

Fixation-based models can achieve, Fixation-based models can achieve, and even exceed, some of the best and even exceed, some of the best models in computer vision. models in computer vision.

……Especially when you don’t have a Especially when you don’t have a lot of training images.lot of training images.

Page 37: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

37

Software and Paper Available at Software and Paper Available at www.chriskanan.comwww.chriskanan.com

For more details email:For more details email:

[email protected]@ucsd.edu

This work was supported by the NSF (grant #SBE-0542013) to the Temporal

Dynamics of Learning Center.

Page 38: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

38Thanks!Thanks!

Page 39: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

39

Sparse Principal Sparse Principal Components AnalysisComponents Analysis

We minimize:We minimize:

Subject to the following constraint:Subject to the following constraint:

Page 40: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

40

The SPCA model as a neural The SPCA model as a neural net…net…

It is AT that is mostly 0…

Page 41: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

41

ResultsResults

suggesting the 1/f power spectrum of suggesting the 1/f power spectrum of images is where this is coming from…images is where this is coming from…

Page 42: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

42

ResultsResults

The role of The role of ::

Recall this reduces the number of Recall this reduces the number of connections…connections…

Page 43: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

43

ResultsResults The role of The role of : higher : higher

means fewer connections, means fewer connections, which alters the contrast which alters the contrast sensitivity function (CSF).sensitivity function (CSF).

Matches recent data on Matches recent data on malnourished kids and malnourished kids and their CSF’s: lower their CSF’s: lower sensitivity at low spatial sensitivity at low spatial frequencies, but slightly frequencies, but slightly better at high than normal better at high than normal controls…controls…

Page 44: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

44

NIMBLE represents its beliefs using NIMBLE represents its beliefs using probability distributionsprobability distributions

Simple nearest neighbor density Simple nearest neighbor density estimation to estimate: estimation to estimate:

P(P(fixationfixationtt | | CategoryCategory = = kk)) Fixations are combined over Fixations are combined over

fixations/time using Bayesian fixations/time using Bayesian updatingupdating

Page 45: 1 Unsupervised learning of visual representations and their use in object & face recognition Gary Cottrell Chris Kanan Honghao Shan Lingyun Zhang Matthew.

45