Using associative memory principles to enhance perceptual ability of vision systems (Giving a...

Using associative memory principles to enhance perceptual ability of

vision systems (Giving a meaning to what you see)

CVPR Workshop on Face Processing in Video June 28, 2004, Washington, DC, USA

Dr. Dmitry GorodnichyComputational Video Group

Institute for Information Technology National Research Council Canada

www.cv.iit.nrc.ca/~dmitry/pinn

Designing visual memory using attractor-based neural networks with

application to perceptual vision systems

CVPR Workshop on Face Processing in Video June 28, 2004, Washington, DC, USA

Dr. Dmitry GorodnichyComputational Video Group

Institute for Information Technology National Research Council Canada

www.cv.iit.nrc.ca/~dmitry/pinn www.perceptual-vision.com/memory

3. Associative memory for video (Dr. Dmitry Gorodnichy)

The unique place of this research

You are here

Computer vision

Pattern recognition

Your eyes, y

our brain


Talk overview

1. On neurobiology side- How it works in brain: From eye retina, to primary visual cortex, to

neurons, to synapses

2. Memories as attractors of associative neural network- Finding Best learning rule to tune the synapses

3. On computer vision path- Evolution of Perceptual Vision User Interface Systems:

From Face Detection to Face Tracking to Face Localization to Face Recognition

4. Putting it all together: Visual memory for analyzing faces:- What makes processing in video special

- Canonical face representation- Memories of faces as attractors of the network


How we (humans) do it: see & memorize what we see ?

Dorsal (“where”) stream:V1,V2,V3… deal with object localization

Ventral (“what”) stream: V1, V2, V4, inferior temporal cortex (TE/IT)deals with object recognition

Refs: Perus, Ungerleider, Haxby, Riesenhuber, Poggio …

•Seeing


How we (humans) do it: see & memorize what we see ? (cntd)

• In brain: 1010 to 1013 interconnected neurons

• Neurons are either in rest or activated (modelled as units taking values) Yi={+1,-1}, depending on value of other neurons Yj and the strength of synaptic connections Cij

•Brain is thus modelled as a network of binary neurons evolving in time from an initial state (e.g. stimulus coming from retina) until it reaches a stable state - attractor

•The attractors of the network is what we actually remember – associative memory.

•Recognizing / Memorizing

Refs: Hebb’49, Little’74,’78, Willshaw’71, …


Recognition / memorization: formally

• Main question:

How to compute Cij so that a) the desired patterns Vm become attractors, i.e. VVm ~CV ~CVm

and

b) network exhibits best associative (error-correction) properties, i.e.- largest attraction radius (tolerated noise)largest attraction radius (tolerated noise)- largest number of prototypes M stored- largest number of prototypes M stored

??Refs: Hebb’49, McCalloch-Pitts‘43, Amari’71,’77, Hopfield’82,Sejnowski’89, Willshaw’71

•Attractor-based neural networks


Learning rules: From biologically plausible to mathematically justifiable

Neurophysiological Postulate: “If two neurons on either side of a synapse are

activated, then the strength of the synapse is strengthened”

“When a child is born, she knows nothing. As she repeatedly observed, she learns” – Postulate from Montessory approach to enfant development.

Models

Hebb: (C = 1/N VVT) , Generalized Hebb:

Better however:

or even

Refs: Hebb’49, Hopfield’82, Sejnowski’77, Willshaw’71

How to update weights

mj

mi

mij VV

NC

1

mij

mij

mij CCC 1

),( mj

mi

mij VVFC

),,( 1 mj

mi

mij

mij VVCFC

),( 1 mmmij VCFC


C = VV+ • Obtained mathematically from stability condition: VVm =CV =CVm

• With reduced self-connection (Cii = 0.15 Cii ), it is guaranteed

[Gorodnichy’97]

to retrieve M=0.5N patterns from 8% noise

M=0.7N patterns from 2% noise (for comparison: Hebb rule stops retrieving when M=0.14N)

• Widrow-Hoff’s (delta) rule is the iterative approximation of it.

• Hebb rule is the special case of it for orthogonal prototypes.

Refs: Amari’71,’77, Kohonen’72, Personnaz’85, Kanter-Sompolinsky’86,Gorodnichy‘95-’99

Pseudo-inverse as the best learning rule


… besides that it yields the best retrieval for this type of networks.

• It is non-iterative – good for fast (real-time) learning • It is also fast in retrieval. • The performance of the network can be examined and improved analytically.

Guaranteed to converge.

• It can deal with continuous stream of data, never being saturated: if dynamic desuturation is used-> maintaining the capacity of 0.2N (with complete retrieval)-> providing means for forgetting obsolete data-> setting the basis for designing of adaptive filters

• All this makes the network very suitable for real-time memorization and recognition, as needed for video processing tasks.

• Finally, there's a free CPP code which you can compile and try yourself!

At PINN website: www.cv.iit.nrc.ca/~dmitry/pinn

-

What’s else good about PI rule

http://www.cv.iit.nrc.ca/~dmitry/pinn/pinn.cpp


These Neural Network are known as…

•pseudo-inverse networks - for using Moore-Penrose pseudoinverse V+ in computing the synapces

•projection networks - for synaptic (weight) matrix C=VV+ being the projection matrix on the space of prototypes

•Hopfield-like networks - for being binary and fully-connected in the stage of learning

•recurrent networks - for evolving in time, based on external input and internal memory

•attractor-based networks - for storing patterns as attractors (i.e. stable states of the network)

•dynamic systems - for allowing the dynamic systems theory to be applied

•associative memory - for being able to memorize, recall and forget patterns, just as much as humans do.


Analytical examination

By looking at the synaptic

weights Cij, one can say a lot …

about the properties of memory:

- how many main attractors (stored memories) it has.

- how good the retrieval is.


Attraction Radius as function of weights

Theoretical result:

(for direct attraction radius)


Dynamics of the network

The behaviour of the network is governed by the energy functions

• However :

-> They are fewThey are few, when D>0.1 [Gorodnichy&Reznik’97]-> They are detected automaticallyThey are detected automatically

The network always converges: as long as Cij=CjiThe network always converges: as long as Cij=Cji

• Cycles are possible, when D<1 :


Update flow neuro-processing

->-> is very fast is very fast (as only few neurons are actually changing in one iteration)

-> detects cycle automatically -> detects cycle automatically

-> suitable for parallel implementation-> suitable for parallel implementation

[Gorodnichy&Reznik’94]:

“Process only those neurons which change during the evolution”, i.e.

instead of N multiplications:

do only few of them :


How video information is processed ?

• As we know how to memorize, the question is

what should be memorized?

What type of video information needs to be processed

?

Lets see what mother nature (neurobiology) tells us


Visual Processing mechanisms

• Images are of very low resolution except in the fixation point.

• The eyes look at points which attract visual attention.

• Saliency is: in a) motion, b) colour, c) disparitydisparity, d) intensity.

• These channels are processed independently in brain Intensity means: frequencies, orientation, gradient .

• Brain process the sequences of images rather than one image. - Bad quality of images is compensated by the abundance of images.

• Colour & motion are used for segmentation.

• Intensity is used for recognition.

• Bottom-up (image driven) visual attention is very fast and precedes top-down (goal-driven) attention: 25ms vs 1sec.

Refs:Itti,….


Visual recognition mechanism

- What to learn: generality vs specifics, invariance vs selectivity

- Affine transformations in 2D (rotation in image plane, scale) are easily dealt with.

- No 3D model stored. Instead, several view-based 2D models stored

- One neural network per view.

In context of face recognition:

- Faces are stored in canonical representation

- 2D transformations are easy in image/video processing!

- Video allows to wait (until a face is a position in which it was stored)

Refs: Poggio,…


Orientation selectivity,Top-down vs bottom up detection

From [Riesenhuber-Poggio, Nature Neuroscience,2000]


On computer vision side


Perceptual Vision System

Goal: To detect, track and recognize face and facial movements of the user.

x y , z PUI

monitor

binary eventON

OFFrecognition /

memorizationUnknown User!

Setup:

+ face close to camera (within hand

distance)

+ approximately front-faced oriented

+ limited number of users and motions

- off-the shelf camera (low quality,

low resolution)

- Desktop computer (with limited

processing power)


What can be “perceived”: Face processing tasks

““Something yellow moves”Something yellow moves”

Face Segmentation

Facial Event Recognition

Face Memorization

Face Detection

Face Tracking(crude)

Face Classification

Face Localization(precise)

Face Identification

“It’s a face”

“It’s at (x,y,z,”

“Lets follow it!”

“It’s face of a child”“S/he smiles, blinks”

“Face unknown. Store it!” “It’s Mila!”

“I look and see…”

…


Computer Vision results achieved

• 1998. Proof-of-concept PUI: colour-based tracking [Bradski]– Unlikely to be used for precise tracking…

• 1999-2002.Several good skin colour models developed(HSV,UCS,YCrCb*)

– Unlikely to get better than that…

• 2002. Subpixel-accuracy convex-shape nose tracking [Nouse]

•

• 1999-2001. Motion-based segmentation & localization– 2001. Non-linear change detection– 2003. Second-order change detection [Double-blink]

•

• 2001. Viola-Jones face detection using Haar-like wavelets

•

• 2004. Stereotracking using nose and projective vision [Gorodnichy,Roth-IVC]


Face Detection and Tracking


Face Detection and Tracking (lights off)


Face Detection and Tracking (lights on)


Demand and applications

Internet, tendencias & tecnología

La nariz utilizada como mouse

En el Instituto de Tecnología de la Información, en Canadá, se desarrolló un sistema llamado Nouse que permite manejar softwares con movimientos del rostro. El creador de este programa, Dmitry Gorodnichy, explicó vía e-mail a LA NACION LINE cómo funciona y cuáles son sus utilidades

Si desea acceder a más información, contenidos relacionados, material audiovisual y opiniones de nuestros lectores ingrese en : http://www.lanacion.com.ar/03/05/21/dg_497588.asp Copyright S. A. LA NACION 2003. Todos los derechos reservados.

http://www.lanacion.com.ar/03/05/21/dg_497588.asp?origen=amigoenvio





http://oas.lanacion.com.ar/RealMedia/ads/click_nx.cgi/www.premium.com.ar/enviaraunamigo@Top

http://www.lanacion.com.ar/


On importance of nose

Test: The user rotates his head only! (the shoulders do not move)

Precision / convenience is such that it allows one to use nose as mouse (or a joystick handle) – to Nouse


NouseTM : range and speed of tracking


Stereotracking with nose feature


Second-order change detection

• Detecting change in a change [Gorodnichy’03]

• Non-linear change detection deals with changes due illumination changes [ Durucan’02]


Eye Blink Detection

• Previously very difficult in moving heads

• With second-order change detection became possible

• Is currently used to enable people with brain injury face-to-face communication [AAATE’03]


Something (special) about video

Importance:

- Video is becoming ubiquitous. Cameras are everywhere.

- For security, computer–human interaction, video-conferencing, entertainment …

Constraints:

- Real-time processing is required.

- Low resolution: 160x120 images or mpeg-decoded.

- Low-quality: week exposure, blurriness, cheap lenses

Essence:

- It is inherently dynamic! temporal info to make up for bad quality

- It has parallels with biological vision! it can be processed efficiently


Applicability of 160x120 video

• According to face

anthropometrics(studied on BioID database)

• Tested with

Perceptual User interfaces

Face size

½ image ¼ image 1/8 image 1/16 image

In pixels 80x80 40x40 20x20 10x10

Between eyes-IOD 40 20 10 5

Eye size 20 10 5 2

Nose size 10 5 - -

FS b

FD b -

FT b -

FL b - -

FER b -

FC b -

FM / FI - -

– goodb – barely applicable

- – not good


Choosing the face model.

On importance of eyes:

• Eyes are the most salient features on a face.

• Besides, there two of them, which makes the excellent

reference frame out of them

• They also the best (and the only) stable landmarks on a face

which can be used a reference.

Intra-ocular distance (IOD) makes a very

convenient unit of measurement!

Eye –centered face model

On resolution:

• Lowest resolution possible, not to inflict overfitting due

to the present noise

(and there’s a lot of noise in video!)


Eye-centered face representations

Suitable for Face Analysis from video

d

24

2. .IOD

Suitable for Face Recognition in

travel documents [ICAO’02]

Size 24 x 24 is sufficient for face memorization & recognition and is optimal for low-quality video and for fast processing.


From image pixels to feature vectors

• When the eyes are detected, and a face is converted to a canonical

representation, it is easy to memorize to recognize

• Using (orientational, frequency) features: Gabor filters ?

• As faces are already rectified (to the same scale and orientation), no need

for complex transformations.

Just deal with illumination changes.

• Converting 24x24 face to binary feature vector:

A) Vi =Ixy - Iave , N=24x24=576

B ) Vi,j =sign(Ii - Ij ), N= 244

C ) Vi,j =Haar-like(i,j,k,l ) much more

• Some pixels may be ignored (corners, eye location)


Closer to experiments

• Network size of N=576 stores– M=N/2 states with …– M=N/4 states with 25%N error correction

• Faces are extracted using OpenCV Viola-Jones function

• Another way: from blinking as in [avbpa03]:


Visual memory for user perception

• What can be retrieved:– user identity, – face orientation, – facial expression


Retrieving orientation


A few more demos: taped and live…

… as time allows…

• Watch how memory is being filled out, as you learn new prototypes


Conclusions ?

Computer vision

Pattern recognition

Neuro-biology


Conclusions

• A lot has been done in PR, CV, NB. – How to know all of these ?…– How to use all of these ?… Or which way you’d prefer?

• Attractor-based network - great tool: – very easy to understand what it is doing– very suitable for live real-time video processing– Very much within the lines of biological vision– You are invited to try it yourself! – from our website

• Other contributions:– Canonical face representation for FPIV

• Is that possible to work, while on parental leave with two kids?


Dealing with a stream of data

Dynamic desaturationDynamic desaturation:

-> maintains the capacity of -> maintains the capacity of 0.2N0.2N (with complete retrieval) (with complete retrieval)-> allows to store data in real-time -> allows to store data in real-time (no need for iterative learning methods!) -> -> provides means for forgetting obsolete dataprovides means for forgetting obsolete data-> is the basis for the design of -> is the basis for the design of adaptive filtersadaptive filters

Using associative memory principles to enhance perceptual ability of vision systems (Giving a...

Documents

Transcript of Using associative memory principles to enhance perceptual ability of vision systems (Giving a...