Jul09 Hinton Deeplearn

8/13/2019 Jul09 Hinton Deeplearn

1/126

UCL Tutorial on:Deep Belief Nets

(An updated and extended version of my 2007 N!" tutorial#

$eoffrey %inton

Canadian nstitute for Advan&ed 'esear&

)

Department of Computer "&ien&e

University of Toronto


2/126

"&edule for te Tutorial

* 2+00 , -+-0 Tutorial part .

* -+-0 , -+/ 1uestions

* -+/ /+. Tea Brea3

* /+. , +/ Tutorial part 2

* +/ , 4+00 1uestions


3/126

"ome tin5s you 6ill learn in tis tutorial

* %o6 to learn multilayer 5enerative models of unlaelled

data y learnin5 one layer of features at a time+, %o6 to add 8ar3ov 'andom 9ields in ea& idden layer+

* %o6 to use 5enerative models to ma3e dis&riminativetrainin5 metods 6or3 mu& etter for &lassifi&ation and

re5ression+, %o6 to extend tis approa& to $aussian !ro&esses and

o6 to learn &omplex domainspe&ifi& 3ernels for a$aussian !ro&ess+

* %o6 to perform nonlinear dimensionality redu&tion on verylar5e datasets

, %o6 to learn inary lo6dimensional &odes and o6 touse tem for very fast do&ument retrieval+

* %o6 to learn multilayer 5enerative models of i5

dimensional se;uential data+


4/126

A spe&trum of ma&ine learnin5 tas3s

* Lo6dimensional data (e+5+

less tan .00 dimensions#

* Lots of noise in te data

* Tere is not mu& stru&ture in

te data and 6at stru&ture

tere is &an e represented y

a fairly simple model+

* Te main prolem is

distin5uisin5 true stru&ture

from noise+

* %i5dimensional data (e+5+

more tan .00 dimensions#

* Te noise is not suffi&ient to

os&ure te stru&ture in te

data if 6e pro&ess it ri5t+* Tere is a u5e amount of

stru&turein te data ut te

stru&ture is too &ompli&ated to

e represented y a simple

model+

* Te main prolem is fi5urin5

out a 6ay to represent te

&ompli&ated stru&ture so tat it

&an e learned+

Typi&al "tatisti&sArtifi&ial ntelli5en&e


5/126

%istori&al a&35round:9irst 5eneration neural net6or3s

* !er&eptrons (e o?e&ts y

learnin5 o6 to 6ei5ttese features+

, Tere 6as a neatlearnin5 al5oritm forad?ustin5 te 6ei5ts+

, But per&eptrons arefundamentally limitedin 6at tey &an learnto do+

nonadaptive

and&oded

features

output units

e+5+ &lass laels

input units

e+5+ pixels

"3et& of a typi&al

per&eptron from te .=40@s

Bom Toy


6/126

"e&ond 5eneration neural net6or3s (


7/126

A temporary di5ression

* apni3 and is &o6or3ers developed a very &lever type

of per&eptron &alled a "upport e&tor 8a&ine+

, nstead of and&odin5 te layer of nonadaptive

features ea& trainin5 example is used to &reate a

ne6 feature usin5 a fixed re&ipe+* Te feature &omputes o6 similar a test example is to tat

trainin5 example+

, Ten a &lever optimi>ation teue is used to sele&t

te est suset of te features and to de&ide o6 to

6ei5t ea& feature 6en &lassifyin5 a test &ase+* But its ?ust a per&eptron and as all te same limitations+

* n te .==0@s many resear&ers aandoned neural

net6or3s 6it multiple adaptive idden layers e&ause

"upport e&tor 8a&ines 6or3ed etter+


8/126


9/126

Ever&omin5 te limitations of a&3

propa5ation

* Feep te effi&ien&y and simpli&ity of usin5 a

5radient metod for ad?ustin5 te 6ei5ts ut use

it for modelin5 te stru&ture of te sensory input+

,Ad?ust te 6ei5ts to maximi>e te proailitytat a 5enerative model 6ould ave produ&ed

te sensory input+

, Learn p(ima5e# not p(lael G ima5e#

* f you 6ant to do &omputer vision first learn

&omputer 5rapi&s

* at 3ind of 5enerative model sould 6e learn


10/126

Belief Nets

* A elief net is a dire&ted

a&y&li& 5rap &omposed of

sto&asti& variales+

* e 5et to oserve some of

te variales and 6e 6ould

li3e to solve t6o prolems:

* Te inferen&e prolem:nfer

te states of te unoserved

variales+

* Te learnin5 prolem:Ad?ust

te intera&tions et6een

variales to ma3e te

net6or3 more li3ely to

5enerate te oserved data+

sto&asti&idden

&ause

visile

effe&t

e 6ill use nets &omposed of

layers of sto&asti& inary variales

6it 6ei5ted &onne&tions+ Later

6e 6ill 5enerali>e to oter types of

variale+


11/126

"to&asti& inary units(Bernoulli variales#

* Tese ave a state of .

or 0+

* Te proaility ofturnin5 on is determined

y te 6ei5ted input

from oter units (plus a

ias#

0

0

.

+==

j

jijii

wsbsp

)exp(1)(

11

+j

jiji wsb

)( 1=isp


12/126

Learnin5 Deep Belief Nets

* t is easy to 5enerate anuniased example at te

leaf nodes so 6e &an see

6at 3inds of data te

net6or3 elieves in+

* t is ard to infer te

posterior distriution over

all possile &onfi5urations

of idden &auses+

* t is ard to even 5et asample from te posterior+

* "o o6 &an 6e learn deep

elief nets tat ave

millions of parameters

sto&asti&

idden

&ause

visile

effe&t


13/126

Te learnin5 rule for si5moid elief nets

* Learnin5 is easy if 6e &an

5et an uniased sample

from te posterior

distriution over idden

states 5iven te oserveddata+

* 9or ea& unit maximi>e

te lo5 proaility tat itsinary state in te sample

from te posterior 6ould e

5enerated y te sampled

inary states of its parents+

+==

j

jijii

wsspp

)exp(1)(

11

?

i

jiw

)( iijji pssw =

is

js

learnin5

rate


14/126

Hxplainin5 a6ay (Iudea !earl#

* Hven if t6o idden &auses are independent tey &ane&ome dependent 6en 6e oserve an effe&t tat tey &an

ot influen&e+

, f 6e learn tat tere 6as an eart;ua3e it redu&es te

proaility tat te ouse ?umped e&ause of a tru&3+

tru&3 its ouse eart;ua3e

ouse ?umps

20 20

20

.0 .0

p(..#J+000.

p(.0#J+/===

p(0.#J+/===

p(00#J+000.

posterior


15/126

y it is usually very ard to learn

si5moid elief nets one layer at a time* To learn 6e need te posterior

distriution in te first idden layer+

* !rolem .: Te posterior is typi&ally&ompli&ated e&ause of Kexplainin5a6ay+

* !rolem 2:Te posterior dependson te prior as 6ell as te li3eliood+

, "o to learn 6e need to 3no6te 6ei5ts in i5er layers evenif 6e are only approximatin5 te

posterior+All te 6ei5ts intera&t+* !rolem -:e need to inte5rate

over all possile &onfi5urations ofte i5er variales to 5et te prior

for first idden layer+ Mu3

data

idden variales

idden variales

idden variales

li3eliood

prior


16/126

"ome metods of learnin5

deep elief nets

* 8onte Carlo metods &an e used to sample

from te posterior+

, But its painfully slo6 for lar5e deep models+

* n te .==0@s people developed variationalmetods for learnin5 deep elief nets

, Tese only 5et approximate samples from te

posterior+, Neveteless te learnin5 is still 5uaranteed to

improve a variational ound on te lo5

proaility of 5eneratin5 te oserved data+


17/126

Te rea3trou5 tat ma3es deep

learnin5 effi&ient

* To learn deep nets effi&iently 6e need to learn one layer

of features at a time+ Tis does not 6or3 6ell if 6e

assume tat te latent variales are independent in te

prior :

, Te latent variales are not independent in te

posterior so inferen&e is ard for nonlinear models+

, Te learnin5 tries to find independent &auses usin5

one idden layer 6i& is not usually possile+

* e need a 6ay of learnin5 one layer at a time tat ta3es

into a&&ount te fa&t tat 6e 6ill e learnin5 more

idden layers later+

, e solve tis prolem y usin5 an undire&ted model+


18/126

T6o types of 5enerative neural net6or3

* f 6e &onne&t inary sto&asti& neurons in a

dire&ted a&y&li& 5rap 6e 5et a "i5moid Belief

Net ('adford Neal .==2#+

* f 6e &onne&t inary sto&asti& neurons usin5

symmetri& &onne&tions 6e 5et a Bolt>mann

8a&ine (%inton ) "e?no6s3i .=-#+

, f 6e restri&t te &onne&tivity in a spe&ial 6ay

it is easy to learn a Bolt>mann ma&ine+


19/126

'estri&ted Bolt>mann 8a&ines("molens3y .=4 &alled tem Karmoniums#

* e restri&t te &onne&tivity to ma3e

learnin5 easier+

, Enly one layer of idden units+

* e 6ill deal 6it more layers later

, No &onne&tions et6een idden units+

* n an 'B8 te idden units are

&onditionally independent 5iven te

visile states+

, "o 6e &an ;ui&3ly 5et an uniased

sample from te posterior distriution

6en 5iven a datave&tor+

, Tis is a i5 advanta5e over dire&ted

elief nets

idden

i

?

visile


20/126

Te Hner5y of a ?oint &onfi5uration(i5norin5 terms to do 6it iases#

= ji ijji whvv,hE ,)(6ei5t et6een

units i and ?

Hner5y 6it &onfi5uration

von te visile units and

on te idden units

inary state of

visile unit i

inary state of

idden unit ?

ji

ij

hvw

hvE=

),(


21/126

ei5tsHner5ies!roailities

* Ha& possile ?oint &onfi5uration of te visileand idden units as an ener5y

, Te ener5y is determined y te 6ei5ts and

iases (as in a %opfield net#+* Te ener5y of a ?oint &onfi5uration of te visile

and idden units determines its proaility:

* Te proaility of a &onfi5uration over te visile

units is found y summin5 te proailities of all

te ?oint &onfi5urations tat &ontain it+

),(),(

hvEhvp e


22/126

Usin5 ener5ies to define proailities

* Te proaility of a ?oint&onfi5uration over ot visile

and idden units depends on

te ener5y of tat ?oint

&onfi5uration &ompared 6itte ener5y of all oter ?oint

&onfi5urations+

* Te proaility of a&onfi5uration of te visile

units is te sum of te

proailities of all te ?oint

&onfi5urations tat &ontain it+

=

gu

guE

hvE

eehvp

,

),(

),(

),(

=

gu

guEh

hvE

e

e

vp

,

),(

),(

)(

partition

fun&tion


23/126

A pi&ture of te maximum li3eliood learnin5

al5oritm for an 'B8

0>< jihv

>< jihv

i

?

i

?

i

?

i

?

t J 0 t J . t J 2 t J infinity

>


24/126

A ;ui&3 6ay to learn an 'B8

0>< jihv

1>< jihv

i

?

i

?

t J 0 t J .

)( 10

>


25/126

%o6 to learn a set of features tat are 5ood for

re&onstru&tin5 ima5es of te di5it 2

0 inary

feature

neurons

.4 x .4pixel

ima5e

0 inary

feature

neurons

.4 x .4pixel

ima5e

n&rement6ei5tset6een an a&tive

pixel and an a&tive

feature

De&rement 6ei5tset6een an a&tive

pixel and an a&tive

feature

data(reality#

re&onstru&tion

(etter tan reality#


26/126

Te final 0x 24 6ei5ts

Ha& neuron 5ras a different feature+


27/126

'e&onstru&tion

from a&tivated

inary featuresData

'e&onstru&tion

from a&tivated

inary featuresData

%o6 6ell &an 6e re&onstru&t te di5it ima5es

from te inary feature a&tivations

Ne6 test ima5es fromte di5it &lass tat te

model 6as trained on

ma5es from an

unfamiliar di5it &lass

(te net6or3 tries to see

every ima5e as a 2#


28/126

Tree 6ays to &omine proaility density

models (an underlyin5 teme of te tutorial#

* Mixture: Ta3e a 6ei5ted avera5e of te distriutions+

, t &an never e sarper tan te individual distriutions+t@s a very 6ea3 6ay to &omine models+

* Product:8ultiply te distriutions at ea& point and tenrenormali>e (tis is o6 an 'B8 &omines te distriutions definedy ea& idden unit#

, Hxponentiallymore po6erful tan a mixture+ Tenormali>ation ma3es maximum li3eliood learnin5

diffi&ult ut approximations allo6 us to learn any6ay+* Composition:Use te values of te latent variales of onemodel as te data for te next model+

, or3s 6ell for learnin5 multiple layers of representationut only if te individual models are undire&ted+


29/126

Trainin5 a deep net6or3(te main reason 'B8@s are interestin5#

* 9irst train a layer of features tat re&eive input dire&tly

from te pixels+

* Ten treat te a&tivations of te trained features as if

tey 6ere pixels and learn features of features in a

se&ond idden layer+

* t &an e proved tat ea& time 6e add anoter layer of

features 6e improve a variational lo6er ound on te lo5

proaility of te trainin5 data+

, Te proof is sli5tly &ompli&ated+

, But it is ased on a neat e;uivalen&e et6een an

'B8 and a deep dire&ted model (des&ried later#


30/126

Te 5enerative model after learnin5 - layers

* To 5enerate data:

.+ $et an e;uilirium sample

from te toplevel 'B8 y

performin5 alternatin5 $is

samplin5 for a lon5 time+2+ !erform a topdo6n pass to

5et states for all te oter

layers+

"o te lo6er level ottomup

&onne&tions are not part of

te 5enerative model+ Tey

are ?ust used for inferen&e+

2

data

.

-

2W

3W

1W

d d l i 3


31/126

y does 5reedy learnin5 6or3An aside: Avera5in5 fa&torial distriutions

* f you avera5e some fa&torial distriutions you

do NET 5et a fa&torial distriution+

, n an 'B8 te posterior over te idden units

is fa&torial for ea& visile ve&tor+, But te a55re5ated posterior over all trainin5

&ases is not fa&torial (even if te data 6as

5enerated y te 'B8 itself#+


32/126

y does 5reedy learnin5 6or3

* Ha& 'B8 &onverts its data distriutioninto an a55re5ated posterior distriutionover its idden units+

* Tis divides te tas3 of modelin5 itsdata into t6o tas3s:

, Tas3 .:Learn 5enerative 6ei5tstat &an &onvert te a55re5ated

posterior distriution over te iddenunits a&3 into te data distriution+

, Tas3 2:Learn to model tea55re5ated posterior distriutionover te idden units+

, Te 'B8 does a 5ood ?o of tas3 .and a moderately 5ood ?o of tas3 2+

* Tas3 2 is easier (for te next 'B8# tanmodelin5 te ori5inal data e&ause tea55re5ated posterior distriution is&loser to a distriution tat an 'B8 &an

model perfe&tly+

data distriution

on visile units

a55re5ated

posterior distriutionon idden units

)|( Whp

),|( Whvp

Tas3 2

Tas3 .


33/126

y does 5reedy learnin5 6or3

=h

hvphpvp )|()()(

Te 6ei5ts in te ottom level 'B8 definep(vG# and tey also indire&tly define p(#+

"o 6e &an express te 'B8 model as

f 6e leave p(vG# alone and improve p(# 6e 6ill

improve p(v#+

To improve p(# 6e need it to e a etter model of

te a55re5ated posteriordistriution over idden

ve&tors produ&ed y applyin5 to te data+


34/126

i& distriutions are fa&torial in a

dire&ted elief net

* n a dire&ted elief net 6it one idden layer te

posterior over te idden units p(Gv# is non

fa&torial (due to explainin5 a6ay#+

, Te a55re5ated posterior is fa&torial if te

data 6as 5enerated y te dire&ted model+

* t@s te opposite 6ay round from an undire&ted

model 6i& as fa&torial posteriors and a nonfa&torial prior p(# over te iddens+

* Te intuitions tat people ave from usin5 dire&ted

models are very misleadin5 for undire&ted models+


35/126

y does 5reedy learnin5 fail in a dire&ted module

* A dire&ted module also &onverts its datadistriution into an a55re5ated posterior

, Tas3 .Te learnin5 is no6 ardere&ause te posterior for ea& trainin5&ase is nonfa&torial+

,Tas3 2is performed usin5 anindependent prior+ Tis is a very adapproximation unless te a55re5atedposterior is &lose to fa&torial+

* A dire&ted module attempts to ma3e te

a55re5ated posterior fa&torial in one step+, Tis is too diffi&ult and leads to a ad

&ompromise+ Tere is also no5uarantee tat te a55re5atedposterior is easier to model tan tedata distriution+

data distriution

on visile units

)|( 2Whp

),|( 1Whvp

Tas3 2

Tas3 .

a55re5ated

posterior distriutionon idden units


36/126

A model of di5it re&o5nition

2000 toplevel neurons

00 neurons

00 neurons

2 x 2

pixel

ima5e

.0 lael

neurons

Te model learns to 5enerate

&ominations of laels and ima5es+

To perform re&o5nition 6e start 6it aneutral state of te lael units and do

an uppass from te ima5e follo6ed

y a fe6 iterations of te toplevel

asso&iative memory+

Te top t6o layers form an

asso&iative memory 6oseener5y lands&ape models te lo6

dimensional manifolds of te

di5its+

Te ener5y valleys ave names


37/126

9inetunin5 6it a &ontrastive version of te

K6a3esleep al5oritm

After learnin5 many layers of features 6e &an finetune

te features to improve 5eneration+

.+ Do a sto&asti& ottomup pass

,Ad?ust te topdo6n 6ei5ts to e 5ood at

re&onstru&tin5 te feature a&tivities in te layer elo6+

-+ Do a fe6 iterations of samplin5 in te top level 'B8

Ad?ust te 6ei5ts in te toplevel 'B8+

/+ Do a sto&asti& topdo6n pass,Ad?ust te ottomup 6ei5ts to e 5ood at

re&onstru&tin5 te feature a&tivities in te layer aove+


38/126

"o6 te movie of te net6or35eneratin5 di5its

(availale at 666+&s+torontoO


39/126

"amples 5enerated y lettin5 te asso&iative

memory run 6it one lael &lamped+ Tere are

.000 iterations of alternatin5 $is samplin5

et6een samples+


40/126

Hxamples of &orre&tly re&o5ni>ed and6ritten di5its

tat te neural net6or3 ad never seen efore

ts very

5ood


41/126

%o6 6ell does it dis&riminate on 8N"T test set 6it

no extra information aout 5eometri& distortions

* $enerative model ased on 'B8@s .+2P

* "upport e&tor 8a&ine (De&oste et+ al+# .+/P

* Ba&3prop 6it .000 iddens (!latt#


42/126

Unsupervised Kpretrainin5 also elps for

models tat ave more data and etter priors

* 'an>ato et+ al+ (N!" 2004# used an additional

400000 distorted di5its+

* Tey also used &onvolutional multilayer neural

net6or3s tat ave some uiltin lo&altranslational invarian&e+

Ba&3propa5ation alone: 0+/=P

Unsupervised layerylayer

pretrainin5 follo6ed y a&3prop: 0+-=P (re&ord#


43/126

Anoter vie6 of 6y layerylayer

learnin5 6or3s (%inton Esindero ) Te 2004#

* Tere is an unexpe&ted e;uivalen&e et6een

'B8@s and dire&ted net6or3s 6it many layers

tat all use te same 6ei5ts+

, Tis e;uivalen&e also 5ives insi5t into 6y

&ontrastive diver5en&e learnin5 6or3s+


44/126

An infinite si5moid elief net

tat is e;uivalent to an 'B8

* Te distriution 5enerated y tis

infinite dire&ted net 6it repli&ated

6ei5ts is te e;uilirium distriution

for a &ompatile pair of &onditional

distriutions: p(vG# and p(Gv# tatare ot defined y

,A topdo6n pass of te dire&ted

net is exa&tly e;uivalent to lettin5

a 'estri&ted Bolt>mann 8a&inesettle to e;uilirium+

, "o tis infinite dire&ted net

defines te same distriution as

an 'B8+

W

v.

.

v0

0

v2

2

TW

TW

TW

W

W

et&+


45/126

* Te variales in 0 are &onditionallyindependent 5iven v0+

, nferen&e is trivial+ e ?ust

multiply v0 y transpose+

, Te model aove 0 implementsa &omplementary prior+

, 8ultiplyin5 v0 y transpose5ives te produ&tof te li3eliood

term and te prior term+* nferen&e in te dire&ted net isexa&tly e;uivalent to lettin5 a'estri&ted Bolt>mann 8a&ine settleto e;uilirium startin5 at te data+

nferen&e in a dire&ted net

6it repli&ated 6ei5ts

W

v.

.

v0

0

v2

2

TW

TW

TW

W

W

et&+

R

R

R

R


46/126

* Te learnin5 rule for a si5moid elief

net is:

* it repli&ated 6ei5ts tis e&omes:

W

v.

.

v0

0

v2

2

T

W

TW

TW

W

W

et&+

0

i

s

0

js

1js

2

js

1

is

2

is

+

+

+

ij

iij

jji

iij

ss

sss

sss

sss

...)(

)(

)(

211

101

100

TW

TW

TW

W

W

)( iijij sssw


47/126

* 9irst learn 6it all te 6ei5ts tied, Tis is exa&tly e;uivalent to

learnin5 an 'B8

, Contrastive diver5en&e learnin5

is e;uivalent to i5norin5 te smallderivatives &ontriuted y te tied

6ei5ts et6een deeper layers+

Learnin5 a deep dire&ted

net6or3

W

W

v.

.

v0

0

v2

2

TW

TW

TW

W

et&+

v0

0

W


48/126

* Ten free>e te first layer of 6ei5ts

in ot dire&tions and learn te

remainin5 6ei5ts (still tied

to5eter#+

, Tis is e;uivalent to learnin5

anoter 'B8 usin5 te

a55re5ated posterior distriution

of 0 as te data+

W

v.

.

v0

0

v2

2

TW

TW

TW

W

et&+

frozenW

v.

0

W

TfrozenW


49/126

%o6 many layers sould 6e use and o6

6ide sould tey e

* Tere is no simple ans6er+

, Hxtensive experiments y Mosua Ben5io@s 5roup

(des&ried later# su55est tat several idden layers is

etter tan one+, 'esults are fairly roust a5ainst &an5es in te si>e of a

layer ut te top layer sould e i5+

* Deep elief nets 5ive teir &reator a lot of freedom+

, Te est 6ay to use tat freedom depends on te tas3+, it enou5 narro6 layers 6e &an model any distriution

over inary ve&tors ("uts3ever ) %inton 2007#


50/126

at appens 6en te 6ei5ts in i5er layers

e&ome different from te 6ei5ts in te first layer

* Te i5er layers no lon5er implement a &omplementaryprior+, "o performin5 inferen&e usin5 te fro>en 6ei5ts in

te first layer is no lon5er &orre&t+ But its still pretty5ood+

, Usin5 tis in&orre&t inferen&e pro&edure 5ives avariational lo6er ound on te lo5 proaility of tedata+

* Te i5er layers learn a prior tat is &loser to te

a55re5ated posterior distriution of te first idden layer+, Tis improves te net6or3@s model of te data+

* %inton Esindero and Te (2004# prove tat tisimprovement is al6ays i55er tan te loss in te variationalound &aused y usin5 less a&&urate inferen&e+


51/126

An improved version of Contrastive

Diver5en&e learnin5 (if time permits#

* Te main 6orry 6it CD is tat tere 6ill e deepminima of te ener5y fun&tion far a6ay from tedata+, To find tese 6e need to run te 8ar3ov &ain for

a lon5 time (maye tousands of steps#+, But 6e &annot afford to run te &ain for too lon5for ea& update of te 6ei5ts+

* 8aye 6e &an run te same 8ar3ov &ain overmany 6ei5t updates (Neal .==2#

, f te learnin5 rate is very small tis sould ee;uivalent to runnin5 te &ain for many stepsand ten doin5 a i55er 6ei5t update+


52/126

!ersistent CD(Ti?men Teileman C8L 200 ) 200=#

* Use miniat&es of .00 &ases to estimate te

first term in te 5radient+ Use a sin5le at& of

.00 fantasies to estimate te se&ond term in te

5radient+

* After ea& 6ei5t update 5enerate te ne6

fantasies from te previous fantasies y usin5one alternatin5 $is update+

, "o te fantasies &an 5et far from te data+

C t ti di


53/126

Contrastive diver5en&e as an

adversarial 5ame

* y does persisitent CD 6or3 so 6ell 6it only

.00 ne5ative examples to &ara&teri>e te

6ole partition fun&tion

, 9or all interestin5 prolems te partition

fun&tion is i5ly multimodal+

, %o6 does it mana5e to find all te modes

6itout startin5 at te data


54/126

Te learnin5 &auses very fast mixin5

* Te learnin5 intera&ts 6it te 8ar3ov &ain+

* !ersisitent Contrastive Diver5en&e &annot eanalysed y vie6in5 te learnin5 as an outer loop+

, erever te fantasies outnumer te

positive data te freeener5y surfa&e israised+ Tis ma3es te fantasies rus around

ypera&tively+

% i t t CD t t


55/126

%o6 persistent CD moves et6een te

modes of te model@s distriution

* f a mode as more fantasy

parti&les tan data te free

ener5y surfa&e is raised until

te fantasy parti&les es&ape+

, Tis &an over&ome free

ener5y arriers tat 6ould

e too i5 for te 8ar3ov

Cain to ?ump+

* Te freeener5y surfa&e is

ein5 &an5ed to elp

mixin5 in addition to definin5

te model+


56/126

"ummary so far

* 'estri&ted Bolt>mann 8a&ines provide a simple 6ay tolearn a layer of features 6itout any supervision+

, 8aximum li3eliood learnin5 is &omputationallyexpensive e&ause of te normali>ation term ut

&ontrastive diver5en&e learnin5 is fast and usually6or3s 6ell+

* 8any layers of representation &an e learned y treatin5te idden states of one 'B8 as te visile data fortrainin5 te next 'B8 (a &omposition of experts#+

* Tis &reates 5ood 5enerative models tat &an ten efinetuned+

, Contrastive 6a3esleep &an finetune 5eneration+


57/126

B'HAF


58/126

Evervie6 of te rest of te tutorial

* %o6 to finetune a 5reedily trained 5enerativemodel to e etter at dis&rimination+

* %o6 to learn a 3ernel for a $aussian pro&ess+

* %o6 to use deep elief nets for nonlinear

dimensionality redu&tion and do&ument retrieval+

* %o6 to learn a 5enerative ierar&y of

&onditional random fields+

* A more advan&ed learnin5 module for deepelief nets tat &ontains multipli&ative

intera&tions+

* %o6 to learn deep models of se;uential data+


59/126

9inetunin5 for dis&rimination

* 9irst learn one layer at a time 5reedily+

* Ten treat tis as Kpretrainin5 tat finds a 5oodinitial set of 6ei5ts 6i& &an e finetuned ya lo&al sear& pro&edure+

, Contrastive 6a3esleep is one 6ay of finetunin5 te model to e etter at 5eneration+

* Ba&3propa5ation &an e used to finetune te

model for etter dis&rimination+, Tis over&omes many of te limitations of

standard a&3propa5ation+


60/126

y a&3propa5ation 6or3s etter 6it

5reedy pretrainin5: Te optimi>ation vie6

* $reedily learnin5 one layer at a time s&ales 6ellto really i5 net6or3s espe&ially if 6e ave

lo&ality in ea& layer+

* e do not start a&3propa5ation until 6e alreadyave sensile feature dete&tors tat souldalready e very elpful for te dis&rimination tas3+, "o te initial 5radients are sensile and

a&3prop only needs to perform a lo&alsear&from a sensile startin5 point+



61/126


5reedy pretrainin5: Te overfittin5 vie6

* 8ost of te information in te final 6ei5ts &omes frommodelin5 te distriution of input ve&tors+, Te input ve&tors 5enerally &ontain a lot more

information tan te laels+

, Te pre&ious information in te laels is only used forte final finetunin5+

, Te finetunin5 only modifies te features sli5tly to 5ette &ate5ory oundaries ri5t+ t does not need todis&over features+

* Tis type of a&3propa5ation 6or3s 6ell even if most ofte trainin5 data is unlaeled+, Te unlaeled data is still very useful for dis&overin5

5ood features+


62/126

9irst model te distriution of di5it ima5es

2000 units

00 units

00 units

2 x 2

pixel

ima5e

Te net6or3 learns a density model for

unlaeled di5it ima5es+ en 6e 5enerate

from te model 6e 5et tin5s tat loo3 li3e

real di5its of all &lasses+

But do te idden features really elp 6itdi5it dis&rimination

Add .0 softmaxed units to te top and do

a&3propa5ation+

Te top t6o layers form a restri&ted

Bolt>mann ma&ine 6ose free ener5y

lands&ape sould model te lo6

dimensional manifolds of te di5its+


63/126

'esults on permutationinvariant 8N"T tas3

* ery &arefully trained a&3prop net 6it .+4Pone or t6o idden layers (!lattS %inton#

* "8 (De&oste ) "&oel3opf 2002# .+/P

* $enerative model of ?oint density of .+2Pima5es and laels (R 5enerative finetunin5#

* $enerative model of unlaelled di5its .+.Pfollo6ed y 5entle a&3propa5ation(%inton ) "ala3utdinov "&ien&e 2004#


64/126

Learnin5 Dynami&s of Deep Nets

te next / slides des&rie 6or3 y Mosua Ben5io@s 5roup

Before fine-tuning After fine-tuning


65/126

Hffe&t of Unsupervised !retrainin5

4

Erhan et. al. AISTATS2009


66/126

Hffe&t of Dept

44

w/o pre-trainingwith pre-trainingwithout pre-training

L i T ? t i i 9 ti "


67/126

Learnin5 Tra?e&tories in 9un&tion "pa&e(a 2D visuali>ation produ&ed 6it t"NH#

* Ha& point is a

model in fun&tion

spa&e

* Color J epo&

* Top: tra?e&tories

6itout pretrainin5+

Ha& tra?e&tory

&onver5es to a

different lo&al min+

* Bottom: Tra?e&tories

6it pretrainin5+

* No overlap

Erhan et. al. AISTATS2009

i d t i i 3


68/126

y unsupervised pretrainin5 ma3es sense

stuff

ima5e lael

stuff

ima5e lael

f ima5elael pairs 6ere

5enerated tis 6ay it

6ould ma3e sense to tryto 5o strai5t from

ima5es to laels+

9or example do te

pixels ave even parity

f ima5elael pairs are

5enerated tis 6ay it

ma3es sense to first learnto re&over te stuff tat

&aused te ima5e y

invertin5 te i5

and6idt pat6ay+

i5

and6idtlo6

and6idt


69/126

8odelin5 realvalued data

* 9or ima5es of di5its it is possile to representintermediate intensities as if tey 6ere proailities y

usin5 Kmeanfield lo5isti& units+

, e &an treat intermediate values as te proaility

tat te pixel is in3ed+* Tis 6ill not 6or3 for real ima5es+

, n a real ima5e te intensity of a pixel is almost

al6ays almost exa&tly te avera5e of te nei5orin5

pixels+, 8eanfield lo5isti& units &annot represent pre&ise

intermediate values+


70/126

'epla&in5 inary variales y

inte5ervalued variales

(Te and %inton 200.#

* Ene 6ay to model an inte5ervalued variale is

to ma3e N identi&al &opies of a inary unit+

* All &opies ave te same proaility

of ein5 Kon : p J lo5isti&(x#

, Te total numer of Kon &opies is li3e te

firin5 rate of a neuron+, t as a inomial distriution 6it mean N p

and varian&e N p(.p#


71/126

A etter 6ay to implement inte5er values

* 8a3e many &opies of a inary unit+* All &opies ave te same 6ei5ts and te same

adaptive ias ut tey ave different fixed offsets to

te ias:

....,5.3,5.2,5.1,5.0 bbbb

x


72/126

A fast approximation

* Contrastive diver5en&e learnin5 6or3s 6ell for te sum of

inary units 6it offset iases+* t also 6or3s for re&tified linear units+ Tese are mu& faster

to &ompute tan te sum of many lo5isti& units+

output J max(0 x R randns;rt(lo5isti&(x## #

)1log()5.0(logistic

1

x

n

n

enx ++=

=

%o6 to train a ipartite net6or3 of re&tified


73/126

%o6 to train a ipartite net6or3 of re&tified

linear units

* Iust use &ontrastive diver5en&e to lo6er te ener5y ofdata and raise te ener5y of neary &onfi5urations tatte model prefers to te data+

data>< jihv

recon>< jihv

i

?

i

?

)( recondata >


74/126

3D Object Recognition: The NORB dataset

Stereopairs o! gra"sca#e images o! to" objects$

% #ighting conditions& '%( )ie*points+i)e object instances per c#ass in the training set, differentset o! !i)e instances per c#ass in the test set

(-&3.. training cases& (-&3.. test cases

,nima#s

/umans

P#anes

Truc0s

Cars

Norma#i1ed

uni!orm

)ersion o!

NORB


75/126

"implifyin5 te data

* Ha& trainin5 &ase is a stereopair of =4x=4 ima5es+

, Te o?e&t is &entered+

, Te ed5es of te ima5e are mainly lan3+

, Te a&35round is uniform and ri5t+* To ma3e learnin5 faster used simplified te data:

, Tro6 a6ay one ima5e+

, Enly use te middle 4/x4/ pixels of te oter

ima5e+

, Do6nsample to -2x-2 y avera5in5 / pixels+

"implifyin5 te data even more so tat it &an


76/126

"implifyin5 te data even more so tat it &an

e modeled y re&tified linear units

* Te intensity isto5ram for ea& -2x-2 ima5e as asarp pea3 for te ri5t a&35round+

* 9ind tis pea3 and &all it >ero+

* Call all intensities ri5ter tan te a&35round >ero+

* 8easure intensities do6n6ards from te a&35round

intensity+

0

Test set error rates on NE'B after 5reedy


77/126

learnin5 of one or t6o idden layers usin5

re&tified linear units

9ull NE'B (2 ima5es of =4x=4#

* Lo5isti& re5ression on te ra6 pixels 20+P

* $aussian "8 (trained y Leon Bottou# ..+4P

* Convolutional neural net (Le Cun@s 5roup# 4+0P(&onvolutional nets ave 3no6led5e of translations uilt in#

'edu&ed NE'B (. ima5e -2x-2#

* Lo5isti& re5ression on te ra6 pixels-0+2P

* Lo5isti& re5ression on first idden layer ./+=P

* Lo5isti& re5ression on se&ond idden layer .0+2P

T


78/126

Te

re&eptive

fields of

somere&tified

linear

idden

units+

A standard type of realvalued visile unit


79/126

A standard type of realvalued visile unit

* e &an model pixels as$aussian variales+

Alternatin5 $issamplin5 is still easytou5 learnin5 needs to

e mu& slo6er+

ijj

ji i

iv

hidj

jj

visi i

ii whhbbv,E =,

2

2

2

)()(

hv

H

ener5y5radient

produ&ed y te total

input to a visile unit

paraoli&

&ontainment

fun&tion

ii vb

ellin5 et+ al+ (200# so6 o6 to extend 'B8@s to te

exponential family+ "ee also Ben5io et+ al+ (2007#

A random sample of .0000 inary filters learned


80/126

y Alex Fri>evs3y on a million -2x-2 &olor ima5es+

Cominin5 deep elief nets 6it $aussian pro&esses


81/126

Cominin5 deep elief nets 6it $aussian pro&esses

* Deep elief nets &an enefit a lot from unlaeled data

6en laeled data is s&ar&e+, Tey ?ust use te laeled data for finetunin5+

* Fernel metods li3e $aussian pro&esses 6or3 6ell onsmall laeled trainin5 sets ut are slo6 for lar5e trainin5sets+

* "o 6en tere is a lot of unlaeled data and only a littlelaeled data &omine te t6o approa&es:, 9irst learn a deep elief net 6itout usin5 te laels+, Ten apply a $aussian pro&ess model to te deepest

layer of features+ Tis 6or3s etter tan usin5 te ra6data+

, Ten use $!@s to 5et te derivatives tat are a&3propa5ated trou5 te deep elief net+ Tis is afurter 6in+ t allo6s $!@s to finetune &ompli&ateddomainspe&ifi& 3ernels+

Learnin5 to extra&t te orientation of a fa&e pat&


82/126

Learnin5 to extra&t te orientation of a fa&e pat&("ala3utdinov ) %inton N!" 2007#

Te trainin5 and test sets for predi&tin5


83/126

Te trainin5 and test sets for predi&tin5

fa&e orientation

..000 unlaeled &ases.00 00 or .000 laeled &ases

fa&e pat&es from ne6 people

Te root mean s;uared error in te orientation


84/126

Te root mean s;uared error in te orientation

6en &ominin5 $!@s 6it deep elief nets

22+2 .7+= .+2

.7+2 .2+7 7+2

.4+- ..+2 4+/

$! on

te

pixels

$! on

toplevel

features

$! on toplevel

features 6it

finetunin5

.00 laels 00 laels

.000 laels

Con&lusion: Te deep features are mu& etter

tan te pixels+ 9inetunin5 elps a lot+

Deep Autoen&oders 2x2W T


85/126

(%inton ) "ala3utdinov 2004#

* Tey al6ays loo3ed li3e a really

ni&e 6ay to do nonlinear

dimensionality redu&tion:

, But it is very diffi&ult to

optimi>e deep autoen&oders

usin5 a&3propa5ation+

* e no6 ave a mu& etter 6ay

to optimi>e tem:

, 9irst train a sta&3 of / 'B8@s

, Ten Kunroll tem+

, Ten finetune 6it a&3prop+

.000 neurons

00 neurons

00 neurons

20 neurons

20 neurons

-0

.000 neurons

2x2

1

2

3

4

4

3

2

1

W

W

W

W

W

W

W

W

T

T

T

T

linearunits

A &omparison of metods for &ompressin5


86/126

A &omparison of metods for &ompressin5

di5it ima5es to -0 real numers+

real

data

-0Ddeep auto

-0D lo5isti&

!CA

-0D

!CA

'etrievin5 do&uments tat are similar


87/126

'etrievin5 do&uments tat are similar

to a ;uery do&ument

* e &an use an autoen&oder to find lo6dimensional &odes for do&uments tat allo6

fast and a&&urate retrieval of similar

do&uments from a lar5e set+

* e start y &onvertin5 ea& do&ument into a

Ka5 of 6ords+ Tis a 2000 dimensional

ve&tor tat &ontains te &ounts for ea& of te2000 &ommonest 6ords+

%o6 to &ompress te &ount ve&tor


88/126

p

* e train te neuralnet6or3 to reprodu&e its

input ve&tor as its output

* Tis for&es it to

&ompress as mu&information as possile

into te .0 numers in

te &entral ottlene&3+

* Tese .0 numers areten a 5ood 6ay to

&ompare do&uments+

2000 re&onstru&ted &ounts

00 neurons

2000 6ord &ounts

00 neurons

20 neurons

20 neurons

.0

input

ve&tor

output

ve&tor

!erforman&e of te autoen&oder at


89/126

!erforman&e of te autoen&oder at

do&ument retrieval

* Train on a5s of 2000 6ords for /00000 trainin5 &asesof usiness do&uments+, 9irst train a sta&3 of 'B8@s+ Ten finetune 6it

a&3prop+* Test on a separate /00000 do&uments+

, !i&3 one test do&ument as a ;uery+ 'an3 order all teoter test do&uments y usin5 te &osine of te an5leet6een &odes+

, 'epeat tis usin5 ea& of te /00000 test do&umentsas te ;uery (re;uires 0+.4 trillion &omparisons#+

* !lot te numer of retrieved do&uments a5ainst teproportion tat are in te same andlaeled &lass as te;uery do&ument+

!roportion of retrieved do&uments in same &lass as ;uery


90/126

p ; y

Numer of do&uments retrieved

9irst &ompress all do&uments to 2 numers usin5 a type of !CA


91/126

9irst &ompress all do&uments to 2 numers usin5 a type of !CA

Ten use different &olors for different

do&ument &ate5ories

9irst &ompress all do&uments to 2 numers+Ten use different &olors for different do&ument &ate5ories


92/126

5

9indin5 inary &odes for do&uments


93/126

d 5 a y &odes o do&u e s

*Train an autoen&oder usin5 -0lo5isti& units for te &ode layer+

* Durin5 te finetunin5 sta5eadd noise to te inputs to te&ode units+,

Te Knoise ve&tor for ea&trainin5 &ase is fixed+ "o 6estill 5et a deterministi&5radient+

, Te noise for&es teira&tivities to e&ome imodalin order to resist te effe&tsof te noise+

, Ten 6e simply round tea&tivities of te -0 &ode unitsto . or 0+

2000 re&onstru&ted &ounts

00 neurons

2000 6ord &ounts

00 neurons

20 neurons

20 neurons

-0

noise

"emanti& asin5: Usin5 a deep autoen&oder as aasfun&tion for findin5 approximate mat&es


94/126

as fun&tion for findin5 approximatemat&es

("ala3utdinov ) %inton 2007#

as

fun&tion

Ksupermar3et sear&

%o6 5ood is a sortlist found tis 6ay


95/126

5 y

* e ave only implemented it for a milliondo&uments 6it 20it &odes ut 6at &ould

possily 5o 6ron5

,A 20D yper&ue allo6s us to &apture enou5

of te similarity stru&ture of our do&ument set+

* Te sortlist found usin5 inary &odes a&tually

improves te pre&isionre&all &urves of T9D9+

, Lo&ality sensitive asin5 (te fastest otermetod# is 0 times slo6er and as 6orse

pre&isionre&all &urves+

$eneratin5 te parts of an o?e&t


96/126

$eneratin5 te parts of an o?e&t

* Ene 6ay to maintain te&onstraints et6een te parts isto 5enerate ea& part verya&&urately

, But tis 6ould re;uire a lot of&ommuni&ation and6idt+

* "loppy topdo6n spe&ifi&ation ofte parts is less demandin5

, ut it messes up relationsipset6een features

, so use redundant featuresand use lateral intera&tions to&lean up te mess+

* Ha& transformed feature elpsto lo&ate te oters

, Tis allo6s a noisy &annel

sloppy topdo6n

a&tivation of parts

&leanup usin53no6n intera&tions

pose parameters

features 6ittopdo6n

support

Ks;uare

R

ts li3e soldiers on

a parade 5round

"emirestri&ted Bolt>mann 8a&ines


97/126

* e restri&t te &onne&tivity to ma3e

learnin5 easier+* Contrastive diver5en&e learnin5 re;uires

te idden units to e in &onditional

e;uilirium 6it te visiles+

, But it does not re;uire te visile unitsto e in &onditional e;uilirium 6it te

iddens+

,All 6e re;uire is tat te visile units

are &loser to e;uilirium in tere&onstru&tions tan in te data+

* "o 6e &an allo6 &onne&tions et6een

te visiles+

idden

i

?

visile

Learnin5 a semirestri&ted Bolt>mann 8a&ine


98/126

0>< jihv

1>< jihv

i

?

i

?

t J 0 t J .

)( 10

>mann


99/126

5

8a&ines

* 8etod .:To form a re&onstru&tion &y&letrou5 te visile units updatin5 ea& in turn

usin5 te topdo6n input from te iddens plus

te lateral input from te oter visiles+

* 8etod 2:Use Kmean field visile units tat

ave real values+ Update tem all in parallel+

, Use dampin5 to prevent os&illations

)()(11 iti

ti xpp +=+

total input to idampin5

'esults on modelin5 natural ima5e pat&es


100/126

5 5 p

usin5 a sta&3 of 'B8@s (Esindero and %inton#

* "ta&3 of 'B8@s learned one at a time+* /00 $aussian visile units tat see

6itened ima5e pat&es, Derived from .00000 an %ateren

ima5e pat&es ea& 20x20* Te idden units are all inary+

, Te lateral &onne&tions arelearned 6en tey are te visileunits of teir 'B8+

* 'e&onstru&tion involves lettin5 tevisile units of ea& 'B8 settle usin5meanfield dynami&s+, Te already de&ided states in te

level aove determine te effe&tiveiases durin5 meanfield settlin5+

Dire&ted Conne&tions

Dire&ted Conne&tions

Undire&ted Conne&tions

/00

$aussian

units

%idden

8'9 6it

2000 units

%idden

8'9 6it00 units

.000 toplevel units+

No 8'9+

itout lateral &onne&tions


101/126

real data samples from model

it lateral &onne&tions


102/126

real data samples from model

A funny 6ay to use an 8'9


103/126

A funny 6ay to use an 8'9

* Te lateral &onne&tions form an 8'9+* Te 8'9 is used durin5 learnin5 and 5eneration+

* Te 8'9 is notused for inferen&e+

, Tis is a novel idea so vision resear&ers don@t li3e it+

* Te 8'9 enfor&es &onstraints+ Durin5 inferen&e&onstraints do not need to e enfor&ed e&ause te dataoeys tem+

, Te &onstraints only need to e enfor&ed durin55eneration+

* Unoserved idden units &annot enfor&e &onstraints+

, To enfor&e &onstraints re;uires lateral &onne&tions oroserved des&endants+

y do 6e 6iten data


104/126

y do 6e 6iten data

* ma5es typi&ally ave stron5 pair6ise &orrelations+* Learnin5 i5er order statisti&s is diffi&ult 6en tere are

stron5 pair6ise &orrelations+

, "mall &an5es in parameter values tat improve te

modelin5 of i5erorder statisti&s may e re?e&tede&ause tey form a sli5tly 6orse model of te mu&

stron5er pair6ise statisti&s+

* "o 6e often remove te se&ondorder statisti&s efore

tryin5 to learn te i5erorder statisti&s+

itenin5 te learnin5 si5nal instead


105/126

of te data

* Contrastive diver5en&e learnin5 &an remove te effe&tsof te se&ondorder statisti&s on te learnin56itouta&tually &an5in5 te data+

, Te lateral &onne&tions model te se&ond order

statisti&s, f a pixel &an e re&onstru&ted &orre&tly usin5 se&ond

order statisti&s its 6ill e te same in tere&onstru&tion as in te data+

, Te idden units &an ten fo&us on modelin5 i5order stru&ture tat &annot e predi&ted y te lateral&onne&tions+

* 9or example a pixel &lose to an ed5e 6ere interpolationfrom neary pixels &auses in&orre&t smootin5+

To6ards a more po6erful multilinear


106/126

sta&3ale learnin5 module

* "o far te states of te units in one layer ave only eenused to determine te effe&tive iases of te units in te

layer elo6+

* t 6ould e mu& more po6erful to modulate te pair6ise

intera&tions in te layer elo6+,A 5ood 6ay to desi5n a ierar&i&al system is to allo6

ea& level to determine te o?e&tive fun&tion of te level

elo6+

* To modulate pair6ise intera&tions 6e need i5erorderBolt>mann ma&ines+

%i5er order Bolt>mann ma&ines("e?no6s3i


107/126

("e?no6s3i


108/126

model ima5e transformations(te unfa&tored version#

* A 5loal transformation spe&ifies 6i& pixel

5oes to 6i& oter pixel+

* Conversely ea& pair of similar intensity pixels

one in ea& ima5e votes for a parti&ular 5loaltransformation+

ima5e(t# ima5e(tR.#

ima5e transformation

9a&torin5 tree6ay


109/126

5 y

multipli&ative intera&tions

=

=

fhfjfifhj

hjii

ijhhj

hji

i

wwwsssE

wsssE

,,

,,

fa&tored6it linearly

many parameters

per fa&tor+

unfa&tored6it &ui&ally

many parameters

A pi&ture of te lo6ran3 tensor

&ontri ted fa&tor f


110/126

&ontriuted y fa&tor f

ifw

jfw

hfw

Ha& layer is a s&aled version

of te same matrix+

Te asis matrix is spe&ified

as an outer produ&t 6ittypi&al term

"o ea& a&tive idden unit

&ontriutes a s&alartimes te matrix spe&ified y

fa&tor f +

jfifww

hfw

nferen&e 6it fa&tored tree6ay


111/126


[ ]

=

=

==

j

jfjif

i

ihfhfhf

hfjfifhj

hji

if

wswswsEsE

wwwsssE

)()( 10

,,

%o6 &an5in5 te inary state

of unit &an5es te ener5y

&ontriuted y fa&tor f+

at unit needs

to 3no6 in order to

do $is samplin5

Te ener5y

&ontriuted y

fa&tor f+

Belief propa5ation


112/126

p p 5

ifw jfw

hfw

f

i j

h

Te out5oin5 messa5e

at ea& vertex of te

fa&tor is te produ&t of

te 6ei5ted sums atte oter t6o verti&es+

Learnin5 6it fa&tored tree6ay


113/126


delmodata

modeldata

hfh

hfh

hf

f

hf

f

hf

j

jfjif

i

ihf

msms

w

E

w

Ew

wswsm

=

= messa5e

from fa&tor fto unit

'oland data


114/126

'oland data

8odelin5 te &orrelational stru&ture of a stati& ima5ey usin5 t6o &opies of te ima5e


115/126

ifw jfw

hfw

f

i j

h

Ha& fa&tor sends te

s;uared output of a linearfilter to te idden units+

t is exa&tly te standard

model of simple and&omplex &ells+ t allo6s

&omplex &ells to extra&t

oriented ener5y+

Te standard model dropsout of doin5 elief

propa5ation for a fa&tored

tirdorder ener5y fun&tion+Copy . Copy 2

An advanta5e of modelin5 &orrelations

t i l t t i l


116/126

et6een pixels rater tan pixels

* Durin5 5eneration a Kverti&al ed5e unit &an turn offte ori>ontal interpolation in a re5ion 6itout6orryin5 aout exa&tly 6ere te intensitydis&ontinuity 6ill e+

, Tis 5ives some translational invarian&e, t also 5ives a lot of invarian&e to ri5tness and

&ontrast+

, "o te Kverti&al ed5e unit is li3e a &omplex &ell+

* By modulatin5 te &orrelations et6een pixels ratertan te pixel intensities te 5enerative model &anstill allo6 interpolation parallel to te ed5e+

A prin&iple of ierar&i&al systems


117/126

A prin&iple of ierar&i&al systems

* Ha& level in te ierar&y sould not try tomi&romana5e te level elo6+

* nstead it sould &reate an o?e&tive fun&tion for

te level elo6 and leave te level elo6 tooptimi>e it+

, Tis allo6s te fine details of te solution to

e de&ided lo&ally 6ere te detailed

information is availale+* E?e&tive fun&tions are a 5ood 6ay to do

astra&tion+

Time series models


118/126

Time series models

* nferen&e is diffi&ult in dire&ted models of timeseries if 6e use nonlinear distriuted

representations in te idden units+

, t is ard to fit Dynami& Bayes Nets to i5

dimensional se;uen&es (e+5 motion &apture

data#+

* "o people tend to avoid distriuted

representations and use mu& 6ea3er metods(e+5+ %88@s#+

Time series models


119/126

Time series models

* f 6e really need distriuted representations (6i& 6enearly al6ays do# 6e &an ma3e inferen&e mu& simplery usin5 tree tri&3s:

, Use an 'B8 for te intera&tions et6een idden andvisile variales+ Tis ensures tat te main sour&e of

information 6ants te posterior to e fa&torial+, 8odel sortran5e temporal information y allo6in5

several previous frames to provide input to te iddenunits and to te visile units+

* Tis leads to a temporal module tat &an e sta&3ed, "o 6e &an use 5reedy learnin5 to learn deep models

of temporal stru&ture+

An appli&ation to modelin5motion &apture data


120/126

motion &apture data(Taylor 'o6eis ) %inton 2007#

* %uman motion &an e &aptured y pla&in5refle&tive mar3ers on te ?oints and ten usin5lots of infrared &ameras to tra&3 te -Dpositions of te mar3ers+

* $iven a s3eletal model te -D positions of temar3ers &an e &onverted into te ?oint an5lesplus 4 parameters tat des&rie te -D positionand te roll pit& and ya6 of te pelvis+, e only represent &an5esin ya6 e&ause pysi&s

doesn@t &are aout its value and 6e 6ant to avoid&ir&ular variales+

Te &onditional 'B8 model( ti ll d C'9#


121/126

(a partially oserved C'9#

* "tart 6it a 5eneri& 'B8+* Add t6o types of &onditionin5

&onne&tions+

* $iven te data te idden unitsat time t are &onditionallyindependent+

* Te autore5ressive 6ei5ts &anmodel most sortterm temporalstru&ture very 6ell leavin5 teidden units to model nonlinearirre5ularities (su& as 6en tefoot its te 5round#+ t2 t. t

i

j

v

Causal 5eneration from a learned model


122/126

5

* Feep te previous visile states fixed+, Tey provide a timedependent

ias for te idden units+

* !erform alternatin5 $is samplin5

for a fe6 iterations et6een teidden units and te most re&ent

visile units+

, Tis pi&3s ne6 idden and visile

states tat are &ompatile 6itea& oter and 6it te re&ent

istory+

i

j

%i5er level models


123/126

5

* En&e 6e ave trained te model 6e &anadd layers li3e in a Deep Belief Net6or3+

* Te previous layer C'B8 is 3ept and itsoutput 6ile driven y te data is treatedas a ne6 3ind of Kfully oserved data+

* Te next level C'B8 as te samear&ite&ture as te first (tou5 6e &analter te numer of units it uses# and istrained te same 6ay+

* Upper levels of te net6or3 model moreKastra&t &on&epts+

* Tis 5reedy learnin5 pro&edure &an e?ustified usin5 a variational ound+

i

j

k

t2 t. t

Learnin5 6it Kstyle laels


124/126

5 y

* As in te 5enerative model of

and6ritten di5its (%inton et al+

2004# style laels &an e

provided as part of te input to

te top layer+

* Te laels are represented y

turnin5 on one unit in a 5roup of

units ut tey &an also e

lended+

i

j

t2 t. t

k

l


125/126

"o6 demo@s of multiple styles of

6al3in5

These can be foun atwww.cs.toronto.eu/!gwta"lor/

'eadin5s on deep elief nets


126/126

5 p

A readin5 list (tat is still ein5 updated# &an efound at

666+&s+toronto+eduO

Jul09 Hinton Deeplearn

Documents

Transcript of Jul09 Hinton Deeplearn