Neural networks - Université de...

Neural networksTraining neural networks - empirical risk minimization

NEURAL NETWORK2

Topics: multilayer neural network• Could have L hidden layers:‣ layer input activation for k>0

‣ hidden layer activation (k from 1 to L):

‣ output layer activation (k=L+1):...

Feedforward neural network

Hugo LarochelleD

epartement d’informatique

Universit

e de Sherbrooke

hugo.larochelle@usherbrooke.ca

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w

• h(x) = g(a(x)) = g(b+P

i wixi)

• g(·) b

• h(x) = g(a(x))

• a(x) = b

(1) +W

⇣a(x)i = b

i,j xj

• o(x) = g

(out)(b(2) +w

Hugo LarochelleD

Universit

e de Sherbrooke

September 6, 2012

Abstract

• a(x) = b+P

i wixi = b+w

• h(x) = g(a(x)) = g(b+P

i wixi)

• g(·) b

• h(x) = g(a(x))

• a(x) = b

(1) +W

⇣a(x)i = b

i,j xj

• o(x) = g

(out)(b(2) +w

Hugo LarochelleD

Universit

e de Sherbrooke

September 6, 2012

Abstract

• a(x) = b+

Pi wixi = b+w

• h(x) = g(a(x)) = g(b+

Pi wixi)

xd b w

• g(a) = a

• g(a) = sigm(a) =

1+exp(�a)

• g(a) = tanh(a) =

exp(a)�exp(�a)exp(a)+exp(�a) =

exp(2a)�1

exp(2a)+1

• g(a) = max(0, a)

• g(a) = reclin(a) = max(0, a)

• g(·) b

i xj h(x)i

• h(x) = g(a(x))

• a(x) = b

⇣a(x)i = b

i,j xj

• o(x) = g

...... 1

......

• p(y = c|x)

• o(a) = softmax(a) =

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

Hugo LarochelleD

Universit´e de Sherbrooke

September 13, 2012

Abstract

• f(x)

• l(f(x(t);✓), y(t))

• r✓l(f(x(t);✓), y(t))

• ⌦(✓)

• r✓⌦(✓)

• f(x)c = p(y = c|x)

(t) y(t)

• l(f(x), y) = �P

c 1(y=c) log f(x)c = � log f(x)y =

f(x)c� log f(x)y =

�1(y=c)

rf(x) � log f(x)y =

f(x)y[1(y=0), . . . , 1(y=C�1)]

�e(c)

• p(y = c|x)

hexp(a1)Pc exp(ac)

exp(aC)Pc exp(ac)

• f(x)

(k)(x) = b

(k�1)

(x) (h

(x) = x)

(k)(x) = g(a

(k)(x))

(x) = o(a

(x)) = f(x)

MACHINE LEARNING3

Topics: empirical risk minimization, regularization• Empirical risk minimization‣ framework to design learning algorithms

‣ is a loss function

‣ is a regularizer (penalizes certain values of )

• Learning is cast as optimization‣ ideally, we’d optimize classification error, but it’s not smooth

‣ loss function is a surrogate for what we truly should optimize (e.g. upper bound)

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• bµ�µpb�2/T

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• Supervised learning example: (x, y) x y

• Training set: Dtrain= {(x(t), y(t))}

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• � = � 1T

Ptr✓l(f(x

(t);✓), y(t))� �r✓⌦(✓)

• ✓ ✓ +�

MACHINE LEARNING4

Topics: stochastic gradient descent (SGD)• Algorithm that performs updates after each example‣ initialize ( )

‣ for N iterations- for each training example

• To apply this algorithm to neural network training, we need‣ the loss function

‣ a procedure to compute the parameter gradients

‣ the regularizer (and the gradient )

‣ initialization method

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• � = � 1T

Ptr✓l(f(x

(t);✓), y(t))� �r✓⌦(✓)

• ✓ ✓ +�

• {x 2 Rd | rx

f(x) = 0}

f(x)v > 0 8v

f(x)v < 0 8v

• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• � = � 1T

Ptr✓l(f(x

(t);✓), y(t))� �r✓⌦(✓)

• ✓ ✓ +�

• {x 2 Rd | rx

f(x) = 0}

f(x)v > 0 8v

f(x)v < 0 8v

• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)

• (x

(t), y(t))

• bµ =

• b�2=

1T�1

(t) � bµ)2

• b⌃ =

1T�1

(t) � bµ)(x(t) � bµ)>

• E[

bµ] = µ E[b�2] = �2

i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓p(x(1), . . . ,x(T )

•p(x(1), . . . ,x(T )

p(x(t))

• T�1T

b⌃ =

(t) � bµ)(x(t) � bµ)>

Machine learning

• f(x;✓)

• Dvalid Dtest

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• � = � 1T

Ptr✓l(f(x

(t);✓), y(t))� �r✓⌦(✓)

• ✓ ✓ +�

•argmin

l(f(x(t);✓), y(t)) + �⌦(✓)

• l(f(x(t);✓), y(t))

• ⌦(✓)

• � = � 1T

Ptr✓l(f(x

(t);✓), y(t))� �r✓⌦(✓)

• ✓ ✓ + ↵ �

• {x 2 Rd | rx

f(x) = 0}

f(x)v > 0 8v

f(x)v < 0 8v

• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)

• (x

(t), y(t))

• f⇤ f

�1X = I

• (X>)i,j = Xj,i

• det (X) =Q

i Xi,i

• det (X>) = det (X)

• det (X�1) = det (X)�1

• det (X(1)X

(2)) = det (X(1)) det (X(2))

>Xv > 0

• �

• {x(t)}

• 9w, t

(t⇤) =P

t 6=t⇤ wtx(t)

• R(X) = {x 2 Rh | 9w x =P

j wjX·,j}

• {x 2 Rh | x /2 R(X)}

• {�i,ui | Xui = �iui et u

>i uj = 1i=j}

• X =P

i �iuiu>i

• det (X) =Q

i �i

• �i > 0

f(x, y) = lim�!0

f(x+�, y)� f(x, y)

f(x, y) = lim�!0

f(x, y +�)� f(x, y)

Machine learning

• Supervised learning example: (x, y)

• Training set: Dtrain = {(xt, yt}

training epoch =

iteration over all examples

Hugo LarochelleD

September 13, 2012

Abstract

• f(x)

• l(f(x(t);✓), y(t))

• r✓l(f(x(t);✓), y(t))

• ⌦(✓)

• r✓⌦(✓)

• f(x)c = p(y = c|x)

(t) y(t)

• l(f(x), y) = �P

�1(y=c)

f(x)y[1(y=0), . . . , 1(y=C�1)]

�e(c)

Hugo LarochelleD

September 13, 2012

Abstract

• f(x)

• l(f(x(t);✓), y(t))

• r✓l(f(x(t);✓), y(t))

• ⌦(✓)

• r✓⌦(✓)

• f(x)c = p(y = c|x)

(t) y(t)

• l(f(x), y) = �P

�1(y=c)

f(x)y[1(y=0), . . . , 1(y=C�1)]

�e(c)

Hugo LarochelleD

September 13, 2012

Abstract

• f(x)

• l(f(x(t);✓), y(t))

• r✓l(f(x(t);✓), y(t))

• ⌦(✓)

• r✓⌦(✓)

• f(x)c = p(y = c|x)

(t) y(t)

• l(f(x), y) = �P

�1(y=c)

f(x)y[1(y=0), . . . , 1(y=C�1)]

�e(c)

Hugo LarochelleD

September 13, 2012

Abstract

• f(x)

• l(f(x(t);✓), y(t))

• r✓l(f(x(t);✓), y(t))

• ⌦(✓)

• r✓⌦(✓)

• f(x)c = p(y = c|x)

(t) y(t)

• l(f(x), y) = �P

�1(y=c)

f(x)y[1(y=0), . . . , 1(y=C�1)]

�e(c)

Hugo LarochelleD

September 13, 2012

Abstract

• f(x)

• ✓ ⌘ {W(1),b(1), . . . ,W(L+1),b(L+1)}

• l(f(x(t);✓), y(t))

• r✓l(f(x(t);✓), y(t))

• ⌦(✓)

• r✓⌦(✓)

• f(x)c = p(y = c|x)

(t) y(t)

• l(f(x), y) = �P

�1(y=c)

f(x)y[1(y=0), . . . , 1(y=C�1)]

�e(c)

Neural networks - Université de...

Documents

Transcript of Neural networks - Université de...

IFT 615 – Intelligence Artificielleinfo.usherbrooke.ca/hlarochelle/ift615/ift615-intro.pdfIFT 615 – Intelligence Artificielle Introduction Hugo%Larochelle% Départementd’informaque

Natural language processing: neural network language modelinfo.usherbrooke.ca/hlarochelle/ift725/nlp-language-model.pdf · Natural language processing: neural network language model

USC COCOMO II 2000 - Université de Sherbrookeinfo.usherbrooke.ca/.../CocomoII_Reference_Manual.pdf · COst MOdel (COCOMO) first published by Dr. Barry Boehm in his book Software

Natural language processing: recursive neural networkinfo.usherbrooke.ca/hlarochelle/ift725/nlp-recursive-net.pdf · Natural language processing: recursive neural network IFT 725

Natural language processing: syntactic and semantic tagginginfo.usherbrooke.ca/hlarochelle/ift725/nlp-tagging.pdf · A Uniﬁed Architecture for Natural Language Processing: Deep

Universit e de Sherbrooke - Université de Sherbrookeinfo.usherbrooke.ca/sliu/Talks/Ausl-Conf-12.pdf · History: AR-sequences in subcategories 1 Let C mod closed under extensions,

Neural networks - Université de Sherbrookeinfo.usherbrooke.ca/hlarochelle/ift725/8_01_definition.pdfUniversite de Sherbrooke´ hugo.larochelle@usherbrooke.ca November 1, 2012 Abstract

Neural networks - Université de Sherbrookeinfo.usherbrooke.ca/hlarochelle/ift725/8_03_dictionary_update... · SPARSE CODING 3 Topics: dictionary update (algorithm 1) • Going back