Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

6
CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam Introduction Outline Basic Units Nonlinearly Separable Problems Backprop Types of Units Putting Things Together CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam (Adapted from Ethem Alpaydin and Tom Mitchell) [email protected] 1 / 33 CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam Introduction Properties History Outline Basic Units Nonlinearly Separable Problems Backprop Types of Units Putting Things Together Introduction Deep learning is based in artificial neural networks Consider humans: Total number of neurons 10 10 Neuron switching time 10 -3 second (vs. 10 -10 ) Connections per neuron 10 4 10 5 Scene recognition time 0.1 second 100 inference steps doesn’t seem like enough ) massive parallel computation 2 / 33 CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam Introduction Properties History Outline Basic Units Nonlinearly Separable Problems Backprop Types of Units Putting Things Together Introduction Properties Properties of artificial neural nets (ANNs): Many “neuron-like” switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically Strong differences between ANNs for ML and ANNs for biological modeling 3 / 33 CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam Introduction Properties History Outline Basic Units Nonlinearly Separable Problems Backprop Types of Units Putting Things Together Introduction History of ANNs The Beginning: Linear units and the Perceptron algorithm (1940s) Spoiler Alert: stagnated because of inability to handle data not linearly separable Aware of usefulness of multi-layer networks, but could not train The Comeback: Training of multi-layer networks with Backpropagation (1980s) Many applications, but in 1990s replaced by large-margin approaches such as support vector machines and boosting 4 / 33 CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam Introduction Properties History Outline Basic Units Nonlinearly Separable Problems Backprop Types of Units Putting Things Together Introduction History of ANNs (cont’d) The Resurgence: Deep architectures (2000s) Better hardware and software support allow for deep (> 5–8 layers) networks Still use Backpropagation, but Larger datasets, algorithmic improvements (new loss and activation functions), and deeper networks improve performance considerably Very impressive applications, e.g., captioning images The Problem: Skynet (TBD) Sorry. 5 / 33 CSCE 970 Lecture 2: Artificial Neural Networks and Deep Learning Stephen Scott and Vinod Variyam Introduction Properties History Outline Basic Units Nonlinearly Separable Problems Backprop Types of Units Putting Things Together When to Consider ANNs Input is high-dimensional discrete- or real-valued (e.g., raw sensor input) Output is discrete- or real-valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Long training times acceptable 6 / 33

Transcript of Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

Page 1: Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

CSCE 970 Lecture 2:Artificial Neural Networks and

Deep Learning

Stephen Scott and Vinod Variyam

(Adapted from Ethem Alpaydin and Tom Mitchell)

[email protected]

1 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

IntroductionProperties

History

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Introduction

Deep learning is based in artificial neural networks

Consider humans:

Total number of neurons ⇡ 10

10

Neuron switching time ⇡ 10

�3 second (vs. 10

�10)Connections per neuron ⇡ 10

4–10

5

Scene recognition time ⇡ 0.1 second100 inference steps doesn’t seem like enough

) massive parallel computation

2 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

IntroductionProperties

History

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

IntroductionProperties

Properties of artificial neural nets (ANNs):

Many “neuron-like” switching unitsMany weighted interconnections among unitsHighly parallel, distributed processEmphasis on tuning weights automatically

Strong differences between ANNs for ML and ANNs forbiological modeling

3 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

IntroductionProperties

History

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

IntroductionHistory of ANNs

The Beginning: Linear units and the Perceptronalgorithm (1940s)

Spoiler Alert: stagnated because of inability to handledata not linearly separable

Aware of usefulness of multi-layer networks, but couldnot train

The Comeback: Training of multi-layer networks withBackpropagation (1980s)

Many applications, but in 1990s replaced bylarge-margin approaches such as support vectormachines and boosting

4 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

IntroductionProperties

History

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

IntroductionHistory of ANNs (cont’d)

The Resurgence: Deep architectures (2000s)Better hardware and software support allow for deep(> 5–8 layers) networksStill use Backpropagation, but

Larger datasets, algorithmic improvements (new lossand activation functions), and deeper networks improveperformance considerably

Very impressive applications, e.g., captioning imagesThe Problem: Skynet (TBD)

Sorry.

5 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

IntroductionProperties

History

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

When to Consider ANNs

Input is high-dimensional discrete- or real-valued (e.g.,raw sensor input)Output is discrete- or real-valuedOutput is a vector of valuesPossibly noisy dataForm of target function is unknownHuman readability of result is unimportantLong training times acceptable

6 / 33

Page 2: Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Outline

Basic unitsLinear unitLinear threshold unitsPerceptron training rule

Nonlinearly separable problems and multilayernetworksBackpropagationPutting everything together

7 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Linear Unit

w1

w2

wn

w0

x1

x2

xn

x0=1

.

.

.�

� wi xin

i=0 1 if > 0

-1 otherwise{o = � wi xin

i=0

y = f (x;w, b) = x

Tw + b = w

1

x

1

+ · · ·+ w

n

x

n

+ b

If set w

0

= b, can simplify aboveForms the basis for many other activation functions

8 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Linear Threshold Unit

w1

w2

wn

w0

x1

x2

xn

x0=1

.

.

.�

� wi xin

i=0 1 if > 0

-1 otherwise{o = � wi xin

i=0

y = o(x;w, b) =

⇢+1 if f (x;w, b) > 0

�1 otherwise

(sometimes use 0 instead of �1)

9 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Linear Threshold UnitDecision Surface

x1

x2+

+

--

+-

x1

x2

(a) (b)

-

+ -

+

Represents some useful functions

What parameters (w, b) representg(x

1

, x

2

;w, b) = AND(x1

, x

2

)?

But some functions not representable

I.e., those not linearly separableTherefore, we’ll want networks of units

10 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Perceptron Training Rule

w

t+1

j

w

t

j

+�w

t

j

, where �w

t

j

= ⌘ (yt � y

t) x

t

j

and

y

t is label of training instance t

y

t is Perceptron output on training instance t

⌘ is small constant (e.g., 0.1) called learning rate

I.e., if (yt � y

t) > 0 then increase w

t

j

w.r.t. x

t

j

, else decrease

Can prove rule will converge if training data is linearlyseparable and ⌘ sufficiently small

11 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Where Does the Training Rule Come From?

Recall initial linear unit, where output

y

t = f (x;w, b) = x

>w + b

(i.e., no threshold)For each training example, compromise betweencorrectiveness and conservativeness

Correctiveness: Tendency to improve on x

t (reduceloss)Conservativeness: Tendency to keep w

t+1 close to w

t

(minimize distance)Use cost function that measures both (let w

0

= b):

J(w) = dist

�w

t+1,w

t

�+ ⌘ loss

0

@y

t,

curr ex, new wtsz }| {w

t+1 · x

t

1

A

12 / 33

Page 3: Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Gradient Descent

Gradient-Descent

iii

ii

www

iwEw

'�

�ww

� ' ,K

14

wt wt+1

η

E (wt)

E (wt+1)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

J

J

J

13 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Gradient Descent (cont’d)

-1

0

1

2

-2-1

01

23

0

5

10

15

20

25

w0 w1

E[w]

J(w)

@J

@w

=

@J

@w

0

,@J

@w

1

, · · · , @J

@w

n

14 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Gradient Descent (cont’d)

J(w) =

conserv.z }| {kwt+1 � w

tk2

2

+

coef .z}|{⌘

correctivez }| {(yt � w

t+1 · x

t)2

=nX

j=1

⇣w

t+1

j

� w

t

j

⌘2

+ ⌘

0

@y

t �nX

j=1

w

t+1

j

x

t

j

1

A2

Take gradient w.r.t. w

t+1 (i.e., @J/@w

t+1

i

) and set to 0:

0 = 2

⇣w

t+1

i

� w

t

i

⌘� 2⌘

0

@y

t �nX

j=1

w

t+1

j

x

t

j

1

Ax

t

i

15 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic UnitsLinear Unit

Linear ThresholdUnit

Perceptron TrainingRule

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogether

Gradient Descent (cont’d)

Approximate with

0 = 2

⇣w

t+1

i

� w

t

i

⌘� 2⌘

0

@y

t �nX

j=1

w

t

j

x

t

j

1

Ax

t

i

,

which yields

w

t+1

i

= w

t

i

+

�w

t

iz }| {⌘�y

t � y

t

�x

t

i

Given other loss function and unit activation function, canderive weight update rule

16 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblemsXOR

General NonlinearlySeparable Problems

Backprop

Types of Units

Putting ThingsTogether

Handling Nonlinearly Separable ProblemsThe XOR Problem

Using linear threshold units (book uses rectified linear units)

x

x

1

2

g (x)1

g (x)2> 0

< 0

> 0< 0

A: (0,0)

D: (1,1)

B: (0,1)

C: (1,0)

neg

posneg

Represent with intersection of two linear separators

g

1

(x) = 1 · x

1

+ 1 · x

2

� 1/2

g

2

(x) = 1 · x

1

+ 1 · x

2

� 3/2

pos =�

x 2 R2 : g

1

(x) > 0 AND g

2

(x) < 0

neg =�

x 2 R2 : g

1

(x), g

2

(x) < 0 OR g

1

(x), g

2

(x) > 0

17 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblemsXOR

General NonlinearlySeparable Problems

Backprop

Types of Units

Putting ThingsTogether

Handling Nonlinearly Separable ProblemsThe XOR Problem (cont’d)

Let z

i

=

(0 if g

i

(x) < 0

1 otherwise

Class (x1

, x

2

) g

1

(x) z

1

g

2

(x) z

2

pos B: (0, 1) 1/2 1 �1/2 0pos C: (1, 0) 1/2 1 �1/2 0neg A: (0, 0) �1/2 0 �3/2 0neg D: (1, 1) 3/2 1 1/2 1

Now feed z

1

, z

2

into g(z) = 1 · z

1

� 2 · z

2

� 1/2

1

2

A: (0,0)

D: (1,1)

B, C: (1,0)

> 0

< 0

posneg

g(z)z

z

18 / 33

Page 4: Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblemsXOR

General NonlinearlySeparable Problems

Backprop

Types of Units

Putting ThingsTogether

Handling Nonlinearly Separable ProblemsThe XOR Problem (cont’d)

In other words, we remapped all vectors x to z such that theclasses are linearly separable in the new vector space

Σi

Σi i

x

Σi

w = 1

w = 1

w = 1

w = 1

w = −1/2

w = −3/2

w

w xi

iw

w = 1

w = −2

w = −1/2

1

2

x1

2x

Hidden Layer

Input Layer

OutputLayer

31

32

41

30

40

53

54

50

3i

42 4i

5i

z

z

z

This is a two-layer perceptron or two-layer feedforwardneural network

Can use many nonlinear activation functions in hidden layer19 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblemsXOR

General NonlinearlySeparable Problems

Backprop

Types of Units

Putting ThingsTogether

Handling Nonlinearly Separable ProblemsGeneral Nonlinearly Separable Problems

By adding up to 2 hidden layers of linear threshold units,can represent any union of intersection of halfspaces

pos

pospos

neg

neg

neg

pos

First hidden layer defines halfspaces, second hidden layertakes intersection (AND), output layer takes union (OR)

20 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

The Sigmoid Unit

[Rarely used in deep ANNs, but continuous and differentiable]w1

w2

wn

w0

x1

x2

xn

x0 = 1

.

.

.�

net = � wi xii=0

n1

1 + e-neto = �(net) = = f(x; w,b)

�(net) is the logistic function

�(net) =1

1 + e

�net

(a type of sigmoid function)

Squashes net into [0, 1] range

Nice property:

d�(x)

dx

= �(x)(1� �(x))

21 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Sigmoid UnitGradient Descent

Again, use squared loss for correctiveness:

E(wt) =1

2

�y

t � y

t

�2

(folding 1/2 of correctiveness into loss func)

Thus@E

@w

t

j

=@

@w

t

j

1

2

�y

t � y

t

�2

=1

2

2

�y

t � y

t

� @

@w

t

j

�y

t � y

t

�=�y

t � y

t

� � @y

t

@w

t

j

!

22 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Sigmoid UnitGradient Descent (cont’d)

Since y

t is a function of f (x;w, b) = net

t = w

t · x

t,

@E

@w

t

j

= ��y

t � y

t

� @y

t

@net

t

@net

t

@w

t

j

= ��y

t � y

t

� @� (net

t)

@net

t

@net

t

@w

t

j

= ��y

t � y

t

�y

t

�1� y

t

�x

t

j

Update rule:

w

t+1

j

= w

t

j

+ ⌘ y

t

�1� y

t

� �y

t � y

t

�x

t

j

23 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Multilayer Networks

x0

x2

xn

Σ

=1

Σ1

σ

σ

Σ

Σ

σ

σ

w

w

w

w

w

w

net n+1

net n+2

net n+3

net n+4

n+3,n+1w

w

w

w

n+3,n+2

n+4,n+1

n+4,n+2

x1 x n+3,n+1n+1,1

n+1,n

n+2,1

n+2,n

n+2,0

n+1,0

xji = input from i to j

= wt from i to jwji

Hidden layer Output Layer

Input

layer

y n+4

y n+3

For now, using sigmoid units

E

t = E(wt) =1

2

X

k2outputs

�y

t

k

� y

t

k

�2

24 / 33

Page 5: Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Training Multilayer NetworksOutput Units

Adjust weight w

t

ji

according to E

t as before

For output units, this is easy since contribution of w

t

ji

to E

t

when j is an output unit is the same as for single neuroncase1, i.e.,

@E

t

@w

t

ji

= ��y

t

j

� y

t

j

�y

t

j

�1� y

t

j

�x

t

ji

= ��t

j

x

t

ji

where �t

j

= � @E

t

@net

t

j

= error term of unit j

1This is because all other outputs are constants w.r.t. w

t

ji25 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Training Multilayer NetworksHidden Units

How can we compute the error term for hidden layerswhen there is no target output r

t for these layers?Instead propagate back error values from output layertoward input layers, scaling with the weightsScaling with the weights characterizes how much of theerror term each hidden unit is “responsible for”

26 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Training Multilayer NetworksHidden Units (cont’d)

The impact that w

t

ji

has on E

t is only through net

t

j

and unitsimmediately “downstream” of j:

@E

t

@w

t

ji

=@E

t

@net

t

j

@net

t

j

@w

t

ji

= x

t

ji

X

k2down(j)

@E

t

@net

t

k

@net

t

k

@net

t

j

= x

t

ji

X

k2down(j)

��t

k

@net

t

k

@net

t

j

= x

t

ji

X

k2down(j)

��t

k

@net

t

k

@y

j

@y

j

@net

t

j

= x

t

ji

X

k2down(j)

��t

k

w

kj

@y

j

@net

t

j

= x

t

ji

X

k2down(j)

��t

k

w

kj

y

j

(1� y

j

)

Works for arbitrary number of hidden layers

27 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Backpropagation Algorithm

Initialize all weights to small random numbers

Until termination condition satisfied do

For each training example (yt, x

t) do1 Input x

t to the network and compute the outputs ˆy

t

2 For each output unit k

�t

k

y

t

k

(1� y

t

k

) (yt

k

� y

t

k

)

3 For each hidden unit h

�t

h

y

t

h

(1� y

t

h

)X

k2down(h)

w

t

k,h �t

k

4 Update each network weight w

t

j,i

w

t

j,i w

t

j,i +�w

t

j,i

where�w

t

j,i = ⌘ �t

j

x

t

j,i28 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

BackpropSigmoid Unit

Multilayer Networks

Training MultilayerNetworks

Backprop Alg

Types of Units

Putting ThingsTogether

Backpropagation AlgorithmExample

eta 0.3

trial 1 trial 2w_ca 0.1 0.1008513 0.1008513 w_cb 0.1 0.1 0.0987985 w_c0 0.1 0.1008513 0.0996498 a 1 0 b 0 1 const 1 1 sum_c 0.2 0.2008513 y_c 0.5498340 0.5500447

w_dc 0.1 0.1189104 0.0964548 w_d0 0.1 0.1343929 0.0935679 sum_d 0.1549834 0.1997990 y_d 0.5386685 0.5497842

target 1 0 delta_d 0.1146431 -0.136083 delta_c 0.0028376 -0.004005

delta_d(t) = y_d(t) * (y(t) - y_d(t)) * (1 - y_d(t))delta_c(t) = y_c(t) * (1 - y_c(t)) * delta_d(t) * w_dc(t)w_dc(t+1) = w_dc(t) + eta * y_c(t) * delta_d(t)w_ca(t+1) = w_ca(t) + eta * a * delta_c(t)

29 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of UnitsTypes of OutputUnits

Types of HiddenUnits

Putting ThingsTogether

Types of Output Units

Given hidden layer outputs h

Linear unit (Sec 6.2.2.1): y = w

>h + b

Minimizing square loss with this output unit maximizeslog likelihood when labels from normal distribution

I.e., find a set of parameters ✓ that is most likely togenerate the labels of the training data

Works well with GD trainingSigmoid (Sec 6.2.2.2): y = �(w>

h + b)Approximates non-differentiable threshold functionMore common in older, shallower networksCan be used to predict probabilities

Softmax unit (Sec 6.2.2.3): Start with z = W

>h + b

Predict probability of label i to besoftmax(z)

i

= exp(zi

)/⇣P

j

exp(zj

)⌘

Continuous, differentiable approximation to argmax30 / 33

Page 6: Introduction CSCE 970 Lecture 2: Artificial Neural Networks ...

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of UnitsTypes of OutputUnits

Types of HiddenUnits

Putting ThingsTogether

Types of Hidden Units

Rectified linear unit (ReLU) (Sec 6.3.1): max{0,W

>x + b}

Good default choiceSecond derivative is 0almost everywhere andderivatives largeIn general, GD workswell when functionsnearly linear

Logistic sigmoid (done already) and tanh (6.3.2)

Nice approximation to threshold, but don’t train well indeep networksStill potentially useful when piecewise functionsinappropriate

Softmax (occasionally used as hidden)31 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogetherHidden Layers

Putting Everything TogetherHidden Layers

How many layers to use?Deep networks tend to build potentially usefulrepresentations of the data via composition of simplefunctionsPerformance improvement not simply from morecomplex network (number of parameters)Increasing number of layers still increases chances ofoverfitting, so need significant amount of training datawith deep network; training time increases as well

Accuracy vs Depth Accuracy vs Complexity

32 / 33

CSCE 970Lecture 2:ArtificialNeural

Networks andDeep

Learning

Stephen Scottand VinodVariyam

Introduction

Outline

Basic Units

NonlinearlySeparableProblems

Backprop

Types of Units

Putting ThingsTogetherHidden Layers

Putting Everything TogetherUniversal Approximation Theorem

Any boolean function can be represented with twolayersAny bounded, continuous function can be representedwith arbitrarily small error with two layersAny function can be represented with arbitrarily smallerror with three layers

Only an EXISTENCE PROOF

Could need exponentially many nodes in a layerMay not be able to find the right weights

33 / 33