CSE 5526: Introduction to Neural Networks
Transcript of CSE 5526: Introduction to Neural Networks
![Page 1: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/1.jpg)
Part I 1
CSE 5526: Introduction to Neural Networks
Instructor: DeLiang Wang
![Page 2: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/2.jpg)
Part I 2
What is this course is about? AI (artificial intelligence) in the broad sense, in particular
learning The human brain and its amazing ability, e.g. vision
![Page 3: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/3.jpg)
Part I 3
Human brain
Lateral fissure
Central sulcus
![Page 4: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/4.jpg)
Part I 4
Brain versus computer
Brain
••
Computer
••
Brain-like computation – Neural networks (NN or ANN) –Neural computation• Discuss syllabus
![Page 5: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/5.jpg)
Part I 5
A single neuron
![Page 6: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/6.jpg)
Part I 6
A single neuron
![Page 7: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/7.jpg)
Part I 7
Real neurons, real synapses Properties
Action potential (impulse) generation Impulse propagation Synaptic transmission & plasticity Spatial summation
Terminology Neurons – units – nodes Synapses – connections – architecture Synaptic weight – connection strength (either positive or negative)
![Page 8: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/8.jpg)
Part I 8
Model of a single neuron
![Page 9: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/9.jpg)
Part I 9
Neuronal model
∑=
=m
jjkjk xwu
1
Adder, weighted sum, linear combiner
kkk buv += Activation potential; bk: bias
)( kk vy ϕ= Output; φ: activation function
![Page 10: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/10.jpg)
Part I 10
Another way of including biasSet x0 = +1 and wk0 = bk
So we have ∑=
=m
jjkjk xwv
0
![Page 11: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/11.jpg)
Part I 11
McCulloch-Pitts model
}1,1{ −∈ix
)(1
bxwy i
m
ii += ∑
=
ϕ
0 if0 if
11
)(<≥
−
=vv
vϕ A form of signum (sign) function
Note difference from textbook; also number of neurons in the brain
Bipolar input
![Page 12: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/12.jpg)
Part I 12
McCulloch-Pitts model (cont.)
• Example logic gates (see blackboard)
• McCulloch-Pitts networks (introduced in 1943) are the first class of abstract computing machines: finite-state automata• Finite-state automata can compute any logic (Boolean) function
![Page 13: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/13.jpg)
Part I 13
Network architecture
• View an NN as a connected, directed graph, which defines its architecture• Feedforward nets: loop-free graph• Recurrent nets: with loops
![Page 14: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/14.jpg)
Part I 14
Feedforward net
• Since the input layer consists of source nodes, it is typically not counted when we talk about the number of layers in a feedforward net
• For example, the architecture of 10-4-2 counts as a two-layer net
![Page 15: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/15.jpg)
Part I 15
A one-layer recurrent net
In this net, the input typically sets the initial condition of the output layer
![Page 16: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/16.jpg)
Part I 16
Network components
• Three components characterize a neural net• Architecture• Activation function• Learning rule (algorithm)
![Page 17: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/17.jpg)
Part I 17
CSE 5526: Introduction to Neural Networks
Perceptrons
![Page 18: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/18.jpg)
Part I 18
Perceptrons
• Architecture: one-layer feedforward net• Without loss of generality, consider a single-neuron perceptron
• • •
• • •
x1
x2
xm
y1
y2
![Page 19: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/19.jpg)
Part I 19
Definition
)(vy ϕ=
otherwise0 if
11
)(≥
−
=v
vϕ
Hence a McCulloch-Pitts neuron, but with real-valued inputs
bxwv i
m
ii +=∑
=1
![Page 20: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/20.jpg)
Part I 20
Pattern recognition
• With a bipolar output, the perceptron performs a 2-class classification problem• Apples vs. oranges
• How do we learn to perform a classification problem?• Task: The perceptron is given pairs of input xp and desired
output dp. How to find w (with b incorporated) so that
? allfor , pdy pp =
![Page 21: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/21.jpg)
Part I 21
Decision boundary
• The decision boundary for a given w:
• g is also called the discriminant function for the perceptron, and it is a linear function of x. Hence it is a linear discriminant function
0)(1
=+=∑=
bxwg i
m
iix
![Page 22: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/22.jpg)
Part I 22
Example
• See blackboard
![Page 23: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/23.jpg)
Part I 23
Decision boundary (cont.)
• For an m-dimensional input space, the decision boundary is an (m ‒ 1)-dimensional hyperplane perpendicular to w. The hyperplane separates the input space into two halves, with one half having y = 1, and the other half having y = -1• When b = 0, the hyperplane goes through the origin
x1
x2
y = -1
y = +1w
g = 0
x
o
o
o o
o
oo
oo
xx
x
x
xx
![Page 24: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/24.jpg)
Part I 24
Linear separability
• For a set of input patterns xp, if there exists one w that separates d = 1 patterns from d = -1 patterns, then the classification problem is linearly separable• In other words, there exists a linear discriminant function that
produces no classification error• Examples: AND, OR, XOR (see blackboard)
• A very important concept
![Page 25: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/25.jpg)
Part I 25
Linear separability: a more general illustration
![Page 26: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/26.jpg)
Part I 26
Perceptron learning rule
• Strengthen an active synapse if the postsynaptic neuron fails to fire when it should have fired; weaken an active synapse if the neuron fires when it should not have fired• Formulated by Rosenblatt based on biological intuition
![Page 27: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/27.jpg)
Part I 27
Quantitatively
)()()1( nwnwnw ∆+=+
n: iteration number η: step size or learning rate
)()]()([)( nxnyndnw −+= η
In vector form
xw ][ yd −=∆ η
![Page 28: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/28.jpg)
Part I 28
Geometric interpretation
• Assume η = 1/2
x1
x2
w(0)
x: d = 1
o
oo
o: d = -1
xx
x
![Page 29: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/29.jpg)
Part I 29
Geometric interpretation
• Assume η = 1/2
x1
x2
w(0)
x: d = 1
o
oo
o: d = -1
xx
x
w(1)
![Page 30: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/30.jpg)
Part I 30
Geometric interpretation
• Assume η = 1/2
x1
x2
w(0)
x: d = 1
o
oo
o: d = -1
xx
x
w(1)
![Page 31: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/31.jpg)
Part I 31
Geometric interpretation
• Assume η = 1/2
x1
x2
w(0)
x: d = 1
o
oo
o: d = -1
xx
x
w(1)
w(2)
![Page 32: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/32.jpg)
Part I 32
Geometric interpretation
• Assume η = 1/2
x1
x2
w(0)
x: d = 1
o
oo
o: d = -1
xx
x
w(1)
w(2)
![Page 33: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/33.jpg)
Part I 33
Geometric interpretation
• Assume η = 1/2
x1
x2
w(0)
x: d = 1
o
oo
o: d = -1
xx
x
w(1)
w(2)w(3)
![Page 34: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/34.jpg)
Part I 34
Geometric interpretation
Each weight update moves w closer to d = 1 patterns, or away from d = -1 patterns. w(3) solves the classification problem
x1
x2 x: d = 1
o
oo
o: d = -1
xx
xw(3)
![Page 35: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/35.jpg)
Part I 35
Perceptron convergence theorem
• Theorem: If a classification problem is linearly separable, a perceptron will reach a solution in a finite number of iterations
• [Proof]Given a finite number of training patterns, because of linear separability, there exists a weight vector wo so that
where
0>≥αpTopd xw (1)
)(min pTopp d xw=α
![Page 36: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/36.jpg)
Part I 36
Proof (cont.)
• We assume that the initial weights are all zero. Let Np denote the number of times xp has been used for actually updating the weight vector at some point in learning
At that time:
])([ pppp
p ydN xw −=∑ η
ppp
pdN x∑= η2
![Page 37: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/37.jpg)
Part I 37
Proof (cont.)
• Consider first
where
pTop
pp
To dN xwww ∑= η2
wwTo
(2)
∑=p
pNP
due to (1) ∑≥p
pNηα2
Pηα2=
![Page 38: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/38.jpg)
Part I 38
Proof (cont.)
• Now consider the change in square length after a single update by x:
whereSince upon an update, d(wTx) ≤ 0
22222 2 wxwwwww −+=−∆+=∆ dη
2w
2max pp x=β
xwx Tdηη 44 22 +=
βη24≤
![Page 39: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/39.jpg)
Part I 39
Proof (cont.)
• By summing for P steps we have the bound:
• Now square the cosine of the angle between wo and w, we have by (2) and (3)
Cauchy-Schwarz inequality
Pβη22 4≤w
2w
(3)
2
2
22
222
22
2
44)(1
ooo
To P
PP
wwwwww
βα
βηαη
=≥≥
![Page 40: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/40.jpg)
Part I 40
Proof (cont.)
• Thus, P must be finite to satisfy the above inequality. This completes the proof
• Remarks• In the case of w(0) = 0, the learning rate has no effect on the proof.
That is, the theorem holds no matter what η (η > 0) is• The solution weight vector is not unique
![Page 41: CSE 5526: Introduction to Neural Networks](https://reader035.fdocuments.us/reader035/viewer/2022080121/62e72b12a27ad70c742d8013/html5/thumbnails/41.jpg)
Part I 41
Generalization
• Performance of a learning machine on test patterns not used during training
• Example: Class 1: handwritten “m”: class 2: handwritten “n”• See blackboard
• Perceptrons generalize by deriving a decision boundary in the input space. Selection of training patterns is thus important for generalization