Lecture 2 Math
Transcript of Lecture 2 Math
-
8/10/2019 Lecture 2 Math
1/34
Neural Networks and Learning Systems
TBMI 26
Lecture 2
Supervised learning Linear models
-
8/10/2019 Lecture 2 Math
2/34
Recap - Supervised learning
Task: Learn to predict/classify new data from
labeled examples.
Input: Training data examples {xi
, yi
} i=1...N,
where xi is a feature vector and yi is a class
label in the set W. Today well mostly assume
two classes: W = {-1,1}
Output: A function f(x;w1,,wk) W
2
Find a function f and adjust the parameters w1,,wk so that
new feature vectors are classified correctly. Generalization!!
-
8/10/2019 Lecture 2 Math
3/34
How does the brain take decisions?(on the low level!)
Basic unit: the neuron
The human brain has approximately 100billion (1011) neurons.
Each neuron connected to about 7000 otherneurons.
Approx. 1014
- 1015
synapses (connections). 3
Cell body
AxonDendrites
-
8/10/2019 Lecture 2 Math
4/34
Model of a neuron
4
Input signals
Weights
Summation
Activation function
Sx1
x2
xn
w1
wn
w2
-
8/10/2019 Lecture 2 Math
5/34
The Perceptron
5
(McCulloch & Pitts 1943, Rosenblatt 1962)
xwTn wxwwwwxx =
=
=0
n
1i
ii00n1 ,,;,...,f
Extra reading on the history of the perceptron:
http://www.csulb.edu/~cwallis/artificialn/History.htm
xsignx =
xx =
xx tanh=
-1
1
-1
1Sigmoid function
Step function
Linear function
Not differentiable!
Not biologically plausible!
http://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htm -
8/10/2019 Lecture 2 Math
6/34
Notational simplification: Bias weight
6
xwTn xwwwxx =
=
=
n
0i
ii0n1 ,,;,...,f
Add a constant 1 to the feature vector so that wedont have to treat w0separately.
Instead of x= [x1,...,xn]T, we have x= [1, x1,...,xn]
T
-
8/10/2019 Lecture 2 Math
7/34
Activation functions in 2D
7
xwxw TT sign=
xwxw TT =
xwxw TT tanh=
-
8/10/2019 Lecture 2 Math
8/34
Geometry of linear classifiers
8
x1
x2 y = w0+w1x1+w2x2= 0
(w1,
w2
)T
y < 0
y > 0
-
8/10/2019 Lecture 2 Math
9/34
Another viewthe super feature
9
=
=n
0i
iixwTxw
Reduces the original nfeatures
into one single feature
x
If seen as vectors, wT
xis theprojection of xonto w.
w
wTx
(w1,w2)T
-
8/10/2019 Lecture 2 Math
10/34
Advantages of a parametric model
Only stores a few parameters (w0, w1, ,wn)
instead of all the training samples, as in k-NN.
Very fast to evaluate on which side of the line
a new sample is on: wTx0
is also known as a
discriminant function.
10
xwwx T=;f
-
8/10/2019 Lecture 2 Math
11/34
Which linear classifier to choose?
11
x1
x2 y = w0+w1x1+w2x2= 0
(w1,
w2
)T
y < 0
y > 0
-
8/10/2019 Lecture 2 Math
12/34
Find the best separator
Optimization!
Min/max of a cost function e(w0, w1, ,wn)
with the weights w0, w1, ,wnas parameters.
Two ways to optimize:
Algebraic: Set derivative and solve!
Iterative numeric: Follow the gradient direction
until minimum/maximum of g is reached.
This is called gradient descent/ascent!
12
0=
iw
e
-
8/10/2019 Lecture 2 Math
13/34
Gradient descent/ascent
13
?
How to get to the lowest point?
=
2
1
x
f
x
f
f
-
8/10/2019 Lecture 2 Math
14/34
Gradient descent
14
Small hLarge h
Too large h
Gradient escentChoosing the step length
-
8/10/2019 Lecture 2 Math
15/34
Choosing the step length
The size of the gradient is not relevant!
15
ew)
Small derivative, small steps.
Large derivative, large steps.
-
8/10/2019 Lecture 2 Math
16/34
Local optima
Gradient search is not guaranteed to find the
global minimum/maximum.
With a sufficiently small step length, the
closest local optimum will be found.
16
Global
minimumLocal
minima
-
8/10/2019 Lecture 2 Math
17/34
Online learning vs. batch learning
Batch learning: Use all training examples to
update the classifier.
Most oftenly used.
Online learning: Update the classifier using
one training example at the time.
Can be used when training samples arrive
sequentially, e.g., adaptive learning.
Also known as stochastic ascent/descent.
17
-
8/10/2019 Lecture 2 Math
18/34
Three different cost functions
Perceptron algorithm
Linear Discriminant Analysis(a.k.a. Fisher Linear Discriminant)
Support Vector Machines
18
-
8/10/2019 Lecture 2 Math
19/34
Perceptron algorithm
19
xwT
i
in ywww
=,...,, 10e
Maximizethe following cost function
n = # featuresI = set of misclassified training samples
yi{-1,1} depending on the class of training sample i
Historically (1960s) one of the first machine learning algorithms
Negative for misclassified samples
Considers only misclassified samples
nwww ,...,, 10e is always negative, or 0 for a perfect classification!
Linear activation function
-
8/10/2019 Lecture 2 Math
20/34
Perceptron algorithm, cont.
20
ik
i
i
k
xyw
=
e
xwT
i
in ywww =,...,,10e
i
i
iy xw
=
e
Algorithm:
1. Start with a random w
2. Iterate Eq. 1 until convergence
Gradient ascent:
1Eq.1 ii
ittt y xw
w
ww
=
= h
eh
-
8/10/2019 Lecture 2 Math
21/34
Perceptron example
21
-
8/10/2019 Lecture 2 Math
22/34
Perceptron algorithm, cont.
If the classes are linearly separable, the
perceptron algorithm will converge to a
solution that separates the classes.
If the classes are not linearly separable, the
algorithm will fail.
Which solution it arrives in depends on the
start value w.
22
-
8/10/2019 Lecture 2 Math
23/34
Linear Discriminant Analysis (LDA)
23
x1
(w1,w2)T
x2
Consider the projected feature vectors
-
8/10/2019 Lecture 2 Math
24/34
LDADifferent projections
24
x1
x2
Projection
Optimal projection
Separating plane
-
8/10/2019 Lecture 2 Math
25/34
LDASeparation
25
Small varianceLarge distance
Large variance
Small distance
Goal: minimize variance and maximize distance.
1
1
2
2
2
2
2
1
2
21
e
-=Maximize:
-
8/10/2019 Lecture 2 Math
26/34
LDACost function
26
2
2
2
1
2
21
e
-=
wCwwCwwCwww totTTT == 21
2
2
2
1
Cwwwxxxxwxwxww Tn
i
T
ii
Tn
i
T
i
T
nn
=
--==-=
== 1
2
1
2 11
Variance:
C
xwxwxww Tn
i
i
Tn
i
i
T
nn=
==
== 11
11
Distance: 1x
Mwwwxxxxwxxwww
TTTT =--=-=- 2121
2
21
2
21
M
2x
-
8/10/2019 Lecture 2 Math
27/34
LDACost function, cont.
27
wCw
Mwww
tot
T
T
=-=
2
2
2
1
221
e
This form is called a (generalized) Rayleigh quotient,
which is maximized by the largest eigenvector to the
generalized eigenvalue problem Ctotw= lMw!
Simplification:
Some scalar K
212121 xxwxxxxMw -=--= KT
211~ xxCw --tot
Scaling of wnot important!
-
8/10/2019 Lecture 2 Math
28/34
LDA - Examples
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
4
5
28
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
4
-
8/10/2019 Lecture 2 Math
29/34
Support Vector Machines (SVM)Idea!
29
Optimal separation line remains the same, feature points
close to the class limits are more important!
These are called support vectors!
Support vectors
-
8/10/2019 Lecture 2 Math
30/34
SVMMaximum margin
30
22
w
w
Choose wthat gives maximum margin !
-
8/10/2019 Lecture 2 Math
31/34
SVMCost function
31
w
wTx+w0= 1
wTx+w0= -1
wTx+w0> 1
wTx+w0< -1
wTx+w0= 0
Scaling of wis freechoose a specific scaling!
xp
wT (xp+) +w0= 1
xs
wTxs+w0 = 1
xs= xp+
wT+ wTxp+ w0 = 1
0||w|| T = ||w||
For the support vector, = 1 / ||w||
-
8/10/2019 Lecture 2 Math
32/34
SVMCost function, cont.
32
Maximizing= 1 / ||w|| is the same as minimizing ||w||!
1)(subject to
min
0 wy iT
i xw
w
No training samples must reside in the margin region!
Optimization procedure outside the scope of this course
-
8/10/2019 Lecture 2 Math
33/34
SummaryLinear classifiers
Perceptron algorithm:
Of historical interest
Linear Discriminant Analysis
Simple to implement
Very useful as a first classifier to try
Support Vector Machines
By many considered as the state-of-the-art classifier
principle, especially the nonlinear forms (see nextlecture)
Many software packages exist on the internet
33
-
8/10/2019 Lecture 2 Math
34/34
What about more than 2 classes?
Common solution: Combine several binary
classifiers
34
Evaluate most
likely class