Learning with linear neurons
description
Transcript of Learning with linear neurons
![Page 1: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/1.jpg)
Learning with linear neurons
Adapted from Lectures by Geoffrey Hinton and OthersUpdated by N. Intrator, May 2007
![Page 2: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/2.jpg)
PrehistoryW.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in
nervous activity”, Bulletin of Mathematical Biophysics, 5, 115-137.
x+y-21 * -2
y * +1
x * +1
if sum<0 : 0
else : 1
inpu
ts
wei
ghts
sum output1 1 1 01
0 1 0 0
000
x & yyx
outputinputs
Truth Table for Logical AND
• This seminal paper pointed out that simple artificial “neurons” could be made to perform basic logical operations such as AND, OR and NOT.
![Page 3: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/3.jpg)
Nervous Systems as Logical Circuits
Groups of these “neuronal” logic gates could carry out any computation, even though each neuron was very limited.
x+y-11 * -1
y * +1
x * +1
if sum<0 : 0
else : 1
inpu
ts
wei
ghts
sum output1 1 1 01
0 1 0 0
110
x | yyx
outputinputs
Truth Table for Logical OR
• Could computers built from these simple units reproduce the computational power of biological brains?
• Were biological neurons performing logical operations?
![Page 4: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/4.jpg)
The Perceptron
It obeyed the following rule:
If the sum of the weighted inputs exceeds a threshold, output 1, else output -1.
output
inpu
ts
wei
ghts
sum
Σxi wi
*Frank Rosenblatt (1962). Principles of Neurodynamics, Spartan, New York, NY.
Subsequent progress was inspired by the invention of learning rules inspired by ideas from neuroscience…
Rosenblatt’s Perceptron could automatically learn to categorise or classify input vectors into types.
1 if Σ inputi * weighti > threshold
-1 if Σ inputi * weighti < threshold
![Page 5: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/5.jpg)
Linear neurons
• The neuron has a real-valued output which is a weighted sum of its inputs
• The aim of learning is to minimize the discrepancy between the desired output and the actual output
– How de we measure the discrepancies?
– Do we update the weights after every training case?
– Why don’t we solve it analytically?
xwTi
ii xwy ˆ
Neuron’s estimate of the desired output
inputvector
weightvector
![Page 6: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/6.jpg)
A motivating example
• Each day you get lunch at the cafeteria.– Your diet consists of fish, chips, and beer.– You get several portions of each
• The cashier only tells you the total price of the meal– After several days, you should be able to figure out
the price of each portion.• Each meal price gives a linear constraint on the prices of
the portions:
beerbeerchipschipsfishfish wxwxwxprice
![Page 7: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/7.jpg)
Two ways to solve the equations
• The obvious approach is just to solve a set of simultaneous linear equations, one per meal.
• But we want a method that could be implemented in a neural network.
• The prices of the portions are like the weights in of a linear neuron.
• We will start with guesses for the weights and then adjust the guesses to give a better fit to the prices given by the cashier.
)( ,, beerchipsfish wwww
![Page 8: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/8.jpg)
The cashier’s brain
Price of meal = 850
portions of fish
portions of chips
portions of beer
150 50 100
2 5 3
Linear neuron
![Page 9: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/9.jpg)
• Residual error = 350
• The learning rule is:
• With a learning rate of 1/35, the weight changes are +20, +50, +30
• This gives new weights of 70, 100, 80
• Notice that the weight for chips got worse!
A model of the cashier’s brainwith arbitrary initial weights
)ˆ( yyxw ii
Price of meal = 500
portions of fish
portions of chips
portions of beer
50 50 50
2 5 3
![Page 10: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/10.jpg)
Behavior of the iterative learning procedure
• Do the updates to the weights always make them get closer to their correct values? No!
• Does the online version of the learning procedure eventually get the right answer? Yes, if the learning rate gradually decreases in the appropriate way.
• How quickly do the weights converge to their correct values? It can be very slow if two input dimensions are highly correlated (e.g. ketchup and chips).
• Can the iterative procedure be generalized to much more complicated, multi-layer, non-linear nets? YES!
![Page 11: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/11.jpg)
Deriving the delta rule
• Define the error as the squared residuals summed over all training cases:
• Now differentiate to get error derivatives for weights
• The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases
1 22
1
2
,
ˆ( )
ˆ
ˆ
ˆ( )
n nn
n n
ni i n
i n n nn
ii
E y y
y EE
w w y
x y y
Ew
w
![Page 12: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/12.jpg)
The error surface
• The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. – For a linear neuron, it is a quadratic bowl. – Vertical cross-sections are parabolas. – Horizontal cross-sections are ellipses.
E w1
w2
![Page 13: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/13.jpg)
• Batch learning does steepest descent on the error surface
• Online learning zig-zags around the direction of steepest descent
w1
w2
w1
w2
Online versus batch learning
constraint from training case 1
constraint from training case 2
![Page 14: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/14.jpg)
Adding biases
• A linear neuron is a more flexible model if we include a bias.
• We can avoid having to figure out a separate learning rule for the bias by using a trick:– A bias is exactly
equivalent to a weight on an extra input line that always has an activity of 1.
21 wwb
ii
iwxby ˆ
211 xx
![Page 15: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/15.jpg)
Preprocessing the input vectors
• Instead of trying to predict the answer directly from the raw inputs we could start by extracting a layer of “features”.– Sensible if we already know that certain
combinations of input values would be useful– The features are equivalent to a layer of hand-
coded non-linear neurons.• So far as the learning algorithm is concerned, the
hand-coded features are the input.
![Page 16: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/16.jpg)
Is preprocessing cheating?• It seems like cheating if the aim to show how
powerful learning is. The really hard bit is done by the preprocessing.
• Its not cheating if we learn the non-linear preprocessing.– This makes learning much more difficult and much
more interesting..• Its not cheating if we use a very big set of non-linear
features that is task-independent. – Support Vector Machines make it possible to use a
huge number of features without much computation or data.
![Page 17: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/17.jpg)
Statistical and ANN Terminology
• A perceptron model with a linear transfer function is equivalent to a possibly multiple or multivariate linear regression model [Weisberg 1985; Myers 1986].
• A perceptron model with a logistic transfer function is a logistic regression model [Hosmer and Lemeshow 1989].
• A perceptron model with a threshold transfer function is a linear discriminant function [Hand 1981; McLachlan 1992; Weiss and Kulikowski 1991]. An ADALINE is a linear two-group discriminant.
![Page 18: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/18.jpg)
Transfer functions
• Determines the output from a summation of the weighted inputs of a neuron.
• Maps any real numbers into a domain normally bounded by 0 to 1 or -1 to 1, i.e. squashing functions. Most common functions are sigmoid functions:
iiijjj xwfO
logistic: fx()1
1ex
hyperbolic tangent: fx()ee
ee
x x
x x
![Page 19: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/19.jpg)
Healthcare Applications of ANNs• Predicting/confirming myocardial infarction, heart attack,
from EKG output waves– Physicians had a diagnostic sensitivity and specificity of
73.3% and 81.1% while ANNs performed 96.0% and 96.0%
• Identifying dementia from EEG patterns, performed better than both Z statistics and discriminant analysis; better than LDA for (91.1% vs. 71.9%) in classifying with Alzheimer disease.
• Papnet: A Pap Smear screening system by Neuromedical Systems in used by US FDA
• Predict mortality risk of preterm infants, screening tool in urology, etc.
![Page 20: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/20.jpg)
Classification Applications of ANNs
• Credit Card Fraud Detection: AMEX, Mellon Bank, Eurocard Nederland
• Optical Character Recognition (OCR): Fax Software
• Cursive Handwriting Recognition: Lexicus• Petroleum Exploration: Arco & Texaco• Loan Assessment: Chase Manhattan for
vetting commercial loans• Bomb detection by SAIC
![Page 21: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/21.jpg)
Time Series Applications of ANNs
• Trading systems: Citibank London (FX).
• Portfolio selection and Management: LBS Capital Management (>US$1b), Deere & Co. pension fund (US$100m).
• Forecasting weather patterns & earthquakes.
• Speech technology: verification and generation.
• Medical: Predicting heart attacks from EKGs and mental illness from EEGs.
![Page 22: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/22.jpg)
Advantages of Using ANNs
• Works well with large sets of noisy data, in domains where experts are unavailable or there are no known rules.
• Simplicity of using it as a tool
• Universal approximator.
• Does not impose a structure on the data.
• Possible to extract rules.
• Ability to learn and adapt.
• Does not require an expert or a knowledge engineer.
• Well suited to non-linear type of problems.
• Fault tolerant
![Page 23: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/23.jpg)
Problem with the Perceptron
• Can only learn linearly separable tasks.• Cannot solve any ‘interesting problems’-linearly
nonseparable problems e.g. exclusive-or function (XOR)-simplest nonseparable function.
X1 X2 Output
0 0 0
0 1 1
1 0 1
1 1 0
![Page 24: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/24.jpg)
The Fall of the Perceptron
Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA.
• Before long researchers had begun to discover the Perceptron’s limitations.
• Unless input categories were “linearly separable”, a perceptron could not learn to discriminate between them.
• Unfortunately, it appeared that many important categories were not linearly separable.
• E.g., those inputs to an XOR gate that give an output of 1 (namely 10 & 01) are not linearly separable from those that do not (00 & 11).
![Page 25: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/25.jpg)
The Fall of the Perceptron
Successful
Unsuccessful
Many Hours in the Gym per Week
Few Hours in the Gym
per Week
Footballers
Academics
In this example, a perceptron would not be able to discriminate between the footballers and the academics…
…despite the simplicity of their relationship:
Academics = Successful XOR Gym
This failure caused the majority of researchers to walk away.
![Page 26: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/26.jpg)
The simple XOR example masks a deeper problem ...
1. 3.2. 4.
Consider a perceptron classifying shapes as connected or disconnected and taking inputs from the dashed circles in 1.
In going from 1 to 2, change of right hand end only must be sufficient to change classification (raise/lower linear sum thru 0)
Similarly, the change in left hand end only must be sufficient to change classification
Therefore changing both ends must take the sum even further across threshold
Problem is because of single layer of processing local knowledge cannot be combined into global knowledge. So add more layers ...
![Page 27: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/27.jpg)
THE PERCEPTRON CONTROVERSY
There is no doubt that Minsky and Papert's book was a block to the funding of research in neural networks for more than ten years. The book was widely interpreted as showing that neural networks are basically limited and fatally flawed.
What IS controversial is whether Minsky and Papert shared and/or promoted this belief ? Following the rebirth of interest in artificial neural networks, Minsky and Papert claimed that they had not intended such a broad interpretation of the conclusions they reached in the book Perceptrons.
However, Jianfeng was present at MIT in 1974, and reached a different conclusion on the basis of the internal reports circulating at MIT. What were Minsky and Papert actually saying to their colleagues in the period after the publication of their book?
![Page 28: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/28.jpg)
Minsky and Papert describe a neural network with a hidden layer as follows:
GAMBA PERCEPTRON: A number of linear threshold systems have their outputs connected to the in- puts of a linear threshold system. Thus we have a linear threshold function of many linear threshold functions.
Minsky and Papert then state:
Virtually nothing is known about the computational capabilities of this latter kind of machine. We believe that it can do little more than can a low order perceptron. (This, in turn, would mean, roughly, that although they could recognize (sp) some relations between the points of a picture, they could not handle relations between such relations to any significant extent.) That we cannot understand mathematically the Gamba perceptron very well is, we feel, symptomatic of the early state of development of elementary computational theories.
![Page 29: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/29.jpg)
The connectivity of a perceptron
The input is recoded using hand-picked features that do not adapt.
Only the last layer of weights is learned.
The output units are binary threshold neurons and are learned independently.
non-adaptivehand-codedfeatures
output units
input units
![Page 30: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/30.jpg)
Binary threshold neurons
• McCulloch-Pitts (1943)– First compute a weighted sum of the inputs from
other neurons– Then output a 1 if the weighted sum exceeds the
threshold.
y
ii
iwxz
z1 if
0 otherwise
y
z
1
0threshold
![Page 31: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/31.jpg)
The perceptron convergence procedure
• Add an extra component with value 1 to each input vector. The “bias” weight on this component is minus the threshold. Now we can forget the threshold.
• Pick training cases using any policy that ensures that every training case will keep getting picked– If the output unit is correct, leave its weights alone.– If the output unit incorrectly outputs a zero, add the input
vector to the weight vector.– If the output unit incorrectly outputs a 1, subtract the input
vector from the weight vector.• This is guaranteed to find a suitable set of weights if any such set
exists.
![Page 32: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/32.jpg)
Weight space
• Imagine a space in which each axis corresponds to a weight.
– A point in this space is a weight vector.
• Each training case defines a plane.
– On one side of the plane the output is wrong.
• To get all training cases right we need to find a point on the right side of all the planes.
rightwrong
wrong
right
an input vector
badweights
goodweights
oorigin
![Page 33: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/33.jpg)
Why the learning procedure works
• Consider the squared distance between any satisfactory weight vector and the current weight vector.– Every time the perceptron
makes a mistake, the learning algorithm moves the current weight vector towards all satisfactory weight vectors (unless it crosses the constraint plane).
• So consider “generously satisfactory” weight vectors that lie within the feasible region by a margin at least as great as the largest update.
– Every time the perceptron makes a mistake, the squared distance to all of these weight vectors is always decreased by at least the squared length of the smallest update vector.
margin
right
wrong
![Page 34: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/34.jpg)
What perceptrons cannot do
• The binary threshold output units cannot even tell if two single bit numbers are the same!
Same: (1,1) 1; (0,0) 1
Different: (1,0) 0; (0,1) 0
• The following set of inequalities is impossible:
21
21
,
0,
ww
ww
Data Space
0,1
0,0 1,0
1,1
weight plane output =1output =0
The positive and negative casescannot be separated by a plane
![Page 35: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/35.jpg)
What can perceptrons do?
• They can only solve tasks if the hand-coded features convert the original task into a linearly separable one. How difficult is this?
• The N-bit parity task :– Requires N features of the form: Are at least m bits
on? – Each feature must look at all the components of
the input.• The 2-D connectedness task
– requires an exponential number of features!
![Page 36: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/36.jpg)
The N-bit even parity task• There is a simple solution
that requires N hidden units.– Each hidden unit
computes whether more than M of the inputs are on.
– This is a linearly separable problem.
• There are many variants of this solution.– It can be learned.– It generalizes well if:
>0 >1 >2 >3
-2 +2 -2 +2
+1
1 0 1 0
output
input22 NN
![Page 37: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/37.jpg)
Why connectedness is hard to compute
• Even for simple line drawings, there are exponentially many cases.
• Removing one segment can break connectedness
– But this depends on the precise arrangement of the other pieces.
– Unlike parity, there are no simple summaries of the other pieces that tell us what will happen.
• Connectedness is easy to compute with an iterative algorithm.
– Start anywhere in the ink
– Propagate a marker
– See if all the ink gets marked.
![Page 38: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/38.jpg)
Distinguishing T from C in any orientation and position
• What kind of features are required to distinguish two different patterns of 5 pixels independent of position and orientation?– Do we need to replicate T and C
templates across all positions and orientations?
– Looking at pairs of pixels will not work
– Looking at triples will work if we assume that each input image only contains one object.
Replicate the following two feature detectors in all positions
+ +-+
+-
If any of these equal their threshold of 2, it’s a C. If not, it’s a T.
![Page 39: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/39.jpg)
Beyond perceptrons
• Need to learn the features, not just how to weight them to make a decision. This is a much harder task.
– We may need to abandon guarantees of finding optimal solutions.
• Need to make use of recurrent connections, especially for modeling sequences.
– The network needs a memory (in the activities) for events that happened some time ago, and we cannot easily put an upper bound on this time.
• Engineers call this an “Infinite Impulse Response” system.
– Long-term temporal regularities are hard to learn.
• Need to learn representations without a teacher.
– This makes it much harder to define what the goal of learning is.
![Page 40: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/40.jpg)
Beyond perceptrons
• Need to learn complex hierarchical representations for structures like: “John was annoyed that Mary disliked Bill.”– We need to apply the same computational
apparatus to the embedded sentence as to the whole sentence.
• This is hard if we are using special purpose hardware in which activities of hardware units are the representations and connections between hardware units are the program.
• We must somehow traverse deep hierarchies using fixed hardware and sharing knowledge between levels.
![Page 41: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/41.jpg)
Sequential Perception
• We need to attend to one part of the sensory input at a time.
– We only have high resolution in a tiny region.
– Vision is a very sequential process (but the scale varies)
– We do not do high-level processing of most of the visual input (lack of motion tells us nothing has changed).
– Segmentation and the sequential organization of sensory processing are often ignored by neural models.
• Segmentation is a very difficult problem
– Segmenting a figure from its background seems very easy because we are so good at it, but its actually very hard.
– Contours sometimes have imperceptible contrast, but we still perceive them.
– Segmentation often requires a lot of top-down knowledge.
![Page 42: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/42.jpg)
Fisher Linear Discrimination
• Lower the problem from multi-dimensional to single-dimensional
– Let ‘v’ be a vector in our space– Project the data on the vector ‘v’– Estimate the ‘scatterness’ of the data as projected
on ‘v’– Use this ‘v’ to create a classifier
![Page 43: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/43.jpg)
Fisher Linear Discrimination
• Suppose we are in a 2D space• Which of the three vectors is an optimal ‘v’?
![Page 44: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/44.jpg)
Fisher Linear Discrimination
• The optimal vector maximizes the ratio of between-group-sum-of-squares to within-group-sum-of-squares, denoted
t
t
v Bv
v Wv
within
between
within
![Page 45: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/45.jpg)
Fisher Linear Discrimination
Suppose a case two classes
• Mean of these classes samples:
• Mean of the projected samples:
• ‘Scatterness’ of the projected samples:
• Criterion function:
1
i
ix X
m xn
1 1
i i
t ti i
y Y x X
m y w x w mn n
2 2( )i
i iy Y
s y m
2
1 22 21 2
m mJ v
s s
![Page 46: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/46.jpg)
Fisher Linear Discrimination
• Criterion function should be maximized• Present J as a function of a vector ‘v’
1 2
2 2
2 21 2
1 2 1 2
2 21 2 1 2 1 2 1 2
( )( )
( ) ( )( )
( )( )
( ) ( ) ( )( )
i
i i
ti i i
x X
t t t t ti i i i i
x X x X
t
t
t t t t t
t
t
W x m x m
W W W
s v x v m v x m x m v v Wv
s s v Wv
B m m m m
m m v m v m v m m m m v v Bv
v BvJ v
v Wv
![Page 47: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/47.jpg)
Fisher Linear Discrimination
• The matrix version of the criterion works the same for more than two classes
• J(v) is maximized when
Bv Wv
![Page 48: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/48.jpg)
Fisher Linear Discrimination
Classification of a new observation ‘x’:• Let the class of ‘x’ be the class whose mean vector is
closest to ‘x’ in terms of the discriminant variables• In other words, the class whose mean vector’s
projection on ‘v’ is the closest to the projection of ‘x’ on ‘v’
![Page 49: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/49.jpg)
A Regularized Fisher LDA
• Sw can be singular (inaccurate inversion)
• Regularization: Swreg = Sw + λ*I
– (Ridge-regression)• Adding to the standard deviation (diagonal) a
compensation of the noise - λ
• Choosing λ as a percentile of the eigenvalues of Sw
• “λ-FLDA”
![Page 50: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/50.jpg)
Linear regression
010
2030
40
0
10
20
30
20
22
24
26
Tem
pera
ture
0 10 200
20
40
Given examples
Predict given a new point
![Page 51: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/51.jpg)
0 200
20
40
010
2030
40
0
10
20
30
20
22
24
26
Tem
pera
ture
Linear regression
Prediction Prediction
![Page 52: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/52.jpg)
Ordinary Least Squares (OLS)
0 200
Error or “residual”
Prediction
Observation
Sum squared error
![Page 53: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/53.jpg)
Minimize the sum squared error
Sum squared error
Linear equation
Linear system
![Page 54: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/54.jpg)
Alternative derivation
n
d Solve the system (it’s better not to invert the matrix)
![Page 55: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/55.jpg)
LMS Algorithm(Least Mean Squares)
where
Online algorithm
![Page 56: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/56.jpg)
Beyond lines and planes
everything is the same with
still linear in
0 10 200
20
40
![Page 57: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/57.jpg)
Geometric interpretation
[Matlab demo]
010
200
100
200
300
400
-10
0
10
20
![Page 58: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/58.jpg)
Ordinary Least Squares [summary]
n
d
Let
For example
Let
Minimize by solving
Given examples
Predict
![Page 59: Learning with linear neurons](https://reader036.fdocuments.us/reader036/viewer/2022062805/56814de0550346895dbb4d60/html5/thumbnails/59.jpg)
Probabilistic interpretation
0 200
Likelihood