Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)!...
-
Upload
vuongnguyet -
Category
Documents
-
view
220 -
download
0
Transcript of Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)!...
![Page 1: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/1.jpg)
Perceptron: from Minsky & Papert (1969)!
Retina with!input pattern!
Non-adaptive local feature !detectors (preprocessing)!
Trainable evidence weigher!
Strictly bottom-up feature!processing.!
![Page 2: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/2.jpg)
Teacher:! Correct output for trial 1 is 1.! Correct output for trial 2 is 1.! Correct output for trial 3 is 0.!! . . .! Correct output for trial n is 0.!
`Supervised’ learning training paradigm!
Call the correct answers the!desired outputs and represent!them by the symbol “d”.!
Call the outputs from the network!the actual outputs and represent!them by the symbol “a”.!
Adjust weights to!Reduce error!
a!
![Page 3: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/3.jpg)
From a neuron to a linear threshold unit.!
![Page 4: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/4.jpg)
Trivial Data Set!
No generalization needed.!
Not linearly separable.!
Odd bit parity.!
![Page 5: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/5.jpg)
Perceptron Learning Rule!
![Page 6: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/6.jpg)
More Compact Form!
![Page 7: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/7.jpg)
An Alternative Representation!
![Page 8: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/8.jpg)
Learning the AND function!
Perceptron learning algorithm!
![Page 9: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/9.jpg)
Linear Separability!
Perceptron with 2 inputs:! inputs! threshold!
Solve for y:!
Equation for straight line.! Adjustable wts control slope and intercept.!
![Page 10: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/10.jpg)
AND vs XOR!
Try to draw a straight line through positive and negative instances.!
Geometric illustration of linear separability or not.!
![Page 11: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/11.jpg)
Algebraic Proof of Not Linear Separable for XOR!
![Page 12: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/12.jpg)
Algebraic Proof of Not Linear Separable for XOR!
Impossible, given above.!
Greater than zero.!
![Page 13: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/13.jpg)
Thresholds can be represented as weights!
![Page 14: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/14.jpg)
y(x) = w
Tx+ w0
At decision boundary:! y(x) = 0
Linear component of LTU:!
Separating plane is also called decision boundary!
Linear discriminant function
![Page 15: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/15.jpg)
Properties of linear decision boundary!
1. Decision boundary is orthogonal to weight vector.!
2. Will give formula for distance of decision boundary to origin.!
![Page 16: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/16.jpg)
Geometry of linear decision boundary!
Points in x space!
x
Ax
BLet ! and!be two pts on hyperplane.!
Then !y(xA) = 0 = y(xB)
Therefore:!
w
T (xB � x
A) = 0
Bishop
![Page 17: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/17.jpg)
x_A – x_B is parallel to decision boundary!
w
T (xB � x
A) = 0 Implies wt vector is perpendicular to decision boundary!
Interpretation of vector subtraction!
![Page 18: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/18.jpg)
Distance to hyperplane!
l = ||x|| cos ✓ = ||x|| w
tx
||x||||w|| =�w0
||w||
l = ||x|| cos ✓
Distance to hyperplane.!
![Page 19: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/19.jpg)
SVM Kernel!
Russell and Norvig!
Projecting input features to a higher-dimensional space.!
![Page 20: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/20.jpg)
Perceptron: from Minsky & Papert (1969)!
Retina with!input pattern!
Non-adaptive local feature !detectors (preprocessing)!
Trainable evidence weigher!
Strictly bottom-up feature!processing.!
![Page 21: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/21.jpg)
Diameter-limited Perceptron!
![Page 22: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/22.jpg)
Digits made of seven line segments!
Positive target is the exact digit pattern.!
Negative target is all other possible segment configurations.!
Winston
![Page 23: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/23.jpg)
Winston!
Digit Segment Perceptron!
Ten perceptrons are required.!
![Page 24: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/24.jpg)
Perform a feedforward sweep to compute !the output values below.!
Input is constrained so that!exactly 2 of the 6 inputs are 1.!
Output units indicate whether!the inputs are acquaintances!or siblings.!
Perceptron learning does not generalize to multilayer!
Winston!
![Page 25: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/25.jpg)
An early model of an error signal: 1960’s!
![Page 26: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/26.jpg)
Linear combiner element!
Descriptive complexity is low!
![Page 27: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/27.jpg)
A very simple learning task
• Consider a neural network with two layers of neurons. – neurons in the top layer
represent known shapes. – neurons in the bottom layer
represent pixel intensities. • A pixel gets to vote if it has ink
on it. – Each inked pixel can vote
for several different shapes. • The shape that gets the most
votes wins.
0 1 2 3 4 5 6 7 8 9!
Hinton
![Page 28: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/28.jpg)
Why the simple system does not work
• A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape. – The winner is the template that has the biggest overlap with the ink.
• The ways in which shapes vary are much too complicated to be captured by simple template matches of whole shapes. – To capture all the allowable variations of a shape we need to learn the
features that it is composed of.
Hinton
![Page 29: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/29.jpg)
Examples of handwritten digits from a test set
![Page 30: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/30.jpg)
General layered feedforward network!
There can be varying numbers of units in the different layers.!
A feedforward network does not have to be layered.!!Any connection topology that does not have recurrent connections!is a feedforward network.!
![Page 31: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/31.jpg)
Summary of Architecture for Backprop
input vector
hidden layers
outputs
Back-propagate error signal to get derivatives for learning
Compare outputs with correct answer to get error signal
Hinton
![Page 32: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/32.jpg)
Matrix representations of networks!
![Page 33: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/33.jpg)
Need for supervised training!
We cannot define the hand printed character A. We can only !present examples and hope that the network generalizes.!
The fact that this happens was a breakthrough in pattern recognition.!
![Page 34: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/34.jpg)
sigmoid!
derivative!
Sigmoid plots!
Logistic sigmoid equation!
Give the neuron a graded output value between 0 and 1!
Put sigmoid function here!
![Page 35: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/35.jpg)
Simple transformations on functions!
![Page 36: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/36.jpg)
Odd and Even Functions!
Even: f(x) = f(-x)!
Odd: f(x) = -f(-x)! Flip about y-axis and flip about x-axis.!
Flip about y-axis!
Even: cos, cosh!
Odd: sin, sinh!
![Page 37: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/37.jpg)
sinh, cosh, and tanh!
![Page 38: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/38.jpg)
Relation between logsig and tanh!
![Page 39: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/39.jpg)
Exact formula for setting weights!
1. Two classes!
2. Gaussian distributed!
3. Equal variances!
4. Naïve Bayes!
![Page 40: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/40.jpg)
Gaussian Distribution!
![Page 41: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/41.jpg)
Fitting a bivariate Gaussian to 2D data
From: Florian Hahne
Data is presented as a scatter plot.
![Page 42: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/42.jpg)
Bivariate Gaussian!
Optimal decision boudary!
![Page 43: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/43.jpg)
Exact formula for setting weights!
Bishop, 1995However, in most applications, we musttrain the wts.
![Page 44: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/44.jpg)
Derivatives of sigmoidal functions!
tanh(x)0 = 1� tanh2(x)
logsig(x)
0= logsig(x) · (1� logsig(x))
If you know the value of a function at a particular point, you can quickly compute the derivative of the function.!
![Page 45: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/45.jpg)
Error Signal (Cost function)!
For regression.
E = �SX
t=1
cX
k=1
dk(t) ln dk(t)
# of classes
For classification
Cross entropy
Sum of squares
![Page 46: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/46.jpg)
Error Measure for Linear Unit!
This will be the cost function!
![Page 47: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/47.jpg)
Instantaneous slope (i.e. derivative)!
Δx
Slope: how fast a line is increasing!
Δx!
x + Δx!x!
f(x)!
f(x + Δx)!
f’(x) ~=! f(x + Δx) - f(x)!Δx!
x - axis!
y - axis!
![Page 48: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/48.jpg)
Principle of gradient descent!
Goal is to find minimum of cost function. !!Suppose estimate is x=1.!
Then derivative is > 0.!
Update estimate in the negative direction.!
![Page 49: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/49.jpg)
How the derivative varies with the error function!
Minimum of cost function.!
Positive gradient.!
Negative gradient!
![Page 50: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/50.jpg)
Deriving a learning rule: trains wts for single linear unit!
Weight update rule from gradient descent!
Derivative of cost function wrt weight w_1.!
t denotes test case or trial.!
S denotes # of cases in a batch.!
![Page 51: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/51.jpg)
Simple example: learn AND with linear unit.!
Since there are only two!weights, it’s possible to!visualize the error surface.!
# of cases in a batch = 4!
o!
![Page 52: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/52.jpg)
Composite error surface for linear unit learning AND!
Optimal weight values are at top of surface (w1 = w2 = 1/3).!
In this case, quadratic.!Error surface is flipped upside down, so that it is easier to see.!
![Page 53: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/53.jpg)
Error surface for each of the 4 cases.!
Previous error surface is obtained by adding these together.!
![Page 54: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/54.jpg)
o!
Adding a Logistic Sigmoid Unit!
g(x)
Derivative of sigmoidal unit.!
Formula for sigmoidal unit.!
![Page 55: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/55.jpg)
Derivative of cost function for linear unit with sigmoid!
Chain rule!
Derivative of sigmoid.!
![Page 56: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/56.jpg)
Error surface with a sigmoidal unit!
plateau!
o!
![Page 57: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/57.jpg)
Update rule for wt in output layer!
New form!
Backpropagation of errors!
Function of h, which is defined on next slide.!
![Page 58: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/58.jpg)
New form!
Adapted from Bishop!
For output layer!
For hidden layers!
�j = g0(hj)X
k
wkj�k
�n = [dn � on] g0(hn)
�wji = ⌘oi�j(hj)
Backpropagation of errors!
Arbitrary location in network.!
![Page 59: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/59.jpg)
Delta_n: output layer!
1. If sigmoid, then o(1-o).!2. If tanh, then 1-o^2.!3. If linear, then 1.!
![Page 60: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/60.jpg)
Initializing weights!
1. Most methods set weights to randomly chosen small values.!
2. Random values are chosen to avoid symmetries in the network.!
3. Small values are chosen to avoid the saturated regions of the sigmoid where the slope is near zero.!
![Page 61: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/61.jpg)
Time complexity of backprop!
Feedforward sweep is O(w).!!Backprop sweep is O(w).!
![Page 62: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/62.jpg)
Basic Autoencoder!
Ashish Gupta!
![Page 63: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/63.jpg)
Sparse Autoencoder!
The number of hidden units is greater than the number of input units.!
The cost function is chosen to make most of the hidden units inactive.!
SS error!Minimize wt magnitude.!Minimize # of
active units!
Cost function for sparse AE.!
![Page 64: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/64.jpg)
Limited-memory BFGS!
Limitation of gradient descent: assumes direction of maximum change in gradient is a good direction for update.!!A better choice maximizes ratio of gradient to curvature of error surface.!!Newton’s method takes curvature into account.!!Curvature is given by Hessian matrix. Require inverting H, which is infeasible for large problems.!!BFGS approximates H inverse without actually inverting Hessian. Thus it is a quasi-Newton method.!!Broyden, Fletcher, Goldfarb, Shanno!
![Page 65: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/65.jpg)
Purpose of sparse autoencoder!
Discover an over complete basis set.!
Use basis set as a set of input features to a learning algorithm.!
Can think of this as a form of highly sophisticated preprocessing.!
![Page 66: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/66.jpg)
Summary of Architecture for Backprop
input vector
hidden layers
outputs
Back-propagate error signal to get derivatives for learning
Compare outputs with correct answer to get error signal
Hinton
![Page 67: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/67.jpg)
Training Procedure: Make sure lab results apply to application domain.!
1. Training set!2. Testing set!3. Validation set!
If you have enough data:!
Otherwise:!
Ten-fold cross-validation. Allows use of all data for training.!
![Page 68: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/68.jpg)
Curse of Dimensionality
x1
D = 1
x1
x2
D = 2
x1
x2
x3
D = 3
Amount of data needed to sample a high-dimensional space grows exponentially w/ D.
![Page 69: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/69.jpg)
A feature preprocessing example!
There exists a set of weights to solve!the problem. Gradient descent searches!for the desired weights.!
Distinguishing mines from rocks!in sonar returns.!
![Page 70: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/70.jpg)
26 outputs coding phonemes!
203 input binary vector, 29 bits!for each of 7 chars including!punctuation, 1-out-of 29 code!
Trained on 1024 words, capable!Of intelligible speech after 10!Training epochs, 95% training!Accuracy after 50 epochs, and!78 percent generalization accuracy.!
80 hidden units!
NETtalk – Sejnowski and Rosenberg (1987)!
![Page 71: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/71.jpg)
Limitations of back-propagation
• It requires labeled training data. – Almost all data in an applied setting is unlabeled.
• The learning time does not scale well – It is very slow in networks with multiple hidden layers.
• It can get stuck in poor local optima. – These are often quite good, but for deep networks are far from
optimal.
Hinton
![Page 72: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/72.jpg)
Synaptic noise sources:! 1. Probability of vesicular! release.! 2. Magnitude of response! in case of vesicular ! release.!
![Page 73: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/73.jpg)
Possible biological source of a quasi-global !Difference-of-reward signal!
Mesolimbic system or!!Mesocortical system!
Error broadcast to every !weight in the network!
![Page 74: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/74.jpg)
A spectrum of machine learning tasks
• Low-dimensional data (e.g. less than 100 dimensions)
• Lots of noise in the data
• There is not much structure in the data, and what structure there is, can be represented by a fairly simple model.
• The main problem is
distinguishing true structure from noise.
• High-dimensional data (e.g. more than 100 dimensions)
• The noise is not sufficient to obscure the structure in the data if we process it right.
• There is a huge amount of structure in the data, but the structure is too complicated to be represented by a simple model.
• The main problem is figuring out how to represent the complicated structure in a way that can be learned.
Typical Statistics------------Artificial Intelligence
Hinton
![Page 75: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/75.jpg)
Generative Model for Classification!
1. Model class-conditional densities p(x | C_k).!
2. Model class priors: p(C_k).!
3. Compute posterior: p(C_k | x) using Bayes’ Thm.!
Definition:!
Motivation:!
1. Links ML to probability theory, the universal language for modeling uncertainty and evaluating evidence.!
2. Links neural networks to ML.!
![Page 76: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/76.jpg)
Aoccdrnig to a rscheearch at an Elingshuinervtisy, it deosn't mttaer inwaht oredr the ltteers in a wrod are, the olnyiprmoetnt tihng is tahtfristand lsat ltteer is at the rghit pclae. The rsetcan be a toatl mses andyou can sitll raed it wouthit a porbelm. Tihs isbcuseae we do not raed erveylteter by its slef, but the wrod as a wlohe
Knowledge structuring issues and information sources!
![Page 77: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable](https://reader034.fdocuments.us/reader034/viewer/2022042708/5a7a91ef7f8b9a09238cef40/html5/thumbnails/77.jpg)
Top-down or Context Effects!
Semantics of a structured pattern!
Kaniza figures!