Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The...
Transcript of Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The...
![Page 1: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/1.jpg)
1
The Perceptron Algorithm
MachineLearningFall2017
SupervisedLearning:TheSetup
1
Machine LearningSpring 2018
The slides are mainly from Vivek Srikumar
![Page 2: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/2.jpg)
Outline
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
2
![Page 3: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/3.jpg)
Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
3
![Page 4: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/4.jpg)
Recall: Linear Classifiers
• Input is a n dimensional vector x• Output is a label y ∈ {-1, 1}
• Linear Threshold Units (LTUs) classify an example x using the following classification rule
– Output = sgn(wTx + b) = sgn(b +åwi xi)
– wTx + b ≥ 0 → Predict y = 1
– wTx + b < 0 → Predict y = -1
4
b is called the bias term
![Page 5: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/5.jpg)
Recall: Linear Classifiers
• Input is a n dimensional vector x• Output is a label y ∈ {-1, 1}
• Linear Threshold Units (LTUs) classify an example x using the following classification rule
– Output = sgn(wTx + b) = sgn(b +åwi xi)
– wTx + b ¸ 0 ) Predict y = 1
– wTx + b < 0 ) Predict y = -1
5
b is called the bias term
∑
sgn
&' &( &) &* &+ &, &- &./' /( /) /* /+ /, /- /. 1
1
![Page 6: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/6.jpg)
The geometry of a linear classifier
6
sgn(b +w1 x1 + w2x2)
In n dimensions,a linear classifier represents a hyperplanethat separates the space into two half-spaces
x1
x2
++
++
+ +++
-- --
-- -- --
---- --
--
b +w1 x1 + w2x2=0We only care about the sign, not the magnitude
[w1 w2]
![Page 7: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/7.jpg)
The Perceptron
7
![Page 8: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/8.jpg)
The hype
8
The New Yorker, December 6, 1958 P. 44
The New York Times, July 8 1958 The IBM 704 computer
![Page 9: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/9.jpg)
The Perceptron algorithm
• Rosenblatt 1958
• The goal is to find a separating hyperplane– For separable data, guaranteed to find one
• An online algorithm– Processes one example at a time
• Several variants exist (will discuss briefly at towards the end)
9
![Page 10: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/10.jpg)
The Perceptron algorithm
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
10
![Page 11: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/11.jpg)
The Perceptron algorithm
11
Remember:Prediction = sgn(wTx)
There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
![Page 12: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/12.jpg)
The Perceptron algorithm
12
Footnote: For some algorithms it is mathematically easier to represent False as -1,
and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true.
Remember:
Prediction = sgn(wTx)
There is typically a bias term
also (wTx + b), but the bias
may be treated as a
constant feature and folded
into w
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi):
– Predict y’ = sgn(wtTxi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
![Page 13: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/13.jpg)
The Perceptron algorithm
13
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
![Page 14: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/14.jpg)
The Perceptron algorithm
14
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive number less than 1
![Page 15: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/15.jpg)
The Perceptron algorithm
15
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive number less than 1
Update only on error. A mistake-driven algorithm
![Page 16: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/16.jpg)
The Perceptron algorithm
16
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive number less than 1
Update only on error. A mistake-driven algorithm
This is the simplest version. We will see more robust versions at the end
![Page 17: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/17.jpg)
The Perceptron algorithm
17
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
Input: A sequence of training examples (x1, y1), (x2, y2),!where all xi ∈ ℜn, yi ∈ {-1,1}
• Initialize w0 = 0 ∈ ℜn
• For each training example (xi, yi): – Predict y’ = sgn(wt
Txi)
– If yi ≠ y’:
• Update wt+1 ←wt + r (yi xi)
• Return final weight vector
r is the learning rate, a small positive number less than 1
Update only on error. A mistake-driven algorithm
This is the simplest version. We will see more robust versions at the end
Mistake can be written as yiwtTxi ≤ 0
![Page 18: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/18.jpg)
Intuition behind the update
Suppose we have made a mistake on a positive example
That is, y = +1 and wtTx <0
Call the new weight vector wt+1 = wt + x (say r = 1)
The new dot product will be wt+1Tx = (wt + x)Tx = wt
Tx + xTx > wtTx
For a positive example, the Perceptron update will increase the score assigned to the same input
Similar reasoning for negative examples
18
Mistake on positive: wt+1 ← wt + r xi
Mistake on negative: wt+1 ← wt - r xi
![Page 19: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/19.jpg)
Geometry of the perceptron update
19
wold
Predict
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 20: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/20.jpg)
Geometry of the perceptron update
20
wold
(x, +1)
Predict
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 21: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/21.jpg)
Geometry of the perceptron update
21
wold
(x, +1)
For a mistake on a positiveexample
Predict
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 22: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/22.jpg)
Geometry of the perceptron update
22
wold
(x, +1)
w ← w + x
For a mistake on a positiveexample
(x, +1)
Predict Update
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 23: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/23.jpg)
Geometry of the perceptron update
23
wold
(x, +1)
w ← w + x
For a mistake on a positiveexample
(x, +1)
Predict Update
x
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 24: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/24.jpg)
Geometry of the perceptron update
24
wold
(x, +1)
w ← w + x
For a mistake on a positiveexample
(x, +1)
Predict Update
x
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 25: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/25.jpg)
Geometry of the perceptron update
25
wold
(x, +1) (x, +1)
wnew
w ← w + x
For a mistake on a positiveexample
(x, +1)
Predict Update After
x
Mistake on positive: wt+1 ← wt + r xiMistake on negative: wt+1 ← wt - r xi
![Page 26: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/26.jpg)
Geometry of the perceptron update
26
wold
Predict
![Page 27: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/27.jpg)
Geometry of the perceptron update
27
wold(x, -1)
Predict
For a mistake on a negativeexample
![Page 28: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/28.jpg)
Geometry of the perceptron update
28
wold(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negativeexample
- x
![Page 29: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/29.jpg)
Geometry of the perceptron update
29
wold(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negativeexample
- x
![Page 30: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/30.jpg)
Geometry of the perceptron update
30
wold(x, -1)
w ← w - x
(x, -1)
Predict Update
For a mistake on a negativeexample
- x
![Page 31: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/31.jpg)
Geometry of the perceptron update
31
wold(x, -1) (x, -1)
wnew
w ← w - x
(x, -1)
Predict Update After
For a mistake on a negativeexample
- x
![Page 32: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/32.jpg)
Perceptron Learnability
• Obviously Perceptron cannot learn what it cannot represent– Only linearly separable functions
• Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations
– Parity functions can’t be learned (XOR)• But we already know that XOR is not linearly separable
– Feature transformation trick
32
![Page 33: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/33.jpg)
What you need to know
• The Perceptron algorithm
• The geometry of the update
• What can it represent
33
![Page 34: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/34.jpg)
Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
34
![Page 35: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/35.jpg)
Convergence
Convergence theorem– If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron algorithm will converge.
35
![Page 36: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/36.jpg)
Convergence
Convergence theorem– If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron algorithm will converge.
Cycling theorem– If the training data is not linearly separable, then the
learning algorithm will eventually repeat the same set of weights and enter an infinite loop
36
![Page 37: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/37.jpg)
Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
37
++
++
+ +++
-- --
-- -- --
---- --
--
Margin with respect to this hyperplane
![Page 38: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/38.jpg)
Margin
• The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
• The margin of a data set (!) is the maximum margin possible for that dataset using any weight vector.
38
++
++
+ +++
-- --
-- -- --
---- --
--
Margin of the data
![Page 39: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/39.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ R
and the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u 2 <n (i.e ||u|| = 1) such that for
some ° 2 < and ° > 0 we have yi (uT xi) ¸ °.
Then, the perceptron algorithm will make at most (R/°)2 mistakes
on the training sequence.
39
![Page 40: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/40.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/°)2 mistakes on the training sequence.
40
![Page 41: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/41.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
41
![Page 42: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/42.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
42
We can always find such an R. Just look for the farthest data point from the origin.
![Page 43: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/43.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
43
The data and u have a margin !. Importantly, the data is separable.! is the complexity parameter that defines the separability of data.
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !.
Then, the perceptron algorithm will make at most (R/ !)2
mistakes on the training sequence.
![Page 44: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/44.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
44
If u hadn’t been a unit vector, then we could scale ! in the mistake bound. This will change the final mistake bound to (||u||R/ !)2
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some ! ∈ ℜ and ! > 0 we have yi (uT xi) ≥ !.
Then, the perceptron algorithm will make at most (R/ !)2
mistakes on the training sequence.
![Page 45: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/45.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
45
Suppose we have a binary classification dataset with n dimensional inputs.
If the data is separable,…
…then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes
![Page 46: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/46.jpg)
Proof (preliminaries)
The setting• Initial weight vector w is all zeros
• Learning rate = 1– Effectively scales inputs, but does not change the behavior
• All training examples are contained in a ball of size R– ||xi|| ≤ R
• The training data is separable by margin " using a unit vector u– yi (uT xi) ≥ "
46
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 47: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/47.jpg)
Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
47
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 48: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/48.jpg)
Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
48
Because the data is separable by a margin "
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 49: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/49.jpg)
Proof (1/3)
1. Claim: After t mistakes, uTwt ≥ t "
Because w0 = 0 (i.e uTw0 = 0), straightforward induction gives us uTwt ≥ t "
49
Because the data is separable by a margin "
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 50: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/50.jpg)
2. Claim: After t mistakes, ||wt||2 ≤ tR2
Proof (2/3)
50
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 51: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/51.jpg)
Proof (2/3)
51
The weight is updated only when there is a mistake. That is when yi wt
T xi < 0.
||xi|| ≤ R, by definition of R
2. Claim: After t mistakes, ||wt||2 ≤ tR2
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 52: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/52.jpg)
Proof (2/3)
2. Claim: After t mistakes, ||wt||2 ≤ tR2
Because w0 = 0 (i.e uTw0 = 0), straightforward inductiongives us ||wt||2 ≤ tR2
52
• Receive an input (xi, yi)• if sgn(wt
Txi) ≠ yi:Update wt+1 ← wt + yi xi
![Page 53: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/53.jpg)
Proof (3/3)
What we know:1. After t mistakes, uTwt ≥ t"2. After t mistakes, ||wt||2 ≤ tR2
53
![Page 54: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/54.jpg)
Proof (3/3)
What we know:1. After t mistakes, uTwt ≥ t"2. After t mistakes, ||wt||2 ≤ tR2
54
From (2)
![Page 55: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/55.jpg)
Proof (3/3)
What we know:1. After t mistakes, uTwt ≥ t"2. After t mistakes, ||wt||2 ≤ tR2
55
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt · ||wt||
![Page 56: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/56.jpg)
Proof (3/3)
What we know:1. After t mistakes, uTwt ≥ t"2. After t mistakes, ||wt||2 ≤ tR2
56
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt · ||wt|| (Cauchy-Schwarz inequality)
![Page 57: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/57.jpg)
Proof (3/3)
57
From (2)
uT wt =||u|| ||wt|| cos(<angle between them>)
But ||u|| = 1 and cosine is less than 1
So uT wt ≤ ||wt||
From (1)
What we know:1. After t mistakes, uTwt ≥ t#2. After t mistakes, ||wt||2 ≤ tR2
![Page 58: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/58.jpg)
Proof (3/3)
58
Number of mistakes
What we know:1. After t mistakes, uTwt ≥ t"2. After t mistakes, ||wt||2 ≤ tR2
![Page 59: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/59.jpg)
Proof (3/3)
59
Bounds the total number of mistakes!
Number of mistakes
What we know:1. After t mistakes, uTwt ≥ t"2. After t mistakes, ||wt||2 ≤ tR2
![Page 60: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/60.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training
examples such that for all i, the feature vector xi ∈ ℜn, ||xi||≤ Rand the label yi ∈ {-1, +1}.
Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have yi (uT xi) ≥ $.
Then, the perceptron algorithm will make at most (R/ $)2
mistakes on the training sequence.
60
![Page 61: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/61.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
61
![Page 62: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/62.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
62
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)
R
4
![Page 63: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/63.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
63
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)
R
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)
�
4
![Page 64: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/64.jpg)
Mistake Bound Theorem [Novikoff 1962, Block 1962]
64
Number of mistakes
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)
R
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)
�
4
![Page 65: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/65.jpg)
The Perceptron Mistake bound
• Exercises: – How many mistakes will the Perceptron algorithm make for
disjunctions with n attributes?• What are R and !?
– How many mistakes will the Perceptron algorithm make for k-disjunctions with n attributes?
65
Number of mistakes
![Page 66: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/66.jpg)
What you need to know
• What is the perceptron mistake bound?
• How to prove it
66
![Page 67: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/67.jpg)
Where are we?
• The Perceptron Algorithm
• Perceptron Mistake Bound
• Variants of Perceptron
67
![Page 68: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/68.jpg)
Practical use of the Perceptron algorithm
1. Using the Perceptron algorithm with a finite dataset
2. Margin Perceptron
3. Voting and Averaging
68
![Page 69: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/69.jpg)
1. The “standard” algorithm
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}1. Initialize w = 0 ∈ ℜn
2. For epoch = 1 … T:1. Shuffle the data2. For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0, update w ← w + r yi xi
3. Return w
Prediction: sgn(wTx)
69
![Page 70: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/70.jpg)
1. The “standard” algorithm
70
T is a hyper-parameter to the algorithm
Another way of writing that there is an error
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn
2. For epoch = 1 … T:1. Shuffle the data
2. For each training example (xi, yi) ∈ D:• If yi wTxi ≤ 0, update w ← w + r yi xi
3. Return w
Prediction: sgn(wTx)
![Page 71: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/71.jpg)
2. Margin Perceptron
71
Which hyperplane is better?h1
h2
![Page 72: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/72.jpg)
2. Margin Perceptron
72
Which hyperplane is better?h1
h2
![Page 73: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/73.jpg)
2. Margin Perceptron
73
Which hyperplane is better?h1
h2
The farther from the data points, the less chance to make wrong prediction
![Page 74: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/74.jpg)
2. Margin Perceptron
• Perceptron makes updates only when the prediction is incorrect
yi wTxi < 0
• What if the prediction is close to being incorrect? That is, Pick a positive ! and update when
• Can generalize better, but need to choose – Why is this a good idea?
74
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
4
![Page 75: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/75.jpg)
2. Margin Perceptron
• Perceptron makes updates only when the prediction is incorrect
yi wTxi < 0
• What if the prediction is close to being incorrect? That is, Pick a positive ! and update when
• Can generalize better, but need to choose – Why is this a good idea? Intentionally set a large margin
75
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
4
![Page 76: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/76.jpg)
3. Voting and Averaging
76
What if data is not linearly separable?
Finding a hyperplane with minimum mistakes is NP hard
![Page 77: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/77.jpg)
Voted PerceptronGiven a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}1. Initialize w = 0 ∈ ℜn and a = 0 ∈ ℜn
2. For epoch = 1 … T:– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0– update wm+1 ← wm + r yi xi
– m=m+1– Cm = 1
• Else– Cm = Cm + 1
3. Return (w1, c1), (w2, c2), …, (wk, Ck)
Prediction:
77
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
4
![Page 78: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/78.jpg)
Averaged Perceptron
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}1. Initialize w = 0 ∈ ℜn and a = 0 ∈ ℜn
2. For epoch = 1 … T:– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0– update w ← w + r yi xi
• a ← a + w3. Return a
Prediction: sgn(aTx)
78
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
4
![Page 79: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/79.jpg)
Averaged Perceptron
79
This is the simplest version of the averaged perceptron
There are some easy programming tricks to make sure that a is also updated only when there is an error
Given a training set D = {(xi, yi)}, xi ∈ ℜn, yi ∈ {-1,1}
1. Initialize w = 0 ∈ ℜn and a = 0 ∈ ℜn
2. For epoch = 1 … T:– For each training example (xi, yi) ∈ D:
• If yi wTxi ≤ 0– update w ← w + r yi xi
• a ← a + w3. Return a
Prediction: sgn(aTx)
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
4
![Page 80: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/80.jpg)
Averaged Perceptron
80
This is the simplest version of
the averaged perceptron
There are some easy
programming tricks to make
sure that a is also updated
only when there is an error
Extremely popular
If you want to use the Perceptron algorithm, use averaging
Given a training set D = {(xi, y
i)}, x
i∈ ℜn, y
i∈ {-1,1}
1. Initialize w = 0 ∈ ℜn and a = 0 ∈ ℜn
2. For epoch = 1 … T:
– For each training example (xi, y
i) ∈ D:
• If yiwTx
i≤ 0
– update w ← w + r yix
i
• a ← a + w3. Return a
Prediction: sgn(aTx)
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
4
![Page 81: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/81.jpg)
Question: What is the difference?
81
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
4
Voted:
Averaged:
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
w>1 x = s1,w
>2 x = s2,w
>3 x = s3 (5)
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
c1 = c2 = c3 = 1 (5)
4Voted: Any two are positive
Averaged:
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)
4
![Page 82: Supervised Learning: The Setup The Perceptron Algorithmzhe/pdf/lec-10-perceptron-upload.pdf · The Perceptron algorithm 12 Footnote: For some algorithms it is mathematically easier](https://reader030.fdocuments.us/reader030/viewer/2022040206/5d388ca188c9931d5c8c321b/html5/thumbnails/82.jpg)
Summary: Perceptron
• Online learning algorithm, very widely used, easy to implement
• Additive updates to weights
• Geometric interpretation
• Mistake bound
• Practical variants abound
• You should be able to implement the Perceptron algorithm
82