CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection &...
Transcript of CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection &...
![Page 1: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/1.jpg)
CHAPTER 2:
Supervised Learning
![Page 2: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/2.jpg)
Learning a Class from Examples Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
2
![Page 3: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/3.jpg)
Training set XNt
tt ,r 1}{ xX
negative is if
positive is if
x
x
0
1r
3
2
1
x
xx
![Page 4: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/4.jpg)
Class C 2121 power engine AND price eepp
4
![Page 5: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/5.jpg)
Hypothesis class H
negative is says if
positive is says if )(
x
xx
h
hh
0
1
N
t
tt rhhE1
1 x)|( X
5
Error of h on H
![Page 6: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/6.jpg)
S, G, and the Version Space
6
most specific hypothesis, S
most general hypothesis, G
h H, between S and G isconsistent and make up the version space(Mitchell, 1997)
![Page 7: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/7.jpg)
Margin Choose h with largest margin
7
![Page 8: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/8.jpg)
VC Dimension N points can be labeled in 2N ways as +/–
H shatters N if there
exists h H consistent
for any of these:
VC(H ) = N
8
An axis-aligned rectangle shatters 4 points only !
![Page 9: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/9.jpg)
Probably Approximately Correct (PAC) Learning How many training examples N should we have, such that with probability
at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)
Each strip is at most ε/4
Pr that we miss a strip 1‒ ε/4
Pr that N instances miss a strip (1 ‒ ε/4)N
Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
9
![Page 10: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/10.jpg)
Noise and Model ComplexityUse the simpler one because Simpler to use
(lower computational
complexity)
Easier to train (lower
space complexity)
Easier to explain
(more interpretable)
Generalizes better (lower
variance - Occam’s razor)
10
![Page 11: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/11.jpg)
Multiple Classes, Ci i=1,...,KNt
tt ,r 1}{ xX
, if
if
ijr
jt
it
ti C
C
x
x
0
1
, if
if
ijh
jt
it
ti C
C
x
xx
0
1
11
Train hypotheses hi(x), i =1,...,K:
![Page 12: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/12.jpg)
Regression
01 wxwxg
01
2
2 wxwxwxg
N
t
tt xgrN
gE1
21X|
12
N
t
tt wxwrN
wwE1
2
0101
1X|,
tt
t
N
ttt
xfr
r
rx 1,X
![Page 13: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/13.jpg)
Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to
find a unique solution
The need for inductive bias, assumptions about H Generalization: How well a model performs on new data
Overfitting: H more complex than C or f
Underfitting: H less complex than C or f
13
![Page 14: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/14.jpg)
Triple Trade-Off There is a trade-off between three factors (Dietterich,
2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
As N,E
As c (H),first Eand then E
14
![Page 15: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/15.jpg)
Cross-Validation To estimate generalization error, we need data unseen
during training. We split the data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Resampling when there is few data
15
![Page 16: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/16.jpg)
Dimensions of a Supervised Learner1. Model:
2. Loss function:
3. Optimization procedure:
|xg
t
tt grLE |,| xX
16
X|min arg*
E
![Page 17: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/17.jpg)
CHAPTER 3:
Bayesian Decision Theory
![Page 18: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/18.jpg)
Probability and Inference Result of tossing a coin is {Heads,Tails}
Random var X {1,0}
Bernoulli: P {X=1} = poX (1 ‒ po)(1 ‒ X)
Sample: X = {xt }Nt =1
Estimation: po = # {Heads}/#{Tosses} = ∑txt / N
Prediction of next toss:
Heads if po > ½, Tails otherwise
18
![Page 19: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/19.jpg)
Classification Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk Input: x = [x1,x2]T ,Output: C Î {0,1} Prediction:
otherwise 0
)|()|( if 1 choose
or
otherwise 0
)|( if 1 choose
C
C
C
C
,xxCP ,xxCP
. ,xxCP
2121
21
01
501
19
![Page 20: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/20.jpg)
Bayes’ Rule
x
xx
p
pPP
CCC
| |
110
0011
110
xx
xxx
||
||
CC
CCCC
CC
Pp
PpPpp
PP
20
posterior
likelihoodprior
evidence
![Page 21: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/21.jpg)
Bayes’ Rule: K>2 Classes
K
kkk
ii
iii
CPCp
CPCp
p
CPCpCP
1
|
|
||
x
x
x
xx
xx | max | if choose
and 1
kkii
K
iii
CPCPC
CPCP
10
21
![Page 22: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/22.jpg)
Losses and Risks Actions: αi
Loss of αi when the state is Ck : λik
Expected risk (Duda and Hart, 1973)
xx
xx
|min| if choose
||
kkii
k
K
kiki
RR
CPR
1
22
![Page 23: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/23.jpg)
Losses and Risks: 0/1 Loss
ki
kiik
if
if
1
0
x
x
xx
|
|
||
i
ikk
K
kkiki
CP
CP
CPR
1
1
23
For minimum risk, choose the most probable class
![Page 24: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/24.jpg)
Losses and Risks: Reject
10
1
1
0
otherwise
if
if
,Ki
ki
ik
xxx
xx
|||
||
iik
ki
K
kkK
CPCPR
CPR
1
1
1
otherwise reject
| and || if choose 1xxx ikii CPikCPCPC
24
![Page 25: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/25.jpg)
Discriminant Functions Kigi ,, , 1x xx kkii ggC max if choose
xxx kkii gg max| R
ii
i
i
i
CPCp
CP
R
g
|
|
|
x
x
x
x
25
K decision regions R1,...,RK
![Page 26: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/26.jpg)
K=2 Classes Dichotomizer (K=2) vs Polychotomizer (K>2)
g(x) = g1(x) – g2(x)
Log odds:
otherwise
if choose
2
1 0
C
gC x
x
x
|
|log
2
1
CP
CP
26
![Page 27: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/27.jpg)
Utility Theory Prob of state k given exidence x: P (Sk|x)
Utility of αi when state is k: Uik
Expected utility:
xx
xx
| max| if Choose
||
jj
ii
kkiki
EUEUα
SPUEU
27
![Page 28: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/28.jpg)
Association Rules Association rule: X Y
People who buy/click/visit/enjoy X are also likely to buy/click/visit/enjoy Y.
A rule implies association, not necessarily causation.
28
![Page 29: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/29.jpg)
Association measures Support (X Y):
Confidence (X Y):
Lift (X Y):
29
customers
and bought whocustomers
#
#,
YXYXP
X
YX
XP
YXPXYP
bought whocustomers
and bought whocustomers
|
#
#
)(
,
)(
)|(
)()(
,
YP
XYP
YPXP
YXP
![Page 30: CHAPTER 2: Supervised Learningakfletcher/stat261/Lec2Slides_UnSupBayes.pdfModel Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique](https://reader035.fdocuments.us/reader035/viewer/2022070212/610219669b35ae2ee0182f74/html5/thumbnails/30.jpg)
Apriori algorithm (Agrawal et al., 1996) For (X,Y,Z), a 3-item set, to be frequent (have enough
support), (X,Y), (X,Z), and (Y,Z) should be frequent.
If (X,Y) is not frequent, none of its supersets can be frequent.
Once we find the frequent k-item sets, we convert them to rules: X, Y Z, ...
and X Y, Z, ...
30