7/25/2019 Mulitvariate Random Trees
1/279
1
Model Building Training
Max Kuhn
Kjell Johnson
Global Nonclinical Statistics
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
2/279
2
Overview
T!"ical data scenarios# $xa%"les we&ll be using
General a""roaches to %odel building
'ata "re("rocessing
)egression(t!"e %odels
*lassi+ication(t!"e %odels
Other considerations
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
3/279
,
T!"ical 'ata
)es"onse %a! be continuous or categorical -redictors %a! be
# continuous. count. and/or binar!
# dense or s"arse
# observed and/or calculated
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
4/279
0
-redictive Models
hat is a "redictive %odel34 5 %odel whoseprimary"ur"ose is +or "rediction
6as o""osed to in+erence7
e would li8e to 8now wh! the %odel wor8s. as
well as the relationshi" between "redictors and
the outco%e. but these are secondar!
$xa%"les9 blood(glucose %onitoring. s"a%
detection. co%"utational che%istr!. etc:
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
5/279
;
hat 5re The! NotGood
7/25/2019 Mulitvariate Random Trees
6/279
@
hat 5re The! NotGood
7/25/2019 Mulitvariate Random Trees
7/279D
The Big-icture
Cn the end. E"redictive %odelingF is not a
substitute+or intuition. but aco%"li%ent3
Can 5!res. in Supercrunchers
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
8/279
)e+erences
Statistical Modeling9 The Two *ultures3b! HeoBrei%an 6Statistical Science. Iol 1@. , 6217.
1(2,17
The Elements of Statistical Learning b! =astie.
Tibshirani and
7/25/2019 Mulitvariate Random Trees
9/279
)egression Methods
Multi"le linear regression
-artial least sLuares
Neural networ8s
Multivariate ada"tive regression s"lines
Su""ort vector %achines
)egression trees $nse%bles o+ trees9
# Bagging. boosting. and rando% +orests
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
10/2791
*lassi+ication Methods
'iscri%inant anal!sis +ra%ewor8# Hinear. Luadratic. regularied. +lexible. and "artial least sLuares
discri%inant anal!sis
Modern classi+ication %ethods
# *lassi+ication trees
# $nse%bles o+ trees
Boosting and rando% +orests
# Neural networ8s
# Su""ort vector %achines
# 8(nearest neighbors
# Naive Ba!es
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
11/27911
Cnteresting Models e 'on&t =ave Ti%e
7/25/2019 Mulitvariate Random Trees
12/27912
$xa%"le 'ata Sets
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
13/2791,
Boston =ousing 'ata
This is a classic bench%ar8 data set +or regression: Ctincludes housing data +or ;@ census tracts o+ Boston
+ro% the 1D census:
cri%9 "er ca"ita cri%e rate
Cndus9 "ro"ortion o+ non(retailbusiness acres "er town
chas9 *harles )iver du%%!
variable 6 1 i+ tract bounds
river otherwise7
nox9 nitric oxides concentration
r%9 average nu%ber o+ roo%s
"er dwelling
5ge9 "ro"ortion o+ owner(
occu"ied units built "rior to
10
dis9 weighted distances to +ive
Boston e%"lo!%ent centers rad9 index o+ accessibilit! to
radial highwa!s
tax9 +ull(value "ro"ert!(tax rate
"tratio9 "u"il(teacher ratio b!
town
b9 "ro"ortion o+ %inorities
Medv9 %edian value ho%es
6outco%e7
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
14/27910
To! *lassi+ication $xa%"le
5 si%ulated data set will beused to de%onstrate
classi+ication %odels
# two "redictorswith a correlation
coe++icient o+ :; were si%ulated
# two classes were si%ulated6active3 and inactive37
5 "robabilit! %odel was used to
assign a "robabilit! o+ being
active to each sa%"le
# the 2;P. ;P and D;P
"robabilit! lines are shown on
the right
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
15/2791;
To! *lassi+ication $xa%"le
The classes were rando%l!assigned based on the "robabilit!
The training data had 2;
co%"ounds 6"lot on right7
# the test set also contained 2;
co%"ounds
ith two "redictors. the class
boundaries can be shown +or
each %odel
# this can be a signi+icant aid in
understanding how the %odelswor8
# Qbut we ac8nowledge how
unrealistic this situation is
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
16/2791@
Model Building Training
General Strategies
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
17/2791D
Objective
To construct a %odel o+ "redictors that
can be used to "redict a res"onse
Data
Model
Prediction
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
18/2791
Model Building Ste"s
*o%%on ste"s during %odel building are9# esti%ating %odel "ara%eters 6i:e: training %odels7
# deter%ining the values o+ tuning "ara%eters that
cannot be directl! calculated +ro% the data
# calculating the "er+or%ance o+ the +inal %odel that will
generalie to new data
The %odeler has a +inite a%ount o+ data. which
the! %ust Rs"endR to acco%"lish these ste"s
# =ow do we s"end3 the data to +ind an o"ti%al %odel4
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
19/2791
S"ending3 'ata
e t!"icall! s"end3 data on training and test data sets# Training Set9 these data are used to esti%ate %odel "ara%eters
and to "ic8 the values o+ the co%"lexit! "ara%eter6s7 +or the%odel:
# Test Set (aka validation set)9 these data can be used to get aninde"endent assess%ent o+ %odel e++icac!: The! should not beused during %odel training:
The %ore data we s"end. the better esti%ates we&ll get6"rovided the data is accurate7: Given a +ixed a%ount o+data.
# too %uch s"ent in training won&t allow us to get a goodassess%ent o+ "redictive "er+or%ance: e %a! +ind a %odel that+its the training data ver! well. but is not generaliable 6over+itting7
# too %uch s"ent in testing won&t allow us to get a goodassess%ent o+ %odel "ara%eters
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
20/279
2
Methods +or *reating a Test Set
=ow should we s"lit the data into a training andtest set4
O+ten. there will be a scienti+ic rational +or the s"lit
and in other cases. the s"litscan be %adee%"iricall!:
Several e%"irical s"litting o"tions9
#co%"letel! rando%
# strati+ied rando%
# %axi%u% dissi%ilarit! in "redictor s"ace
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
21/279
21
*reating a Test Set9 *o%"letel! )ando% S"lits
5 co%"letel! rando% 6*)7 s"lit rando%l! "artitions thedata into a training and test set
7/25/2019 Mulitvariate Random Trees
22/279
22
*reating a Test Set9 Strati+ied )ando% S"lits
5 strati+ied rando% s"lit %a8es a rando% s"litwithin strati+ication grou"s
# in classi+ication. the classes are used as strata
# in regression. grou"s based on the Luantiles o+ the
res"onse are used as strata
Strati+ication atte%"ts to "reserve the distribution
o+ the outco%e between the training and testsets
#5 S) s"lit is %ore a""ro"riate +or unbalanced data
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
23/279
2,
Over(
7/25/2019 Mulitvariate Random Trees
24/279
20
Over(
7/25/2019 Mulitvariate Random Trees
25/279
2;
Over(
7/25/2019 Mulitvariate Random Trees
26/279
2@
Over(
7/25/2019 Mulitvariate Random Trees
27/279
2D
Over(
7/25/2019 Mulitvariate Random Trees
28/279
2
Over(
7/25/2019 Mulitvariate Random Trees
29/279
2
=ow 'o e $sti%ate Over(
7/25/2019 Mulitvariate Random Trees
30/279
,
=ow 'o e $sti%ate Over(
7/25/2019 Mulitvariate Random Trees
31/279
,1
K(+old *ross Ialidation
=ere. we rando%l! s"lit the data into Kbloc8s o+roughl! eLual sie
e leave out the +irst bloc8 o+ data and +it a
%odel:
This %odel is used to "redict the held(out bloc8
e continue this "rocess until we&ve "redicted all
Khold(out bloc8s The +inal "er+or%ance is based on the hold(out
"redictions
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
32/279
,2
K(+old *ross Ialidation
The sche%atic below shows the "rocess +or K ,grou"s:
# Kis usuall! ta8en to be ; or 1
# leave one out cross(validationhas each sa%"le as abloc8
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
33/279
,,
Heave Grou" Out *ross Ialidation
5 rando% "ro"ortiono+ data 6sa! P7 are
used to train a %odel
The re%ainder is
used to "redict
"er+or%ance
This "rocess is
re"eated %an! ti%esand the average
"er+or%ance is used
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
34/279
,0
Bootstra""ing
Bootstra""ing ta8es a rando% sa%"le withre"lace%ent
# the rando% sa%"le is the sa%e sie as the original data
set
# co%"ounds %a! be selected %ore than once
# each co%"ound has a @,:2P change o+ showing u" at
least once
So%e sa%"les won&t be selected# these sa%"les will be used to "redict "er+or%ance
The "rocess is re"eated %ulti"le ti%es 6sa! ,7
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
35/279
,;
The Bootstra"
ith bootstra""ing.the nu%ber o+ held(
out sa%"les is
rando%
So%e %odels. such
as rando% +orest. use
bootstra""ing within
the %odeling "rocessto reduce over(+itting
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
36/279
,@
Training Models with Tuning -ara%eters
5 single training/test s"lit iso+ten not enough +or %odels
with tuning "ara%eters
e %ust use resa%"ling
techniLues to get goodesti%ates o+ %odel
"er+or%ance over %ulti"le
values o+ these "ara%eters
e "ic8 the co%"lexit!"ara%eter6s7 with the best
"er+or%ance and re(+it the
%odel using all o+ the data
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
37/279
,D
Si%ulated 'ata $xa%"le
Het&s +it a nearest neighbors %odel to thesi%ulated classi+ication data:
The o"ti%al nu%ber o+ neighbors %ust be chosen
C+ we use leave grou" out cross(validation and setaside 2P. we will +it %odels to a rando% 2sa%"les and "redict ; sa%"les
# , iterations were used
e&ll train over 11 odd values +or the nu%ber o+neighbors
# we also have a 2; "oint test set
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
38/279
,
To! 'ata $xa%"le
The "lot on the right shows theclassi+ication accurac! +or each
value o+ the tuning "ara%eter
# The gre! "oints are the ,
resa%"led esti%ates
# The blac8 line shows theaverage
accurac!
# The blue line is the 2; sa%"le
test set
Ct loo8s li8e D or %oreneighbors is o"ti%al with an
esti%ated accurac! o+ @P
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
39/279
,
To! 'ata $xa%"le
hat i+ we didn&t resa%"leand used the whole data
set4
The "lot on the right
shows the accurac!
across the tuning
"ara%eters
This would "ic8 a %odelthat over(+its and has
o"ti%istic "er+or%ance
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
40/279
0
Model Building Training
'ata -re(-rocessing
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
41/279
01
h! -re(-rocess4
Cn order to get e++ective and stable results. %an!%odels reLuire certain assu%"tions about the
data
# this is %odel de"endent
e will list each %odel&s "re("rocessing
reLuire%ents at the end
Cn general. "re("rocessing rarel! hurts %odel
"er+or%ance. but could %a8e %odel
inter"retation %ore di++icult
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
42/279
02
*o%%on -re(-rocessing Ste"s
7/25/2019 Mulitvariate Random Trees
43/279
0,
ero Iariance -redictors
Most %odels reLuire that each "redictor have atleast two uniLue values
h!4
#5 "redictor with onl!one uniLue value has a varianceo+ ero and containsno in+or%ation about the
res"onse:
Ct is generall! a good idea to re%ove the%:
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
44/279
00
Near ero Iariance3 -redictors
5dditionall!. i+ the distributions o+ the "redictorsare ver! s"arse.
# this can have a drastic e++ect on the stabilit! o+ the
%odel solution
# ero variance descri"tors could be induced during
resa%"ling
But what does a near ero variance3 "redictor
loo8 li8e4
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
45/279
0;
Near ero Iariance3 -redictor
There are two conditions +or an NI3 "redictor# a low nu%ber o+ "ossible values. and
# a high i%balance in the +reLuenc! o+ the values
7/25/2019 Mulitvariate Random Trees
46/279
0@
NI $xa%"le
Cn co%"utational che%istr! wecreated "redictors based onstructural characteristics o+co%"ounds:
5s an exa%"le. the descri"torn)113 is the nu%ber o+ 11(%e%ber rings
The table to the right is thedistribution o+ n)11 +ro% a
training set# the distinct value "ercentage is
;/;,; :,
# the +reLuenc! ratio is ;1/2, 21:
# 11-Member Rings
Value re!uenc"
;1
1 0
2 2,
, ;
0 2
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
47/279
0D
'etecting NIs
Two criteria +or detecting NIs are the# 'iscrete value "ercentage
'e+ined as the nu%ber o+ uniLue values divided b! the nu%ber o+
observations
)ule(o+(thu%b9 discrete value "ercentage U 2P could indicate a
"roble%
#
7/25/2019 Mulitvariate Random Trees
48/279
0
=ighl! *orrelated -redictors
So%e %odels can be negativel! a++ected b!highl! correlated "redictors
# certain calculations 6e:g: %atrix inversion7 can beco%eseverel! unstable
=ow can we detectthese "redictors4
# Iariance in+lation +actor 6IC
7/25/2019 Mulitvariate Random Trees
49/279
0
=ighl! *orrelated -redictors and
)esa%"ling
)ecall that resa%"ling slightl! "erturbs thetraining data set to increase variation
C+ a %odel is adversel! a++ected b! high
correlations between "redictors. the resa%"ling
"er+or%ance esti%ates can be "oor in
co%"arison to the test set
# Cn this case. resa%"ling does a better job at "redicting
how the %odel wor8s on +uture sa%"les
* i d S li
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
50/279
;
*entering and Scaling
Standardiing the "redictors can greatl! i%"rovethe stabilit! o+ %odel calculations:
More i%"ortantl!. there are several %odels 6e:g:"artial least sLuares7 that i%"licitl! assu%e that
all o+ the "redictors are on the sa%e scale
5"art +ro% the loss o+ the original units. there is
no real downside o+ centering and scaling
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
51/279
;1
Model Building Training
)egression(t!"e Models
S tti
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
52/279
;2
Setting
)es"onse is continuous
Obj ti
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
53/279
;,
Objective
To construct a %odel o+ "redictors that
can be used to "redict a res"onse
Data
Model
Prediction
) i M th d
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
54/279
;0
)egression Methods
Multi"le linear regression
-artial least sLuares
Neural networ8s
Multivariate ada"tive regression s"lines
Su""ort vector %achines
)egression trees
$nse%bles o+ trees9
# Bagging. boosting. and rando% +orests
$ach o+ these %ethods see8 to +ind a relationshi"between the "redictors and res"onse that %ini%ieserrorbetween the observed and "redicted res"onse
5dditi M d l
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
55/279
;;
5dditive Models
Cn the beginning there were linear %odels9( ) ppXXY +++= 110E
5nd =astie and Tibshirani 617 said. Het there be
Generalied 5dditive Models39
( ) ( ) ( )pp XfXffY +++= 110E
5nd Nelder and edderburn 61D27 said. Het there be
Generalied Hinear Models39
( )( ) ppXXYg +++= 110Eand link functions appeared.
and scatterplot smoothers and backtting
algorithms appeared.
< ili + 5dditi M d l
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
56/279
;@
7/25/2019 Mulitvariate Random Trees
57/279
;D
5ssessing Model -er+or%ance
5 i M d l - +
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
58/279
;
5ssessing Model -er+or%ance
=ow well does a regression %odel "er+or%4 5nswering thisLuestion de"ends on how we want to use the %odel:-ossible goals are9
# To understand the relationshi" between the "redictor and theres"onse:
# To use the %odel to "redict +uture observations& res"onse:
Cn either case. we can use several o+ di++erent %easures toevaluate %odel "er+or%ance: e will +ocus on two9
# *oe++icient o+ deter%ination 6R27
# )oot %ean sLuare error 6)MS$7
=owever. the set o+ data that we use to evaluate"er+or%ance will change de"ending on our "ur"ose:
hich Set o+ 'ata to Yse to $ al ate -er+or%ance4
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
59/279
;
hich Set o+ 'ata to Yse to $valuate -er+or%ance4
C+ we are onl! interested in understanding the underl!ingrelationshi" between the "redictor and the res"onse. thenwe can co%"ute R2and )MS$ on the data +or which the%odel was built 6i:e the training data7:
# =owever. these values will be overl! o"ti%istic o+ the %odel&sabilit! to "redict +uture observations:
C+ we are interested in understanding the %odel&s abilit! to"redict +uture observations. then we need to co%"ute R2and )MS$ on data +or which the %odel was notbuilt 6i:e:a test set or cross(validation set7:
#
7/25/2019 Mulitvariate Random Trees
60/279
@
L 6 7
)oot Mean SLuared -rediction $rror 6)MS-$7
)MS$ %easures the average deviation o+ an observation
to the best(+it "lane
)MS-$ %easures the average deviation o+ an
observation to its "redicted value +or the test or cross(
validation set
( )1+=
pn
SSERMSE
( )*
1
2
*
n
yyRMSPE
n
iii
= =
n* = the nu%ber o+ observations in the test or cross(validation set
*o%"uting Q2
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
61/279
@1
*o%"uting Q2
-rocess9# -artition the data into
a training and testing set. or
bloc8s to be used +or training and testing
# Build the %odel on the trainingdata and "redict the
testing data
Q2 R2o+ the relationshi" between the observed
and "redicted values +or the testing data:
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
62/279
@2
Multi"le Hinear )egression9
5 Zuic8 )eview
Multi"le Hinear )egression
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
63/279
@,
Multi"le Hinear )egression
Objective9
7/25/2019 Mulitvariate Random Trees
64/279
@0
The Best -lane
( ) YXXX
T1T1
0
=
p
To +ind the best "lane. we solve9
# where Ynx1. Xnx(p+1)and(p+1)x1
The best is9
2XYmin
5side9 5 Bit More 5bout (XTX)
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
65/279
@;
5side9 5 Bit More 5bout (XTX)
(XT
X) is a critical %atrix +or %an! statistical%odeling techniLues
5 +ew +un +actsQ (XTX)is "ro"ortional to the covariance %atrix. S
Scontains the variances and covariances o+ all
"redictors
# TechniLues that de"end on (XTX) also reLuire that it is
invertible
5ssu%"tions9 'iagnostic -lots
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
66/279
@@
5ssu%"tions9 'iagnostic -lots
hen 'oes )egression
7/25/2019 Mulitvariate Random Trees
67/279
@D
hen 'oes )egression
7/25/2019 Mulitvariate Random Trees
68/279
@
5 6Trivial7 $xa%"le o+ Multicollinearit!
Su""ose that we have one observation 6,.;7. and we wish to +ind the [best& line +or the
data: Cn this exa%"le. the nu%ber o+ observations 617 is less than the nu%ber o+"ara%eters 629 slo"e and interce"t7: hen the nu%ber o+ "ara%eters is greater than
the nu%ber o+ observations. we can +ind an in+inite nu%ber o+ [best& solutions:
Solution
Solution 1
Solution $
%n the presence o& multicollinearit"' the best
solution ill be unstable
Boston =ousing 'ata
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
69/279
@
Boston =ousing 'ata
Het&s use a linear regression %odel to "redict the %edianhouse "rice in Boston:
-rocess9
# S"lit the data into a training set 6n ,,D7 and testing set 6n 1@7
#
7/25/2019 Mulitvariate Random Trees
70/279
D
)esults
The results are +airl! si%ilar. at least within the variation o+resa%"ling
One reason !ou %a! see di++erences9 %ulticollinearit!
# Multicollinearit! in the "redictors can "roduce so%ewhat unstablesolutions +or each resa%"le
# hen the data are slightl! changed. the %odel can drasticall!change
The test set is a single. static set o+ data +or veri+ication
# The bootstra" esti%ate o+ "er+or%ance %a! be better with
collinearit!
Training *ata
(bootstrap)Test *ata
RMS+ ,$ RMS+ R$
Hinear )eg ;:2, :@1 0:;, :D02
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
71/279
D1
-artial Heast SLuares )egression
Solutions +or Overdeter%ined *ovariance Matrices
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
72/279
D2
Solutions +or Overdeter%ined *ovariance Matrices
Iariable reduction# Tr! to acco%"lish this through the "re("rocessing
ste"s
-artial least sLuares 6-HS7
Other %ethods
#5""l! a generalied inverse
# )idge regression9 5djusts the variance/covariance
%atrix so that we can +ind a uniLue inverse:
# -rinci"al co%"onent regression 6-*)7
not reco%%ended\but it&s a good wa! to understand -HS
Ynderstanding -artial Heast SLuares9
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
73/279
D,
-rinci"al *o%"onents 5nal!sis
-*5 see8s to +ind linear co%binations o+ theoriginal variables that su%%arie the %axi%u%
a%ount o+ variabilit! in the original data
# These linear co%binations are o+ten calledprincipal
componentsor scores:
# 5principal directionis a vector that "oints in the
direction o+ %axi%u% variance:
-rinci"al *o%"onents 5nal!sis
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
74/279
D0
-rinci"al *o%"onents 5nal!sis
-*5 is inherentl! an o"ti%iation "roble%. whichis subject to two constraints
1:The "rinci"al directions have unit length
2:$ither
a:Successivel! derived scores areuncorrelated to "reviousl!
derived scores. O)
b:Successivel! derived directions are reLuired to be orthogonal
to "reviousl! derived directions
Cn the %athe%atical +or%ulation. either constraint i%"lies the
other constraint
-rinci"al *o%"onents 5nal!sis
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
75/279
D;
(0
(,
(2
(1
1
2
,
0
;
(@ (; (0 (, (2 (1 1 2 , 0 ;
redictor 1
-redictor$
*irection 1
Score
-rinci"al *o%"onents 5nal!sis
htt"9//"+ier"edia/index:"h"/C%age9-*5%ovie:gi+
Mathe%aticall! S"ea8ing
http://home.pfizer.com/http://pfizerpedia/index.php/Image:PCAmovie.gifhttp://home.pfizer.com/http://pfizerpedia/index.php/Image:PCAmovie.gif7/25/2019 Mulitvariate Random Trees
76/279
D@
Mathe%aticall! S"ea8ingQ
The o"ti%iation "roble% de+ined b! -*5 can be solvedthrough the +ollowing +or%ulation9
subject to constraints 2a: or b:
7/25/2019 Mulitvariate Random Trees
77/279
DD
-*5 Bene+its and 'rawbac8s
Bene+its
# 'i%ension reduction
e can o+ten su%%arie a large "ercentage o+ original variabilit!
with onl! a +ew directions
# Yncorrelated scores
The new scores are not linearl! related to each other
'rawbac8s
# -*5 chases3 variabilit!
-*5 directions will be drawn to "redictors with the %ost variabilit!
Outliers %a! have signi+icant in+luence on the directions and
resulting scores:
-rinci"al *o%"onent )egression
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
78/279
D
-rinci"al *o%"onent )egression
-rocedure9
1: )educe di%ension o+ "redictors using -*5
2: )egress scores on res"onse
Notice9 The "rocedure is se!uential
-rinci"al *o%"onent )egression
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
79/279
D
*imension reduction isindependent o& the ob.ective
redictor
Variables
/ Scores
Response
Variable
/0
MR
-rinci"al *o%"onent )egression
7/25/2019 Mulitvariate Random Trees
80/279
7/25/2019 Mulitvariate Random Trees
81/279
1
Scatter o& irst /0 Scores ith Response
(2:
(1:;
(1:
(:;
:
:;
1:
1:;
2:
2:;
(@: (0: (2: : 2: 0: @: :
irst -/0 Scores
Response
R$2 3)331
)elationshi" o+
7/25/2019 Mulitvariate Random Trees
82/279
2
-HS =istor!
=: old 61@@. 1D;7
S: old and =: Martens 61,7
Stone and Broo8s 617
7/25/2019 Mulitvariate Random Trees
83/279
,
Hatent Iariable Model
redictor$
redictors Responses
Response1
redictor1
redictor
redictor4
redictor5
atent Variables
1
$
redictor6
Response$
Response
Note9 -HS can handle %ulti"le res"onse variables
*o%"arison with )egression
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
84/279
0
" g
redictor1
redictor$
redictor
redictor4
redictor5
Response1
-HS O"ti%iation6 di t 7
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
85/279
;
6%an! "redictors. oneres"onse7
-HS see8s to +ind linear co%binations o+ the
inde"endent variables that su%%arie the
%axi%u% a%ount o+ co(variabilit!with the
res"onse:
# These linear co%binations areo+ten called PLScomponentsor -HS scores:
#5 -HSdirectionis a vector that "oints in the directiono+ %axi%u% co(variance:
-HS O"ti%iation6 di t 7
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
86/279
@
6%an! "redictors. oneres"onse7
-HS is inherentl! an o"ti%iation "roble%. which
is subject to two constraints
1:The -HS directions have unit length
2:$ither
a:Successivel! derived scores areuncorrelated to "reviousl!
derived scores. O)
b:Successivel! derived directions are orthogonal to "reviousl!
derived directions
Ynli8e -*5. either constraint does NOT i%"l! the other
constraint
*onstraint 2:a: is %ost co%%onl! i%"le%ented
Mathe%aticall! S"ea8ingQ
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
87/279
D
! " g
The o"ti%iation "roble% de+ined b! -HS can be solvedthrough the +ollowing +or%ulation9
subject to constraints 2a: or b:
7/25/2019 Mulitvariate Random Trees
88/279
)egression
( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )aa
responsesores,orrsoresvarmaxargresponsevar
aa
YX,aorrXavarmaxargYvar
aa
YX,aorrYvarXavar
maxarg
aa
YX,aCovmaxarg
T
2
a
T
T2T
a
T
T2T
a
T
T2
a
=
=
=
-HS isSi%ultaneous 'i%ension )eductionand )egression
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
89/279
and )egression
%ax Iar6scores7*orr26response.scores7
*imension Reduction
(/0)Regression
-HS Bene+its and 'rawbac8s
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
90/279
Bene+it
# Si%ultaneous di%ension reduction and regression
'rawbac8s
# Si%ilar to -*5. -HS chases3 co(variabilit!
-HS directions will be drawn to inde"endent variables with the %ostvariabilit! 6although this will be te%"ered b! the need to also be
related to the res"onse7
Outliers %a! have signi+icant in+luence on the directions. resulting
scores. and relationshi" with the res"onse: S"eci+icall!. outliers can
# %a8e it a""ear that there is no relationshi" between the
"redictors and res"onse when there trul! is a relationshi". or
# %a8e it a""ear that there is a relationshi" between the
"redictors and res"onse when there trul! is no relationshi"
-artial Heast SLuares
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
91/279
1
Simultaneousdimension
reduction and regression
redictor
Variables
Response
Variable
S
L
7/25/2019 Mulitvariate Random Trees
92/279
2
)elationshi" o+
7/25/2019 Mulitvariate Random Trees
93/279
,
Scatter o& irst S Scores ith Response
(2:
(1:;
(1:
(:;
:
:;
1:
1:;
2:
2:;
(2: (1:; (1: (:; : :; 1: 1:; 2: 2:;
irst -S Scores
Response
R$2 3)7#
-HS in -ractice
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
94/279
0
-HS see8s to +ind latent variables 6HIs7 that
su%%arie variabilit! and are highl! "redictive o+
the res"onse:
=ow do we deter%ine the nu%ber o+ HIs to
co%"ute4
# $valuate )MS-$ 6or Q27
The o"ti%al nu%ber o+ co%"onents is the
nu%ber o+ co%"onents that %ini%ies )MS-$
-HS +or the Boston housing data9Training the -HS Model
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
95/279
;
Training the -HS Model
Since -HS can handle
highl! correlatedvariables. we +it the %odelusing all 12 "redictors
The %odel was trainedwith u" to @ co%"onents
)MS$ dro"s noticeabl!+ro% 1 to 2 co%"onents
and so%e +or 2 to ,co%"onents:
# Models with , or %oreco%"onents %ight besu++icient +or these data
Training the -HS Model
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
96/279
@
)oughl! the sa%e
"ro+ile is seen when
the %odels are judged
on R$
Boston =ousing )esults
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
97/279
D
Ysing the two co%"onent %odel. we can "redict
the test set
-HS training statistics are si%ilar to those +ro%
linear regression
Both %ethods "er+or% about the sa%e in the test
set
Training *ata
(bootstrap)
Test *ata
RMS+ ,$ RMS+ R$
Hinear )eg ;:2, :@1 0:;, :D02
-HS ;:2; :@ 0:;@ :D,
-HS Model
7/25/2019 Mulitvariate Random Trees
98/279
-HS O"ti%iation 6276%an! "redictors man" res"onses7
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
99/279
6%an! "redictors. man"res"onses7
-HS see8s to +ind linear co%binations o+ the
inde"endent variables and a linear co%bination
o+ the de"endent variables that su%%arie the
%axi%u% a%ount o+ co(variabilit!between the
co%binations:# These linear co%binations are o+ten called PLS X
space and !space componentsor PLS Xspace and
!space scores:
# Hi8wise. ](s"ace and ?(s"ace -HS directions"oint in
the direction o+ %axi%u% co(variance between the
s"aces:
-HS O"ti%iation 6276%an! "redictors man" res"onses7
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
100/279
1
6%an! "redictors. man"res"onses7
-HS is inherentl! an o"ti%iation "roble%. which
is subject to two constraints
1:The ](s"ace and ?(s"ace -HS directions have unit
length
2:$ither
a:Successivel! derived scores in each s"ace are uncorrelated
to "reviousl! derived scores. O)
b:Successivel! derived directions in each s"ace are orthogonal
to "reviousl! derived directions
*onstraint 2:a: is %ost co%%onl! i%"le%ented
Mathe%aticall! S"ea8ingQ
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
101/279
11
The o"ti%iation "roble% de+ined b! -HS can be
solved through the +ollowing +or%ulation9
subject to constraints 2a: or b:
( )( )( )
,!!aa
Y!X,aCovmaxarg
TT
TT2
!a,
( ) ( ) ( )( )( )!!aaY!X,aorrY!varXavarmaxarg TT
TT2TT
!a,=
-HS isSi%ultaneous 'i%ension )eductionand )egression
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
102/279
12
and )egression
%ax Iar68-scores7*orr268-scores.9-scores7Iar69-scores7
8-space *imension
Reduction (/0)
Regression 9-space *imension
Reduction (/0)
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
103/279
1,
Neural Networ8s
Neural Networ8s
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
104/279
10
Hi8e -HS or -*). these %odels create
inter%ediar! latent variables that are used to
"redict the outco%e
Neural networ8s di++er +ro% -HS or -*) in a +ew
wa!s
# the objective +unction used to derive the new variables
is di++erent
# The latent variables are created using +lexible. highl!nonlinear +unctions
# The latent variables usuall! do not have an! %eaning
Networ8 Structures
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
105/279
1;
There are %an! t!"es o+ neural networ8 structures
# we will concentrate on the single la!er. +eed(+orward networ8
redictor1
redictor$
redictor
redictor4
redictor5
:idden ;nit 1
:idden ;nit $
:idden ;nit k
Response1
7/25/2019 Mulitvariate Random Trees
106/279
1@
The transition +ro% this
sub(%odel to the hidden
units is nonlinear
# sig%oidal +unctions.such
as the logistic +unction. aret!"icall! used
7/25/2019 Mulitvariate Random Trees
107/279
1D
The hidden units are then
used to "redict the
outco%e using si%"le
linear co%binations
*learl!. the "ara%eters are not identi+iable and
the hidden units have no real %eaning 6unli8e
-*57
Training Networ8s
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
108/279
1
Ct is highl! reco%%ended that the "redictors arecentered and scaled "rior to training
The nu%ber o+ hidden units is a tuning"ara%eter
ith %an! "redictors and hidden units. thenu%ber o+ esti%ated "ara%eters can beco%ever! large
# with a large nu%ber o+ hidden units. these %odels canLuic8l! start to over+it
)ando% starting values are t!"icall! used toinitialie the "ara%eter esti%ates
eight 'eca!
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
109/279
1
This is a training techniLue that atte%"ts to
shrin83 the "ara%eter esti%ates towards ero
# large "ara%eter esti%ates are "enalied in the %odel
training
This leads to s%oother. lessextre%e %odels
# the e++ect o+ weight deca! is de%onstrated +or
classi+ication %odels
Boston =ousing 'ata
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
110/279
11
The %odel see%s to
do well with +ewer
co%"onents 6not
t!"ical7
7/25/2019 Mulitvariate Random Trees
111/279
111
The +inal %odel used high value +or weight deca!
and 1 hidden unit
This %odel see%s to be an i%"rove%ent
co%"ared to the others
Training *ata(bootstrap)
Test *ata
RMS+ ,$ RMS+ R$
Hinear )eg ;:2, :@1 0:;, :D02
-HS ;:2; :@ 0:;@ :D,
Neural Net 0:@ :D;D 0:2 :D
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
112/279
112
Su""ort Iector Machines
Su""ort Iector Machines 6SIMs7
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
113/279
11,
SIMs are "redictive statistical %odels develo"edin 1@, b! Ia"ni8 that were signi+icantl!ex"anded in the &s
These %odels wereinitiall! develo"ed +orclassi+ication %odels. but were later ada"ted +orregression %odels
Objective
7/25/2019 Mulitvariate Random Trees
114/279
110
)ecall that linear
regression esti%ates
"ara%eters b!
calculating9
# the %odel residuals
# the total su% o+ the
sLuared residuals 6SS)7
The "ara%eters withthe s%allest SS) are
o"ti%al
Objective
7/25/2019 Mulitvariate Random Trees
115/279
11;
Su""ort vector %achine
regression %odels create a+unnel3 around theregression line
# residuals within the +unnel arenot counted in the "ara%eteresti%ation
# the su% o+ the residualsoutside the +unnel are used asthe objective +unction 6nosLuared ter%7
5 +unnel sie is set to 1 S'o+ the outco%e is not a bad"lace to start
The SIM Model O"ti%iation
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
116/279
11@
Hi8e =uber(t!"e robust
regression. outliers have alinear e++ect on theobjective +unction
Over+itting can becontrolled b! using a"enalied objective+unction 6%ore later7
Zuadratic "rogra%%ing%ethods are needed tosolve these eLuations
Su""ort Iectors and 'ata )eduction
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
117/279
11D
The "oints that are outside
the +unnel 6or on it&sboundar!7 are the su""ort
vectors
Ct turns out that the "rediction
+unction onl! uses thesu""ort vectors
# the "rediction eLuation is %ore
co%"act and e++icient
# the %odel %a! be %ore robust
to outliers
Su""ort Iectors and 'ata )eduction
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
118/279
11
The %odel +itting routine "roduces values 67 thatare non(ero +or all o+ the su""ort vectors
To "redict a new sa%"le. the original training data
+or the non(ero values are needed9
Nonlinear Boundaries
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
119/279
11
Nonlinear boundaries can be co%"uted using the
8ernel tric83
The "redictor s"ace can be ex"anded b! adding
nonlinear +unctions o+ the "redictors
*o%%on 8ernel +unctions are9
Nonlinear Boundaries
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
120/279
12
The tric83 is that the co%"utations can o"erate
onl! on the inner("roducts o+ the extended
"redictor set
Cn this wa!. the "redictor s"ace di%ension can be
greatl! ex"anded without %uch co%"utational
i%"act
*ost +unctions
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
121/279
121
Su""ort vector %achines also include a regulariation
"ara%eter that controls how %uch the regression line canada"t to the data
# s%aller values result in %ore linear 6i:e: +lat7 sur+aces
This "ara%eter is generall! re+erred to as *ost3
7/25/2019 Mulitvariate Random Trees
122/279
122
5s "reviousl!
%entioned. there is a
wa! to anal!ticall!
esti%ate the tuning
"ara%eter +or the )B
*o%"arison
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
151/279
1;1
Bagging can signi+icantl! increase "er+or%ance o+ trees
# +ro% resa%"ling9
The cost is co%"uting ti%e and the loss o+ inter"retation
One reason that bagging wor8s is that single trees areunstable
# s%all changes in the data %a! drasticall! change the tree
Training *ata(bootstrap)
Test
RMS+ ,$ RMS+ R$
Single Tree ;:1 :D 0:2 :D
Bagging 0:,2 :D@ ,:@ :2;
)ando%
7/25/2019 Mulitvariate Random Trees
152/279
1;2
)ando% +orests %odels are si%ilar to bagging
# se"arate %odels are built +or each bootstra" sa%"le
# the largest tree "ossible is +it +or each bootstra" sa%"le
=owever. when rando% +orests starts to %a8e a
new s"lit. it onl! considers a rando% subset o+"redictors
# The subset sie is the 6o"tional7 tuning "ara%eter
)ando% +orests de+aults to a subset sie that is thesLuare root o+ the nu%ber o+ "redictors and is
t!"icall! robust to this "ara%eter
)ando% -redictor Cllustration
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
153/279
1;,
)ando%l! select a
subset o+ variables
+ro% original data
'ataset 1 'ataset 2 'ataset M
^^
^
Build trees
-redict -redict -redict
7/25/2019 Mulitvariate Random Trees
154/279
1;0
rediction o& an observation' =>
( )
M
f
xF
M
m
m== 1x
)(
-ro"erties o+ )ando%
7/25/2019 Mulitvariate Random Trees
155/279
1;;
Iariance reduction
#5veraging "redictions across %an! %odels "rovides%ore stable "redictions and %odel accurac!6Brei%an. 1@7
)obustness to noise#5ll observations have an eLual chance to in+luence
each %odel in the ense%ble
# =ence. outliers have less o+ an e++ect on individual
%odels +or the overall "redicted values
*o%"arison
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
156/279
1;@
*o%"aring the three %ethods using resa%"ling9
Both bagging and rando% +orests are %e%or!less3
# each bootstra" sa%"le doesn&t 8now an!thing about the other
sa%"les
Training *ata(bootstrap)
Test
RMS+ ,$ RMS+ R$
Single Tree ;:1 :D 0:2 :D
Bagging 0:,2 :D@ ,:@ :2;
)and
7/25/2019 Mulitvariate Random Trees
157/279
1;D
5 %ethod to boost3 wea8 learning algorith%s
6s%all trees7 into strong learning algorith%s
# Kearns and Ialiant 617. Scha"ire 617.
7/25/2019 Mulitvariate Random Trees
158/279
1;
7/25/2019 Mulitvariate Random Trees
159/279
1;
Stage 1
?uild
eighted
tree
n=200
n=90 n=110
81 @ 5$ 81 A 5$
/ompute
stage eight stage 12 f6,2:7
Reeigh
observations
6#i1.2.:::. n7
'eter%ine weight o+ ith
observation9
The larger the error.
the higher the weight
$
n=200
n=64 n=136
8$B @ $$4 8$B A $$4
stage $2 f62@:D7
'eter%ine weight o+ ith
observation
M
n=200
n=161 n=39
86 @ 3 86 A 3
stage M2 f62:;7
/omputeerror = =
n
i
ie1
2 #2 = =n
i
ie1
2 #2= =n
i
ie1
2 #2
Boosting Trees
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
160/279
1@
Boosting has three tuning "ara%eters9
# nu%ber o+ iterations 6i:e: trees7
# co%"lexit! o+ the tree 6i:e: nu%ber o+ s"lits7
# learning rate9 how Luic8l! the algorith% ada"ts
This i%"le%entation is the %ost co%"utationall!
taxing o+ the tree %ethods shown here
7/25/2019 Mulitvariate Random Trees
161/279
1@1
( )( )=
=M
m
mmfxF1
x)(
rediction o& an observation' =>
here them are constrained to sum to 1
-ro"erties o+ Boosting
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
162/279
1@2
)obust to over+itting
#5s the nu%ber o+ iterations increases. the test set
error does not increase
# Scha"ire. et al: 617.
7/25/2019 Mulitvariate Random Trees
163/279
1@,
One a""roach to training is
to set the learning rate to ahigh value 6:17 and tune
the other two "ara%eters
Cn the "lot to the right. a grid
o+ co%binations o+ the 2tuning "ara%eters were
used to o"ti%ie the %odel
The o"ti%al settings were9
# ; trees with high co%"lexit!
*o%"arison Su%%ar!
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
164/279
1@0
*o%"aring the +our %ethods9
Training *ata(bootstrap)
Test
RMS+ ,$ RMS+ R$
Single Tree ;:1 :D 0:2 :D
Bagging 0:,2 :D@ ,:@ :2;
)and
7/25/2019 Mulitvariate Random Trees
165/279
1@;
Model Building Training
Model *o%"arisons
hich Model is Best4
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
166/279
1@@
The No
7/25/2019 Mulitvariate Random Trees
167/279
1@D
$xcellent Ier! Good 5verage
7/25/2019 Mulitvariate Random Trees
168/279
1@
I ero var "redictor. NI near(ero var "redictor.
*S centerXscale. =*- highl! correlated "redictor
W 'e"ends on i%"le%entation
Boston =ousing 'ata
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
169/279
1@
The correlation between the results on the training set
6n,,D7 via cross(validation and the results +ro% the test
set 6n1@7 were :D1 6)MS$7 and :@; 6)27
So%e 5dvice
Th i i l ti hi b t Cnter"retabilit!
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
170/279
1D
There is an inverse relationshi" between
"er+or%ance and inter"retabilit!
e want the best o+ both worlds9 great
"er+or%ance and a si%"le. intuitive %odel
Tr! this9#
7/25/2019 Mulitvariate Random Trees
171/279
1D1
)egression 'atasets
Cnternet Move 'ata Base
CM'B i li th t t l i d TI
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
172/279
1D2
CM'B is an on(line resource that catalogs %ovies and TI
"rogra%s +ro% %an! countries:
Basic in+or%ation about the "rogra% is %aintained and
users can rate each "rogra% on a +ive "oint scale:
e extracted in+or%ation about %ovies and ca"tured9
# the average vote
# the nu%ber o+ votes
# basic in+or%ation9 run ti%e. rating 6i+ an!7. !ear o+ release. etc
# genre9 dra%a. co%ed! etc and
# 8e!words9 based on novel. +e%ale lead. title s"o8en b! characterQ
*an we "redict the %ovie rating based on these data4
Tecator S"ectrosco"! 'ata
< St tlib
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
173/279
1D,
7/25/2019 Mulitvariate Random Trees
174/279
1D0
The variables are s"ectral
%easure%ents at s"eci+icwavelengths and are
highl! autocorrelated:
e wish to "redict the"ercent +at +or each
sa%"le:
Towson =o%e Sales
Cn+or%ation about ho%es sold in the Towson Mar!land area 6north o+
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
175/279
1D;
Cn+or%ation about ho%es sold in the Towson. Mar!land area 6north o+Balti%ore7 were collected:
The area enco%"asses the northern border o+ Balti%ore cit!6Cdlew!dle7. suburban areas 65nnelsie. )odgers
7/25/2019 Mulitvariate Random Trees
176/279
1D@
)egression Bac8u" Slides
SIM Model
7/25/2019 Mulitvariate Random Trees
177/279
1DD
M5)S Model
7/25/2019 Mulitvariate Random Trees
178/279
1D
)egression Tree Model
7/25/2019 Mulitvariate Random Trees
179/279
1D
Boosting Tree Model
7/25/2019 Mulitvariate Random Trees
180/279
1
Iariable C%"ortance +or -HS
To understand the
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
181/279
11
To understand the
i%"ortance o+ each +actor.we can loo8 at a weighted
su% o+ the absolute
regression coe++icients
# the weights are based on
the decrease in error as
%ore co%"onents are
added
e can also loo8 at the
loadings to get a %ore
detailed assess%ent
Iariable C%"ortance +or -HS
=ere we can loo8 at the
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
182/279
12
=ere. we can loo8 at the
increase in )2as %odelter%s are added
C+ the variable is neverused in a ter%. it has an
i%"ortance o+ ero
Iariable C%"ortance +or )egression Trees
=ere we can loo8 at the
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
183/279
1,
=ere. we can loo8 at the
decrease in MS$ as%odel ter%s are added
C+ the variable is neverused in a s"lit. it has an
i%"ortance o+ ero
Iariable C%"ortance +or )ando%
7/25/2019 Mulitvariate Random Trees
184/279
10
5 "er%utation a""roach is
used
$ach training data +or
variable is scra%bled in
turn and the P increase inthe out(o+(bag MS$ is
trac8ed
Boosting.
7/25/2019 Mulitvariate Random Trees
185/279
1;
Boosting +its a +orward stagewise additive %odel
6=astie. Tibshirani and
7/25/2019 Mulitvariate Random Trees
186/279
1@
learning rate:
# a "ara%eter that controls the rate o+ learning o+ observations
that overla" on a decision boundar! 6
7/25/2019 Mulitvariate Random Trees
187/279
1D
Hinear regression %odels will +ail i+ there are ero(
variance "redictors included
# The! will also +ail during cross(validation i+ an! near(
ero variance "redictors are in the data
5s just discussed. re%ovinghighl! correlated"redictors is strongl! suggested
*entering and scaling are not reLuired. but can
greatl! increase the nu%erical stabilit! o+ the%odel
-HS -re(-rocessing
Because o+ its di%ension reduction abilities -HS
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
188/279
1
Because o+ its di%ension reduction abilities. -HS
is resistant to ero( and near(ero variance"redictors
5lso. since -HS canhandle 6and "erha"s ex"loit7
correlated "redictors. it is not necessar! tore%ove the%
*entering and scaling are extre%el! i%"ortant +or
-HS %odels
# otherwise. the "redictors with large variabilit! can
do%inate the selection o+ co%"onents
Neural Networ8 -re(-rocessing
Neural networ8 %odels will not +ail with ero(variance
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
189/279
1
Neural networ8 %odels will not +ail with ero variance
"redictors =owever. these %odels use a large nu%ber o+ "ara%eters
and near(ero variance "redictors %a! lead to nu%erical
issues such as a +ailureto converge
=ighl! correlated "redictors should be re%oved
%ulticollinearit! can have a signi+icant e++ect on %odel
"er+or%ance
*entering and scaling are reLuired
M5)S -re(-rocessing
M5)S %odels are resistant to ero( and near(ero
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
190/279
1
M5)S %odels are resistant to ero and near ero
variance "redictors
=ighl! correlated "redictors are allowed. but this can lead
to signi+icant a%ount o+ rando%ness during the "redictor
selection "rocess
# The s"lit choice between two highl! correlated "redictors beco%es
a toss(u"
*entering and scaling are not reLuired but are suggested
Tree -re(-rocessing
5 basic regression tree reLuires ver! little "re(
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
191/279
11
5 basic regression tree reLuires ver! little "re
"rocessing
# %issing "redictor values are allowed
# centering and scaling are not reLuired
centering and scaling do not a++ect results
# highl! correlated "redictors are allowed
Cncluding highl! correlated descri"tors can cause instabilit!
and %a8e descri"tor i%"ortance ran8ings so%ewhat rando%
# ero( and near(ero variance "redictors are allowed
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
192/279
12
Model Building Training
*lassi+ication(t!"e Models
Setting
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
193/279
1,
)es"onse is categorical
)es"onse %a! have %ore than two categories
Objective
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
194/279
10
To construct a %odel o+ "redictors thatcan be used to "redict a res"onse
Data
Model
Prediction
*lassi+ication Methods
'iscri%inant anal!sis +ra%ewor8
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
195/279
1;
!
# Hinear. Luadratic. regularied. +lexible. and "artial least sLuaresdiscri%inant anal!sis
Modern classi+ication %ethods
# Tree(based ense%ble %ethods
Boosting and rando% +orests# Neural networ8s
# Su""ort vector %achines
# 8(nearest neighbors
# Naive Ba!es
$ach o+ these %ethods see8 to +ind a "artitioning o+ thedata that %ini%ies classification error
$valuating *lassi+ication Model -er+or%ance
Hi8e regression %odels. we desire to understand the
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
196/279
1@
g .
"redictive abilit! o+ a classi+ication %odel: e can evaluate a %odel&s "er+or%ance b! using cross(
validation or a test set o+ data:
7/25/2019 Mulitvariate Random Trees
197/279
1D
Mini%ie classi+ication error 6or %axi%ie accurac!7# 'eter%ine how well the %odel "rediction agrees with the
actual classi+ication o+ observations:
N5XBX*X'BX'5X*Total*X''*Cnactive
5XBB55ctive
TotalCnactive5ctive-redicted
5
ctu
al
Cntuition
5n intuitive %easure o+ accurac! is
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
198/279
1
65 X '7 / N# hen the actual classes are balanced. this is an
a""ro"riate %easure o+ %odel "er+or%ance:
But. this %easure "roduces the sa%e values +ordi++erent tables9
5ctive Cnactive
5ctive ; ;
Cnactive ; 0;
5ctive Cnactive
5ctive ; ;
Cnactive ; 0;
vs
5ccurac! +or both tables is :
*oes one table sho more agreementthan the otherC
5nother Measure9 Ka""a
To "rovide a %easure o+ agree%ent +or unbalanced
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
199/279
1
" g
tables. *ohen 61@7 "ro"osed co%"aring the observedagree%ent to the ex"ected agree%ent
To co%"ute Ka""a. we need
# The observed agree%ent9 O 65 X '7 / N
# The ex"ected agree%ent
Ka""a is de+ined as9 k 6O # $7 / 61 # $7
( )( ) ( )( )2N
DCDBBACAE
+++++=
Ka""a -ro"erties
Generall!9 (1 k1
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
200/279
2
# values close to indicate "oor agree%ent# values close to 1 indicate near "er+ect agree%ent
+or co%"lete disagree%ent. k (1
# Ialues o+ :0 or above are considered to indicate %oderate
agree%ent. and values o+ : or higher indicate excellentagree%ent:3 6Sto8es. 'avis. and Koch. 217
*an be generalied to V 2 classes
5ctive Cnactive
5ctive ; ;
Cnactive ; 0;
5ctive Cnactive
5ctive ; ;
Cnactive ; 0;
k0.49 k 0.65
Note9 hen the observed classes are balanced. 8a""a accurac!
5nother Measure9)eceiver O"erating *haracteristic 6)O*7 *urves
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
201/279
21
)O* curves can be used to assess aclassi+ication %odel&s "er+or%ance or to co%"are
several %odels& "er+or%ance
Building an )O* curve reLuires that the %odel"roduces a continuous "rediction
7/25/2019 Mulitvariate Random Trees
202/279
22
Ter%inolog!9
# Sensitivit! True -ositive )ate T- / 6T- X
7/25/2019 Mulitvariate Random Trees
203/279
2,
0ll observations ith predicted probabilities D the cuto&& are classi&ied as negative
*lassi+ication Model -redictions
Several classi+ication %odels generate a "redicted value
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
204/279
20
+or each class in the original data# -HS'5.
7/25/2019 Mulitvariate Random Trees
205/279
2;
g
observation into grou" :
The "robabilit! that the observation is in grou"
is9
where K is the total nu%ber o+ grou"s
=
1p
g
g
ip
i
e
e
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
206/279
2@
'iscri%inant Models
*lassical 'iscri%inant Models
These %odels +or% a discri%inant +unction that
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
207/279
2D
can be used to classi+! sa%"les
The discri%inant +unction is a linear +unction o+ the
"redictors that atte%"ts to9
This is a latent variable %ethod si%ilar to -HS and
others that we have seen
# how the latent variable is created di++ers between
%ethods
5ssu%"tion9 the within grou" variabilit! is the same+oreach grou"
Hinear 'iscri%inant 5nal!sis
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
208/279
2
each grou":
7/25/2019 Mulitvariate Random Trees
209/279
2
The "lot on the right
shows a three class
exa%"le where a linear
%ethod li8e H'5 is %ost
e++ective
5side9 H'5 and Hogistic )egression
Ct turns out that H'5 and logistic regression are +itting %odels that arever! si%ilar
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
210/279
21
ver! si%ilar
# H'5 assu%es that the "redictors are %easured with error and that theclassi+ication o+ the observations is 8nown
# H) assu%es that the "redictors are 8nown and that the classi+ication o+the observations are %easured with error
5ssu%ing that the res"onseerror is Nor%al. the o"ti%al se"arating"lane +or logistic regression is9
H'5 esti%ates a large nu%ber o+ "ara%eters and has +airl! strict
constraints on the data
5lso. logistic %odels %a! be %ore +orgiving o+ s8ewed "redictordistributions
$xa%"le 'ata
7/25/2019 Mulitvariate Random Trees
211/279
211
set. H'5 doesn&t do aver! good job since
the boundar! is
nonlinear
The linear "redictor is
deter%ined to be
B7-redictor:2;57-redictor61:1
(
5side9 H'5 and Harge Nu%ber o+ -redictors
So%e classi+ication %odels are not drasticall!
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
212/279
212
a++ected b! large nu%bers o+ "redictors# Cn %an! cases. a nu%ber o+ "redictors will be noise
H'5 has the "otential to over+it
# H'5 class "robabilit! esti%ates beco%e %ore extre%eas the nu%ber o+ "redictors beco%es large even whenthere is no underl!ing di++erence
5 si%ilar issue occurs in H)
#
7/25/2019 Mulitvariate Random Trees
213/279
21,
data set that was co%"lete noise
7/25/2019 Mulitvariate Random Trees
214/279
210
co%binations o+ the original variables6scores7 that are highl! correlated with
the res"onse:
7/25/2019 Mulitvariate Random Trees
215/279
Solution9Sa%e as -HS +or )egression
The o"ti%iation "roble% de+ined b! -HS can be
7/25/2019 Mulitvariate Random Trees
216/279
21@
solved through the +ollowing +or%ulation9
subject to constraints 2a: or b:
( )( )( )
,!!aa
Y!X,aCovmaxarg
TT
TT2
!a,
( ) ( ) ( )( )( )!!aa
Y!X,aorrY!varXavar
maxarg TT
TT2TT
!a,=
7/25/2019 Mulitvariate Random Trees
217/279
21D
# The -HS directions are the eigenvectors o+ a %odi+iedbetween(class covariance %atrix. ?:
# *oding o+ the res"onse %atrix does not %atter
either gcolu%ns or g(1 colu%ns"rovides the sa%e answer
# The constraint in the ?(s"ace does not %a8e sense
h! constrain a res"onse that denotes class %e%bershi"4
# C+ the ?(s"ace constraint is re%oved. the -HSdirections are exactl! the eigenvectors o+ the between(class covariance %atrix. ?
# H'5 is o"ti%al i+ di%ension reduction is not necessar!
The o"ti%al directions +or H'5 are the eigenvectors o+ E(1?:
-HS 'iscri%inant 5nal!sis $xa%"le 1
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
218/279
21
The so+t%ax +unction is used to deter%ine classi+ication boundaries:
-HS 'iscri%inant 5nal!sis $xa%"le 2
S*0 *0
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
219/279
21
Zuadratic 'iscri%inant 5nal!sis
5ssu%"tion9 the within grou" variabilit! is different+or
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
220/279
22
each grou": The decision rule is
# where re"resents grou" .
# The class with the largest score is the "redicted class
# 5 +unction o+ sLuared distance o+ each observation +ro% each
grou"&s center
The decision rule de"ends on the covariance %atrix +or
each grou"
Zuadratic 'iscri%inant 5nal!sis
Z'5 extends the H'5
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
221/279
221
%odel b! using Luadratic6i$enonlinear7 classi+icationboundaries
=owever. the data
reLuire%ents are %orestringent
# at least as %an! co%"oundsas "redictors in each class
# no ero(variance or linearl!de"endent "redictors
)egularied 'iscri%inant 5nal!sis
The %ethod tries to s"lit the di++erence between H'5 andZ'5
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
222/279
222
Z'5:
Ct uses two tuning "ara%eters. ga%%a and la%bda9
# ga%%a controls the correlation assu%"tion +or the "redictors
as ga%%a 1 the %odel assu%es less "redictor correlations
# la%bda toggles betweenlinear and Luadratic boundaries ga%%a ` la%bda 1 H'5
ga%%a ` la%bda Z'5
Other co%binations o+ ga%%a and la%bda "roduce%odels that are co%"ro%ises between H'5 and Z'5
)egularied 'iscri%inant 5nal!sis
To see the e++ect o+ changing ga%%a9
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
223/279
22,
# )daMovie5:gi+
To see the e++ect o+ changing la%bda9
# )daMovieB:gi+
e can +ind the o"ti%al ga%%a and la%bda b!
cross(validation
7/25/2019 Mulitvariate Random Trees
224/279
220
Cn addition to the original "redictors. nonlinear +unctions o+the "redictors are added to the data
# This is 8nown as a basis ex"ansion3 o+ the original data
This "rocedure essentiall! buildsa set o+ one versus all3
classi+ication %odels
# a /1 outco%e is used +or each %odel
# the so+t%ax +unction is used to convert the %odel out"ut to class
"robabilities
7/25/2019 Mulitvariate Random Trees
225/279
22;
used
7/25/2019 Mulitvariate Random Trees
226/279
22@
hinge +eatures# +or these data. , sets o+ +eatures were used in to
discri%inate the classes
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
227/279
22D
Modern *lassi+ication Methods
*lassi+ication Trees
Hi8e regression trees. classi+ication trees search
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
228/279
22
through each "redictor to +ind a value o+ single"redictor that s"lits the data into two 6or %ore7grou"s that are %ore "ure than the originalgrou":
7/25/2019 Mulitvariate Random Trees
229/279
22
-red B -red '
-red 5
5 V Thresh 1 5 Thresh 1
B V Thresh 2 B Thresh 2
' V Thresh 0 ' Thresh 0
5 V Thresh , 5 Thresh ,
1 2 1 2 1 2 1 2 1 2
C%"urit! Measures
There are several %easures +or deter%ining the
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
230/279
2,
"urit! o+ the s"lit:
7/25/2019 Mulitvariate Random Trees
231/279
2,1
Misclassi+ication error9 !1p1 " !2p2
# hen!1 # !2 0#, M$ 0.5*(p1 + p2)
Gini index9 !1p1(13p1) + !2p2(13p2)
# hen!1 # !2 0#, GC0.5*(p1(1-p1)+ p2(1-p2))
n
d$!n
c%!
d$
d
d$
$p
c%c%
+=+=
++=
++
21
2
,
,min
C%"urit! Measure *o%"arison
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
232/279
2,2
Si%"le $xa%"le
Cn this exa%"le a +ew
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
233/279
2,,
"ossible "artitions clearl!stand out9
# x1 ;.
# x2 D:;. or
# x2 1:;
=ow does each i%"urit!
%easure ran8 these
"artitions4 2 0 @ 1
2
0
@
G
1
x1
x2
*lassi+ication )esults
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
234/279
2,0
$nse%ble Methods
Hi8e individual regression trees. single
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
235/279
2,;
classi+ication trees# are not o"ti%al classi+ication %ethods:
# have high variabilit!\s%all changes in the data can
drasticall! a++ect thestructure o+ the tree:
Bagging. rando% +orests. and boosting can also
be i%"le%ented +or classi+ication "roble%s
Bagging. )ando%
7/25/2019 Mulitvariate Random Trees
236/279
2,@
i%"le%ented in the sa%e wa! as in regression:
The objective is to %ini%ie %isclassi+ication
error# The loss +unction changes to e%ponential lossrather
than sLuared error loss:
Tuning "ara%eters +or these %ethods are thesa%e as in regression
Neural Networ8s
Hi8e -HS. neural networ8s +or classi+ication
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
237/279
2,D
translate the classes to a set o+ binar! 6ero/one7variables:
The binar! variables are %odeled using the
"redictors and the so+t%ax techniLue is used to
%a8e sure that the %odel out"uts behave li8e
"robabilities
7/25/2019 Mulitvariate Random Trees
238/279
2,
co%"lexit! "ara%eters9# The nu%ber o+ hidden units
# The a%ount o+ weight deca!
The second "ara%eter hel"s deter%ine thes%oothness o+ the classi+ication boundaries
7/25/2019 Mulitvariate Random Trees
239/279
2,
objective +unction9# the %argin
Su""ose we have two"redictors and a buncho+
co%"ounds e %a! want to classi+!
co%"ounds as active orinactive
Het&s +urther su""ose thatthese two "redictorsco%"letel! se"arate theseclasses
The Margin
There are an in+inite
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
240/279
20
nu%ber o+ straight linesthat we can use to
se"arate these two
grou"s
# so%e %ust be better thanothers
The %argin is a de+ined
b! eLuall! s"aced
boundaries on each side
o+ the line
The Margin
To %axi%ie the
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
241/279
201
%argin. we tr! to %a8eit as large as "ossible
# without ca"turing an!co%"ounds
5s the %arginincreases. the solutionbeco%es %ore robust
SIMs %axi%ie the%argin to esti%ate"ara%eters
Su""ort Iectors and 'ata )eduction
hen the classes overla". "oints are allowed within the%argin
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
242/279
202
g
# the nu%ber o+ "oints is controlled b! a cost "ara%eter
The "oints that are within the %argin 6or on it&s
boundar!7 are the su""ort vectors
Ct turns out that the "rediction +unction onl! uses thesu""ort vectors
# the "rediction eLuation is %ore co%"act and e++icient
# the %odel %a! be %ore robust to outliers
Nonlinear Boundaries
Si%ilar to regression %odels. the 8ernel tric83
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
243/279
20,
can be used to generate highl! nonlinear classboundaries
7/25/2019 Mulitvariate Random Trees
244/279
200
)B< Kernel D SIs 6,1:@P7
The $++ect o+ the *ost -ara%eter
5s the cost "ara%eter is increased. the %odel will
8 h d l l i+ h d
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
245/279
20;
wor8 ver! hard to correctl! classi+! the co%"ounds# This can lead to over(+itting
To see the e++ect o+ the cost "ara%eter. the lin8below shows an ani%ation +or a radial basis
+unction SIM# Sv%MovieB:gi+
Note that. as the boundar! beco%es %ore
co%"licated. the SI decreases# The %argin is beco%ing ver! s%all
Nearest Neighbor *lassi+iers
To "redict the class o+ a new co%"ound. this
d th t + t l + th
http://home.pfizer.com/http://home.pfizer.com/http://pfizerpedia/index.php/Image:SvmMovieB.gifhttp://home.pfizer.com/http://pfizerpedia/index.php/Image:SvmMovieB.gif7/25/2019 Mulitvariate Random Trees
246/279
20@
"rocedure uses the %ost +reLuent class o+ theclosest &neighbors
# i+ a tie. rando%l! "ic8 +ro% the%ost +reLuent classes
&. the nu%ber o+ neighbors. is the tuning"ara%eter
Since distance is used to de+ine the nearest
"oints. the "redictors should be centered and
scaled
Nearest Neighbor *lassi+iers
7/25/2019 Mulitvariate Random Trees
247/279
20D
the %odel was tunedacross &values +ro% 1 to
2
# D neighbors was +ound to
be o"ti%al
&NN class boundaries
tend to be so%ewhat
jagged but s%ooth out as&increases
Nave Ba!es
)ecall Ba!es theore%9
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
248/279
20
O+ course. the "redictor distributions are usuall!
%ultivariate and these "robabilities would involve
%ultidi%ensional integration
Nave Ba!es
Cn nave Ba!es.3 a8a Cdiot&s Ba!es.3 the
l ti hi b t di t i d
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
249/279
20
relationshi"s between "redictors are ignored# i$eall "redictors are treated as uncorrelated
Nave Ba!es
'es"ite this assu%"tion. this %odel usuall! is
titi ith t l ti
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
250/279
2;
ver! co%"etitive. even with strong correlations =ow do we esti%ate continuous "redictor
distributions4
# "ara%etricall!9 assu%e nor%alit! and use the sa%"le%ean and variance
# non("ara%etricall!9 use a non"ara%etric densit!
esti%ator
Nave Ba!es
7/25/2019 Mulitvariate Random Trees
251/279
2;1
"redictor 5 in our exa%"le. we see a slight shi+tbetween the distributions o+ the "redictor +or
each class9
Nave Ba!es
C+ l h
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
252/279
2;2
C+ a new sa%"le has a
value o+ "redictor 5
(1. it is %ore li8el! to
be active# active densit! A :0
# inactive densit! A :1D
Nave Ba!es
7/25/2019 Mulitvariate Random Trees
253/279
2;,
larger +or values between
(:; and :;
7/25/2019 Mulitvariate Random Trees
254/279
2;0
Sample 1 Sample $
red 0 red ? red ? red 0
' ( Total ' ' Total
5ctive :0 :10 :@ :0 :, :12
Cnactive :1D :@2 :1 :1D : :1
Nave Ba!es and Man! -redictors
Hi8e H'5. nave Ba!es
%odels can o er+it hen
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
255/279
2;;
%odels can over+it when%an! nois! "redictors are
included in the %odel
5s with H'5. we si%ulated
noise data and were able
to see class se"aration
increase as the nu%ber o+
"redictors went u"
Nave Ba!es *lassi+iers
*lass boundaries +or
nave Ba!es %odels
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
256/279
2;@
nave Ba!es %odelscan show circular or
elli"tical islands
Since the "redictors
are treated as
uncorrelated. there
cannot be an!
diagonal elli"ses
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
257/279
$xa%"le9 -rediction o+ S"a%
e would li8e to classi+! e%ails as being s"a% with an
e%"hasis on high s"eci+icit!. i:e: a low "robabilit! o+ non(
b i l b l d
7/25/2019 Mulitvariate Random Trees
258/279
2;
s"a% being labeled as s"a%
7/25/2019 Mulitvariate Random Trees
259/279
Method *o%"arison
7/25/2019 Mulitvariate Random Trees
260/279
2@
)O* *o%"arison
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
261/279
2@1
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
262/279
2@2
*lassi+ication 'atasets
Glauco%a 'ata
@2 variables are derived +ro% a con+ocal laser scanning
i%age o+ the o"tic nerve head. describing its %or"holog!:
Observations are +ro% nor%al and glauco%atous e!es
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
263/279
2@,
Observations are +ro% nor%al and glauco%atous e!es.
res"ectivel!: $xa%"les o+ variables are9
# as9 su"erior area
# ",ss9 volu%e below sur+ace te%"oral
# mhcn9 %ean height contour nasal
# "ari9 volu%e above re+erence in+erior. etc
e would li8e to "redict whether a subject has glauco%agiven their i%aging data
-redicting 'iabetes in -i%a Cndians
These data are +ro% -i%a Cndian wo%en living in 5riona:Several variables were collected. such as9
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
264/279
2@0
e would li8e to "redict a new Cndian wo%ans diabeticstatus given their other in+or%ation:
# pregnant9 nu%ber o+
"regnancies
# glucose9 "las%a glucose
levels# pressure9 diastolic B-
# triceps9 s8in +old thic8ness
# insulin9 seru% insulin
# mass9 bod! %ass index
# pedigree9 diabetic "edigree
+unction.
# age
# dia,etes9 negative or "ositive
http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
265/279
2@;
*lassi+ication Bac8u" Slides
7/25/2019 Mulitvariate Random Trees
266/279
2@@
7/25/2019 Mulitvariate Random Trees
267/279
2@D
# %issing "redictor values are allowed
# centering and scaling are not reLuired
centering and scaling do not a++ect results
# highl! correlated "redictors are allowed
Cncluding highl! correlated "redictors can cause
instabilit! and %a8e "redictor i%"ortance ran8ings
so%ewhat rando%
# ero( and near(ero variance "redictors are
allowed
)'5 -re(-rocessing
)'5 %odels are cannot deal with ero( and near(ero
variance "redictors
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
268/279
2@
# the! %ust be re%oved
=ighl! correlated "redictors are allowed. but not
suggested
# =owever. "er+ectl! correlated "redictors will cause the %odel to +ail
*entering and scaling are not reLuired but are suggested
5dditionall!. there cannot be linear de"endencies between
"redictors
Neural Networ8 -re(-rocessing
Neural networ8 %odels will not +ail with ero(variance
"redictors
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
269/279
2@
=owever. these %odels use a large nu%ber o+ "ara%eters
and near(ero variance "redictors %a! lead to nu%erical
issues such as a +ailureto converge
=ighl! correlated "redictors should be re%oved:
*entering and scaling are reLuired
Nearest Neighbor -re(-rocessing
These %odels are resistant to ero( and near(ero
variance "redictors as well as highl! correlated "redictors
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
270/279
2D
*entering and scaling are reLuired
Nave Ba!es -re(-rocessing
These %odel will not +ail with ero(variance "redictors
=ighl! correlated "redictors are also allowed
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
271/279
2D1
=ighl! correlated "redictors are also allowed: *entering and scaling are not reLuired
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
272/279
2D2
Model Building Training
Other *onsiderations
Iariables to Select
Iariables thought to be related to the res"onseshould be included in the %odel
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
273/279
2D,
So%eti%es we don&t 8now i+ a set o+ variables arerelated to the res"onse
Should these be included in the anal!sis4
C+ the variables are not related to the res"onse.then we are including noise into our "redictor set
hat ha""ens to the "er+or%ance o+ the
techniLues when noise is added4# *an we still +ind signal4
Cllustration
To the blood brain barrier data o+ Mente and Ho%bardo62;7. we have added 1. ;. 1. and 2 rando%
"redictors
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
274/279
2D0
"redictors
7/25/2019 Mulitvariate Random Trees
275/279
2D;
Noise
0.1
0.2
0.3
-er+or%ance *o%"arison)29 Test Set
0.4
0.5
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
276/279
2D@
Noise
0.1
0.2
0.3
Iariables to Select
=o"e+ull!. we&ve de%onstrated that resa%"ling is
a good wa! to avoid over(+itting
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
277/279
2DD
a good wa! to avoid over +itting )ealie that "redictor selection is "art o+ the
%odeling "rocess
'oing "redictor selection outside o+ cross(validation can lead to sever "redictor selection
bias
# and "otential over(+itting 6but !ou won&t 8now until a
test set7
$++ects o+ *ategoriing a *ontinuous )es"onse
5 %ajorit! o+ res"onses are %easured on a continuousscale
The continuous scale allows us to co%"are observations
http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
278/279
2D
The continuous scale allows us to co%"are observationson their original scale
So%eti%es the continuous res"onse naturall! +alls intotwo or %ore %odes
# C+ the relative distance between these %odes is not relevant. thenthe res"onse can be binned
# =owever. i+ the distance between %odes is relevant. then we losein+or%ation b! binning the res"onse
Binning a continuous res"onse that does not have natural
%odes will %a8e us lose even %ore in+or%ation and willdegrade %odel
Than8s
Than8s +or sitting through all this
http://home.pfizer.com/http://home.pfizer.com/7/25/2019 Mulitvariate Random Trees
279/279
More than8s to9
# Benevolent overlords 'avid -otter and $d
Kad!sews8i
# Nathan *oulter and Gauta% Bhola +or co%"uting
http://home.pfizer.com/http://home.pfizer.com/Top Related