Mulitvariate Random Trees

download Mulitvariate Random Trees

of 279

Transcript of Mulitvariate Random Trees

  • 7/25/2019 Mulitvariate Random Trees

    1/279

    1

    Model Building Training

    Max Kuhn

    Kjell Johnson

    Global Nonclinical Statistics

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    2/279

    2

    Overview

    T!"ical data scenarios# $xa%"les we&ll be using

    General a""roaches to %odel building

    'ata "re("rocessing

    )egression(t!"e %odels

    *lassi+ication(t!"e %odels

    Other considerations

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    3/279

    ,

    T!"ical 'ata

    )es"onse %a! be continuous or categorical -redictors %a! be

    # continuous. count. and/or binar!

    # dense or s"arse

    # observed and/or calculated

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    4/279

    0

    -redictive Models

    hat is a "redictive %odel34 5 %odel whoseprimary"ur"ose is +or "rediction

    6as o""osed to in+erence7

    e would li8e to 8now wh! the %odel wor8s. as

    well as the relationshi" between "redictors and

    the outco%e. but these are secondar!

    $xa%"les9 blood(glucose %onitoring. s"a%

    detection. co%"utational che%istr!. etc:

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    5/279

    ;

    hat 5re The! NotGood

  • 7/25/2019 Mulitvariate Random Trees

    6/279

    @

    hat 5re The! NotGood

  • 7/25/2019 Mulitvariate Random Trees

    7/279D

    The Big-icture

    Cn the end. E"redictive %odelingF is not a

    substitute+or intuition. but aco%"li%ent3

    Can 5!res. in Supercrunchers

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    8/279

    )e+erences

    Statistical Modeling9 The Two *ultures3b! HeoBrei%an 6Statistical Science. Iol 1@. , 6217.

    1(2,17

    The Elements of Statistical Learning b! =astie.

    Tibshirani and

  • 7/25/2019 Mulitvariate Random Trees

    9/279

    )egression Methods

    Multi"le linear regression

    -artial least sLuares

    Neural networ8s

    Multivariate ada"tive regression s"lines

    Su""ort vector %achines

    )egression trees $nse%bles o+ trees9

    # Bagging. boosting. and rando% +orests

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    10/2791

    *lassi+ication Methods

    'iscri%inant anal!sis +ra%ewor8# Hinear. Luadratic. regularied. +lexible. and "artial least sLuares

    discri%inant anal!sis

    Modern classi+ication %ethods

    # *lassi+ication trees

    # $nse%bles o+ trees

    Boosting and rando% +orests

    # Neural networ8s

    # Su""ort vector %achines

    # 8(nearest neighbors

    # Naive Ba!es

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    11/27911

    Cnteresting Models e 'on&t =ave Ti%e

  • 7/25/2019 Mulitvariate Random Trees

    12/27912

    $xa%"le 'ata Sets

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    13/2791,

    Boston =ousing 'ata

    This is a classic bench%ar8 data set +or regression: Ctincludes housing data +or ;@ census tracts o+ Boston

    +ro% the 1D census:

    cri%9 "er ca"ita cri%e rate

    Cndus9 "ro"ortion o+ non(retailbusiness acres "er town

    chas9 *harles )iver du%%!

    variable 6 1 i+ tract bounds

    river otherwise7

    nox9 nitric oxides concentration

    r%9 average nu%ber o+ roo%s

    "er dwelling

    5ge9 "ro"ortion o+ owner(

    occu"ied units built "rior to

    10

    dis9 weighted distances to +ive

    Boston e%"lo!%ent centers rad9 index o+ accessibilit! to

    radial highwa!s

    tax9 +ull(value "ro"ert!(tax rate

    "tratio9 "u"il(teacher ratio b!

    town

    b9 "ro"ortion o+ %inorities

    Medv9 %edian value ho%es

    6outco%e7

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    14/27910

    To! *lassi+ication $xa%"le

    5 si%ulated data set will beused to de%onstrate

    classi+ication %odels

    # two "redictorswith a correlation

    coe++icient o+ :; were si%ulated

    # two classes were si%ulated6active3 and inactive37

    5 "robabilit! %odel was used to

    assign a "robabilit! o+ being

    active to each sa%"le

    # the 2;P. ;P and D;P

    "robabilit! lines are shown on

    the right

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    15/2791;

    To! *lassi+ication $xa%"le

    The classes were rando%l!assigned based on the "robabilit!

    The training data had 2;

    co%"ounds 6"lot on right7

    # the test set also contained 2;

    co%"ounds

    ith two "redictors. the class

    boundaries can be shown +or

    each %odel

    # this can be a signi+icant aid in

    understanding how the %odelswor8

    # Qbut we ac8nowledge how

    unrealistic this situation is

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    16/2791@

    Model Building Training

    General Strategies

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    17/2791D

    Objective

    To construct a %odel o+ "redictors that

    can be used to "redict a res"onse

    Data

    Model

    Prediction

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    18/2791

    Model Building Ste"s

    *o%%on ste"s during %odel building are9# esti%ating %odel "ara%eters 6i:e: training %odels7

    # deter%ining the values o+ tuning "ara%eters that

    cannot be directl! calculated +ro% the data

    # calculating the "er+or%ance o+ the +inal %odel that will

    generalie to new data

    The %odeler has a +inite a%ount o+ data. which

    the! %ust Rs"endR to acco%"lish these ste"s

    # =ow do we s"end3 the data to +ind an o"ti%al %odel4

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    19/2791

    S"ending3 'ata

    e t!"icall! s"end3 data on training and test data sets# Training Set9 these data are used to esti%ate %odel "ara%eters

    and to "ic8 the values o+ the co%"lexit! "ara%eter6s7 +or the%odel:

    # Test Set (aka validation set)9 these data can be used to get aninde"endent assess%ent o+ %odel e++icac!: The! should not beused during %odel training:

    The %ore data we s"end. the better esti%ates we&ll get6"rovided the data is accurate7: Given a +ixed a%ount o+data.

    # too %uch s"ent in training won&t allow us to get a goodassess%ent o+ "redictive "er+or%ance: e %a! +ind a %odel that+its the training data ver! well. but is not generaliable 6over+itting7

    # too %uch s"ent in testing won&t allow us to get a goodassess%ent o+ %odel "ara%eters

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    20/279

    2

    Methods +or *reating a Test Set

    =ow should we s"lit the data into a training andtest set4

    O+ten. there will be a scienti+ic rational +or the s"lit

    and in other cases. the s"litscan be %adee%"iricall!:

    Several e%"irical s"litting o"tions9

    #co%"letel! rando%

    # strati+ied rando%

    # %axi%u% dissi%ilarit! in "redictor s"ace

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    21/279

    21

    *reating a Test Set9 *o%"letel! )ando% S"lits

    5 co%"letel! rando% 6*)7 s"lit rando%l! "artitions thedata into a training and test set

  • 7/25/2019 Mulitvariate Random Trees

    22/279

    22

    *reating a Test Set9 Strati+ied )ando% S"lits

    5 strati+ied rando% s"lit %a8es a rando% s"litwithin strati+ication grou"s

    # in classi+ication. the classes are used as strata

    # in regression. grou"s based on the Luantiles o+ the

    res"onse are used as strata

    Strati+ication atte%"ts to "reserve the distribution

    o+ the outco%e between the training and testsets

    #5 S) s"lit is %ore a""ro"riate +or unbalanced data

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    23/279

    2,

    Over(

  • 7/25/2019 Mulitvariate Random Trees

    24/279

    20

    Over(

  • 7/25/2019 Mulitvariate Random Trees

    25/279

    2;

    Over(

  • 7/25/2019 Mulitvariate Random Trees

    26/279

    2@

    Over(

  • 7/25/2019 Mulitvariate Random Trees

    27/279

    2D

    Over(

  • 7/25/2019 Mulitvariate Random Trees

    28/279

    2

    Over(

  • 7/25/2019 Mulitvariate Random Trees

    29/279

    2

    =ow 'o e $sti%ate Over(

  • 7/25/2019 Mulitvariate Random Trees

    30/279

    ,

    =ow 'o e $sti%ate Over(

  • 7/25/2019 Mulitvariate Random Trees

    31/279

    ,1

    K(+old *ross Ialidation

    =ere. we rando%l! s"lit the data into Kbloc8s o+roughl! eLual sie

    e leave out the +irst bloc8 o+ data and +it a

    %odel:

    This %odel is used to "redict the held(out bloc8

    e continue this "rocess until we&ve "redicted all

    Khold(out bloc8s The +inal "er+or%ance is based on the hold(out

    "redictions

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    32/279

    ,2

    K(+old *ross Ialidation

    The sche%atic below shows the "rocess +or K ,grou"s:

    # Kis usuall! ta8en to be ; or 1

    # leave one out cross(validationhas each sa%"le as abloc8

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    33/279

    ,,

    Heave Grou" Out *ross Ialidation

    5 rando% "ro"ortiono+ data 6sa! P7 are

    used to train a %odel

    The re%ainder is

    used to "redict

    "er+or%ance

    This "rocess is

    re"eated %an! ti%esand the average

    "er+or%ance is used

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    34/279

    ,0

    Bootstra""ing

    Bootstra""ing ta8es a rando% sa%"le withre"lace%ent

    # the rando% sa%"le is the sa%e sie as the original data

    set

    # co%"ounds %a! be selected %ore than once

    # each co%"ound has a @,:2P change o+ showing u" at

    least once

    So%e sa%"les won&t be selected# these sa%"les will be used to "redict "er+or%ance

    The "rocess is re"eated %ulti"le ti%es 6sa! ,7

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    35/279

    ,;

    The Bootstra"

    ith bootstra""ing.the nu%ber o+ held(

    out sa%"les is

    rando%

    So%e %odels. such

    as rando% +orest. use

    bootstra""ing within

    the %odeling "rocessto reduce over(+itting

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    36/279

    ,@

    Training Models with Tuning -ara%eters

    5 single training/test s"lit iso+ten not enough +or %odels

    with tuning "ara%eters

    e %ust use resa%"ling

    techniLues to get goodesti%ates o+ %odel

    "er+or%ance over %ulti"le

    values o+ these "ara%eters

    e "ic8 the co%"lexit!"ara%eter6s7 with the best

    "er+or%ance and re(+it the

    %odel using all o+ the data

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    37/279

    ,D

    Si%ulated 'ata $xa%"le

    Het&s +it a nearest neighbors %odel to thesi%ulated classi+ication data:

    The o"ti%al nu%ber o+ neighbors %ust be chosen

    C+ we use leave grou" out cross(validation and setaside 2P. we will +it %odels to a rando% 2sa%"les and "redict ; sa%"les

    # , iterations were used

    e&ll train over 11 odd values +or the nu%ber o+neighbors

    # we also have a 2; "oint test set

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    38/279

    ,

    To! 'ata $xa%"le

    The "lot on the right shows theclassi+ication accurac! +or each

    value o+ the tuning "ara%eter

    # The gre! "oints are the ,

    resa%"led esti%ates

    # The blac8 line shows theaverage

    accurac!

    # The blue line is the 2; sa%"le

    test set

    Ct loo8s li8e D or %oreneighbors is o"ti%al with an

    esti%ated accurac! o+ @P

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    39/279

    ,

    To! 'ata $xa%"le

    hat i+ we didn&t resa%"leand used the whole data

    set4

    The "lot on the right

    shows the accurac!

    across the tuning

    "ara%eters

    This would "ic8 a %odelthat over(+its and has

    o"ti%istic "er+or%ance

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    40/279

    0

    Model Building Training

    'ata -re(-rocessing

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    41/279

    01

    h! -re(-rocess4

    Cn order to get e++ective and stable results. %an!%odels reLuire certain assu%"tions about the

    data

    # this is %odel de"endent

    e will list each %odel&s "re("rocessing

    reLuire%ents at the end

    Cn general. "re("rocessing rarel! hurts %odel

    "er+or%ance. but could %a8e %odel

    inter"retation %ore di++icult

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    42/279

    02

    *o%%on -re(-rocessing Ste"s

  • 7/25/2019 Mulitvariate Random Trees

    43/279

    0,

    ero Iariance -redictors

    Most %odels reLuire that each "redictor have atleast two uniLue values

    h!4

    #5 "redictor with onl!one uniLue value has a varianceo+ ero and containsno in+or%ation about the

    res"onse:

    Ct is generall! a good idea to re%ove the%:

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    44/279

    00

    Near ero Iariance3 -redictors

    5dditionall!. i+ the distributions o+ the "redictorsare ver! s"arse.

    # this can have a drastic e++ect on the stabilit! o+ the

    %odel solution

    # ero variance descri"tors could be induced during

    resa%"ling

    But what does a near ero variance3 "redictor

    loo8 li8e4

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    45/279

    0;

    Near ero Iariance3 -redictor

    There are two conditions +or an NI3 "redictor# a low nu%ber o+ "ossible values. and

    # a high i%balance in the +reLuenc! o+ the values

  • 7/25/2019 Mulitvariate Random Trees

    46/279

    0@

    NI $xa%"le

    Cn co%"utational che%istr! wecreated "redictors based onstructural characteristics o+co%"ounds:

    5s an exa%"le. the descri"torn)113 is the nu%ber o+ 11(%e%ber rings

    The table to the right is thedistribution o+ n)11 +ro% a

    training set# the distinct value "ercentage is

    ;/;,; :,

    # the +reLuenc! ratio is ;1/2, 21:

    # 11-Member Rings

    Value re!uenc"

    ;1

    1 0

    2 2,

    , ;

    0 2

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    47/279

    0D

    'etecting NIs

    Two criteria +or detecting NIs are the# 'iscrete value "ercentage

    'e+ined as the nu%ber o+ uniLue values divided b! the nu%ber o+

    observations

    )ule(o+(thu%b9 discrete value "ercentage U 2P could indicate a

    "roble%

    #

  • 7/25/2019 Mulitvariate Random Trees

    48/279

    0

    =ighl! *orrelated -redictors

    So%e %odels can be negativel! a++ected b!highl! correlated "redictors

    # certain calculations 6e:g: %atrix inversion7 can beco%eseverel! unstable

    =ow can we detectthese "redictors4

    # Iariance in+lation +actor 6IC

  • 7/25/2019 Mulitvariate Random Trees

    49/279

    0

    =ighl! *orrelated -redictors and

    )esa%"ling

    )ecall that resa%"ling slightl! "erturbs thetraining data set to increase variation

    C+ a %odel is adversel! a++ected b! high

    correlations between "redictors. the resa%"ling

    "er+or%ance esti%ates can be "oor in

    co%"arison to the test set

    # Cn this case. resa%"ling does a better job at "redicting

    how the %odel wor8s on +uture sa%"les

    * i d S li

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    50/279

    ;

    *entering and Scaling

    Standardiing the "redictors can greatl! i%"rovethe stabilit! o+ %odel calculations:

    More i%"ortantl!. there are several %odels 6e:g:"artial least sLuares7 that i%"licitl! assu%e that

    all o+ the "redictors are on the sa%e scale

    5"art +ro% the loss o+ the original units. there is

    no real downside o+ centering and scaling

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    51/279

    ;1

    Model Building Training

    )egression(t!"e Models

    S tti

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    52/279

    ;2

    Setting

    )es"onse is continuous

    Obj ti

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    53/279

    ;,

    Objective

    To construct a %odel o+ "redictors that

    can be used to "redict a res"onse

    Data

    Model

    Prediction

    ) i M th d

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    54/279

    ;0

    )egression Methods

    Multi"le linear regression

    -artial least sLuares

    Neural networ8s

    Multivariate ada"tive regression s"lines

    Su""ort vector %achines

    )egression trees

    $nse%bles o+ trees9

    # Bagging. boosting. and rando% +orests

    $ach o+ these %ethods see8 to +ind a relationshi"between the "redictors and res"onse that %ini%ieserrorbetween the observed and "redicted res"onse

    5dditi M d l

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    55/279

    ;;

    5dditive Models

    Cn the beginning there were linear %odels9( ) ppXXY +++= 110E

    5nd =astie and Tibshirani 617 said. Het there be

    Generalied 5dditive Models39

    ( ) ( ) ( )pp XfXffY +++= 110E

    5nd Nelder and edderburn 61D27 said. Het there be

    Generalied Hinear Models39

    ( )( ) ppXXYg +++= 110Eand link functions appeared.

    and scatterplot smoothers and backtting

    algorithms appeared.

    < ili + 5dditi M d l

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    56/279

    ;@

  • 7/25/2019 Mulitvariate Random Trees

    57/279

    ;D

    5ssessing Model -er+or%ance

    5 i M d l - +

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    58/279

    ;

    5ssessing Model -er+or%ance

    =ow well does a regression %odel "er+or%4 5nswering thisLuestion de"ends on how we want to use the %odel:-ossible goals are9

    # To understand the relationshi" between the "redictor and theres"onse:

    # To use the %odel to "redict +uture observations& res"onse:

    Cn either case. we can use several o+ di++erent %easures toevaluate %odel "er+or%ance: e will +ocus on two9

    # *oe++icient o+ deter%ination 6R27

    # )oot %ean sLuare error 6)MS$7

    =owever. the set o+ data that we use to evaluate"er+or%ance will change de"ending on our "ur"ose:

    hich Set o+ 'ata to Yse to $ al ate -er+or%ance4

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    59/279

    ;

    hich Set o+ 'ata to Yse to $valuate -er+or%ance4

    C+ we are onl! interested in understanding the underl!ingrelationshi" between the "redictor and the res"onse. thenwe can co%"ute R2and )MS$ on the data +or which the%odel was built 6i:e the training data7:

    # =owever. these values will be overl! o"ti%istic o+ the %odel&sabilit! to "redict +uture observations:

    C+ we are interested in understanding the %odel&s abilit! to"redict +uture observations. then we need to co%"ute R2and )MS$ on data +or which the %odel was notbuilt 6i:e:a test set or cross(validation set7:

    #

  • 7/25/2019 Mulitvariate Random Trees

    60/279

    @

    L 6 7

    )oot Mean SLuared -rediction $rror 6)MS-$7

    )MS$ %easures the average deviation o+ an observation

    to the best(+it "lane

    )MS-$ %easures the average deviation o+ an

    observation to its "redicted value +or the test or cross(

    validation set

    ( )1+=

    pn

    SSERMSE

    ( )*

    1

    2

    *

    n

    yyRMSPE

    n

    iii

    = =

    n* = the nu%ber o+ observations in the test or cross(validation set

    *o%"uting Q2

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    61/279

    @1

    *o%"uting Q2

    -rocess9# -artition the data into

    a training and testing set. or

    bloc8s to be used +or training and testing

    # Build the %odel on the trainingdata and "redict the

    testing data

    Q2 R2o+ the relationshi" between the observed

    and "redicted values +or the testing data:

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    62/279

    @2

    Multi"le Hinear )egression9

    5 Zuic8 )eview

    Multi"le Hinear )egression

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    63/279

    @,

    Multi"le Hinear )egression

    Objective9

  • 7/25/2019 Mulitvariate Random Trees

    64/279

    @0

    The Best -lane

    ( ) YXXX

    T1T1

    0

    =

    p

    To +ind the best "lane. we solve9

    # where Ynx1. Xnx(p+1)and(p+1)x1

    The best is9

    2XYmin

    5side9 5 Bit More 5bout (XTX)

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    65/279

    @;

    5side9 5 Bit More 5bout (XTX)

    (XT

    X) is a critical %atrix +or %an! statistical%odeling techniLues

    5 +ew +un +actsQ (XTX)is "ro"ortional to the covariance %atrix. S

    Scontains the variances and covariances o+ all

    "redictors

    # TechniLues that de"end on (XTX) also reLuire that it is

    invertible

    5ssu%"tions9 'iagnostic -lots

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    66/279

    @@

    5ssu%"tions9 'iagnostic -lots

    hen 'oes )egression

  • 7/25/2019 Mulitvariate Random Trees

    67/279

    @D

    hen 'oes )egression

  • 7/25/2019 Mulitvariate Random Trees

    68/279

    @

    5 6Trivial7 $xa%"le o+ Multicollinearit!

    Su""ose that we have one observation 6,.;7. and we wish to +ind the [best& line +or the

    data: Cn this exa%"le. the nu%ber o+ observations 617 is less than the nu%ber o+"ara%eters 629 slo"e and interce"t7: hen the nu%ber o+ "ara%eters is greater than

    the nu%ber o+ observations. we can +ind an in+inite nu%ber o+ [best& solutions:

    Solution

    Solution 1

    Solution $

    %n the presence o& multicollinearit"' the best

    solution ill be unstable

    Boston =ousing 'ata

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    69/279

    @

    Boston =ousing 'ata

    Het&s use a linear regression %odel to "redict the %edianhouse "rice in Boston:

    -rocess9

    # S"lit the data into a training set 6n ,,D7 and testing set 6n 1@7

    #

  • 7/25/2019 Mulitvariate Random Trees

    70/279

    D

    )esults

    The results are +airl! si%ilar. at least within the variation o+resa%"ling

    One reason !ou %a! see di++erences9 %ulticollinearit!

    # Multicollinearit! in the "redictors can "roduce so%ewhat unstablesolutions +or each resa%"le

    # hen the data are slightl! changed. the %odel can drasticall!change

    The test set is a single. static set o+ data +or veri+ication

    # The bootstra" esti%ate o+ "er+or%ance %a! be better with

    collinearit!

    Training *ata

    (bootstrap)Test *ata

    RMS+ ,$ RMS+ R$

    Hinear )eg ;:2, :@1 0:;, :D02

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    71/279

    D1

    -artial Heast SLuares )egression

    Solutions +or Overdeter%ined *ovariance Matrices

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    72/279

    D2

    Solutions +or Overdeter%ined *ovariance Matrices

    Iariable reduction# Tr! to acco%"lish this through the "re("rocessing

    ste"s

    -artial least sLuares 6-HS7

    Other %ethods

    #5""l! a generalied inverse

    # )idge regression9 5djusts the variance/covariance

    %atrix so that we can +ind a uniLue inverse:

    # -rinci"al co%"onent regression 6-*)7

    not reco%%ended\but it&s a good wa! to understand -HS

    Ynderstanding -artial Heast SLuares9

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    73/279

    D,

    -rinci"al *o%"onents 5nal!sis

    -*5 see8s to +ind linear co%binations o+ theoriginal variables that su%%arie the %axi%u%

    a%ount o+ variabilit! in the original data

    # These linear co%binations are o+ten calledprincipal

    componentsor scores:

    # 5principal directionis a vector that "oints in the

    direction o+ %axi%u% variance:

    -rinci"al *o%"onents 5nal!sis

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    74/279

    D0

    -rinci"al *o%"onents 5nal!sis

    -*5 is inherentl! an o"ti%iation "roble%. whichis subject to two constraints

    1:The "rinci"al directions have unit length

    2:$ither

    a:Successivel! derived scores areuncorrelated to "reviousl!

    derived scores. O)

    b:Successivel! derived directions are reLuired to be orthogonal

    to "reviousl! derived directions

    Cn the %athe%atical +or%ulation. either constraint i%"lies the

    other constraint

    -rinci"al *o%"onents 5nal!sis

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    75/279

    D;

    (0

    (,

    (2

    (1

    1

    2

    ,

    0

    ;

    (@ (; (0 (, (2 (1 1 2 , 0 ;

    redictor 1

    -redictor$

    *irection 1

    Score

    -rinci"al *o%"onents 5nal!sis

    htt"9//"+ier"edia/index:"h"/C%age9-*5%ovie:gi+

    Mathe%aticall! S"ea8ing

    http://home.pfizer.com/http://pfizerpedia/index.php/Image:PCAmovie.gifhttp://home.pfizer.com/http://pfizerpedia/index.php/Image:PCAmovie.gif
  • 7/25/2019 Mulitvariate Random Trees

    76/279

    D@

    Mathe%aticall! S"ea8ingQ

    The o"ti%iation "roble% de+ined b! -*5 can be solvedthrough the +ollowing +or%ulation9

    subject to constraints 2a: or b:

  • 7/25/2019 Mulitvariate Random Trees

    77/279

    DD

    -*5 Bene+its and 'rawbac8s

    Bene+its

    # 'i%ension reduction

    e can o+ten su%%arie a large "ercentage o+ original variabilit!

    with onl! a +ew directions

    # Yncorrelated scores

    The new scores are not linearl! related to each other

    'rawbac8s

    # -*5 chases3 variabilit!

    -*5 directions will be drawn to "redictors with the %ost variabilit!

    Outliers %a! have signi+icant in+luence on the directions and

    resulting scores:

    -rinci"al *o%"onent )egression

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    78/279

    D

    -rinci"al *o%"onent )egression

    -rocedure9

    1: )educe di%ension o+ "redictors using -*5

    2: )egress scores on res"onse

    Notice9 The "rocedure is se!uential

    -rinci"al *o%"onent )egression

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    79/279

    D

    *imension reduction isindependent o& the ob.ective

    redictor

    Variables

    / Scores

    Response

    Variable

    /0

    MR

    -rinci"al *o%"onent )egression

  • 7/25/2019 Mulitvariate Random Trees

    80/279

  • 7/25/2019 Mulitvariate Random Trees

    81/279

    1

    Scatter o& irst /0 Scores ith Response

    (2:

    (1:;

    (1:

    (:;

    :

    :;

    1:

    1:;

    2:

    2:;

    (@: (0: (2: : 2: 0: @: :

    irst -/0 Scores

    Response

    R$2 3)331

    )elationshi" o+

  • 7/25/2019 Mulitvariate Random Trees

    82/279

    2

    -HS =istor!

    =: old 61@@. 1D;7

    S: old and =: Martens 61,7

    Stone and Broo8s 617

  • 7/25/2019 Mulitvariate Random Trees

    83/279

    ,

    Hatent Iariable Model

    redictor$

    redictors Responses

    Response1

    redictor1

    redictor

    redictor4

    redictor5

    atent Variables

    1

    $

    redictor6

    Response$

    Response

    Note9 -HS can handle %ulti"le res"onse variables

    *o%"arison with )egression

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    84/279

    0

    " g

    redictor1

    redictor$

    redictor

    redictor4

    redictor5

    Response1

    -HS O"ti%iation6 di t 7

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    85/279

    ;

    6%an! "redictors. oneres"onse7

    -HS see8s to +ind linear co%binations o+ the

    inde"endent variables that su%%arie the

    %axi%u% a%ount o+ co(variabilit!with the

    res"onse:

    # These linear co%binations areo+ten called PLScomponentsor -HS scores:

    #5 -HSdirectionis a vector that "oints in the directiono+ %axi%u% co(variance:

    -HS O"ti%iation6 di t 7

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    86/279

    @

    6%an! "redictors. oneres"onse7

    -HS is inherentl! an o"ti%iation "roble%. which

    is subject to two constraints

    1:The -HS directions have unit length

    2:$ither

    a:Successivel! derived scores areuncorrelated to "reviousl!

    derived scores. O)

    b:Successivel! derived directions are orthogonal to "reviousl!

    derived directions

    Ynli8e -*5. either constraint does NOT i%"l! the other

    constraint

    *onstraint 2:a: is %ost co%%onl! i%"le%ented

    Mathe%aticall! S"ea8ingQ

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    87/279

    D

    ! " g

    The o"ti%iation "roble% de+ined b! -HS can be solvedthrough the +ollowing +or%ulation9

    subject to constraints 2a: or b:

  • 7/25/2019 Mulitvariate Random Trees

    88/279

    )egression

    ( )

    ( ) ( ) ( )

    ( ) ( ) ( )

    ( ) ( ) ( )aa

    responsesores,orrsoresvarmaxargresponsevar

    aa

    YX,aorrXavarmaxargYvar

    aa

    YX,aorrYvarXavar

    maxarg

    aa

    YX,aCovmaxarg

    T

    2

    a

    T

    T2T

    a

    T

    T2T

    a

    T

    T2

    a

    =

    =

    =

    -HS isSi%ultaneous 'i%ension )eductionand )egression

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    89/279

    and )egression

    %ax Iar6scores7*orr26response.scores7

    *imension Reduction

    (/0)Regression

    -HS Bene+its and 'rawbac8s

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    90/279

    Bene+it

    # Si%ultaneous di%ension reduction and regression

    'rawbac8s

    # Si%ilar to -*5. -HS chases3 co(variabilit!

    -HS directions will be drawn to inde"endent variables with the %ostvariabilit! 6although this will be te%"ered b! the need to also be

    related to the res"onse7

    Outliers %a! have signi+icant in+luence on the directions. resulting

    scores. and relationshi" with the res"onse: S"eci+icall!. outliers can

    # %a8e it a""ear that there is no relationshi" between the

    "redictors and res"onse when there trul! is a relationshi". or

    # %a8e it a""ear that there is a relationshi" between the

    "redictors and res"onse when there trul! is no relationshi"

    -artial Heast SLuares

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    91/279

    1

    Simultaneousdimension

    reduction and regression

    redictor

    Variables

    Response

    Variable

    S

    L

  • 7/25/2019 Mulitvariate Random Trees

    92/279

    2

    )elationshi" o+

  • 7/25/2019 Mulitvariate Random Trees

    93/279

    ,

    Scatter o& irst S Scores ith Response

    (2:

    (1:;

    (1:

    (:;

    :

    :;

    1:

    1:;

    2:

    2:;

    (2: (1:; (1: (:; : :; 1: 1:; 2: 2:;

    irst -S Scores

    Response

    R$2 3)7#

    -HS in -ractice

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    94/279

    0

    -HS see8s to +ind latent variables 6HIs7 that

    su%%arie variabilit! and are highl! "redictive o+

    the res"onse:

    =ow do we deter%ine the nu%ber o+ HIs to

    co%"ute4

    # $valuate )MS-$ 6or Q27

    The o"ti%al nu%ber o+ co%"onents is the

    nu%ber o+ co%"onents that %ini%ies )MS-$

    -HS +or the Boston housing data9Training the -HS Model

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    95/279

    ;

    Training the -HS Model

    Since -HS can handle

    highl! correlatedvariables. we +it the %odelusing all 12 "redictors

    The %odel was trainedwith u" to @ co%"onents

    )MS$ dro"s noticeabl!+ro% 1 to 2 co%"onents

    and so%e +or 2 to ,co%"onents:

    # Models with , or %oreco%"onents %ight besu++icient +or these data

    Training the -HS Model

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    96/279

    @

    )oughl! the sa%e

    "ro+ile is seen when

    the %odels are judged

    on R$

    Boston =ousing )esults

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    97/279

    D

    Ysing the two co%"onent %odel. we can "redict

    the test set

    -HS training statistics are si%ilar to those +ro%

    linear regression

    Both %ethods "er+or% about the sa%e in the test

    set

    Training *ata

    (bootstrap)

    Test *ata

    RMS+ ,$ RMS+ R$

    Hinear )eg ;:2, :@1 0:;, :D02

    -HS ;:2; :@ 0:;@ :D,

    -HS Model

  • 7/25/2019 Mulitvariate Random Trees

    98/279

    -HS O"ti%iation 6276%an! "redictors man" res"onses7

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    99/279

    6%an! "redictors. man"res"onses7

    -HS see8s to +ind linear co%binations o+ the

    inde"endent variables and a linear co%bination

    o+ the de"endent variables that su%%arie the

    %axi%u% a%ount o+ co(variabilit!between the

    co%binations:# These linear co%binations are o+ten called PLS X

    space and !space componentsor PLS Xspace and

    !space scores:

    # Hi8wise. ](s"ace and ?(s"ace -HS directions"oint in

    the direction o+ %axi%u% co(variance between the

    s"aces:

    -HS O"ti%iation 6276%an! "redictors man" res"onses7

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    100/279

    1

    6%an! "redictors. man"res"onses7

    -HS is inherentl! an o"ti%iation "roble%. which

    is subject to two constraints

    1:The ](s"ace and ?(s"ace -HS directions have unit

    length

    2:$ither

    a:Successivel! derived scores in each s"ace are uncorrelated

    to "reviousl! derived scores. O)

    b:Successivel! derived directions in each s"ace are orthogonal

    to "reviousl! derived directions

    *onstraint 2:a: is %ost co%%onl! i%"le%ented

    Mathe%aticall! S"ea8ingQ

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    101/279

    11

    The o"ti%iation "roble% de+ined b! -HS can be

    solved through the +ollowing +or%ulation9

    subject to constraints 2a: or b:

    ( )( )( )

    ,!!aa

    Y!X,aCovmaxarg

    TT

    TT2

    !a,

    ( ) ( ) ( )( )( )!!aaY!X,aorrY!varXavarmaxarg TT

    TT2TT

    !a,=

    -HS isSi%ultaneous 'i%ension )eductionand )egression

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    102/279

    12

    and )egression

    %ax Iar68-scores7*orr268-scores.9-scores7Iar69-scores7

    8-space *imension

    Reduction (/0)

    Regression 9-space *imension

    Reduction (/0)

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    103/279

    1,

    Neural Networ8s

    Neural Networ8s

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    104/279

    10

    Hi8e -HS or -*). these %odels create

    inter%ediar! latent variables that are used to

    "redict the outco%e

    Neural networ8s di++er +ro% -HS or -*) in a +ew

    wa!s

    # the objective +unction used to derive the new variables

    is di++erent

    # The latent variables are created using +lexible. highl!nonlinear +unctions

    # The latent variables usuall! do not have an! %eaning

    Networ8 Structures

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    105/279

    1;

    There are %an! t!"es o+ neural networ8 structures

    # we will concentrate on the single la!er. +eed(+orward networ8

    redictor1

    redictor$

    redictor

    redictor4

    redictor5

    :idden ;nit 1

    :idden ;nit $

    :idden ;nit k

    Response1

  • 7/25/2019 Mulitvariate Random Trees

    106/279

    1@

    The transition +ro% this

    sub(%odel to the hidden

    units is nonlinear

    # sig%oidal +unctions.such

    as the logistic +unction. aret!"icall! used

  • 7/25/2019 Mulitvariate Random Trees

    107/279

    1D

    The hidden units are then

    used to "redict the

    outco%e using si%"le

    linear co%binations

    *learl!. the "ara%eters are not identi+iable and

    the hidden units have no real %eaning 6unli8e

    -*57

    Training Networ8s

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    108/279

    1

    Ct is highl! reco%%ended that the "redictors arecentered and scaled "rior to training

    The nu%ber o+ hidden units is a tuning"ara%eter

    ith %an! "redictors and hidden units. thenu%ber o+ esti%ated "ara%eters can beco%ever! large

    # with a large nu%ber o+ hidden units. these %odels canLuic8l! start to over+it

    )ando% starting values are t!"icall! used toinitialie the "ara%eter esti%ates

    eight 'eca!

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    109/279

    1

    This is a training techniLue that atte%"ts to

    shrin83 the "ara%eter esti%ates towards ero

    # large "ara%eter esti%ates are "enalied in the %odel

    training

    This leads to s%oother. lessextre%e %odels

    # the e++ect o+ weight deca! is de%onstrated +or

    classi+ication %odels

    Boston =ousing 'ata

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    110/279

    11

    The %odel see%s to

    do well with +ewer

    co%"onents 6not

    t!"ical7

  • 7/25/2019 Mulitvariate Random Trees

    111/279

    111

    The +inal %odel used high value +or weight deca!

    and 1 hidden unit

    This %odel see%s to be an i%"rove%ent

    co%"ared to the others

    Training *ata(bootstrap)

    Test *ata

    RMS+ ,$ RMS+ R$

    Hinear )eg ;:2, :@1 0:;, :D02

    -HS ;:2; :@ 0:;@ :D,

    Neural Net 0:@ :D;D 0:2 :D

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    112/279

    112

    Su""ort Iector Machines

    Su""ort Iector Machines 6SIMs7

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    113/279

    11,

    SIMs are "redictive statistical %odels develo"edin 1@, b! Ia"ni8 that were signi+icantl!ex"anded in the &s

    These %odels wereinitiall! develo"ed +orclassi+ication %odels. but were later ada"ted +orregression %odels

    Objective

  • 7/25/2019 Mulitvariate Random Trees

    114/279

    110

    )ecall that linear

    regression esti%ates

    "ara%eters b!

    calculating9

    # the %odel residuals

    # the total su% o+ the

    sLuared residuals 6SS)7

    The "ara%eters withthe s%allest SS) are

    o"ti%al

    Objective

  • 7/25/2019 Mulitvariate Random Trees

    115/279

    11;

    Su""ort vector %achine

    regression %odels create a+unnel3 around theregression line

    # residuals within the +unnel arenot counted in the "ara%eteresti%ation

    # the su% o+ the residualsoutside the +unnel are used asthe objective +unction 6nosLuared ter%7

    5 +unnel sie is set to 1 S'o+ the outco%e is not a bad"lace to start

    The SIM Model O"ti%iation

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    116/279

    11@

    Hi8e =uber(t!"e robust

    regression. outliers have alinear e++ect on theobjective +unction

    Over+itting can becontrolled b! using a"enalied objective+unction 6%ore later7

    Zuadratic "rogra%%ing%ethods are needed tosolve these eLuations

    Su""ort Iectors and 'ata )eduction

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    117/279

    11D

    The "oints that are outside

    the +unnel 6or on it&sboundar!7 are the su""ort

    vectors

    Ct turns out that the "rediction

    +unction onl! uses thesu""ort vectors

    # the "rediction eLuation is %ore

    co%"act and e++icient

    # the %odel %a! be %ore robust

    to outliers

    Su""ort Iectors and 'ata )eduction

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    118/279

    11

    The %odel +itting routine "roduces values 67 thatare non(ero +or all o+ the su""ort vectors

    To "redict a new sa%"le. the original training data

    +or the non(ero values are needed9

    Nonlinear Boundaries

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    119/279

    11

    Nonlinear boundaries can be co%"uted using the

    8ernel tric83

    The "redictor s"ace can be ex"anded b! adding

    nonlinear +unctions o+ the "redictors

    *o%%on 8ernel +unctions are9

    Nonlinear Boundaries

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    120/279

    12

    The tric83 is that the co%"utations can o"erate

    onl! on the inner("roducts o+ the extended

    "redictor set

    Cn this wa!. the "redictor s"ace di%ension can be

    greatl! ex"anded without %uch co%"utational

    i%"act

    *ost +unctions

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    121/279

    121

    Su""ort vector %achines also include a regulariation

    "ara%eter that controls how %uch the regression line canada"t to the data

    # s%aller values result in %ore linear 6i:e: +lat7 sur+aces

    This "ara%eter is generall! re+erred to as *ost3

  • 7/25/2019 Mulitvariate Random Trees

    122/279

    122

    5s "reviousl!

    %entioned. there is a

    wa! to anal!ticall!

    esti%ate the tuning

    "ara%eter +or the )B

    *o%"arison

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    151/279

    1;1

    Bagging can signi+icantl! increase "er+or%ance o+ trees

    # +ro% resa%"ling9

    The cost is co%"uting ti%e and the loss o+ inter"retation

    One reason that bagging wor8s is that single trees areunstable

    # s%all changes in the data %a! drasticall! change the tree

    Training *ata(bootstrap)

    Test

    RMS+ ,$ RMS+ R$

    Single Tree ;:1 :D 0:2 :D

    Bagging 0:,2 :D@ ,:@ :2;

    )ando%

  • 7/25/2019 Mulitvariate Random Trees

    152/279

    1;2

    )ando% +orests %odels are si%ilar to bagging

    # se"arate %odels are built +or each bootstra" sa%"le

    # the largest tree "ossible is +it +or each bootstra" sa%"le

    =owever. when rando% +orests starts to %a8e a

    new s"lit. it onl! considers a rando% subset o+"redictors

    # The subset sie is the 6o"tional7 tuning "ara%eter

    )ando% +orests de+aults to a subset sie that is thesLuare root o+ the nu%ber o+ "redictors and is

    t!"icall! robust to this "ara%eter

    )ando% -redictor Cllustration

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    153/279

    1;,

    )ando%l! select a

    subset o+ variables

    +ro% original data

    'ataset 1 'ataset 2 'ataset M

    ^^

    ^

    Build trees

    -redict -redict -redict

  • 7/25/2019 Mulitvariate Random Trees

    154/279

    1;0

    rediction o& an observation' =>

    ( )

    M

    f

    xF

    M

    m

    m== 1x

    )(

    -ro"erties o+ )ando%

  • 7/25/2019 Mulitvariate Random Trees

    155/279

    1;;

    Iariance reduction

    #5veraging "redictions across %an! %odels "rovides%ore stable "redictions and %odel accurac!6Brei%an. 1@7

    )obustness to noise#5ll observations have an eLual chance to in+luence

    each %odel in the ense%ble

    # =ence. outliers have less o+ an e++ect on individual

    %odels +or the overall "redicted values

    *o%"arison

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    156/279

    1;@

    *o%"aring the three %ethods using resa%"ling9

    Both bagging and rando% +orests are %e%or!less3

    # each bootstra" sa%"le doesn&t 8now an!thing about the other

    sa%"les

    Training *ata(bootstrap)

    Test

    RMS+ ,$ RMS+ R$

    Single Tree ;:1 :D 0:2 :D

    Bagging 0:,2 :D@ ,:@ :2;

    )and

  • 7/25/2019 Mulitvariate Random Trees

    157/279

    1;D

    5 %ethod to boost3 wea8 learning algorith%s

    6s%all trees7 into strong learning algorith%s

    # Kearns and Ialiant 617. Scha"ire 617.

  • 7/25/2019 Mulitvariate Random Trees

    158/279

    1;

  • 7/25/2019 Mulitvariate Random Trees

    159/279

    1;

    Stage 1

    ?uild

    eighted

    tree

    n=200

    n=90 n=110

    81 @ 5$ 81 A 5$

    /ompute

    stage eight stage 12 f6,2:7

    Reeigh

    observations

    6#i1.2.:::. n7

    'eter%ine weight o+ ith

    observation9

    The larger the error.

    the higher the weight

    $

    n=200

    n=64 n=136

    8$B @ $$4 8$B A $$4

    stage $2 f62@:D7

    'eter%ine weight o+ ith

    observation

    M

    n=200

    n=161 n=39

    86 @ 3 86 A 3

    stage M2 f62:;7

    /omputeerror = =

    n

    i

    ie1

    2 #2 = =n

    i

    ie1

    2 #2= =n

    i

    ie1

    2 #2

    Boosting Trees

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    160/279

    1@

    Boosting has three tuning "ara%eters9

    # nu%ber o+ iterations 6i:e: trees7

    # co%"lexit! o+ the tree 6i:e: nu%ber o+ s"lits7

    # learning rate9 how Luic8l! the algorith% ada"ts

    This i%"le%entation is the %ost co%"utationall!

    taxing o+ the tree %ethods shown here

  • 7/25/2019 Mulitvariate Random Trees

    161/279

    1@1

    ( )( )=

    =M

    m

    mmfxF1

    x)(

    rediction o& an observation' =>

    here them are constrained to sum to 1

    -ro"erties o+ Boosting

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    162/279

    1@2

    )obust to over+itting

    #5s the nu%ber o+ iterations increases. the test set

    error does not increase

    # Scha"ire. et al: 617.

  • 7/25/2019 Mulitvariate Random Trees

    163/279

    1@,

    One a""roach to training is

    to set the learning rate to ahigh value 6:17 and tune

    the other two "ara%eters

    Cn the "lot to the right. a grid

    o+ co%binations o+ the 2tuning "ara%eters were

    used to o"ti%ie the %odel

    The o"ti%al settings were9

    # ; trees with high co%"lexit!

    *o%"arison Su%%ar!

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    164/279

    1@0

    *o%"aring the +our %ethods9

    Training *ata(bootstrap)

    Test

    RMS+ ,$ RMS+ R$

    Single Tree ;:1 :D 0:2 :D

    Bagging 0:,2 :D@ ,:@ :2;

    )and

  • 7/25/2019 Mulitvariate Random Trees

    165/279

    1@;

    Model Building Training

    Model *o%"arisons

    hich Model is Best4

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    166/279

    1@@

    The No

  • 7/25/2019 Mulitvariate Random Trees

    167/279

    1@D

    $xcellent Ier! Good 5verage

  • 7/25/2019 Mulitvariate Random Trees

    168/279

    1@

    I ero var "redictor. NI near(ero var "redictor.

    *S centerXscale. =*- highl! correlated "redictor

    W 'e"ends on i%"le%entation

    Boston =ousing 'ata

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    169/279

    1@

    The correlation between the results on the training set

    6n,,D7 via cross(validation and the results +ro% the test

    set 6n1@7 were :D1 6)MS$7 and :@; 6)27

    So%e 5dvice

    Th i i l ti hi b t Cnter"retabilit!

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    170/279

    1D

    There is an inverse relationshi" between

    "er+or%ance and inter"retabilit!

    e want the best o+ both worlds9 great

    "er+or%ance and a si%"le. intuitive %odel

    Tr! this9#

  • 7/25/2019 Mulitvariate Random Trees

    171/279

    1D1

    )egression 'atasets

    Cnternet Move 'ata Base

    CM'B i li th t t l i d TI

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    172/279

    1D2

    CM'B is an on(line resource that catalogs %ovies and TI

    "rogra%s +ro% %an! countries:

    Basic in+or%ation about the "rogra% is %aintained and

    users can rate each "rogra% on a +ive "oint scale:

    e extracted in+or%ation about %ovies and ca"tured9

    # the average vote

    # the nu%ber o+ votes

    # basic in+or%ation9 run ti%e. rating 6i+ an!7. !ear o+ release. etc

    # genre9 dra%a. co%ed! etc and

    # 8e!words9 based on novel. +e%ale lead. title s"o8en b! characterQ

    *an we "redict the %ovie rating based on these data4

    Tecator S"ectrosco"! 'ata

    < St tlib

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    173/279

    1D,

  • 7/25/2019 Mulitvariate Random Trees

    174/279

    1D0

    The variables are s"ectral

    %easure%ents at s"eci+icwavelengths and are

    highl! autocorrelated:

    e wish to "redict the"ercent +at +or each

    sa%"le:

    Towson =o%e Sales

    Cn+or%ation about ho%es sold in the Towson Mar!land area 6north o+

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    175/279

    1D;

    Cn+or%ation about ho%es sold in the Towson. Mar!land area 6north o+Balti%ore7 were collected:

    The area enco%"asses the northern border o+ Balti%ore cit!6Cdlew!dle7. suburban areas 65nnelsie. )odgers

  • 7/25/2019 Mulitvariate Random Trees

    176/279

    1D@

    )egression Bac8u" Slides

    SIM Model

  • 7/25/2019 Mulitvariate Random Trees

    177/279

    1DD

    M5)S Model

  • 7/25/2019 Mulitvariate Random Trees

    178/279

    1D

    )egression Tree Model

  • 7/25/2019 Mulitvariate Random Trees

    179/279

    1D

    Boosting Tree Model

  • 7/25/2019 Mulitvariate Random Trees

    180/279

    1

    Iariable C%"ortance +or -HS

    To understand the

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    181/279

    11

    To understand the

    i%"ortance o+ each +actor.we can loo8 at a weighted

    su% o+ the absolute

    regression coe++icients

    # the weights are based on

    the decrease in error as

    %ore co%"onents are

    added

    e can also loo8 at the

    loadings to get a %ore

    detailed assess%ent

    Iariable C%"ortance +or -HS

    =ere we can loo8 at the

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    182/279

    12

    =ere. we can loo8 at the

    increase in )2as %odelter%s are added

    C+ the variable is neverused in a ter%. it has an

    i%"ortance o+ ero

    Iariable C%"ortance +or )egression Trees

    =ere we can loo8 at the

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    183/279

    1,

    =ere. we can loo8 at the

    decrease in MS$ as%odel ter%s are added

    C+ the variable is neverused in a s"lit. it has an

    i%"ortance o+ ero

    Iariable C%"ortance +or )ando%

  • 7/25/2019 Mulitvariate Random Trees

    184/279

    10

    5 "er%utation a""roach is

    used

    $ach training data +or

    variable is scra%bled in

    turn and the P increase inthe out(o+(bag MS$ is

    trac8ed

    Boosting.

  • 7/25/2019 Mulitvariate Random Trees

    185/279

    1;

    Boosting +its a +orward stagewise additive %odel

    6=astie. Tibshirani and

  • 7/25/2019 Mulitvariate Random Trees

    186/279

    1@

    learning rate:

    # a "ara%eter that controls the rate o+ learning o+ observations

    that overla" on a decision boundar! 6

  • 7/25/2019 Mulitvariate Random Trees

    187/279

    1D

    Hinear regression %odels will +ail i+ there are ero(

    variance "redictors included

    # The! will also +ail during cross(validation i+ an! near(

    ero variance "redictors are in the data

    5s just discussed. re%ovinghighl! correlated"redictors is strongl! suggested

    *entering and scaling are not reLuired. but can

    greatl! increase the nu%erical stabilit! o+ the%odel

    -HS -re(-rocessing

    Because o+ its di%ension reduction abilities -HS

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    188/279

    1

    Because o+ its di%ension reduction abilities. -HS

    is resistant to ero( and near(ero variance"redictors

    5lso. since -HS canhandle 6and "erha"s ex"loit7

    correlated "redictors. it is not necessar! tore%ove the%

    *entering and scaling are extre%el! i%"ortant +or

    -HS %odels

    # otherwise. the "redictors with large variabilit! can

    do%inate the selection o+ co%"onents

    Neural Networ8 -re(-rocessing

    Neural networ8 %odels will not +ail with ero(variance

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    189/279

    1

    Neural networ8 %odels will not +ail with ero variance

    "redictors =owever. these %odels use a large nu%ber o+ "ara%eters

    and near(ero variance "redictors %a! lead to nu%erical

    issues such as a +ailureto converge

    =ighl! correlated "redictors should be re%oved

    %ulticollinearit! can have a signi+icant e++ect on %odel

    "er+or%ance

    *entering and scaling are reLuired

    M5)S -re(-rocessing

    M5)S %odels are resistant to ero( and near(ero

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    190/279

    1

    M5)S %odels are resistant to ero and near ero

    variance "redictors

    =ighl! correlated "redictors are allowed. but this can lead

    to signi+icant a%ount o+ rando%ness during the "redictor

    selection "rocess

    # The s"lit choice between two highl! correlated "redictors beco%es

    a toss(u"

    *entering and scaling are not reLuired but are suggested

    Tree -re(-rocessing

    5 basic regression tree reLuires ver! little "re(

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    191/279

    11

    5 basic regression tree reLuires ver! little "re

    "rocessing

    # %issing "redictor values are allowed

    # centering and scaling are not reLuired

    centering and scaling do not a++ect results

    # highl! correlated "redictors are allowed

    Cncluding highl! correlated descri"tors can cause instabilit!

    and %a8e descri"tor i%"ortance ran8ings so%ewhat rando%

    # ero( and near(ero variance "redictors are allowed

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    192/279

    12

    Model Building Training

    *lassi+ication(t!"e Models

    Setting

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    193/279

    1,

    )es"onse is categorical

    )es"onse %a! have %ore than two categories

    Objective

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    194/279

    10

    To construct a %odel o+ "redictors thatcan be used to "redict a res"onse

    Data

    Model

    Prediction

    *lassi+ication Methods

    'iscri%inant anal!sis +ra%ewor8

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    195/279

    1;

    !

    # Hinear. Luadratic. regularied. +lexible. and "artial least sLuaresdiscri%inant anal!sis

    Modern classi+ication %ethods

    # Tree(based ense%ble %ethods

    Boosting and rando% +orests# Neural networ8s

    # Su""ort vector %achines

    # 8(nearest neighbors

    # Naive Ba!es

    $ach o+ these %ethods see8 to +ind a "artitioning o+ thedata that %ini%ies classification error

    $valuating *lassi+ication Model -er+or%ance

    Hi8e regression %odels. we desire to understand the

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    196/279

    1@

    g .

    "redictive abilit! o+ a classi+ication %odel: e can evaluate a %odel&s "er+or%ance b! using cross(

    validation or a test set o+ data:

  • 7/25/2019 Mulitvariate Random Trees

    197/279

    1D

    Mini%ie classi+ication error 6or %axi%ie accurac!7# 'eter%ine how well the %odel "rediction agrees with the

    actual classi+ication o+ observations:

    N5XBX*X'BX'5X*Total*X''*Cnactive

    5XBB55ctive

    TotalCnactive5ctive-redicted

    5

    ctu

    al

    Cntuition

    5n intuitive %easure o+ accurac! is

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    198/279

    1

    65 X '7 / N# hen the actual classes are balanced. this is an

    a""ro"riate %easure o+ %odel "er+or%ance:

    But. this %easure "roduces the sa%e values +ordi++erent tables9

    5ctive Cnactive

    5ctive ; ;

    Cnactive ; 0;

    5ctive Cnactive

    5ctive ; ;

    Cnactive ; 0;

    vs

    5ccurac! +or both tables is :

    *oes one table sho more agreementthan the otherC

    5nother Measure9 Ka""a

    To "rovide a %easure o+ agree%ent +or unbalanced

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    199/279

    1

    " g

    tables. *ohen 61@7 "ro"osed co%"aring the observedagree%ent to the ex"ected agree%ent

    To co%"ute Ka""a. we need

    # The observed agree%ent9 O 65 X '7 / N

    # The ex"ected agree%ent

    Ka""a is de+ined as9 k 6O # $7 / 61 # $7

    ( )( ) ( )( )2N

    DCDBBACAE

    +++++=

    Ka""a -ro"erties

    Generall!9 (1 k1

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    200/279

    2

    # values close to indicate "oor agree%ent# values close to 1 indicate near "er+ect agree%ent

    +or co%"lete disagree%ent. k (1

    # Ialues o+ :0 or above are considered to indicate %oderate

    agree%ent. and values o+ : or higher indicate excellentagree%ent:3 6Sto8es. 'avis. and Koch. 217

    *an be generalied to V 2 classes

    5ctive Cnactive

    5ctive ; ;

    Cnactive ; 0;

    5ctive Cnactive

    5ctive ; ;

    Cnactive ; 0;

    k0.49 k 0.65

    Note9 hen the observed classes are balanced. 8a""a accurac!

    5nother Measure9)eceiver O"erating *haracteristic 6)O*7 *urves

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    201/279

    21

    )O* curves can be used to assess aclassi+ication %odel&s "er+or%ance or to co%"are

    several %odels& "er+or%ance

    Building an )O* curve reLuires that the %odel"roduces a continuous "rediction

  • 7/25/2019 Mulitvariate Random Trees

    202/279

    22

    Ter%inolog!9

    # Sensitivit! True -ositive )ate T- / 6T- X

  • 7/25/2019 Mulitvariate Random Trees

    203/279

    2,

    0ll observations ith predicted probabilities D the cuto&& are classi&ied as negative

    *lassi+ication Model -redictions

    Several classi+ication %odels generate a "redicted value

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    204/279

    20

    +or each class in the original data# -HS'5.

  • 7/25/2019 Mulitvariate Random Trees

    205/279

    2;

    g

    observation into grou" :

    The "robabilit! that the observation is in grou"

    is9

    where K is the total nu%ber o+ grou"s

    =

    1p

    g

    g

    ip

    i

    e

    e

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    206/279

    2@

    'iscri%inant Models

    *lassical 'iscri%inant Models

    These %odels +or% a discri%inant +unction that

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    207/279

    2D

    can be used to classi+! sa%"les

    The discri%inant +unction is a linear +unction o+ the

    "redictors that atte%"ts to9

    This is a latent variable %ethod si%ilar to -HS and

    others that we have seen

    # how the latent variable is created di++ers between

    %ethods

    5ssu%"tion9 the within grou" variabilit! is the same+oreach grou"

    Hinear 'iscri%inant 5nal!sis

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    208/279

    2

    each grou":

  • 7/25/2019 Mulitvariate Random Trees

    209/279

    2

    The "lot on the right

    shows a three class

    exa%"le where a linear

    %ethod li8e H'5 is %ost

    e++ective

    5side9 H'5 and Hogistic )egression

    Ct turns out that H'5 and logistic regression are +itting %odels that arever! si%ilar

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    210/279

    21

    ver! si%ilar

    # H'5 assu%es that the "redictors are %easured with error and that theclassi+ication o+ the observations is 8nown

    # H) assu%es that the "redictors are 8nown and that the classi+ication o+the observations are %easured with error

    5ssu%ing that the res"onseerror is Nor%al. the o"ti%al se"arating"lane +or logistic regression is9

    H'5 esti%ates a large nu%ber o+ "ara%eters and has +airl! strict

    constraints on the data

    5lso. logistic %odels %a! be %ore +orgiving o+ s8ewed "redictordistributions

    $xa%"le 'ata

  • 7/25/2019 Mulitvariate Random Trees

    211/279

    211

    set. H'5 doesn&t do aver! good job since

    the boundar! is

    nonlinear

    The linear "redictor is

    deter%ined to be

    B7-redictor:2;57-redictor61:1

    (

    5side9 H'5 and Harge Nu%ber o+ -redictors

    So%e classi+ication %odels are not drasticall!

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    212/279

    212

    a++ected b! large nu%bers o+ "redictors# Cn %an! cases. a nu%ber o+ "redictors will be noise

    H'5 has the "otential to over+it

    # H'5 class "robabilit! esti%ates beco%e %ore extre%eas the nu%ber o+ "redictors beco%es large even whenthere is no underl!ing di++erence

    5 si%ilar issue occurs in H)

    #

  • 7/25/2019 Mulitvariate Random Trees

    213/279

    21,

    data set that was co%"lete noise

  • 7/25/2019 Mulitvariate Random Trees

    214/279

    210

    co%binations o+ the original variables6scores7 that are highl! correlated with

    the res"onse:

  • 7/25/2019 Mulitvariate Random Trees

    215/279

    Solution9Sa%e as -HS +or )egression

    The o"ti%iation "roble% de+ined b! -HS can be

  • 7/25/2019 Mulitvariate Random Trees

    216/279

    21@

    solved through the +ollowing +or%ulation9

    subject to constraints 2a: or b:

    ( )( )( )

    ,!!aa

    Y!X,aCovmaxarg

    TT

    TT2

    !a,

    ( ) ( ) ( )( )( )!!aa

    Y!X,aorrY!varXavar

    maxarg TT

    TT2TT

    !a,=

  • 7/25/2019 Mulitvariate Random Trees

    217/279

    21D

    # The -HS directions are the eigenvectors o+ a %odi+iedbetween(class covariance %atrix. ?:

    # *oding o+ the res"onse %atrix does not %atter

    either gcolu%ns or g(1 colu%ns"rovides the sa%e answer

    # The constraint in the ?(s"ace does not %a8e sense

    h! constrain a res"onse that denotes class %e%bershi"4

    # C+ the ?(s"ace constraint is re%oved. the -HSdirections are exactl! the eigenvectors o+ the between(class covariance %atrix. ?

    # H'5 is o"ti%al i+ di%ension reduction is not necessar!

    The o"ti%al directions +or H'5 are the eigenvectors o+ E(1?:

    -HS 'iscri%inant 5nal!sis $xa%"le 1

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    218/279

    21

    The so+t%ax +unction is used to deter%ine classi+ication boundaries:

    -HS 'iscri%inant 5nal!sis $xa%"le 2

    S*0 *0

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    219/279

    21

    Zuadratic 'iscri%inant 5nal!sis

    5ssu%"tion9 the within grou" variabilit! is different+or

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    220/279

    22

    each grou": The decision rule is

    # where re"resents grou" .

    # The class with the largest score is the "redicted class

    # 5 +unction o+ sLuared distance o+ each observation +ro% each

    grou"&s center

    The decision rule de"ends on the covariance %atrix +or

    each grou"

    Zuadratic 'iscri%inant 5nal!sis

    Z'5 extends the H'5

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    221/279

    221

    %odel b! using Luadratic6i$enonlinear7 classi+icationboundaries

    =owever. the data

    reLuire%ents are %orestringent

    # at least as %an! co%"oundsas "redictors in each class

    # no ero(variance or linearl!de"endent "redictors

    )egularied 'iscri%inant 5nal!sis

    The %ethod tries to s"lit the di++erence between H'5 andZ'5

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    222/279

    222

    Z'5:

    Ct uses two tuning "ara%eters. ga%%a and la%bda9

    # ga%%a controls the correlation assu%"tion +or the "redictors

    as ga%%a 1 the %odel assu%es less "redictor correlations

    # la%bda toggles betweenlinear and Luadratic boundaries ga%%a ` la%bda 1 H'5

    ga%%a ` la%bda Z'5

    Other co%binations o+ ga%%a and la%bda "roduce%odels that are co%"ro%ises between H'5 and Z'5

    )egularied 'iscri%inant 5nal!sis

    To see the e++ect o+ changing ga%%a9

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    223/279

    22,

    # )daMovie5:gi+

    To see the e++ect o+ changing la%bda9

    # )daMovieB:gi+

    e can +ind the o"ti%al ga%%a and la%bda b!

    cross(validation

  • 7/25/2019 Mulitvariate Random Trees

    224/279

    220

    Cn addition to the original "redictors. nonlinear +unctions o+the "redictors are added to the data

    # This is 8nown as a basis ex"ansion3 o+ the original data

    This "rocedure essentiall! buildsa set o+ one versus all3

    classi+ication %odels

    # a /1 outco%e is used +or each %odel

    # the so+t%ax +unction is used to convert the %odel out"ut to class

    "robabilities

  • 7/25/2019 Mulitvariate Random Trees

    225/279

    22;

    used

  • 7/25/2019 Mulitvariate Random Trees

    226/279

    22@

    hinge +eatures# +or these data. , sets o+ +eatures were used in to

    discri%inate the classes

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    227/279

    22D

    Modern *lassi+ication Methods

    *lassi+ication Trees

    Hi8e regression trees. classi+ication trees search

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    228/279

    22

    through each "redictor to +ind a value o+ single"redictor that s"lits the data into two 6or %ore7grou"s that are %ore "ure than the originalgrou":

  • 7/25/2019 Mulitvariate Random Trees

    229/279

    22

    -red B -red '

    -red 5

    5 V Thresh 1 5 Thresh 1

    B V Thresh 2 B Thresh 2

    ' V Thresh 0 ' Thresh 0

    5 V Thresh , 5 Thresh ,

    1 2 1 2 1 2 1 2 1 2

    C%"urit! Measures

    There are several %easures +or deter%ining the

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    230/279

    2,

    "urit! o+ the s"lit:

  • 7/25/2019 Mulitvariate Random Trees

    231/279

    2,1

    Misclassi+ication error9 !1p1 " !2p2

    # hen!1 # !2 0#, M$ 0.5*(p1 + p2)

    Gini index9 !1p1(13p1) + !2p2(13p2)

    # hen!1 # !2 0#, GC0.5*(p1(1-p1)+ p2(1-p2))

    n

    d$!n

    c%!

    d$

    d

    d$

    $p

    c%c%

    +=+=

    ++=

    ++

    21

    2

    ,

    ,min

    C%"urit! Measure *o%"arison

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    232/279

    2,2

    Si%"le $xa%"le

    Cn this exa%"le a +ew

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    233/279

    2,,

    "ossible "artitions clearl!stand out9

    # x1 ;.

    # x2 D:;. or

    # x2 1:;

    =ow does each i%"urit!

    %easure ran8 these

    "artitions4 2 0 @ 1

    2

    0

    @

    G

    1

    x1

    x2

    *lassi+ication )esults

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    234/279

    2,0

    $nse%ble Methods

    Hi8e individual regression trees. single

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    235/279

    2,;

    classi+ication trees# are not o"ti%al classi+ication %ethods:

    # have high variabilit!\s%all changes in the data can

    drasticall! a++ect thestructure o+ the tree:

    Bagging. rando% +orests. and boosting can also

    be i%"le%ented +or classi+ication "roble%s

    Bagging. )ando%

  • 7/25/2019 Mulitvariate Random Trees

    236/279

    2,@

    i%"le%ented in the sa%e wa! as in regression:

    The objective is to %ini%ie %isclassi+ication

    error# The loss +unction changes to e%ponential lossrather

    than sLuared error loss:

    Tuning "ara%eters +or these %ethods are thesa%e as in regression

    Neural Networ8s

    Hi8e -HS. neural networ8s +or classi+ication

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    237/279

    2,D

    translate the classes to a set o+ binar! 6ero/one7variables:

    The binar! variables are %odeled using the

    "redictors and the so+t%ax techniLue is used to

    %a8e sure that the %odel out"uts behave li8e

    "robabilities

  • 7/25/2019 Mulitvariate Random Trees

    238/279

    2,

    co%"lexit! "ara%eters9# The nu%ber o+ hidden units

    # The a%ount o+ weight deca!

    The second "ara%eter hel"s deter%ine thes%oothness o+ the classi+ication boundaries

  • 7/25/2019 Mulitvariate Random Trees

    239/279

    2,

    objective +unction9# the %argin

    Su""ose we have two"redictors and a buncho+

    co%"ounds e %a! want to classi+!

    co%"ounds as active orinactive

    Het&s +urther su""ose thatthese two "redictorsco%"letel! se"arate theseclasses

    The Margin

    There are an in+inite

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    240/279

    20

    nu%ber o+ straight linesthat we can use to

    se"arate these two

    grou"s

    # so%e %ust be better thanothers

    The %argin is a de+ined

    b! eLuall! s"aced

    boundaries on each side

    o+ the line

    The Margin

    To %axi%ie the

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    241/279

    201

    %argin. we tr! to %a8eit as large as "ossible

    # without ca"turing an!co%"ounds

    5s the %arginincreases. the solutionbeco%es %ore robust

    SIMs %axi%ie the%argin to esti%ate"ara%eters

    Su""ort Iectors and 'ata )eduction

    hen the classes overla". "oints are allowed within the%argin

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    242/279

    202

    g

    # the nu%ber o+ "oints is controlled b! a cost "ara%eter

    The "oints that are within the %argin 6or on it&s

    boundar!7 are the su""ort vectors

    Ct turns out that the "rediction +unction onl! uses thesu""ort vectors

    # the "rediction eLuation is %ore co%"act and e++icient

    # the %odel %a! be %ore robust to outliers

    Nonlinear Boundaries

    Si%ilar to regression %odels. the 8ernel tric83

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    243/279

    20,

    can be used to generate highl! nonlinear classboundaries

  • 7/25/2019 Mulitvariate Random Trees

    244/279

    200

    )B< Kernel D SIs 6,1:@P7

    The $++ect o+ the *ost -ara%eter

    5s the cost "ara%eter is increased. the %odel will

    8 h d l l i+ h d

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    245/279

    20;

    wor8 ver! hard to correctl! classi+! the co%"ounds# This can lead to over(+itting

    To see the e++ect o+ the cost "ara%eter. the lin8below shows an ani%ation +or a radial basis

    +unction SIM# Sv%MovieB:gi+

    Note that. as the boundar! beco%es %ore

    co%"licated. the SI decreases# The %argin is beco%ing ver! s%all

    Nearest Neighbor *lassi+iers

    To "redict the class o+ a new co%"ound. this

    d th t + t l + th

    http://home.pfizer.com/http://home.pfizer.com/http://pfizerpedia/index.php/Image:SvmMovieB.gifhttp://home.pfizer.com/http://pfizerpedia/index.php/Image:SvmMovieB.gif
  • 7/25/2019 Mulitvariate Random Trees

    246/279

    20@

    "rocedure uses the %ost +reLuent class o+ theclosest &neighbors

    # i+ a tie. rando%l! "ic8 +ro% the%ost +reLuent classes

    &. the nu%ber o+ neighbors. is the tuning"ara%eter

    Since distance is used to de+ine the nearest

    "oints. the "redictors should be centered and

    scaled

    Nearest Neighbor *lassi+iers

  • 7/25/2019 Mulitvariate Random Trees

    247/279

    20D

    the %odel was tunedacross &values +ro% 1 to

    2

    # D neighbors was +ound to

    be o"ti%al

    &NN class boundaries

    tend to be so%ewhat

    jagged but s%ooth out as&increases

    Nave Ba!es

    )ecall Ba!es theore%9

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    248/279

    20

    O+ course. the "redictor distributions are usuall!

    %ultivariate and these "robabilities would involve

    %ultidi%ensional integration

    Nave Ba!es

    Cn nave Ba!es.3 a8a Cdiot&s Ba!es.3 the

    l ti hi b t di t i d

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    249/279

    20

    relationshi"s between "redictors are ignored# i$eall "redictors are treated as uncorrelated

    Nave Ba!es

    'es"ite this assu%"tion. this %odel usuall! is

    titi ith t l ti

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    250/279

    2;

    ver! co%"etitive. even with strong correlations =ow do we esti%ate continuous "redictor

    distributions4

    # "ara%etricall!9 assu%e nor%alit! and use the sa%"le%ean and variance

    # non("ara%etricall!9 use a non"ara%etric densit!

    esti%ator

    Nave Ba!es

  • 7/25/2019 Mulitvariate Random Trees

    251/279

    2;1

    "redictor 5 in our exa%"le. we see a slight shi+tbetween the distributions o+ the "redictor +or

    each class9

    Nave Ba!es

    C+ l h

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    252/279

    2;2

    C+ a new sa%"le has a

    value o+ "redictor 5

    (1. it is %ore li8el! to

    be active# active densit! A :0

    # inactive densit! A :1D

    Nave Ba!es

  • 7/25/2019 Mulitvariate Random Trees

    253/279

    2;,

    larger +or values between

    (:; and :;

  • 7/25/2019 Mulitvariate Random Trees

    254/279

    2;0

    Sample 1 Sample $

    red 0 red ? red ? red 0

    ' ( Total ' ' Total

    5ctive :0 :10 :@ :0 :, :12

    Cnactive :1D :@2 :1 :1D : :1

    Nave Ba!es and Man! -redictors

    Hi8e H'5. nave Ba!es

    %odels can o er+it hen

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    255/279

    2;;

    %odels can over+it when%an! nois! "redictors are

    included in the %odel

    5s with H'5. we si%ulated

    noise data and were able

    to see class se"aration

    increase as the nu%ber o+

    "redictors went u"

    Nave Ba!es *lassi+iers

    *lass boundaries +or

    nave Ba!es %odels

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    256/279

    2;@

    nave Ba!es %odelscan show circular or

    elli"tical islands

    Since the "redictors

    are treated as

    uncorrelated. there

    cannot be an!

    diagonal elli"ses

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    257/279

    $xa%"le9 -rediction o+ S"a%

    e would li8e to classi+! e%ails as being s"a% with an

    e%"hasis on high s"eci+icit!. i:e: a low "robabilit! o+ non(

    b i l b l d

  • 7/25/2019 Mulitvariate Random Trees

    258/279

    2;

    s"a% being labeled as s"a%

  • 7/25/2019 Mulitvariate Random Trees

    259/279

    Method *o%"arison

  • 7/25/2019 Mulitvariate Random Trees

    260/279

    2@

    )O* *o%"arison

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    261/279

    2@1

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    262/279

    2@2

    *lassi+ication 'atasets

    Glauco%a 'ata

    @2 variables are derived +ro% a con+ocal laser scanning

    i%age o+ the o"tic nerve head. describing its %or"holog!:

    Observations are +ro% nor%al and glauco%atous e!es

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    263/279

    2@,

    Observations are +ro% nor%al and glauco%atous e!es.

    res"ectivel!: $xa%"les o+ variables are9

    # as9 su"erior area

    # ",ss9 volu%e below sur+ace te%"oral

    # mhcn9 %ean height contour nasal

    # "ari9 volu%e above re+erence in+erior. etc

    e would li8e to "redict whether a subject has glauco%agiven their i%aging data

    -redicting 'iabetes in -i%a Cndians

    These data are +ro% -i%a Cndian wo%en living in 5riona:Several variables were collected. such as9

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    264/279

    2@0

    e would li8e to "redict a new Cndian wo%ans diabeticstatus given their other in+or%ation:

    # pregnant9 nu%ber o+

    "regnancies

    # glucose9 "las%a glucose

    levels# pressure9 diastolic B-

    # triceps9 s8in +old thic8ness

    # insulin9 seru% insulin

    # mass9 bod! %ass index

    # pedigree9 diabetic "edigree

    +unction.

    # age

    # dia,etes9 negative or "ositive

    http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    265/279

    2@;

    *lassi+ication Bac8u" Slides

  • 7/25/2019 Mulitvariate Random Trees

    266/279

    2@@

  • 7/25/2019 Mulitvariate Random Trees

    267/279

    2@D

    # %issing "redictor values are allowed

    # centering and scaling are not reLuired

    centering and scaling do not a++ect results

    # highl! correlated "redictors are allowed

    Cncluding highl! correlated "redictors can cause

    instabilit! and %a8e "redictor i%"ortance ran8ings

    so%ewhat rando%

    # ero( and near(ero variance "redictors are

    allowed

    )'5 -re(-rocessing

    )'5 %odels are cannot deal with ero( and near(ero

    variance "redictors

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    268/279

    2@

    # the! %ust be re%oved

    =ighl! correlated "redictors are allowed. but not

    suggested

    # =owever. "er+ectl! correlated "redictors will cause the %odel to +ail

    *entering and scaling are not reLuired but are suggested

    5dditionall!. there cannot be linear de"endencies between

    "redictors

    Neural Networ8 -re(-rocessing

    Neural networ8 %odels will not +ail with ero(variance

    "redictors

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    269/279

    2@

    =owever. these %odels use a large nu%ber o+ "ara%eters

    and near(ero variance "redictors %a! lead to nu%erical

    issues such as a +ailureto converge

    =ighl! correlated "redictors should be re%oved:

    *entering and scaling are reLuired

    Nearest Neighbor -re(-rocessing

    These %odels are resistant to ero( and near(ero

    variance "redictors as well as highl! correlated "redictors

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    270/279

    2D

    *entering and scaling are reLuired

    Nave Ba!es -re(-rocessing

    These %odel will not +ail with ero(variance "redictors

    =ighl! correlated "redictors are also allowed

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    271/279

    2D1

    =ighl! correlated "redictors are also allowed: *entering and scaling are not reLuired

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    272/279

    2D2

    Model Building Training

    Other *onsiderations

    Iariables to Select

    Iariables thought to be related to the res"onseshould be included in the %odel

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    273/279

    2D,

    So%eti%es we don&t 8now i+ a set o+ variables arerelated to the res"onse

    Should these be included in the anal!sis4

    C+ the variables are not related to the res"onse.then we are including noise into our "redictor set

    hat ha""ens to the "er+or%ance o+ the

    techniLues when noise is added4# *an we still +ind signal4

    Cllustration

    To the blood brain barrier data o+ Mente and Ho%bardo62;7. we have added 1. ;. 1. and 2 rando%

    "redictors

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    274/279

    2D0

    "redictors

  • 7/25/2019 Mulitvariate Random Trees

    275/279

    2D;

    Noise

    0.1

    0.2

    0.3

    -er+or%ance *o%"arison)29 Test Set

    0.4

    0.5

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    276/279

    2D@

    Noise

    0.1

    0.2

    0.3

    Iariables to Select

    =o"e+ull!. we&ve de%onstrated that resa%"ling is

    a good wa! to avoid over(+itting

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    277/279

    2DD

    a good wa! to avoid over +itting )ealie that "redictor selection is "art o+ the

    %odeling "rocess

    'oing "redictor selection outside o+ cross(validation can lead to sever "redictor selection

    bias

    # and "otential over(+itting 6but !ou won&t 8now until a

    test set7

    $++ects o+ *ategoriing a *ontinuous )es"onse

    5 %ajorit! o+ res"onses are %easured on a continuousscale

    The continuous scale allows us to co%"are observations

    http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    278/279

    2D

    The continuous scale allows us to co%"are observationson their original scale

    So%eti%es the continuous res"onse naturall! +alls intotwo or %ore %odes

    # C+ the relative distance between these %odes is not relevant. thenthe res"onse can be binned

    # =owever. i+ the distance between %odes is relevant. then we losein+or%ation b! binning the res"onse

    Binning a continuous res"onse that does not have natural

    %odes will %a8e us lose even %ore in+or%ation and willdegrade %odel

    Than8s

    Than8s +or sitting through all this

    http://home.pfizer.com/http://home.pfizer.com/
  • 7/25/2019 Mulitvariate Random Trees

    279/279

    More than8s to9

    # Benevolent overlords 'avid -otter and $d

    Kad!sews8i

    # Nathan *oulter and Gauta% Bhola +or co%"uting

    http://home.pfizer.com/http://home.pfizer.com/