Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...

Post on 15-Dec-2015

214 views 0 download

Transcript of Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...

Automatic Feature Selection

Feb 2015

Update on Hadoop / R

Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from

HortonWorks) http://hortonworks.com/products/hortonworks-sand

box/#install

Do tutorials – here http://hortonworks.com/tutorials/

Add R / Rstudio Server to your VM

Use Rhadoop to inteface Hadoop and R

Issue

There are many predictive analytical

models that will work –Which among many

is best?

Example Data – HVAC building log data

date 6/1/13 6/1/13 6/1/13 6/25/13time 0:00:01 0:00:01 0:00:01 0:13:19target.temp 69 66 69 70actual.temp 55 58 60 71system 14 13 5 19system.age 6 20 8 14building.id 17 4 7 18temp.diff 14 8 9 -1temp.range COLD COLD COLD NORMALextreme.temp 1 1 1 0country Egypt Finland South Africa Indonesiahvac.product FN39TG GG1919 FN39TG JDNS77building.age 11 17 13 25building.manager M17 M4 M7 M18service.center.distance 150 115 100 68days.since.service 142 109 164 86he.efficiency 12 22 2 36fan.hours 17 16 15 8coolant.type B12 B12 B12 B12software.release P10 P10 P10 P10ave.outside.temp 91 46 77 80software.P12 0 0 0 0coolant.B12 1 1 1 1neg.diff 1 1 1 -1abs.diff 14 8 9 1diff.size 3 2 2 1cut.off 1 1 1 0

What to look for in among models

R-squared (linear models)

Variable Significance

# of Variables that are significant

Sign of Variables

Confusion Matrix “Score” (non-linear models)

AIC number (non-linear models)

What to look for in among models

Variables and Significance

AIC Score

Confusion Matrix

Confusion Matrix Score

Hand Done Model Outcome

Approach

Calculate the combinations of all independent variables

Write function to; Run each model possibility For a sample of X (~10) samples of training / test data

sets Collect;

# of variables that have significance < .1 “score” the confusion matrix

Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame

Step 1 – set up empty data frame to hold results

Step 2 – calculate all combinations of variables

Step 3 – run function to estimate all models and save parameters

Step 4 – average all models and sort

Average of Top Models Are …

Model MatrixMean SigMean Weigthed

cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P12 0.79 5.60 4.45

cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type 0.88 5.00 4.39

cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp 0.85 4.90 4.17

cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.77 4.30 3.30

cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P12 0.91 3.60 3.28

cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P12 0.86 3.80 3.25

cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P12 0.84 3.80 3.18

cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type 0.88 3.60 3.17

cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P12 0.87 3.60 3.14

cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release 0.85 3.70 3.14

cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P12 0.89 3.50 3.11

cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.89 3.50 3.10

cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type 0.88 3.50 3.09

cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P12 0.85 3.60 3.06

cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P12 0.81 3.70 3.00

cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P12 0.91 3.30 3.00

Each of these should be tested again

More extensive use of varied train / test data sample sets

Stability of each model beyond the scoring

Chosen model “makes sense”

Alternative ways to do this …

Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..

Setting it up & running RFE

data frame of predictor variables

vector of outcome variable

max number of variables to keep

control functions

run recursive elimination model

Outcome of the RFE

Problems

Number of variables combinations can get HUGE

Might need multicore or parallel to get through it

Thank YouBrooke Aker

baker@bigdatalens.com