Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...

Automatic Feature Selection

Feb 2015

Update on Hadoop / R

Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from

HortonWorks) http://hortonworks.com/products/hortonworks-sand

box/#install

Do tutorials – here http://hortonworks.com/tutorials/

Add R / Rstudio Server to your VM

Use Rhadoop to inteface Hadoop and R

There are many predictive analytical

models that will work –Which among many

is best?

Example Data – HVAC building log data

date 6/1/13 6/1/13 6/1/13 6/25/13time 0:00:01 0:00:01 0:00:01 0:13:19target.temp 69 66 69 70actual.temp 55 58 60 71system 14 13 5 19system.age 6 20 8 14building.id 17 4 7 18temp.diff 14 8 9 -1temp.range COLD COLD COLD NORMALextreme.temp 1 1 1 0country Egypt Finland South Africa Indonesiahvac.product FN39TG GG1919 FN39TG JDNS77building.age 11 17 13 25building.manager M17 M4 M7 M18service.center.distance 150 115 100 68days.since.service 142 109 164 86he.efficiency 12 22 2 36fan.hours 17 16 15 8coolant.type B12 B12 B12 B12software.release P10 P10 P10 P10ave.outside.temp 91 46 77 80software.P12 0 0 0 0coolant.B12 1 1 1 1neg.diff 1 1 1 -1abs.diff 14 8 9 1diff.size 3 2 2 1cut.off 1 1 1 0

What to look for in among models

R-squared (linear models)

Variable Significance

# of Variables that are significant

Sign of Variables

Confusion Matrix “Score” (non-linear models)

AIC number (non-linear models)

What to look for in among models

Variables and Significance

AIC Score

Confusion Matrix

Confusion Matrix Score

Hand Done Model Outcome

Approach

Calculate the combinations of all independent variables

Write function to; Run each model possibility For a sample of X (~10) samples of training / test data

sets Collect;

# of variables that have significance < .1 “score” the confusion matrix

Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame

Step 1 – set up empty data frame to hold results

Step 2 – calculate all combinations of variables

Step 3 – run function to estimate all models and save parameters

Step 4 – average all models and sort

Average of Top Models Are …

Model MatrixMean SigMean Weigthed

cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P12 0.79 5.60 4.45

cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type 0.88 5.00 4.39

cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp 0.85 4.90 4.17

cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.77 4.30 3.30

cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P12 0.91 3.60 3.28

cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P12 0.86 3.80 3.25

cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P12 0.84 3.80 3.18

cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type 0.88 3.60 3.17

cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P12 0.87 3.60 3.14

cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release 0.85 3.70 3.14

cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P12 0.89 3.50 3.11

cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.89 3.50 3.10

cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type 0.88 3.50 3.09

cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P12 0.85 3.60 3.06

cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P12 0.81 3.70 3.00

cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P12 0.91 3.30 3.00

Each of these should be tested again

More extensive use of varied train / test data sample sets

Stability of each model beyond the scoring

Chosen model “makes sense”

Alternative ways to do this …

Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..

Setting it up & running RFE

data frame of predictor variables

vector of outcome variable

max number of variables to keep

control functions

run recursive elimination model

Outcome of the RFE

Problems

Number of variables combinations can get HUGE

Might need multicore or parallel to get through it

Thank YouBrooke Aker

baker@bigdatalens.com

Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...

Documents

Transcript of Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...

Hortonworks apache training

Hortonworks Sandbox with VirtualBox

Disruptive Possibilities - Hortonworks

Future of Data Hortonworks Data Platform and Hortonworks ... · Hortonworks Connected Data Platforms and Solutions Hortonworks Connection Hortonworks Solutions Enterprise Data ...

Hortonworks Data Platform - Teradata Connector …...Hortonworks Data Platform November 30, 2018 2 2. Hortonworks Connector for Teradata 2.1. Introduction Hortonworks Connector for

ova - nsbirdsociety.ca

Ova Chapterpreview

Import/Export OVA - ahadas.comahadas.com/slides/ova.pdf · Uploading OVA Create a VM from an OVA that resides out of the DC – Currently, limited to VMs – In 4.2, limited to oVirt-OVA

OVA English

christlicheperlen.files.wordpress.com · ova, .1 christ[icheperlen.de christlicheperlen.de . ova, .1 christ[icheperlen.de christlicheperlen.de . ova, .1 christ[icheperlen.de christlicheperlen.de

Dataguise hortonworks insurance_feb25

Hortonworks Tutorial Hive 5.22

OVA Descriptor File Properties - Cisco · OVA Descriptor File Properties AllrequiredandoptionalpropertiesfortheOVAdescriptorfilearedescribedhere. • …

Hortonworks Data Platform - Meetupfiles.meetup.com/13696062/HortonWorksOverview.pdf · Hortonworks Data Platform: Enterprise Hadoop The Hortonworks Data Platform (HDP) delivers all

RapidMiner Wisdom 2016 - Hortonworks

Ova Errata

Hortonworks Data Platform - Workflow Management€¦ · Hortonworks Data Platform (December 10, 2019) Workflow Management docs.cloudera.com

TECHNICAL WHITE PAPER AOS-CX OVA ON GNS3 VM · 2019-12-02 · Internal Use Only AOS-CX OVA ON GNS3 VM Introduction The ArubaOS-CX Simulation Software OVA is a virtual platform to

Telcos and Cable Companies Use Hortonworks and …docs.media.bitpipe.com/io_12x/io_123029/item_1150690/Hortonworks...©2014 Hortonworks 3 Telcos and Cable Companies Use Hortonworks

HPE Reference Architecture for Hortonworks HDP 2.4 on HPE … · 2017-08-21 · Hortonworks Data Platform: A full Enterprise Hadoop Data Platform . Hortonworks Data Platform Version