About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss...

Post on 06-Aug-2020

1 views 0 download

Transcript of About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss...

About the Presenter: David J Corliss

• PhD in statistical astrophysics;; formerly part-­time faculty at Wayne State University• Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector• Founder of Peace-­Work, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice

Best Practices in Big Data

David J Corliss, PhDPeace-Work

4/27/2016

IHBIThe Institute for Healthand Business Insight

OUTLINE

Data Management

Sampling and Coding for Big Data

Tests For Model Performance

Distributed Computing

Summary

Data Management for Big Data

• Pre-screen records and variables

• Process only the records and variables needed

• Efficient Data Step Coding

• Use less computationally intensive methods

Bad Data Management 101Proc sort data=applicants;

by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101Unnecessary Sort

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101

Doesn’t screenvariables first

Unnecessary Sort

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101

Doesn’t screenvariables first

Unnecessary Sort

Models allvariables

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101

Doesn’t screenvariables first

Unnecessary Sort

Models allvariables

Computationally intensivebut not needed

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Managing Big Data

Managing Big Data

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Managing Big Data

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Select candidate variables

Managing Big Data

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Select candidate variables

Computationally lightestsufficient method

Managing Big Data

proc glmselect data=applicants(where=(ranuni(0) le 0.001));

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Select candidate variables

Computationally lightestsufficient method

Model onlyscreened variables

Sampling for Big Data

• Develop analytic processes using sample

• Sample Size

• Representative Samples

• Testing Sample Quality

Efficient Coding for Big Data

• Read only the variables needed for analysis

• Pass the data as few times as possible

• Use formats instead of new variables

• Shorten records by using codes instead of text

• Trim unnecessary decimal places

• Computationally light processes where possible

Coding for Big Data: Hash ObjectAn Ordinary Customer ListName Street_Address City State Zip_Code prod_42 prod_44

Magnify Analytics 1 Kennedy Square Detroit MI 48226 4 3

Fedex Office 2609 Plymouth Road #7 Ann Arbor MI 48105 4 2

Hyatt Regency Minneapolis 1300 Nicollet Mall Minneapolis MN 55403 1 5

Wrigley Field 1060 W. Addison St Chicago IL 60613 2 3

.

.

The Same Data in a Hash TableHash_ID Zip_Code prod_42 prod_44

00042540 48226 4 3

00063640 48105 4 3

00146328 55403 4 3

00243466 60613 4 3

.

.

Coding for Big Data: Hash ObjectThe Hash Object Process

y = w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5

1. Read the hash key for the given record

2. Look up the value of x1 by the key

3. Multiply by w1 and save it in a buffer

4. Repeat for each component of the model

5. Add all the components to calculate y

6. Release the buffer and go the next record

7. Repeat for each record

Testing Model PerformanceThe Problem of p-values and Big Data

Explanatory Variable Estimate Pr ( > |z|)

Var1 0.271503909 > 0.001

Var2 0.998361223 > 0.001

. . .

. . .

Var25 0.244677914 > 0.001

Var26 0.387859652 > 0.001

. . .

. . .

Var100 0.561703993 > 0.001

Var101 0.479482516 0.002

Var102 0.35656757 0.003

ASA Statement on p-values, 3/7/2016:

“The p-value was never intended to be a substitute

for scientific reasoning…Well-reasoned statistical

arguments contain much more than the value of a

single number and whether that number exceeds an

arbitrary threshold. The ASA statement is intended

to steer research into a ‘post p<0.05 era.”

Ron Wasserstein, ASA Executive Director

Testing Model Performance

Testing Model PerformanceNew Statistical Tests for Big Data

• Bonferroni Correction

• False Discovery Rate

• False Coverage Rate

• PCER

• Bayesian, including Bayesian FCR

Traditional Server Computing

SERVER

USER WORK STATIONS

Distributed Computing

Traditional Server Computing

SERVER

USER WORK STATIONS

Need More Resources?

SERVER

USER WORK STATIONS

Distributed Computing

Traditional Server Computing

SERVER

USER WORK STATIONS

Need More Resources? >> Get a Bigger Server

SERVER

USER WORK STATIONS

Distributed Computing

Distributed Computing

Scalable Distributed Computing

USER WORK STATIONS

SERVER NODE NETWORK

Distributed Computing

Scalable Distributed Computing

Need More Resources?

USER WORK STATIONS

SERVER NODE NETWORK

USER WORK STATIONS

SERVER NODE NETWORK

Distributed Computing

Scalable Distributed Computing

USER WORK STATIONS

SERVER NODE NETWORK

USER WORK STATIONS

SERVER NODE NETWORK

Need More Resources? >> Add More Nodes

Summary of Big Data Best Practices

• Use best practices for managing large data sets, with efficient coding

• Pre-screen records and variables, only processing the data needed

• Use sampling where appropriate

• Consider Hash Object Programming to apply scoring models to big data

• Learn and use multi-threaded and distributed statistical procedures

• Use tests for model performance that have been designed for big data

• Look into grid computing for large analytic systems

References and Additional MaterialsProgramming for Job Security, Arthur Carpenter and Tony Payne

http://www2.sas.com/proceedings/sugi23/Training/p275.pdf

Secrets of Efficient SAS® Coding Techniques

http://support.sas.com/resources/papers/proceedings16/11741-2016.pdf

The SAS Data Step: Where Your Input Mattershttp://www.pharmasug.org/proceedings/2012/TF/PharmaSUG-2012-TF04.pdf

Maximizing the Power of Hash Tables, David J Corliss

http://support.sas.com/resources/papers/proceedings13/037-2013.pdf

Questions

davidjcorliss@peace-work.org

IHBIThe Institute for Healthand Business Insight