About the Presenter: David J Corliss
• PhD in statistical astrophysics;; formerly part-time faculty at Wayne State University• Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector• Founder of Peace-Work, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice
Best Practices in Big Data
David J Corliss, PhDPeace-Work
4/27/2016
IHBIThe Institute for Healthand Business Insight
OUTLINE
Data Management
Sampling and Coding for Big Data
Tests For Model Performance
Distributed Computing
Summary
Data Management for Big Data
• Pre-screen records and variables
• Process only the records and variables needed
• Efficient Data Step Coding
• Use less computationally intensive methods
Bad Data Management 101Proc sort data=applicants;
by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101Unnecessary Sort
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Models allvariables
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Models allvariables
Computationally intensivebut not needed
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Managing Big Data
Managing Big Data
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Managing Big Data
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Select candidate variables
Managing Big Data
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Select candidate variables
Computationally lightestsufficient method
Managing Big Data
proc glmselect data=applicants(where=(ranuni(0) le 0.001));
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Select candidate variables
Computationally lightestsufficient method
Model onlyscreened variables
Sampling for Big Data
• Develop analytic processes using sample
• Sample Size
• Representative Samples
• Testing Sample Quality
Efficient Coding for Big Data
• Read only the variables needed for analysis
• Pass the data as few times as possible
• Use formats instead of new variables
• Shorten records by using codes instead of text
• Trim unnecessary decimal places
• Computationally light processes where possible
Coding for Big Data: Hash ObjectAn Ordinary Customer ListName Street_Address City State Zip_Code prod_42 prod_44
Magnify Analytics 1 Kennedy Square Detroit MI 48226 4 3
Fedex Office 2609 Plymouth Road #7 Ann Arbor MI 48105 4 2
Hyatt Regency Minneapolis 1300 Nicollet Mall Minneapolis MN 55403 1 5
Wrigley Field 1060 W. Addison St Chicago IL 60613 2 3
.
.
The Same Data in a Hash TableHash_ID Zip_Code prod_42 prod_44
00042540 48226 4 3
00063640 48105 4 3
00146328 55403 4 3
00243466 60613 4 3
.
.
Coding for Big Data: Hash ObjectThe Hash Object Process
y = w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5
1. Read the hash key for the given record
2. Look up the value of x1 by the key
3. Multiply by w1 and save it in a buffer
4. Repeat for each component of the model
5. Add all the components to calculate y
6. Release the buffer and go the next record
7. Repeat for each record
Testing Model PerformanceThe Problem of p-values and Big Data
Explanatory Variable Estimate Pr ( > |z|)
Var1 0.271503909 > 0.001
Var2 0.998361223 > 0.001
. . .
. . .
Var25 0.244677914 > 0.001
Var26 0.387859652 > 0.001
. . .
. . .
Var100 0.561703993 > 0.001
Var101 0.479482516 0.002
Var102 0.35656757 0.003
ASA Statement on p-values, 3/7/2016:
“The p-value was never intended to be a substitute
for scientific reasoning…Well-reasoned statistical
arguments contain much more than the value of a
single number and whether that number exceeds an
arbitrary threshold. The ASA statement is intended
to steer research into a ‘post p<0.05 era.”
Ron Wasserstein, ASA Executive Director
Testing Model Performance
Testing Model PerformanceNew Statistical Tests for Big Data
• Bonferroni Correction
• False Discovery Rate
• False Coverage Rate
• PCER
• Bayesian, including Bayesian FCR
Traditional Server Computing
SERVER
USER WORK STATIONS
Distributed Computing
Traditional Server Computing
SERVER
USER WORK STATIONS
Need More Resources?
SERVER
USER WORK STATIONS
Distributed Computing
Traditional Server Computing
SERVER
USER WORK STATIONS
Need More Resources? >> Get a Bigger Server
SERVER
USER WORK STATIONS
Distributed Computing
Distributed Computing
Scalable Distributed Computing
USER WORK STATIONS
SERVER NODE NETWORK
Distributed Computing
Scalable Distributed Computing
Need More Resources?
USER WORK STATIONS
SERVER NODE NETWORK
USER WORK STATIONS
SERVER NODE NETWORK
Distributed Computing
Scalable Distributed Computing
USER WORK STATIONS
SERVER NODE NETWORK
USER WORK STATIONS
SERVER NODE NETWORK
Need More Resources? >> Add More Nodes
Summary of Big Data Best Practices
• Use best practices for managing large data sets, with efficient coding
• Pre-screen records and variables, only processing the data needed
• Use sampling where appropriate
• Consider Hash Object Programming to apply scoring models to big data
• Learn and use multi-threaded and distributed statistical procedures
• Use tests for model performance that have been designed for big data
• Look into grid computing for large analytic systems
References and Additional MaterialsProgramming for Job Security, Arthur Carpenter and Tony Payne
http://www2.sas.com/proceedings/sugi23/Training/p275.pdf
Secrets of Efficient SAS® Coding Techniques
http://support.sas.com/resources/papers/proceedings16/11741-2016.pdf
The SAS Data Step: Where Your Input Mattershttp://www.pharmasug.org/proceedings/2012/TF/PharmaSUG-2012-TF04.pdf
Maximizing the Power of Hash Tables, David J Corliss
http://support.sas.com/resources/papers/proceedings13/037-2013.pdf
Top Related