A NEW PLATFORM FOR A NEW ERA
2 © Copyright 2013 Pivotal. All rights reserved.
What we will cover in today’s Meetup
� Data Science for Biomedicine – Challenges – Platforms, processes, and tools
� Use Cases Leveraging Data Science for Biomedicine – Genomics: Distributed GWAS – Image Processing: Massively Parallel Cell Counting – Healthcare: Predicting asthma-related hospital admissions
� Wrap Up & Questions
3 © Copyright 2013 Pivotal. All rights reserved. 3 © Copyright 2013 Pivotal. All rights reserved.
Challenges
4 © Copyright 2013 Pivotal. All rights reserved.
Challenge: The ‘big-ness’ of big data
Oil Exploration Medical Imaging
Video Surveillance Mobile Sensors
Stock Market Gene Sequencing
Smart Grids Social Media
FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY
COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014
READING SMART METERS EVERY 15 MINUTES
IS 3000X MORE DATA INTENSIVE
OIL RIGS GENERATE
25000 DATA POINTS PER SECOND
5 © Copyright 2013 Pivotal. All rights reserved.
Medications"
Family "History"
Lab tests"
Clinical"Narratives"
Imaging"
Environment"
Medical History"
Sensors"& Mobile"
Genetics"
Molecular"Diagnostics"
Challenge: Diverse data
6 © Copyright 2013 Pivotal. All rights reserved.
Solutions: New environments & tools HDFS STORAGE AND MPP
ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT
VARIETY/VELOCITY
DISTRIBUTED COMPUTATION FOR PARALLELIZATION PETABYTES OF DATA
OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK
TO ACCESS COMMON LANGUAGES
RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND
TOOLS
SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP
MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE
FLEXIBLE
SCALABLE
ENABLING
ACCESSIBLE
7 © Copyright 2013 Pivotal. All rights reserved.
Solutions: Leverage Diverse Data Create predictive models at scale • Integrate data from various sources to build larger models to improve statistics
and inference • Enable parallelized execution of libraries
False positive rate
True
pos
itive
rate
Medical History
Medical History
Medical History Genetics
Clinician Notes
Clinician Notes
Medical History Genetics Imaging Clinician
Notes
8 © Copyright 2013 Pivotal. All rights reserved. 8 © Copyright 2013 Pivotal. All rights reserved.
Platforms
9 © Copyright 2013 Pivotal. All rights reserved.
Multiple platforms with a single, simple goal: Distributed storage with in-place computation
Hadoop
MPP Database
SQL-on-Hadoop
10 © Copyright 2013 Pivotal. All rights reserved.
Multiple platforms with a single, simple goal: Distributed storage with in-place computation
Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
Hadoop
MPP Database
SQL-on-Hadoop
11 © Copyright 2013 Pivotal. All rights reserved.
Multiple platforms with a single, simple goal: Distributed storage with in-place computation
Think of it as distributed file system with very large blocks of data
Schema on read allows flexibility for a variety of datasets Compute using a variety of paradigms (e.g. MapReduce)
Hadoop
MPP Database
Name Node
Data Node 1
Data Node 2
Data Node 3
Data Node 4
1 2 3 2 3 1 1 2 SQL-on-Hadoop
12 © Copyright 2013 Pivotal. All rights reserved.
Multiple platforms with a single, simple goal: Distributed storage with in-place computation
• SQL compliant • World-class query optimizer • Interactive query • Horizontal scalability • Robust data management • Common Hadoop formats • Deep analytics
Hadoop
MPP Database
Think of it as distributed PostGreSQL (GPDB) on Hadoop • SQL compliant • World-class query optimizer • Interactive query • Horizontal scalability • Robust data management • Common Hadoop formats • Deep analytics SQL-on-
Hadoop
13 © Copyright 2013 Pivotal. All rights reserved.
Sample Applications Challenges Use Cases
The landscape of technology for big data
HAMSTER/MPI GraphLab
MapReduce
SQL
14 © Copyright 2013 Pivotal. All rights reserved.
Sample Applications Challenges Use Cases
Batch processing of large volumes of data
Analytics on large-scale structured data
Operations on very large matrices
The landscape of technology for big data
HAMSTER/MPI GraphLab
MapReduce
SQL
15 © Copyright 2013 Pivotal. All rights reserved.
Sample Applications Challenges Use Cases
Batch processing of large volumes of data
Not optimal for highly iterative methods (file I/O bottleneck),
functions over windows
Analytics on large-scale structured data
Requires restructuring of data to manipulate very large files
Operations on very large matrices
Requires knowledge of OpenMP, mis-used for embarrassingly
parallel problems
The landscape of technology for big data
HAMSTER/MPI GraphLab
MapReduce
SQL
16 © Copyright 2013 Pivotal. All rights reserved.
Sample Applications Challenges Use Cases
Batch processing of large volumes of data
Not optimal for highly iterative methods (file I/O bottleneck),
functions over windows Word count on tweets
Analytics on large-scale structured data
Requires restructuring of data to manipulate very large files
Predicting mortality on clinical data from diverse sources
Operations on very large matrices
Requires knowledge of OpenMP, mis-used for embarrassingly
parallel problems
Protein docking, molecular dynamics
The landscape of technology for big data
HAMSTER/MPI GraphLab
MapReduce
SQL
17 © Copyright 2013 Pivotal. All rights reserved.
Clinical"Narratives"Imaging" Genetics"
Good for processing many images rapidly
Many documents with no shared processing Read mapping
In-database processing of very large images
stored as a table Information retrieval BAM file manipulations,
counts
Processing very large images (e.g. FFT)
Multiple sequence alignment
Choosing the right environment for different analytics challenges
HAMSTER/MPI GraphLab
MapReduce
SQL
18 © Copyright 2013 Pivotal. All rights reserved. 18 © Copyright 2013 Pivotal. All rights reserved.
Process & Tools
19 © Copyright 2013 Pivotal. All rights reserved.
1 Find Data Platforms • Pivotal
Greenplum DB • Pivotal HD • Hadoop (other) • SAS HPA • AWS
2 Write Code
Editing Tools • Vi/Vim • Emacs • Smultron • TextWrangler • Eclipse • Notepad++ • IPython • Sublime • Rstudio
Languages • SQL • Bash scripting • C • C++ • C# • Java • Python • R
3 Run Code Interfaces • pgAdminIII • psql • psycopg2 • Terminal • Cygwin • Putty • Winscp
4 Write Code for Big Data In-Database • SQL • PL/Python • PL/Java • PL/R • PL/pgSQL
Hadoop • HAWQ • Pig • Hive • Java
5 Implement Algorithms
Libraries • MADlib Java • Mahout R • (Too many to list!) Text • OpenNLP • NLTK • GPText C++ • opencv
Python • NumPy • SciPy • scikit-learn • Pandas Programs • Alpine Miner • Rstudio • MATLAB • SAS • Stata
6 Show Results
Visualization • python-matplotlib • python-networkx • D3.js • Tableau
• GraphViz • Gephi • R (ggplot2, lattice,
shiny) • Excel
7 Collaborate
Sharing Tools • Chorus • Confluence • Socialcast • Github • Google Drive &
Hangouts
PIVOTAL DATA SCIENCE TOOLKIT
A large and varied tool box!
20 © Copyright 2013 Pivotal. All rights reserved.
1 Find Data Platforms • Pivotal
Greenplum DB • Pivotal HD • Hadoop (other) • SAS HPA • AWS
2 Write Code
Editing Tools • Vi/Vim • Emacs • Smultron • TextWrangler • Eclipse • Notepad++ • IPython • Sublime • Rstudio
Languages • SQL • Bash scripting • C • C++ • C# • Java • Python • R
3 Run Code Interfaces • pgAdminIII • psql • psycopg2 • Terminal • Cygwin • Putty • Winscp
4 Write Code for Big Data In-Database • SQL • PL/Python • PL/Java • PL/R • PL/pgSQL
Hadoop • HAWQ • Pig • Hive • Java
5 Implement Algorithms
Libraries • MADlib Java • Mahout R • (Too many to list!) Text • OpenNLP • NLTK • GPText C++ • opencv
Python • NumPy • SciPy • scikit-learn • Pandas Programs • Alpine Miner • Rstudio • MATLAB • SAS • Stata
6 Show Results
Visualization • python-matplotlib • python-networkx • D3.js • Tableau
• GraphViz • Gephi • R (ggplot2, lattice,
shiny) • Excel
7 Collaborate
Sharing Tools • Chorus • Confluence • Socialcast • Github • Google Drive &
Hangouts
PIVOTAL DATA SCIENCE TOOLKIT
A large and varied tool box!
Data Review Feature Creation Model Building Operationalization
21 © Copyright 2013 Pivotal. All rights reserved.
MADlib In-Database Functions Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market
Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Hypothesis Testing
Chi-Squared test F-test & t-test ANOVA Kolmogorov-Smirnov Mann-Whitney test Wilcoxon signed-rank test Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
Collaborators:
22 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm • Finding linear
dependencies between variables
• How to compute with a single scan?
23 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT y = xiT yi
i∑
24 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master
XT y
Segment 1 Segment 2
X1T y1 X2
T y2+ =
25 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master Segment 1 Segment 2
XT yX1T y1 X2
T y2+ =
26 © Copyright 2013 Pivotal. All rights reserved.
Performing a linear regression on 10 million rows in seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
27 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism • Little or no effort is required to break up the problem into a number of
parallel tasks, and there exists no dependency (or communication) between those parallel tasks
• Also known as ‘explicit parallelism’ • Examples:
– Count a deck of cards by dividing it up between people in this room: Count in parallel
– MapReduce – map() function in Python – apply() family of functions in R
28 © Copyright 2013 Pivotal. All rights reserved.
� The interpreter/VM of the language ‘X’ is installed on each node of the cluster
• Data Parallelism: - PL/X piggybacks on MPP
architecture
• Allows users to write Greenplum/HAWQ/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby
Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
29 © Copyright 2013 Pivotal. All rights reserved. 29 © Copyright 2013 Pivotal. All rights reserved.
Genomics Use Case: Massively-Parallel GWAS Study
30 © Copyright 2013 Pivotal. All rights reserved.
In-database genome-wide association study
Network Interconnect
Master Severs
Segment Severs
SQL & R Indiv Covariates
1 2 10 1 F 23 18 2 M 39 41 3 M 50 23
N F 19 24
COVARIATES
SNP1 2 MAA CC TTAT CG TTAA GG TC
TT CG TC
31 © Copyright 2013 Pivotal. All rights reserved.
In-database genome-wide association study
Network Interconnect
Master Severs
Segment Severs
SNP1 SNP2 SNPM
SQL & R Indiv Covariates
1 2 10 1 F 23 18 2 M 39 41 3 M 50 23
N F 19 24
Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
32 © Copyright 2013 Pivotal. All rights reserved.
In-database genome-wide association study
Network Interconnect
Master Severs
Segment Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & R Indiv Covariates
1 2 10 1 F 23 18 2 M 39 41 3 M 50 23
N F 19 24
Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
33 © Copyright 2013 Pivotal. All rights reserved.
In-database genome-wide association study
Network Interconnect
Master Severs
Segment Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & R Indiv Covariates
1 2 10 1 F 23 18 2 M 39 41 3 M 50 23
N F 19 24
Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-212 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
34 © Copyright 2013 Pivotal. All rights reserved.
In-database genome-wide association study
Network Interconnect
Master Severs
Segment Severs
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
LOR1 LOR2 LORM
SQL & R Indiv Covariates
1 2 10 1 F 23 18 2 M 39 41 3 M 50 23
N F 19 24
Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-212 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of ~500,000 loci for thousands of individuals occurs rapidly and in parallel
• Results are easily manipulated and explored
35 © Copyright 2013 Pivotal. All rights reserved.
Generate relevant plots using tools like Tableau immediately after parallel statistical analysis in-database
on Pivotal technology
Visualize & analyze genomics data without movement
36 © Copyright 2013 Pivotal. All rights reserved.
Simply select SNPs of interest and visualize additional patient data or
metrics stored in the same database!
Visualize & analyze genomics data without movement
37 © Copyright 2013 Pivotal. All rights reserved.
Rapidly explore additional data sources, like mapped reads, to shorten time to insights. Data is
available on the same platform, no data movement required!
Visualize & analyze genomics data without movement
38 © Copyright 2013 Pivotal. All rights reserved. 38 © Copyright 2013 Pivotal. All rights reserved.
Image Processing Use Case: Massively-Parallel Cell Counting
39 © Copyright 2013 Pivotal. All rights reserved.
Tiss
uepa
thol
ogy.
com
40 © Copyright 2013 Pivotal. All rights reserved.
An image is simply an array of pixels
41 © Copyright 2013 Pivotal. All rights reserved.
Representing an image in a table HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations
Source Image: Col
Row
0 1 2 0 1 2
0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2
col
row
in
tsy
Structured:
42 © Copyright 2013 Pivotal. All rights reserved.
Translating image processing to simple SQL
Function Distribution of pixel intensities
SQL SELECT intsy, count(*) !FROM tbl !GROUP BY intsy!
Output 150, 5 215, 4
HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations
� No data movement required
� Simple SQL queries for data exploration
Source Image:
col
row
in
tsy
Structured: Col
Row
0 1 2 0 1 2
0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2
col
row
in
tsy
43 © Copyright 2013 Pivotal. All rights reserved.
What about windows of pixels?
0 1 2 0 1 2
Source Image:
col
row
in
tsy
Structured: Col
Row
0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2
col
row
in
tsy
44 © Copyright 2013 Pivotal. All rights reserved.
What about windows of pixels? Source Image:
Col
Row
0 1 2 0 1 2
Function Neighboring pixel values (no diagonals)
SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!!
Output 1, 1, [215, 150, 215, 150, 215]
0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2
col
row
in
tsy
Structured:
45 © Copyright 2013 Pivotal. All rights reserved.
Window functions for image processing
0 1 2 0 1 2
What about 8-connected
kernels?
Source Image: Col
Row
Function Neighboring pixel values (no diagonals)
SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!
Output 1, 1, [215, 150, 215, 150, 215]
46 © Copyright 2013 Pivotal. All rights reserved.
Window functions for image processing
diag1: row-col diag2: row+col
0 1 2 0 1 2
Col
Row
Function Neighboring pixel values (no diagonals)
SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!
Output 1, 1, [215, 150, 215, 150, 215]
Source Image:
47 © Copyright 2013 Pivotal. All rights reserved.
Window functions for image processing
0 1 2 0 1 2
Col
Row
Function Neighboring pixel values (no diagonals)
SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), ! LAG ( intsy ) OVER( diag1_wdw ), ! LEAD ( intsy ) OVER( diag1_wdw ), ! LAG ( intsy ) OVER( diag2_wdw ), ! LEAD ( intsy ) OVER( diag2_wdw ) ! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), !diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !
Output 1, 1, [215, 150, 215, 150, 215, 150, 215, 150, 150]
Source Image:
48 © Copyright 2013 Pivotal. All rights reserved.
Smoothing (noise removal) � Make each pixel intensity value similar to its
neighbors by averaging the intensity values in the surrounding neighborhood.
� Smoothing using a uniform box filter:
0 1 2 3 0 1 2 3
Col
Row
0 1 2 3 0 1 2 3 SELECT row, col, madlib.array_mean(intsy_wdw) !
!FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !
49 © Copyright 2013 Pivotal. All rights reserved.
Smoothing (noise removal)
SELECT row, col, madlib.array_dot(intsy_wdw, ! array[.2,.125,.125,.125,.125,.075,.075,.075,.075]) !FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !
� Make each pixel intensity value similar to its neighbors by averaging the intensity values in the surrounding neighborhood.
� Smoothing using a Gaussian filter:
0 1 2 3 0 1 2 3
Col
Row
0 1 2 3 0 1 2 3
.2 .125 .125
.125 .075 .075
.125 .075 .075
50 © Copyright 2013 Pivotal. All rights reserved.
Image Processing Pipeline For Object Counting
Original
Image name # Cells
Tma_001.jpg 359
Tma_002.jpg 1892
Tma_003.jpg 871
… …
Smoothing Average over
window of pixels
Thresholding Select pixels under intensity threshold
Morphological Operations Min/max over
window of pixels
Object Detection Connected
components
Object Counting Select components
with size filter
51 © Copyright 2013 Pivotal. All rights reserved. 51 © Copyright 2013 Pivotal. All rights reserved.
Healthcare Use Case: Predicting Asthma-Related Hospital Admissions
52 © Copyright 2013 Pivotal. All rights reserved.
Code-a-Thon Details - Logistics • 24-Hour Data Science Code-a-
Thon • Four finalist vendors:
– Pivotal – Cloudera – Hortonworks, and – IBM
• Number of resources per vendor is 5
• Final deliverable is a 15 minute presentation to senior leaders, executives, doctors, and pharmacists
53 © Copyright 2013 Pivotal. All rights reserved.
Code-a-Thon Details - Data � Air Quality Data
– Air Pollutants and California Air Resource Board (ARB) Data
– Daily Particulate matter (PM 10 and 2.5) and Ozone (O3) measurements
� Medication Order History – 4 years of anonymized medication
order history – Encounter data
▪ Encounter Type ▪ Encounter Date ▪ Diagnosis ▪ Patient Demographics
— Age/Gender/Zip Code
▪ Details of the Prescription — Medication — Therapeutic Class — Expiration Date
– Dispense data ▪ Refill Date/ Location
54 © Copyright 2013 Pivotal. All rights reserved.
Raw Air Quality Data � Measured at 77 stations
� Dispersed in 50 zip codes
� Only 6% of customer population lives in a zip code where there is an air station
Any analysis that focuses only on zip codes with air stations would be incomplete
Challenge #1
55 © Copyright 2013 Pivotal. All rights reserved.
Step 1. Shepard Interpolation
� Calculate air miles between all zip codes
� Populate the air quality measures at zip codes with no stations with inverse distance weighted averages from nearby air stations
Challenge #1
56 © Copyright 2013 Pivotal. All rights reserved.
Step 2. Determine zip codes where asthma is over-represented - We calculated the prevalence of
asthma for the overall population and each zip code
- We determine whether the distribution of disease prevalence is significantly different for a zip code by running a chi-square test at the zip code level
- The cut-off for p-value is 0.05
- The standardized residuals are plotted Red: over-represented asthma Green: under-represented asthma
Challenge #1
57 © Copyright 2013 Pivotal. All rights reserved.
Step 3. Spatial Alignment Challenge #1
58 © Copyright 2013 Pivotal. All rights reserved.
Predicting Asthma Admissions Findings � Prior Hospitalization: Our analysis found that patients who have prior asthma related
hospitalizations in the last 12 months were 4.85 times more likely to have a hospitalization (any) in the next 3 months compared to patients who had no prior asthma hospitalizations in the last 12 months.
� Socio-economic status : Of the various socio-economic status features we tried, the percent population under 50K is the one that was significant.
� Age Under 10 and Age Above 60 : Compared to the reference group (patients with the ages between 10 and 60) these two age groups have increased likelihood (~24% and ~10%) to be hospitalized in the next 3 months.
� History of Unfilled Medication: If a patient had an unfilled medication in their history, then ceteris paribus, they are 13% more likely to have a hospitalization (p = 2.7e-06)
Challenge #2
59 © Copyright 2013 Pivotal. All rights reserved.
Asthma Population Management Application
Application #1
60 © Copyright 2013 Pivotal. All rights reserved.
Asthma Management Application Application #2
61 © Copyright 2013 Pivotal. All rights reserved.
Technology Adoption Journey of a Major Healthcare Provider
Prove that better technology can speed up discovery • Code-a-thon
Prove that better technology can improve model quality • Length of Stay Modeling
Prove that technology is accessible to my clinicians and researchers • Comorbidity Feature Generation App
Prove that data science can help in areas other than clinical analytics • Fraud Detection for Accounts Payable
Prove that, once trained, our scientists can get to insights as quickly as the Pivotal DS team • EDIP Modeling in 4 days
62 © Copyright 2013 Pivotal. All rights reserved.
Check out the Pivotal Data Science Blog! http://blog.pivotal.io/data-science-pivotal
A NEW PLATFORM FOR A NEW ERA
Top Related