Meetup asu 150113_upload

Post on 08-Aug-2015

69 views 0 download

Tags:

Transcript of Meetup asu 150113_upload

A NEW PLATFORM FOR A NEW ERA

2 © Copyright 2013 Pivotal. All rights reserved.

What we will cover in today’s Meetup

� Data Science for Biomedicine –  Challenges –  Platforms, processes, and tools

� Use Cases Leveraging Data Science for Biomedicine –  Genomics: Distributed GWAS –  Image Processing: Massively Parallel Cell Counting –  Healthcare: Predicting asthma-related hospital admissions

� Wrap Up & Questions

3 © Copyright 2013 Pivotal. All rights reserved. 3 © Copyright 2013 Pivotal. All rights reserved.

Challenges

4 © Copyright 2013 Pivotal. All rights reserved.

Challenge: The ‘big-ness’ of big data

Oil Exploration Medical Imaging

Video Surveillance Mobile Sensors

Stock Market Gene Sequencing

Smart Grids Social Media

FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY

COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS EVERY 15 MINUTES

IS 3000X MORE DATA INTENSIVE

OIL RIGS GENERATE

25000 DATA POINTS PER SECOND

5 © Copyright 2013 Pivotal. All rights reserved.

Medications"

Family "History"

Lab tests"

Clinical"Narratives"

Imaging"

Environment"

Medical History"

Sensors"& Mobile"

Genetics"

Molecular"Diagnostics"

Challenge: Diverse data

6 © Copyright 2013 Pivotal. All rights reserved.

Solutions: New environments & tools HDFS STORAGE AND MPP

ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT

VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

TOOLS

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE

7 © Copyright 2013 Pivotal. All rights reserved.

Solutions: Leverage Diverse Data Create predictive models at scale •  Integrate data from various sources to build larger models to improve statistics

and inference •  Enable parallelized execution of libraries

False positive rate

True

pos

itive

rate

Medical History

Medical History

Medical History Genetics

Clinician Notes

Clinician Notes

Medical History Genetics Imaging Clinician

Notes

8 © Copyright 2013 Pivotal. All rights reserved. 8 © Copyright 2013 Pivotal. All rights reserved.

Platforms

9 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Hadoop

MPP Database

SQL-on-Hadoop

10 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Hadoop

MPP Database

SQL-on-Hadoop

11 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Think of it as distributed file system with very large blocks of data

Schema on read allows flexibility for a variety of datasets Compute using a variety of paradigms (e.g. MapReduce)

Hadoop

MPP Database

Name Node

Data Node 1

Data Node 2

Data Node 3

Data Node 4

1 2 3 2 3 1 1 2 SQL-on-Hadoop

12 © Copyright 2013 Pivotal. All rights reserved.

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

•  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics

Hadoop

MPP Database

Think of it as distributed PostGreSQL (GPDB) on Hadoop •  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics SQL-on-

Hadoop

13 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

14 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

Batch processing of large volumes of data

Analytics on large-scale structured data

Operations on very large matrices

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

15 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

Batch processing of large volumes of data

Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows

Analytics on large-scale structured data

Requires restructuring of data to manipulate very large files

Operations on very large matrices

Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

16 © Copyright 2013 Pivotal. All rights reserved.

Sample Applications Challenges Use Cases

Batch processing of large volumes of data

Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows Word count on tweets

Analytics on large-scale structured data

Requires restructuring of data to manipulate very large files

Predicting mortality on clinical data from diverse sources

Operations on very large matrices

Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

Protein docking, molecular dynamics

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

SQL

17 © Copyright 2013 Pivotal. All rights reserved.

Clinical"Narratives"Imaging" Genetics"

Good for processing many images rapidly

Many documents with no shared processing Read mapping

In-database processing of very large images

stored as a table Information retrieval BAM file manipulations,

counts

Processing very large images (e.g. FFT)

Multiple sequence alignment

Choosing the right environment for different analytics challenges

HAMSTER/MPI GraphLab

MapReduce

SQL

18 © Copyright 2013 Pivotal. All rights reserved. 18 © Copyright 2013 Pivotal. All rights reserved.

Process & Tools

19 © Copyright 2013 Pivotal. All rights reserved.

1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

20 © Copyright 2013 Pivotal. All rights reserved.

1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

Data Review Feature Creation Model Building Operationalization

21 © Copyright 2013 Pivotal. All rights reserved.

MADlib In-Database Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Hypothesis Testing

Chi-Squared test F-test & t-test ANOVA Kolmogorov-Smirnov Mann-Whitney test Wilcoxon signed-rank test Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Collaborators:

22 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Streaming Algorithm •  Finding linear

dependencies between variables

•  How to compute with a single scan?

23 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑

24 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =

25 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =

26 © Copyright 2013 Pivotal. All rights reserved.

Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

27 © Copyright 2013 Pivotal. All rights reserved.

Data Parallelism •  Little or no effort is required to break up the problem into a number of

parallel tasks, and there exists no dependency (or communication) between those parallel tasks

•  Also known as ‘explicit parallelism’ •  Examples:

–  Count a deck of cards by dividing it up between people in this room: Count in parallel

–  MapReduce –  map() function in Python –  apply() family of functions in R

28 © Copyright 2013 Pivotal. All rights reserved.

�  The interpreter/VM of the language ‘X’ is installed on each node of the cluster

•  Data Parallelism: -  PL/X piggybacks on MPP

architecture

•  Allows users to write Greenplum/HAWQ/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

29 © Copyright 2013 Pivotal. All rights reserved. 29 © Copyright 2013 Pivotal. All rights reserved.

Genomics Use Case: Massively-Parallel GWAS Study

30 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

COVARIATES

SNP1 2 MAA CC TTAT CG TTAA GG TC

TT CG TC

31 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

32 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

33 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

34 © Copyright 2013 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

LOR1 LOR2 LORM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

•  In-database computation of ~500,000 loci for thousands of individuals occurs rapidly and in parallel

•  Results are easily manipulated and explored

35 © Copyright 2013 Pivotal. All rights reserved.

Generate relevant plots using tools like Tableau immediately after parallel statistical analysis in-database

on Pivotal technology

Visualize & analyze genomics data without movement

36 © Copyright 2013 Pivotal. All rights reserved.

Simply select SNPs of interest and visualize additional patient data or

metrics stored in the same database!

Visualize & analyze genomics data without movement

37 © Copyright 2013 Pivotal. All rights reserved.

Rapidly explore additional data sources, like mapped reads, to shorten time to insights. Data is

available on the same platform, no data movement required!

Visualize & analyze genomics data without movement

38 © Copyright 2013 Pivotal. All rights reserved. 38 © Copyright 2013 Pivotal. All rights reserved.

Image Processing Use Case: Massively-Parallel Cell Counting

39 © Copyright 2013 Pivotal. All rights reserved.

Tiss

uepa

thol

ogy.

com

40 © Copyright 2013 Pivotal. All rights reserved.

An image is simply an array of pixels

41 © Copyright 2013 Pivotal. All rights reserved.

Representing an image in a table HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

Source Image: Col

Row

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Structured:

42 © Copyright 2013 Pivotal. All rights reserved.

Translating image processing to simple SQL

Function Distribution of pixel intensities

SQL SELECT intsy, count(*) !FROM tbl !GROUP BY intsy!

Output 150, 5 215, 4

HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

�  No data movement required

�  Simple SQL queries for data exploration

Source Image:

col

row

in

tsy

Structured: Col

Row

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

43 © Copyright 2013 Pivotal. All rights reserved.

What about windows of pixels?

0 1 2 0 1 2

Source Image:

col

row

in

tsy

Structured: Col

Row

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

44 © Copyright 2013 Pivotal. All rights reserved.

What about windows of pixels? Source Image:

Col

Row

0 1 2 0 1 2

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!!

Output 1, 1, [215, 150, 215, 150, 215]

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

col

row

in

tsy

Structured:

45 © Copyright 2013 Pivotal. All rights reserved.

Window functions for image processing

0 1 2 0 1 2

What about 8-connected

kernels?

Source Image: Col

Row

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

46 © Copyright 2013 Pivotal. All rights reserved.

Window functions for image processing

diag1: row-col diag2: row+col

0 1 2 0 1 2

Col

Row

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

Source Image:

47 © Copyright 2013 Pivotal. All rights reserved.

Window functions for image processing

0 1 2 0 1 2

Col

Row

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), ! LAG ( intsy ) OVER( diag1_wdw ), ! LEAD ( intsy ) OVER( diag1_wdw ), ! LAG ( intsy ) OVER( diag2_wdw ), ! LEAD ( intsy ) OVER( diag2_wdw ) ! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), !diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

Output 1, 1, [215, 150, 215, 150, 215, 150, 215, 150, 150]

Source Image:

48 © Copyright 2013 Pivotal. All rights reserved.

Smoothing (noise removal) �  Make each pixel intensity value similar to its

neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a uniform box filter:

0 1 2 3 0 1 2 3

Col

Row

0 1 2 3 0 1 2 3 SELECT row, col, madlib.array_mean(intsy_wdw) !

!FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

49 © Copyright 2013 Pivotal. All rights reserved.

Smoothing (noise removal)

SELECT row, col, madlib.array_dot(intsy_wdw, ! array[.2,.125,.125,.125,.125,.075,.075,.075,.075]) !FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

�  Make each pixel intensity value similar to its neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a Gaussian filter:

0 1 2 3 0 1 2 3

Col

Row

0 1 2 3 0 1 2 3

.2 .125 .125

.125 .075 .075

.125 .075 .075

50 © Copyright 2013 Pivotal. All rights reserved.

Image Processing Pipeline For Object Counting

Original

Image name # Cells

Tma_001.jpg 359

Tma_002.jpg 1892

Tma_003.jpg 871

… …

Smoothing Average over

window of pixels

Thresholding Select pixels under intensity threshold

Morphological Operations Min/max over

window of pixels

Object Detection Connected

components

Object Counting Select components

with size filter

51 © Copyright 2013 Pivotal. All rights reserved. 51 © Copyright 2013 Pivotal. All rights reserved.

Healthcare Use Case: Predicting Asthma-Related Hospital Admissions

52 © Copyright 2013 Pivotal. All rights reserved.

Code-a-Thon Details - Logistics •  24-Hour Data Science Code-a-

Thon •  Four finalist vendors:

–  Pivotal –  Cloudera –  Hortonworks, and –  IBM

•  Number of resources per vendor is 5

•  Final deliverable is a 15 minute presentation to senior leaders, executives, doctors, and pharmacists

53 © Copyright 2013 Pivotal. All rights reserved.

Code-a-Thon Details - Data �  Air Quality Data

–  Air Pollutants and California Air Resource Board (ARB) Data

–  Daily Particulate matter (PM 10 and 2.5) and Ozone (O3) measurements

�  Medication Order History –  4 years of anonymized medication

order history –  Encounter data

▪  Encounter Type ▪  Encounter Date ▪  Diagnosis ▪  Patient Demographics

—  Age/Gender/Zip Code

▪  Details of the Prescription —  Medication —  Therapeutic Class —  Expiration Date

–  Dispense data ▪  Refill Date/ Location

54 © Copyright 2013 Pivotal. All rights reserved.

Raw Air Quality Data �  Measured at 77 stations

�  Dispersed in 50 zip codes

�  Only 6% of customer population lives in a zip code where there is an air station

Any analysis that focuses only on zip codes with air stations would be incomplete

Challenge #1

55 © Copyright 2013 Pivotal. All rights reserved.

Step 1. Shepard Interpolation

�  Calculate air miles between all zip codes

�  Populate the air quality measures at zip codes with no stations with inverse distance weighted averages from nearby air stations

Challenge #1

56 © Copyright 2013 Pivotal. All rights reserved.

Step 2. Determine zip codes where asthma is over-represented -  We calculated the prevalence of

asthma for the overall population and each zip code

-  We determine whether the distribution of disease prevalence is significantly different for a zip code by running a chi-square test at the zip code level

-  The cut-off for p-value is 0.05

-  The standardized residuals are plotted Red: over-represented asthma Green: under-represented asthma

Challenge #1

57 © Copyright 2013 Pivotal. All rights reserved.

Step 3. Spatial Alignment Challenge #1

58 © Copyright 2013 Pivotal. All rights reserved.

Predicting Asthma Admissions Findings �  Prior Hospitalization: Our analysis found that patients who have prior asthma related

hospitalizations in the last 12 months were 4.85 times more likely to have a hospitalization (any) in the next 3 months compared to patients who had no prior asthma hospitalizations in the last 12 months.

�  Socio-economic status : Of the various socio-economic status features we tried, the percent population under 50K is the one that was significant.

�  Age Under 10 and Age Above 60 : Compared to the reference group (patients with the ages between 10 and 60) these two age groups have increased likelihood (~24% and ~10%) to be hospitalized in the next 3 months.

�  History of Unfilled Medication: If a patient had an unfilled medication in their history, then ceteris paribus, they are 13% more likely to have a hospitalization (p = 2.7e-06)

Challenge #2

59 © Copyright 2013 Pivotal. All rights reserved.

Asthma Population Management Application

Application #1

60 © Copyright 2013 Pivotal. All rights reserved.

Asthma Management Application Application #2

61 © Copyright 2013 Pivotal. All rights reserved.

Technology Adoption Journey of a Major Healthcare Provider

Prove that better technology can speed up discovery •  Code-a-thon

Prove that better technology can improve model quality • Length of Stay Modeling

Prove that technology is accessible to my clinicians and researchers • Comorbidity Feature Generation App

Prove that data science can help in areas other than clinical analytics • Fraud Detection for Accounts Payable

Prove that, once trained, our scientists can get to insights as quickly as the Pivotal DS team • EDIP Modeling in 4 days

62 © Copyright 2013 Pivotal. All rights reserved.

Check out the Pivotal Data Science Blog! http://blog.pivotal.io/data-science-pivotal

A NEW PLATFORM FOR A NEW ERA