Meetup asu 150113_upload

A NEW PLATFORM FOR A NEW ERA

What we will cover in today’s Meetup

� Data Science for Biomedicine –  Challenges –  Platforms, processes, and tools

� Use Cases Leveraging Data Science for Biomedicine –  Genomics: Distributed GWAS –  Image Processing: Massively Parallel Cell Counting –  Healthcare: Predicting asthma-related hospital admissions

� Wrap Up & Questions

Challenges

Challenge: The ‘big-ness’ of big data

Oil Exploration Medical Imaging

Video Surveillance Mobile Sensors

Stock Market Gene Sequencing

Smart Grids Social Media

FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY

COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS EVERY 15 MINUTES

IS 3000X MORE DATA INTENSIVE

OIL RIGS GENERATE

25000 DATA POINTS PER SECOND

Medications"

Family "History"

Lab tests"

Clinical"Narratives"

Imaging"

Environment"

Medical History"

Sensors"& Mobile"

Genetics"

Molecular"Diagnostics"

Challenge: Diverse data

Solutions: New environments & tools HDFS STORAGE AND MPP

ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT

VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE

Solutions: Leverage Diverse Data Create predictive models at scale •  Integrate data from various sources to build larger models to improve statistics

and inference •  Enable parallelized execution of libraries

False positive rate

Medical History

Medical History Genetics

Clinician Notes

Medical History Genetics Imaging Clinician

Platforms

Multiple platforms with a single, simple goal: Distributed storage with in-place computation

Hadoop

MPP Database

SQL-on-Hadoop

Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Hadoop

MPP Database

SQL-on-Hadoop

Think of it as distributed file system with very large blocks of data

Schema on read allows flexibility for a variety of datasets Compute using a variety of paradigms (e.g. MapReduce)

Hadoop

MPP Database

Name Node

Data Node 1

Data Node 2

Data Node 3

Data Node 4

1 2 3 2 3 1 1 2 SQL-on-Hadoop

•  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics

Hadoop

MPP Database

Think of it as distributed PostGreSQL (GPDB) on Hadoop •  SQL compliant •  World-class query optimizer •  Interactive query •  Horizontal scalability •  Robust data management •  Common Hadoop formats •  Deep analytics SQL-on-

Hadoop

Sample Applications Challenges Use Cases

The landscape of technology for big data

HAMSTER/MPI GraphLab

MapReduce

Batch processing of large volumes of data

Analytics on large-scale structured data

Operations on very large matrices

MapReduce

Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows

Requires restructuring of data to manipulate very large files

Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

MapReduce

Not optimal for highly iterative methods (file I/O bottleneck),

functions over windows Word count on tweets

Requires restructuring of data to manipulate very large files

Predicting mortality on clinical data from diverse sources

Requires knowledge of OpenMP, mis-used for embarrassingly

parallel problems

Protein docking, molecular dynamics

MapReduce

Clinical"Narratives"Imaging" Genetics"

Good for processing many images rapidly

Many documents with no shared processing Read mapping

In-database processing of very large images

stored as a table Information retrieval BAM file manipulations,

counts

Processing very large images (e.g. FFT)

Multiple sequence alignment

Choosing the right environment for different analytics challenges

MapReduce

Process & Tools

1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

1 Find Data Platforms •  Pivotal

Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime •  Rstudio

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  HAWQ •  Pig •  Hive •  Java

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

Data Review Feature Creation Model Building Operationalization

MADlib In-Database Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Hypothesis Testing

Chi-Squared test F-test & t-test ANOVA Kolmogorov-Smirnov Mann-Whitney test Wilcoxon signed-rank test Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Collaborators:

Linear Regression: Streaming Algorithm •  Finding linear

dependencies between variables

•  How to compute with a single scan?

Linear Regression: Parallel Computation

XT y = xiT yi

Master

Segment 1 Segment 2

X1T y1 X2

T y2+ =

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =

Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Data Parallelism •  Little or no effort is required to break up the problem into a number of

parallel tasks, and there exists no dependency (or communication) between those parallel tasks

•  Also known as ‘explicit parallelism’ •  Examples:

–  Count a deck of cards by dividing it up between people in this room: Count in parallel

–  MapReduce –  map() function in Python –  apply() family of functions in R

�  The interpreter/VM of the language ‘X’ is installed on each node of the cluster

•  Data Parallelism: -  PL/X piggybacks on MPP

architecture

•  Allows users to write Greenplum/HAWQ/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

Interconnect

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Genomics Use Case: Massively-Parallel GWAS Study

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

COVARIATES

SNP1 2 MAA CC TTAT CG TTAA GG TC

TT CG TC

Master Severs

Segment Severs

SNP1 SNP2 SNPM

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

N M TC

COVARIATES GENOTYPES

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

LOR1 LOR2 LORM

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

•  In-database computation of ~500,000 loci for thousands of individuals occurs rapidly and in parallel

•  Results are easily manipulated and explored

Generate relevant plots using tools like Tableau immediately after parallel statistical analysis in-database

on Pivotal technology

Visualize & analyze genomics data without movement

Simply select SNPs of interest and visualize additional patient data or

metrics stored in the same database!

Rapidly explore additional data sources, like mapped reads, to shorten time to insights. Data is

available on the same platform, no data movement required!

Image Processing Use Case: Massively-Parallel Cell Counting

An image is simply an array of pixels

Representing an image in a table HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

Source Image: Col

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

Structured:

Translating image processing to simple SQL

Function Distribution of pixel intensities

SQL SELECT intsy, count(*) !FROM tbl !GROUP BY intsy!

Output 150, 5 215, 4

HAWQ or GPDB enables rapid processing of multiple or extremely large images in parallel without memory limitations

�  No data movement required

�  Simple SQL queries for data exploration

Source Image:

Structured: Col

0 1 2 0 1 2

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

What about windows of pixels?

0 1 2 0 1 2

Source Image:

Structured: Col

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

What about windows of pixels? Source Image:

0 1 2 0 1 2

Function Neighboring pixel values (no diagonals)

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!!

Output 1, 1, [215, 150, 215, 150, 215]

0 0 0 1 0 2 1 0 1 1 1 2 2 0 2 1 2 2

Structured:

Window functions for image processing

0 1 2 0 1 2

What about 8-connected

kernels?

Source Image: Col

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

diag1: row-col diag2: row+col

0 1 2 0 1 2

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), !!!!! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !!!

Output 1, 1, [215, 150, 215, 150, 215]

Source Image:

0 1 2 0 1 2

SQL SELECT row, col, ! array [intsy, ! LAG ( intsy ) OVER( col_wdw ), ! LEAD ( intsy ) OVER( col_wdw ), ! LAG ( intsy ) OVER( row_wdw ), ! LEAD ( intsy ) OVER( row_wdw ), ! LAG ( intsy ) OVER( diag1_wdw ), ! LEAD ( intsy ) OVER( diag1_wdw ), ! LAG ( intsy ) OVER( diag2_wdw ), ! LEAD ( intsy ) OVER( diag2_wdw ) ! ] intsy_wdw!FROM tbl!WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), !diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), !diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

Output 1, 1, [215, 150, 215, 150, 215, 150, 215, 150, 150]

Source Image:

Smoothing (noise removal) �  Make each pixel intensity value similar to its

neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a uniform box filter:

0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 SELECT row, col, madlib.array_mean(intsy_wdw) !

!FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

Smoothing (noise removal)

SELECT row, col, madlib.array_dot(intsy_wdw, ! array[.2,.125,.125,.125,.125,.075,.075,.075,.075]) !FROM ( ! SELECT row, col, array [intsy, ! LAG (intsy) OVER( col_wdw ), LEAD (intsy) OVER( col_wdw ), ! LAG (intsy) OVER( row_wdw ), LEAD (intsy) OVER( row_wdww ), ! LAG (intsy) OVER( diag1_wdw ), LEAD (intsy) OVER( diag1_wdw ), ! LAG (intsy) OVER( diag2_wdw ), LEAD (intsy) OVER( diag2_wdw ) ! ] intsy_wdw! FROM tbl! WINDOW col_wdw AS (PARTITION BY col ORDER BY row), ! row_wdw AS (PARTITION BY row ORDER BY col), ! diag1_wdw AS (PARTITION BY (row-col) ORDER BY col), ! diag2_wdw AS (PARTITION BY (row+col) ORDER BY col) !

�  Make each pixel intensity value similar to its neighbors by averaging the intensity values in the surrounding neighborhood.

�  Smoothing using a Gaussian filter:

0 1 2 3 0 1 2 3

.2 .125 .125

.125 .075 .075

Image Processing Pipeline For Object Counting

Original

Image name # Cells

Tma_001.jpg 359

Tma_002.jpg 1892

Tma_003.jpg 871

… …

Smoothing Average over

window of pixels

Thresholding Select pixels under intensity threshold

Morphological Operations Min/max over

window of pixels

Object Detection Connected

components

Object Counting Select components

with size filter

Healthcare Use Case: Predicting Asthma-Related Hospital Admissions

Code-a-Thon Details - Logistics •  24-Hour Data Science Code-a-

Thon •  Four finalist vendors:

–  Pivotal –  Cloudera –  Hortonworks, and –  IBM

•  Number of resources per vendor is 5

•  Final deliverable is a 15 minute presentation to senior leaders, executives, doctors, and pharmacists

Code-a-Thon Details - Data �  Air Quality Data

–  Air Pollutants and California Air Resource Board (ARB) Data

–  Daily Particulate matter (PM 10 and 2.5) and Ozone (O3) measurements

�  Medication Order History –  4 years of anonymized medication

order history –  Encounter data

▪  Encounter Type ▪  Encounter Date ▪  Diagnosis ▪  Patient Demographics

—  Age/Gender/Zip Code

▪  Details of the Prescription —  Medication —  Therapeutic Class —  Expiration Date

–  Dispense data ▪  Refill Date/ Location

Raw Air Quality Data �  Measured at 77 stations

�  Dispersed in 50 zip codes

�  Only 6% of customer population lives in a zip code where there is an air station

Any analysis that focuses only on zip codes with air stations would be incomplete

Challenge #1

Step 1. Shepard Interpolation

�  Calculate air miles between all zip codes

�  Populate the air quality measures at zip codes with no stations with inverse distance weighted averages from nearby air stations

Challenge #1

Step 2. Determine zip codes where asthma is over-represented -  We calculated the prevalence of

asthma for the overall population and each zip code

-  We determine whether the distribution of disease prevalence is significantly different for a zip code by running a chi-square test at the zip code level

-  The cut-off for p-value is 0.05

-  The standardized residuals are plotted Red: over-represented asthma Green: under-represented asthma

Challenge #1

Step 3. Spatial Alignment Challenge #1

Predicting Asthma Admissions Findings �  Prior Hospitalization: Our analysis found that patients who have prior asthma related

hospitalizations in the last 12 months were 4.85 times more likely to have a hospitalization (any) in the next 3 months compared to patients who had no prior asthma hospitalizations in the last 12 months.

�  Socio-economic status : Of the various socio-economic status features we tried, the percent population under 50K is the one that was significant.

�  Age Under 10 and Age Above 60 : Compared to the reference group (patients with the ages between 10 and 60) these two age groups have increased likelihood (~24% and ~10%) to be hospitalized in the next 3 months.

�  History of Unfilled Medication: If a patient had an unfilled medication in their history, then ceteris paribus, they are 13% more likely to have a hospitalization (p = 2.7e-06)

Challenge #2

Asthma Population Management Application

Application #1

Asthma Management Application Application #2

Technology Adoption Journey of a Major Healthcare Provider

Prove that better technology can speed up discovery •  Code-a-thon

Prove that better technology can improve model quality • Length of Stay Modeling

Prove that technology is accessible to my clinicians and researchers • Comorbidity Feature Generation App

Prove that data science can help in areas other than clinical analytics • Fraud Detection for Accounts Payable

Prove that, once trained, our scientists can get to insights as quickly as the Pivotal DS team • EDIP Modeling in 4 days

Check out the Pivotal Data Science Blog! http://blog.pivotal.io/data-science-pivotal

A NEW PLATFORM FOR A NEW ERA

Meetup asu 150113_upload

Internet

Transcript of Meetup asu 150113_upload

Education Advances - ASU Presentation Craig heldman asu gsv

ASU Library Resources · 2020-01-06 · ASU Library Libraries and Collections About the lThrary ASU Home My ASU Co leges and Schools Map and Locations Search Quick Links Directory

SYNAESTHESIA - ASU

Administrative ASU/ASU West Administration

Saigon Wordpress Meetup - Themes Wordpress Meetup

nose 11 ASU

Penn foster asu education innovation presentation asu logo (1)

ASU Monthly

ASU Game Time - ASU vs Coastal Carolina

ASU Softball

ASU Travel Card Standards of Use - Arizona State University · ASU Financial Services - January 2016 2 Use of the ASU Travel Card ... An acknowledgement that ASU Travel Card Standards

ASU Typography

Internet of Things Cebu meetup : 1st meetup

Asu Manual

ASU Changemaker Central

Asu ppk eng

CHAPTER asu

Isotherm 3201 ASU & 3701 ASU 3201 ASU & 3701 ASU Installation and operating instruction Installations- och bruksanvisning Bedienungs- und Einbaueinleitung Instruction d’installation

ASU Catering

Isotherm 3251 ASU SP & 3751 ASU SP - thermoprodukter.se · Isotherm 3251 ASU SP & 3751 ASU SP Installations- och bruksanvisning Installation and operating instruction Bedienungs-