Powering Research and Analytics with a Data Lake and Hadoop · 2019-02-09 · 1 Powering Research...

1

Powering Research and Analytics with a Data Lake and Hadoop

Session #35, February 12, 2019

Rajan Chandras, Director Data Architecture and Strategy, NYU Langone Health

Marilyn Campbell, Manager, Clinical Data Analytics, NYU Langone Health

2

Rajan Chandras, MS, MSc, PMP, PAHMHas no real or apparent conflicts of interest to report.

Marilyn Campbell, MScHas no real or apparent conflicts of interest to report.

Conflict of interest

3

• Learning objectives

• The problem

– Healthcare analytics and challenges to democratizing data

• The solution approach

– Hadoop and the NYU Langone data lake

• User experience and use cases

– Clinical Quality and Effectiveness

– Cardiovascular repository

– Predictive analytics

• Ongoing work

• Discussion

Agenda

4

• Recognize the unique analytic needs of healthcare researchers,

clinical analysts, informaticists and data scientists

• Compare and contrast different approaches to democratizing data

for researchers, clinical informaticists and data scientists

• Discuss the benefits and challenges of using Hadoop for

enterprise analytics

• Employ the Hadoop data management platform to implement a

"data lake“

Learning objectives

5

Healthcare analytics: complex, unique

6

Challenges to democratizing data

Approach Limitations

User access to multiple data

sources

Inefficient governance;

M repositories x N users

Network share Not practical; not secure; no value

add

Data virtualization/ federation (a.k.a. EII) Cannot scale; expensive

Conventional databases Conformance to complex data

models; expensive ETL; limited

scale; expensive at higher scale;

cannot handle different types of data Conventional data

warehousing

Appliances Expensive; vendor dependence

NoSQL databases Not generic; depends on use case

7

The Hadoop big data platform

2003

2004

2008

2009

2011

2018

Google File System

MapReduce

Cloudera

MAPR

Hortonworks

Cloudera +

Hortonworks

Servers + Storage + Databases + Query/Processing

8

Designed from the ground up for big data

Open source and “co-opetitive”

Secure, scalable and resilient

On-premise or cloud

Not just storage but also compute

Support for streaming data

Can store and process varied types of data

Flexible, no pre-defined data models: files, SQL, NoSQL

Support for BI/analytic tools: JDBC/ODBC, SQL, SAS, Python, R…

Native and custom metadata

Hadoop for self-service analytics

9

Hadoop challenges

• Complex platform with ever-growing portfolio of technologies

• Not designed for transactional applications

• Limited SQL capabilities, e.g. referential integrity, stored

procedures, updates, indexes

• Tool integration takes patience and expertise

• There’s a learning curve

• Not a hammer for every nail

We are here

10

• Data ingestion & provisioning

• Lift and shift

• Immutable & user workspaces

• Self-service analytics

• Access technologies

The NYULH data lake

• Data governance

• Integration with master data

• Integration with reference data

• Metadata and data lineage

• Data lake explorer

11

Enterprise

Performance

Analysis and

Reporting (ePAR)

Data hive architecture

Cardiovascular Data

Repository (CVR)

External Encounters

and Data SetsPredictive Analytics

Unit (PAU)

12

• “Put the data in the hands of those best qualified to analyze it”

– Having immediate use cases

– Skilled in data management and use of analytic software such as R, SQL, Python, SAS, other

– Are motivated self-starters

• Limitations

– By design, the data lake is only useable by those with advanced analytic skills and knowledge

– Reliant on a motivated user community

– Lack of documentation

– Long term vision; impact on efficiency in reporting will take time

The user experience

13

• Goal: self-service automated clinical quality reporting and analysis

• Why Hadoop works

– Centralized user access to multiple enterprise datasets and reference data

– Access to clinical data not dependent on IT

– Accessible via multiple analytic tools (SQL, SAS, R, Python, etc.)

– Clinical analysis uses separate resources from enterprise production reporting

• Use Cases

– Base hospital encounter data for external reporting, including clinical data

– Internal reporting using CMS metric logic and reference data

Clinical Quality and Effectiveness

14

• Goal: unify disparate cardiology “data islands”, ease information sharing, and enable data to shape cardiovascular practice and research


– Can readily absorb disparate pieces of data and enable fast data combination

– No traditional data warehouse organizational walls

– Easy to share analytic data sets; fosters community

– Repository for archived cardiac registry datasets

• Use cases

– Data mine, merge, and access previously isolated large data set, e.g., EKG + structured morphological heart characteristics to localize source of arrhythmia

– Machine learning to predict clinical outcomes of cardiovascular disease interventions, e.g., likelihood of success or complication of atrial fibrillation ablations, tailored for patients using their EHR and imaging information (deep clinical phenotype)

– Cardiovascular quality improvement dashboards, e.g., care for heart failure or hypertension

Cardiovascular Data Repository

15

• Goal: translate clinical predictive models to the point of care; build, implement, deploy, evaluate, monitor, and maintain machine learning based clinical models


– Training sets for machine learning are data hungry

– Building these datasets are resource intensive both in complexity of table joins and raw volume

– Past state: These intensive queries will sometimes get killed or not finish because of competition with database production activities

– Current state: The data lake and Hadoop allow us to quickly build datasets and implement complex joins without competing against production level activities

• Use Cases

– Predict 2 month mortality risk at inpatient

– Predict primary diagnosis of congestive heart failure using natural language processing

– Predict patients at risk of end state renal disease (i.e. dialysis) in the next year

Predictive Analytics Unit

16

• Optimize platform capabilities

• Create documentation

• Integrate with enterprise data governance tool for:

– Metadata

– Data lineage

– Reference data

• Expand content

• Establish as an operational and analytical enterprise data source

Ongoing work

17

• Martha J. Radford, MD, Chief Quality Officer, Professor of Medicine (Cardiology)

• Jeff Shein, BA, Senior Director, Enterprise Data Warehousing and Analytics

• Eugene Grossi, MD, Stephen B. Colvin Professor of Cardiothoracic Surgery

• Jason Kreuter, PhD, Director, Data & Analytics, Research Associate Professor, Dept. of Medicine

• Lior Jankelson, MD, PhD, Assistant Professor of Medicine

• Yindalon Aphinyanaphongs, MD, PhD, Director, Clinical Predictive Analytics Unit, Assistant Professor, Population Health and Medicine

• Swetha Nukala, MBBS, MPH, Department of Clinical Quality and Effectiveness

• Satyaki Adusumally, MS, Medical Center Information Technology

• Shekhar Vemuri, Chief Technology Officer, Clairvoyant LLC

Acknowledgments

18

DiscussionChallenges, Experiences, Questions

• Analytic architectures

• Big data technologies

• Cloud vs. on premise

• Master data management

• Ontologies, vocabularies

and reference data mgmt.

• Business glossaries

Thank YouRemember to complete the online session evaluation

[email protected] | www.linkedin.com/in/marilynmcampbell

[email protected] | www.linkedin.com/in/rchandras

• Metadata and data

lineage

• Data governance

• Shift in skills and tools

• Democratizing data

• How to win friends and

influence people

mailto:[email protected]

http://www.linkedin.com/in/marilynmcampbell

mailto:[email protected]

http://www.linkedin.com/in/rchandras

Powering Research and Analytics with a Data Lake and Hadoop · 2019-02-09 · 1 Powering Research...

Documents

Transcript of Powering Research and Analytics with a Data Lake and Hadoop · 2019-02-09 · 1 Powering Research...