Powering Research and Analytics with a Data Lake and Hadoop · 2019-02-09 · 1 Powering Research...
Transcript of Powering Research and Analytics with a Data Lake and Hadoop · 2019-02-09 · 1 Powering Research...
1
Powering Research and Analytics with a Data Lake and Hadoop
Session #35, February 12, 2019
Rajan Chandras, Director Data Architecture and Strategy, NYU Langone Health
Marilyn Campbell, Manager, Clinical Data Analytics, NYU Langone Health
2
Rajan Chandras, MS, MSc, PMP, PAHMHas no real or apparent conflicts of interest to report.
Marilyn Campbell, MScHas no real or apparent conflicts of interest to report.
Conflict of interest
3
• Learning objectives
• The problem
– Healthcare analytics and challenges to democratizing data
• The solution approach
– Hadoop and the NYU Langone data lake
• User experience and use cases
– Clinical Quality and Effectiveness
– Cardiovascular repository
– Predictive analytics
• Ongoing work
• Discussion
Agenda
4
• Recognize the unique analytic needs of healthcare researchers,
clinical analysts, informaticists and data scientists
• Compare and contrast different approaches to democratizing data
for researchers, clinical informaticists and data scientists
• Discuss the benefits and challenges of using Hadoop for
enterprise analytics
• Employ the Hadoop data management platform to implement a
"data lake“
Learning objectives
5
Healthcare analytics: complex, unique
6
Challenges to democratizing data
Approach Limitations
User access to multiple data
sources
Inefficient governance;
M repositories x N users
Network share Not practical; not secure; no value
add
Data virtualization/ federation (a.k.a. EII) Cannot scale; expensive
Conventional databases Conformance to complex data
models; expensive ETL; limited
scale; expensive at higher scale;
cannot handle different types of data Conventional data
warehousing
Appliances Expensive; vendor dependence
NoSQL databases Not generic; depends on use case
7
The Hadoop big data platform
2003
2004
2008
2009
2011
2018
Google File System
MapReduce
Cloudera
MAPR
Hortonworks
Cloudera +
Hortonworks
Servers + Storage + Databases + Query/Processing
8
Designed from the ground up for big data
Open source and “co-opetitive”
Secure, scalable and resilient
On-premise or cloud
Not just storage but also compute
Support for streaming data
Can store and process varied types of data
Flexible, no pre-defined data models: files, SQL, NoSQL
Support for BI/analytic tools: JDBC/ODBC, SQL, SAS, Python, R…
Native and custom metadata
Hadoop for self-service analytics
9
Hadoop challenges
• Complex platform with ever-growing portfolio of technologies
• Not designed for transactional applications
• Limited SQL capabilities, e.g. referential integrity, stored
procedures, updates, indexes
• Tool integration takes patience and expertise
• There’s a learning curve
• Not a hammer for every nail
We are here
10
• Data ingestion & provisioning
• Lift and shift
• Immutable & user workspaces
• Self-service analytics
• Access technologies
The NYULH data lake
• Data governance
• Integration with master data
• Integration with reference data
• Metadata and data lineage
• Data lake explorer
11
Enterprise
Performance
Analysis and
Reporting (ePAR)
Data hive architecture
Cardiovascular Data
Repository (CVR)
External Encounters
and Data SetsPredictive Analytics
Unit (PAU)
12
• “Put the data in the hands of those best qualified to analyze it”
– Having immediate use cases
– Skilled in data management and use of analytic software such as R, SQL, Python, SAS, other
– Are motivated self-starters
• Limitations
– By design, the data lake is only useable by those with advanced analytic skills and knowledge
– Reliant on a motivated user community
– Lack of documentation
– Long term vision; impact on efficiency in reporting will take time
The user experience
13
• Goal: self-service automated clinical quality reporting and analysis
• Why Hadoop works
– Centralized user access to multiple enterprise datasets and reference data
– Access to clinical data not dependent on IT
– Accessible via multiple analytic tools (SQL, SAS, R, Python, etc.)
– Clinical analysis uses separate resources from enterprise production reporting
• Use Cases
– Base hospital encounter data for external reporting, including clinical data
– Internal reporting using CMS metric logic and reference data
Clinical Quality and Effectiveness
14
• Goal: unify disparate cardiology “data islands”, ease information sharing, and enable data to shape cardiovascular practice and research
• Why Hadoop works
– Can readily absorb disparate pieces of data and enable fast data combination
– No traditional data warehouse organizational walls
– Easy to share analytic data sets; fosters community
– Repository for archived cardiac registry datasets
• Use cases
– Data mine, merge, and access previously isolated large data set, e.g., EKG + structured morphological heart characteristics to localize source of arrhythmia
– Machine learning to predict clinical outcomes of cardiovascular disease interventions, e.g., likelihood of success or complication of atrial fibrillation ablations, tailored for patients using their EHR and imaging information (deep clinical phenotype)
– Cardiovascular quality improvement dashboards, e.g., care for heart failure or hypertension
Cardiovascular Data Repository
15
• Goal: translate clinical predictive models to the point of care; build, implement, deploy, evaluate, monitor, and maintain machine learning based clinical models
• Why Hadoop works
– Training sets for machine learning are data hungry
– Building these datasets are resource intensive both in complexity of table joins and raw volume
– Past state: These intensive queries will sometimes get killed or not finish because of competition with database production activities
– Current state: The data lake and Hadoop allow us to quickly build datasets and implement complex joins without competing against production level activities
• Use Cases
– Predict 2 month mortality risk at inpatient
– Predict primary diagnosis of congestive heart failure using natural language processing
– Predict patients at risk of end state renal disease (i.e. dialysis) in the next year
Predictive Analytics Unit
16
• Optimize platform capabilities
• Create documentation
• Integrate with enterprise data governance tool for:
– Metadata
– Data lineage
– Reference data
• Expand content
• Establish as an operational and analytical enterprise data source
Ongoing work
17
• Martha J. Radford, MD, Chief Quality Officer, Professor of Medicine (Cardiology)
• Jeff Shein, BA, Senior Director, Enterprise Data Warehousing and Analytics
• Eugene Grossi, MD, Stephen B. Colvin Professor of Cardiothoracic Surgery
• Jason Kreuter, PhD, Director, Data & Analytics, Research Associate Professor, Dept. of Medicine
• Lior Jankelson, MD, PhD, Assistant Professor of Medicine
• Yindalon Aphinyanaphongs, MD, PhD, Director, Clinical Predictive Analytics Unit, Assistant Professor, Population Health and Medicine
• Swetha Nukala, MBBS, MPH, Department of Clinical Quality and Effectiveness
• Satyaki Adusumally, MS, Medical Center Information Technology
• Shekhar Vemuri, Chief Technology Officer, Clairvoyant LLC
Acknowledgments
18
DiscussionChallenges, Experiences, Questions
• Analytic architectures
• Big data technologies
• Cloud vs. on premise
• Master data management
• Ontologies, vocabularies
and reference data mgmt.
• Business glossaries
Thank YouRemember to complete the online session evaluation
[email protected] | www.linkedin.com/in/marilynmcampbell
[email protected] | www.linkedin.com/in/rchandras
• Metadata and data
lineage
• Data governance
• Shift in skills and tools
• Democratizing data
• How to win friends and
influence people