SAS Global Forum EC 2016: Transforming Analytics with ......A Unified Data Lake across that will...
Transcript of SAS Global Forum EC 2016: Transforming Analytics with ......A Unified Data Lake across that will...
Transforming Analytics with Hadoop and Efficient Data Management
DirectorData Sciences UnitedHealthcare Group
Ravi Shanbhag
Transforming Analytics with Hadoop and
Efficient Data Management
Presented By:
Ravi Shanbhag: Director, Data Science, UHC
Agenda
• From a Legacy SAS Environment to a Grid
• Planning to Add Hadoop to the mix
• Benefits Realized
• Considerations
• Future RoadMap
2
From Legacy SAS to a Grid
2011-2015
Massive addition to our Computation Needs
Advanced Analytic Capabilities
Full Mid-Tier with Zero Footprint Applications
Simplified Onboarding
Primarily Base SAS Usage
Minimal Metadata Management
No Grid Capabilities
Upgrades were time-consuming and expensive
2008 - 2011
3
600+ users (6x Increase)
90000+ processes Per Week (12x Increase)
150+ TB of SAS Data (19x Increase)
0 to ~100 users over 4 yrs
Peak of 8000 processes per week
8+ TBs of SAS Data
From Legacy SAS to a Grid
2008 - 2011 2011-2015
4
So Where is the Problem?
• Unified a lot of diverse groups – user growth exploded
• SAN Growth of 170% YoY - Storage Costs increased
• Data Governance Challenges
• Uneven stack utilization
• Cyclicality of business causes I/O and Disk Issues during peak loads
• Significant usage (and $) for Data Ingest, Cleansing and Provisioning
5
Adding Hadoop to the mix
A Unified Data Lake across that will provide a 360 view of our data
End-user
access via
familiar SAS
tools
Superset of
our Data and
a single
source of truth
for all
Analytics
Deploy Only
Mature
Components
of the
Ecosystem
Computational
push-down
into the
cluster
Unified
Security
Model for all
our Data (SAS
or Hadoop)
6
www.cloudera.com
Hadoop Stack Maturity
A typical Hadoop Stack
cloudera.com
apache.com
Pivotal.io
The Leading Open Source Analytic
Database for Apache Hadoop
Distributed SQL Query
Engine for Big Data
World’s Most Advanced Enterprise SQL
on Hadoop Analytic Engine
Spark's module for working with
structured data.High performance relational database
layer over HBase for low latency
applications
Apache Hive is a data warehouse
infrastructure built on top
of Hadoop for providing data
summarization, query, and analysis.
SQL Variants
Numerous SQL Variants – Which ones to choose?
> 6 years
Low source Commentary
SQL Variants
28 Committers and ~1400 Commits
SQL Variants
< 3 years
Low source Commentary
SQL Variants
80 Committers and ~2250 Commits
SQL Variants
Data Storage Variants
13
Text, Sequence Files, RC, ORC, Avro, Parquet etc.
SASHDAT, HDMD, SPD etc.
File
Formats
Snappy, LZO, Gzip, BZip2, etc.Compression
Formats
Data Storage Variants (Contd..)
TXT - Uncompressed ………………………… 2.0 GB
Avro………………………………………..……. 1.8 GB
Avro - Snappy Compression………..……….. 875 MB
Parquet - Snappy Compression…………..…. 400 MB
ORC - Snappy Compression……………..…. 350 MB
Data Storage Variants (Contd..)
15
Benchmarking
What formats work best for your workloads?
Do your data structures change often?
CPU vs. Memory vs. I/O
SQL On Hadoop? What tools will you use?
Current Analytics Ecosystem
Tenant A Compute and
Storage
Data Lake SOURCES
Tenant B Compute and
Storage
Tenant C Compute and
Storage
Tenant D Compute and
Storage
User Group A
User Group B
User Group C
User Group D
END-USERS
Other Analytic Tools
Claims
Revenue
Membership
Financial
Clinical
Operational
Call Center
EDWPublish
SAS HPA Grid EG/EM/VA/Web/AMO
SA
S In
-
Mem
ory
SA
S A
ccess
SA
S
Accele
rato
rs
DIRECT HADOOP ACCESS
7
Realized Benefits
Speed to
Market
Data Sourcing,
Standardization
Flattened
Layers available
for Analytics
User
Adoption
Familiar SAS
End-user tools
In-organic
Growth
Costs vs
Benefits
Reduced Opex
Faster adoption
of several ACA
mandates
Right tool for
the right job
Unified Security
Model
Better utilization
of the stack
What gets done
where?8
Considerations
• Storage is Cheap isn’t it?
• Where’s my EASY button?
• All open-source is NOT created equal
• Multi-tenant Shared-services
9
Future
Hadoop Data
Lake
Starting point for
Ingestion,
Transforms and
Provisioning
Superset of all our
data
Computation
Push-down
We’ll push-down as
much as we can into
the cluster
Direct lift from
Hadoop into the In-
Memory Stack
Utilize other
components
NoSQL Integration
with SAS Tools
Enterprise Search
Streaming
Grow
Internal Analytic
use-cases on
Operational Data
Expand our user-
base
10
Thank You!
Questions?
Ravi ShanbhagDirector, Data Science, UHC
www.linkedin.com/in/ravishanbhag