SAS Global Forum EC 2016: Transforming Analytics with ......A Unified Data Lake across that will...

Transforming Analytics with Hadoop and Efficient Data Management

DirectorData Sciences UnitedHealthcare Group

Ravi Shanbhag

Transforming Analytics with Hadoop and

Efficient Data Management

Presented By:

Ravi Shanbhag: Director, Data Science, UHC

Agenda

• From a Legacy SAS Environment to a Grid

• Planning to Add Hadoop to the mix

• Benefits Realized

• Considerations

• Future RoadMap

2

From Legacy SAS to a Grid

2011-2015

Massive addition to our Computation Needs

Advanced Analytic Capabilities

Full Mid-Tier with Zero Footprint Applications

Simplified Onboarding

Primarily Base SAS Usage

Minimal Metadata Management

No Grid Capabilities

Upgrades were time-consuming and expensive

2008 - 2011

3

600+ users (6x Increase)

90000+ processes Per Week (12x Increase)

150+ TB of SAS Data (19x Increase)

0 to ~100 users over 4 yrs

Peak of 8000 processes per week

8+ TBs of SAS Data

From Legacy SAS to a Grid

2008 - 2011 2011-2015

4

So Where is the Problem?

• Unified a lot of diverse groups – user growth exploded

• SAN Growth of 170% YoY - Storage Costs increased

• Data Governance Challenges

• Uneven stack utilization

• Cyclicality of business causes I/O and Disk Issues during peak loads

• Significant usage (and $) for Data Ingest, Cleansing and Provisioning

5

Adding Hadoop to the mix

A Unified Data Lake across that will provide a 360 view of our data

End-user

access via

familiar SAS

tools

Superset of

our Data and

a single

source of truth

for all

Analytics

Deploy Only

Mature

Components

of the

Ecosystem

Computational

push-down

into the

cluster

Unified

Security

Model for all

our Data (SAS

or Hadoop)

6

www.cloudera.com

Hadoop Stack Maturity

A typical Hadoop Stack

cloudera.com

apache.com

Pivotal.io

The Leading Open Source Analytic

Database for Apache Hadoop

Distributed SQL Query

Engine for Big Data

World’s Most Advanced Enterprise SQL

on Hadoop Analytic Engine

Spark's module for working with

structured data.High performance relational database

layer over HBase for low latency

applications

Apache Hive is a data warehouse

infrastructure built on top

of Hadoop for providing data

summarization, query, and analysis.

SQL Variants

Numerous SQL Variants – Which ones to choose?

> 6 years

Low source Commentary

SQL Variants

28 Committers and ~1400 Commits

SQL Variants

< 3 years

Low source Commentary

SQL Variants

80 Committers and ~2250 Commits

SQL Variants

Data Storage Variants

13

Text, Sequence Files, RC, ORC, Avro, Parquet etc.

SASHDAT, HDMD, SPD etc.

File

Formats

Snappy, LZO, Gzip, BZip2, etc.Compression

Formats

Data Storage Variants (Contd..)

TXT - Uncompressed ………………………… 2.0 GB

Avro………………………………………..……. 1.8 GB

Avro - Snappy Compression………..……….. 875 MB

Parquet - Snappy Compression…………..…. 400 MB

ORC - Snappy Compression……………..…. 350 MB

Data Storage Variants (Contd..)

15

Benchmarking

What formats work best for your workloads?

Do your data structures change often?

CPU vs. Memory vs. I/O

SQL On Hadoop? What tools will you use?

Current Analytics Ecosystem

Tenant A Compute and

Storage

Data Lake SOURCES

Tenant B Compute and

Storage

Tenant C Compute and

Storage

Tenant D Compute and

Storage

User Group A

User Group B

User Group C

User Group D

END-USERS

Other Analytic Tools

Claims

Revenue

Membership

Financial

Clinical

Operational

Call Center

EDWPublish

SAS HPA Grid EG/EM/VA/Web/AMO

SA

S In

-

Mem

ory

SA

S A

ccess

SA

S

Accele

rato

rs

DIRECT HADOOP ACCESS

7

Realized Benefits

Speed to

Market

Data Sourcing,

Standardization

Flattened

Layers available

for Analytics

User

Adoption

Familiar SAS

End-user tools

In-organic

Growth

Costs vs

Benefits

Reduced Opex

Faster adoption

of several ACA

mandates

Right tool for

the right job

Unified Security

Model

Better utilization

of the stack

What gets done

where?8

Considerations

• Storage is Cheap isn’t it?

• Where’s my EASY button?

• All open-source is NOT created equal

• Multi-tenant Shared-services

9

Future

Hadoop Data

Lake

Starting point for

Ingestion,

Transforms and

Provisioning

Superset of all our

data

Computation

Push-down

We’ll push-down as

much as we can into

the cluster

Direct lift from

Hadoop into the In-

Memory Stack

Utilize other

components

NoSQL Integration

with SAS Tools

Enterprise Search

Streaming

Grow

Internal Analytic

use-cases on

Operational Data

Expand our user-

base

10

Thank You!

Questions?

Ravi ShanbhagDirector, Data Science, UHC

[email protected]

www.linkedin.com/in/ravishanbhag

SAS Global Forum EC 2016: Transforming Analytics with ......A Unified Data Lake across that will...

Documents

Transcript of SAS Global Forum EC 2016: Transforming Analytics with ......A Unified Data Lake across that will...