R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

46
Revolution Confidential Revolution Analytics & Cloudera Confidential R + Hadoop Ask Bigger (and new) Questions and Get Better, Faster Answers Michele Chambers Chief Strategy Officer & VP Product Mgmt Jai Ranganathan Director Product Mgmt & Strategy

description

The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.

Transcript of R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Page 1: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution Analytics & Cloudera Confidential

R + Hadoop Ask Bigger (and new) Questions

and Get Better, Faster Answers

Michele Chambers Chief Strategy Officer & VP Product Mgmt

Jai Ranganathan Director Product Mgmt & Strategy

Page 2: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Period of Disruption

2

1st Generation Predictive Analytics

Page 3: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Today’s Challenge:

Accelerating Business Cadence

Changing Business Environment

• Fact Based Decisions Require More Data

• Need to Understand Tradeoffs and Best Course of Action

• Predictive Models Need to Continually Deliver Lift

• Reduced Shelf Life for Predictive Models

Faster Time to Value

• Reduce Analytic Cycle Time

• Build & Deploy Models Faster

• Eliminate Time Consuming Data Movements

Rapid Customer Facing Decisions

• Score More Frequently

• Need to Make Best Decision in Real Time

3

Page 4: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

4

Big Data

2nd Generation Modern Analytics

Machine

Learning

Quick to Fail

Lift

Page 5: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Typical Technology Challenges

Our Customers Face

Big Data

• New Data Sources

• Data Variety & Velocity

• Fine Grain Control

• Data Movement, Memory Limits

Complex Computation

• Experimentation

• Many Small Models

• Ensemble Models

• Simulation

Enterprise Readiness

• Heterogeneous Landscape

• Write Once, Deploy Anywhere

• Skill Shortage

• Production Support

Production Efficiency

• Shorter Model Shelf Life

• Volume of Models

• Long End-to-End Cycle Time

• Pace of Decision Accelerated

5

Page 6: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution Confidential

Big Data Big Analytics is different

Page 7: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

7

Page 8: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

y=ax+b

8

Page 9: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

y=ax+b

y=ax+b

y=ax+b

y=ax+b

y=ax+b

y=ax+b

y=ax+b

y=ax+b

9

Page 10: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential New model

Existing model

10

Page 11: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

60%

65%

70%

75%

80%

85%

90%

95%

100%

0% 5% 10% 15% 20% 25% 30%

Accura

cy

False Positives

Add unstructured data

Existing model

Page 12: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Big Data Big Analytics Use Cases

12

• Build predictive models with (very) large datasets

• More rows/observations and/or more columns/features

• Tend to use dimension reduction, machine learning and/or ensemble techniques One Big Model

• Score and predict with (very) large datasets with previously built model

• Score in batch or individual transactions

• Previously built model may be exported from model build to model deployment env. Big Data Scoring

• Model factories build predictive models in quantity

• Automated building of individualized models and/or parallel individualized model execution

Many Small Models

• Score and predict with many individualized models

• Production model factories require model management

Scoring Many Models

• Analytic models that are mathematically intense

• May not use large data sets but generate a lot of interim calculations

• May include vectorization, simulation, optimization

Computationally Intensive Analytics

12

Page 13: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Big Data Big Analytics

Specialized Use Cases

• Build forecasts with time sequenced data

• For Big Data, tend to be many small models esp. machine data

• Due to typical Big Data volume requires model management

Time Series Analytics

• Use of unstructured, free text

• For Big Data, typically used to enhance structured predictive analytics

• Minimally requires text processing tools and may also require natural language processing

Text and Document Analytics

• Analyzing continuous, high speed data flows for patterns and acting upon the patterns in real-time

• Requires specialized sampling and filtering techniques

• Uses distinct discovery analytics methods such as frequent itemsets or clustering

Mining Data Streams

• No separation of model building and model scoring

• As real-time data becomes more widely available, this emerging category reduces time-to-insight with little or no separation between model building and scoring

Zero Latency

13

Page 14: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution Confidential

Analytic Reference Architecture

De

cis

ion

Analytic Applications

Inte

gra

tio

n

Middleware

Da

ta

Hadoop Data

Warehouse

Other

Data

Sources

An

aly

tics

Analytics Development Tools &

Platforms |||||||||

||||||||| |||||||||

14

Page 15: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution Confidential

Architectural Approaches to Analytics

Beside Architecture Inside Architecture

Decis

ion

In

teg

ratio

n

An

aly

tics

Analytics Development Tools & Platforms

Local Data Mart

Data

||||||||||||

||||||||||||

De

cis

ion

In

teg

ratio

n

Da

ta +

An

aly

tics

Analytics Development Tools & Platforms

Analytic Applications

Middleware

Data Sources

Data Sources

Analytic Applications

Middleware

15

Page 16: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Pros & Cons of Architectural Approaches

• Analytic workflow tasks performed in a separate analytics environment outside of the source database

• Pros: Segregates analytic workload

• Cons: Doesn’t leverage powerful production for transformations, introduces scoring latencies,

Beside Architecture

• Analytics workflow tasks performed inside the source database with embedded analytics

• Pros: Eliminates data movement, reduces model latency, allows exploration of all data

• Cons: IT governance on production, potential new skills

Inside Architecture

• Some analytic workflow tasks performed inside the source database & others performed in a separate analytics environment

• Pros: Leverages strengths of each architecture

• Cons: Maintain multiple environments

Hybrid Architecture

16

Page 17: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Building & Deploying Analytic Models

Beside Architecture

Inside Architecture

Hybrid Architecture

An

aly

tics Analytics Development

Tools & Platforms

Local Data Mart Data

Data Sources 2 4 3 3 4 1

Da

ta +

An

aly

tics

Analytics Development

Tools & Platforms

Data Sources

2 3 1

An

aly

tics

Analytics Development

Tools & Platforms

Local Data Mart

Da

ta +

An

aly

tics

Analytics Development

Tools & Platforms

Data Sources 1 2

LEGEND Model Build

Model Deploy

Model Recode / PMML

Update Data Data Prep / Marshaling

1 3 4

Page 18: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

+

&

Page 19: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Our platform vision

19

Lower cost per TB

Avoid data copying

Minimize big data movement

Simplify the IT and user

experience

Organizations bring their applications to

Hadoop data

©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or

redistribution without written permission is prohibited.

Page 20: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Traditional workloads in Hadoop

WORKLOADS IN HADOOP

Search

Analytics

Self-service BI

Data Processing (ELT)

In Cloudera

• 2-10X the performance

• 1/10th the cost

In Cloudera

• Integrated R support for

deep analytics

• Takes advantage of entire

cluster for high

performance

• More granular datasets

with more model features In Cloudera

• Data exploration on the

full fidelity data

• Faster lifecycle from

source data to mini-mart

• 1/10th the cost OLAP reporting

©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or

redistribution without written permission is prohibited.

Page 21: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Enterprise-Grade Solutions for Big Data Key Characteristics

©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or

redistribution without written permission is prohibited.

Page 22: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Cloudera Manager & R integration Seamless cluster administration for Revolution R Enterprise

Deploy Deploy Revolution R Enterprise quickly

and easily onto your CDH cluster 1

Configure & Optimize Ensure optimal settings are configured for

performance of Revolution R Enterprise 2

Monitor, Diagnose &

Report Identify resource controls, monitor

performance, debug and diagnose issues

through a single consolidated interface

3

©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or

redistribution without written permission is prohibited.

Page 23: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

23

Page 24: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential What is the R Language?

A Platform… A Procedural Language for Stats, Math and Data Science

A Complete Data Visualization Framework

Provided as Open Source

A Community… 2M+ Users with the Skill to Tackle Big Data Statistical and

Numerical Analysis and Machine Learning Projects

Active User Groups Across the World

An Ecosystem CRAN: 4500+ Freely Available Algorithms, Test Data and

Evaluations

24

Page 25: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Revolution R Enterprise

Revolution R Enterprise is the only enterprise big data big analytics platform

based on open source R statistical computing language

Portable Across Enterprise Platforms

High Performance, Scalable Analytics

Easier to Build & Deploy

25

Page 26: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

R is open source and drives analytic innovation but….

has some limitations for Enterprises

Disk based scalability

Parallel threading

Commercial support

Leverage open source packages plus Big Data ready packages

26

Commercial License

In memory bound

Single threaded

Community support

4500+ innovative

analytic packages

Risk of deployment

of open source

Big Data

Speed of

Analysis

Enterprise

Readiness

Analytic

Breadth

& Depth

Commercial

Viability

26

Page 27: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Language

Interpreter and

Standard R

Algorithm Suites

Development &

Deployment Tooling

Big Data Distributed

Execution Platform

Introducing Revolution R Enterprise The Big Data Big Analytics Platform

R +

CR

AN

Revo

R

DistributedR

ConnectR

ScaleR

DevelopR DeployR

Revolution R Enterprise

27

Page 28: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Big Data Speed @ Scale

with Revolution R Enterprise

Fast Math Libraries

Parallelized Algorithms

In-Database Execution

Multi-Threaded Execution

Multi-Core Processing

In-Hadoop Execution

Memory Management

Parallelized User Code

28

First, we enhance and

accelerate the Open

Source R interpreter.

28

Page 29: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Open Source R performance:

Multi-threaded Math

Open

Source R

29

Revolution R

Enterprise

Computation (4-core laptop) Open Source R Revolution R Speedup

Linear Algebra1

Matrix Multiply 176 sec 9.3 sec 18x

Cholesky Factorization 25.5 sec 1.3 sec 19x

Linear Discriminant Analysis 189 sec 74 sec 3x

General R Benchmarks2

R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x

R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable

1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php

2. http://r.research.att.com/benchmarks/

Customers report 3-50x

performance improvements

compared to Open Source R —

without changing any code

Page 30: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Big Data Speed @ Scale

with Revolution R Enterprise

Fast Math Libraries

Parallelized Algorithms

In-Database Execution

Multi-Threaded Execution

Multi-Core Processing

In-Hadoop Execution

Memory Management

Parallelized User Code

30

Second, we built a

platform for hosting R

with Big Data on a

variety of massively

parallel platforms.

30

Page 31: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Revolution R Enterprise DistributedR Innovative Memory Management, Multi-Threaded Execution, Multi-Core Processing

• A Revolution R Enterprise ScaleR analytic is provided a data source as input

• The analytic loops over data, reading a block at a time.

• Blocks of data are read by a separate worker thread (Thread 0).

• Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update

intermediate results objects in memory

• When all of the data is processed a master results object is created from the intermediate results objects

COMBINE INTERMEDIATE RESULTS

31

Page 32: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution R Enterprise ScaleR

Performance and Capacity

32

Page 33: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

SAS HPA Benchmarking comparison* Logistic Regression

Rows of data 1 billion 1 billion

Parameters “just a few” 7

Time 80 seconds 44 seconds

Data location In memory On disk

Nodes 32 5

Cores 384 20

RAM 1,536 GB 80 GB

Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a

20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.

*As published by SAS in HPC Wire, April 21, 2011

Double

45%

1/6th

5%

5%

Revolution R Enterprise Delivers Performance at 2% of the Cost

33

Page 34: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Revolution R Enterprise ScaleR:

High Performance Big Data Analytics

Data import – Delimited,

Fixed, SAS, SPSS, OBDC

Variable creation &

transformation

Recode variables

Factor variables

Missing value handling

Sort

Merge

Split

Aggregate by category

(means, sums)

Min / Max

Mean

Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product

matrix for set variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data

(standard tables & long form)

Marginal Summaries of Cross

Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Data Prep, Distillation & Descriptive Analytics

Subsample (observations &

variables)

Random Sampling

R Data Step Statistical Tests

Sampling

Descriptive Statistics

34

Page 35: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Revolution R Enterprise ScaleR:

High Performance Big Data Analytics

Sum of Squares (cross product

matrix for set variables)

Multiple Linear Regression

Generalized Linear Models (GLM)

- All exponential family

distributions: binomial, Gaussian,

inverse Gaussian, Poisson,

Tweedie. Standard link functions

including: cauchit, identity, log,

logit, probit. User defined

distributions & link functions.

Covariance & Correlation

Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Histogram

Line Plot

Scatter Plot

Lorenz Curve

ROC Curves (actual data and

predicted values)

K-Means

Statistical Modeling

Decision Trees

Predictive Models Cluster Analysis Data Visualization

Classification

Machine Learning

Simulation

Monte Carlo

Variable Selection

Stepwise Regression (for linear reg)

35

Page 36: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Unparalleled Big Data Big Analytics

Scale, Performance & Innovation

1 + 1 = 1000’s

Performance

V

a

l

u

e

Revolution R Enterprise

+ =

Performance

Enhanced R R Language

Open Source

R Analytic

Packages

Big Data

Distributed &

Parallel

Processing

&

Analytic Package

Big Data

Distributed &

Parallel

Processing

&

Analytic Package

Open Source

R Analytic

Packages

Performance Enhanced R

36

Page 37: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Leveraging CRAN with DistributedR & ScaleR

Big Data Distillation Allows a R programmer to leverage RRE ScaleR to reduce dimensionality

prior and input the reduced data set into open source packages so that the computationally intensive portion is sped up with RRE ScaleR techniques and any of the plethora of open source packages can be leveraged

Big Data Threading

Allows a R programmer to leverage RRE ScaleR to execute algorithms designed for SMP environments in parallel using DistributedR (ie: Monte Carlo simulation)

Supercharge Open Source package with RRE

Allows a R programmer to re-engineer a CRAN routine by replacing an Open Source function inside an R based algorithm with the equivalent ScaleR function(s)

High Performance Custom Algorithm

Allows a R programmer to use the RRE high throughput extreme data format (XDF) to apply any combination of Open Source functions and logic while chunking through an XDF file to overcome the Open Source R memory limitations

37

Page 38: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

WODA:

Write Once – Deploy Anywhere

38

Page 39: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Big Analytics on Big Data in Hadoop

100% R on Hadoop

Full Skill Transfer - No Java needed.

Use 4500+ CRAN Packages

Blend Combine R & Other Tools /

Methods

100% Portability

Build Once – Deploy Many

Track Evolution of Hadoop

Protect Against Platform Uncertainty

Avoid Platform Lock-ins

Hadoop Performance & Scale

Leverage Hadoop Parallelism Easily

Analyze Data Without Moving It

Da

ta

Analy

tics

Applic

ations

Hadoop

+

Scalable

Compute

HDFS

HBase

Portability.

Parallel Storage

Hive

Big Data

Scale

100% R.

39

Page 40: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution Confidential

Revolution R Enterprise + Cloudera Propels

Enterprises into the Future

De

cis

ion

Analytic Applications

Inte

gra

tio

n

Middleware

Da

ta

Cloudera Data Management Platform

An

aly

tics

Revolution R Enterprise Big Data Big Analytics Platform

||||||||| |||||||||

|||||||||

40

Page 41: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution R Enterprise Powers

Write Once, Deploy Anywhere

41

Beside Architecture

Inside Architecture

Hybrid Architecture

An

aly

tics

Revolution R Enterprise

Local Data Mart Data

Cloudera 2 4 3 3 4 1

Da

ta +

An

aly

tics

Revolution R Enterprise

Cloudera

2 3 1

An

aly

tics

Revolution R Enterprise

Local Data Mart

Da

ta +

An

aly

tics

Revolution R Enterprise

Cloudera 1 2

LEGEND Model Build

Model Deploy

Model Recode / PMML

Update Data Data Prep / Marshaling

4 |||||||||||||

|||||||||||||

|||||| Direct Connector

Bottom Line: Save Time, Save Money, Get Insights Faster • Direct connectors access data without data movement

• Push down analyzing data without movement

• Use same R script on any platform without recoding

• Use right architecture for the job!

Page 42: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Revolution R Enterprise Inside Cloudera

Consumption

Cloudera

Business Analysts (Alteryx, Tableau,

QlikView, Cognos,

Microstrategy, Datameer

etc.)

Power Analysts (R Studio, DevelopR, etc.)

Line of Business

users (Analytic Apps, Rules

Engines, etc.)

Revolution R Enterprise

Machine Data

New Data Sources

Data Suppliers

Traditional Sources

IBM Mainframe

Data Sources

R +

CR

AN

Revo

R

DistributedR

ConnectR

ScaleR

DeployR

Big Data Big Analytics

Data Transformation,

Model Building & Scoring

42

Page 43: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential QuickStart Programs Deliver Value Quickly

Offered by both Cloudera and Revolution

Analytics

Combine Software, Services and Training

Cloudera can help you get started with

Hadoop in a few ways

Revolution Analytics helps you realize value

from R + Hadoop

43

Page 44: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential Summary

Revolution R Enterprise and Cloudera Hadoop bring best-of-breed technologies to deliver:

Highly scalable and high performance machine learning on data residing in Hadoop

Using the familiar R programming environment makes analytics at scale accessible and easy for R users

With the ability to integrate disparate data sources in one repository, full lifecycle analytics from ad-hoc analysis to production analytics are available in one managed environment

The deep integration of Revolution R Enterprise with Cloudera will provide a seamless operational experience for managing both products

44

Page 45: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

45

Thank You

Visit us @ Strata NYC Oct 28

Page 46: R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential

Revolution Confidential

Questions Revolution Analytics: [email protected]

Cloudera: [email protected]