R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
-
Upload
revolution-analytics -
Category
Technology
-
view
108 -
download
2
description
Transcript of R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
Revolution Confidential
Revolution Analytics & Cloudera Confidential
R + Hadoop Ask Bigger (and new) Questions
and Get Better, Faster Answers
Michele Chambers Chief Strategy Officer & VP Product Mgmt
Jai Ranganathan Director Product Mgmt & Strategy
Revolution Confidential Period of Disruption
2
1st Generation Predictive Analytics
Revolution Confidential
Today’s Challenge:
Accelerating Business Cadence
Changing Business Environment
• Fact Based Decisions Require More Data
• Need to Understand Tradeoffs and Best Course of Action
• Predictive Models Need to Continually Deliver Lift
• Reduced Shelf Life for Predictive Models
Faster Time to Value
• Reduce Analytic Cycle Time
• Build & Deploy Models Faster
• Eliminate Time Consuming Data Movements
Rapid Customer Facing Decisions
• Score More Frequently
• Need to Make Best Decision in Real Time
3
Revolution Confidential
4
Big Data
2nd Generation Modern Analytics
Machine
Learning
Quick to Fail
Lift
Revolution Confidential
Typical Technology Challenges
Our Customers Face
Big Data
• New Data Sources
• Data Variety & Velocity
• Fine Grain Control
• Data Movement, Memory Limits
Complex Computation
• Experimentation
• Many Small Models
• Ensemble Models
• Simulation
Enterprise Readiness
• Heterogeneous Landscape
• Write Once, Deploy Anywhere
• Skill Shortage
• Production Support
Production Efficiency
• Shorter Model Shelf Life
• Volume of Models
• Long End-to-End Cycle Time
• Pace of Decision Accelerated
5
Revolution Confidential
Revolution Confidential
Big Data Big Analytics is different
Revolution Confidential
7
Revolution Confidential
y=ax+b
8
Revolution Confidential
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
9
Revolution Confidential New model
Existing model
10
Revolution Confidential
60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 5% 10% 15% 20% 25% 30%
Accura
cy
False Positives
Add unstructured data
Existing model
Revolution Confidential Big Data Big Analytics Use Cases
12
• Build predictive models with (very) large datasets
• More rows/observations and/or more columns/features
• Tend to use dimension reduction, machine learning and/or ensemble techniques One Big Model
• Score and predict with (very) large datasets with previously built model
• Score in batch or individual transactions
• Previously built model may be exported from model build to model deployment env. Big Data Scoring
• Model factories build predictive models in quantity
• Automated building of individualized models and/or parallel individualized model execution
Many Small Models
• Score and predict with many individualized models
• Production model factories require model management
Scoring Many Models
• Analytic models that are mathematically intense
• May not use large data sets but generate a lot of interim calculations
• May include vectorization, simulation, optimization
Computationally Intensive Analytics
12
Revolution Confidential
Big Data Big Analytics
Specialized Use Cases
• Build forecasts with time sequenced data
• For Big Data, tend to be many small models esp. machine data
• Due to typical Big Data volume requires model management
Time Series Analytics
• Use of unstructured, free text
• For Big Data, typically used to enhance structured predictive analytics
• Minimally requires text processing tools and may also require natural language processing
Text and Document Analytics
• Analyzing continuous, high speed data flows for patterns and acting upon the patterns in real-time
• Requires specialized sampling and filtering techniques
• Uses distinct discovery analytics methods such as frequent itemsets or clustering
Mining Data Streams
• No separation of model building and model scoring
• As real-time data becomes more widely available, this emerging category reduces time-to-insight with little or no separation between model building and scoring
Zero Latency
13
Revolution Confidential
Revolution Confidential
Analytic Reference Architecture
De
cis
ion
Analytic Applications
Inte
gra
tio
n
Middleware
Da
ta
Hadoop Data
Warehouse
Other
Data
Sources
An
aly
tics
Analytics Development Tools &
Platforms |||||||||
||||||||| |||||||||
14
Revolution Confidential
Revolution Confidential
Architectural Approaches to Analytics
Beside Architecture Inside Architecture
Decis
ion
In
teg
ratio
n
An
aly
tics
Analytics Development Tools & Platforms
Local Data Mart
Data
||||||||||||
||||||||||||
De
cis
ion
In
teg
ratio
n
Da
ta +
An
aly
tics
Analytics Development Tools & Platforms
Analytic Applications
Middleware
Data Sources
Data Sources
Analytic Applications
Middleware
15
Revolution Confidential Pros & Cons of Architectural Approaches
• Analytic workflow tasks performed in a separate analytics environment outside of the source database
• Pros: Segregates analytic workload
• Cons: Doesn’t leverage powerful production for transformations, introduces scoring latencies,
Beside Architecture
• Analytics workflow tasks performed inside the source database with embedded analytics
• Pros: Eliminates data movement, reduces model latency, allows exploration of all data
• Cons: IT governance on production, potential new skills
Inside Architecture
• Some analytic workflow tasks performed inside the source database & others performed in a separate analytics environment
• Pros: Leverages strengths of each architecture
• Cons: Maintain multiple environments
Hybrid Architecture
16
Revolution Confidential Building & Deploying Analytic Models
Beside Architecture
Inside Architecture
Hybrid Architecture
An
aly
tics Analytics Development
Tools & Platforms
Local Data Mart Data
Data Sources 2 4 3 3 4 1
Da
ta +
An
aly
tics
Analytics Development
Tools & Platforms
Data Sources
2 3 1
An
aly
tics
Analytics Development
Tools & Platforms
Local Data Mart
Da
ta +
An
aly
tics
Analytics Development
Tools & Platforms
Data Sources 1 2
LEGEND Model Build
Model Deploy
Model Recode / PMML
Update Data Data Prep / Marshaling
1 3 4
Revolution Confidential
+
&
Revolution Confidential Our platform vision
19
Lower cost per TB
Avoid data copying
Minimize big data movement
Simplify the IT and user
experience
Organizations bring their applications to
Hadoop data
©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or
redistribution without written permission is prohibited.
Revolution Confidential Traditional workloads in Hadoop
WORKLOADS IN HADOOP
Search
Analytics
Self-service BI
Data Processing (ELT)
In Cloudera
• 2-10X the performance
• 1/10th the cost
In Cloudera
• Integrated R support for
deep analytics
• Takes advantage of entire
cluster for high
performance
• More granular datasets
with more model features In Cloudera
• Data exploration on the
full fidelity data
• Faster lifecycle from
source data to mini-mart
• 1/10th the cost OLAP reporting
©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or
redistribution without written permission is prohibited.
Revolution Confidential Enterprise-Grade Solutions for Big Data Key Characteristics
©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or
redistribution without written permission is prohibited.
Revolution Confidential Cloudera Manager & R integration Seamless cluster administration for Revolution R Enterprise
Deploy Deploy Revolution R Enterprise quickly
and easily onto your CDH cluster 1
Configure & Optimize Ensure optimal settings are configured for
performance of Revolution R Enterprise 2
Monitor, Diagnose &
Report Identify resource controls, monitor
performance, debug and diagnose issues
through a single consolidated interface
3
©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or
redistribution without written permission is prohibited.
Revolution Confidential
23
Revolution Confidential What is the R Language?
A Platform… A Procedural Language for Stats, Math and Data Science
A Complete Data Visualization Framework
Provided as Open Source
A Community… 2M+ Users with the Skill to Tackle Big Data Statistical and
Numerical Analysis and Machine Learning Projects
Active User Groups Across the World
An Ecosystem CRAN: 4500+ Freely Available Algorithms, Test Data and
Evaluations
24
Revolution Confidential Revolution R Enterprise
Revolution R Enterprise is the only enterprise big data big analytics platform
based on open source R statistical computing language
Portable Across Enterprise Platforms
High Performance, Scalable Analytics
Easier to Build & Deploy
25
Revolution Confidential
R is open source and drives analytic innovation but….
has some limitations for Enterprises
Disk based scalability
Parallel threading
Commercial support
Leverage open source packages plus Big Data ready packages
26
Commercial License
In memory bound
Single threaded
Community support
4500+ innovative
analytic packages
Risk of deployment
of open source
Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
Commercial
Viability
26
Revolution Confidential
Language
Interpreter and
Standard R
Algorithm Suites
Development &
Deployment Tooling
Big Data Distributed
Execution Platform
Introducing Revolution R Enterprise The Big Data Big Analytics Platform
R +
CR
AN
Revo
R
DistributedR
ConnectR
ScaleR
DevelopR DeployR
Revolution R Enterprise
27
Revolution Confidential
Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
28
First, we enhance and
accelerate the Open
Source R interpreter.
28
Revolution Confidential
Open Source R performance:
Multi-threaded Math
Open
Source R
29
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x
performance improvements
compared to Open Source R —
without changing any code
Revolution Confidential
Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
30
Second, we built a
platform for hosting R
with Big Data on a
variety of massively
parallel platforms.
30
Revolution Confidential Revolution R Enterprise DistributedR Innovative Memory Management, Multi-Threaded Execution, Multi-Core Processing
• A Revolution R Enterprise ScaleR analytic is provided a data source as input
• The analytic loops over data, reading a block at a time.
• Blocks of data are read by a separate worker thread (Thread 0).
• Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update
intermediate results objects in memory
• When all of the data is processed a master results object is created from the intermediate results objects
COMBINE INTERMEDIATE RESULTS
31
Revolution Confidential
Revolution R Enterprise ScaleR
Performance and Capacity
32
Revolution Confidential
SAS HPA Benchmarking comparison* Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
33
Revolution Confidential Revolution R Enterprise ScaleR:
High Performance Big Data Analytics
Data import – Delimited,
Fixed, SAS, SPSS, OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category
(means, sums)
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
Subsample (observations &
variables)
Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
34
Revolution Confidential Revolution R Enterprise ScaleR:
High Performance Big Data Analytics
Sum of Squares (cross product
matrix for set variables)
Multiple Linear Regression
Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
Covariance & Correlation
Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Histogram
Line Plot
Scatter Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
K-Means
Statistical Modeling
Decision Trees
Predictive Models Cluster Analysis Data Visualization
Classification
Machine Learning
Simulation
Monte Carlo
Variable Selection
Stepwise Regression (for linear reg)
35
Revolution Confidential
Unparalleled Big Data Big Analytics
Scale, Performance & Innovation
1 + 1 = 1000’s
Performance
V
a
l
u
e
Revolution R Enterprise
+ =
Performance
Enhanced R R Language
Open Source
R Analytic
Packages
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Open Source
R Analytic
Packages
Performance Enhanced R
36
Revolution Confidential
Leveraging CRAN with DistributedR & ScaleR
Big Data Distillation Allows a R programmer to leverage RRE ScaleR to reduce dimensionality
prior and input the reduced data set into open source packages so that the computationally intensive portion is sped up with RRE ScaleR techniques and any of the plethora of open source packages can be leveraged
Big Data Threading
Allows a R programmer to leverage RRE ScaleR to execute algorithms designed for SMP environments in parallel using DistributedR (ie: Monte Carlo simulation)
Supercharge Open Source package with RRE
Allows a R programmer to re-engineer a CRAN routine by replacing an Open Source function inside an R based algorithm with the equivalent ScaleR function(s)
High Performance Custom Algorithm
Allows a R programmer to use the RRE high throughput extreme data format (XDF) to apply any combination of Open Source functions and logic while chunking through an XDF file to overcome the Open Source R memory limitations
37
Revolution Confidential
WODA:
Write Once – Deploy Anywhere
38
Revolution Confidential Big Analytics on Big Data in Hadoop
100% R on Hadoop
Full Skill Transfer - No Java needed.
Use 4500+ CRAN Packages
Blend Combine R & Other Tools /
Methods
100% Portability
Build Once – Deploy Many
Track Evolution of Hadoop
Protect Against Platform Uncertainty
Avoid Platform Lock-ins
Hadoop Performance & Scale
Leverage Hadoop Parallelism Easily
Analyze Data Without Moving It
Da
ta
Analy
tics
Applic
ations
Hadoop
+
Scalable
Compute
HDFS
HBase
Portability.
Parallel Storage
Hive
Big Data
Scale
100% R.
39
Revolution Confidential
Revolution Confidential
Revolution R Enterprise + Cloudera Propels
Enterprises into the Future
De
cis
ion
Analytic Applications
Inte
gra
tio
n
Middleware
Da
ta
Cloudera Data Management Platform
An
aly
tics
Revolution R Enterprise Big Data Big Analytics Platform
||||||||| |||||||||
|||||||||
40
Revolution Confidential
Revolution R Enterprise Powers
Write Once, Deploy Anywhere
41
Beside Architecture
Inside Architecture
Hybrid Architecture
An
aly
tics
Revolution R Enterprise
Local Data Mart Data
Cloudera 2 4 3 3 4 1
Da
ta +
An
aly
tics
Revolution R Enterprise
Cloudera
2 3 1
An
aly
tics
Revolution R Enterprise
Local Data Mart
Da
ta +
An
aly
tics
Revolution R Enterprise
Cloudera 1 2
LEGEND Model Build
Model Deploy
Model Recode / PMML
Update Data Data Prep / Marshaling
4 |||||||||||||
|||||||||||||
|||||| Direct Connector
Bottom Line: Save Time, Save Money, Get Insights Faster • Direct connectors access data without data movement
• Push down analyzing data without movement
• Use same R script on any platform without recoding
• Use right architecture for the job!
Revolution Confidential Revolution R Enterprise Inside Cloudera
Consumption
Cloudera
Business Analysts (Alteryx, Tableau,
QlikView, Cognos,
Microstrategy, Datameer
etc.)
Power Analysts (R Studio, DevelopR, etc.)
Line of Business
users (Analytic Apps, Rules
Engines, etc.)
Revolution R Enterprise
Machine Data
New Data Sources
Data Suppliers
Traditional Sources
IBM Mainframe
Data Sources
R +
CR
AN
Revo
R
DistributedR
ConnectR
ScaleR
DeployR
Big Data Big Analytics
Data Transformation,
Model Building & Scoring
42
Revolution Confidential QuickStart Programs Deliver Value Quickly
Offered by both Cloudera and Revolution
Analytics
Combine Software, Services and Training
Cloudera can help you get started with
Hadoop in a few ways
Revolution Analytics helps you realize value
from R + Hadoop
43
Revolution Confidential Summary
Revolution R Enterprise and Cloudera Hadoop bring best-of-breed technologies to deliver:
Highly scalable and high performance machine learning on data residing in Hadoop
Using the familiar R programming environment makes analytics at scale accessible and easy for R users
With the ability to integrate disparate data sources in one repository, full lifecycle analytics from ad-hoc analysis to production analytics are available in one managed environment
The deep integration of Revolution R Enterprise with Cloudera will provide a seamless operational experience for managing both products
44
Revolution Confidential
45
Thank You
Visit us @ Strata NYC Oct 28
Revolution Confidential
Revolution Confidential
Questions Revolution Analytics: [email protected]
Cloudera: [email protected]