Big data analytics on teradata with revolution r enterprise bill jacobs

51
1877 Big Data Analytics on Teradata: An Introduction to Revolution R Enterprise Bill Jacobs Dir., Product Marketing, Revolution Analytics

description

Revolution Analytics brings big data analytics to Teradata database. Presentation from Teradata Partners, October 2013 overviewing Revolution R Enterprise for Teradata by Bill Jacobs, Director, Product Marketing, Revolution Analytics.

Transcript of Big data analytics on teradata with revolution r enterprise bill jacobs

Page 1: Big data analytics on teradata with revolution r enterprise   bill jacobs

1877 Big Data Analytics on Teradata: An Introduction toRevolution R Enterprise

Bill Jacobs

Dir., Product Marketing, Revolution Analytics

Page 2: Big data analytics on teradata with revolution r enterprise   bill jacobs

Demystifying R

What is R

Why is it so

popular?

Is it only open

source?

Page 3: Big data analytics on teradata with revolution r enterprise   bill jacobs

3

Page 4: Big data analytics on teradata with revolution r enterprise   bill jacobs

Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013

4

THE PERFECT STORM

+ Computing Power + Data + Pace of Business+ Customer Expectations

Our view: Big Data meets Big Math = New Business Outcomes

+Data Science+Computer Science +Management Science

Better Business Decisions

New Business Outcomes

Page 5: Big data analytics on teradata with revolution r enterprise   bill jacobs

Confidential to Revolution Analytics

Big Analytics Delivers Value from Big Data

5

Volume Variety Velocity

The three V’s of Big Data Big Analytics:

Maximizing Value,accommodating data Volatility,while assuring Veracity of insights

The three Vs of Big Data:

Page 6: Big data analytics on teradata with revolution r enterprise   bill jacobs

6

WELCOME & INTRODUCTIONS

R Open Source- Language, Community, Collaboration

- Robert Gentleman & Ross Ihaka, 1993

- Version 1.0 released 2000

- 2.5 Million Global Users

- Over 4,800 add-on “Packages”

- Why R?

R in Universities = New Talent

Emerging Modeling/Visualization

Lower Cost Alternative

Open Source = Flexible & Innovative

Access to Free Packages

Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013

Page 7: Big data analytics on teradata with revolution r enterprise   bill jacobs

Source: http://r4stats.com/popularity 7

R is Exploding in Popularity & Functionality

Web Site PopularityNumber of links to main web site

4,000

2,000

1,050

900

600

R

SAS

SPSS

S-Plus

Stata

Scholarly ActivityGoogle Scholar hits (’05-’09 CAGR)

R 46%

SAS -11%

SPSS -27%

S-Plus 0%

Stata 10%

1995 2000 2005 2010

0

1,000

2,000

3,000

4,000

Internet DiscussionMean monthly traffic on email discussion list

R

SAS

Stata

SPSS

S-Plus

Package GrowthNumber of R packages listed on CRAN

0

500

1000

1500

2000

2500

Page 8: Big data analytics on teradata with revolution r enterprise   bill jacobs

R is exploding in popularity & functionality

“A key benefit of R is that it provides near-instant availability of new and

experimental methods created by its user base — without waiting for the

development/release cycle of commercial software. SAS

recognizes the value of R to our customer base…”

Product Marketing Manager SAS Institute, Inc

“I’ve been astonished by the rate at which R has been adopted. Four

years ago, everyone in my economics department [at the

University of Chicago] was using Stata; now, as far as I can tell, R is

the standard tool, and students learn it first.”

Deputy Editor for New Products at Forbes

R Usage GrowthRexer Data Miner Survey, 2007-2013

70% of data miners report using R

24% use R as primary tool

Source: www.rexeranalytics.com

Page 9: Big data analytics on teradata with revolution r enterprise   bill jacobs

R Is The Most Commonly Used Primarly Analytics Tool

70% of data miners report using R

24% use R as primary tool

Source: www.rexeranalytics.com

Source: www.rexeranalytics.com

Page 10: Big data analytics on teradata with revolution r enterprise   bill jacobs

10

Example of advanced visualization with R

Facebook Network Graphic

Page 11: Big data analytics on teradata with revolution r enterprise   bill jacobs

11

R Community, collaboration and breadth: CRAN task views (sub set of 4800+ packages)

Source: http://www.maths.lancs.ac.uk/~rowlings/R/TaskViews/

Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013

Page 12: Big data analytics on teradata with revolution r enterprise   bill jacobs

12

Key Big Data Challenge: The Analytics Talent Pool

Page 13: Big data analytics on teradata with revolution r enterprise   bill jacobs

13

2 Million R Users

The Analytics Talent Pool With R

Page 14: Big data analytics on teradata with revolution r enterprise   bill jacobs

14

Big Data In-memory bound Hybrid memory & disk scalability

Operates on bigger volumes & factors

Speed of Analysis

Single threaded Parallel threading Shrinks analysis time

Enterprise Readiness

Community support

Commercial support Delivers full service production support

Analytic Breadth & Depth

5000+ innovative analytic packages

Leverage open source packages plus Big Data ready packages

Supercharges R

Commercial Viability

Risk of deployment of open source

Commercial license Eliminate risk with open source

R is open source and drives analytic innovation but….has some limitations for Enterprises

Page 15: Big data analytics on teradata with revolution r enterprise   bill jacobs

250Customers

Our History & Our Future

500Customers

2007 2013 2015 2017

CompanyFounding

Chapter 1Capture Mindshare

Chapter 2Mobilize withMarket Focus

Chapter 3ScalableGrowth

Revolution R EnterpriseV1 through V6.1

Revolution R EnterpriseV6.2 through V9

Revolution R EnterpriseV10 through v11

1000Customers

Relocate HQ to Palo Alto

NA OfficesNYC

Dallas

Company Confidential – Do not distribute 15

Page 16: Big data analytics on teradata with revolution r enterprise   bill jacobs

Digital Media & Retail

200+ Customer Stories

16

Finance & Insurance Healthcare & Life Sciences

Manufacturing & High TechAcademic & Gov’t

Revolution Confidential

Page 17: Big data analytics on teradata with revolution r enterprise   bill jacobs

Revolution Analytics - Overview

17

We are the only provider of a commercial analytics platform based on the open source R statistical computing language.

Power

Productivity

Enterprise Readiness

Stable, scalable

multi-platform with

world-wide support

Easier to build and deploy

analytic applications

Professional services

enablement

Distributed, high performance

analytical algorithms

World Wide Support Teams• Standard and Premium Programs• Technical Account Managers• Customer Success Managers

Professional Services• Architecture planning• Systems Integration• Advanced analytic applications• Full life cycle projects

Page 18: Big data analytics on teradata with revolution r enterprise   bill jacobs

Customers Revolutionize their Business

Power

“…we saw about a 4x performance improvement on 50 million records. It works brilliantly.”   - CEO, John Wallace, DataSong

4X performance 50M records scored daily

Scalability

“We’ve been able to scale our solution to a problem that’s so big that most companies could not address it…..” - SVP Analytics, Kevin Lyons, eXelate

TB’s data from 200+ data sources10’s thousands attributes100’s millions of scores daily

2X data 2X attributes no impact on performance

Performance

“We need a high-

performance analytics …

we can now identify

opportunities for our clients

that would otherwise be

lost.”

- Chief Analytics Officer,

Leon Zemel, [x+1]

19

Page 19: Big data analytics on teradata with revolution r enterprise   bill jacobs

Revolution R Enterprise

What is Revolution R Enterprise?

How does Revolution R Enterprise work with Teradata Database?

Page 20: Big data analytics on teradata with revolution r enterprise   bill jacobs

Revolution R Enterprise

High Performance, Scalable Analytics Portable Across Enterprise Platforms Easier to Build & Deploy Analytics

is….the only big data big analytics platform based on open source R,the defacto statistical computing language for modern analytics

21

Page 21: Big data analytics on teradata with revolution r enterprise   bill jacobs

How is RRE Used?

Discovering Patterns with Big Data

Building Models Efficiently

Flexibly Deploying Models to Consumers

Customer segmentation

Market basket analysis

Social networking analysis

Fraud detection Marketing attribution Sentiment analysis …and much more

Credit risk Customer churn Propensity to buy Market risk Operational risk …and much more

Customer lifetime value Pricing optimization Recommendation

engines …and much more

22

Page 22: Big data analytics on teradata with revolution r enterprise   bill jacobs

Introducing Revolution R Enterprise (RRE)The Big Data Big Analytics Platform

R+

CR

AN

Rev

oR

DistributedR

DevelopR DeployR

ScaleR

ConnectR

Big Data Big Analytics Ready

– Enterprise readiness

– High performance analytics

– Multi-platform architecture

– Data source integration

– Development tools

– Deployment tools

23

Page 23: Big data analytics on teradata with revolution r enterprise   bill jacobs

The Platform Step by Step:R Capabilities

R+

CR

AN

DistributedR

ScaleR

ConnectR

R+CRAN• Open source R interpreter

• UPDATED R 3.0.2• Freely-available R algorithms• Algorithms callable by RevoR• Embeddable in R scripts• 100% Compatible with existing

R scripts, functions and packages

RevoR• Performance enhanced R interpreter• Based on open source R• Adds high-performance math

Available On:• PlatformTM LSFTM Linux®

• Microsoft® HPC Clusters• Microsoft Azure Burst• Windows® & Linux Servers• Windows & Linux Workstations• Teradata® Database• IBM® Netezza®

• IBM BigInsightsTM

• Cloudera Hadoop®

• Hortonworks Hadoop• Intel® Hadoop

Rev

oR

DevelopR

DeployR

24

Page 24: Big data analytics on teradata with revolution r enterprise   bill jacobs

25

Big Data Speed @ Scale with Revolution R Enterprise (RRE)

Fast Math Libraries

Parallelized Algorithms

In-Database Execution

Multi-Threaded Execution

Multi-Core Processing

In-Hadoop Execution

Memory Management

Parallelized User Code

First, we enhance and accelerate the Open Source

R interpreter.

Page 25: Big data analytics on teradata with revolution r enterprise   bill jacobs

26

Open Source R Performance: Multi-threaded Math

OpenSource R

Revolution R Enterprise

Computation (4-core laptop) Open Source R Revolution R Speedup

Linear Algebra1

Matrix Multiply 176 sec 9.3 sec 18x

Cholesky Factorization 25.5 sec 1.3 sec 19x

Linear Discriminant Analysis 189 sec 74 sec 3x

General R Benchmarks2

R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x

R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable

1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php2. http://r.research.att.com/benchmarks/

Customers report 5-50x performance improvements

compared to Open Source R — without changing any code

Page 26: Big data analytics on teradata with revolution r enterprise   bill jacobs

Rev

oR

DevelopR

DeployR

R+

CR

AN

DistributedR

ScaleR

ConnectR

The Platform Step by Step:Parallelization & Data Sourcing ConnectR

• High-speed & direct connectors

Available for:• High-performance XDF• SAS, SPSS, delimited & fixed format

text data files• Hadoop HDFS (text & XDF)• Teradata Database & Aster• EDWs and ADWs• ODBC

ScaleR• Ready-to-Use high-performance

big data big analytics • Fully-parallelized analytics• Data prep & data distillation• Descriptive statistics & statistical

tests• Correlation & covariance matrices• Predictive Models – linear, logistic,

GLM• Machine learning• Monte Carlo simulation• NEW Tools for distributing

customized algorithms across nodes

DistributedR• Distributed computing framework• Delivers portability across platforms

Available on:• Windows Servers• Red Hat and NEW SuSE Linux Servers• IBM Platform LSF Linux• Microsoft HPC Clusters• Microsoft Azure Burst• NEW Teradata Database• NEW Cloudera Hadoop• NEW Hortonworks Hadoop

27

Page 27: Big data analytics on teradata with revolution r enterprise   bill jacobs

28

Big Data Speed @ Scale with Revolution R Enterprise (RRE)

Fast Math Libraries

Parallelized Algorithms

In-Database Execution

Multi-Threaded Execution

Multi-Core Processing

In-Hadoop Execution

Memory Management

Parallelized User Code

Second, we built a platform for hosting R with Big Data on a variety of massively parallel platforms.

Page 28: Big data analytics on teradata with revolution r enterprise   bill jacobs

Revolution R EnterprisePowering Next Generation Analytics

COMBINE INTERMEDIATE RESULTS

29

Page 29: Big data analytics on teradata with revolution r enterprise   bill jacobs

30

SAS HPA Speed comparison* Logistic Regression

Rows of data 1 billion 1 billion

Parameters “just a few” 7

Time 80 seconds 44 seconds

Data location In memory On disk

Nodes 32 5

Cores 384 20

RAM 1,536 GB 80 GB

Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.

*As published by SAS in HPC Wire, April 21, 2011

Double

45%

1/6th

5%

5%

Revolution R Enterprise Delivers Performance at 2% of the Cost

Page 30: Big data analytics on teradata with revolution r enterprise   bill jacobs

Analytics Layer: High Performance Big Data Analytics with ScaleR

R Data Step DescriptiveStatistics

StatisticalTests

Sampling

PredictiveModeling

DataVisualization

MachineLearning

Simulation

31

Page 31: Big data analytics on teradata with revolution r enterprise   bill jacobs

Company Confidential – Do not distribute

32

ScaleR: Fast Parallel External Memory Algorithms

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation

Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means,

sums) Use any of the functionality of

the R language to transform and clean data row by row!

Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product

matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard

tables & long form) Marginal Summaries of Cross

Tabulations

Chi Square Test t-Test F-Test Plus 100’s of other tests

available in R!

Data Prep, Distillation & Descriptive Analytics

Subsample (observations & variables)

Random Sampling High quality, fast, parallel

random number generators

R Data Step Statistical Tests

Sampling

Descriptive Statistics

Page 32: Big data analytics on teradata with revolution r enterprise   bill jacobs

33

ScaleR: Fast Parallel External Memory Algorithms

Covariance, Correlation, Sums of Squares (cross product matrix for set variables) matrices

Multiple Linear Regression Generalized Linear Models (GLM) - All

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Logistic Regression Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models

Histogram Line Plot Lorenz Curve ROC Curves (actual data and predicted

values) Plus numerous tools in R and ScaleR

to generate big data visualizations

K-Means

Statistical Modeling

Decision Trees Decision Forests

Predictive Models Cluster AnalysisData Visualization

Classification

Machine Learning

Simulation

High quality, fast, parallel random number generators

Use the rich functionality of R for simulations

Page 33: Big data analytics on teradata with revolution r enterprise   bill jacobs

The Power of Revolution R EnterprisePerformance & Scalability

R + CRAN

Fast Math Libraries

Memory Management

Multi-Threaded Execution

Grid Processing

Parallelized Algorithms

Parallelized User Code

In-Database Execution

Open Source Leverage latest innovation

In-Hadoop Execution

Va l ue

RevoR 3-50X faster

DistributedR Effective memory utilization

DistributedR Powerful divide & conquer

DistributedR Maximizes computation

ScaleR Labor saving power

ScaleR Leverage CRAN

ScaleR Moves computation to data

ScaleR Moves computation to data

34

Page 34: Big data analytics on teradata with revolution r enterprise   bill jacobs

35

Why Teradata And Revolution R Enterprise?

Teradata User Demand

Data Movement Penalty Growing

New Analytics Requiring MPP Approach

R Popularity

Open Source Limitations

Arrival of Teradata v14.10

Page 35: Big data analytics on teradata with revolution r enterprise   bill jacobs

36

Revolution Analytics coupled with the Teradata Unified Data Architecture accelerates

big data analytics using the widely-accepted R language.

Available Today: Scalable R analytics on servers

connected to Teradata High speed, parallel data transfer, 5x

faster than RODBC Integrated parallel analytics solution

Company Confidential

High-SpeedTPT Connector

+

Revolution R Enterprise 6.2

Upcoming Capabilities (4Q13) Parallel R in-database for big data

analytics on Teradata R programmers can immediately build

parallel R models completely in R Revolution parallel in-database

algorithms exclusively available on Teradata Teradata Version 14.0

Teradata Version 14.10

+Revolution R Enterprise V7

Bill Jacobs
Is TPT bidirectional- Thomas?Details of COE Activision - Thomas?
Page 36: Big data analytics on teradata with revolution r enterprise   bill jacobs

37

Introducing Revolution R Enterprise Version 7 on Teradata Database

New Teradata Table Operators

New Parallelized Algorithms

In-Database Execution of Parallelized Algorithms

Executes R Scripts From R Workstations or Servers

Provides Orders of Magnitude Performance Gains

Supports Multiple Platforms in UDA

Available Late 2013

Page 38: Big data analytics on teradata with revolution r enterprise   bill jacobs

39

HOW DOES IT WORK?

Transparent Parallelization of Analytical, Predictive Modeling and Machine Learning in Teradata

Page 39: Big data analytics on teradata with revolution r enterprise   bill jacobs

40

Understanding R’s Compute Workload

Compute Burden from Script or Command

Computational Workload Breakdown

Compute Burden from Algorithmic Computations

Algorithms99.xxx%

R Script < 1%

Page 40: Big data analytics on teradata with revolution r enterprise   bill jacobs

41

ScaleR PEMAs: High Performance Analytical Algorithms

Users Script Calls ScaleR PEMA

– No Unique Code or Setup for Parallelism

– ScaleR Algorithms are “just another R package”

– Using PEMAs is Transparent, Automatic, Fast and Scales

Linearly

PEMAs Transparently Parallelize Algorithm Execution

– Parallelized Versions of Statistics, Predictive Modeling and

Machine Learning Algorithms

– PEMAs Transparently Distribute Computations Across AMPs

– Results are Consolidated Into A Single Result Set

– Provides Write Once Deploy Anywhere (WODA) Portability

Page 41: Big data analytics on teradata with revolution r enterprise   bill jacobs

42

Transparent to the Script

Transparent Distributed Computing with RRE ScaleR

In Revolution R Enterprise:

Script Calls ScaleR PEMA Algorithm Executes

Algorithm Returns to Script Script Continues Execution

Algorithm Starts A Master Process Master Identifies Environment

Threading?

Cores?

Chips?

Distributed Nodes?

Master Initializes Algorithm Prepares Instructions for Nodes

Master Executes Table Operators In Each VAMP

VAMPs process each data segment

Table Operator runs in each VAMP

Table Operator returns Intermediate Result Object (IRO) to master process

Master Process Combines IROsReturns Consolidated Answer to Script

Page 42: Big data analytics on teradata with revolution r enterprise   bill jacobs

43

ScaleR PEMAs on Teradata:Transparent Distribution of R Analytics

For Each Call to a ScaleR Algorithm:– One Request

– Many Subtasks

– One Answer

AMPs

CorporateApplications

Extended Stored Procedure

Desktops & Servers

Teradata Database

+Revolution

R Enterprise

ODBC

Revolution R Enterprise

Revolution R Enterprise

Table Operators

Page 44: Big data analytics on teradata with revolution r enterprise   bill jacobs

Rev

oR

R+

CR

AN

DistributedR

ScaleR

ConnectR

DeployR• Web services software

development kit• Integrates R Into application

infrastructures

Capabilities:• Invokes R Scripts from

web services calls• RESTful interface for

easy integration• Works with leading desktop

& BI tools

DevelopR• Freely-available R algorithms• Callable by RevoR• Embeddable in R scripts

Available on:• Can be called by RevoR• Can be run singe-node

using RevoR• Analyze large data using

RDataStep package• Run on multiple nodes using

rxEXEC package

The Platform Step by Step:Tools & Deployment

47

DevelopR DeployR

Page 45: Big data analytics on teradata with revolution r enterprise   bill jacobs

48

DevelopR Integrated Development Environment

Script with type ahead and code snippets Solutions window

for organizing code and data

Packages installed and

loaded

Objects loaded in the R

Environment

Object details

Sophisticated debugging with breakpoints ,

variable values etc.

http://www.revolutionanalytics.com/demos/revolution-productivity-environment/demo.htm

Page 46: Big data analytics on teradata with revolution r enterprise   bill jacobs

Seamless Bring the power of R to any web enabled application

Simple Leverage common APIs including JS, Java, .NET

Scalable Robustly scale user and compute workloads

Secure Manage enterprise security with LDAP & SSO

Data Analysis

Business Intelligence

Mobile Web Apps

Cloud / SaaS

R / Statistical Modeling Expert

DeployR

DeploymentExpert

49

Page 47: Big data analytics on teradata with revolution r enterprise   bill jacobs

On-demand sales forecasting

Real-time social media sentiment

analysis

Create Custom, On-Demand Analytical AppsSome Examples:

Leveraging the power of R from Microsoft tools

50

Page 48: Big data analytics on teradata with revolution r enterprise   bill jacobs

Alteryx and Revolution Analytics

Delivering Enterprise-Scale Predictive Analytics to Line

of Business Analysts

Enabling a Broader Audience to Harness the

Universe of R

Empowering Analysts with Easy-to-Use Predictive Tools combined with the

Leading R Platform

Making Predictive Analytics More Accessible and Scalable

51

Page 49: Big data analytics on teradata with revolution r enterprise   bill jacobs

52

Summary. R is Hot.

– Most Broadly Used Analytical Language

– Its Popularity Addresses Critical Talent Gap

– Vast Functionality Via CRAN

– R Needs a Platform For Big Data Big Analytics Revolution Provides Enterprise-Capable Platforms for R.

– High Performance.

– Scalable via Transparent Distributed Execution

– Portable – Write Once Deploy Anywhere - WODA

– Commercial Support & Services Cut Project Risks Teradata + Revolution Provide a Robust Solution

– Teradata provides stable, high-performane big data environment

– Revolution provides speed, scale, portability and stability for the enterprise

Page 50: Big data analytics on teradata with revolution r enterprise   bill jacobs

53

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.

Next steps?