In-Database Analytics Deep Dive with Teradata and Revolution
-
Upload
revolution-analytics -
Category
Data & Analytics
-
view
1.224 -
download
2
Transcript of In-Database Analytics Deep Dive with Teradata and Revolution
Mario Inchiosa
Chief Scientist, Revolution Analytics
In-Database Analytics Deep Dive with
Teradata and Revolution R
Tim Miller Partner Integration Lab, Teradata
• Introduction
• Revolution R Enterprise
• Case Study – Global Internet Marketplace
• Under the Hood
• Summary & Questions
Agenda
• What data storage/management software do you use?
> Hadoop
> Teradata
> LSF Clusters/Grids
> Servers
Please choose all that apply
Poll Question #1
• Most powerful statistical programming language
– Flexible, extensible and comprehensive for productivity
• Most widely used data analysis software
– Used by 2M+ data scientists, statisticians and analysts
• Create beautiful and unique data visualizations
– As seen in New York Times, Twitter and Flowing Data
• Thriving open-source community
– Leading edge of analytics research
• Fills the talent gap
– New graduates prefer R
What is R?
R is Hot bit.ly/r-is-
hot
WHITE PAPER
Exploding growth and demand for R
• R is the highest paid IT skill
> Dice.com, Jan 2014
• R most-used data science language after SQL
> O’Reilly, Jan 2014
• R is used by 70% of data miners
> Rexer, Sep 2013
• R is #15 of all programming languages
> RedMonk, Jan 2014
• R growing faster than any other data science language
> KDnuggets, Aug 2013
• More than 2 million users worldwide
R Usage Growth Rexer Data Miner Survey, 2007-2013
70% of data miners report using R
R is the first choice of more
data miners than any other
software
Source: www.rexeranalytics.com
Debt<10% of Income Debt=0%
Good
Credit
Risks
Bad
Credit
Risks
Good
Credit
Risks
Yes
YesYes
NO
NONO
Income>$40K
SQL Request
Sample Data
Debt<10% of Income Debt=0%
Good
Credit
Risks
Bad
Credit
Risks
Good
Credit
Risks
Yes
YesYes
NO
NONO
Income>$40K
Results
Desktop and Server Analytic Architecture
In-Database Analytic Architecture
Results
Server Based vs. In-Database Architectures
Why Is Teradata Different?
Exponential Performance Improvement
Analyst
Node level calculation: 1 2 7 9 = 4.5
• R is distributed across nodes or servers
• Runs independently of the other nodes/servers
> Great for row independent processing such as Model Scoring
> However, for analytic functions requiring all the data such as Model Building…
– Onus is on the R programmer to understand data parallelism
Challenges Running R in Parallel
1 1 1 1 2 9 1 7 9 3 9 9
System level calculation: 1 1 1 1 1 2 3 7 9 9 9 9 = 2.5
Example: Median (Midpoint)
Node Level
1. Find median per node
2. Consolidate and find the midpoint of the results
3. Produce the wrong answer
System Level
1. Sort all the data
2. Take midpoint
3. Produce the right answer
< Wrong
< Right
R Operations on Data
R operates on independent rows > Score models for a given observation
> Parsing Text field
> Log(x)
R operates on independent partitions
> Fit a model to a partition such as region, time, product or store
R operates on the entire data set > Global sales average
> Regression on all customers
R Client
R Client
R Client
• What statistical programming tools do you use?
> R/RRE
> SAS
> SPSS
> Statistica
> KXEN
Please choose all that apply
Poll Question #2
Who is Revolution Analytics?
Revolution Analytics
OUR COMPANY
The leading provider
of advanced
analytics software
and services
based on open source
R, since 2007
OUR SOFTWARE
The only Big Data, Big
Analytics software
platform based on the
data science language R
SOME KUDOS
Visionary
Gartner Magic Quadrant
for Advanced Analytics
Platforms, 2014
Finance Insurance
Healthcare & Pharma Digital Economy Analytics Service Providers
Manufacturing & High Tech
Revolution R Enterprise
is…. the only big data big analytics platform
based on open source R, the de facto statistical computing language for modern analytics
• High Performance, Scalable Analytics
• Portable Across Enterprise Platforms
• Easier to Build & Deploy Analytics
Big Data In-memory bound Hybrid memory & disk scalability
Operates on bigger volumes & factors
Speed of Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise Readiness
Community support Commercial support Delivers full service production support
Analytic Breadth & Depth
5000+ innovative analytic packages
Leverage open source packages plus Big Data ready packages
Supercharges R
Commercial Viability
Risk of deployment of open source
Commercial license Eliminate risk with open source
It Has Some Limitations for Enterprises R: Open Source that Drives Innovation, but…
The Big Data Big Analytics Platform
Introducing Revolution R Enterprise (RRE)
DistributedR
DevelopR DeployR
ScaleR
ConnectR
• Big Data Big Analytics Ready
> Enterprise readiness
> High performance analytics
> Multi-platform architecture
> Data source integration
> Development tools
> Deployment tools
The Platform Step by Step: R Capabilities
R+CRAN • Open source R interpreter
• UPDATED R 3.1.1
• Freely-available R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing R scripts, functions and packages
RevoR • Based on open source R
• Adds high-performance math
Available On: • Teradata Database
• Hortonworks Hadoop
• Cloudera Hadoop
• MapR Hadoop
• IBM Platform LSF Linux
• Microsoft HPC Clusters
• Windows & Linux Servers
• Windows & Linux Workstations
DeployR • Web services software
development kit for integration analytics via Java, JavaScript or .NET APIs
• Integrates R Into application infrastructures
Capabilities:
• Invokes R Scripts from web services calls
• RESTful interface for easy integration
• Works with web & mobile apps, leading BI & Visualization tools and business rules engines
DevelopR • Integrated development
environment for R
• Visual ‘step-into’ debugger
• Based on Visual Studio Isolated Shell
Available on:
• Windows
DevelopR DeployR
The Platform Step by Step: Tools & Deployment
DevelopR - Integrated Development Environment Script with type ahead
and code snippets Solutions window for organizing code and
data
Packages installed and
loaded
Objects loaded in the R
Environment
Object details
Sophisticated debugging with breakpoints , variable values etc.
DeployR - Integration with 3rd Party Software
• Seamless
– Bring the power of R to any web enabled application
• Simple
– Leverage common APIs including JS, Java, .NET
• Scalable
– Robustly scale user and compute workloads
• Secure
– Manage enterprise security with LDAP & SSO
Data Analysis
Business Intelligence
Mobile Web Apps
Cloud / SaaS
R / Statistical
Modeling Expert
DeployR Deployment
Expert
The Platform Step by Step: Parallelization & Data Sourcing
ConnectR • High-speed & direct connectors
Available for: • High-performance XDF
• SAS, SPSS, delimited & fixed format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database
• ODBC
ScaleR • Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Correlation & covariance matrices
• Predictive Models – linear, logistic, GLM
• Machine learning
• Monte Carlo simulation
• Tools for distributing customized algorithms across nodes
DistributedR • Distributed computing framework
• Delivers portability across platforms
Available on:
• Teradata Database
• Hortonworks / Cloudera / MapR
• Windows Servers / HPC Clusters
• IBM Platform LSF Linux Clusters
• Red Hat Linux Servers
• SuSE Linux Servers
Revolution R Enterprise ScaleR: High Performance Big Data Analytics
Data Prep, Distillation & Descriptive Analytics
R Data Step Descriptive Statistics
Statistical Tests
Sampling
• Data import – Delimited, Fixed, SAS, SPSS, ODBC
• Variable creation & transformation using any R functions and packages
• Recode variables • Factor variables • Missing value handling • Sort • Merge • Split • Aggregate by category
(means, sums)
• Min / Max • Mean • Median (approx.) • Quantiles (approx.) • Standard Deviation • Variance • Correlation • Covariance • Sum of Squares (cross
product matrix) • Pairwise Cross tabs • Risk Ratio & Odds Ratio • Cross-Tabulation of Data • Marginal Summaries of
Cross Tabulations
• Chi Square Test • Kendall Rank Correlation • Fisher’s Exact Test • Student’s t-Test
• Subsample (observations & variables)
• Random Sampling
Revolution R Enterprise ScaleR (continued)
Predictive Models
• Covariance/Correlation/Sum of Squares/Cross-product Matrix
• Multiple Linear Regression • Logistic Regression • Generalized Linear Models
(GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. - User defined distributions & link functions.
• Classification & Regression Trees and Forests
• Gradient Boosted Trees • Residuals for all models
• Histogram • ROC Curves (actual
data and predicted values)
• Lorenz Curve • Line and Scatter Plots • Tree Visualization
Data Visualization
Variable Selection
• Stepwise Regression • Linear • Logistic • GLM
• Monte Carlo • Run open source R
functions and packages across cores and nodes
Cluster Analysis
• K-Means
Classification & Regression
• Decision Trees • Decision Forests • Gradient Boosted Trees
• Prediction (scoring) • PMML Export
Simulation and HPC
Deployment
Statistical Modeling Machine Learning
DistributedR
ScaleR
ConnectR
DeployR
Write Once…Deploy Anywhere.
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
In the Cloud Amazon AWS
Workstations & Servers Windows Linux
Clustered Systems IBM Platform LSF Microsoft HPC
Hadoop Hortonworks, Cloudera, MapR
EDW Teradata Database
• Challenge: Model and score 250M customers
• Server-based workflow was taking 3 days
• Move calculation in-database to drastically reduce runtime, process twice as many customers, and increase lift
Case Study - Global Internet Marketplace
• Binomial Logistic Regression
> 50+ Independent variables including categorical with indicator variables
> Train from small sample (many thousands) – not a problem in and of itself
> Scoring across entire corpus (many hundred millions) – slightly more challenging
Existing Open Source R model
• Same Binomial Logistic Regression
> 50+ Independent variables including categorical with indicator variables
> Train from large sample (many millions) – more accurately captures user patterns and increases lift
> Scoring across entire corpus (many hundred millions) – completes in minutes
Revolution R Enterprise model
By moving the compute to the data
RRE Used to Optimized the Current Process
Before After
Reduced 3 day process to 10 minutes
Scaling study: Time vs. Number of Rows
Benchmarking the Optimized Process
rows
tim
e
NOTE: • Teradata Environment
> 4 node, 1700 Appliance • RRE Environment
> version 7.2, > R 3.0.2
Server-based (Not In-DB) In-DB
• Before
trainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters)
fits <- predict(trainit, newdata=test.data, type='response')
• After
trainit <- rxGlm(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters)
fits <- rxPredict(trainit, newdata=test.data, type='response')
Recode Open Source R to Revolution R Enterprise
Optimization process
Revolution R Enterprise
How RRE Scale R Actually Works
Open
Source R
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x performance improvements
compared to Open Source R — without changing any code
RevoR - Performance Enhanced R Revolution R Enterprise:
Across Cores and Nodes
Scalable and Parallelized
• Anatomy of a PEMA: 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize
• Process a chunk of data at a time, giving linear scalability
• Process an unlimited number of rows of data in a fixed amount of RAM
• Independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions
• Independent of where the data is coming from, giving portability with respect to data sources
“Parallel External Memory Algorithms”
Scalability and Portability of PEMAs
• Efficient computational algorithms
• Efficient memory management – minimize data copying and data conversion
• Heavy use of C++ templates; optimal code
• Efficient data file format; fast access by row and column
• Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)
• Handle categorical variables efficiently
ScaleR Performance
Speed and Scalability Comparison
• Unique PEMAs: Parallel, external-memory algorithms
• High-performance, scalable replacements for R/SAS analytic functions
• Parallel/distributed processing eliminates CPU bottleneck
• Data streaming eliminates memory size limitations
• Works with in-memory and disk-based architectures
In-Database Billion Row Logistic Regression
• 114 seconds on Teradata 2650 (6 nodes, 72 cores), including time to read data
• Scales linearly with number of rows
• Scales linearly with number of nodes: 3x faster than on 2 node Teradata system
Allstate compares SAS, Hadoop, and R for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-core Hadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise In-Teradata on 6-node 2650
3.3 minutes
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
• At what stage are you in your in-database analytics deployment project?
> Still researching tools and methods
> Evaluating/Selecting data storage/management platform
> Evaluating/Selecting analytics programming tools
> Launched the project/working on it now
> We’re done and looking for another one!
Please select one answer
Poll Question #3
• Revolution R Enterprise has a new “data source”, RxTeradata (ODBC and TPT)
# Change the data source if necessary
tdConn <- "DRIVER=…; IP=…; DATABASE=…; UID=…; PWD=…“
teradataDS <- RxTeradata(table=“…", connectionString=tdConn, …)
• Revolution R Enterprise has a new “compute context”, RxInTeradata
# Change the “compute context”
tdCompute <- rxInTeradata(connectionString=..., shareDir=..., remoteShareDir=...,
revoPath=..., wait=.., consoleOutput=...)
• Sample code for R Logistic Regression
# Specify model formula and parameters
rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier
+ F(CRSDepTime), data=teradataDS)
RRE End-User’s Perspective
• Table User Defined Functions (UDFs) allow users to place a function in the FROM clause of a SELECT statement
• Table Operators extend the existing table UDF capability:
> Table Operators are Object Oriented – Inputs and outputs can be arbitrary and not “fixed” as Table UDF’s require
> Table Operators have a simpler row iterator interface – Interface simply produces output rows providing a more natural application development
interface than Table UDF’s
> Table operators operate on a stream of rows. – Rows are buffered for high-performance, eliminating row at a time processing
> Table operators support PARTITON BY and ORDER BY
– Allows the development of Map Reduce style operators in-database
Table Operators – Teradata 14.10+
RRE Architecture in Teradata 14.10+
Worker Process
Message Passing Layer
Master Process …
Request Response Teradata 14.10+
Data Partition
Data Partition
Data Partition
Data Partition
Master Process
Worker Process
Worker Process
Worker Process …
* All communication is done by binary BLOB’s
PE Layer
AMP Layer
1. RRE commands are sent to a “Master Process” - an External Stored Procedure (XSP) in the Parsing Engine that provides parallel coordination
2. RRE analytics are split into “Worker Process” tasks that run in a Table Operator (TO) on every AMP.
a. HPA analytics iterate over the data, and intermediate results are analyzed and managed by the XSP.
b. HPC analytics do not iterate, and final results from each AMP are returned to the XSP
3. Final combined results are assembled by the XSP and returned to the user
tdConnect <- rxTeradata(<data, connection string, …>)
tdCompute <- rxInTeradata(<data, server arguments, …>)
** PUT-based Installer
• High-performance, scalable, portable, fully-featured algorithms
• Integration with R ecosystem
• Compatibility with Big Data ecosystem
Summary
PARTNERS Mobile App
InfoHub Kiosks
teradata-partners.com
WE LOVE FEEDBACK
Questions
Rate this Session
Questions?
Resources for you (available on RevolutionAnalytics.com):
• White Paper: Teradata and Revolution Analytics: For the Big Data Era, An Analytics Revolution
• Webinar: Big Data Analytics with Teradata and Revolution Analytics
PARTNERS Mobile App
InfoHub Kiosks
teradata-partners.com
WE LOVE FEEDBACK
Questions
Rate this Session
Thank You!
www.RevolutionAnalytics.com www.Teradata.com