In-Database Predictive Analytics

In-DatabasePredictive Analytics

John A. De Goes@jdegoes, john@precog.com

• Introduction

• Abusing SQL

• Painful by Design

• Database Extensions

• MADlib

• Other Approaches

• Summary

Agenda

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.

Traditional Predictive Analytics

Introduction

database

Data Bottleneck:Painful, Slow

Introduction

database

What’s the answer?

Introduction

“MapReduce”

Move the Code, not the Data!

AdvancedAnalytics

Introduction

Let’s Do K-Means in SQL!

Abusing SQL

General Approach in RDBMS

Feedback

DatabaseDriver

Abusing SQL

Our Initial Model

d k n iteration avg_q

number of dimensions

number of clusters

number of points

number of iterations

variance

Abusing SQL

Our Initial Data Set

Y1 Y2 Y3 Y3

n rows

Abusing SQL

Projection & Numbering

Y1 Y2 Y3 ...

i Y1 ... Yd

INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;

Abusing SQL

Flattening

i Y1 ... Yd

INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;

i l val

n x d rows

Abusing SQL

Initializing k Cluster Centers

i Y1 ... Yd

j Y1 ... Yd

INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

Abusing SQL

j Y1 ... Yd

Flattening

l j val

d x k rows

INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL

Computing Distances to Clusters

INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;

i j dist

n x k rows

Abusing SQL

Computing Nearest Neighbors

INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;

nearest clusters

n rows

Abusing SQL

Count Points Per Cluster

INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;

Abusing SQL

Compute New Centroids

INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL

Compute Variances

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Update Model

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Let’s not do that again!

Abusing SQL

Why are predictive analytics so hard to express in SQL?

Painful by Design

#1: No Arrays

Setsrows

Tuplescolumns

Arrays

Painful by Design

#2: Relational Algebra Sucks

Projection Selection Rename Natural Join

Theta JoinSemijoin

R S R S

Antijoin

Division

⟕R S

Left outer join

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

Painful by Design

Iteration Recursion Multiple Dimensions

There’s GOT to be a better way!

Database Extensions

C Extension

Database Extensions

UDFUser-Defined Function

UDAUser-Defined Aggregate

Map Reducemap(a)

op2(a,b)init(a)

accum(a, b)merge(a, b)final(a)

Database Extensions

MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.

MADlib

Mac OS X

http://www.madlib.net/files/madlib-0.6-Darwin.dmg

http://www.madlib.net/files/madlib-0.6-Linux.rpm

1. Download the binaryMADlib

Mac OS X

Double-click on installer

yum install $MADLIB_PACKAGE --nogpgcheck

2. Start the InstallationMADlib

Greenplum

source /path/to/greenplum/greenplum_path.sh

PostgreSQL

Make sure psql is in PATH

3. Verify LocatabilityMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install

4. Register MADlibMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check

5. Test InstallationMADlib

SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);

Clustering in MADlibMADlib

Ahhhhhh......

MADlib

Our Way or the Highway

Composability

MADlib

RDBMS Isn’t the Only Game in Town!

Other Approaches

1. Embrace Coding

• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,

of course, MapReduce

• BDAS Ecosystem• Spark

Other Approaches

2. Reject RDBMS

• Datalog + variants• In theory, ideal for many kinds of predictive analytics

• Suffers from a lack of distributed, feature-complete implementations

Other Approaches

2. Reject RDBMS

• Rasdaman / RASQL• Arrays but not analytics

Community Editionshttp://www.rasdaman.org

Other Approaches

2. Reject RDBMS

• MonetDB / SciQL• Array extension of SQL

• Poor analytics

Community Editionshttp://www.monetdb.org

Other Approaches

2. Reject RDBMS

• SciDB / AFL (AQL)• Excellent analytics

• Limited composability

Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches

2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions

• Still immature

Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)

http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Other Approaches

Summary

• Increase performance, reduce friction by doing more inside the database

• Not a panacea• Hard to do in SQL

• Hard to do in C (but you may not have to: MADlib)

• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced analytics

Q&AJohn A. De Goes

@jdegoes, john@precog.com

References

• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)

In-Database Predictive Analytics

Technology

Transcript of In-Database Predictive Analytics

Predictive Analytics & Information Governance · Self-parking cars Autonomous technology ... Why Predictive Analytics? Predictive coding and analytics solutions provide a methodology

Predictive Analytics Techniques: What to Use For …downloads.deusm.com/allanalytics/academy/0326-Predictive-Analytics... · Predictive Analytics Techniques: What to Use ... • Use

“Predictive Analytics” Contents Index The Author · 2020-05-27 · Predictive Analytics While most analytics use cases focus on analyzing historical data, predictive analytics

Big Data Predictive Analytics in Oracle Database 12cnyoug.org/wp-content/uploads/2014/09/Berger_Big_Data.pdf · Big Data Predictive Analytics in Oracle Database 12c ... and timing

Your Predictive Journey - Jump Analytics€¦ · 1 hy Predictive Analytics 2 Analytics Strategy 3 Predictive Modeling 4 Predictive Analytics ourney 1 Your Predictive Journey A Practical

Predictive Analytics and Accelerated Underwriting … · Predictive Analytics and Accelerated Underwriting Survey Report ... predictive analytics or underwriting program in ... accelerated

Analytics Overview #Predictive Analytics

Predictive analytics and predictive marketing

In-Database Analytics: Predictive Analytics, Intelligence · Copyright 2011 Oracle Corporation 11g Statistics & SQL Analytics (Free) •Ranking functions •rank, dense_rank, cume_dist,

IBM SPSS Predictive Analytics Workshop · Explore multiple predictive analytics techniques ... Crime analysis Predictive policing ... 30 IBM SPSS Predictive Analytics Workshop

Predictive Analytics

Forecasting Hotspots - A Predictive Visual Analytics Approachebertd/vis09/predictive...Keywords: Predictive analytics, visual analytics, syndromic surveillance. 1 MOTIVATION Visual

Predictive analytics by Discourse Analytics

Predictive Marketing for Banking - Predictive Analytics World...Business Analytics software 11 Predictive Analytics – • Predict impact on sentiment of messaging decisions with

Oracle Advanced Analytics Database Option · Fastest way to deliver enterprise-wide predictive analytics Integrated GUI for Predictive Analytics Database scoring engine Lowest TCO

Leveraging Predictive Analytics for Enterprise Risk Management · Analytics Predictive Analytics What is Predictive Analytics? 6 •A wide range of statistical methods and approaches

Predictive Analytics

Predictive Network Analytics Platform - · PDF fileVirtela leverages the Virtela Predictive Network Analytics platform to deliver ... Predictive Network Analytics eliminate the need

Predictive Quality - SPSS Analytics Partner · Evolution of predictive analytics Predictive maintenance, introduced in 1980 and the focus of predictive analytics—in its simplest

A Model of Data Maturity to Support Predictive Analytics ... · A Model of Data Maturity to Support Predictive Analytics, Part Deux ... in relational database form ... • PeopleSoft