In-Database Predictive Analytics

Post on 18-Jan-2015

2.222 views 2 download

Tags:

description

Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store. This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and survey the tools and technologies necessary to head down the path.

Transcript of In-Database Predictive Analytics

In-DatabasePredictive Analytics

John A. De Goes@jdegoes, john@precog.com

• Introduction

• Abusing SQL

• Painful by Design

• Database Extensions

• MADlib

• Other Approaches

• Summary

Agenda

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.

Traditional Predictive Analytics

Introduction

database

R

SAS

Data Bottleneck:Painful, Slow

Introduction

database

R

SAS

What’s the answer?

Introduction

“MapReduce”

Move the Code, not the Data!

AdvancedAnalytics

Introduction

Let’s Do K-Means in SQL!

Abusing SQL

General Approach in RDBMS

SQL

Feedback

DatabaseDriver

Abusing SQL

Our Initial Model

model

d k n iteration avg_q

number of dimensions

number of clusters

number of points

number of iterations

variance

Abusing SQL

Our Initial Data Set

Y

Y1 Y2 Y3 Y3

n rows

Abusing SQL

Projection & Numbering

Y

Y1 Y2 Y3 ...

YH

i Y1 ... Yd

INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;

1

2

3

4

...

...

n

1

2

3

4

...

...

n

Abusing SQL

Flattening

YH

i Y1 ... Yd

INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;

1

2

3

4

...

...

n

1

1

1

1

2

...

n

YV

i l val

1

2

...

d

1

...

d

n x d rows

1

1

...

1

2

...

n

Abusing SQL

Initializing k Cluster Centers

YH

i Y1 ... Yd

CH

j Y1 ... Yd

1

2

3

4

...

...

n

INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

1

2

3

4

...

...

k

Abusing SQL

CH

j Y1 ... Yd

1

2

3

4

...

...

k

Flattening

C

l j val

d x k rows

1

1

...

1

2

...

d

1

2

...

k

1

...

k

INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL

Computing Distances to Clusters

INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;

YD

i j dist

1

2

...

k

1

...

k

n x k rows

1

1

...

1

2

...

n

Abusing SQL

Computing Nearest Neighbors

INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;

nearest clusters

YNN

i j

n rows

1

2

3

4

5

...

n

Abusing SQL

Count Points Per Cluster

INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;

Abusing SQL

Compute New Centroids

INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL

Compute Variances

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Update Model

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Let’s not do that again!

Abusing SQL

Why are predictive analytics so hard to express in SQL?

Painful by Design

#1: No Arrays

Setsrows

Tuplescolumns

Arrays

Painful by Design

#2: Relational Algebra Sucks

Projection Selection Rename Natural Join

R S

Theta JoinSemijoin

R S R S

Antijoin

÷R S

Division

⟕R S

Left outer join

R S

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

Painful by Design

Iteration Recursion Multiple Dimensions

There’s GOT to be a better way!

Database Extensions

C Extension

Database Extensions

UDFUser-Defined Function

UDAUser-Defined Aggregate

Map Reducemap(a)

op2(a,b)init(a)

accum(a, b)merge(a, b)final(a)

Database Extensions

MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.

MADlib

Mac OS X

http://www.madlib.net/files/madlib-0.6-Darwin.dmg

Linux

http://www.madlib.net/files/madlib-0.6-Linux.rpm

1. Download the binaryMADlib

Mac OS X

Double-click on installer

Linux

yum install $MADLIB_PACKAGE --nogpgcheck

2. Start the InstallationMADlib

Greenplum

source /path/to/greenplum/greenplum_path.sh

PostgreSQL

Make sure psql is in PATH

3. Verify LocatabilityMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install

4. Register MADlibMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check

5. Test InstallationMADlib

SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);

Clustering in MADlibMADlib

Ahhhhhh......

MADlib

Our Way or the Highway

Composability

MADlib

RDBMS Isn’t the Only Game in Town!

Other Approaches

1. Embrace Coding

• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,

of course, MapReduce

• BDAS Ecosystem• Spark

Other Approaches

2. Reject RDBMS

• Datalog + variants• In theory, ideal for many kinds of predictive analytics

• Suffers from a lack of distributed, feature-complete implementations

Other Approaches

2. Reject RDBMS

• Rasdaman / RASQL• Arrays but not analytics

Community Editionshttp://www.rasdaman.org

Other Approaches

2. Reject RDBMS

• MonetDB / SciQL• Array extension of SQL

• Poor analytics

Community Editionshttp://www.monetdb.org

Other Approaches

2. Reject RDBMS

• SciDB / AFL (AQL)• Excellent analytics

• Limited composability

Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches

2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions

• Still immature

Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)

http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Other Approaches

Summary

• Increase performance, reduce friction by doing more inside the database

• Not a panacea• Hard to do in SQL

• Hard to do in C (but you may not have to: MADlib)

• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced analytics

Q&AJohn A. De Goes

@jdegoes, john@precog.com

References

• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)