Big Analytics Without Big Hassles 04/10/14 Webinar

27
Big Analytics without Big Hassles Bryan Lewis Chief Data Scientist Alex Poliakov Solutions Architect

description

Data scientists just want to do fast, interactive exploratory analytics on all kinds of data—without thinking about whether data fits in-memory, about parallelism, force-fitting it into a table, or pulling it out of a file and formatting it for math packages. You’d also like to use your favorite analytical language and have it transparently scale up to Big Data volumes. Paradigm4 presents a webinar about SciDB—the open source, array database with native scalable complex analytics, programmable from R and Python. Learn how SciDB enables you to: •Explore rich data sets interactively •Do complex math in-database—without being constrained by memory limitations •Perform fast multi-dimensional windowing, filtering, and aggregation •Offload large computations to a commodity hardware cluster—on-premise or in a cloud •Use R and Python to analyze SciDB arrays as if they were R or Python objects •Share data among users, with multi-user data integrity guarantees and version control

Transcript of Big Analytics Without Big Hassles 04/10/14 Webinar

Page 1: Big Analytics Without Big Hassles 04/10/14 Webinar

Big Analytics without Big Hassles

Bryan Lewis Chief Data Scientist

Alex Poliakov

Solutions Architect

Page 2: Big Analytics Without Big Hassles 04/10/14 Webinar

Paradigm4’s SciDB

MPP Database

Array data model

Complex analytics

Commodity clusters or cloud

R & Python

Big analytics without big hassles

Page 3: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 3

Using WebEx

•  Ask questions using the Q&A window

•  This webinar is being recorded

•  Replays will be available

Page 4: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 4

Agenda

1.  Brief Introduction to SciDB

2.  Demos

3.  Q & A

Page 5: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 5

Paradigm4 develops & supports SciDB

Force behind many major advances in databases

Postgres Vertica Paradigm4 Illustra VoltDB Streambase DataTamer

Mike Stonebraker CTO & Co-founder MIT Professor ISTC Big Data at MIT

Page 6: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 6

Presenters

Bryan Lewis, Chief Data Scientist Applied Math Ph.D. Founder Rocketcalc; RevolutionAnalytics CRAN contributor

Alex Poliakov, Solutions Architect Decade developing database internals (Netezza, Paradigm4) Solutions: e-commerce, pharma/biotech, insurance, satellite imagery

Page 7: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 7

Three pillars of SciDB

MPP Database

Array data model

Complex analytics

Commodity clusters or cloud

R & Python

Big analytics without big hassles

Page 8: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 8

SciDB Powers NIH NCBI’s 1000 Genomes Project

Running 24 x 7 since Fall 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

Page 9: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 9

Some commercial use cases

Pharma, Biotech, Healthcare

Join public & private data Integrate many data sources Scale up & speed up math

Quant Finance

Fast data window selection & scalable math

Image & Sensor Analytics

E-commerce

SVD on sparse matrices 50M x 50M Powering recommendation engine

Integrate diverse data with different spatial and temporal resolutions

Page 10: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

0

•  ACID support means multiple users can

simultaneously read / write / analyze data

•  FAST JOINs

data in files

SciDB is a Database

Page 11: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

1

Arrays are a natural data model

Sen

sor

/ car

/ p

ho

ne

time

longitude

Event

other dimensions ….

latitude

Exc

han

ge

Stock_ID

Time

other dimensions ….

Page 12: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

2

Native Array DBs vs. Relational DBs

Spatially close data in the coordinate system are stored close to each other on disk Important for ordered data and analysis

Page 13: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

3

Array storage supports fast multi-dimensional SELECTs

Illustration credit: Andrei Pandre

Page 14: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

4

SciDB does scalable complex analytics

•  No more ETL hassles to a separate math package •  Data not constrained to fit in memory

Parallel linear algebra Principal component analysis Clustering GLM Machine learning and more

Page 15: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

5

•  Program SciDB from R or Python •  Naturally reference & manipulate data in SciDB •  Large computations run on SciDB cluster

–  Go beyond the scalability limitations of R & Python

Analyst-Friendly Interfaces

We also support AQL and JDBC

Page 16: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

6

Shared-Nothing Cluster Architecture

SciDB Coordinator

SciDB …

SciDB 1

SciDB 2

R + SciDB-R

Python + SciDB-Py

JDBC

Web Browser

K-replication for redundancy Scale out horizontally

Page 17: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

7

SciDB Arrays

Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes

Price Volume Symbol usec 450.61 150 “AAPL” 36013008713

Page 18: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

8

SciDB Arrays D

imen

sion

i

Attributes Price Volume Symbol usec

1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113

A 1-D array looks like an R or Pandas data frame.

This picture shows five cells, each with four attributes.

Page 19: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 1

9

SciDB Arrays

The same data “redimensioned” into a 2D array

Dim

ensi

on u

sec

“AAPL” “MSFT”

Price Volume Price Volume

36003200113 36.20 100

36013008713 450.61 150

36013008915 450.73 200

36013208113 450.84 10

36019008713 36.57 75

Dimension Symbol .

Page 20: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 2

0

SciDB Array Schema

CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];

Attributes v1, v2, v3

Dimensions I, J

Dimension size * is unbounded

Chunk size

Chunk overlap

Page 21: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 2

1

Arrays are distributed with overlap

Supports constant time moving window aggregates and feature detection …even when data cross node boundaries

0.02 0.01 0.01

0.01 0.01 0.50

0.01 0.02 0.01

0.01 0.01 0.02

0.01 0.50 0.02

0.02 0.01 0.01

0.01 0.01 0.50

0.01 0.02 0.01

0.02 0.01 0.02

0.01 0.50 0.02

0.02 0.01 0.01

0.01 0.02 0.02

Page 22: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 2

2

Live demonstrations

1)  Airline data •  Select •  Aggregate lateness •  Heatmap

2) Netflix-like data •  SVD

3) Zipcode (lat,long) and population by zipcode •  Join •  Compute distance-weighted population by zipcode •  Plot histogram

4) Satellite and point-of-interest data •  Select region •  Regrid and plot •  Overlay another dataset: shopping mall locations

Page 23: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 2

3

Demonstration Cluster

Running on modest 4 node cluster Each node has

16 cores 128 GB RAM 4 x 1TB disks Connected by 1Gbit Ethernet

Also runs on public clouds

Page 24: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 2

4

Registration Poll Results

Excel,'15%'MATLAB,'6%'

Other,'20%'

Python,'17%'

R,'42%'

What'mathemaAcal'and'staAsAcal'compuAng'soGware'do'you'use?'''''

n'='340'

Please respond to live poll

Page 25: Big Analytics Without Big Hassles 04/10/14 Webinar

© P

arad

igm

4 2

5

Try It

Quick Start •  scidb.org/forum •  Download a VM or EC2 AMI

Community Edition Enterprise Edition Open Source; Active forum Commercial license Unrestricted & fully scalable Unrestricted & fully scalable

More math functions Intel MKL support Failover & fault tolerance System management tools

Page 26: Big Analytics Without Big Hassles 04/10/14 Webinar

Take Away: Less coding, more analysis

ACID database Array data model In-database complex math Automatic scale-out & speed-up Programmable from R and Python

www.paradigm4.com

Page 27: Big Analytics Without Big Hassles 04/10/14 Webinar

© Paradigm4 Inc. 27

Questions?

Tell us about your application •  [email protected]

Try our Quick Start •  scidb.org/forum •  Download a VM or EC2 AMI

www.paradigm4.com

Thanks for your interest!