SciDB


Topics

• The Big Complex Analytics Space

• SciDB Overview

• How we are different and why that matters

• Architecture

Note: We call our company P4 for short


Rich Data + Complex Analyticsdrive insights and innovative product offerings

● Tap new types and sources of data

– Location, genomic, behavioral, speech, sensors, images, …

● Integrate mixed data sources for novel insights

– Genomic/wearable sensors/ EHRs /clinical/payer/provider

– Satellite images/smart grid data

– Location/weather/traffic/driving behavior

● Generate micro-segmented pricing & products

– Personalized insurance

– Precision medicine

– Precision warranties

– Behavioral targeting

– Location-based services

● Look at whole populations, big time windows, big regions


Where some of the ‘complex analytics’ problems are

Pharma, Biotech, AgroBusiness, Healthcare Informatics

• Next-gen sequencing analysis & GWAS• Population studies• Evidence-based outcome studies• Pharmaco-economics

Insurance Analytics • Personalized auto or workman’s comp insurance• Catastrophe modeling and policy pricing• Risk modeling for insurance exchanges

Industrial Analytics • Precision warranty pricing & maintenance schedules

Call Centers • Speech analytics

Energy • Data from smart sensor grids

Digital Marketing • Geo-targeting & other personalization strategies• Recommendation engines

Financial Services • Financial modeling, back testing, sensitivity testing• Algorithmic trading• Portfolio management & risk management

Scientific Research • Astronomy, Climatology, High Energy Physics, et al


‘Big Analytics’ covers two categories

• Big Volume + Simple analytics – Traditional Data Warehouses, RDBMSs

– Business analysts

– Count statistics, roll-ups, aggregates

• Big Volume + Complex Analytics– Emerging markets; new tools

– Data scientists / healthcare analysts / quants / operations researchers

– Multivariate statistics, clustering, SVD, machine-learning, et al

P4’s space


Why would industrial & commercial analytics applications benefit from yet another software platform?

• Sensor data, geospatial data, temporal data, genomic data & images are far more efficiently managed as multi-dimensional arrays than as relational tables

• Complex analytics should execute in place where the data resides and scale easily with additional nodes and cores


P4’s new ‘Complex Analytics’ databasescientific data management & analytics for the commercial world

Rich Data

Massively Scalable Math

Smart Data Management

© Paradigm4 Inc.

P4 is well-matched for M2M data

Machine-generated data have inherent ordering & structure• location data from cars and cell phones• telematics data from sensors• energy usage data from smart sensors grids • genetic sequencing data• patient telemetry data • time series and longitudinal event data • satellite images of the earth’s surface

2012-01-31 22:32:36.968000


A new ‘Complex Analytics’ databasescientific data management & analytics for the commercial & industrial worlds

• All-in-one next generation database with data life cycle management native, seamlessly integrated, scalable complex math operations

• Array data model optimal for temporal, geospatial, and machine-generated data n-dimensional

• Open Source

• Commodity HW grid or cloud


SciDB Features

Distributed data storage

With redundancy/fault tolerance and high-availability

Scalable Parallel operations Parallel linear algebra, aggregates, summaries, data loading

ACID Transactions

Stuctured N-dimensional Sparse Array Data Model Defined by schema

Expressive SQL-like Query Syntax Supports joins by array dimensions

No-Overwrite Data Versioning

Extensible User-defined types, functions, operators


Paradigm4 enables data-intensive research

Capture Ingest, store, and manage data throughout its lifecycle

Curation Save raw, corrected, pre-processed and derived analytic data, with meta data and provenance

Curiosityh Explore, drill down, filter, select

Compute Complex math and modeling

Collaboration Shared resourceNo data silos with long, metadata filenames

Compliance No overwrite, versioned data storage supports reproducibility and validation of results


First class support for scientific data & scientific research• Ingest, store, access, and manage data throughout its life

cycle• No overwrite database; historical versioning support• Metadata – store curation and calibration information• Extensibility (user defined types and operations)• Save raw, corrected, pre-processed, and derived data• Support for provenance • Support reproducibility of results• Share data across work groups and with outside

organizations

Why SciDB for scientists?


P4’s native Array DB beats Relational DBs* on storage efficiency & complex computations

● Math functions run directly on native storage format

● Dramatic storage efficiencies as # of dimensions & attributes grows

– Architecture supports n dimensions

● Facilitates drill-down & clustering by like groups

● High performance for both sparse and dense data

– 10-100x faster than RDBMSs on array operations

16 cells

48 cells

* Applies to both row stores & column stores


Data exploration & analytics work better when

the natural ordering of data is preserved

Clusters, temporal regions are stored together

Resample time or re-grid geospatial data at any resolution

Slice & drill-down in any n-dimensional region

Fast data selection for ad hoc queries

Efficient analytics over sub-regions & moving windows


Complex math underpins many use cases

Industrial Analytics

• Precision warranty pricing• Proactive preventive maintenance• Modeling & optimization• Event monitoring in refineries and factories

• covariance• PCA , SVD• cross validation• bootstrapping• cluster analysis• linear/logistic regression

Pharma Biotech Healthcare

• Next-gen Sequencing• Population studies• Outcome studies• Precision medicine


Complex math underpins many use cases

Computational Finance

• Back testing• Sophisticating modeling• Portfolio optimization • Risk management

• covariance• PCA , SVD• cross validation• bootstrapping• cluster analysis• monte carlo methods• linear/logistic regression


P4’s native math library supports distributed processing

• Task parallelism‘Embarrassingly parallel’ tasks

Process subpopulations in parallel

Run simulations in parallel

• Massively scalable complex math‘Non-embarrassingly’ parallel tasks like large scale linear algebra

Math operations that pass intermediate data between nodes

Challenging O(n3) computations

Math operations on data too large to fit on one node

• Large scale analytics without samplingLook at whole populations, big time windows, big regions

Sample when you want to; not to fit analytics package constraints

Use all the data: sometimes you really want the long tail or the black swan


Query language seamlessly integrates data manipulation & math

Array Query Language -- AQLDeclarative SQL-like language extended for working with array data

Large-scale math operations embedded in queries

ExtensibleAdd user-defined types and functions

R, python, and other client interfaces

Compute the log odds ratio for a failure model using logistic regression

SELECT * FROM LOGISTREGR (model_matrix, success_count, failure_count, 'coefficients')


Linear Algebra as Building Block

Mathematical and Data Manipulation Operations

multiply ( transpose ( Simple_Array ), Simple_Array );

regrid( Simple_Array, 10, 10, avg (v2) );

cumsum (filter ( Simple_Array, v1 = ‘Odd’ ),I, v1 );


Flexible Schema

• Ad hoc queries• Don’t have to know a priori

what questions you will want to ask of your data

• Change schema dynamically• Values <=> dimensions

• Supports transparent data exploration and mining


Satellite images Healthcare images

Well-suited for storing, accessing, & analyzing images

GIS data

Store metadata with the data• Instrument id & calibration data• Experimental conditions and variables• Data set identifiers & comments


• Regrid operator• Change resolution and coordinate systems

• Overlap• Supports feature detection when features fall between nodes

• Support for multi-dimensional window operations• Spatial averaging

• Non-integer dimensions• Access image through spatio-temporal coordinate systems• Astronomy (right ascension, declination)

• Remote sensing (lat, long, time)

SciDB support for images


SciDB array model: create array

CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ];

Attributes red, green,

blue

Dimensions longitude, lattitude

Dimension size * indicates unbounded

Chunk size

Chunk overlap


[SciDB] Scalable data management

instance 1 (coordinator) instance 2 (worker)

instance 3 (worker) instance 4 (worker)

1 2

3 4

© Paradigm4 Inc.

Soft scalability teston automotive telematics and location data

• This graph shows how performance scales when both the data volume and the number of instances are increased together

• Query computes a score for each driver based on how many other vehicles were driving at the same time, in the same areas as the driver

• If data is perfectly distributed and if all operations in a query are perfectly parallelizable, the graph should be a 0 slope line

exec

uti

on

tim

e re

lati

ve t

o 1

X

scale factor


New Data Window operator

• Computes aggregates over rolling one-dimensional windows• skipping over empty array cells• particularly useful for analysis of time series events that happen at

varying frequencies

• Data window accepts an input array, a dimension name, number of preceding values, number of following values, and a list of aggregate calls

data_window (input_array, dim_name, num_preceding, num_following, aggregate1(attribute1), aggregate2(attribute2)...)


Analyzing event data

• Event hot spots• Look at which specific sets of locations (at the lat-long level) have

the most hard acceleration and hard braking events (count or volume normalized metric)

• Profile hot spots by day of week, time of day

• Event windows• Look at a 30 second window before and after each hard braking

and hard acceleration event • Look for patterns to predict adverse events or profile drivers


Manage data throughout its life cycle

• Data is never overwritten• Preserve raw data, corrected data, and updates in the database• Facilitates reproducibility, audits, compliance• Supports model development and testing: what-if modeling,

scenario testing, back-testing, sensitivity testing

• Updates are versioned


Client Interfaces

• i-query interactive command line query interface• Python, C++, R clients• GUI (forms) interface coming• Open source client api – roll your own!

© Paradigm4 Inc.

What about hadoop?

• Hadoop alone is not a DBMS• No indexes, updates, data consistency, metadata

• Modules (hadoop, Pig, Hive, Hbase, HDFS) are loosely integrated and require a lot of glue code

• Requires skilled development staff to write custom code and maintain clusters

• Slower than a real parallel distributed database so needs more HW• Linear algebra operators are hard to implement as a map and a reduce• See Stonebraker, Kepner CACM blog post: Possible Hadoop Trajectories http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories

© Paradigm4 Inc.

What about NoSQL like MongoDB?

Great for some uses cases: match the tool to your requirements

• NoSQL and XML-based systems bake ‘schema’ into the application code or the records themselves

• NoSQL is most easily defined by what it excludes• No schemas • No query language• Lacks easily automatic data integrity of ACID databases• No support for joins which are useful when working with multiple data

sources• Requires coding to walk the data structures to manage data and extract

information• Harder to collaborate and share data across groups• More custom code than a DB means potential longer term maintenance

and data archiving issues

• Paradigm4 offers the flexibility of object-oriented data schemas without sacrificing ACID database integrity or ad hoc query support


SciDB and Paradigm4

• SciDB is a global, open source community • Scientists from many fields & computer-scientists• www.scidb.org

• Paradigm4, a commercial company, sponsors & manages SciDB• Doing all the initial development for SciDB• Sells and supports a commercial-quality release of SciDB• Along with enterprise management tools (e.g. provisioning,

security, recovery) • And industry-specific add-ons• www.paradigm4.com


Get more from your analytical database

• Power, Productivity & Performance– Less coding

– Less data movement

– Transparent scale-up & speed-up

– Prototypes scale to production without rewriting

– Lower cost deployment

• Highly pedigreed technical teamCTO is Mike Stonebraker

renowned database researcher & entrepreneur

• Ready to work with early adopters

© Paradigm4 Inc.

Big Complex Analyticscombines data sources for novel insights & products

Automotive Telematics Healthcare Informatics

© Paradigm4 Inc.

Big Complex Analyticspowers population studies

> 70K tissue samples

> 65K gene probes per sample

covariance, clustering, SVD

> 10 million cars

GPS & driving data every sec

insurance by the trip & how you drive

linear regressions, risk & pricing modeling

© Paradigm4 Inc.

Architecture

• ‘Shared Nothing” cluster of commodity hardware nodes• Interconnected with standard ethernet and TCP/IP


SciDB array model: create array

CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ];

Attributes red, green,

blue

Dimensions longitude, lattitude

Dimension size * indicates unbounded

Chunk size

Chunk overlap


SciDB Array Schema

CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];

Attributes v1, v2, v3

Dimensions I, J

Dimension size * is unbounded

Chunk size

Chunk overlap

© Paradigm4 Inc.

SciDB array model: data types

• Whole numbers: int8, int16, int32, int64• Unsigned whole numbers: uint8, ..., uint64• Date and Time: datetime• Date and Time with timezone: datetimez• Floating point: float, double• Boolean: bool• Character: char• Variable-length strings: string

© Paradigm4 Inc.

SciDB array model: Storage

• SciDB store every attribute separatelly• Good compression:

– RLE– zlib

• Parallel processing

© Paradigm4 Inc.

SciDB array model: 1D-array

• Chunk: unit of data processing • Chunk should fit in memory entirely• User chooses chunk size

© Paradigm4 Inc.

SciDB array model: bitmap

• SciDB describes EMPTY values using bitmap• bitmap is compressed efficiently with RLE

© Paradigm4 Inc.

SciDB array model: clustering

• Several available chunk distributions:– Round-Robin (default)– Replication

• Optimizer splits queries into stages

• Every stage processed parallel

• Scatter/Gather intermediate results after every stage according to requirements

• Overlap helps descrease SG size (!)

• NO single point of failure

© Paradigm4 Inc.

SciDB array model: redundancy

• --redundancy=X• Every chunk is replicated X times

• Single copy on every node

• Redundand chunks used only when a node becomes unavilable

• We protect networks and disk failures

• Use RAID for protect disk failures

© Paradigm4 Inc.

SciDB array model: release 12.7

• Time series• Optimizations• Binary loader (based on PostgreSQL binary loader)

• data_window operator

© Paradigm4 Inc.

SciDB array model: next release

• Repart failed nodes by redundand data• Elastic cluster:

– Increase/decrease node count


Contact

– Marilyn Matz

– CEO & co-founder

– 781 718 3999

– [email protected]

– www.paradigm4.com

– www.scidb.org

SciDB

Internet

Transcript of SciDB