The New Analytical DB for the Hadoop Platform

The New Analytical DB for the Hadoop PlatformSept 2012

Agenda

2

● Where is the (Big) data?● How “big” is Big Data?● Approaches to working with data

● Transactional/operational systems● Analytical systems

● Hadapt● Hadapt compared to HBase● Who we are and where we come from● Hadapt in Poland● What's next in Hadapt

Big Data: Volume | Variety| Velocity

Source: wikibon.org

• 2,500 exabytes of new information in 2012

• “Digital universe” grew by 62% last year to 800K petabytes & will grow to 1.2 zettabytes this year

• 80% of data is typically not in data warehouses

Data Beats Algorithms

“I’m at Google because that’s where the data is.”

-- Peter Norvig, on why he left NASA for Google in 2001

Databases

5

Datastores

6

Where did Hadapt come from and Why?

7

“Digital universe” grew by 62% last year to 800K petabytes & will grow to 1.2 zettabytes this year

“How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did”

“Why Netflix produces BBC remake starring Kevin Spacey, directed by David Fincher”

Differences of Purpose : “Transaction Processing”Operational systems

● Optimized for small short random access – reads and writes● e.g. record that a person bought 20 shares of a company on

the stock market *or* record that a user posted something on another users “wall”

Traditional DB examples● Oracle● MySQL

NoSQL Examples● HBase● MongoDB● Cassandra

8

Differences of Purpose: AnalyticsAnalytics

● Optimized for read-only computations about large amounts of data

● e.g. compute the average amount invested in bond funds and stock funds for all employees at all employers over the last 5 years

DB Examples● Netezza● Vertica

NoSQL Examples● Hive● Pig

9

Oct Nov Dec Jan Feb Mar02468

10121416

Actual

Option 1Acme

GM

Newco

Oldco

Bigcorp

Foo

Acme Newco Bigcorp

0

2

4

6

8

10

Option 2

Option 2

The evolution of analytics – where are we today?

10

The early stages of analytics • Market Basket Analysis• Trend Analysis• Cyclical Analysis• Customer Segmentation

New Analytical Models• Pattern Detection, Discovery, Matching• A/B Testing and Behavioral Analysis• Sessionization• Social Correlation Analysis • Fractional Attribution• Sentiment Analysis • Personalization

Hadapt – The Adaptive Analytical Platform for Big Data● Company started in early 2011, currently commercializing the Yale University research

project by Kamil Bajda-Pawlikowski called HadoopDB led by Dr. Daniel Abadi● Combines the benefits of Apache Hadoop and relational DBMS technology into a

single system for applications that rely on multi-structured data analytics● Designed for the cloud, and is optimized for virtualized environments● Architected to leverage clusters of industry standard (commodity) machines● Provides the full power of MapReduce as well as SQL support and the ability to work

with data within a single platform● Based on findings from the

HadoopDB project it aims to achieve:

– Performance and efficiency of MPP databases

– Scalability, fault tolerance, and flexibility of MapReduce-based systems

11

Hadapt Analysis Process

Raw Dataload

enrichquery

BI ToolsApplications

predict

analyze

Predictive Analytics

Hadapt Bulk loader

Multi-Structured Big Data Analytics Across Industries – Use Cases

Need for deep data analysis…on TB’s to PB’s of data…with minutes to seconds response times

Internet Use Cases

Financial Services & Insurance

Use Cases

Retail Use Cases

Communications, Media &

Information Services

Use Cases

• Recommendation Engines

• Cross-channel Analysis• Clickstream/Golden

Path Analysis• Right Offer at the Right

Time• Social networking graph

analysis • Ad Revenue

Optimization

• Risk Warehousing• e-Discovery• Tick data back testing• Anti-Money

Laundering/Fraud Detection

• Customer Behavior Analysis

• Customer Behavior Analytics

• Market & Consumer Segmentation

• Event and Behavior-based Targeting

• Affinity/Market Basket Analysis

• Loyalty Analytics

• Price Optimization• CDR Analysis• Customer Churn

Prevention• Network Optimization• Ad optimization

Common Requirements across these applications:● Ad hoc analysis

● Structured & Unstructured data● Rapid iteration

● Elastic scale out, cloud deployments

Hadapt Architecture

14

Master Node

HDFS MapReduceFramework

Namenode JobTracker

Node 1

TaskTracker

Database DataNode

Hadapt SQL Engine

Node n

TaskTracker

Database DataNode

Load & QueryTasks

MapReduceJob

SQL QueryMapReduce

Job

Hadapt – Key components – Query EngineFlexible Query Interface

● Data can be queried using both SQL and MapReduce ● SQL can be embedded within MapReduce or vice versa● JDBC/ODBC drivers for connectivity with customer-facing BI tools

Query Planner● Queries are analyzed to consider data partitioning and distribution, indexes, and statistics to

determine a query plan● Split query execution ensures optimal use of the DBMS layer before pushing operations into

Hadoop

Adaptive Query Execution● In MPP databases the time to complete the query will be approximately equal to the time it

takes the slowest compute node to complete its assigned task● This dynamic is especially problematic in a cloud environment● Query plans are adjusted dynamically based on cloud worker node performance

15

Hadapt – Key components – Data EngineData Loader

● Data is loaded using all machines in parallel

● Data is partitioned into small chunks and replicated across the cluster

● Optimizes query performance and fault tolerance

Data manager● Stores metadata about the schema, data, and chunk distribution

● Handles data replication, backups, recovery, and rebalancing of chunks across the cluster

Hybrid Storage Engine● A DBMS engine is stored on each node in addition to a standard distributed file system

(HDFS)

● DBMS layer is optimized for structured data and HDFS handles unstructured data

More insight into the underlying technology: http://www.HadoopDB.net

16

http://www.HadoopDB.net/

HBase Data Model : Conceptual

From the BigTable paper:“a sparse, distributed, persistent multi-dimensional sorted map”

(row: bytestring, column family: bytestring, column: bytestring, time: int64) ---> byte string

17

HBase Map { ”key_1" : { ”columnfamily_a" : { ”column_i" : { 15 : "y", 4 : "m" }, ”column_ii" : { 15 : "d”, }}, “columnfamily_b" : { ”column_other" : { 6 : "w" 3 : "o" 1 : "w” }}}}

18

Hadapt Data Model : ConceptualTraditional Relational Tables

19

CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT

451234 NEWCORP

196 Broadway…

1 111-555-1212

$1,231,285 NULL

887765 ACME 1 Main st. …

2 222-555-1212

$46,945 “Top customer”

HBase Data Model : Physical

Every cell stored with row, family, column and timestampAllows fast lookup with low copy overhead

BUT

Space inefficient (optional compression available) and inefficient to scan

20

“key_1” “cf_a” “c_i” 15 “foo”

“key_1” “cf_a” “c_ii” 15 “bar”

“key_2” “cf_a” “c_ii” 4 “baz”

Hadapt Data Model : Physical

Leverages RDBMSSupports Normalized or Denormalized data models

21

Data Model / Workload Comparison

22

Hadapt HBase

Conceptual Relational tables Sparse sorted map

Schema Structured Fluid

Data Density Dense Sparse

Workload Large scans, joins, aggregations

Point lookup, Short range lookup, updates

Interface SQL Custom API

Informal Performance Comparison

23

Hadapt HBase

Load / Ingest batch Fast!

Lookup speed Few seconds Fast!

Data warehouse queries

50x faster than HBase

Uh oh

Hadapt is NOT● OLTP● NoSQL Key/Value store● CEP – streaming analysis● Web Server● File System

(but we do integrate with all of them)

24

Example HBase+Hadapt Application

Social Graph Input into Communication Monitoring System

HBase would provide real-time lookup and update of connected entities and their risk profiles for monitoring / alerting. Incremental data capture, real-time detection

Hadapt would periodically recalculate rich entity connectivity model to be deployed to the HBase real-time persistence layerCalculates the patterns that should be detected in real-time

25

Hadapt Board

26

Chris Lynch, Chairman of the Board – Previously CEO of Vertica

Sharmila Shahani-Mulligan – Previously CMO of Aster Data

Felda Hardymon – Staples, Endeca, PTC, Vertica, Gartner, BladeLogic, Skype, LinkedIn, and many others

Matthew Howard - Avere Systems, Blue Jeans Network, ConteXtream, MobileIron, Pertino Networks, Retrevo, and many others

Daniel Abadi, Chief Scientist – Yale, MIT, known for C-Store (Vertica), and HadoopDB (Hadapt)

Hadapt Management

27

Justin Borgman - Chief Executive Officer and Co-Founder

Dr. Daniel Abadi - Chief Scientist and Co-founder

Philip Wickline - Chief Technology Officer

Kamil Bajda-Pawlikowski - Chief Software Architect and Co-Founder

Kelly Stirman - Vice President of Customer Solutions

Scott Howser - Vice President of Marketing

Hadapt in Poland

28

● Kamil Bajda-Pawlikowski (HadoopDB is part of his PhD work) graduated from Wrocław University of Technology

● Hadapt from its inception had contributors in Poland, most of them are still with us

● Hadapt now has a permanent location, an office in Warsaw

● Hadapt in Poland is now a legitimate company: Hadapt Polska sp. z o.o.

● Hadapt Polska is now hiring!

– We're looking for a couple of bright, excellent senior/principal software engineers with great OOP/system skills and experience in developing enterprise products

Hadapt in the Future

29

● In late 2011 Hadapt raised $9.5 million funding and is rapidly growing as a company since then, headcount ca. 40 employees

● We already have several big customers in the USA, and are gaining more market attention every month

● Big Announcement is coming in October at the next Strata/HadoopWorld 2012 conference in New York

THANK YOU

Wojciech [email protected]

The New Analytical DB for the Hadoop Platform

Documents

Transcript of The New Analytical DB for the Hadoop Platform