Ten things to consider for interactive analytics on write once workloads

Ten things to consider

for Interactive Analytics on high volume, write-once workloads

Full talk and demo at Fifth Elephant 2014Abinash Karan

[email protected]

http://www.bizosys.com/

About

• CTO and Co-Founder at Bizosys Technologies since 2009

• Created HSearch – a Real-time, distributed search and analytics engine built on Hadoop platform

• Passion on distributed systems and data structures

• Speaker at Fifth Elephant 2013, Microsoft Teched 2012, Yahoo Hadoop India Summit 2011

• Developed partitioning, read optimized data structures modules for HSearch.

• Worked with a range of search products including Lucene, Solr, Endeca and FAST

• Abinash is an engineering graduate of NIT, Raurkela

Summary of what you will hearCONTEXT – Write once data load - Ex. Time-series data.

Which Database?

1. SSD is Good

2. MPP is Good

3. Columnar is Good

4. Logical Partition is Good

5. Data Skew Partition is Good

6. Search Engine Index could lead to Index Explosion

7. Concurrent Users First, Single Query Performance Next

8. High Throughput File level Snapshot Loading

9. Calculate cost upfront

10. Data Structure makes a Big Difference

HBaseMangoDB

Shark

SAP Hanna

i1010

Which Database?

HSearch

RiakHive

Dremel

1010data

Memcached

FoundationDB

Splunk

Elasticsearch

DynamoDB

Datameer

LevelDB

Netezza

Oracle TimesTen

Aerospike

Sybase IQ

Vertica

accumulo

HyperTable

SOLR

Data NodeApplication

Server

DB

Instance

Network

50 micro

sec

DISK

Disk access 20 milli sec

SSD

100 micro sec

RAM

100 nano sec

Data NodeApplication

Server

Database

Node

Network50 micro

sec

DISK

Data Hotnessbased caching

Concept#1 SSD And RAM is Good.

SSD

RAM

Database

Node

Application

ServerMPP Node

Computed Data

DISK

All Data

MPP Processing ?Concept#2 MPP is Good

1 2 2 2 8 4

12

228 bytes

Concept#3 Columnar is Good

Opens 84 Bytes*Filter on Col1 and Display Col6

2012 Data180 Millions

…..


Select sum(col3) where col2= 2014

Complete Dataset (1 billion rows)

Partitioned Data (500M Rows)

Concept#4 Logical Partition is Good

Stringer

5 Million

…

5 Million

500 Million rows in memory

Select sum(col3) where col2= 2014

5 Million rows in memory

Concept#5 Data Skew Partition is Good (Paging)


…..


Index size is X times more of original data size

Index size is X time lesser of original data size

Concept#6 Search Index may lead to Index Explosion

Repeated Value

Unique Value

1 2 2 2 8 4

1 2 2 2 8 4

Concept#7 Concurrent Users First, Single Query Performance

Next

1 User

10% CPU

200ms

1 User

70% CPU

175ms

Support 6

Concurrent

Users

Concept#8 High Throughput File level Snapshot Loading

Insert 1 row in 1sec

1million rows in 1sec

Insert 1 row in 1 ms

1million rows in 1

hour

Backup

Move the

snapshot file

Distributed Index

Building

Splitting

Compaction

Concept#9 Calculate cost upfront

Support existing

SQLs,

No new servers

New Process

Instance

New Language

No Monitoring

Hardware Cost Per ByteSSD-RAM,

Engine Efficiency,

Spot Instance – Reserved Instance,

Indexes @ Compute Node - Data Node

Maintenance CostSkill Acquisition, Dashboard

App Dev/Migration CostExisting SQLs to custom SQL/JSON

CS

V / JS

ON

/

TS

V

KV

Seco

nd

ary

Ind

ex

Inverte

d

Ind

ex

Lazy S

orte

d

Bin

ary

Serd

e

Append

Update

Delete

GET

Select (Repea

t Data)

(Non-Repeat

Data)

Filter (Repe

at

Data)

(Non-

Repeat

Data)

Nulls

Concept#10 Data Structure makes a Big Difference

* Custom Variations : RC File, ORC File, Parquet

1. Size Reduction

on Index

2. Compressibility

3. Fast Access

10 CONCEPT DEMONSTRATION

HSEARCH DEMO

HVAC ID BuildingID READING_TIME INLET

TEMP

OUTLET

TEMP

ERROR

MESSAGE

Ten things to consider for interactive analytics on write once workloads

Software

Transcript of Ten things to consider for interactive analytics on write once workloads