Ten Things to Consider for Starting Your Small Farm - University of
Ten things to consider for interactive analytics on write once workloads
-
Upload
abinasha-karana -
Category
Software
-
view
1.228 -
download
0
Transcript of Ten things to consider for interactive analytics on write once workloads
Ten things to consider
for Interactive Analytics on high volume, write-once workloads
Full talk and demo at Fifth Elephant 2014Abinash Karan
About
• CTO and Co-Founder at Bizosys Technologies since 2009
• Created HSearch – a Real-time, distributed search and analytics engine built on Hadoop platform
• Passion on distributed systems and data structures
• Speaker at Fifth Elephant 2013, Microsoft Teched 2012, Yahoo Hadoop India Summit 2011
• Developed partitioning, read optimized data structures modules for HSearch.
• Worked with a range of search products including Lucene, Solr, Endeca and FAST
• Abinash is an engineering graduate of NIT, Raurkela
Summary of what you will hearCONTEXT – Write once data load - Ex. Time-series data.
Which Database?
1. SSD is Good
2. MPP is Good
3. Columnar is Good
4. Logical Partition is Good
5. Data Skew Partition is Good
6. Search Engine Index could lead to Index Explosion
7. Concurrent Users First, Single Query Performance Next
8. High Throughput File level Snapshot Loading
9. Calculate cost upfront
10. Data Structure makes a Big Difference
HBaseMangoDB
Shark
SAP Hanna
i1010
Which Database?
HSearch
RiakHive
Dremel
1010data
Memcached
FoundationDB
Splunk
Elasticsearch
DynamoDB
Datameer
LevelDB
Netezza
Oracle TimesTen
Aerospike
Sybase IQ
Vertica
accumulo
HyperTable
SOLR
Data NodeApplication
Server
DB
Instance
Network
50 micro
sec
DISK
Disk access 20 milli sec
SSD
100 micro sec
RAM
100 nano sec
Data NodeApplication
Server
Database
Node
Network50 micro
sec
DISK
Data Hotnessbased caching
Concept#1 SSD And RAM is Good.
SSD
RAM
Database
Node
Application
ServerMPP Node
Computed Data
DISK
All Data
MPP Processing ?Concept#2 MPP is Good
2012 Data180 Millions
…..
2014 Data500 Millions
Select sum(col3) where col2= 2014
Complete Dataset (1 billion rows)
Partitioned Data (500M Rows)
Concept#4 Logical Partition is Good
Stringer
5 Million
…
5 Million
500 Million rows in memory
Select sum(col3) where col2= 2014
5 Million rows in memory
Concept#5 Data Skew Partition is Good (Paging)
2012 Data180 Millions
…..
2014 Data500 Millions
Index size is X times more of original data size
Index size is X time lesser of original data size
Concept#6 Search Index may lead to Index Explosion
Repeated Value
Unique Value
1 2 2 2 8 4
1 2 2 2 8 4
Concept#7 Concurrent Users First, Single Query Performance
Next
1 User
10% CPU
200ms
1 User
70% CPU
175ms
Support 6
Concurrent
Users
Concept#8 High Throughput File level Snapshot Loading
Insert 1 row in 1sec
1million rows in 1sec
Insert 1 row in 1 ms
1million rows in 1
hour
Backup
Move the
snapshot file
Distributed Index
Building
Splitting
Compaction
Concept#9 Calculate cost upfront
Support existing
SQLs,
No new servers
New Process
Instance
New Language
No Monitoring
Hardware Cost Per ByteSSD-RAM,
Engine Efficiency,
Spot Instance – Reserved Instance,
Indexes @ Compute Node - Data Node
Maintenance CostSkill Acquisition, Dashboard
App Dev/Migration CostExisting SQLs to custom SQL/JSON
CS
V / JS
ON
/
TS
V
KV
Seco
nd
ary
Ind
ex
Inverte
d
Ind
ex
Lazy S
orte
d
Bin
ary
Serd
e
Append
Update
Delete
GET
Select (Repea
t Data)
(Non-Repeat
Data)
Filter (Repe
at
Data)
(Non-
Repeat
Data)
Nulls
Concept#10 Data Structure makes a Big Difference
* Custom Variations : RC File, ORC File, Parquet
1. Size Reduction
on Index
2. Compressibility
3. Fast Access