Post on 12-Jul-2015
©2013 BIZOSYS TECHNOLOGIES PRIVATE LIMITED
15 Billion computations in
187 milliseconds
with a Big Join in Hadoop
Business Drivers
1. Support 6 months of data as opposed to 2 days
2. Near real-time calculation with optimal infrastructure
The Use-case : Assessing Market Risk of an
Investment Portfolio
The Use-case : Assessing Market Risk of an
Investment Portfolio
Acc Equity Qty
A1 MSFT 100
A1 ORCL 500
A2 CISCO 400
Equity Model1 Model2
MSFT $78.00 $77.12
ORCL $33.78 $31.09
CISCO $32.12 $16.00
X
What is the total portfolio value for Model1?
Problem with The Big Join :
Acc Equity Qty
A1 MSFT 100
A1 ORCL 500
A2 CISCO 400
Equity Model1 Model2
MSFT 78$ 77.12$
ORCL 45.12$ 49.77$
CISCO 32.12$ 16.0$
X3M positions2M products * 5000
Models/Day
15 Billion Calculations
Schema Design…
Price Model DAY1 DAY N
Model1 Product 1 - PriceProduct 2 - Price….Product 2000000 - Price
…
… … …
Model 5000 … …
Date All Positions
XX-XXX-XXXX Acc Id 1 – ProductId 1 - 23 stocks…Acc Id 22000 – ProductId 200000 - 111 stocks
Why 1 price model is packed in 1 HBase Cell?
0
100
200
300
400
500
600
2M Products in 1 Cell 2M Products in 2M Cells
Eventual Consistency Overhead
GBs required : Product-Price model Data
Get rid of “HBase Cell meta-data” payload
Why Region Server is set at 16*64 MB?
1 Thread per Price Model64 Price Model/Machine
78 64core machines** @ 78 Region Servers
Enable Parallel Computing
**This is based on scalability factor of performance testing (150ms/ price model with parallel computing)
Why HBase Coprocessors are used?
Region 2Machine 1
Region 1Machine 1
HBaseCoprocessor
1 Cell = 1st Price Model =2 Million product prices =
8 * 2 = 16M
1 Cell = 2nd Price Model =2 Million product prices =
8 * 2 = 16M
Region 78Machine 78
1 Cell=5000th Price Model =2 Million product prices =
8 * 2 = 16M
Value @ Risk output For 1
Day
Reducer
Mapper
Mapper
Mapper
Map-Reduce does not Jam Network.
Fin
al o
utp
ut
of
mo
de
ls
Why is price-model-id stored as row-key?
Reading Sequentially (HBase Scanner) is lot faster than Random Row Read
Hadoop Distributed File System
Hadoop Map-Reduce Hadoop HBase
HSearch Indexer HSearch Coprocessor
MR Indexing Job with Lucene Analyzers
VAR RealTime MR Plug-In
HSearch Adapter
VAR Computation Application
Batch Mode Indexing Real-Time computation
The Final Building Blocks
Why We Like HBase
Why We Built HSearch
• Scalable• Real-Time• Apache Licensed
• Search and Analysis inside Hadoop• Real-time Map-Reduce• Extreme Parallelization
• Distribute index with auto-sharding and auto-replication - Handle Big Data
• Parallelize Indexing, Searching, Grouping – in milliseconds
• Binary serde, Compress, (May encrypt) at storage and transmission - Securely
• Cache everything – Serving thousand of users
• Redundize everything –With very limited support engineers.
• Index, Search and Analyze multi-structure big data in milliseconds.
• Search/Analyze as events unfold - For any additions or changes at sources.
• Plug-in custom algos/code with runtime data grouping and computing.
WHY
HOW
Available on
Apache Licensed
hadoopsearch.net
©2013 BIZOSYS TECHNOLOGIES PRIVATE LIMITED
For more information regarding Bizosys business, please write to sunil@bizosys.com
http://www.bizosys.com