qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Srisatish
-
Upload
srisatish-ambati -
Category
Technology
-
view
109 -
download
2
description
Transcript of qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Srisatish
H2O – The Open Source Math Engine !
Better Predictions!
4/23/13
H2O – Open Source in-memory Machine Learning for Big Data
Universe is sparse. Life is messy. Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
H2O the
Prediction
Engine
Adhoc Explora-on
Math Modeling
Real-‐-me Scoring
Big Data
Messy NAs
Clustering
Classifica-on
Ensembles 100’s nanos
models
Regression
Group By Grep
H2O the
Prediction
Engine
Big Data Explora-on Modeling Scoring Real-‐-me
No New API!
Approximate!results each step!
H2O the
Prediction
Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
hFps://github.com/0xdata/h2o
All Top 10ʼs are binary!- Anonymous
Data chunks > code chunks TCP for Data. UDP for Control.
>> Generated Java Assist
10 Move Code not Data
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
A Frame: Vec[] age sex zip ID car
l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM
A Chunk, Unit of Parallel Access
season for Variable-sized chunks
and a season Uniform chunks. Tightly-packed! (chunk is also unit of batch!)
9 Chunk-ing Express!
No Expensive intermediate states. Fine-grain parallelism wins! >> Fork / Join
8 Reduce early. Reduce Often!
All CPUs grab Chunks in parallel Map/Reduce & F/J handles all sync
8 Reduce early. Reduce Often!
JVM 4 Heap
JVM 1 Heap JVM 2 Heap JVM 3 Heap
Vec Vec Vec Vec Vec
Debugging slow >> Heartbeats, Messages Two General’s Paradox
7 Slow is not different from Dead
in-memory system as good as your memory manager! lazy eviction. compress.
align. Corollary: Track down Leaks!
6 Memory Manager
Use primitives
5 Memory Overheads
// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
Tree size Bin size Recursively divide Till Data à Cache
4 Cache-‐Oblivious
User-mode reliability S3 Readers will TCP Reset Mux your connections Not all toolkits are equal. >> JetS3
3 EC2 – Nothing is bounded
Non-Blocking Data Structures.
2 No Locks, No Cry
// VOLATILE READ before key compare. // CAS private final boolean CAS_kvs( final Object[] oldkvs, final Object[] newkvs ) { return _unsafe.compareAndSwapObject(this, _kvs_offset, oldkvs, newkvs ); }
byte[ ]. roll-your-own. fast.
1 endian wars ended! Keep-It-Simple-Serialization.
public AutoBuffer putA1 ( byte[] ary, int sofar, int length ) {
while( sofar < length ) { int len = Math.min(length - sofar, _bb.remaining()); _bb.put(ary, sofar, len); sofar += len; if( sofar < length ) sendPartial(); } return this;
}
Data Movement is a Defect. Slowing down helps communication.
Got Speed?
Accuracy rules over speed. Predictive Performance
0 Math always produces a number
Data presentation bias. Sorted data => interesting results
1 Shuffle
2 Random acts of Kindness?
3 Convex Problems: ADMM
Matrix operations jama, jblas.. all single node. Distributed version needs data transfer!
4 Amdahl strikes: Cholesky / QR Decomposition
embarrassingly parallel binning tree-building splits
5 Random Forests
iterate & stage weak-learners =>
strong learners each tree can be parallel minimize communication
6 Boos-ng
embarrassingly parallel pre-calculate base stats distance calculation weight matrices – small footprint
7 Neural Nets & Clustering
Daisy chain a bunch of models Interleave. JIT – Minimize loops over data.
8 Ensembles
Deterministic versions first! Got Pen & Paper? Optimize often. Test Big Data soon.
9 Tools
Replace NAs to improves predictive performance by about 10pc.!
- Newton
Munging Missing Features impute NAs with mean impute NAs with knn impute with recursive pca!
- Boyd
Unbalanced data single rare classes Fraud / No-Fraud!
Stratify
Unbalanced data multiple rare classes Browse, Click, Purchase!
Stratify
Use Customer Data Algorithms for Sparse vs. Dense Unbalanced Data. Robustness under noise
10 Data is the System
Volume: HDFS
HIVE/SQL
Data Scientist
Munging slice n dice Features
Classification Regression Clustering Optimal Model
Engineer
Velocity: Events Online Scoring
Explora-on
Modeling
Offline Scoring
Business Analyst
Ensemble models Low latency
Applications
Predictions
Rule Engine
Before H2O
Big Data Explora-on Modeling Scoring Real-‐-me
Big Data beats Better Algorithms!
Big Data Explora-on Modeling Scoring Real-‐-me
Big Data and Better Algorithms! Scale & Parallelism!
H2O the
Prediction
Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
hFps://github.com/0xdata/h2o
H2O – The Open Source Math Engine !
Better Predictions!
0xdata.com
45
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
46
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
Read the docs!
This talk!
Join our GIT!
0xdata.com
47
Distributed Data Taxonomy
Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
Usecases
Conversion, Retention & Churn!• Lead Conversion!• Engagement!• Product Placement!• Recommendations!
Pricing Engine!Fraud Detection!