The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.
-
Upload
trevor-willis -
Category
Documents
-
view
220 -
download
4
Transcript of The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.
![Page 1: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/1.jpg)
The Gamma Operator for Big Data Summarizationon an Array DBMS
Carlos Ordonez
1
![Page 2: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/2.jpg)
Acknowledgments
• Michael Stonebraker , MIT• My PhD students: Yiqun Zhang, Wellington Cabrera• SciDB team: Paul Brown, Bryan Lewis, Alex
Polyakov
2
![Page 3: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/3.jpg)
Why SciDB?
• Large matrices beyond RAM size• Storage by row or column not good enough• Matrices natural in statistics, engineer. and science• Multidimensional arrays -> matrices, not same thing• Parallel shared-nothing best for big data analytics• Closer to DBMS technology, but some similarity with
Hadoop• Feasible to create array operators, having matrices as
input and matrix as output• Combine processing with R package and LAPACK
3
![Page 4: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/4.jpg)
4
![Page 5: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/5.jpg)
Old: separate sufficient statistics
5
![Page 6: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/6.jpg)
New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]
6
![Page 7: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/7.jpg)
Equivalent equations with projections from Γ
7
![Page 8: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/8.jpg)
Properties of
8
![Page 9: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/9.jpg)
Further properties details:non-commutative and distributive
9
![Page 10: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/10.jpg)
Storage in array chunks
10
![Page 11: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/11.jpg)
SCAN
In SciDB we store the points in X as 2D array.
Worker11
![Page 12: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/12.jpg)
Array storage and processing in SciDB
• Assuming d<<n it is natural to hash partition X by i=1..n• Gamma computation is fully parallel maintaining
local Gamma versions in RAM. • X can be read with a fully parallel scan• No need to write Gamma from RAM to disk during
scan, unless fault tolerant
12
![Page 13: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/13.jpg)
Coordinator
Worker 1
OK
Coordinator Worker 1
NO!
13
Point must fit in one chunk. Otherwise, join is needed (slow)
![Page 14: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/14.jpg)
Parallel computation
Coordinator Worker 1 Worker 2
send send
14
![Page 15: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/15.jpg)
Dense matrix operator: O(d2 n)
15
![Page 16: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/16.jpg)
Sparse matrix operator: O(d n) for hyper-sparse matrix
16
![Page 17: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/17.jpg)
Pros: Algorithm evaluation with physical array operators• Since xi fits in one chunk joins are avoided (at least 2X
I/O with hash or merge join)• Since xi*xi
T can be computed in RAM we avoid an aggregation which would require sorting points by i• No need to store X twice: X, XT: half I/O, half RAM space• No need transpose X, costly reorganization even in
RAM, especially if X spans several RAM segments• Operator works in C++ compiled code: fast; vector
accessed once; direct assignment (bypass C++ functions calls)
17
![Page 18: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/18.jpg)
System issues and limitations
• Gamma not efficiently computable in AQL or AFL: hence operator is required• Arrays of tuples in SciDB are more general, but cumbersome for
matrix manipulation: arrays of single attribute (double)• Points must be stored completely inside a chunk: wide
rectangular chunks: may not be I/O optimal• Slow: Arrays must be pre-processed to SciDB load format,
loaded to 1D array and re-dimensioned=>optimize load.• Multiple SciDB instances per node improve I/O speed:
interleaving CPU• Larger chunks are better: 8MB, especially for dense matrices;
avoid shuffling; avoid joins• Dense (alpha) and sparse (beta) versions 18
![Page 19: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/19.jpg)
Benchmark: scale up emphasis• Small: cluster with 2 Intel Quadcore servers 4GB
RAM, 3TB disk• Large: Amazon cloud 2
19
![Page 20: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/20.jpg)
20
![Page 21: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/21.jpg)
Why is Gamma faster than SciDB+LAPACK?
21
Gamma operator
d Gamma op Scan mem alloc CPU merge
100 3.5 0.7 0.1 2.2 0.0200 10.9 1.0 0.1 8.6 0.0
400 38.8 2.2 0.1 33.9 0.1800 145.0 4.6 0.1 134.7 0.4
1600 599.8 11.4 0.1 575.5 1.0
SciDB and LAPACK (crossprod() call in SciDB)
TOTAL transpose subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK MKL
77.3 0.1 0.3 41.7 0.1 25.9 0.0 8.0 0.8 0.2163.0 0.1 0.2 84.9 0.1 55.7 0.0 17.2 1.8 0.6373.1 0.1 0.3 172.6 0.5 120.6 0.3 39.4 5.4 2.1
1497.3 0.1 0.1 553.6 0.8 537.6 0.5 169.8 21.2 8.1* * * * * * * * * 33.4
![Page 22: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/22.jpg)
Combination: SciDB + R
22
![Page 23: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/23.jpg)
Can Gamma operator beat LAPACK?
23
Gamma versus Open BLAS LAPACK (90% performance of MKL)
Gamma: scan, sparse/dense 2 threads; disk+RAM+CPU
LAPACK: Open BLAS~=MKL; 2 threads; RAM+CPU
d=100 LAPACK d=200 LAPACK d=400 LAPACK d=800 LAPACK
ndensitydense
sparse Op BLAS dense sparse Op BLAS dense sparse Op BLAS2 dense sparse Open BLAS
100k 0.1% 3.3 0.1 0.4 11.3 0.1 1.0 38.9 0.2 3.1 145.0 0.6 10.7
100k 1.0% 3.3 0.1 0.4 11.3 0.2 1.0 38.9 0.4 3.1 145.0 1.0 10.7
100k 10.0% 3.3 0.5 0.4 11.3 0.9 1.0 38.9 2.2 3.1 145.0 6.2 10.7
100k 100.0% 3.3 4.5 0.4 11.3 15.4 1.0 38.9 55.9 3.1 145.0 201.0 10.71M 0.1% 31.1 0.2 3.8 103.5 0.2 10.0 316.5 0.4 423.2 1475.7 0.9fail1M 1.0% 31.1 0.5 3.8 103.5 1.1 10.0 316.5 3.8 423.2 1475.7 4.0fail1M 10.0% 31.1 4.0 3.8 103.5 7.0 10.0 316.5 16.3 423.2 1475.7 46.4fail1M 100.0% 31.1 44.0 3.8 103.5 148.8 10.0 316.5 542.3 423.2 1475.7 2159.6fail
![Page 24: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/24.jpg)
SciDB in the Cloud: massive parallelism
24
![Page 25: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/25.jpg)
Conclusions• One pass summarization matrix operator: parallel, scalable• Optimization of outer matrix multiplication as sum (aggregation) of vector
outer products• Dense and sparse matrix versions required• Operator compatible with any parallel shared-nothing system, but better for
arrays• Gamma matrix must fit in RAM, but n unlimited• Summarization matrix can be exploited in many intermediate computations
(with appropriate projections) in linear models• Simplifies many methods to two phases:
1. Summarization2. Computing model parameters
• Requires arrays, but can work with SQL or MapReduce
25
![Page 26: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/26.jpg)
Future work: Theory
• Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs• Connection to frequent itemset• Sampling• Higher expected moments, co-variates• Unlikely: Numeric stability with
unnormalized sorted data
26
![Page 27: The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649d6e5503460f94a4ef89/html5/thumbnails/27.jpg)
Future work: Systems
• DONE: Sparse matrices: layout, compression• DONE: Beat LAPACK on high d• Online model learning (cursor interface
needed, incompatible with DBMS)• Unlimited d (currently d>8000); join required
for high d? Parallel processing of high d more complicated, chunked• Interface with BLAS and MKL, not worth it?• Faster than column DBMS for sparse?
27