ExaNIMLAn Exascale Library for Numerically Inspired Machine LearningSeverin Reiz§, Hasan Ashraf§,Tobias Neckel§, George Biros†, Hans-Joachim Bungartz§§ Technical University of Munich, † University of Texas at Austin, [email protected], [email protected], [email protected], [email protected], [email protected]
IntroductionMotivation•Significant gap in communities:
Machine Learning (ML)↔ high-performance computing (HPC)•ML needs considerable computing power→ we need adequate software!•ExaNIML: Library with algorithms
– with modern applications from ML community– with enough concurrency for next generation distributed computing systems
Approach•Methods from scientific computing domain
– “Undervalued” near-linear complexity methods (fast multipole methods)1
– Adaptive sparse grids to mitigate curse of dimensionality2
•HPC: Exploit potential of supercomputers– Concurrency: Choose suitable algorithms for parallel computing– Extract computational bottlenecks as low-level drivers in C++ or Kokkos– Performance Portability in-light of the upcoming new GPU and CPU
architectures
Classification with Kernel MethodsKernel Matrix
Occurs in many domains . . .
•multi-class classification•model-order reduction• uncertainty quantification• partial differential equations
Example: Binary classification
Ridge regression
•N data points xi ∈ Rd and N binary labels yi
• f (x) = sign(N∑i=1
k(x, xi)wi)→ u = f (xtest) = K ∗ w
•Solving a linear system: kernel matrix Koften not stable, nearly singular.Solve K̃ → K + λI instead
Our approach: Kernel Matrix Approximation
•Often K is a dense N-by-N matrix; this quadraticcomplexity often is the computational bottleneck•To reach O(N) algorithms it requires approximation•For the majority of applications off-diagonal blocks ofK admit good low-rank approximations
Many ML libraries offer Kernel methods: to our knowl-edge none of them offer kernel matrix approximation
KernelM
ethodsKey Computational Bottlenecks
Geometry-oblivious Fast Multipole Method1
•Hierarchically off-diagonal low-rank•Speeds up algebraic operations
iSend and iRecv
Telescope Skeleton Weights
Pack Skel Weights
TelescopeSkeleton Potentials
Unpack SkelWeights
Reduce Skeleton Potentials
l
r
p
q
{𝛂}
u
{β}
{θ}
{π}
v
r
l
p
q
{𝛂}
{β}
{θ}
{π}
u
v
u
{β}r
l
Dependency graph for asynchronous task analysis
4-process distributed H-Matrix compression. Mixedcolored sections and factors are shared for dis-tributed nodes
First scalability resultsMultiplication1
192-core 384 768 1536 3072 6144
Neighbor Tree Skeletonization
192-core 384 768 1536 3072 6144
GFLOPS Sync Async
12%14%16%
26%27%29%29% 27% 26%
16% 14% 12%
60s 44s
38s 23s
27s 12s
18s 10s
13s 6s
10s 4s
52s 19s
210s
28s 12s
110s
16s 8s
55s
9s 5s
30s
5s 3s
17s
4s 2s 9s
#01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12
120s (y-axis) 40s (y-axis)
90s (y-axis) 30s (y-axis)
Time for compression (left) and multiplication(right) for a 6-d gaussian kernel matrix of 64M-by-64M
O(N) linear solver3
192-core 384 768 1536 3072 6144
Efficiency Factorize
15.3%20.3%
25.5%27.5%28.9%30%30% 28.9% 27.5% 25.5%20.3%
15.3%
192-core 384 768 1536 3072 6144
Efficiency Solve
7.3%12.2%
16.2%16.2%16.5%16.6%16.6% 16.5% 16.2% 16.2%12.2%
7.3%
1.41s 0.71s 0.36s 0.18s 0.12s 0.10s 11.07s 5.75s 3.01s 1.63s 1.02s 0.68s #01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12
1.0s
0.1s
Time for ULV-Factorization (left) and forward-solve (right)
Dimensionality Reduction
10 20 30 40 50 60
0
0.2
0.4
0.6
0.8
1
Components
Rel
ativ
eS
ingu
lar
Valu
es
−6 −4 −2 0 2 4 6
−5
0
5
First Component
Seco
nd
Com
pon
ent
Embedded Space012
Reduce the dimensionality of datasetManifold Learning Algorithms• (Kernel) Principal Component Analysis• Isomap algorithm•Hessian local eigenmaps, ...
Classification on Embedded Space
•Example forced to 2D manifold (plotting)•Classification on lower dimensional man-
ifold
→ Sparse grid classification4
Approximation with Sparse GridsSparse grids•Reduce number of grid points•One approach: Combi-Technique
Combine Full Grids (red and blue, c.f.figure on right)•Suitable for 5-20 dimensions
Sparse Grids in Embedded Space1. Manifold learning algorithm for coarse
embedded space2. Fine approximation in embedded space
with Sparse Grids
Synergy between Point-based and Grid-based Methods
Sparse
Grids
Interfaces
Kernel Matrix Approxima-tion Manifold Learning
Deep LearningResNet, MobileNet,. . .
Sparse grid libraryData mining withSparse grid density esti-mation
Conclusion
•Method design– Run prominent models from current machine learning peers– Combine models with hierarchical kernel and sparse grid methods• Library design
– Community/reproducibility: ExaNIML library for others to play
References
[1] C. D. Yu, S. Reiz, and G. Biros, “Distributed-memory hierarchical compression of dense SPD matrices,” in Proceedings of the International Con-ference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, (Piscataway, NJ, USA), pp. 15:1–15:15, IEEE Press, 2018.
[2] H.-J. Bungartz and M. Griebel, “Sparse grids,” Acta numerica, vol. 13, pp. 147–269, 2004.
[3] D. Y. Chenhan, S. Reiz, and G. Biros, “Distributed o (n) linear solver for dense symmetric hierarchical semi-separable matrices,” in 2019 IEEE 13thInternational Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 1–8, IEEE, 2019.
[4] B. Peherstorfer, D. Pflüger, and H.-J. Bungartz, “Density estimation with adaptive sparse grids for large data sets,” in Proceedings of the 2014SIAM international conference on data mining, pp. 443–451, SIAM, 2014.
Top Related