Download - An Exascale Library for Numerically Inspired Machine Learning · 2020. 6. 24. · Example forced to 2D manifold (plotting) Classiﬁcation on lower dimensional man-ifold!Sparse grid

ExaNIMLAn Exascale Library for Numerically Inspired Machine LearningSeverin Reiz§, Hasan Ashraf§,Tobias Neckel§, George Biros†, Hans-Joachim Bungartz§§ Technical University of Munich, † University of Texas at Austin, [email protected], [email protected], [email protected], [email protected], [email protected]

IntroductionMotivation•Significant gap in communities:

Machine Learning (ML)↔ high-performance computing (HPC)•ML needs considerable computing power→ we need adequate software!•ExaNIML: Library with algorithms

– with modern applications from ML community– with enough concurrency for next generation distributed computing systems

Approach•Methods from scientific computing domain

– “Undervalued” near-linear complexity methods (fast multipole methods)1

– Adaptive sparse grids to mitigate curse of dimensionality2

•HPC: Exploit potential of supercomputers– Concurrency: Choose suitable algorithms for parallel computing– Extract computational bottlenecks as low-level drivers in C++ or Kokkos– Performance Portability in-light of the upcoming new GPU and CPU

architectures

Classification with Kernel MethodsKernel Matrix

Occurs in many domains . . .

•multi-class classification•model-order reduction• uncertainty quantification• partial differential equations

Example: Binary classification

Ridge regression

•N data points xi ∈ Rd and N binary labels yi

• f (x) = sign(N∑i=1

k(x, xi)wi)→ u = f (xtest) = K ∗ w

•Solving a linear system: kernel matrix Koften not stable, nearly singular.Solve K̃ → K + λI instead

Our approach: Kernel Matrix Approximation

•Often K is a dense N-by-N matrix; this quadraticcomplexity often is the computational bottleneck•To reach O(N) algorithms it requires approximation•For the majority of applications off-diagonal blocks ofK admit good low-rank approximations

Many ML libraries offer Kernel methods: to our knowl-edge none of them offer kernel matrix approximation

KernelM

ethodsKey Computational Bottlenecks

Geometry-oblivious Fast Multipole Method1

•Hierarchically off-diagonal low-rank•Speeds up algebraic operations

iSend and iRecv

Telescope Skeleton Weights

Pack Skel Weights

TelescopeSkeleton Potentials

Unpack SkelWeights

Reduce Skeleton Potentials

l

r

p

q

{𝛂}

u

{β}

{θ}

{π}

v

r

l

p

q

{𝛂}

{β}

{θ}

{π}

u

v

u

{β}r

l

Dependency graph for asynchronous task analysis

4-process distributed H-Matrix compression. Mixedcolored sections and factors are shared for dis-tributed nodes

First scalability resultsMultiplication1

192-core 384 768 1536 3072 6144

Neighbor Tree Skeletonization

192-core 384 768 1536 3072 6144

GFLOPS Sync Async

12%14%16%

26%27%29%29% 27% 26%

16% 14% 12%

60s 44s

38s 23s

27s 12s

18s 10s

13s 6s

10s 4s

52s 19s

210s

28s 12s

110s

16s 8s

55s

9s 5s

30s

5s 3s

17s

4s 2s 9s

#01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12

120s (y-axis) 40s (y-axis)

90s (y-axis) 30s (y-axis)

Time for compression (left) and multiplication(right) for a 6-d gaussian kernel matrix of 64M-by-64M

O(N) linear solver3

192-core 384 768 1536 3072 6144

Efficiency Factorize

15.3%20.3%

25.5%27.5%28.9%30%30% 28.9% 27.5% 25.5%20.3%

15.3%

192-core 384 768 1536 3072 6144

Efficiency Solve

7.3%12.2%

16.2%16.2%16.5%16.6%16.6% 16.5% 16.2% 16.2%12.2%

7.3%

1.41s 0.71s 0.36s 0.18s 0.12s 0.10s 11.07s 5.75s 3.01s 1.63s 1.02s 0.68s #01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12

1.0s

0.1s

Time for ULV-Factorization (left) and forward-solve (right)

Dimensionality Reduction

10 20 30 40 50 60

0

0.2

0.4

0.6

0.8

1

Components

Rel

ativ

eS

ingu

lar

Valu

es

−6 −4 −2 0 2 4 6

−5

0

5

First Component

Seco

nd

Com

pon

ent

Embedded Space012

Reduce the dimensionality of datasetManifold Learning Algorithms• (Kernel) Principal Component Analysis• Isomap algorithm•Hessian local eigenmaps, ...

Classification on Embedded Space

•Example forced to 2D manifold (plotting)•Classification on lower dimensional man-

ifold

→ Sparse grid classification4

Approximation with Sparse GridsSparse grids•Reduce number of grid points•One approach: Combi-Technique

Combine Full Grids (red and blue, c.f.figure on right)•Suitable for 5-20 dimensions

Sparse Grids in Embedded Space1. Manifold learning algorithm for coarse

embedded space2. Fine approximation in embedded space

with Sparse Grids

Synergy between Point-based and Grid-based Methods

Sparse

Grids

Interfaces

Kernel Matrix Approxima-tion Manifold Learning

Deep LearningResNet, MobileNet,. . .

Sparse grid libraryData mining withSparse grid density esti-mation

Conclusion

•Method design– Run prominent models from current machine learning peers– Combine models with hierarchical kernel and sparse grid methods• Library design

– Community/reproducibility: ExaNIML library for others to play

References

[1] C. D. Yu, S. Reiz, and G. Biros, “Distributed-memory hierarchical compression of dense SPD matrices,” in Proceedings of the International Con-ference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, (Piscataway, NJ, USA), pp. 15:1–15:15, IEEE Press, 2018.

[2] H.-J. Bungartz and M. Griebel, “Sparse grids,” Acta numerica, vol. 13, pp. 147–269, 2004.

[3] D. Y. Chenhan, S. Reiz, and G. Biros, “Distributed o (n) linear solver for dense symmetric hierarchical semi-separable matrices,” in 2019 IEEE 13thInternational Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 1–8, IEEE, 2019.

[4] B. Peherstorfer, D. Pflüger, and H.-J. Bungartz, “Density estimation with adaptive sparse grids for large data sets,” in Proceedings of the 2014SIAM international conference on data mining, pp. 443–451, SIAM, 2014.