MAGMA: aNewGeneration%% Release of%Linear%Algebra...

MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

University of Tennessee, Knoxville

Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki

Release

MAGMA: LAPACK for GPUs

• MAGMA –  Matrix algebra for GPU and multicore architecture –  To provide LAPACK/ScaLAPACK on hybrid architectures –  http://icl.cs.utk.edu/magma/

• MAGMA 1.3 –  For NVIDIA CUDA GPUs on shared memory systems –  80+ hybrid algorithms have been developed (total of 320+ routines)

•  One-sided factorizations and linear system solvers

•  Two-sided factorizations and eigenproblem solvers

•  A subset of BLAS and auxiliary routines in CUDA

• MAGMA developers & collaborators –  UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) –  Community effort, similarly to LAPACK/ScaLAPACK

Key Features of MAGMA 1.3

• High performance • Multiple precision support (Z, C, D, S, and MP) • Hybrid algorithms • Out-of-GPU memory algorithms • MultiGPU support

Key Features of MAGMA 1.3

MAGMA Software Stack

MAGMA 1.3 Hybrid Tile (PLASMA) Algorithms

single

distr.

C P U GPU / Coprocessor H Y B R I D

MAGMA BLAS

LAPACK

CUDA / OpenCL / MIC

Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python

MAGMA SPARSE

MAGMA 1.3, clMAGMA 1.0, MAGMA MIC 0.3

StarPU run-time system PLASMA / Quark

MAGMA 1.3 Hybrid LAPACK and Tile Kernels

Hybrid LAPACK/ScaLAPACK & Tile Algorithms / StarPU / DAGuE

Multiple precision support

CGETRF_GPU SGETRF_GPU ZGETRF_GPU DGETRF_GPU

Matrix size

Keeneland GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660 (2x6 cores @2.8GHz)

Performance of the LU factorization in various precisions

CPU DP peak (134 GFlop/s)

Methodology overview

•  MAGMA uses hybridization methodology based on –  Representing linear algebra algorithms as collections

of tasks and data dependencies among them –  Properly scheduling tasks' execution over

multicore and GPU hardware components

•  Successfully applied to fundamental linear algebra algorithms –  One- and two-sided factorizations and solvers –  Iterative linear and eigensolvers

•  Productivity –  1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

Hybrid CPU+GPU algorithms (small tasks for multicores and large tasks for GPUs)

A methodology to use all available resources:

A Hybrid Algorithm Example •  Left-looking hybrid Cholesky factorization in MAGMA

•  The difference with LAPACK – the 4 additional lines in red •  Line 8 (done on CPU) is overlapped with work on the GPU (from line 6)

Mixed precision iterative refinement

SP Solve

DP Solve (MP Iter.Ref.)

DP Solve

Matrix size

Keeneland GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores)

Solving general dense linear systems using mixed precision iterative refinement

Out of GPU Memory Algorithms

350 DGETRF

Matrix size

Solving large problems that do not fit in the GPU memory

Matrices of size that do not fit in a specified GPU memory

350 DGETRF

Matrix size

Solving large problems that do not fit in the GPU memory

Out-of-GPU-memory Algorithms can now solve large problems

•  Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms)

•  The rest of the matrix stays on the CPU •  Left-looking versions minimize writing on the CPU

1)  Copy A2 to the GPU 2)  Update A2 using A1 (a panel of A1 at a time) 3)  Factor the updated A2 using existing

hybrid code 4)  Copy factored A2 to the CPU

Trivially extended to multiGPUs: A2 is “larger” with 1-D block cyclic distribution, again reusing existing algorithms

Factored sub-matric A1 on CPU

To be factored

sub-matrix A2 on GPU

Untouched part of the

matrix

MultiGPU Support

• Data distribution –  1-D block-cyclic distribution

• Algorithm –  GPU holding current panel is sending

it to CPU –  All updates are done in parallel on

the GPUs –  Look-ahead is done with GPU holding

the next panel

GPU 0 . . .

CPU (MKL)

LU Factorization on multiGPUs in DP

Matrix size

1000 2 GPUs

CPU (MKL)

Matrix size

1000 3 GPUs

2 GPUs

CPU (MKL)

Matrix size

1000 Kepler (K20X) 3 GPUs 2 GPUs 1 GPU CPU (MKL)

LU Factorization on Kepler in DP

Matrix size

LU Scalability on Kepler in DP

"MAGMA (host+1 GPU)"

MKL (host)

Matrix size

CPU: Intel Xeon ES-2670 [ 16 cores @2.60 GHz ] GPU: K20X [ 14 MP x192 @0.73 GHz ]

LU Scalability on Kepler in DP

"MAGMA (host+2 GPUs)"

"MAGMA (host+1 GPU)"

MKL (host)

Matrix size

CPU: Intel Xeon ES-2670 [ 16 cores @2.60 GHz ] GPU: K20X [ 14 MP x192 @0.73 GHz ]

Eigenproblem Solvers in MAGMA

•  A x = λ x –  Quantum mechanics (Schrödinger equation) –  Quantum chemistry –  Principal component analysis (in data mining) –  Vibration analysis (of mechanical structures) –  Image processing, compression, face recognition –  Eigenvalues of graph, e.g., in Google’s page rank

•  Need to solve it fast Current MAGMA results: MAGMA with 1 GPU can be 12x faster vs vendor libraries on state-of-art multicore systems

T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-‐vector mul4plica4on on mul4ple GPUs and its applica4on to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

J. Dongarra, A. Haidar, T. Schulthess, R. Solca, and S. Tomov, A novel hybrid CPU-‐ GPU generalized eigensolver for electronic structure calcula4ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.

Toward fast Eigensolver

2k3k4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k0

Matrix size

DSYTRD_MKL

Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

flops formula: n3/3*time Higher is faster

A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-‐GPU generalized eigensolver for electronic structure calcula4ons based on fine grained memory aware tasks, ICL Technical report, 03/2012.

Matrix size

DSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL

Matrix size

DSYTRD_2stages_3GPUsDSYTRD_2stages_2GPUsDSYTRD_2stages_1GPUDSYTRD_3GPUsDSYTRD_2GPUsDSYTRD_1GPUDSYTRD_MKL

0 10 20 30 40 50 60

nz = 10160 20 40 60

nz = 3600

0 10 20 30 40 50 60

nz = 10160 10 20 30 40 50 60

nz = 178

second

  Characteristics •  Stage 1: BLAS-3, increasing computational intensity, •  Stage 2: BLAS-1.5, new cache friendly kernel, •  4X/12X faster than standard approach, •  Bottelneck: if all Eigenvectors are required, it has 1

back transformation extra cost.

Current and Future Directions

• Synchronization avoiding algorithms using Dynamic Runtime Systems

48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200

High-productivity w/ Dynamic Runtime Systems From Sequential Nested-Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k], ...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n], ...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k], ...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n], ...); }

High-productivity w/ Dynamic Runtime Systems From Sequential Nested-Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){ starpu_Insert_Task(&cl_zgeqrt, k , k, ...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_zunmqr, k, n, ...); for (m = k+1; m < MT; m++){ starpu_Insert_Task(&cl_ztsqrt, m, k, ...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_ztsmqr, m, n, k, ...); }

Use of StarPU runtime system to exploit heterogeneous architectures (multicore + GPUs) or Quark if only targetting multicore.

Matrices Over Runtime Systems @ Exascale (MORSE)

• Mission statement: "Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems”

• Runtime challenges due to the ever growing hardware complexity

• Algorithmic challenges to exploit the hardware capabilities at most

•  Integrated into MAGMA software stack –  MAGMA 1.3 includes Level 3 BLAS, linear and least squares solvers

Collaborators and Support

MAGMA team http://icl.cs.utk.edu/magma

PLASMA team http://icl.cs.utk.edu/plasma

Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia

MAGMA: aNewGeneration%% Release of%Linear%Algebra...

Documents

Transcript of MAGMA: aNewGeneration%% Release of%Linear%Algebra...

MAGMA Users’ Guide - ICL UTKicl.cs.utk.edu/projectsfiles/magma/docs/magma-v02.pdf · Chapter 1 The MAGMA Library The goal of the Matrix Algebra on GPU and Multicore Architectures

Magma v1.4

Magma Trax

OSM#9 Hackfest...The EPC Network Slice 10 Magma EPC Network Slice Orchestrator Subnet (shared) Evolved Packet Core Subnet Magma Orchestrator (KNF) Magma AGW + Tester (VNF) Magma vEPC

POMPEI: Programming with OpenMP4 for Exascale Investigationsicl.cs.utk.edu/projectsfiles/magma/pubs/69-pompei-report.pdf · 2017-12-07 · Programming with OpenMP4 for Exascale Investigations

Accessory Finder - Mixware Accesory...Magma DJ Controller Workstation MC7000 Magma Multi-Format Workstation XXL PLUS Magma Multi-Format Workstation XXL Plus 19 Magma Rolltop Backpack

VOLCANOES. Lava & Magma Lava vs. Magma…What’s the difference? Lava- magma that flows into Earth’s surface; the rock that forms when magma cools & solidifies.

MagmaDNN – High-Performance Data Analytics for Manycore …icl.utk.edu/projectsfiles/magma/pubs/71-MagmaDNN.pdf · 2017-12-22 · MagmaDNN – High-Performance Data Analytics for

Magma Demon

MAGMA manual (version 1.06) - CNCRctg.cncr.nl/software/MAGMA/doc/manual_v1.06.pdfMAGMA manual (version 1.06) TABLE OF CONTENTS OVERVIEW 3 QUICKSTART 4 ANNOTATION 6 OVERVIEW 6 RUNNING

Magma chamber behavior beneath a volcanic edificeraman/papers2/PinelJGR.pdf · Magma chamber behavior beneath a volcanic edifice ... Magma migration; KEYWORDS: volcanic edifice, magma

ICL UTKicl.utk.edu/projectsfiles/pulsar/pdf/reference_manual.pdfContents 1 Module Index 1 1.1 Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Magma Fincorp Limited MAGMA - bseindia.com · About Magma Fincorp Limited Magma Fincorp Limited (Magma) is a non-deposit taking non-banking finance company (NBFC), registered with

Small Tensor Operations on Advanced …icl.cs.utk.edu/projectsfiles/magma/pubs/56-final_report.pdfSmall Tensor Operations on Advanced Architectures for High-order Applications A. Abdelfattah

MAGMA MIC Optimizing Linear Algebra for Intel Xeon Phiicl.eecs.utk.edu/projectsfiles/magma/pubs/MAGMA_MIC_ISC15.pdf · MAGMA MIC Optimizing Linear Algebra for Intel Xeon Phi! Innovative

High-Performance Data Analytics for Manycore GPUs and CPUsicl.eecs.utk.edu/projectsfiles/magma/pubs/MagmaDNN-v0.2.pdf · 2019-01-02 · MagmaDNN 0.2 High-Performance Data Analytics

Geochemistry 3 Geophysics Geosystems - Magma Dynamicsmagma.geol.ucsb.edu/papers/2002_Gcubed_ref.pdf · example, in order to quantify magma convection, magma mixing, the nucleation,

MagMA maart

Magma Diversification

Staged storage and magma convection at Ambrym volcano, … · magma volume due to loss of gas bubbles. Following descent, degassed magma is presumed to mix with reservoir magma, or