Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important Makoto...

Optimal Algorithm Selection of Parallel Sparse Matrix-Vector

Multiplication Is Important

Makoto Kudoh*1, Hisayasu Kuroda*1,

Takahiro Katagiri*2, Yasumasa Kanada*1

*1 The University of Tokyo

*2 PRESTO, Japan Science and Technology Corporation

Introduction

Sparse Matrix-Vector multiplication(SpMxV)

（ A is a sparse matrix, x is a dense vector）

Basic computational kernel used in scientific computations

-ex. Iterative solver for linear systems, eigenvalue problems

nnn xAAx ,

Large scale SpMxV problems

Parallel Sparse Matrix-Vector Multiplication

Calculation of Parallel Sparse Matrix-Vector Multiplication

Two phase computations:data communication and local computation

PE0PE1PE2PE3y x

Row block distribution Compressed sparse row format

4001 0 2 4 5

0 3 0 2 1 1 3

1 4 -1 3 2 -2 -4

rowptr

colind

Vector data communication

x A x yPE0

Local computation

Optimization of Parallel SpMxV

Many optimization algorithms of SpMxV proposed

BUTThe effect depends highly on the non-zero structure of the matrix and the machine’s

architecture

Optimal algorithm selection is important

Poor performance compared with dense matrix Increased memory reference to matrix data

caused by indirect access Irregular memory access pattern to vector x

Related Works

Library approach PSPARSLIB, PETSc, ILIB, etc Fixed optimize algorithm Work on parallel systems

Compiler approach SPARSITY, sparse compiler, etc Generate optimized code for matrix and machine Not work on parallel systems

The purpose of our work

Include several algorithms for local computation and data communication

Measure performance of each algorithm exhaustively Select the best algorithm for the matrix and machine

Algorithm selecting time is not in concern

Performance of best algorithm for matrix and machine

Performance of fixed algorithm for all matrices and machines

compare

Our program

Optimization algorithms of our program

Algorithms implemented in our routine Local computation

Register Blocking Diagonal Blocking Unrolling

Data communication Allgather Communication Range Limited Communication Minimum Data Size Communication

Register Blocking (Local Computation 1/3)

Extract small dense blocks and make a blocked matrix

• Reduce the number of load instruction• Increase temporal locality to the source

vectorAbbreviate size mxn Register Blocking to RmxnR1x2,R1x3,R1x4,R2x1,R2x2,R2x3,R2x4,

R3x1,R3x2,R3x3,R3x4,R4x1,R4x2,R4x3,R4x4

Original matrix Blocked matrix Remaining matrix

Diagonal Blocking (Local Computation 2/3)

For matrices with dense non-zero structure around diagonal part

Block diagonal part and treat it as a dense band matrix

• Reduce the number of load instruction• Optimize the access of register and

cacheAbbreviate size n Diagonal Blocking to DnD3,D5,D7,D9,D11,D13,D15,D17,D19

Original matrix Blocked matrix Remaining matrix

Unrolling (Local Computation 3/3)

Just unroll the inner loop

Abbreviate unrolling level n to Un

• Reduce the loop overhead• Exploit instruction level parallelism

U1,U2,U3,U4,U5,U6,U7,U8

Allgather Communication (data communication 1/3)

Each processor sends all vector data to all other processors

Easy to implement (with MPI_Allgather)

The communication data size is very large

Range-limited Communication (data communication 2/3)

Send only minimum contiguous required block Not communicate between unnecessary processors

Small overhead CPU time, since data rearrangement is unnecessary

Communication data size is not minimum on most matrices

vector vector

Minimum Data Size Communication (data communication 1/3)

Communicate only the required elements Need ‘pack’ and ‘unpack’ operations before and

after communication

The communication data size is minimum ‘pack’, ‘unpack’ operations require a little

overhead CPU time

PE0 PE1

vector

unpack

vector

buffer

Implementation of Communication

Use MPI library 3 implementations for 1 to 1 communication

Send-Recv Isend-Irecv Irecv-Isend

3 implementations for range-limited and minimum data size communication

Allgather

SendRecv-range, IsendIrecv-range, IrecvIsend-range

SendRecv-min, IsendIrecv-min, Irecv-Isend-min

Methodology of Selecting Optimal Algorithm

Measure the time of local computation and data communication independently

When combined, total time is not necessarily fastest

1. Measure time of each data communication, select best algorithm

2. Combine local computation and best data communication, measure time and select best

Select at runtimeCan not detect the characteristic of the matrix until

runtime

Numerical Experiment

Default fixed algorithms Experimental environment, test

matrices Results

Default Fixed Algorithms

No. Local computation

Data communication

1 U1 Allgather

2 R2x2 Allgather

3 U1 IrecvIsend-min

4 R2x2 IrecvIsend-min

Local computation ： U1 and R2x2

Data communication ：Allgather and IrecvIsend-min

Experimental Environment

NameProcessor # of PEs Network

Compiler Compiler Version

Compiler Option

PC-ClusterPentiumIII 800 MHz 8 100 base-T Ethernet

GCC 2.95.2 -O3

SUN Enterprise 3500

Ultra Sparc II 336 MHz 8 SMP

WorkShop Compilers 5.0 -xO5

COMPAQAlphaServer GS80

Alpha 21264 731MHz

Compaq C 6.3-027 -fast

SGI2100MIPS R12000 350MHz

MIPSpro C 7.30 -64 -O3

HITACHI HA8000-ex880

Intel Itanium 800MHz 8 SMP

Intel Itanium Compiler 5.0.1 -O3

Language C

Communication library MPI (MPICH 1.2.1)

Test Matrices

From Tim Davis’ matrix collectionNo.

Name Explanation Dimension

Non-zeros

1 3dtube 3-D pressure tube 45,330 3,213,618

2 cfd1 Symmetric pressure matrix 70,656 1,828,364

3 crystk03 FEM crystal vibration 24,696 1,751,178

4 venkat01 Unstructured 2D euler solver 62,424 1,717,792

5 bcsstk35 Automobile seat frame and body attachment

30,237 1,450,163

6 cfd2 Symmetric pressure matrix 123,440 3,087,898

7 ct20stif Stiffness matrix 52,329 2,698,463

8 nasasrb Shuttle rocket booster 54,870 2,677,324

9 raefsky3 Fluid structure interaction turbulence

21,200 1,488,768

10 pwtk Pressurized wind tunnel 217,918 11,634,424

11 gearbox Aircraft flap actuator 153,746 9,080,404

cfd1ct20stifgearbox

Result of Matrix No.2

def1 def2 def3 def4 opt

R2x4 U2 U2 U2R2x4 U6 U2 U5

IrecvIsend-min

PentiumIII-Ethernet

U1 R2x2 R1x3 R2x2

R3x1 U1 R1x3 U2

IrecvIsend-range

Alpha-SMP

MIPS-DSM

R2x1 U2 D7 U1

U4 U3 U4 U3

IrecvIsend-range

R3x1 R3x1 R1x3 R2x2

R1x3 D9 R3x1 R1x3

IsendIrecv-range

Itanium-SMP

Comm-time(msec)Local-time(msec)

Comm-algorithmLocal-algorithm

R2x3 R3x3 R3x3 R3x3R2x3 R3x3 R3x3 R3x3

IsendIrecv-min

PentiumIII-Ethernet

U1 U1 R3x1 R3x1U1 R3x1 U1 U1

SendRecv-min

Alpha-SMP

MIPS-DSM

IrecvIsend-min

D9 D15 R3x3 D11D7 D9 R3x3 R2x3

SendRecv-min

Itanium-SMP

Comm-time(msec)

Local-time(msec)

100150200250300350400450500

R2x3 R3x3 D15 R3x3R2x3 R3x3 R3x3 R3x3

IsendIrecv-min

PentiumIII-Ethernet

D5 R3x3 R3x3 R3x3R3x3 R3x3 R3x3 R3x3

SendRecv-min

Alpha-SMP

MIPS-DSM

D5 R3x3 R3x3 R3x3R3x3 D7 D9 R3x3

SendRecv-min

Itanium-SMP

Comm-time(msec)Local-time(msec)

Summary of Experiment

def 1 def 2 def 3 def 4

PC-cluster 8.16 7.90 1.32 1.05

Sun Enterprise 3500

2.82 3.07 1.35 1.58

COMPAQ 3.56 3.10 1.59 1.44

SGI 3.73 3.33 1.61 1.36

Hitachi 2.51 1.81 2.03 1.39

Summary of speed-up

•Best algorithm depends highly on characteristics of matrix and machine

•Obtained at least 1.05 speed-up compared with fixed default algorithms

Conclusion and Future Work

Compared performance of best algorithm with that of typical fixed algorithms

Obtained meaningful speed-up by selecting best algorithm

Selecting optimal algorithm according to characteristics of matrix and machine is important

Create light overhead method of selecting algorithm Now, selecting time takes hundreds of SpMxV

Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important Makoto...

Documents

Transcript of Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important Makoto...

Black Holes in Extra Dimensions COSMO 2003, Ambleside Toby Wiseman Cambridge (UK) / Harvard Work with B. Kol (Hebrew University) H. Kudoh (Kyoto)

Universal Serial Bus PS Interface White Paper - USB. · PDF fileAtsushi Mitamura Renesas Electronics Corp. Kiichi Muto Renesas Electronics Corp. Masami Katagiri Renesas Electronics

Ryoichi Komiyama *, Yasumasa Fujii University of Tokyo (Dept. of Nuclear Engineering)

Title Studies on Vinyl Sulfonic Acid Kunichika, Sango ...repository.kulib.kyoto-u.ac.jp/dspace/bitstream/2433/75836/1/chd... · Sango KUNICHIKA and Takao KATAGIRI ii) Preparation

G-lambda and Enlightened Middleware and Control Plane ... · Page 1 G-lambda and Enlightened Middleware and Control Plane interactions Tomohiro Kudoh Grid Technology Research Center

July 30, 2009 Research Center, Shiseido Co., Ltd. Chika Katagiri, Ph.D.

Religion and Politics in Contemporary India The Complexities … · Religion and Politics in Contemporary India The Complexities of Secularism and Communalism Sekine Yasumasa? Introduction

1 Case Study Abuse of Dominance in Japan Kazuyuki KATAGIRI Japan Fair Trade Commission OECD-Korea Regional Centre for Competition Regional Antitrust Workshop.

Implementation of an Automated Building Construction · PDF fileR. Kudoh Implementation of an Automated Building Construction System R. Kudoh Senior General Manager Building Construction

1111111 Intermediate Accounting, Ninth Edition Kieso and Weygandt Prepared by Catherine Katagiri, CPA The College of Saint Rose Albany, New York John Wiley.

General Relativistic MHD Simulations with Finite Conductivity Shinji Koide (Kumamoto University) Kazunari Shibata (Kyoto University) Takahiro Kudoh (NAOJ)

Morimura Yasumasa On Self-Portrait: Through the Looking-Glass

From Sanitary to Sustainable Landfilling - why, how, and when? · Heijo Scharff, Sustainable Landfill Foundation (The Netherlands) Yasumasa Tojo, Hokkaido University (Japan) Goran

PSY 432: PERSONALITY A NON-WESTERN APPROACH Chapter 17: Zen Buddhism Dainin Katagiri-roshi (1928-1990) D. T. Suzuki (1870-1966)

11111 Intermediate Accounting, 9ed Kieso and Weygandt Prepared by Catherine Katagiri, CPA The College of Saint Rose Albany, New York John Wiley & Sons,

GFD-I-213 Guy Roberts, DANTE NSI-WG Tomohiro Kudoh, AIST ...

Princenthal, Nancy “Yasumasa Morimua at Luhring Augustine ...prod-images.exhibit-e.com/€¦ · concealment. Few artists—few individuals—have worn their hearts on their ...

2003/06/17Open Printing WG Japan/Asia 1 Open Printing Working Group Japan/Asia Activities Update 2003/06/17 Osamu MIHARA Yasumasa TORATANI.

Doubly Spinning Black Rings and Beyondold.phys.huji.ac.il/~barak_kol/HDGR/proceedings/Kudoh.pdf · 2007. 2. 18. · Hideaki Kudoh, - Doubly Spinning Black Rings and Beyond - Time

Print Untitled (23 pages)train/regulation/ch8.pdfTitle: Print Untitled (23 pages) Author: katagiri Created Date: 3/13/2000 5:49:28 PM