Toward Mega-Scale Computing with pMatlab

29
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

description

Toward Mega-Scale Computing with pMatlab. Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010. - PowerPoint PPT Presentation

Transcript of Toward Mega-Scale Computing with pMatlab

Page 1: Toward Mega-Scale Computing with pMatlab

Slide 1

MIT Lincoln Laboratory

Toward Mega-Scale Computing with pMatlab

Chansup Byun and Jeremy Kepner

MIT Lincoln Laboratory

Vipin Sachdeva and Kirk E. Jordan

IBM T.J. Watson Research Center

HPEC 2010

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Page 2: Toward Mega-Scale Computing with pMatlab

Slide 2

MIT Lincoln Laboratory

Outline

• What is Parallel Matlab (pMatlab)

• IBM Blue Gene/P System• BG/P Application Paths• Porting pMatlab to BG/P

• Introduction

• Performance Studies

• Optimization for Large Scale Computation

• Summary

Page 3: Toward Mega-Scale Computing with pMatlab

Slide 3

MIT Lincoln Laboratory

Library Layer (pMatlab)Library Layer (pMatlab)

Parallel Matlab (pMatlab)

Vector/MatrixVector/Matrix CompComp TaskConduit

Application

ParallelLibrary

ParallelHardware

Input Analysis Output

UserInterface

HardwareInterface

Kernel LayerKernel Layer

Math(MATLAB/Octave)

Messaging(MatlabMPI)

Layered Architecture for parallel computing• Kernel layer does single-node math & parallel messaging• Library layer provides a parallel data and computation toolbox to Matlab users

Page 4: Toward Mega-Scale Computing with pMatlab

Slide 4

MIT Lincoln Laboratory

IBM Blue Gene/P System

Core speed: 850 MHz

cores

LLGridCore counts: ~1K

Blue Gene/PCore counts: ~300K

Page 5: Toward Mega-Scale Computing with pMatlab

Slide 5

MIT Lincoln Laboratory

Blue Gene Application Paths

Serial and Pleasantly Parallel Apps

Highly ScalableMessage Passing Apps

High Throughput Computing (HTC)

High Performance Computing (MPI)

Blue Gene Environment

• High Throughput Computing (HTC)– Enabling BG partition for many single-node jobs– Ideal for “pleasantly parallel” type applications

Page 6: Toward Mega-Scale Computing with pMatlab

Slide 6

MIT Lincoln Laboratory

HTC Node Modes on BG/P

• Symmetrical Multiprocessing (SMP) mode– One process per compute node– Full node memory available to the process

• Dual mode– Two processes per compute node– Half of the node memory per each process

• Virtual Node (VN) mode– Four processes per compute node (one per core)– 1/4th of the node memory per each process

Page 7: Toward Mega-Scale Computing with pMatlab

Slide 7

MIT Lincoln Laboratory

Porting pMatlab to BG/P System

• Requesting and booting a BG partition in HTC mode– Execute “qsub” command

Define number of processes, runtime, HTC boot script (htcpartition --trace 7 --boot --mode dual \

--partition $COBALT_PARTNAME) Wait for the partition ready (until the boot completes)

• Running jobs– Create and execute a Unix shell script to run a series of

“submit” commands including submit -mode dual -pool ANL-R00-M1-512 \ -cwd /path/to/working/dir -exe /path/to/octave \ -env LD_LIBRARY_PATH=/home/cbyun/lib \ -args “--traditional MatMPI/MatMPIdefs523.m”

• Combine the two stepseval(pRUN(‘m_file’, Nprocs, ‘bluegene-smp’))

Page 8: Toward Mega-Scale Computing with pMatlab

Slide 8

MIT Lincoln Laboratory

Outline

• Single Process Performance• Point-to-Point

Communication• Scalability

• Introduction

• Performance Studies

• Optimization for Large Scale Computation

• Summary

Page 9: Toward Mega-Scale Computing with pMatlab

Slide 9

MIT Lincoln Laboratory

Performance Studies

• Single Processor Performance– MandelBrot – ZoomImage– Beamformer– Blurimage– Fast Fourier Transform (FFT)– High Performance LINPACK (HPL)

• Point-to-Point Communication– pSpeed

• Scalability– Parallel Stream Benchmark: pStream

Page 10: Toward Mega-Scale Computing with pMatlab

Slide 10

MIT Lincoln Laboratory

0

2

4

6

8

10

12

14

Tim

e R

elat

ive

to M

atla

b

Matlab 2009b, LLGrid

Octave 3.2.2, LLGrid

Octave 3.2.2, IBM BG/P

Octave 3.2.2, IBM BG/P(Clock Normalized)

Single Process Performance:Intel Xeon vs. IBM PowerPC 450

* conv2() performance issue in Octave has been improved in a subsequent release

HPLMandelBrot ZoomImage* Beamformer Blurimage* FFT

Lower is better

26.4

s

11.9

s

18.8

s

6.1

s

1.5

s

5.2

s

Page 11: Toward Mega-Scale Computing with pMatlab

Slide 11

MIT Lincoln Laboratory

Octave Performance With Optimized BLAS

DGEM Performance Comparison

Matrix size (N x N)

MF

LOP

S

Page 12: Toward Mega-Scale Computing with pMatlab

Slide 12

MIT Lincoln Laboratory

0

1

2

3

4

1 2 3

Matlab 2009B, LLGrid

Octave 3.2.2, LLGrid

Octave 3.2.2, IBM BG/P

Octave 3.2.2, IBM BG/P(Clock Normalized)

Single Process Performance:Stream Benchmark

Higher is better

Triada = b + q c

Addc = a + c

Scaleb = q c

944

MB

/s

120

8 M

B/s

996

MB

/s

Rel

ativ

e P

erfo

rman

ce t

o M

atla

b

Page 13: Toward Mega-Scale Computing with pMatlab

Slide 13

MIT Lincoln Laboratory

Point-to-Point Communication

Pid = 0

Pid = 1Pid = Np-1

Pid = 3

Pid = 2

• pMatlab example: pSpeed– Send/Receive messages to/from the neighbor.– Messages are files in pMatlab.

Page 14: Toward Mega-Scale Computing with pMatlab

Slide 14

MIT Lincoln Laboratory

Filesystem Consideration

• A single NFS-shared disk (Mode S)

Pid = 0

Pid = 1Pid = Np-1

Pid = 3

Pid = 2

Pid = 0

Pid = 1Pid = Np-1

Pid = 3

Pid = 2

• A group of cross-mounted, NFS-shared disks to distribute messages (Mode M)

Page 15: Toward Mega-Scale Computing with pMatlab

Slide 15

MIT Lincoln Laboratory

pSpeed Performance on LLGrid: Mode S

Matlab 2009b, 1x2

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ban

dw

idth

, B

ytes

/sec

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Matlab 2009b, 2x1

1.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ban

dw

idth

, B

ytes

/sec

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Matlab 2009b, 4x1

1.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ban

dw

idth

, B

ytes

/sec

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Matlab 2009b, 8x1

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ban

dw

idth

, B

ytes

/sec

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Higher is better

Page 16: Toward Mega-Scale Computing with pMatlab

Slide 16

MIT Lincoln Laboratory

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Tim

es

, Se

co

nd

s

Matlab 2009b, 4x1

Octave 3.2.2, 4x1

pSpeed Performance on LLGrid:Mode M

Lower is better

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ban

dw

idth

, Byt

es/s

ec

Matlab 2009b, 4x1

Octave 3.2.2, 4x1

Higher is better

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ba

nd

wid

th, B

yte

s/s

ec

Matlab 2009b, 8x1

Octave 3.2.2, 8x1

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Tim

es

, Se

co

nd

s

Matlab 2009b, 8x1

Octave 3.2.2, 8x1

Page 17: Toward Mega-Scale Computing with pMatlab

Slide 17

MIT Lincoln Laboratory

pSpeed Performance on BG/P

BG/P Filesystem: GPFS

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Ba

nd

wid

th, B

yte

s/s

ec

Octave 3.2.2, 2x1

Octave 3.2.2, 4x1

Octave 3.2.2, 8x1

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Sizes, 8*2^N Bytes

Tim

es

, Se

co

nd

s

Octave 3.2.2, 2x1

Octave 3.2.2, 4x1

Octave 3.2.2, 8x1

Page 18: Toward Mega-Scale Computing with pMatlab

Slide 18

MIT Lincoln Laboratory

pStream Results with Scaled Size

• SMP mode: Initial global array size of 2^25 for Np=1– Global array size scales proportionally as number of

processes increases (1024x1)

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1 2 3 4 5 6 7 8 9 10 11

Number of Processes, 2^(N-1)

Ban

dw

idth

, M

B/S

ec

Scale

Add

Triad

563 GB/sec100% Efficiency

at Np = 1024

Page 19: Toward Mega-Scale Computing with pMatlab

Slide 19

MIT Lincoln Laboratory

pStream Results with Fixed Size

• Global array size of 2^30– The number of processes scaled up to 16384 (4096x4)

VN: 8.832 TB/Sec101% efficiency

at Np=16384

DUAL: 4.333 TB/Sec96% efficiency

at Np=8192

SMP: 2.208 TB/Sec98% efficiency

at Np=4096

Page 20: Toward Mega-Scale Computing with pMatlab

Slide 20

MIT Lincoln Laboratory

Outline

• Aggregation

• Introduction

• Performance Studies

• Optimization for Large Scale Computation

• Summary

Page 21: Toward Mega-Scale Computing with pMatlab

Slide 21

MIT Lincoln Laboratory

Current Aggregation Architecture

• The leader process receives all the distributed data from other processes.

• All other processes send their portion of the distributed data to the leader process.

• The process is inherently sequential.– The leader receives Np-1 messages.

01

23

45

67

8

Np = 8Np: total number of processes

Page 22: Toward Mega-Scale Computing with pMatlab

Slide 22

MIT Lincoln Laboratory

Binary-Tree Based Aggregation

• BAGG: Distributed message collection using a binary tree– The even numbered processes send a message to its odd

numbered neighbor– The odd numbered processes receive a message from its

even numbered neighbor.

01

23

45

67

0

2

4

60

4

Maximum number of message a process may send/receive is N, where Np = 2^(N)

Page 23: Toward Mega-Scale Computing with pMatlab

Slide 23

MIT Lincoln Laboratory

0.01

0.1

1

10

100

1 2 3 4 5 6 7

No of CPUs (2^N)

Tim

e, S

ec

on

ds

IBRIX, agg()

IBRIX, bagg()

LUSTRE, agg()

LUSTRE, bagg()

BAGG() Performance

• Two dimensional data and process distribution

• Two different file systems are used for performance comparison

– IBRIX: file system for users’ home directories– LUSTRE: parallel file system for all computation

IBRIX: 10x faster at Np=128

LUSTRE: 8x faster at Np=128

Page 24: Toward Mega-Scale Computing with pMatlab

Slide 24

MIT Lincoln Laboratory

0.1

1

10

100

1000

1 2 3 4 5 6 7 8

No of CPUs, 2^(N+2)

Tim

e, S

eco

nd

s

GPFS, agg()

GPFS, bagg()

BAGG() Performance, 2

• Four dimensional data and process distribution

• With GPFS file system on IBM Blue Gene/P System (ANL’s Surveyor)

– From 8 processes to 1024 processes

2.5x faster at Np=1024

Page 25: Toward Mega-Scale Computing with pMatlab

Slide 25

MIT Lincoln Laboratory

Generalizing Binary-Tree Based Aggregation

• HAGG: Extend the binary tree to the next power of two number

– Suppose that Np = 6 The next power of two number: Np* = 8

– Skip any messages from/to the fictitious Pid’s.

01

23

45

67

0

2

4

60

4

Page 26: Toward Mega-Scale Computing with pMatlab

Slide 26

MIT Lincoln Laboratory

BAGG() vs. HAGG()

• HAGG() generalizes BAGG() – Removes the restriction (Np = 2^N) in BAGG()– Additional costs associated with bookkeeping

• Performance comparison on two dimensional data and process distribution

0.01

0.1

1

10

1 2 3 4 5 6 7

No of CPUs, 2^N

Tim

e, S

ec

on

ds

IBRIX, agg

IBRIX, bagg

IBRIX, hagg

~3x faster at Np=128

Page 27: Toward Mega-Scale Computing with pMatlab

Slide 27

MIT Lincoln Laboratory

0.1

1

10

100

1 2 3 4 5 6 7

No of CPUs, 2^(N+2)

Tim

e, S

ec

on

ds

GPFS, bagg()

GPFS, hagg()

BAGG() vs. HAGG(), 2

• Performance comparison on four dimensional data and process distribution

• Performance difference is marginal on a dedicated environment– SMP mode on IBM Blue Gene/P System

Page 28: Toward Mega-Scale Computing with pMatlab

Slide 28

MIT Lincoln Laboratory

BAGG() Performance with Crossmounts

• Significant performance improvement by reducing resource contention on file system

– Performance is jittery because production cluster is used for performance test

0

10

20

30

40

50

1 2 3 4 5 6 7 8

Number of Processes, 2^(N-1)

To

tal

Ru

nti

me,

Sec

on

ds

Old AGG, IBRIX

New AGG, IBRIX

New AGG, Hybrid(IBRIX+Crossmounts)

Lower is better

Page 29: Toward Mega-Scale Computing with pMatlab

Slide 29

MIT Lincoln Laboratory

Summary

• pMatlab has been ported to IBM Blue Gene/P system

• Clock-normalized, single process performance of Octave on BG/P system is on-par with Matlab

• For pMatlab point-to-point communication (pSpeed), file system performance is important.

– Performance is as expected with GPFS on BG/P

• Parallel Stream Benchmark scaled to 16384 processes

• Developed a new pMatlab aggregation function using a binary tree to scale beyond 1024 processes