CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical...

34
CBELU - 1 James 07/04/22 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC Workshop September 20, 2007 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

Transcript of CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical...

Page 1: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

CBELU - 1James 04/21/23

MIT Lincoln Laboratory

High Performance Simulations of Electrochemical Models

on the Cell Broadband Engine

James Geraci

HPEC Workshop

September 20, 2007

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

Page 2: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 2

James Geraci 04/21/23

• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

Page 3: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 3

James Geraci 04/21/23

Introduction Real Time Battery State of Health Estimation

•The way in which a battery ages is a strong function of battery geometry.•The rate of self discharge is also a function ofbattery geometry.

Lead Acid Battery

Page 4: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 4

James Geraci 04/21/23

Inside a Lead Acid Battery

Face View Lead Dioxide Plate Face View Lead Plate

•Lead acid batteries are made up of stacks of lead dixode and lead electrode pairs

Page 5: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 5

James Geraci 04/21/23

Finite volume description of battery

LeadDioxide

SulfuricAcid

LeadEdge view lead dioxide plate

•The center volume’s physics can be influenced by the physics of the volumesto the right, the left, above, and below

•The 5 point stencil gives rise to banded matrix

Page 6: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 6

James Geraci 04/21/23

Matlab spy plot of Banded Matrix

•Non-zero entries shown in blue

•Inner band around diagonal

•Distant narrow outer band

•For the physics of a lead acidbattery, the matrix is not symmetric (shown)

•For the physics of a lead acidbattery, the matrix is poorlyconditioned 10^8-10^12 - need double precision,

no GPGPU

Page 7: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 7

James Geraci 04/21/23

• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

Page 8: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 8

James Geraci 04/21/23

LU Decomposition

• LU decomposition decomposes a matrix J into a lower triangular matrix L and an upper triangular matrix U

• L & U can be used to solve a system of linear equations Jx = r by forward elimination back substitution

– Essentially Gaussian Elimination

• Often used on poorly conditioned systems where ‘iterative solvers’ can’t be used.

• Difficult to parallelize for small systems because of the fine grain nature of the parallelism involved.

• Banded LU is a special case of LU – The matrix J has a special ‘banded’ data pattern.

Page 9: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 9

James Geraci 04/21/23

• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

Page 10: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 10

James Geraci 04/21/23

Cell Broadband Engine

•Cell Broadband Engine is a new heterogeneous multicore processor that features large internal and off chip bandwidth.

Page 11: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 11

James Geraci 04/21/23

EIB (96B/cycle 3.2Gcycles/sec*96B/cycle = 286.1GB/sec)

Cell Broadband Engine

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

PPU

L1L2 PXUMIC

16B/cycle 16B/cycle47.7GB/sec

16B/cycle

PPE

25GB/sec

Bandwidth:•50GB/sec/SPU•25GB/sec off chipSPE Performance:•14 Gflops double

Page 12: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 12

James Geraci 04/21/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Banded LU• Performance• Latency• Synchronization

Page 13: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 13

James Geraci 04/21/23

Banded LUOut of Core Algorithm Explained

1hbWi

i

2b

2b

)2:,(),(

),2:1()2:,2:1( biiiJ

iiJ

ibiiJbiibiiJ

JMatrix

Page 14: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 14

James Geraci 04/21/23

Banded LUOut of Core Algorithm Analyzed

1hbW

Ni

222 bN

i

i

2b

2b

2

2bMultiplies:

Compute

Memory Accesses

Ni

Subtracts:2

2b

Gets:

Puts:

2b

2b

22 bN

Memory Doubles Moved

b

b

22 bNb

Synchronizations N

•Compute/Memory Ratio = ½•For a 16728x16728 matrix with band size of 420, almost 22 GB of data would have to be moved.

JMatrix

Page 15: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 15

James Geraci 04/21/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Banded LU• Performance• Latency• Synchronization

Page 16: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 16

James Geraci 04/21/23

Peformance of Out of Core Algorithm

•Out of Core Algorithm outperforms UMFpack on Opteron 246 based workstation.

•No appreciable gain in performance past 4 SPEs

Gflops for matrix size 16728x16728 w/ band size of 420 Compute Time for matrix size 16728x16728 w/ band size of 420

Sec

on

ds

2

1

3

1 4 5 62 3G

flo

ps

Out of coreLinear speed up

Out of coreOpteron 246: UMFpackLinear speed up

3 4 5 61 2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Page 17: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 17

James Geraci 04/21/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Banded LU• Performance• Latency• Synchronization

Page 18: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 18

James Geraci 04/21/23

Latency for DMA put

16 bytes1024 bytes

•Significant Performance hit for memory access smaller than 16bytes

•Bandwidth limited region starts at 8x128bytes

Page 19: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 19

James Geraci 04/21/23

SPE to main memory bandwidth

16 bytes

1024 bytes

Theoretical Maximum Bandwidth 25GB/sec

•Theoretical maximum bandwidth can almost be achieved for larger message sizes

Page 20: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 20

James Geraci 04/21/23

OCA Memory Access Size Dependence

•Out of Core performance is better when memory access is a byte multiple of 128

Page 21: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 21

James Geraci 04/21/23

PPE/SPE Synchronization Mailboxes

// barrierWhile(sum != NUM_SPE){for(i < NUM_SPE){ if(NewMailBoxMessage){

readMailBox;}}} // Notify PPEspu_writech(SPU_WrOutMbox,1); // Wait for PPEspu_readch(SPU_RdInMbox);

// Restart SPEswriteSPEinMboxes();

PPE SPESPE

MFC

LS

SXU

SPU

PPU

L1L2 PXU

16B/cycle

16B/cycle

Out Mailbox

In Mailbox

PPE

EIB•Mailboxes are one common method of synchronization

Page 22: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 22

James Geraci 04/21/23

PPE/SPE Synchronization MailboxesRound Trip Times

• Mailboxes using IBM SDK 2.1 C-intrinsics

• Mailboxes using IBM SDK 2.1 C-intrinsics & pointers to MMIO registers

• Mailboxes by pointers alone

• Standard round trip latency 16byte message

• IBM SDK 2.1 C-intrinsics for mailboxes do not seem to have idea performance.

6.24 seconds

3.65 seconds

0.35 seconds / not reliable

58.33 seconds TCP (Gigabit)8.08 seconds Infiniband

Page 23: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 23

James Geraci 04/21/23

Synchronization by hybrid of C-intrinsics and pointers to MMIO registers

•Synchronization with IBM SDK 2.1 C-intrinsics & pointers to MMIO registersyields fairly good performance for a moderate numbers SPEs

Matrix size 16728x16728 w/ band size of 420

1 2 3 4 5 6

Number of SPEs

Out of coreLinear speed up

1

2

Sec

on

ds

3

Page 24: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 24

James Geraci 04/21/23

Synchronization exclusively by IBM SDK 2.1 mailbox C intrinsics

•Synchronization by IBM SDK 2.1 mailbox C intrinsics alone, yields little gain for low SPE count and performance LOSS after only 4 SPEs!!!

Sec

on

ds

2 3 4 5 6

Number of SPEs

SPE synchronization by SDK C intrinsics

0.11

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Page 25: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 25

James Geraci 04/21/23

Data from synchronization exclusively by pointers

Sec

on

ds

2 3 4 5 6

Number of SPEs

SPE synchronization by SDK C intrinsics

0.11

0.2

0.3

0.4

0.5

0.6

0.7

0.8

•Synchronization by pointers alone, yields a nice speed up for all SPEs•Seems to be reliability issue with reading mailbox status register via pointers

Page 26: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 26

James Geraci 04/21/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Hide memory accesses• Hide synchronizations

Page 27: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 27

James Geraci 04/21/23

Banded LUIn Core Algorithm

1hbW

2b

2b

Ni 2

2b

Multiplies:

Compute

Ni

Total Accesses:

222 bN

Subtracts:2

2b

Gets:

Puts:

12N3

bb )*3(2 bNbb

Memory Accesses Doubles Moved

Synchronizations N

•Compute/Memory Ratio =

•For a 16728x16728 matrix with band size of 420, only 0.158 GB of data would have to be moved.

32b

Matrix

Start up Access: 1 2bb

Page 28: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 28

James Geraci 04/21/23

In Core Performance

Achieves over 11% of theoretical performance

MIT Lincoln Laboratory

Max out core0.583 Gflops

Max in core1.23 Gflops

Intel Xeon5160 3.0Ghz0.377 Gflops

Page 29: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 29

James Geraci 04/21/23

In Core Performancew/ linear speed up

MIT Lincoln Laboratory

Performance improves as band size increases

Page 30: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 30

James Geraci 04/21/23

IBM QS20 performance

MIT Lincoln Laboratory

IBM QS20Max in core3.15 Gflops

Achieves over 11% of theoretical performance

Page 31: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 31

James Geraci 04/21/23

ICA Memory Access Size Dependence

•In core performance does not show a large dependence on memory access size

Page 32: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 32

James Geraci 04/21/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

Page 33: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 33

James Geraci 04/21/23

Summary / Future Work

• Parallel LU decomposition can benefit from the high bandwidth of the CBE

– Benefit depends greatly on synchronization scheme

• inCore LU offers performance advantages over out of core LU– Limits on size of matrix bandwidth for inCore LU.

• Partial pivoting

Page 34: CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

MIT Lincoln LaboratoryCBELU - 34

James Geraci 04/21/23

Acknowledgements

• Out of Core Algorithm– Sudarshan Raghunathan – John Chu MIT

• In Core Algorithm – Jeremy Kepner MIT LL– Sharon Sacco MIT LL– Rodric Rabbah IBM