Parent Engagement and IEP Meetings HPEC Principals and Superintendent Meeting June 15, 2011.
CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical...
-
Upload
jesse-clark -
Category
Documents
-
view
217 -
download
3
Transcript of CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical...
CBELU - 1James 04/21/23
MIT Lincoln Laboratory
High Performance Simulations of Electrochemical Models
on the Cell Broadband Engine
James Geraci
HPEC Workshop
September 20, 2007
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government
MIT Lincoln LaboratoryCBELU - 2
James Geraci 04/21/23
• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
MIT Lincoln LaboratoryCBELU - 3
James Geraci 04/21/23
Introduction Real Time Battery State of Health Estimation
•The way in which a battery ages is a strong function of battery geometry.•The rate of self discharge is also a function ofbattery geometry.
Lead Acid Battery
MIT Lincoln LaboratoryCBELU - 4
James Geraci 04/21/23
Inside a Lead Acid Battery
Face View Lead Dioxide Plate Face View Lead Plate
•Lead acid batteries are made up of stacks of lead dixode and lead electrode pairs
MIT Lincoln LaboratoryCBELU - 5
James Geraci 04/21/23
Finite volume description of battery
LeadDioxide
SulfuricAcid
LeadEdge view lead dioxide plate
•The center volume’s physics can be influenced by the physics of the volumesto the right, the left, above, and below
•The 5 point stencil gives rise to banded matrix
MIT Lincoln LaboratoryCBELU - 6
James Geraci 04/21/23
Matlab spy plot of Banded Matrix
•Non-zero entries shown in blue
•Inner band around diagonal
•Distant narrow outer band
•For the physics of a lead acidbattery, the matrix is not symmetric (shown)
•For the physics of a lead acidbattery, the matrix is poorlyconditioned 10^8-10^12 - need double precision,
no GPGPU
MIT Lincoln LaboratoryCBELU - 7
James Geraci 04/21/23
• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
MIT Lincoln LaboratoryCBELU - 8
James Geraci 04/21/23
LU Decomposition
• LU decomposition decomposes a matrix J into a lower triangular matrix L and an upper triangular matrix U
• L & U can be used to solve a system of linear equations Jx = r by forward elimination back substitution
– Essentially Gaussian Elimination
• Often used on poorly conditioned systems where ‘iterative solvers’ can’t be used.
• Difficult to parallelize for small systems because of the fine grain nature of the parallelism involved.
• Banded LU is a special case of LU – The matrix J has a special ‘banded’ data pattern.
MIT Lincoln LaboratoryCBELU - 9
James Geraci 04/21/23
• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
MIT Lincoln LaboratoryCBELU - 10
James Geraci 04/21/23
Cell Broadband Engine
•Cell Broadband Engine is a new heterogeneous multicore processor that features large internal and off chip bandwidth.
MIT Lincoln LaboratoryCBELU - 11
James Geraci 04/21/23
EIB (96B/cycle 3.2Gcycles/sec*96B/cycle = 286.1GB/sec)
Cell Broadband Engine
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
SPE
MFC
LS
SXU
SPU
PPU
L1L2 PXUMIC
16B/cycle 16B/cycle47.7GB/sec
16B/cycle
PPE
25GB/sec
Bandwidth:•50GB/sec/SPU•25GB/sec off chipSPE Performance:•14 Gflops double
MIT Lincoln LaboratoryCBELU - 12
James Geraci 04/21/23
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
• Banded LU• Performance• Latency• Synchronization
MIT Lincoln LaboratoryCBELU - 13
James Geraci 04/21/23
Banded LUOut of Core Algorithm Explained
1hbWi
i
2b
2b
)2:,(),(
),2:1()2:,2:1( biiiJ
iiJ
ibiiJbiibiiJ
JMatrix
MIT Lincoln LaboratoryCBELU - 14
James Geraci 04/21/23
Banded LUOut of Core Algorithm Analyzed
1hbW
Ni
222 bN
i
i
2b
2b
2
2bMultiplies:
Compute
Memory Accesses
Ni
Subtracts:2
2b
Gets:
Puts:
2b
2b
22 bN
Memory Doubles Moved
b
b
22 bNb
Synchronizations N
•Compute/Memory Ratio = ½•For a 16728x16728 matrix with band size of 420, almost 22 GB of data would have to be moved.
JMatrix
MIT Lincoln LaboratoryCBELU - 15
James Geraci 04/21/23
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
• Banded LU• Performance• Latency• Synchronization
MIT Lincoln LaboratoryCBELU - 16
James Geraci 04/21/23
Peformance of Out of Core Algorithm
•Out of Core Algorithm outperforms UMFpack on Opteron 246 based workstation.
•No appreciable gain in performance past 4 SPEs
Gflops for matrix size 16728x16728 w/ band size of 420 Compute Time for matrix size 16728x16728 w/ band size of 420
Sec
on
ds
2
1
3
1 4 5 62 3G
flo
ps
Out of coreLinear speed up
Out of coreOpteron 246: UMFpackLinear speed up
3 4 5 61 2
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
MIT Lincoln LaboratoryCBELU - 17
James Geraci 04/21/23
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
• Banded LU• Performance• Latency• Synchronization
MIT Lincoln LaboratoryCBELU - 18
James Geraci 04/21/23
Latency for DMA put
16 bytes1024 bytes
•Significant Performance hit for memory access smaller than 16bytes
•Bandwidth limited region starts at 8x128bytes
MIT Lincoln LaboratoryCBELU - 19
James Geraci 04/21/23
SPE to main memory bandwidth
16 bytes
1024 bytes
Theoretical Maximum Bandwidth 25GB/sec
•Theoretical maximum bandwidth can almost be achieved for larger message sizes
MIT Lincoln LaboratoryCBELU - 20
James Geraci 04/21/23
OCA Memory Access Size Dependence
•Out of Core performance is better when memory access is a byte multiple of 128
MIT Lincoln LaboratoryCBELU - 21
James Geraci 04/21/23
PPE/SPE Synchronization Mailboxes
// barrierWhile(sum != NUM_SPE){for(i < NUM_SPE){ if(NewMailBoxMessage){
readMailBox;}}} // Notify PPEspu_writech(SPU_WrOutMbox,1); // Wait for PPEspu_readch(SPU_RdInMbox);
// Restart SPEswriteSPEinMboxes();
PPE SPESPE
MFC
LS
SXU
SPU
PPU
L1L2 PXU
16B/cycle
16B/cycle
Out Mailbox
In Mailbox
PPE
EIB•Mailboxes are one common method of synchronization
MIT Lincoln LaboratoryCBELU - 22
James Geraci 04/21/23
PPE/SPE Synchronization MailboxesRound Trip Times
• Mailboxes using IBM SDK 2.1 C-intrinsics
• Mailboxes using IBM SDK 2.1 C-intrinsics & pointers to MMIO registers
• Mailboxes by pointers alone
• Standard round trip latency 16byte message
• IBM SDK 2.1 C-intrinsics for mailboxes do not seem to have idea performance.
6.24 seconds
3.65 seconds
0.35 seconds / not reliable
58.33 seconds TCP (Gigabit)8.08 seconds Infiniband
MIT Lincoln LaboratoryCBELU - 23
James Geraci 04/21/23
Synchronization by hybrid of C-intrinsics and pointers to MMIO registers
•Synchronization with IBM SDK 2.1 C-intrinsics & pointers to MMIO registersyields fairly good performance for a moderate numbers SPEs
Matrix size 16728x16728 w/ band size of 420
1 2 3 4 5 6
Number of SPEs
Out of coreLinear speed up
1
2
Sec
on
ds
3
MIT Lincoln LaboratoryCBELU - 24
James Geraci 04/21/23
Synchronization exclusively by IBM SDK 2.1 mailbox C intrinsics
•Synchronization by IBM SDK 2.1 mailbox C intrinsics alone, yields little gain for low SPE count and performance LOSS after only 4 SPEs!!!
Sec
on
ds
2 3 4 5 6
Number of SPEs
SPE synchronization by SDK C intrinsics
0.11
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MIT Lincoln LaboratoryCBELU - 25
James Geraci 04/21/23
Data from synchronization exclusively by pointers
Sec
on
ds
2 3 4 5 6
Number of SPEs
SPE synchronization by SDK C intrinsics
0.11
0.2
0.3
0.4
0.5
0.6
0.7
0.8
•Synchronization by pointers alone, yields a nice speed up for all SPEs•Seems to be reliability issue with reading mailbox status register via pointers
MIT Lincoln LaboratoryCBELU - 26
James Geraci 04/21/23
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
• Hide memory accesses• Hide synchronizations
MIT Lincoln LaboratoryCBELU - 27
James Geraci 04/21/23
Banded LUIn Core Algorithm
1hbW
2b
2b
Ni 2
2b
Multiplies:
Compute
Ni
Total Accesses:
222 bN
Subtracts:2
2b
Gets:
Puts:
12N3
bb )*3(2 bNbb
Memory Accesses Doubles Moved
Synchronizations N
•Compute/Memory Ratio =
•For a 16728x16728 matrix with band size of 420, only 0.158 GB of data would have to be moved.
32b
Matrix
Start up Access: 1 2bb
MIT Lincoln LaboratoryCBELU - 28
James Geraci 04/21/23
In Core Performance
Achieves over 11% of theoretical performance
MIT Lincoln Laboratory
Max out core0.583 Gflops
Max in core1.23 Gflops
Intel Xeon5160 3.0Ghz0.377 Gflops
MIT Lincoln LaboratoryCBELU - 29
James Geraci 04/21/23
In Core Performancew/ linear speed up
MIT Lincoln Laboratory
Performance improves as band size increases
MIT Lincoln LaboratoryCBELU - 30
James Geraci 04/21/23
IBM QS20 performance
MIT Lincoln Laboratory
IBM QS20Max in core3.15 Gflops
Achieves over 11% of theoretical performance
MIT Lincoln LaboratoryCBELU - 31
James Geraci 04/21/23
ICA Memory Access Size Dependence
•In core performance does not show a large dependence on memory access size
MIT Lincoln LaboratoryCBELU - 32
James Geraci 04/21/23
Outline
• Introduction
• Out of Core Algorithm
• In Core Algorithm
• Summary
MIT Lincoln LaboratoryCBELU - 33
James Geraci 04/21/23
Summary / Future Work
• Parallel LU decomposition can benefit from the high bandwidth of the CBE
– Benefit depends greatly on synchronization scheme
• inCore LU offers performance advantages over out of core LU– Limits on size of matrix bandwidth for inCore LU.
• Partial pivoting
MIT Lincoln LaboratoryCBELU - 34
James Geraci 04/21/23
Acknowledgements
• Out of Core Algorithm– Sudarshan Raghunathan – John Chu MIT
• In Core Algorithm – Jeremy Kepner MIT LL– Sharon Sacco MIT LL– Rodric Rabbah IBM