Heterogeneous Computing in Charm++
-
Upload
galvin-dudley -
Category
Documents
-
view
27 -
download
0
description
Transcript of Heterogeneous Computing in Charm++
![Page 1: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/1.jpg)
Heterogeneous Computingin Charm++
David Kunzman
![Page 2: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/2.jpg)
Motivations
• Performance and Popularity of Accelerators– Our work currently focuses on Cell (and Larrabee)– Difficult to program accelerators
• Architecture specific code (not portable)• Many asynchronous events (data movement, multiple cores)
• Heterogeneous Clusters Exist Already– Roadrunner at LANL (Opterons and Cells)– Lincoln at NCSA (Xeons and GPUs)– MariCel at BSC (Powers and Cells)
![Page 3: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/3.jpg)
Goals
• Portability of code– Code should be portable between systems with and without
accelerators– Across homogeneous and heterogeneous clusters– Reduce programmer effort
• Allow various pieces of code to be written independently– Pieces of code share the accelerator(s)– Scheduled by the runtime system automatically
• Naturally extend the existing Charm++ model– Same programming model for all hosts and accelerators
![Page 4: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/4.jpg)
Approach
• Make entry methods portable between host and accelerator cores– Allows the programmer to write entry method code
once and use the same code for all cores– Still make use of architecture/core specific features
• Take advantage of the clear communication boundaries in Charm++– Almost all data is encapsulated within chare objects– Data is passed between chare objects by invoking
entry methods
![Page 5: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/5.jpg)
Extending Charm++
• SIMD Instruction Abstraction– To reach any significant fraction of peak, must use
SIMD instructions on modern cores– Abstract SIMD instructions so code is portable
• Accelerated Entry Methods– May execute on accelerators– Essentially a standard entry method split into two
stages• Function body (accelerator or host; limited)• Callback function (host; not limited)
![Page 6: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/6.jpg)
SIMD Instruction Abstraction
• Abstract SIMD instructions supported by multiple architectures– Currently adding support for: SSE (x86),
AltiVec/VMX (PowerPC; PPE), SIMD instructions on SPEs, and Larrabee
– Generic C implementation when no direct architectural support is present
– Types: vecf, veclf, veci, ...– Operations: vaddf, vmulf, vsqrtf, ...
![Page 7: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/7.jpg)
Example Entry Method
entry void accum(int inArrayLen, float inArray[inArrayLen]) {
if (inArrayLen != localArrayLen) return;
for (int i = 0; i < inArrayLen; ++i)
localArray[i] = localArray[i] + inArray[i];
};
To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
![Page 8: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/8.jpg)
Example Entry Method w/ SIMD
entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) {
if (inArrayLen != localArrayLen) return;
vecf *inArrayVec = (vecf*)inArray;vecf *localArrayVec = (vecf*)localArray;int arrayVecLen = inArrayLen / vecf_numElems;for (int i = 0; i < arrayVecLen; ++i)
localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);
for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)localArray[i] = localArray[i] + inArray[i];
};
To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
![Page 9: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/9.jpg)
Accel Entry Method Structure
Invocation (both): chareObj.entryName(… passed parameters …)
Accelerated
Interface File:
entry [accel] void entryName
( …passed parameters… )
[ …local parameters… ]
{ … function body … }
callback_member_function;
Standard
Interface File:
entry void entryName
( …passed parameters… );
Source File:
void ChareClass::entryName
( …passed parameters … )
{ … function body … }
vs.
![Page 10: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/10.jpg)
Example Accelerated Entry Method
entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) [ readOnly : int localArrayLen <impl_obj->localArrayLen>, readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] {
if (inArrayLen != localArrayLen) return;
vecf *inArrayVec = (vecf*)inArray;vecf *localArrayVec = (vecf*)localArray;int arrayVecLen = inArrayLen / vecf_numElems;for (int i = 0; i < arrayVecLen; ++i)
localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);
for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)localArray[i] = localArray[i] + inArray[i];
} accum_callback;
To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
![Page 11: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/11.jpg)
Timeline of Events
• Runtime system…– Directs data movement (messages & DMAs)– Schedules accelerated entry methods and callbacks
![Page 12: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/12.jpg)
Communication Overlap
• Data movement automatically overlapped with accelerated entry method execution on SPEs and entry method execution on PPE
![Page 13: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/13.jpg)
Handling Host Core Differences
• Automatic modification of application data at communication boundaries– Structure of data is known via
parameters and Pack-UnPack (PUP) routines
– During packing process, add information on how the data is encoded
– During unpacking, if needed, modify data to match local architecture
![Page 14: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/14.jpg)
Molecular Dynamics (MD) Code
• Based on object interaction seen in NAMD’s nonbonded electrostatic force computation (simplified)– Coulomb’s Law– Single precision floating-point
• Particles evenly divided between patch objects– ~92K particles in 144 patches (similar to ApoA1 benchmark)
• Compute objects (self and pair wise) compute forces for patch objects
• Patches integrate combined force data and update particle positions
![Page 15: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/15.jpg)
MD Code Results
• Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs– 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures– Better scaling is achieved when Xeons are present– 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak
on one SPE, assuming that SPE has an infinite local store)
![Page 16: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/16.jpg)
Visualizing MD Code Execution
![Page 17: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/17.jpg)
Summary
• Support for accelerators and heterogeneous execution in Charm++– Programming model and runtime system changes
• Accelerated entry methods• SIMD instruction abstraction• Automatic modification of application data• Visualization support
– Support• Currently supports Cell• Adding support for Larrabee• Clusters where host cores have different architectures
![Page 18: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/18.jpg)
Future Work
• Dynamic measurement based load balancing on heterogeneous systems
• Increase support for more accelerators– In the process of adding support for Larrabee– Increasing support for existing abstractions
and/or developing new abstractions
![Page 19: Heterogeneous Computing in Charm++](https://reader035.fdocuments.us/reader035/viewer/2022062517/568134a9550346895d9bb728/html5/thumbnails/19.jpg)
Questions