Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
2
Transcript of Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.
![Page 1: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/1.jpg)
Climate Machine Update
David Donofrio
RAMP Retreat
8/20/2008
![Page 2: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/2.jpg)
Agenda
• Project Overview
• Tensilica Architecture and Design Flow
• Tensilica Tools Demo
• Why we need RAMP
• Current Progress
• Next Steps
![Page 3: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/3.jpg)
A New Approach to HPC
• Current HPC Design approach:– Leverage commodity processors
from Intel, AMD, etc– Once machine is built, optimize
problems to run on it – Power wall prevents scaling to
exaflop performance– Power is the new design point
Olukotun and Sutter
Moore’s Law still in effect - but number of processors double every
18 months rather than clock rate
![Page 4: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/4.jpg)
A New Approach to HPC• Our approach:
– Identify application, then tailor machine using semi-custom design – Optimize CPU architecture and further extend with semi-custom ISA– Leverage auto-tuning to access architecture specific optimizations– Even if each simple core is 1/4 as computationally efficient as a
complex core you can fit hundreds on a single die and be 100x more power efficient
• Learn from embedded market where Flops / Watt and rapid design cycles are crucial– Start with building blocks from embedded designs rather than full
custom ASIC– Preserve ability to run general purpose C code
• Application Target: 1km Scale Climate ModelTailor machine architecture to application to
reduce waste
![Page 5: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/5.jpg)
Climate Model Resource Requirements
• DOE has identified high-resolution climate modeling as a leading justification for exascale computing
• Must express 20M way parallelism• Requires performance of 200 Pflops peak• Simulation must run 1000x faster than real time
Randall / CSU
NASA
QuickTime™ and a decompressor
are needed to see this picture.
• Amenable to massively concurrent architectures composed of power efficient embedded cores.• Actively working with the climate science community to enable new Icosahedral model
![Page 6: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/6.jpg)
Tensilica Processor Design Flow
• Complete Solution: Hardware, Software and Verification
• Fully customizable– Required base ISA ensures
general purpose applications
• Processor configuration submitted to Tensilica’s servers where synthesis is performed– Returned design can be spun for
ASIC or FPGA
– Bit file available for Avnet boards
• Building block approach drastically reduces design cycle time compared to full-custom design
Tensilica Inc.
![Page 7: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/7.jpg)
Tensilica Architecture Features
• Verilog-like TIE language allows for custom ISA extensions– Functional and performance verification built in– Auto generated compiler intrinsics– 64-bit IEEE-DP floating point coded up in TIE and available
• Custom VLIW support• Inter-processor communication easily enabled
through:– TIE Ports– TIE Queues
• Access to direct HW support for interprocessor communication
– TIE Lookups• Allows interface to external ROMs or other RTL block
![Page 8: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/8.jpg)
Tensilica Architecture Overview
QuickTime™ and a decompressor
are needed to see this picture.
Tensilica Inc.
![Page 9: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/9.jpg)
Tensilica Performance Debug• Processor viewed as black box• State can be compressed (via HW) and pushed out
JTAG port– Intended for program replay
• Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail – $ hit miss with virtual address– Branch taken / not taken– Call / return– Resource dependency– Etc…
• Opportunity for hundredsof performance countersto be made available
QuickTime™ and a decompressor
are needed to see this picture.
Tensilica Inc.
![Page 10: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/10.jpg)
Tensilica Tools Demo
QuickTime™ and a decompressor
are needed to see this picture.
![Page 11: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/11.jpg)
Why we need RAMP• Fast, accurate emulation enables:
– Dual nested loop of HW / SW co-design• Preliminary work using Stanford SM sim shows significant
improvement in power eff. using automated HW/SW co-tuning• RAMP critical to accelerate
– Rapid prototyping and analysis of Tensilica architectural options
– Inter-processor communication architecture exploration– Running FULL climate code providing a more complete
performance picture
• Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5– Extensive HW performance counter data enables an
emulation environment with similar resolution but much greater speed
Tensilica provided emulation environment kick-starts this effort
![Page 12: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/12.jpg)
Current Status
• ML505 used for initial design exploration– Basic xtensa processor + JTAG and memory
controller is ~50% of a Virtex 5 50t– Runs at 50MHz
• ASIC in 65G process runs at 650MHz
• OnChip Debug working • Can load / run programs using main memory
synthesized from BRAM• DRAM interface coded - currently being
debugged• RTL license recently obtained - full simulation
environment (in ModelSim) being brought up
![Page 13: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/13.jpg)
Next Steps…
• Transition to BEE3 from ML505• Bring up XTOS environment on single xtensa
processor on BEE3• Run single column of climate code on single
processor – Demo at SC’08 in November– Continue HW / SW co-tuning optimization
• Begin multi-processor emulation– Emulation of single socket, 32 core, using
networked BEE3s– Running full 2 Million line climate model
![Page 14: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/14.jpg)
Backup
![Page 15: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.](https://reader036.fdocuments.us/reader036/viewer/2022062714/56649d6a5503460f94a48654/html5/thumbnails/15.jpg)
The Need for Exascale Computing
• DOE has identified high-resolution climate modeling as leading justification for exascale computing– 1 km resolution targeted for accurate cloud
resolving model
• Difficult to scale existing systems– HPC design using commodity processors
estimated to draw 179MW– BlueGene design estimated to draw 20MW– Leveraging embedded cores and more
application specific design a power envelope of 3-5MW is projected
Icosahedral
LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.
Randall / CSU