Post on 16-Apr-2018
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration
under contract DE-AC04-94AL85000.
SST Overview
Genie Hsieh Arun Rodrigues
Sandia National Labs
Why SST?
View of the Simulation Problem
Application writerspurchasersdesigners
system procurementalgorithm co-design
architecture researchlanguage research
Multiple Audiences.....Network
ProcessorSystem
present systemsfuture systemsX X X
Scale..... ManyCores
+Memory
ManyManyNodes
ManyManyMany
ThreadsX X
Multi-Physics AppsInformatics Apps
Complexity.....Communication Libraries
Run-TimesOS Effects
Existing LanguagesNew LanguagesX X
Constraints.....Performance
CostPower
ReliabilityCoolingUsability
RiskSize
Worldwide Impact"Total power used by servers [in 2005] represented ... an amount comparable to that for color televisions. "-ESTIMATING TOTAL POWER CONSUMPTION BY SERVERS IN THE U.S. AND THE WORLD, Jonathan G. Koomey
3741e9 KW-Hrs Total US power consumption
* 3-4% used by computers (>2% servers, >1% household computer use)
= 112 - 150e9 KW-Hrs US Computer power consumption* $0.1 $/KW-Hr Retail cost, US Average 2009= $11 - $15 Billion US$ in compute power
* 3-5 in 2005 US was roughly 1/3 of servers, by power.
= $33 ( ) - $75 ( ) Billion US$ in worldwide computer power
* 15-35% DRAM memory power
= $5 ( ) - $25 ( ) BIllion in US$ in DRAM power
Major Simulation Challenges•Multiple Objectives
–Performance used to be only criteria
–Now, Energy, cost, power, reliability, etc...
•Scale & Detail–Many system
characteristics require detail to measure
–Detailed simulation takes too long (10^4-10^5 slower than realtime)
•Accuracy–Systems more complex–Vendors don’t reveal
necessary details
Major Simulation Challenges•Multiple Objectives
–Performance used to be only criteria
–Now, Energy, cost, power, reliability, etc...
•Scale & Detail–Many system
characteristics require detail to measure
–Detailed simulation takes too long (10^4-10^5 slower than realtime)
•Accuracy–Systems more complex–Vendors don’t reveal
necessary details
What Is the SST?
SST Simulation Project Overview
Technical Approach
Goals•Become the standard architectural simulation framework for HPC•Be able to evaluate future systems on DOE workloads•Use supercomputers to design supercomputers
•Parallel•Parallel Discrete Event core with conservative optimization over MPI•Holistic
•Integrated Tech. Models for power•McPAT, Sim-Panalyzer•Multiscale
•Detailed and simple models for processor, network, and memory
•Open•Open Core, non viral, modular
Consortium•“Best of Breed” simulation suite•Combine Lab, academic, & industry
Status•Current Release (2.1) at code.google.com/p/sst-simulator/•Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model
Parallel Implementation•Implemented over MPI•Configuration, partitioning, initialization handled by core•Conservative, distance-based optimization
Parallel Implementation•Implemented over MPI•Configuration, partitioning, initialization handled by core•Conservative, distance-based optimization
Message Handling•SST core transparently handles message delivery
•Detects if destination is local or remote
•Local messages delivered to local queues
•Remote messages stored for later serialization and remove delivery–Boost Serialization Library
used for message serialization–MPI used for transfer
•Ranks synchronize based on partitioning
Multi-Scale• Goal: Enable tradeoffs
between accuracy, flexibility, and simulation speed– No single “right” way to
simulate– Support multiple
audiences• High- & Low-level interfaces –Allows multiple input types–Allows multiple input
sources• Traces, stochastic, state-
machines, execution...
Multiscale Parameters
High-Level Low-Level
Detail Message Instruction
Fundamental Objects
Message, Compute block, Process
Instruction, Thread
Static Generation
MPI Traces, MA Traces
Instruction Trace
Dynamic Generation State Machine Execution
Simulator Structure
Parallel DES
MPICheckpointing
Statistics
Power Area Cost
Configuration
Services
VendorComponent
OpenComponent
VendorComponent
OpenComponent
Simulator Core
SST/GeM5 Integration•Goals: Run GeM5 in parallel, connect with other SST components
•High parallel efficiency•Changes
–Replaced Python-based configuration w/ XML or C++-based system
–Encapsulated GeM5 as an SST Component, GeM5 event Q driven by SST clock
–Created translator SimObjects to connect to SST links
–Changes made to GeM5 loader to avoid use of async (untimed out-of-band) messages
gem5 SST Component
CPU
L1 L1
BUS
L2
IO
Bridge
PhysMem
IO Bus
Syscall
Handler
Translator
Portals
NICSS Router
MemBus+
Memory
BUS
DRAMSim
GeM5•M5: Modular platform for computer system architecture research, encompassing system-level architecture as well as processor microarchitecture.
•Provides detailed, full-system CPU models for x86, ARM, SPARC, Alpha
•Integrated at SST Component, allows interaction with SST models, and parallel execution
•Currently tested up to 256 nodes.
Gem5
SimObject
SimObject
SimObject
Port
Port
Gem5
Queue
SST::Component
SST::Component SST::Component
SST::Link SST::Link
SST Queue
SST::Component
Diversity!
genericProc
Macro
M5 O3
RS RouterDRAMSim
Stochastic
simpleRouter
ResiliencySimscheduleSim
Com
plex
ity/D
etai
l
Execution Time
Zesto
Commonalities• Discrete Event or time stepped
• Amenable to event counting for power modeling
MacSim
Simulation Process
Step Action Example
1 Input Problem Description Skeleton App, Mini App, Compact App
2 Input Hardware Description (Vendor) roadmap, novel architecture
3 Run Model Detailed Model, Abstract Model, Vendor Model
4 Read Results Performance, Energy, Cost, Reliability, etc...
5 Refine problem/hardware description Rewrite Proxy App, use more detailed model
6 Goto 1 Increase level of detail & repeat
Simulation Process
Step Action Example
1 Input Problem Description Skeleton App, Mini App, Compact App
2 Input Hardware Description (Vendor) roadmap, novel architecture
3 Run Model Detailed Model, Abstract Model, Vendor Model
4 Read Results Performance, Energy, Cost, Reliability, etc...
5 Refine problem/hardware description Rewrite Proxy App, use more detailed model
6 Goto 1 Increase level of detail & repeat
Component Library•Parallel Core v2
–Parallel DES layered on MPI–Partitioning –Configuration & Checkpointing–Power modeling
•Technology Models–McPAT, Sim-Panalyzer, IntSim, Orion and
custom power/energy models–HotSpot Thermal model–Working on reliability models
•Components–Processor: Macro Applications, Macro
Network, NMSU, genericProc, state-machine, Zesto, GeM5, GPGPU
–Network: Red Storm, simpleRouter, GeM5–Memory: DRAMSim II, Adv. Memory, Flash,
SSD, DiskSim
Parallel DES
MPICheckpointing
Statistics
Power Area Cost
Configuration
Services
VendorComponent
OpenComponent
VendorComponent
OpenComponent
Simulator Core
SST Simulator Core
Component Library•Parallel Core v2
–Parallel DES layered on MPI–Partitioning –Configuration & Checkpointing–Power modeling
•Technology Models–McPAT, Sim-Panalyzer, IntSim, Orion and
custom power/energy models–HotSpot Thermal model–Working on reliability models
•Components–Processor: Macro Applications, Macro
Network, NMSU, genericProc, state-machine, Zesto, GeM5, GPGPU
–Network: Red Storm, simpleRouter, GeM5–Memory: DRAMSim II, Adv. Memory, Flash,
SSD, DiskSim
Parallel DES
MPICheckpointing
Statistics
Power Area Cost
Configuration
Services
VendorComponent
OpenComponent
VendorComponent
OpenComponent
Simulator Core
SST Simulator Core
Holistic Simulation
•Design space includes much more than simple performance
•Create common interface to multiple technology libraries–Power/Energy–Area/Timing estimation
•Make it easier for components to model technology parameters
Com
mon
Tec
h A
PI McPAT
Cacti
Sim-Panalyzer
Others
Component
Open Simulator Framework
Parallel DES
MPICheckpointing
Statistics
Power Area Cost
Configuration
Services
VendorComponent
OpenComponent
VendorComponent
OpenComponent
Simulator Core
•Simulator Core will provide...–Power, Area, Cost modeling–Checkpointing–Configuration–Parallel Component-Based
Discrete Event Simulation•Components
–Ships with basic set of open components
–Industry can plug in their own models
•Under no obligation to share•Open Source (BSD-like) license •SVN hosted on Google Code
How To Use the SST
Where to Get the SST•Google Code•http://code.google.com/p/sst-simulator/
–Anonymous checkout available•BSD-like License•Directory Structure
/sst-simulator : Top Level/deps : Scripts to help build dependencies/doc : Doxygen/sst : Source code/core : Simulator Core (DES)/elements : Simulation models/DRAMSimC : DRAM Simulator/M5 : GeM5 Simulator/SS_Router : Cray SeaStar Simulator/zesto : Detailed x86/iris : NoC Simulator/Many Others...
Dependencies•System
–64-bit Linux (Debian or RedHat)
–MacOS >10.4•Libraries
–MPI Distribution (Open MPI or MPICH2)
–Boost 1.43.x–ParMETIS 3.1.1 (Optional)–Zoltan 3.2 (Optional)
•Optional Components–DRAMSim2–Disksim (64-bit Linux Only)–McPAT–IntSim
–ORION–HotSpot–GeM5 (64-bit Linux Only)
•Detailed directions at: http://code.google.com/p/sst-simulator/w/list
SST Deps Tarball•Fast-Path dependencies•Download sstComponents_2012-FEB-23.tar.gz from:
•http://code.google.com/p/sst-simulator/downloads/list•Untar in $HOME•cd sst-simulator/deps/bin•./sstDependencies.sh cleanBuild•Uncompresses, patches, and builds
–Boost–DiskSim–DRAMSim 2–HotSpot–IntSim–McPAT–Orion–GeM5
Building the SST•From top level•./autogen.sh•./configure <options>
–./configure --help for explanations–--prefix=DIR–--with-gem5=DIR–--with-disksim=DIR
•make•make install
Building the SST With MacSim•Unpack the MacSim source into the SST Elements dir–cd $SST_TOP/sst/elements –svn co https://svn.research.cc.gatech.edu/macsim/trunk macsim –cd macsim
•Patch MacSim–patch -p0 -i macsim-sst.patch
•Build the SST–cd $SST_TOP–./autogen.sh –./configure
Running the SST•Serial–sst.x <SDL File>
•Parallel–mpirun -n <ranks> sst.x <SDL File>
Sample XML Format•Simple component-based format
•Programmatic construction (C++ or scripted interface) in development
<component id="nic1" weight=0.5> <nicmodel> <params debug=1 /> <links> <link id="cpu1nicmodel"> <params lat="1" name="CPU" /> </link> <link id="Router1LocalPort"> <params lat="1" name="NETWORK" /> </link> </links> </nicmodel></component>
<component id="router1" weight=0.3> <routermodel> <params hop_delay="2000" debug=1 /> <links> <link id="Router1LocalPort"> <params lat="1" name="local" /> </link> <link id="H0"> <params lat="2" name="west" /> </link> </links> </routermodel></component>
Link Specification(s)
Component Parameters
Component
Sample XML Format•Named Links
–Endpoints specified–Minimum Latency specified
•Used in Partitioning & Optimization
•Variables & Parameter Blocks
Link Name
Link Local Name
<component id="nic1" weight=0.5> <nicmodel> <params debug=1 /> <links> <link id="cpu1nicmodel"> <params lat="1" name="CPU" /> </link> <link name="Router1LocalPort"> <params lat="1" name="NETWORK" /> </link> </links> </nicmodel></component>
<component id="router1" weight=0.3> <routermodel> <params hop_delay="2000" debug=1 /> <links> <link name="Router1LocalPort"> <params lat="1" name="local" /> </link> <link id="H0"> <params lat="2" name="west" /> </link> </links> </routermodel></component>
<variables> <nic_link_lat> 200ns </nic_link_lat> <rtr_link_lat> 10ns </rtr_link_lat></variables>
<link name=0.cpu2nic port=cpu latency=$nic_link_lat/>
GeM5 Example•Two Level Configuration
–SST Configuration•One M5 SST::Component per rank
•Simulation variables (stop time, etc...)
–M5 Configuration•M5 sub-components (processors, caches, busses)
•Uses same format as SST XML
•Subbomponent Config parameters same as GeM5
<sst> <component name=system type=m5C.M5 rank=0 > <params> <configFile>exampleM5.xml</configFile> <debug> 0 </debug> <M5debug> none </M5debug> <info>yes</info> <registerExit>yes</registerExit> </params> </component></sst>
<component name=nid0.cpu0 type=O3Cpu > <params include=cpuParams> <base.process.cmd>${M5_EXE}</base.process.cmd> <base.process.env.0>RT_RANK=0</base.process.env.0> </params> <link name=nid0.cpu-dcache port=dcache_port/> <link name=nid0.cpu-icache port=icache_port/> </component>
<component name=nid0.cpu0.dcache type=BaseCache > <params include=cacheParams> <size>65536</size> </params> <link name=nid0.cpu-dcache port=cpu_side/> <link name=nid0.dcache-bus port=mem_side/></component>
Portals NIC Example•Tests large scale HPC network–Processor: State-machine –NIC: Simplified packet-
based–Router: Detailed flit-based
router (Cray SeaStar)•Tested up to 3M components
•Includes SDL generator
<component name=0.cpu type=portals4_sm.trig_cpu rank=0 > <params include=cpu_params> <id> 0 </id> </params> <link name=0.cpu2nic port=nic/></component>
<component name=0.nic type=portals4_sm.trig_nic rank=0 > <params include=nic_params1,nic_params2> <id> 0 </id> </params> <link name=0.cpu2nic port=cpu/> <link name=0.nic2rtr port=rtr/></component>
<component name=0.rtr type=SS_router.SS_router rank=0 > <params include=rtr_params> <id> 0 </id> </params> <link name=0.nic2rtr port=nic/> <link name=xr2r.0.0.1 port=xPos/> <link name=xr2r.0.0.0 port=xNeg/></component>
<component name=0.cpu type=portals4_sm.trig_cpu rank=0 >
Library
ComponentManual Partition
MacSim Example•Each MacSim instance is an SST::Component
•Clock speed, Paths to traces, configuration specified in SDL
•Two Modes–Standalone–w/ DRAMSim2
<sst> <component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath>PARAM_PATH</paramPath> <tracePath>TRACE_PATH</tracePath> <outputPath>./results/</outputPath> <clock>1.4Ghz</clock> </params> </component></sst>
Standalone MacSim in SST
MacSim Example•Each MacSim instance is an SST::Component
•Clock speed, Paths to traces, configuration specified in SDL
•Two Modes–Standalone–w/ DRAMSim2
<sst> <component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath>PARAM_PATH</paramPath> <tracePath>TRACE_PATH</tracePath> <outputPath>./results/</outputPath> <clock>1.4Ghz</clock> </params> </component></sst>
Standalone MacSim in SST
<component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath></paramPath> <tracePath></tracePath> <outputPath></outputPath> <clock>1.4Ghz</clock> </params> <link name=membus port=bus latency=1ns /></component>
<component name=mem0 type=DRAMSimC > <params> <clock> 1066 Mhz </clock> <systemini> system-2.ini </systemini> <deviceini>GDDR5_hynix_1Gb.ini</deviceini> </params> <link name=membus port=bus latency=1ns /></component>
MacSim+DRAMSim in SST
SST Internals
Key Objects & Interfaces•Goal: Simplicity•Objects
–SST::Component: A model of a hardware component
–SST::Link: A connection between two components–SST::Event: A discrete Event–SST::EventHandler: Function to handle an incoming
event or clock tick•Events
–SST::Component::ConfigureLink(): Registers a link and (optionally) handler
–SST::Link::Recv(): Pull an event from a link–SST::Link::Send(): Send an event down a link–SST::Component::registerClock(): Register a clock
and handler
ComponentEventHandler
Event
ComponentEventHandler
Link
Sample XML Format•Named Links•Endpoints specified•Minimum Latency specified–Used in Partitioning
Link Name
Link Local Name
<component id="nic1" weight=0.5> <nicmodel> <params debug=1 /> <links> <link id="cpu1nicmodel"> <params lat="1" name="CPU" /> </link> <link id="Router1LocalPort"> <params lat="1" name="NETWORK" /> </link> </links> </nicmodel></component>
<component id="router1" weight=0.3> <routermodel> <params hop_delay="2000" debug=1 /> <links> <link id="Router1LocalPort"> <params lat="1" name="local" /> </link> <link id="H0"> <params lat="2" name="west" /> </link> </links> </routermodel></component>
Routermodel Constructor
Gather Parameters
Create Event Handler
Add Links & Register Handler to Links
Routermodel(ComponentId_t id, Clock* clock, Params_t& params) : Component(id, clock), params(params){ Params_t::iterator it= params.begin(); while (it != params.end()) { if (!it->first.compare("hop_delay")) { int delay; sscanf(it->second.c_str(), "%d", &delay); hop_delay= delay * 0.000000000001; } ...; ++it; }
// One handler handles all ports Handler= new EventHandler<Routermodel, bool, Time_t, Event *> (this, &Routermodel::handle_events);
local_port= LinkAdd("local", Handler); west_port= LinkAdd("west", Handler); east_port= LinkAdd("east", Handler); ...;} Link Local
Name
Simple Event Handlerbool Routermodel::handle_events(Time_t time, Event *event){ NicEvent *e= static_cast<NicEvent *>(event);
e->router_delay += hop_delay; if (e->routeX > 0) { // Keep going East e->routeX--; east_port->Send(e); } else if (e->routeX == 0) { if (e->routeY > 0) { // Go South e->routeY--; south_port->Send(e); } else if (e->routeY == 0) { // We have arrived! local_port->Send(e); } else /* e->routeY < 0 */ { // Go North e->routeY++; north_port->Send(e); } } else /* e->routeX < 0 */ { // Keep going West e->routeX++; west_port->Send(e); }
return false;}
Multiple Ways of Handling Events•“Poll” a Link
–Link::Recv()–Usually called from
a clock–Allows multiple
events to be pulled in same function
•Register event handler with link–handler called
whenever event arrives
•Register a Clock–Called at regular
interval
Constructor {
cpu= LinkAdd( "port0" );
eventHandler = new EventHandler< Xbar, bool, Time_t, Event* > ( this, &Xbar::processEvent ); nic= LinkAdd( "port1", eventHandler );
clockHandler = new EventHandler< Xbar, bool, Cycle_t, Time_t > ( this, &Xbar::clock );
ClockRegister( frequency, clockHandler );}
....
CompEvent *e = cpu->Recv();
Time & Checkpointing•Time
–Represented internally as 64-bit int (default femtoseconds)
–All time intervals in configuration are specified in SI units (e.g. 1.5 ns, 3 Ghz, etc...)
•Example: registerClock( "1 GHz", clockHandler );
–TimeConverter objects provided to convert between different bases
•Checkpointing–Failures inevitable at scale–Checkpointing must be built in as “first
class” function–Uses Boost Serialization Library–Components dump state to binary file at
specified interval
template<class Archive> void load(Archive & ar, const unsigned int version) { // serialize base ar & BOOST_SERIALIZATION_BASE_OBJECT_NVP(Component); // serialize members ar & BOOST_SERIALIZATION_NVP(workPerCycle); ar & BOOST_SERIALIZATION_NVP(commFreq); ar & BOOST_SERIALIZATION_NVP(commSize); ar & BOOST_SERIALIZATION_NVP(neighbor); ar & BOOST_SERIALIZATION_NVP(N); ar & BOOST_SERIALIZATION_NVP(S); ar & BOOST_SERIALIZATION_NVP(E); ar & BOOST_SERIALIZATION_NVP(W); //restore links N->setFunctor(eventHandler); S->setFunctor(eventHandler); E->setFunctor(eventHandler); W->setFunctor(eventHandler); }
Example Checkpoint Function
Time & Checkpointing•Time
–Represented internally as 64-bit int (default femtoseconds)
–All time intervals in configuration are specified in SI units (e.g. 1.5 ns, 3 Ghz, etc...)
•Example: registerClock( "1 GHz", clockHandler );
–TimeConverter objects provided to convert between different bases
•Checkpointing–Failures inevitable at scale–Checkpointing must be built in as “first
class” function–Uses Boost Serialization Library–Components dump state to binary file at
specified interval
template<class Archive> void load(Archive & ar, const unsigned int version) { // serialize base ar & BOOST_SERIALIZATION_BASE_OBJECT_NVP(Component); // serialize members ar & BOOST_SERIALIZATION_NVP(workPerCycle); ar & BOOST_SERIALIZATION_NVP(commFreq); ar & BOOST_SERIALIZATION_NVP(commSize); ar & BOOST_SERIALIZATION_NVP(neighbor); ar & BOOST_SERIALIZATION_NVP(N); ar & BOOST_SERIALIZATION_NVP(S); ar & BOOST_SERIALIZATION_NVP(E); ar & BOOST_SERIALIZATION_NVP(W); //restore links N->setFunctor(eventHandler); S->setFunctor(eventHandler); E->setFunctor(eventHandler); W->setFunctor(eventHandler); }
Example Checkpoint Function
Case Studies
Sample Results & Uses
5%
16%
51%
28%
MiniMD Memory Power Breakdown
NoCDRAML2MC
1%15%
59%
26%
GUPS Memory Power Breakdown
NoCDRAML2MC
0
15
30
45
60
GUPS PageRank MiniMD HPCCG
35.99
9.59
30.44
50.03
LSQ Occupancy
Entries
0
125
250
375
500
GUPS PageRank MiniMD HPCCG
414.01
60.7
200.32
2206.8
Avg. Memory Latency
nanoseconds
5%
16%
51%
28%
MiniMD Memory Power Breakdown
NoCDRAML2MC
1%15%
59%
26%
GUPS Memory Power Breakdown
NoCDRAML2MC
0
15
30
45
60
GUPS PageRank MiniMD HPCCG
35.99
9.59
30.44
50.03
LSQ Occupancy
Entries
0
125
250
375
500
GUPS PageRank MiniMD HPCCG
414.01
60.7
200.32
2206.8
Avg. Memory Latency
nanoseconds
Power analysis help prioritize technology investments
SST Simulation of MD code shows diminishing returns for threading on small data sets
Detailed component simulation highlights bottlenecks
Prototyping•Algorithm/Application presented as a skeleton code•Express communication or memory access pattern•Easy to change / Ignores many details•Allows exploration of a single system features
–E.g. Collective performance in presence of noise (ignore processor & memory, focus on router & NIC)
0
20000
40000
60000
80000
100000
120000
140000
64 128 256 512 1024 2048 4096 8192 16384 32768
Allre
duce
Tim
e (n
s)
Nodes
Host Tree: 1000 ns latency, 10 Mmsgs/s, Radix-8Host Tree w/ Noise: 1000 ns latency, 10 Mmsgs/s, Radix-8
Triggered Tree: 1000 ns latency, 10 Mmsgs/s, Radix-16Triggered Tree w/ Noise: 1000 ns latency, 10 Mmsgs/s, Radix-16
Recursive Doubling: 1000 ns latency, 10 Mmsgs/sRecursive Doubling w/ Noise: 1000 ns latency, 10 Mmsgs/s
Triggered Recursive Doubling: 1000 ns latency, 10 Mmsgs/sTriggered Recursive Doubling w/ Noise: 1000 ns latency, 10 Mmsgs/s
Design Space Exploration
Design Space Exploration Example•Design Space Exploration
–Inputs•Memory technology (DDR2, DDR3, GDDR5)•Core width (1,2,4,8 wide issue)•Cache size (32/32/1M or 64/64/2M)
–Outputs: Power, Performance, Cost•Methodology
–Performance models: GeM5/x86, DRAMSim2*
–Energy Models: DRAMSim2, McPAT–Cost Models: IC Knowledge–Key Questions
•What is good cache size? Core Width?•Which DRAM technology?
•Example of sorts of questions simulation can answer
Cores MC
DIMM DIMM
DIMM DIMM
DDR2, DDR3, GDDR5
Small or Large coresSmall or Large caches
Execution Based Processor Model
Detailed DRAM Model
Target Apps
HPCCG
Lulesh
Cache Size
•Larger caches increase processor size, power
•Avg. Power increase: 6.75%•Avg. Cost increase: 3.76%•Avg. Performance improvement–Lulesh: 1.40%–HPCCG: 6.73%
•Conclusion: Lulesh probably wouldn’t benefit, HPCCG marginal benefit
!"#
$"#
%"#
&"#
'"#
("#
)"#
*"#
+"#
!"#$%& '()*$+&,")*& -./$)0&
!$1"%+234$&
5!,,6&
!$%1"%+234$&
-2%7$&,240$&834%$2)$&
9:$4*&"1&-2%7$%&,240$)&
Which Memory System
•Options–DDR2: Cheap, low
power, antiquated–DDR3: Higher
performance, reasonable power
–GDDR5: Expensive, high power, very fast
•Pure performance:–GDDR 26-47% faster
than DDR3 (Lulesh)–GDDR 32-41% faster
thand DDR3 (HPCCG)–GDDR Wins?
!"
!#$"
%"
%#$"
&"
&#$"
%" &" '" (" )*+,-.+"
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
67&)25+,)#-"#$%./)+
889:+
889;+
<889=+
!"
!#$"
%"
%#$"
&"
&#$"
%" &" '" (" )*+,-.+"
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
6,778+,)#-"#$%./)+
99:;+
99:<+
899:=+
Better
Better
Energy & Cost•GDDR better performance•DDR3 generally does better on perf/Watt –Lulesh: -3% to 107%–HPCCG: 0 to 100%–GDDR does well at higher
processor widths–DDR2 sometimes slightly
better than DDR3 on HPCCG
•perf/$–DDR3 better for narrow
cores, GDDR better for wide
–DDR slightly better over all
!"#$
!"%$
!"&$
'$
'"($
'"#$
'"%$
'$ ($ #$ &$ )*+,-.+$
!"#$
%&'()*+,)#-.,"/)#+
,#"0)11"#+2'*34+
56&)14+,)#-"#$%70)+8)#+2%9+
::;<+
::;=+
>::;?+
!"#$
!"%$
!"&$
'$
'"($
'"#$
'"%$
'$ ($ #$ &$ )*+,-.+$
!"#$
%&'()*+,)#-.,"/)#+
,#"0)11"#+2'*34+
5,667+,)#-"#$%80)+9)#+2%:+
;;<=+
;;<>+
7;;<?+
!"#$
!"%$
!"&$
!"'$
!"($
!")$
*$
*"*$
*"+$
*",$
*"#$
*$ +$ #$ ($ -./012/$
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
67&)25+,)#-"#$%./)+8)#+9"&&%#+
99:;+
99:<+
=99:>+
!"#$
!"%$
!"&$
!"'$
!"($
!")$
*$
*"*$
*"+$
*$ +$ #$ ($ ,-./01.$
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
6,778+,)#-"#$%./)+9)#+:"&&%#+
::;<+
::;=+
8::;>+
Better
Better
Processor Width
•Wider processor can issue more instructions/cycle
•Consumes more area, power–Cost is super-linear wrt area
increase•Power often increases faster than performance–E.g. 8-wide processor 78%
faster on Lulesh, uses 123% more power
•1-2 wide cores most power efficient
•2-4 most cost efficient
!"#$
!"%$
!"&$
!"'$
($
("($
(")$
("*$
!" #" $" %"
&'()
*+,-./"0.(1'()
*23."456"
0('3.55'("7,/89"
:;+.59<"=>.38"'1"0('3.55'("7,/89"
0'?.("
@'58"
!"#$
!"%$
!"&$
!"'$
($
("($
(")$
("*$
!" #" $" %"
&'()
*+,-./"0.(1'()
*23."456"
0('3.55'("7,/89"
:0;;<=">?.38"'1"0('3.55'("7,/89"
0'@.("
;'58"
Better
Better
Design Space Exploration Results•Fastest memory technology not always best (DDR beats GDDR) due to power, cost
•No “best” processor - depends on tradeoff between cost, performance, power
•Can provide better understanding of which configurations are best for a given application
•Can be used as basis for application optimization, vendor guidance
Lulesh Pareto Optimal Designs
Width Memory Cache Power Performance Cost1 DDR3 Small 1.00 1.00 1.02 DDR3 Small 1.43 1.65 1.32 GDDR5 Small 3.00 2.28 2.04 GDDR5 Small 3.57 2.92 2.38 GDDR5 Small 5.29 3.62 3.4
Future Directions
New Components & Increased Integration
•New Components in development–NVRAM–New HPC NIC–Interface to Palacios
hypervisor•Increased Integration & Improvements–Parallel GeM5–GeM5/IRIS/DRAMSim
integration–SST/Meso scale–Improved Power/Area/Cost
models
Runnemede
Simulator
NMSU
Stochastic
Simulator
M5
MacSim
IRIS
Network
Models
DRAMSim
Skeleton
Apps
TracesC
om
mon
Interfa
ce
Memory
Refs
Traces
Stochastic
Blocks
Comm.
Events
Mem
oiza
tion
Exec.
Blocks
Meso-Scale Simulation
Power/Energy Modeling
•Parameterized Models turn architectural parameters into energy/event and static dissipation (e.g. leakage energy)
•During Simulation, count Key Events
•Event Counts * Energy/Event = Dynamic Energy
•Dynamic + Static Energy = Total Energy
•Energy / time = power•ThermalModel(power) = temperature
•Temperature adjusts leakage power
J/Read
Static
J/WriteCache Model
Size
Assoc.
Temp.
ReadsWritesTime
Simulate
J/ReadJ/WriteStatic
X = Energy
ThermalModel
Case Study: Reliability vs. PowerHidden cost of DVFS
•Dynamic voltage/frequency Scaling reduces power
•➔Reduces temperature•➔Causes thermal cycling•➔Reduces reliability
•Need–Algorithms to balance
temperature, lower power, & maintain performance
–Arch: Sensors and feedback–Runtime: Scheduler changes–App: Awareness
!"#$%&'()*'+*,$-./"#0
! 12)$3-4*5'(#"*3$)$6#3#)7*
! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6
;<='0%/)>*!?
!"#$%&'()*'+*,$-./"#0
! 12)$3-4*5'(#"*3$)$6#3#)7*
! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6
;<='0%/)>*!?
(Coskun 2011)
Case Study: Reliability vs. PowerHidden cost of DVFS
•Dynamic voltage/frequency Scaling reduces power
•➔Reduces temperature•➔Causes thermal cycling•➔Reduces reliability
•Need–Algorithms to balance
temperature, lower power, & maintain performance
–Arch: Sensors and feedback–Runtime: Scheduler changes–App: Awareness
!"#$%&'()*'+*,$-./"#0
! 12)$3-4*5'(#"*3$)$6#3#)7*
! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6
;<='0%/)>*!?
!"#$%&'()*'+*,$-./"#0
! 12)$3-4*5'(#"*3$)$6#3#)7*
! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6
;<='0%/)>*!?
(Coskun 2011)
Bonus Slides
Component Validation•Strategy: component validation in parallel with system-level validation
•Current components validated at different levels, with different methodologies
•Validation in isolation
•What is needed–Uniform validation
methodology (apps)–System (multi-component)
level validation
Component Method Error
DRAMSim RTL Level validation against Micron Cycle
Generic Proc
Simplescalar SPEC92 Validation ~5%
NMSU Comparison vs. existing processors on SPEC <7%
RS Network
Latency/BW against SeaStar 1.2, 2.1 <5%
MacSim Comparison vs. Existing GPUs
Ongoing<10%
expected
Zesto Comparison vs several processors, benchmarks 4-5%
McPAT Comparisons against existing processors
10- 23%
Power is the Problem•2018 Exascale Machine–1 Exaop/sec–100s petabyte/sec memory
bandwidth–100s petabyte/sec
interconnect bandwidth–No major architecture
changes
•Consider power –1 pJ * 1 Exa = 1 MW–1 MW/year = $1 M–$200-400M / year power bill
Page 14
management of data, combined with better overlap of communication and computation could reduce bandwidth
requirements by 50% or more.
Table 2: Effects of Power Reduction Techniques on an Exascale System
2018 Estimate Reduction Techniques Reduced Power Reduction
Processing 224 MW Simpler processor, Reduce
FP 11.2 MW 95%
Memory 125 MW Closer Proximity 37.5 MW 70%
Interconnect 24 MW Message Overlap 12 MW 50%
Total 373 MW 60.7 MW 84%
Applying these techniques to the hypothetical system from the introduction, we see a power reduction of 84%. This turns a
wholly impractical machine consuming hundreds of megawatts and costing significant fractions of a billion dollars to
power to a more attainable machine consuming tens of megawatts. More aggressive application of the techniques presented
here, and improvements in technology could reduce this further.
Figure 13 – Lessons from Embedded Systems
REFERENCES
[1] G. M. Amdahl, “Validity of single-processor approach to achieving large-scale computing capability,” Proceedings of AFIPS Conference, Reston,
VA. 1967. pp. 483-485
[2] G. Arnout, “C for system level design”, Proceedings of Design Automation and Test Europe (DATE) pp.384- 386, 2003, Munich, Germany
[3] M. Barr . "Embedded Systems Glossary". Netrino Technical Library. http://www.netrino.com/Embedded-Systems/Glossary. Retrieved 2007-04-21.
[4] M. Barr; A. J. Massa (2006). "Introduction". Programming embedded systems: with C and GNU development tools. O'Reilly. pp. 1-2. http://books.google.com/books?id=nPZaPJrw_L0C&pg=PA1.
[5] J. L. Bentley, “Programming pearls, second edition,” Addison-Wesley, Inc., 2000, ISBN 0-201-65788-0.
[6] B. W. Boehm, “Improving software productivity,” IEEE Computer 20, 9, 1987, pp. 43-57.
[7] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, J. Rattner, “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,” Technology & Research, Technology@Intel, Magazine Platform 2015
[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communication of the ACM, January 2008, pp. 107-113.
[9] D. Burger and J. R, Goodman, “Billion-transistor architectures: there and back again,” IEEE Computer 37, 3, Mar. 2004, pp. 22-28
From Jensen “Embedded systems and exascale computing.” CiSE 2010
Energy Conventional
Processor 62.5 pJ/op 62.5 MW
Memory 31.25 pJ/bit 125 MW
Interconnect 6 pJ/bit 24 MW
Total 211.5 MW
X to Solution
•Wider processors provide shorter time to solution
•Require much more energy to solution
!"#$
!"%$
!"&$
'"'$
'"($
'"#$
'"%$
'"&$
'$ )$ *$ +$
!"#$
%&'()*+,)#-"#$
%./)+012+
,#"/)11"#+3'*45+
67&)158+9:)/4+"-+,#"/)11"#+3'*45+
9.)#;<+="+>"&7?".+
='$)+="+>"&7?".+
!"#$
!"%$
!"&$
'"'$
'"($
'"#$
'"%$
'"&$
'$ )$ *$ +$
!"#$
%&'()*+,)#-"#$
%./)+012+
,#"/)11"#+3'*45+
6,7789+:;)/4+"-+,#"/)11"#+3'*45+
:.)#<=+>"+?"&@A".+
>'$)+>"+?"&@A".+
Better
Better