SST Overview - Georgia Institute of...

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration

under contract DE-AC04-94AL85000.

SST Overview

Genie Hsieh Arun Rodrigues

Sandia National Labs

Why SST?

View of the Simulation Problem

Application writerspurchasersdesigners

system procurementalgorithm co-design

architecture researchlanguage research

Multiple Audiences.....Network

ProcessorSystem

present systemsfuture systemsX X X

Scale..... ManyCores

+Memory

ManyManyNodes

ManyManyMany

ThreadsX X

Multi-Physics AppsInformatics Apps

Complexity.....Communication Libraries

Run-TimesOS Effects

Existing LanguagesNew LanguagesX X

Constraints.....Performance

CostPower

ReliabilityCoolingUsability

RiskSize

Worldwide Impact"Total power used by servers [in 2005] represented ... an amount comparable to that for color televisions. "-ESTIMATING TOTAL POWER CONSUMPTION BY SERVERS IN THE U.S. AND THE WORLD, Jonathan G. Koomey

3741e9 KW-Hrs Total US power consumption

* 3-4% used by computers (>2% servers, >1% household computer use)

= 112 - 150e9 KW-Hrs US Computer power consumption* $0.1 $/KW-Hr Retail cost, US Average 2009= $11 - $15 Billion US$ in compute power

* 3-5 in 2005 US was roughly 1/3 of servers, by power.

= $33 ( ) - $75 ( ) Billion US$ in worldwide computer power

* 15-35% DRAM memory power

= $5 ( ) - $25 ( ) BIllion in US$ in DRAM power

Major Simulation Challenges•Multiple Objectives

–Performance used to be only criteria

–Now, Energy, cost, power, reliability, etc...

•Scale & Detail–Many system

characteristics require detail to measure

–Detailed simulation takes too long (10^4-10^5 slower than realtime)

•Accuracy–Systems more complex–Vendors don’t reveal

necessary details

Major Simulation Challenges•Multiple Objectives

–Performance used to be only criteria

–Now, Energy, cost, power, reliability, etc...

•Scale & Detail–Many system

characteristics require detail to measure

–Detailed simulation takes too long (10^4-10^5 slower than realtime)

•Accuracy–Systems more complex–Vendors don’t reveal

necessary details

What Is the SST?

SST Simulation Project Overview

Technical Approach

Goals•Become the standard architectural simulation framework for HPC•Be able to evaluate future systems on DOE workloads•Use supercomputers to design supercomputers

•Parallel•Parallel Discrete Event core with conservative optimization over MPI•Holistic

•Integrated Tech. Models for power•McPAT, Sim-Panalyzer•Multiscale

•Detailed and simple models for processor, network, and memory

•Open•Open Core, non viral, modular

Consortium•“Best of Breed” simulation suite•Combine Lab, academic, & industry

Status•Current Release (2.1) at code.google.com/p/sst-simulator/•Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model

Parallel Implementation•Implemented over MPI•Configuration, partitioning, initialization handled by core•Conservative, distance-based optimization

Message Handling•SST core transparently handles message delivery

•Detects if destination is local or remote

•Local messages delivered to local queues

•Remote messages stored for later serialization and remove delivery–Boost Serialization Library

used for message serialization–MPI used for transfer

•Ranks synchronize based on partitioning

Multi-Scale• Goal: Enable tradeoffs

between accuracy, flexibility, and simulation speed– No single “right” way to

simulate– Support multiple

audiences• High- & Low-level interfaces –Allows multiple input types–Allows multiple input

sources• Traces, stochastic, state-

machines, execution...

Multiscale Parameters

High-Level Low-Level

Detail Message Instruction

Fundamental Objects

Message, Compute block, Process

Instruction, Thread

Static Generation

MPI Traces, MA Traces

Instruction Trace

Dynamic Generation State Machine Execution

Simulator Structure

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

SST/GeM5 Integration•Goals: Run GeM5 in parallel, connect with other SST components

•High parallel efficiency•Changes

–Replaced Python-based configuration w/ XML or C++-based system

–Encapsulated GeM5 as an SST Component, GeM5 event Q driven by SST clock

–Created translator SimObjects to connect to SST links

–Changes made to GeM5 loader to avoid use of async (untimed out-of-band) messages

gem5 SST Component

Bridge

PhysMem

IO Bus

Syscall

Handler

Translator

Portals

NICSS Router

MemBus+

Memory

DRAMSim

GeM5•M5: Modular platform for computer system architecture research, encompassing system-level architecture as well as processor microarchitecture.

•Provides detailed, full-system CPU models for x86, ARM, SPARC, Alpha

•Integrated at SST Component, allows interaction with SST models, and parallel execution

•Currently tested up to 256 nodes.

SimObject

SST::Component

SST::Component SST::Component

SST::Link SST::Link

SST Queue

SST::Component

Diversity!

genericProc

RS RouterDRAMSim

Stochastic

simpleRouter

ResiliencySimscheduleSim

Execution Time

Commonalities• Discrete Event or time stepped

• Amenable to event counting for power modeling

MacSim

Simulation Process

Step Action Example

1 Input Problem Description Skeleton App, Mini App, Compact App

2 Input Hardware Description (Vendor) roadmap, novel architecture

3 Run Model Detailed Model, Abstract Model, Vendor Model

4 Read Results Performance, Energy, Cost, Reliability, etc...

5 Refine problem/hardware description Rewrite Proxy App, use more detailed model

6 Goto 1 Increase level of detail & repeat

Simulation Process

Step Action Example

1 Input Problem Description Skeleton App, Mini App, Compact App

2 Input Hardware Description (Vendor) roadmap, novel architecture

3 Run Model Detailed Model, Abstract Model, Vendor Model

4 Read Results Performance, Energy, Cost, Reliability, etc...

5 Refine problem/hardware description Rewrite Proxy App, use more detailed model

6 Goto 1 Increase level of detail & repeat

Component Library•Parallel Core v2

–Parallel DES layered on MPI–Partitioning –Configuration & Checkpointing–Power modeling

•Technology Models–McPAT, Sim-Panalyzer, IntSim, Orion and

custom power/energy models–HotSpot Thermal model–Working on reliability models

•Components–Processor: Macro Applications, Macro

Network, NMSU, genericProc, state-machine, Zesto, GeM5, GPGPU

–Network: Red Storm, simpleRouter, GeM5–Memory: DRAMSim II, Adv. Memory, Flash,

SSD, DiskSim

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

SST Simulator Core

Component Library•Parallel Core v2

–Parallel DES layered on MPI–Partitioning –Configuration & Checkpointing–Power modeling

•Technology Models–McPAT, Sim-Panalyzer, IntSim, Orion and

custom power/energy models–HotSpot Thermal model–Working on reliability models

•Components–Processor: Macro Applications, Macro

Network, NMSU, genericProc, state-machine, Zesto, GeM5, GPGPU

–Network: Red Storm, simpleRouter, GeM5–Memory: DRAMSim II, Adv. Memory, Flash,

SSD, DiskSim

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

SST Simulator Core

Holistic Simulation

•Design space includes much more than simple performance

•Create common interface to multiple technology libraries–Power/Energy–Area/Timing estimation

•Make it easier for components to model technology parameters

PI McPAT

Sim-Panalyzer

Others

Component

Open Simulator Framework

Parallel DES

MPICheckpointing

Statistics

Power Area Cost

Configuration

Services

VendorComponent

OpenComponent

VendorComponent

OpenComponent

Simulator Core

•Simulator Core will provide...–Power, Area, Cost modeling–Checkpointing–Configuration–Parallel Component-Based

Discrete Event Simulation•Components

–Ships with basic set of open components

–Industry can plug in their own models

•Under no obligation to share•Open Source (BSD-like) license •SVN hosted on Google Code

How To Use the SST

Where to Get the SST•Google Code•http://code.google.com/p/sst-simulator/

–Anonymous checkout available•BSD-like License•Directory Structure

/sst-simulator : Top Level/deps : Scripts to help build dependencies/doc : Doxygen/sst : Source code/core : Simulator Core (DES)/elements : Simulation models/DRAMSimC : DRAM Simulator/M5 : GeM5 Simulator/SS_Router : Cray SeaStar Simulator/zesto : Detailed x86/iris : NoC Simulator/Many Others...

Dependencies•System

–64-bit Linux (Debian or RedHat)

–MacOS >10.4•Libraries

–MPI Distribution (Open MPI or MPICH2)

–Boost 1.43.x–ParMETIS 3.1.1 (Optional)–Zoltan 3.2 (Optional)

•Optional Components–DRAMSim2–Disksim (64-bit Linux Only)–McPAT–IntSim

–ORION–HotSpot–GeM5 (64-bit Linux Only)

•Detailed directions at: http://code.google.com/p/sst-simulator/w/list

SST Deps Tarball•Fast-Path dependencies•Download sstComponents_2012-FEB-23.tar.gz from:

•http://code.google.com/p/sst-simulator/downloads/list•Untar in $HOME•cd sst-simulator/deps/bin•./sstDependencies.sh cleanBuild•Uncompresses, patches, and builds

–Boost–DiskSim–DRAMSim 2–HotSpot–IntSim–McPAT–Orion–GeM5

Building the SST•From top level•./autogen.sh•./configure <options>

–./configure --help for explanations–--prefix=DIR–--with-gem5=DIR–--with-disksim=DIR

•make•make install

Building the SST With MacSim•Unpack the MacSim source into the SST Elements dir–cd $SST_TOP/sst/elements –svn co https://svn.research.cc.gatech.edu/macsim/trunk macsim –cd macsim

•Patch MacSim–patch -p0 -i macsim-sst.patch

•Build the SST–cd $SST_TOP–./autogen.sh –./configure

Running the SST•Serial–sst.x <SDL File>

•Parallel–mpirun -n <ranks> sst.x <SDL File>

Sample XML Format•Simple component-based format

•Programmatic construction (C++ or scripted interface) in development

Link Specification(s)

Component Parameters

Component

Sample XML Format•Named Links

–Endpoints specified–Minimum Latency specified

•Used in Partitioning & Optimization

•Variables & Parameter Blocks

Link Name

Link Local Name

GeM5 Example•Two Level Configuration

–SST Configuration•One M5 SST::Component per rank

•Simulation variables (stop time, etc...)

–M5 Configuration•M5 sub-components (processors, caches, busses)

•Uses same format as SST XML

•Subbomponent Config parameters same as GeM5

<sst> <component name=system type=m5C.M5 rank=0 > <params> <configFile>exampleM5.xml</configFile> <debug> 0 </debug> <M5debug> none </M5debug> <info>yes</info> <registerExit>yes</registerExit> </params> </component></sst>

Portals NIC Example•Tests large scale HPC network–Processor: State-machine –NIC: Simplified packet-

based–Router: Detailed flit-based

router (Cray SeaStar)•Tested up to 3M components

•Includes SDL generator

Library

ComponentManual Partition

MacSim Example•Each MacSim instance is an SST::Component

•Clock speed, Paths to traces, configuration specified in SDL

•Two Modes–Standalone–w/ DRAMSim2

<sst> <component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath>PARAM_PATH</paramPath> <tracePath>TRACE_PATH</tracePath> <outputPath>./results/</outputPath> <clock>1.4Ghz</clock> </params> </component></sst>

Standalone MacSim in SST

MacSim Example•Each MacSim instance is an SST::Component

•Clock speed, Paths to traces, configuration specified in SDL

•Two Modes–Standalone–w/ DRAMSim2

<sst> <component name=gpu0 type=macsimComponent.macsimComponent> <params> <paramPath>PARAM_PATH</paramPath> <tracePath>TRACE_PATH</tracePath> <outputPath>./results/</outputPath> <clock>1.4Ghz</clock> </params> </component></sst>

Standalone MacSim in SST

<component name=mem0 type=DRAMSimC > <params> <clock> 1066 Mhz </clock> <systemini> system-2.ini </systemini> <deviceini>GDDR5_hynix_1Gb.ini</deviceini> </params> <link name=membus port=bus latency=1ns /></component>

MacSim+DRAMSim in SST

SST Internals

Key Objects & Interfaces•Goal: Simplicity•Objects

–SST::Component: A model of a hardware component

–SST::Link: A connection between two components–SST::Event: A discrete Event–SST::EventHandler: Function to handle an incoming

event or clock tick•Events

–SST::Component::ConfigureLink(): Registers a link and (optionally) handler

–SST::Link::Recv(): Pull an event from a link–SST::Link::Send(): Send an event down a link–SST::Component::registerClock(): Register a clock

and handler

ComponentEventHandler

Sample XML Format•Named Links•Endpoints specified•Minimum Latency specified–Used in Partitioning

Link Name

Link Local Name

Routermodel Constructor

Gather Parameters

Create Event Handler

Add Links & Register Handler to Links

Routermodel(ComponentId_t id, Clock* clock, Params_t& params) : Component(id, clock), params(params){ Params_t::iterator it= params.begin(); while (it != params.end()) { if (!it->first.compare("hop_delay")) { int delay; sscanf(it->second.c_str(), "%d", &delay); hop_delay= delay * 0.000000000001; } ...; ++it; }

// One handler handles all ports Handler= new EventHandler<Routermodel, bool, Time_t, Event *> (this, &Routermodel::handle_events);

local_port= LinkAdd("local", Handler); west_port= LinkAdd("west", Handler); east_port= LinkAdd("east", Handler); ...;} Link Local

Simple Event Handlerbool Routermodel::handle_events(Time_t time, Event *event){ NicEvent *e= static_cast<NicEvent *>(event);

e->router_delay += hop_delay; if (e->routeX > 0) { // Keep going East e->routeX--; east_port->Send(e); } else if (e->routeX == 0) { if (e->routeY > 0) { // Go South e->routeY--; south_port->Send(e); } else if (e->routeY == 0) { // We have arrived! local_port->Send(e); } else /* e->routeY < 0 */ { // Go North e->routeY++; north_port->Send(e); } } else /* e->routeX < 0 */ { // Keep going West e->routeX++; west_port->Send(e); }

return false;}

Multiple Ways of Handling Events•“Poll” a Link

–Link::Recv()–Usually called from

a clock–Allows multiple

events to be pulled in same function

•Register event handler with link–handler called

whenever event arrives

•Register a Clock–Called at regular

interval

Constructor {

cpu= LinkAdd( "port0" );

eventHandler = new EventHandler< Xbar, bool, Time_t, Event* > ( this, &Xbar::processEvent ); nic= LinkAdd( "port1", eventHandler );

clockHandler = new EventHandler< Xbar, bool, Cycle_t, Time_t > ( this, &Xbar::clock );

ClockRegister( frequency, clockHandler );}

CompEvent *e = cpu->Recv();

Time & Checkpointing•Time

–Represented internally as 64-bit int (default femtoseconds)

–All time intervals in configuration are specified in SI units (e.g. 1.5 ns, 3 Ghz, etc...)

•Example: registerClock( "1 GHz", clockHandler );

–TimeConverter objects provided to convert between different bases

•Checkpointing–Failures inevitable at scale–Checkpointing must be built in as “first

class” function–Uses Boost Serialization Library–Components dump state to binary file at

specified interval

template<class Archive> void load(Archive & ar, const unsigned int version) { // serialize base ar & BOOST_SERIALIZATION_BASE_OBJECT_NVP(Component); // serialize members ar & BOOST_SERIALIZATION_NVP(workPerCycle); ar & BOOST_SERIALIZATION_NVP(commFreq); ar & BOOST_SERIALIZATION_NVP(commSize); ar & BOOST_SERIALIZATION_NVP(neighbor); ar & BOOST_SERIALIZATION_NVP(N); ar & BOOST_SERIALIZATION_NVP(S); ar & BOOST_SERIALIZATION_NVP(E); ar & BOOST_SERIALIZATION_NVP(W); //restore links N->setFunctor(eventHandler); S->setFunctor(eventHandler); E->setFunctor(eventHandler); W->setFunctor(eventHandler); }

Example Checkpoint Function

Time & Checkpointing•Time

–Represented internally as 64-bit int (default femtoseconds)

–All time intervals in configuration are specified in SI units (e.g. 1.5 ns, 3 Ghz, etc...)

•Example: registerClock( "1 GHz", clockHandler );

–TimeConverter objects provided to convert between different bases

•Checkpointing–Failures inevitable at scale–Checkpointing must be built in as “first

class” function–Uses Boost Serialization Library–Components dump state to binary file at

specified interval

template<class Archive> void load(Archive & ar, const unsigned int version) { // serialize base ar & BOOST_SERIALIZATION_BASE_OBJECT_NVP(Component); // serialize members ar & BOOST_SERIALIZATION_NVP(workPerCycle); ar & BOOST_SERIALIZATION_NVP(commFreq); ar & BOOST_SERIALIZATION_NVP(commSize); ar & BOOST_SERIALIZATION_NVP(neighbor); ar & BOOST_SERIALIZATION_NVP(N); ar & BOOST_SERIALIZATION_NVP(S); ar & BOOST_SERIALIZATION_NVP(E); ar & BOOST_SERIALIZATION_NVP(W); //restore links N->setFunctor(eventHandler); S->setFunctor(eventHandler); E->setFunctor(eventHandler); W->setFunctor(eventHandler); }

Example Checkpoint Function

Case Studies

Sample Results & Uses

MiniMD Memory Power Breakdown

NoCDRAML2MC

GUPS Memory Power Breakdown

NoCDRAML2MC

GUPS PageRank MiniMD HPCCG

LSQ Occupancy

Entries

414.01

200.32

2206.8

Avg. Memory Latency

nanoseconds

MiniMD Memory Power Breakdown

NoCDRAML2MC

GUPS Memory Power Breakdown

NoCDRAML2MC

LSQ Occupancy

Entries

414.01

200.32

2206.8

Avg. Memory Latency

nanoseconds

Power analysis help prioritize technology investments

SST Simulation of MD code shows diminishing returns for threading on small data sets

Detailed component simulation highlights bottlenecks

Prototyping•Algorithm/Application presented as a skeleton code•Express communication or memory access pattern•Easy to change / Ignores many details•Allows exploration of a single system features

–E.g. Collective performance in presence of noise (ignore processor & memory, focus on router & NIC)

100000

120000

140000

64 128 256 512 1024 2048 4096 8192 16384 32768

Host Tree: 1000 ns latency, 10 Mmsgs/s, Radix-8Host Tree w/ Noise: 1000 ns latency, 10 Mmsgs/s, Radix-8

Triggered Tree: 1000 ns latency, 10 Mmsgs/s, Radix-16Triggered Tree w/ Noise: 1000 ns latency, 10 Mmsgs/s, Radix-16

Recursive Doubling: 1000 ns latency, 10 Mmsgs/sRecursive Doubling w/ Noise: 1000 ns latency, 10 Mmsgs/s

Triggered Recursive Doubling: 1000 ns latency, 10 Mmsgs/sTriggered Recursive Doubling w/ Noise: 1000 ns latency, 10 Mmsgs/s

Design Space Exploration

Design Space Exploration Example•Design Space Exploration

–Inputs•Memory technology (DDR2, DDR3, GDDR5)•Core width (1,2,4,8 wide issue)•Cache size (32/32/1M or 64/64/2M)

–Outputs: Power, Performance, Cost•Methodology

–Performance models: GeM5/x86, DRAMSim2*

–Energy Models: DRAMSim2, McPAT–Cost Models: IC Knowledge–Key Questions

•What is good cache size? Core Width?•Which DRAM technology?

•Example of sorts of questions simulation can answer

Cores MC

DIMM DIMM

DDR2, DDR3, GDDR5

Small or Large coresSmall or Large caches

Execution Based Processor Model

Detailed DRAM Model

Target Apps

Lulesh

Cache Size

•Larger caches increase processor size, power

•Avg. Power increase: 6.75%•Avg. Cost increase: 3.76%•Avg. Performance improvement–Lulesh: 1.40%–HPCCG: 6.73%

•Conclusion: Lulesh probably wouldn’t benefit, HPCCG marginal benefit

!"#$%& '()*$+&,")*& -./$)0&

!$1"%+234$&

5!,,6&

!$%1"%+234$&

-2%7$&,240$&834%$2)$&

9:$4*&"1&-2%7$%&,240$)&

Which Memory System

•Options–DDR2: Cheap, low

power, antiquated–DDR3: Higher

performance, reasonable power

–GDDR5: Expensive, high power, very fast

•Pure performance:–GDDR 26-47% faster

than DDR3 (Lulesh)–GDDR 32-41% faster

thand DDR3 (HPCCG)–GDDR Wins?

%" &" '" (" )*+,-.+"

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

67&)25+,)#-"#$%./)+

<889=+

%" &" '" (" )*+,-.+"

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

6,778+,)#-"#$%./)+

899:=+

Better

Energy & Cost•GDDR better performance•DDR3 generally does better on perf/Watt –Lulesh: -3% to 107%–HPCCG: 0 to 100%–GDDR does well at higher

processor widths–DDR2 sometimes slightly

better than DDR3 on HPCCG

•perf/$–DDR3 better for narrow

cores, GDDR better for wide

–DDR slightly better over all

'$ ($ #$ &$ )*+,-.+$

%&'()*+,)#-.,"/)#+

,#"0)11"#+2'*34+

56&)14+,)#-"#$%70)+8)#+2%9+

>::;?+

'$ ($ #$ &$ )*+,-.+$

%&'()*+,)#-.,"/)#+

,#"0)11"#+2'*34+

5,667+,)#-"#$%80)+9)#+2%:+

7;;<?+

*$ +$ #$ ($ -./012/$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

67&)25+,)#-"#$%./)+8)#+9"&&%#+

=99:>+

*$ +$ #$ ($ ,-./01.$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

6,778+,)#-"#$%./)+9)#+:"&&%#+

8::;>+

Better

Processor Width

•Wider processor can issue more instructions/cycle

•Consumes more area, power–Cost is super-linear wrt area

increase•Power often increases faster than performance–E.g. 8-wide processor 78%

faster on Lulesh, uses 123% more power

•1-2 wide cores most power efficient

•2-4 most cost efficient

!" #" $" %"

*+,-./"0.(1'()

*23."456"

0('3.55'("7,/89"

:;+.59<"=>.38"'1"0('3.55'("7,/89"

0'?.("

!" #" $" %"

*+,-./"0.(1'()

*23."456"

0('3.55'("7,/89"

:0;;<=">?.38"'1"0('3.55'("7,/89"

0'@.("

Better

Design Space Exploration Results•Fastest memory technology not always best (DDR beats GDDR) due to power, cost

•No “best” processor - depends on tradeoff between cost, performance, power

•Can provide better understanding of which configurations are best for a given application

•Can be used as basis for application optimization, vendor guidance

Lulesh Pareto Optimal Designs

Width Memory Cache Power Performance Cost1 DDR3 Small 1.00 1.00 1.02 DDR3 Small 1.43 1.65 1.32 GDDR5 Small 3.00 2.28 2.04 GDDR5 Small 3.57 2.92 2.38 GDDR5 Small 5.29 3.62 3.4

Future Directions

New Components & Increased Integration

•New Components in development–NVRAM–New HPC NIC–Interface to Palacios

hypervisor•Increased Integration & Improvements–Parallel GeM5–GeM5/IRIS/DRAMSim

integration–SST/Meso scale–Improved Power/Area/Cost

models

Runnemede

Simulator

Stochastic

Simulator

MacSim

Network

Models

DRAMSim

Skeleton

TracesC

Interfa

Memory

Traces

Stochastic

Blocks

Events

Blocks

Meso-Scale Simulation

Power/Energy Modeling

•Parameterized Models turn architectural parameters into energy/event and static dissipation (e.g. leakage energy)

•During Simulation, count Key Events

•Event Counts * Energy/Event = Dynamic Energy

•Dynamic + Static Energy = Total Energy

•Energy / time = power•ThermalModel(power) = temperature

•Temperature adjusts leakage power

J/Read

Static

J/WriteCache Model

Assoc.

ReadsWritesTime

Simulate

J/ReadJ/WriteStatic

X = Energy

ThermalModel

Case Study: Reliability vs. PowerHidden cost of DVFS

•Dynamic voltage/frequency Scaling reduces power

•➔Reduces temperature•➔Causes thermal cycling•➔Reduces reliability

•Need–Algorithms to balance

temperature, lower power, & maintain performance

–Arch: Sensors and feedback–Runtime: Scheduler changes–App: Awareness

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

(Coskun 2011)

Case Study: Reliability vs. PowerHidden cost of DVFS

•Dynamic voltage/frequency Scaling reduces power

•➔Reduces temperature•➔Causes thermal cycling•➔Reduces reliability

•Need–Algorithms to balance

temperature, lower power, & maintain performance

–Arch: Sensors and feedback–Runtime: Scheduler changes–App: Awareness

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

!"#$%&'()*'+*,$-./"#0

! 12)$3-4*5'(#"*3$)$6#3#)7*

! 8.##5*07$7#*! 944#.#"$7#&*7:#"3$.*424.-)6

;<='0%/)>*!?

(Coskun 2011)

Bonus Slides

Component Validation•Strategy: component validation in parallel with system-level validation

•Current components validated at different levels, with different methodologies

•Validation in isolation

•What is needed–Uniform validation

methodology (apps)–System (multi-component)

level validation

Component Method Error

DRAMSim RTL Level validation against Micron Cycle

Generic Proc

Simplescalar SPEC92 Validation ~5%

NMSU Comparison vs. existing processors on SPEC <7%

RS Network

Latency/BW against SeaStar 1.2, 2.1 <5%

MacSim Comparison vs. Existing GPUs

Ongoing<10%

expected

Zesto Comparison vs several processors, benchmarks 4-5%

McPAT Comparisons against existing processors

10- 23%

Power is the Problem•2018 Exascale Machine–1 Exaop/sec–100s petabyte/sec memory

bandwidth–100s petabyte/sec

interconnect bandwidth–No major architecture

changes

•Consider power –1 pJ * 1 Exa = 1 MW–1 MW/year = $1 M–$200-400M / year power bill

management of data, combined with better overlap of communication and computation could reduce bandwidth

requirements by 50% or more.

Table 2: Effects of Power Reduction Techniques on an Exascale System

2018 Estimate Reduction Techniques Reduced Power Reduction

Processing 224 MW Simpler processor, Reduce

FP 11.2 MW 95%

Memory 125 MW Closer Proximity 37.5 MW 70%

Interconnect 24 MW Message Overlap 12 MW 50%

Total 373 MW 60.7 MW 84%

Applying these techniques to the hypothetical system from the introduction, we see a power reduction of 84%. This turns a

wholly impractical machine consuming hundreds of megawatts and costing significant fractions of a billion dollars to

power to a more attainable machine consuming tens of megawatts. More aggressive application of the techniques presented

here, and improvements in technology could reduce this further.

Figure 13 – Lessons from Embedded Systems

REFERENCES

[1] G. M. Amdahl, “Validity of single-processor approach to achieving large-scale computing capability,” Proceedings of AFIPS Conference, Reston,

VA. 1967. pp. 483-485

[2] G. Arnout, “C for system level design”, Proceedings of Design Automation and Test Europe (DATE) pp.384- 386, 2003, Munich, Germany

[3] M. Barr . "Embedded Systems Glossary". Netrino Technical Library. http://www.netrino.com/Embedded-Systems/Glossary. Retrieved 2007-04-21.

[4] M. Barr; A. J. Massa (2006). "Introduction". Programming embedded systems: with C and GNU development tools. O'Reilly. pp. 1-2. http://books.google.com/books?id=nPZaPJrw_L0C&pg=PA1.

[5] J. L. Bentley, “Programming pearls, second edition,” Addison-Wesley, Inc., 2000, ISBN 0-201-65788-0.

[6] B. W. Boehm, “Improving software productivity,” IEEE Computer 20, 9, 1987, pp. 43-57.

[7] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, J. Rattner, “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,” Technology & Research, Technology@Intel, Magazine Platform 2015

[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communication of the ACM, January 2008, pp. 107-113.

[9] D. Burger and J. R, Goodman, “Billion-transistor architectures: there and back again,” IEEE Computer 37, 3, Mar. 2004, pp. 22-28

From Jensen “Embedded systems and exascale computing.” CiSE 2010

Energy Conventional

Processor 62.5 pJ/op 62.5 MW

Memory 31.25 pJ/bit 125 MW

Interconnect 6 pJ/bit 24 MW

Total 211.5 MW

X to Solution

•Wider processors provide shorter time to solution

•Require much more energy to solution

'$ )$ *$ +$

%&'()*+,)#-"#$

%./)+012+

,#"/)11"#+3'*45+

67&)158+9:)/4+"-+,#"/)11"#+3'*45+

9.)#;<+="+>"&7?".+

='$)+="+>"&7?".+

'$ )$ *$ +$

%&'()*+,)#-"#$

%./)+012+

,#"/)11"#+3'*45+

6,7789+:;)/4+"-+,#"/)11"#+3'*45+

:.)#<=+>"+?"&@A".+

>'$)+>"+?"&@A".+

Better

SST Overview - Georgia Institute of...

Documents

Transcript of SST Overview - Georgia Institute of...

Industrial Power SST-600 Steam Turbine - Siemens · design combines the best of the existing Siemens steam turbine technologies from the SST-300, SST-400, SST-600, and SST-800 turbine

HPArch Research Group - Georgia Institute of Technologycomparch.gatech.edu/hparch/tutorial_slides/hpca... · | Macsim package IRIS (NoC simulator from Prof. Yalamanchili’s group)

Using gem5 for DRAM Explorationlearning.gem5.org/tutorial/presentations/gem5_mem_research.pdf · SystemC / TLM2.0 coupling in gem5 Gem5 Tips & Tricks 1. DRAM Evolution 2 P=1 P=2 P=4

Comparison of Cache Replacement Policies using Gem5 Simulator€¦ · different cache configurations and study its performance. This analysis will be done with the help of GEM5 simulator,

Wang Hui Sino-German Joint Software Institution hui.wang@jsi.buaa.edu.cn Gem5 Guide.

SST Materials - reg.abcsignup.com · •Review the SST brochures –Family –Referral sources –Program staff •Review the SST Workbook . SST Brochure for Families • Designed

Overview of ARM-SVE gem5 simulator with “Open parameters” · 2019. 6. 21. · Note: “Some of parameters in this gem5 are taken from public ... generation Arm HPC processor,

The gem5 Simulator · Welcome! We’re glad you’re here! The gem5 simulator has been multi-year effort A wide variety of institutions have participated This tutorial is for you

Architectural Simulation Swapnil Haria (mainly gem5) …

PIM-gem5: a system simulator for Processing-in-Memory ...

Architectural Exploration with gem5 · 2021. 1. 29. · gem5’s core models were not designed to replace more accurate microarchitectural models. To validate functional correctness

Learning gem5 Part IIlearning.gem5.org/tutorial/presentations/learning gem5... · 2018-09-17 · A simple SimObject © Jason Lowe-Power 2

The gem5 Simulatordist.gem5.org/dist/tutorials/isca_pres_2011.pdfThe gem5 Simulator ISCA 2011 Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4 Gabe Black5 Korey Sewell6 Derek

Learning gem5 Part Ilearning.gem5.org/tutorial/presentations/learning gem5...What is gem5? Michigan m5 + Wisconsin GEMS = gem5 ^The gem5 simulator is a modular platform for computer-system

THE AMD gem5 APU SIMULATOR: MODELING GPUS USING …

Learning gem5 Part IIIlearning.gem5.org/tutorial/presentations/learning gem5 - part 3.pdf · Cache state machine outline Parameters: Cache memory: Where the data is stored Message

Empirical CPU Power Modelling and Estimation in the gem5 ...gem5, evaluation of difference between the gem5 power model and real hardware, and implementation of the power model into

A Dimensions: [mm] B Recommended land pattern: [mm] D ... · 2013-03-12 2013-01-13 2012-12-10 2012-10-29 2012-08-27 2006-05-05 DATE SSt SSt SSt SSt SSt SSt SSt BY SSt COt COt SSt

Introduction to gem5 - ISCUG€¦ · •The gem5 simulation infrastructure is the merger of –The best aspects of the M5 and –The best aspects of GEMS •M5 – Highly configurable

Satellite SPERIAN Welding Protection AG - Reweld AG …reweld.ch/doc/manuals/expert/Optrel_Satellite_Manual.pdf · Optrel Satellite 9610.059.01 ... SST 5 SST 6 SST 7 SST 8 SST 9 SST