Integrated Management of Power Aware Computing & Communication Technologies

95
1 Integrated Management of Power Aware Computing & Communication Technologies Review Meeting Nader Bagherzadeh, Pai H. Chou, Fadi Kurdahi, UC Irvine Jean-Luc Gaudiot, USC, Nazeeh Aranki, Benny Toomarian, JPL DARPA Contract F33615-00-1-1719 June 13, 2001 JPL -- Pasadena, CA

description

Integrated Management of Power Aware Computing & Communication Technologies. Review Meeting Nader Bagherzadeh, Pai H. Chou, Fadi Kurdahi, UC Irvine Jean-Luc Gaudiot , USC, Nazeeh Aranki, Benny Toomarian , JPL DARPA Contract F33615-00-1-1719 June 13, 2001 JPL -- Pasadena, CA. Agenda. - PowerPoint PPT Presentation

Transcript of Integrated Management of Power Aware Computing & Communication Technologies

Page 1: Integrated Management of Power Aware Computing & Communication Technologies

1

Integrated Management of Power Aware Computing & Communication

TechnologiesReview Meeting

Nader Bagherzadeh, Pai H. Chou, Fadi Kurdahi, UC Irvine

Jean-Luc Gaudiot, USC,Nazeeh Aranki, Benny Toomarian, JPL

DARPA Contract F33615-00-1-1719

June 13, 2001

JPL -- Pasadena, CA

Page 2: Integrated Management of Power Aware Computing & Communication Technologies

2

Agenda

Administrative Review of milestones, schedule

Technical presentation Progress

Applications (UAV/DAATR, Rover, Deep Impact, distributed sensors) Scheduling (system-level pipelining) Advanced microarchitecture power modeling (SMT) Architecture (mode selection with overhead) Integration (Copper, JPL, COTS data sheet)

Lessons learned Challenges, issues Next accomplishments

Questions & action items review.

Page 3: Integrated Management of Power Aware Computing & Communication Technologies

3

Quad Chart

Innovations Component-based power-aware design

Exploit off-the-shelf components & protocols Best price/performance, reliable, cheap to replace

CAD tool for global power policy optimization Optimal partitioning, scheduling, configuration Manage entire system, including mechanical & thermal

Power-aware reconfigurable architectures Reusable platform for many missions Bus segmentation, voltage / frequency scaling

Impact Enhanced mission success

More task for the same power Dramatic reduction in mission completion time

Cost saving over a variety of missions Reusable platform & design techniques Fast turnaround time by configuration, not redesign

Confidence in complex design points Provably correct functional/power constraints Retargetable optimization to eliminate overdesign Power protocol for massive scale

Behavior

Architecture

high-levelsimulation

functionalpartitioning& scheduling

compositionoperators

high-levelcomponents

behavioralsystem model

busses, protocols systemarchitecture

mapping system integration& synthesis

staticconfiguration

dynamic powermanagement

parameterizablecomponents

2Q 00

Kickoff

2Q 01 2Q 02 Static & hybrid optimizations

partitioning / allocation scheduling bus segmentation voltage scaling

COTS component library FireWire and I2C bus

models Static composition authoring Architecture definition High-level simulation Benchmark Identification

Dynamic optimizations task migration processor shutdown bus segmentation frequency scaling

Parameterizable components library

Generalized bus models Dynamic reconfiguration

authoring Architecture reconfiguration Low-level simulation System benchmarking

Year 1 Year 2

Page 4: Integrated Management of Power Aware Computing & Communication Technologies

4

Program Overview

Power-aware system-level design Amdahl's law applies to power as well as performance Enhance mission success (time, task) Rapid customization for different missions

Design tool Exploration & evaluation Optimization& specialization Technique integration

System architecture Statically configurable Dynamically adaptive Use COTS parts & protocols

Page 5: Integrated Management of Power Aware Computing & Communication Technologies

5

Personnel & teaming plans

UC Irvine - Design tools Nader Bagherzadeh - PI Pai Chou - Co-PI Fadi Kurdahi Jinfeng Liu Dexin Li Duan Tran

USC - Component power optimization Jean-Luc Gaudiot - faculty participant Seong-Won Lee - student

JPL - Applications & benchmarking Nazeeh Aranki Nikzad “Benny” Toomarian

- students

Page 6: Integrated Management of Power Aware Computing & Communication Technologies

6

Milestones & Schedule

Static & hybrid optimizationspartitioning / allocationschedulingbus segmentationvoltage scaling

COTS component library FireWire and I2C bus models Static composition authoring Architecture definition High-level simulation Benchmark Identification

Dynamic optimizations task migrationprocessor shutdownbus segmentation frequency scaling

Parameterizable components library Generalized bus models Dynamic reconfiguration authoring Architecture reconfiguration Low-level simulation System benchmarking

Page 7: Integrated Management of Power Aware Computing & Communication Technologies

7

we are here!

Review of Progress

May'00 Kickoff meeting (Scottsdale, AZ) Sept'00 Review meeting (UCI)

Scheduling formulation, UI mockup, System level configuration Examples: Pathfinder & X-2000 (manual solution)

Nov'00 PI meeting (Annapolis, MD) Tools: scheduler + UI v.1 (Java) Examples: Pathfinder & X-2000 (automated)

Apr'01 PI meeting (San Diego, CA) Tools: scheduler + UI v.2 - v.3 (Jython) Examples: Pathfinder & initial UAV (Pipelined)

June'01 Review meeting

Page 8: Integrated Management of Power Aware Computing & Communication Technologies

8

New for this Review (June '01)

Tools Scheduler + UI v.4 (pipelined, buffer matching) Mode selector v.1 (mode change overhead, constraint based) SMT model

Examples: Pathfinder, µAMPS sensors(mode selection) UAV, Wavelet (dataflow) (pipelined, detailed estimate) Deep Impact (command driven) (planning)

Integration Input from Copper:

timing/power estimation (PowerPC simulation model) Output to Copper:

power profile + budget (Copper Compiler) Within IMPACCT:

initial Scheduler + Mode Selector integration

Page 9: Integrated Management of Power Aware Computing & Communication Technologies

9

Overview of Design Flow

Input Tasks, constraints, component library Estimation (measurement or simulation via COPPER)

Refinement Loop Scheduling (pipeline/transform…) Mode Selection (either before or after scheduling) System level simulation (planned integration)

Output: to COPPER Interchange Format:

Power Profile, Schedule, Selected modes Code Generation Microarchitecture Simulation

Page 10: Integrated Management of Power Aware Computing & Communication Technologies

10

Design Flow

componentlibrary

scheduler high-levelsimulator

modeselector

powersimulator

task model,timing /powerconstraints

Compiler

power profile,C program

modemodel

power + timingestimation

task allocation,component selection

CO

PPER

IMPA

CC

T

low-levelsimulator

executable

Page 11: Integrated Management of Power Aware Computing & Communication Technologies

11

Power Aware Scheduling

Execution model Multiple processors, multiple power consumers Multiple domains: digital, thermal, mechanical

Constraint driven Min / Max power Min / Max timing constraints

Handles problems in different domains Time Driven System level pipelining -- in time and in space Parallelism extraction

Experimental results Coarse to fine grained parallelism tradeoffs

Page 12: Integrated Management of Power Aware Computing & Communication Technologies

12

Prototype of GUI scheduling tool

Power-aware Gantt chart Time view

Timing of all tasks on parallel resources

Power consumption of each task Power view

System-level power profile Min/max power constraint, energy

cost

Interactive scheduling Automated schedulers – timing,

power, loop Manual intervention – drag &

drop

Demo available

Page 13: Integrated Management of Power Aware Computing & Communication Technologies

13

Power-Aware Scheduling

New constraint-based application model [paper at Codes'01] Min/Max Timing constraints

Precedence, subsumes dataflow, general timing, shared resource Dependency across iteration boundaries – loop pipelining Execution delay of tasks – enables frequency/voltage scaling

Power constraints Max power – total power budget Min power – controls power jitter or force utilization of free source

System-level, multi-scenario scheduling [paper at DAC'01] 25% Faster while saving 31% energy cost Exploits "free" power (solar, nuclear min-output)

System-level loop pipelining [working papers] Borrow time and power across iteration boundaries Aggressive design space exploration by new constraint classification Achieves 49% speedup and 24% energy reduction

Page 14: Integrated Management of Power Aware Computing & Communication Technologies

14

Scheduling case study:Mars Pathfinder

System specification 6 wheel motors 4 steering motors System health check Hazard detection

Power supply Battery (non-rechargeable) Solar panel

Power consumption Digital

Computation, imaging, communication, control Mechanical

Driving, steering Thermal

Motors must be heated in low-temperature environment

Page 15: Integrated Management of Power Aware Computing & Communication Technologies

15

Scheduling case study:Mars Pathfinder

Input Time-constrained tasks Min/Max Power constraints Rationale: control jitter, ensure utilization of free power

Core algorithm Static analysis of slack properties Solves time constraints by branch&bound Solves power constraints by local movements within slacks

Target architecture X-2000 like configurable space platform Symmetric multiprocessors, multi-domain power consumers, solar/batt

Results Ability to track power availability Finishes task faster while incurring less energy cost

Page 16: Integrated Management of Power Aware Computing & Communication Technologies

16

More aggressive scheduling:System-level pipelining

Borrow tasks across iterations Alleviates "hot spots" by spreading to another iteration Smooth out utilization by borrowing across iterations

Core techniques Formulation: separate pseudo dependency from true dependency Static analysis and task transformation Augmented scheduler for new dependency

Results -- on Mars Pathfinder example Additional energy savings with speedup Smoother power profile

Page 17: Integrated Management of Power Aware Computing & Communication Technologies

17

Scheduling case study:UAV DAATR

Example of a very different nature! Algorithm, rather than "system" example

Target architecture C code -- unspecified; assume sequential execution, no parallelism MatLab -- unmapped

Algorithm Sequential, given in MatLab or C Potential parallelism in space, not in time

Constraints & dependencies Dataflow: partial ordering Timing: latency; no pairwise Min/Max timing Power: budget for different resolutions

Page 18: Integrated Management of Power Aware Computing & Communication Technologies

18

Scheduling case study:UAV example (cont'd)

Challenge: Parallelism Extraction Essential to enable scheduling Difficult to automate; need manual code rewrite Different pipeline stages must be relatively similar in length

Rewritten code Inserted checkpoints for power estimation Error prone buffer mapping between iterations

Found a dozen bugs in benchmark C code Missing Summation in standard deviation calculation Frame buffer off by one line Dangling pointers not exposed until pipelined

Page 19: Integrated Management of Power Aware Computing & Communication Technologies

19

ATR application: what we are given

Target Detection

FFT

Filter/IFFT

Filter/IFFT

Filter/IFFT

ComputeDistance

ComputeDistance

1 Frame

m Detections3 filters

FFT FFT FFT

Filter/IFFT

Filter/IFFT

Filter/IFFT

ComputeDistance

ComputeDistance

FFT FFT

Bugs

Page 20: Integrated Management of Power Aware Computing & Communication Technologies

20

Bug report

Misread input data file OK, no effect to the algorithm

Miscalculate mean, std for image OK, these values not used (currently)

Wrong filter data for SUN/PowerPC OK for us, since we operate on different platforms Bad for SUN/PowerPC users, wrong results

Misplaced FFT module The algorithm is wrong

However, these problems are not captured in the output image files

Page 21: Integrated Management of Power Aware Computing & Communication Technologies

21

What it should look like

Target Detection

FFT FFT

Filter/IFFT

Filter/IFFT

Filter/IFFT

ComputeDistance

ComputeDistance

ComputeDistance

ComputeDistance

Filter/IFFT

Filter/IFFT

Filter/IFFT

1 Frame

m Detections

3 filters

k distances

Page 22: Integrated Management of Power Aware Computing & Communication Technologies

22

What it really should look like

Target Detection

FFT FFT

Filter/IFFT

Filter/IFFT

Filter/IFFT

ComputeDistance

ComputeDistance

ComputeDistance

ComputeDistance

Filter/IFFT

Filter/IFFT

Filter/IFFT

1 Frame

m Detections

3 filters

k distances

Page 23: Integrated Management of Power Aware Computing & Communication Technologies

23

Problems

Limited parallelism Serial data flow with tight dependency Parallelism available (diff. detections, filters, etc) but limited

Limited ability to extract parallelism Limited by serial execution model (C implementation) No available parallel platforms

Limited scalability Cannot guarantee response time for big images (N2 complexity) Cannot apply optimization for small images (each block is too small)

Limited system-level knowledge High-level knowledge lost in a particular implementation

Page 24: Integrated Management of Power Aware Computing & Communication Technologies

24

Our vision: 2-dimensional partitioning

M Targets(M FFTs)

M Targets(3M IFFTs)

K Distances(2K IFFTs)

Output: target detection w/ distance for N simultaneous frames

Target Detection

FFT

Filter/IFFT

ComputeDistance

m Detections

3 filters

k distances

Filter/IFFT

Filter/IFFT

ComputeDistance

FFT

Filter/IFFT

ComputeDistance

Filter/IFFT

Filter/IFFT

ComputeDistance

Single DFG(vertical flow)

Target Detection

FFT

Filter/IFFT

ComputeDistance

m Detections

3 filters

k distances

Filter/IFFT

Filter/IFFT

ComputeDistance

FFT

Filter/IFFT

ComputeDistance

Filter/IFFT

Filter/IFFT

ComputeDistance

Input:N simultaneous

frames

Cluster by N DFGs(horizontal duplication)

N Frames(N target detection)

Partitioning(horizontal cuts)

Page 25: Integrated Management of Power Aware Computing & Communication Technologies

25

System-level blocks

Target Detection

FFT

Filter/IFFT

Compute Distance

N Frames(N target detection)

M Targets(M FFTs)

M Targets(3M IFFTs)

K Distances(2K IFFTs)

Input:N simultaneous frames

Output: target detection w/ distance for N

simultaneous frames

Page 26: Integrated Management of Power Aware Computing & Communication Technologies

26

Our vision

Target Detection

FFT

Filter/IFFT

Compute Distance

Target Detection

FFT

Filter/IFFT

Compute Distance

Target Detection

FFT

Filter/IFFT

Compute Distance

Target Detection

FFT

Filter/IFFT

Compute Distance

Page 27: Integrated Management of Power Aware Computing & Communication Technologies

27

System-level pipelining

Target Detection

FFT

Filter/IFFT

Compute Distance

Input:N simultaneous frames

Output: target detection w/ distance for N

simultaneous frames

Group 0Group 1

Group 0

Group 2

Group 0

Group 1

Group 3

Group 0

Group 1

Group 2

Group 4

Group 3

Group 2

Group 1

Group 0

Group 5

Group 4

Group 3

Group 2

Group 1

Page 28: Integrated Management of Power Aware Computing & Communication Technologies

28

What does it buy us?

Parallelism All modules run in PARALLEL Each module processes N (M, K) INDEPENDENT instances, that could

all be processed in parallel NO DATA DEPENDENCY between modules

Throughput Throughput multiplied by processing units Process N frames at a reduced response time Better utilization of resources

Page 29: Integrated Management of Power Aware Computing & Communication Technologies

29

What does it buy us? (cont'd)

Flexibility Insert / remove modules at any time Adjust N, (M or K) at any time Make each module parallel / serial at any time More knobs to tune: parallelism / response time / throughput / power Driven by run-time constraints

Scalability Reduced response time on big images (small N and/or deeper pipe) Better utilization/throughput on small images

More compiler support Simple control / data flow: each module is just a simple loop, which is

essentially parallel Need an automatic partitioning tool to take horizontal cuts

Page 30: Integrated Management of Power Aware Computing & Communication Technologies

30

What does it buy us: how power-aware is it?

Subsystems shut-down Turn on / off any time based on power budget Split / merge (migrate) modules on demand

Power-aware scheduling Each task can be scheduled at any time during one pipe stage, since they

are totally independent More scheduling opportunity with an entire system

Dynamic voltage/frequency scaling The amount of computation N, (M or K) is known ahead of time Scaling factor = C / N (very simple!) Less variance of code behavior =>

strong guarantee to meet deadline, more accurate power estimates

Run-time code versioning Select right code based on N, (M or K)

Page 31: Integrated Management of Power Aware Computing & Communication Technologies

31

Experimental implementation:pipelining transformation

Goal To make everything completely independent

Methodology Dataflow graph extraction (vertical) Initial partitioning (currently manual with some aids from COPPER) Horizontal clustering Horizontal cut (final partitioning)

Techniques Buffer assignment: each module gets its own buffer Buffer renaming: read/write on different buffer Circular buffer: each module gets a window of fixed buffer size Our approach: the combination

Page 32: Integrated Management of Power Aware Computing & Communication Technologies

32

Buffer rotation

B

Circular buffer B

Pipe stages:a, b, c, d

a

b

cd

Time = 0

a b c

dTime = 1

ab c d

Time = 2

ab

c d

Time = 3a

bc

d

Time = 4a

bc

d

Time = 5

Page 33: Integrated Management of Power Aware Computing & Communication Technologies

33

Background - acyclic dataflow

a

b

c

d

Single circular buffer One serial data flow path All data flows are of same type

same size

Multiple buffers Multiple data flow paths Different type, size

a

b

c

d

Page 34: Integrated Management of Power Aware Computing & Communication Technologies

34

A more complete picture

ab c d

Circular buffer A, B

Pipe stages:a, b, c, d

B

A

Time = 0Time = 1

B

A

BA

Time = 2B

A

Time = 3Time = 4

B

A

Time = 5

B

A

2. Buffer live

3. Life-time spent in pipeline 4. Buffer dead

1. Buffer ready(raw data, e.g. ATR images)

Head pointer

Page 35: Integrated Management of Power Aware Computing & Communication Technologies

35

How does it work?

Raw data is dumped into the buffer from the data sources A head pointer keeps incrementing Buffer is ready, but not live (active in pipeline) yet Example, ATR image data coming from sensors

Buffer becomes live in pipeline Raw data are consumed and/or forwarded New data are produced/consumed When a buffer is no longer needed by any pipeline stages, it is dead and

recycled

Is everything really independent? Yes! At each snapshot, each module is operating on different data

Page 36: Integrated Management of Power Aware Computing & Communication Technologies

36

What are we trading off?

ab c d

B

A

Speedcomputation intensity, parallelism,throughput,power

TimeResponse time,

delay

Workloadamount of computation, energy

a

bc

d

a b c dab c da

b

cd

a b c d

Page 37: Integrated Management of Power Aware Computing & Communication Technologies

37

3-D Design space navigation

Speed

Time

Workload

N frames

N = 2,t = T / 2

N = 4,t = T / 4

Page 38: Integrated Management of Power Aware Computing & Communication Technologies

38

Design flow

IMPACCT pipeline code versioning

C Source code

Pipelined C Source code

COPPER power simulator

P T N

3-D table•Power•Time•Workload Task-level

constraints

System-level constraints

Power-aware schedule

IMPACCT scheduler and mode selection

abc d

DFG

Page 39: Integrated Management of Power Aware Computing & Communication Technologies

39

Scheduling case study:Wavelet compression (JPL)

Algorithm in C Wavelet decomposition Compression: "knob" to choose lossy factor or lossless

Example category Dataflow, similar to DAATR Finer grained, better structure

IMPACCT improvements Transformation to enable pipelining Exploit lossy factor in trade space

Page 40: Integrated Management of Power Aware Computing & Communication Technologies

40

Wavelet Algorithm

Wavelet Decomposition Quantization Entropy coding

Page 41: Integrated Management of Power Aware Computing & Communication Technologies

41

Wavelet Algorithm structure

For all image blocks

Initialization(check params,

allocate memory)

block init.,set params, read image block

decomp(), (lossless FWT)

(remove overlap)

Bit_plane_decomp,(set decomp param)

(1st level entropy coding)

(bit_plane encoding) Output result to file

•Sequential execution blocks•No data dependency between image blocks

Page 42: Integrated Management of Power Aware Computing & Communication Technologies

42

Wavelet: experiments

Experiments being conducted Checkpoints marked up manually Initial power estimation obtained Code being manually rewritten / restructured for pipelining Appears better structured than UAV example

Trade space High performance to low power Pipelining in space and in time, similar to UAV example Lossy compression parameter

Page 43: Integrated Management of Power Aware Computing & Communication Technologies

43

Ongoing scheduling case study:Deep Impact

"Planning" level example Coarse grained, system level

Hardware architecture COTS PowerPC 750 babybed, emulating a Rad-Hard PPC at 4x

=> Models the X-2000 architecture using DS1 software COTS PowerPC 603e board, emulating I/O devices in real time

Software architecture vxWorks, static priority driven, preemptive JPL's own software architecture -- command based 1/8 second time steps; 1-second control loops

Task set 60 tasks to schedule, 255 priority levels

Page 44: Integrated Management of Power Aware Computing & Communication Technologies

44

NASA Deep Impact project

Platform X-2000 configurable architecture to be using RAD 6000 (Rad-Hard PowerPC 750 @133MHz)

Testbed (JPL Autonomy Lab) PPC 750 single-board computer -- runs flight software

Prototype @233MHz, Real flight @133MHz COTS board, L1 only, no L2 cache

PowerPC 603e -- emulate the I/O devices connected via compact PCI

DS1: Deep Space One (legacy flight software ) Software architecture:

8 Hz ticks, command based running on top of vxWorks

Perfmon: performance monitoring utility in DS1 11 test activities 60 tasks

Page 45: Integrated Management of Power Aware Computing & Communication Technologies

45

Deep Impact example (cont'd)

Available form: Real-time Traces Collected using Babybed 90 seconds of trace, time-stamped tasks, L-1 cache

Input needed Algorithm (not available) Timing / power constraints (easy) Functional constraints

Sequence of events Combinations of illegal modes

Challenges Modeling two layers of software architecture (RTOS + command)

Page 46: Integrated Management of Power Aware Computing & Communication Technologies

46

Design Flow

componentlibrary

scheduler high-levelsimulator

modeselector

powersimulator

task model,timing /powerconstraints

Compiler

power profile,C program

modemodel

power + timingestimation

task allocation,component selection

CO

PPER

IMPA

CC

T

low-levelsimulator

executable

Page 47: Integrated Management of Power Aware Computing & Communication Technologies

47

SMT Power Simulator

Simulator Features Compatible with SimpleScalar 3.0b

Execute PISA and EV6 binaries Portability – Run on most kinds of computers

Handling Simultaneous Multithreading Run up to 8 threads simultaneously Similar to UW SMT model

Power Aware Features Same analytic power model as WATTCH

Clock Gating Parameterized Models

42 functional unit classifications (WATTCH has 12) 10 dynamic activity factors (WATTCH has 4)

Page 48: Integrated Management of Power Aware Computing & Communication Technologies

48

Examples of Module Classification

Functional Units include Arithmetic units: ALU, FPU, etc Control units: Instr decoder, etc Memory units: Caches, CAM, etc Buses: Result bus

Cache Access Cache Hit

Read Tag & Data Cache Miss

Read Tag Update Tag & Data Read Data

Arithmetic Operation: 4 groups Int ALU: +, -, bit operations Int MULT: , FP ALU: +, - FP MULT: ,

FP ALU

FP RegNormal FP Operation

FP MULT

FP RegFP Mult

Operation

ALU

Integer Reg

Integer ALU

Integer RegNormal Integer

Operation

Integer MULT

Integer RegInteger Mult Operation

Cache Tag X 2

Cache Array X 2Cache Miss

Cache Hit

Event

Cache

Cache Tag

Cache Array

Accessed units in WATTCH

Accessed units in SMT Power

Simulator

Page 49: Integrated Management of Power Aware Computing & Communication Technologies

49

SMT Power Simulator

Project Status Performance Simulator – Done Power Simulator – Implementation is done Power parameter verification on going

Verification Methodology Analytic model

Proven models from WATTCH Comparison with COTS processors

Motorola PowerPC 7450 Intel mobile Pentium III Alpha 21264

Page 50: Integrated Management of Power Aware Computing & Communication Technologies

50

Example of Verification with COTS Processors

Typical/Maximum Power Consumption Typical -> Average power consumption of applications Maximum -> Peak power consumption of applications Benchmark simulations are needed to verify

Modules in operation Deep Sleep: Nothing -> Static power dissipation Sleep: PLL working -> Static + PLL power dissipation Nap: BUS snooping -> Static + PLL + I/O power dissipation Doze: No instruction fetch -> no information

TBD

TBD

TBD

Doze

1.0

0.9

0.8

Sleep

0.512.122.219.017.91.8667

0.461.820.017.116.11.8600

0.411.617.815.214.31.8533

Deep Sleep

NapMax (Vec)

MaxTyp (W)

Vtg (V)

Freq (Mhz)

PowerPC 7450 Power Consumption

Page 51: Integrated Management of Power Aware Computing & Communication Technologies

51

Example of Simulation Result

Processor Configuration 4 issue superscalar

Target programs: 4 simple test programs Maximum power consumption

87.37W at 4 ICP (Instruction per cycle): Maximum throughput

Clock gating CC1: Max power for running units and zero for idle units CC2: Input dependent power for running units and zero for idle units CC3: Input dependent power for running units and static power for idle units

ATR

Test4

Test3

Test2

Test1

Program

16.8910.3715.2487.370.9449343

24.5119.2127.8787.371.445600494229

0.78

0.61

0.36

Instr per Cycle

12.17

9.95

6.90

CC1

8.31

6.83

4.48

CC2

15.0987.3719432

13.8587.3710560

11.7687.374859

CC3MAX# of Instr

Page 52: Integrated Management of Power Aware Computing & Communication Technologies

52

SMT Simulation Methodology

Input C Program Executable Binaries

PISA EV6

Processor Parameters Architectural Parameters

Output Static Power Consumption

Program independent Dynamic Power Consumption

Program dependent Power Profile – Moving Avg.

Processorparameters

Target CProgram

PowerParameters

HostCompiler

crossCompiler

PowerSimulator

StaticPower

DynamicPower

DynamicProfile

Page 53: Integrated Management of Power Aware Computing & Communication Technologies

53

SMT Power Simulator: Tool Usage

Host Portability Any host computer that can run SimpleScalar

Execution command sim-smt [options] target.list

List file content executable [program arguments]

Processor parameters -config configuration.file

Simulation results redirection -redir:sim simulator.result -redir:prog target.program.result

Page 54: Integrated Management of Power Aware Computing & Communication Technologies

54

Mode Selection

Determine when what component is running at what mode Mode selection is non-trivial

Scheduler will be overwhelmed to determine component modes at the same time!

Exploration space of all mode combinations is tremendous Greedy solution may fail mission timing-constraints or power

constraints

Mode selection is worthwhile Exploration spaces exist to improve power reduction and power-

awareness Energy saving ( 5-15%) Cost saving: (10-40%) Ease the task planning and give a more realistic picture

Page 55: Integrated Management of Power Aware Computing & Communication Technologies

55

Methodology and Design Flow

The whole picture - the integration of: Power-aware scheduler Mode selector Power estimation/profiling tools

Static view

Scheduler

Power Estimator

Initial schedule

modified schedule

Power/timing numberpower profile Power/timing budget

Power/timing budget

Power profile

Mode Selector

Page 56: Integrated Management of Power Aware Computing & Communication Technologies

56

System Modeling

Component power model Power modes with overhead

System timing model Constraint graph

Mode dependency modeling Mode dependency graph

External parameters Environment temperature Surrounding terrain

Page 57: Integrated Management of Power Aware Computing & Communication Technologies

57

Component Power Model

Power mode Each mode is defined by power and timing attributes

Constant, Profile, external (environmental) parameters May be hierarchical -- e..g. PowerPC 7450

active: { cache on: { cache settings }, cache off,voltage scaling, clock scaling },

doze: { clock scaling }, nap: { } deep sleep: { }

Overhead on mode changes Power overhead, timing overhead e.g. preheating a motor, voltage scaling, PLL

Environmental parameters e.g. temperature, terrain (roughness of ground for a motor) Affect power and timing overhead

Page 58: Integrated Management of Power Aware Computing & Communication Technologies

58

Component Model Examples

Driving motor Power is function of Temperature Mode change time also function

of Temperature T

Microprocessor (PowerPC 603e)

off on0W Power: –0.1225*T + 1.0

Power: 2.2WTime: (–1.875*T+10)*(T<0) +10*(T≥0)

Power: 0.5WTime: 3

Full power

Doze Nap Sleep

DPM4.0W 3.2W

1.0W 70mW 40mW

10 cycles -

10 cycles -

10 cycles -

10 cycles - 100us + 255 bus clocks + 10 cycles

100us + 255 bus clocks + 10 cycles

t1 + 3 cycles3 cycles t1 + 3 cycles

Page 59: Integrated Management of Power Aware Computing & Communication Technologies

59

FireWire Bus Power Model

Cable Power Pc = µ·L ·Cf (µ: constant, L: cable length, Cf: data transfer rate)

Driver Power (Pd) Fast lookup table Protocol simulator (in progress)

Event-driven system-level simulator Generated event traces for high level power estimation

Bus Power Pbus = Pc + Pd 100MHz 200MHz 400MHz

Full-on 320 mW 350mW 380 mW

Idle 250mW 250mW 250mW

Ultra low power 0.5 mW 0.5 mW 0.5 mW

TSB41AB3 IEEE 1394a-2000 THREE-PORT CABLETRANSCEIVER/ARBITER

Page 60: Integrated Management of Power Aware Computing & Communication Technologies

60

Design Flow

componentlibrary

scheduler high-levelsimulator

modeselector

powersimulator

task model,timing /powerconstraints

Compiler

power profile,C program

modemodel

power + timingestimation

task allocation,component selection

CO

PPER

IMPA

CC

T

low-levelsimulator

executable

Page 61: Integrated Management of Power Aware Computing & Communication Technologies

61

Timing: Constraint graph

Min/max timing constraints between pairs of events

Vertices Represent events A task has a Start and an End event

e.g. A.s = start event of task A, B.e = end event of task B

Directed edges Weights on edges Nonnegative weight: min constraint Negative weight: -max constraint

A.s B.e10

End event of B should be no earlier than 10 time units after the start event of A

A.s B.s-10

Start event of B should be no later than 10 time units after the start event of A

Page 62: Integrated Management of Power Aware Computing & Communication Technologies

62

System Timing Modeling Example

Haz.e drv.e5 cam.s1drv.s-10

ppc1.s ppc2.s-20ppc1.e

sci.s1

rf.e

str.s1

str.s-5

1

1

-30

Haz: hazard detectorStr: steering motorDrv: driving motorCam: cameraPpc: processorSci: scientific deviceRf : radio frequency modem

Micro Rover example Multiple resources Timing constraints between tasks

Page 63: Integrated Management of Power Aware Computing & Communication Technologies

63

Mode Dependency Modeling

Functional modes examples: ATR -- short range, middle range behavior choice as dictated by functional requirements

(i.e., not controllable by power management)

Component modes examples: processor full-on, sleep, doze, voltage/clock scaling operational setting of component

(i.e., open to mode selection for meeting power/timing constraints)

Dependencies Among functional modes (of different activities) Among component modes Between functional and component modes

e.g., ATR in short-range mode, Processor running in high-clock rate

Page 64: Integrated Management of Power Aware Computing & Communication Technologies

64

Mode dependency graph

Directed acyclic graph Mode Vertices:

modes of component

Edges mode dependency: "only if" mode A chosen implies B may be chosen mode B NOT chosen => NOT mode A

Operator vertices { AND, OR, MUTEX } (C op D) implies E may be chosen not E => (C op D) must be false op imposes constraint on combination of C, D

A B

C

D

op E

mode

op

Page 65: Integrated Management of Power Aware Computing & Communication Technologies

65

Mode dependency example: Rover

haz: hazard detectorstr: steering motordrv: driving motor

haz.on

str.ondrv.on

ORMUTEX

Components hazard detector, driving motor,

steering motor

Constraints on modes: hazard detector and the motors should

not be working at the same time

Mode combinations

Hazard detector Driving motor Steering motor

M0 Off Off Off

M1 Off On Off

M2 Off Off On

M3 Off On OnM4 On Off Off

Page 66: Integrated Management of Power Aware Computing & Communication Technologies

66

Mode Modeling Example:µAMPS sensors

Components: processor, memory, RF, sensor

Constraints on modes: Processor is active when both radio and

sensor is active Memory is active only when processor is

active

Microsensor architecture

S.on

R.onAND

S.on

R.rx XOR

A.sleep

A.active M.on

R.rx_txMUTEX

A.idle

A.sleep

A:ARMM:memoryR: radioS: sensor

A.activeM.on

Page 67: Integrated Management of Power Aware Computing & Communication Technologies

67

Mode Modeling of µAMPS sensors(cont’d)

Mode combinations considered: by MIT group: 5 combinations manual grouping, ad hoc

Our method 3 more combinations systematically generated from

dependency graph

Add constraint: When sensor is off, all other component

should be off (proactive)

Automatically obtain same results as MIT group

Mode S R A MM0 On Tx,rx Active OnM1 On Rx Idle OffM2 On Rx Sleep OffM3 On Off Sleep OffM4 Off Tx,rx Active OnM5 Off Rx Idle OffM6 Off Rx Sleep OffM7 Off Off Sleep OffN

ot g

iven

by

MIT

gro

upR.on S.on

Page 68: Integrated Management of Power Aware Computing & Communication Technologies

68

Mode Combination Enumeration- Using Dependency Graph

Component level mode dep. graph Group modes by component Show mode dependency between

components

Enumerating reachable modes Topological sorting Graph helps prune out infeasible

mode combinations

Break cycle in comp. graph Removing an edge in cycle Keep track of the last dependent

successor component

Radio

SensorARM Memory

RadioSensor ARM Memory

off off sleep off

on off sleep off

idle off

on idle off

active on

Page 69: Integrated Management of Power Aware Computing & Communication Technologies

69

External Parameters & Constraints

Parameters in system model Temperature, terrain Used to characterize

components and their overhead

System Constraints Maximum Power constraint

Constant or power profile (function of time)

Minimum Power constraint Constant or power profile (

function of time) Total energy constraint ( under

working) Mission time (mission deadline)

Temperature (°C) Power (W)

0 1.0-40 5.9-80 9.6

Power consumption of Driving motor at different temperatures

Page 70: Integrated Management of Power Aware Computing & Communication Technologies

70

System Power Representation

Schedule Gantt Chart

Time view Power view

Mode selection Gantt chart

Tasks marked with mode settings

Added non-operating tasks Idle intervals mode change

overheads Power profile view

Page 71: Integrated Management of Power Aware Computing & Communication Technologies

71

Design Flow

componentlibrary

scheduler high-levelsimulator

modeselector

powersimulator

task model,timing /powerconstraints

Compiler

power profile,C program

modemodel

power + timingestimation

task allocation,component selection

CO

PPER

IMPA

CC

T

low-levelsimulator

executable

Page 72: Integrated Management of Power Aware Computing & Communication Technologies

72

Mode selection: Problem statement

Input initial schedule (timing power) component model, system model initial selection of modes

Objective Model mode change overhead

(timing, power) Capture sequence of mode changes Minimize energy cost by considering

overhead tradeoffs

Output Schedule for power & timing, with overhead Augmented schedule with selected mode

Mode selector

scheduleComponent library

System constraints

Mode selection

Page 73: Integrated Management of Power Aware Computing & Communication Technologies

73

Application Example: Rover

Behaviors and tasks Moving around on Mars surface Hazard detection, driving and steering Communicating with the Lander Taking pictures (IMP) Performing scientific experiments (APXS, ASI/MET)

Components in the entire system Hazard detector (HAZ) Driving motor (DRV) Steer motor (STR) Radio frequency modem (RF) Camera (CAM) Microprocessor (PowerPC) Microcontroller (ARM)

A schedule of the electronic subsystem of micro rover

Page 74: Integrated Management of Power Aware Computing & Communication Technologies

74

Mode selection Results:Energy savings

Traditional approach Only two modes: { On, Off } Timing constraints ONLY Power constraints may be violated Considers mode change overhead

Our Approach:with Mode Selection All legal mode combinations Both timing and power constraints Detailed mode change overhead

Results Energy saving: 3.7% to 11.9% average saving: 8.7%

1300

1350

1400

1450

1500

1550

1600

1650

1700

1750

5 6 7 8 9 10

Mode SelectionOn&Off

Pmin(W)

Energy(J)

Page 75: Integrated Management of Power Aware Computing & Communication Technologies

75

Results for mode selection:Cost savings

Cost vs. Energy saving: Cost defined as energy above

minimum constraints

Savings From 6.9% to 49.3% average 26.5%

0

100

200

300

400

500

600

700

800

900

1000

5 6 7 8 9 10

Mode SelectionOn&Off

Pmin (W)

Energy (J)

Page 76: Integrated Management of Power Aware Computing & Communication Technologies

76

Exploring Different Working Scenarios

Three tasks Moving around (MOV) Taking picture (CAM) Scientific experiment (SCI)

Three scenarios A: MOV, CAM, SCI B: CAM, MOV, SCI C: CAM, SCI, MOV

Temperature profile is given as: Temperature

-90

-80

-70

-60

-50

-40

-30

-20

-10

01 2 3 4 5 6

123456

Page 77: Integrated Management of Power Aware Computing & Communication Technologies

77

Result III

Scenarios consume different amounts of energy Scenario C consumes 12%

more energy than scenario A (by mode selection)

Mode selection always does better compared to (on, off) only up to 11.7% energy saving

0

5000

10000

15000

20000

25000

A B C

Mode SelectionOn&off

Page 78: Integrated Management of Power Aware Computing & Communication Technologies

78

Mode selection: Issues

Challenges: Explosion of state space -- grows exponentially Modeling restrictions in mode change sequence

Solution / novelty Formalism for mode dependency at component level & system level Systematically prune search space

Experimental results Energy and time saved More accurate modeling of overhead

Page 79: Integrated Management of Power Aware Computing & Communication Technologies

79

Accomplishments to date

Power-aware scheduling Multi-processor/domain, Min / Max power and timing constraints 3 classes of system level pipelining techniques

Mode selection Component and system model Captures power & timing overhead on mode change

Incorporating power models and simulators SMT simulator for advanced microarchitectural exploration FireWire, DRAM, cache, PowerPC

Tool prototype & Integration GUI for power-aware Gantt chart scheduling & mode selection Power aware visualization tool for benchmarks Interface to COPPER project

Page 80: Integrated Management of Power Aware Computing & Communication Technologies

80

Lessons learned

Challenges Not all applications fit a given model Alternative design flows may be required for different applications Manually extract parallelism & dependency in benchmarks Capture mode dependency in components & applications Integration of good power models for PowerPC

Right level of abstraction Many low-level power models available; not always usable Need system-level power estimations Details of the architecture model Memory / bus power models Overhead for voltage/frequency scaling

Page 81: Integrated Management of Power Aware Computing & Communication Technologies

81

Fulfilled Milestones

Power-aware scheduling [3 papers] Multi-scenario System-level pipelining

Mode selection encompass power management (voltage/freq scaling)

UI prototype scheduling, mode selection, benchmark visualization

Initial tool integration interface to COPPER

Processor power & simulation models SMT simulator

Page 82: Integrated Management of Power Aware Computing & Communication Technologies

82

Upcoming Milestones

Dynamic optimization Scheduling and planning -- using the Deep Impact example Pipeline depth/width tuning at run-time

Additional static optimization component selection/assignment bus topology optimization

Simulation Bus simulation models SMT -- Thermal dissipation profiling,

Dynamic power/thermal management

Tool integration Simulation models from other groups IMPACCT tools and library tighter integration between IMPACCT and COPPER

Page 83: Integrated Management of Power Aware Computing & Communication Technologies

83

Ideas: dynamic optimization

More dynamic scenarios Power suddenly cut off, with small power reserve before shutdown Mission replanning, changing objectives

Solutions required Division between static preparation & dynamic handling Ability to decide most important actions to take under extreme time

constraint Need feedback/notification mechanism in execution model Decentralized power management

Need new benchmark examples

Page 84: Integrated Management of Power Aware Computing & Communication Technologies

84

Future planned evaluation

Deep Impact from JPL Mission planning and scheduling example Image compression (wavelet) algorithm Architectural mapping

JPL Testbed PPC750 board to measure actual power PPC750 to simulate instrumentation in real-time advanced board with real instrumentation

Validation through simulation Scheduler output fed to COPPER for compilation Simulation via COPPER and our own SMT Compare estimated power with refined version

Page 85: Integrated Management of Power Aware Computing & Communication Technologies

85

Applications

Space Mars Rover (scheduling, mode selection) Deep Impact (planning)

UAV DAATR (pipelined scheduling)

(mode selection under investigation)

Distributed sensors MIT µAMPS sensor (mode selection)

Need apps requiring dynamic planning/reconfig!

Page 86: Integrated Management of Power Aware Computing & Communication Technologies

86

Development plans

Scripting and web-based tool Jython (Java + Python), TkInter for GUI prototype Core scheduler

Modular, detachable from GUI Option to run on separate server or same process as UI

CGI scripts for arch. configuration (unix/web based) Latest version distributed thru WebCVS

Interface with commercial CAD backend Detailed power estimation tools Functional simulation with proprietary models

Rationale Open source, runs on any platform All publicly available development tools Trivial to install, no compilation, encourage modification

Page 87: Integrated Management of Power Aware Computing & Communication Technologies

87

Technology Transition --Consystant Design Technologies

Version 1 released Apr.11 shown at ESC runs on Linux will support Solaris, Win2k

Extensible system platform plugin for synthesis targets Linux, vxWorks, …

Simulator selective focus coordination centric

Active collaboration confirmed Installation in week of June 25 Designated application engineer

Page 88: Integrated Management of Power Aware Computing & Communication Technologies

88

http://www.ece.uci.edu/impacct/

Page 89: Integrated Management of Power Aware Computing & Communication Technologies

89

Metrics

Source-aware energy model Takes “free energy” into account Cost for not using free energy

Profile-aware Total energy dependent on consumers’ power profile Smoothness of power draw

Scenario-aware Cost function tracks external factors (e.g. temperature, solar level) Stage in mission

Timing/performance Makespan (length of an iteration) Dynamic planning cost

Page 90: Integrated Management of Power Aware Computing & Communication Technologies

90

Architectural Configuration

Mode selection Power consumption level (doze, nap, sleep, etc.) Low power design techniques

Clock scaling, voltage scaling Memory/cache configurations, bus encoding Communication protocols, compression, algorithm transformations

Optimize feasible solutions for energy/timing costs Power, Real time, Inter-resource modes constraints Constraints between functionality modes and resources modes

Functionality mode and resource modes

Bus topology optimization Static clustering and bus partitioning Dynamic reclustering with shutdown

Page 91: Integrated Management of Power Aware Computing & Communication Technologies

91

Application - Mars Rover

Mission-critical embedded system Hard real-time system Composed of COTS component

Electronic: µprocessor, µcontroller, memory,camera, scientific devices, ... Mechanics/thermal: driving motor, steering motor, heaters, … Power sources: solar panel, battery

Power/energy and performance constraints Stringent max power constraint Flexible min power constraint Limited non-rechargeable energy sources Global timing requirement

Limited working window during sol daytime Timing constraint among tasks

Harsh and uncertain working environment Extremely low temperature - affects component behaviors Uncertain environment: winds/obstacles/rugged terrain

Page 92: Integrated Management of Power Aware Computing & Communication Technologies

92

Example Platform- X2000

COTS components Modeling Processors (PowerPC 603e, 750) Memory organization (cache, memory) System interconnects (FireWire bus driver/controller) Scientific equipment Sensors/actuators Mechanics/Thermals (driving/steering motors/heaters)

System-level architecture modeling Tree topology for FireWire bus architecture Component clustering for bus segmentation

Page 93: Integrated Management of Power Aware Computing & Communication Technologies

93

Testing Methodologies

A "Activity" for given duration (5 s, 10 s, 15 s) repeated 6 times record both I-cache & D-cache misses (recorded in separate runs)

B Recording 90 seconds worth of an Activity till its completion 1 minute gap between runs also I-cache & D-cache misses

C -- what is measurement C?

Page 94: Integrated Management of Power Aware Computing & Communication Technologies

94

User Input

Attributes tasks, resources, timing constraints, power budgets

Unique features power as constraint scheduling, system-level mission planning, power-aware loop

pipelining, timing constraint classification. subsumes deadline, dataflow

Language mix of graphical and custom constraint language

Page 95: Integrated Management of Power Aware Computing & Communication Technologies

95

Methodology and Work Flow

Exploration techniques Backtracking Cutting exploration space with multi-dimensional constraints

Two steps in design exploration: Find feasible mode selection for operating tasks

Timing constraints Constraint graph Resource slacks Mission deadline

Dependency between tasks Dependency graph

Find feasible mode selections for idle intervals System power/energy constraints: min, max, or power profile Mode change overhead, both time and power overheads

Speedup techniques Sorting component modes with power numbers