COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon...

57
Public FP7-ICT-2009- 4 (247999) COMPLEX COdesign and power Management in PLatform- based design space EXploration Project Duration 2009-12-01 – 2012-11-30 Type IP WP no. Deliverable no. Lead participant WP3 D3.2.2 POLITO Final report on embedded software and hardware optimization Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi), Sven Rosinger, Kim Grüttner (OFFIS) Issued by POLITO Document Number/Rev. COMPLEX/POLITO/R/D3.2.2/1.0 Classification COMPLEX Public Submission Date 2012-02-29 Due Date 2012-02-29 Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013) © Copyright 2012 OFFIS e.V., STMicroelectronics srl., STMicroelectronics Beijing R&D Inc, Thales Communications SA, GMV Aerospace and Defence SA, SNPS Belgium NV, EDALab srl, Magillem Design Services SAS, Politecnico di Milano, Universidad de Cantabria, Politecnico di Torino, Interuniversitair Micro-Electronica Centrum vzw, European Electronic Chips & Systems design Initiative. This document may be copied freely for use in the public domain. Sections of it may be copied provided that acknowledgement is given of this original work. No responsibility is assumed by COMPLEX or its members for any aplication or design, nor for any infringements of patents or rights of others which may result from the use of this document.

Transcript of COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon...

Page 1: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

Public

FP7-ICT-2009- 4 (247999) COMPLEX

COdesign and power Management in PLatform-

based design space EXploration

Project Duration 2009-12-01 – 2012-11-30 Type IP

WP no. Deliverable no. Lead participant

WP3 D3.2.2 POLITO

Final report on embedded software and hardware

optimization

Prepared by Massimo Poncino, Haroon Mahmood (PoliTo),

Carlo Brandolese, Gianluca Palermo, William

Fornaciari (PoliMi), Sven Rosinger, Kim

Grüttner (OFFIS)

Issued by POLITO

Document Number/Rev. COMPLEX/POLITO/R/D3.2.2/1.0

Classification COMPLEX Public

Submission Date 2012-02-29

Due Date 2012-02-29

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

© Copyright 2012 OFFIS e.V., STMicroelectronics srl., STMicroelectronics Beijing

R&D Inc, Thales Communications SA, GMV Aerospace and Defence SA, SNPS Belgium

NV, EDALab srl, Magillem Design Services SAS, Politecnico di Milano, Universidad de

Cantabria, Politecnico di Torino, Interuniversitair Micro-Electronica Centrum vzw, European

Electronic Chips & Systems design Initiative.

This document may be copied freely for use in the public domain. Sections of it may be

copied provided that acknowledgement is given of this original work. No responsibility is

assumed by COMPLEX or its members for any aplication or design, nor for any

infringements of patents or rights of others which may result from the use of this document.

Page 2: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 2

History of Changes

ED. REV. DATE PAGES REASON FOR CHANGES

Massimo Poncino 1.0 2012-02-29 56 First release of final version.

Page 3: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 3

Table of Contents

1 Scope of the Document ...................................................................................................... 4

2 Embedded SW optimization ............................................................................................... 5 2.1 Compiler optimizations exploration ........................................................................... 6

2.1.1 Optimization space modelling ................................................................................ 6 2.1.2 Optimizations clustering ........................................................................................ 7 2.1.3 Flow execution ....................................................................................................... 9

2.2 Source to source optimization flow .......................................................................... 13 2.2.1 Optimization hint engine ...................................................................................... 13

2.2.2 Optimization hint rule definition .......................................................................... 14 2.2.3 Flow execution of the optimization hint engine ................................................... 16 2.2.4 Transformation effectiveness quantitative estimator ........................................... 16 2.2.5 Flow execution of the transformation effectiveness estimator ............................. 17

2.3 Parametric exploration ............................................................................................. 18

2.3.1 Target independent configuration ........................................................................ 18 2.3.2 Target dependent configuration ........................................................................... 20

2.4 Tools ......................................................................................................................... 22 2.4.1 swat-core-cc ......................................................................................................... 22

2.4.2 swat-opt ................................................................................................................ 23 2.4.3 swat-tge ................................................................................................................ 24

3 Custom hardware optimization ........................................................................................ 25 3.1 High level synthesis optimizations ........................................................................... 25

3.1.1 Technology Selection and Parameter Ranges ...................................................... 25 3.1.2 Evaluation of Power Gating Models .................................................................... 27

3.1.3 Evaluation of IP-Level Application of Power Management ................................ 35 3.2 Memory optimization ............................................................................................... 39

3.2.1 Introduction .......................................................................................................... 39

3.2.2 Energy Optimization of scratchpad memories ..................................................... 39 3.2.3 Concurrent Aging and Energy Optimization of scratchpad memories ................ 40

4 Application to Use-Cases ................................................................................................. 51

4.1 Use Case 1 ................................................................................................................ 51 4.2 Use Case 2 ................................................................................................................ 51

4.2.1 DCT - High level synthesis optimizations ........................................................... 51 4.3 Use Case 3 ................................................................................................................ 54

5 Summary .......................................................................................................................... 55 6 References ........................................................................................................................ 56

Page 4: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 4

1 Scope of the Document

This deliverable presents the results from Task T3.2 - Embedded software optimization

(Participants: PoliMi, IMEC - Start: M7 - End: M24) and Task T3.3 – Custom hardware

optimization (Participants: CV, OFFIS, PoliTo - Start: M7 - End: M24) up to M27.

The deliverable is the second and last describing the optimization activities for embedded

software and for the hardware, and describes to the application of these optimization

techniques in the COMPLEX flow ‘in isolation’ without emphasis of their interaction. The

latter is the subject of a different set of deliverables (D3.4, ”Intermediate and Final Report on

Design Space Exploration D3.4.3, “Final Report on Design Space Exploration” for the

hardware optimizations, and D3.5.2 “Final report on Run-Time Management” for the

software techniques).

The document closely follows the structure of its predecessor (D3.2.1). Sections 2 and 3 of

describe the methodologies and the toolchains for the embedded software and custom

hardware optimization (both High Level Synthesis and Memory hierarchy optimizations).

Finally, Section 4 shows how the three selected use cases are covered by the optimization

toolchains.

Page 5: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 5

2 Embedded SW optimization

Embedded software optimization has been studied in several different ways. Some approaches

are strictly related to the detailed software estimation and optimization methodologies, other

involve also additional portions of the flow, namely the design space exploration engine

MOST. This section describes the advancements in the implementation of the different

optimization flows

1. Compiler optimization exploration. Integrates the SWAT detailed software estimation

toolchain with the MOST design exploration engine to find out the best combination

of optimization options offered by the LLVM code transformation tool.

2. Source-to-source transformation. This flow is completely based on the SWAT toolchain

and has the goal of providing "optimization hints" to the developer, suggesting high-

level, potentially beneficial transformations. This toolchain cannot be integrated with

the MOST exploration engine since the suggested transformations are not applied

automatically but rather require manual coding. Two different approaches have been

followed.

a. Qualitative. The first is a qualitative only “optimization hint” engine,

providing indications on the section of code to optimize and on how to

optimize it.

b. Quantitative. The second is an evolution of this first approach, as it estimated

the potential energy reduction associated to a certain transformation. Bein

quantitative, this approach requires significant effort to analyze the effect of

transformations and to model them in quantitative way.

3. Parametric optimizations. This flow integrates the SWAT estimation toolchain with the

MOST exploration engine and operates on source files implementing functions that

depend on compile-time parameters. Typical examples are compiler pragmas (memory

alignment, loop properties, unrolling directives, linker options, etc.) and application-

specific parameters.

4. Application configuration. This flow integrate the SWAT estimation tool chain with the

MOST exploration engine and provides an automated mechanism for the selection of

specific “function implementations” and “processor operating modes”. The two

approaches cover different application aspects.

a. Target independent. This flow assumes that more than one implementation

(referred to as "function mode") is provided for one or more given functions.

Implementation differ w.r.t. functional and non-functional properties. Different

implementation are expected to be executed on the target platform always

operating in the voltage and frequency conditions.

b. Target dependent. Selected functions are automatically annotated to force the

target to enter specific voltage and frequency operating modes. The best

combination of modes is selected by means of design space exploration trying

to minimize the overall application energy under timing constraints.

Details of each flow are provided in the following.

Page 6: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 6

2.1 Compiler optimizations exploration

This section provides a summary of the proposed optimization approach and reports the

improvements that have been implemented.

2.1.1 Optimization space modelling

The available set of LLVM transformations/optimization is modeled by the binary vector:

NtttT 10

(1)

whose element ti indicates whether the i-th transformation is active or not (for the list of

available transformations see Deliverable D3.2.1). This leads to a very large space to be

explored. The clustering matrix:

nKKK

N

N

rrr

rrr

rrr

,1,0,

,11,10,1

,01,00,0

R

(2)

Has the goal of grouping transformations. An element 1, jir indicates that the transformation

with index j belongs to group i, while a value 0, jir means that the transformation j does not

belong to group i. With clustering, a specific optimization choice is described by the set of

groups that are active, that is by a vector:

KgggG 10

(3)

having the same semantics as vector T, but with groups instead of single transformations.

Given a certain choice of groups to be activated – as selected by the exploration flow –the

transformations to enable are simply obtained as:

TRGT (4)

The original idea of the flow has been extended according to a two-phase approach. The

second phase consists in the exploration over clusters of transformation, as described above

and in more detail in Deliverable D3.2.1. The first phase, on the other hand, operates within

each cluster. Given a cluster

}1|{ ,,1,0, jijNiiii rtrrrg

(5)

the same flow is used to select the subset of transformation that lead to more efficient code.

Formally, this is equivalent to eliminate those transformations whose effect is negligible on

the specific code. If jt is such a transformation, then we set 0, jir . After reducing all groups

according to this procedure, cluster-level optimization is performed.

The setup of the MOST-based optimization flow is depicted in Figure 1. Inputs of the flow

are the source files, the set of compiler options and a model of the target architecture.

Page 7: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 7

x.c

t = Estimated timee = Estimated energys = Estimated size

MOST

Design Space:optimizations list

OptimizationOptionsOptimization

Front-End

Back-End

CPUmodel

x.opt.c

x.c

t = Estimated timee = Estimated energys = Estimated size

MOST

Design Space:optimizations list

OptimizationOptionsOptimization

Front-End

Back-End

CPUmodel

x.opt.c

Figure 1: General MOST-based optimization flow

The output can have different forms, namely:

1. A host-executable binary file

2. A target-executable binary file

3. A list of compilation options

4. A rewriting of the C source code.

It must be noted that, in the last case, the SWAT flow uses the LLVM experimental C

language back-end, which generates C code that is hardly readable, as it is the effect of

translation of assembly code back to very simple C statements.

2.1.2 Optimizations clustering This section summarizes the clusters that have been constructed to perform the two-phase

transformation selection exploration.

Control Flow -abcd Remove redundant conditional branches -break-crit-edges Break critical edges in CFG -block-placement Profile Guided Basic Block Placement -insert-edge-profiling Insert instrumentation for edge profiling -insert-optimal-edge-profiling Insert optimal instrumentation for profiling -jump-threading Thread control through conditional blocks -mergereturn Unify function exit nodes -lowerswitch Lower SwitchInst's to branches -sink Code Sinking -simplifycfg Simplify the CFG

Functions -always-inline Inliner for always_inline functions -argpromotion Promote 'by reference' arguments to scalars -codegenprepare Prepare a function for code generation -deadargelim Dead Argument Elimination

Page 8: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 8

-functionattrs Deduce function attributes -inline Function Integration/Inlining -ipconstprop Interprocedural constant propagation -ipsccp Interprocedural Sparse Conditional Constant Propagation -mergefunc Merge Functions -partial-inliner Partial Inliner -partial-specialization Partial Specialization -sretpromotion Promote sret arguments -tailcallelim Tail Call Elimination -tailduplicate Tail Duplication

Constants -constmerge Merge Duplicate Global Constants -constprop Simple constant propagation -ipconstprop Interprocedural constant propagation -ipsccp Interprocedural Sparse Conditional Constant Propagation -sccp Sparse Conditional Constant Propagation

Variables & Expressions -argpromotion Promote 'by reference' arguments to scalars -globaldce Dead Global Elimination -globalopt Global Variable Optimizer -gvn Global Value Numbering -mem2reg Promote Memory to Register -reg2mem Demote all values to stack slots -scalarrepl Scalar Replacement of Aggregates -reassociate Reassociate expressions -split-geps Split complex GEPs into simple GEPs

Basic Blocks -adce Aggressive Dead Code Elimination -dce Dead Code Elimination -die Dead Instruction Elimination -dse Dead Store Elimination -instcombine Combine redundant instructions -sink Code Sinking

Loops -indvars Canonicalize Induction Variables -lcssa Loop-Closed SSA Form Pass -licm Loop Invariant Code Motion -loop-deletion Dead Loop Deletion Pass -loop-extract Extract loops into new functions -loop-extract-single Extract at most one loop into a new function -loop-index-split Index Split Loops -loop-reduce Loop Strength Reduction -loop-rotate Rotate Loops -loop-unroll Unroll loops -loop-unswitch Unswitch loops -loop-simplify Canonicalize natural loops

Lowering -lowerallocs Lower allocations from instructions to calls -loweratomic Lower atomic intrinsics -lowerinvoke Lower invoke and unwind, for unwindless code generators -lowersetjmp Lower Set Jump

Page 9: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 9

-lowerswitch Lower SwitchInst's to branches -memcpyopt Optimize use of memcpy and friend -prune-eh Remove unused exception handling info -simplify-libcalls Simplify well-known library calls -simplify-libcalls-halfpowr Simplify half_powr library calls

Finally, the transformations in the following group are always active, since they mostly deal

with the manipulation of the internal representation and do not really have an effect on the

quality of the code.

Always active -deadtypeelim Dead Type Elimination -internalize Internalize Global Symbols -strip Strip all symbols from a module -strip-dead-prototypes Remove unused function declarations -strip-debug-declare Strip all llvm.dbg.declare intrinsics -ssi Static Single Information Construction -ssi-everything Static Single Information Construction

2.1.3 Flow execution

Execution of the optimization flow is quite straightforward. In the following we suppose that

the transformations are clustered as described in the previous section.

Since the flow is integrated with MOST, which acts a main tool, two files are necessary:

1. A wrapper script to invoke the actual estimator.

2. An XML file describing the exploration space and the optimization goals.

The script, in particular, wraps the call to the C-to-C SWAT optimization tool based on the

LLVM optimizer opt and the swat-core-ba flow to evaluate execution time and

energy consumption of the code resulting from the application of selected transformations.

Here is an example of the script that has been developed to this purpose.

Page 10: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 10

#__MOST_GENERIC_WRAPPER__# INPUT_TEMPLATE_FILE INPUT_FILE

#__MOST_GENERIC_WRAPPER__# METRIC_NAME OUTPUT_FILE TYPE ADDITIONAL INFO

#__MOST_GENERIC_WRAPPER__output_file__#

execution_cycles

log/reisc_sim.lo

regexp

Executed\s*(\S+)\s*cycles

#__MOST_GENERIC_WRAPPER__output_file__#

instructions

log/reisc_sim.log

regexp

cycles,\s*(\S+)\s*instructions

#__MOST_GENERIC_WRAPPER__output_file__#

code_size

log/stat.log

template

Size:

#!/bin/sh

TARGET_FILE_DIR="/home/complex/UC1/apps/gsm/"

REISC_CONFIG_FILE="/home/complex/UC1/reisc/simple.cfg"

set -e

Page 11: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 11

touch phase1.txt phase2.txt phase3.txt excluded.txt

echo "-indvars -loop-unroll" >> @[email protected]

echo "-inline" >> @[email protected]

echo "-licm -loop-unswitch" >>@[email protected]

echo "-sccp" >> @[email protected]

echo "-mem2reg" >> @[email protected]

echo "-preverify -domtree -verify -lowersetjmp" > opt.cfg

cat phase1.txt phase2.txt phase3.txt >> opt.cfg

echo "-preverify -domtree -verify" >> opt.cfg

rm phase*.txt

mkdir -p log bin opt opt/tmp

swat-opt -config opt.swatcfg -swat-debug > log/swat_opt.log 2>&1

reisc-gcc -O0 -mint32 opt/*.c -o bin/a.out > log/reisc_gcc.log 2>&1

reisc-run -a "--config-file=$REISC_CONFIG_FILE" bin/a.out >

log/reisc_sim.log 2>&1

stat bin/a.out > log/stat.log

exit 0

As far as the exploration space description, it is constituted by an XML file listing the

available parameters to explore and the optimization goal.

Page 12: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 12

#<?xml version="1.0" encoding="UTF-8"?>

<design_space xmlns="http://www.multicube.eu/" version="1.3">

<simulator>

<simulator_executable path="/usr/bin/perl /home/most/wrapper.pl

–-execution_config=/home/most/complex/UC1/run.sh.in --timeout=1800" />

</simulator>

<parameters>

<parameter name="loop_unroll" type="string">

<item value="excluded"/>

<item value="phase1"/>

<item value="phase2"/>

<item value="phase3"/>

</parameter>

<parameter name="inline" type="string">

<item value="excluded"/>

<item value="phase1"/>

<item value="phase2"/>

<item value="phase3"/>

</parameter>

...

</parameters>

<system_metrics>

<system_metric name="instructions" type="float" unit="inst"

desired="small" />

Page 13: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 13

<system_metric name="execution_cycles" type="float" unit="cycle"

desired="small" />

<system_metric name="code_size" type="float" unit="Byte"

desired="small" />

</system_metrics>

</design_space>

As mentioned before, the SWAT estimation flow can seamlessly be replaced by the target ISS

to perform a more accurate energy evaluation. This second option, though, suffers the

drawback that instruction set simulation proved to be more than 400 time slower than

estimation. This, considering that the exploration space is rather large, strongly encourages

the use of the SWAT estimation toolchain. The optimization and estimation commands run by

the script are the following:

$> swat-core-cc –config opt.swatcfg –swat-debug

$> swat-core-ba –config ba.swatcfg –swat-debug

The tool swat-core-cc performs actual transformations using the LLVM optimizer. The

set of active transformation is passed to LLVM through specific configuration options in the

opt.swatcfg file. This is thus the input for the transformation and estimation tools and the

output of the MOST engine during exploration. At each step of the exploration process, in

fact, MOST generates a new configuration file.

For the format of the configuration files and a description of the command line options See

Section 2.4.

2.2 Source to source optimization flow

This section describes the implementation of the source-to-source optimization hint engine

based on the formal formulation provided in Deliverable D3.2.1. Since the optimization hint

engine swat-opt does not perform any source code transformation – which is left as a

manual task to be performed by the developer – it is not possible to “close” the optimization

loop by exploiting the exploration tool MOST.

The second part of the section describes the prototypical implementation of the quantitative

transformation evaluation engine swat-tge. This tool provides a quantitative estimation of

the potential energy saving that might be obtained by applying specific high-level

transformations.

2.2.1 Optimization hint engine

After running the hint engine, the developer is provided with a set of suggestions on where

and how to transform the source code.

Page 14: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 14

Figure 2 shows a simplified view of the, where some pre-processing activities have been

omitted and indicated as a whole with “Front-End” box. It should be noted that the closed

loop actually needs human intervention, as the suggested code transformations are not applied

automatically.

x.c

Optimization Hints

Optimization

Engine

Front-End

Back-End

CPUmodel

t = Estimated timee = Estimated energys = Estimated size

Satisfied ?

ManualCode

Transformation

x.opt.c

x.opt.c

yes

no

applysuggestedtransformations

transformationrules

x.c

Optimization Hints

Optimization

Engine

Front-End

Back-End

CPUmodel

t = Estimated timee = Estimated energys = Estimated size

Satisfied ?

ManualCode

Transformation

x.opt.c

x.opt.c

yes

no

applysuggestedtransformations

transformationrules

Figure 2: General SWAT-based source-to-source optimization flow

2.2.2 Optimization hint rule definition

The grammar used to build the rules is rather general and is described in the following

rulelist rule rulelist

| rule

rule ruleid ‘:’ condition

ruleid constant

condition term ‘|’ condition

| term

term elem ‘&’ term

| term

elem comp

Page 15: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 15

| ‘~’ comp

comp ‘%’ ruleid

| ‘(’ condition ‘)’

| identifier relop constant

identifier ‘$’ metricid

| ‘$’ metricid ‘[‘ constant ‘]’

Where constant is a terminal symbol indicating a numeric constant, relop stands for a

relational operator and metricid is a terminal symbol whose string value identifies a specific

metric, according to the following table, grouped according to the scope they refer to

(function, basic-block, or whole application). The second column indicates whether the metric

is a scalar or a vector. In the latter case, specifies the meaning of the index.

Function metrics

Identifier Argument Description

fnsize Function size

fnbbsize Function BB size

fnsizeavg Average function size

fnbbsizeavg Average function BB size

fncalls Functions called

fncallswgh Weighted functions called

fncallpoints Function call points

fninsnstat Index of the instruction Function instruction statistics

fnexec Function execution count

fntime Total function execution time

fntimeavg Average function execution time

fndepth Average function depth

fncallpointf Function call points frequency

fnregpress Function register pressure

fnclassstat Index of the class Function instruction class statistics

fnmempress Function memory pressure

fnstackpress Function stack pressure

Basic block metrics

Identifier Argument Description

bbsize Basic block size

bbsizeavg Average basic block size

bbinsnstat Index of the instruction Basic block instruction statistics

bbexec Basic block execution count

bbregpress Basic block register pressure

bbclassstat Index of the class Basic block instruction class statistics

Application metrics

Identifier Argument Description

aaclassstat Index of the class Instruction class statistics

aastackmax Maximum stack size

Page 16: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 16

aainsnstat Index of the instruction SInstruction statistics

aabbexec Total basic block execution time

aaregpress Register pressure

aamempress Memory pressure

aastackpressave Average stack pressure

2.2.3 Flow execution of the optimization hint engine

It is worth noting that swat-opt the should not be applied to all the source code, but rather

to selected portions, called “scopes” (see Deliverable D3.2.1) that have been identified as the

most critical part of the application.

The list of scopes can be obtained using the analysis flow constituted by swat-core-ba and

swat-analyze. In particular, after performing the basic modeling and estimation tasks collected

in swat-core-ba with:

$> swat-core-ba –config myconfig.swatcf –swat-debug

it is necessary to run swat-analyze with the selection options activated, namely:

$> swat-analyze -bb-select –threshold <percent>

–cluster *.bbmodel

to select critical basic-blocks (loops inclusive), and:

$> swat-analyze -fn-select –threshold <percent> *.bbmodel

to select critical functions. Furthermore, since the optimization engine needs to know which

groups of basic blocks constitute a loop, the following command should be executed for each

critical function

$> swat-analyze –bb-cfg –loops f1.bbmodel f2.bbmodel ...

Finally the optimization engine can be run with the command

$> swat-opt –config opt.swatcfg –swat-debug

For a detailed description of the command line interface and of the configuration options of

the tools, see Section 2.3.1. and Sections 4.4, 4.5 and 4.6 of Deliverable D2.2.2.

2.2.4 Transformation effectiveness quantitative estimator

This tool provides an estimate of the effectiveness of specific high-level transformations. The

key concept behind this approach is the possibility to estimate how the basic-block models of

the applications will be affected by specific transformation. The simplest and most accurate

approach would be to actually transform the source code, then perform estimations. This is

depicted by Figure 3.

Page 17: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 17

Figure 3: Exact transformation effectiveness estimation approach

The tool, thus, does not perform exact and semantically consistent transformation of the code

(as a source-to-source transformation engine would do) but rather “updates” the underlying

basic block models. This is pictorially represented by Figure 4.

Figure 4: SWAT Transformation effectiveness estimation approach

This approach requires a significant analysis and modeling effort to characterize specific

transformations in terms of resulting basic-block models. From a technical point of view, it is

not possible to express the transformation of the basic-block models strictly in mathematical

form. For this reason we have decided to account for the effect of each transformation by

means of a specific algorithm generating the new basic-block model. Each algorithm is the

compiled in a shared dynamic library loaded at runtime by the core tool.

2.2.5 Flow execution of the transformation effectiveness estimator

The tool implementing this idea is currently in a very preliminary phase of development, as it

was not originally foreseen in the project. We nevertheless decided to explore this idea mainly

to support the optimization hint engine, rather than replacing it completely.

The tool is run with the following command line:

$> swat-tge –config tge.swatcfg –tform <name> –swat-debug

The configuration file, at present, does not introduce any additional option. Transformations

are explicitly specified on the command line. The name of the transformation is used to select

Page 18: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 18

the specific model transformation dynamic library to be loaded. A complete description of the

interface is not provided here, since the tool not yet stable enough.

2.3 Parametric exploration

This flow has the goal of finding the combination of the "parameters" of the applications that

maximizes a predefined optimization goal. The kind of parameters that can be explored here

are all supposed to be implemented as macro definitions influencing:

1. The behavior of the compiler. These macros (usually pragmas) are used to modify the

behavior of the compiler in the optimization, code generation and linking phases.

2. The behavior of the application. These macros directly influence the behavior of the

application code, by specifying, for example, tolerances, number of iterations,

timeouts, polling frequencies and so on.

We will refer to the former case as “target independent” exploration, while as “target

dependent” exploration the latter.

The optimization flow, shown in Figure 5, combines the MOST exploration engine and the

SWAT estimation toolchain (or the actual instruction-set simulator of the target platform).

x.c

t = Estimated timee = Estimated energys = Estimated size

MOST

Design Space:optimizations list

Front-End

Back-End

CPUmodel

macros.h

macros.hx.c

t = Estimated timee = Estimated energys = Estimated size

MOST

Design Space:optimizations list

Front-End

Back-End

CPUmodel

macros.h

macros.h

Figure 5: General setup of the optimization flow for parametric exploration.

The flow has been implemented and tested on small examples. Since the implementation of

the flow basically consists in building ad-hoc wrappers and XML parameters descriptions for

interfacing MOST and the SWAT estimation flow, no additional details needs to be provided

here. The form of the XML file and of the wrapper script is similar to that discussed in

Section 2.1.3.

2.3.1 Target independent configuration

For this kind of optimization, we suppose that a given function foo() of the application has

been implemented in different ways, which we refer to as functional modes. Each functional

mode is then subject to conditional compilation under the guard of macro FOO_MODE_<N>,

where <N> is a suffix that unambiguously identifies one of the specific implementations. An

Page 19: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 19

example of implementation template of a function with three different modes is provided in

Figure 6.

#if defined( FOO_MODE_1 )

int foo( int x ) {

// Implementation 1

}

#elif defined( FOO_MODE_2 )

int foo( int x ) {

// Implementation 2

}

#elif defined( FOO_MODE_3 )

int foo( int x ) {

// Implementation 3

}

#else

#error Mode not defined.

#endif

Figure 6: Template implementation of a function with three functional modes.

Page 20: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 20

Typical examples of different “functional modes” of a function are related to the different

accuracies of a computation, a floating-point versus a fixed-point implementation of an

algorithm and different trade-offs between local processing and transmission frequency for

sensing functions on a wireless sensor network node.

2.3.2 Target dependent configuration

This second option of parametric exploration has the goal of determining the best combination

of the voltage and frequency operating modes of the target processor. Compared to traditional

approaches, where entire threads, processes or process batches are assigned an operating

mode, the exploration proposed here operates at a much finer-grained level.

Considering a generic application as structured as a set of C functions, we first identify the

most critical ones, using the same analysis steps outlined for the optimization hints flow.

These functions needs then modified manually, but in a very trivial way: it is in fact sufficient

to add two macros, one at the beginning and one at the end of the function. Note that if the

function has more than one exit point, the exit macro must be added before each of them.

Figure 7 shows a template of a function instrumented with the macros necessary to enable this

form of parametric exploration and automatic application of the configuration selected by the

exploration engine.

int foo( int x )

{

/* Declarations */

VFMODE_ENTER_FOO

/* Original function body */

/* At each exit point */

VFMODE_EXIT_FOO

return some_var;

}

Figure 7: Modification of a function to support target dependent modes exploration.

In the specific case of the ReISC processor, the core provides three operating modes, namely

normal, snooze and sleep. The exploration engine, supported by the SWAT analysis tools

swat-core-tr and swat-analyze, will select per each function the best suited operating mode of

the target processor. This is done by minimizing the estimated energy consumption under

Page 21: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 21

execution time constraints, either in the form of deadlines for each function or in the form of

an overall timing constraint.

Again, the structure of the flow is based on the MOST exploration engine and does not

significantly differ from the arrangements discussed so far. The only difference lays in the

core tools of the SWAT framework that are used to perform analysis and estimation.

In particular, the following steps are necessary. First of all, the code of the functions needs to

be modified as described in Figure 7. Then, a static estimation pass must be performed to

derive the execution time and the energy consumption of each basic block of the application.

This is done by means of the swat-core-ba flow. Executing the application with different

processor modes assigned to different functions implies suitably updating the basic-block

models with different costs based on the specific mode the function is assigned.

Since the operating mode of the processor changes over time, depending on the function being

executed and its associated mode, a full trace of the basic-block executed must be generated.

This is done using the swat-core-tr tool with a specific configuration that includes tracing of

function entry and exit points. This information will be used during analysis to determine

where, in the execution trace, the operating mode is changed. The command to do this is:

$> swat-core-tr –config trbbce.swatcfg –swat-debug

Where the configuration file specifies the required instrumentation rules and support library,

namely:

[trace-bbce]

rules = bbce.rules

libray = libswat-tracing.a

binary = executable

execute = true

mode = file

The SWAT tracing core flow will dump the execution trace on a file with extension .t804 (see

Deliverable D2.2.2 for a description of the format of the trace file) listing all the executed

basic blocks and the function entry and exit points. This two passes (static estimation and

tracing) need to be performed only once, before entering the exploration loop managed by

MOST.

The dynamic, mode-dependent, estimation is then performed using swat-trp, the SWAT trace

post-processor. This tool, for the specific trace analysis, requires as input a file specifying one

“allocation” of functions to processor modes. The form of the file is very simple, as it lists all

functions and related operating modes. For a description of the way operating modes can be

assigned to functions, see Section 4.3.9 of Deliverable D2.2.2. This file is the input for the

trace analysis and is the output of MOST. At each step the file describes a different allocation

of functions to modes.

Furthermore the tool needs a specific entry in the configuration file indicating the energy and

timing characterization of the processor modes. This is specified as:

[taget]

cpu-modes = resic.modes

Page 22: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 22

At this point the analysis tool can be run:

$> swat-trp –config alloc.swatcfg –fn-allocation

–allocation-file <prj>.alloc

–trace <prj>.t804 –swat-debug

The output generated are the estimated execution time and energy consumption of the

application configured as described in the allocation file. These figures are used by MOST to

select different allocations until the best one is found.

2.4 Tools

2.4.1 swat-core-cc

This tool implements core of the C-to-C optimization engine based on the LLVM optimizer.

Synopsys

swat-core-cc <options>

Options

-help

Prints a short description of the tool options.

-version

Prints the tool version.

-swat-debug

Produces a verbose debugging output of the execution.

-config

Specifies the configuration filename.

-output

Specifies the output filename, listing the rules that have triggered.

Configuration file specific options

The configuration file format follows the standard defined for all configuration files used by

the SWAT toolchain as described in Section 4.5.1 of Deliverable D2.2.2. For the specific tool

options the configuration file introduces the additional section [optimization] described

below, and uses the information in the configuration options llvm-ccflags, llvm-

optflags and llvm-optfile found in the standard [compilers] section.

The new section simply allows specifying the output directory where to save the optimized

version of the application. This new version is the input for the estimation flow.

Output-dir = <path>

The output directory.

Page 23: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 23

2.4.2 swat-opt

This tool implements rule-base optimization hint engine.

Synopsys

swat-opt <options>

Options

-help

Prints a short description of the tool options.

-version

Prints the tool version.

-swat-debug

Produces a verbose debugging output of the execution.

-config

Specifies the configuration filename.

-output

Specifies the output filename, listing the rules that have triggered.

Configuration file specific options

The configuration file format follows the standard defined for all configuration files used by

the SWAT toolchain as described in Section 4.5.1 of Deliverable D2.2.2. For the specific tool

options the configuration file uses the additional section [srcopt] described below

rules = <string>

The rule file. The file has the .optrules suffix and collects the rules, one per line,

structured according to the grammar exposed above.

fn-selection = <fnid> [<fnid>...]

Selected functions to apply the rules on. The argument is a list of function identifiers,

as generated by swat-uniqid.

bb-selection = <bbid> [<bbid>...]

Selected basic-blocks to apply the rules on. The argument is a list of basic-block

identifiers, as generated by swat-uniqid.

lp-selection = (<bbid> [<bbid>..]) [(<bbid> [<bbid>...])...]

Selected loops to apply the rules on. The argument is a list of loops enclose in parentheses,

each loop being in turn a list of basic-block identifiers, as generated by swat-uniqid. The

list of loops can be obtained using swat-analyze with the options –bb-cfg –loops, as

describe in Section 4.3.1 of Deliverable D2.2.2.

Page 24: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 24

2.4.3 swat-tge

This tool implements the quantitative transformation effectiveness estimator.

Synopsys

swat-tge <options>

Options

-help

Prints a short description of the tool options.

-version

Prints the tool version.

-swat-debug

Produces a verbose debugging output of the execution.

-config

Specifies the configuration filename.

-tform <name>

Specifies the transformation to be analysed. The algorithmic transformation model is

implemented in the library tge_<name>.so.

Page 25: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 25

3 Custom hardware optimization

3.1 High level synthesis optimizations

Modeling the dominant effects of Register Transfer (RT)-level components under power

gating to get fast and accurate estimates in order to explore the design space of the HLS is one

of the main contributions in this work. Figure 8 gives a simplified overview on the modeling,

estimation-, and optimization flow that will be further described in this section. A further

description of the overall flow can be found in Deliverable D.3.2.1.

Its main purpose is to get accurate estimates for four main variables: leakage currents in the

static on and off state, energy overheads due to the state transition and the break-even time.

These values are obtained for each individual RTL component within the design and are then

used beside the precise parameter values and activity patterns to get an estimation for its

overall energy consumption.

Figure 8: Visualisation of proposed power-gating modelling, estimation and optimisation flow

The experimental assessment of the developed power gating model accuracy needs a fixed

and well defined environment. For this reason, at first a technology selection is done for

which the evaluation is done and all model parameters are constrained to a set of discrete

values or a continuous range. The following evaluation then distinguishes between the pure

model evaluation and a presentation of the power management adoption at system level.

3.1.1 Technology Selection and Parameter Ranges

To validate the correctness of the modeling approaches and to prove its universality, a

selection of technologies and parameters has been made. Beside different technology node

sizes, it is important to cover different process corners. Additionally, MTCMOS technologies

should be considered in order to cover sleep transistor implementations in both standard- and

high-threshold design.

Page 26: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 26

Figure 9: Semiconductor technology selection

Figure 9 lists three different technologies for which the characterization was done. The

Nangate free 45nm open source digital cell library technology is a general purpose (GP)

technology based on predictive technology modelcards of the NIMO Group, Arizona State

University. It is freely available and is widely used in the scientific context. It offers three

even process corners (slow-slow, typical-typical, and fast-fast) that are all evaluated

separately. Even means that both PMOS and NMOS devices are equally affected by

variations of fabrication parameters. Further, it is a MTCMOS technology and thus it includes

both, standard- and high-VTH transistors. The industrial technologies are also MTCMOS

technologies but their process corner is restricted to the typical case in this evaluation.

Additionally, and in contrast to the Nangate library technology, they are both LP specialized

technologies. These LP techniques inherently have lower leakage currents and the resulting

power gating break-even time is in another order of magnitude.

Figure 10: Parameter ranges

Furthermore, a set of different power gating implementation types (referred to as power

gating scheme (PGS)) has been selected. It covers PMOS- as well as NMOS-based sleep

devices, double-cutoff as well as super-cutoff techniques.

Figure 10 lists all parameters of the characterization process and its parameter ranges. The

supply voltage is constrained by the technology whereas the surrounding temperature is

constrained by reasonable values. The gate voltage of the sleep devices that is used in

SCCMOS techniques to enforce a cutoff is specified as an offset to the supply or ground

voltage. It is in the range 0V to 0.1V and thus the sleep signal is in the range of [VDD;

VDD+0.1V] for PMOS-based PGSs and [GND;GND-0.1V] for NMOS-based PGSs. The

sleep transistor width is constrained to a maximum of 10% of the gated component size. The

characterization is also constrained to functional RTL units that are available and supported

by OFFIS’s PowerOpt . Their bitwidths ranges from 4 to 32 bits in 4 bit steps.

Page 27: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 27

3.1.2 Evaluation of Power Gating Models

During model generation, a lot of methods have been used to compact and ease the resulting

models. This includes compressions of lookup tables, exhaustive interpolations in multiple

dimensions, parameter separation, (non)-linear regression techniques, and simplifications to

speed up the model generation. For this reason, the evaluation has to show the quality and the

performance improvements compared to reference estimates. Since silicon measurements are

not available, the reference estimates are obtained by Spice-based analog circuit simulation

measurements. This is an established approach in the scientific as well as industrial area.

The entire characterization is done via Synopsys HSPICE version A-2008.03-SP1 and is

executed on a general purpose Intel Core2Duo machine at 3Ghz. It lasts about one day per

semiconductor technology whereas transient simulations of the state transition energy and

wakeup models make up 98% of the time. Of this, more than 50% is attributable to large

multiplier components. This illustrates the limits of circuit simulations and underlines the

hardness of predicting the application of power gating for huge components.

For presenting the absolute and relative accuracy of the models, a Monte-Carlo evaluation has

been applied covering all parameters in the aforementioned ranges and three error measures

have been computed: the maximum relative error for over- and underestimation (XRE), the

mean absolute relative error (MARE), and the relative standard deviation. In the following,

the evaluation results of the models are presented.

3.1.2.1 Evaluation of Sleep Transistor Leakage Models

In the remaining leakage current model the supply voltage range is sampled with a rate of

0.1V, the temperature with 20°C, and the gate voltage with a rate of 0.1V, resulting in a total

of 5*6*2 = 60 sampling points for each PGS and technology. Furthermore, the

characterization has been done for an isolated PGS circuitry with a channel width of 1µm.

Figure 11 shows the model errors. As it can be seen, the remaining gate- and subthreshold-

leakage currents can be predicted with an average MARE below 1% and a maximum error of

6.5%. On top of this error, the model simplification of assuming the voltage drop across the

sleep transistor to be equivalent to the supply voltage will induce an additional error in terms

of an overestimation of up to 15%.

Page 28: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 28

Figure 11: Errors of the gate- and subthreshold leakage model for locking sleep transistors

Conducting sleep transistors are again modeled at a supply voltage sampling rate of 0.1V,

whereas the gate voltage disappears as a parameter. Since pure gate-leakage currents do only

slightly depend on the temperature, a wider sampling step of 50°C can be used for this model,

leading to a total of 5*3 = 15 sampling points. Nevertheless, the temperature remains a

parameter during modeling as it may gain importance in future semiconductor technologies

because of increasing pn-junction leakage currents being more dependent on the temperature.

Figure 12 presents the model evaluation results of the gate-leakage model for conducting

sleep devices. The MARE is about 4% for the Nangate free 45nm open source digital cell

library and 1% for the two industrial technologies. In all cases, the model tends to

overestimate the gate-leakage currents because of the quadratic impact of VGS and VGD

while the model linearly interpolates between two adjacent sampling points. Increasing the

supply voltage sampling rate would reduce this overestimation but also enlarge the model.

Additionally, the maximum error is only 18% for the Nangate and even below 4% for the

industrial technologies.

Page 29: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 29

Figure 12: Errors of the gate-leakage model for conducting sleep devices

3.1.2.2 Evaluation of Voltage Drop Models

Figure 13 presents the maximum, mean, and standard deviation errors of the voltage drop

model for the conducting state. The parameters temperature and supply voltage are sampled

with a step width of 20°C and 0.05V. As presented in the charts, the occurring voltage drop

can be predicted with an average error of 1-5% with maximum overestimates of 25%.

Secondly, the errors of HVT- and double-gating schemes are larger than those of SVT- and

single-gating schemes because these schemes have higher on resistances and increase the

voltage drop dynamic that needs to be interpolated by the model. Underestimates that would

play down the presence of sleep devices are limited to 5% maximum.

The voltage drop model for the locking state is evaluated as presented in Figure 14. For the

parameters supply voltage, temperature, gate-voltage, and sleep transistor size the model

consists of a 5*2*3*6 = 180-point measuring field. With a mean absolute relative error below

1.5% and a relative standard deviation of 2.1% in maximum across all technologies, the

accuracy of the model is very high. However, this accuracy is also necessary because the

estimates serve as input to the state transition energy model and highly impact its prediction.

Page 30: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 30

Figure 13: Errors of the voltage drop model for conducting sleep devices

Page 31: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 31

Figure 14: Errors of the voltage drop model for locking sleep devices

3.1.2.3 Evaluation of State Transition Energy Models

The most effort for model evaluation has been spent for the state transition energy model

because some large multiplier components are not simulatable in high bitwidths or in

combination with some PGSs. In these cases, Synopsys HSPICE fails in simulating the

circuits due to a high memory demand and failing convergence analyses. To provide a

meaningful analysis of the model, a Monte-Carlo based evaluation performs a total of 1000

randomly chosen transient simulation runs, lasting about two weeks of computation time. The

presented errors base on about 93% of the simulation runs that have been finished

successfully and include all model errors induced by the model representation and required

interpolation. Especially, the bitwidth-scaling and PGS selection is reflected in the evaluation.

Peak errors have been observed at peak voltage drop errors because of their super linear

dependency.

Page 32: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 32

Figure 15 summarizes the evaluation results per technology and RTL-component. Mean

absolute relative errors below 10% and mostly even below 5% have been analyzed for the

dominant part of components. Nevertheless, the quality varies. For example the incrementer

component inc_fast in the Nangate technology is conspicuous with its higher peak errors and

standard deviations. Secondly, the model tends to underestimate the state transition energy for

the two multiplier components in different technologies. This suggests the conjecture that the

matrix structure causes super linearly increasing wake-up energies. Nonetheless, the

maximum errors are reasonable below 25% and no further modeling effort has been spent for

these components.

As the temperature is set to the upper bound during characterization, the models do only

predict upper bound estimates. The interpolation table size of the model is 5*2*5 = 50 points

for the model parameters supply voltage, voltage drop, and sleep transistor size.

For the purpose of high-level tradeoffs for which the models should be used, the accuracy is

perfectly adequate and the speed improvement is the dominant model feature. Considering

that a single analog circuit simulation may take up to several hours, the pre-characterized

models can provide thousands of estimates per second.

Page 33: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 33

Figure 15: Error of state transition energy model

3.1.2.4 Evaluation of State Transition Delay Models

Figure 16 presents the wake-up time model evaluation. As it can be seen, the mean average

errors are mainly below 10% but peak errors vary a lot and range up to 26%. Especially the

wake-up delay prediction for the small-type components performs better compared to the fast-

type components throughout all technologies. The interpolation table size of the model is as

small as in the ERT SW model because it bases on the same characterization runs.

Page 34: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 34

Figure 16: Errors of state-transition delay model

3.1.2.5 Evaluation of Process Variation on Power Gating

The Nangate semiconductor technology offers circuit level device models of three process

corners. These corners represent the extremes of parameter variations within which a circuit

must operate correctly. Thus, the corners cover the overall spectrum from slowest to fastest

possible devices. In this section, the impact of process variation on power gating is evaluated

exemplarily for a single RTL component.

Figure 17 presents model estimates for power gating relevant parameters that are normalized

to the typical operating case. As it can be seen, the voltage drop across the sleep transistor as

well as the state transition energies do only slightly change. This is completely different for

the leakage currents and timing behavior. As expected, power gated components that are

fabricated at the fast process corner wake up faster but on the other hand they cause a lot more

Page 35: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 35

leakage currents. In relative terms, the active current of the fast process corner is 2.6 times as

high as of the typical corner but, while being power gated, the remaining leakage current in

the sleep state is even 5.3 times as high. But in absolute terms, the amount of reduced leakage

is much higher for the fast corner. Together with the almost constant state change energy,

power gating becomes even more advantageous for designs fabricated at the fast and less

advantageous for the slow process corner.

A break-even time analysis for the Nangate 45nm technology at fast process corner results in

tbe times less than half of the typical-case break-even time.

Figure 17: Normalized model estimates for different process corners to analyze the process variation

impact on power gating

3.1.3 Evaluation of IP-Level Application of Power Management

Every RTL component within a datapath contributes a small fraction to the active and sleep

currents of an overall design and has its individual wake-up energy and time. Further, at RT-

level, each component has its own break-even time. At system-level, all of these parameters

merge to one overall effectivity-metric of power gating and result in one global break-even

time that has to be exceeded if all components are cut off simultaneously. This Section will

evaluate this system-level view of power management in relative comparisons and absolute

numbers against the background of overall possible savings, impact of parameters, and

overhead costs of area and power.

Figure 18 lists design examples and characteristic parameters such as their functional unit

datapath composition after synthesis and cycle count within the schedule. To all of the

designs power gating has been applied with HVT NMOS sleep devices that are most

commonly in today’s practice. The fourth and fifth column in the table show absolute active

and sleep current numbers of the designs at a fixed supply voltage of 1.0V, an ambient

temperature of 27°C, and on the base of the Nangate 45nm technology at typical process

corner. The sleep and active currents are restricted to the functional units of the designs

because of the focus within this analysis. Nevertheless, the FUs make up the dominating part

Page 36: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 36

of the total energy consumption. For example, in the FDCT benchmark, the FUs contribute

68% of the total energy consumption whereas the remaining 32% split up for multiplexer,

registers, controller, and clock tree. As the results show, active state leakage current is

effectively reduced throughout all benchmarks.

Figure 18: Design examples and the effectivitiy of power gating in a global sleep state

In the following, a deeper analysis of the FDCT benchmark is examined in order to show the

impact of the continuous parameters temperature and supply voltage as well as the discrete

parameters process corner and PGS selection. For this analysis the Nangate 45nm technology

has been chosen in typical and fast process corner. Furthermore, the HVT version has again

been selected for sleep devices and the sleep device sizes have been fixed to 2% of each RTL

component size. HVT devices require a higher supply voltage. Thus, its range is constrained

to [1.1V;1.3V] whereas the temperature is examined across its whole range of [27°C;127°C].

Figure 19 then shows the gating-switch effectivity as a ratio of sleep/active current and the

break-even time of the overall FDCT design in nanoseconds.

At first, it can be seen that the effectivity of power gating has only a small variance across the

parameter ranges. It becomes only slightly less effective in suppressing leakage currents if the

temperature increases.

The supply voltage has also only a marginal impact on the effectivity. Additionally, there is

only a small variation between 2% and 4% among the different power gating schemes. In

other words, leakage is reduced by 96-98% in all cases and, from the point of pure leakage

saving, the PGS selection is not particularly interesting if all surrounding parameters are

identically.

Secondly, the break even time is presented. Unlike the gating effectivity, the break even time

diminishes with increasing temperature and supply voltage. This is because the wake-up time

is much lower and less incomplete transitions occur during the state transition. With a factor

of up to four, the variance is also much higher. Furthermore, the PGS selection highly impacts

the break-even time. As it can be seen, PMOS schemes have up to two times higher break-

even times. Comparing the two process corners, the break even time is also about twice as big

for the typical process corner than that of the fast process corner.

Page 37: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 37

Figure 19: Comparison of power gating scheme efficiency and dynamic parameter impact

Page 38: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 38

The wake-up time at system-level is given by the maximum RTL component wake-up time if

the supply grid is assumed to be sufficiently dimensioned. Figure 20 shows the wake-up time

of the FDCT benchmark in dependence on the temperature and supply voltage parameter for

the aforementioned gating types and process corners.

Figure 20: Wake-up time evaluation of the FDCT design

It can be observed that the wake-up time shows a very small variance in the parameter ranges.

It slightly decreases with increasing supply voltage and increases with a raising temperature.

Furthermore, at the fast process corner, it is about 20-30% smaller as it is at the typical

process corner. A comparison of the PMOS and NMOS gating schemes shows that NMOS

schemes are about three times faster in waking up.

Page 39: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 39

3.2 Memory optimization

3.2.1 Introduction

The memory optimization tool aims at optimizing the memory hierarchy of the system under

analysis using total memory energy as a metric; however, the optimization strategy based on

sub-banking being considered for SRAMs is also beneficial to mitigate aging effect caused by

Negative Bias Temperature Instability (NBTI). This section will first summarize the

assessment of the energy benefits obtained by the memory optimization tool and eventually

present the details of techniques to concurrently achieve reduced energy and extended

lifetime.

3.2.2 Energy Optimization of scratchpad memories

Many strategies for reducing dynamic energy of memories proposed in the literature rely on

the paradigm of splitting a memory array ([4], [5], [6]). Section 3.2.1.3 of D3.2.1 explained

how splitting the address space into multiple, independently accessed memory sub-blocks can

provide significant reduction in energy consumption. Memory sub-banking is beneficial for

energy in general because of the non-uniform distribution of accesses to memory locations.

Even a naïve partition of two identical sub-blocks guarantees a sizable reduction of average

energy. The search space of all possible memory partitions can be easily enumerated by

observing that a partition is completely defined by a set of address boundaries (e.g., a bi-

partition can be characterized by the addressing around which the memory is split into two) .

Options for searching the space include Top-down branch-and-bound search [4] or a bottom-

up one based on dynamic programming [4]. This allows solving the problem optimally in

polynomial time in spite of an exponentially-sized search space.

Section 3.1.3.1 of D3.4.2 presented detailed energy results for the tool in the standalone

MEMOPT version. Here in the following figures, we show the percentage of energy reduction

by splitting the memory in two partitions. The first set consists of three sample applications

provided with the ReISC distribution, whereas the second set is a subset of the MIBENCH

benchmarks, which are widely used in the embedded systems community. All applications

were compiled using the ReISC toolchain and a fixed set of compiler optimizations. Figure 21

shows results for the ReISC sample applications and Figure 22 show results for the

MIBENCH kernels.

Figure 21. Energy Reduction on ReISC sample applications.

Page 40: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 40

Figure 22. Energy Reduction on MIBENCH sample applications.

Above figures clearly exhibit the benefit obtained by partitioning the memory into two blocks,

providing 80 to 85 % savings in almost all cases. Certainly these savings can be further

enhanced by increasing the number of partitions.

3.2.3 Concurrent Aging and Energy Optimization of scratchpad memories

3.2.3.1 Overview

Traditionally, power and reliability have been considered as conflicting metrics, since most

design solutions for improving reliability (redundant circuits, strong signals, large devices)

are intrinsically power inefficient. However, the recent emergence of reliability issues in the

form of aging (i.e., temporal drift of performance) of devices has opened a new perspective of

this dichotomy. Such a benefit can be especially exploited in SRAM memory structures,

which are particularly sensitive to NBTI effects: given their symmetric structure, they cannot

in fact take advantage of value-dependent recovery.

The most effective solutions rely on the observation that typical power management strategies

(i.e., voltage scaling for dynamic power and power/ground gating for static power) can be

exploited to reduce NBTI-induced aging [10], [11]. Therefore, proper re-visitation of power-

managed memory/cache architectures according to an aging-related metric can achieve

concurrent energy and aging improvements [12], [13], [14]. In this deliverable the memory

optimization strategy based on sub-banking used to obtain energy-efficient SRAM

architectures, is also investigated to extend the lifetime of the memory and some additional

techniques are presented to further improve the aging benefits.

3.2.3.2 Aging: Background and Preliminaries

Aging of devices has emerged as the latest challenge brought by technology scaling. Thinner

oxide layers, higher electric fields and operating temperatures, induce adverse physical and

chemical phenomena that cause transistors to deteriorate their performance over time.

Page 41: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 41

Deviation from the ideal behaviour of manufactured devices is the most critical downside of

technology scaling beyond the 90nm node. The most evident type of non-ideality is related to

the non-determinism of devices due to process variations [15]. They are mostly due to random

fluctuations of dopant atoms and to the systematic or non-systematic impreciseness of the

manufacturing process, and can be viewed as a sort of ’time-zero’, fixed deviation from the

nominal behaviour of each device.

There exists however another, and even more insidious, type of non-ideality resulting from

technology scaling, namely, time-dependent deviations in the operating characteristics of

devices [16]. Two are essentially the sources of time-dependent variations: Bias Temperature

Instability (BTI), and Hot Carrier Interface (HCI). These physical/chemical effects result in

the degradation of the oxide thus causing a drift of the threshold voltage over time.

Bias Temperature Instability (BTI) has emerged as the most critical wear-out mechanism for

MOS transistors below the 100nm node. It manifests itself as a time-dependent, permanent

increase of the threshold voltage Vth of active transistors. Although BTI occurs in both n-type

and p-type devices, at the current technology nodes, i.e., 65nm and 45nm, only pMOS

transistors are significantly affected, the NMOS transistor has a negligible level of holes in the

channel and thus, does not suffer from the BTI degradation.

NBTI occurs when a pMOS is negatively biased (i.e., a logic ’0’ is applied to the gate of the

pMOS, resulting in Vgs = −VDD), and manifests itself as an increase of the threshold voltage

with time, resulting in the reduction of drive current and noise margin, causing in turn a

degradation of the delay of a device.

The actual amount of degradation depends on several parameters of a device, such as its logic

function, threshold voltage, size, load, and temperature [17]. From the design standpoint,

however, the most important property of NBTI is its dependence on the logic values. The

threshold voltage (and delay) degradation effects occur only when a pMOS device is in its

critical state (the stress states), that is, when a logic ’0’ is applied to the device inputs. In fact,

when a logic ’1’ is applied, NBTI stress is actually removed, resulting in a partial recovery

(i.e., a decrease) of the threshold voltage (the recovery state) as depicted by Figure 23.

Figure 23. NBTI effect on pMOS.

Page 42: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 42

The most widely accepted physical model that explains the NBTI phenomena is the Reaction

Diffusion (R-D) mechanism [16], which explains the temporal shift of Vth in terms of the

breaking of hydrogen-passivated Si-H bonds at the Si/SiO2 interface and the subsequent

diffusion of hydrogen, which induces the formation of interface traps. The generated traps,

which accumulate over time, decrease the electrostatic control of the channel, therefore

resulting in a larger threshold voltage Vth. This trap generation phase is called the stress

phase, when the electrical stress is removed (i.e., Vgs = 0, corresponding to having a logic ''1''

on the pMOS gate input), holes are not present in the channel thereby avoiding the generation

of new traps, while part of the free hydrogen atoms diffuse back and anneal the broken Si-H

bonds. In this phase, called the recovery phase, the number of interface traps is reduced and

the Vth partially recovered. Figure 23 shows the temporal diagram of a typical NBTI-induced

Vth degradation and recovery sequence.

Experimental data report variation of Vth of about 10-15% per year, depending on the target

technology and electrical or environmental conditions. The delay degradation follows the

same trend as threshold voltage, yet with a smaller magnitude.

A conventionally accepted metric for the aging of a SRAM cell is the Static Noise Margin

(SNM), defined as the minimum DC noise voltage necessary to change the state of an SRAM

cell. NBTI impacts the SNM, because it causes the drift of PMOS transistors’ Vth over time

(namely, Vth (t) = K ・ t1/4

), thus lowering the static characteristics of the two inverters that

form the 6T-SRAM cell. Therefore, after some time, the SNM falls below a threshold that

allows safe storage of data (such a threshold depends on the technology and the specific

design of the memory cell).

Figure 24. Stating Noise Margin and aging of SRAM cell.

As a matter of fact, SRAM structures are particularly sensitive to NBTI effects because, given

their symmetric structure, they cannot take advantage of the value-dependent recovery typical

of NBTI. A SRAM cell ages in fact whatever the value it stores; therefore the effect of the

dependency on logic values is immaterial in a memory cell. The best-case degradation occurs

when both PMOS exhibit the same amount of degradation; that is when the cells store a 0 and

a 1 with equal probability [19].

Page 43: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 43

Figure 25. 6T SRAM cell.

So, in a memory cell aging occurs (and by extension to a memory word) regardless of the fact

that a cell (or word) is accessed or not. In other words, there is a substantial difference

between dynamic power and NBTI aging. In order to “stop” the aging, a memory cell (or

word) must be put into a proper “idle” state that can be used when a cell (word) is not

accessed. Therefore, the issue of an aging-aware memory structure shares many similarities

with a power management problem: there is need to implement an appropriate power-down

mechanism for memory words, and by aggregation, blocks of words (banks). The work of [9]

has shown how typical implementation of low-power states (using supply voltage or power

gating) have different cost/benefit tradeoffs in achieving simultaneous energy and aging

benefits.

3.2.3.3 NBTI mitigation techniques

A number of solutions have been proposed in literature to mitigate NBTI effects in SRAMS

which includes methods to equalize cell value probabilities, designing customized NBTI

resilient cells to allow minimum degradation ratio for all PMOS transistors in the cell. Among

them a class of solutions is based on the exploration of aging benefits provided by low-energy

states [12], [13], [14]. Assessment at the architectural level on entire memory blocks of power

management solutions (based on both DVS and power gating) were evaluated in [12].

The impact on aging of classical implementations of standby states is interesting and deserves

further insights.

Impact of Power Gating on Aging: When the sleep transistor is off, each cell is disconnected

from the ground, and both inverters’ outputs will reach the “1” value, i.e., the NBTI-immune

configuration. Notice that this is not a “logic” state: it is due to electrical reasons and cannot

be forced by writing some value in the cell. Based on this property, each line will age

proportionally to time spent in the active state; this allows to express aging in terms of the

probability of the Sleep signal (Figure 1-(a)) Psleep, i.e., how often a cache line is put into

sleep state.

Impact of Voltage Scaling on Aging: The aging of a PMOS transistor is determined by the

amount of negative bias voltage (i.e., gate-to-source voltage); thus, voltage scaling has a

beneficial effect to help alleviating aging by supplying a device with a smaller Vdd which

translates into a smaller Vgs, and therefore in a smaller magnitude of negative bias. Reduction

of the SNM as a function of Vdd is roughly linear; the degradation of the SNM under a

“drowsy” voltage Vdd, drowsy = 0.4V is about 60% of the degradation at the nominal Vdd.

Both power management schemes behave well in terms of energy savings along-with

appreciable benefits in terms of life extension. Concerning performance, however, power

gating scheme suffers from a large performance penalty due to the increased miss rate that

forces many accesses to be resolved in main memory. So, in terms of energy, lifetime benefit

and performance overhead, DVS-based scheme is generally more effective.

Page 44: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 44

Once we have identified the most suitable implementation of the standby state for aging

reduction, we are left with the choice of what is the unit of power management (UPM), that is,

the atomic unit that can be turned on and off based on activity.

It makes sense to have this unit coincide with the unit of access (UA), i.e., a memory word or

a cache line

Therefore, In a power-managed cache, idleness of each UPM can be extracted by observing

the idle intervals longer than some breakeven time. Given that idleness can be exploited for

both energy and aging reduction, it is straightforward to observe that different characteristics

of the idleness profile matter for the two metrics. For energy, it is clearly the average value

that matters: energy savings for each line will cumulate proportionally to the idleness of the

line. For aging, conversely, it is the worst case that matters: the line that becomes unusable

first will cause the entire cache to fail. Since due to the very principle of locality, the

distribution of idleness will in general not be uniform, average and worst case will differ. In

order to concurrently exploit idleness, worst case idleness must be controlled.

Following are some of the techniques which have been proved to control worst-case idleness:

Dynamic Indexing

Partitioned cache architecture

Application specific partitioning

In the following, we will outline the basic ideas behind each of these strategies.

Dynamic indexing

In dynamic indexing [13], the cache indexing function is modified over time in order to

achieve a uniform distribution of idleness over the cache lines; in this way all the leakage

saving opportunities can also be used for aging reduction.

There is an inverse correlation between idleness of a cache line and its aging (i.e. an idle line

can be power managed and therefore it will not age) such uniform distribution guarantees the

elimination of the worst-case: all cache lines will thus die at the same time.

Fortunately, power (and in particular, static power) optimization techniques, like power-

gating and dynamic voltage scaling - DVS, can offer mitigation of NBTI effects in memories.

Power gating, has the effect of completely nullifying the aging effects [11]. Similarly, but

with a smaller impact, voltage scaling improves NBTI-induced aging because a reduced Vdd

corresponds to a smaller bias voltage [10].

Figure 26 shows the two cache leakage architectures based on power management. A cache

line is being used as the atomic unit of power management. The decision about whether to

turn off a cache line is based on its usage: lines that are not accessed since a given number of

cycles (the breakeven time, B1) are put into a low-leakage state.

Page 45: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 45

Figure 26. Power-Gated (a) and DVS-Based (b) architectures.

Dynamic indexing applies to caches, in which the addressing mechanism occurs but with

minor adaptations can be applied to generic SRAM blocks.

Uniform-size partitioned cache architecture:

This scheme can be viewed as a coarse-grain extension of dynamic indexing [13], and

implements a uniform-size, multi-bank cache with the purpose of achieving a better design

point in aging/energy design space.

This technique is based on the idea of partitioning a memory into multiple banks of identical

size. While this organization has been widely used for reducing both dynamic and static

power, its exploitation for aging benefits requires proper management of the existing idleness

of the various banks. This can be achieved by means of a sort of time-varying addressing

scheme in which addresses are mapped to different banks over time in such a way that the

idleness is uniformly distributed over all the banks.

Page 46: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 46

Figure 27. DVS-Based uniform-size multi-bank architectures.

The architecture is shown in Figure 27, for a M-block partitioned cache; solid lines represent

address lines, while dashed ones are “sleep” signals. When a block is unused it is turned off

by asserting the sleep signal, which enables the selection of the low Vdd supply voltage. The

block labeled D implements these two functionalities: remapping the address on the proper

block and asserting the “sleep” signal for the other blocks. Address signals are simply derived

by routing the n − p LSBs of the address to each sub-block; sleep signals are obtained by

taking the p MSBs and transforming them into a 1-hot code onto 2p bits. Each signal is fed to

the corresponding block (e.g., Bank 0 corresponds to the M-bit encoding 00 . . . 1, Bank M −

1 corresponds to 100 . . . 0).

Experimental results show that a time-varying re-indexing allows to significantly improve the

lifetime of power-managed caches: above architecture provides average aging improvements

between 22% (for the worst configuration) and 2x (for the best one) with respect to a

monolithic cache, compared to a mere 9% improvement obtained with a conventional power-

managed cache architecture.

Application specific partitioning:

Above two approaches were based on having the whole cache dying at a given time with

dynamic indexing providing maximum aging benefit at the cost of hardware overhead and

partitioned architecture giving optimal result with minimum hardware overhead.

However, the technique of application specific partitioning [18] can provide a better tradeoff

between aging benefit and hardware overhead by allowing different blocks of memory to age

at different rates. This implies that some cache block will become unreliable first, and the

cache will keep on functioning with a reduced efficiency (or equivalently, with a

progressively smaller cache). As soon as one block dies, it simply forces the corresponding

lines to become indefinitely invalid: any further access to these lines will result on a miss.

This approach positively affects aging because splitting into multiple banks distributed the

worst case over several instances of the memory. As a simple, intuitive explanation consider

that a memory array will fail in its entirety as soon as one single SRAM cell fails (that is, it

cannot be safely read or written); if we split the memory into, say, two banks, only one of the

Page 47: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 47

two will fail (the one containing the earliest failing cell) while the second one will last longer

(up to the earliest failing cell in the second block).

Figure 28. DVS-Based non uniform multi-bank architectures.

Let’s assume a scratchpad memory with L = 2n lines (l0, . . . , lL−1), where n is the number of

the index bits of the memory address. We want to split the memory into M blocks B0, . . . ,

BM−1, of sizes S0, . . . , SM−1, addressed using n0, . . . , nM−1 bits, respectively. Figure 28 shows

the conceptual architecture and the relevant quantities.

It assumes the use of voltage scaling for implementing the low-energy states for the blocks

(denoted by the dotted signal from the dual supply voltage selector). Voltage scaling is the

viable choice for the scratchpad memory as it allows to preserve the contents of the memory

block in the standby state with a better energy/performance tradeoff [14]. The decoding block

Dec in the figure serves two purposes: remapping the address on the proper block and

asserting the standby signals for the M blocks.

Another implication of this graceful aging is that a proper aging metric is required for a fair

comparison against previous solutions. To this purpose, we introduce the concept of Effective

LifeTime (ELT), that is, the product of lifetime and size of a memory block. ELT

conceptually measures for how much time a memory block of a given size can be used. Figure

29 pictorially describes the concept of ELT.

Page 48: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 48

Figure 29. Effective lifetime.

The solid line enclosing the filled area denotes the aging profile of a regular memory: M

words are usable reliably for an amount of time equal to LT1. By using solutions such as [13],

[14] the lifetime can be extended upto LT2, and the memory still becomes unusable as a

whole (dotted line). With approach of [18] (dashed line), the entire memory will be usable

until LT3a equal to original lifetime, but then disable the earliest failing block so that we can

still use a smaller (say, M′ < M words) memory, yet for a longer time (up to LT3b). Figure 29

shows a simplified case in which only the cache is partitioned into two blocks. The ELT is the

area below the various aging profiles. Thus, the rationale is that, depending on the idleness

distribution and on how we partition the memory, it is possible that

ELT3 = M ・ LT3a +M′ ・ (LT3b − LT3a) > ELT2 = M ・ LT2

ELT-driven partitioning alone already yields significant benefits in terms of both aging and

energy with respect to a fixed-size partition as the one of [17], thanks to a better matching

between the partition sizes and the idleness profile. However, the knowledge of the idleness

profile can be exploited so as to further improve both aging and energy, at the cost of a small

hardware overhead. The basic transformation we implement is to selectively swap addresses

across partitions in order to achieve a better overall ELT. This can be easily implemented by

modifying the cache indexing function for a few, selected addresses.

Memory Size

Time

M words

LT2

LT1= LT

3a LT

3b

M’ words

Page 49: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 49

The choice of a possible swap-based strategy depends on its relation with the ELT-driven

partitioning step. There are essentially two options to combine these two phases. The first and

most intuitive is to run the partitioning first and then improve the results of partitioning with a

set of swaps. We call this strategy partition & swap. A second option is to first tweak the

idleness profile with a set of swaps and then find the best partition on that profile. We call this

strategy cluster & partition.

In the following, two detailed algorithms for each of the two strategies are discussed. Both

algorithms are parameterized by a parameter k, which denotes the number of swapped

addresses.

1) Partition & Swap Strategy: Since both size and minimum idleness concur to determine

ELT, the basic principle behind this strategy is to repeatedly swap the address with the

minimum idleness in the largest block with some address (with a larger idleness) of a smaller

block that dies earlier.

1: k-Swap (I)

2: B = ELT-DrivenPartitioning (I)

3: for l = 1 to k do

4: i ⇐ index of address with l-th maximum idleness in the earliest failing block.

5: j = index of block with maximum value of Sj ・ (m2j − m1j).

6: mj ⇐ index of m1j

7: if (I[i] > I[mj ]) then

8: SWAP(I[i]; I[mj ])

9: end if

10: end for

11: return B

In this strategy, first we get the partition B = B0; : : : ;BM−1 with sizes S0; : : : ; SM−1. We then

repeat k times the swap between the address with maximum idleness in the earliest failing

block (i) and the one with the minimum idleness in the block j in which such a swap would

maximize the benefit. The latter is defined as the product between size of the block and

difference between the second and first minimum (Si ・ (m2j −m1j). The second factor

represents how much the lifetime of this block would be extended. Clearly, the swap is done

only if beneficial (i.e., if we are bringing into Block j an address with idleness higher than the

previous minimum m1j).

2) Cluster & Partition Strategy: The rationale behind this strategy is driven by the

observation that the ELT-driven partitioning would provide ideal results if the idleness profile

I would be sorted in non-decreasing order. Since sorting the entire profile would require an

excessive number of swaps, the algorithm we implement under this strategy (called k-min

clustering) identifies the k minima in the idleness profile and swaps them with the addresses

at the beginning or at the end of the profile (first or last k addresses), as shown in the pseudo-

code below. Then, the ELT-driven partitioning is applied on the modified idleness profile.

Following pseudo-code portraits the steps involved in this process.

1: k-MinClustering (I)

2: (j1; : : : ; jk) ⇐ indices of the first k minima

3: i = 0

4: for l = 1 to k do

Page 50: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 50

5: SWAP(I[i]; I[jl]);

6: end for

7: B = ELT-DrivenPartitioning (I)

8: return B

Figure 30 illustrates the average aging benefit obtained by different strategies discussed

above. For experimentation purpose, MIBENCH applications are used. PALT is the lifetime

extension obtained by partitioned cache architecture and PLT is the advantage gained by

application specific partitioning. Other entities in the graph show the lifetime increment,

obtained by combining optimization algorithms with PLT, for different values of k where k

represents the number of swaps being performed.

PALT on average improves lifetime by 45% whereas advantage in case of PLT increases to

125%. The strategies of “K-swaps” and “K-min clustering” improve the aging results up-to

230%.

Figure 30. Effective lifetime.

Page 51: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 51

4 Application to Use-Cases

The following sections shortly describe how the different activities related to the embedded

software and custom hardware optimization techniques are used in the three different uses

cases.

4.1 Use Case 1

The use case 1 will be used to exercise the optimization related to the embedded software and

custom memory hierarchy definition. In particular,

the SWAT toolchain will be used to optimize the SW running on the REISC core by adopting

the techniques presented in section 2 on compiler optimization and code hints possibly taking

into account application and platform status;

The MMCO tool, with its new extensions for the evaluation of aging, will be used to model,

characterize and optimize the memory hierarchy of the SoC node in order to improve the

energy efficiency and the aging of the device.

Results of the application of these strategies for benchmark application have been reported in

deliverable D1.3.1; evaluation of the final test case application will be reported in the

deliverables of WP4.

4.2 Use Case 2

4.2.1 DCT - High level synthesis optimizations

The DCT hardware part of Use Case 2 has been synthesized from behavioral level to RT-level

by the PowerOpt tool. It results in a datapath and a corresponding controller. The critical path

has a length of 109 cycles that has been identified during synthesis and results out of data

dependencies and control-flow dependencies. Especially sequential memory-accesses prevent

the schedule from being shorter. The overall design has been estimated to occupy an area of

79908µm² whereas 14336µm² belongs to on-chip memories when the design is being

synthesized in a 45nm semiconductor technology.

The datapath consists of the following functional units that are applicable to be power gated

during runtime as well as during an overall sleep mode:

29 adder components

1 incrementer component

9 subtractor components

In addition to these functional units, the datapath consists of 83 registers and two shifters. The

dominating part of the functional units are instantiated at a bitwidth of 32.

Page 52: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 52

If power gating is applied to all sequential units simultaneously its standby current is

drastically reduced as shown in Figure 31. As it can be seen, the power gating technique

reduces the inherent leakage currents by 98.6%. For this estimate NMOS-sleep transistors

have been chosen, because at the same performance they can be smaller than PMOS devices.

Further, high-threshold devices have been chosen for sleep devices as they cut off the leakage

currents more effective.

The state transition costs have been estimated to be amortized completely after 91ns due to

the large leakage savings. 87.7% of the state transition energy belongs to the RTL-component

state transition costs, 10.4% belong to the sleep transistors and the remaining part of 1.9% is

caused by additional buffers in front of the sleep transistors.

Figure 31: Aggregated leakage currents of functional units in active and sleep state[A]

Beside this global power down, the proposed power-gating aware synthesis can apply a cycle-

wise power down to the components on an individual basis. Thereby, the idle-times of each

functional unit is analyzed within the schedule and in accordance to the break-even time, a

temporally fine-grained power down is applied if it is a worthwhile decision in terms of a

power reduction.

A schedule-length of 109 controller steps and having 39 functional units in the datapath

results in 4251 slots. In 684 of these slots, the functional units execute an operation. Thus, the

average workload is about 16.1%. In other words, in 83.9% of the time, the components are

idling and wasting leakage power.

In the following an analysis of the idle time lengths between each pair of functional

operations is presented as it has been analyzed for the Use-Case 2 design. Idle times of a

length of zero controller steps represent consecutive operations with no idle time in between.

This case is optimal for power gating as it represents a clustering of operations. In this use

case 69.2% of all operations are executed in a consecutive manner.

7,05E-05

9,36E-07

I_active I_sleep

estimated leakage currents [A]

Page 53: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 53

Figure 32: Histogram of idle-times between operations in Use-Case 2 DCT

The idle-time lengths in between the remaining operations are shown in Figure 32. The

average idle period length is 16.3 controller steps. Although the period lengths are widely

distributed and range up to 77 csteps, two clusters exist at the boarders. On the one hand a lot

of idle periods below 10 csteps exist. These periods cannot be exhausted by the power gating

technique as the energetic break-even time is not exceeded in most of the cases. A second

cluster of idle periods exists between 60 and 70 cycles. These idle periods are perfectly

adequate for powering down the corresponding functional units.

The break-even times of subtractor-, adder-, and incrementer components that are

implemented in the freely available Nangate 45nm technology at the fast process corner have

been evaluated to be 9-11 cycles at a frequency of 100Mhz, an ambient temperature of 27°C

and a supply voltage of 1.2V. This implies a drastic reduction of the leakage power of these

components during longer times of idleness. The proposed cycle-wise power gating synthesis

results in a 37.19% reduction of leakage currents during runtime.

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

com

mo

nn

ess

idle time length [csteps]

511

avg. idle period length: 16.3 csteps

consecutive operations: 69.2%

Page 54: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 54

4.3 Use Case 3

The Use Case three mainly has been designed to exercise the MDA frontend of the

COMPLEX design flow. Activities related to T3.2 and T3.3 are not part of this use case.

Page 55: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 55

5 Summary

This document presented the overall approach for embedded software and custom hardware

optimization and the way the methodologies interact also with design space exploration and

run-time management techniques.

The core portion of the document describes the toolchains for the embedded software and

custom hardware optimization in the COMPLEX flow. Each section within the deliverable

presents a description of the proposed methodology and an overview of the toolchain

supporting it.

Finally, Section 4 has shown some sample results of how the three selected use cases are

covered by the optimization toolchains.

Page 56: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 56

6 References

[1] HELMS, D.; HOYER, M.; NEBEL, W.: Accurate PTV, State, and ABB Aware RTL

Blackbox Modeling of Subthreshold, Gate, and PN-junction Leakage. In: Proc. of the

2006 Int’l Workshop on Power and Timing Modeling, Optimization and Simulation

(PATMOS) 4148/2006 (2006), 56–65. http://dx.doi.org/10.1007/11847083. – DOI

10.1007/11847083

[2] HOYER, M; HELMS, D.; NEBEL, W.: Modeling the Impact of High Level Leakage

Optimization Techniques on the Delay of RT-Components. In: Proc. of the 2007 Int’l

Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)

(2007), S. 171–180

[3] ITRS WORKING GROUP: INTERNATIONAL TECHNOLOGY ROADMAP FOR

SEMICONDUCTORS: Design. http://www.itrs.net/home.html.

[4] L. Benini, L. Macchiarulo, A. Macii, E. Macii, M. Poncino, "Layout-Driven Memory

Synthesis for Embedded Systems-on-Chip," IEEE Transactions on VLSI Systems, Vol.

10, No. 2, pp. 96-105, April 2002.

[5] F. Angiolini, L. Benini, A. Caprara, "An Efficient Profile-Based Algorithm for

Scratchpad Memory Partitioning", IEEE Transactions on Computer-Aided Design,

Nov 2005, Vol. 24, No. 11, pp. 1660-1676.

[6] O. Ozturk, M. Kandemir, "Non-uniform Banking for Reducing Memory Energy

Consumption," DATE'05: Design, Automation and Test in Europe, Mar. 2005,pp.

814-819.

[7] Mirko Loghi, Olga Golubeva, Enrico Macii, Massimo Poncino, “Architectural Leakage

Power Minimization of Scratchpad Memories by Application-Driven Sub-Banking”.

IEEE Transactions on Computers., Vol. 59, No. 7, pp. 891-904. July 2010.

[8] A. Calimera, E. Macii, M. Poncino, “NBTI-Aware Clustered Power Gating”, ACM

TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, Vol.

16, No. 1, November 2010, pp.3-1— 3-25.

[9] A. Calimera, M. Loghi, E. Macii, M. Poncino, “Aging Effects of Leakage

Optimizations for Caches”, ACM/IEEE GLSVLSI-10: IEEE/ACM Great Lakes

Symposium on VLSI, May 2010, pp. 95-98.

[10] L. Zhang, R. P. Dick, “Scheduled Voltage Scaling for Increasing Lifetime in the

Presence of NBTI,” ASPDAC’09, pp. 492–497, Jan. 2009.

[11] A. Calimera, E. Macii, M. Poncino, ”NBTI-Aware Power Gating forConcurrent

Leakage and Aging Optimization”, ISLPED ’09: International Symposium on Low

Power Electronics and Design, pp. 127-132, August 2009

[12] A. Ricketts, J. Singh., K. Ramakrishnan, N. Vijaykrishnan, D. K. Pradhan.

“Investigating the Impact of NBTI on Different Power Saving Cache Strategies,”

DATE’10: Design, Automation and Test in Europe, pp 592–597, March 2010

[13] A. Calimera, M. Loghi, E. Macii, M. Poncino, “ Dynamic indexing: Concurrent leakage

and aging optimization for caches”, 2010 ACM/IEEE International Symposium on

Low-Power Electronics and Design (ISLPED), pp.343-348, 18-20 Aug. 2010

[14] A. Calimera, M. Loghi, E. Macii, M. Poncino, “ Partitioned cache architectures for

reduced NBTI-induced aging”, DATE 2011: Design Automation and Test in Europe,

pp. 938-943, March 2011.

[15] BORKAR S. et al, 2005. Designing Reliable Systems from Unreliable Components:

The Challenges of Transistor Variability and Degradation, IEEE Micro. 25, 6, 10–16.

Page 57: COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon Mahmood (PoliTo), Carlo Brandolese, Gianluca Palermo, William Fornaciari (PoliMi),

COMPLEX/POLITO/R/D3.2.2/1.0 Public

Final report on embedded software and memory optimization

Page 57

[16] ALAM M.A., Reliability- and process-variation aware design of integrated circuits.

Microelectronics Reliability, 48, 8, 1114-1122

[17] KIMIZUKA N., YAMAMOTO, T. MOGAMI, T. YAMAGUCHI, K. IMAI, K.

HORIUCHI, T. Impact of bias temperature instability for direct tunneling ultra-thin

gate oxide on MOSFET scaling. Symposium on VLSI Technology, 1999, 73–74.

[18] H. Mahmood, M. Loghi, M. Poncino, E. Macii, Application-Specific Memory

Partitioning for Joint Energy and Lifetime Optimization. DATE’12, Design,

Automation & Test in Europe, March 12th-16th, 2012

[19] S.V. Kumar, K.H. Kim, S.S Sapatnekar, “Impact of NBTI on SRAM read stability and

design for reliability”, ISQED'06, March 2006, pp. 213--218.