UIC POLIMI Master of Science in Computer Science Presentation
COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon...
Transcript of COdesign and power Management in PLatform- based design ... · Prepared by Massimo Poncino, Haroon...
Public
FP7-ICT-2009- 4 (247999) COMPLEX
COdesign and power Management in PLatform-
based design space EXploration
Project Duration 2009-12-01 – 2012-11-30 Type IP
WP no. Deliverable no. Lead participant
WP3 D3.2.2 POLITO
Final report on embedded software and hardware
optimization
Prepared by Massimo Poncino, Haroon Mahmood (PoliTo),
Carlo Brandolese, Gianluca Palermo, William
Fornaciari (PoliMi), Sven Rosinger, Kim
Grüttner (OFFIS)
Issued by POLITO
Document Number/Rev. COMPLEX/POLITO/R/D3.2.2/1.0
Classification COMPLEX Public
Submission Date 2012-02-29
Due Date 2012-02-29
Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)
© Copyright 2012 OFFIS e.V., STMicroelectronics srl., STMicroelectronics Beijing
R&D Inc, Thales Communications SA, GMV Aerospace and Defence SA, SNPS Belgium
NV, EDALab srl, Magillem Design Services SAS, Politecnico di Milano, Universidad de
Cantabria, Politecnico di Torino, Interuniversitair Micro-Electronica Centrum vzw, European
Electronic Chips & Systems design Initiative.
This document may be copied freely for use in the public domain. Sections of it may be
copied provided that acknowledgement is given of this original work. No responsibility is
assumed by COMPLEX or its members for any aplication or design, nor for any
infringements of patents or rights of others which may result from the use of this document.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 2
History of Changes
ED. REV. DATE PAGES REASON FOR CHANGES
Massimo Poncino 1.0 2012-02-29 56 First release of final version.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 3
Table of Contents
1 Scope of the Document ...................................................................................................... 4
2 Embedded SW optimization ............................................................................................... 5 2.1 Compiler optimizations exploration ........................................................................... 6
2.1.1 Optimization space modelling ................................................................................ 6 2.1.2 Optimizations clustering ........................................................................................ 7 2.1.3 Flow execution ....................................................................................................... 9
2.2 Source to source optimization flow .......................................................................... 13 2.2.1 Optimization hint engine ...................................................................................... 13
2.2.2 Optimization hint rule definition .......................................................................... 14 2.2.3 Flow execution of the optimization hint engine ................................................... 16 2.2.4 Transformation effectiveness quantitative estimator ........................................... 16 2.2.5 Flow execution of the transformation effectiveness estimator ............................. 17
2.3 Parametric exploration ............................................................................................. 18
2.3.1 Target independent configuration ........................................................................ 18 2.3.2 Target dependent configuration ........................................................................... 20
2.4 Tools ......................................................................................................................... 22 2.4.1 swat-core-cc ......................................................................................................... 22
2.4.2 swat-opt ................................................................................................................ 23 2.4.3 swat-tge ................................................................................................................ 24
3 Custom hardware optimization ........................................................................................ 25 3.1 High level synthesis optimizations ........................................................................... 25
3.1.1 Technology Selection and Parameter Ranges ...................................................... 25 3.1.2 Evaluation of Power Gating Models .................................................................... 27
3.1.3 Evaluation of IP-Level Application of Power Management ................................ 35 3.2 Memory optimization ............................................................................................... 39
3.2.1 Introduction .......................................................................................................... 39
3.2.2 Energy Optimization of scratchpad memories ..................................................... 39 3.2.3 Concurrent Aging and Energy Optimization of scratchpad memories ................ 40
4 Application to Use-Cases ................................................................................................. 51
4.1 Use Case 1 ................................................................................................................ 51 4.2 Use Case 2 ................................................................................................................ 51
4.2.1 DCT - High level synthesis optimizations ........................................................... 51 4.3 Use Case 3 ................................................................................................................ 54
5 Summary .......................................................................................................................... 55 6 References ........................................................................................................................ 56
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 4
1 Scope of the Document
This deliverable presents the results from Task T3.2 - Embedded software optimization
(Participants: PoliMi, IMEC - Start: M7 - End: M24) and Task T3.3 – Custom hardware
optimization (Participants: CV, OFFIS, PoliTo - Start: M7 - End: M24) up to M27.
The deliverable is the second and last describing the optimization activities for embedded
software and for the hardware, and describes to the application of these optimization
techniques in the COMPLEX flow ‘in isolation’ without emphasis of their interaction. The
latter is the subject of a different set of deliverables (D3.4, ”Intermediate and Final Report on
Design Space Exploration D3.4.3, “Final Report on Design Space Exploration” for the
hardware optimizations, and D3.5.2 “Final report on Run-Time Management” for the
software techniques).
The document closely follows the structure of its predecessor (D3.2.1). Sections 2 and 3 of
describe the methodologies and the toolchains for the embedded software and custom
hardware optimization (both High Level Synthesis and Memory hierarchy optimizations).
Finally, Section 4 shows how the three selected use cases are covered by the optimization
toolchains.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 5
2 Embedded SW optimization
Embedded software optimization has been studied in several different ways. Some approaches
are strictly related to the detailed software estimation and optimization methodologies, other
involve also additional portions of the flow, namely the design space exploration engine
MOST. This section describes the advancements in the implementation of the different
optimization flows
1. Compiler optimization exploration. Integrates the SWAT detailed software estimation
toolchain with the MOST design exploration engine to find out the best combination
of optimization options offered by the LLVM code transformation tool.
2. Source-to-source transformation. This flow is completely based on the SWAT toolchain
and has the goal of providing "optimization hints" to the developer, suggesting high-
level, potentially beneficial transformations. This toolchain cannot be integrated with
the MOST exploration engine since the suggested transformations are not applied
automatically but rather require manual coding. Two different approaches have been
followed.
a. Qualitative. The first is a qualitative only “optimization hint” engine,
providing indications on the section of code to optimize and on how to
optimize it.
b. Quantitative. The second is an evolution of this first approach, as it estimated
the potential energy reduction associated to a certain transformation. Bein
quantitative, this approach requires significant effort to analyze the effect of
transformations and to model them in quantitative way.
3. Parametric optimizations. This flow integrates the SWAT estimation toolchain with the
MOST exploration engine and operates on source files implementing functions that
depend on compile-time parameters. Typical examples are compiler pragmas (memory
alignment, loop properties, unrolling directives, linker options, etc.) and application-
specific parameters.
4. Application configuration. This flow integrate the SWAT estimation tool chain with the
MOST exploration engine and provides an automated mechanism for the selection of
specific “function implementations” and “processor operating modes”. The two
approaches cover different application aspects.
a. Target independent. This flow assumes that more than one implementation
(referred to as "function mode") is provided for one or more given functions.
Implementation differ w.r.t. functional and non-functional properties. Different
implementation are expected to be executed on the target platform always
operating in the voltage and frequency conditions.
b. Target dependent. Selected functions are automatically annotated to force the
target to enter specific voltage and frequency operating modes. The best
combination of modes is selected by means of design space exploration trying
to minimize the overall application energy under timing constraints.
Details of each flow are provided in the following.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 6
2.1 Compiler optimizations exploration
This section provides a summary of the proposed optimization approach and reports the
improvements that have been implemented.
2.1.1 Optimization space modelling
The available set of LLVM transformations/optimization is modeled by the binary vector:
NtttT 10
(1)
whose element ti indicates whether the i-th transformation is active or not (for the list of
available transformations see Deliverable D3.2.1). This leads to a very large space to be
explored. The clustering matrix:
nKKK
N
N
rrr
rrr
rrr
,1,0,
,11,10,1
,01,00,0
R
(2)
Has the goal of grouping transformations. An element 1, jir indicates that the transformation
with index j belongs to group i, while a value 0, jir means that the transformation j does not
belong to group i. With clustering, a specific optimization choice is described by the set of
groups that are active, that is by a vector:
KgggG 10
(3)
having the same semantics as vector T, but with groups instead of single transformations.
Given a certain choice of groups to be activated – as selected by the exploration flow –the
transformations to enable are simply obtained as:
TRGT (4)
The original idea of the flow has been extended according to a two-phase approach. The
second phase consists in the exploration over clusters of transformation, as described above
and in more detail in Deliverable D3.2.1. The first phase, on the other hand, operates within
each cluster. Given a cluster
}1|{ ,,1,0, jijNiiii rtrrrg
(5)
the same flow is used to select the subset of transformation that lead to more efficient code.
Formally, this is equivalent to eliminate those transformations whose effect is negligible on
the specific code. If jt is such a transformation, then we set 0, jir . After reducing all groups
according to this procedure, cluster-level optimization is performed.
The setup of the MOST-based optimization flow is depicted in Figure 1. Inputs of the flow
are the source files, the set of compiler options and a model of the target architecture.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 7
x.c
t = Estimated timee = Estimated energys = Estimated size
MOST
Design Space:optimizations list
OptimizationOptionsOptimization
Front-End
Back-End
CPUmodel
x.opt.c
x.c
t = Estimated timee = Estimated energys = Estimated size
MOST
Design Space:optimizations list
OptimizationOptionsOptimization
Front-End
Back-End
CPUmodel
x.opt.c
Figure 1: General MOST-based optimization flow
The output can have different forms, namely:
1. A host-executable binary file
2. A target-executable binary file
3. A list of compilation options
4. A rewriting of the C source code.
It must be noted that, in the last case, the SWAT flow uses the LLVM experimental C
language back-end, which generates C code that is hardly readable, as it is the effect of
translation of assembly code back to very simple C statements.
2.1.2 Optimizations clustering This section summarizes the clusters that have been constructed to perform the two-phase
transformation selection exploration.
Control Flow -abcd Remove redundant conditional branches -break-crit-edges Break critical edges in CFG -block-placement Profile Guided Basic Block Placement -insert-edge-profiling Insert instrumentation for edge profiling -insert-optimal-edge-profiling Insert optimal instrumentation for profiling -jump-threading Thread control through conditional blocks -mergereturn Unify function exit nodes -lowerswitch Lower SwitchInst's to branches -sink Code Sinking -simplifycfg Simplify the CFG
Functions -always-inline Inliner for always_inline functions -argpromotion Promote 'by reference' arguments to scalars -codegenprepare Prepare a function for code generation -deadargelim Dead Argument Elimination
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 8
-functionattrs Deduce function attributes -inline Function Integration/Inlining -ipconstprop Interprocedural constant propagation -ipsccp Interprocedural Sparse Conditional Constant Propagation -mergefunc Merge Functions -partial-inliner Partial Inliner -partial-specialization Partial Specialization -sretpromotion Promote sret arguments -tailcallelim Tail Call Elimination -tailduplicate Tail Duplication
Constants -constmerge Merge Duplicate Global Constants -constprop Simple constant propagation -ipconstprop Interprocedural constant propagation -ipsccp Interprocedural Sparse Conditional Constant Propagation -sccp Sparse Conditional Constant Propagation
Variables & Expressions -argpromotion Promote 'by reference' arguments to scalars -globaldce Dead Global Elimination -globalopt Global Variable Optimizer -gvn Global Value Numbering -mem2reg Promote Memory to Register -reg2mem Demote all values to stack slots -scalarrepl Scalar Replacement of Aggregates -reassociate Reassociate expressions -split-geps Split complex GEPs into simple GEPs
Basic Blocks -adce Aggressive Dead Code Elimination -dce Dead Code Elimination -die Dead Instruction Elimination -dse Dead Store Elimination -instcombine Combine redundant instructions -sink Code Sinking
Loops -indvars Canonicalize Induction Variables -lcssa Loop-Closed SSA Form Pass -licm Loop Invariant Code Motion -loop-deletion Dead Loop Deletion Pass -loop-extract Extract loops into new functions -loop-extract-single Extract at most one loop into a new function -loop-index-split Index Split Loops -loop-reduce Loop Strength Reduction -loop-rotate Rotate Loops -loop-unroll Unroll loops -loop-unswitch Unswitch loops -loop-simplify Canonicalize natural loops
Lowering -lowerallocs Lower allocations from instructions to calls -loweratomic Lower atomic intrinsics -lowerinvoke Lower invoke and unwind, for unwindless code generators -lowersetjmp Lower Set Jump
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 9
-lowerswitch Lower SwitchInst's to branches -memcpyopt Optimize use of memcpy and friend -prune-eh Remove unused exception handling info -simplify-libcalls Simplify well-known library calls -simplify-libcalls-halfpowr Simplify half_powr library calls
Finally, the transformations in the following group are always active, since they mostly deal
with the manipulation of the internal representation and do not really have an effect on the
quality of the code.
Always active -deadtypeelim Dead Type Elimination -internalize Internalize Global Symbols -strip Strip all symbols from a module -strip-dead-prototypes Remove unused function declarations -strip-debug-declare Strip all llvm.dbg.declare intrinsics -ssi Static Single Information Construction -ssi-everything Static Single Information Construction
2.1.3 Flow execution
Execution of the optimization flow is quite straightforward. In the following we suppose that
the transformations are clustered as described in the previous section.
Since the flow is integrated with MOST, which acts a main tool, two files are necessary:
1. A wrapper script to invoke the actual estimator.
2. An XML file describing the exploration space and the optimization goals.
The script, in particular, wraps the call to the C-to-C SWAT optimization tool based on the
LLVM optimizer opt and the swat-core-ba flow to evaluate execution time and
energy consumption of the code resulting from the application of selected transformations.
Here is an example of the script that has been developed to this purpose.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 10
#__MOST_GENERIC_WRAPPER__# INPUT_TEMPLATE_FILE INPUT_FILE
#__MOST_GENERIC_WRAPPER__# METRIC_NAME OUTPUT_FILE TYPE ADDITIONAL INFO
#__MOST_GENERIC_WRAPPER__output_file__#
execution_cycles
log/reisc_sim.lo
regexp
Executed\s*(\S+)\s*cycles
#__MOST_GENERIC_WRAPPER__output_file__#
instructions
log/reisc_sim.log
regexp
cycles,\s*(\S+)\s*instructions
#__MOST_GENERIC_WRAPPER__output_file__#
code_size
log/stat.log
template
Size:
#!/bin/sh
TARGET_FILE_DIR="/home/complex/UC1/apps/gsm/"
REISC_CONFIG_FILE="/home/complex/UC1/reisc/simple.cfg"
set -e
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 11
touch phase1.txt phase2.txt phase3.txt excluded.txt
echo "-indvars -loop-unroll" >> @[email protected]
echo "-inline" >> @[email protected]
echo "-licm -loop-unswitch" >>@[email protected]
echo "-sccp" >> @[email protected]
echo "-mem2reg" >> @[email protected]
echo "-preverify -domtree -verify -lowersetjmp" > opt.cfg
cat phase1.txt phase2.txt phase3.txt >> opt.cfg
echo "-preverify -domtree -verify" >> opt.cfg
rm phase*.txt
mkdir -p log bin opt opt/tmp
swat-opt -config opt.swatcfg -swat-debug > log/swat_opt.log 2>&1
reisc-gcc -O0 -mint32 opt/*.c -o bin/a.out > log/reisc_gcc.log 2>&1
reisc-run -a "--config-file=$REISC_CONFIG_FILE" bin/a.out >
log/reisc_sim.log 2>&1
stat bin/a.out > log/stat.log
exit 0
As far as the exploration space description, it is constituted by an XML file listing the
available parameters to explore and the optimization goal.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 12
#<?xml version="1.0" encoding="UTF-8"?>
<design_space xmlns="http://www.multicube.eu/" version="1.3">
<simulator>
<simulator_executable path="/usr/bin/perl /home/most/wrapper.pl
–-execution_config=/home/most/complex/UC1/run.sh.in --timeout=1800" />
</simulator>
<parameters>
<parameter name="loop_unroll" type="string">
<item value="excluded"/>
<item value="phase1"/>
<item value="phase2"/>
<item value="phase3"/>
</parameter>
<parameter name="inline" type="string">
<item value="excluded"/>
<item value="phase1"/>
<item value="phase2"/>
<item value="phase3"/>
</parameter>
...
</parameters>
<system_metrics>
<system_metric name="instructions" type="float" unit="inst"
desired="small" />
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 13
<system_metric name="execution_cycles" type="float" unit="cycle"
desired="small" />
<system_metric name="code_size" type="float" unit="Byte"
desired="small" />
</system_metrics>
</design_space>
As mentioned before, the SWAT estimation flow can seamlessly be replaced by the target ISS
to perform a more accurate energy evaluation. This second option, though, suffers the
drawback that instruction set simulation proved to be more than 400 time slower than
estimation. This, considering that the exploration space is rather large, strongly encourages
the use of the SWAT estimation toolchain. The optimization and estimation commands run by
the script are the following:
$> swat-core-cc –config opt.swatcfg –swat-debug
$> swat-core-ba –config ba.swatcfg –swat-debug
The tool swat-core-cc performs actual transformations using the LLVM optimizer. The
set of active transformation is passed to LLVM through specific configuration options in the
opt.swatcfg file. This is thus the input for the transformation and estimation tools and the
output of the MOST engine during exploration. At each step of the exploration process, in
fact, MOST generates a new configuration file.
For the format of the configuration files and a description of the command line options See
Section 2.4.
2.2 Source to source optimization flow
This section describes the implementation of the source-to-source optimization hint engine
based on the formal formulation provided in Deliverable D3.2.1. Since the optimization hint
engine swat-opt does not perform any source code transformation – which is left as a
manual task to be performed by the developer – it is not possible to “close” the optimization
loop by exploiting the exploration tool MOST.
The second part of the section describes the prototypical implementation of the quantitative
transformation evaluation engine swat-tge. This tool provides a quantitative estimation of
the potential energy saving that might be obtained by applying specific high-level
transformations.
2.2.1 Optimization hint engine
After running the hint engine, the developer is provided with a set of suggestions on where
and how to transform the source code.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 14
Figure 2 shows a simplified view of the, where some pre-processing activities have been
omitted and indicated as a whole with “Front-End” box. It should be noted that the closed
loop actually needs human intervention, as the suggested code transformations are not applied
automatically.
x.c
Optimization Hints
Optimization
Engine
Front-End
Back-End
CPUmodel
t = Estimated timee = Estimated energys = Estimated size
Satisfied ?
ManualCode
Transformation
x.opt.c
x.opt.c
yes
no
applysuggestedtransformations
transformationrules
x.c
Optimization Hints
Optimization
Engine
Front-End
Back-End
CPUmodel
t = Estimated timee = Estimated energys = Estimated size
Satisfied ?
ManualCode
Transformation
x.opt.c
x.opt.c
yes
no
applysuggestedtransformations
transformationrules
Figure 2: General SWAT-based source-to-source optimization flow
2.2.2 Optimization hint rule definition
The grammar used to build the rules is rather general and is described in the following
rulelist rule rulelist
| rule
rule ruleid ‘:’ condition
ruleid constant
condition term ‘|’ condition
| term
term elem ‘&’ term
| term
elem comp
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 15
| ‘~’ comp
comp ‘%’ ruleid
| ‘(’ condition ‘)’
| identifier relop constant
identifier ‘$’ metricid
| ‘$’ metricid ‘[‘ constant ‘]’
Where constant is a terminal symbol indicating a numeric constant, relop stands for a
relational operator and metricid is a terminal symbol whose string value identifies a specific
metric, according to the following table, grouped according to the scope they refer to
(function, basic-block, or whole application). The second column indicates whether the metric
is a scalar or a vector. In the latter case, specifies the meaning of the index.
Function metrics
Identifier Argument Description
fnsize Function size
fnbbsize Function BB size
fnsizeavg Average function size
fnbbsizeavg Average function BB size
fncalls Functions called
fncallswgh Weighted functions called
fncallpoints Function call points
fninsnstat Index of the instruction Function instruction statistics
fnexec Function execution count
fntime Total function execution time
fntimeavg Average function execution time
fndepth Average function depth
fncallpointf Function call points frequency
fnregpress Function register pressure
fnclassstat Index of the class Function instruction class statistics
fnmempress Function memory pressure
fnstackpress Function stack pressure
Basic block metrics
Identifier Argument Description
bbsize Basic block size
bbsizeavg Average basic block size
bbinsnstat Index of the instruction Basic block instruction statistics
bbexec Basic block execution count
bbregpress Basic block register pressure
bbclassstat Index of the class Basic block instruction class statistics
Application metrics
Identifier Argument Description
aaclassstat Index of the class Instruction class statistics
aastackmax Maximum stack size
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 16
aainsnstat Index of the instruction SInstruction statistics
aabbexec Total basic block execution time
aaregpress Register pressure
aamempress Memory pressure
aastackpressave Average stack pressure
2.2.3 Flow execution of the optimization hint engine
It is worth noting that swat-opt the should not be applied to all the source code, but rather
to selected portions, called “scopes” (see Deliverable D3.2.1) that have been identified as the
most critical part of the application.
The list of scopes can be obtained using the analysis flow constituted by swat-core-ba and
swat-analyze. In particular, after performing the basic modeling and estimation tasks collected
in swat-core-ba with:
$> swat-core-ba –config myconfig.swatcf –swat-debug
it is necessary to run swat-analyze with the selection options activated, namely:
$> swat-analyze -bb-select –threshold <percent>
–cluster *.bbmodel
to select critical basic-blocks (loops inclusive), and:
$> swat-analyze -fn-select –threshold <percent> *.bbmodel
to select critical functions. Furthermore, since the optimization engine needs to know which
groups of basic blocks constitute a loop, the following command should be executed for each
critical function
$> swat-analyze –bb-cfg –loops f1.bbmodel f2.bbmodel ...
Finally the optimization engine can be run with the command
$> swat-opt –config opt.swatcfg –swat-debug
For a detailed description of the command line interface and of the configuration options of
the tools, see Section 2.3.1. and Sections 4.4, 4.5 and 4.6 of Deliverable D2.2.2.
2.2.4 Transformation effectiveness quantitative estimator
This tool provides an estimate of the effectiveness of specific high-level transformations. The
key concept behind this approach is the possibility to estimate how the basic-block models of
the applications will be affected by specific transformation. The simplest and most accurate
approach would be to actually transform the source code, then perform estimations. This is
depicted by Figure 3.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 17
Figure 3: Exact transformation effectiveness estimation approach
The tool, thus, does not perform exact and semantically consistent transformation of the code
(as a source-to-source transformation engine would do) but rather “updates” the underlying
basic block models. This is pictorially represented by Figure 4.
Figure 4: SWAT Transformation effectiveness estimation approach
This approach requires a significant analysis and modeling effort to characterize specific
transformations in terms of resulting basic-block models. From a technical point of view, it is
not possible to express the transformation of the basic-block models strictly in mathematical
form. For this reason we have decided to account for the effect of each transformation by
means of a specific algorithm generating the new basic-block model. Each algorithm is the
compiled in a shared dynamic library loaded at runtime by the core tool.
2.2.5 Flow execution of the transformation effectiveness estimator
The tool implementing this idea is currently in a very preliminary phase of development, as it
was not originally foreseen in the project. We nevertheless decided to explore this idea mainly
to support the optimization hint engine, rather than replacing it completely.
The tool is run with the following command line:
$> swat-tge –config tge.swatcfg –tform <name> –swat-debug
The configuration file, at present, does not introduce any additional option. Transformations
are explicitly specified on the command line. The name of the transformation is used to select
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 18
the specific model transformation dynamic library to be loaded. A complete description of the
interface is not provided here, since the tool not yet stable enough.
2.3 Parametric exploration
This flow has the goal of finding the combination of the "parameters" of the applications that
maximizes a predefined optimization goal. The kind of parameters that can be explored here
are all supposed to be implemented as macro definitions influencing:
1. The behavior of the compiler. These macros (usually pragmas) are used to modify the
behavior of the compiler in the optimization, code generation and linking phases.
2. The behavior of the application. These macros directly influence the behavior of the
application code, by specifying, for example, tolerances, number of iterations,
timeouts, polling frequencies and so on.
We will refer to the former case as “target independent” exploration, while as “target
dependent” exploration the latter.
The optimization flow, shown in Figure 5, combines the MOST exploration engine and the
SWAT estimation toolchain (or the actual instruction-set simulator of the target platform).
x.c
t = Estimated timee = Estimated energys = Estimated size
MOST
Design Space:optimizations list
Front-End
Back-End
CPUmodel
macros.h
macros.hx.c
t = Estimated timee = Estimated energys = Estimated size
MOST
Design Space:optimizations list
Front-End
Back-End
CPUmodel
macros.h
macros.h
Figure 5: General setup of the optimization flow for parametric exploration.
The flow has been implemented and tested on small examples. Since the implementation of
the flow basically consists in building ad-hoc wrappers and XML parameters descriptions for
interfacing MOST and the SWAT estimation flow, no additional details needs to be provided
here. The form of the XML file and of the wrapper script is similar to that discussed in
Section 2.1.3.
2.3.1 Target independent configuration
For this kind of optimization, we suppose that a given function foo() of the application has
been implemented in different ways, which we refer to as functional modes. Each functional
mode is then subject to conditional compilation under the guard of macro FOO_MODE_<N>,
where <N> is a suffix that unambiguously identifies one of the specific implementations. An
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 19
example of implementation template of a function with three different modes is provided in
Figure 6.
#if defined( FOO_MODE_1 )
int foo( int x ) {
// Implementation 1
}
#elif defined( FOO_MODE_2 )
int foo( int x ) {
// Implementation 2
}
#elif defined( FOO_MODE_3 )
int foo( int x ) {
// Implementation 3
}
#else
#error Mode not defined.
#endif
Figure 6: Template implementation of a function with three functional modes.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 20
Typical examples of different “functional modes” of a function are related to the different
accuracies of a computation, a floating-point versus a fixed-point implementation of an
algorithm and different trade-offs between local processing and transmission frequency for
sensing functions on a wireless sensor network node.
2.3.2 Target dependent configuration
This second option of parametric exploration has the goal of determining the best combination
of the voltage and frequency operating modes of the target processor. Compared to traditional
approaches, where entire threads, processes or process batches are assigned an operating
mode, the exploration proposed here operates at a much finer-grained level.
Considering a generic application as structured as a set of C functions, we first identify the
most critical ones, using the same analysis steps outlined for the optimization hints flow.
These functions needs then modified manually, but in a very trivial way: it is in fact sufficient
to add two macros, one at the beginning and one at the end of the function. Note that if the
function has more than one exit point, the exit macro must be added before each of them.
Figure 7 shows a template of a function instrumented with the macros necessary to enable this
form of parametric exploration and automatic application of the configuration selected by the
exploration engine.
int foo( int x )
{
/* Declarations */
VFMODE_ENTER_FOO
/* Original function body */
/* At each exit point */
VFMODE_EXIT_FOO
return some_var;
}
Figure 7: Modification of a function to support target dependent modes exploration.
In the specific case of the ReISC processor, the core provides three operating modes, namely
normal, snooze and sleep. The exploration engine, supported by the SWAT analysis tools
swat-core-tr and swat-analyze, will select per each function the best suited operating mode of
the target processor. This is done by minimizing the estimated energy consumption under
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 21
execution time constraints, either in the form of deadlines for each function or in the form of
an overall timing constraint.
Again, the structure of the flow is based on the MOST exploration engine and does not
significantly differ from the arrangements discussed so far. The only difference lays in the
core tools of the SWAT framework that are used to perform analysis and estimation.
In particular, the following steps are necessary. First of all, the code of the functions needs to
be modified as described in Figure 7. Then, a static estimation pass must be performed to
derive the execution time and the energy consumption of each basic block of the application.
This is done by means of the swat-core-ba flow. Executing the application with different
processor modes assigned to different functions implies suitably updating the basic-block
models with different costs based on the specific mode the function is assigned.
Since the operating mode of the processor changes over time, depending on the function being
executed and its associated mode, a full trace of the basic-block executed must be generated.
This is done using the swat-core-tr tool with a specific configuration that includes tracing of
function entry and exit points. This information will be used during analysis to determine
where, in the execution trace, the operating mode is changed. The command to do this is:
$> swat-core-tr –config trbbce.swatcfg –swat-debug
Where the configuration file specifies the required instrumentation rules and support library,
namely:
[trace-bbce]
rules = bbce.rules
libray = libswat-tracing.a
binary = executable
execute = true
mode = file
The SWAT tracing core flow will dump the execution trace on a file with extension .t804 (see
Deliverable D2.2.2 for a description of the format of the trace file) listing all the executed
basic blocks and the function entry and exit points. This two passes (static estimation and
tracing) need to be performed only once, before entering the exploration loop managed by
MOST.
The dynamic, mode-dependent, estimation is then performed using swat-trp, the SWAT trace
post-processor. This tool, for the specific trace analysis, requires as input a file specifying one
“allocation” of functions to processor modes. The form of the file is very simple, as it lists all
functions and related operating modes. For a description of the way operating modes can be
assigned to functions, see Section 4.3.9 of Deliverable D2.2.2. This file is the input for the
trace analysis and is the output of MOST. At each step the file describes a different allocation
of functions to modes.
Furthermore the tool needs a specific entry in the configuration file indicating the energy and
timing characterization of the processor modes. This is specified as:
[taget]
cpu-modes = resic.modes
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 22
At this point the analysis tool can be run:
$> swat-trp –config alloc.swatcfg –fn-allocation
–allocation-file <prj>.alloc
–trace <prj>.t804 –swat-debug
The output generated are the estimated execution time and energy consumption of the
application configured as described in the allocation file. These figures are used by MOST to
select different allocations until the best one is found.
2.4 Tools
2.4.1 swat-core-cc
This tool implements core of the C-to-C optimization engine based on the LLVM optimizer.
Synopsys
swat-core-cc <options>
Options
-help
Prints a short description of the tool options.
-version
Prints the tool version.
-swat-debug
Produces a verbose debugging output of the execution.
-config
Specifies the configuration filename.
-output
Specifies the output filename, listing the rules that have triggered.
Configuration file specific options
The configuration file format follows the standard defined for all configuration files used by
the SWAT toolchain as described in Section 4.5.1 of Deliverable D2.2.2. For the specific tool
options the configuration file introduces the additional section [optimization] described
below, and uses the information in the configuration options llvm-ccflags, llvm-
optflags and llvm-optfile found in the standard [compilers] section.
The new section simply allows specifying the output directory where to save the optimized
version of the application. This new version is the input for the estimation flow.
Output-dir = <path>
The output directory.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 23
2.4.2 swat-opt
This tool implements rule-base optimization hint engine.
Synopsys
swat-opt <options>
Options
-help
Prints a short description of the tool options.
-version
Prints the tool version.
-swat-debug
Produces a verbose debugging output of the execution.
-config
Specifies the configuration filename.
-output
Specifies the output filename, listing the rules that have triggered.
Configuration file specific options
The configuration file format follows the standard defined for all configuration files used by
the SWAT toolchain as described in Section 4.5.1 of Deliverable D2.2.2. For the specific tool
options the configuration file uses the additional section [srcopt] described below
rules = <string>
The rule file. The file has the .optrules suffix and collects the rules, one per line,
structured according to the grammar exposed above.
fn-selection = <fnid> [<fnid>...]
Selected functions to apply the rules on. The argument is a list of function identifiers,
as generated by swat-uniqid.
bb-selection = <bbid> [<bbid>...]
Selected basic-blocks to apply the rules on. The argument is a list of basic-block
identifiers, as generated by swat-uniqid.
lp-selection = (<bbid> [<bbid>..]) [(<bbid> [<bbid>...])...]
Selected loops to apply the rules on. The argument is a list of loops enclose in parentheses,
each loop being in turn a list of basic-block identifiers, as generated by swat-uniqid. The
list of loops can be obtained using swat-analyze with the options –bb-cfg –loops, as
describe in Section 4.3.1 of Deliverable D2.2.2.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 24
2.4.3 swat-tge
This tool implements the quantitative transformation effectiveness estimator.
Synopsys
swat-tge <options>
Options
-help
Prints a short description of the tool options.
-version
Prints the tool version.
-swat-debug
Produces a verbose debugging output of the execution.
-config
Specifies the configuration filename.
-tform <name>
Specifies the transformation to be analysed. The algorithmic transformation model is
implemented in the library tge_<name>.so.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 25
3 Custom hardware optimization
3.1 High level synthesis optimizations
Modeling the dominant effects of Register Transfer (RT)-level components under power
gating to get fast and accurate estimates in order to explore the design space of the HLS is one
of the main contributions in this work. Figure 8 gives a simplified overview on the modeling,
estimation-, and optimization flow that will be further described in this section. A further
description of the overall flow can be found in Deliverable D.3.2.1.
Its main purpose is to get accurate estimates for four main variables: leakage currents in the
static on and off state, energy overheads due to the state transition and the break-even time.
These values are obtained for each individual RTL component within the design and are then
used beside the precise parameter values and activity patterns to get an estimation for its
overall energy consumption.
Figure 8: Visualisation of proposed power-gating modelling, estimation and optimisation flow
The experimental assessment of the developed power gating model accuracy needs a fixed
and well defined environment. For this reason, at first a technology selection is done for
which the evaluation is done and all model parameters are constrained to a set of discrete
values or a continuous range. The following evaluation then distinguishes between the pure
model evaluation and a presentation of the power management adoption at system level.
3.1.1 Technology Selection and Parameter Ranges
To validate the correctness of the modeling approaches and to prove its universality, a
selection of technologies and parameters has been made. Beside different technology node
sizes, it is important to cover different process corners. Additionally, MTCMOS technologies
should be considered in order to cover sleep transistor implementations in both standard- and
high-threshold design.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 26
Figure 9: Semiconductor technology selection
Figure 9 lists three different technologies for which the characterization was done. The
Nangate free 45nm open source digital cell library technology is a general purpose (GP)
technology based on predictive technology modelcards of the NIMO Group, Arizona State
University. It is freely available and is widely used in the scientific context. It offers three
even process corners (slow-slow, typical-typical, and fast-fast) that are all evaluated
separately. Even means that both PMOS and NMOS devices are equally affected by
variations of fabrication parameters. Further, it is a MTCMOS technology and thus it includes
both, standard- and high-VTH transistors. The industrial technologies are also MTCMOS
technologies but their process corner is restricted to the typical case in this evaluation.
Additionally, and in contrast to the Nangate library technology, they are both LP specialized
technologies. These LP techniques inherently have lower leakage currents and the resulting
power gating break-even time is in another order of magnitude.
Figure 10: Parameter ranges
Furthermore, a set of different power gating implementation types (referred to as power
gating scheme (PGS)) has been selected. It covers PMOS- as well as NMOS-based sleep
devices, double-cutoff as well as super-cutoff techniques.
Figure 10 lists all parameters of the characterization process and its parameter ranges. The
supply voltage is constrained by the technology whereas the surrounding temperature is
constrained by reasonable values. The gate voltage of the sleep devices that is used in
SCCMOS techniques to enforce a cutoff is specified as an offset to the supply or ground
voltage. It is in the range 0V to 0.1V and thus the sleep signal is in the range of [VDD;
VDD+0.1V] for PMOS-based PGSs and [GND;GND-0.1V] for NMOS-based PGSs. The
sleep transistor width is constrained to a maximum of 10% of the gated component size. The
characterization is also constrained to functional RTL units that are available and supported
by OFFIS’s PowerOpt . Their bitwidths ranges from 4 to 32 bits in 4 bit steps.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 27
3.1.2 Evaluation of Power Gating Models
During model generation, a lot of methods have been used to compact and ease the resulting
models. This includes compressions of lookup tables, exhaustive interpolations in multiple
dimensions, parameter separation, (non)-linear regression techniques, and simplifications to
speed up the model generation. For this reason, the evaluation has to show the quality and the
performance improvements compared to reference estimates. Since silicon measurements are
not available, the reference estimates are obtained by Spice-based analog circuit simulation
measurements. This is an established approach in the scientific as well as industrial area.
The entire characterization is done via Synopsys HSPICE version A-2008.03-SP1 and is
executed on a general purpose Intel Core2Duo machine at 3Ghz. It lasts about one day per
semiconductor technology whereas transient simulations of the state transition energy and
wakeup models make up 98% of the time. Of this, more than 50% is attributable to large
multiplier components. This illustrates the limits of circuit simulations and underlines the
hardness of predicting the application of power gating for huge components.
For presenting the absolute and relative accuracy of the models, a Monte-Carlo evaluation has
been applied covering all parameters in the aforementioned ranges and three error measures
have been computed: the maximum relative error for over- and underestimation (XRE), the
mean absolute relative error (MARE), and the relative standard deviation. In the following,
the evaluation results of the models are presented.
3.1.2.1 Evaluation of Sleep Transistor Leakage Models
In the remaining leakage current model the supply voltage range is sampled with a rate of
0.1V, the temperature with 20°C, and the gate voltage with a rate of 0.1V, resulting in a total
of 5*6*2 = 60 sampling points for each PGS and technology. Furthermore, the
characterization has been done for an isolated PGS circuitry with a channel width of 1µm.
Figure 11 shows the model errors. As it can be seen, the remaining gate- and subthreshold-
leakage currents can be predicted with an average MARE below 1% and a maximum error of
6.5%. On top of this error, the model simplification of assuming the voltage drop across the
sleep transistor to be equivalent to the supply voltage will induce an additional error in terms
of an overestimation of up to 15%.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 28
Figure 11: Errors of the gate- and subthreshold leakage model for locking sleep transistors
Conducting sleep transistors are again modeled at a supply voltage sampling rate of 0.1V,
whereas the gate voltage disappears as a parameter. Since pure gate-leakage currents do only
slightly depend on the temperature, a wider sampling step of 50°C can be used for this model,
leading to a total of 5*3 = 15 sampling points. Nevertheless, the temperature remains a
parameter during modeling as it may gain importance in future semiconductor technologies
because of increasing pn-junction leakage currents being more dependent on the temperature.
Figure 12 presents the model evaluation results of the gate-leakage model for conducting
sleep devices. The MARE is about 4% for the Nangate free 45nm open source digital cell
library and 1% for the two industrial technologies. In all cases, the model tends to
overestimate the gate-leakage currents because of the quadratic impact of VGS and VGD
while the model linearly interpolates between two adjacent sampling points. Increasing the
supply voltage sampling rate would reduce this overestimation but also enlarge the model.
Additionally, the maximum error is only 18% for the Nangate and even below 4% for the
industrial technologies.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 29
Figure 12: Errors of the gate-leakage model for conducting sleep devices
3.1.2.2 Evaluation of Voltage Drop Models
Figure 13 presents the maximum, mean, and standard deviation errors of the voltage drop
model for the conducting state. The parameters temperature and supply voltage are sampled
with a step width of 20°C and 0.05V. As presented in the charts, the occurring voltage drop
can be predicted with an average error of 1-5% with maximum overestimates of 25%.
Secondly, the errors of HVT- and double-gating schemes are larger than those of SVT- and
single-gating schemes because these schemes have higher on resistances and increase the
voltage drop dynamic that needs to be interpolated by the model. Underestimates that would
play down the presence of sleep devices are limited to 5% maximum.
The voltage drop model for the locking state is evaluated as presented in Figure 14. For the
parameters supply voltage, temperature, gate-voltage, and sleep transistor size the model
consists of a 5*2*3*6 = 180-point measuring field. With a mean absolute relative error below
1.5% and a relative standard deviation of 2.1% in maximum across all technologies, the
accuracy of the model is very high. However, this accuracy is also necessary because the
estimates serve as input to the state transition energy model and highly impact its prediction.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 30
Figure 13: Errors of the voltage drop model for conducting sleep devices
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 31
Figure 14: Errors of the voltage drop model for locking sleep devices
3.1.2.3 Evaluation of State Transition Energy Models
The most effort for model evaluation has been spent for the state transition energy model
because some large multiplier components are not simulatable in high bitwidths or in
combination with some PGSs. In these cases, Synopsys HSPICE fails in simulating the
circuits due to a high memory demand and failing convergence analyses. To provide a
meaningful analysis of the model, a Monte-Carlo based evaluation performs a total of 1000
randomly chosen transient simulation runs, lasting about two weeks of computation time. The
presented errors base on about 93% of the simulation runs that have been finished
successfully and include all model errors induced by the model representation and required
interpolation. Especially, the bitwidth-scaling and PGS selection is reflected in the evaluation.
Peak errors have been observed at peak voltage drop errors because of their super linear
dependency.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 32
Figure 15 summarizes the evaluation results per technology and RTL-component. Mean
absolute relative errors below 10% and mostly even below 5% have been analyzed for the
dominant part of components. Nevertheless, the quality varies. For example the incrementer
component inc_fast in the Nangate technology is conspicuous with its higher peak errors and
standard deviations. Secondly, the model tends to underestimate the state transition energy for
the two multiplier components in different technologies. This suggests the conjecture that the
matrix structure causes super linearly increasing wake-up energies. Nonetheless, the
maximum errors are reasonable below 25% and no further modeling effort has been spent for
these components.
As the temperature is set to the upper bound during characterization, the models do only
predict upper bound estimates. The interpolation table size of the model is 5*2*5 = 50 points
for the model parameters supply voltage, voltage drop, and sleep transistor size.
For the purpose of high-level tradeoffs for which the models should be used, the accuracy is
perfectly adequate and the speed improvement is the dominant model feature. Considering
that a single analog circuit simulation may take up to several hours, the pre-characterized
models can provide thousands of estimates per second.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 33
Figure 15: Error of state transition energy model
3.1.2.4 Evaluation of State Transition Delay Models
Figure 16 presents the wake-up time model evaluation. As it can be seen, the mean average
errors are mainly below 10% but peak errors vary a lot and range up to 26%. Especially the
wake-up delay prediction for the small-type components performs better compared to the fast-
type components throughout all technologies. The interpolation table size of the model is as
small as in the ERT SW model because it bases on the same characterization runs.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 34
Figure 16: Errors of state-transition delay model
3.1.2.5 Evaluation of Process Variation on Power Gating
The Nangate semiconductor technology offers circuit level device models of three process
corners. These corners represent the extremes of parameter variations within which a circuit
must operate correctly. Thus, the corners cover the overall spectrum from slowest to fastest
possible devices. In this section, the impact of process variation on power gating is evaluated
exemplarily for a single RTL component.
Figure 17 presents model estimates for power gating relevant parameters that are normalized
to the typical operating case. As it can be seen, the voltage drop across the sleep transistor as
well as the state transition energies do only slightly change. This is completely different for
the leakage currents and timing behavior. As expected, power gated components that are
fabricated at the fast process corner wake up faster but on the other hand they cause a lot more
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 35
leakage currents. In relative terms, the active current of the fast process corner is 2.6 times as
high as of the typical corner but, while being power gated, the remaining leakage current in
the sleep state is even 5.3 times as high. But in absolute terms, the amount of reduced leakage
is much higher for the fast corner. Together with the almost constant state change energy,
power gating becomes even more advantageous for designs fabricated at the fast and less
advantageous for the slow process corner.
A break-even time analysis for the Nangate 45nm technology at fast process corner results in
tbe times less than half of the typical-case break-even time.
Figure 17: Normalized model estimates for different process corners to analyze the process variation
impact on power gating
3.1.3 Evaluation of IP-Level Application of Power Management
Every RTL component within a datapath contributes a small fraction to the active and sleep
currents of an overall design and has its individual wake-up energy and time. Further, at RT-
level, each component has its own break-even time. At system-level, all of these parameters
merge to one overall effectivity-metric of power gating and result in one global break-even
time that has to be exceeded if all components are cut off simultaneously. This Section will
evaluate this system-level view of power management in relative comparisons and absolute
numbers against the background of overall possible savings, impact of parameters, and
overhead costs of area and power.
Figure 18 lists design examples and characteristic parameters such as their functional unit
datapath composition after synthesis and cycle count within the schedule. To all of the
designs power gating has been applied with HVT NMOS sleep devices that are most
commonly in today’s practice. The fourth and fifth column in the table show absolute active
and sleep current numbers of the designs at a fixed supply voltage of 1.0V, an ambient
temperature of 27°C, and on the base of the Nangate 45nm technology at typical process
corner. The sleep and active currents are restricted to the functional units of the designs
because of the focus within this analysis. Nevertheless, the FUs make up the dominating part
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 36
of the total energy consumption. For example, in the FDCT benchmark, the FUs contribute
68% of the total energy consumption whereas the remaining 32% split up for multiplexer,
registers, controller, and clock tree. As the results show, active state leakage current is
effectively reduced throughout all benchmarks.
Figure 18: Design examples and the effectivitiy of power gating in a global sleep state
In the following, a deeper analysis of the FDCT benchmark is examined in order to show the
impact of the continuous parameters temperature and supply voltage as well as the discrete
parameters process corner and PGS selection. For this analysis the Nangate 45nm technology
has been chosen in typical and fast process corner. Furthermore, the HVT version has again
been selected for sleep devices and the sleep device sizes have been fixed to 2% of each RTL
component size. HVT devices require a higher supply voltage. Thus, its range is constrained
to [1.1V;1.3V] whereas the temperature is examined across its whole range of [27°C;127°C].
Figure 19 then shows the gating-switch effectivity as a ratio of sleep/active current and the
break-even time of the overall FDCT design in nanoseconds.
At first, it can be seen that the effectivity of power gating has only a small variance across the
parameter ranges. It becomes only slightly less effective in suppressing leakage currents if the
temperature increases.
The supply voltage has also only a marginal impact on the effectivity. Additionally, there is
only a small variation between 2% and 4% among the different power gating schemes. In
other words, leakage is reduced by 96-98% in all cases and, from the point of pure leakage
saving, the PGS selection is not particularly interesting if all surrounding parameters are
identically.
Secondly, the break even time is presented. Unlike the gating effectivity, the break even time
diminishes with increasing temperature and supply voltage. This is because the wake-up time
is much lower and less incomplete transitions occur during the state transition. With a factor
of up to four, the variance is also much higher. Furthermore, the PGS selection highly impacts
the break-even time. As it can be seen, PMOS schemes have up to two times higher break-
even times. Comparing the two process corners, the break even time is also about twice as big
for the typical process corner than that of the fast process corner.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 37
Figure 19: Comparison of power gating scheme efficiency and dynamic parameter impact
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 38
The wake-up time at system-level is given by the maximum RTL component wake-up time if
the supply grid is assumed to be sufficiently dimensioned. Figure 20 shows the wake-up time
of the FDCT benchmark in dependence on the temperature and supply voltage parameter for
the aforementioned gating types and process corners.
Figure 20: Wake-up time evaluation of the FDCT design
It can be observed that the wake-up time shows a very small variance in the parameter ranges.
It slightly decreases with increasing supply voltage and increases with a raising temperature.
Furthermore, at the fast process corner, it is about 20-30% smaller as it is at the typical
process corner. A comparison of the PMOS and NMOS gating schemes shows that NMOS
schemes are about three times faster in waking up.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 39
3.2 Memory optimization
3.2.1 Introduction
The memory optimization tool aims at optimizing the memory hierarchy of the system under
analysis using total memory energy as a metric; however, the optimization strategy based on
sub-banking being considered for SRAMs is also beneficial to mitigate aging effect caused by
Negative Bias Temperature Instability (NBTI). This section will first summarize the
assessment of the energy benefits obtained by the memory optimization tool and eventually
present the details of techniques to concurrently achieve reduced energy and extended
lifetime.
3.2.2 Energy Optimization of scratchpad memories
Many strategies for reducing dynamic energy of memories proposed in the literature rely on
the paradigm of splitting a memory array ([4], [5], [6]). Section 3.2.1.3 of D3.2.1 explained
how splitting the address space into multiple, independently accessed memory sub-blocks can
provide significant reduction in energy consumption. Memory sub-banking is beneficial for
energy in general because of the non-uniform distribution of accesses to memory locations.
Even a naïve partition of two identical sub-blocks guarantees a sizable reduction of average
energy. The search space of all possible memory partitions can be easily enumerated by
observing that a partition is completely defined by a set of address boundaries (e.g., a bi-
partition can be characterized by the addressing around which the memory is split into two) .
Options for searching the space include Top-down branch-and-bound search [4] or a bottom-
up one based on dynamic programming [4]. This allows solving the problem optimally in
polynomial time in spite of an exponentially-sized search space.
Section 3.1.3.1 of D3.4.2 presented detailed energy results for the tool in the standalone
MEMOPT version. Here in the following figures, we show the percentage of energy reduction
by splitting the memory in two partitions. The first set consists of three sample applications
provided with the ReISC distribution, whereas the second set is a subset of the MIBENCH
benchmarks, which are widely used in the embedded systems community. All applications
were compiled using the ReISC toolchain and a fixed set of compiler optimizations. Figure 21
shows results for the ReISC sample applications and Figure 22 show results for the
MIBENCH kernels.
Figure 21. Energy Reduction on ReISC sample applications.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 40
Figure 22. Energy Reduction on MIBENCH sample applications.
Above figures clearly exhibit the benefit obtained by partitioning the memory into two blocks,
providing 80 to 85 % savings in almost all cases. Certainly these savings can be further
enhanced by increasing the number of partitions.
3.2.3 Concurrent Aging and Energy Optimization of scratchpad memories
3.2.3.1 Overview
Traditionally, power and reliability have been considered as conflicting metrics, since most
design solutions for improving reliability (redundant circuits, strong signals, large devices)
are intrinsically power inefficient. However, the recent emergence of reliability issues in the
form of aging (i.e., temporal drift of performance) of devices has opened a new perspective of
this dichotomy. Such a benefit can be especially exploited in SRAM memory structures,
which are particularly sensitive to NBTI effects: given their symmetric structure, they cannot
in fact take advantage of value-dependent recovery.
The most effective solutions rely on the observation that typical power management strategies
(i.e., voltage scaling for dynamic power and power/ground gating for static power) can be
exploited to reduce NBTI-induced aging [10], [11]. Therefore, proper re-visitation of power-
managed memory/cache architectures according to an aging-related metric can achieve
concurrent energy and aging improvements [12], [13], [14]. In this deliverable the memory
optimization strategy based on sub-banking used to obtain energy-efficient SRAM
architectures, is also investigated to extend the lifetime of the memory and some additional
techniques are presented to further improve the aging benefits.
3.2.3.2 Aging: Background and Preliminaries
Aging of devices has emerged as the latest challenge brought by technology scaling. Thinner
oxide layers, higher electric fields and operating temperatures, induce adverse physical and
chemical phenomena that cause transistors to deteriorate their performance over time.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 41
Deviation from the ideal behaviour of manufactured devices is the most critical downside of
technology scaling beyond the 90nm node. The most evident type of non-ideality is related to
the non-determinism of devices due to process variations [15]. They are mostly due to random
fluctuations of dopant atoms and to the systematic or non-systematic impreciseness of the
manufacturing process, and can be viewed as a sort of ’time-zero’, fixed deviation from the
nominal behaviour of each device.
There exists however another, and even more insidious, type of non-ideality resulting from
technology scaling, namely, time-dependent deviations in the operating characteristics of
devices [16]. Two are essentially the sources of time-dependent variations: Bias Temperature
Instability (BTI), and Hot Carrier Interface (HCI). These physical/chemical effects result in
the degradation of the oxide thus causing a drift of the threshold voltage over time.
Bias Temperature Instability (BTI) has emerged as the most critical wear-out mechanism for
MOS transistors below the 100nm node. It manifests itself as a time-dependent, permanent
increase of the threshold voltage Vth of active transistors. Although BTI occurs in both n-type
and p-type devices, at the current technology nodes, i.e., 65nm and 45nm, only pMOS
transistors are significantly affected, the NMOS transistor has a negligible level of holes in the
channel and thus, does not suffer from the BTI degradation.
NBTI occurs when a pMOS is negatively biased (i.e., a logic ’0’ is applied to the gate of the
pMOS, resulting in Vgs = −VDD), and manifests itself as an increase of the threshold voltage
with time, resulting in the reduction of drive current and noise margin, causing in turn a
degradation of the delay of a device.
The actual amount of degradation depends on several parameters of a device, such as its logic
function, threshold voltage, size, load, and temperature [17]. From the design standpoint,
however, the most important property of NBTI is its dependence on the logic values. The
threshold voltage (and delay) degradation effects occur only when a pMOS device is in its
critical state (the stress states), that is, when a logic ’0’ is applied to the device inputs. In fact,
when a logic ’1’ is applied, NBTI stress is actually removed, resulting in a partial recovery
(i.e., a decrease) of the threshold voltage (the recovery state) as depicted by Figure 23.
Figure 23. NBTI effect on pMOS.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 42
The most widely accepted physical model that explains the NBTI phenomena is the Reaction
Diffusion (R-D) mechanism [16], which explains the temporal shift of Vth in terms of the
breaking of hydrogen-passivated Si-H bonds at the Si/SiO2 interface and the subsequent
diffusion of hydrogen, which induces the formation of interface traps. The generated traps,
which accumulate over time, decrease the electrostatic control of the channel, therefore
resulting in a larger threshold voltage Vth. This trap generation phase is called the stress
phase, when the electrical stress is removed (i.e., Vgs = 0, corresponding to having a logic ''1''
on the pMOS gate input), holes are not present in the channel thereby avoiding the generation
of new traps, while part of the free hydrogen atoms diffuse back and anneal the broken Si-H
bonds. In this phase, called the recovery phase, the number of interface traps is reduced and
the Vth partially recovered. Figure 23 shows the temporal diagram of a typical NBTI-induced
Vth degradation and recovery sequence.
Experimental data report variation of Vth of about 10-15% per year, depending on the target
technology and electrical or environmental conditions. The delay degradation follows the
same trend as threshold voltage, yet with a smaller magnitude.
A conventionally accepted metric for the aging of a SRAM cell is the Static Noise Margin
(SNM), defined as the minimum DC noise voltage necessary to change the state of an SRAM
cell. NBTI impacts the SNM, because it causes the drift of PMOS transistors’ Vth over time
(namely, Vth (t) = K ・ t1/4
), thus lowering the static characteristics of the two inverters that
form the 6T-SRAM cell. Therefore, after some time, the SNM falls below a threshold that
allows safe storage of data (such a threshold depends on the technology and the specific
design of the memory cell).
Figure 24. Stating Noise Margin and aging of SRAM cell.
As a matter of fact, SRAM structures are particularly sensitive to NBTI effects because, given
their symmetric structure, they cannot take advantage of the value-dependent recovery typical
of NBTI. A SRAM cell ages in fact whatever the value it stores; therefore the effect of the
dependency on logic values is immaterial in a memory cell. The best-case degradation occurs
when both PMOS exhibit the same amount of degradation; that is when the cells store a 0 and
a 1 with equal probability [19].
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 43
Figure 25. 6T SRAM cell.
So, in a memory cell aging occurs (and by extension to a memory word) regardless of the fact
that a cell (or word) is accessed or not. In other words, there is a substantial difference
between dynamic power and NBTI aging. In order to “stop” the aging, a memory cell (or
word) must be put into a proper “idle” state that can be used when a cell (word) is not
accessed. Therefore, the issue of an aging-aware memory structure shares many similarities
with a power management problem: there is need to implement an appropriate power-down
mechanism for memory words, and by aggregation, blocks of words (banks). The work of [9]
has shown how typical implementation of low-power states (using supply voltage or power
gating) have different cost/benefit tradeoffs in achieving simultaneous energy and aging
benefits.
3.2.3.3 NBTI mitigation techniques
A number of solutions have been proposed in literature to mitigate NBTI effects in SRAMS
which includes methods to equalize cell value probabilities, designing customized NBTI
resilient cells to allow minimum degradation ratio for all PMOS transistors in the cell. Among
them a class of solutions is based on the exploration of aging benefits provided by low-energy
states [12], [13], [14]. Assessment at the architectural level on entire memory blocks of power
management solutions (based on both DVS and power gating) were evaluated in [12].
The impact on aging of classical implementations of standby states is interesting and deserves
further insights.
Impact of Power Gating on Aging: When the sleep transistor is off, each cell is disconnected
from the ground, and both inverters’ outputs will reach the “1” value, i.e., the NBTI-immune
configuration. Notice that this is not a “logic” state: it is due to electrical reasons and cannot
be forced by writing some value in the cell. Based on this property, each line will age
proportionally to time spent in the active state; this allows to express aging in terms of the
probability of the Sleep signal (Figure 1-(a)) Psleep, i.e., how often a cache line is put into
sleep state.
Impact of Voltage Scaling on Aging: The aging of a PMOS transistor is determined by the
amount of negative bias voltage (i.e., gate-to-source voltage); thus, voltage scaling has a
beneficial effect to help alleviating aging by supplying a device with a smaller Vdd which
translates into a smaller Vgs, and therefore in a smaller magnitude of negative bias. Reduction
of the SNM as a function of Vdd is roughly linear; the degradation of the SNM under a
“drowsy” voltage Vdd, drowsy = 0.4V is about 60% of the degradation at the nominal Vdd.
Both power management schemes behave well in terms of energy savings along-with
appreciable benefits in terms of life extension. Concerning performance, however, power
gating scheme suffers from a large performance penalty due to the increased miss rate that
forces many accesses to be resolved in main memory. So, in terms of energy, lifetime benefit
and performance overhead, DVS-based scheme is generally more effective.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 44
Once we have identified the most suitable implementation of the standby state for aging
reduction, we are left with the choice of what is the unit of power management (UPM), that is,
the atomic unit that can be turned on and off based on activity.
It makes sense to have this unit coincide with the unit of access (UA), i.e., a memory word or
a cache line
Therefore, In a power-managed cache, idleness of each UPM can be extracted by observing
the idle intervals longer than some breakeven time. Given that idleness can be exploited for
both energy and aging reduction, it is straightforward to observe that different characteristics
of the idleness profile matter for the two metrics. For energy, it is clearly the average value
that matters: energy savings for each line will cumulate proportionally to the idleness of the
line. For aging, conversely, it is the worst case that matters: the line that becomes unusable
first will cause the entire cache to fail. Since due to the very principle of locality, the
distribution of idleness will in general not be uniform, average and worst case will differ. In
order to concurrently exploit idleness, worst case idleness must be controlled.
Following are some of the techniques which have been proved to control worst-case idleness:
Dynamic Indexing
Partitioned cache architecture
Application specific partitioning
In the following, we will outline the basic ideas behind each of these strategies.
Dynamic indexing
In dynamic indexing [13], the cache indexing function is modified over time in order to
achieve a uniform distribution of idleness over the cache lines; in this way all the leakage
saving opportunities can also be used for aging reduction.
There is an inverse correlation between idleness of a cache line and its aging (i.e. an idle line
can be power managed and therefore it will not age) such uniform distribution guarantees the
elimination of the worst-case: all cache lines will thus die at the same time.
Fortunately, power (and in particular, static power) optimization techniques, like power-
gating and dynamic voltage scaling - DVS, can offer mitigation of NBTI effects in memories.
Power gating, has the effect of completely nullifying the aging effects [11]. Similarly, but
with a smaller impact, voltage scaling improves NBTI-induced aging because a reduced Vdd
corresponds to a smaller bias voltage [10].
Figure 26 shows the two cache leakage architectures based on power management. A cache
line is being used as the atomic unit of power management. The decision about whether to
turn off a cache line is based on its usage: lines that are not accessed since a given number of
cycles (the breakeven time, B1) are put into a low-leakage state.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 45
Figure 26. Power-Gated (a) and DVS-Based (b) architectures.
Dynamic indexing applies to caches, in which the addressing mechanism occurs but with
minor adaptations can be applied to generic SRAM blocks.
Uniform-size partitioned cache architecture:
This scheme can be viewed as a coarse-grain extension of dynamic indexing [13], and
implements a uniform-size, multi-bank cache with the purpose of achieving a better design
point in aging/energy design space.
This technique is based on the idea of partitioning a memory into multiple banks of identical
size. While this organization has been widely used for reducing both dynamic and static
power, its exploitation for aging benefits requires proper management of the existing idleness
of the various banks. This can be achieved by means of a sort of time-varying addressing
scheme in which addresses are mapped to different banks over time in such a way that the
idleness is uniformly distributed over all the banks.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 46
Figure 27. DVS-Based uniform-size multi-bank architectures.
The architecture is shown in Figure 27, for a M-block partitioned cache; solid lines represent
address lines, while dashed ones are “sleep” signals. When a block is unused it is turned off
by asserting the sleep signal, which enables the selection of the low Vdd supply voltage. The
block labeled D implements these two functionalities: remapping the address on the proper
block and asserting the “sleep” signal for the other blocks. Address signals are simply derived
by routing the n − p LSBs of the address to each sub-block; sleep signals are obtained by
taking the p MSBs and transforming them into a 1-hot code onto 2p bits. Each signal is fed to
the corresponding block (e.g., Bank 0 corresponds to the M-bit encoding 00 . . . 1, Bank M −
1 corresponds to 100 . . . 0).
Experimental results show that a time-varying re-indexing allows to significantly improve the
lifetime of power-managed caches: above architecture provides average aging improvements
between 22% (for the worst configuration) and 2x (for the best one) with respect to a
monolithic cache, compared to a mere 9% improvement obtained with a conventional power-
managed cache architecture.
Application specific partitioning:
Above two approaches were based on having the whole cache dying at a given time with
dynamic indexing providing maximum aging benefit at the cost of hardware overhead and
partitioned architecture giving optimal result with minimum hardware overhead.
However, the technique of application specific partitioning [18] can provide a better tradeoff
between aging benefit and hardware overhead by allowing different blocks of memory to age
at different rates. This implies that some cache block will become unreliable first, and the
cache will keep on functioning with a reduced efficiency (or equivalently, with a
progressively smaller cache). As soon as one block dies, it simply forces the corresponding
lines to become indefinitely invalid: any further access to these lines will result on a miss.
This approach positively affects aging because splitting into multiple banks distributed the
worst case over several instances of the memory. As a simple, intuitive explanation consider
that a memory array will fail in its entirety as soon as one single SRAM cell fails (that is, it
cannot be safely read or written); if we split the memory into, say, two banks, only one of the
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 47
two will fail (the one containing the earliest failing cell) while the second one will last longer
(up to the earliest failing cell in the second block).
Figure 28. DVS-Based non uniform multi-bank architectures.
Let’s assume a scratchpad memory with L = 2n lines (l0, . . . , lL−1), where n is the number of
the index bits of the memory address. We want to split the memory into M blocks B0, . . . ,
BM−1, of sizes S0, . . . , SM−1, addressed using n0, . . . , nM−1 bits, respectively. Figure 28 shows
the conceptual architecture and the relevant quantities.
It assumes the use of voltage scaling for implementing the low-energy states for the blocks
(denoted by the dotted signal from the dual supply voltage selector). Voltage scaling is the
viable choice for the scratchpad memory as it allows to preserve the contents of the memory
block in the standby state with a better energy/performance tradeoff [14]. The decoding block
Dec in the figure serves two purposes: remapping the address on the proper block and
asserting the standby signals for the M blocks.
Another implication of this graceful aging is that a proper aging metric is required for a fair
comparison against previous solutions. To this purpose, we introduce the concept of Effective
LifeTime (ELT), that is, the product of lifetime and size of a memory block. ELT
conceptually measures for how much time a memory block of a given size can be used. Figure
29 pictorially describes the concept of ELT.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 48
Figure 29. Effective lifetime.
The solid line enclosing the filled area denotes the aging profile of a regular memory: M
words are usable reliably for an amount of time equal to LT1. By using solutions such as [13],
[14] the lifetime can be extended upto LT2, and the memory still becomes unusable as a
whole (dotted line). With approach of [18] (dashed line), the entire memory will be usable
until LT3a equal to original lifetime, but then disable the earliest failing block so that we can
still use a smaller (say, M′ < M words) memory, yet for a longer time (up to LT3b). Figure 29
shows a simplified case in which only the cache is partitioned into two blocks. The ELT is the
area below the various aging profiles. Thus, the rationale is that, depending on the idleness
distribution and on how we partition the memory, it is possible that
ELT3 = M ・ LT3a +M′ ・ (LT3b − LT3a) > ELT2 = M ・ LT2
ELT-driven partitioning alone already yields significant benefits in terms of both aging and
energy with respect to a fixed-size partition as the one of [17], thanks to a better matching
between the partition sizes and the idleness profile. However, the knowledge of the idleness
profile can be exploited so as to further improve both aging and energy, at the cost of a small
hardware overhead. The basic transformation we implement is to selectively swap addresses
across partitions in order to achieve a better overall ELT. This can be easily implemented by
modifying the cache indexing function for a few, selected addresses.
Memory Size
Time
M words
LT2
LT1= LT
3a LT
3b
M’ words
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 49
The choice of a possible swap-based strategy depends on its relation with the ELT-driven
partitioning step. There are essentially two options to combine these two phases. The first and
most intuitive is to run the partitioning first and then improve the results of partitioning with a
set of swaps. We call this strategy partition & swap. A second option is to first tweak the
idleness profile with a set of swaps and then find the best partition on that profile. We call this
strategy cluster & partition.
In the following, two detailed algorithms for each of the two strategies are discussed. Both
algorithms are parameterized by a parameter k, which denotes the number of swapped
addresses.
1) Partition & Swap Strategy: Since both size and minimum idleness concur to determine
ELT, the basic principle behind this strategy is to repeatedly swap the address with the
minimum idleness in the largest block with some address (with a larger idleness) of a smaller
block that dies earlier.
1: k-Swap (I)
2: B = ELT-DrivenPartitioning (I)
3: for l = 1 to k do
4: i ⇐ index of address with l-th maximum idleness in the earliest failing block.
5: j = index of block with maximum value of Sj ・ (m2j − m1j).
6: mj ⇐ index of m1j
7: if (I[i] > I[mj ]) then
8: SWAP(I[i]; I[mj ])
9: end if
10: end for
11: return B
In this strategy, first we get the partition B = B0; : : : ;BM−1 with sizes S0; : : : ; SM−1. We then
repeat k times the swap between the address with maximum idleness in the earliest failing
block (i) and the one with the minimum idleness in the block j in which such a swap would
maximize the benefit. The latter is defined as the product between size of the block and
difference between the second and first minimum (Si ・ (m2j −m1j). The second factor
represents how much the lifetime of this block would be extended. Clearly, the swap is done
only if beneficial (i.e., if we are bringing into Block j an address with idleness higher than the
previous minimum m1j).
2) Cluster & Partition Strategy: The rationale behind this strategy is driven by the
observation that the ELT-driven partitioning would provide ideal results if the idleness profile
I would be sorted in non-decreasing order. Since sorting the entire profile would require an
excessive number of swaps, the algorithm we implement under this strategy (called k-min
clustering) identifies the k minima in the idleness profile and swaps them with the addresses
at the beginning or at the end of the profile (first or last k addresses), as shown in the pseudo-
code below. Then, the ELT-driven partitioning is applied on the modified idleness profile.
Following pseudo-code portraits the steps involved in this process.
1: k-MinClustering (I)
2: (j1; : : : ; jk) ⇐ indices of the first k minima
3: i = 0
4: for l = 1 to k do
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 50
5: SWAP(I[i]; I[jl]);
6: end for
7: B = ELT-DrivenPartitioning (I)
8: return B
Figure 30 illustrates the average aging benefit obtained by different strategies discussed
above. For experimentation purpose, MIBENCH applications are used. PALT is the lifetime
extension obtained by partitioned cache architecture and PLT is the advantage gained by
application specific partitioning. Other entities in the graph show the lifetime increment,
obtained by combining optimization algorithms with PLT, for different values of k where k
represents the number of swaps being performed.
PALT on average improves lifetime by 45% whereas advantage in case of PLT increases to
125%. The strategies of “K-swaps” and “K-min clustering” improve the aging results up-to
230%.
Figure 30. Effective lifetime.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 51
4 Application to Use-Cases
The following sections shortly describe how the different activities related to the embedded
software and custom hardware optimization techniques are used in the three different uses
cases.
4.1 Use Case 1
The use case 1 will be used to exercise the optimization related to the embedded software and
custom memory hierarchy definition. In particular,
the SWAT toolchain will be used to optimize the SW running on the REISC core by adopting
the techniques presented in section 2 on compiler optimization and code hints possibly taking
into account application and platform status;
The MMCO tool, with its new extensions for the evaluation of aging, will be used to model,
characterize and optimize the memory hierarchy of the SoC node in order to improve the
energy efficiency and the aging of the device.
Results of the application of these strategies for benchmark application have been reported in
deliverable D1.3.1; evaluation of the final test case application will be reported in the
deliverables of WP4.
4.2 Use Case 2
4.2.1 DCT - High level synthesis optimizations
The DCT hardware part of Use Case 2 has been synthesized from behavioral level to RT-level
by the PowerOpt tool. It results in a datapath and a corresponding controller. The critical path
has a length of 109 cycles that has been identified during synthesis and results out of data
dependencies and control-flow dependencies. Especially sequential memory-accesses prevent
the schedule from being shorter. The overall design has been estimated to occupy an area of
79908µm² whereas 14336µm² belongs to on-chip memories when the design is being
synthesized in a 45nm semiconductor technology.
The datapath consists of the following functional units that are applicable to be power gated
during runtime as well as during an overall sleep mode:
29 adder components
1 incrementer component
9 subtractor components
In addition to these functional units, the datapath consists of 83 registers and two shifters. The
dominating part of the functional units are instantiated at a bitwidth of 32.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 52
If power gating is applied to all sequential units simultaneously its standby current is
drastically reduced as shown in Figure 31. As it can be seen, the power gating technique
reduces the inherent leakage currents by 98.6%. For this estimate NMOS-sleep transistors
have been chosen, because at the same performance they can be smaller than PMOS devices.
Further, high-threshold devices have been chosen for sleep devices as they cut off the leakage
currents more effective.
The state transition costs have been estimated to be amortized completely after 91ns due to
the large leakage savings. 87.7% of the state transition energy belongs to the RTL-component
state transition costs, 10.4% belong to the sleep transistors and the remaining part of 1.9% is
caused by additional buffers in front of the sleep transistors.
Figure 31: Aggregated leakage currents of functional units in active and sleep state[A]
Beside this global power down, the proposed power-gating aware synthesis can apply a cycle-
wise power down to the components on an individual basis. Thereby, the idle-times of each
functional unit is analyzed within the schedule and in accordance to the break-even time, a
temporally fine-grained power down is applied if it is a worthwhile decision in terms of a
power reduction.
A schedule-length of 109 controller steps and having 39 functional units in the datapath
results in 4251 slots. In 684 of these slots, the functional units execute an operation. Thus, the
average workload is about 16.1%. In other words, in 83.9% of the time, the components are
idling and wasting leakage power.
In the following an analysis of the idle time lengths between each pair of functional
operations is presented as it has been analyzed for the Use-Case 2 design. Idle times of a
length of zero controller steps represent consecutive operations with no idle time in between.
This case is optimal for power gating as it represents a clustering of operations. In this use
case 69.2% of all operations are executed in a consecutive manner.
7,05E-05
9,36E-07
I_active I_sleep
estimated leakage currents [A]
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 53
Figure 32: Histogram of idle-times between operations in Use-Case 2 DCT
The idle-time lengths in between the remaining operations are shown in Figure 32. The
average idle period length is 16.3 controller steps. Although the period lengths are widely
distributed and range up to 77 csteps, two clusters exist at the boarders. On the one hand a lot
of idle periods below 10 csteps exist. These periods cannot be exhausted by the power gating
technique as the energetic break-even time is not exceeded in most of the cases. A second
cluster of idle periods exists between 60 and 70 cycles. These idle periods are perfectly
adequate for powering down the corresponding functional units.
The break-even times of subtractor-, adder-, and incrementer components that are
implemented in the freely available Nangate 45nm technology at the fast process corner have
been evaluated to be 9-11 cycles at a frequency of 100Mhz, an ambient temperature of 27°C
and a supply voltage of 1.2V. This implies a drastic reduction of the leakage power of these
components during longer times of idleness. The proposed cycle-wise power gating synthesis
results in a 37.19% reduction of leakage currents during runtime.
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
com
mo
nn
ess
idle time length [csteps]
511
avg. idle period length: 16.3 csteps
consecutive operations: 69.2%
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 54
4.3 Use Case 3
The Use Case three mainly has been designed to exercise the MDA frontend of the
COMPLEX design flow. Activities related to T3.2 and T3.3 are not part of this use case.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 55
5 Summary
This document presented the overall approach for embedded software and custom hardware
optimization and the way the methodologies interact also with design space exploration and
run-time management techniques.
The core portion of the document describes the toolchains for the embedded software and
custom hardware optimization in the COMPLEX flow. Each section within the deliverable
presents a description of the proposed methodology and an overview of the toolchain
supporting it.
Finally, Section 4 has shown some sample results of how the three selected use cases are
covered by the optimization toolchains.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 56
6 References
[1] HELMS, D.; HOYER, M.; NEBEL, W.: Accurate PTV, State, and ABB Aware RTL
Blackbox Modeling of Subthreshold, Gate, and PN-junction Leakage. In: Proc. of the
2006 Int’l Workshop on Power and Timing Modeling, Optimization and Simulation
(PATMOS) 4148/2006 (2006), 56–65. http://dx.doi.org/10.1007/11847083. – DOI
10.1007/11847083
[2] HOYER, M; HELMS, D.; NEBEL, W.: Modeling the Impact of High Level Leakage
Optimization Techniques on the Delay of RT-Components. In: Proc. of the 2007 Int’l
Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)
(2007), S. 171–180
[3] ITRS WORKING GROUP: INTERNATIONAL TECHNOLOGY ROADMAP FOR
SEMICONDUCTORS: Design. http://www.itrs.net/home.html.
[4] L. Benini, L. Macchiarulo, A. Macii, E. Macii, M. Poncino, "Layout-Driven Memory
Synthesis for Embedded Systems-on-Chip," IEEE Transactions on VLSI Systems, Vol.
10, No. 2, pp. 96-105, April 2002.
[5] F. Angiolini, L. Benini, A. Caprara, "An Efficient Profile-Based Algorithm for
Scratchpad Memory Partitioning", IEEE Transactions on Computer-Aided Design,
Nov 2005, Vol. 24, No. 11, pp. 1660-1676.
[6] O. Ozturk, M. Kandemir, "Non-uniform Banking for Reducing Memory Energy
Consumption," DATE'05: Design, Automation and Test in Europe, Mar. 2005,pp.
814-819.
[7] Mirko Loghi, Olga Golubeva, Enrico Macii, Massimo Poncino, “Architectural Leakage
Power Minimization of Scratchpad Memories by Application-Driven Sub-Banking”.
IEEE Transactions on Computers., Vol. 59, No. 7, pp. 891-904. July 2010.
[8] A. Calimera, E. Macii, M. Poncino, “NBTI-Aware Clustered Power Gating”, ACM
TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, Vol.
16, No. 1, November 2010, pp.3-1— 3-25.
[9] A. Calimera, M. Loghi, E. Macii, M. Poncino, “Aging Effects of Leakage
Optimizations for Caches”, ACM/IEEE GLSVLSI-10: IEEE/ACM Great Lakes
Symposium on VLSI, May 2010, pp. 95-98.
[10] L. Zhang, R. P. Dick, “Scheduled Voltage Scaling for Increasing Lifetime in the
Presence of NBTI,” ASPDAC’09, pp. 492–497, Jan. 2009.
[11] A. Calimera, E. Macii, M. Poncino, ”NBTI-Aware Power Gating forConcurrent
Leakage and Aging Optimization”, ISLPED ’09: International Symposium on Low
Power Electronics and Design, pp. 127-132, August 2009
[12] A. Ricketts, J. Singh., K. Ramakrishnan, N. Vijaykrishnan, D. K. Pradhan.
“Investigating the Impact of NBTI on Different Power Saving Cache Strategies,”
DATE’10: Design, Automation and Test in Europe, pp 592–597, March 2010
[13] A. Calimera, M. Loghi, E. Macii, M. Poncino, “ Dynamic indexing: Concurrent leakage
and aging optimization for caches”, 2010 ACM/IEEE International Symposium on
Low-Power Electronics and Design (ISLPED), pp.343-348, 18-20 Aug. 2010
[14] A. Calimera, M. Loghi, E. Macii, M. Poncino, “ Partitioned cache architectures for
reduced NBTI-induced aging”, DATE 2011: Design Automation and Test in Europe,
pp. 938-943, March 2011.
[15] BORKAR S. et al, 2005. Designing Reliable Systems from Unreliable Components:
The Challenges of Transistor Variability and Degradation, IEEE Micro. 25, 6, 10–16.
COMPLEX/POLITO/R/D3.2.2/1.0 Public
Final report on embedded software and memory optimization
Page 57
[16] ALAM M.A., Reliability- and process-variation aware design of integrated circuits.
Microelectronics Reliability, 48, 8, 1114-1122
[17] KIMIZUKA N., YAMAMOTO, T. MOGAMI, T. YAMAGUCHI, K. IMAI, K.
HORIUCHI, T. Impact of bias temperature instability for direct tunneling ultra-thin
gate oxide on MOSFET scaling. Symposium on VLSI Technology, 1999, 73–74.
[18] H. Mahmood, M. Loghi, M. Poncino, E. Macii, Application-Specific Memory
Partitioning for Joint Energy and Lifetime Optimization. DATE’12, Design,
Automation & Test in Europe, March 12th-16th, 2012
[19] S.V. Kumar, K.H. Kim, S.S Sapatnekar, “Impact of NBTI on SRAM read stability and
design for reliability”, ISQED'06, March 2006, pp. 213--218.