An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...
Transcript of An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...
-
DISSERTATION
An Energy Aware Framework for MobileComputing
ausgeführt zum Zwecke der Erlangung des akademischen Gradeseines Doktors der technischen Wissenschaften
eingereicht an derTechnischen Universität WienFakultät für Elektrotechnik und Informationstechnik
von
Dipl.-Ing. Naeem Zafar AzeemiBrigittenauer Lande 224/ 6643, 1200 Wiengeboren in Karachi, Pakistan am 14. August 1968Matrikelnummer: 0327346
October 6, 2007 .............................................................
-
Advisor
Univ.Prof. Dipl.-Ing. Dr.techn. Markus RuppTechnische Universität WienInstitut für Nachrichtentechnik und Hochfrequenztechnik
Examiner
Univ.Prof. Dr.phil.nat. Christoph GrimmTechnische Universität WienInstitut für Computertechnik
-
To Amra, Mukashfa and Kunza
-
ABSTRACT
Since their inception, energy dissipation has been a critical issue for mobile computingsystems. Although a large research investment in low-energy circuit design and hardwarelevel energy management has led to more energy-efficient architectures, even then, thereis a growing realization that the contribution to energy conservation should be morerigorously considered at higher levels of the systems, such as operating systems andapplications.
This dissertation puts forth the claim that energy-aware compilation to improve appli-cation quality both in terms of execution time and energy consumption is essential fora high performance mobile computing embedded system design. Our work is a designparadigm shift from the logic gate being the basic silicon computation unit, to an in-struction running on an embedded processor. Multimedia DSP processors are the mostlucrative choice to a mobile computing system design for their optimal performance de-livery in high data throughput at low energy. They use instruction-level parallelism (ILP)in programs, for executing more than one primitive instruction at a time. In this work,we exploit the parallelism slacks, unraveled by the native multimedia DSP compilers.We propose an iterative compilation environment to optimize a given ’C’ source code.The contributions of our framework are the collaboration of an application profile mon-itor (APM) together with an optimization engine in native multimedia DSP SoftwareDevelopment Environments (SDE). We propose to monitor application behavior at alllevels (such as static, compilation, scheduling, linking and during execution). TheseAPMs are later used in an optimization engine to speculate optimal code transformationschemes. These schemes are applied successively, across the basic code blocks. Wepropose two methods for the selection of optimization schemes, a Gradient Mode Iter-ative Compilation (GMIC) and Multicriteria Stochastic Iterative Compilation (MSIC).Both schemes are tested at several multimedia applications obtained from diversifieddomains such as video transcodecs (MPEG2, H-264L), audio transcodecs (G-723, Mp3)and bioinformatics (Glimmer, Fgene), to name a few.
Finally, we propose the characterization of application-architecture correlations that sup-port our claim that an ideal performance of a mobile computing system demands a per-fect match between hardware capability and program behavior. We exposed our resultsfor 20 multimedia applications experimented at the TriMedia DSP 1300, the BlackfinDSP ADSP533, and the PIII-850 embedded processor.
Keywords: Energy Aware, Source-to-Source, Multimedia Processor, Workload Charac-terization.
-
vi Abstract
-
ZUSAMMENFASSUNG
Seit dem Bestehen von mobilen Rechensystemen ist Energieverbrauch ein entscheiden-der Faktor. Obwohl bereits zahlreiche Forschungsergebnisse zu hardwarelösungen mitniedrigem Energieverbrauch geführt haben, ist mittlerweile klar geworden, dass En-ergieeinsparungen auf höherer Ebene, wie beispielsweise bei Betriebssystemen und -anwendungen, vermehrt in Betracht gezogen werden sollten.
Diese Dissertation belegt, dass eine energiebewusste Compilierung zur Verringerung derAusführungszeit führt und somit ein wesentliches Kriterium darstellt, um ein effizienteseingebettetes System für mobile Datenverarbeitung zu gewährleisten. Unsere Arbeitbeschäftigt sich mit einem neuen Entwicklungs-Paradigma, das sich nicht mehr aufeinzelne logische Gatter als grundlegende Entwicklungselemente konzentriert, sondernsich einzelnen Instruktionen auf einem eingebetteten Prozessor widmet. Digitale Sig-nalverarbeitungsprozessoren für Multimediaanwendungen stellen für ein mobiles Daten-verarbeitungssystem die preiswerteste Lösung dar, um eine optimale Datendurchlaufzeitbei niedrigem Energiebedarf zu gewährleisten. Diese nutzen hierfür die Parallelität aufInstruktionsebene (ILP) von Programmen, um damit mehrere primitive Instruktionenzur gleichen Zeit ausführen zu können. In der vorliegenden Dissertation wird die Pro-grammparalellisierung mit einem speziellen Monitor erfasst. Weiters schlagen wir eineschrittweise Compilierung vor, um den gegebenen Programmcode in ”C” zu optimieren.Ein weiterer Beitrag besteht aus einer Programmumgebung zur Analyse von Anwendun-gen und deren Optimierung. Hierbei wird das Programmverhalten auf mehreren Ebenen(statischer Ebene, Compilierung, Scheduling, Linking, und während der Ausführung)überwacht. Diese Analysen werden anschließend von einem Optimierungsprogramm ver-wendet, um eine optimale Compiler-Konfiguration zu ermitteln. In dieser Arbeit wer-den zwei verschiedene Methoden für die Auswahl der Optimierungsoptionen vorgestellt,nam̈lich ein Gradientenverfahren und ein stochastisches Verfahren. Beide Verfahrenwerden mit verschiedenen Multimediaanwendungen aus unterschiedlichen Bereichen wiebeipsielsweise Video-Kodierung (MPEG2, H-264L), Audio-Kodierung (G-723, MP3) undBioinformatik (Gllimmer, Fgene) getestet.
Schließlich schlagen wir Metriken zur Erfassung der Korrelation zwischen Anwendung undHardware vor, die unsere Behauptung untermauern, dass eine ideale Leistung des mobilenDatenverarbeitungssystems nur dann erreicht werden kann, wenn die Hardwarekapazitätsowie das Programmverhalten perfekt zusammenpassen. Die Leistungsfähigkeit dieserMetriken wird anhand der Prozessoren Trimedia DSP 1300, Blackfin DSP ADSP533 undPIII-850 gezeigt.
-
viii Zusammenfassung
Schlagwörter: Energy-aware, Quellcodetransformation, eingebettete Systeme, Multi-media Prozessoren, Mobile Computing, workload characterization
-
ACKNOWLEDGEMENTS
I would like to thank my teacher Khwaja Shamsuddin Azeemi and parents who have hada positive effect on me personally, to whom I owe a debt of gratitude for helping in oneway or another to influence the person I am today.
First and foremost, I thank my supervisor Dr. Markus Rupp, for his consistent efforts toinvoke my inherent skills to accomplish this task successfully. I appreciate his bottomlesspatience for technical review and substantive comments that improved the readabilityof the dissertation.
Thanks to my sister Farhi, and brothers Waseem and Nadeem, who provide encourage-ment in the face of every seemingly impossible task that I face.
Thanks to Afsar, Sobia, Shams Sahib, Ana Eliza and Liana for their love, support andgreat understanding, especially during vulnerable moments.
Thanks to my friends, colleagues and acquaintances: Bastian, Martin at the ChristianDoppler Laboratory; Sabine from Vienna; Naveed and Saima from Boston; Nadeem andfamily from San Francisco; Amir Malik and family from Korea for their kind assistanceand facilitation during last 45 months.
I would like to acknowledge valuable technical support from Dr. Arpad Scholtz atInstitute of Communications and Radio Frequency Engineering, Dr. Stefan Mahlknechtat Institute of Computer Technology and Aneesa Sultan at Vienna Bio Center.
I am also grateful to Dr. Christoph Grimm for his time and patience to review thismanuscript.
-
CONTENTS
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Mobile Embedded System Constraints . . . . . . . . . . . . . . 11.1.2 IC Fabrication Technology Constraints . . . . . . . . . . . . . . 21.1.3 Battery Technology Constraints . . . . . . . . . . . . . . . . . 31.1.4 Architecture-Application Correlation Slacks . . . . . . . . . . . 4
1.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Energy-Cycle Aware Compilation Framework (ECACF) 13
2.1 Energy Saving Techniques - A Review . . . . . . . . . . . . . . . . . . 142.1.1 Fabrication level power reduction . . . . . . . . . . . . . . . . . 142.1.2 Processor level power reduction . . . . . . . . . . . . . . . . . . 152.1.3 EDA tools level power reduction . . . . . . . . . . . . . . . . . 152.1.4 Compiler level power reduction . . . . . . . . . . . . . . . . . . 162.1.5 Low power data structures . . . . . . . . . . . . . . . . . . . . 162.1.6 Idle mode power reduction . . . . . . . . . . . . . . . . . . . . 172.1.7 Power reduction in distributed computing systems . . . . . . . . 172.1.8 Power reduction in communication systems . . . . . . . . . . . 172.1.9 Battery aware power reduction . . . . . . . . . . . . . . . . . . 18
2.2 Multimedia DSPCPU Architecture . . . . . . . . . . . . . . . . . . . . 192.2.1 Multimedia Processor Execution Model . . . . . . . . . . . . . 202.2.2 Multimedia Processor Operations Overview . . . . . . . . . . . 21
2.3 Workload Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Multimedia Applications . . . . . . . . . . . . . . . . . . . . . 232.3.2 Bioinformatics Workload . . . . . . . . . . . . . . . . . . . . . 24
2.4 Energy Cycle Aware Compilation Framework Methodology . . . . . . . 282.4.1 Application Expression Profile . . . . . . . . . . . . . . . . . . . 30
2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Related Work for Energy Measurement . . . . . . . . . . . . . . 322.5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Gradient Mode Iterative Compilation (GMIC) 41
3.1 GMIC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
-
xii Contents
3.1.1 Performance Qualifier Measurement . . . . . . . . . . . . . . . 43
3.1.2 Code Block Queuing . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.3 Code Block Expression Profile . . . . . . . . . . . . . . . . . . 44
3.1.4 Transformation Scheme . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Example: Optimization of an MPEG-1 encoder . . . . . . . . . . . . . 46
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Multicriteria Stochastic Iterative Compilation (MSIC) 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Objects and Constraints . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Case Study I - Arbitrary Application . . . . . . . . . . . . . . . 59
4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ) 61
4.3 Performance Comparison with GMIC . . . . . . . . . . . . . . . . . . . 66
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Application-Architecture Characterization 69
5.1 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 Principal Component Analysis (PCA): . . . . . . . . . . . . . . 70
5.1.2 Scree Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 Box Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.4 Scatter Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.5 Differential Application Expression Profile (dAEP): . . . . . . . 72
5.2 Application Characterization . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Architecture-Centric Application Characterization . . . . . . . . . . . . 81
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conclusions 89
Appendices 91
A List of Application Expression Profile (AEP) Monitors 93
B VLIW Descriptor File (VDF) Format 99
C User Constraints Files (UCF) Format 103
C.1 UCF for MPEG-1 encoder example in Section 3.3 . . . . . . . . . . . . 104
C.2 UCF for NLIVQ example in Section 4.2.3 . . . . . . . . . . . . . . . . 104
-
Contents xiii
D Application Attributes 105
E List of Acronyms 113
-
LIST OF FIGURES
1.1 Power consumption for Intel CPUs [1]. . . . . . . . . . . . . . . . . . . 3
1.2 Thermal and power delivery cost in a desktop PC [2]. . . . . . . . . . . 4
1.3 Battery technologies and their capacities [3]. . . . . . . . . . . . . . . 5
1.4 Thesis Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 TriMedia VLIW instruction [4]. . . . . . . . . . . . . . . . . . . . . . . 20
2.2 TriMedia functional unit assignment [4]. . . . . . . . . . . . . . . . . . 21
2.3 Transformation methodology. . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Vertical application profile layers. . . . . . . . . . . . . . . . . . . . . . 30
2.5 Experimental setup for instruction/program current measurement [5]. . 33
2.6 Proposed experimental setup for application current measurement atprocessor and memory. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Current consumption for vector quantization (VQ) application executionlife cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 CPU core current consumption versus address range for VQ application. 35
2.9 Memory current consumption versus address range for G-728 audio transcodec.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10 CPU core current consumption versus address range for G-728 audiotranscodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 CPU peripheral current consumption versus address range for G-728 au-dio transcodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Gradient mode Iterative Compilation Methodology (GMIC). . . . . . . . 42
3.2 Fraction of JPMO CB in an MPEG-1 application, the code blocks arenumbered from fb01 to fb34. . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Fraction of JPMO contributed by code blocks in an MPEG-1 application-(a window view for seven blocks). . . . . . . . . . . . . . . . . . . . . 44
3.4 GMIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
-
xvi List of Figures
3.5 Heuristic track of CT-Tuple for an MPEG-1 encoder application. . . . . 48
3.6 Heuristic track of CTxy tuple for FFT application. . . . . . . . . . . . . 50
3.7 Heuristic track of CTxy tuple for IDCT application. . . . . . . . . . . . 50
3.8 Heuristic track of CTxy tuple for T64 application. . . . . . . . . . . . . 51
3.9 Heuristic track of CTxy tuple for M100 application. . . . . . . . . . . . 52
3.10 Heuristic track of CTxy tuple for H-264L application. . . . . . . . . . . 52
4.1 A simplified view of framework with multicriteria methodology extension. 56
4.2 Simplified Genetic Algorithm Model [6]. . . . . . . . . . . . . . . . . . 58
4.3 Development of fitness function for Case Study 1 in TS1 and TS2. . . . 59
4.4 Fraction of IPC for Case Study 1 in TS1 and TS2. . . . . . . . . . . . 60
4.5 Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2. 60
4.6 Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25CB are numbered from F01 to F25). . . . . . . . . . . . . . . . . . . . 62
4.7 Development of the fitness function for NLIVQ. . . . . . . . . . . . . . 64
4.8 Fraction of IPC for NLIVQ. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 Fraction of energy saving for NLIVQ. . . . . . . . . . . . . . . . . . . . 65
4.10 Fraction of functional unit utilization for NLIVQ. . . . . . . . . . . . . 65
5.1 Scatter plot for 20 applications at the TriMedia processor. . . . . . . . 75
5.2 PCA Scree plot for 20 applications at the TriMedia processor. . . . . . 76
5.3 PCA box plot for 20 applications at the TriMedia processor. . . . . . . 76
5.4 PCA biplot for 20 applications at the TriMedia processor. . . . . . . . . 77
5.5 Scatter plot for 20 applications at the Blackfin processor. . . . . . . . . 79
5.6 PCA biplot for 20 applications at the Blackfin processor. . . . . . . . . 80
5.7 Scatter plot for 20 applications at the PIII 850 processor. . . . . . . . . 82
5.8 PCA biplot for 20 applications at the PIII 850 processor. . . . . . . . . 83
5.9 Differential AEP across three hardware platforms. . . . . . . . . . . . . 83
5.10 PCA biplot for 20 applications across the TriMedia processor and theBlackfin processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.11 PCA biplot for 20 applications across the Blackfin processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.12 PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
-
LIST OF TABLES
2.1 Energy reduction techniques for embedded system design. . . . . . . . . 14
2.2 Multimedia Benchmarks (Speech Transcodecs). . . . . . . . . . . . . . 24
2.3 Multimedia Benchmarks (Video Transcodecs). . . . . . . . . . . . . . . 25
2.4 Multimedia Benchmarks (Audio Transcodecs). . . . . . . . . . . . . . . 25
2.5 Generic DSP application Benchmarks [7]. . . . . . . . . . . . . . . . . 26
2.6 Test Vectors Characterization. . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Bio-Computation Applications Benchmark . . . . . . . . . . . . . . . . 27
3.1 Transformation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Gradient Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 CBLT in CPU cycles for NLIVQ. . . . . . . . . . . . . . . . . . . . . . 63
4.2 Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04,TS07, TS09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Sum of absolute difference for for TS04, TS07, TS09. . . . . . . . . . . 66
4.4 Performance comparison between GMIC and MSIC. . . . . . . . . . . . 67
5.1 MPEGdec profile for successive transformations [8]. . . . . . . . . . . . 72
D.1 Pseudonyms for 20 applications. . . . . . . . . . . . . . . . . . . . . . 105
D.2 AEP for optimized 20 applications at the TriMedia processor. . . . . . . 106
D.3 AEP for optimized 20 applications at the Blackfin processor. . . . . . . 107
D.4 AEP for optimized 20 applications at the PIII 850 processor. . . . . . . 108
D.5 dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
D.6 dAEP for optimized 20 applications across the Blackfin and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
-
xviii List of Tables
D.7 dAEP for optimized 20 applications across the TriMedia and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
-
1 INTRODUCTION
1.1 Motivation
The growing trend towards the untethered ubiquitous computing is entailed with many
performance related issues. The ideal performance of a mobile computing system de-
mands a perfect match between architecture capability and program behavior. Archi-
tecture performance can be enhanced with better hardware technology, innovative low
Integrated Circuits (IC) geometry features, and efficient resources management [9]. In
the same vein, the demand for having multimedia functions on handheld devices requires
an enormous computation power to handle large data and program sizes. Efficient ar-
chitecture utilization for both energy dissipation and execution time as well as optimal
application firmware are two important performance metrics for these embedded systems.
The optimal architecture utilization is debilitated by different design limitations, such
as high level system design constraints, fabrication level constraints, battery technology
constraints etc. They are discussed next in more detail.
1.1.1 Mobile Embedded System Constraints
Mobile embedded systems (MES) present unique challenges and opportunities for system-
level low-energy designs, e.g.,
• MES are usually severely energy constrained. In particular, handheld devices , air-borne, and spaceborne systems are typically battery-operated and therefore have a
limited energy budget [10]. MES are also typically relatively more time-constrained
compared to portable embedded or general-purpose systems. Therefore, the chal-
lenge is to save energy while guaranteeing temporal constraints.
• Some MES applications such as avionics, robotics and deep space missions requiresystems with small form factors, which in turn mandates low heat dissipation.
Since heat is a byproduct of energy dissipation, low-energy system-design ensures
a more reliable system by limiting the heat produced.
• MES are typically over-designed to ensure that the temporal deadline guaranteesare still met even if all tasks take up their Worst-Case Execution Time (WCET).
-
2 1 Introduction
Since, in the average case, tasks do not require their WCET, the redundancy in
hardware design in MES makes them energy inefficient.
In short, system-level techniques can decrease this energy dissipation through the
use of energy-aware task scheduling algorithms while preserving their temporal
constraints.
1.1.2 IC Fabrication Technology Constraints
Integrated circuits in their various incarnations consume some amount of electric power.
This power is dissipated both by the action of the switching devices contained in IC
(such as transistors) as well as heat due to the resistivity of the electrical circuits. This
is a major consideration in the design of microporcessors and the embedded systems
they are used in [11]. Figure 1.1 shows the power consumption for the Intel series
of processors produced over the last two decades [1]. The horizontal axis shows the
advancement in IC fabrication technology in terms of chip geometry (i.e nanometers),
while power dissipation is plotted in Watts. Each point is marked with two numbers,
showing chip geometry and power consumption, respectively. Points lying on the same
vertical axis such as (350,43) and (350,34.8) show the processors in the same technology,
but different performance. E.g., (350,43) and (350,34.8) corresponds to PII 300MHz
and PII 233MHz, respectively. Similarly, P4 3MHz was fabricated at 130 nm and 81.9
W, while in later versions at lower geometry P4 EE 3.40MHz is fabricated at 90 nm
and low power 83.9 W; further, it is improved for higher operating frequency (P4 EE
3.73MHz) at same the geometry but at a penalty of increase in power consumption
i.e., 115 W. The increasing trend towards special purpose core processors has further
reduced the geometry down to 65 nm and power consumption to 130 W (for Intel Core
2 Extreme Qx6700). Readers are encouraged to read [1] [12] [13] for a detailed view of
power versus technology trends realized by various CPU manufacturers.
Attempts to shape the power-geometry envelop (shown as a shoe in Figure 1.1) have
their limits at the fabrication technology at 50 nm, where leakage current starts dominat-
ing the power consumption (discussed further in Chapter 2). Although special purpose
core processors are implemented at 50 nm [14] [12], with a power consumption of 14.5
W (shown at bottom of heal in Figure 1.1), but their operating frequency is limited to
130 MHz, which is not sufficient to meet the current demand for multimedia process-
ing. The designers goal to achieve a low leakage ’heal’ in the power-geometry shoe is
associated with a high power cost. This cost has two components. The first is thermal
cost, which is associated with keeping the devices below the specified operating temper-
ature limits. Maintaining the integrity of packaging at higher temperatures also requires
expensive solutions. The second component is the on board power delivery cost, which
is related to on-board decoupling capacitances and interconnects associated with the
power distribution network. Moreover, the increased trend towards driving the CPU at
-
1.1. Motivation 3
lower operating voltage and higher frequency increases the magnitude of the current
drawn by the CPU. This exacerbates the issue of resistive and inductive noise problems
and leads to a significant increase in system cost.
Fig. 1.1: Power consumption for Intel CPUs [1].
Figure 1.2 gives an idea of the range of dollar amounts associated with the above costs
for different system components [2]. As can be seen, when the system power is in the
35-40 W range, the cost of each additional Watt tends to grow above $1/W per chip.
Designers have already pulled the fabrication limits to achieve low energy design goals
[15]. E.g., shrinking the integrated circuit geometry below 50 nm doubles the leakage
current as compared to 65 nm. Such issues exacerbate the need to consider low energy
design more rigorously at higher hierarchies of the system level [5].
1.1.3 Battery Technology Constraints
The energy constraints on mobile devices are becoming increasingly tight as complexity
and performance requirements continue to be pushed by the user demand [16]. Proces-
sor speeds have doubled as approximately every 18 months as predicted by Moore’s
law [17]. While processor speed and energy consumption have increased rapidly, the
corresponding improvement in battery technology has been slow. In fact, battery ca-
pacity has increased by a factor of less than four in the last three decades [3] [18].
-
4 1 Introduction
Fig. 1.2: Thermal and power delivery cost in a desktop PC [2].
Figure 1.3 shows the current state-of-the-art in battery technology. The slack in in-
crease in the battery capacity is hampered by the ionization chemistry limits [3] [19].
The design target for batteries with long life-span and short sizes is hard to achieve.
E.g., though Ni-MH is lighter in weight than Ni-Cd, it requires a higher recharging
time. In the same vein, Li-Ion batteries are more promising for higher energy density,
large number of charging cycles, little memory effect, longer shelf life, but higher cost
and increased external protection against discharging inhibits its low cost wide use. In
short, the technological constraint on the realization of high capacity, low size battery
highlights the importance of low energy consideration.
1.1.4 Architecture-Application Correlation Slacks
Traditionally, optimal MES performance is gained by focussing on the underlying hard-
ware architecture. This ignores the fact that it is the software executing on a CPU
that determines its energy consumption. The execution time and energy consumption
of a program on any parallel processor is dependent not only on the composition of
operations contained within the program, but also on the ability of users to express the
-
1.2. Design Space Exploration 5
Fig. 1.3: Battery technologies and their capacities [3].
parallelism at the correct granularity level for the processor. Therefore, to fairly com-
pare cycle-energy performance of two applications at a given processor, two different
mappings of the applications will be required, one for each application. An integrated
approach that considers energy-cycle performance at architecture as well as application
level is essential for energy efficient application developments.
1.2 Design Space Exploration
The program behavior is difficult to predict due to its heavy dependence on application
and run-time conditions [20] [21]. For mobile computing, the application performance
can be optimized by using parallel hardware architectures, such as Very-Long Instruction
Word (VLIW) architectures [22] [23]. VLIW architectures are a suitable alternative for
exploiting instruction-level parallelism (ILP) in programs, that is, for executing more than
one basic (primitive) instruction at a time. These processors contain multiple functional
units. They fetch from the instruction cache a Very-Long Instruction Word containing
several primitive instructions, and dispatch the entire VLIW for parallel execution. These
-
6 1 Introduction
capabilities are exploited by compilers which generate code that has grouped together
independent primitive instructions executable in parallel. The processors have a relatively
simple control logic because they do not perform any dynamic scheduling nor reordering
of operations (as is the case in most contemporary superscalar processors). The instruc-
tion set for a VLIW architecture tends to consist of simple instructions (RISC-like). The
compiler must assemble many primitive operations into a single ”instruction word” such
that the multiple functional units are kept busy, which requires enough instruction-level
parallelism (ILP) in a code sequence to fill the available operation slots.
In mobile computing software design, the conventional software development environ-
ment (for compilation and machine code generation) cannot be used. In these methods,
the execution time and code size are primarily considered, while the energy dissipation
issue is piggy-backed to the final design; that inevitably leads to an expensive cooling
mechanism and eventually increases the system overall cost while reducing reliability.
The software perspective on power consumption has been the subject of work in [24].
Here a detailed instruction-level power model of the Intel 486DX2 was built. The impact
of software on the CPU power and energy consumption, and software optimizations to
reduce these were studied. It is well known that the number of useful instructions is
always different from the number of instructions in a static code. The code execution
flow determines the number of useful instructions according to input data. Therefore,
computing the total energy consumed merely by adding the energy consumption of
individual instructions does not provide the actual energy consumption of the program
as claimed in [24].
In this thesis we propose a framework, where software applications optimally utilize
the hardware architecture to deliver energy-cycle performance within user defined con-
straints. Our energy aware framework in [25] meets the demand by incorporating the
following features in a native multimedia DSP compilation environment.
1) The framework transforms the legacy application source code into optimal ’C’ source
code, taking advantage of different slacks appearing in the application-to-binary devel-
opment hierarchy.
2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different
performance goals both in terms of execution time as well as energy dissipation.
3) We developed post-profiling techniques published in [26] to evaluate the application
performance not only at compilation layer (as conventional compiler does) but also at
scheduling layer, linker layer, machine code generation layer and finally at loader layer.
4) We measure the real-time performance of applications running on actual hardware.
These measured parameters are further used to tune the transformation scheme of the
legacy software application.
5) We tested our framework at different applications that belong to diversified industrial
-
1.2. Design Space Exploration 7
domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and
bioinformatics applications [28] [29].
6) The work is further extended in [30] [27] to characterize application-architecture
correlation, that are well suited for a pre-design assessment of an embedded system
design. It answers the question whether a given hardware architecture is an appropriate
choice for a given multimedia software application or not.
It may be noted, the terms power consumption and energy consumption are often in-
terchanged. It is important to distinguish between these two when we talk of either of
these in the context of programs running on mobile applications. Mobile systems run
on limited energy available in a battery. Therefore, the energy consumed by the system
or by the software running on it, determines the length of the battery life.
This thesis is based on the following publications.
• N. Zafar Azeemi, A. Sultan ”Characterization of Bioinformatics Applications onMultimedia Processor”, in Proc. IEEE Cairo International Biomedical Engineering
Conference (CIBEC ’06), pages BI06-BI09, 195 - 200, Cairo, Egypt, December,
2006.
• N. Zafar Azeemi ”Handling Architecture-Application Dynamic Behavior in Set-top Box Applications”, in Proc. IEEE International Conference on Information
and Automation (ICIA ’06), pages 195 - 200, Colombo, Sri Lanka, December,
2006.
• N. Zafar Azeemi, A. Sultan, A. Muhammad ”Parameterized Characterization ofBioinfomatics Workload on SIMD Architecture”, in Proc. IEEE International Con-
ference on Information and Automation (ICIA ’06), pages 189 - 194, Colombo,
Sri Lanka, December, 2006.
• N. Zafar Azeemi ”Multicriteria Energy Efficient Source Code Compilation for De-pendable Embedded Applications”, in Proc. IEEE International Conference on
Information Technology (IIT ’06), Dubai, UAE, November, 2006.
• N. Zafar Azeemi ”Compiler Directed Battery-Aware Implementation of Mobile Ap-plications”, in Proc. IEEE 2nd International Conference on Emerging Technologies
(ICET ’06), pages 151 - 156, Peshawar, Pakistan, November, 2006.
• N. Zafar Azeemi ”A Multiobjective Evolutionary Approach for Constrained JointSource Code Optimization”, in Proc. ISCA 19th International Conference on Com-
puter Application in Industry (CAINE ’06), pages 175 - 180, Las Vegas, Nevada,
USA, November, 2006.
• N. Zafar Azeemi ”Probabilistic Iterative Compilation for Source Optimization ofEmbedded Programs”, in Proc. 2006 IEEE International SoC Design Conference
(ISOCC ’06), pages 323 - 328, Seoul, Korea, October, 2006.
-
8 1 Introduction
• N. Zafar Azeemi, M. Rupp ”Multicriteria Low Energy Source Level Optimization ofEmbedded Programs”, in Proc. Tagungsband zur Informationstagung Mikroelek-
tronik (ME ’06) IEEE Austria, pages 150 - 158, Vienna, Austria, October, 2006.
• N. Zafar Azeemi ”Architecture-Aware Hierarchical Probabilistic Source Optimiza-tion”, in Proc. ISCA 19th International Conference on Parallel and Distributed
Computing Systems (PDCS ’06),pages 90-95, San Francisco, USA, September,
2006.
• N. Zafar Azeemi ”Power Aware Framework for Dense Matrix Operations in Mul-timedia Processors”, in Proc. IEEE 9th International Multi-topic Conference (IN-
MIC ’05), Karachi, Pakistan, December, 2005.
• N. Zafar Azeemi, M. Rupp ”Energy-Aware Source-to-Source Transformations fora VLIW DSP Processor”, in Proc. IEEE 17th International Conference on Micro-
electronics (ICM ’05), pages 133 - 138, Islamabad, Pakistan, December, 2005.
• N. Zafar Azeemi ”A Framework for Architecture Based Energy-Aware Code Trans-formations in VLIW Processors”, in Proc. International Symposium on Telecom-
munication (IST ’05), pages 393 - 398, Shiraz, Iran, September, 2005.
1.3 Thesis Outline
This thesis is organized in five chapters, as shown in Figure 1.4. A brief description of
each chapter is given below.
Chapter 1: We discuss the different design limitations, such as high level system design
constraints, fabrication level constraints, battery technology constraints etc. We explore
the design slacks that exist in contemporary work [31] [24] [5] for energy aware code
optimization. We explain the thesis structure and provide a detailed list of contributions.
Chapter 2: This chapter lays the necessary foundation for the development of our
energy cycle aware iterative compilation framework. Our methodology optimizes a soft-
ware application for energy consumption, execution time as well as efficient hardware
architecture utilization. As compared to [5] [32] [33] [34], we elaborate our method
for generic multimedia processors. Unlike [35] [36] [36], we define software applica-
tion in terms of its architectural behavior. We provide a simplified overview of typical
multimedia processors. Though various multimedia operation models are presented in
[37] [31] [38] [39] [40], but their complexity refrain them to be readily usable in a real
time optimization environment. We use a simplified multimedia operation model devel-
oped in [4], that views the instruction set in terms of load/store operations, compute
operations, special register operations and control flow operations. The measurement
of energy consumption made by an application at a real-time platform is a first step
-
1.3. Thesis Outline 9
Fig. 1.4: Thesis Structure.
to know in any energy constrained embedded system and can be used to estimate
the battery lifetime of the system. The experimental setup proposed in [5] [32] [41]
for instruction/program current measurement, addressing modes, immediate operands,
and exhaustive characterization is very time consuming. We present here a measure-
ment platform that is generic and applicable to most off-the-shelf available multimedia
processors. It is based on current measurement at both processor and memory input
lines. Unlike the instruction based energy model presented in [42] [24], we propose a
simplified energy consumption model based on code blocks. We expose a step-by-step
procedure for the measurement of software application energy consumption at a target
hardware architecture. As compared to [24] [32] [41], we apply our framework at two
major application domains, multimedia and bioinformatics. The multimedia application
set consists of encoders and decoders (transcodecs) encompassing three media types -
speech, video, and audio (music), whereas, we categorize the basic functionality offered
by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule
based analysis, biological data bases and biological taxonomy. The results published
-
10 1 Introduction
in [28] [29] reveal the usefulness of our framework at diversified application domains.
Several energy reduction opportunities at design level are also presented.
Chapter 3: Our energy cycle aware compilation framework is powered by a source
code transformation engine. Unlike [43] [42] [24], we implement our scheme by first
investigating the ’C’ source code of application for cycle energy taxing blocks, based
on trace data collected during a profile of the application as mentioned in Chapter 2.
Here, we present a novel heuristic that searches the solution space for an optimal source
code transformation scheme. We demonstrate that the algorithm executes a solution
and evaluates the energy-time tradeoff based on a user-defined metric. Based on the
evaluation, it selects the next solution to be evaluated. The heuristic terminates when
desired objectives are achieved. Our gradient mode iterative compilation scheme has
two salient features. First, it requires queuing code blocks such that blocks pertaining
similar expression profile most likely to benefit from the same transformation scheme.
Second, it completes in a discrete number of steps based on the number of code blocks,
whereas schemes mentioned by Sinha et al. in [33] and Tiwari et al. in [5] offer searches
that grow exponentially as the number of code blocks increases. We also expose our
scheme by analyzing a video encoding application (MPEG-1 encoder). Further merits
and demerits of the scheme are also explained in different application scenarios.
Chapter 4: The gradient mode iterative compilation as proposed in the previous chapter,
belongs to a class of compilation termed as feedback directed compilation. It brings
relatively small improvement, as it effectively restricts itself to trying different back-end
optimizations. The major impediment to such approach is the heuristic search technique
itself. Unlike [32] [41], in this chapter we consider the optimization problem as a single
task, where all desired aims have to be taken into account simultaneously. We present
a new method, which is based on the optimization of a multicriteria, objective function,
where the desired aims of architecture-based energy-cycle optimization are formulated as
penalty terms of such objective function. Further, we describe how the maximization of
the objective function can be achieved by using a Genetic Algorithm (GA). The interface
of the proposed methodology to our energy cycle aware compilation framework is also
explained. We also expose the minutia of our methodology e.g., selection of constraints,
development of fitness function, formation of Hertz matrix. We discuss two multimedia
applications in depth to elaborate the advantage of the algorithm.
Chapter 5: In this chapter we introduce the concept of application-architecture char-
acterization with the help of our ECACF and multivariate statistics techniques. To our
knowledge this is a first attempt to obtain such characterization from the application
expression profiles.
The application-architecture correlation is a bidirectional process matching algorithmic
structure with hardware architecture and vice vera. The programmer will benefit from
this efficient mapping and produce better source codes. Applications of similar function-
ality may yield similar Application Expression Profile (AEP), and hence can be suitable
-
1.3. Thesis Outline 11
for similar hardware platform. We explore the fact that despite the simplicity of our
methodology, the analysis of large matrices provided by an application expression profile
under different levels of transformation at different architectures is not trivial and re-
quires an advanced knowledge of discovery processes. To this end, we introduce a new
methodology to evaluate the application portability using multivariate statistics. We
demonstrate how box plot, scree plot, and PCA biplots can be used to characterize an
application at a given hardware architecture. We expose the minutia of methodology by
exploring the AEP across three different hardware platforms at diversified applications.
Finally, we demonstrate how dAEP can be used to find out the legacy code portability
across platforms.
-
12 1 Introduction
-
2. ENERGY-CYCLE AWARE COMPILATION
FRAMEWORK (ECACF)
Miniaturization of computing systems is finding applications in special areas such as
hand-held computation, tiny robots, guidance systems in automated vehicles, to name
just a few. Also, these systems or their users move from place to place. Because of
their small size and their mobility requirement, they are powered by batteries of low
rating. In order to avoid frequent recharging and/or replacement of the batteries, there
is significant interest in low-energy system design. Energy consumption is an area of
growing concern in system design. It leads to variety of system related issues, such as
battery life, thermal limits, packaging constraints, and cooling options [44]. Though
energy is actually consumed by the hardware, energy consumption can be reduced apart
from using low-energy electronics by suitably manipulating the software systems. This
is because the hardware activities are controlled through the software. Let a program
X run for T seconds to achieve its goal, VCC be the supply voltage of the system, and
I be the average current in Amperes drawn from the power source for T seconds. We
can rewrite T as T = N x τ where N is the number of clock cycles and τ is the clock
period. Then, the amount of energy consumed by X to achieve its goal is given by: E
=VCC x I x N x τ joules. Since for a given hardware, both VCC and τ are fixed, E
∝ I x N. However, at the application level, it is more meaningful to talk about T thanN, and therefore, we express energy as E ∝ I x T. This expression is the foundation ofour ECACF. It shows the main idea in the design of energy-efficient software that is to
reduce both T and I. From the running time (average case) of an algorithm we achieve
a measure of T . However, to compute I, one must consider the current drawn during
each clock cycle. This is illustrated in Section 2.5.
Given the fact that power is the rate of energy consumption, in this thesis, we refer to
power and energy interchangeably. Low power design is a complex endeavor requiring
a broad range of strategies from floor planning on silicon substrate to the design of
application software. In Table 2.1, we enlisted several strategies for achieving energy
efficiency in an energy-conscious system design. In the following section, we review some
of these strategies.
-
14 2 Energy-Cycle Aware Compilation Framework (ECACF)
Power Reduction Strategies MES Design LevelsFabrication Level Power Reduction Low level
Processor Level Power Reduction Intermediate level
EDA Tools Level Power Reduction High level
Compiler Level Power Reduction High level
Low Power Data Structures High level
Idle Model Power Reduction Intermediate level
Power Reduction in Distributed Computing High level
Power Reduction in Communication Systems High level
Battery Aware Power Reduction High level
Tab. 2.1: Energy reduction techniques for embedded system design.
2.1 Energy Saving Techniques - A Review
We review a wide spectrum of strategies, shown in Table 2.1, ranging from the hardware
fabrication process to energy efficient communications system. Energy saving due to
different approaches are, in the best case, multiplicative. E.g., in an IDCT application
implemented in [44] [45] [46] [47], a 30% energy saving from low-energy electronics
together with a 23% saving from compiler techniques will yield a total energy saving of
(1-((1-0.30)(1-0.23)))×100%= 46.1%.
However, generally the total energy saving is less, say, in this example 34%, because the
various energy saving strategies may adversely affect each other.
2.1.1 Fabrication level power reduction
The power consumption in a CMOS digital circuit is expressed as [48]
P = (CLV 2DDfp) + (ISCVDD) + (IleakgeVDD) (2.1)
where VDD is the supply voltage, fp is the output switching frequency, CL is the output
capacitance load, ISC is the short circuit current pulse, generated when both n- and
p-transistors are briefly turned on during the output switching, and Ileakage is the leakage
current. The first term on the righthand side of the power equation is the dominant
factor [48]. It is expected that power saving with two orders of magnitude can be
achieved using low-power electronics. About half of the power reduction will come from
architecture changes and management of switching activity (fp). The other half of
power reduction will come from using advanced materials technology to allow reduction
of VDD to 1 V or below from 5 or 3.5 V while also reducing CL [48] [49].
-
2.1. Energy Saving Techniques - A Review 15
2.1.2 Processor level power reduction
Mobile embedded system requires small form factors and hence processors designed for
high-end desktops are not suitable for such application. Havinga et al. in [50] show that
microprocessors can account for up to 33% of a typical notebook power budget, which
is around 15W. Therefore, processor designers include a number of features to reduce
power consumption. E.g., in TriMedia processor TM130x [4] and Blackfin processor
ADSP533S some of the power reduction features are dynamic idle-time shutdown of
separate execution units, low-power cache design, and power considerations for standard
cells, data-path elements, and clocking. The processor also supports three static power
management modes doze, nap, and sleep [51]. These modes reduce power at a global
level when the processor is idle for an extended period of time. Since CMOS circuits
consume power during the charging and discharging of capacitances, reducing switching
activity saves power. At the architecture-level, two strategies to reduce switching activi-
ties are Gray code addressing and cold scheduling of instructions [52] [53]. Experimental
results show that cold scheduling reduces switching by 20 ∼ 30%. The Gray codes ad-vantage over the binary code is that each memory access changes the address by only
one bit. Thus, a significant number of bit switches can be eliminated using Gray code
addressing. Also, by decomposing a finite-state machine into several submachines, [54]
suggest that it is possible to selectively turn off portions of a circuit, thereby reducing
the switching activities. Tiwari et al. [31] have studied the idea of shutting off parts of
a logic circuit that are not needed in a particular computation on a per-clock-cycle basis.
This saves the power used in all the useless transitions in those parts of the circuit. Burd
et al. in [55] and Govilak et al. in [56] have suggested that power consumption in a
CPU can be reduced by dynamically changing its operating frequency and voltage. Fur-
ther studies to expose the role of prediction and of smoothing in dynamic speed-setting
policies is discussed in [57]. Havinga and Smit [50] propose energy saving by exploiting
locality of reference with dedicated, optimized modules. The idea of locality of reference
is to offload as much work as possible from the CPU to programmable modules that are
placed in the data streams.
2.1.3 EDA tools level power reduction
The design of low-power systems cannot be achieved without good power-conscious
EDA tools. EDA tools are used at all levels of hardware design: behavioral, architectural,
logic and physical. For a detailed exposition of power-conscious EDA tools, the reader
is referred to tutorials by [58] [59] [14].
-
16 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.1.4 Compiler level power reduction
Compiler design techniques contribute to energy saving in several ways [60] [61]. Kolson
and Nicolau [62] [40] [63] address the problem of allocating memory to variables in em-
bedded DSP (digital signal processing) software. The goal is to maximize simultaneous
data transfers from different memory banks to registers [64] [65] [66]. In several DSP
applications mentioned in [67] [68], two registers are loaded with the required data and
an arithmetic operation is performed. Loading two registers with a single double transfer
instruction draws a little more current than a move instruction. Both the instructions
take one clock cycle each. However, energy is saved by using the double transfer, be-
cause the double transfer instruction loads the two registers in one clock cycle, whereas
we need two clock cycles to sequentially load the registers. Experimental results for a
few applications on a Blackfin DSP processor in [30] show that up to 47% of energy
can be saved by this approach. Instructions with memory operands have much higher
energy costs than instructions with register operands [30]. This suggests that energy
can be saved by suitably assigning the live variables of a program to registers. But, a
processor has only a small number of registers. When the number of simultaneous live
variables is larger than the number of available registers, some of the variables must be
spilled to memory. Register assignment for loop variables is important because loops
are typically executed many times. Algorithms for optimal register assignment to loop
variables are presented in [69] [70] [71] [62]. This algorithm can be included in the
code generation part of a compiler.
2.1.5 Low power data structures
Kondo et al. [72] propose a method of implementing set data types with minimum power
consumption. In a programming language, one can implement the set data type using a
variety of concrete data structures such as arrays, pointer arrays, linked list and binary
tree [73]. Thus, to implement the set operations, such as locate, insert, and remove
a record from a set, one has to manipulate the memory elements in a concrete data
structure as proposed in [74] [75] [33] [42]. It is the memory accesses in the process
of set operations that actually consume power. Thus, the power consumption in set
operations is a function of the number of memory elements used in implementing a set
data type, the number of read and write operations are performed in the implementation,
and some logic details such as capacitance of memory elements, voltage level, and
frequency of operation. The concrete data structures are compared on the basis of a
filling factor, which is the fraction of the locations that would be filled if implementation
is in arrays [76] [77] [78]. It has been shown that for different levels of filling factor,
different concrete data structures lead to low values of the power cost function. E.g.,
for filling factors greater than 60%, arrays are better in implementing energy efficient
set data types [72].
-
2.1. Energy Saving Techniques - A Review 17
2.1.6 Idle mode power reduction
The doze mode is an innovative approach to conserving energy [79] [80] [81] [60]. It is
very attractive in a communication environment where a mobile system may occasionally
send or receive messages. In the doze mode, the clock speed is reduced and no user
process is executed. Rather, a mobile host simply waits for any incoming message. Upon
receiving a message, the host resumes its normal mode of operation. The energy saving
due to this mode depends on the local computations on a mobile and the pattern of
communication between a mobile and a support station [82]. Simulation studies in [41]
show that energy saving due to this mode spreads over a wide range of 2 ∼ 98%.
2.1.7 Power reduction in distributed computing systems
Agent based computation is a relatively new idea in distributed computing [83] [81]
[84]. General agent-based distributed computing systems have been designed using the
concept of Lindas tuple space [85]. Wei et al. [86] discuss how energy-efficient
distributed algorithms in a mobile computing environment can be designed using a tuple
space managed on the fixed network of a mobile system. Lin et al. [22] propose a power
efficient commit protocol which supports conventional two-phase commit services. A
distributed autonomous system called Noah (Network oriented application harmony)
has been proposed in [87] built in the Mitsubishi laboratory. Though the purpose of
Noah is not to save energy, it demonstrates how agent based systems can be built using
a tuple space as the medium for process communication. By shifting most workload
to peer fixed hosts, the load, the power consumption and the message exchanged via
expensive wireless links in a mobile host are greatly reduced.
2.1.8 Power reduction in communication systems
The receiver subsystem of a mobile station need not be active all the time [88]. Most
digital cellular and cordless systems provide power cycling at the mobile units. Mobile
stations can periodically relax (power cycle) their receivers as a means of conserving
energy. Since the receiver of a mobile unit is not continuously ready to receive messages
from the local support station (base station), some kind of coordination between a base
station and a mobile unit is necessary. Salkintzis et al. [89] propose a page-and-answer
protocol. Intuitively, the protocol works as follows:
When a base station has a message for a mobile unit, the base station sends a small
paging packet to the mobile unit. If the mobile unit receives the paging packet, that
is if the mobile receiver is up, the mobile sends an answer packet to the base station.
Obviously, if the paging message is sent at a time when the receiver is powered off, no
answer packet is generated by the mobile and the base station will once again page the
-
18 2 Energy-Cycle Aware Compilation Framework (ECACF)
mobile after some time. Upon receiving an answer packet, the base station sends the
desired message to the mobile unit.
Kravets and Krishnan [90] propose power saving by selectively choosing short periods
of time to suspend communications and shut down the communication device. Applying
this method to a transport protocol and using three simulated communication patterns,
they have achieved up to an 83% saving in the energy consumed by the communication
system. Chlamtac et al. [91] address the problem of wireless access protocols which
include an energy constraint and develop three energy conserving protocols for various
loads: grouped-tag TDMA, directory, and pseudorandom. Singh et al. [92] argue that
there is a need for using power-aware metrics, such as minimize energy consumed per
packet, minimize variance in node power levels, maximize time to network partition, etc.,
in the design of power efficient routing protocols. They show that these metrics in a
shortest-cost routing algorithm reduces the cost/packet of routing packets by 5 ∼ 30%over shortest-hop routing.
2.1.9 Battery aware power reduction
Chiasserini and Rao [18] have shown how battery behavior can be exploited to prolong
battery life. In particular, they identify the phenomenon of charge recovery that takes
place under pulsed discharge conditions as a mechanism that can be exploited to enhance
the capacity of an energy cell. The bursty nature of many data traffic sources suggests
that there might be a natural fit between the two. Bai and Lai [93] implement some
methods to let the low power CPU efficiently do some kind of computation intensive
tasks, such as graphic image processing and displaying. Their methods include reducing
the computation complexity of bitmap file processing, using fixed-point math instead
of floating point math, prestoring the table of trigonometric functions, and using a few
lines of assembly language code in the inner loop of graphic image processing program
to improve its performance. These methods lead to a speed up of the programs by a
factor of three to six.
In [44], we argue that mobile applications development require us to rethink the concept
of an algorithm from the viewpoint of battery life. Instead of asking for the best result,
a user may say :
’Give me the best result you can find, using no more than X units of resource R.’
Or, one can let the system make the tradeoff between fidelity and resource consumption
by saying:
’Give me the best result you can obtain economically.’
-
2.2. Multimedia DSPCPU Architecture 19
2.2 Multimedia DSPCPU Architecture
A multimedia processor is a media processor for high-performance multimedia appli-
cations that deals with high-quality video and audio. Typically, an extended general-
purpose CPU ( called the DSPCPU) makes it capable of implementing a variety of
multimedia algorithms from popular multimedia standards such as MPEG-1 and MPEG-
2. The key features behind this powerful processor are as follows:
• A general-purpose VLIW processor core coordinates all the on-chip activities.In addition to implementing the non-trivial parts of multimedia algorithms, this
processor runs a small real-time operating system that is driven by interrupts from
the other units.
• DMA-driven multimedia input/output units that operate independently and thatproperly format data to make software media processing efficient.
• DMA-driven multimedia coprocessors that operate independently and in parallelwith the DSPCPU to perform operations specific to important multimedia algo-
rithms.
• A high-performance bus and memory system that provides communication betweenthe processing units.
• A flexible external bus interface.
A typical multimedia processor is based on a three-level hierarchy of operators:
• Instructions
• Operations
• RISC operations
One instruction may contain five operations as depicted in Figure 2.1. Each operation
may execute multiple arithmetic operations. E.g., for TriMedia DSP processor TM130x,
one such operation is the command IFIR(a, b). This command contains a total of threearithmetic operations: Two multiplications and one addition (aHI × bHI + aLO × bLO).
Up to five operations including two IFIR commands can be issued in each machine
cycle. The ability of TriMedia’s VLIW architecture to execute multiple operations in
parallel gives it a big advantage over traditional RISC and CISC architectures found in
current mass-market microprocessors.
-
20 2 Energy-Cycle Aware Compilation Framework (ECACF)
Fig. 2.1: TriMedia VLIW instruction [4].
2.2.1 Multimedia Processor Execution Model
The multimedia processor processor provides a large set of general purpose registers,
generally named as r0, r1, and so on. In addition to the hardware program counter PC,
there are a few user-accessible special purpose registers to hold CPU branch addresses.
The CPU issues one long instruction every clock cycle. Each instruction consists of
several operations (five operations for the TM1300 microprocessor) [4]. Each operation
is comparable to a RISC machine instruction, except that the execution of an operation
is conditional upon the content of a general purpose register. Examples of operations
are:
IF r10 iadd r11 r12 → r13 (if r10 true, add r11 and r12 and write sum in r13)
IF r10 ld32d(4) r15 → r16 (if r10 true, load 32 bits from mem[r15+4] into r16)
IF r20 jmpf r21 r22 (if r20 true and r21 false, jump to address in r22)
Each operation has a specific, known execution latency (in clock cycles). For example,
in case of TM1300, iadd takes 1 cycle. This means that the result of an iadd operation
started in clock cycle ’i’ is available for use as an argument to operations issued in cycle
’i+1’ or later. The other operations issued in cycle ’i’ cannot use the result of iadd.
Similarly the ld32d operation has a latency of 3 cycles. The result of an ld32d operation
started in cycle ’j’ is available for use by other operations issued in cycle ’j+3’ or later.
Branches, such as the jmpf example above have three delay slots. This means that if a
branch operation in cycle ’k’ is taken, all operations in the instructions in cycle k+1, k+2
and k+3 are still executed. In the above examples, r10 and r20 control the conditional
execution of the operations. This is also referred to as guarding, where r10 and r20
contain the guard of the operation.
The implementation of architecture restricts the choice of operations that can be per-
formed in parallel or can be packed into an instruction. For example, the DSPCPU in
TM1300 allows no more than two load/store class operations to be packed into a single
instruction, shown in Figure 2.2. Also, no more than five results (of previously started
operations) can be written during any one cycle. The packing of operations is not nor-
-
2.2. Multimedia DSPCPU Architecture 21
mally performed by the programmer. Instead, the instruction scheduler takes care of
converting the parallel intermediate format code into packed instructions ready for the
assembler. The rules are formally described in the VLIW Description File (VDF) used
by the instruction scheduler and other tools.
Fig. 2.2: TriMedia functional unit assignment [4].
2.2.2 Multimedia Processor Operations Overview
In this section we present a brief overview of the multimedia processor instruction set.
Readers are encouraged to refer to [4] for details.
Conditional Execution: In multimedia processor architectures, all operations are op-
tionally ’guarded’. A guarded operation executes conditionally, depending on the value
in the ’guard’ register. For example, a guarded add is written as:
IF R23 iadd R14 R10 → R13.
This should be taken to mean if R23 then R13 ← R14 + R10. The ’if R23’ clausecontrols the execution of the operation based on the LSB of R23. Hence, depending
on the LSB of R23, R13 is either unchanged or set to contain the integer sum of R14
and R10. Guarding applies to all TM1300 operations, except the iimm and uimm (load-
immediate) operations. Guarding controls the effect on all programmer visible state of
the system, i.e. register values, memory content, exception raising and device state.
Load and Store Operations: Memory is byte addressable. Loads and stores have to
be naturally aligned, i.e. a 16-bit load or store must target an address that is a multiple
of two. A 32-bit load or store must target an address that is a multiple of four. For
-
22 2 Energy-Cycle Aware Compilation Framework (ECACF)
TM1300, the BSX bit in the PCSW (program control status word) register determines
the byte order of loads and stores. E.g., see ld32 and st32 in Appendix A of [4], only
32-bit load and store operations are allowed to access MMIO registers in the MMIO
address aperture. The results are undefined for other loads and stores. A load from
a non-existent MMIO register returns an undefined result. A store to a non-existent
MMIO register times out and then does not happen. There are no other side effects of
an access to a nonexistent MMIO register. The state of the BSX bit has no effect on
the result of MMIO accesses. Loads are allowed to be issued speculatively. Loads that
are outside the range of valid data memory addresses for the active process return an
implementation dependent value and do not generate an exception. Misaligned loads
also return an implementation dependent value and do not generate an exception.
Compute Operations: Compute operations are register-to-register operations. The
specified operation is performed on one or two source registers and the result is written
to the destination register.
Immediate Operations load an immediate constant (specified in the opcode) and produce
a result in the destination register.
Floating-Point Compute Operations are register-to-register operations. The specified
operation is performed on one or two source registers and the result is written to the
destination register. Unless otherwise mentioned all floating point operations observe
the rounding mode bits defined in the PCSW register. All floating-point operations
not ending in flags update the PCSW exception flags. All operations ending in flags
compute the exception flags as if the operation were executed and return the flag values
(in the same format as in the PCSW); the exception flags in the PCSW itself remain
unchanged.
Multimedia Operations are special compute operations. They are like normal compute
operations, but the specified operations are not usually found in general purpose CPUs.
These operations provide special support for multi-media applications.
Special-Register Operations: Special register operations operate on special registers,
such as program control status word, branch address holding registers etc.
Control-Flow Operations: Control-flow operations change the value of the program
counter. Conditional jumps test the value in a register, and based on this value, change
the program counter to the address contained in a second register or continue execution
with the next instruction. Unconditional jumps always change the program counter
to the specified immediate address. Control-flow operations can be interruptible or
non-interruptible. The execution of an interruptible jump is the only occasion where a
multimedia processor allows special event handling to take place.
-
2.3. Workload Description 23
2.3 Workload Description
Our workload consists of two major application domains, multimedia and bioinformatics.
Both use compute and data intensive algorithms. In this section we present in detail the
diversity found in these application domains, that we selected for the rigorous testing of
our ECACF. The variability in the input data streams is also discussed.
2.3.1 Multimedia Applications
The multimedia application set consists of encoders and decoders (transcodecs) encom-
passing three media types - speech, video, and audio (music) - and is summarized in
Table 2.2 to Table 2.5. We obtained codes for these applications from various public
domain sources [94] [95] [96] [21]. The applications were chosen for their importance
in real systems and (we believe) to be representative enough to make the inferences in
this study. We evaluated all our applications with four inputs, summarized in Table 2.6.
Here, we only report results from a single input for each application. We chose the input
that gave the highest (normalized) standard deviation in per frame execution time on
our base system. We call these inputs the default inputs, and list them in the second
column of Table 2.6. Results with the other inputs are similar, both quantitatively and
qualitatively. The G.728, H.263, and MPEG codecs statically distinguish multiple frame
types. G.728 uses an adaptive algorithm, where certain parameters are updated every
four frames. The processing of each frame in a single four-frame cycle is different due
to the calculation of these parameters. Thus, we treat these as different types of frames
(numbered one through four). The H.263 and MPEG codecs use almost the same video
compression scheme. A key difference is that MPEG uses three different types of frames
- I frames do not exploit inter-frame redundancy, P frames exploit inter-frame redun-
dancy using a previous frame, and B frames exploit such redundancy using a previous
and a later frame. Our H.263 codecs do not use B frames. They use a single I frame at
the beginning of the video and P frames for the rest. We do not include the I frame in
our analysis. It takes excessively long to simulate a frame with the MPEG codecs using
the frame sizes specified by the MPEG-2 standard (about 4 to 16 hours per frame for
MPEGenc. We scaled down the frame size to 176x144 pixels so that we could simulate
a reasonable number of frames to assess execution time variability. We ensured that
the scaling did not affect the cache behavior by performing a working set analysis and
running representative experiments with larger frame sizes and different cache sizes. As
the chosen frame size conforms to the H.263 standard, we used the same size for the
H.263 codecs for consistency. Also for consistency, we used the same set of four inputs
for both MPEG and H.263 codecs. These inputs contain a great deal of motion to
stress the applications. H.263 was designed for low bit-rate applications such as video
conference (which typically have less motion); therefore, our results from these inputs
represent an upper bound on the expected variability for H.263.
-
24 2 Energy-Cycle Aware Compilation Framework (ECACF)
Application Description Input Vector SampleRate/Through-put
GSMenc Low bit-rate speech codingbased on the European GSM6.10 provisional standard. UsesRPE/LTP (residual pulse ex-citation/long term prediction)coding at 13 Kb/s. Compressesframes of 160 16-bit samplesinto 264 bits.
orignova 20 ms (160 sam-ples), 8 KHz
GSMdec homemsg
G728enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.
lpcqutfe 625 µs, (5 sam-ples), 8 KHz
G728dec homemsg
G723enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.
lpcqutfe 625 µs, (5 sam-ples), 8 KHz
G723dec homemsg
G729enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.
lpcqutfe 625 µs, (5 sam-ples), 8 KHz
G729dec homemsg
Tab. 2.2: Multimedia Benchmarks (Speech Transcodecs).
2.3.2 Bioinformatics Workload
Due to a significant increase in biological threats against humane, plants and other
species during last two decades, there is a growing realization that bioinformatics and
molecular biology equipments should be available in small form factors, that can be
readily available in field [97]. This lead to development of battery as well as execu-
-
2.3. Workload Description 25
Application Description Input Vector SampleRate/Through-put
H263enc Low bit-rate video coding basedon the H.263 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.
orignova 40 ms, 25 frames/s
H263dec buggy
H264Lenc Low bit-rate video coding basedon the H.264 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.
orignova 40 ms, 25 frames/s
H264Ldec buggy
MPEGenc High bit-rate video codingbased on the MPEG-2 videocoding standard. Uses intra-frame (1) and inter-frame (P,B) coding. Typical bit rate is1.5-6 Mb/s.
Buggy 33 ms, 30 frames/s
MPEGdec flwr
MPEG-1 encoder High bit-rate video codingbased on the MPEG-1 videocoding standard.
Buggy 33 ms, 30 frames/s
MPEG-1 encoder flwr
NLIVQ Non linear interpolative vectorquantization, image processingcodec
cameraman.tif 512x512 resolu-tion, Gray scale
Tab. 2.3: Multimedia Benchmarks (Video Transcodecs).
Application Description Input Vector SampleRate/Through-put
MP3enc Audio decoding based on theMPEG Audio Layer-3 standard.Synthesizes an audio signal outof coded spectral components.Typical bit rate is 16-256 Kb/s.
filter 26 ms (1151 sam-ples), 44.1 KHz
MP3dec filter
Tab. 2.4: Multimedia Benchmarks (Audio Transcodecs).
-
26 2 Energy-Cycle Aware Compilation Framework (ECACF)
Application DescriptionFFT Fast Fourier Transform
IDCT Inverse Discrete Cosine Transform
T64 Matrix Transpose 64x64
M100 Matrix Multiplication 100x100
Tab. 2.5: Generic DSP application Benchmarks [7].
Domain Test Vector Description FeaturesAudio CatSteven Soft rock song 2500 frames, av-
erage length 65.25seconds
Sting Pop songBeethoven 2500 classical piece
Video Flwr Drive-by of houses 450 frames, each18 seconds forH.263 and 15seconds for MPEG
Cact Panoramic viewBuggy Buggy raceTens Table tennis match
Speech Homemsg An answering message Average frame sizefor GSM codecs is500, for G.72x is19000, length: 20seconds
Orignova Sentences read by different adultslpcqutefe Sentence read by a boy
Tab. 2.6: Test Vectors Characterization.
tion time efficient handheld devices for bioinformtics applications. Bioinformatics is an
interdisciplinary research area that helps to produce ’sensible’ and ’useful’ information
from the wealth of data that has been produced by the genome sequencing projects.
We categorize the basic functionality offered by all bioinformatics tools into four groups,
they are:
1. Algorithm for pattern recognition, probability formulae are used to determine the
statistical similarity in given two or more than two sequences.
2. Rule-bases analysis defines how a mathematical or statistical technique can be applied.
Different sets are defined with a membership, and set of rules are also created to elaborate
associativity. A basic set theory is used to fire a rule.
3. Biological data bases are uniformly and efficiently maintained archives of consistent
data that contain information and annotation of DNA and protein sequences, DNA
and protein structures as well as DNA and protein expression profiles [98] [99]. An
-
2.3. Workload Description 27
important feature of these databases is their simplicity in access and query management.
In addition some websites [100] [101] [102] provide visualization tools to aid biological
interpretation.
4. Biological taxonomy records the differences in sequences across different classes
helping further to reduce the similarity errors.
We chose applications for their importance in real system and representative enough to
make the inferences in this study. They are summarized in Table 2.7. We obtained
codes for these applications from various public domain sources. For lack of space, we
only report their underlying algorithm; details may be found in [99] [97] [102]. The
input databases are obtained from the NIH genetic sequence database ’GenBank’, NCBI
assembly archive ’Genome Assembly Archive’, Homologus structure alignment database
’HOMSTRAD’, the NIMH-NCI protein-disease database ’PDD’ and ’The Lens’ [100]
[102].
Application Pseudonym Features AlgorithmsGENESPLICER A01 Detect splice sites in the
genomic DNAHigh accuracy and com-putationally efficient
TIGRSCAN A02 DNA modeling Generalized HiddenMarkov Model (GHMM),HMM
TRANSTERMIS A03 Rho-independent tran-scriptional terminators
Statistical estimationtechniques
GENSCAN A04 Predict complete genestructure
Search algorithms
MUMMER A05 Genome Sequence align-ment
Tree algorithms
GLIMMERHMM A06 Find gene sequence ineukaryotes
IMM, Splice site models,Maximal dependence de-composition techniques
GENIE A07 Gene finder in vertebrateand human DNA
GHMM, Neural Net-works
FGENE A08 Find splice sites, genes,promoters
Linear discriminantanalysis
GRAIL A09 Analysis of DNA se-quence
Automated computation
GENEMARK A10 Find genes in bacterialDNA sequence
Markov chains
NetPlaneGene A11 Sequence analysis Neural network
GLIMMER A12 Coding regions in micro-bial DNA
Interpolated MarkovModels (IMM)
Tab. 2.7: Bio-Computation Applications Benchmark .
-
28 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.4 Energy Cycle Aware Compilation Framework Methodology
The ECACF is shown in Figure 2.3. The source code is processed successively for
static code analysis, post compiler analysis and finally for scheduling analysis. A VLIW
processor descriptor file (VDF) is used to provide architecture information to compiler,
scheduler and finally to the machine code generator. The VDF file contains a list of
pseudo and machine operations, latency of the operations, opcodes, slot assignment
schemes, processor operating frequency, instruction cache feature (associativity, block
size, number of sets) and main memory features (size, order, read/write latencies). This
file format is compatible as mentioned in [103] [4] [81] [104]. Here, we follow the
same VLIW naming convention as used in [104]. This feature has made our scheme
architecture independent. A list of parameters is generated in each step during the
methodology flow. Intermediate trace files are generated during the code processing
flow to produce AEP, such as code size, execution time number of cache miss (for both
instruction and data caches), data cache conflicts, data bank alignment, highway usage,
scheduling factor and slot utilization. After the simulation these parameters are used
to compute transformation control factors such as unrolling factor, grafting depth and
blocking metrics. These control factors are further explained in [25]. Iteratively after
each cycle all these parameters are recorded again and are compared to preset user
constraints mentioned in a User Constraint File (UCF). This file contains desired values
for code, execution time, energy and allowed percentage cache miss. Energy is measured
at the target platform (the setup is explained in Section 2.5). All these parameters are fed
back to the transformation cost analyzer. In each successive transformation it is decided
that whether energy-cycle performance has been optimized or not. The source code is
optimized by undergoing code restructuring schemes known as loop unrolling, decision
tree grafting and loop tiling. Additional benefits are gained by combining traditional
compiler optimization algorithms, such as constant and variable propagation, dead code
elimination, strength reduction etc..
-
2.4. Energy Cycle Aware Compilation Framework Methodology 29
Fig. 2.3: Transformation methodology.
-
30 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.4.1 Application Expression Profile
From a ’C’ source code to an executable binary, an embedded application has to go
through many tools: the text writing notepad, compiler, scheduler, linker, and the
loader. The urge ’how can I?’ is transformed into the conscious biased perception, en-
tailed by embedded systems emerging from software hardware co-design. The software
leads and the hardware follows the technological limitations. The behavior, a software
implementation can express on a hardware is limited by the liberty offered by the hard-
ware architecture and the ability of programmers to code the ’how can I?’. The above
issues indicate that for a ’good’ energy-cycle performance there is a need to gather
more detailed profiles, containing information about system behavior on various levels
as shown in Figure 2.4. The main goal of such vertical profiling is to further improve the
understanding of system behavior through correlation of profile information at different
levels.
Fig. 2.4: Vertical application profile layers.
Hitherto, an executable application development hierarchy is composed of compilation,
scheduling, linking, and binary code generation. Finally, this code is downloaded to
the SDRAM attached with the multimedia processor. Our Application Profile Monitor
(APM) extracts application behavioral parameters as mentioned above. This infor-
mation is extracted from the vertical profile layer block as shown in Figure 2.4. An
application is profiled both in terms of its static and run time (dynamic) behavior. The
way an application expresses itself, we call Application Expression Profile (AEP) for a
given hardware architecture. We characterize an application expression profile using the
following conventions:
1) Name : It describes the name of the profile monitor.
2) Definition: It defines the profile monitor as used in our ECACF.
-
2.5. Experimental Setup 31
3) Location: It shows the location of the monitor in the application development hier-
archy such as compilation, scheduling, linking etc.
4) Type : There are two possible types: static or dynamic.
5) Range: The possible range of value a monitor can have.
6) Level: If a parameter is measured directly from the code, it is called primary monitor,
in other case if it is computed using one or more parameters, we call it secondary monitor.
E.g., a primary monitor can be written as:
Name: Processor Frequency
Definition: The operating frequency of the microprocessor
Location: VDF
Type: static
Range: Typical 100MHz - 233MHz (depends on given hardware architecture)
Level: Primary
Similarly, a secondary monitor can be written as:
Name: Scheduling Factor
Definition: Computed this factor by dividing infinite machine cycle time with finite
machine cycle time
Location: Transformation Engine and Scheduler
Type: Dynamic
Range: 0 to 1
Level: Secondary
A complete list of profile monitors is provided in Appendix A.
2.5 Experimental Setup
The energy consumption by an application at a realtime platform is a first step to be
known in any energy constrained embedded system and can be used to estimate the
battery lifetime of the system. In this section, we describe an energy measurement
method for a software application running on a realtime multimedia VLIW processor.
The method is described for TM1300 Philips DSP processor, but it is applicable to other
multimedia processors, for e.g., Blackfin ADSP533S. The measurement framework has
been incorporated into our ECACF, that allows a software application programmer to
measure a realtime energy consumption by running the candidate ’C’ source code.
-
32 2 Energy-Cycle Aware Compilation Framework (ECACF)
2.5.1 Related Work for Energy Measurement
The energy consumption of a software applicati