An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...

DISSERTATION

An Energy Aware Framework for MobileComputing

ausgeführt zum Zwecke der Erlangung des akademischen Gradeseines Doktors der technischen Wissenschaften

eingereicht an derTechnischen Universität WienFakultät für Elektrotechnik und Informationstechnik

von

Dipl.-Ing. Naeem Zafar AzeemiBrigittenauer Lande 224/ 6643, 1200 Wiengeboren in Karachi, Pakistan am 14. August 1968Matrikelnummer: 0327346

October 6, 2007 .............................................................

Advisor

Univ.Prof. Dipl.-Ing. Dr.techn. Markus RuppTechnische Universität WienInstitut für Nachrichtentechnik und Hochfrequenztechnik

Examiner

Univ.Prof. Dr.phil.nat. Christoph GrimmTechnische Universität WienInstitut für Computertechnik

To Amra, Mukashfa and Kunza

ABSTRACT

Since their inception, energy dissipation has been a critical issue for mobile computingsystems. Although a large research investment in low-energy circuit design and hardwarelevel energy management has led to more energy-efficient architectures, even then, thereis a growing realization that the contribution to energy conservation should be morerigorously considered at higher levels of the systems, such as operating systems andapplications.

This dissertation puts forth the claim that energy-aware compilation to improve appli-cation quality both in terms of execution time and energy consumption is essential fora high performance mobile computing embedded system design. Our work is a designparadigm shift from the logic gate being the basic silicon computation unit, to an in-struction running on an embedded processor. Multimedia DSP processors are the mostlucrative choice to a mobile computing system design for their optimal performance de-livery in high data throughput at low energy. They use instruction-level parallelism (ILP)in programs, for executing more than one primitive instruction at a time. In this work,we exploit the parallelism slacks, unraveled by the native multimedia DSP compilers.We propose an iterative compilation environment to optimize a given ’C’ source code.The contributions of our framework are the collaboration of an application profile mon-itor (APM) together with an optimization engine in native multimedia DSP SoftwareDevelopment Environments (SDE). We propose to monitor application behavior at alllevels (such as static, compilation, scheduling, linking and during execution). TheseAPMs are later used in an optimization engine to speculate optimal code transformationschemes. These schemes are applied successively, across the basic code blocks. Wepropose two methods for the selection of optimization schemes, a Gradient Mode Iter-ative Compilation (GMIC) and Multicriteria Stochastic Iterative Compilation (MSIC).Both schemes are tested at several multimedia applications obtained from diversifieddomains such as video transcodecs (MPEG2, H-264L), audio transcodecs (G-723, Mp3)and bioinformatics (Glimmer, Fgene), to name a few.

Finally, we propose the characterization of application-architecture correlations that sup-port our claim that an ideal performance of a mobile computing system demands a per-fect match between hardware capability and program behavior. We exposed our resultsfor 20 multimedia applications experimented at the TriMedia DSP 1300, the BlackfinDSP ADSP533, and the PIII-850 embedded processor.

Keywords: Energy Aware, Source-to-Source, Multimedia Processor, Workload Charac-terization.

vi Abstract

ZUSAMMENFASSUNG

Seit dem Bestehen von mobilen Rechensystemen ist Energieverbrauch ein entscheiden-der Faktor. Obwohl bereits zahlreiche Forschungsergebnisse zu hardwarelösungen mitniedrigem Energieverbrauch geführt haben, ist mittlerweile klar geworden, dass En-ergieeinsparungen auf höherer Ebene, wie beispielsweise bei Betriebssystemen und -anwendungen, vermehrt in Betracht gezogen werden sollten.

Diese Dissertation belegt, dass eine energiebewusste Compilierung zur Verringerung derAusführungszeit führt und somit ein wesentliches Kriterium darstellt, um ein effizienteseingebettetes System für mobile Datenverarbeitung zu gewährleisten. Unsere Arbeitbeschäftigt sich mit einem neuen Entwicklungs-Paradigma, das sich nicht mehr aufeinzelne logische Gatter als grundlegende Entwicklungselemente konzentriert, sondernsich einzelnen Instruktionen auf einem eingebetteten Prozessor widmet. Digitale Sig-nalverarbeitungsprozessoren für Multimediaanwendungen stellen für ein mobiles Daten-verarbeitungssystem die preiswerteste Lösung dar, um eine optimale Datendurchlaufzeitbei niedrigem Energiebedarf zu gewährleisten. Diese nutzen hierfür die Parallelität aufInstruktionsebene (ILP) von Programmen, um damit mehrere primitive Instruktionenzur gleichen Zeit ausführen zu können. In der vorliegenden Dissertation wird die Pro-grammparalellisierung mit einem speziellen Monitor erfasst. Weiters schlagen wir eineschrittweise Compilierung vor, um den gegebenen Programmcode in ”C” zu optimieren.Ein weiterer Beitrag besteht aus einer Programmumgebung zur Analyse von Anwendun-gen und deren Optimierung. Hierbei wird das Programmverhalten auf mehreren Ebenen(statischer Ebene, Compilierung, Scheduling, Linking, und während der Ausführung)überwacht. Diese Analysen werden anschließend von einem Optimierungsprogramm ver-wendet, um eine optimale Compiler-Konfiguration zu ermitteln. In dieser Arbeit wer-den zwei verschiedene Methoden für die Auswahl der Optimierungsoptionen vorgestellt,nam̈lich ein Gradientenverfahren und ein stochastisches Verfahren. Beide Verfahrenwerden mit verschiedenen Multimediaanwendungen aus unterschiedlichen Bereichen wiebeipsielsweise Video-Kodierung (MPEG2, H-264L), Audio-Kodierung (G-723, MP3) undBioinformatik (Gllimmer, Fgene) getestet.

Schließlich schlagen wir Metriken zur Erfassung der Korrelation zwischen Anwendung undHardware vor, die unsere Behauptung untermauern, dass eine ideale Leistung des mobilenDatenverarbeitungssystems nur dann erreicht werden kann, wenn die Hardwarekapazitätsowie das Programmverhalten perfekt zusammenpassen. Die Leistungsfähigkeit dieserMetriken wird anhand der Prozessoren Trimedia DSP 1300, Blackfin DSP ADSP533 undPIII-850 gezeigt.

viii Zusammenfassung

Schlagwörter: Energy-aware, Quellcodetransformation, eingebettete Systeme, Multi-media Prozessoren, Mobile Computing, workload characterization

ACKNOWLEDGEMENTS

I would like to thank my teacher Khwaja Shamsuddin Azeemi and parents who have hada positive effect on me personally, to whom I owe a debt of gratitude for helping in oneway or another to influence the person I am today.

First and foremost, I thank my supervisor Dr. Markus Rupp, for his consistent efforts toinvoke my inherent skills to accomplish this task successfully. I appreciate his bottomlesspatience for technical review and substantive comments that improved the readabilityof the dissertation.

Thanks to my sister Farhi, and brothers Waseem and Nadeem, who provide encourage-ment in the face of every seemingly impossible task that I face.

Thanks to Afsar, Sobia, Shams Sahib, Ana Eliza and Liana for their love, support andgreat understanding, especially during vulnerable moments.

Thanks to my friends, colleagues and acquaintances: Bastian, Martin at the ChristianDoppler Laboratory; Sabine from Vienna; Naveed and Saima from Boston; Nadeem andfamily from San Francisco; Amir Malik and family from Korea for their kind assistanceand facilitation during last 45 months.

I would like to acknowledge valuable technical support from Dr. Arpad Scholtz atInstitute of Communications and Radio Frequency Engineering, Dr. Stefan Mahlknechtat Institute of Computer Technology and Aneesa Sultan at Vienna Bio Center.

I am also grateful to Dr. Christoph Grimm for his time and patience to review thismanuscript.

CONTENTS

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Mobile Embedded System Constraints . . . . . . . . . . . . . . 11.1.2 IC Fabrication Technology Constraints . . . . . . . . . . . . . . 21.1.3 Battery Technology Constraints . . . . . . . . . . . . . . . . . 31.1.4 Architecture-Application Correlation Slacks . . . . . . . . . . . 4

1.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Energy-Cycle Aware Compilation Framework (ECACF) 13

2.1 Energy Saving Techniques - A Review . . . . . . . . . . . . . . . . . . 142.1.1 Fabrication level power reduction . . . . . . . . . . . . . . . . . 142.1.2 Processor level power reduction . . . . . . . . . . . . . . . . . . 152.1.3 EDA tools level power reduction . . . . . . . . . . . . . . . . . 152.1.4 Compiler level power reduction . . . . . . . . . . . . . . . . . . 162.1.5 Low power data structures . . . . . . . . . . . . . . . . . . . . 162.1.6 Idle mode power reduction . . . . . . . . . . . . . . . . . . . . 172.1.7 Power reduction in distributed computing systems . . . . . . . . 172.1.8 Power reduction in communication systems . . . . . . . . . . . 172.1.9 Battery aware power reduction . . . . . . . . . . . . . . . . . . 18

2.2 Multimedia DSPCPU Architecture . . . . . . . . . . . . . . . . . . . . 192.2.1 Multimedia Processor Execution Model . . . . . . . . . . . . . 202.2.2 Multimedia Processor Operations Overview . . . . . . . . . . . 21

2.3 Workload Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Multimedia Applications . . . . . . . . . . . . . . . . . . . . . 232.3.2 Bioinformatics Workload . . . . . . . . . . . . . . . . . . . . . 24

2.4 Energy Cycle Aware Compilation Framework Methodology . . . . . . . 282.4.1 Application Expression Profile . . . . . . . . . . . . . . . . . . . 30

2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Related Work for Energy Measurement . . . . . . . . . . . . . . 322.5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Gradient Mode Iterative Compilation (GMIC) 41

3.1 GMIC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xii Contents

3.1.1 Performance Qualifier Measurement . . . . . . . . . . . . . . . 43

3.1.2 Code Block Queuing . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.3 Code Block Expression Profile . . . . . . . . . . . . . . . . . . 44

3.1.4 Transformation Scheme . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Example: Optimization of an MPEG-1 encoder . . . . . . . . . . . . . 46

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Multicriteria Stochastic Iterative Compilation (MSIC) 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 Objects and Constraints . . . . . . . . . . . . . . . . . . . . . . 57

4.2.2 Case Study I - Arbitrary Application . . . . . . . . . . . . . . . 59

4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ) 61

4.3 Performance Comparison with GMIC . . . . . . . . . . . . . . . . . . . 66

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Application-Architecture Characterization 69

5.1 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 Principal Component Analysis (PCA): . . . . . . . . . . . . . . 70

5.1.2 Scree Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.3 Box Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.4 Scatter Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.5 Differential Application Expression Profile (dAEP): . . . . . . . 72

5.2 Application Characterization . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Architecture-Centric Application Characterization . . . . . . . . . . . . 81

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Conclusions 89

Appendices 91

A List of Application Expression Profile (AEP) Monitors 93

B VLIW Descriptor File (VDF) Format 99

C User Constraints Files (UCF) Format 103

C.1 UCF for MPEG-1 encoder example in Section 3.3 . . . . . . . . . . . . 104

C.2 UCF for NLIVQ example in Section 4.2.3 . . . . . . . . . . . . . . . . 104

Contents xiii

D Application Attributes 105

E List of Acronyms 113

LIST OF FIGURES

1.1 Power consumption for Intel CPUs [1]. . . . . . . . . . . . . . . . . . . 3

1.2 Thermal and power delivery cost in a desktop PC [2]. . . . . . . . . . . 4

1.3 Battery technologies and their capacities [3]. . . . . . . . . . . . . . . 5

1.4 Thesis Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 TriMedia VLIW instruction [4]. . . . . . . . . . . . . . . . . . . . . . . 20

2.2 TriMedia functional unit assignment [4]. . . . . . . . . . . . . . . . . . 21

2.3 Transformation methodology. . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Vertical application profile layers. . . . . . . . . . . . . . . . . . . . . . 30

2.5 Experimental setup for instruction/program current measurement [5]. . 33

2.6 Proposed experimental setup for application current measurement atprocessor and memory. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 Current consumption for vector quantization (VQ) application executionlife cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 CPU core current consumption versus address range for VQ application. 35

2.9 Memory current consumption versus address range for G-728 audio transcodec.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.10 CPU core current consumption versus address range for G-728 audiotranscodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.11 CPU peripheral current consumption versus address range for G-728 au-dio transcodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Gradient mode Iterative Compilation Methodology (GMIC). . . . . . . . 42

3.2 Fraction of JPMO CB in an MPEG-1 application, the code blocks arenumbered from fb01 to fb34. . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Fraction of JPMO contributed by code blocks in an MPEG-1 application-(a window view for seven blocks). . . . . . . . . . . . . . . . . . . . . 44

3.4 GMIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xvi List of Figures

3.5 Heuristic track of CT-Tuple for an MPEG-1 encoder application. . . . . 48

3.6 Heuristic track of CTxy tuple for FFT application. . . . . . . . . . . . . 50

3.7 Heuristic track of CTxy tuple for IDCT application. . . . . . . . . . . . 50

3.8 Heuristic track of CTxy tuple for T64 application. . . . . . . . . . . . . 51

3.9 Heuristic track of CTxy tuple for M100 application. . . . . . . . . . . . 52

3.10 Heuristic track of CTxy tuple for H-264L application. . . . . . . . . . . 52

4.1 A simplified view of framework with multicriteria methodology extension. 56

4.2 Simplified Genetic Algorithm Model [6]. . . . . . . . . . . . . . . . . . 58

4.3 Development of fitness function for Case Study 1 in TS1 and TS2. . . . 59

4.4 Fraction of IPC for Case Study 1 in TS1 and TS2. . . . . . . . . . . . 60

4.5 Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2. 60

4.6 Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25CB are numbered from F01 to F25). . . . . . . . . . . . . . . . . . . . 62

4.7 Development of the fitness function for NLIVQ. . . . . . . . . . . . . . 64

4.8 Fraction of IPC for NLIVQ. . . . . . . . . . . . . . . . . . . . . . . . . 64

4.9 Fraction of energy saving for NLIVQ. . . . . . . . . . . . . . . . . . . . 65

4.10 Fraction of functional unit utilization for NLIVQ. . . . . . . . . . . . . 65

5.1 Scatter plot for 20 applications at the TriMedia processor. . . . . . . . 75

5.2 PCA Scree plot for 20 applications at the TriMedia processor. . . . . . 76

5.3 PCA box plot for 20 applications at the TriMedia processor. . . . . . . 76

5.4 PCA biplot for 20 applications at the TriMedia processor. . . . . . . . . 77

5.5 Scatter plot for 20 applications at the Blackfin processor. . . . . . . . . 79

5.6 PCA biplot for 20 applications at the Blackfin processor. . . . . . . . . 80

5.7 Scatter plot for 20 applications at the PIII 850 processor. . . . . . . . . 82

5.8 PCA biplot for 20 applications at the PIII 850 processor. . . . . . . . . 83

5.9 Differential AEP across three hardware platforms. . . . . . . . . . . . . 83

5.10 PCA biplot for 20 applications across the TriMedia processor and theBlackfin processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.11 PCA biplot for 20 applications across the Blackfin processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.12 PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

LIST OF TABLES

2.1 Energy reduction techniques for embedded system design. . . . . . . . . 14

2.2 Multimedia Benchmarks (Speech Transcodecs). . . . . . . . . . . . . . 24

2.3 Multimedia Benchmarks (Video Transcodecs). . . . . . . . . . . . . . . 25

2.4 Multimedia Benchmarks (Audio Transcodecs). . . . . . . . . . . . . . . 25

2.5 Generic DSP application Benchmarks [7]. . . . . . . . . . . . . . . . . 26

2.6 Test Vectors Characterization. . . . . . . . . . . . . . . . . . . . . . . 26

2.7 Bio-Computation Applications Benchmark . . . . . . . . . . . . . . . . 27

3.1 Transformation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Gradient Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 CBLT in CPU cycles for NLIVQ. . . . . . . . . . . . . . . . . . . . . . 63

4.2 Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04,TS07, TS09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Sum of absolute difference for for TS04, TS07, TS09. . . . . . . . . . . 66

4.4 Performance comparison between GMIC and MSIC. . . . . . . . . . . . 67

5.1 MPEGdec profile for successive transformations [8]. . . . . . . . . . . . 72

D.1 Pseudonyms for 20 applications. . . . . . . . . . . . . . . . . . . . . . 105

D.2 AEP for optimized 20 applications at the TriMedia processor. . . . . . . 106

D.3 AEP for optimized 20 applications at the Blackfin processor. . . . . . . 107

D.4 AEP for optimized 20 applications at the PIII 850 processor. . . . . . . 108

D.5 dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

D.6 dAEP for optimized 20 applications across the Blackfin and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xviii List of Tables

D.7 dAEP for optimized 20 applications across the TriMedia and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

1 INTRODUCTION

1.1 Motivation

The growing trend towards the untethered ubiquitous computing is entailed with many

performance related issues. The ideal performance of a mobile computing system de-

mands a perfect match between architecture capability and program behavior. Archi-

tecture performance can be enhanced with better hardware technology, innovative low

Integrated Circuits (IC) geometry features, and efficient resources management [9]. In

the same vein, the demand for having multimedia functions on handheld devices requires

an enormous computation power to handle large data and program sizes. Efficient ar-

chitecture utilization for both energy dissipation and execution time as well as optimal

application firmware are two important performance metrics for these embedded systems.

The optimal architecture utilization is debilitated by different design limitations, such

as high level system design constraints, fabrication level constraints, battery technology

constraints etc. They are discussed next in more detail.

1.1.1 Mobile Embedded System Constraints

Mobile embedded systems (MES) present unique challenges and opportunities for system-

level low-energy designs, e.g.,

• MES are usually severely energy constrained. In particular, handheld devices , air-borne, and spaceborne systems are typically battery-operated and therefore have a

limited energy budget [10]. MES are also typically relatively more time-constrained

compared to portable embedded or general-purpose systems. Therefore, the chal-

lenge is to save energy while guaranteeing temporal constraints.

• Some MES applications such as avionics, robotics and deep space missions requiresystems with small form factors, which in turn mandates low heat dissipation.

Since heat is a byproduct of energy dissipation, low-energy system-design ensures

a more reliable system by limiting the heat produced.

• MES are typically over-designed to ensure that the temporal deadline guaranteesare still met even if all tasks take up their Worst-Case Execution Time (WCET).

2 1 Introduction

Since, in the average case, tasks do not require their WCET, the redundancy in

hardware design in MES makes them energy inefficient.

In short, system-level techniques can decrease this energy dissipation through the

use of energy-aware task scheduling algorithms while preserving their temporal

constraints.

1.1.2 IC Fabrication Technology Constraints

Integrated circuits in their various incarnations consume some amount of electric power.

This power is dissipated both by the action of the switching devices contained in IC

(such as transistors) as well as heat due to the resistivity of the electrical circuits. This

is a major consideration in the design of microporcessors and the embedded systems

they are used in [11]. Figure 1.1 shows the power consumption for the Intel series

of processors produced over the last two decades [1]. The horizontal axis shows the

advancement in IC fabrication technology in terms of chip geometry (i.e nanometers),

while power dissipation is plotted in Watts. Each point is marked with two numbers,

showing chip geometry and power consumption, respectively. Points lying on the same

vertical axis such as (350,43) and (350,34.8) show the processors in the same technology,

but different performance. E.g., (350,43) and (350,34.8) corresponds to PII 300MHz

and PII 233MHz, respectively. Similarly, P4 3MHz was fabricated at 130 nm and 81.9

W, while in later versions at lower geometry P4 EE 3.40MHz is fabricated at 90 nm

and low power 83.9 W; further, it is improved for higher operating frequency (P4 EE

3.73MHz) at same the geometry but at a penalty of increase in power consumption

i.e., 115 W. The increasing trend towards special purpose core processors has further

reduced the geometry down to 65 nm and power consumption to 130 W (for Intel Core

2 Extreme Qx6700). Readers are encouraged to read [1] [12] [13] for a detailed view of

power versus technology trends realized by various CPU manufacturers.

Attempts to shape the power-geometry envelop (shown as a shoe in Figure 1.1) have

their limits at the fabrication technology at 50 nm, where leakage current starts dominat-

ing the power consumption (discussed further in Chapter 2). Although special purpose

core processors are implemented at 50 nm [14] [12], with a power consumption of 14.5

W (shown at bottom of heal in Figure 1.1), but their operating frequency is limited to

130 MHz, which is not sufficient to meet the current demand for multimedia process-

ing. The designers goal to achieve a low leakage ’heal’ in the power-geometry shoe is

associated with a high power cost. This cost has two components. The first is thermal

cost, which is associated with keeping the devices below the specified operating temper-

ature limits. Maintaining the integrity of packaging at higher temperatures also requires

expensive solutions. The second component is the on board power delivery cost, which

is related to on-board decoupling capacitances and interconnects associated with the

power distribution network. Moreover, the increased trend towards driving the CPU at

1.1. Motivation 3

lower operating voltage and higher frequency increases the magnitude of the current

drawn by the CPU. This exacerbates the issue of resistive and inductive noise problems

and leads to a significant increase in system cost.

Fig. 1.1: Power consumption for Intel CPUs [1].

Figure 1.2 gives an idea of the range of dollar amounts associated with the above costs

for different system components [2]. As can be seen, when the system power is in the

35-40 W range, the cost of each additional Watt tends to grow above $1/W per chip.

Designers have already pulled the fabrication limits to achieve low energy design goals

[15]. E.g., shrinking the integrated circuit geometry below 50 nm doubles the leakage

current as compared to 65 nm. Such issues exacerbate the need to consider low energy

design more rigorously at higher hierarchies of the system level [5].

1.1.3 Battery Technology Constraints

The energy constraints on mobile devices are becoming increasingly tight as complexity

and performance requirements continue to be pushed by the user demand [16]. Proces-

sor speeds have doubled as approximately every 18 months as predicted by Moore’s

law [17]. While processor speed and energy consumption have increased rapidly, the

corresponding improvement in battery technology has been slow. In fact, battery ca-

pacity has increased by a factor of less than four in the last three decades [3] [18].

4 1 Introduction

Fig. 1.2: Thermal and power delivery cost in a desktop PC [2].

Figure 1.3 shows the current state-of-the-art in battery technology. The slack in in-

crease in the battery capacity is hampered by the ionization chemistry limits [3] [19].

The design target for batteries with long life-span and short sizes is hard to achieve.

E.g., though Ni-MH is lighter in weight than Ni-Cd, it requires a higher recharging

time. In the same vein, Li-Ion batteries are more promising for higher energy density,

large number of charging cycles, little memory effect, longer shelf life, but higher cost

and increased external protection against discharging inhibits its low cost wide use. In

short, the technological constraint on the realization of high capacity, low size battery

highlights the importance of low energy consideration.

1.1.4 Architecture-Application Correlation Slacks

Traditionally, optimal MES performance is gained by focussing on the underlying hard-

ware architecture. This ignores the fact that it is the software executing on a CPU

that determines its energy consumption. The execution time and energy consumption

of a program on any parallel processor is dependent not only on the composition of

operations contained within the program, but also on the ability of users to express the

1.2. Design Space Exploration 5

Fig. 1.3: Battery technologies and their capacities [3].

parallelism at the correct granularity level for the processor. Therefore, to fairly com-

pare cycle-energy performance of two applications at a given processor, two different

mappings of the applications will be required, one for each application. An integrated

approach that considers energy-cycle performance at architecture as well as application

level is essential for energy efficient application developments.

1.2 Design Space Exploration

The program behavior is difficult to predict due to its heavy dependence on application

and run-time conditions [20] [21]. For mobile computing, the application performance

can be optimized by using parallel hardware architectures, such as Very-Long Instruction

Word (VLIW) architectures [22] [23]. VLIW architectures are a suitable alternative for

exploiting instruction-level parallelism (ILP) in programs, that is, for executing more than

one basic (primitive) instruction at a time. These processors contain multiple functional

units. They fetch from the instruction cache a Very-Long Instruction Word containing

several primitive instructions, and dispatch the entire VLIW for parallel execution. These

6 1 Introduction

capabilities are exploited by compilers which generate code that has grouped together

independent primitive instructions executable in parallel. The processors have a relatively

simple control logic because they do not perform any dynamic scheduling nor reordering

of operations (as is the case in most contemporary superscalar processors). The instruc-

tion set for a VLIW architecture tends to consist of simple instructions (RISC-like). The

compiler must assemble many primitive operations into a single ”instruction word” such

that the multiple functional units are kept busy, which requires enough instruction-level

parallelism (ILP) in a code sequence to fill the available operation slots.

In mobile computing software design, the conventional software development environ-

ment (for compilation and machine code generation) cannot be used. In these methods,

the execution time and code size are primarily considered, while the energy dissipation

issue is piggy-backed to the final design; that inevitably leads to an expensive cooling

mechanism and eventually increases the system overall cost while reducing reliability.

The software perspective on power consumption has been the subject of work in [24].

Here a detailed instruction-level power model of the Intel 486DX2 was built. The impact

of software on the CPU power and energy consumption, and software optimizations to

reduce these were studied. It is well known that the number of useful instructions is

always different from the number of instructions in a static code. The code execution

flow determines the number of useful instructions according to input data. Therefore,

computing the total energy consumed merely by adding the energy consumption of

individual instructions does not provide the actual energy consumption of the program

as claimed in [24].

In this thesis we propose a framework, where software applications optimally utilize

the hardware architecture to deliver energy-cycle performance within user defined con-

straints. Our energy aware framework in [25] meets the demand by incorporating the

following features in a native multimedia DSP compilation environment.

1) The framework transforms the legacy application source code into optimal ’C’ source

code, taking advantage of different slacks appearing in the application-to-binary devel-

opment hierarchy.

2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different

performance goals both in terms of execution time as well as energy dissipation.

3) We developed post-profiling techniques published in [26] to evaluate the application

performance not only at compilation layer (as conventional compiler does) but also at

scheduling layer, linker layer, machine code generation layer and finally at loader layer.

4) We measure the real-time performance of applications running on actual hardware.

These measured parameters are further used to tune the transformation scheme of the

legacy software application.

5) We tested our framework at different applications that belong to diversified industrial

1.2. Design Space Exploration 7

domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and

bioinformatics applications [28] [29].

6) The work is further extended in [30] [27] to characterize application-architecture

correlation, that are well suited for a pre-design assessment of an embedded system

design. It answers the question whether a given hardware architecture is an appropriate

choice for a given multimedia software application or not.

It may be noted, the terms power consumption and energy consumption are often in-

terchanged. It is important to distinguish between these two when we talk of either of

these in the context of programs running on mobile applications. Mobile systems run

on limited energy available in a battery. Therefore, the energy consumed by the system

or by the software running on it, determines the length of the battery life.

This thesis is based on the following publications.

• N. Zafar Azeemi, A. Sultan ”Characterization of Bioinformatics Applications onMultimedia Processor”, in Proc. IEEE Cairo International Biomedical Engineering

Conference (CIBEC ’06), pages BI06-BI09, 195 - 200, Cairo, Egypt, December,

2006.

• N. Zafar Azeemi ”Handling Architecture-Application Dynamic Behavior in Set-top Box Applications”, in Proc. IEEE International Conference on Information

and Automation (ICIA ’06), pages 195 - 200, Colombo, Sri Lanka, December,

2006.

• N. Zafar Azeemi, A. Sultan, A. Muhammad ”Parameterized Characterization ofBioinfomatics Workload on SIMD Architecture”, in Proc. IEEE International Con-

ference on Information and Automation (ICIA ’06), pages 189 - 194, Colombo,

Sri Lanka, December, 2006.

• N. Zafar Azeemi ”Multicriteria Energy Efficient Source Code Compilation for De-pendable Embedded Applications”, in Proc. IEEE International Conference on

Information Technology (IIT ’06), Dubai, UAE, November, 2006.

• N. Zafar Azeemi ”Compiler Directed Battery-Aware Implementation of Mobile Ap-plications”, in Proc. IEEE 2nd International Conference on Emerging Technologies

(ICET ’06), pages 151 - 156, Peshawar, Pakistan, November, 2006.

• N. Zafar Azeemi ”A Multiobjective Evolutionary Approach for Constrained JointSource Code Optimization”, in Proc. ISCA 19th International Conference on Com-

puter Application in Industry (CAINE ’06), pages 175 - 180, Las Vegas, Nevada,

USA, November, 2006.

• N. Zafar Azeemi ”Probabilistic Iterative Compilation for Source Optimization ofEmbedded Programs”, in Proc. 2006 IEEE International SoC Design Conference

(ISOCC ’06), pages 323 - 328, Seoul, Korea, October, 2006.

8 1 Introduction

• N. Zafar Azeemi, M. Rupp ”Multicriteria Low Energy Source Level Optimization ofEmbedded Programs”, in Proc. Tagungsband zur Informationstagung Mikroelek-

tronik (ME ’06) IEEE Austria, pages 150 - 158, Vienna, Austria, October, 2006.

• N. Zafar Azeemi ”Architecture-Aware Hierarchical Probabilistic Source Optimiza-tion”, in Proc. ISCA 19th International Conference on Parallel and Distributed

Computing Systems (PDCS ’06),pages 90-95, San Francisco, USA, September,

2006.

• N. Zafar Azeemi ”Power Aware Framework for Dense Matrix Operations in Mul-timedia Processors”, in Proc. IEEE 9th International Multi-topic Conference (IN-

MIC ’05), Karachi, Pakistan, December, 2005.

• N. Zafar Azeemi, M. Rupp ”Energy-Aware Source-to-Source Transformations fora VLIW DSP Processor”, in Proc. IEEE 17th International Conference on Micro-

electronics (ICM ’05), pages 133 - 138, Islamabad, Pakistan, December, 2005.

• N. Zafar Azeemi ”A Framework for Architecture Based Energy-Aware Code Trans-formations in VLIW Processors”, in Proc. International Symposium on Telecom-

munication (IST ’05), pages 393 - 398, Shiraz, Iran, September, 2005.

1.3 Thesis Outline

This thesis is organized in five chapters, as shown in Figure 1.4. A brief description of

each chapter is given below.

Chapter 1: We discuss the different design limitations, such as high level system design

constraints, fabrication level constraints, battery technology constraints etc. We explore

the design slacks that exist in contemporary work [31] [24] [5] for energy aware code

optimization. We explain the thesis structure and provide a detailed list of contributions.

Chapter 2: This chapter lays the necessary foundation for the development of our

energy cycle aware iterative compilation framework. Our methodology optimizes a soft-

ware application for energy consumption, execution time as well as efficient hardware

architecture utilization. As compared to [5] [32] [33] [34], we elaborate our method

for generic multimedia processors. Unlike [35] [36] [36], we define software applica-

tion in terms of its architectural behavior. We provide a simplified overview of typical

multimedia processors. Though various multimedia operation models are presented in

[37] [31] [38] [39] [40], but their complexity refrain them to be readily usable in a real

time optimization environment. We use a simplified multimedia operation model devel-

oped in [4], that views the instruction set in terms of load/store operations, compute

operations, special register operations and control flow operations. The measurement

of energy consumption made by an application at a real-time platform is a first step

1.3. Thesis Outline 9

Fig. 1.4: Thesis Structure.

to know in any energy constrained embedded system and can be used to estimate

the battery lifetime of the system. The experimental setup proposed in [5] [32] [41]

for instruction/program current measurement, addressing modes, immediate operands,

and exhaustive characterization is very time consuming. We present here a measure-

ment platform that is generic and applicable to most off-the-shelf available multimedia

processors. It is based on current measurement at both processor and memory input

lines. Unlike the instruction based energy model presented in [42] [24], we propose a

simplified energy consumption model based on code blocks. We expose a step-by-step

procedure for the measurement of software application energy consumption at a target

hardware architecture. As compared to [24] [32] [41], we apply our framework at two

major application domains, multimedia and bioinformatics. The multimedia application

set consists of encoders and decoders (transcodecs) encompassing three media types -

speech, video, and audio (music), whereas, we categorize the basic functionality offered

by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule

based analysis, biological data bases and biological taxonomy. The results published

10 1 Introduction

in [28] [29] reveal the usefulness of our framework at diversified application domains.

Several energy reduction opportunities at design level are also presented.

Chapter 3: Our energy cycle aware compilation framework is powered by a source

code transformation engine. Unlike [43] [42] [24], we implement our scheme by first

investigating the ’C’ source code of application for cycle energy taxing blocks, based

on trace data collected during a profile of the application as mentioned in Chapter 2.

Here, we present a novel heuristic that searches the solution space for an optimal source

code transformation scheme. We demonstrate that the algorithm executes a solution

and evaluates the energy-time tradeoff based on a user-defined metric. Based on the

evaluation, it selects the next solution to be evaluated. The heuristic terminates when

desired objectives are achieved. Our gradient mode iterative compilation scheme has

two salient features. First, it requires queuing code blocks such that blocks pertaining

similar expression profile most likely to benefit from the same transformation scheme.

Second, it completes in a discrete number of steps based on the number of code blocks,

whereas schemes mentioned by Sinha et al. in [33] and Tiwari et al. in [5] offer searches

that grow exponentially as the number of code blocks increases. We also expose our

scheme by analyzing a video encoding application (MPEG-1 encoder). Further merits

and demerits of the scheme are also explained in different application scenarios.

Chapter 4: The gradient mode iterative compilation as proposed in the previous chapter,

belongs to a class of compilation termed as feedback directed compilation. It brings

relatively small improvement, as it effectively restricts itself to trying different back-end

optimizations. The major impediment to such approach is the heuristic search technique

itself. Unlike [32] [41], in this chapter we consider the optimization problem as a single

task, where all desired aims have to be taken into account simultaneously. We present

a new method, which is based on the optimization of a multicriteria, objective function,

where the desired aims of architecture-based energy-cycle optimization are formulated as

penalty terms of such objective function. Further, we describe how the maximization of

the objective function can be achieved by using a Genetic Algorithm (GA). The interface

of the proposed methodology to our energy cycle aware compilation framework is also

explained. We also expose the minutia of our methodology e.g., selection of constraints,

development of fitness function, formation of Hertz matrix. We discuss two multimedia

applications in depth to elaborate the advantage of the algorithm.

Chapter 5: In this chapter we introduce the concept of application-architecture char-

acterization with the help of our ECACF and multivariate statistics techniques. To our

knowledge this is a first attempt to obtain such characterization from the application

expression profiles.

The application-architecture correlation is a bidirectional process matching algorithmic

structure with hardware architecture and vice vera. The programmer will benefit from

this efficient mapping and produce better source codes. Applications of similar function-

ality may yield similar Application Expression Profile (AEP), and hence can be suitable

1.3. Thesis Outline 11

for similar hardware platform. We explore the fact that despite the simplicity of our

methodology, the analysis of large matrices provided by an application expression profile

under different levels of transformation at different architectures is not trivial and re-

quires an advanced knowledge of discovery processes. To this end, we introduce a new

methodology to evaluate the application portability using multivariate statistics. We

demonstrate how box plot, scree plot, and PCA biplots can be used to characterize an

application at a given hardware architecture. We expose the minutia of methodology by

exploring the AEP across three different hardware platforms at diversified applications.

Finally, we demonstrate how dAEP can be used to find out the legacy code portability

across platforms.

12 1 Introduction

2. ENERGY-CYCLE AWARE COMPILATION

FRAMEWORK (ECACF)

Miniaturization of computing systems is finding applications in special areas such as

hand-held computation, tiny robots, guidance systems in automated vehicles, to name

just a few. Also, these systems or their users move from place to place. Because of

their small size and their mobility requirement, they are powered by batteries of low

rating. In order to avoid frequent recharging and/or replacement of the batteries, there

is significant interest in low-energy system design. Energy consumption is an area of

growing concern in system design. It leads to variety of system related issues, such as

battery life, thermal limits, packaging constraints, and cooling options [44]. Though

energy is actually consumed by the hardware, energy consumption can be reduced apart

from using low-energy electronics by suitably manipulating the software systems. This

is because the hardware activities are controlled through the software. Let a program

X run for T seconds to achieve its goal, VCC be the supply voltage of the system, and

I be the average current in Amperes drawn from the power source for T seconds. We

can rewrite T as T = N x τ where N is the number of clock cycles and τ is the clock

period. Then, the amount of energy consumed by X to achieve its goal is given by: E

=VCC x I x N x τ joules. Since for a given hardware, both VCC and τ are fixed, E

∝ I x N. However, at the application level, it is more meaningful to talk about T thanN, and therefore, we express energy as E ∝ I x T. This expression is the foundation ofour ECACF. It shows the main idea in the design of energy-efficient software that is to

reduce both T and I. From the running time (average case) of an algorithm we achieve

a measure of T . However, to compute I, one must consider the current drawn during

each clock cycle. This is illustrated in Section 2.5.

Given the fact that power is the rate of energy consumption, in this thesis, we refer to

power and energy interchangeably. Low power design is a complex endeavor requiring

a broad range of strategies from floor planning on silicon substrate to the design of

application software. In Table 2.1, we enlisted several strategies for achieving energy

efficiency in an energy-conscious system design. In the following section, we review some

of these strategies.

14 2 Energy-Cycle Aware Compilation Framework (ECACF)

Power Reduction Strategies MES Design LevelsFabrication Level Power Reduction Low level

Processor Level Power Reduction Intermediate level

EDA Tools Level Power Reduction High level

Compiler Level Power Reduction High level

Low Power Data Structures High level

Idle Model Power Reduction Intermediate level

Power Reduction in Distributed Computing High level

Power Reduction in Communication Systems High level

Battery Aware Power Reduction High level

Tab. 2.1: Energy reduction techniques for embedded system design.

2.1 Energy Saving Techniques - A Review

We review a wide spectrum of strategies, shown in Table 2.1, ranging from the hardware

fabrication process to energy efficient communications system. Energy saving due to

different approaches are, in the best case, multiplicative. E.g., in an IDCT application

implemented in [44] [45] [46] [47], a 30% energy saving from low-energy electronics

together with a 23% saving from compiler techniques will yield a total energy saving of

(1-((1-0.30)(1-0.23)))×100%= 46.1%.

However, generally the total energy saving is less, say, in this example 34%, because the

various energy saving strategies may adversely affect each other.

2.1.1 Fabrication level power reduction

The power consumption in a CMOS digital circuit is expressed as [48]

P = (CLV 2DDfp) + (ISCVDD) + (IleakgeVDD) (2.1)

where VDD is the supply voltage, fp is the output switching frequency, CL is the output

capacitance load, ISC is the short circuit current pulse, generated when both n- and

p-transistors are briefly turned on during the output switching, and Ileakage is the leakage

current. The first term on the righthand side of the power equation is the dominant

factor [48]. It is expected that power saving with two orders of magnitude can be

achieved using low-power electronics. About half of the power reduction will come from

architecture changes and management of switching activity (fp). The other half of

power reduction will come from using advanced materials technology to allow reduction

of VDD to 1 V or below from 5 or 3.5 V while also reducing CL [48] [49].

2.1. Energy Saving Techniques - A Review 15

2.1.2 Processor level power reduction

Mobile embedded system requires small form factors and hence processors designed for

high-end desktops are not suitable for such application. Havinga et al. in [50] show that

microprocessors can account for up to 33% of a typical notebook power budget, which

is around 15W. Therefore, processor designers include a number of features to reduce

power consumption. E.g., in TriMedia processor TM130x [4] and Blackfin processor

ADSP533S some of the power reduction features are dynamic idle-time shutdown of

separate execution units, low-power cache design, and power considerations for standard

cells, data-path elements, and clocking. The processor also supports three static power

management modes doze, nap, and sleep [51]. These modes reduce power at a global

level when the processor is idle for an extended period of time. Since CMOS circuits

consume power during the charging and discharging of capacitances, reducing switching

activity saves power. At the architecture-level, two strategies to reduce switching activi-

ties are Gray code addressing and cold scheduling of instructions [52] [53]. Experimental

results show that cold scheduling reduces switching by 20 ∼ 30%. The Gray codes ad-vantage over the binary code is that each memory access changes the address by only

one bit. Thus, a significant number of bit switches can be eliminated using Gray code

addressing. Also, by decomposing a finite-state machine into several submachines, [54]

suggest that it is possible to selectively turn off portions of a circuit, thereby reducing

the switching activities. Tiwari et al. [31] have studied the idea of shutting off parts of

a logic circuit that are not needed in a particular computation on a per-clock-cycle basis.

This saves the power used in all the useless transitions in those parts of the circuit. Burd

et al. in [55] and Govilak et al. in [56] have suggested that power consumption in a

CPU can be reduced by dynamically changing its operating frequency and voltage. Fur-

ther studies to expose the role of prediction and of smoothing in dynamic speed-setting

policies is discussed in [57]. Havinga and Smit [50] propose energy saving by exploiting

locality of reference with dedicated, optimized modules. The idea of locality of reference

is to offload as much work as possible from the CPU to programmable modules that are

placed in the data streams.

2.1.3 EDA tools level power reduction

The design of low-power systems cannot be achieved without good power-conscious

EDA tools. EDA tools are used at all levels of hardware design: behavioral, architectural,

logic and physical. For a detailed exposition of power-conscious EDA tools, the reader

is referred to tutorials by [58] [59] [14].


2.1.4 Compiler level power reduction

Compiler design techniques contribute to energy saving in several ways [60] [61]. Kolson

and Nicolau [62] [40] [63] address the problem of allocating memory to variables in em-

bedded DSP (digital signal processing) software. The goal is to maximize simultaneous

data transfers from different memory banks to registers [64] [65] [66]. In several DSP

applications mentioned in [67] [68], two registers are loaded with the required data and

an arithmetic operation is performed. Loading two registers with a single double transfer

instruction draws a little more current than a move instruction. Both the instructions

take one clock cycle each. However, energy is saved by using the double transfer, be-

cause the double transfer instruction loads the two registers in one clock cycle, whereas

we need two clock cycles to sequentially load the registers. Experimental results for a

few applications on a Blackfin DSP processor in [30] show that up to 47% of energy

can be saved by this approach. Instructions with memory operands have much higher

energy costs than instructions with register operands [30]. This suggests that energy

can be saved by suitably assigning the live variables of a program to registers. But, a

processor has only a small number of registers. When the number of simultaneous live

variables is larger than the number of available registers, some of the variables must be

spilled to memory. Register assignment for loop variables is important because loops

are typically executed many times. Algorithms for optimal register assignment to loop

variables are presented in [69] [70] [71] [62]. This algorithm can be included in the

code generation part of a compiler.

2.1.5 Low power data structures

Kondo et al. [72] propose a method of implementing set data types with minimum power

consumption. In a programming language, one can implement the set data type using a

variety of concrete data structures such as arrays, pointer arrays, linked list and binary

tree [73]. Thus, to implement the set operations, such as locate, insert, and remove

a record from a set, one has to manipulate the memory elements in a concrete data

structure as proposed in [74] [75] [33] [42]. It is the memory accesses in the process

of set operations that actually consume power. Thus, the power consumption in set

operations is a function of the number of memory elements used in implementing a set

data type, the number of read and write operations are performed in the implementation,

and some logic details such as capacitance of memory elements, voltage level, and

frequency of operation. The concrete data structures are compared on the basis of a

filling factor, which is the fraction of the locations that would be filled if implementation

is in arrays [76] [77] [78]. It has been shown that for different levels of filling factor,

different concrete data structures lead to low values of the power cost function. E.g.,

for filling factors greater than 60%, arrays are better in implementing energy efficient

set data types [72].

2.1. Energy Saving Techniques - A Review 17

2.1.6 Idle mode power reduction

The doze mode is an innovative approach to conserving energy [79] [80] [81] [60]. It is

very attractive in a communication environment where a mobile system may occasionally

send or receive messages. In the doze mode, the clock speed is reduced and no user

process is executed. Rather, a mobile host simply waits for any incoming message. Upon

receiving a message, the host resumes its normal mode of operation. The energy saving

due to this mode depends on the local computations on a mobile and the pattern of

communication between a mobile and a support station [82]. Simulation studies in [41]

show that energy saving due to this mode spreads over a wide range of 2 ∼ 98%.

2.1.7 Power reduction in distributed computing systems

Agent based computation is a relatively new idea in distributed computing [83] [81]

[84]. General agent-based distributed computing systems have been designed using the

concept of Lindas tuple space [85]. Wei et al. [86] discuss how energy-efficient

distributed algorithms in a mobile computing environment can be designed using a tuple

space managed on the fixed network of a mobile system. Lin et al. [22] propose a power

efficient commit protocol which supports conventional two-phase commit services. A

distributed autonomous system called Noah (Network oriented application harmony)

has been proposed in [87] built in the Mitsubishi laboratory. Though the purpose of

Noah is not to save energy, it demonstrates how agent based systems can be built using

a tuple space as the medium for process communication. By shifting most workload

to peer fixed hosts, the load, the power consumption and the message exchanged via

expensive wireless links in a mobile host are greatly reduced.

2.1.8 Power reduction in communication systems

The receiver subsystem of a mobile station need not be active all the time [88]. Most

digital cellular and cordless systems provide power cycling at the mobile units. Mobile

stations can periodically relax (power cycle) their receivers as a means of conserving

energy. Since the receiver of a mobile unit is not continuously ready to receive messages

from the local support station (base station), some kind of coordination between a base

station and a mobile unit is necessary. Salkintzis et al. [89] propose a page-and-answer

protocol. Intuitively, the protocol works as follows:

When a base station has a message for a mobile unit, the base station sends a small

paging packet to the mobile unit. If the mobile unit receives the paging packet, that

is if the mobile receiver is up, the mobile sends an answer packet to the base station.

Obviously, if the paging message is sent at a time when the receiver is powered off, no

answer packet is generated by the mobile and the base station will once again page the


mobile after some time. Upon receiving an answer packet, the base station sends the

desired message to the mobile unit.

Kravets and Krishnan [90] propose power saving by selectively choosing short periods

of time to suspend communications and shut down the communication device. Applying

this method to a transport protocol and using three simulated communication patterns,

they have achieved up to an 83% saving in the energy consumed by the communication

system. Chlamtac et al. [91] address the problem of wireless access protocols which

include an energy constraint and develop three energy conserving protocols for various

loads: grouped-tag TDMA, directory, and pseudorandom. Singh et al. [92] argue that

there is a need for using power-aware metrics, such as minimize energy consumed per

packet, minimize variance in node power levels, maximize time to network partition, etc.,

in the design of power efficient routing protocols. They show that these metrics in a

shortest-cost routing algorithm reduces the cost/packet of routing packets by 5 ∼ 30%over shortest-hop routing.

2.1.9 Battery aware power reduction

Chiasserini and Rao [18] have shown how battery behavior can be exploited to prolong

battery life. In particular, they identify the phenomenon of charge recovery that takes

place under pulsed discharge conditions as a mechanism that can be exploited to enhance

the capacity of an energy cell. The bursty nature of many data traffic sources suggests

that there might be a natural fit between the two. Bai and Lai [93] implement some

methods to let the low power CPU efficiently do some kind of computation intensive

tasks, such as graphic image processing and displaying. Their methods include reducing

the computation complexity of bitmap file processing, using fixed-point math instead

of floating point math, prestoring the table of trigonometric functions, and using a few

lines of assembly language code in the inner loop of graphic image processing program

to improve its performance. These methods lead to a speed up of the programs by a

factor of three to six.

In [44], we argue that mobile applications development require us to rethink the concept

of an algorithm from the viewpoint of battery life. Instead of asking for the best result,

a user may say :

’Give me the best result you can find, using no more than X units of resource R.’

Or, one can let the system make the tradeoff between fidelity and resource consumption

by saying:

’Give me the best result you can obtain economically.’

2.2. Multimedia DSPCPU Architecture 19

2.2 Multimedia DSPCPU Architecture

A multimedia processor is a media processor for high-performance multimedia appli-

cations that deals with high-quality video and audio. Typically, an extended general-

purpose CPU ( called the DSPCPU) makes it capable of implementing a variety of

multimedia algorithms from popular multimedia standards such as MPEG-1 and MPEG-

2. The key features behind this powerful processor are as follows:

• A general-purpose VLIW processor core coordinates all the on-chip activities.In addition to implementing the non-trivial parts of multimedia algorithms, this

processor runs a small real-time operating system that is driven by interrupts from

the other units.

• DMA-driven multimedia input/output units that operate independently and thatproperly format data to make software media processing efficient.

• DMA-driven multimedia coprocessors that operate independently and in parallelwith the DSPCPU to perform operations specific to important multimedia algo-

rithms.

• A high-performance bus and memory system that provides communication betweenthe processing units.

• A flexible external bus interface.

A typical multimedia processor is based on a three-level hierarchy of operators:

• Instructions

• Operations

• RISC operations

One instruction may contain five operations as depicted in Figure 2.1. Each operation

may execute multiple arithmetic operations. E.g., for TriMedia DSP processor TM130x,

one such operation is the command IFIR(a, b). This command contains a total of threearithmetic operations: Two multiplications and one addition (aHI × bHI + aLO × bLO).

Up to five operations including two IFIR commands can be issued in each machine

cycle. The ability of TriMedia’s VLIW architecture to execute multiple operations in

parallel gives it a big advantage over traditional RISC and CISC architectures found in

current mass-market microprocessors.


Fig. 2.1: TriMedia VLIW instruction [4].

2.2.1 Multimedia Processor Execution Model

The multimedia processor processor provides a large set of general purpose registers,

generally named as r0, r1, and so on. In addition to the hardware program counter PC,

there are a few user-accessible special purpose registers to hold CPU branch addresses.

The CPU issues one long instruction every clock cycle. Each instruction consists of

several operations (five operations for the TM1300 microprocessor) [4]. Each operation

is comparable to a RISC machine instruction, except that the execution of an operation

is conditional upon the content of a general purpose register. Examples of operations

are:

IF r10 iadd r11 r12 → r13 (if r10 true, add r11 and r12 and write sum in r13)

IF r10 ld32d(4) r15 → r16 (if r10 true, load 32 bits from mem[r15+4] into r16)

IF r20 jmpf r21 r22 (if r20 true and r21 false, jump to address in r22)

Each operation has a specific, known execution latency (in clock cycles). For example,

in case of TM1300, iadd takes 1 cycle. This means that the result of an iadd operation

started in clock cycle ’i’ is available for use as an argument to operations issued in cycle

’i+1’ or later. The other operations issued in cycle ’i’ cannot use the result of iadd.

Similarly the ld32d operation has a latency of 3 cycles. The result of an ld32d operation

started in cycle ’j’ is available for use by other operations issued in cycle ’j+3’ or later.

Branches, such as the jmpf example above have three delay slots. This means that if a

branch operation in cycle ’k’ is taken, all operations in the instructions in cycle k+1, k+2

and k+3 are still executed. In the above examples, r10 and r20 control the conditional

execution of the operations. This is also referred to as guarding, where r10 and r20

contain the guard of the operation.

The implementation of architecture restricts the choice of operations that can be per-

formed in parallel or can be packed into an instruction. For example, the DSPCPU in

TM1300 allows no more than two load/store class operations to be packed into a single

instruction, shown in Figure 2.2. Also, no more than five results (of previously started

operations) can be written during any one cycle. The packing of operations is not nor-

2.2. Multimedia DSPCPU Architecture 21

mally performed by the programmer. Instead, the instruction scheduler takes care of

converting the parallel intermediate format code into packed instructions ready for the

assembler. The rules are formally described in the VLIW Description File (VDF) used

by the instruction scheduler and other tools.

Fig. 2.2: TriMedia functional unit assignment [4].

2.2.2 Multimedia Processor Operations Overview

In this section we present a brief overview of the multimedia processor instruction set.

Readers are encouraged to refer to [4] for details.

Conditional Execution: In multimedia processor architectures, all operations are op-

tionally ’guarded’. A guarded operation executes conditionally, depending on the value

in the ’guard’ register. For example, a guarded add is written as:

IF R23 iadd R14 R10 → R13.

This should be taken to mean if R23 then R13 ← R14 + R10. The ’if R23’ clausecontrols the execution of the operation based on the LSB of R23. Hence, depending

on the LSB of R23, R13 is either unchanged or set to contain the integer sum of R14

and R10. Guarding applies to all TM1300 operations, except the iimm and uimm (load-

immediate) operations. Guarding controls the effect on all programmer visible state of

the system, i.e. register values, memory content, exception raising and device state.

Load and Store Operations: Memory is byte addressable. Loads and stores have to

be naturally aligned, i.e. a 16-bit load or store must target an address that is a multiple

of two. A 32-bit load or store must target an address that is a multiple of four. For


TM1300, the BSX bit in the PCSW (program control status word) register determines

the byte order of loads and stores. E.g., see ld32 and st32 in Appendix A of [4], only

32-bit load and store operations are allowed to access MMIO registers in the MMIO

address aperture. The results are undefined for other loads and stores. A load from

a non-existent MMIO register returns an undefined result. A store to a non-existent

MMIO register times out and then does not happen. There are no other side effects of

an access to a nonexistent MMIO register. The state of the BSX bit has no effect on

the result of MMIO accesses. Loads are allowed to be issued speculatively. Loads that

are outside the range of valid data memory addresses for the active process return an

implementation dependent value and do not generate an exception. Misaligned loads

also return an implementation dependent value and do not generate an exception.

Compute Operations: Compute operations are register-to-register operations. The

specified operation is performed on one or two source registers and the result is written

to the destination register.

Immediate Operations load an immediate constant (specified in the opcode) and produce

a result in the destination register.

Floating-Point Compute Operations are register-to-register operations. The specified

operation is performed on one or two source registers and the result is written to the

destination register. Unless otherwise mentioned all floating point operations observe

the rounding mode bits defined in the PCSW register. All floating-point operations

not ending in flags update the PCSW exception flags. All operations ending in flags

compute the exception flags as if the operation were executed and return the flag values

(in the same format as in the PCSW); the exception flags in the PCSW itself remain

unchanged.

Multimedia Operations are special compute operations. They are like normal compute

operations, but the specified operations are not usually found in general purpose CPUs.

These operations provide special support for multi-media applications.

Special-Register Operations: Special register operations operate on special registers,

such as program control status word, branch address holding registers etc.

Control-Flow Operations: Control-flow operations change the value of the program

counter. Conditional jumps test the value in a register, and based on this value, change

the program counter to the address contained in a second register or continue execution

with the next instruction. Unconditional jumps always change the program counter

to the specified immediate address. Control-flow operations can be interruptible or

non-interruptible. The execution of an interruptible jump is the only occasion where a

multimedia processor allows special event handling to take place.

2.3. Workload Description 23

2.3 Workload Description

Our workload consists of two major application domains, multimedia and bioinformatics.

Both use compute and data intensive algorithms. In this section we present in detail the

diversity found in these application domains, that we selected for the rigorous testing of

our ECACF. The variability in the input data streams is also discussed.

2.3.1 Multimedia Applications

The multimedia application set consists of encoders and decoders (transcodecs) encom-

passing three media types - speech, video, and audio (music) - and is summarized in

Table 2.2 to Table 2.5. We obtained codes for these applications from various public

domain sources [94] [95] [96] [21]. The applications were chosen for their importance

in real systems and (we believe) to be representative enough to make the inferences in

this study. We evaluated all our applications with four inputs, summarized in Table 2.6.

Here, we only report results from a single input for each application. We chose the input

that gave the highest (normalized) standard deviation in per frame execution time on

our base system. We call these inputs the default inputs, and list them in the second

column of Table 2.6. Results with the other inputs are similar, both quantitatively and

qualitatively. The G.728, H.263, and MPEG codecs statically distinguish multiple frame

types. G.728 uses an adaptive algorithm, where certain parameters are updated every

four frames. The processing of each frame in a single four-frame cycle is different due

to the calculation of these parameters. Thus, we treat these as different types of frames

(numbered one through four). The H.263 and MPEG codecs use almost the same video

compression scheme. A key difference is that MPEG uses three different types of frames

- I frames do not exploit inter-frame redundancy, P frames exploit inter-frame redun-

dancy using a previous frame, and B frames exploit such redundancy using a previous

and a later frame. Our H.263 codecs do not use B frames. They use a single I frame at

the beginning of the video and P frames for the rest. We do not include the I frame in

our analysis. It takes excessively long to simulate a frame with the MPEG codecs using

the frame sizes specified by the MPEG-2 standard (about 4 to 16 hours per frame for

MPEGenc. We scaled down the frame size to 176x144 pixels so that we could simulate

a reasonable number of frames to assess execution time variability. We ensured that

the scaling did not affect the cache behavior by performing a working set analysis and

running representative experiments with larger frame sizes and different cache sizes. As

the chosen frame size conforms to the H.263 standard, we used the same size for the

H.263 codecs for consistency. Also for consistency, we used the same set of four inputs

for both MPEG and H.263 codecs. These inputs contain a great deal of motion to

stress the applications. H.263 was designed for low bit-rate applications such as video

conference (which typically have less motion); therefore, our results from these inputs

represent an upper bound on the expected variability for H.263.


Application Description Input Vector SampleRate/Through-put

GSMenc Low bit-rate speech codingbased on the European GSM6.10 provisional standard. UsesRPE/LTP (residual pulse ex-citation/long term prediction)coding at 13 Kb/s. Compressesframes of 160 16-bit samplesinto 264 bits.

orignova 20 ms (160 sam-ples), 8 KHz

GSMdec homemsg

G728enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.

lpcqutfe 625 µs, (5 sam-ples), 8 KHz

G728dec homemsg



G723dec homemsg



G729dec homemsg

Tab. 2.2: Multimedia Benchmarks (Speech Transcodecs).

2.3.2 Bioinformatics Workload

Due to a significant increase in biological threats against humane, plants and other

species during last two decades, there is a growing realization that bioinformatics and

molecular biology equipments should be available in small form factors, that can be

readily available in field [97]. This lead to development of battery as well as execu-



H263enc Low bit-rate video coding basedon the H.263 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.

orignova 40 ms, 25 frames/s

H263dec buggy

H264Lenc Low bit-rate video coding basedon the H.264 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.

orignova 40 ms, 25 frames/s

H264Ldec buggy

MPEGenc High bit-rate video codingbased on the MPEG-2 videocoding standard. Uses intra-frame (1) and inter-frame (P,B) coding. Typical bit rate is1.5-6 Mb/s.

Buggy 33 ms, 30 frames/s

MPEGdec flwr

MPEG-1 encoder High bit-rate video codingbased on the MPEG-1 videocoding standard.

Buggy 33 ms, 30 frames/s

MPEG-1 encoder flwr

NLIVQ Non linear interpolative vectorquantization, image processingcodec

cameraman.tif 512x512 resolu-tion, Gray scale

Tab. 2.3: Multimedia Benchmarks (Video Transcodecs).


MP3enc Audio decoding based on theMPEG Audio Layer-3 standard.Synthesizes an audio signal outof coded spectral components.Typical bit rate is 16-256 Kb/s.

filter 26 ms (1151 sam-ples), 44.1 KHz

MP3dec filter

Tab. 2.4: Multimedia Benchmarks (Audio Transcodecs).


Application DescriptionFFT Fast Fourier Transform

IDCT Inverse Discrete Cosine Transform

T64 Matrix Transpose 64x64

M100 Matrix Multiplication 100x100

Tab. 2.5: Generic DSP application Benchmarks [7].

Domain Test Vector Description FeaturesAudio CatSteven Soft rock song 2500 frames, av-

erage length 65.25seconds

Sting Pop songBeethoven 2500 classical piece

Video Flwr Drive-by of houses 450 frames, each18 seconds forH.263 and 15seconds for MPEG

Cact Panoramic viewBuggy Buggy raceTens Table tennis match

Speech Homemsg An answering message Average frame sizefor GSM codecs is500, for G.72x is19000, length: 20seconds

Orignova Sentences read by different adultslpcqutefe Sentence read by a boy

Tab. 2.6: Test Vectors Characterization.

tion time efficient handheld devices for bioinformtics applications. Bioinformatics is an

interdisciplinary research area that helps to produce ’sensible’ and ’useful’ information

from the wealth of data that has been produced by the genome sequencing projects.

We categorize the basic functionality offered by all bioinformatics tools into four groups,

they are:

1. Algorithm for pattern recognition, probability formulae are used to determine the

statistical similarity in given two or more than two sequences.

2. Rule-bases analysis defines how a mathematical or statistical technique can be applied.

Different sets are defined with a membership, and set of rules are also created to elaborate

associativity. A basic set theory is used to fire a rule.

3. Biological data bases are uniformly and efficiently maintained archives of consistent

data that contain information and annotation of DNA and protein sequences, DNA

and protein structures as well as DNA and protein expression profiles [98] [99]. An


important feature of these databases is their simplicity in access and query management.

In addition some websites [100] [101] [102] provide visualization tools to aid biological

interpretation.

4. Biological taxonomy records the differences in sequences across different classes

helping further to reduce the similarity errors.

We chose applications for their importance in real system and representative enough to

make the inferences in this study. They are summarized in Table 2.7. We obtained

codes for these applications from various public domain sources. For lack of space, we

only report their underlying algorithm; details may be found in [99] [97] [102]. The

input databases are obtained from the NIH genetic sequence database ’GenBank’, NCBI

assembly archive ’Genome Assembly Archive’, Homologus structure alignment database

’HOMSTRAD’, the NIMH-NCI protein-disease database ’PDD’ and ’The Lens’ [100]

[102].

Application Pseudonym Features AlgorithmsGENESPLICER A01 Detect splice sites in the

genomic DNAHigh accuracy and com-putationally efficient

TIGRSCAN A02 DNA modeling Generalized HiddenMarkov Model (GHMM),HMM

TRANSTERMIS A03 Rho-independent tran-scriptional terminators

Statistical estimationtechniques

GENSCAN A04 Predict complete genestructure

Search algorithms

MUMMER A05 Genome Sequence align-ment

Tree algorithms

GLIMMERHMM A06 Find gene sequence ineukaryotes

IMM, Splice site models,Maximal dependence de-composition techniques

GENIE A07 Gene finder in vertebrateand human DNA

GHMM, Neural Net-works

FGENE A08 Find splice sites, genes,promoters

Linear discriminantanalysis

GRAIL A09 Analysis of DNA se-quence

Automated computation

GENEMARK A10 Find genes in bacterialDNA sequence

Markov chains

NetPlaneGene A11 Sequence analysis Neural network

GLIMMER A12 Coding regions in micro-bial DNA

Interpolated MarkovModels (IMM)

Tab. 2.7: Bio-Computation Applications Benchmark .


2.4 Energy Cycle Aware Compilation Framework Methodology

The ECACF is shown in Figure 2.3. The source code is processed successively for

static code analysis, post compiler analysis and finally for scheduling analysis. A VLIW

processor descriptor file (VDF) is used to provide architecture information to compiler,

scheduler and finally to the machine code generator. The VDF file contains a list of

pseudo and machine operations, latency of the operations, opcodes, slot assignment

schemes, processor operating frequency, instruction cache feature (associativity, block

size, number of sets) and main memory features (size, order, read/write latencies). This

file format is compatible as mentioned in [103] [4] [81] [104]. Here, we follow the

same VLIW naming convention as used in [104]. This feature has made our scheme

architecture independent. A list of parameters is generated in each step during the

methodology flow. Intermediate trace files are generated during the code processing

flow to produce AEP, such as code size, execution time number of cache miss (for both

instruction and data caches), data cache conflicts, data bank alignment, highway usage,

scheduling factor and slot utilization. After the simulation these parameters are used

to compute transformation control factors such as unrolling factor, grafting depth and

blocking metrics. These control factors are further explained in [25]. Iteratively after

each cycle all these parameters are recorded again and are compared to preset user

constraints mentioned in a User Constraint File (UCF). This file contains desired values

for code, execution time, energy and allowed percentage cache miss. Energy is measured

at the target platform (the setup is explained in Section 2.5). All these parameters are fed

back to the transformation cost analyzer. In each successive transformation it is decided

that whether energy-cycle performance has been optimized or not. The source code is

optimized by undergoing code restructuring schemes known as loop unrolling, decision

tree grafting and loop tiling. Additional benefits are gained by combining traditional

compiler optimization algorithms, such as constant and variable propagation, dead code

elimination, strength reduction etc..

2.4. Energy Cycle Aware Compilation Framework Methodology 29

Fig. 2.3: Transformation methodology.


2.4.1 Application Expression Profile

From a ’C’ source code to an executable binary, an embedded application has to go

through many tools: the text writing notepad, compiler, scheduler, linker, and the

loader. The urge ’how can I?’ is transformed into the conscious biased perception, en-

tailed by embedded systems emerging from software hardware co-design. The software

leads and the hardware follows the technological limitations. The behavior, a software

implementation can express on a hardware is limited by the liberty offered by the hard-

ware architecture and the ability of programmers to code the ’how can I?’. The above

issues indicate that for a ’good’ energy-cycle performance there is a need to gather

more detailed profiles, containing information about system behavior on various levels

as shown in Figure 2.4. The main goal of such vertical profiling is to further improve the

understanding of system behavior through correlation of profile information at different

levels.

Fig. 2.4: Vertical application profile layers.

Hitherto, an executable application development hierarchy is composed of compilation,

scheduling, linking, and binary code generation. Finally, this code is downloaded to

the SDRAM attached with the multimedia processor. Our Application Profile Monitor

(APM) extracts application behavioral parameters as mentioned above. This infor-

mation is extracted from the vertical profile layer block as shown in Figure 2.4. An

application is profiled both in terms of its static and run time (dynamic) behavior. The

way an application expresses itself, we call Application Expression Profile (AEP) for a

given hardware architecture. We characterize an application expression profile using the

following conventions:

1) Name : It describes the name of the profile monitor.

2) Definition: It defines the profile monitor as used in our ECACF.

2.5. Experimental Setup 31

3) Location: It shows the location of the monitor in the application development hier-

archy such as compilation, scheduling, linking etc.

4) Type : There are two possible types: static or dynamic.

5) Range: The possible range of value a monitor can have.

6) Level: If a parameter is measured directly from the code, it is called primary monitor,

in other case if it is computed using one or more parameters, we call it secondary monitor.

E.g., a primary monitor can be written as:

Name: Processor Frequency

Definition: The operating frequency of the microprocessor

Location: VDF

Type: static

Range: Typical 100MHz - 233MHz (depends on given hardware architecture)

Level: Primary

Similarly, a secondary monitor can be written as:

Name: Scheduling Factor

Definition: Computed this factor by dividing infinite machine cycle time with finite

machine cycle time

Location: Transformation Engine and Scheduler

Type: Dynamic

Range: 0 to 1

Level: Secondary

A complete list of profile monitors is provided in Appendix A.

2.5 Experimental Setup

The energy consumption by an application at a realtime platform is a first step to be

known in any energy constrained embedded system and can be used to estimate the

battery lifetime of the system. In this section, we describe an energy measurement

method for a software application running on a realtime multimedia VLIW processor.

The method is described for TM1300 Philips DSP processor, but it is applicable to other

multimedia processors, for e.g., Blackfin ADSP533S. The measurement framework has

been incorporated into our ECACF, that allows a software application programmer to

measure a realtime energy consumption by running the candidate ’C’ source code.


2.5.1 Related Work for Energy Measurement

The energy consumption of a software applicati

An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...

Documents

Transcript of An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...