ENHANCEMENTS TO RISC ARCHITECTURE FOR PORTABLE …. Govindarajalu.pdf(B.S. ABDUR RAHMAN INSTITUTE OF...

ENHANCEMENTS TO RISC ARCHITECTURE FOR

PORTABLE EMBEDDED SYSTEMS

A THESIS

Submitted by

B. GOVINDARAJALU

Under the guidance of

Dr. K.M. MEHATA

in partial fulfilment for the award of the degree of

DOCTOR OF PHILOSOPHY

in

COMPUTER SCIENCE AND ENGINEERING

B.S.ABDUR RAHMAN UNIVERSITY

(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956)

www.bsauniv.ac.in

OCTOBER 2014

ENHANCEMENTS TO RISC ARCHITECTURE

FOR PORTABLE EMBEDDED SYSTEMS

A THESIS

Submitted by

B. GOVINDARAJALU

Under the guidance of

Dr. K.M. MEHATA

in partial fulfilment for the award of the degree of

DOCTOR OF PHILOSOPHY

in

COMPUTER SCIENCE AND ENGINEERING

B.S.ABDUR RAHMAN UNIVERSITY

(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956)

www.bsauniv.ac.in

OCTOBER 2014

ii

B.S.ABDUR RAHMAN UNIVERSITY (B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)

(Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

BONAFIDE CERTIFICATE

Certified that this thesis ENHANCEMENTS TO RISC

ARCHITECTURE FOR PORTABLE EMBEDDED SYSTEMS is the bonafide

work of B.GOVINDARAJALU (RRN: 1186221) who carried out the thesis

work under my supervision. Certified further, that to the best of my

knowledge the work reported herein does not form part of any other thesis or

dissertation on the basis of which a degree or award was conferred on an

earlier occasion on this or any other candidate.

SIGNATURE Dr. K.M.MEHATA RESEARCH SUPERVISOR Professor & Dean Department of CSE B.S. Abdur Rahman University Vandalur, Chennai – 600 048

SIGNATURE Dr. SHARMILA SANKAR

HEAD OF THE DEPARTMENT Professor & Head

Department of CSE B.S. Abdur Rahman University Vandalur, Chennai – 600 048

iii

ACKNOWLEDGEMENT

This thesis would not have been possible without the help and

support of many people. First, I would like to thank my research supervisor

Dr. K.M. Mehata, Professor and Dean, School of Computer, Information and

Mathematical Sciences of B.S. Abdur Rahman University, for all his inspiring

ideas and support. It is he who motivated me to join the PhD programme

when I approached him four years back with lot of doubts in my mind.

I thank Dr. Sharmila Sankar, Professor and Head, Dr. Angelina

Geetha, Professor, and other staff members of the department of Computer

Science and Engineering, B.S. Abdur Rahman University, for their support

and encouragement. I thank the Doctoral Committee Members,

Dr. V. Sankaranarayanan, Professor of Eminence, B.S. Abdur Rahman

University and Dr. Ranjani Parthasarathi, Professor, Information and

Communication Engineering, Anna University, for their review comments. I

express my gratitude to the Chancellor, Vice Chancellor, Pro-Vice

Chancellor and Registrar for giving me an opportunity to do research at

B.S. Abdur Rahman University.

I thank the managements of four engineering colleges - Rajalakshmi

Engineering College, Sri Ramanujar Engineering College, Dhanalakshmi

College of Engineering, and Sri Venkateswara College of Engineering -

where I have served during my research work, for supporting me. I thank

following professionals whom I had consulted for my requirements: Raju

Sambandam, Ilanthirayan Singaram, Ramkumar, and Shyamala

Dharmar. I thank my ex-colleagues Kohila, Prof. Ramakrishnan and

Prof.Sivakumar and ex-students Haripriya, Vinodhini, Nandhini and

Abinaya who have helped me at various stages of my research work.

The persons who consistently supported me but also suffered most

during my research work are my family members - wife Bhuvaneswari, son

Krishnakumar, daughter-in-law Manjula, daughter Padma, and son-in-law

G.K. Ananth. Finally, I want to mention my grand children - Vihaan,

Sahahsra and Anvitha - who provided both distraction and relaxation.

B. GOVINDARAJALU

iv

ABSTRACT

The proposed research work focuses on developing a flexible

Instruction Set Architecture (ISA) by modifications to the Reduced Instruction

Set Computing (RISC) architecture to minimise code memory in portable

embedded systems. When the RISC architecture was introduced in 1980's,

the program memory was external to the processor. As the present day

embedded system is available as single System-On-a-Chip (SoC), there is

need for an ISA that contributes to overall benefits to the SoC in terms of

chip space, power consumption and cost. Though there are many code

reduction methods, there are very few ISA level techniques that aim at

modern embedded SoCs in which the code memory occupies a large part of

the silicon area due to the Fixed Instruction Encoding (FIE) feature of RISC.

This thesis proposes replacing FIE with Hybrid Instruction Encoding (HIE) for

MIPS32 like RISC processors to support multiple instruction sizes, and

hybrid lengths for offset and immediate fields so as to reduce wastage of

memory. The proposed solution eliminates additional code compression

efforts on the part of the system developers. Suitable modifications to

MIPS32 ISA have been developed and experimented using MiBench and

MediaBench suites. A code analysis cum conversion suite has been

developed as part of the research work. Further, a set of compound and

composite instructions has been introduced enhancing the code size

reduction. In addition to HIE, the research work also explores supporting

the Register Memory Architecture with new instructions. The results show

reduction of code memory ranging from 22% to 44% that is significant in

battery operated portable applications such as wearable devices, and

implantable medical devices for which processor performance is not critical.

The final part of the thesis focuses on the adoption of Heads-and-Tails

format to take advantage of the high code density of the hybrid-length

instructions while enabling deeply pipelined or superscalar processors.

v

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO

ACKNOWLEDGEMENT iii

ABSTRACT iv

LIST OF TABLES xii

LIST OF FIGURES xiv

LIST OF SYMBOLS AND ABBREVIATIONS xix

1. PORTABLE EMBEDDED SYSTEMS AND

CODE SIZE 1

1.1 EMBEDDED SYSTEMS 2

1.1.1 Characteristics of Embedded Systems 5

1.1.2 Architecture of Embedded Systems 8

1.1.3 Embedded Software 9

1.2 SoC ARCHITECTURE 13

1.3 BOPES DESIGN PHILOSOPHY AND

PROCESSOR TECHNOLOGY 15

1.3.1 IC Technology 18

1.4 PROCESSOR ARCHITECTURES AND

INSTRUCTION ENCODING 19

1.4.1 CISC Vs RISC 19

1.4.2 Load - Store Architecture (LSA) 23

1.4.3 Fixed, Variable and Hybrid length encoding 25

1.5 MOTIVATION 27

1.6 RESEARCH OBJECTIVES 31

1.7 CONTRIBUTIONS 32

1.8 THESIS OVERVIEW 33

2 BACKGROUND AND RELATED WORK 35

2.1 DESIGN FOR LOW POWER CONSUMPTION 35

vi


2.2 INSTRUCTION SET ARCHITECTURE (ISA) 38

2.2.1 Instruction Types and Operations 39

2.2.2 Operation Codes 44

2.2.3 Addressing modes 45

2.2.4 Data types 49

2.2.5 ISA Models 49

2.3 PROCESSOR PERFORMANCE AND

ADVANCED ARCHITECTURES 50

2.3.1 Instruction Pipelining 53

2.3.2 RISC Instructions and Pipelining 54

2.3.3 Superscalar processor 60

2.3.4 Very Long Instruction Word (VLIW)

Processor 60

2.3.5 Cache Memory 63

2.3.6 Virtual Memory 65

2.3.7 Multicore CPU 66

2.4 EMBEDDED PROCESSORS 68

2.5 EMBEDDED SYSTEM ARCHITECTURES 71

2.5.1 Digital Signal Processor 73

2.5.2 Media Extensions 74

2.5.3 Embedded Multiprocessors 75

2.6 MIPS32 Vs OTHER RISC PROCESSORS 77

2.6.1 CISC and RISC Convergence 78

2.7 MIPS32 INSTRUCTIONS AND

CODE WASTAGE 80

2.8 CODE SIZE REDUCTION IN EMBEDDED

SYSTEMS 82

2.8.1 Code Compression 83

2.8.2 Dictionary-based Compression 85

2.8.3 Compiler Techniques 88

vii


2.8.4 Ad hoc ISA Modification 89

2.9 ISA LEVEL CODE SIZE REDUCTION 90

2.10 CONCLUSIONS 91

3 BEHAVIOUR OF EMBEDDED CODES FOR RISC 93

3.1 MIBENCH BENCHMARKS 94

3.2 MEDIABENCH BENCHMARKS 98

3.3 MIMEDIA BENCHMARK SUITE 102

3.4 TYPICAL BEHAVIOUR OF EMBEDDED

APPLICATIONS 104

3.5 CONCLUSIONS 128

4 HYBRID INSTRUCTION ENCODING FOR

RISC CORES 129

4.1 MIPS ISA AND CODE WASTAGE 130

4.1.1 MIPS Instruction Set 130

4.1.2 MIPS Instruction Format 131

4.1.3 Wastage in MIPS32 Code 139

4.2 HIE1 METHODOLOGY FOR MIPS32 144

4.2.1 HIE1 RISC Instructions 145

4.2.2 Mapping MIPS32 ISA to HIE1 147

4.3 HIE1 EXPERIMENTAL RESULTS 150

4.3.1 Drawback of register size reduction 154

4.4 DESIGN OF HIE2 155

4.4.1 Impact of Reduction of immediate

and offset lengths to 15 bits 156

4.4.2 HIE2 Design for MIPS32 156

4.5 DISCUSSION ON HIE2 RESULTS 160

4.5.1 Reduction in Memory Accesses in HIE 162

4.5.2 Reduction in Redundant zeros in HIE 166

viii


4.6 PROCESSOR MODIFICATIONS TO

SUPPORT HIE 170

4.7 CONCLUSIONS 172

5 REGISTER MEMORY ARCHITECTURE FOR

RISC CORES 174

5.1 MOTIVATION FOR RMA 174

5.2 METHODOLOGY FOR RMA ALU

INSTRUCTIONS 176

5.2.1 Formats for ADDrm Instruction for MIPS 177

5.2.2 Proposed RMA Opcodes 179

5.2.3 Estimates on Code Size Reduction 184

5.3 RESULTS AND DISCUSSION 185

5.4 PIPELINE MODIFICATIONS FOR RMA 190

5.5 CONCLUSIONS 194

6 HYBRID PROCESSOR FOR PORTABLE

EMBEDDED SYSTEMS 195

6.1 SoC DESIGN AND EMBEDDED SYSTEMS 196

6.1.1 Smart watch 196

6.1.2 Scanner 197

6.1.3 Smartphones 199

6.2 ENHANCEMENTS TO HIE AND RMA CODES 200

6.3 FUTURE ENHANCEMENT TO HIE-RMA 202

6.4 HIE AND ILP 202

6.4.1 Hybrid-Length Instructions and

Instruction fetch 203

6.4.2 Instruction Fetch and Cache Access 204

6.5 HIE-MIPS Vs microMIPS/Thumb2 208

ix


6.6 DISCUSSION AND CONCLUSION 210

6.6.1 Summary of Contributions 212

6.6.2 Limitations of Described Research Work 213

6.6.3 Areas for Future Work 213

6.6.3.1 MicroMIPS and Thumb2

versions with HIE-RMA 214

6.6.3.2 Reconfigurable HIE-RMA version 214

6.6.3.3 Dynamic simulation 214

6.6.3.4 FPGA Processor design 214

6.6.3.5 Compiler tool chain 214

6.6.3.6 HIE-RMA-HAT Processor 214

REFERENCES 215

LIST OF PUBLICATIONS 221

APPENDIX 1 222

(MIDACC ARCHITECTURE)

A1.1 INTRODUCTION 222

A1.2 MIDACC INTERNALS 222

A1.2.1 MIDA Internals 223

A1.2.1.1 Instruction class distribution 223

A1.2.1.2 Instruction distribution 223

A1.2.1.3 MIPS Code redundant 0’s

Distribution 224

A1.2.1.4 Branch instruction distribution 224

A1.2.1.5 WASTIO Calculation 225

A1.2.1.6 Population of FTFI 225

A1.2.1.7 Registers usage behaviour 226

A1.2.1.8 Shift length usage 227

x


A1.2.1.9 Immediate field usage pattern 228

A1.2.1.10 Offset field usage pattern 229

A1.2.2 MICC Internals 229

A1.2.2.1 HIE1 code conversion 229


A1.2.2.3 RMA Code Conversion 231

A1.2.2.4 RMA+HIE1 code conversion 234


A1.3 MIDACC EXTENDER 234

APPENDIX 2 235

(MIDACC USER GUIDE)

A2.1 INTRODUCTION 235

A2.2 INSTALLING MIDACC 235

A2.3 INPUT FORMAT REQUIRED BY MIDACC 239

A2.4 USING MIDACC 240

A2.4.1 MIDA Tab 240

A2.4.2 MICC Tab 242



A2.4.2.3 RMA Code conversion 244



A2.5 SAMPLE OUTPUT OBTAINED USING MIDACC 246

A2.5.1 Code Analysis Report 246

A2.5.2 HIE1 Code Conversion Report 256

A2.5.3 HIE2 Code Conversion Report 257

A2.5.4 RMA Code Conversion Report 259

A2.5.5 RMA+HIE1 Code Conversion Report 260

A2.5.6 RMA+HIE2 Code Conversion Report 261

xi


A2.6 CROSS COMPILATION PROCEDURE 261

A2.6.1 Using Sourcery Codebench for

Cross Compilation 262

A2.6.1.1 Building the C program 262

A2.6.1.2 Obtaining the assembly code 262

APPENDIX 3 264

(MIPS32 INSTRUCTION IDENTIFICATION TABLE)

APPENDIX 4 269

(HIE1-MIPS INSTRUCTION MAP)

APPENDIX 5 273

(HIE2-MIPS INSTRUCTION MAP)

A5.1 HIE2-MIPS INSTRUCTION MAP 273

A5.2 HYBRID LENGTH FIELDS ENCODING 276

TECHNICAL BIOGRAPHY 278

xii

LIST OF TABLES

TABLE NO. TITLE PAGE NO.

1.1 Examples of Embedded Systems 3

1.2 Design Metrics for Embedded Systems 6

1.3 Types of Software Components in Embedded Systems 11

1.4 Typical SoCs and Applications 15

1.5 Extent of Data Transfer Instructions in CISC and RISC 24

2.1 Sample Instructions and processor actions 40

2.2 Addressing modes and mechanisms 45

2.3 Instruction cycle steps and actions for ADD instruction 51

2.4 Sample micro-operations 52

2.5 Typical instruction cycle phases in RISC processors 58

2.6 Typical Embedded Architectures and Processors 69

2.7 Processor types in Complex Embedded Systems 70

2.8 Typical Wastage of Bits in MIPS32 Instructions 81

3.1 MiBench Benchmarks 95

3.2 MediaBench Benchmarks 99

3.3 Embedded Applications for MiMedia Suite 103

3.4 Four types of offset / immediate byte patterns 117

3.5 Trends in Embedded Applications: WASTIO components 119

3.6 Code bloat factors for Embedded object codes for MIPS32 127

4.1 MIPS Instruction Formats 131

4.2 MIPS Instruction Fields 132

4.3 MIPS32 Integer instructions and actions 134

4.4 MIPS32 Instructions, opcodes and redundant zeros 140

4.5 Sample Encoding of Offset/Immediate Field in HIE-MIPS 147

4.6 MIPS32 ISA to HIE1 RISC ISA Mapping 148

4.7 Mapping MIPS32 ISA to HIE2 ISA 160

4.8 Comparison of Code reduction schemes HIE1 and HIE2 161

4.9 Average Instruction Size in HIE 164

xiii

TABLE NO. TITLE PAGE NO.

4.10 Memory Cycles for Instruction Fetch in HIE 165

4.11 Comparison of RZs of Embedded Applications 167

4.12 Typical Code Size Reduction of Embedded

Applications in HIE 168

4.13 Relationship between RZ, WASTIO AND HIE PCR 169

5.1 Data transfer Vs Arithmetic Instructions in

Embedded Applications 175

5.2 MIPS ADD Instructions 177

5.3 Proposed RMA ADD Instructions for MIPS 178

5.4 Types of ALU instruction for RMA load sequence 181

5.5 Types of ALU instruction for RMA store sequence 181

5.6 RMA Instructions Corresponding to MIPS32

Instructions for Load Sequence 183

5.7 RMA Instructions Corresponding to MIPS32

Instructions for Store Sequence 183

6.1 Impact of Integration of Code reduction schemes

HIE2 and RMA 201

A3.1 MIPS32 Instruction Identification Table 264

A4.1 HIE1-MIPS instruction map 269

A4.2 IID Field Encoding 272

A5.1 HIE2-MIPS INSTRUCTION MAP 273

A5.2 hl Encoding for G1 Type Instructions 276



xiv

LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.

1.1 Block diagram of a pacemaker 4

1.2 Embedded Systems Model 8

1.3 Software Components in Embedded Systems 10

1.4 Organisation of an Embedded SoC 14

1.5 Block diagram of digital camera 18

1.6 The independence of processor and IC technologies 19

1.7 CISC scenario 20

1.8 RISC scenario 22

1.9 ISA Lexical Level 23

1.10 Variable Instruction Encoding Format 25

1.11 Fixed Instruction Encoding Format 26

1.12 Hybrid Instruction Encoding Format 26

1.13 Memory Trends in SoC 29

1.14 Extent of Embedded memory in the die area in SoCs 30

1.15 Multiple embedded memory IPs in multicore SoC 30

2.1 IBM S370 Instruction Formats 47

2.2 INTEL Pentium Pro Instruction Formats 48

2.3 MIPS32 Instruction Formats 48

2.4 A six stage instruction pipeline 53

2.5 (a) Five stage pipeline 55

2.5 (b) Timing Diagram 56

2.5 (c) RISC Pipeline as a series of datapaths 56

2.6 Superscalar Processor Organisation 61

2.7 VLIW Processor Organisation 62

2.8 Use of Cache memory 63

2.9 Virtual memory concept 65

xv


2.10 Virtual memory mechanism 66

2.11 A Quad-core CPU 67

2.12 SPARC64 VII Processor 67

2.13 IBM Codepack Code Compression for Power PC 86

2.14 Dictionary based compression 87

2.15 Decompression procedure for the dictionary

based compression 87

2.16 Memory map of variable instruction stream 92

3.1 Utilized and unutilised instructions in Embedded codes 107

3.2 Distribution of utilized and unutilized instructions in

Embedded domains 108

3.3 Frequency of integer instructions in Embedded codes 109

3.4 Frequency of instructions usage in Embedded domains 110

3.5 Population of FTFI in Embedded codes 111

3.6 Distribution of FTFI in Embedded domains 112

3.7 Usage of full 16 bit immediate by Embedded codes 113

3.8 Trends in usage of 16 bit immediate in Embedded domains 114

3.9 Extent of usage of full 16 bit offset by embedded codes 115

3.10 Trends in usage of 16 bit offset field in Embedded domains 116

3.11 WASTIO Percentages in Embedded applications 118

3.12 Extent of WASTIO in Embedded domains 119

3.13 WASTIO distribution in Embedded domains 120

3.14 Usage of more than 16 registers by Embedded applications 121

3.15 Usage of more than 16 registers in Embedded domains 122

3.16 Usage of more than 16 bit shifts in Embedded applications 123

3.17 Frequency of more than 16 bit shifts in Embedded domains 124

3.18 Extent of branch instructions in Embedded applications 125

3.19 Usage of branch instructions in Embedded domains 126

4.1 MIPS R2000 instruction map 133

4.2 MIPS R2000 registers 138

xvi


4.3 Format of and instruction in MIPS32 ISA 142

4.4 addiu instruction with immediate field containing zero value 142

4.5 addiu instruction with only most significant byte of

immediate as zero 142

4.6 addiu instruction with only least significant byte of

immediate as zero 143

4.7 addiu instruction with both bytes of immediate field as

non-zero value 143

4.8 HIE1 RISC Instruction Formats 145

4.9 R Type instruction in HIE1 146

4.10 Effect of HIE1 on Automotive and Industrial

Control Benchmarks 150

4.11 Effect of HIE1 on Network Benchmarks 151

4.12 Effect of HIE1 on Video and Audio Benchmarks 151

4.13 Effect of HIE1 on Image Benchmarks 152

4.14 Effect of HIE1 on Speech Benchmarks 152

4.15 Effect of HIE1 on Security Benchmarks 153

4.16 Effect of HIE1 on Text Benchmarks 153

4.17 Effect of HIE1 on Embedded Segments 154

4.18 HIE2 instruction formats 159

4.19 Code Reduction Comparison between HIE1 and HIE2 162

5.1 RMA Instruction Format – RM Type 179

5.2 RMA Instruction Format – IM Type 179

5.3 (a) Format of LW Instruction 180

5.3 (b) Format of SW Instruction 180

5.4 R-Type ADD instruction in MIPS 182

5.5 I-Type ADD instruction in MIPS 182

5.6 Comparison of object codes of LSA and RMA 185

5.7 Code size Reduction due to RMA for Automotive Domain 186

5.8 Code size Reduction due to RMA for Network Domain 186

xvii


5.9 Code size Reduction due to RMA for Video and

Audio domains 187

5.10 Code size Reduction due to RMA for Image Domain 187

5.11 Code size Reduction due to RMA for Speech Domain 188

5.12 Code size Reduction due to RMA for Security Domain 188

5.13 Code size Reduction due to RMA for Text Domain 189

5.14 Comparison of Code size Reduction due to RMA

for Embedded Domains 189

5.15 Proposed 6-Stage RMA Pipeline 191

5.16 Execution of LSA Instructions in 6-Stage RMA Pipeline 192

5.17 Execution of LSA Instruction in 5-stage RMA pipeline 193

5.18 Execution of RMA ADDrm Instruction in 5-Stage

RMA pipeline 193

6.1 Block diagram of smart watch 197

6.2 Block diagram of a scanner 198

6.3 Block diagram for the Snapdragon S4 SoC

using Krait CPUs 200

6.4 Two stage instruction Decoding 206

6.5 Predecoding and Marking Instruction Lengths 206

6.6 Heads and Tails Format 207

6.7 HIE2-MIPS Instruction Types in HAT Scheme 209

6.8 Variable-length decoding in a HAT Scheme 210

A1.1 Functional block diagram of MIDACC 222

A1.2 Algorithm for WASTIO calculation 225

A1.3 Algorithm for Population of FTFI 226

A1.4 Algorithm for Shift Length usage computation 228

A1.5 HIE1 code conversion algorithm 230

A1.6 HIE2 code conversion algorithm 231

A1.7 Overview of RMA code conversion for load sequence 232

A1.8 Overview of RMA code conversion for store sequence 233

xviii


A1.9 RMA+HIE1 code conversion 234

A1.10 RMA+HIE2 code conversion 234

A2.1 Snapshot of MIDACC Suite installation folder 235

A2.2 Snapshot of MIDACC Suite welcome screen 236

A2.3 Snapshot of MIDACC installation screen 236

A2.4 Snapshot of MIDACC installation process 237

A2.5 Snapshot of MIDACC installation status 237

A2.6 Snapshot of MIDACC installation completion 238

A2.7 Snapshot of MIDACC icon in desktop and start menu 238

A2.8 Snapshot of MIDACC Suite tool 239

A2.9 Snapshot of assembly code of SUSAN 239

A2.10 Snapshot of input format accepted by MIDACC 240

A2.11 Snapshot of MIDA Tab 240

A2.12 Snapshot of code analysis process using MIDA Tab 241

A2.13 Snapshot of MICC Tab 242

A2.14 Snapshot of HIE1 Code conversion process 243

A2.15 Snapshot of HIE2 Code conversion process 244

A2.16 Snapshot of RMA Code conversion process 244

A2.17 Snapshot of RMA+HIE1 Code conversion process 245

A2.18 Snapshot of RMA+HIE2 Code conversion process 246

xix

LIST OF SYMBOLS AND ABBREVIATIONS

3D - Three Dimensional

AC - Address Calculation

ADPCM - Adaptive Differential Pulse Code Modulation

ALU - Arithmetic Logic Unit

ASIC - Application Specific Integrated Circuit

ASIP - Application Specific Instruction set Processor

BDTI - Berkeley Design Technology Inc

BOPES - Battery Operated Portable Embedded Systems

CAN - Controller Area Network

CBF - Code Bloat Factor

CCD - Charge-Coupled Device

CCRP - Compressed Code RISC Processor

CISC - Complex Instruction Set Computing

CLB - Cache Line address Lookaside Buffer

CM - Cache Memory

CMOS - Complementary Metal Oxide Semiconductor

CODEC - Coder/Decoder

COM - Serial communication interface

CONMANIP - Constant Manipulation

CPI - Clock cycles Per Instruction

CPU - Central Processing Unit

CRC - Cyclic Redundancy Check

DM - Data Move

DMA - Direct Memory Access

DSP - Digital Signal Processing / Processor

EEMBC - Embedded Microprocessor Benchmark Consortium

EPIC - Efficient Pyramid Image Coder

ESC - Escape

EX - Execute

FAT - File Allocation Table

xx

FFT - Fast Fourier Transform

FIE - Fixed Instruction set Encoding

fn - Function

FPGA - Field Programmable Gate Array

FTFI - Frequently used Top Four Instructions

FTP - File Transfer Protocol

GB - Giga Byte

GCC - Gnu Compiler Collection

GPP - General Purpose Processor

GPR - General Purpose Register

GPS - Global Positioning System

GSM - Global Standard for Mobile communications

HAT - Heads And Tails

HDL - Hardware Description Language

HIE - Hybrid Instruction Encoding

HIE1 - Hybrid Instruction Encoding version1

HIE2 - Hybrid Instruction Encoding version2

HIE-MIPS - MIPS with HIE ISA

HIE-RMA - HIE with RMA

HIE-RMA-MIPS - MIPS with both HIE and RMA ISA

HLL - High Level Language

HTML - Hyper Text Markup Language

HTTP - Hyper Text Transfer Protocol

HTTPS - Hyper Text Transfer Protocol Secure

I/O - Input / Output

IC - Integrated Circuit

ID - Instruction Decode

IEEE - The Institute of Electrical and Electronics Engineers

IF - Instruction Fetch

ILP - Instruction Level Parallelism

IM - Immediate and Memory

IOT - Internet Of Things

IP - Intellectual Property

xxi

IR - Instruction Register

IrDA - Infrared Data Association

ISA - Instruction Set Architecture

JPEG - Joint Photographic Experts Group

KB - Kilo Byte

LAT - Line Access Table

LSA - Load Store Architecture

LSB - Least Significant Bit / Byte

LSI - Large Scale Integration / Load and Store Instruction

MAR - Memory Address Register

MBR - Memory Buffer Register

MEM - Memory Access

MICC - MIPS Instruction Code Converter

MIDA - MIPS Instruction Distribution Analyser

MIDACC - MIPS Instruction Distribution Analyser cum Code

Converter

ML - Machine Language

MMS - Multimedia Messaging Service

MMU - Memory Management Unit

MP3 - MPEG-1 or MPEG-2 Audio Layer III

MPEG - Moving Pictures Experts Group

MSB - Most Significant Bit / Byte

NOP - No Operation

NP - Network Processor

NRE - NonRecurring Engineering cost

OCR - Optical Character Recognition

OPX - Opcode Extension

OS - Operating System

PC - Program Counter

PCM - Pulse Code Modulation

PCR - Percentage Code Reduction

PDA - Personal Data Assistant

PGP - Pretty Good Privacy

xxii

PIM - Personal Information Manager

PLA - Programmable Logic Array

PLD - Programmable Logic Device

PMD - Personal Mobile Device

RGB - Red-Green-Blue

RISC - Reduced Instruction Set Computing

RM - Register and Memory

RMA - Register Memory Architecture

ROM - Read Only Memory

RTOS - Real-time Operating System

RZ - Redundant Zero

SD - Secure Digital

SDT - Software Dynamic Translator

SIMD - Single Instruction Multiple Data

SMS - Short Messaging Service

SoC - System-On-a Chip

SPEC - System Performance Evaluation Corporation

SPP - Single Purpose Processor

TCP/IP - Transmission Control Protocol / Internet Protocol

TDMA/FDMA - Time-and Frequency-Division Multiple Access

TIFF - Tag Image File Format

TLB - Translation Lookaside Buffer

TRZ - Total Redundant Zeros

TV - Television

UART - Universal Asynchronous Receiver Transmitter

USB - Universal Serial Bus

VLIW - Very Long Instruction Word

VLSI - Very Large Scale Integration

WAP - Wireless Application Protocol

WASTIO - Wastage in Immediate and Offset Fields

WB - Write Back

1

1. PORTABLE EMBEDDED SYSTEMS AND CODE SIZE

Application of computers has been growing rapidly and spreading to

every field of life. Desire for better performance, reliability and cost

reduction has been fulfilled by newer design concepts and techniques in

both hardware and software. Embedded processing is the new generation

of computing that is revolutionising the way people live, and the way people

act in certain occasions. A wide range of smart and low-cost devices such

as digital watches, cell phones, digital cameras, and portable video games

has penetrated everyone's life. As per the forecast by the Linely group [1],

the embedded processor market in 2015 will exceed $4.0 billion. The

emergence of the Internet of Things (IoT) and the demand for smart

devices in every aspect of life is driving a complete overhaul of traditional

wisdom in the embedded industry.

Most embedded devices are battery-powered and designed with

System-on-a-chip (SoC). In the Battery Operated Portable Embedded

Systems (BOPES), reduction in size, cost and power consumption are

primary requirements unlike the servers and desktops in which

performance is the primary requirement. One of the factors contributing to

increase in product cost, size and power consumption is the large code

size of the embedded application software. CORE-based design with

predefined and pre-verified modules in modern SoCs is the state-of-the art

design strategy.

Due to increasing complexity of embedded systems, the size of

embedded programs keeps growing and hence the code memory occupies

the largest share of the total die area, more than the area of the

microprocessor core and the other on-chip modules. For example, in a

high-end hard disk drive [2], the processor occupies a silicon area of

6 mm², where as the code memory takes 20-40 mm². As a result, apart

2

from increased chip space and cost, the power consumption also

increases. Hence minimizing code size is an essential requirement in

BOPES especially in biomedical embedded systems such as pacemaker

and prosthetic devices. Coping with modern trends in technology, the

BOPES deserve an architectural level solution to minimise code size. The

main goal of this work is to provide an efficient Instruction Set Architecture

(ISA) for RISC processor cores so as to produce minimum object code in

BOPES designed around SoCs.

1.1 EMBEDDED SYSTEMS

An embedded system is a computing system that is embedded

within larger electronic devices, performing one or more fixed functions.

The embedded systems are pre-programmed by the developer with built-in

application program(s). The wide spectrum of embedded systems includes

a variety of applications as illustrated in Table 1.1. Embedded computers

have the widest spread of processing power and cost. The low end

embedded processors cost less than a dime, medium scale embedded

processors cost under $5, and high end processors cost around $100.

Although embedded devices cost less compared to personal computers,

their volume of sale are huge. In the year 2010 alone, 19 billion embedded

processors were sold compared to 1.8 billion Personal Mobile Devices

(PMD), 350 million desktop PCs, and 20 million servers [3].

Like variation in cost, there is a wide variation in the requirements of

different embedded applications. For certain embedded applications such

as network switches, avionics systems, video phones etc., high processor

performance is a critical requirement. In certain embedded systems such

as toys, scanners, washing machines, microwave ovens etc., size and cost

are critical aspects instead of performance. In certain other embedded

applications such as cell phone, tablet computers etc., power consumption

is important, as the major requirement is to minimise the frequent

3

recharging of battery. In prosthetic devices, both size and power

consumption are critical factors.

Table 1.1: Examples of Embedded Systems

Nature of Application Selected examples

Automotive Transmission control, cruise control, fuel injection,

antilock brakes, active suspension, navigation

Consumer electronics

Cell phones, digital cameras, camcorders,

calculators, personal digital assistants, smart

briefcase, smart watch, toys, games

Home appliances

Washing machines, microwave ovens, answering

machines, thermostats, home security systems,

lighting systems, TV set-top boxes, battery

chargers, smart phones, remote controls, coffee

maker, cooker, smart refrigerator, clothes dryer,

MP3 player, smart speakers, trash compactor,

thermostat, Personal Data Assistants (PDAs)

Office automation Fax machines, photocopiers, printers, scanners,

monitors, multifunction device

Business equipment

Alarm systems, card readers, cash registers,

product scanners, automated teller machines,

automatic toll systems, electronic instruments,

point of sales terminals

Biomedical and

healthcare

Patient monitoring system, pacemaker, blood

pressure monitor, electronic stethoscopes,

medical imaging, smart bed, electric wheelchair,

ambulance, hearing aid, prosthetic devices

Defence Wearable computer, signal tracking systems,

missiles

Industrial control Robotics, Factory control

Entertainment Music systems

Communications Routers, modems, network switches, network

bridges, hubs, gateways, satellites

Computer peripherals Hard disk drives, network adapters, printers

Special

Avionic systems, life support systems,

teleconferencing systems, satellite phones,

robots, traffic light controller, police vehicle, fire

control, video conferencing, elevators

4

An artificial cardiac pacemaker, a typical example for the application

of BOPES in health care, is a critical system which is used to treat patients

with various heart conditions in which the natural pacemaker is affected [4].

It is an electronic device placed under the skin near the heart and

generates simulated paces to the heart using electric impulses. Figure. 1.1

shows the block diagram of a pacemaker that contains a processor

functioning as the controller.

Electrodes

Leads

Pacing Unit

Power Source

Sensing Unit

Control Unit

Figure. 1.1: Block diagram of a pacemaker

The pacemaker is a hermetically sealed device containing a power

source, usually a lithium battery, a sensing amplifier which processes the

electrical manifestation of naturally occurring heart beats as sensed by the

heart electrodes, the processor acting as the control logic for the

pacemaker and the output circuitry which delivers the pacing impulse to the

electrodes. Much advancement has been made possible by

microprocessor controlled pacemakers. Instead of producing a static,

predetermined heart rate, a dynamic pacemaker compensates for both

actual respiratory loading and potentially anticipated respiratory loading.

Dual-chamber pacemakers control both the ventricles and the atria and

achieve timing the contractions of the atria to precede that of the ventricles

http://en.wikipedia.org/wiki/Hermetically_sealed

http://en.wikipedia.org/wiki/Lithium_battery

http://en.wikipedia.org/wiki/Microprocessor

http://en.wikipedia.org/wiki/Atrium_(heart)

5

thereby improving the pumping efficiency of the heart and can be useful in

congestive heart failure. Rate responsive pacing allows the device to sense

the physical activity of the patient and respond appropriately by increasing

or decreasing the base pacing rate via rate response algorithms. The

implanted pacemaker is a battery operated real time embedded system

which must be smaller in size and less in weight and must operate with low

power to increase battery life. Pacemakers are programmed with tens of

thousands of lines of code. It is obvious that size and battery life are the

most important parameters in the pacemaker than speed.

1.1.1 Characteristics of Embedded Systems

An embedded system is an applied computer system that has

several characteristics distinguishing it from other types of computing

systems. The main difference is that the embedded system is not used for

general purpose computing but designed for one or more dedicated

applications. On the other hand, a general purpose system is designed to

perform a variety of tasks as per the user's choice. For example, a digital

camera is an embedded system used always as a camera. In contrast, a

desktop computer is used for running a variety of application programs like

spreadsheets, word processors, games etc. In most embedded systems,

the users are not even aware of the presence of a microprocessor inside

the system.

Embedded systems have tight constraints on design metrics. There

are several design metrics for an embedded system as listed in Table 1.2.

6

Table 1.2: Design Metrics for Embedded Systems

Metric Description Remarks

NRE cost

Nonrecurring engineering

cost; the initial cost of

designing and testing the

system

One time nonrecurring cost;

multiple units can be

manufactured without any

additional design cost

Unit cost Manufacturing cost of each

copy of the system Excludes NRE cost

Size Physical space required Measured in bytes / gates /

transistors

Performance Instruction execution time of

the system

Smaller execution time

means higher performance

Power Amount of power consumed

by the system

Determines the life time of a

battery, or cooling

requirements. Decides

frequency of recharging the

battery

Flexibility Ability to change the

functionality of the system

Should not incur heavy NRE

cost

Time-to-

prototype

Time needed to build a

working version of the system

Prototype helps verify the

system's usefulness and

correctness

Time-to-

market

Time required to develop a

system before releasing to

the customers

Includes design time,

manufacturing time, and

testing time

Maintainability Ability to modify the system

after its initial release

Original designers need not

be available

Confidence Correct implementation of

system's functionality

Addition of test circuitry may

be required

Safety Probability that the system

will not cause harm

Built-in safety measures may

be required

7

There are exceptions to each of these constraints; cars, avionics

systems, and medical imaging devices are some examples of embedded

systems in which one or more of these are not satisfied. Often metrics

compete with one another; improving one may affect another. For example,

if an implementation's size is reduced, performance of the implementation

may suffer. Hence optimization of these metrics is a challenge for an

embedded system designer.

Certain Embedded systems are often required to provide real-time

response. Examples of such systems are pacemakers, flight control

systems of an aircraft, and sensor systems in nuclear reactors and power

plants. These embedded systems must continuously sense and monitor

changes in the system's environment and must compute certain results and

respond in real-time within specified time limit. In other words, a portion of

the application program has an absolute maximum execution time, and

certain set of tasks must be completed within the fixed amount of time.

There are two categories of real-time systems: hard and soft. In a hard

real-time system, missing the deadline may cause a damage in which case

it is considered that the system has failed. For example, a car's cruise

controller must react to speed and break sensors and compute acceleration

or deceleration amount within a limited time. A failure to meet the deadline

means loss of control of the car. Avionics, automotive safety and control

systems, and weapons systems are typical examples of hard real-time

systems. In soft real-time systems, timely response with small delays is

acceptable. Examples of soft real-time systems are the scheduling display

system on the railway platforms, washing machines, live audio-video

systems and toys. In these systems, occasional violation of constraints

results in degraded quality, but the systems can continue to operate.

8

1.1.2 Architecture of Embedded Systems

Architecture and usage pattern of embedded systems differ from

those of general purpose desktops and servers [5]. The process of

embedded system design and development described by Tammy

Noergaard [6] consists of four phases namely creating the architecture,

implementing the architecture, testing the system, and maintaining the

system. At the highest level, a commonly used primary architectural tool is

the Embedded System's Model, illustrated in Figure. 1.2. The hardware

layer is present in all embedded systems but the system software layer and

application software layer may exist either as independent layers or as a

combined layer depending on the complexity of the embedded system.

Figure. 1.2: Embedded Systems Model

In terms of workload, there are basically three different styles in

embedded systems: controlling, switching and routing, and media

processing [7]. Examples for controlling workload are found in various

appliances, automotive, and industrial environments. These applications

have light computations, and a tight coupling with a set of peripherals and

sensors. A strong real-time component is invariably present in such

control-dominated applications. Switching and routing category involves

control applications that handle streams of data in networking applications.

These workloads have to move large amounts of data at short intervals in

real-time. As a result, buffering capabilities are required. Due to need for

concurrent processing of multiple independent streams, efficient

multithreading support is essential. Media processing category involves

9

multimedia applications such as videos, audios and images which are

complex and diverse requiring high level of computational performance.

Apart from real-time restrictions, heavy computational workloads, large

memory bandwidth and capacity requirements are typical constraints in

these embedded systems and hence dedicated special hardware support is

provided along with embedded CPU.

1.1.3 Embedded Software

The wide spectrum of embedded computing applications is broadly

divided into four different types: image processing and consumer market,

communications market, automotive market, and special area markets

such as medical, military, industrial control and avionics [7]. In some of

these applications, real time performance is critical but in others, size, cost

and power consumption are critical rather than performance. In most

embedded systems, the primary goal is achieving the performance at a

minimum price rather than attaining higher performance at a higher price

[3]. The entire software of an embedded system is placed in the ROM

or flash memory since the embedded system is not user

programmable. Figure. 1.3 shows typical software components

needed to control an embedded device.

A Real-Time Operating System (RTOS) is a computing environment

that reacts to input within a specific time period. A real-time deadline can

be so small that system reaction appears instantaneous. Some RTOS

implementations are very complete and very robust, while other

implementations are very simple and suited for only one particular purpose.

An RTOS may be either event-driven or time-sharing. An event-driven

RTOS is a system that changes its state only in response to an incoming

event. A time-sharing RTOS is a system that changes its state as a

function of time.

10

Figure. 1.3: Software Components in Embedded Systems

A kernel is the central core of an operating system, and it takes care

of all the OS jobs: booting, task scheduling, and standard function libraries.

In an embedded system, there is rarely enough memory to maintain a large

function library and hence only essential functions must be included. The

kernel will boot the system and initialize the ports and the global data items.

Then, it will start the scheduler and instantiate any hardware timers that

need to be started. Finally, the kernel gets dumped out of memory (except

for the library functions, if any), and the scheduler will start running the child

tasks. The kernel of a real-time operating system provides an "abstraction

layer" that hides from application software the hardware details of the

processor (or set of processors) upon which the application software will

execute.

In addition to the core operating system, many embedded systems

have additional upper-layer software components. These components

consist of networking protocol stacks like CAN, TCP/IP, FTP, HTTP, and

HTTPS, and also include storage capabilities like FAT and flash memory

management systems. If the embedded device has audio and video

http://en.wikipedia.org/wiki/Controller%E2%80%93area_network

http://en.wikipedia.org/wiki/TCP/IP

http://en.wikipedia.org/wiki/FTP

http://en.wikipedia.org/wiki/HTTP

http://en.wikipedia.org/wiki/HTTPS

http://en.wikipedia.org/wiki/File_Allocation_Table

11

capabilities, then appropriate drivers and codecs will be present in the

system.

Most embedded systems are architecturally simpler and do not use

advanced memory management concepts such as virtual memory, and do

not have hard disk. Certain portable embedded systems may not be able to

bear the cost of the RTOS or may have very simple scheduling

requirements that can be managed by a simple monitor eliminating the

need for an operating system. There are certain products such as

automobiles that have multiple embedded systems. Present-day

automobiles have hundreds of processors. Table 1.3 illustrates various

software components required in some typical embedded systems [8].

Table 1.3: Types of Software Components in Embedded Systems

Embedded

System Software Components / functions

Smart card 1.Boot-up, initialisation and OS programs

2. Smart card secure file system

3. Connection establishment and termination

4. Communication with host

5. Cryptography algorithm

6. Host authentication

7. Card authentication

8. Saving additional parameters sent by the host

Digital camera 1. CCD signal processing for offset correction

2. JPEG coding

3. JPEG decoding

4. Pixel processing before display

5. Memory and file systems

6. Light, flash and display device drivers

7. COM, USB port and Bluetooth device drivers for port

operations for printer and communication control

12

Table 1.3 (Continued)

Embedded


Mobile phone 1. Memory and file systems

2. Keypad, LCD, serial, USB, 3G or 2G port device drivers for

port operations for keypad, printer and computer

communication control

3. SMS (Short Messaging Service) message creation and

communicator, contact and PIM (personal information

manager), task-to-do manager and email

4. Mobile imager for uploading pictures and Multimedia

Messaging Service (MMS)

5. Mobile browser for access to the web

6. Downloader for Java games, ringtones, games, wall

papers

7. Simple camera

8. Bluetooth synchronization, IrDA and WAP connections

support

Mobile

computer

1. OS

2. Touch screen GUIs, memory and file systems

3. Memory stick

4. Outlook, Internet explorer, Word, Excel, PowerPoint, and

handwritten text processor

5. Applications or enterprise software

Automobile 1. Engine control

2. Speed control and brake

3. Safety systems

4. Seat and pedal controls

5. Car environment controls

6. Route and traffic monitors

7. Automobile status monitoring

8. System interfaces for commands, voice activation, and

interfacing

9. Infotainment systems

13


Embedded


Hard disk drive 1. Motor control

2. Data decoding

3. Disk scheduling

4. On-disk management tasks

5. Off-disk management tasks

Pacemaker Basic functions: sensing, pacing, and lead impedance

measurement

1.2 SoC ARCHITECTURE

Thanks to advancements in VLSI design, most of today's portable

embedded systems are designed as System-on-a-chip (SoC) in which

multiple IC chip logics are implemented in a single die thereby housing

entire embedded system on a single chip. A block diagram illustrating the

organization of a typical SoC is given in Figure. 1.4. Different components

of the SoC may be of different technologies. For example, a SoC may

consist of one or more embedded microprocessors/microcontrollers, digital

signal processors (DSP), application specific circuits and memory. SoCs

are complex integrated circuits and permit integration of blocks from

several vendors. These components are being sold in the form of

Intellectual Property (IP) as three modes: hard cores, firm cores, or soft

cores. A hard core is a physical description of the IP design provided in a

variety of physical file formats. It is best for plug-and-play applications since

it is fully tested already, and is less portable and flexible than the other two

types of cores. The firm core carries structural description of a component

typically provided in a Hardware Description Language (HDL) and is

configurable to various applications.

http://searchwinit.techtarget.com/definition/Plug-and-Play

14

Figure. 1.4: Organisation of an Embedded SoC

The most flexible of the three different cores, the soft core is a

synthesizable behavioral description of a component and exists either as a

netlist (a list of the logic gates and associated interconnections making up

an integrated circuit) or as HDL code. The facility of design reuse of SoC

components using IP cores helps designers to reduce development time

since a new SoC design need not start from scratch. Table 1.4 lists some

typical commercial SoCs and their applications. Multicore SoCs and

Programmable SoCs are common in today's BOPES. In a typical SoC, the

memory occupies over 60% of the chip area [8].

http://whatis.techtarget.com/definition/logic-gate-AND-OR-XOR-NOT-NAND-NOR-and-XNOR

http://searchcio-midmarket.techtarget.com/definition/integrated-circuit

http://searchcio-midmarket.techtarget.com/definition/integrated-circuit

15

Table 1.4: Typical SoCs and Applications

SoC Name Manufacturer Typical Application

Tegra3 Nvidia Tablet

Snapdragon Qualcom Tablet, Smart phone

AZZ10 Intel Mobile device

Edison Intel Tiny computer

Exynos 5 Samsung Tablet

OMAP4430 Texas Instruments Google Glass

Zynq 7000 Xilinx Automotive, aerospace & defence,

broadcast, consumer, industrial,

medical, communications

VC2100 Agera Disk controller

ST L7250 ST Microelectronics Disk controller

IXP 1200 Intel Network processor

EP9312 Cirrus logic Audio processor

OMAP 1510 Texas Instruments Mobile multimedia

ASMgrid Accent Home Area Networking

1.3 BOPES DESIGN PHILOSOPHY AND PROCESSOR TECHNOLOGY

The desired functionality of an embedded application can be

implemented on any of the three different processor types: Programmable

General-Purpose Processor (GPP), Single Purpose Processor (SPP), and

Application-Specific Instruction set Processor (ASIP). In practice, a

combination of such processors is used in designing an embedded system

in order to optimize a system's design metrics.

The general-purpose processor is designed for a given Instruction

Set Architecture (ISA) with a micro architecture that is not known to the

application software designers. The embedded system designer merely

16

programs the processor for the desired functionality by developing suitable

programs and storing them in the program memory of the processor. This

approach offers several design metric benefits. Since the embedded

system designer needs to do mere program development and not digital

design, time-to-market and NRE costs are low. Also, there is a high degree

of flexibility as the designer can change the functionality by merely

replacing the program. Being a general approach, performance may be

slow for certain embedded applications. Since the processor is a standard

commercial product manufactured in large quantities, even small quantities

are available in low cost for the embedded system developer and hence,

unit cost of the embedded system works out to be low. However, if the

embedded system is to be sold in large quantities, then choice of

application specific processor will result in a cheaper product. Due to the

fixed processor hardware, size and power may be large for certain

embedded applications. A general purpose processor may have any of

different architectures such as scalar processor, vector processor, array

processor, superscalar processor, VLIW processor, and multicore

processor [9]. A given processor may offer instruction level parallelism; the

superscalar and VLIW are two different approaches, hardware and

software, to instruction level parallelism. The multicore processor houses

more than one processor core in a single chip thereby providing processor

level parallelism.

A single purpose processor is a digital circuit capable of executing

only one type of program. Some common examples of single-purpose

processors are UART, DMA controller, JPEG codec etc. The embedded

system designer can either pick a suitable pre-designed single-purpose

processor from the market or create a custom digital circuit for the single-

purpose processor. The advantages of using single-purpose processor in

embedded system design are higher performance, smaller size, smaller

power and low unit cost (for large quantities). The disadvantages are higher

design time, higher NRE costs, low flexibility and higher unit cost (for small

17

quantities). For some applications, performance may be lower compared to

the designs using general-purpose processors. The single-purpose

processor is also known by several popular names: coprocessor,

accelerator, peripheral etc.

An ASIP is a programmable processor for a specific type of

application such as digital-signal processing, or telecommunications. The

architecture of such a processor is optimized for giving better performance

for the target application type. Inclusion of special functional units and

exclusion of infrequently used functional units are two common strategies

used in designing an ASIP. The advantages of ASIP approach to

embedded system design are good performance, small size and small

power. A large NRE cost is the disadvantage of ASIP approach. Digital

Signal Processors (DSPs), Network Processors (NPs), and

microcontrollers are typical examples for ASIPs.

A digital camera is a typical example for a BOPE that can be

implemented using a mixture of GPP, ASIP and SPP. As illustrated in

Figure. 1.5, a digital camera is a camera that encodes images and videos

digitally and stores them for later reproduction. It performs a limited set of

functions such as capturing pictures, compressing images, storing frames,

decompressing and displaying frames, and uploading frames to another

device through a suitable I/O interface. Frank Vahid and Tony Givargis [5]

have discussed the pros and cons of four different design approaches to

the digital camera and compared three design metrics of interest namely

performance, power consumption and chip area. The first approach uses a

single GPP but could not meet the performance requirement. The other

three approaches give feasible solutions and the choice depends on the

target market segment.

http://en.wikipedia.org/wiki/Camera

http://en.wikipedia.org/wiki/Digital_image

http://en.wikipedia.org/wiki/Video

18

Figure. 1.5: Block diagram of digital camera

1.3.1 IC Technology

Implementation of a processor on an integrated circuit (IC) can be

done with any of the three IC technologies: Full-custom/VLSI, Semicustom

ASIC (Gate array and standard cell) and Programmable Logic Device

(PLD). Any type of processor can be mapped [5] to any type of IC

technology, as illustrated in Figure. 1.6. The three IC technologies differ by

how customized the IC is for a particular design. The VLSI design has a

very high NRE cost and long turnaround time, typically several months; but

yields excellent performance with small size and power. It is suitable for

high volume or extremely performance-critical applications. The ASICs

provide good performance and size, with much less NRE cost than

full-custom ICs with turnaround time in the order of weeks. The PLD has

two types: Programmable Logic Array (PLA) and Programmable Array

Logic (PAL). Field Programmable Gate Array (FPGA) is a newer type of

PLD. PLDs offer very low NRE cost and almost instant IC availability.

Bigger size than ASIC, higher unit cost, higher power consumption and

lower performance are the drawbacks of PLDs, but they are well suited for

rapid prototyping.

19

General

Purpose

Processor

ASIP Single

Purpose

Processor

Full

Custom

Semi-

Custom PLD

Flexibility, NRE cost, Time-to-market, cost (low volume)

Power efficiency, performance, size, cost

Figure. 1.6: The independence of processor and IC technologies

1.4 PROCESSOR ARCHITECTURES AND INSTRUCTION ENCODING

Based on the type of internal storage inside the processor, the

instruction set architectures are classified into three types: stack

architecture, accumulator architecture and general-purpose register

architecture [3]. The stack architecture and accumulator architecture have

become obsolete, and Register-Memory Architecture (RMA) and Load-

Store Architecture (LSA) are two popular versions of general purpose

architecture used in microprocessors.

1.4.1 CISC Vs RISC

Two different Instruction Set Architecture (ISA) styles [9] are

followed in present day computer systems: Complex Instruction Set

Computing (CISC) and Reduced Instruction Set Computing (RISC). In

practice, CISC processors use RMA whereas RISC processors use LSA.

20

Reference

Compiler

Source

Code

Object

Code

CPU

(Complex)

Instruction

(Powerful)

Main

Memory

(Small)

Instruction set

(large)

High Level

Language Program

Machine Language

Program

Figure. 1.7: CISC scenario

Figure. 1.7 gives the overall view of a CISC system. The CISC has

powerful instructions and large instruction set. In early days of main frames,

due to use of magnetic core memory, cost of memory was high. Since

CISC architecture results in compact object code, the CISC processors

were accepted well. IBM System/360, UNIVAC 1100, HP 2100, and

VAX 11 are some popular CISC systems. Developments in IC technology

gave more scope for implementing new concepts and techniques that

required more circuits in the CPU. After the invention of semiconductor

memories, the performance of memory improved and the cost of memory

fell drastically but still the speed of memory is relatively slower than that of

the CPU. The invention of microprocessor resulted in low cost systems.

21

Advancements in VLSI technology enabled inclusion of more circuits

inside the microprocessor for performing new functions such as instruction

pre-fetch and pre-decode, multitasking and virtual memory support. Since

CISC processors supported powerful and complex instructions, control unit

design used microprogramming in order to simplify the design process but

due to the microprogram memory, instruction execution time increased.

In the past, the general trend in computer architecture and

organization has been toward increasing processor complexity: more

instructions, more addressing modes, more specialized registers, and so

on [10]. The RISC concept represents a fundamental break from the CISC

philosophy. As part of the attempts to develop a faster processor, RISC

architecture (Figure. 1.8) was promoted eliminating complex instructions

and complex addressing modes. The major characteristics of initial RISC

processors [10] are simple instructions, small instruction set, uniform

instruction length, limited addressing modes, simple instruction formats,

load-store architecture, hardwired instruction decoder, large register count

and instruction pipelining.

The main advantages of RISC architecture are easy implementation

of instruction pipeline and simplification of instruction decoder circuitry, and

higher performance. Moreover, due to its simplicity, time to develop a RISC

processor is shorter compared to that of a CISC processor.

The RISC Vs CISC controversy has died down due to gradual

convergence of the technologies. The RISC systems have become more

complex and CISC systems have introduced certain RISC like features.

However, there is a need to take a relook at following two features of RISC

processor cores used in SoCs for BOPES, namely uniform instruction

length and load-store architecture, from the perspective of increased code

size of RISC architecture, impacting cost, size and power consumption.

22

Instruction

Main

Memory

Compiler

Source

Code

Object

Code

CPU

(Simple)

(Simple)

(Large)

Reference

Instruction set

High Level Language

Program

Machine Language

Program

Figure. 1.8: RISC scenario

As shown in Figure. 1.9, as the lexical level of a CISC is higher, a

CISC requires execution of fewer instructions (smaller bit traffic) than does

a RISC [11]. The CISC architecture moves the ISA upward, thereby

reducing the semantic gap that must be spanned by the compiler and

increasing the semantic gap spanned by the hardware. On the other hand,

RISC architecture increases the software semantic gap and decreases the

hardware semantic gap. The experiments by Bhandarkar and Clark [12]

established that the RISC processor has to execute twice the number of

instructions compared to a CISC processor for the same application

program.

23

Software Translation

High Level

Language

Hardware Translation

CISC

RISC

ISA

Gates

Figure. 1.9: ISA Lexical Level

This ‘code size bloating’ problem of RISC processors depicted in [2]

shows the object code size of an MPEG2 encoder compiled on multiple

processors of different architectures. The Intel x86, a typical CISC

processor with register-memory architecture needs 50.6 kB of code, while

the RISC processors Thumb and SHARC need 68.2 kB and 106.2 kB

respectively.

1.4.2 Load - Store Architecture (LSA)

Most modern processors are based on load-store architecture. The

original objective [13] of choosing LSA for RISC is simplifying the hardware

and increasing performance so as to meet the performance needs of

workstations and servers. Generally, RISC processors have three types of

instructions: ALU instructions, Load and store instructions and Branch

and Jump type instructions. For ALU instructions, the operands are in

24

registers and the results stored in registers. In Load and Store

instructions, one operand is in register and the other operand is in

memory. The address of the memory operand is generally specified as

the sum of two parts: the base register contents and an offset in the

immediate field. In LSA, only load and store instructions can access

memory operands, and the arithmetic/logical instructions can only operate

on register operands. Since arithmetic and logical operations on memory

operands are not permitted in LSA, the compiler should place a load

instruction, before an add instruction, to move the data from memory to

register. Similarly, the result of an add instruction is stored by the processor

in a register only. Hence a store instruction has to be placed by the

compiler, after the add instruction, for moving the result to main memory.

This restriction results in too many data transfer instructions namely load

and store instructions. A comparison [3] of distribution of Arithmetic/logic

instructions and data transfer instructions for two benchmark programs on

VAX and MIPS is shown in Table 1.5. The 50% to 100% increase in data

transfer instructions for the MIPS, compared to the VAX, is due to use of

several load and store instructions in MIPS. The fixed instruction size of

RISC is another cause for code size increase.

Table 1.5: Extent of Data Transfer Instructions in CISC and RISC

Program Processor Arithmetic/logic

instructions

Data transfer

instructions

Gcc

VAX 40% 19%

MIPS 35% 27%

Spice VAX 23% 15%

MIPS 29% 35%

25

1.4.3 Fixed, Variable and Hybrid length encoding

There are three choices for encoding the instruction set [3]: variable

length, fixed length and hybrid. The variable length encoding (Figure. 1.10)

allows all addressing modes for all operations and supports any number of

operands. It results in smallest object code since there are no unused

fields. In this type of instruction format, the instruction length varies on the

basis of opcode and address specifiers. The characteristics of Variable

Instruction Format are

(1) Difficult control design to compute next address

(2) Complex operations

(3) Slow due to several memory accesses

operation Address

specifier

Address

field 1

…… …… Address

field n

Figure. 1.10: Variable Instruction Encoding Format

Processors that used variable instruction encoding include Intel

80x86 and VAX. The VAX offered excellent code density due to powerful

addressing modes, powerful instructions and efficient instruction encoding.

To reduce code size, the VAX permitted three different lengths of

addresses for displacement addressing - 8-bit, 16-bit and 32-bit addresses.

The fixed length encoding (Figure. 1.11) permits only single size for all

instructions. It always has the same number of operands and the

addressing mode is specified as part of the opcode. It results in the largest

code size. The Characteristics of fixed instruction format are

Simple to decode

Wastes code space because of fixed length fields and simple

operation

Helps easy implementation of pipelining

26

operation Address field 1 Address field 2 Address field 3

Figure. 1.11: Fixed Instruction Encoding Format

All RISC processors use fixed length encoding and some of these

are ARM, MIPS, PowerPC, SuperH, Alpha and SPARC. A processor

architect, more interested in code size than performance will choose

variable length encoding, and an architect more interested in performance

than code size will choose fixed length encoding. The hybrid length

encoding (Figure. 1.12) allows multiple formats specified by the opcode. In

other words, in hybrid length encoding, a processor supports multiple fixed

instruction lengths.

operation Address specifier Address field 1

operation Address specifier1 Address specifier2 Address field

operation Address specifier Address field 1 Address field 2

Figure. 1.12: Hybrid Instruction Encoding Format

IBM 360/370 and TI TMS320C54x are some of the processors using

the hybrid approach. Though hybrid length encoding reduces the code size

compared to fixed length encoding, instruction decoding becomes more

involved and there will be performance reduction. The Fixed Instruction

Encoding (FIE) of RISC processors helps in simpler instruction decoding

and easy pipeline design [3]. But the FIE increases the code size as some

fields are either unused or underutilized in several instructions. The

desktops and server systems are not seriously affected by the large code

memory size since both the code memory and data memory are external to

the processor chip in these systems. On the other hand, in most

embedded systems, the present trend is use of SoCs wherein the code

27

memory is integrated with the processor and the other system hardware on

a single chip. This limits the available space for the application memory for

the SoC architecture and hence the need for a compact code.

1.5 MOTIVATION

The applications of computers and architecture of computer systems

have undergone rapid growth over the years. The demand for increased

performance has been met by several architectural innovations. Several

new areas of applications have given rise to new requirements other than

high performance. Embedded systems are one such area that has grown

rapidly. Though some types of embedded systems require high

performance, majority of embedded systems are sensitive to size and

power consumption.

For the past three decades, there has been steady increase in

computer performance by increasing clock frequency or by introducing

overlap and parallelism. The clock frequency has reached its peak and

designers have given up any further effort in this direction. In 2004, Intel

cancelled a line of 4+ gigahertz processors due to difficulty controlling the

heat generated by such fast clock rates. Efforts towards instruction level

parallelism also changed direction from superscalar architecture to VLIW

architecture due to saturation in performance. Ultimately, the era of

advances in instruction level parallelism has come to an end, and instead,

processor level parallelism with multicore architecture has become the

current trend both for performance and parallelism. The density of

transistors on a single chip roughly doubles every two years obeying the

Moore's Law, a prediction made by Gardon Moore nearly 49 years ago

[14]. Additionally, logic is becoming less expensive in terms of area and

power consumption while communication is increasingly costly.

28

The RISC architecture was promoted in the days when

microprocessors had become complex, there were limitations of including

additional on-chip hardware, and embedded systems had limited market

[12, 15, 16, 17]. Those were the days when the entire program memory

was external to the processor. Hence the processor architecture was

viewed in isolation. Either processor performance or processor power

consumption alone was the main objective rather than overall system

parameters. Today, as the entire BOPES is available as single SoC, it is

desired to have a processor architecture that contributes to the overall

benefits to the SoC in terms of chip space, power consumption and cost,

rather than optimising these parameters for the processor core in isolation.

The performance provided must not come at the expense of unreasonable

power consumption or chip space. Considering the rapidly expanding

BOPES market, the savings in cost, space and power consumption can

justify the investments in development of a new tool chain for the processor

architecture. Although several strategies and techniques for reducing cost,

space and power consumption have been successfully used in practice, the

proposed solution for code size reduction by modification to RISC

architecture can give all the three gains in a single stroke. This approach

eliminates any additional effort required on the part of the embedded

system developers. The resultant savings in code size and proportional

reduction in code memory space are highly relevant in SoC based

applications such as wearable devices, implantable medical devices,

surveillance devices etc. In modern embedded systems, area and power

consumed by the memory subsystems is 10 times that of the datapath [18,

19].

The memory subsystem forms a large part (typically up to 70%) of

the silicon area of the current day SoC and expected to go up to 94% in

2014 as shown in Figure. 1.13 [20]. The allocation of physical real estate

(die area) of typical large ASIC and SoC designs tends to fall into three

general groups: die area dedicated to new custom logic, die area dedicated

29

to reusable logic (3rd-party IP or legacy internal IPs), and die area used for

embedded memory.

As Figure. 1.14 shows, while companies continue to develop their

own key custom blocks that help to differentiate their chips in market (like

wireless DSP+RF for 802.11n, Bluetooth, and other emerging wireless

standards), and third-party IPs (such as USB cores, Ethernet cores, and

CPU/Micro-controller cores) occupy a fairly consistent percentage of die

area, the percentage of area used for embedded memory is increasing

dramatically. According to data from Semico Research, in 2013, the

majority of SoC ASIC designs allocate over 50% of their die area to various

embedded memories.

Figure. 1.13: Memory Trends in SoC

30

Figure. 1.14: Extent of Embedded memory in the die area in SoCs

Figure. 1.15: Multiple embedded memory IPs in multicore SoC

31

In addition, there is a wide variety in the purpose and ideal

characteristics of the many embedded memories in a large SoC, as seen in

Figure. 1.15. Consequently, it is very important for processor architects to

evolve a new ISA that suits the present trends in Embedded SoCs so as to

minimize code memory size.

1.6 RESEARCH OBJECTIVES

The overall aim of the research undertaken is to develop a set of

architectural changes to the RISC architecture for reducing code size in

SoC based Battery Operated Portable Embedded Systems (BOPES). The

goal of this thesis is to justify the need to replace the 'uniform instruction

size' feature by 'hybrid instruction size' in the embedded RISC cores used in

BOPES so as to minimize the code size for embedded programs. This

thesis proposes replacement of FIE with Hybrid Instruction Encoding (HIE)

with two modifications to RISC architecture: multiple instruction sizes, and

hybrid lengths for the offset and immediate fields. The provision for multiple

instruction sizes eliminates unused fields in most instructions thereby

reducing code size. Similarly, allowing hybrid lengths for the offset and

immediate fields minimizes wastage of bits in these fields.

The estimates of code size savings and area reduction in SoCs

based on the proposed processor are done. A code analysis cum

conversion suite has been developed and various tools of this custom built

suite are used in different phases of the research work. The present

research work has established that architectural modification leads to

reduction in the code size of over 44% for certain portable embedded

applications. Such a gain is highly significant in certain healthcare products

such as pacemakers and bio-medical multiprocessor SoC for neuropathic

applications.

32

1.7 CONTRIBUTIONS

With this research objective in mind, several investigations have

been carried out and the main contributions of this work are summarized

below.

Behaviour Analysis of Embedded Applications

A code analysis tool for profiling object codes of MIPS32 (a typical

RISC processor) is developed. Analysis of RISC object codes for 23

embedded applications is performed using the code analyser tool to

determine the strategy for minimising the code size.

Design of Hybrid Instruction Encoding for RISC Processors

Two versions of Hybrid Instruction Encoding (HIE) are designed for

supporting multiple instruction sizes and hybrid lengths for the offset and

the immediate fields. For each of the 66 integer instructions of MIPS32, an

equivalent HIE instruction has been designed.

Design of Register Memory Architecture (RMA) for RISC Processor

This part of the research work involves design of 12 RMA ALU

instructions, each of which replaces a sequence of two RISC instructions,

for MIPS processor. The traditional RISC pipeline sequence is rearranged

to suit both LSA and RMA instructions.

Design of Embedded RISC Processor

The fourth part of the research work deals with designing a hybrid

processor incorporating both HIE and RMA. The embedded object codes of

MIPS32 are recoded to the HIE-RMA processor using the custom built

33

code converter so that the code size reduction is measured. Further,

additional code reduction is explored with use of compound instructions

and composite instructions.

Developing Static Simulator for HIE / RMA

To estimate the efficiency of the HIE and RMA for RISC processor, a

code converter tool, MIPS Instruction Distribution Analyser cum Code

Converter (MIDACC), has been developed. The MIDACC converts the

object codes from MIPS ISA to HIE/RMA-ISA.

1.8 THESIS OVERVIEW

The rest of the thesis is organized as follows. The following chapter

provides the background material for the thesis. It begins by presenting an

overview of different types of embedded processors. It describes the

various attributes of Instruction Set Architecture (ISA) and discusses the

processor performance and high performance architectural features. Finally,

an overview of techniques for designing for low power consumption is

presented.

In Chapter 3, an analysis of the object codes of 23 embedded

applications from MiBench and MediaBench benchmarks is provided. The

behavioral pattern of MIPS object codes of the 23 embedded applications

using MIDA, the custom built code analyzer tool for MIPS32 object codes

are discussed. Apart from measuring the static instruction frequencies, this

tool calculates the total amount of under utilization of offset and immediate

fields in the object codes.

Hybrid Instruction Encoding for RISC processors is addressed in

Chapter 4. Here, two different HIE designs for MIPS processor are

34

proposed and the embedded domains suitable for each of these are

identified.

The Chapter 5 presents register-memory architecture for the MIPS

processor. The hardware redesign required supporting the RMA at the

micro architectural level and the impact of RMA on processor performance

are addressed.

In Chapter 6, the design of an embedded core using both HIE and

RMA concepts is explored. Integration of HIE with RMA along with use of

compound instructions and composite instructions in HIE2 code is done to

evaluate the overall code reduction. This chapter outlines relevant micro

architecture requirements and effectiveness of such a processor in the

present scenario with multi core SoCs for battery-powered hand held

embedded systems. Finally, this chapter summarizes the research work and

discusses limitations and possibilities for further research.

35

2. BACKGROUND AND RELATED WORK

This chapter provides the necessary background information that is

useful to understand the main contributions of the thesis. The following

section presents a brief discussion on techniques for designing for low

power consumption. Section 2.2 describes the various attributes of

Instruction Set Architecture (ISA). Section 2.3 discusses processor

performance and high performance architectural features. In section 2.4, an

overview of different types of embedded processors is presented.

Section 2.5 deals with the architectural aspects of the embedded systems.

Section 2.6 presents an overview of emergence of different RISC

processors including MIPS. Section 2.7 explains how MIPS32 instructions

waste bits. In section 2.8, various techniques, followed for embedded code

size reduction, are reviewed. Finally, the need for a new and dedicated ISA

for Embedded SoCs is elaborated in section 2.9.

2.1 DESIGN FOR LOW POWER CONSUMPTION

Power dissipation and energy efficiency are primary design

constraints for both simple and complex processors. As a result of the

growing market for battery-powered portable embedded systems, the drive

for minimum power consumption has become equally important as the

drive for increased performance. Power consumption in processors

consists of a static component, called leakage power, and a dynamic

component, called switching power. The total power consumption of CMOS

circuit comprises three components [21]:

1. Switching power: This is the power dissipated by charging and

discharging the gate output capacitance, CL, and represents

the useful work performed by the gate. The energy per output

36

transition is given by the following equation where Vdd is power

supply voltage:

2

t L dd

1E = .C . V = 1picojoule

2 (2.1)

2. Short-circuit power: When the gate inputs are at an intermediate

level, both the p- and n-type networks can conduct. This results

in a transitory conducting path from Vdd to Vss. In a careful design

that avoids slow signal transitions, the short-circuit power is

usually a small fraction of the switching power.

3. Leakage current: The transistor networks do conduct a very

small current when they are in their 'off' state. Though it is

generally negligible in an active circuit, it can drain a supply

battery over a long period of time.

In a well designed active circuit, the switching power dominates, with

the short-circuit power forming 10% to 20% of the total power, and the

leakage current being significant only when the circuit is inactive.

Therefore, the total power dissipation, Pc, of a CMOS circuit, neglecting the

short-circuit and leakage components, is given by summing the dissipation

of every gate g in the circuit C:

2 g

C dd g L

g C

1P = .f. V . A . C

2 (2.2)

where f is the clock frequency, Ag is the gate active factor (reflecting the fact

that not all gates switch every cycle) and g

LC is the gate load capacitance.

The typical gate load capacitance is a function of the process

technology and therefore not under the control of the designer. The

37

remaining parameters in the equation suggest following approaches to low-

power design:

1. Minimize the power supply voltage, Vdd.

2. Minimize the circuit activity, A. Techniques such as clock gating

fall under this heading.

3. Minimize the number of gates. Simpler circuits use less power

than complex ones, all other things being equal.

4. Minimize the clock frequency, f. Although a lower clock rate

reduces the power consumption, it also reduces performance

having a neutral effect on power-efficiency. If, however, a

reduced clock frequency allows operation at a reduced Vdd, this

will be highly beneficial to the power-efficiency.

5. Exploit parallelism. Duplicating a circuit allows the two circuits

to sustain the same performance at half the clock frequency of

the original circuit, which allows the required performance to be

delivered with a lower power supply voltage.

Although static leakage power has historically been small compared

to dynamic switching power, the situation is changing as the feature sizes

decrease. The smallest chip size of a chip process technology refers to the

smallest size of transistors, wires, or gaps between them that can be

created onto the chip die with that process technology. As these sizes

decrease, the capacitance of the system of transistors,

g

LC , is lowered. This

reduced capacitance decreases the switching time of these transistors (or

gate delay), resulting in faster logic performance accommodating faster

processor clock frequencies. The gate activity factor approximates the

average switching activity of the circuit for each clock edge. The supply

voltage, Vdd, is lowered to reduce interference with the ever-closer

neighbouring components and to meet thermal requirements. Lowering Vdd

greatly reduces dynamic power consumption since the dynamic power is

38

proportional to the square of this supply voltage. However, lowering the

supply voltage in turn often requires a lowering of the threshold voltage, the

voltage level at which transistors switch, to maintain fast clock rates.

Lowering the threshold voltage and moving the threshold closer to ground

causes a disproportionate increase in the static leakage current and thus

an increase in static power consumption [22].

For a fixed task, decreasing the clock rate reduces the power, but

not the energy. The energy to execute a workload is equal to the average

power multiplied by the execution time for the workload. For BOPES

devices, battery life is more important than actual power consumption.

Hence energy is the proper metric.

2.2 INSTRUCTION SET ARCHITECTURE (ISA)

The features that are built into architecture’s instruction set are

commonly referred to as the Instruction Set Architecture or ISA. The ISA

defines such features as the operations that can be used by the

programmers to create programs under that architecture, the operands

(data) that can be accepted and processed by architecture, the storage, the

addressing modes used to gain access to and process operands, and

handling of interrupts. These features are important because an ISA

implementation is a determining factor in defining important characteristics

of an embedded design, such as performance, design time, available

functionality, and cost. In the embedded domain, it used to be true that

minimizing gates was the most important consideration of an ISA design

[7]. This is what led to many of the idiosyncrasies of early DSP designs.

Advances in VLSI technologies have changed this, and most of the

embedded world can now afford enough complexity to allow much more

regular and orthogonal instruction sets.

39

2.2.1 Instruction Types and Operations

The following information is provided either directly or indirectly by

an instruction [9]:

1. Operation code (opcode): Nature of operation done by the

instruction

2. Data: Type of data - binary, decimal, character etc.

3. Operand location: Memory, register etc.

4. Operand addressing: Method of specifying the operand location

(address)

5. Instruction length: Size - one byte, two bytes etc.

6. Number of address fields: zero address, single address, two

address etc.

Two computers of different architectures do not have the same

instruction set. Almost every architecture provides certain unique

instructions that ease the burden of compiler/programmer or the hardware

design. Based on the operations performed by the instructions, it is

common to classify the instructions into following types:

1. Data transfer instructions: These move data from one

register/memory location to another.

2. Arithmetic instructions: These perform arithmetical operations.

3. Logical instructions: These perform Boolean logical operations.

4. Control transfer instructions: These modify program execution

sequence.

5. Input/output (I/O) instructions: These transfer information

between external peripherals and system nucleus

(CPU/memory)

6. String manipulation instructions: These manipulate strings of

byte, word, double word etc.

40

7. Translate instructions: These convert the data from one

format to another.

8. Processor control instructions: These control the processor

operation.

Table 2.1 lists sample instructions for each of the above eight types

and corresponding actions done by the processor for these instructions.

Table 2.1: Sample Instructions and processor actions

Instruction

Type Specific Instruction examples and processor actions

Data transfer Instruction Action by processor

MOVE Transfer data from source location to

destination location

LOAD Transfer data from a memory location to a

CPU register

STORE Transfer data from a CPU register to a

memory location

PUSH Transfer data from the source to stack (top)

POP Transfer data from stack (top) to the

destination

XCHG Exchange; swap the contents of the source

and destination

CLEAR Reset the destination with all 0's

SET Set the destination with all 1's

41


Instruction


Arithmetic Instruction Action by processor

ADD Add; calculate sum of two operands

ADC Add with carry; calculate the sum of

operands and the 'carry' bit

SUB Subtract; calculate the difference of two

numbers

SUBB Subtract with borrow; calculate the

difference with 'borrow'

MUL Multiply; calculate the product of two

operands

DIV Divide; calculate the quotient and

remainder of two numbers

NEG Negate; change sign of operand

INC Increment; add 1 to operand

DEC Decrement; subtract 1 from operand

SFIFTA Shift arithmetic; shift the operand

(left or right) with sign extension

Logical Instruction Action by processor

NOT Complement the operand

OR Perform bit-wise logical OR of operands

AND Perform bit-wise logical AND of operands

XOR Perform bit-wise 'exclusive OR' of operands

SHIFT Shift the operand (left or right) filling the

empty bit positions as 0's

ROT Rotate; shift the operand (left or right) with

wrap-around

TEST Test for specified condition and set or reset

relevant flags

42


Instruction


Control

transfer

Instruction Action by processor

JUMP Branch; enter the specified address into

Program Counter (PC)

JUMPIF Branch on condition; enter the specified

address into PC only if the specified

condition is satisfied; conditional transfer

JUMPSUB CALL; save current 'program control

status' (into stack) and then enter the

specified address into PC

RET RETRURN; unsave (restore) 'program

control status' (from stack) into PC and other

relevant registers and flags

INT Interrupt; create a software interrupt; save

'program control status' (into stack) and

enter the address corresponding to the

specified vector code into PC

IRET Interrupt return; restore (unsave) 'program

control status' (from stack) into PC and other

relevant registers and flags

LOOP Iteration; decrement the implied register by 1

and test for non-zero; if satisfied, enter the

specified address into PC

43


Instruction


Input-output Instruction Action by processor

IN Input; read data from the specified input port /

device into specified or implied register

OUT Output; write data from specified or implied

register into an output port/device

TEST I/O Read the status from I/O subsystem and set

condition flags (codes)

START

I/O

Inform the I/O processor (or the data channel)

to start the I/O program consisting of

commands for the I/O operations

HALT I/O Inform the I/O processor (or the data

channel) to abort the I/O program

consisting of commands for the I/O

operations under progress

String

manipulation


MOVS Move byte or word of string

LODS Load byte or word of string

CMPS Compare byte or word of strings

STOS Store byte or word of string

SCAS Scan byte or word of string

Translate Instruction Action by processor

XLAT Translate; convert the given code into

another by table lookup

PACK Convert the unpacked decimal number into

packed decimal

UNPACK Convert the packed decimal number into

unpacked decimal

44


Instruction


Processor

control


HLT Halt; stop instruction cycle (processing)

STI (EI) Set/enable interrupt; sets interrupt enable

flag to '1', so as to allow maskable interrupts

CLI (DI) Clear/disable interrupt; resets interrupt

enable flag to '0' so as to ignore maskable

interrupts

WAIT Freeze instruction cycle till a specified

condition, such as an input signal becoming

active, is satisfied

NOOP No operation; no action

ESC Escape; the next instruction after the ESC is

to be skipped since it is meant for the

coprocessor

LOCK Reserve the bus, and hence the memory,

till the next instruction, following the LOCK

instruction, is executed/completed

CMC Complement 'carry' flag

CLC Clear 'carry' flag

STC Set 'carry' flag

2.2.2 Operation codes

There are a number of ways to allocate opcodes to an instruction

[11]. The design issue is to reduce the number of bits in the instruction

(small bit budget) while providing a large number of opcodes for a rich

instruction set. Following three design techniques have been used to meet

these requirements:

45

1. A fixed-length opcode allocated to variable length instructions as in

IBM S370 (Figure. 2.1)

2. A variable-length opcode provided by opcode expansion, allocated

in a variable-length instructions as in Intel x86 (Figure. 2.2)

3. A variable-length opcode provided by opcode expansion, allocated

in a fixed-length instruction as in MIPS32 (Figure. 2.3).

2.2.3 Addressing modes

Addressing mode is the method by which the location of an

instruction is specified within an instruction. Table 2.2 defines popular

addressing modes. A given ISA may not support all the addressing modes.

Table 2.2: Addressing modes and mechanisms

Addressing

mode Mechanism Remarks/examples

Implied

addressing

Operand address is not specified

explicitly

RET and IRET

Immediate

addressing

Operand is given in the

instruction

Fast operand fetch

but operand size is

limited as it increases

instruction length

Direct

addressing

(Absolute

addressing)

Operand is in a memory location;

its address is given in the

instruction

One memory access

required to get the

operand

Indirect

addressing

Operand is in a memory location;

its address is also in memory;

address of the location

containing the operand address

is given in the instruction

Two memory

accesses are

required to get the

operand

46


Addressing

mode Mechanism Remarks/examples

Register

direct

addressing

Operand is in a register; the

register address/number is given

in the instruction

Faster operand fetch

compared to direct

addressing

Register

indirect

addressing

Operand is in memory; its

address in a register;

address/number of the register is

given in the instruction

Faster operand fetch

than indirect

addressing

Base

register

addressing

Operand is in memory; its

address is specified in two parts;

the instruction gives an offset

number and also specifies the

base register; the offset (integer

number) has to be added to the

base register contents

Useful in relocation

of programs

PC-relative

addressing

Similar to base register

addressing, but the register

always being the PC

Mostly used by

branch instructions

Index

addressing

The operand is in memory; the

instruction gives an address, and

the index register contains an

offset number; the address and

the offset number are added to

get the operand address

Convenient for

indexing arrays

47

Figure. 2.1: IBM S370 Instruction Formats

48

Figure. 2.2: INTEL Pentium Pro Instruction Formats

Figure. 2.3: MIPS32 Instruction Formats

49

2.2.4 Data types

Application programs may use various types of data depending on

the problem. A machine language program can operate either on numeric

data or on non-numeric data. The numeric data can be either binary or

decimal number. The non-numeric data can be any of the following types:

characters, addresses, and logical data. All non-binary data is represented

inside a computer in the binary coded form. The binary data can be

represented either as a fixed-point or a floating-point number. In fixed-point

number representation, the position of a binary number is rigidly fixed in

one place. In floating-point number representation, the binary point's

position can be anywhere. The fixed-point numbers are known as integers

whereas the floating-point numbers are known as real numbers. Arithmetic

operations on fixed-point numbers are simple and they require minimum

hardware circuits. The floating-point arithmetic is complex and requires

extensive hardware circuits. Compared to fixed-point numbers, the floating-

point numbers have two advantages:

1. The maximum or minimum value that can be represented in

floating-point number representation is higher. Hence it is

useful in dealing with very small or very large numbers.

2. The floating-point number representation leads to better

accuracy in arithmetic operations.

2.2.5 ISA Models

There are several different ISA models that architectures are based

upon, each with its own specifications for the various features. The most

commonly implemented ISA models are application-specific, general

purpose and instruction level parallel. Application-Specific ISA Models

define processors that are intended for specific embedded applications,

such as processors made only for TVs. General-purpose ISA models are

50

typically implemented in processors targeted to be used in a wide variety of

systems, rather than only in specific types of embedded systems. CISC

model and RISC model are the common types of general-purpose ISA

architectures implemented in embedded processors. Many current

processor designs fall under the CISC or RISC category primarily because

of their heritage. RISC processors have become more complex, while CISC

processors have become more efficient to compete with their RISC

counterparts, thus blurring the line between the definition of a RISC versus

a CISC architecture. Technically, these processors have both RISC and

CISC attributes, regardless of their definitions. Instruction-level parallelism

ISA architectures are similar to general-purpose ISAs, except that they

execute multiple instructions in parallel, as the name implies. Examples of

instruction-level parallelism ISAs [9] include SIMD model, Superscalar

model, and VLIW model.

2.3 PROCESSOR PERFORMANCE AND ADVANCED ARCHITECTURES

The performance of a processor is measured by the amount of time

taken by the processor to execute a program. The processor performs an

instruction cycle for each instruction. Table 2.3 illustrates the actions taken

at various steps of the instruction cycle for ADD instruction. Elementary

operations performed by the processor during instruction cycle execution

are known as micro-operations. A given micro-operation takes place when

the corresponding control signal is issued by the processor. Table 2.4

illustrates some sample micro-operations performed by the processor. The

time taken for executing different instructions is not the same. Hence the

type of instructions executed in a program and the number of instructions

executed by the processor, while running the program, decides the time

taken by the processor to execute a program.

51

Table 2.3: Instruction cycle steps and actions for ADD instruction

Sl.

No. Step

Action

responsibility Remarks

Parameter

affecting

performance

1 Instruction

fetch

Control unit;

external action

Fetches next

instruction from

main memory

memory

access time

2 Instruction

decode

Control unit;

internal action

Analyses opcode

pattern in the

instruction and

identifies the exact

operation specified

decode time

3 Operand

fetch

Control unit:

external

(memory) or

internal action

depending on

the location of

operands

Determines the

operand addresses

and then fetches

the operands, one

by one, from main

memory or CPU

registers and

supply them to ALU

(1) operand

address

calculation

time

(2) Register/

memory

access time

4 Execute

(ADD)

ALU; internal

action

Specified arithmetic

operation is done

Addition time

5 Result

store

Control unit;

external or

internal action

Stores the result in

memory or

registers

Register/

memory

access time

52

Table 2.4: Sample micro-operations

Sl.

no.

Control

signal

Micro-operation Remarks

1 MAR← PC Contents of PC are copied

(transferred) to Memory Address

Register (MAR)

The first micro-

operation in

instruction fetch

2 PC← PC + 4 Contents of PC are incremented

by 4

The PC always

points to next

instruction

address

3 IR ←MBR Contents of Memory Buffer

Register (MBR) are copied to

Instruction Register (IR)

The last micro-

operation in

instruction fetch

4 MBR ←R2 Contents of R2 register are

copied to MBR

The first micro-

operation in result

store

The following equation is commonly used for expressing a

computer's performance ability:

time time cycles instructions

program cycle instruction program

In other words, the execution time is given by the following equation:

Tp = Nie X CPI/F (2.4)

where Nie is the number of instructions executed (and not the number of

instructions present in the program), CPI is the average number of clock

cycles needed for an instruction, and F is the clock frequency. The CISC

approach attempts to minimize the number of instructions per program,

sacrificing the number of cycles per instruction. RISC does the opposite,

reducing the cycles per instruction at the cost of the number of instructions

per program.

(2.3)

53

For any specific computer, there are two simple measurements that

give us an idea about its performance:

1. Response time or execution time: This is the time taken by the

computer to execute a given program – from the start to the

end of completion of the program. The response time for a

program is different for different computers.

2. Throughput: This is the work done (total number of programs

executed) by the computer during a given period of time.

2.3.1 Instruction Pipelining

In a simple processor (scalar, non-pipelined), the steps of an

instruction cycle are sequentially performed one after the other and

execution of successive instructions are also done sequentially, one after

the other. Instruction pipelining (Figure. 2.4) is a technique in which

execution of successive instructions are overlapped. The goal is to

increase the total number of instructions executed in a given period of time.

In a pipelined processor, different sections of the processor perform

different steps of the instruction cycle for different instructions at a given

time. Each step is called a pipe stage. All the pipe stages together form a

pipe.

Figure. 2.4: A six stage instruction pipeline

54

In a six stage instruction pipeline, six instructions can be active

simultaneously. If it is assumed that all instructions are independent of

other instructions, then for each clock cycle, one instruction can be

completed due to overlap of instruction cycles of consecutive instructions.

In practice, three types of hazards - data, structural, and control - reduce

the pipeline efficiency [9].

Dependencies between instructions are a property of programs. If

two instructions are dependent, they should not be executed

simultaneously. They may be partially overlapped. Two instructions may be

either directly data dependent or indirectly data dependent through another

instruction due to chain of dependencies. In case of dependence, there are

two possible solutions:

1. Preserving the dependence but preventing a hazard

2. Removing the dependence by transforming the object code.

Techniques used for detecting and preventing hazards should

preserve program order so that the overall behaviour and results of the

program are not affected.

2.3.2 RISC Instructions and Pipelining

Though pipelining can be implemented in both CISC and RISC types

of processors to enhance performance, it is simpler to design a pipelined

RISC processor. The following properties of RISC architecture help in

simplifying the pipeline design:

1. All instructions are of equal size, say 4 bytes.

2. Instruction formats are not many; just 1 to 3.

3. Arithmetic and other operations on data always have operands

(data) in registers (not in memory).

4. Only load and store instructions can access memory.

55

Generally RISC processors have three types of instructions: ALU

instructions, Load and store instructions and Branch and Jump type

instructions. In ALU Instructions, the operands are available in registers.

On completion, the results should be stored in registers. In load and store

instructions, one operand is in register and the other operand is in memory.

The address of the memory operand is generally specified as the sum of

two parts: the base register contents and the offset indicated by the

immediate field in the instruction. In branches and jumps, the branch

conditions are usually specified in one of the two ways:

1. Comparison of two items in registers

2. Condition bits or condition codes

Unconditional jumps are present in almost all RISC processors.

Traditional RISC pipeline has five stages as shown in Figure. 2.5 (a).

Figure. 2.5 (b) shows timing diagram while executing 6 instructions over 10

clock cycles. Figure. 2.5 (c) shows the RISC pipeline as a series of data

paths shifted in time.

Figure. 2.5 (a): Five stage pipeline

56

Figure. 2.5 (b): Timing Diagram

CC- Code Cache (Instruction memory); R-Registers; ALU-Arithmetic Logic

Unit; DC-Data Cache (data memory)

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

CC R ALU DC R

1 2 3 4 5 6 7 8 9 10

Time in Clock cycles

Pro

gra

m e

xec

uti

on s

equen

ce

Figure. 2.5 (c): RISC Pipeline as a series of datapaths

57

Tradeoffs in micro architecture have changed somewhat since the

RISC five-stage pipeline [7]. In the early RISC days, transistor count

limitations convinced the designers to reuse the ALU for address

computations. Today, transistors are almost free of cost but wires are

expensive. Each additional pipeline stage has a marginal benefit in terms of

spreading out the work in smaller steps that may allow a lower cycle time,

and a marginal cost in terms of added design complexity and global

overheads. Table 2.5 defines the clock cycles, respective stages of

instruction cycle and micro operations. Actual number of clock cycles

required for different instructions are as follows:

Unconditional branch instruction: 2 (cycles 1 and 2)

Store instruction: 4 (cycles 1 to 4)

Any other instruction: 5 (cycles 1 to 5)

There are many alternate design options offering varying

performance levels. The designer chooses the best option taking into

account the hardware cost and required performance level.

There are two major problems in a practical pipeline:

1. Resource Conflict: Two different operations at two

sections/stages may need the same hardware resource in the

same clock cycle, due to overlapping of instructions. To resolve

this, multiple resources of the same type can be provided in the

hardware. This will increase the cost and hence should be

done judiciously.

2. Interference between adjacent stages: Two instructions in

different stages of the pipeline should not interfere with each

other. To resolve this, pipeline registers are used between

successive stages of the pipeline. The pipeline registers are

named indicating the stages linked by them such as IF/ID,

58

ID/EX, EX/MEM and MEM/WB. The result of any specific stage

is stored in the pipeline register at the end of a clock cycle.

During the next clock cycle, the contents of the pipeline register

serve as input to the next stage. In some cases, the result

generated by one stage may not be used as input to the next

stage. It may propagate through more than one stage. For

example, for a STORE instruction, the result is produced in the

ID stage but it is stored in memory only in the MEM stage.

Table 2.5: Typical instruction cycle phases in RISC processors

Sl.

no.

Clock

cycle

Instruction

cycle phase

Major micro

operations

Hardware

sections involved

1 1 Instruction

Fetch (IF)

a. Send PC contents

to memory

b. Fetch the current

instruction from

memory

c. Increment PC by 4

to indicate the next

instruction address

a. Cache memory

2 2 Instruction

Decode (ID);

plus Register

Read cycle

a. Decode the

instruction

b. Read the contents

of source registers

c. Compare the

contents of registers

(as preparation for

certain instructions

such as compare)

a. Instruction

decoder

b. Registers

c. Adder /

comparator

59


Sl.

no.

Clock

cycle

Instruction

cycle phase

Major micro

operations

Hardware

sections involved

3 3 Execution

(EX); plus

Effective

address cycle

a. For ALU instruction,

the specified

operation is done by

the ALU

b. For memory

reference instruction

(Load/store), the

effective address is

calculated by ALU by

adding the base

register contents and

the offset.

c. For branch

instruction, testing of

branch condition is

done.

a. ALU

b. ALU

c. ALU

4 4 Memory

Access

(MEM); plus

branch

completion

a. For load instruction,

memory read

operation from the

effective address is

done.

b. For store

instruction, memory

write operation at the

effective address,

storing the contents of

source register

c. For branch

instruction, the branch

address is entered in

PC if branch occurs.

a. Cache memory

b. Cache memory

5 5 Write – back

(WB)

a. The result is stored

in the destination

register for load

instruction and ALU

instruction.

a. Registers

60

2.3.3 Superscalar processor

In a scalar pipelined processor, though there are multiple

instructions simultaneously active in the pipeline, there is only one

execution unit/functional unit. Hence at a given time, only one instruction

can be in the execution unit. In a superscalar architecture, there are

multiple pipelines in the processor and hence two or more instructions can

be executed simultaneously. In other words, in a superscalar processor,

same type of operation (add, shift etc.) can be executed simultaneously in

single clock cycle on multiple pipelines for different instructions. Figure. 2.6

shows the organization of a superscalar processor with two pipelines [9]. In

some superscalar processors, instruction sequencing is static (at

compilation time) but in majority of superscalar processors, it is dynamic (at

run time). The control unit in a dynamic superscalar processor is a complex

one whereas in a static superscalar processor, the compiler is a complex

one.

2.3.4 Very Long Instruction Word (VLIW) Processor

The VLIW architecture exploits Instruction Level Parallelism (ILP)

with close cooperation between the compiler and the processor. The

processor has multiple functional units similar to a dynamic superscalar

processor but scheduling is done by the compiler that groups several

independent operations into a very long instruction word. Each VLIW has

multiple fields/slots with each slot containing one RISC like operation. Each

operation corresponds to a functional unit. During the execution of a VLIW,

the processor performs all the operations in parallel in different functional

units. Figure. 2.7 illustrates the principle of a VLIW processor [9].

61

OF-Operand Fetch IF- Instruction Fetch EX-Execute SR-Store Results

2 instructions

Instruction queue

EU-1

Odd instruction

EU-2

EU-Execute unit

Write buffers

Cache

Memory

MAIN MEMORY

System Bus

Unified cache

2 instructions

Even instruction

OF

EX

SR SR

EX

OF

RE

GIS

TE

RS

I F Unit

Decode

and

dispatch

Result

Figure. 2.6: Superscalar Processor Organisation

62

Instruction Cache Memory

add mul load store cmp branch mulfl addfl

INT INT MAU 1 MAU 2 INT Branch FLOAT FLOAT

ALU MUL/DIV ALU unit MUL/DIV ADDER

AAADDER

Integer RF Floating

Point RF

Bus Interface Data Cache

IR

FUs

MAU

System Bus

IR-Instruction Register FU-Functional Unit

RF-Register File

INT-Integer

MAU-Memory Addressing Unit

(a) Inside VLIW Processor

add mul load store cmp branch mulfl addfl

add R1 R2

256 bits

32 bits

(b) VLIW and one operation

Figure. 2.7: VLIW Processor Organisation

63

2.3.5 Cache Memory

The cache memory is a small and fast intermediate buffer between

the processor and the main memory with the objective of reducing the

processor's waiting time during main memory access. The presence of

cache memory is not known to application programs. Figure. 2.8 illustrates

the use of cache memory.

Figure. 2.8: Use of Cache memory

The main memory is conceptually divided into many blocks, each

containing a fixed number of consecutive locations. The cache memory is

organized as number of lines and the size of each line is same as the

capacity of main memory block. The cache operation is based on locality of

reference [23], a property inherent in programs. Most of the times,

processing requirement is such that instructions or data needed are

available in those main memory locations which are physically close to the

current main memory location being accessed. There are two kinds of

behaviour pattern:

1. Temporal locality: A recently accessed memory location is

likely to be accessed again.

64

2. Spatial locality: The neighbouring location to the recently

accessed memory location is likely to be accessed.

In view of these two properties, while reading a location from main

memory, the content of entire block is transferred and stored in cache

memory. There are more blocks in main memory than the number of lines

in cache memory. Hence a mapping function is followed by the cache

controller to systematically map any main memory block to one of the

cache lines. When the processor needs a memory operand, the cache

controller checks the cache memory to find out if the current main memory

address is already mapped onto cache. If it is mapped, it means the

required item is available in cache memory and this condition is called

'cache hit'. Then the required information is read from cache memory.

On the other hand, if the current main memory address is not

mapped in cache memory, the required information is not available in

cache memory and this situation is known as 'cache miss'. In this case, the

entire block containing the main memory address is brought into the cache

memory. The time taken to bring the required item from the main memory

and supply it to the processor is known as 'miss penalty'. The hit rate (also

known as hit ratio) provides the fraction of the number of accesses which

faced 'cache hit' to the total number of accesses.

The cache memory is of two types: Unified cache or common cache,

and Split cache. The unified cache stores both instructions and data. In

split cache, there is a separate instruction cache (also known as code

cache) and data cache. Some computers use a two level or three level

cache memory system. The cache immediately next to the processor is

known as level 1 cache or primary cache. The next level cache is called a

level 2 cache or secondary cache. Most microprocessors are incorporating

multi-level caches on-chip.

65

2.3.6 Virtual Memory

Virtual memory concept facilitates the execution of large programs in

systems with smaller physical memory. Virtual memory is desirable in the

following two cases:

1. The logical memory space of the processor is small

2. The physical main memory space has to be kept small to

reduce the cost though the processor has large logical memory

space.

Figure. 2.9 illustrates the concept of virtual memory. In virtual

memory system, the OS automatically manages the long programs by

storing the entire program on a large hard disk. At a given time, only some

portions of the program are stored in main memory. During the execution of

the program, different portions of the program are swapped between the

main memory and hard disk on need basis. The program does not address

the physical memory directly.

CM - Cache memory; optional unit.

Figure. 2.9: Virtual memory concept

66

While referring to an instruction or operand, it provides the logical

address, and the virtual memory hardware (also known as memory

management unit or MMU) in the processor translates it into the equivalent

physical memory address [9]. There are two popular methods in virtual

memory implementation: paging and segmentation. In paging, the system

software divides the program into pages of equal sizes. In segmentation,

the machine language programmer organizes the program into different

segments which need not be of same size. Figure. 2.10 illustrates the

mechanism of virtual memory.

Figure. 2.10: Virtual memory mechanism

2.3.7 Multicore CPU

Building a high performance computer system by linking together

several low performing computers is a standard technique of achieving

parallelism. This idea is the basis for development of multiprocessor

systems. Designing a microcomputer using multiple single-chip

microprocessors has been a cost-effective strategy for several years in the

past. The latest trend is the design of multicore microprocessors resulting

in quantum change in the way multiprocessor systems are developed and

67

used for various applications [10]. Figure. 2.11 illustrates the concept of

muticore with four cores in a single die. Figure. 2.12 illustrates the

organization of SPARC 64 VII, a popular quad core CPU.

Figure. 2.11: A Quad-core CPU

Figure. 2.12: SPARC64 VII Processor

Chip Multiprocessing technology is an architecture in which multiple

physical cores are integrated on a single processor module. Each physical

core runs a single execution thread of a multithreaded application

independently from other cores at any given time. With this technology,

multi-core processors offer several times the performance of single-core

68

modules. The ability to process multiple instructions at each clock cycle

provides the performance advantage, but improvements also result from

the short distances and fast bus speeds between chips as compared to

traditional CPU to CPU communication in a multiprocessor system.

2.4 EMBEDDED PROCESSORS

Processors are the main functional units of an embedded system,

and are primarily responsible for processing instructions and data. An

embedded system contains at least one master processor, acting as the

central controlling device, and can have additional slave processors that

work with and are controlled by the master processor. These slave

processors may either extend the instruction set of the master processor or

act to manage buses and input/output (I/O) devices. The complexity of the

master processor usually determines whether it is classified as a

microprocessor or a microcontroller. Traditionally, microprocessors contain

a minimal set of integrated memory and I/O components, whereas the

microcontrollers have most of the system memory and I/O components

integrated on the chip. However, these traditional definitions are becoming

somewhat inaccurate in view of convergence taking place in recent

processor designs. There are literally hundreds of embedded processors

available and these can be grouped into various architectures [6]. What

differentiates one processor group's architecture from another is the set of

machine code instructions that the processors within the architecture group

can execute. Processors are considered to be of the same architecture

when they can execute the same set of machine code instructions. Table

2.6 lists some examples of real-world processors and the architecture

families they fall under. Table 2.7 lists the merits and demerits of different

types of processors that can embed in a complex embedded system [8].

69

Table 2.6: Typical Embedded Architectures and Processors

Architecture Processor Manufacturer

AMD Au1xx Advanced Micro Devices

ARM ARM7, ARM9, ... ARM, ....

ColdFire 5282, 5272, 5307, 5407, ... Motorola/Freescale, ...

M32/R 32170, 32180, 32182,

32192, ...

Renesas/Mitsubishi, ...

MIPS32 R3K, R4K, 5K, 16, ... MT14kx, IDT, MIPS

Technologies, ...

NEC Vr55xx, Vr54xx, Vr41xx NEC Corporation, ...

PowerPC 82xx, 74xx, 8xx, 7xx, 6xx,

5xx, 4xx

IBM, Motorola/Freescale, ...

SuperH (SH) SH3, SH4 Hitachi, ...

SHARC SHARC Analog Devices, Transtech

DSP, Radstone, ...

strongARM strongARM Intel, ...

SPARC UltraSPARC II Sun Microsystems, ...

TMS320C6xxx TMS320C6xxx Texas Instruments

x86 X86 [386, 486, Pentium(II,

III, IV)...]

Intel, Transmeta, National

Semiconductor, Atlas, ...

Tricore Tricore1, Tricore2, ... Infineon, ...

70

Table 2.7: Processor types in Complex Embedded Systems

Processor type Application Advantage Disadvantage

General purpose

microprocessor

When intensive

computations are

required and large

embedded software

is located in the

external memory

cores or chips

No engineering

cost in

designing the

processor

Additional redundant

execution units that

are not needed in the

given system design

Microcontroller Used with internal

memory, devices

and peripherals and

when embedded

software is located

in the internal ROM

or flash memory

No engineering

cost in

designing the

processor

Additional

manufacturing costs

and redundant

application units

which are not

needed in the given

system design

DSP Used with signal

processing-related

instructions for

filters, image, audio,

and video and

CODEC applications

No engineering

cost involved in

designing the

signal

processor

Manufacturing cost

may be high

Single purpose

processors and

application

specific system

processor

Control I/O and bus

operations and

peripherals and

devices

They support

other

processing

units in the

system and

execute

specific

hardware

processes fast

In-house engineering

cost of development,

royalty payments for

an IP core of

processor and time-

to-market cost

Multicore

processor

To significantly

enhance the

performance of the

system

Reduced

engineering

cost

Increased

manufacturing cost

Accelerator To accelerate the

execution of codes.

A floating point

coprocessor

accelerates

mathematical

operations and Java

accelerator

accelerates Java

code execution.

Increases

performance by

co-processing

with the main

processor

Increased

engineering cost of

development or

royalty payments for

the IP core of

processor

71

2.5 EMBEDDED SYSTEM ARCHITECTURES

Embedded computer systems range from everyday machines - most

of the microwaves and washing machines, printers, network switches, and

automobiles - to handheld digital devices (such as PDAs, cell phones, and

music players) to videogame consoles and digital set-top boxes. Except in

some applications such as PDAs, in many embedded applications, the only

programming occurs at developer's site in connection with the initial loading

of the application code or a later software upgrade of that application. Thus,

the application is carefully tuned for the processor and system [3].

Embedded systems often process information in different ways from

general-purpose processors. Typically these applications include deadline-

driven constraints—so-called real-time constraints. In these applications, a

particular computation must be completed by a certain time limit failing

which the system will malfunction. A real-time performance requirement is

one where a segment of the application has an absolute maximum

execution time that is allowed. For example, in a digital set-top box the time

to process each video frame is limited, since the processor must accept

and process the frame before the next frame arrives (typically called hard

real-time systems). In some applications, a more liberal requirement exists:

the average time for a particular task is constrained as well as is the

number of instances when some maximum time is exceeded. Such

approaches (typically called soft real-time) arise when it is possible to

occasionally miss the time constraint on an event, as long as not too many

are missed. Real-time performance tends to be highly application

dependent.

Embedded system applications typically involve processing

information as signals that may be an image, a motion picture composed of

a series of images, a control sensor measurement, and so on. Signal

72

processing requires specific computation that many embedded processors

are optimized for.

Two other key characteristics exist in many embedded applications:

the need to minimize memory and the need to minimize power. The

importance of memory size translates to an emphasis on code size, since

data size is dictated by the application. Some architecture has special

instruction set capabilities to reduce code size. Larger memories also mean

more power, and optimizing power is often critical in embedded

applications. Although the emphasis on low power is frequently driven by

the use of batteries, the need to use less expensive packaging (plastic

versus ceramic) and the absence of a fan for cooling also demand reduced

power consumption.

Often an application’s functional and performance requirements are

met by combining a custom hardware solution together with software

running on a standardized embedded processor core, which is designed to

interface to such special-purpose hardware. In practice, embedded

problems are usually solved by one of three approaches:

1. The designer uses a combined hardware/software solution that

includes some custom hardware and an embedded processor

core that is integrated with the custom hardware, often on the

same chip.

2. The designer uses custom software running on an off-the-shelf

embedded processor.

3. The designer uses a digital signal processor and custom

software for the processor.

Embedded systems are a very broad category of computing devices.

For example, the TI 320C55 DSP is a relatively “RISC-like” processor

designed for embedded applications, with very fine-tuned capabilities. On

73

the other end of the spectrum, the TI 320C64x is a very high-performance,

eight-issue VLIW processor for very demanding tasks. Media extensions

attempt to merge DSPs with some more general-purpose processing

abilities to make these processors usable for signal processing

applications. Hennessy and Patterson have examined [3] several case

studies, including the Sony PlayStation 2, digital cameras, and cell phones.

The PlayStation2 performs detailed three-dimensional graphics, whereas a

cell phone encodes and decodes signals according to elaborate

communication standards. But both have system architectures that are very

different from general-purpose desktop or server platforms. In general,

architectural decisions that seem practical for general-purpose applications,

such as multiple levels of caching or out-of-order superscalar execution,

are much less desirable in embedded applications. This is due to chip area,

cost, power, and real-time constraints. The programming model that these

systems present places more demands on both the programmer and the

compiler for extracting parallelism.

2.5.1 Digital Signal Processor

A digital signal processor (DSP) is a special-purpose processor

optimized for executing digital signal processing algorithms [5]. Most of

these algorithms, from time-domain filtering (e.g., infinite impulse response

and finite impulse response filtering), to convolution, to transforms (e.g.,

fast Fourier transform, discrete cosine transform), to even forward error

correction (FEC) encodings, all have as their kernel the same operation: a

multiply-accumulate operation. Either transform has as its core the sum of

a product. To accelerate this, DSPs typically feature special-purpose

hardware to perform multiply-accumulate (MAC). A MAC instruction of

“MAC A, B, C” has the semantics of “A = A + B * C.” In some situations, the

performance of this operation is so critical that a DSP is selected for an

application based solely upon its MAC operation throughput. DSPs often

employ fixed-point arithmetic. In addition to MAC operations, DSPs often

74

also have operations to accelerate portions of communications algorithms.

An important class of these algorithms revolve around encoding and

decoding forward error correction codes—codes in which extra information

is added to the digital bit stream to guard against errors in transmission. At

one end of the DSP spectrum is the TI 320C55 architecture optimized for

low-power, embedded applications with a seven-staged pipelined CPU.

The source of input data to DSP is some form of digitized signal, like

a photo image captured by a digital camera, a voice packet going through a

network router, or an audio clip played by a digital keyboard. As with

microcontrollers, DSPs also tend to incorporate many peripherals that are

useful in signal processing on a single IC. For example, a DSP device may

contain a number of analog-to-digital and digital-to-analog converters,

pulse-width modulators, direct memory access controllers, timers, and

counters.

2.5.2 Media Extensions

Media Extensions is a middle ground between DSPs and

microcontrollers. These extensions add DSP-like capabilities to

microcontroller architectures at relatively low cost. Because media

processing is judged by human perception, the data for multimedia

operations are often much narrower than the 64-bit data word of modern

desktop and server processors. For example, floating-point operations for

graphics are normally in single precision, not double precision, and often at

a precision less than specified by IEEE 754. Rather than waste the 64-bit

arithmetic-logical units (ALUs) when operating on 32-bit, 16-bit, or even8-

bit integers, multimedia instructions can operate on several narrower data

items at the same time. Thus, a partitioned add operation on 16-bit data

with a64-bit ALU would perform four 16-bit adds in a single clock cycle. The

extra hardware required is only to prevent carries between the four 16-bit

partitions of the ALU. For example, such instructions might be used for

75

graphical operations on pixels [10]. These operations are commonly called

single-instruction multiple data (SIMD) or vector instructions. Most graphics

multimedia applications use 32-bit floating-point operations.

2.5.3 Embedded Multiprocessors

In the embedded space, a number of special-purpose designs have

used customized multiprocessors; including the Sony PlayStation

2[7].Many special-purpose embedded designs consist of a general-purpose

programmable processor or DSP with special-purpose, finite-state

machines that are used for stream-oriented I/O. In applications ranging

from computer graphics and media processing to telecommunications, this

style of special-purpose multiprocessor is becoming common. Although the

inter-processor interactions in such designs are highly regimented and

relatively simple—consisting primarily of a simple communication

channel—because much of the design is committed to silicon, ensuring that

the communication protocols among the input/output processors and the

general-purpose processor are correct is a major challenge in such

designs. As a recent trend, embedded multiprocessors are built from

several general-purpose processors. These multiprocessors have been

focused primarily on the high-end telecommunications and networking

market, where scalability is critical. An example of such a design is the

MXP processor designed by empowerTel Networks for use in voiceover-IP

systems. The MXP processor consists of four main components:

1. An interface to serial voice streams, including support for

handling jitter

2. Support for fast packet routing and channel lookup

3. A complete Ethernet interface, including the MAC layer

4. Four MIPS32 R4000-class processors, each with its own cache

(a total of 48 KB or 12 KB per processor)

76

The MIPS processors are used to run the code responsible for

maintaining the voice-over-IP channels, including the assurance of quality

of service, echo cancellation, simple compression, and packet encoding.

Since the goal is to run as many independent voice streams as possible, a

multiprocessor is an ideal solution. Because of the small size of the MIPS

cores, the entire chip takes only 13.5Mtransistors. Future generations of

the chip are expected to handle more voice channels, as well as do more

sophisticated echo cancellation, voice activity detection, and more

sophisticated compression.

Multiprocessing is becoming widespread in the embedded

computing arena for two primary reasons. First, the issues of binary

software compatibility, which plague desktop and server systems, are less

relevant in the embedded space. Often software in an embedded

application is written from scratch for an application or significantly

modified. Second, the applications often have natural parallelism,

especially at the high end of the embedded space. Examples of this natural

parallelism abound in applications such as a settop box, a network switch,

a cell phone or a game system. The lower barriers to use of thread-level

parallelism together with the greater sensitivity to die cost (and hence

efficient use of silicon) are leading to widespread adoption of

multiprocessing in the embedded space, as the application needs grow to

demand more performance.

Desktop computers and servers rely on the memory hierarchy to

reduce average access time to relatively static data, but there are

embedded applications where data are often a continuous stream. In such

applications there is still spatial locality, but temporal locality is much more

limited. The steady stream of graphics and audio demanded by electronic

games lead to a different approach to memory design. The style is high

bandwidth via many dedicated independent memories.

77

2.6 MIPS32 Vs OTHER RISC PROCESSORS

Although the modern version of the RISC design dates to the 1980s,

a number of systems of the 1970s have been credited as the first RISC

architecture, partly based on their use of load/store approach. For

example, the CDC 6600 designed by Seymour Cray in 1964 used a

load/store architecture with only two addressing modes (register+register,

and register+immediate constant) and 74 opcodes, with the basic clock

cycle/instruction issue rate being 10 times faster than the memory access

time [24,25].

The modern RISC revolution started with the projects at Stanford

University and University of California, Berkeley and IBM. Stanford's design

led to the successful MIPS architecture, while Berkeley's RISC project has

been commercialized as the SPARC. Another success from this era was

IBM's 801 that eventually led to the Power Architecture. As these projects

matured, a wide variety of similar designs flourished in the late 1980s and

early 1990s, representing a major force in the Unix workstation market as

well as embedded processors in laser printers, routers and similar

products. The Berkeley RISC project delivered the RISC-I processor in

1982. Compared with averages of about 100,000 in newer CISC designs of

the era, the RISC-I, consisting of only 44,420 transistors, had only 32

instructions with three addressing modes, and yet completely outperformed

any other single-chip design. They followed this up with the 40,760

transistor, 39 instruction RISC-II in 1983, which ran over three times as fast

as RISC-I. In 1986, Hewlett Packard started using an early implementation

of their PA-RISC in some of their computers. In the meantime, the Berkeley

RISC effort had become so well known that it eventually became the name

for the entire concept and in 1987 Sun Microsystems began shipping

systems with the SPARC processor, directly based on the Berkeley RISC-II

system.

http://en.wikipedia.org/wiki/Load/store_architecture

http://en.wikipedia.org/wiki/CDC_6600

http://en.wikipedia.org/wiki/Seymour_Cray

http://en.wikipedia.org/wiki/Load/store_architecture

http://en.wikipedia.org/wiki/Addressing_mode

http://en.wikipedia.org/wiki/Stanford_University


http://en.wikipedia.org/wiki/University_of_California,_Berkeley

http://en.wikipedia.org/wiki/MIPS_architecture

http://en.wikipedia.org/wiki/Berkeley_RISC

http://en.wikipedia.org/wiki/SPARC

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/Power_Architecture

http://en.wikipedia.org/wiki/Unix_workstation

http://en.wikipedia.org/wiki/Embedded_processor

http://en.wikipedia.org/wiki/Laser_printer

http://en.wikipedia.org/wiki/Router_(computing)

http://en.wikipedia.org/wiki/Complex_instruction_set_computing

http://en.wikipedia.org/wiki/Hewlett_Packard

http://en.wikipedia.org/wiki/PA-RISC



http://en.wikipedia.org/wiki/Sun_Microsystems


78

Well-known RISC families include DEC Alpha, AMD 29k, ARC,

ARM, Atmel AVR, Blackfin, Intel i860 and i960, MIPS, Motorola 88000, PA-

RISC, Power (including PowerPC), SuperH, and SPARC. In the 21st

century, the use of ARM architecture processors in smart phones and

tablet computers such as the iPad, Android, and Windows RT tablets

provided a wide user base for RISC-based systems. RISC processors are

also used in supercomputers such as the K computer, the fastest on the

TOP500 list in 2011, and Sequoia, the fastest in 2012 list.

Over the years, RISC instruction sets have grown in size, and today

many of them have a larger set of instructions than many CISC CPUs.

Some RISC processors such as the PowerPC have instruction sets as

large as the CISC IBM System/370, for example; conversely, the DEC

PDP-8—clearly a CISC CPU because many of its instructions involve

multiple memory accesses—has only 8 basic instructions and a few

extended instructions. RISC architectures are now used across a wide

range of platforms, from cellular telephones and tablet computers to some

of the world's fastest supercomputers such as the K computer, the fastest

on the TOP500 list in 2011. As of 2014, a new research ISA, RISC-V, has

been under development at University of California, Berkeley, emphasizing

features such as many core, heterogeneous multiprocessing,

virtualisability, and dense instruction encoding.

2.6.1 CISC and RISC Convergence

State of the art processor technology has changed significantly since

RISC chips were first introduced in the early '80s. Because a number of

advancements are used by both RISC and CISC processors, the lines

between the two architectures have begun to blur. In fact, the two

architectures almost seem to have adopted the strategies of the other.

Since the processor speeds have increased, CISC chips are now able to

execute more than one instruction within a single clock. This also allows

http://en.wikipedia.org/wiki/DEC_Alpha

http://en.wikipedia.org/wiki/AMD_29k

http://en.wikipedia.org/wiki/ARC_International

http://en.wikipedia.org/wiki/ARM_architecture

http://en.wikipedia.org/wiki/Atmel_AVR

http://en.wikipedia.org/wiki/Blackfin

http://en.wikipedia.org/wiki/Intel_i860

http://en.wikipedia.org/wiki/Intel_i960


http://en.wikipedia.org/wiki/Motorola_88000



http://en.wikipedia.org/wiki/Power_Architecture

http://en.wikipedia.org/wiki/PowerPC

http://en.wikipedia.org/wiki/SuperH


http://en.wikipedia.org/wiki/ARM_architecture

http://en.wikipedia.org/wiki/Smart_phone

http://en.wikipedia.org/wiki/Tablet_computer

http://en.wikipedia.org/wiki/IPad

http://en.wikipedia.org/wiki/Android_(operating_system)

http://en.wikipedia.org/wiki/Supercomputer

http://en.wikipedia.org/wiki/K_computer

http://en.wikipedia.org/wiki/TOP500

http://en.wikipedia.org/wiki/IBM_Sequoia

http://en.wikipedia.org/wiki/PowerPC

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/System/370

http://en.wikipedia.org/wiki/PDP-8

http://en.wikipedia.org/wiki/Tablet_computer

http://en.wikipedia.org/wiki/Supercomputer

http://en.wikipedia.org/wiki/K_computer

http://en.wikipedia.org/wiki/TOP500

79

CISC chips to make use of pipelining. With other technological

improvements, it is now possible to fit many more transistors on a single

chip. This gives RISC processors enough space to incorporate more

complicated, CISC-like commands. RISC chips also make use of more

complicated hardware, making use of extra function units for superscalar

execution. All of these factors have led some groups to conclude that now

in the present "post-RISC" era, the two architectures have become so

similar that distinguishing between them is no longer relevant. However, it

should be noted that RISC chips still retain some important traits. RISC

chips strictly utilize uniform, single-cycle instructions. They also retain the

register-to-register, load/store architecture. And despite their extended

instruction sets, RISC chips still have a large number of general purpose

registers.

The question of whether ISA plays an intrinsic role in performance or

energy efficiency is becoming important [26]. The traditionally low power

ARM ISA (a RISC) is entering the high performance server market, with the

traditionally high-performance x86 ISA (a CISC) is entering the mobile low-

power device market.

The MIPS architecture that grew out of a graduate course by John L.

Hennessy at Stanford University in 1981, resulted in a functioning system

in 1983, and could run simple programs by 1984. The MIPS approach

emphasized an aggressive clock cycle and the use of the pipeline, making

sure it could be run as "full" as possible. The MIPS system was followed by

the MIPS-X and in 1984 Hennessy and his colleagues formed MIPS

Computer Systems. The commercial venture resulted in the R2000

microprocessor in 1985, and was followed by the R3000 in 1988. The

company was purchased by Silicon Graphics, Inc. in 1992, and was spun

off as MIPS Technologies, Inc. in 1998. Subsequently Imagination

Technologies has bought the company.


http://en.wikipedia.org/wiki/John_L._Hennessy

http://en.wikipedia.org/wiki/John_L._Hennessy


http://en.wikipedia.org/wiki/MIPS_Computer_Systems

http://en.wikipedia.org/wiki/MIPS_Computer_Systems

http://en.wikipedia.org/wiki/R2000_(microprocessor)

http://en.wikipedia.org/wiki/R2000_(microprocessor)

http://en.wikipedia.org/wiki/R3000

80

2.7 MIPS32 INSTRUCTIONS AND CODE WASTAGE

RISC processors generally have three types of instructions: ALU,

Load or store, and Branch and Jump. Though RISC processors have

limited number of addressing modes, there are variations among the

processors. MIPS processor has only two addressing modes: immediate

and displacement, both with 16-bit fields [3].

Figure. 2.3 seen earlier in section 2.2.2 summarises the basic

formats of MIPS32 integer instructions [27] with examples. The length of

the fields in bits is indicated inside brackets. All the instructions are 32-bits

and the most significant six bits contain the opcode. In the I-type and J-type

instructions, the opcode itself indicates the exact operation. In the R-type

instructions, the op field identifies the instruction type and the fn field (least

significant bits 0-5) indicates the exact operation. For example, the six-bit

pattern 000000 in op identifies all R-type instructions and the fn pattern

indicates the exact function i.e., the instruction is add, and, sub, mul, div,

shift etc. For the and instruction, the op is 0x24 whereas for the or

instruction, the op is 0x25. The R-type is for register-to-register operations.

The I-type is for data transfers, branches, and immediate operations. In

load/store type instructions, the offset field is added to the contents of the

rs register, usually an address, to form the effective address for one of the

operands, either the source or destination.

The branch instructions use a signed 16-bit offset field enabling

jump by 215-1 instructions forward or 215 instructions backward. In I-type

arithmetic instructions, the immediate field is sign-extended to 32-bits to

form one of the operands, and the other operand is available in the rs

register. In I-type logical instructions, the immediate field is zero-extended

to form the second operand and the rs register has the first operand. The

J-type is for jumps and the instruction address is identified by the 26-bit

target field. The actual instruction address is a 30-bit address formed by

81

shifting left the target field contents by four bits. There are two more jump

instructions, jr and jalr, which follow different formats and they contain the

instruction address in the rs register and they have no target field.

The drawbacks of RISC instruction formats due to fixed instruction

size feature are as follows:

1. Several bits are unused in many instructions. Table 2.8 lists the

extent of unused bits in six integer instructions of MIPS32 ISA

since all instructions have to be 32 bits.

2. The R-type instructions use totally 12 bits to specify the

operation though there are only maximum of 64 different R-type

operations in MIPS32 ISA.

Table 2.8: Typical Wastage of Bits in MIPS32 Instructions

Instruction Action

No. of

unused

bits

Instruction Action

No. of

unused

bits

Rfe Return

from

exception

19 addu Addition 5

Syscall System call 20 mult Multiply 10

Nop No

operation

20 lui Load upper

immediate

5

3. In immediate type instructions such as addi, 16 bits are used

for specifying the immediate operand. In most cases, 8 bits are

sufficient for the immediate operand and the remaining 8 bits

become redundant. In branch instructions such as beq, the

82

offset field is underutilized in those cases where the offset

required can be specified with 8 bits.

The impact of these drawbacks on the code size has been quantified

in chapter 3 by analysing typical embedded object codes with the help of a

custom built tool. The outcome of this analysis has formed the basis for the

architectural modifications proposed in chapter 4 and chapter 5.

2.8 CODE SIZE REDUCTION IN EMBEDDED SYSTEMS

In embedded applications, every bit of code counts since it directly

affects both the program memory size, and the amount of bit traffic

between the program memory and the processor. Static code size is

directly proportional to cost in terms of program ROM size in embedded

systems. Dynamic code size has repercussions on instruction cache

effectiveness and hence on performance. Depending on the complexity of

the system, the code memory takes beyond 50% of the embedded product.

The instruction fetches take 5 to 15% of the execution time for a typical 32-

bit embedded RISC processor [7]. Since embedded systems are not user

programmable, several techniques are available to the developers, both at

compiler level and hardware level for compressing the original code

generated by the compiler. However, most solutions reduce performance.

Although the goal of this thesis is in favour of redesigning existing RISC

processors, review of philosophy behind these code compression

techniques and the extent of code compression achieved is provided to

help appreciate the benefits of the architectural solution proposed by us.

Several techniques to reduce code size have been implemented

[28]. These are classified into three types [2]: Code compression, Compiler

techniques and Ad hoc ISA modification. The first two techniques retain the

original ISA whereas the third technique involves supporting a new

83

instruction set that is a subset of the original ISA. An overview of these

three techniques is given below.

2.8.1 Code Compression

Code compression, initially applied to single issue processors such

as CISC and RISC, is now used in VLIW processors also. The

compression methods [28] are based on traditional data compression

techniques including entropy encoding, such as Huffman encoding [29] and

arithmetic coding [30,31,32], dictionary-based compression [33], operand

factorization [34], and re-encoding the original RISC instructions, to name a

few. Code compression involves compressing the executable RISC object

code in offline, and storing the compressed code in code memory. The

decompression is done on-the-fly, for each instruction, during program

execution. The decompression unit is placed between the processor core

and memory either as post-cache (between the cache and the processor),

or as pre-cache (between the code memory and the cache) [35]. In the pre-

cache architecture, the code memory contains compressed code but the

instruction cache memory contains uncompressed code. Decompression

occurs whenever there is a cache miss and hence it is not time critical. In

the post-cache architecture, both code memory and instruction cache

contain compressed code. Decompression occurs during every instruction

fetch and hence it is in the critical path of the instruction pipeline.

The criterion to measure the efficiency of a code compression

scheme is compression ratio, which is defined as the ratio of the size of the

compressed program over the size of the original program. A large body of

knowledge is available on lossless compression [36] and hardware for low

power and high performance compression and decompression has been

proposed [37]. However, there are some distinctive requirements [38]. First,

it must be possible to decompress a program during execution, ensuring

random access, starting from several points inside the program, since

84

branch, jump, and call instructions can alter the program execution.

Second, compression and decompression algorithms can be highly

asymmetric because compression can be performed once for all (offline)

when the executable is generated, while decompression is performed

during program execution; thus it should be fast and power efficient

because its hardware cost must be fully amortized by the corresponding

savings in memory size and power, without compromising performance.

The compression methods [28] result in either variable or fixed-width

instructions. Decompression is more complex with variable-width

instruction as the width of the instruction is not known before the

decompression. Normally, the code compression strategy does not require

any modification to the processor architecture. The instruction fetch unit

generates the next instruction address which will be normally the sum of

previous instruction address and the size of the previous instruction. On

encountering a branch, jump, or call instruction, the target address will be

calculated and the target instruction will be fetched from the memory or

cache. If the program memory contains the compressed code, a mapping

between the original address space and the compressed address space is

necessary. Alternate approach [33] requires a two phase action in offline

after compilation. First, compress the whole program, then, patch branch

offsets during a second phase, to point to a compressed code. In this

approach, the processor needs to be modified to handle unaligned

(compressed) branch targets.

Wolfe and Chanin [30, 39] were the first to apply code compression

to embedded systems. Their scheme known as Compressed Code RISC

Processor (CCRP) uses Huffman coding to compress MIPS object codes,

and a Line Access Table (LAT) to map original program block addresses

and compressed code block addresses. The LAT is stored in program

memory. The code memory has compressed code and the code cache

holds the uncompressed code. Compression is done through a software

tool after linking, and the compressed program is placed into a special

85

memory area, identified by the linker as a compressed text segment that

also has a special section for decompression tables. A byte-based Huffman

coding algorithm was used with a cache line as the basic block to be

compressed. A TLB like buffer called Cache line address Lookaside Buffer

(CLB) is introduced to minimise LAT accesses and save time.

Decompression is slower since Huffman codes are of variable length

codes.

The CCRP method established the foundation for the IBM Codepack

compression technology for the PowerPC 400 series [40]. Compressed

code is stored in the external memory and CodePack is placed between

the memory and the cache as illustrated in Figure. 2.13. Decompression is

triggered by an instruction cache miss. The translation between the

compressed and uncompressed lines is held in the LAT. The 32-bit

PowerPC Instructions are divided into two 16-bit parts and two Huffman

tables are used for each piece. The Huffman-like codewords are assigned

on a frequency distribution basis. Words are grouped in sets and words

belonging to the same set have been assigned codewords of the same

length. For each cache miss, Codepack fetches and decompresses two

cache blocks instead of the only one requested. This approach does not

involve compiler modification or processor design change. The original

work of Wolfe and Chanin achieves 30 to 50% compression ratio whereas

IBM CodePack technique gives compression ratio between 36% and 47%.

2.8.2 Dictionary-based Compression

Dictionary- based compression is another compression method

[38,28,41]. It is based on the property that the same instructions with the

same operands reappear in the embedded object code repeatedly. The

compression algorithm creates a dictionary of distinct instructions, and

replaces each instruction in the original program with the corresponding

86

index to the dictionary as illustrated in Figure. 2.14. Thus, the instructions

are substituted by 'codewords'.

Figure. 2.13: IBM Codepack Code Compression for Power PC

As the codeword is smaller than the original instruction, the size of

the code is reduced. During program execution, the codeword (dictionary

index), fetched from the program memory, is used to fetch the original

uncompressed instructions in the dictionary. Figure. 2.15 illustrates the

decompression operation of the dictionary method of compression. Given a

program with N unique instructions, the length of the codeword is [log N]

bits.

87

Figure. 2.14: Dictionary based compression

Figure. 2.15: Decompression procedure for the dictionary based

compression

The dictionary is usually implemented in ROM in the control path of

the processor. Dictionary-based compression is a simple scheme offering

fast decompression. The decompressor is actually a simple table; it can be

integrated with the instruction decoder into a single pipeline stage. Though

this scheme is a straightforward one, offering inexpensive address

88

translation and sizable reduction of memory fetch bandwidth (i.e., number

of bits transferred from code memory to execute a program), [7] argues that

'this approach is the least appealing for an embedded system'. On the

other hand, [39] establishes that the dictionary-based compression is

competitive with CodePack for static footprint compression, and achieves

superior results for bus traffic and energy reduction.

In expression-tree-based algorithms [42] for code compression

proposed by Guido et. al, the encoded symbols are extracted from program

expression trees and dictionary-based decompression engines are

implemented.

2.8.3 Compiler Techniques

Modern embedded compilers are often more complex than general

purpose compilers. A traditional compiler mainly aims to optimize a one-

dimensional cost function represented by the number of cycles needed to

execute a program. On the other hand, for an embedded compiler, code

size and energy are equally important as the speed of execution. Certain

scalar optimizations by traditional compiler are relevant in embedded

systems also. For example, transformations such as dead code elimination,

common sub expression elimination, strength reduction, copy propagation,

and constant folding reduce code, and power consumption apart from

improving speed. However, certain ILP-oriented optimizations such as loop

unrolling, tail duplication, procedure inclining and cloning, speculation, and

global code motion offer better speed but may hurt code size and power

consumption [7]. Research on code compression has been very active in

the compiler community [11, 43] with the goal of finding compact program

representations. Pure software techniques [39] by compiler to reduce

program size and decompress instructions during execution have been

popular among embedded community. Compiler techniques for code

compression for RISC architectures, by Cooper and McIntosh [44] map

89

isomorphic instruction sequences into abstract routine calls or cross-

jumping. A profile-guided code compression to apply Huffman coding to

infrequently executed functions has been suggested by Debray and Evans

[45], [46]. A control flow graph centric software approach to reduce memory

space consumption has been proposed by Ozturk et al [47]. Their approach

involves on-the-fly compression/decompression of object codes of

embedded applications. A flexible decompressor approach, applicable to

multiple platforms, was proposed by Shogan and Chiders [48] with their

implementation of IBM's CodePack algorithms within the fetch step of

Software Dynamic Translator (SDT) in pure software infrastructure. Thus

compiler techniques for code compression involve register renaming, inter

procedural optimization, and procedural abstraction of repeated code

fragments. The procedure abstraction is a program optimization technique

that replaces repeated sequences of common code with calls to a single

procedure. The above compiler techniques are attractive since they have

no runtime decompression overheads, do not require any hardware change

and the code generated can be directly executed by the processor.

However, there is a need to modify the software tools such as compilers

and linkers.

2.8.4 Ad hoc ISA Modification

This approach customizes the existing RISC instruction set

architecture with narrow instructions supporting fewer operations, smaller

operand fields, and fewer registers. For example, the Thumb [49]

instruction set is a modification of the original ARM instruction set (32-bit

instructions). It has 36 different 16-bit instructions which form a subset of

ARM instructions. Similarly in MIPS16, a subset of 32-bit MIPS instructions

are mapped to 16-bit MIPS instructions which can be translated in real-time

into 32-bit MIPS instructions. This approach involves a considerable effort

to design the new instruction set and requires a new instruction decoder, a

new set of software development tools, such as compiler, assembler, and

90

linker. A code saving of up to 40% has been reported. However, the dense

instruction sets often cause performance penalties [39] due to lack of

instructions. Also, the processor hardware needs additional logic for

decoder/decompression to support both ISAs. Both ARM and MIPS have

responded to the first criticism by introducing Thumb2 and microMIPS

processors. The ISAs of these processors support two instruction sizes:

16-bit and 32-bit. Although the performance degradation has been taken

care to certain extent, the processors still have additional

decoder/converter logic to detect the 16-bit instructions and convert them

into 32-bit instructions.

There have been attempts to develop tiny RISC processors [50].

The DMN-6 has 16 registers of 8-bits, executes just 12 instructions and has

no cache memory. Known as Minimal RISC processor, it is meant

exclusively for use in toys.

2.9 ISA LEVEL CODE SIZE REDUCTION

Instructions set architects have broadly used two techniques to

reduce the relative energy cost of instruction stream delivery. One

approach is to increase the amount of work performed by a single

instruction. Vector machines, for example, reduce instruction bandwidth

demands by expressing a large amount of SIMD parallelism in a single

instruction [9]. CISC machines do so by combining multiple simple

operations into a single instruction and providing more addressing modes.

An alternate approach is to reduce the size of the instructions. CISC

instruction sets generally have been composed of variable-length

instructions: the simple and more common ones are usually encoded in

fewer bits than those that require more operands or occur less frequently.

RISC ISAs initially sacrificed the code density advantages of variable-

length instruction encodings in favour of simple, fixed length 32-bit

encodings. Subsequently, RISC instruction set extensions have provided

91

fixed-length 16-bit encodings (as in ARM Thumb and MIPS16), although

often at the expense of performance and limited access to some hardware

features. Next generation RISC ISAs (as in ARM Thumb2, micro MIPS and

RISC-V) partly resolve these drawbacks by encoding the most common

instructions densely, while maintaining most or all of the functionality of the

32-bit ISA. However, these ISAs have not fully resolved the issue of code

density since these ISAs continue giving importance to pipeline design

complexity. Hence they have only two different instruction sizes: two bytes

and four bytes. Still, these are called as variable instruction length ISAs

which is a misnomer and the term hybrid instruction length is the proper

term. On the other hand, hybrid length encoding proposed in this thesis

recommends a new ISA with four different sizes that reduces the average

length of instructions with the goal of minimizing code memory size. It also

improves energy per operation by reducing instruction fetch traffic.

Depending on the memory word size, with a stream of hybrid instruction

length instructions, some instructions will reside in more than one memory

word and will require more than one memory access to fetch the

instruction. Figure. 2.16 illustrates a memory map of a sequence of x86

instructions [11]. The digits indicate the instruction number in the stream.

The eight instructions in the stream require seven memory cycles, giving

0.875 memory cycles per instruction. For this example, the average

number of bytes per instruction is 3.375. Published statistics on the IBM

S360 show that this CISC architecture has approximately four bytes per

instruction [11].

2.10 CONCLUSIONS

This chapter provides an overview of various attributes of ISA and

different types of embedded processors. The cause for the increased code

size of embedded processors is illustrated with the example of MIPS32

ISA. Different techniques for code size reduction in embedded systems

have been briefly seen in this chapter.

92

Figure. 2.16: Memory map of variable instruction stream

The next chapter analyses the behaviour of embedded object codes

of MIPS32 and the Chapter 4 discusses two different techniques of hybrid

instruction encoding for MIPS32 processor to minimise the code size.

93

3. BEHAVIOUR OF EMBEDDED CODES FOR RISC

Embedded domain has a wide range of applications from sensor

systems to smart cellular phones. In many cases in the embedded domain,

it is difficult to isolate the software of an embedded system from the system

itself. Unlike the SPEC [3] for the general-purpose domain, there is no

dominant benchmark suite for the embedded domain. However, certain

industrial and academic benchmark packages [7] are available for the

embedded domain. MediaBench [51], MiBench [52], Berkeley Design

Technology, Inc. (BDTI), and Embedded Microprocessor Benchmark

Consortium (EEMBC) [43, 53] are four popular benchmark suites that are

commonly used by the embedded community. Whereas the MiBench and

MediaBench are academic packages containing sets of publicly available

programs that cover several embedded applications, the other two are

commercial suites. The BDTI contains DSP benchmarks written in

assembly language, and is very specific for simple DSPs and has limited

applicability outside this domain.

The EEMBC contains several sub-domains of benchmarks, including

automotive, imaging, consumer, and telecommunications sections. This

research work has identified 23 embedded applications, to cover the entire

spectrum of BOPES, from two representative set of embedded

benchmarks, MiBench and MediaBench. These applications are cross

compiled for the MIPS32 processor prior to static program analysis. Static

program analysis is the analysis of a computer program without actually

executing the program (analysis performed on executing programs is

known as dynamic analysis). The analysis is usually performed by an

automated tool either on the source code, or on the object code. Due to

need for flexibility, it has been decided to develop a new stand-alone tool,

as part of the research work, for analysing the MIPS32 object codes. This

also enabled incorporation of additional features at later stages.

http://en.wikipedia.org/wiki/Program_analysis_(computer_science)

http://en.wikipedia.org/wiki/Dynamic_program_analysis

http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis

http://en.wikipedia.org/wiki/Source_code

http://en.wikipedia.org/wiki/Object_code

94

This chapter provides an analysis of the object codes of 23

embedded benchmarks from MiBench and MediaBench to understand the

behavior of embedded applications and determine the strategy for

minimising the code size. Initially, a description of the two benchmarks

suites is provided. This is followed by a discussion on the behavior of MIPS

object codes of the embedded benchmarks, using MIDA, the custom built

code analyzer for MIPS32 object codes. Apart from measuring the static

instruction frequencies, this tool estimates the extent of under utilization of

the offset and immediate fields in the object codes.

3.1 MIBENCH BENCHMARKS

The MiBench is a set of benchmark programs in C, for six

embedded applications: Automotive and Industrial control, Consumer

Devices, Office Automation, Networking, Security and Telecommunication.

Table 3.1 lists the MiBench programs used for evaluating the HIE for

MIPS32. For certain applications, there are two versions: a small data set

version and large data set version. The small data set represents a light-

weight, useful embedded application of the benchmark, while the large data

set provides a more stressful, real-world application. Typical applications of

Automotive and Industrial Control are air bag controllers, engine

performance monitors and sensor systems. These benchmarks perform

mathematical calculations, bit counting, sorting and image recognition.

The automotive and industrial control category is a representative of

embedded control systems. The typical examples of consumer devices are

scanners, digital cameras and Personal Digital Assistants (PDAs). The

benchmarks mainly consist of multimedia applications with the

representative algorithms for jpeg encoding/decoding, image colour format

conversion, image dithering, colour palette reduction, MP3

encode/decoding and HTML typesetting. Most of the algorithms are taken

from SGI TIFF utilities. The typical examples of network devices are

95

switches and routers. The work done by the embedded processors in these

devices involves shortest path calculations, tree and table backups and

data input/output. The algorithms used in these benchmarks are finding a

shortest path in a graph and creating and searching a Patricia trie data

structure. The Telecommunications benchmarks have algorithms for voice

encoding / decoding, frequency analysis and checksum calculation. With

the popularity of internet, the trend is integrating wireless communication in

many portable consumer devices. The Office applications are primarily text

manipulation algorithms. The typical examples of office automation are

printers, fax machines and word processors. The PDAs, though grouped

under consumer devices, involve heavily manipulation of text for data

organization. The security benchmarks have algorithms for data encryption,

decryption and hashing. There are some benchmarks common to network,

security and telecommunication classes.

Table 3.1: MiBench Benchmarks

Auto/Industrial Domain

Program Functions

basicmath Simple mathematical calculations such as cubic function

solving, integer square root and angle conversions from

degrees to radians; these are needed for calculating

road speed or other vector values.

bitcount Tests the bit manipulation abilities of a processor by

counting the number of bits in an array of integers; five

methods are used by this program.

qsort Sorts a large array of strings into ascending order using

the quick sort algorithms

susan An image recognition package for recognizing corners

and edges, and typically used for a vision based quality

assurance application. It can smooth an image and has

adjustments for threshold, brightness, and spatial

control.

96

Consumer Domain

Program Functions

jpeg An algorithm for image compression and

decompression; commonly used to view images

embedded in documents. JPEG is a standard, lossy

compression image format.

lame An MP3 encoder that supports constant, average and

variable bit-rate encoding

typeset A general typesetting tool with a front-end processor for

HTML; representative of a core component of a web

browser that might be used in a consumer device. It

captures the processing required to typeset an HTML

document, without any rendering overheads.

Office Domain

Program Functions

stringsearch Searches for given words in phrases using a case

insensitive comparison algorithm

ispell A fast spelling checker supporting contextual spell

checking, correction suggestions, and languages other

than English; It is similar to Unix spell, but faster.

rsynth A text to speech synthesis program. It integrates several

pieces of public domain code into a single program.

Network Domain

Program Functions

dijkstra Constructs a large graph in an adjacency matrix

representation and then calculates the shortest path

between every pair of nodes using repeated applications

of Dijkstra's algorithm that is a well known solution to the

shortest path problem.

97

Network Domain

Program Functions

patricia Creates and searches a Patricia trie structure that is a

data structure used in place of full trees with very sparse

leaf nodes. Branches with only a single leaf are

collapsed upwards in the trie to reduce traversal time at

the expense of code complexity. Patricia tries are

commonly used in network applications to represent

routing tables.

CRC32 Same as CRC32 in Telecom

sha Same as sha in Security

blowfish Same as blowfish in Security

Security Domain

Program Functions

Blowfish

encrypt/

decrypt

Blowfish is a symmetric block cipher with a variable

length key. Since its key length can range from 32 to

448 bits, it is ideal for domestic and exportable

encryption.

sha A secure hash algorithm that produces a 160-bit

message digest for a given input; used in the secure

exchange of cryptographic keys and for generating

digital signatures. It is also used in the well-known MD4

and MD5 hashing functions.

Rjindael

encrypt/

decrypt

A block cipher with the option of 128-, 192-, and 256-bt

keys and blocks.

98

Telecommunications Domain

Program Functions

CRC32 Performs a 32-bit Cyclic Redundancy Check (CRC) on

a file. Useful to detect errors in data transmission.

FFT Performs a Fast Fourier Transform and its inverse

transform on an array of data. Fourier transforms are

useful in digital signal processing to find the frequencies

contained in a given input signal.

ADPCM

encode/

decode

Adaptive Differential Pulse Code Modulation; takes 16-

bit linear PCM samples and converts them into 4-bit

samples, yielding a compression rate of 4:1. ADPCM is

a variation of the well-known standard Pulse Code

Modulation (PCM).

GSM encode/

decode

Global Standard for Mobile communications. A standard

for voice encoding/decoding data streams. It uses a

combination of Time- and Frequency-Division Multiple

Access (TDMA/FDMA) to encode/decode data streams.

3.2 MEDIABENCH BENCHMARKS

The MediaBench suite is composed of multimedia applications

collected from image processing, communications and DSP applications.

Founded in 1997, MediaBench 1 was designed as a representative of

workload of emerging multimedia and communications systems. It included

applications written in C, ranging from image and video coding, to audio

and speech processing, and even encryption and computer graphics.

The original MediaBench suite had 11 application packages

covering six media areas: video, image, graphics, audio, speech, and

security. Many of these applications are unoptimised versions derived from

open-source programs that were not designed for the embedded domain.

99

The video benchmark, MPEG-2, characterized encoding and decoding

video sequences. Audio area was covered by ADPCM for encoding and

decoding audio streams. The image media type was characterized by three

applications: JPEG, EPIC and Ghostscript. The first two are for coding

standard colour images and the Ghostscript for postscript transcoding.

The speech area had three applications: GSM, G.721 and Rasta.

The first two are for encoding speech and the third is for speech recognition

application. Security is covered by two applications, PGP and pegwit for

encrypting and decrypting messages. Computer graphics is covered by

Mesa, a set of computer graphics libraries, similar to openGL, which

included three demo programs as the benchmarks for graphics.

MediaBench2 is an upgradation of MediaBench suite with some new

applications. These applications are needed if it is required to evaluate

performance of a processor or dynamic behaviour of an application. Since

none of these is the objective of this research work, benchmarks of

MediaBench1 can meet the requirement. A brief coverage on the selected

applications in MediaBench suite is given in Table 3.2.

Table 3.2: MediaBench Benchmarks

Program Functions

JPEG JPEG (pronounced "jay-peg") is a standardized compression

method for full-colour and gray-scale images. This package

contains C software to implement JPEG image compression

and decompression. JPEG is lossy, meaning that the output

image is not exactly identical to the input image. Two

applications are derived from the JPEG source code; cjpeg

does image compression and djpeg does decompression based

on the ISO JPEG standard for image compression. Source

code produced by the independent JPEG group. JPEG is

intended for compressing "real-world" scenes; line drawings,

cartoons and other non-realistic images are not its strong suite.

100


Program Functions

MPEG A dominant standard for high quality digital video transmission.

The important computing kernel is a discrete cosine transform

for coding and the inverse transform for decoding. The two

applications used are mpeg2enc and mpeg2dec for encoding

and decoding respectively. mpeg2play is a player for MPEG-1

and MPEG-2 video bit streams. It is based on mpeg2decode by

the MPEG Software Simulation Group. In mpeg2decode, the

emphasis is on correct implementation of the MPEG standard

and comprehensive code structure. The latter is not always

easy to combine with high execution speed. Therefore a version

has been derived which is optimized for higher decoding and

display speed at the cost of a less straightforward

implementation and slightly non-compliant decoding. In

addition, all conformance checks and some fault recovery

procedures have been omitted from mpeg2play. A discrete

cosine transform for coding and the inverse transform for

decoding is used by this benchmark.

GSM An implementation of the European GSM 06.10 provisional

standard for full-rate speech transcoding, prI-ETS 300 036,

which uses RPE/LTP (residual pulse excitation/long term

prediction) coding at 13 kbit/s. GSM 06.10 compresses frames

of 160 13-bit samples (8 kHz sampling rate, i.e. a frame rate of

50 Hz) into 260 bits; for compatibility with typical UNIX

applications, this implementation turns frames of 160 16-bit

linear samples into 33-byte frames (1650 Bytes/s). The quality

of the algorithm is good enough for reliable speaker recognition;

even music often survives transcoding in recognizable form

(given the bandwidth limitations of 8 kHz sampling rate).

101


Program Functions

G.721 The files in this package comprise ANSI-C language reference

implementations of the CCITT (International Telegraph and

Telephone Consultative Committee) G.711, G.721 and G.723

voice compressions. They have been tested on Sun

SPARCstations and passed 82 out of 84 test vectors published

by CCITT (Dec. 20, 1988) for G.721 and G.723. [The two

remaining test vectors, which the G.721 decoder

implementation for u-law samples did not pass, may be in error

because they are identical to two other vectors for G.723_40.]

This source code is released by Sun Microsystems, Inc. to the

public domain.

PEGWIT Pegwit is a program for performing public key encryption and

authentication. It uses an elliptic curve over GF(2^255), SHA1

for hashing, and the symmetric block cipher square.

EPIC EPIC (Efficient Pyramid Image Coder) is an experimental image

data compression utility written in the C programming language.

The compression algorithms are based on a biorthogonal

critically-sampled dyadic wavelet decomposition and a

combined run-length/Huffman entropy coder. The filters have

been designed to allow extremely fast decoding on conventional

(i.e., non-floating point) hardware, at the expense of slower

encoding and a slight degradation in compression quality (as

compared to a good orthogonal wavelet decomposition).

ADPCM Adaptive differential pulse code modulation is one of the

simplest and oldest forms of audio coding. ADPCM stands for

Adaptive Differential Pulse Code Modulation. It is a family of

speech compression and decompression algorithms. A

common implementation takes 16-bit linear PCM samples and

converts them to 4-bit samples, yielding a compression rate of

4:1. The ADPCM code used is the Intel/DVI ADPCM code

which is being recommended by the IMA Digital Audio

Technical Working Group. But this is NOT a CCITT G722 coder.

The CCITT ADPCM standard is much more complicated,

probably resulting in better quality sound but also in much more

computational overhead.

102

3.3 MIMEDIA BENCHMARK SUITE

In order to explore the strengths and weaknesses of the MIPS32 ISA

for embedded applications, a composite benchmark package named

MiMedia has been created with selected benchmarks from MiBench and

MediaBench suites. Certain benchmarks such as jpeg and gsm are

common to MiBench and MediaBench suites. Certain other benchmarks

such as mad, sphinx, PGP, Ghostscript, Rasta and Mesa have been

dropped due to some errors encountered during the downloading/ cross-

compilation process.

Since the goal of the research work is reducing memory size

occupied by the programs and not execution of the programs, there is no

need for finding dynamic instruction distribution or execution time of the

benchmarks. For the same reason, all the subprograms of a benchmark

application can be considered together into a single package. Hence a

composite suite has been created from the benchmarks of MiBench and

MediaBench avoiding duplication but including a variety of applications.

However, it has been decided to drop the small data set versions and

include only big data set versions. Table 3.3 presents the 23 applications

that have been grouped under the MiMedia suite. These have been

mapped into eight domains: Automotive and industrial control, Network,

Video, Audio, Image, Speech, Security, and Text. The susan has been

included in two areas of applications: both under Automobile and industrial

area, and image area.

103

Table 3.3: Embedded Applications for MiMedia Suite

Embedded

Domain Application

Name

Source

Benchmark

Suite Programs

Object code

size (bytes)

Automotive

and Industrial

Control

basicmath MiBench basicmath (large) 4984

bitcount MiBench bitcnts 4268

Qsort MiBench Qsort (large) 1944

susan MiBench susan 51000

Network dijkstra MiBench Dijkstra (large) 463144

patricia MiBench Patricia 463744

CRC32 MiBench CRC32 461500

Video MPEG2 MediaBench 1.Mpeg2encode

2.Mpeg2decode

1115208

Audio

ADPCM MediaBench 1. Rawcaudio is

coder (encoder)

2. Rawdcoder is

decoder

3.timing is test timer

for both coder and

decoder

1384008

lame MiBench lame 223892

Image JPEG MediaBench 1. Cjpeg; coder

2. djpeg; decoder

3. jpegtran;

lossless transcoding

between different

JPEG file formats.

4. rdjpcom;

displays the text in

COM (comment)

markers in a JFIF

file

5. wrjpgcom;

inserts user-

supplied text as a

COM (comment)

marker in a JFIF file

225744

104


Embedded

Domain Application

Name

Source

Benchmark

Suite Programs

Object code

size (bytes)

Image EPIC MediaBench 1. epic; does

compression

2. unepic; does

decompression

972100

fft MiBench fft 498640

(susan) MiBench (susan) 51000

Speech

GSM MiBench 1. Toast; encoder

2. untoast;

decoder

1019216

G721 MediaBench 1. encode; Voice

encoder

2. decode; Voice

decoder

942232

rsynth MiBench say 26224

Security

pegwit MediaBench Pegwit hashing 510632

sha MiBench sha 4160

blowfish MiBench bf 466604

rjindael MiBench rjindael 476464

Text

typeset MiBench lout 505252

stringsearch MiBench stringsearch 462296

ispell MiBench ispell 48320

3.4 TYPICAL BEHAVIOUR OF EMBEDDED APPLICATIONS

The MiMedia benchmarks were cross-compiled on Intel PC and the

compiler output was analysed using the custom-built tool suite, MIDACC,

an offline code analyser and converter tool suite. It was required to

generate executable binaries for MIPS processors using cross compiler,

running under Linux OS on Intel Platform. There are many commercial

cross compilers available and published on internet among which has been

chosen, Sourcery CodeBench, which has a 'Lite' version free for

developers and academic purposes. It consists of a set of tools, like

105

compiler, linker, object dumping tools, library archiving etc. Even though

Gnu C Compiler (GCC) can be used in cross compilation, the Sourcery

CodeBench Tool chain, which is specially built for embedded system, will

produce optimized object code and executable. This Compiler can produce

object code and executable for all MIPS processors, MIPS1 ISA

instructions, MIPS2 ISA instructions, etc., and also for MIPS32 and MIPS64

for 32 bit and 64 bit processors. This compiler can cross compile C

programs for other RISC processors like R1000, mk4 also. The MIPS32

option has been chosen for this research work.

The MIDACC suite has two tools: the MIDA, a MIPS code analyser

and the MICC, a code converter. The results from MICC are discussed in

Chapters 4 and 5. This chapter focuses on MIDA. Given a MIPS32 object

code, the MIDA profiles the code and produces various statistics for the

given application program as follows:

1. Object code size

2. Frequency of each instruction class

3. Frequency of the 66 integer instructions

4. Usage pattern for offset and immediate values

5. Number of bytes wasted in underutilized fields of offset and

immediate

6. Frequency of usage of branch instructions

7. Usage pattern for GPRs

8. Usage pattern for shift amount in shift instructions

9. Number of bytes wasted due to redundant zeroes in the

instructions

Apart from the above nine aspects covered by MIDA, it was felt at a

later stage to carryout additional analysis of the embedded codes for

estimating the scope for introducing composite instructions and eliminating

106

avoidable duplicate information within certain instructions. An extension to

MIDACC was developed for this purpose and it is discussed in Chapter 6.

This tool named MIDACC Extender performs certain other functions also as

discussed in Chapter 6.

Appendix 1 describes the structure of the tool MIDA and Appendix 2

lists the sample outputs of MIDA for selected embedded applications.

Analysis of MIPS object codes using MIDA reveals several interesting

behaviour of embedded applications as discussed below. Though

experiments have been carried out with all the subprograms of the

benchmarks, the discussion below omits certain unimportant and short

programs. Similarly when two or more similar sub programs are included in

an application, the discussion is included about one of them only.

1. None of the embedded applications use all 66 integer instructions.

Number of unused instructions varies from 8 to 40. The following eight

instructions are not used by any application: ADD, ADDI, SUB, BGEZAL,

BLTZAL, BLTZ, MTHI and REF. Some benchmarks use hardly half the

number of instructions. For instance, qsort uses only 26 instructions. On

the other extreme, programs such as mpeg2 and fft, use 58 instructions.

Figure. 3.1 depicts the extent of unused and used instructions by the

embedded codes of the 23 benchmarks. As a majority trend, eleven

benchmarks use 55 or 56 instructions. Figure. 3.2 presents typical

distribution of utilized and unutilized instructions in the eight embedded

domains. The video segment uses highest number of instructions and the

Automotive segment uses the lowest number of instructions.

107

Figure. 3.1: Utilized and unutilised instructions in Embedded codes

108

Figure. 3.2: Distribution of utilized and unutilized instructions in

Embedded domains

2. Every application is using only limited categories of instructions.

Several instructions are used very sparingly and any given program is

mostly made up of only 5 to 7 types of instructions. Five common

instructions that are used liberally by 23 programs are LW, SW, ADDIU,

ADD and BEQ. LW is the only instruction that is used more than 10% in all

programs. Some benchmarks use certain specific instructions in plenty due

to the nature of operations. For instance, only susan uses LBU more than

5%. Likewise, only sha uses SB more than 10% and qsort uses JR more

than 5%. Figure. 3.3 illustrates how the 66 integer instructions are

populated in the 23 benchmarks. All three benchmarks in the Network

segment follow the same pattern. On the other hand, each of the three

benchmarks in the Text segment exhibit separate behavior. There is a near

uniform figure for the three highly used instruction groups, whereas for the

other two groups of 0% and 1%, the figures are scattered. Figure. 3.4

109

presents typical distribution of instruction density in the eight embedded

domains. There is uniformity among all the eight embedded segments

when it comes to the 5% and above cases.

Figure. 3.3: Frequency of integer instructions in Embedded codes

110

Figure. 3.4: Frequency of instructions usage in Embedded domains

3. A glance at the instruction counts in 23 benchmarks gives

interesting information. In majority of the benchmarks, the same sets of

instructions are heavily used. Four instructions - addu, addiu, lw and sw -

dominate all the benchmarks and the sum total of these four instructions

form a major portion of the benchmarks. These Frequently used Top Four

Instructions (FTFI) consume as high as 67% of the embedded codes.

Seventeen benchmarks have FTFI around 60. The lowest figure itself is

42%. Figure. 3.5 shows the variation of FTFI in the 23 benchmarks. Only

two benchmarks, basicmath and sha have FTFI below 50. Figure. 3.6

shows typical behaviour of FTFI for the eight embedded segments using

the geometric mean values of FTFI. The FTFI is the lowest for the Security

segment and the highest for the Image segment of benchmarks. Applying

80-20 rule, any technique to improve the density of these four instructions

will drastically reduce the code sizes of embedded programs.

111

Figure. 3.5: Population of FTFI in Embedded codes

112

Figure. 3.6: Distribution of FTFI in Embedded domains

4. Distribution of immediate values: The size of immediate values

affects instruction length. The majority of the immediate values are positive

as reported by Hennessy and Patterson [3] for the SPEC benchmarks. As

per their study, small immediate values are heavily used and large

immediate values are sometimes used mostly in addressing calculations.

Further, 8 bit immediate can capture about 50% of the cases and 16 bits

about 80%. The experiments with embedded benchmarks on MIPS

processor show interesting behavior. The 16 bits immediate field is heavily

under utilized by embedded benchmarks as shown in Figure. 3.7. Except

for two benchmarks, rsynth and typeset, other 21 programs need full 16

bits in less than 10% of the cases. Further, most benchmarks need full 16

bits in 0% to 5% of the cases only. The benchmarks of Automotive

applications and the Speech segments are in the two extreme ends but

within a short range as shown in Figure. 3.8.

113

Figure. 3.7: Usage of full 16 bit immediate by Embedded codes

114

Figure. 3.8: Trends in usage of 16 bit immediate in Embedded

domains

5. As per Hennessy and Patterson's analysis [3] with SPEC

benchmarks, displacement values are widely distributed. There are both a

large number of small values and a fair number of large values. The factors

contributing to the wide distribution of displacement values are multiple

storage areas for variables and different displacements to access them

apart from the overall addressing scheme used by the compiler. The

analysis shows that embedded applications use full 16 bits of offset field

very rarely as shown in Figure. 3.9. In fact, 16 benchmarks use only 15 bits

always. Even the remaining programs need 16 bits maximum in 3% of the

cases. The overall behavior of embedded segments in this aspect is shown

in Figure. 3.10. Network and Speech segments are satisfied with 15 bits of

offset. The worst case requirement is that of Image segment that has 2% of

cases using more than 15 bit offset.

115

Figure. 3.9: Extent of usage of full 16 bit offset by embedded codes

116

Figure. 3.10: Trends in usage of 16 bit offset field in Embedded

domains

6. Extent of memory Wastage In Immediate And offset Fields

(WASTIO) in embedded object codes comes to significant amount. The

underutilization of these two fields of instructions due to the use of

redundant 0's are classified into four types a, b, c and d according to the

four combinations of wastages in the object code as defined in Table 3.4.

WASTIO = 2a+b+c. Programs that have higher values of d and lower

values of a, b and c will cause less wastage of memory due to redundant

zeroes. WASTIO percentage is calculated using the formula, WASTIO

percentage = 100 X (WASTIO / Object code Size). The extent of wastage

due to underutilization of the offset and immediate fields varies from 8% to

117

16% of the code size for the embedded applications as shown in

Figure. 3.11.

Table 3.4: Four types of offset / immediate byte patterns

Type Offset / immediate bytes

a All 16 bits are 0's

b One byte wastage due to all zeroes in the least significant byte

c One byte wastage due to all zeroes in the most significant byte

d No wastage; both bytes have non zero value

Twenty benchmarks have WASTIO percentage of either 11 or 12.

Figure. 3.12 shows typical behaviour of WASTIO for the eight embedded

segments using the geometric mean values of WASTIO. Four segments

have equal amount of WASTIO percentage and the variation of WASTIO

percentage among the eight embedded segments is from 10 to 13 only. In

order to give an idea of number of bytes wasted in the immediate and offset

fields, Table 3.5 compares the largest programs in each application domain

of MiMedia. Though the video program mpeg2 is the largest embedded

benchmark of MiMedia, the number of bytes wasted in offset and

immediate fields is higher for the Text benchmark, typeset. Figure. 3.13

depicts the distribution of WASTIO percentage in the three types a, b and

c. Except for the Text, the b component of WASTIO is zero for the other

embedded domains.

118

Figure. 3.11: WASTIO Percentages in Embedded applications

119

Figure. 3.12: Extent of WASTIO in Embedded domains

Table 3.5: Trends in Embedded Applications: WASTIO components

Embedded

Domain

Largest

Application

Code size

(bytes)

WASTIO

%

Number of

bytes

wasted

a

%

b

%

c

%

Automotive and Industrial Control

susan 51000 11 5610 4 0 7

Network patricia 463744 12 55649 3 0 9

Video MPEG2 578880 11 63677 2 0 9

Audio ADPCM 460884 12 55306 3 0 9

Image fft 498640 11 54850 2 0 9

Speech GSM 509608 11 56057 2 0 9

Security pegwit 510632 12 61276 3 0 9

Text typeset 505252 15 75788 3 1 11

120

Figure. 3.13: WASTIO distribution in Embedded domains

7. Usage of general-purpose registers (GPRs): MIPS32 has 32

32-bit GPRs that are also known as integer registers. Reduction of

registers may pose problem to compiler in register allocation thereby

impacting on speeding up the code. It has been reported by Hennessy and

Patterson [3] that at least 16 registers are essential for the graph coloring

technique used by register allocation algorithms. The analysis of embedded

codes establishes that having just 16 registers will be inefficient. The

frequency of usage of more than 16 registers by the embedded codes is

shown in Figure. 3.14. Only one benchmark, susan, needs more than 16

registers in just 5% of the cases. The highest requirement is by JPEG that

needs more than 16 registers in 59% of the cases.

121

Figure. 3.14: Usage of more than 16 registers by Embedded

applications

122

The overall behavior of embedded segments in this aspect is shown

in Figure. 3.15. The Audio, Speech and Security segments have worst case

requirements at the level of either 45% or 46%. The Automotive segment

has the minimum requirement at 24%.

Figure. 3.15: Usage of more than 16 registers in Embedded domains

8. Shift amount: MIPS32 uses 5 bits for specifying the shift amount

thereby allowing a maximum of 32-bit shift in one operation. The analysis of

embedded codes reveals that in majority of cases, the shift amount is less

than or equal to 16 bits as illustrated in Figure. 3.16. However, about 13

benchmarks need more than 16-bit shifts varying from 10% to 13% of the

cases. While qsort does not need more than 16-bit shifts, bitcnts need

more than 16-bit shifts in 36% of the cases. The frequency of usage of

more than 16-bit shifts by the embedded codes is shown in Figure. 3.17.

The Security segment has maximum requirement whereas Automotive and

Image segments have very low requirements for more than 16 bit shifts.

123

Figure. 3.16: Usage of more than 16 bit shifts in Embedded

applications

124

Figure. 3.17: Frequency of more than 16 bit shifts in Embedded

domains

9. The extent of redundant LOAD and STORE instructions is

estimated by RMA analysis of the code. Further discussion on this is

presented in Chapter 5.

10. The extent of presence of branch instructions is estimated by the

tool, MIDA, even though it is useful mainly for dynamic simulation.

However, the objective of obtaining this data is to get some idea regarding

the code space occupied by the branch instructions. The analysis of

embedded codes reveals that branch instructions together take up 3% to

11% of the code space as illustrated in Figure. 3.18. In thirteen

benchmarks, the branch instructions occupy 11% of the code. The segment

wise code space required by branch instructions is shown in Figure. 3.19.

The Automotive segment has the lowest case of 4% code branch

instructions whereas the Network segment has the maximum case of 11%

requirement.

125

Figure. 3.18: Extent of branch instructions in Embedded applications

126

Figure. 3.19: Usage of branch instructions in Embedded domains

11. The wastage of code space due to presence of Redundant

Zeroes (RZ) in the instructions is estimated and found to vary from 5% to

13% as illustrated in Table 3.6. Three benchmarks, lame, sha and lout,

have the least RZ of 5%. The qsort has the highest RZ of 13% of code. The

RZ and WASTIO together give an idea of the extent of unused bits in the

embedded codes. In addition, the ratio between Load/Store and ALU type

instructions, LSI/ALUI, is an important measure of code bloating in

embedded codes. This aspect will be discussed in Chapter 5. The three

major parameters contributing to code bloat factor (CBF) are WASTIO,

FTFI, and RZ. Table 3.6 compares these parameters of 23 benchmarks.

One may tend to conclude that the CBF is a measure of extent of code

reduction possible. A discussion on this is presented in Chapter 4 along

with the analysis of the results for code size reduction.

127

Table 3.6: Code bloat factors for Embedded object codes for MIPS32

Embedded

Domain Benchmark % RZ

%

WASTIO FTFI

Automotive

and Industrial

Control

basicmath 12 9 46

bitcount 12 14 59

Qsort 13 11 56

susan 8 17 67

Network dijkstra 7 13 60

patricia 7 13 60

CRC32 7 13 60

Video MPEG2 7 13 59

Audio

ADPCM 7 13 60

lame 5 12 57

Image

JPEG 11 14 58

EPIC 7 13 60

fft 7 13 59

(susan) 8 17 67

Speech GSM 7 13 59

G721 7 13 60

rsynth 11 11 50

Security pegwit 7 13 59

sha 5 17 42

blowfish 7 13 60

rjindael 7 13 60

Text typeset 5 16 63

stringsearch 7 13 60

ispell 7 12 53

128

In addition to the above mentioned 11 aspects, there are certain

other code analysis functions provided by MIDAAC extender as mentioned

earlier. These are discussed in Chapter 6.

3.5 CONCLUSIONS

This chapter provides an analysis of the MIPS object codes of 23

embedded benchmarks from MiBench and MediaBench to understand the

behavior of embedded applications and determine the strategy for

minimising the code size using MIDA, the custom built code analyzer for

MIPS32 object codes.

As already mentioned, the MIDACC developed as part of the

research work acts as a standalone software tool for both MIPS32 code

analysis and for evaluating the new ISA for MIPS32 and measuring the

code size reduction. Since simulation of a new ISA is involved, it will be a

complex process if any existing simulator is to be used for this purpose as

extensive modifications will be required to conduct the type of code

analysis desired. The objective of this research is not execution of

embedded programs but only analysing the behaviour of embedded

applications and measuring static code sizes of HIE-MIPS for various

embedded applications and comparing with static code sizes of MIPS32.

Hence a decision was taken to develop an offline tool suite that can do

both code analysis and conversion of the object codes of MIPS32 into

object codes of new ISA. The architecture of the tool and the strategies

followed in implementation are discussed in Appendix 1. The Appendix 2

provides user guide to MIDACC and sample results of MIDACC.

In the next chapter, two different design strategies for minimizing the

WASTIO and reducing unused zeroes in the opcodes are discussed. Both

the options are evaluated using the MIDACC.

129

4. HYBRID INSTRUCTION ENCODING FOR RISC CORES

This chapter proposes an ISA level technique, for reduction of

average instruction size so as to minimize the embedded object code

generated by the RISC compiler. This approach relieves from the

embedded system developers the burden of incorporating external static

code compression cum dynamic decompression mechanisms, in each new

product, thereby saving on product development cost, and also reducing

the time-to-market. In order to achieve this, it is required to develop a new

type of RISC processor as well as the entire tool chain, but it is worth

investing in view of the growing embedded processor market.

The Fixed Instruction Encoding (FIE) used in RISC processors helps

in simpler instruction decoding and easy pipeline design. But the FIE

increases the object code size as some fields are either unused or

underutilized in several instructions. Since all instructions have to be of

uniform length, many redundant zeroes are inserted in several instructions

to maintain 32 bit instruction length. Further, huge wastage of memory

space occurs due to under utilization of the immediate and offset fields in

the instructions. This chapter proposes replacement of FIE with Hybrid

Instruction Encoding (HIE) with two modifications to RISC Architecture:

multiple instruction sizes, and hybrid lengths for the offset and immediate

fields. The provision for multiple instruction sizes minimizes unused fields in

most instructions thereby reducing code size. Similarly, allowing hybrid

lengths for the offset and immediate fields minimizes wastage of bits in

these fields.

This chapter deals with design of two different versions of HIE for

MIPS processor and estimating the code size reduction. The HIE1 version

limits the number of general purpose registers to 16 so as to enable

reducing the length of several instructions by one byte. On the other hand,

the HIE2 version reduces the maximum length of offset/immediate fields to

130

15-bits. To help estimate the code saving in the proposed architecture, both

the HIE versions have been designed as a modification to MIPS32 ISA. For

each of the 66 integer instructions of MIPS32, an equivalent HIE instruction

has been designed for both versions. This chapter discusses the designs of

both versions and the code size reduction achieved. Further, the

modifications required in the processor micro architecture to support the

HIE versions are reviewed.

4.1 MIPS ISA AND CODE WASTAGE

The early MIPS architectures were 32-bit, with 64-bit versions added

later. Multiple revisions of the MIPS instruction set exist, including MIPS I,

MIPS II, MIPS III, MIPS IV, MIPS V, MIPS32, and MIPS64. The current

versions are MIPS32 and MIPS64. MIPS32 supports only 32-bit data and

address whereas MIPS64 supports 64-bit data and address. There are also

two special versions, MIPS16 and microMIPS, targeting embedded

applications. The term MIPS is used liberally to mean MIPS32, the target

processor chosen for the research work. The term MIPS R2000 is used

when referring to specific features of the MIPS version.

4.1.1 MIPS Instruction Set

The instruction set of MIPS [11, 27] consists of a variety of basic

instructions such as

21 arithmetic instructions

8 logic instructions

8 bit manipulation instructions

12 comparison instructions

25 branch/jump instructions

15 load instructions

10 store instructions

131

8 move instructions

4 miscellaneous instructions

4.1.2 MIPS Instruction Format

The instruction formats of MIPS can be classified into three broad

categories as R-Type (Register), I-Type (Immediate) and J-Type (Jump) as

shown in Table 4.1. The R-type instructions perform ALU operations with

two register sources and one register destination address. The I-type

instructions perform load, store, and ALU operations with an immediate

operand.

Table 4.1: MIPS Instruction Formats

Format

type

Bits,

31-26

Bits,

25-21

Bits,

20-16

Bits,

15-11

Bits,

10-6

Bits,

5-0

Nature of

operations

R op rs rt rd sa opx

(fn)

arithmetic

operations

I op rs rt offset / immediate transfer, branch,

immediate

operations

J op target jump operation

Table 4.2 defines the different fields in the instructions. The J -type

instructions perform unconditional branching to the target address. There

are also conditional branch instructions in the I type instructions; these use

a signed 16-bit instruction offset field. Hence they can jump 215 -1

instructions (not bytes) forward or 215 instructions backwards.

132

Table 4.2: MIPS Instruction Fields

Field Purpose

Op a 6-bit operation code

Rs a 5-bit source register specifier

Rt a 5-bit target (source/destination) register or branch

condition

immediate a 16-bit immediate, branch displacement or address

displacement

target a 26-bit jump target address

Rd a 5-bit destination register specifier

Sa a 5-bit shift amount

opx/fn a 6-bit operation code extension (function) field

MIPS ISA uses a fixed instruction length of 32 bits with 6 bits allotted

to the opcode. This provides only 64 opcodes which is insufficient for the

number of desired instructions. To resolve this problem, MIPS supports

variable-length operation codes within a fixed-length instruction that uses

expanded opcodes. The most frequently used operations are directly

encoded in the 6-bit opcode, while a small set of the 64 possible codes are

reserved as escape codes that require decoding of more bits in the

instruction to obtain the full opcode. This technique of expanded opcodes

enriches the instruction set but limits the length of the instruction. Figure.

4.1 provides instruction map of MIPS R2000. Operation codes marked with

a dagger cause reserved instruction exceptions and these are reserved for

later versions of MIPS architecture. The opcode bits are 26-31 and the

initial decoding of the opcode is shown at the top of the figure. As an

example, for the opcode 000011, the instruction is JAL, a jump and link,

instruction. If the opcode is 000000, the special instructions are invoked

with bits 0-5 of the instruction. These 6 bits, found in the R format only, are

decoded in the SPECIAL map. BCOND is expanded with bits 16-20 of the

instruction while COP0 is expanded with bits 0-4. COP1, 2, and 3 are

expanded with bits 16-25. Table 4.3 identifies the actions performed by the

133

integer instructions of MIPS R2000 [27]. In addition, there are floating-point

instructions that are not included here since it is beyond the scope of this

thesis. The MIPS has a floating-point coprocessor (numbered 1) that

operates on single precision and double precision floating-point numbers.

The coprocessor has its own registers.

28..26 Opcode

31..29 0 1 2 3 4 5 6 7

0 SPECIAL

BCOND J JAL BEQ BNE BLEZ BGTZ

1 ADDI ADDIU SLTI SLTIU

ANDI ORI XORI LUI

2 COP0 COP1 COP2 COP3 † † † †

3 † † † † † † † †

4 LB LH LWL LW LBU LHU LWR †

5 SB SH SWL SW † † SWR †

6 LWC0 LWC1 LWC2 LWC3 † † † †

7 SWC0 SWC1 SWC2

SWC3

† † † †

2..0 SPECIAL

5..3 0 1 2 3 4 5 6 7

0 SLL † SRL SRA SLLV † SRLV SRAV

1 JR JALR † SYSCALL

BREAK † †

2 MFHI MTHI MFLO MTLO † † † †

3 MULT MULTU DIV DIVU † † † †

4 ADD ADDU SUB SUBU AND OR XOR NOR

5 † † SLT SLTU † † † †

6 † † † † † † † †

7 † † † † † † † †

18..16 BCOND

20..19 0 1 2 3 4 5 6 7

0 BLTZ BGEZ

1

2 BLTZAL BGEZAL

3

25..23

COPz 22,21,1

6 0 1 2 3 4 5 6 7

0,0,0 MF MT

BCT

0,0,1 BCF

0,1,0

0,1,1

1,0,0 CF CT

Figure. 4.1: MIPS R2000 instruction map

134

Table 4.3: MIPS32 Integer instructions and actions

(Abbreviations used: R-Register; I-Immediate; O-Offset; T-target address)

Sl.

no.

Instruction

name

Operation

Type

Operand

Format Action

1 add ALU R Addition with overflow

2 addu ALU R Addition without overflow

3 addi ALU I Addition immediate with

overflow

4 addiu ALU I Addition immediate without

overflow

5 and ALU R Logical AND

6 andi ALU I Logical AND of rs with zero-

extended immediate

7 div ALU R Division with overflow;

leave quotient and

remainder in registers lo

and hi respectively

8 divu ALU R Division without overflow;

result storing similar to div

instruction

9 mult ALU R Multiply; leave the low order

and high order words of the

product in registers lo and

hi respectively

10 multu ALU R Unsigned multiply; result

storing similar to mult

instruction

11 nor ALU R Logical NOR

12 or ALU R Logical OR

13 ori ALU I Logical OR of rs with zero-

extended immediate

14 sll ALU R Shift left logical; by number

of positions specified by sa

field

15 sllv ALU R Shift left logical variable; by

number of times specified

by rs

135


Sl.

no.

Instruction

name

Operation

Type

Operand

Format Action

16 sra ALU R Shift right arithmetic; by

number of positions

specified by sa field

17 srav ALU R Shift right arithmetic

variable; by number of

times specified by rs

18 srl ALU R Shift right logical; by

number of positions

specified by sa field

19 srlv ALU R Shift right logical variable;

by number of times

specified by rs

20 sub ALU R Subtract with overflow

21 subu ALU R Subtract without overflow

22 xor ALU R Logical exclusive OR

23 xori ALU I Logical exclusive OR of rs

with zero-extended

immediate

24 lui CONMANIP I Load lower halfword of

immediate into upper

halfword of rt ; reset other

bits of rt

25 slt Compare R Set less than

26 sltu Compare R Set less than unsigned

27 slti Compare I Set less than immediate

28 sltiu Compare I Set less than unsigned

immediate

29 bczt Branch O Branch coprocessor z true

30 bczf Branch O Branch coprocessor z false

31 beq Branch O Branch on equal

32 bgez Branch O Branch of rs is greater than

or equal to 0

136


Sl.

no.

Instruction

name

Operation

Type

Operand

Format Action

33 bgezal Branch O Branch if rs is greater than

or equal to 0; in addition,

save (link) the address of

the next instruction in R31.

34 bgtz Branch O Branch on greater than 0

35 blez Branch O Branch if rs is less than or

equal to 0

36 bltzal Branch O Branch if rs is less than 0;

in addition, save (link) the

address of the next

instruction in R31.

37 bltz Branch O Branch on less than 0

38 bne Branch O Branch on not equal

39 j Jump T Unconditionally jump to the

instruction at target

40 jal Jump T Unconditionally jump to the

instruction at target; in

addition, save (link) the

address of the next

instruction in R31.

41 jalr Jump R Unconditionally jump to the

instruction whose address

is in rs; in addition, save

(link) the address of the

next instruction in rd.

42 jr Jump R Unconditionally jump to the

instruction whose address

is in rs

43 lb Load O Load byte with sign-

extension

44 lbu Load O Load byte without sign-

extension

45 lh Load O Load halfword with sign-

extension

137


Sl.

no. Instruction

name Operation

Type Operand

Format Action

46 lhu Load O Load halfword without sign-

extension

47 lw Load O Load word

48 lwcz Load O Load word into coprocessor

register

49 lwl Load O Load the left bytes from the

word, at the possibly

unaligned address, into rt

50 lwr Load O Load the right bytes from

the word, at the possibly

unaligned address, into rt

51 sb Store O Store the low byte from rt

52 sh Store O Store the low halfword from

rt

53 sw Store O Store the word from rt

54 swcz Store O Store the word from the

coprocessor register

55 swl Store O Store the left bytes from rt

at the possibly unaligned

address

56 swr Store O Store the right bytes from rt

at the possibly unaligned

address

57 mfhi Data move R Transfers hi to rd

58 mflo Data move R Transfers lo to rd

59 mthi Data move R Transfers rs to hi

60 mtlo Data move R Transfers rs to lo

61 mfcz Data move R Transfers coprocessor

register to rt

62 mtcz Data move R Transfers rt to coprocessor

register

63 Syscall Exception /

Interrupt

R System call

64 Break Exception/

Interrupt R Cause exception

65 NOP Exception /

Interrupt - Do nothing

66 rfe Exception /

Interrupt R Return from exception

138

The register file of MIPS R2000 architecture consists of thirty two,

32-bit registers as shown in Figure. 4.2. These registers are used for

operands, results (both integer and floating point), and index registers. One

of the registers, R0 is always set to zero (by the hardware) for use in

clearing a register, providing a zero constant, and support of address

arithmetic. Two 32-bit registers are provided to support multiplication

(holding the double length product) and division (holding the quotient and

the remainder). The program counter is a separately architected register

unlike in certain processors such as ARM wherein one of the GPRs is

dedicated as PC.

Like all other RISC processors, the MIPS R2000 follows load/store

architecture. The load and store instructions use two memory accesses

(instruction fetch and operand fetch) and operate on signed and unsigned

bytes, words (2bytes) and double words. Most instructions use the three-

address register-to-register format. The data types consist of following:

single- and double-precision IEEE floating point

signed, 2's complement 8-, 16-, and 32-bit integers

unsigned 8-, 16-, and 32-bit integers

Figure. 4.2: MIPS R2000 registers

139

The MIPS R2000 memory is byte addressable. Thus the address for

a 16-bit integer ignores the LSB of the address. The 2 LSBs of the address

are ignored for 32-bit integers and single-precision floating-point data

types. The three LSBs of the address are ignored for double precision

floating-point data types. Binding of the addresses to opcodes is

accomplished in the instruction decoding hardware.

4.1.3 Wastage in MIPS32 Code

The general drawback of fixed instruction size feature of RISC

architecture has been discussed in Chapters 1 and 2. The following

discussion is specific to MIPS32.

1. Several bits are unused in many instructions as pointed out in

Chapter 2. Table 4.4 lists the opcodes of 66 integer instructions indicating

their formats and the number of redundant zeroes. The R-type instructions

use totally 12 bits (OP and OPX bits) to specify the operation though there

are only maximum of 64 different R-type operations in MIPS32 ISA. Since

MIPS32 has 32 GPRs, five bits are used to specify each register operand.

Leaving out the 12 bits (op and fn) for the operation, 20 bits have been left

for the operand fields whereas only 15 bits are needed. Hence five bits are

unused in R-type instructions as illustrated in Figure. 4.3 for the and

instruction. If three more bits are eliminated from any field, the instruction

length can be reduced to 24 bits. The solution in HIE1 for relieving three

bits involves reduction of register field to four bits as discussed in the next

section. In HIE2, the OPX field is either eliminated or replaced by a shorter

field.

140

Table 4.4 MIPS32 Instructions, opcodes and redundant zeros

(Abbreviations used: R-Register; I-Immediate; O-Offset; T-target address;

RZ - Redundant zero)

Sl

no.

Instruction

name

OP

(bits

31-26)

OPX

(bits 5-0)

Operation

Type

Operand

Format

No. of

RZ

1 add 000000 100000 ALU R 5

2 addu ,, 100001 ALU R 5

3 addi 001000 - ALU I 0

4 addiu 001001 - ALU I 0

5 and 000000 100100 ALU R 5

6 andi 001100 - ALU I 0

7 div 000000 011010 ALU R 10

8 divu ,, 011011 ALU R 10

9 mult ,, 011000 ALU R 10

10 multu 000000 011001 ALU R 10

11 nor ,, 100111 ALU R 5

12 0r ,, 100101 ALU R 5

13 ori 001101 - ALU I 0

14 sll 000000 000000 ALU R 0

15 sllv ,, 000100 ALU R 5

16 sra ,, 000011 ALU R 0

17 srav ,, 000111 ALU R 5

18 srl ,, 000010 ALU R 0

19 srlv ,, 000110 ALU R 5

20 sub ,, 100010 ALU R 5

21 subu ,, 100011 ALU R 5

22 xor ,, 100110 ALU R 5

23 xori 001110 - ALU I 0

24 lui 001111 - CONMANIP I 5

25 slt 000000 101010 Compare R 5

26 sltu ,, 101011 Compare R 5

27 slti 001010 - Compare I 0

141


Sl

no.

Instruction

name

OP

(bits

31-26)

OPX

(bits 5-0)

Operation

Type

Operand

Format

No. of

RZ

28 sltiu 001011 - Compare I 0

29 bczt - - Branch O 4

30 bczf - - Branch O 4

31 beq 000100 - Branch O 0

32 bgez 000001 - Branch O 4

33 bgezal ,, - Branch O 0

34 bgtz 000111 - Branch O 5

35 blez 000110 - Branch O 5

36 bltzal 000001 - Branch O 0

37 bltz ,, - Branch O 5

38 bne 000101 - Branch O 0

39 j 000010 - Jump T 0

40 jal 000011 - Jump T 0

41 jalr 000000 001001 Jump R 10

42 jr 000000 001000 Jump R 16

43 lb 100000 - Load O 0

44 lbu 100100 - Load O 0

45 lh 100001 - Load O 0

46 lhu 100101 - Load O 0

47 lw 100011 - Load O 0

48 lwcz - - Load O 0

49 lwl 100010 - Load O 0

50 lwr 100011 - Load O 0

51 sb 101000 - Store O 0

52 sh 101001 - Store O 0

53 sw 101011 - Store O 0

54 swcz - - Store O 0

55 swl 101010 - Store O 0

56 swr 101110 - Store O 0

142


Sl

no.

Instruction

name

OP

(bits

31-26)

OPX

(bits 5-0)

Operation

Type

Operand

Format

No. of

RZ

57 mfhi 000000 010000 Data move R 15

58 mflo ,, 010010 Data move R 15

59 mthi ,, 010001 Data move R 15

60 mtlo ,, 010011 Data move R 15

61 mfcz - - Data move R-O 11

62 mtcz - - Data move R-O 11

63 Syscall 000000 001100 Exception /

Interrupt

R-O 20

64 Break ,, 001101 Exception/

Interrupt

R-O 0

65 NOP ,, - Exception /

Interrupt

- 26

66 rfe 010000 100000 Exception /

Interrupt

- 19

0's rs rt rd 0's 100100

6 5 5 5 5 6

Figure. 4.3: Format of and instruction in MIPS32 ISA

001001 rs rt 0's

6 5 5 16

Figure. 4.4: addiu instruction with immediate field containing zero value

001001 rs rt 0000000000101100

6 5 5 16

Figure. 4.5: addiu instruction with only most significant byte of

immediate as zero

143

001001 rs rt 0000001100000000

6 5 5 16

Figure. 4.6: addiu instruction with only least significant byte of

immediate as zero

001001 rs rt 0001010100001001

6 5 5 16

Figure. 4.7: addiu instruction with both bytes of immediate field as

non-zero value

2. In immediate type instructions such as addi, 16 bits are used for

specifying the immediate operand. In most cases, eight bits are sufficient

for the immediate operand and the remaining 8 bits become redundant.

Figures. 4.4 to 4.7 illustrate the four different cases of immediate field

patterns out of which only in one case, both bytes of the immediate field are

non-zero. Thus in the other three cases, there is wastage of either one byte

or two. It was seen in Chapter 3 that two out of 23 benchmarks need a

maximum of 8-bits only for the immediate field. Even among the other

benchmarks, there is requirement beyond 8-bits only in maximum 10% of

the cases. This behaviour of embedded codes has been exploited by us in

designing the hybrid encoding for the immediate field as discussed in the

next section.

3. In branch instructions such as beq, the offset field is underutilized

in those cases where the offset required can be specified with 8 bits. It was

seen in Chapter 3 that 16 out of 23 benchmarks need a maximum of 8-bits

only for the offset field. Even among the other benchmarks, only in

maximum cases of 3%, there is requirement beyond 8 bits. The hybrid

encoding technique used for the immediate field is followed for the offset

field also.

144

The impact of these drawbacks on the code size has been studied in

chapter 2. Sections 4.2 and 4.4 deal with two different HIE techniques for

MIPS32 to achieve the goal of minimising unused fields within instructions,

and improving the utilization of the offset and immediate fields. When a

new processor is designed, the computer architect has greater flexibility in

choosing the ISA attributes such as instruction formats, opcodes,

addressing modes and number of registers. Developing a new processor is

an involved process requiring appropriate tools and it consumes several

man years. To evaluate such a processor, an entire tool chain has to be

created. On the other hand, this research work deals with HIE design for

MIPS by modifications to certain features of the MIPS32. This approach

helps to verify and prove the concept though the extent of code size

reduction achievable will be slightly less compared to what is possible with

a newly designed HIERISC processor. In HIE1, the solution involves

reducing the number of GPRs to 16, whereas in HIE2, the maximum length

of offset/immediate is reduced by 1 bit. In HIE1, the length of OP and fn

fields are retained as in MIS32. In HIE2, the six bit fn field is eliminated;

instead, an iid field varying from two to three bits serves the purpose of

instruction identification. Both design approaches have certain common

aspects such as hybrid lengths for offset and immediate fields. First, the

HIE1 is discussed in detail in section 4.2, and then the design of HIE2 is

taken up in section 4.4.

4.2. HIE1 METHODOLOGY FOR MIPS32

To evaluate the effectiveness of our proposed HIERISC ISA, it is

designed as a piggyback to the MIPS32 ISA. Basically, for every integer

instruction of MIPS32 ISA, an equivalent HIE instruction is provided. In the

HIE1 ISA, four different sizes are allowed for the integer instructions: three

8-bit, seven 16-bit, twenty one 24-bit, three 32-bit, and thirty two

instructions with three length options: 16/24/32 bits.

145

4.2.1 HIE1 RISC Instructions

The HIE1 design supports nine different types of integer instructions.

Figure. 4.8 shows the proposed instruction formats for HIE1. Out of 66

integer instructions, j, jal, and break, are retained as 32 bits, due to system

software implications. The remaining instructions are translated into one of

the HIE1 types. In several ALU instructions, there are five redundant zeros.

As pointed out earlier, the register fields are reduced by one bit each so

that these instructions can be reduced to 24 bits as shown in Figure. 4.9.

This restricts the number of GPRs to 16; however, it will not strain the

compiler as graph colouring technique for register allocation works

satisfactorily for 16 GPRs [3]. Popular RISC Processors such as ARM and

SH4 have only 16 registers.

Figure. 4.8: HIE1 RISC Instruction Formats

146

op rs rt rd fn

6 4 4 4 6

Figure. 4.9: R Type instruction in HIE1

The nop, rfe and syscall are 8-bit instructions with a common

opcode and a 2-bit iid field to identify the instruction. The 16-bit instructions

are jr, mfhi, mflo, mthi, mtlo, mfcz and mtcz. In mfcz and mtcz, the rd field is

retained as 5 bits since it refers to coprocessor registers. The iid bit

differentiates between mfcz and mtcz. The mthi, mflo, mthi and mtlo have a

common format and the register field is shared between rd and rs. In

mfhi/mflo/mthi/mtlo format, the rd/rs field denotes rd for mfhi and mflo. For

mthi and mtlo, it denotes rs.

The 24-bit instructions that form three different R-types are add,

addu, and, div, divu, mult, multu, nor, or, sll, sllv, sra, srav, srl, srlv, sub,

subu, xor, slt, sltu, and jalr. In type1, there is no sa field. In type2, there is

no rs field. In type3, there are four zeroes to maintain byte alignment. The

remaining 32 instructions have three length options: 16, 24, or 32 bits. The

offset and immediate fields are encoded in a unique way in our proposal.

Table 4.5 shows a typical example using hexadecimal notation. If the value

of the offset / immediate is zero, these fields are omitted. When one of the

bytes in the offset / immediate is zero, that byte is omitted and the hybrid

identifier hl is formed accordingly. All the four cases have a common

opcode.

147

Table 4.5: Sample Encoding of Offset/Immediate Field in HIE-MIPS

MIPS32 Encoding HIE1-MIPS Encoding hl bits HIE1 Instruction

size (bits)

0000 Nil 00 16

000F 0F 01 24

0F00 0F 10 24

0F0F 0F0F 11 32

4.2.2 Mapping MIPS32 ISA to HIE1

MIPS Instructions are converted into HIE1RISC instructions of

different types as illustrated in table 4.6. As indicated earlier, for three

instructions, j, jal, and break there is no change. For all others, conversion

depends on the opcode and immediate / offset fields. All unconverted

instructions are retained as 32 bits. For some instructions, more than one

type of conversion is possible. For example, for addi instruction, three

cases are there; addi-a, addi-b, addi-c. a means converted length is 16 bits

whereas b and c mean converted length is 24 bits. Identifying certain MIPS

instructions involve multiple match conditions. For example, for bczt

instruction, the first byte may be any one of the four combinations: 41, 45,

49,4D. In addition, the second byte has 16 combinations:

01,03,05,07,09,0B,0D,0F,11,13,15,17,19,1B,1D,1F. For NOP instruction,

all 32 bits are 0’s.

148

Table 4.6: MIPS32 ISA to HIE1 RISC ISA Mapping

HIE1

Group

No. of

instructions

HIE1

IL

(bits)

Instructions Type No. of

RZs

Remarks on

HIE1 format

A 3 8 rfe, syscall,

nop

Exception and

interrupt

0 Common OP

field; iid

differentiates

B 2 16 mfcz, mtcz Data movement

with coprocessor

0 Common OP

field; one-bit iid

differentiates.

The rt is four bits

but the rd is five

bits

C 5 16 jr, mfhi, mflo,

mthi, mtlo

jr is jump register

instruction; others

are data

movement type

0 The OP and fn

fields are similar

to MIPS32. The

4-bit register

field is rs for jr,

mthi and mtlo.

For mfhi and

mflo, the register

field is rd.

D 13 24 add, addu,

and, nor, or,

sllv, srav, srlv,

sub, subu, xor,

slt, sltu

R- type; slt and

sltu are

comparison type;

others are

arithmetic and

logical

0 HIE1 R-Type1.

All fields are

similar to

MIPS32 except

that unused

zeroes are

deleted and the

register fields

are 4 bits

E 3 24 sll, sra, srl R- type; shift 0 HIE1 R-Type2.

All fields are

similar to

MIPS32 except

that the unused

rs field is

deleted and the

register fields

are 4 bits

149


HIE1

Group

No. of

instructions

HIE1

IL

(bits)

Instructions Type No. of

RZs

Remarks on

HIE1 format

F 5 24 Jalr, div, divu,

mult, multu

R-type; arithmetic 4 HIE1 R-Type3.

All fields are

similar to

MIPS32 except

that 6 unused

zeroes are

deleted and the

register fields

are 4 bits; 4

zeroes maintain

byte alignment.

In jalr, the

register fields

are rs and rd; in

other

instructions,

these are rs and

rt.

G 32 16/

24/32

addi, addiu,

andi, ori, xori,

lui, slti, sltiu,

bczt, bczf, beq,

bgez, bgezal,

bgtz, blez,

bltzal, bltz,

bne, lb, lbu, lh,

lhu, lw, lwcz,

lwl, lwr, sb, sh,

sw, swcz, swl,

swr

I-type

/branch/load/store.

A mixture of

arithmetic/

logical, constant

manipulation,

compare, branch,

load and store

type. Most are of

I-type. The

branch/ load /store

instructions have

offset.

0 HIE1 I- Type. All

fields are similar

to MIPS32

except that the

immediate /

offset field can

take three

different lengths;

0/8/16 bits. In

lui, the rs field

contains 4

zeroes. All

register fields

are 4-bts.

H 2 32 j, jal jump 0 Similar to MIPS.

I 1 32 break Exception and

interrupt

0 Similar to MIPS.

150

4.3 HIE1 EXPERIMENTAL RESULTS

There is a wide variation in the sizes of the benchmark programs.

Out of the 23 embedded applications, four are small (≤ 10KB), four are

medium (10KB-100KB) and fifteen are large (≥ 100KB). The code size

reduction for individual benchmarks in each category is shown in Figures.

4.10 to 4.16. It is observed that there is varying extent of reduction for

embedded programs ranging from 18% to 27%. The three programs-

susan, bitcount and JPEG - get the maximum reduction and the lame

program gets the least reduction. There is very little or nil variation in the

reduction percentages of different applications in network, speech and

security segments. The Automotive and Industrial Control benchmarks

(Figure. 4.10) show reduction varying from 21% to 27%. All three network

benchmarks get around 21.5 % as shown in Figure. 4.11. As shown in

Figure. 4.12, the video program, MPEG2, gets 21% reduction whereas the

two benchmarks in Audio - ADPCM and lame - have noticeable differences.

In image segment, JPEG and susan get higher reduction exceeding 26%

but EPIC and fft give only 21% as shown in Figure. 4.13.

Figure. 4.10: Effect of HIE1 on Automotive and Industrial Control

Benchmarks

151

Figure. 4.11: Effect of HIE1 on Network Benchmarks

Figure. 4.12: Effect of HIE1 on Video and Audio Benchmarks

152

Figure. 4.13: Effect of HIE1 on Image Benchmarks

Figure. 4.14: Effect of HIE1 on Speech Benchmarks

153

Figure. 4.15: Effect of HIE1 on Security Benchmarks

Figure. 4.16: Effect of HIE1 on Text Benchmarks

154

Figure. 4.17: Effect of HIE1 on Embedded Segments

In speech segment, all the three benchmarks have reduction ratios

between 21% to 22% as shown in Figure. 4.14. The four benchmarks in

security segment have reduction ratios between 21% and 22% as shown in

Figure. 4.15. The three benchmarks of text segment show reduction from

19% to 21% as in Figure.4.16. Comparison of reduction percentages

across the segments is shown in Figure. 4.17. Since many segments

contain multiple benchmarks, use of geometric means has been followed

for the reduction percentages. It is observed that the Automotive and

Consumer segments gain maximum with reduction, and the Audio segment

gains least.

4.3.1 Drawback of register size reduction

Though there are many processors such as ARM that manage well

with just 16 registers only, experiments with MiMedia object codes for

MIPS32 reveal a different fact. Out of 23 benchmarks, only susan is not

affected significantly as seen in Figure.3.14. Only in 5% of the cases,

155

susan needs more than 16 registers. This can be easily taken care of by

the compiler. But all other benchmarks use more than 16 registers not less

than 35% of the register accesses. The worst case behaviour is by jpeg

that needs more than 16 registers in 59% of the cases. Eliminating these

accesses by the compiler by restructuring the code will definitely result in

increased number of instructions apart from reducing performance. As per

Figure. 3.15, only benchmarks of two segments, Automotive and industrial

control and image may tolerate reduction of registers from 32 to 16.

However, if susan is deleted from these two segments, then this model will

fare as bad as MIPS16 if not worse than MIPS16.

Another issue to be answered at this point is impact of shift amount

field sa from 5 bits to 4. As seen in Figure. 3.16, only two programs,

bitcount and sha have high figures of using more than 16 bit shifts. Figure.

3.17 confirms that the requirement of most embedded segments lies

between 5% and 11%. Hence HIE1 will not affect the shift operations.

However, HIE1 is not attractive if register usage is heavy. Hence this

technique is recommended only for toys market wherein performance is not

an issue.

4.4 DESIGN OF HIE2

The HIE2 follows a more aggressive reduction policy than HIE1 in

the following aspects:

1. Number of redundant 0's on the OPX/fn field is also reduced. In

HIE1, 24 instructions have redundant 0's whereas the HIE2 has

only 9 instructions with redundant 0's.

2. The HIE2 has 12 different instruction formats whereas in HIE1,

there are nine formats.

156

In HIE2, MIPS32 Instructions are converted into new HIE Plus

instructions of 12 different types by retaining the length of the register field

as 5 bits. But, the maximum length of offset / immediate fields is reduced to

15 bits.

4.4.1 Impact of Reduction of immediate and offset lengths to 15 bits

The 16 bits immediate field is heavily under utilized by embedded

benchmarks as shown in Figure. 3.7. Except for two benchmarks, rsynth

and typeset, other 21 programs need full 16 bits in less than 10% of the

cases. The average requirement is only 5%. As seen in Figure. 3.8, most

embedded segments benchmarks need full 16 bits in 0% to 5% of the

cases only. The benchmarks of Automotive applications and the Speech

segments are in the two extreme ends but within a short range as shown in

Figure. 3.8. Our analysis shows that embedded applications use full 16 bits

of offset field very rarely as shown in Figure. 3.9. The average requirement

is less than 1%. In fact, 16 benchmarks use only 15 bits always. Even the

remaining programs need 16 bits maximum in 3% of the cases. The overall

behavior of embedded segments in this aspect is shown in Figure. 3.10.

Network and Speech segments are satisfied with 15 bits of offset. The

worst case requirement is that of Image segment that has 2% of cases

using more than 15 bit offset. Combining the usage requirements of both

immediate and offset fields, the average figure is less than 6%. Hence the

decision of reducing the immediate and offset fields by 1 bit in HIE2 is a

better choice than reducing register and shift amount fields in HIE1.

4.4.2 HIE2 Design for MIPS32

Since the HIE1 design has been already discussed in depth, this

section will focus on essential differences while discussing HIE2 design. In

HIE2, the instructions are of 12 types as shown in Figure. 4.18. The need

for assigning 66 integer opcodes and reserving some opcodes for future

157

expansion is the main reason for such a large number of types used. The

HIE2 features are as follows.

1. The HIE2 supports four sizes for integer instructions:

(a) Three 8-bit instructions of type A

(b) Twelve 16-bit instructions: 2 type B, 5 type C and 5 type F

(c) Sixteen 24-bit instructions: 8 type D1, 5 type D2 and 3 type E

(d) Three 32-bit instructions: 2 type H and 1 type I, and

(e) Thirty two instructions with multiple options: 26 type G1

instructions with 24/32 bits; 2 type G2 instructions with

8/16/24 bits; and 4 type G3 instructions with16/24/32 bits

2. In all the instructions, the msb bit (IT) indicates the instruction type.

For Types G1, G2 and G3, IT bit is 0 indicating hybrid length fields.

For all other types, the IT bit is 1 indicating fixed length fields.

3. The OP is only 5 bits for all instructions in HIE2. However, the IT bit

is an addition.

4. The instruction identifier (iid) field indicates the exact instructions

within a group of instructions with a common OP. The length of iid is

either 2 or 3 bits.

5. The hybrid length (hl) field indicates the length of offset / immediate

fields as in HIE1.

6. Out of 66 integer instructions, j, jal, and break, are retained as 32

bits as in HIE1.

7. Unlike HIE1, the register fields are retained as 5-bits by HIE2.

8. The syscall, nop, and rfe are 8-bit instructions with a common

opcode (OP = 00001) and a 2-bit iid to identify the instruction. The

iid patterns 00, 01, 10 represent syscall, nop and rfe respectively.

The 11 combination is reserved for future addition.

158

9. The 16-bit instructions are of 3 types: Type B, C and F. In Type B,

the mfcz and mtcz have two different opcodes. In Type C, all the five

instructions have a common opcode. The three bit iid field identifies

the exact instruction. The mthi, mflo, mthi and mtlo have a common

format and the register field is shared between rd and rs. In

mfhi/mflo/mthi/mtlo format, the rd/rs field denotes rd for mfhi and

mflo instructions. For mthi and mtlo instructions, it denotes rs. In

type F, each instruction has a separate opcode.

10. In the 24-bit instructions, the formats of type D1 and type D2 are

similar but with different opcode. The eight instructions in type D1

have a common opcode and a three bit iid field identifies the exact

instruction. Similarly in type D2, there are five instructions sharing a

common opcode but with a different three bit iid field.

11. The 32-bit instructions are of two types. In type H, each of the two

instructions has a separate opcode. In type I, there is only one

instruction.

12. In type G1, there are 26 instructions, each with a separate opcode.

The instruction length is either 24 bit or 32 bit. A one bit hl field

indicates the actual length of immediate /offset field. This offset /

immediate can be either 7 bit or 15 bit.

13. In type G2, there are two instructions with separate opcodes. The

instruction length can be 8/16/24 bits. The length of the offset can be

0/8/16 bits. Two bit hl field identifies the length.

14. In type G3, there are four instructions with a common opcode. A two

bit iid field identifies the exact instruction. The instruction length can

be 16/24/32 bits. The length of the offset can be 0/8/16 bits. A two

bit hl field indicates this as 00, 01, 10 or 11.

The mapping between the MIPS32 ISA and HIE2 ISA is illustrated in

Table 4.7.

159

A 8-bit:nop,syscall,rfe

it opcode iid

1 5 2

B 16-bit:mfcz,mtcz

it opcode rt rs

1 5 5 5

C 16-bit: mfhi, mflo, mthi, mtlo, jr

it opcode iid rd/rs 00

1 5 3 5 2

D1 24-bit: add, addu, and, nor, or, sub, subu, xor

it opcode iid rs rt rd

1 5 3 5 5 5

D2 24-bit: sllv, srav, srlv, slt, sltu

it opcode iid rs rt sa

1 5 3 5 5 5

E 24-bit: sll, sra, srl

it opcode iid rt rd sa 0

1 5 2 5 5 5 1

F 16-bit: jalr, div, divu, mult, multu

it opcode rs rt/rd

1 5 5 5

G1 24/32-bit: addi, addiu, lui, slti, sltiu, beq, bgezal, bltzal, bne, lb, lbu ...

it opcode hl rs rt immediate/offset

1 5 1 5 5 7/15

G2 8/16/24-bit: bczt, bczf

it opcode hl offset

1 5 2 0/8/16 G3 16/24/32-bit: bgez, bgtz, blez, bltz

it opcode iid hl rs 0 offset

1 5 2 2 5 1 0/8/16 H 32-bit: j,jal

It opcode target

1 5 26 I 32-bit: break

It opcode code 0’s

1 5 20 6

Figure. 4.18: HIE2 instruction formats

160

Table 4.7: Mapping MIPS32 ISA to HIE2 ISA

HIE2Type Length

(bits)

No. of

main

Opcodes

allotted

No. of

instructions

allotted

No.

of

Free

slots

Allotted Instructions No.

of RZ

A 8 1 3 1 syscall, nop, rfe 0

B 16 2 2 - mfcz,mtcz 0

C 16 1 5 3 mfhi, mflo, mthi, mtlo, jr 2

F 16 5 5 - jalr, div, divu, mult,

multu

0

D1 24 1 8 - add, addu, and, nor, or,

sub, subu, xor

0

D2 24 1 5 3 sllv, srav, srlv, slt, sltu 0

E 24 1 3 1 sll, sra, srl 1

H 32 2 2 - j, jal 0

I 32 1 1 - Break 0

G1 24/32 26 26 - addi, addiu, andi, ori,

xori, lui, slti, sltiu, beq,

bgezal, bltzal, bne, lb,

lbu, lh, lhu, lw, lwcz, lwl,

lwr, sb, sh, sw, swcz,

swl, swr

0

G2 8/16/ 24 2 2 - bczt, bczf 0

G3 16/24/32 1 4 - bgez, bgtz, blez, bltz 1

4.5 DISCUSSION ON HIE2 RESULTS

Table 4.8 compares the results of both HIE versions. It is observed

that in HIE2, the MiMedia application programs get reduction ranging from

18% to 27% same as in HIE1. Further, it is an interesting coincidence that

on the whole, except for two programs, the other programs get less than

0.5% improvement in HIE2.

161

Table 4.8: Comparison of Code reduction schemes HIE1 and HIE2

Application

Area

Application

Name

HIE1 code

Reduction %

HIE2 code

Reduction %

HIE2

improvement

over HIE1

Automotive

and Industrial

Control

basicmath 20.29 21.55 1.26

bitcount 26.64 26.80 0.16

Qsort 23.67 24.07 0.40

susan 26.80 27.03 0.23

Network

dijkstra 21.48 21.59 0.11

patricia 21.52 21.63 0.11

CRC32 21.49 21.60 0.11

Video MPEG2 20.95 21.30 0.35

Audio

ADPCM 21.49 21.60 0.11

lame 17.77 18.08 0.31

Image

JPEG 26.43 26.97 0.54

EPIC 21.40 21.55 0.15

fft 21.08 21.45 0.37

(susan) 26.80 27.03 0.23

Speech

GSM 21.44 21.54 0.10

G721 21.59 21.70 0.11

rsynth 22.23 22.46 0.23

Security pegwit 21.60 21.70 0.10

sha 22.72 22.83 0.11

blowfish 21.54 21.65 0.11

rjindael 21.65 21.77 0.12

Text

typeset 20.67 20.81 0.14

stringsearch 21.49 21.60 0.11

ispell 19.06 19.08 0.02

Thus the HIE2 gives marginally better results as regard to

percentage of code reduction. However, the HIE2 is more compiler friendly

162

than HIE1 and will have minimum performance reduction. It was expected

that the HIE2 will give poorer results than the HIE1. But the reduction

achieved by HIE2 is either the same as that of HIE1 or marginally higher

than that of HIE1 ranging from 0.02% to 1.26% as seen in Table 4.8. The

domain wise comparison of code size reductions by the two HIE versions is

presented in Figure. 4.19. As expected, HIE2 offers better code reduction

for all the embedded segments though the improvement is negligible in

most cases and the maximum increase is less than 1%. Hence, the HIE2 is

preferable for embedded SoCs developed for most segments of embedded

applications requiring performance.

Figure. 4.19: Code Reduction Comparison between HIE1 and HIE2

4.5.1 Reduction in Memory Accesses in HIE

Greater code density improves static code size. This is particularly

important for several embedded systems, especially microcontrollers, since

it can be a large fraction of the system cost and influence the system's

physical size which has impact on fitness for purpose and manufacturing

163

cost. Improving dynamic code size reduces the amount of bandwidth used

to fetch instructions. This can reduce cost and energy use and can improve

performance. Smaller dynamic code size also reduces the size of caches

needed for a given hit rate; smaller caches can use less energy and less

chip area and can have lower access latency.

Reduction of switching activity per instruction is related to lowering

bus energy consumption by reducing bit toggles per instruction fetch.

However, total fetch energy should take into account memory access

energy and energy of additional HIE-fetch logic/buffer.

Due to code size reduction by HIE, the size of average instruction

has been reduced from 4 bytes to 2.93 bytes onwards up to 3.18 bytes as

shown in Table 4.9. The average instruction size works out to 3.12 bytes.

This indicates the extent of energy saving that can be achieved by HIE in

embedded systems. The number of instruction fetches are reduced by the

same percentage as the code size reduction percentage by HIE. The

number of memory cycles per instruction varies from 0.73 to 080 as shown

in Table. 4.10. The switching power reduces proportionately. However, the

static instruction count is not directly related to bus traffic reduction as there

is always a difference between the number of instructions in the code

(static count) and the number of instructions fetched and executed by the

processor (dynamic count). Nevertheless, depending on the programs, the

dynamic count may be either directly proportional or indirectly proportional

to static count. The study by Benini [38] has established the existence of a

trade-off between static code footprint and fetch bandwidth reduction.

164

Table 4.9: Average Instruction Size in HIE

Embedded

Domain Application

No of RISC

instruction

fetch

No of HIE

instruction

fetch

Reduction in

instruction

fetch in HIE

HIE

Average

no. of

bytes /

instruction

Automotive

and

Industrial

Control

basicmath 1246 994 252 3.18

bitcount 1067 783 284 2.93

Qsort 486 371 115 3.05

susan 12750 9334 3416 2.93

Network dijkstra 115786 90924 24862 3.14

patricia 115936 90995 24941 3.14

CRC32 115375 90584 24791 3.14

Video MPEG2 134082 106005 28077 3.16

Audio ADPCM 115221 90469 24752 3.14

lame 55973 46027 9946 3.29

Image JPEG 17596 12946 4650 2.94

EPIC 123807 97313 26494 3.14

fft 124660 98385 26275 3.16

(susan) 12750 9334 3416 2.93

Speech GSM 127402 100098 27304 3.14

G721 117785 92364 25421 3.14

rsynth 6556 5099 1457 3.11

Security pegwit 127658 100084 27574 3.14

sha 1015 804 211 3.17

blowfish 116651 91532 25119 3.14

rjindael 119116 93333 25783 3.13

Text

typeset 126313 100252 26061 3.17

stringsearch 115574 90747 24827 3.14

ispell 12080 9778 2302 3.24

165

Table 4.10: Memory Cycles for Instruction Fetch in HIE

Embedded

Domain

Application % reduction in HIE

Instruction fetch

No. of memory

cycles per

instruction

Automotive

and

Industrial

Control

basicmath 20.22 0.80

bitcount 20.62 0.73

Qsort 23.66 0.76

susan 26.79 0.73

Network dijkstra 21.47 0.79

patricia 21.51 0.78

CRC32 21.49 0.79

Video MPEG2 20.94 0.79

Audio

ADPCM 21.48 0.79

lame 17.77 0.82

Image JPEG 26.43 0.74

EPIC 21.40 0.79

fft 21.08 0.79

(susan) 26.79 0.73

Speech

GSM 21.43 0.79

G721 21.58 0.78

rsynth 22.22 0.78

Security pegwit 21.60 0.78

sha 20.79 0.79

blowfish 21.53 0.78

rjindael 21.65 0.78

Text

typeset 20.63 0.79

stringsearch 21.48 0.79

ispell 19.05 0.81

166

4.5.2 Reduction in Redundant zeros in HIE

A major parameter contributing to code reduction in HIE is reduction

of redundant zeros in HIE codes compared to the MIPS32 code. In MIPS32

integer instructions, 35 instructions have redundant zeros. On the other

hand, the HIE1 and HIE2 codes have only five and nine instructions

respectively with redundant zeros. Table 4.11 compares the net RZs of

HIE1 and HIE2 codes with RZ of MIPS32 code.

Table 4.12 summarizes the Percentage of Code Reduction (PCR) by

HIE technique for the 23 benchmark programs classified according to their

sizes. Table 4.13 compares the HIE PCR with total wastage including RZ

and WASTIO. Interestingly, extent of code reduction by HIE is

approximately equal to extent of code wastage estimated in Chapter 3. The

PCR is either equal to the total wastage percentage or higher by one or two

percentage. A relationship between the code size reduction in HIE and

three properties of MIPS32 object codes - FTFI, RZ, and WASTIO - was

suspected by us in Chapter 3. Though the FTFI gives an idea about the

scope for code reduction, it is not the only deciding factor. It is noticed that

in most cases, the code size reduction is higher for those programs that

have higher amount of four major instructions and higher amount of under

utilization of immediate and offset fields. This behavior forms the backbone

of our HIE methodology. However, there are marginal exceptional

behaviours by some programs.

167

Table 4.11: Comparison of RZs of Embedded Applications

Embedded

Domain Benchmark

% RZ

MIPS

% RZ

HIE1

% RZ

HIE2

Automotive and

Industrial Control

basicmath 12 0.06 0.17

bitcount 12 0.10 0.37

Qsort 13 0.17 0.41

susan 8 0.02 0.06

Network dijkstra 7 0.07 0.18

patricia 7 0.07 0.18

CRC32 7 0.07 0.18

Video MPEG2 7 0.06 0.17

Audio ADPCM 7 0.07 0.18

lame 5 0.02 0.11

Image JPEG 11 0.26 0.27

EPIC 7 0.07 0.18

fft 7 0.06 0.18

(susan) 8 0.02 0.06

Speech GSM 7 0.06 0.18

G721 7 0.07 0.18

rsynth 11 0.06 0.22

Security pegwit 7 0.06 0.17

sha 5 0.08 0.22

blowfish 7 0.07 0.18

rjindael 7 0.07 0.18

Text typeset 5 0.02 0.06

stringsearch 7 0.07 0.18

ispell 7 0.01 0.22

168

Table 4.12: Typical Code Size Reduction of Embedded Applications

in HIE

PCR Small

(< 10KB)

Medium

(10KB-100KB)

Large

(>100KB)

Below 20 - ispell Lame

20-25 basicmath,

qsort, sha

rsynth typeset, fft, CRC 32,

dijkstra, patricia,

blowfish, rijndael,

adpcm, gsm, pegwit,

mpeg2, g721, epic,

stringsearch

Above 25 bitcount susan jpeg

For instance, the sha has only 42% of four major instructions and

only 16% of the code is wasted due to under utilization of immediate and

offset fields. In spite of this, there is 23% code size reduction with HIE for

the sha. This could be due to increased number of R-type instructions in

the MIPS32 code for the sha. These instructions have been reduced to 24

bits in the HIE-MIPS code.

169

Table 4.13: Relationship between RZ, WASTIO AND HIE PCR

Embedded

Domain Benchmark

%

RZ

%

WASTIO

A = % RZ

+ %

WASTIO

B =

HIE

PCR

Difference

B-A

Automotive

and

Industrial

Control

basicmath 12 9 21 22 1

bitcount 12 14 26 27 1

Qsort 13 11 24 24 0

susan 8 17 25 27 2

Network

dijkstra 7 13 20 22 2

patricia 7 13 20 22 2

CRC32 7 13 20 22 2

Video MPEG2 7 13 20 21 1

Audio ADPCM 7 13 20 22 2

lame 5 12 17 18 1

Image JPEG 11 14 25 27 2

EPIC 7 13 20 22 2

fft 7 13 20 21 1

(susan) 8 17 25 27 2

Speech

GSM 7 13 20 22 2

G721 7 13 20 22 2

rsynth 11 11 22 22 0

Security pegwit 7 13 20 22 2

sha 5 17 22 23 1

blowfish 7 13 20 22 2

rjindael 7 13 20 22 2

Text typeset 5 16 21 21 0

stringsearch 7 13 20 22 2

ispell 7 12 19 19 0

170

4.6 PROCESSOR MODIFICATIONS TO SUPPORT HIE

The processor has to manage the non uniform size of instructions in

HIE. The instruction fetch logic requires dual instruction buffers. The

instruction decoder needs to be more exhaustive. In case of fixed

instruction encoding, the instruction length is known to the processor in

advance. Hence, during sequential program execution, the PC has to be

incremented by the length of instruction in bytes. In HIE, the instructions

have different lengths. Hence the instruction being executed must be

decoded to decide its width. This is somewhat less of a problem for

implementations that only decode one instruction per cycle. For wider

decode (superscalar) several tricks are available to reduce the cost of

parsing out individual instructions from a block of instruction memory. One

technique is to use marker bits to indicate the start or end of an instruction.

Such marker bits would be set for each parcel of instruction encoding and

stored in the instruction cache. Several AMD x86 implementations have

used marker bit techniques. Alternatively, marker bits could be included in

the instruction encoding as done in HIE. This places some constrains on

opcode assignment and placement since the marker bits effectively

become part of the opcode. Another technique, used by the IBM zSeries

(S/360 and descendants), is to encode the instruction length in a simple

way in the opcode in the first parcel. The zSeries uses two bits to encode

three different instruction lengths (16, 32, and 48 bits) with two encodings

used for 16 bit length. By placing this in a fixed position, it is relatively easy

to quickly determine where the next sequential instruction begins. With

hybrid/variable length instructions, part of the opcode must often be

decoded before the basic parsing of the instruction can be started as

discussed in Chapter 6. This tends to delay the availability of register

names and other, less critical information. With more complex

implementations (deeper pipelines, out-of-order execution, etc.), the extra

relative complexity of handling variable length instructions is reduced. After

instruction decode, a sophisticated implementation of an ISA with variable

171

length instructions tends to look very similar to one of an ISA with fixed

length instructions.

In order to handle interrupts, the PC should have an auxiliary

register. The example of Motorola processor MC 680X0 is relevant to our

case. The MC 680X0 uses variable length instructions, some using a

second word complete instruction specification, whereas others use

extension words to complete certain address specifications. There are two

PCs: one a conventional PC register and the second a scanPC register that

is not visble to the programmer. The scanPC keeps tracking the words that

have been interpreted. For instruction retry, and for return after interrupts

that occur within delay slots, conventional PC is used. For instruction

fetches, the scanPC is used. It is obvious that any processor supporting

variable-length instructions have to check whether each instruction

straddles a cache line or virtual memory page boundary. This aspect is

further discussed in Chapter 6.

In HIE, the instruction decoder should expand the immediate and

offset fields to 16 bits based on hl bit pattern. Basically, the four different

actions needed are as follows:

1. If hl=00, 16 zeros should be forced into the offset/immediate

register.

2. If hl=01, eight zeros should be entered in the most significant

byte position of the offset/immediate register and the eight bit

contents offset/immediate field in the instruction should be

copied to the least significant byte position of the

offset/immediate register.

3. If hl=10, eight zeros should be entered in the least significant

byte position of the offset/immediate register and the eight bit

contents offset/immediate field in the instruction should be

172

copied to the most significant byte position of the


4. Of hl=11, the offset/immediate field should be copied to the


The above changes are minor in nature compared to the larger

gains in code size reduction since the area occupied by the processor is

typically around 5% of a SoC whereas the memory area takes several

times depending on the nature of embedded application.

4.7 CONCLUSIONS

This chapter has proposed Hybrid Instruction Encoding in place of

Fixed Instruction Encoding so as to reduce the code memory size in SoCs.

An HIE-ISA has been proposed for RISC processors supporting multiple

instruction sizes, and four options for immediate and offset fields.

Simulation of HIE has been done with four instruction sizes for MIPS32

processor and the results show code size reduction up to 27%.

Experiments have been conducted with twenty three benchmark programs

collected from MiBench and MediaBench suites, using the custom built

static simulator. It was noticed that except for two programs all others got

reduced by more than 20%. Whereas one large program got reduced by

more than 25% and another large program got reduced by less than 20%,

the remaining 14 large programs got reduced between 20% - 25%.

Considering the significant extent of savings in code memory and chip

space in SoCs, development of dedicated HIE-RISC processor cores for

the embedded market is recommended.

The instruction fetch and decode logics need to manage hybrid

instruction lengths and multiple sizes of offset and immediate fields. These

hardware changes do not need much additional space in the processor.

The processor itself occupies lesser area than the on-chip memory in

173

embedded SoCs, and hence the HIE reduces the overall chip area for

SoCs. In HIE2, the immediate/offset field has been made 15 bits compared

to 16 bits of MIPS32. It has been established that most embedded

programs will have negligible impact. This research work has estimated the

static code size reduction for HIE based ISA, and dynamic simulation is not

done for evaluating performance and power consumption. Marginal

performance reduction can be tolerated for BOPES in view of savings in

chip space and power consumption. However, use of parallelism with

superscalar architecture or multicore SoCs will compensate performance

loss due to single processor core. There are additional techniques of code

size reduction that can be integrated with HIE to increase the extent of

code size reduction. The next two chapters deal with such techniques.

174

5. REGISTER MEMORY ARCHITECTURE FOR RISC CORES

The Load-Store architecture (LSA) of RISC processors is one of the

factors contributing to increase in code memory space in embedded

systems. This chapter explores achieving code size reduction by

incorporating Register-memory architecture (RMA) in embedded RISC

processors. Modifications required in an existing RISC processor to

incorporate RMA arithmetic/logical instructions are discussed. As a case

study, MIPS32 instruction set is enhanced with 12 new instructions

supporting memory operations. The experiments on object codes of

MiMedia benchmarks yield varying extent of code size reduction. The rest

of the chapter is organized as follows. Section 5.1 discusses the motivation

for the RMA architecture. Section 5.2 explains the methodology used for

introducing RMA in MIPS processor. Section 5.3 gives measurements

estimating the resulting code space reduction due to RMA and discusses

the results. Section 5.4 discusses the hardware changes required in RISC

processors to support RMA. Section 5.5 presents the conclusions.

5.1 MOTIVATION FOR RMA

The original objective of choosing LSA in RISC architecture is to

simplify the instruction pipeline and increase processor performance. In

load-store ISA, only load and store instructions can access memory

operands and the arithmetic/logical instructions can only operate on

register operands. Since arithmetic and logical operations on memory

operands are not permitted, the compiler should use a load instruction,

before an add instruction, to move the data from memory to register.

Similarly, the result of an add instruction is stored by the processor in a

register only. Hence a store instruction has to be used by the compiler,

after the add instruction, for moving the result to main memory. This

restriction results in 50% to 100% increase in data transfer instructions in

RISC processors as pointed out in Chapter 1. The extent of usage of data

175

transfer and arithmetic instructions in MIPS object codes for the MiMedia

Benchmarks has been estimated using MIDA.

Table 5.1: Data transfer Vs Arithmetic Instructions in Embedded

Applications

Embedded

domain

Benchmark No. of Integer

instructions

% of LSI

Class

% of

ALUI

Class

Automotive

and Industrial

Control

basicmath 905 18 30

bitcount 1047 36 42

Qsort 463 24 49

susan 12317 50 35

Network dijkstra 114034 35 40

patricia 114181 35 40

CRC32 113623 35 40

Video MPEG2 137957 32 40

Audio ADPCM 113469 35 40

lame 44416 27 37

Image JPEG 17267 39 36

EPIC 121390 34 40

fft 119767 33 39

(susan) 12317 50 35

Speech GSM 124978 34 41

G721 116002 35 40

rsynth 6202 25 43

Security pegwit 125631 34 41

sha 1038 59 31

blowfish 114810 35 40

rjindael 117170 35 41

Text typeset 123805 47 33

stringsearch 113827 35 40

ispell 12021 31 43

176

Table 5.1 lists the percentage distribution of data transfer

instructions and arithmetic instructions in object codes of 23 different

benchmark programs. For the ALU instructions (ALUI), the operands are

in registers and the results are stored in registers. In Load and Store

instructions (LSI), one operand is in register and the other operand is in

memory.

The address of the memory operand is generally specified as

the sum of two parts: the base register contents and an offset in the

immediate field. The proposal in this thesis is to support one memory

operand in ALU instructions thereby adding a new class of register-

memory instructions in addition to existing register-register ALU

instructions. As a result, the instruction set is marginally enhanced.

Though compiler and processor modifications are required, these are one

time efforts by the processor manufacturers/ compiler developers and there

is no additional burden on embedded system developers. Also it is a

program independent solution for embedded applications. This strategy

can be combined with other methods of code size reduction thereby

achieving additional amount of code size reduction.

5.2 METHODOLOGY FOR RMA ALU INSTRUCTIONS

Most arithmetic operations can be carried out in MIPS both in R-

Type format and I-Type format. In R-type addition, both the source

operands are available in registers and the result is placed in the

destination register. In I-type addition, one source operand is in a register

and the other source operand is available in the instruction as immediate

operand. There are four different Add instructions in MIPS as defined in

Table 5.2. The first two instructions, ADD and ADDU, follow R-Type format.

In both these instructions, two register contents are added. The ADD can

cause overflow exception in which case there is no result; for the ADDU,

overflow cannot occur. The ADDI and ADDIU follow I- Type format and

177

there is no wastage of instruction length. These two instructions add

register content with an immediate operand present in the instruction itself.

Table 5.2: MIPS ADD Instructions

MIPS Instruction

Meaning Purpose Description

ADD Add Word

To add 32-bit integers. rt is added with rs. If

no ‘overflow’, the

result is stored in rd.

ADDU Add

Unsigned

Word

To add 32-bit integers;

modulo arithmetic.

Applicable for address

calculation, or integer

arithmetic

environments

that ignore overflow.

rt is added with rs and

the arithmetic result is

stored in rd.

ADDI Add

Immediate

Word

To add a constant to a

32-bit integer.

The signed immediate

is added with rs.

Storing result is

similar to ADD.

ADDIU Add

Immediate

Unsigned

Word


32-bit integer;

functionally similar to

ADDU.


is added with rs and

the arithmetic result is

stored in rd.

5.2.1 Formats for ADDrm Instruction for MIPS

Both R-Type and I-Type instruction formats have been designed for

the ADDrm instruction for MIPS as shown in Figures. 5.1 and 5.2. Table 5.3

lists the four types of RMA-add instructions supporting a memory operand.

In the RM-Type Format, the register Rs-l is used as the base register and

178

an eight bit offset specifies the memory operand. The register operand is in

Rt-rm and the Rd-r field gives the destination register operand retaining the

three address format.

Table 5.3: Proposed RMA ADD Instructions for MIPS

Instruction Meaning Purpose Description

ADD-RM Add Word

–reg,mem

To add two 32-bit

integers, one of them

is in a register and the

other in memory.

The signed offset is

added to the base, Rs-

l, to get the address of

the first operand. The

memory word is added

with Rt-rm. If no

‘overflow’, the result is

stored in Rd-r.

ADDU-RM Add

Unsigned

Word-

reg,mem

To add 32-bit integers,

one of them is in a

register and the other

in memory; functionally

similar to ADDU.

The signed offset is

added to the base, Rs-

l, to get the address of

the first operand. The

memory word is added

with Rt-rm and the

arithmetic result is

stored in Rd-r.

ADDI-RM Add

Immediate

Word-

reg,mem


32-bit integer in

memory.


is added to memory

word at the address

formed by Rs-l and

offset. Functionally

similar to ADD-RM. If

no ‘overflow’, result is

stored in Rt-i.

ADDIU-RM Add

Immediate

Unsigned

Word-

reg,mem


32-bit integer in

memory; functionally

similar to ADDU-RM.


is added to the memory

word at the address

formed by Rs-l and

offset and the

arithmetic result is

stored in Rt-i.

179

Op Rs-l Rt-rm Rd-r Offset-l Opx-rm

31-26 25-21 20-16 15-11 10-3 2-0

Figure. 5.1: RMA Instruction Format – RM Type

Op Rt-i Rs-I Offset-l Immediate

31-26 21-25 20-16 15-8 7-0

Figure. 5.2: RMA Instruction Format – IM Type

In the IM Format, Rs-l is used as a base register and an eight bit

offset specifies the memory operand. The immediate field gives the other

operand. The Rt-i gives the destination register. For generating memory

address, the datapath already present for load type instructions can be

used. Hence only additional control signals have to be generated by the

opcode decoder.

5.2.2 Proposed RMA opcodes

The RMA instructions introduce two new formats. The 3-lsbs (bits

2-0) define the nature of operation whereas the 6-msbs (bits 31-26) gives

the operation type, for RM format. For IM format, the 6-msbs alone indicate

the operation. The offset and immediate fields are only 8-bits that may pose

a challenge to the compiler. MIDACC converts LSA instructions into RMA

instructions only if the length limitation is satisfied. Strategy used by the

RMA simulator to modify the R-Type and I-Type ADD instructions and

generating a composite RMA instruction is as follows. MIDACC scans the

MIPS object codes and estimates the scope for RMA instructions in the

given program by a search for appropriate sequences of Load word

(LW)/Store word (SW) and ALU instructions. The formats of LW and SW

instruction are shown in Figure.5.3.a and Figure. 5.3.b respectively.

180

OP Rs-l Rt-l Offset-l

31-26 25-21 20-16 15-0

Figure. 5.3 (a): Format of LW Instruction

OP Rs-s Rt-s Offset-s

31-26 25-21 20-16 15-0

Figure. 5.3 (b): Format of SW Instruction

The MIDACC inserts RMA Instructions by performing following

actions (a)-(e)

(a) Scans the object code for the ‘LSA sequence’ such as Load

preceding an ALU type instruction or Store following an ALU

type instruction.

(b) Determines whether the ‘LSA sequence’ qualifies for RMA

conversion or not.

(c) If it is a qualifying sequence, then the ‘LSA sequence’ is

deleted and appropriate RMA instruction is inserted. Further, all

the subsequent instruction addresses are decremented by 4.

(d) An RMA table is created indicating the original instruction

addresses where-all the RMA instructions have been inserted

(for dynamic simulation assistance).

(e) Further, a ‘Step-1 Jump address table’ is created indicating

original JUMP instruction address and step-1compressed

JUMP instruction address and new jump target address.

181

Table. 5.4: Types of ALU instruction for RMA load sequence

MIPS

instruction Format

Opcode, op

(bits 31-26)

Opx (Function)

(bits 5-0)

ADD R 000000 100000

ADDU R 000000 100001

ADDI I 001000 NA

ADDIU I 001001 NA

SUB R 000000 100010

SUBU R 000000 100011

AND R 000000 100100

ANDI I 001100 NA

OR R 000000 100101

ORI I 001101 NA

XOR R 000000 100110

NOR R 000000 100111

Table. 5.5: Types of ALU instruction for RMA store sequence

MIPS

Instruction Format

Opcode, op

(bits 31-26)

Opx/

Function (bits 5-0)

ADD R 000000 100000

ADDU R 000000 100001

SUB R 000000 100010

Table 5.4 is used to identify ALU type instruction in the RMA code

conversion for the sequence of Load preceding an ALU type instruction.

The Table 5.5 is used to identify ALU type instruction in the RMA code

conversion for the sequence of Store instruction following an ALU type

182

instruction. Qualifying condition for LW preceding an R-Type ALU

Instruction (Figure. 5.4) are as follows:

1. Numeric value of ‘Offset-l’ should not exceed 8-bits.

2. Rt-l should be equal to either Rs-r or Rt-r. If Rt-l is equal to

Rs-r, Rs-r is dropped and Rt-r is renamed as Rt-rm. If Rt-l is

equal to Rt-r, Rt-r is dropped and Rs-r is renamed as Rt-rm.

000000 Rs-r Rt-r Rd-r 00000 100000

Figure. 5.4: R-Type ADD instruction in MIPS

Qualifying condition for LW preceding I-Type ALU Instruction

(Figure. 5.5) is as specified below:

1. Numeric value of ‘Immediate’ should not exceed 8-bits.

2. Rt-l should be equal to Rs-i. If it is equal, the Rt-l is dropped.

001000 rs-i rt-i Immediate

Figure. 5.5: I-Type ADD instruction in MIPS

Qualifying condition for SW following R-Type ALU Instruction:

1. Numeric value of ‘Offset-s’ should not exceed 8-bits.

2. Rd-r should be equal to Rt-s. If Rt-s is equal to Rd-r, both Rt-s

and Rd-r are dropped.

183

Table 5.6: RMA Instructions Corresponding to MIPS32 Instructions for

Load Sequence

MIPS32 instruction New RMA Instruction

Instruction

Name

Type Opcodeop

(bits 31-

26)

Opx

Function

(bits 5-0)

Instruction

Name

Type OP (bits

31-26)

Ox-rm

(bits 2-

0)

ADD R 000000 100000 ADD-rm RM 101101 000

ADDU R 000000 100001 ADDU-rm RM 101101 001

ADDI I 001000 NA ADDI-rm IM 101111 NA

ADDIU I 001001 NA ADDIU-rm IM 110101 NA

SUB R 000000 100010 SUB-rm RM 101101 010

SUBU R 000000 100011 SUBU-rm RM 101101 011

AND R 000000 100100 AND-rm RM 101101 100

ANDI I 001100 NA ANDI-rm IM 110110 NA

OR R 000000 100101 OR-rm RM 101101 101

ORI I 001101 NA ORI-rm IM 111110 NA

XOR R 000000 100110 XOR-rm RM 101101 110

NOR R 000000 100111 NOR-rm RM 101101 111

Table 5.7: RMA Instructions Corresponding to MIPS32

Instructions for Store Sequence

MIPS32 instruction New RMA-st Instruction

Instruction Type Opcode,

op

(bits 31-

26)

Opx/

Function

(bits 5-0)

Instruction Type OP

(bits

31-26)

Ox-

rmst

(bits

2-0)

ADD R 000000 100000 ADD-rmst RM 011111 101

ADDU R 000000 100001 ADDU-rmst RM 011111 110

SUB R 000000 100010 SUB-rmst RM 011111 111

184

Tables 5.6 and 5.7 gives the RMA instructions corresponding to the

MIPS instruction created for a load sequence and store sequence

respectively.

5.2.3 Estimates on Code Size Reduction

The extent of code size reduction with RMA instructions varies with

the nature of embedded programs. Static simulation has been conducted

with MiMedia source programs using MIDACC and the results are

discussed in section 5.3. A sample illustration is given in Figure. 5.6 with a

C code for a loop operating on an array of 100 elements [27]. The

assembly codes and object codes for both LSA and RMA are manually

generated.

C Code for a Loop with variable array index [12]

Loop: g = g + A [i] ;

i = i + j ;

if (I ! = h) go to Loop ;

Assembly Code for LSA Object Code for LSA (All numbers in decimals)

Comments Address object code

Loop : Add r7, r3, r3 ; r7 = 2 * i 8000 0 3 3 7 0 32

Add r7, r7, r7 ; r7 = 4 * i 8004 0 7 7 7 0 32

Add r7, r7, r5 ; r7 = address of A[i] 8008 0 7 5 7 0 32

Lw r6, 0 (r7) ; r6 = A[i] 8012 35 7 6 0

Add r1, r1, r6 ; g = g + A [i] 8016 0 1 6 1 0 32

Add r3, r3, r4 ; i = i + j 8020 0 3 4 3 0 32

Bne r3, r2, Loop ; go to loop if i ≠ h 8024 5 3 2 -24

185

Assembly Code for RMA Object Code for RMA (All numbers in decimals)

Comments Address object code

Loop : Add r7, r3, r3 ; r7 = 2 * i 8000 0 3 3 7 0 32

Add r7, r7, r7 ; r7 = 4 * i 8004 0 7 7 7 0 32

Add r7, r7, r5 ; r7 = address of A[i] 8008 0 7 5 7 0 32

Addrm r1, r1, 0 (r7) ; g = g + A [i] 8012 0 7 1 1 0 40

Add r3, r3, r4 ; i = i + j 8016 0 3 4 3 0 32

Bne r3, r2, Loop ; go to loop if i ≠ h 8020 5 3 2 -20

Figure. 5.6: Comparison of object codes of LSA and RMA

The array’s base is in register r5; registers r1, r2, r3 and r4 are

allotted for variables g, h, i and j; r6 and r7 are used as temporary registers.

As can be noticed easily, the LSA assembly code needs seven statements

whereas the RMA assembly code needs only six statements. The LSA object

code occupies 28 bytes whereas the RMA object code occupies 24 bytes.

This amounts to roughly 14% reduction in code space. During execution of the

program, the loop is executed 100 times. Hence in absolute terms, it means

considerable amount of reduction in memory I/O bandwidth resulting in large

power reduction.

5.3 RESULTS AND DISCUSSION

MIDACC has been used to simulate the RMA environment for object

codes of MiMedia suite. The experiments are conducted on Intel PC under

Linux operating system. The object codes are converted from LSA to RMA

by MIDACC for estimating the code size reduction due to RMA for various

embedded applications. In addition to inserting the RMA instructions, it also

generates the compressed code that can be used as input to the linker.

However, further work is necessary for dynamic simulation.

186

Figures. 5.7 to 5.13 depict the results obtained for static simulation for 23

selected embedded programs. Figure. 5.14 compares RMA code

reduction for embedded domains.

Figure. 5.7: Code size Reduction due to RMA for Automotive Domain

Figure. 5.8: Code size Reduction due to RMA for Network Domain

187

Figure. 5.9: Code size Reduction due to RMA for Video and Audio

domains

Figure. 5.10: Code size Reduction due to RMA for Image Domain

188

Figure. 5.11: Code size Reduction due to RMA for Speech Domain

Figure. 5.12: Code size Reduction due to RMA for Security Domain

189

Figure. 5.13: Code size Reduction due to RMA for Text Domain

Figure. 5.14: Comparison of Code size Reduction due to RMA for

Embedded Domains

190

It is surprising that except two programs - susan and bitcnts - others

get less than 5% reduction. The maximum reduction is 18% for the susan.

Compared to the reduction percentages obtained by HIE, the reduction

with RMA is almost negligible for most applications. Hence, viewed in

isolation, it is not advantageous to incorporate RMA in RISC processors.

However, RMA can be combined with HIE to increase the effective

reduction of the code size. Further, as discussed in Chapter 6, use of

compound instructions gives additional code size reduction. The final

conclusion is presented in section 5.5 after discussion of the hardware

modifications required in section 5.4.

5.4 PIPELINE MODIFICATIONS FOR RMA

The RMA will impact the entire processor architecture as regard

to opcodes, addressing modes, instruction length, instruction encoding,

datapath, control signals etc. This involves both hardware changes in

processor cores and modifications to the compilers and associated

software tools.

The traditional RISC pipeline sequence discussed in Chapter 2 has

to be rearranged to suit both LSA and RMA instructions by interchanging

the Data memory access and Execute stages. Figure. 5.15 shows the

proposed 6-stage pipeline that supports the RMA Register-Memory

instructions, for RISC processors. The action taken by the pipeline for the

ADD-RM instruction is as follows. The first two stages are similar to the

RISC pipeline. In the third stage (AC), memory address, for the memory

operand, is calculated by a small address adder (as in ARM processors)

and in the fourth stage (MEM), the memory operand is fetched from the

data memory. In the fifth stage (EX), addition is carried out and in the last

stage (WB), the result is stored in destination register.

191

Figure. 5.15: Proposed 6-Stage RMA Pipeline

Figure. 5.16 shows the execution of LSA instructions in the six

stage pipeline. For the LSA ADD instruction, the AC and MEM cycles are

unused. Compared to a 5-stage RISC pipeline, one additional clock cycle is

wasted (consumed) for LSA instructions in the 6-stage RMA pipeline. The

unused internal cycles do not directly affect performance, since they do not

cause pipeline stalls [54]. However, a slight performance decrease is

expected due to increase in pipeline length in view of increased frequency

of dependencies between successive instructions. It is a question of choice

between performance and code size. For embedded systems, code size is

more significant, and hence the increase in execution time by one cycle is

tolerable.

192

Figure. 5.16: Execution of LSA Instructions in 6-Stage RMA Pipeline

An alternate approach is possible for the RMA pipeline with 5 stages

as shown in Figure. 5.17 in which the EX and MEM stages are combined

as a single MEM/EX stage to improve the efficiency of the RMA pipeline for

the LSA instructions. In the 6-stage pipeline, the EX stage is unused by

LOAD and STORE instructions of LSA, and the MEM stage is unused by

LSA ADD instruction as shown in Figure. 5.16. Hence in the 5-stage RMA

pipeline, the EX and MEM stages can be combined into a single MEM/EX

stage. Therefore, no performance penalty is caused for LSA instructions.

193

Figure. 5.17: Execution of LSA Instruction in 5-stage RMA pipeline

Figure. 5.18: Execution of RMA ADDrm Instruction in 5-Stage RMA

pipeline

194

For the ADD-rm instruction, the MEM/EX stage is recycled as shown

in Figure.5.18. In the first MEM/EX cycle, the memory operand is fetched

from the data cache and, in the second MEM/EX cycle, the addition is

performed. As a matter of fact, such 5-stage approach has been used in

several processors including Pentium, R8000 and PA 7100 [54]. The

hardware changes for the RMA arithmetic/logical instructions will

definitely reduce the performance of the processors to some extent due

to pipeline stalls during memory access. However, since use of cache

memory is common in present day embedded systems also, chances of

pipeline stall will be rare.

5.5 CONCLUSIONS

This chapter investigates implementation of register-memory

architecture for embedded processors for low power embedded systems in

view of the need for reduced chip size and lower power consumption.

Encoding of appropriate new instructions and pipeline modifications of

existing RISC processors to support RMA arithmetic/logical instructions has

been experimented demonstrating addition of 12 RMA instructions for

MIPS32. In view of pipeline changes required, one may conclude that the

RMA idea is not cost effective. However, it is essential to view the SoC as a

whole and not the processor alone in isolation. Additional hardware

components required for incorporating the processor changes are almost

free of cost and generation of control signals is not that complex. But

increasing the memory size is costlier. However, performance degradation

due to pipeline stalls, during memory access for RMA ALU instructions, is a

matter of concern. With multicore and superscalar architectures,

performance degradation can be compensated by appropriate scheduling

of instructions. The integration of RMA with HIE is discussed in Chapter 6.

Similarly, combining the idea of compound instructions with HIE and RMA

offers additional code size reduction as shown in Chapter 6.

195

6. HYBRID PROCESSOR FOR PORTABLE EMBEDDED

SYSTEMS

The emergence of the Internet of Things (IoT) and the insatiable

demand for smart devices in every aspect of life is driving a complete

overhaul of traditional wisdom in the embedded processor and embedded

memory industry. As electronic devices become smarter, the software code

becomes larger and needs to be processed faster to handle the

communication protocols, authentication, message generation, and so on.

The reality is now dawning on the industry that RISC architecture cannot

meet this new generation of code storage requirement, with embedded

software increasing quickly from a few Kilobytes to several Megabytes.

CORE-based design with predefined and preverified modules in

modern SoCs is the state-of-the art design strategy. In this thesis, the term

processor is used in a broad sense. It includes both core IP in a SoC and

traditional single-chip form since the basic concept of the processor is the

same. Both are the same architecturally irrespective of the different

implementations.

The main goal of the research work is to provide an enhanced ISA

for RISC processor cores so as to produce minimum object code for SoC

based BOPES. This Chapter proposes designing a hybrid processor

incorporating both HIE and RMA along with certain other ISA level

enhancements to meet the code size requirements of portable embedded

systems. To estimate the code size reduction possible with such a

processor, MIPS32 is taken as the target processor and the embedded

object codes of MIPS32 are recoded to HIE-RMA-MIPS so that the overall

code size reduction is measured. The MIDACC has been suitably extended

to work on HIE object codes and performing HIE on RMA codes. In

addition, certain compound and composite instructions are proposed in this

chapter to further reduce the code size.

196

6.1 SoC DESIGN AND EMBEDDED SYSTEMS

Present day typical deep-submicron IC design poses challenges to

SoC design teams. In a generic 0.13 µ standard-cell foundry process,

silicon density routinely exceeds 100,000 usable gates per square

millimetre. Consequently, a low-cost chip with a core area of 50 square

millimetres can carry 5 million logic gates. All embedded systems now

contain significant amounts of software. Three examples for the SoC based

battery powered portable embedded systems, namely smart watch,

scanner and smartphones, are briefly discussed in the following

subsections.

6.1.1 Smart watch

While early models of the smart watch, a computerized wristwatch,

performed basic tasks, such as calculations, translations, and game-

playing, modern smart watches are effectively wearable computers. Many

smart watches run mobile apps, while some models run a mobile operating

system and function as portable media players, offering playback of FM

radio, audio, and video files to the user via a Bluetooth headset. Some

smartphone models, also called watch phones, feature full mobile phone

capability, and can make or answer phone calls. A smart watch may collect

information from internal or external sensors. It may control, or retrieve data

from, other instruments or computers. It may support wireless technologies

like Bluetooth, Wi-Fi, and GPS. However, it is also possible that a

wristwatch computer may merely serve as a front end for a remote system,

as in the case of watches utilizing cellular technology or Wi-Fi. Figure. 6.1

illustrates the block diagram of a smart watch.

http://en.wikipedia.org/wiki/Embedded_computer_system

http://en.wikipedia.org/wiki/Calculator_watch

http://en.wikipedia.org/wiki/Translation

http://en.wikipedia.org/wiki/Game_watch

http://en.wikipedia.org/wiki/Game_watch

http://en.wikipedia.org/wiki/Wearable_computer

http://en.wikipedia.org/wiki/Mobile_apps

http://en.wikipedia.org/wiki/Mobile_operating_system

http://en.wikipedia.org/wiki/Mobile_operating_system

http://en.wikipedia.org/wiki/Portable_media_player

http://en.wikipedia.org/wiki/FM_broadcasting

http://en.wikipedia.org/wiki/FM_broadcasting

http://en.wikipedia.org/wiki/Bluetooth

http://en.wikipedia.org/wiki/Mobile_phone

http://en.wikipedia.org/wiki/Wi-Fi

http://en.wikipedia.org/wiki/GPS

197

Figure. 6.1: Block diagram of smart watch

Smart watch may include features such as a camera, accelerometer,

thermometer, altimeter, barometer, compass, chronograph, calculator, cell

phone, touch screen, GPS navigation, map display, graphical display,

speaker, scheduler, watch, Secure Digital (SD) cards as a mass storage

device, and rechargeable battery. It may communicate with a wireless

headset, heads-up display, insulin pump, microphone, modem, or other

devices. Some also have "sport watch" functionality with activity tracker

features (also known as fitness tracker) as seen in GPS watches made for

training, diving, and outdoor sports. Functions may include training

programs (such as intervals), lap times, speed display, GPS tracking unit,

route tracking, dive computer, heart rate monitor compatibility, cadence

sensor compatibility, and compatibility with sport transitions.

6.1.2 Scanner

The basic function of a scanner is to analyze an image and process

it. Image and text capture allows saving information to a file on the

computer. Handheld 3D scanners are used in industrial design, reverse

engineering, inspection and analysis, digital manufacturing and medical

http://en.wikipedia.org/wiki/Camera

http://en.wikipedia.org/wiki/Accelerometer

http://en.wikipedia.org/wiki/Thermometer

http://en.wikipedia.org/wiki/Altimeter

http://en.wikipedia.org/wiki/Barometer

http://en.wikipedia.org/wiki/Compass

http://en.wikipedia.org/wiki/Chronograph

http://en.wikipedia.org/wiki/Calculator

http://en.wikipedia.org/wiki/Cell_phone

http://en.wikipedia.org/wiki/Cell_phone

http://en.wikipedia.org/wiki/Touch_screen

http://en.wikipedia.org/wiki/GPS_navigation_device

http://en.wikipedia.org/wiki/Graphical_display

http://en.wikipedia.org/wiki/Computer_speaker

http://en.wikipedia.org/wiki/Calendaring_software

http://en.wikipedia.org/wiki/Watch

http://en.wikipedia.org/wiki/SDcard

http://en.wikipedia.org/wiki/Mass_storage_device

http://en.wikipedia.org/wiki/Mass_storage_device

http://en.wikipedia.org/wiki/Rechargeable_battery

http://en.wikipedia.org/wiki/Wireless_headset

http://en.wikipedia.org/wiki/Wireless_headset

http://en.wikipedia.org/wiki/Heads-up_display

http://en.wikipedia.org/wiki/Insulin_pump

http://en.wikipedia.org/wiki/Microphone

http://en.wikipedia.org/wiki/Modem

http://en.wikipedia.org/wiki/Activity_tracker

http://en.wikipedia.org/wiki/GPS_watch

http://en.wikipedia.org/wiki/GPS_tracking_unit

http://en.wikipedia.org/wiki/Dive_computer

http://en.wikipedia.org/wiki/Heart_rate_monitor

http://en.wikipedia.org/wiki/Cadence_(cycling)

198

applications. Colour scanners typically read Red-Green-Blue (RGB) colour

data from the array. This data is then processed with some proprietary

algorithm to correct for different exposure conditions, and sent to the

computer via the device's input/output interface. By combining full-colour

imagery with 3D models, modern hand-held scanners are able to

completely reproduce objects electronically. The addition of 3D colour

printers enables accurate miniaturization of these objects, with applications

across many industries and professions. The size of the file created

increases with the square of the resolution. The file size can be reduced for

a given resolution by using "lossy" compression methods such as JPEG, at

some cost in quality. If the best possible quality is required, lossless

compression should be used; reduced-quality files of smaller size can be

produced from such an image when required. A scanning utility and some

type of image-editing application (such as Photoshop), and optical

character recognition (OCR) software are required. The OCR software

converts graphical images of text into standard text that can be edited

using common word-processing and text-editing software. Figure. 6.2

shows the block diagram of a typical hand held scanner designed around a

SoC.

Figure. 6.2: Block diagram of a scanner

http://en.wikipedia.org/wiki/RGB_color_model

http://en.wikipedia.org/wiki/Input/output

http://en.wikipedia.org/wiki/JPEG

http://en.wikipedia.org/wiki/Photoshop

http://en.wikipedia.org/wiki/Optical_character_recognition

http://en.wikipedia.org/wiki/Optical_character_recognition

199

6.1.3 Smartphones

A smartphone (or smart phone) is a mobile phone with more

advanced computing capability and connectivity than basic feature phones.

Early smartphones typically combined the features of a mobile phone with

those of another popular consumer device, such as a personal digital

assistant (PDA), a media player, a digital camera, and/or a GPS navigation

unit. Later smartphones include all of these plus the features of a

touchscreen computer, including web browsing, Wi-Fi, and third-party

applications.

The software for smartphones can be visualized as a software stack.

The stack consists of the following layers:

1. kernel -- management systems for processes and drivers for

hardware

2. middleware -- software libraries that enable smartphone

applications (such as security, Web browsing and messaging)

3. application execution environment (AEE) -- application

programming interfaces, which allow developers to create their

own programs

4. user interface framework -- the graphics and layouts seen on

the screen

5. application suite -- the basic applications users access

regularly such as menu screens, calendars and message

inboxes

Figure. 6.3 gives a functional view of Snapdragon, a popular SoC

used in most smartphones.

http://en.wikipedia.org/wiki/Mobile_phone

http://en.wikipedia.org/wiki/Feature_phone

http://en.wikipedia.org/wiki/Personal_digital_assistant

http://en.wikipedia.org/wiki/Personal_digital_assistant

http://en.wikipedia.org/wiki/Portable_media_player

http://en.wikipedia.org/wiki/Digital_camera

http://en.wikipedia.org/wiki/GPS_Phone

http://en.wikipedia.org/wiki/GPS_Phone

http://en.wikipedia.org/wiki/Touchscreen

http://en.wikipedia.org/wiki/Web_browser

http://en.wikipedia.org/wiki/Wi-Fi

http://en.wikipedia.org/wiki/Mobile_app

200

Figure. 6.3: Block diagram for the Snapdragon S4 SoC

using Krait CPUs

Inside the Snapdragons, not only the processing cores, graphics

chip and media accelerators are present in single SoC but also present in

the package are full wireless radios, GPS and RAM. MediaTek has just

launched MT6795, a 64-bit Octa-Core SoC for High-End Smartphones.

6.2 ENHANCEMENT TO HIE AND RMA CODES

The following additional techniques were applied to get higher code

reduction for the MIPS object codes by suitable extensions to MIDAAC.

1. Introducing 16-bit compound instructions, of two-address, in

HIE2 to replace instructions with same two source registers,

and instructions with same source and destination register.

2. Introducing two composite instructions - loadmultiple and

storemultiple - to replace consequent load/store instructions.

201

3. Performing HIE simulation on the RMA codes to estimate

combined effects of HIE and RMA.

The results of all three techniques are shown in Table 6.1. It is

interesting to note that the two benchmarks - susan and bitcount - that gave

excellent code reduction both for HIE and RMA, suffered maximum due to

Table 6.1: Integration of Code reduction schemes and compound/

composite instructions

Benchmark

A = HIE

PCR +

RMA

PCR

B = Actual

HIE-RMA

PCR

C = PCR

due to

Compound

instructions

D = PCR

due to

Composite

instructions

Overall

Reduction

PCR =

B+C+D

basicmath 24.12 22.02 1.16 1.47 24.65

bitcount 34.11 30.62 3.49 3.37 37.48

Qsort 27.16 26.08 3.46 3.38 32.92

susan 44.72 36.30 6.91 1.23 44.44

dijkstra 25.13 23.72 2.39 4.18 30.29

patricia 25.16 23.73 2.39 4.20 30.32

CRC32 25.13 23.72 2.40 4.19 30.31

MPEG2 24.57 23.27 2.37 4.07 29.71

ADPCM 25.12 23.71 2.40 4.19 30.30

lame 20.02 19.10 0.79 2.06 21.95

JPEG 30.64 28.75 2.79 5.34 36.88

EPIC 25.01 23.62 2.40 4.11 30.13

fft 24.75 23.43 2.29 3.97 29.69

(susan) 44.72 36.30 6.91 1.23 44.44

GSM 24.87 23.54 2.51 4.03 30.08

G721 25.18 23.78 2.40 4.21 30.39

rsynth 23.63 23.20 1.89 3.00 28.09

pegwit 25.05 23.71 2.70 4.13 30.54

sha 23.79 23.56 2.09 6.15 31.80

blowfish 25.18 23.76 2.49 4.14 30.39

rjindael 25.21 23.84 2.73 4.11 30.68

typeset 22.49 21.82 1.53 2.10 25.45

stringsearch 25.12 23.71 2.41 4.19 30.31

ispell 21.46 20.60 1.73 3.78 26.11

202

cumulative effect of HIE on RMA codes (A-B). This is because the RMA

conversion has eliminated some part of the HIE scope since the RMA has

better code density compared to RISC. The minimum affected benchmark

is sha, the application that got minimum code reduction for RMA. The

MIDACC extender estimates code size reduction due to use of composite

and compound instructions. The overall code reduction including all three

techniques in addition to HIE2 varies from 21.95 to 44.44. Interestingly, as

in the case of reduction due to HIE, the lame gets the lowest overall code

reduction and susan gets the overall maximum code reduction.

6.3 FUTURE ENHANCEMENT TO HIE-RMA

The following additional techniques can be adapted to achieve

increased code reduction for the MIPS object codes and extent of code

reduction can be estimated by suitable modifications to MIDAAC.

1. The maximum size of immediate/ offset can be further reduced.

2. ALU and shift can be combined in a single instruction.

3. A 40-bit instruction can be introduced so that RMA logic can be

enhanced to cover 16-bit offset also in the load instructions.

Presently such cases are not considered for RMA conversion.

This amounts to having five different lengths in hybrid

instruction encoding that will give scope for new types of

composite instructions. This will in turn provide additional

amount of code size reduction.

6.4 HIE AND ILP

Usually RISC processors are considered to be more suitable for

superscalar architecture than CISC processors, though Intel has

successfully implemented superscalar architecture from CISC processors

203

of 80486 onwards. In these days of multicore SoCs, there have been many

ideas being practiced that can be used in HIE-RISC processors. Some of

these are discussed below.

6.4.1 Hybrid-Length Instructions and Instruction fetch

A severe criticism against hybrid / variable-length instruction is

complexity of instruction fetch. However, this has been effectively

implemented in Intel 80486 [10]. Because the processor does not know the

length of the next instruction to be fetched, a typical strategy is to fetch a

number of bytes or words equal to at least the longest possible instruction.

This means that sometimes multiple instructions are fetched. The Intel 80486

is a processor with a five-stage pipeline and supports instructions of variable

length (from 1 to 11 bytes not counting prefixes). It has two 16-byte prefetch

buffers for the instructions. The status of the prefetcher relative to the other

pipeline stages varies from instruction to instruction. On an average, about five

instructions are fetched with each 16-byte load [10]. The instruction fetch

stage operates independently of other stages to keep the prefetch buffers full.

The Itanium processor [10] follows IA-64 architecture that supports

instruction-level parallelism. IA-64 defines a 128-bit bundle that contains

three instructions. The processor can fetch instructions one or more

bundles at a time; each bundle fetch brings in three instructions.

Instructions are fetched through an L1 instruction cache and fed into a

buffer that holds up to eight bundles of instructions. When deciding on

functional units for instruction dispersal, the processor views at most two

instruction bundles at a time.

Parallel fetch and decode is complicated by the need to examine

multiple bytes of an instruction before the start address of the next

sequential instruction is known. The Intel P6 microarchitecture can decode

three variable-length x86 instructions in parallel, but the second and third

instructions must be simple [55]. The P6 performs speculative decodes at

204

each byte position, then muxing out the correctly decoded instructions once

the lengths of the first and second instructions are known. The AMD Athlon

predecodes instructions during cache refill to mark the boundaries between

instructions and the locations of opcodes, scans and aligns multiple

variable- length instructions [56]. The Pentium-4 design [57] improves on

the P6 family by caching decoded fixed-length micro-ops in a trace cache.

These legacy CISC ISAs were not designed with parallel fetch and decode

in mind.

Two simple designs of instruction decoding used popularly in

processors using variable instruction set are shown in Figures. 6.4 and 6.5.

Figure. 6.4 uses an additional instruction decoder in the instruction pipeline.

The first decoder stage determines instruction lengths and steers the

instructions to second stage where the actual instruction decoding is

performed. The design methodology used in decoding x86 variable

instructions uses pre-decoding as shown in Figure. 6.5 to mark instruction

lengths in the code cache. This reduces the number of decode stages in

pipeline but the need to hold the resolved instruction information requires a

larger cache.

6.4.2 Instruction Fetch and Cache Access

Although embedded processors traditionally had simple single-issue

pipelines, current designs have deeper pipelines or superscalar issue to

meet higher performance requirements. A new heads-and-tails (HAT)

format [58] allows variable-length instructions to be held in the cache yet

remain easily indexable for parallel fetch and decode. This format can be

used for HIE-MIPS also to take advantage of the high code density of

hybrid- length instructions while enabling deeply pipelined or superscalar

processors. The HAT format packs multiple variable-length instructions into

fixed-length bundles. Each instruction is split into a fixed-length head

portion and a variable-length tail portion as shown in Figure. 6.6. In MIPS-

205

HAT scheme, the head size was 10 bits. For HIE-MIPS, six bits are

sufficient for the head. The fixed-length heads are packed together in

program order at the start of the bundle, while the variable-length tails are

packed together in reverse program order at the end (i.e., the first tail is at

the end of the bundle). Bundles contain varying numbers of instructions, so

each bundle begins with a small fixed-length field holding the number of the

last instruction in the bundle, i.e. a bundle holding N instructions has N1 in

this field. The remainder of the bundle is used to hold instructions. When

packing instructions into bundles, there can be internal fragmentation if the

next instruction does not fit into the remaining space in a bundle, in which

case the space is left empty and a new bundle is started [58]. The program

counter (PC) in a HAT scheme is split into a bundle number held in the high

bits and an instruction offset held in the low bits. During sequential

execution, the PC should be incremented as usual, but after fetching the

last instruction in a bundle (as given by the instruction count stored in the

bundle), it should skip to the next bundle by incrementing the bundle

number and resetting the instruction offset to zero. A PC value points

directly to the head portion of an instruction and, because they are fixed-

length, multiple sequential instruction heads can be fetched and decoded in

parallel. The tails are still variable-length, however, and so the heads must

contain enough information to locate the correct tail.

The heads-and-tails (HAT) format supports parallel fetch and

decode of compact variable-length instruction sets directly from cache. The

HAT format helps an implementation deliver multiple, variable- sized,

randomly-accessible instruction units to the CPU in a single cycle or

alternatively enables a deeply-pipelined fetch of such units. The HAT

format is used both in main memory and cache, although additional

information might be added to the cached version to improve performance.

206

Code cache

Length decoder

Input buffer

Instruction Steering Logic

Decoder Decoder Decoder

Figure. 6.4: Two stage instruction Decoding

Figure. 6.5: Predecoding and Marking Instruction Lengths

207

Figure. 6.6: Heads and Tails Format

A cache line could contain one or more bundles. Similar to a

conventional variable-length scheme, the tail size information in the head of

one instruction must be decoded to ascertain the location of the start of the

tail of the next instruction. But in the HAT format the length information for

each instruction is held at a fixed spacing in the head instruction stream,

independent of the length of the whole instruction. This makes the critical

path to determine tail alignment for multiple parallel instructions much

shorter than in a conventional variable-length scheme, where the location

of the length information in the next instruction depends on the length of the

current instruction. The tails in a HAT scheme are delayed relative to the

heads, but the head and tail fetches can be pipelined independently. The

authors [58] of HAT scheme experimented with MIPS codes and showed

that the MIPS-HAT format can provide a compression ratio of 75.5%

(Percentage reduction of 24.5%) and a dynamic fetch ratio reduction of

75.0% while supporting deeply pipelined or superscalar execution. They

developed a simple MIPS instruction compression scheme by re-encoding

the MIPS ISA into a variable-length format, and mapping the resulting

208

variable-length instructions into the HAT format. They evaluated use of

both 128-bit and 25-bit bundles for MIPS-HAT. In their scheme, each

instruction can be one of six sizes ranging from 15-40 bits. On the other

hand, the HIE-MIPS supports instructions of 12 sizes ranging from 8-32

bits. Hence HIE-MIPS should offer better results for HAT format. Figure.

6.7 shows HIE2-MIPS instructions recoded into HAT scheme.

The HAT scheme has a number of advantages over conventional

variable-length schemes. Fetch and decode of multiple variable-length

instructions can be pipelined or parallelized. Unlike conventional variable-

length formats, it is impossible to jump into the middle of an instruction. The

variable alignment muxes needed are smaller than in a conventional

variable-length scheme, because they only have to align bits from the tail

and not from the entire instruction length. The fixed-length heads are

handled using a much simpler and faster mux. The HAT format guarantees

that no variable-length instruction straddles a cache line or page boundary,

simplifying instruction fetch and handling of page faults.

The HAT scheme operates all the length decoders in parallel, and

then sums their outputs to determine tail alignments as shown in

Figure. 6.8. The tails in a HAT scheme are delayed relative to the heads,

but the head and tail fetches can be pipelined independently. The

performance impact of the additional latency for the tails can be partly

hidden if more latency-critical instruction information is located in the head

portions.

6.5 HIE-MIPS Vs microMIPS/Thumb2

For several years MIPS16 and ARM's Thumb were used for

embedded applications as discussed in Chapter 2 for code size reduction.

These two processors do not have independent ISAs and cause severe

performance penalty due to several restrictions. For instance, MIPS16 and

209

A B C D1 D2 E F G1 G2 G3 H I

8-bit:nop,syscall,rfe

it opcode

1 5

16-bit:mfcz,mtcz

it opcode

1 5 16-bit: mfhi, mflo, mthi,

it opcode

1 5 24-bit: add, addu, and,

it opcode

1 5 24-bit: sllv, srav, srlv,

it opcode

1 5 24-bit: sll, sra, srl

it opcode

1 5 16-bit: jalr, div, divu,

it opcode

1 5 24/32-bit: addi, addiu,

it opcode

1 5 8/16/24-bit: bczt, bczf

it opcode

1 5 16/24/32-bit: bgez, bgtz,

it opcode

1 5 32-bit: j,jal

it opcode

1 5 32-bit: break

it opcode

1 5 Heads

iid

2

rt rs

5 5 mtlo, jr

iid rd/rs 00

3 5 2 nor, or, sub, subu, xor

iid rs rt rd

3 5 5 5 slt, sltu

iid rs rt sa

3 5 5 5

iid rt rd sa 0

2 5 5 5 1 mult, multu

rs rt/rd

5 5 lui, slti, sltiu, beq, bgezal, bltzal, bne, lb, lbu ...

hl rs rt imm/off

1 5 5 7/15

hl imm/off

2 0/8/16 blez, bltz

iid hl rs 0 imm/off

2 2 5 1 0/8/16

target

26

code 0’s

20 6

Tails

Figure. 6.7: HIE2-MIPS Instruction Types in HAT Scheme

210

Figure. 6.8: Variable-length decoding in a HAT Scheme

Thumb supported only reduced number of registers. Subsequently both

ARM and MIPS introduced Thumb2 and microMIPS with two instruction

length options: 16 bit and 32 bits. These perform internal recoding of

instructions to the original ARM or MIPS processor versions and yet do not

provide full facilities to all instructions. For instance, some 16-bit microMIPS

instructions can access only eight GPRs. In total, the microMIPS ISA adds

54 new instructions [59]. The instruction decoder performs two operations

in sequence. First, the decoder translates the microMIPS into a MIPS32

instruction thus incorporating dual decoders. On the other hand, the HIE-

MIPS will have an independent ISA. The microMIPS achieves code size

reduction of 30% compared to MIPS32. The HIE-RMA-MIPS version

achieves over 44% code size reduction. The Thumb2 has about 24K gates

whereas the microMIPS has approximately 33K gates. The microMIPS

offers 98% performance of MIPS32. The HIE-RMA-MIPS needs single

decoder and take less decode time than microMIPS.

6.6 DISCUSSION AND CONCLUSION

Present day smart phones are highly advanced with quad core and

octa core processors. These smart phones have flash memory of 32 GB or

more. Most of them support entire embedded applications covered by

MiMedia. Hence the sizes of all these can be summed up to estimate code

211

memory requirement. The total size exceeds one crore bytes. The average

HIE+RMA code reduction percentage of 26% translates to saving of about 2

Mega bytes of code memory without considering the OS and other system

software that is also part of the embedded core memory.

The aim of this research is to enhance the 30 years old RISC

architecture with features that make the RISC processors more relevant to

the battery operated embedded systems based on SoCs. Hybrid Instruction

Encoding with hybrid offset and immediate fields has been shown to

improve the code density considerably. The work undertaken generated a

primitive but reasonably functional set of tools for static simulation of

MIPS32 based HIE and RMA architecture. The developed tool chain

helped studying the behaviour of embedded codes and also estimating

code size reduction for three different variations. This work has thus

provided strong evidence that the HIE-RMA can indeed serve as

embedded processor architecture for applications that do not demand

extremely high performance. Actually, the HIE-RMA should increase

performance since size of code cache is also reduced. However, increase

in instruction decoding increases the execution time of single instruction.

This will not make much impact on superscalar processors.

However, the performance of this architecture appears to trail

traditional CISC and RISC processors in terms of instruction fetch and

execution efficiency. Viewed in totality in the context of multicore and

superscalar architecture, the individual processor performance is not a

concern but achieving effective parallelism and minimising memory size are

more relevant for embedded SoCs. Configurable processor technology is

fast emerging particularly in the design of higher performing consumer

products. It allows customization of the core processor, which can have

both performance and power impact on the SoC embedded design. The

Tensillica Xtensa LX2 Series is one such configurable processor [60].

212

The following sections provide a summary of the research work,

outline the limitations of the described work and suggestions for future

work.

6.6.1 Summary of Contributions

The main contributions of this work are summarized below.

Behaviour Analysis of Embedded Applications

The object codes of the embedded applications are profiled by the

custom built software tool, MIDAAC. Applications from two representative

set of embedded benchmarks, MiBench and MediaBench are cross

compiled for the MIPS32 processor for this purpose. Apart from measuring

the static instruction frequencies, this tool measures the total amount of

under utilization of the offset and immediate fields in the object code.

Design of Hybrid Instruction Encoding for RISC Processors

Two versions of Hybrid Instruction Encoding (HIE) are designed for

supporting multiple instruction sizes and hybrid lengths for the offset and

the immediate fields. For each of the 66 integer instructions of MIPS32, an

equivalent HIE instruction has been designed.

Design of Register Memory Architecture (RMA) for RISC

Processor

This part of the research work involves design of 12 RMA ALU

instructions, for MIPS processor. Appropriate formats are chosen for the

RMA instructions. The traditional RISC pipeline sequence is rearranged to

suit both LSA and RMA instructions. Identification of the requirement for

any datapath is carried out.

213

Hybrid RISC Processor

For simulating a hybrid processor incorporating both HIE and RMA,

the embedded object codes of MIPS32 are recoded to the HIE-RMA

processor using the custom built code converter and the code size

reduction is measured.

Developing Static Simulator for HIE / RMA

To estimate the efficiency of the HIE and RMA for RISC processor, a

code converter tool, MIPS Instruction Distribution Analyser cum Code

Converter (MIDACC), has been developed. The MIDACC converts the

object codes from MIPS ISA to HIE/RMA-ISA. This tool also measures the

code size savings for embedded applications in MiBench and MediaBench

benchmark suites.

6.6.2 Limitations of Described Research Work

The uniqueness of the proposed architecture greatly limited reuse of

existing infrastructure and tools. Thus much of the research period was

used to develop tools and static simulation. Lack of suitable compiler to try

dynamic simulation was a major hurdle. Since the custom built tool

MIDACC presently recognises only MIPS32 object codes, no comparison

with ARM like processor could be done. Popular processor simulators such

as SimpleScalar and ArchC need extensive modification to try HIE-RMA for

RISC processors.

6.6.3 Areas for Future Work

There are some areas of broader study that are left open for future

research.

214

6.6.3.1 MicroMIPS and Thumb2 versions with HIE-RMA

MIPS32 is a pure general-purpose RISC processor whereas

microMIPS and Thumb2 are embedded processor versions. These have a

large scope of redesign with HIE-RMA.

6.6.3.2 Reconfigurable HIE-RMA version

Reconfigurability will provide embedded developers scope for

optimising application dependent features.

6.6.3.3 Dynamic simulation

SimpleScalar like simulator can be modified and dynamic simulation

can be carried out to accurately estimate performance and power

consumption.

6.6.3.4 FPGA Processor design

A HIE-RMA processor prototype can be developed based on

microMIPS or Thumb2.

6.6.3.5 Compiler tool chain

A tool chain to suit the HIE-RMA processor to work with the FPGA

processor is also required to put the HIE-RMA processor in use.

6.6.3.6 HIE-RMA-HAT Processor

A new ISA applying HAT scheme to the HIE-RMA processor core

suiting multicore SoCs can be developed.

215

REFERENCES

[1] Joseph Byrne and Tom R. Halfhill, “A Guide to Embedded

Processors”, Linley Group, Seventh Edition, CA, 2012.

[2] Xie Y., Wolf W. and Lekatsas H., “Code Compression for

Embedded VLIW Processors Using Variable-to-Fixed Coding”, IEEE

Trans. VLSI Systems, Vol. 14, pp. 525-536, 2006.

[3] Hennessy J. L and Patterson D. A, “Computer Architecture: A

Quantitative Approach”, Morgan Kaufmann, Fifth Edition, San

Francisco, CA, 2012.

[4] Santhosh Chede and Kishore Kulart, “Design Overview of Processor

Based Implantable Pacemaker”, Journal of Computers, Vol. 3,

pp. 49-57, 2008.

[5] Frank Vahid and Tony Givargis, “Embedded System Design: A

Unified Hardware / Software Introduction”, Third Edition, John Wiley

& Sons, U.K, 2002.

[6] Noergaard T., “Embedded Systems Architecture: A Comprehensive

Guide for Engineers and Programmers”, Elseiver, 2005.

[7] Fisher J. A, Faroboschi P. and Young C., “Embedded Computing: A

VLIW Approach to Architecture, Compilers, and Tools”, Morgan

Kaufmann, San Francisco, CA, 2005.

[8] Raj Kamal, “Embedded Systems: Architecture, Programming and

Design”, Mc Graw-Hill, Second Edition, 2008.

[9] Govindarajalu B., “Computer Architecture: and Organization: Design

Principles and Applications”, Mc Graw-Hill, Second Edition, 2010.

[10] William Stallings, “Computer Organization and Architecture:

Designing for Performance”, Pearson Education Inc, Eighth edition,

2010.

[11] Cragon H.G, “Computer Architecture and Implementation”,

Cambridge University Press, UK, 2000.

216

[12] Bhandarkar D. and Clark D. W, “Performance from Architecture:

Comparing a RISC and a CISC with similar Hardware Organization”,

ACM SIGARCH Computer Architecture News, Vol. 19, pp. 310-319,

1991.

[13] Patterson D. A and Sequin C.H , “A VLIW RISC”, Computer,

Vol. 15, pp. 8-21, 1982.

[14] Moore G.E, “Ceamming more components onto integrated circuits”,

Electronics, Vol. 38, pp. 114-117, 1965.

[15] Patterson D.A and Ditzel D.R “The case for the reduced instruction

set computer”, SIGARCH Comp. Arch. News, Vol. 8, 1980.

[16] Colwell D.R, Hitchcock C.Y, Jensen E , Brinkley Sprunt H. and

Kollar C., “Instruction sets and beyond: computers, complexity, and

controversy”, Computer, Vol. 18, pp. 8-19, 1985.

[17] Flynn M.J, Mitchell C.L and Mulder J.M, “And now a case for more

complex instruction sets”, Computer, Vol. 20, 1987.

[18] Rajesh kumar T.S, “On-Chip Memory Architecture Exploration of

Embedded System on Chip”, PhD thesis, Supercomputer Education

and Research Centre, Indian Institute of Science, Bangalore, 2008.

[19] Balasa F., Catthoor F. and De Man H., “Background memory area

estimation for multidimensional signal processing systems”, IEEE

Trans. VLSI system, Vol. 3, pp. 157-172, 1995.

[20] International Technology Roadmap for semiconductors,

SEMATECH, 3101, Industrial Terrace Suite, 106 Austin TX

78758, 2001.

[21] Steve Furber, “ARM System-on-Chip Architecture”, Pearson

Education Limited, Second Edition, 2000.

[22] Nam Sung Kim, Todd Austin, David Blaauw, Trevor Mudge,

Krisztian Flautner, Jie S. Hu, Mary Jane Irwin, Mahmut Kandemir,

and Vijaykrishnan Narayanan, “Leakage current: Moore's law meets

static power”, Computer, Vol. 36, pp. 68-75, 2003.

[23] Smith A.J, “Cache memories”, ACM Computing Surveys, 1993.

217

[24] Grishman Ralph, “Assembly Language Programming for the Control

Data 6000 Series”, Algorithmics Press, Second Edition, p.12, 1974.

[25] Jack J. Dongarra, “Numerical Linear Algebra on High-Performance

Computers”, p. 6, 1987

[26] Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam,

“Power Struggles: Revisiting the RISC Vs CISC Debate on

Contemporary ARM and RISC Architectures”, 19th IEEE

International Symposium on High Performance Computer

Architecture (HPCA), 2013.

[27] Patterson D. A and Hennessy J. A, “Computer Organization &

Design: The Hardware / Software Interface”, Morgan Kaufmann

Publishers, Fourth Edition, San francisco, CA, 2009.

[28] Hikkinen J., Takala J. and Corporaal H. “Dictionary - based Program

Compression on Customizable processor Architectures”,

Microprocessors and Microsystems, Vol. 33, pp.139-153, 2009.

[29] Huffman D. A, “A method for the construction of minimum-

redundancy codes”, in Proc.IRE, pp. 1098-1101, 1952.

[30] Wolfe A. and Chanin A., “Executing compressed programs on an

embedded RISC architecture”, Int. Symp. Microarch, pp. 81–91,

1992.

[31] Kozuch M. and Wolfe A., “Compression of embedded system

programs”, IEEE International Conference Computer Design,

Cambridge, MA, pp. 270-277, 1994.

[32] Lakatsas H. and Wolf W., “Code Compression for embedded

systems”, 35th Conference on Design Automation, San Francisco,

CA, pp. 516-521, 1998.

[33] Lefurgy C., Bird P., Cheng I. and Mudge T., “Improving Code density

using compression techniques”, 30th Annual International

Symposium on Microarchitecture, Research Triangle Park, NC, pp.

194-203, 1997.

218

[34] Araujo G., Centoducatte P., Cortes M. and Pannanin R, “Code

compression based on operand factorization”, 31st Annual

ACM/IEEE International Symposium on Microarchitecture, Dallas,

TX, USA, pp. 194-201, 1998.

[35] Xie Y. and Wolf W., “Profile-driven code compression”, Des. Autom.

Test Eur, pp. 462-467, 1992.

[36] Bell T., Cleary J., and Witten I., “Text Compression”, Prentice Hall,

1990.

[37] Lin K.J and Wu C.W, “A Low-Power CAM Design for LZ Data

Compression”, IEEE Trans. Computers, Vol. 49, pp. 1139-1145,

2000.

[38] Benini L., Menichelli F. and Olivieri M., “A Class of Code

Compression Schemes for Reducing Power Consumption in

Embedded Microprocessor Systems”, IEEE Trans.Computers, Vol.

53 , pp. 467-482, 2004.

[39] Lin C.H, Xie Y. and Wolf W., “Code Compression for VLIW

Embedded Systems Using a Self - Generating Table”, IEEE Trans.

VLSI Systems, Vol. 15, pp. 1160-1171, 2004.

[40] Kemp T.M, Montoye R.K, Harper J.D, Palmer J.D, and Auerbach

D.J, “A decompression core for power PC”, IBM J.Res.Develop.,

Vol. 42, pp. 807–812, 1998.

[41] Bird P. and Mudge T., “An instruction stream compression

technique”, Elect. Eng. Comp. Sci. Dept., Univ. Michigan, Lansing,

MI, Tech. Rep. CSE-TR-319-96, 1996.

[42] Guido Araujo, Paulo Centoducatte, Rodolfo Azevedo, and Ricardo,

“Expression-Tree-Based Algorithms for code compression on

Embedded RISC Architectures”, IEEE Trans. VLSI Systems, Vol. 8,

pp. 530-533, 2004.

[43] Weiss A.R, “The Standardization of Embedded Benchmarking:

Pitfalls and Opportunities”, Int'l Conf. on Computer Design

(ICCD'99), Austin, TX, pp. 492-498, 1999.

219

[44] Copper K. D and McIntosh N., “Enhanced code compression for

embedded RISC processors”, SIGPLAN Conf. Program. Lang. Des.

Implement, pp. 139-149, 1995.

[45] Debray S. and Evans, “Profile-guided code compression”, Conf.

Program. Lang. Des. Implement, pp. 95-105, 2002.

[46] Debray S. and Evans, “Cold code decompression at runtime”,

Commun. ACM, Vol. 46, pp. 54-60, 2003.

[47] Ozturk O., Saputra H., Kandemir M., and Kolcu I., “Access pattern-

based code compression for memory-constrained embedded

systems”, Des, Autom. Test Eur. Conf. Expo, pp. 882-887, 2005.

[48] Shogan S. and Chiders B. R, “Compact binaries with code

compression in a software dynamic translator”, Des., Autom. Test

Eur. Conf. Expo, pp. 1052-1057, 2004.

[49] Sloss N.A, Symes D. and Wright C., “ARM System Developer's

Guide Designing and Optimizing System Software”, Morgan

Kaufmann Publishers, San Francisco, CA, 2004.

[50] Gerard P.M and Charles P.M, “The RISC Processor DMN-6: A

Unified Data-Control Flow Architecture”, AGM SIGARCH Computer

Architecture News, Vol. 24, 1996.

[51] Lee C., Potkonjak M. and Mangione-Smith W.H, “MediaBench: A

Tool for Evaluating and Synthesizing Multimedia and

Communication System”, Proc. Int'l Symp. Microarchitectures, pp.

330-335, 1997.

[52] Guthaus M. R, Ringenberg J. S, Ernst D., Austin T. M , Mudge T.

and Brown R. B, “MiBench: A free, Commercially Representative

Embedded Benchmark Suite”, In Proceedings of the 4th Annual

Workshop on Workload Characterization, pp. 3-14, 2001.

[53] EDN Embedded Microprocessor Benchmark Consortium, CA, 2013.

[54] Sima D., Fountain T. and Kacsuk P., “Advanced Computer

Architectures: A design space approach”, Pearson Education, 1997.

[55] Circello J., “The superscalar architecture of the MC68060”, IEEE

Micro, Vol. 15, pp. 10–21, 1995.

220

[56] AMD Athlon Processor x86 Code Optimization, chapter Appendix A:

AMD Athlon Processor Microarchitecture. AMD Inc., 220071-0

edition, 2000.

[57] Hinton G., “The microarchitecture of the Pentium 4 processor”, Intel

Technology Journal, Q1 2001.

[58] Heidi Pan and Krste Asanovi´c, “Heads and Tails: A VariableLength

Instruction Format Supporting Parallel Fetch and Decode”,

CASES’01, Atlanta, Georgia, USA, 2001.

[59] “microMIPS Instruction Set Architecture”, MIPS Technologies, Inc.,

2009

[60] Greg Osborn, “Embedded Microcontrollers and Processor Design”,

Pearson Education Limited, 2012.

221

LIST OF PUBLICATIONS

[1] Govindarajalu B. and Mehata K.M, “Code Size Reduction in

Embedded Systems with Redesigned ISA for RISC Processors”,

International Journal of Computer Applications, Vol. 64, No. 12,

pp. 38-45, 2013.

[2] Govindarajalu B. and Mehata K.M, “A case for hybrid instruction

encoding for reducing code size in embedded system-on-chips

based on RISC processor cores”, J. Comput. Sci., Vol. 10,

pp. 411-422, 2014.

[3] Govindarajalu B., Mehata K.M and Ramakrishnan R., “Enhanced

hybrid instruction encoding for portable embedded Systems”,

International Journal of Embedded Systems, InderScience

Publishers, Communicated.

222

APPENDIX 1

MIDACC ARCHITECTURE

A1.1 INTRODUCTION

Figure. A1.1 shows the functional block diagram of MIDACC

indicating different functional modules other than MIDACC extender.

Figure. A1.1: Functional block diagram of MIDACC

A1.2 MIDACC INTERNALS

MIDACC’s code analyser is named MIDA and the code converter,

MICC. The MIDACC extender is an independent software tool for

estimating the scope for compound and composite instructions for HIE-

MIPS.

223

A1.2.1 MIDA Internals

Given a MIPS32 code, MIDA profiles the code and produces the

following statistics.

A1.2.1.1 Instruction class distribution

The following are the steps involved in finding the instruction class

distribution:

1) Scan the instruction

2) Identify the instruction and the type using table A3.1 and

maintain the count for each instruction type

3) Increment the instruction address by 4 and repeat steps 1-3 for

all the instructions.

4) Add the count of instructions belonging to the same type to get

the total count for each instruction type.

A1.2.1.2 Instruction distribution

The following are the steps involved in instruction distribution

identification:


2) Using table A3.1 identify the instruction name. Maintain a count

for each instruction.

3) Increment the instruction address by 4 and repeat steps (1) and

(2) for all the instructions.

224

A1.2.1.3 MIPS Code redundant 0’s Distribution

The following steps are used to find the MIPS code redundant 0’s

distribution:


2) Identify the instruction as per A1.2.1.2. Maintain a count for

each instruction. Identify the RZ value of the instruction using

table 4.4.


(2) until reaching the end of the program.

4) Multiply the count of each instruction by corresponding RZ

value and add the values obtained for all instructions to get the

TRZ value.

A1.2.1.4 Branch instruction distribution

The following steps are used to find the branch instruction

distribution:


2) Identify the type of each instruction using table A3.1. If

instruction type is branch, then go to step 3. Otherwise go to

step 4.

3) Identify the instruction and maintain a count for each branch

instruction.



225

A1.2.1.5 WASTIO Calculation

The algorithm in Figure. A1.2 is used to compute the WASTIO

percentage

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Input : Object code of program

Output : WASTIO Percentage

Algorithm

for each instruction in the program

if instruction format == I || instruction format == O then

Read the value immediate/offset field of the instruction

if all zeros in both LSB and MSB then

a=a+1

else if all zeros in LSB then

b=b+1

else if all zeros in MSB then

c=c+1

else

do nothing

fi

fi

end

WASTIO=2a+b+c

WASTIO Percentage = (WASTIO/Object code size) * 100

Figure. A1.2: Algorithm for WASTIO calculation

A1.2.1.6 Population of FTFI

The algorithm in Figure. A1.3 is used to compute the population of

FTFI.

226

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24


Output : Population of FTFI

Algorithm

addu_count=0

addiu_count=0

lw_count=0

sw_count=0


Read the value OP and OPX field of the instruction

if OP==000000 && OPX=100001 then

addu_count=addu_count+1

else if OP==001001 then

addiu_count=addiu_count+1

else

do nothing

fi

Check the hex value of MSB of the instruction

if value==0x8C || value==0x8D || value==0x8E ||

value==0x8F then

lw_count=lw_count+1

else if value==0xAC || value==0xAD || value==0xAE ||

value==0xAF then

sw_count=sw_count+1

else

do nothing

fi

end

FTFI=addu_count+addiu_count+lw_count+sw_count

Figure. A1.3: Algorithm for Population of FTFI

A1.2.1.7 Registers usage behaviour

The following are the steps involved in finding the registers usage

behaviour:


2) Identify the instruction type and format using Table A3.1

227

3) If instruction type is ALU and format is R, then go to step 4. If

the instruction type is ALU and format is I, then go to step 5.

Otherwise go to step 6.

4) For R-format instructions, check the MSB of rs,rt and rd fields

for each combination and maintain a count of each instruction

for each pattern.

5) For I-format instructions, check the MSB of rs and rt fields for

each combination and maintain a count of each instruction for

each pattern



A1.2.1.8 Shift length usage

The algorithm in Figure. A1.4 is used to compute the shift length

usage.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Output : Percentage of shift amount between 16-31

Algorithm


Read the value of OP and OPX field

if OP==000000 && OPX==000000 then

Read the value of MSB of sa field

if MSB==0 then

sll_count_zero=sll_count_zero+1

else

sll_count_one=sll_count_one+1

fi

else if OP==000000 && OPX=000011then


if MSB==0 then

sra_count_zero=sra_count_zero+1

else

sra_count_one=sra_count_one+1

228

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

fi

else if OP==000000 && OPX=000010 then


if MSB==0 then

srl_count_zero=srl_count_zero+1

else

srl_count_one=srl_count_one+1

fi

else

do nothing

fi

end

Total usage of shift amount= sll_count_zero + sra_count_zero +

srl_count_zero

Shift amount between 16 and 31= sll_count_one + sll_count_one

+sll_count_one

Percentage Shift amount between 16 and 31

=100* ((Shift amount between 16 and 31) / (Total usage of shift

amount))

Figure. A1.4: Algorithm for Shift Length usage computation

A1.2.1.9 Immediate field usage pattern

The following are the steps involved in finding the immediate field

usage pattern:


2) Identify the instruction type and format using Table A3.1

3) If instruction type is ALU and format is R, then go to step 4.


4) Check the immediate fields for all combinations and maintain a

count of each instruction for each combination.



229

A1.2.1.10 Offset field usage pattern

The following are the steps involved in finding the offset field usage

pattern:


2) Identify the instruction format using Table A3.1

3) If the instruction format belongs to offset, then go to step 4.


4) Check the offset fields for all combinations and maintain a

count of each instruction for each combination.



A1.2.2 MICC Internals

A1.2.2.1 HIE1 code conversion

Figure. A1.5 shows the algorithm for HIE1 code conversion.

Input : MIPS32 instruction

Output : HIE1 instruction

Algorithm

1

2

3

4

5

6

7

8

9

10

for all the instructions

Retain opcode of all instruction as such

Identify HIE1 instruction type using table A4.1

if(HIE1 type==H && HIE1 type==I)

Retain instruction as such

Go to the next instruction

else if(HIE1 type==A)

Delete all other existing fields

Insert 2-bit iid field with value as shown in table A4.1

else if(HIE1 type==B)

230

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

Insert 1-bit iid field with value as shown in table A4.1

Remove 1-bit from rt field

Retain rd field as such

else if(HIE1 type==C)

Remove 1-bit from rd/rs field

Retain function bits as such

else if(HIE1 type==D)

Remove 1-bit each from rs, rt and rd field


else if(HIE1 type==E)

Remove 1-bit each from rt, rd and sa field


else if(HIE1 type==F)

Remove 1-bit each from rs and rt field

Delete 6 unused 0’s


else if(HIE1 type==G)

Insert 2-bit hl field using hl HIE1 MIPS encoding in table

4.5

fi

end

Figure. A1.5 : HIE1 code conversion algorithm


Figure. A1.6 shows the algorithm for HIE2 code conversion.

231

Input : MIPS32 instruction

Ouput : HIE2 instruction

Algorithm

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

for all the instructions

Identify HIE2 instruction type using table A5.1

if(HIE2 type==H && HIE2 type==I)

Retain instruction as such

Go to the next instruction

fi

if(HIE2 type==G1 || HIE2 type==G2 || HIE2 type==G3)

insert it field with it=0

if(HIE-2 type==G1)

insert 1-bit hl field using hl encoding in table A5.2

else if(HIE2 type==G2)


else


fi

else

insert it field with it=1

fi

Replace 6-bit MIPS32 opcode with 5-bit HIE2 opcode using table

A5.1

if(HIE2 type== A && HIE2 type==E && HIE2 type==G3)

insert 2-bit iid field with value as shown in table A5.1

fi

if(HIE2 type==C && HIE2 type==D1 && HIE2 type==D2)

insert 3-bit iid field with value as shown in table A5.1

fi

end

Figure. A1.6 : HIE2 code conversion algorithm

A1.2.2.3 RMA Code Conversion

The Figure. A1.7 depicts the overview of RMA code conversion

process for the sequence “load instruction followed by ALU type

instruction”. The Figure. A1.8 depicts the overview of RMA code conversion

process for the sequence “ALU type instruction followed by Store

instruction”.

232

Figure. A1.7: Overview of RMA code conversion for load sequence

NO

YES

NO

YES Type J

TYPE R / Type I

YES

NO

Test opcode of current instruction

Create RMA table and Step1-Jump address table

Convert MSB of instruction from HEX format to decimal format

MSB = MSB >> 2

Is MSB== 23

Is Instruction type == ALU

Instruction type?

Is qualifying Load?

Create new RMA instruction

Decrement all subsequent instruction addresses by 4

Test opcode of next instruction

Increment instruction address by 4

To identify Load

instruction

233

Figure. A1.8: Overview of RMA code conversion for store sequence

Increment instruction

address by 4

NO

NO

YES

YES

YES

NO

Type R

Type I/ Type J

To identify Store

instruction

Test opcode of current instruction

If Instruction type == ALU

Instruction type?

Is qualifying Store?

Test opcode of next instruction

Convert MSB of instruction from HEX format to decimal format

MSB = MSB >> 2

Is MSB== 43

Create new RMA instruction

Decrement all subsequent instruction addresses by 4

Create RMA table and Step1-Jump address table

234

A1.2.2.4 RMA+HIE1 code conversion

The Figure. A1.9 depicts the overview of RMA+HIE1 code

conversion process

Figure. A1.9: RMA+HIE1 code conversion


The Figure. A1.10 depicts the overview of RMA+HIE2 code

conversion process.

Figure. A1.10: RMA+HIE2 code conversion

A1.3 MIDACC EXTENDER

The MIDACC Extender estimates the scope for the following three

requirements: use of compound instructions for D1 and E type instructions

in HIE2 code, conversion of the add and addu instructions having same

register for both the source operands into two-address instructions, and

use of two composite instructions: loadmultiple and storemultiple.

Perform RMA code conversion

Perform HIE1 code conversion

Perform RMA code conversion

Perform HIE2 code conversion

235

APPENDIX 2

MIDACC USER GUIDE

A2.1 INTRODUCTION

This User’s guide discusses briefly the provisions of MIDACC, a

custom-built code analyser cum code convertor tool suite and describes how

it can be used. MIDACC’s code analyser is named MIDA and the code

converter, MICC. MIDACC is a Windows-based application and designed to

function with Microsoft Windows XP and above, having .NET framework 3.5

and above. Use of MIDACC extender is included in the midacc website.

A2.2 INSTALLING MIDACC

Visit the following url

https://midacc.wordpress.com/

and download the zip file, MIDACC Suite.zip to your system/ PC. Unzip the

file and you should see the folder on your system/ PC as shown in

Figure. A2.1.

Figure. A2.1: Snapshot of MIDACC Suite installation folder

236

Right-click the “MIDACC Suite Windows Installer” file and select

“Install”. You will see the welcome screen as shown in Figure. A2.2.

Figure. A2.2: Snapshot of MIDACC Suite welcome screen

Click “Next” to proceed to the next screen as shown in Figure. A2.3.

Select the installation directory where MIDACC has to be installed. The

default directory is “C:\ProgramFiles\MIDACC Suite”. Choose the user for

whom MIDACC suite has to be installed.

Figure. A2.3: Snapshot of MIDACC installation screen

237

Click “Next” to proceed to the installation process shown in

Figure. A2.4.

Figure. A2.4: Snapshot of MIDACC installation process

Click “Next” to start the installation. You will see a screen as shown in

Figure. A2.5 showing the installation status.

Figure. A2.5: Snapshot of MIDACC installation status

238

Figure. A2.6: Snapshot of MIDACC installation completion

You have succesfully installed MIDACC suite after seeing a screen as

shown in Figure. A2.6. Click “Close” to complete the installation.When

properly installed MIDACC can be launched by clicking the “MIDACC suite”

icon on the PC desktop or selecting from the “Start Menu” as shown in

Figure. A2.7.

Figure. A2.7: Snapshot of MIDACC icon in desktop and start menu

239

When launched, you will see two tabs across the top of the MIDACC

suite as shown in Figure. A2.8.

Figure. A2.8: Snapshot of MIDACC Suite tool

A2.3 INPUT FORMAT REQUIRED BY MIDACC

The object code is obtained from the C program by cross compilation

process (for details, refer cross compilation procedure in section A2.6) using

a crosscompiler such as the Sourcery CodeBench tool.

Figure. A2.9: Snapshot of assembly code of SUSAN

240

Figure. A2.9 shows the snapshot of the object code of the SUSAN

program obtained by the cross compilation process.The object code

obtained is then manually processed to get the input format accepted by

MIDACC suite as shown in Figure. A2.10.

Figure. A2.10: Snapshot of input format accepted by MIDACC

A2.4 USING MIDACC

A2.4.1 MIDA Tab

When the program is launched, select the “MIDA” tab on the top of the

MIDACC suite. Figure. A2.11 shows the snapshot of the MIDA tab. It profiles

the object code and provides reports such as the static instruction

frequencies and total amount of under utilization of the offset and immediate

fields in the object code.

Figure. A2.11: Snapshot of MIDA Tab

241

Click on the “Browse” button to select the location of the input object

code file. Click on the desired file and click on the “Open” button. To start the

code analysis process, click on the “Perform Code Analysis” button. Once

the process is completed you will receive a message as shown in the figure.

A2.12 and a message in the status bar notifying the location of the reports

generated by the code analysis process.

Figure. A2.12: Snapshot of code analysis process using MIDA Tab

The various reports generated by the code analysis process are

A. Instruction class distribution

B. Instruction distribution

C. MIPS code redundant 0’s distribution

D. Branch Instruction distribution

E. WASTIO calculation

F. Population of frequently used top four instructions (FTFI)

G. Register usage behaviour

H. Shift length usage

I. Immediate field usage pattern

J. Offset field usage pattern

242

A2.4.2 MICC Tab

It converts the object code from MIPS ISA to HIE/RMA ISA and

measures the code size savings for the application. Figure. A2.13 shows the

snapshot of MICC Tab of MIDACC suite. The MICC tab has five code

conversion functions:

HIE1 Conversion

HIE2 Conversion

RMA Conversion

RMA + HIE1 Conversion

RMA + HIE2 Conversion

Figure. A2.13: Snapshot of MICC Tab

Click on the “Browse” button to browse the location of the input object

code file. Click on the desired file and click on the “Open” button. To start the

code conversion process click on any one of the five buttons as shown in the

above figure. Once the process is completed you will receive a message as

243

shown in the following figures and a message in the status bar notifying the

location of the reports generated by the respective code conversion process.


Figure. A2.14 shows the snapshot of the HIE1 code conversion

process. With HIE1 code conversion the reports generated are:

A. HIE1 Code Redundant 0’s distribution

B. Code size summary

C. Percentage Code Reduction (PCR)

Figure. A2.14: Snapshot of HIE1 Code conversion process



process. With HIE2 code conversion the reports generated are:

A. HIE1 Code Redundant 0’s distribution



244

Figure. A2.15: Snapshot of HIE2 Code conversion process

A2.4.2.3 RMA Code conversion

Figure. A2.16 shows the snapshot of the RMA code conversion

process. With RMA code conversion the reports generated are:

A. RMA scope analysis



Figure. A2.16: Snapshot of RMA Code conversion process

245


Figure. A2.17 shows the snapshot of the RMA+HIE1 code conversion

process. With RMA+HIE1 code conversion the reports generated are:




Figure. A2.17: Snapshot of RMA+HIE1 Code conversion process



process. With RMA+HIE2 code conversion the reports generated are:




246

Figure. A2.18: Snapshot of RMA+HIE2 Code conversion process

A2.5 SAMPLE OUTPUT OBTAINED USING MIDACC

The following section gives the sample output generated using

MIDACC suite for SUSAN program in MiMedia Benchmark.

A2.5.1 Code Analysis Report

A. Instruction class distribution: Program: susan

--------------------------------------------------------------------------------

Instruction type Count Percentage

--------------------------------------------------------------------------------

LOAD 4924 38.62

STORE 1495 11.725

ALU 4097 32.133

INTERRUPT 608 4.769

COMPARE 276 2.165

CONMANIP 138 1.082

BRANCH 361 2.831

JUMP 279 2.188

DATA MOVE 139 1.09

REF 0 0

--------------------------------------------------------------------------------

TOTAL 12317

--------------------------------------------------------------------------------

247

B. Instruction Distribution: Program: susan

----------------------------------------------------------------------------------------------------

Instruction Count % Cumulative Cumulative Type

Count %

----------------------------------------------------------------------------------------------------

LW 3883 30.455 3883 30.455 Load, O

ADDU 1986 15.576 5869 46.031 ALU, R

SW 1310 10.275 7179 56.306 Store, O

ADDIU 1129 8.855 8308 65.161 ALU, I

LBU 972 7.624 9280 72.785 Load, O

NOP 608 4.769 9888 77.554 Interrupt

SUBU 458 3.592 10346 81.146 ALU, R

SLL 423 3.318 10769 84.464 ALU, R

SLT 200 1.569 10969 86.033 Compare, R

BEQ 171 1.341 11140 87.374 Branch, O

SB 166 1.302 11306 88.676 Store, O

BNE 154 1.208 11460 89.884 Branch, O

LUI 138 1.082 11598 90.966 CONMANIP, I

J 115 0.902 11713 91.868 Jump, T

JAL 112 0.878 11825 92.746 Jump, T

MTCZ 105 0.824 11930 93.57 Data move, R - O

LWCZ 65 0.51 11995 94.08 Load, O

SLTIU 52 0.408 12047 94.488 Compare, I

JR 47 0.369 12094 94.857 Jump, R

ANDI 40 0.314 12134 95.171 ALU, I

BCZT 23 0.18 12157 95.351 Branch, O

SLTI 22 0.173 12179 95.524 Compare, I

SWCZ 19 0.149 12198 95.673 Store, O

MFCZ 17 0.133 12215 95.806 DM, R - O

OR 16 0.125 12231 95.931 ALU, R

ORI 16 0.125 12247 96.056 ALU, I

AND 11 0.086 12258 96.142 ALU, R

MFHI 10 0.078 12268 96.22 DM, R

BLEZ 9 0.071 12277 96.291 Branch, O

XOR 8 0.063 12285 96.354 ALU, R

DIV 7 0.055 12292 96.409 ALU, R

MFLO 7 0.055 12299 96.464 DM, R

JALR 5 0.039 12304 96.503 Jump, R

248

Instruction Count % Cumulative Cumulative Type

Count %

BGEZ 4 0.031 12308 96.534 Branch, O

LB 4 0.031 12312 96.565 Load, O

MULT 3 0.024 12315 96.589 ALU, R

SLTU 2 0.016 12317 96.605 Compare, R

ADD 0 0 12317 96.605 ALU, R

ADDI 0 0 12317 96.605 ALU, I

DIVU 0 0 12317 96.605 ALU, R

MULTU 0 0 12317 96.605 ALU, R

NOR 0 0 12317 96.605 ALU, R

SLLV 0 0 12317 96.605 ALU, R

SRA 0 0 12317 96.605 ALU, R

SRAV 0 0 12317 96.605 ALU, R

SRL 0 0 12317 96.605 ALU, R

SRLV 0 0 12317 96.605 ALU, R

SUB 0 0 12317 96.605 ALU, R

XORI 0 0 12317 96.605 ALU, I

BCZF 0 0 12317 96.605 Branch, O

BGEZAL 0 0 12317 96.605 Branch, O

BGTZ 0 0 12317 96.605 Branch, O

BLTZAL 0 0 12317 96.605 Branch, O

BLTZ 0 0 12317 96.605 Banch, O

LH 0 0 12317 96.605 Load, O

LHU 0 0 12317 96.605 Load, O

LWL 0 0 12317 96.605 Load, O

LWR 0 0 12317 96.605 Load, O

SH 0 0 12317 96.605 Store, O

SWL 0 0 12317 96.605 Store, O

SWR 0 0 12317 96.605 Store, O

MTHI 0 0 12317 96.605 DM, R

MTLO 0 0 12317 96.605 DM, R

SYSCALL 0 0 12317 96.605 Interrupt, R - O

BREAK 0 0 12317 96.605 Interrupt, R - O

REF 0 0 12317 96.605

----------------------------------------------------------------------------------------------------

Total 12317

----------------------------------------------------------------------------------------------------

249

C. MIPS Code Redundant 0's Distribution: Program: susan

------------------------------------------------------------------------------------------------------

Instruction Count RZ(bits)

------------------------------------------------------------------------------------------------------

ADDU 1986 9930

NOP 608 15808

SUBU 458 2290

SLT 200 1000

LUI 138 690

MTCZ 105 1155

JR 47 752

BCZT 23 92

MFCZ 17 187

OR 16 80

AND 11 55

MFHI 10 150

BLEZ 9 45

XOR 8 40

DIV 7 70

MFLO 7 105

JALR 5 50

BGEZ 4 16

MULT 3 30

SLTU 2 10

ADD 0 0

DIVU 0 0

MULTU 0 0

NOR 0 0

SLLV 0 0

SRAV 0 0

SRLV 0 0

SUB 0 0

BCZF 0 0

BGTZ 0 0

BLTZ 0 0

MTHI 0 0

MTLO 0 0

SYSCALL 0 0

REF 0 0

------------------------------------------------------------------------------------------------------

Total : 32555 bits

TRZ : 4069 bytes

Percentage of TRZ : 7.979

250

------------------------------------------------------------------------------------------------------

D. Branch Instruction Distribution: Program: susan

------------------------------------------------------------------------------------------------------

Instruction Count Percentage Cummulative Count Cummulative

Percentage Type

------------------------------------------------------------------------------------------------------

BEQ 171 1.341 171 1.341

BNE 154 1.208 325 2.549

BCZT 23 0.18 348 2.729

BLEZ 9 0.071 357 2.8

BGEZ 4 0.031 361 2.831

BCZF 0 0 361 2.831

BGEZAL 0 0 361 2.831

BGTZ 0 0 361 2.831

BLTZAL 0 0 361 2.831

BLTZ 0 0 361 2.831

------------------------------------------------------------------------------------------------------

E. WASTIO Calculation Program: susan

------------------------------------------------------------------------------------------------------

Instruction Count a b c

------------------------------------------------------------------------------------------------------

LW 3838 185 1 3652

ADDU 1986 - - -

SW 1304 20 0 1284

LBU 971 739 0 232

ADDIU 691 1 4 686

NOP 608 - - -

SUBU 458 - - -

SLL 423 - - -

SLT 200 - - -

SB 165 39 0 126

LUI 135 1 5 129

251

MTCZ 105 - - -

BEQ 91 0 0 91

BNE 65 0 0 65

SLTIU 52 0 0 52

JR 47 - - -

LWCZ 44 0 0 44

ANDI 40 0 0 40

BCZT 23 0 0 23

SLTI 19 0 0 19

SWCZ 19 0 0 19

MFCZ 17 - - -

OR 16 - - -

AND 11 - - -

MFHI 10 - - -

XOR 8 - - -

DIV 7 - - -

MFLO 7 - - -

BLEZ 5 0 0 5

JALR 5 - - -

LB 4 2 0 2

MULT 3 - - -

BGEZ 3 0 0 3

SLTU 2 - - -

J 1 0 1 0

JAL 1 0 1 0

ADD 0 - - -

ADDI 0 0 0 0

DIVU 0 - - -

MULTU 0 - - -

NOR 0 - - -

ORI 0 0 0 0

SLLV 0 - - -

252

SRA 0 - - -

SRAV 0 - - -

SRL 0 - - -

SRLV 0 - - -

SUB 0 - - -

XORI 0 0 0 0

BCZF 0 0 0 0

BGEZAL 0 0 0 0

BGTZ 0 0 0 0

BLTZAL 0 0 0 0

BLTZ 0 0 0 0

LH 0 0 0 0

LHU 0 0 0 0

LWL 0 0 0 0

LWR 0 0 0 0

SH 0 0 0 0

SWL 0 0 0 0

SWR 0 0 0 0

MTHI 0 - - -

MTLO 0 - - -

SYSCALL 0 - - -

BREAK 0 - - -

REF 0 - - -

------------------------------------------------------------------------------------------------------

WASTIO 1974 12 6472

WASTIO Percentage 3.871 0.024 12.69

------------------------------------------------------------------------------------------------------

Total WASTIO Percentage = 16.585

253

F. Population of frequently used top four instructions (FTFI) Program: susan

------------------------------------------------------------------------------------------------------

Instruction Count

------------------------------------------------------------------------------------------------------

ADDU 1986

ADDIU 1129

LW 3883

SW 1310

------------------------------------------------------------------------------------------------------

Sum of FTFI 8308

Percentage of FTFI 67.451

------------------------------------------------------------------------------------------------------

G. Registers Usage behaviour Program: susan

------------------------------------------------------------------------------------------------------

1. ALU - R Type Instructions (partial)

------------------------------------------------------------------------------------------------------

rs rt rd ADD ADDU AND DIV DIVU MULT MULTU NOR OR SLL SLLV

------------------------------------------------------------------------------------------------------

0 0 0 0 1925 10 7 0 3 0 0 16 420 0

0 0 1 0 5 0 0 0 0 0 0 0 0 0

0 1 0 0 11 0 0 0 0 0 0 0 0 0

0 1 1 0 0 0 0 0 0 0 0 0 3 0

1 0 0 0 5 0 0 0 0 0 0 0 0 0

1 0 1 0 37 1 0 0 0 0 0 0 0 0

1 1 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 0 3 0 0 0 0 0 0 0 0 0

------------------------------------------------------------------------------------------------------

Total 0 1986 11 7 0 3 0 0 16 423 0

A16 0 61 1 0 0 0 0 0 0 3 0

------------------------------------------------------------------------------------------------------

R Type Registers Access Summary

------------------------------------------------------------------------------------------------------

Total Access To Registers : 3131

Access To Registers 16 - 31 : 71

Percentage Access To Registers 16 - 31 : 2.268

------------------------------------------------------------------------------------------------------

254

2. ALU - I Type Instructions

------------------------------------------------------------------------------------------------------

rs rt ADDI ADDIU ANDI ORI XORI SLTI SLTIU

------------------------------------------------------------------------------------------------------

0 0 0 1024 40 16 0 22 52

0 1 0 21 0 0 0 0 0

1 0 0 22 0 0 0 0 0

1 1 0 62 0 0 0 0 0

------------------------------------------------------------------------------------------------------

Total 0 1129 40 16 0 22 52

A16 0 105 0 0 0 0 0

-----------------------------------------------------------------------------------------------------

I Type Registers Access Summary

------------------------------------------------------------------------------------------------------

Total Access To Registers : 1259

Access To Registers 16 - 31 : 105

Percentage Access To Registers 16 - 31 : 8.34

-----------------------------------------------------------------------------------------------------

H. Shift Length Usage Program: susan

-----------------------------------------------------------------------------------------------------

sa SLL SRA SRL

------------------------------------------------------------------------------------------------------

0 414 0 0

1 9 0 0

-----------------------------------------------------------------------------------------------------

Total 423 0 0

A16 9 0 0

------------------------------------------------------------------------------------------------------

Shift Amount Summary

-----------------------------------------------------------------------------------------------------

Total usage of shift amount = 423

Number of cases for shift amount between 16 - 31 = 9

Percentage of shift amount between 16 - 31 = 2.128

255

------------------------------------------------------------------------------------------------------

I. Immediate Field Usage pattern (partial) Program: susan

------------------------------------------------------------------------------------------------------

Instruction All 0's 01X 001X 0001X 00001X

------------------------------------------------------------------------------------------------------

ADDIU 1 3 4 0 1 1

ANDI 0 0 0 0 0 0

ORI 0 11 0 0 0 0

XORI 0 0 0 0 0 0

SLTI 0 0 0 0 0 0

SLTIU 0 0 0 0 0 0

------------------------------------------------------------------------------------------------------

Total 1 14 4 0 1 1

Percentage 0.121 1.697 0.485 0 0.121 0.121

------------------------------------------------------------------------------------------------------

J. Offset field Usage pattern (partial) Program: susan

---------------------------------------------------------------------------------------------------------------

Instruction All 0's 01X 001X 0001X 00001X

---------------------------------------------------------------------------------------------------------------

BCZT 0 0 0 0 0 0

BCZF 0 0 0 0 0 0

BEQ 0 0 0 0 0 44

BGEZ 0 0 0 0 0 0

BGEZAL 0 0 0 0 0 0

BGTZ 0 0 0 0 0 0

------------------------------------------------------------------------------------------------------

Total 985 4 9 0 0 55

Percentage 14.768 0.06 0.135 0 0 0.825

256

A2.5.2 HIE1 Code Conversion Report

A. HIE1 Code Redundant 0's Distribution: Program: susan

------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------

ADDU 1986 0

NOP 608 0

SUBU 458 0

SLT 200 0

LUI 138 0

MTCZ 105 0

JR 47 0

BCZT 23 0

MFCZ 17 0

OR 16 0

AND 11 0

MFHI 10 0

BLEZ 9 0

XOR 8 0

DIV 7 28

MFLO 7 0

JALR 5 20

BGEZ 4 0

MULT 3 12

SLTU 2 0

ADD 0 0

DIVU 0 0

MULTU 0 0

NOR 0 0

SLLV 0 0

SRAV 0 0

SRLV 0 0

SUB 0 0

257

BCZF 0 0

BGTZ 0 0

BLTZ 0 0

MTHI 0 0

MTLO 0 0

SYSCALL 0 0

REF 0 0

------------------------------------------------------------------------------------------------------

Total : 60 bits

TRZ : 7.5 bytes


------------------------------------------------------------------------------------------------------

B. Code sizes summary Program: susan

------------------------------------------------------------------------------------------------------

1. MIPS32 Code size in Bytes = 51000

3. HIE1-MIPS Code size in bytes = 37229

------------------------------------------------------------------------------------------------------

C. Percentage Code Reduction (PCR) Program: susan

---------------------------------------------------------------------------------------------------------------

HIE Code Size Percentage = 72.998

HIE PCR = 27.002

------------------------------------------------------------------------------------------------------

A2.5.3 HIE2 Code Conversion Report

A. HIE2 Code Redundant 0's Distribution: Program: susan

---------------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------------------

ADDU 1986 0

NOP 608 0

SUBU 458 0

SLT 200 0

LUI 138 0

MTCZ 105 0

258

JR 47 94

BCZT 23 0

MFCZ 17 0

OR 16 0

AND 11 0

MFHI 10 20

BLEZ 9 27

XOR 8 0

DIV 7 0

MFLO 7 14

JALR 5 0

BGEZ 4 12

MULT 3 0

SLTU 2 0

ADD 0 0

DIVU 0 0

MULTU 0 0

NOR 0 0

SLLV 0 0

SRAV 0 0

SRLV 0 0

SUB 0 0

BCZF 0 0

BGTZ 0 0

BLTZ 0 0

MTHI 0 0

MTLO 0 0

SYSCALL 0 0

REF 0 0

---------------------------------------------------------------------------------------------------------------

Total : 167 bits

TRZ : 20.875 bytes


259

------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------


3. HIE2-MIPS Code size in bytes = 37214

------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------

HIE Code Size Percentage = 72.969

HIE PCR = 27.031

A2.5.4 RMA Code Conversion Report

A. RMA Scope Analysis Program: susan

------------------------------------------------------------------------------------------------------

1. No. of successful RMA loads = 2272

2. No. of Unsuccessful RMA loads = 1611

3. Total loads = 3883

4. Percentage of RMA load cases = 58.511

5. No of succesful RMA stores = 43

6. No of Unsuccessful RMA stores = 1267

7. Total stores = 1310

8. Percentage of RMA store cases = 3.282

------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------


2. RMA-MIPS Code size in bytes = 41980

260

------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------

RMA Code Size Percentage = 82.314

RMA PCR = 17.686

------------------------------------------------------------------------------------------------------

A2.5.5 RMA+HIE1 Code Conversion Report


------------------------------------------------------------------------------------------------------









------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------



3. (RMA+HIE1)-MIPS Code size in bytes = 32503

------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------


RMA PCR = 17.686

RMA+HIE1 Code Size Percentage = 63.731

RMA+HIE1 PCR = 36.269

------------------------------------------------------------------------------------------------------

261

A2.5.6 RMA+HIE2 Code Conversion Report


------------------------------------------------------------------------------------------------------









------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------



3. (RMA+HIE2)-MIPS Code size in bytes = 32488


------------------------------------------------------------------------------------------------------


RMA PCR = 17.686

RMA+HIE2 Code Size Percentage = 63.702

RMA+HIE2 PCR = 36.298

------------------------------------------------------------------------------------------------------

A2.6 CROSS COMPILATION PROCEDURE

"Cross compilation" is a process of building executable binaries for

one processor, and running them on another processor whose architecture is

262

different. Cross compilation is required when binary executables are

generated from source code written in a compiled language, like C or C++.

A2.6.1 Using Sourcery Codebench for Cross Compilation

Sourcery CodeBench is a collection of cross compiler tools known as

toolchain for several processor architectures, including ARM, PowerPC,

MIPS, and Intel x86. It also consists of a set of tools like linker, object

dumping tools, library archiving etc. It is specially built for embedded system

and produces optimized object code and executable for all MIPS processors.

A2.6.1.1 Building the C program

This section describes the process of generating the object code from

the C program using Sourcery Code Bench tool. For demonstration of cross

compilation process, MiBench and MediaBench C programs are chosen.

Compile the input C program using the command below in order to obtain

the object code.

$ mips-linux-gnu-gcc <input-C-file-1-name> <input-C-file-2-name>

-o <object-file-name

For example, the following command is used to compile the SHA

benchmark. The option -o is used to specify the output object file name.

$ mips-linux-gnu-gcc sha.c sha_driver.c -o sha

A2.6.1.2 Obtaining the assembly code

Objdump is a program for displaying various information about one or

more object files. It is used as disassembler to view executable in assembly

form. Use the following command to disassemble the object code:

263

$ mips-linux-gnu-objdump -D <object-file-name>

For example the following command is used to disassemble the SHA

object code.

$ mips-linux-gnu-objdump -D sha

The option -D is used to disassemble the contents of all sections of

the program.

264

APPENDIX 3

MIPS32 INSTRUCTION IDENTIFICATION TABLE

Table A3.1 provides information on MIPS32 instruction identification.

The OP indicates 6-bit major opcode in bits 31-26. The OPX indicates the

6-bit opcode extension in bits 5-0. This table is used by MIDACC.

Table A3.1: MIPS32 Instruction Identification Table

(Abbreviations used: R-Register; I-Immediate; O-Offset; T-target address)

Sl.

no.

MIPS32

Instruction OP

OP byte

pattern

(Hexa)

OPX

OPX byte

pattern

(Hexa)

Type,

Format

1 add 000000 00, 01, 02, 03 100000 20, 60,

A0, E0

ALU, R

2 addu ,, 00, 01, 02, 03 100001 21, 61,

A1, E1

ALU, R

3 addi 001000 20, 21, 22, 23 - - ALU, I

4 addiu 001001 24, 25, 26, 27 - - ALU, I

5 and 000000 00, 01, 02, 03 100100 24, 64,

A4, E4

ALU, R

6 andi 001100 30, 31, 32, 33 - - ALU, I

7 div 000000 00, 01, 02, 03 011010 1A, 5A,

9A, DA

ALU, R

8 divu ,, 00, 01, 02, 03 011011 1B, 5B,

9B, DB

,,

9 mult ,, 00, 01, 02, 03 011000 18, 58,

98, D8

,,

10 multu 000000 00, 01, 02, 03 011001 19, 59,

99, D9

,,

11 nor ,, 00, 01, 02, 03 100111 27, 67,

A7, E7

,,

12 or ,, 00, 01, 02, 03 100101 25, 65,

A5, E5

,,

13 ori 001101 34, 35, 36, 37 - - ALU, I

265

Table A3.1 (Continued)

Sl.

no.

MIPS32

Instruction OP

OP byte

pattern

(Hexa)

OPX

OPX byte

pattern

(Hexa)

Type,

Format

14 sll 000000 00, 01, 02, 03 000000 00, 40,

80, C0

ALU, R

15 sllv ,, 00, 01, 02, 03 000100 04, 44,

84, C4

,,

16 sra ,, 00, 01, 02, 03 000011 03, 43,

83, C3

,,

17 srav ,, 00, 01, 02, 03 000111 07, 47,

87, C7

,,

18 srl ,, 00, 01, 02, 03 000010 02, 42,

82, C2

,,

19 srlv ,, 00, 01, 02, 03 000110 06, 46,

86, C6

,,

20 sub ,, 00, 01, 02, 03 100010 22, 62,

A2, E2

,,

21 subu ,, 00, 01, 02, 03 100011 23, 63,

A3, E3

,,

22 xor ,, 00, 01, 02, 03 100110 26, 66,

A6, E6

,,

23 xori 001110 38, 39, 3A,

3B

- - ALU, I

24 lui 001111 3C, 3D, 3E,

3F

- - CONMANIP, I

25 slt 000000 00, 01, 02, 03 101010 2A, 6A,

AA, EA

Compare, R

26 sltu ,, 00, 01, 02, 03 101011 2B, 6B,

AB, EB

,,

27 slti 001010 28, 29, 2A,

2B

- - Compare, I

28 sltiu 001011 2C, 2D, 2E,

2F

- - ,,

29 bczt - 41, 45, 49,

4D; next byte:

01, 03, 05,

07, 09, 0B,

0D, 0F, 11,

13, 15, 17,

19, 1B, 1D,1F

- - Branch, O

266


Sl.

no.

MIPS32

Instruction OP

OP byte

pattern

(Hexa)

OPX

OPX byte

pattern

(Hexa)

Type,

Format

30 bczf - 41, 45, 49,

4D: next byte:

00, 02, 04,

06, 08, 0A,

0C,0E, 10,

12, 14, 16,

18, 1A, 1C,

1E

- - Branch, O

31 beq 000100 10, 11, 12, 13 - - ,,

32 bgez 000001 04, 05, 06,

07; next byte:

01, 21, 41,

61, 81, A1,

C1, E1

- - ,,

33 bgezal ,, 04, 05, 06,

07; next byte:

11, 31, 51,

71, 91, B1,

D1, F1

- - ,,

34 bgtz 000111 1C, 1D, 1E,

1F; next byte:

00, 20, 40,

60, 80, A0,

C0, E0

- - ,,

35 blez 000110 18, 19, 1A,

1B

- - ,,

36 bltzal 000001 04, 05, 06,

07; next byte:

10, 30, 50,

70, 90, B0,

D0, F0

- - ,,

37 bltz ,, 04, 05, 06,

07; next byte:

00, 20, 40,

60, 80, A0,

C0, C0, E0

- - ,,

38 bne 000101 14, 15, 16, 17 - - ,,

39 j 000010 08, 09, 0A,

0B

- - Jump, T

267


Sl.

no.

MIPS32

Instruction OP

OP byte

pattern

(Hexa)

OPX

OPX byte

pattern

(Hexa)

Type,

Format

40 jal 000011 0C, 0D, 0E,

0F

- - Jump, T

41 jalr 000000 00, 01, 02, 03 001001 09, 49,

89, C9

Jump, R

42 jr 000000 00, 01, 02, 03 001000 08, 48,

88, C8

,,

43 lb 100000 80, 81, 82, 83 - - Load, O

44 lbu 100100 90, 91, 92, 93 - - ,,

45 lh 100001 84, 85, 86, 87 - - ,,

46 lhu 100101 94, 95, 96, 97 - - ,,

47 lw 100011 8C, 8D, 8E,

8F

- - ,,

48 lwcz - C0, C1, C2,

C3, C4, C5,

C6, C7, C8,

C9, CA, CB,

CC, CD, CE,

CF

- - ,,

49 lwl 100010 88, 89, 8A,

8B

- - ,,

50 lwr 100011 98, 99, 9A,

9B

- - ,,

51 sb 101000 A0, A1, A2,

A3

- - Store, O

52 sh 101001 A4, A5, A6,

A7

- - ,,

53 sw 101011 AC, AD, AE,

AF

- - ,,

54 swcz - E0, E1, E2,

E3, E4, E5,

E6, E7, E8,

E9, EA, EB,

EC, ED, EE,

EF

- - ,,

55 swl 101010 A8, A9, AA,

AB

- - ,,

56 swr 101110 B8, B9, BA,

BB

- - ,,

268


Sl.

no.

MIPS32

Instruction OP

OP byte

pattern

(Hexa)

OPX

OPX byte

pattern

(Hexa)

Type,

Format

57 mfhi 000000 00, 01, 02, 03 010000 10, 50,

90, D0

Data move, R

58 mflo ,, 00, 01, 02, 03 010010 12, 52,

92, D2

,,

59 mthi ,, 00, 01, 02, 03 010001 11, 51,

91, D1

,,

60 mtlo ,, 00, 01, 02, 03 010011 13, 53,

93, D3

,,

61 mfcz - 40, 44, 48,

4C, 50, 54,

58, 5C, 60,

64, 68, 6C,

70, 74, 78,

7C; next byte:

00 to 0F, 10

to 1F

- - Data move, R

– O

62 mtcz - 40, 44, 48,

4C, 50, 54,

58, 5C, 60,

64, 68, 6C,

70, 74, 78,

7C; next byte:

80 to 8F, 90

to 9F

- - ,,

63 syscall 000000 00, 01, 02, 03 001100 0C, 4C,

8C, CC

Interrupt, R-O

64 break ,, 00, 01, 02, 03 001101 0D, 4D,

8D, CD

,,

65 nop ,, All 0’s in all

32 bits

- 0’s -

66 rfe 010000 40, 41, 42, 43 100000 20, 60,

A0, E0

Interrupt

269

APPENDIX 4

HIE1-MIPS INSTRUCTION MAP

The HIE-1 MIPS instruction map is given in Table A4.1. The iid field

encoding is given in Table A4.2.

Table A4.1: HIE1-MIPS instruction map

Sl no.

MIPS32 Instruction

Type,

Format HIE1 Type

HIE1 length (bits)

HIE1 OP HIE1 OPX

1 add ALU, R D 24 000000 100000

2 addu ALU, R D 24 ,, 100001

3 addi ALU, I G 16/24/32 001000 -

4 addiu ALU, I G ,, 001001 -

5 and ALU, R D 24 000000 100100

6 andi ALU, I G 16/24/32 001100 -

7 div ALU, R F 24 000000 011010

8 divu ,, F ,, ,, 011011

9 mult ,, F 24 ,, 011000

10 multu ,, F 24 000000 011001

11 nor ,, D ,, ,, 100111

12 or ,, D ,, ,, 100101

13 ori ALU, I G 16/24/32 001101 -

14 sll ALU, R E 24 000000 000000

15 sllv ,, D ,, ,, 000100

16 sra ,, E ,, ,, 000011

17 srav ,, D ,, ,, 000111

18 srl ,, E ,, ,, 000010

19 srlv ,, D ,, ,, 000110

20 sub ,, D ,, ,, 100010

270


Sl no.

MIPS32 Instruction

Type,

Format HIE1 Type

HIE1 length (bits)

HIE1 OP HIE1 OPX

21 subu ,, D ,, ,, 100011

22 xor ,, D ,, ,, 100110

23 xori ALU, I G 16/24/32 001110 -

24 lui CONMANIP,

I

G 16/24/32 001111 -

25 slt Compare, R D 24 000000 101010

26 sltu ,, D ,, ,, 101011

27 slti Compare, I G 16/24/32 001010 -

28 sltiu ,, G 16/24/32 001011 -

29 bczt¹ Branch, O G 16/24/32 0100xx -

30 bczf¹ ,, G 16/24/32 0100xx -

31 beq ,, G 16/24/32 000100 -

32 bgez¹ ,, G 16/24/32 000001 -

33 bgezal¹ ,, G 16/24/32 ,, -

34 bgtz¹ ,, G 16/24/32 000111 -

35 blez ,, G 16/24/32 000110 -

36 bltzal ,, G 16/24/32 000001 -

37 bltz ,, G 16/24/32 ,, -

38 bne ,, G 16/24/32 000101 -

39 j Jump, T H 32 000010 -

40 jal ,, H 32 000011 -

41 jalr Jump, R F 24 000000 001001

42 jr ,, C 16 000000 001000

43 lb Load , O G 16/24/32 100000 -

44 lbu ,, G 16/24/32 100100 -

45 lh ,, G 16/24/32 100001 -

46 lhu ,, G 16/24/32 100101 -

271


Sl no.

MIPS32 Instruction

Type, Format HIE1 Type

HIE1 length (bits)

HIE1 OP HIE1 OPX

47 lw ,, G 16/24/32 100011 -

48 lwcz¹ ,, G 16/24/32 - -

49 lwl ,, G 16/24/32 100010 -

50 lwr ,, G 16/24/32 100011 -

51 sb Store, O G 16/24/32 101000 -

52 sh ,, G 16/24/32 101001 -

53 sw ,, G 16/24/32 101011 -

54 swcz¹ ,, G 16/24/32 - -

55 swl ,, G 16/24/32 101010 -

56 swr ,, G 16/24/32 101110 -

57 mfhi Data move, R C 16 000000 010000

58 mflo ,, C 16 ,, 010010

59 mthi ,, C 16 ,, 010001

60 mtlo ,, C 16 ,, 010011

61 mfcz¹ Data move, R

– O

B; iid = 0 16 - -

62 mtcz¹ ,, B; iid = 1 24 - -

63 syscall Interrupt, R-O A ; iid = 01 8 011100 -

64 break ,, I 32 000000 001101

65 nop - A; iid = 10 8 011100 -

66 rfe Interrupt A; iid = 00 16 011100 -

272

Note-1:

Coprocessor related instructions are not fully mapped to HIE1 being

static simulation purpose. Identifying certain instructions involve multiple

match conditions. For example, for bczt instruction, the first byte may be

any one of the four combinations: 41, 45, 49,4D. In addition, the second

byte has 16 combinations: 01, 03, 05, 07, 09, 0B, 0D, 0F, 11, 13,

15,17,19,1B,1D,1F. Appendix 3 gives complete information.

Table A4.2: IID Field Encoding

Group IId Encoding

A iid Instruction

00 rfe

01 syscall

10 nop

11 -

B iid Instruction

0 mfcz

1 mtcz

273

APPENDIX 5

HIE2-MIPS INSTRUCTION MAP

A5.1 HIE2-MIPS INSTRUCTION MAP

The HIE-2 MIPS instruction map is given in table A5.1

Table A5.1: HIE2-MIPS INSTRUCTION MAP

(IT bit = 0 indicates presence of hl field)

Sl.

no.

MIPS32

Instruction

Type,

Format

HIE

Type

HIE

length

(bits)

IT HIE

OP iid

1 add ALU, R D1 24 1 00101 000

2 addu ALU, R ,, 24 ,, ,, 001

3 addi ALU, I G1 24/32 0 00001 -

4 addiu ALU, I ,, ,, ,, 00010 -

5 and ALU, R D1 24 1 00101 010

6 andi ALU, I G1 24/32 0 00011 -

7 div ALU, R F 16 1 01000 -

8 divu ,, ,, ,, ,, 01001 -

9 mult ,, ,, ,, ,, 01010 -

10 multu ,, ,, ,, ,, 01011 -

11 nor ,, D1 24 ,, 00101 011

12 or ,, ,, ,, ,, ,, 100

13 ori ALU, I G1 24/32 0 00100 -

14 sll ALU, R E 24 1 00111 00

15 sllv ,, D2 ,, ,, 00110 000

16 sra ,, E ,, ,, 00111 01

17 srav ,, D2 ,, ,, 00110 001

274


Sl.

no.

MIPS32

Instruction

Type,

Format

HIE

Type

HIE

length

(bits)

IT HIE

OP iid

18 srl ,, E ,, ,, 00111 10

19 srlv ,, D2 ,, ,, 00110 010

20 sub ,, D1 ,, ,, 00101 101

21 subu ,, D1 ,, 1 00101 110

22 xor ,, ,, ,, ,, ,, 111

23 xori ALU, I G1 24/32 0 00101 -

24 lui CONMA-

NIP, I

,, 24/32 ,, 00110 -

25 slt Compare, R D2 24 1 00110 011

26 sltu ,, ,, ,, ,, ,, 100

27 slti Compare, I G1 24/32 0 00111 -

28 sltiu ,, ,, 24/32 ,, 01000 -

29 bczt Branch, O G2 8/16/ 24 ,, 11011 -

30 bczf ,, G2 8/16/ 24 ,, 11100 -

31 beq ,, G1 24/32 ,, 01001 -

32 bgez ,, G3 16/24/32 ,, 11101 00

33 bgezal ,, G1 24/32 ,, 01010 -

34 bgtz ,, G3 16/24/32 ,, 11101 01

35 blez ,, G3 16/24/32 ,, ,, 10

36 bltzal ,, G1 24/32 ,, 01011 -

37 bltz ,, G3 16/24/32 0 11101 11

38 bne ,, G1 24/32 ,, 01100 -

39 j Jump, T H 32 1 01101 -

40 jal ,, H 32 ,, 01110 -

41 jalr Jump, R F 16 ,, 01100 -

42 jr ,, C 16 ,, 00100 000

275


Sl.

no.

MIPS32

Instruction

Type,

Format

HIE

Type

HIE

length

(bits)

IT HIE

OP iid

43 lb Load , O G1 24/32 0 01101 -

44 lbu ,, G1 24/32 ,, 01110 -

45 lh ,, G1 24/32 ,, 01111 -

46 lhu ,, G1 24/32 ,, 10000 -

47 lw ,, ,, 24/32 ,, 10001 -

48 lwcz ,, ,, 24/32 ,, 10010 -

49 lwl ,, ,, 24/32 ,, 10011 -

50 lwr ,, ,, 24/32 ,, 10100 -

51 sb Store, O ,, 24/32 ,, 10101 -

52 sh ,, ,, 24/32 ,, 10110 -

53 sw ,, ,, 24/32 ,, 10111 -

54 swcz ,, ,, 24/32 ,, 11000 -

55 swl ,, G1 24/32 0 11001 -

56 swr ,, ,, 24/32 ,, 11010 -

57 mfhi Data move,

R

C 16 1 00100 001

58 mflo ,, ,, 16 ,, ,, 010

59 mthi ,, ,, 16 ,, ,, 011

60 mtlo ,, ,, 16 ,, ,, 100

61 mfcz Data move,

R – O

B 16 ,, 00010 -

62 mtcz ,, ,, 24 ,, 00011 -

63 syscall Interrupt, R-

O

A 8 ,, 00001 00

64 break ,, I 32 ,, 01111 -

65 nop - A 8 ,, 00001 01

66 rfe Interrupt A 16 ,, 00001 10

276

Note-1:

Coprocessor related instructions are not mapped to HIE2 fully being

static simulation purpose. Identifying certain instructions involve multiple

match conditions. For example, for bczt instruction, the first byte may be

any one of the four combinations: 41,45,49,4D. In addition, the second byte

has 16 combinations: 01, 03, 05, 07, 09, 0B, 0D, 0F, 11, 13, 15, 17, 19,

1B, 1D, 1F. Appendix 3 gives complete information.

A5.2 HYBRID LENGTH FIELDS ENCODING

The hl Encoding for hybrid immediate / offset lengths are as follows.

Table A5.2 and table A5.3 gives hl Encoding for G1 type instructions and

G2 type instructions respectively.

Table A5.2: hl Encoding for G1 Type Instructions

Actual contents of

immediate / offset

(15 bits)

hl

bit

Immediate /

offset (bits)

Instruction

size (bits)

Encoding of

Immediate /

offset field

All 0's 0 7 24 7 zeros

Eight most significant

bits are 0's and value of

remaining seven bits

are non zero.

0 7 24 7 lsbs of actual

contents

Value of eight most

significant bits non zero

1 15 32 Actual

contents

277


Actual contents of

offset

hl

bits

offset

(bits)

Instruction

size (bits)

Encoding of

offset field

All 0's 00 0 8 -

All bits of most significant

byte are 0's and value of

least significant byte non

zero.


contents

All bits of least significant


most significant byte non

zero

10 8 16 8 msbs of actual

contents

Value of both bytes non

zero

11 16 24 Actual 16 bit

contents

The hl Encoding for G3 type instructions is given in Table A5.4.


Actual contents of

offset

hl

bits

Length of

offset in

HIE2 (bits)

Instruction

size (bits)

Encoding of

offset field

All 0's 00 0 16 -

All bits of most significant


least significant byte non

zero.


contents

All bits of least significant


most significant byte non

zero

10 8 24 8 msbs of actual

contents

Values of both bytes non

zero

11 16 32 Actual 16 bit

contents

278

TECHNICAL BIOGRAPHY

Mr. B. Govindarajalu (RRN. 1186221) was born on 3rd Jan 1949,

in Athukkudi, Tamilnadu. He did his schooling in Board High School,

Vaithiswarankoil and secured 77% in the Higher Secondary Examination.

He has received gold medal for scoring first rank in Tamil in Pre University

Course in the University of Madras in 1967. He received B.E. degree in

Electronics and Communications Engineering from National Institute of

Technology, Trichy, University of Madras, in the year 1972. He did his

Masters in M.Tech. Computer Science and Engineering from Indian

Institute of Technology, Bombay in the year 1979. He has got over

42 years of working experience including thirty years of industrial

experience. He was employed with M/s. IIT Bombay, ORG Systems,

Baroda, Infotech Limited, Chennai, Manipal Engineering College,

Rajalakshmi Engineering College, Sree Ramanujar Engineering College,

Dhanalakshmi College of Engineering, Sri Venkateswara College of

Engineering and Microcode, Chennai. He is the founder CEO of Microcode,

Chennai. He has authored two books:

1. IBM PC AND CLONES: Hardware, Troubleshooting and

Maintenance

2. Computer Architecture and Organization: Design Principles

and Applications

He is currently pursuing his Ph.D. Degree in Embedded Systems

and RISC Processors in the Department of Computer Science and

Engineering of B.S. Abdur Rahman University. His areas of interests

include Computer Architecture, Embedded Systems and Computer

Networking. He has published two papers in the journals and authored

twelve articles in a computer magazine. The e-mail ID is:

[email protected] and the contact number is : 9884025129.

ENHANCEMENTS TO RISC ARCHITECTURE FOR PORTABLE …. Govindarajalu.pdf(B.S. ABDUR RAHMAN INSTITUTE OF...

Documents

Transcript of ENHANCEMENTS TO RISC ARCHITECTURE FOR PORTABLE …. Govindarajalu.pdf(B.S. ABDUR RAHMAN INSTITUTE OF...