Carl Hamacher

This page intentionally left blank

December 9, 2010 12:37 ham_338065_halftitle Sheet number 1 Page number i cyan black

COMPUTER ORGANIZATIONAND EMBEDDED SYSTEMS

December 15, 2010 09:16 ham_338065_title Sheet number 1 Page number iii cyan black

COMPUTER ORGANIZATIONAND EMBEDDED SYSTEMS

SIXTH EDITION

Carl HamacherQueen’s University

Zvonko VranesicUniversity of Toronto

Safwat ZakyUniversity of Toronto

Naraig ManjikianQueen’s University

December 22, 2010 10:39 ham_338065_copy Sheet number 1 Page number iv cyan black

COMPUTER ORGANIZATION AND EMBEDDED SYSTEMS, SIXTH EDITION

Published by McGraw-Hill, a business unit of The McGraw-Hill Companies, Inc., 1221 Avenue of theAmericas, New York, NY 10020. Copyright © 2012 by The McGraw-Hill Companies, Inc. All rightsreserved. Previous editions 2002, 1996, and 1990. No part of this publication may be reproduced ordistributed in any form or by any means, or stored in a database or retrieval system, without the priorwritten consent of The McGraw-Hill Companies, Inc., including, but not limited to, in any network orother electronic storage or transmission, or broadcast for distance learning.

Some ancillaries, including electronic and print components, may not be available to customers outsidethe United States.

This book is printed on acid-free paper.

1 2 3 4 5 6 7 8 9 DOC/DOC 0 9 8 7 6 5 4 3 2 1

ISBN 978–0–07–338065–0MHID 0–07–338065–2

Vice President & Editor-in-Chief: Marty LangeVice President EDP/Central Publishing Services: Kimberly Meriwether DavidPublisher: Raghothaman SrinivasanSenior Sponsoring Editor: Peter E. MassarDevelopmental Editor: Darlene M. SchuellerSenior Marketing Manager: Curt ReynoldsSenior Project Manager: Lisa A. BruflodtBuyer: Laura FullerDesign Coordinator: Brenda A. RolwesMedia Project Manager: Balaji SundararamanCover Design: Studio Montage, St. Louis, MissouriCover Image: © Royalty-Free/CORBISCompositor: Techsetters, Inc.Typeface: 10/12 Times RomanPrinter: R. R. Donnelley & Sons Company/Crawfordsville, IN

Library of Congress Cataloging-in-Publication Data

Computer organization and embedded systems / Carl Hamacher ... [et al.]. – 6th ed.p. cm.

Includes bibliographical references.ISBN-13: 978-0-07-338065-0 (alk. paper)ISBN-10: 0-07-338065-2 (alk. paper)

1. Computer organization. 2. Embedded computer systems. I. Hamacher, V. Carl.QA76.9.C643.H36 2012004.2'2–dc22

2010050243

www.mhhe.com

www.mhhe.com

December 7, 2010 11:51 ham_338065_ded Sheet number 1 Page number v cyan black

To our families

December 15, 2010 09:18 ham_338065_ata Sheet number 1 Page number vii cyan black

vii

About the Authors

Carl Hamacher received the B.A.Sc. degree in Engineering Physics from the Universityof Waterloo, Canada, the M.Sc. degree in Electrical Engineering from Queen’s University,Canada, and the Ph.D. degree in Electrical Engineering from Syracuse University, NewYork. From 1968 to 1990 he was at the University of Toronto, Canada, where he was aProfessor in the Department of Electrical Engineering and the Department of ComputerScience. He served as director of the Computer Systems Research Institute during 1984to 1988, and as chairman of the Division of Engineering Science during 1988 to 1990. In1991 he joined Queen’s University, where is now Professor Emeritus in the Department ofElectrical and Computer Engineering. He served as Dean of the Faculty of Applied Sciencefrom 1991 to 1996. During 1978 to 1979, he was a visiting scientist at the IBM ResearchLaboratory in San Jose, California. In 1986, he was a research visitor at the Laboratory forCircuits and Systems associated with the University of Grenoble, France. During 1996 to1997, he was a visiting professor in the Computer Science Department at the University ofCalifornia at Riverside and in the LIP6 Laboratory of the University of Paris VI.

His research interests are in multiprocessors and multicomputers, focusing on theirinterconnection networks.

Zvonko Vranesic received his B.A.Sc., M.A.Sc., and Ph.D. degrees, all in Electrical En-gineering, from the University of Toronto. From 1963 to 1965 he worked as a designengineer with the Northern Electric Co. Ltd. in Bramalea, Ontario. In 1968 he joined theUniversity of Toronto, where he is now a Professor Emeritus in the Department of Electrical& Computer Engineering. During the 1978–79 academic year, he was a Senior Visitor atthe University of Cambridge, England, and during 1984-85 he was at the University ofParis, 6. From 1995 to 2000 he served as Chair of the Division of Engineering Science atthe University of Toronto. He is also involved in research and development at the AlteraToronto Technology Center.

His current research interests include computer architecture and field-programmableVLSI technology.

He is a coauthor of four other books: Fundamentals of Digital Logic with VHDLDesign, 3rd ed.; Fundamentals of Digital Logic with Verilog Design, 2nd ed.; Microcom-puter Structures; and Field-Programmable Gate Arrays. In 1990, he received the WightonFellowship for “innovative and distinctive contributions to undergraduate laboratory in-struction.” In 2004, he received the Faculty Teaching Award from the Faculty of AppliedScience and Engineering at the University of Toronto.

Safwat Zaky received his B.Sc. degree in Electrical Engineering and B.Sc. in Mathemat-ics, both from Cairo University, Egypt, and his M.A.Sc. and Ph.D. degrees in ElectricalEngineering from the University of Toronto. From 1969 to 1972 he was with Bell North-ern Research, Bramalea, Ontario, where he worked on applications of electro-optics and

December 15, 2010 09:18 ham_338065_ata Sheet number 2 Page number viii cyan black

viii About the Authors

magnetics in mass storage and telephone switching. In 1973, he joined the University ofToronto, where he is now Professor Emeritus in the Department of Electrical and ComputerEngineering. He served as Chair of the Department from 1993 to 2003 and as Vice-Provostfrom 2003 to 2009. During 1980 to 1981, he was a senior visitor at the Computer Laboratory,University of Cambridge, England.

He is a Fellow of the Canadian Academy of Engineering. His research interests are inthe areas of computer architecture, digital-circuit design, and electromagnetic compatibility.He is a coauthor of the book Microcomputer Structures and is a recipient of the IEEE ThirdMillennium Medal and of the Vivek Goel Award for distinguished service to the Universityof Toronto.

Naraig Manjikian received his B.A.Sc. degree in Computer Engineering and M.A.Sc.degree in Electrical Engineering from the University of Waterloo, Canada, and his Ph.D.degree in Electrical Engineering from the University of Toronto. In 1997, he joined Queen’sUniversity, Kingston, Canada, where he is now an Associate Professor in the Departmentof Electrical and Computer Engineering. From 2004 to 2006, he served as UndergraduateChair for Computer Engineering. From 2006 to 2007, he served as Acting Head of theDepartment of Electrical and Computer Engineering, and from 2007 until 2009, he servedas Associate Head for Student and Alumni Affairs. During 2003 to 2004, he was a visitingprofessor at McGill University, Montreal, Canada, and the University of British Columbia.During 2010 to 2011, he was a visiting professor at McGill University.

His research interests are in the areas of computer architecture, multiprocessor systems,field-programmable VLSI technology, and applications of parallel processing.

December 15, 2010 09:21 ham_338065_pref Sheet number 1 Page number ix cyan black

ix

Preface

This book is intended for use in a first-level course on computer organization and embeddedsystems in electrical engineering, computer engineering, and computer science curricula.The book is self-contained, assuming only that the reader has a basic knowledge of computerprogramming in a high-level language. Many students who study computer organizationwill have had an introductory course on digital logic circuits. Therefore, this subject is notcovered in the main body of the book. However, we have provided an extensive appendixon logic circuits for those students who need it.

The book reflects our experience in teaching three distinct groups of students: elec-trical and computer engineering undergraduates, computer science undergraduates, andengineering science undergraduates. We have always approached the teaching of courseson computer organization from a practical point of view. Thus, a key consideration in shap-ing the contents of the book has been to carefully explain the main principles, supported byexamples drawn from commercially available processors. Our main commercial examplesare based on: Altera’s Nios II, Freescale’s ColdFire, ARM, and Intel’s IA-32 architectures.

It is important to recognize that digital system design is not a straightforward process ofapplying optimal design algorithms. Many design decisions are based largely on heuristicjudgment and experience. They involve cost/performance and hardware/software tradeoffsover a range of alternatives. It is our goal to convey these notions to the reader.

The book is aimed at a one-semester course in engineering or computer science pro-grams. It is suitable for both hardware- and software-oriented students. Even though theemphasis is on hardware, we have addressed a number of relevant software issues.

McGraw-Hill maintains a Website with support material for the book at http://www.mhhe.com/hamacher.

Scope of the Book

The first three chapters introduce the basic structure of computers, the operations that theyperform at the machine-instruction level, and input/output methods as seen by a programmer.The fourth chapter provides an overview of the system software needed to translate programswritten in assembly and high-level languages into machine language and to manage theirexecution. The remaining eight chapters deal with the organization, interconnection, andperformance of hardware units in modern computers, including a coverage of embeddedsystems.

Five substantial appendices are provided. The first appendix covers digital logiccircuits. Then, four current commercial instruction set architectures—Altera’s Nios II,Freescale’s ColdFire, ARM, and Intel’s IA-32—are described in separate appendices.

Chapter 1 provides an overview of computer hardware and informally introducesterms that are discussed in more depth in the remainder of the book. This chapter discusses

http://www.mhhe.com/hamacher

http://www.mhhe.com/hamacher

December 15, 2010 09:21 ham_338065_pref Sheet number 2 Page number x cyan black

x Preface

the basic functional units and the ways they interact to form a complete computer system.Number and character representations are discussed, along with basic arithmetic operations.An introduction to performance issues and a brief treatment of the history of computerdevelopment are also provided.

Chapter2 gives a methodical treatment of machine instructions, addressing techniques,and instruction sequencing. Program examples at the machine-instruction level, expressedin a generic assembly language, are used to discuss concepts that include loops, subroutines,and stacks. The concepts are introduced using a RISC-style instruction set architecture. Acomparison with CISC-style instruction sets is also included.

Chapter 3 presents a programmer’s view of basic input/output techniques. It explainshow program-controlled I/O is performed using polling, as well as how interrupts are usedin I/O transfers.

Chapter 4 considers system software. The tasks performed by compilers, assemblers,linkers, and loaders are explained. Utility programs that trace and display the results ofexecuting a program are described. Operating system routines that manage the executionof user programs and their input/output operations, including the handling of interrupts, arealso described.

Chapter 5 explores the design of a RISC-style processor. This chapter explains thesequence of processing steps needed to fetch and execute the different types of machineinstructions. It then develops the hardware organization needed to implement these pro-cessing steps. The differing requirements of CISC-style processors are also considered.

Chapter 6 provides coverage of the use of pipelining and multiple execution units inthe design of high-performance processors. Apipelined version of the RISC-style processordesign from Chapter 5 is used to illustrate pipelining. The role of the compiler and the rela-tionship between pipelined execution and instruction set design are explored. Superscalarprocessors are discussed.

Input/output hardware is considered in Chapter 7. Interconnection networks, includingthe bus structure, are discussed. Synchronous and asynchronous operation is explained.Interconnection standards, including USB and PCI Express, are also presented.

Semiconductor memories, including SDRAM, Rambus, and Flash memory imple-mentations, are discussed in Chapter 8. Caches are explained as a way for increasing thememory bandwidth. They are discussed in some detail, including performance modeling.Virtual-memory systems, memory management, and rapid address-translation techniquesare also presented. Magnetic and optical disks are discussed as components in the memoryhierarchy.

Chapter 9 explores the implementation of the arithmetic unit of a computer. Logicdesign for fixed-point add, subtract, multiply, and divide hardware, operating on 2’s-complement numbers, is described. Carry-lookahead adders and high-speed multipliersare explained, including descriptions of the Booth multiplier recoding and carry-save addi-tion techniques. Floating-point number representation and operations, in the context of theIEEE Standard, are presented.

Today, far more processors are in use in embedded systems than in general-purposecomputers. Chapters 10 and 11 are dedicated to the subject of embedded systems. First,basic aspects of system integration, component interconnections, and real-time operationare presented in Chapter 10. The use of microcontrollers is discussed. Then, Chapter 11concentrates on system-on-a-chip (SoC) implementations, in which a single chip integrates

December 15, 2010 09:21 ham_338065_pref Sheet number 3 Page number xi cyan black

Preface xi

the processing, memory, I/O, and timer functionality needed to satisfy application-specificrequirements. A substantial example shows how FPGAs and modern design tools can beused in this environment.

Chapter 12 focuses on parallel processing and performance. Hardware multithread-ing and vector processing are introduced as enhancements in a single processor. Shared-memory multiprocessors are then described, along with the issue of cache coherence. In-terconnection networks for multiprocessors are presented.

Appendix A provides extensive coverage of logic circuits, intended for a reader whohas not taken a course on the design of such circuits.

Appendices B, C, D, and E illustrate how the instruction set concepts introduced inChapters 2 and 3 are implemented in four commercial processors: Nios II, ColdFire, ARM,and Intel IA-32. The Nios II and ARM processors illustrate the RISC design style. ColdFirehas an easy-to-teach CISC design, while the IA-32 CISC architecture represents the mostsuccessful commercial design. The presentation for each processor includes assembly-language examples from Chapters 2 and 3, implemented in the context of that processor. Thedetails given in these appendices are not essential for understanding the material in the mainbody of the book. It is sufficient to cover only one of these appendices to gain an appreciationfor commercial processor instruction sets. The choice of a processor to use as an exampleis likely to be influenced by the equipment in an accompanying laboratory. Instructors maywish to use more that one processor to illustrate the different design approaches.

Changes in the Sixth Edition

Substantial changes in content and organization have been made in preparing the sixthedition of this book. They include the following:

• The basic concepts of instruction set architecture are now covered using the RISC-styleapproach. This is followed by a comparative examination of the CISC-style approach.

• The processor design discussion is focused on a RISC-style implementation, whichleads naturally to pipelined operation.

• Two chapters on embedded systems are included: one dealing with the basic structureof such systems and the use of microcontrollers, and the other dealing with system-on-a-chip implementations.

• Appendices are used to give examples of four commercial processors. Each appendixincludes the essential information about the instruction set architecture of the givenprocessor.

• Solved problems have been included in a new section toward the end of chapters andappendices. They provide the student with solutions that can be expected for typicalproblems.

Difficulty Level of Problems

The problems at the end of chapters and appendices have been classified as easy (E), medium(M), or difficult (D). These classifications should be interpreted as follows:

December 15, 2010 09:21 ham_338065_pref Sheet number 4 Page number xii cyan black

xii Preface

• Easy—Solutions can be derived in a few minutes by direct application of specificinformation presented in one place in the relevant section of the book.

• Medium—Use of the book material in a way that does not directly follow any examplespresented is usually needed. In some cases, solutions may follow the general patternof an example, but will take longer to develop than those for easy problems.

• Difficult—Some additional insight is needed to solve these problems. If a solutionrequires a program to be written, its underlying algorithm or form may be quite differentfrom that of any program example given in the book. If a hardware design is required,it may involve an arrangement and interconnection of basic logic circuit componentsthat is quite different from any design shown in the book. If a performance analysis isneeded, it may involve the derivation of an algebraic expression.

What Can Be Covered in a One-Semester Course

This book is suitable for use at the university or college level as a text for a one-semestercourse in computer organization. It is intended for the first course that students will takeon computer organization.

There is more than enough material in the book for a one-semester course. The corematerial on computer organization and relevant software issues is given in Chapters 1through 9. For students who have not had a course in logic circuits, the material inAppendixA should be studied near the beginning of a course and certainly prior to covering Chapter 5.

A course aimed at embedded systems should include Chapters 1, 2, 3, 4, 7, 8, 10 and 11.Use of the material on commercial processor examples in Appendices B through E

can be guided by instructor and student interest, as well as by relevance to any hardwarelaboratory associated with a course.

Acknowledgments

We wish to express our thanks to many people who have helped us during the preparationof this sixth edition of the book.

Our colleagues Daniel Etiemble of University of Paris South and Glenn Gulak of Uni-versity of Toronto provided numerous comments and suggestions that helped significantlyin shaping the material.

Blair Fort and Dan Vranesic provided valuable help with some of the programmingexamples.

Warren R. Carithers of Rochester Institute of Technology, Krishna M. Kavi of Uni-versity of North Texas, and Nelson Luiz Passos of Midwestern State University providedreviews of material from both the fifth and sixth editions of the book.

The following people provided reviews of material from the fifth edition of the book:Goh Hock Ann of Multimedia University, Joseph E. Beaini of University of ColoradoDenver, Kalyan Mohan Goli of Jawaharlal Nehru Technological University, Jaimon Jacobof Model Engineering College Ernakulam, M. Kumaresan of Anna University Coimbatore,

December 15, 2010 09:21 ham_338065_pref Sheet number 5 Page number xiii cyan black

Preface xiii

Kenneth K. C. Lee of City University of Hong Kong, Manoj Kumar Mishra of Institute ofTechnical Education and Research, Junita Mohamad-Saleh of Universiti Sains Malaysia,Prashanta Kumar Patra of College of Engineering and Technology Bhubaneswar, Shanq-Jang Ruan of National Taiwan University of Science and Technology, S. D. Samantarayof G. B. Pant University of Agriculture and Technology, Shivakumar Sastry of Universityof Akron, Donatella Sciuto of Politecnico of Milano, M. P. Singh of National Institute ofTechnology Patna, Albert Starling of University of Arkansas, Shannon Tauro of Universityof California Irvine, R. Thangarajan of Kongu Engineering College, Ashok Kunar Turuk ofNational Institute of Technology Rourkela, and PhilipA. Wilsey of University of Cincinnati.

Finally, we truly appreciate the support of Raghothaman Srinivasan, Peter E. Massar,Darlene M. Schueller, Lisa Bruflodt, Curt Reynolds, Brenda Rolwes, and Laura Fuller atMcGraw-Hill.

Carl HamacherZvonko Vranesic

Safwat ZakyNaraig Manjikian

December 15, 2010 12:12 ham_338065_extrafm Sheet number 1 Page number xiv cyan black

McGraw-Hill CreateTM Craft your teaching resources to match the way you teach! WithMcGraw-Hill Create, www.mcgrawhillcreate.com, you can easily rearrange chapters, com-bine material from other content sources, and quickly upload content you have written likeyour course syllabus or teaching notes. Find the content you need in Create by search-ing through thousands of leading McGraw-Hill textbooks. Arrange your book to fit yourteaching style. Create even allows you to personalize your book’s appearance by selectingthe cover and adding your name, school, and course information. Order a Create book andyou’ll receive a complimentary print review copy in 3-5 business days or a complimentaryelectronic review copy (eComp) via email in minutes. Go to www.mcgrawhillcreate.comtoday and register to experience how McGraw-Hill Create empowers you to teach yourstudents your way.

McGraw-Hill Higher Education and Blackboard® have teamed up.

Blackboard, the Web-based course management system, has partnered with McGraw-Hillto better allow students and faculty to use online materials and activities to complementface-to-face teaching. Blackboard features exciting social learning and teaching tools thatfoster more logical, visually impactful and active learning opportunities for students. You’lltransform your closed-door classrooms into communities where students remain connectedto their educational experience 24 hours a day.

This partnership allows you and your students access to McGraw-Hill’s Create right fromwithin your Blackboard course - all with one single sign-on. McGraw-Hill and Blackboardcan now offer you easy access to industry leading technology and content, whether yourcampus hosts it, or we do. Be sure to ask your local McGraw-Hill representative for details.

www.mcgrawhillcreate.com

www.mcgrawhillcreate.com

December 16, 2010 09:28 ham_338065_toc Sheet number 1 Page number xv cyan black

xv

Contents

C h a p t e r 1

Basic Structure ofComputers 1

1.1 Computer Types 21.2 Functional Units 3

1.2.1 Input Unit 41.2.2 Memory Unit 41.2.3 Arithmetic and Logic Unit 51.2.4 Output Unit 61.2.5 Control Unit 6

1.3 Basic Operational Concepts 71.4 Number Representation and Arithmetic

Operations 91.4.1 Integers 101.4.2 Floating-Point Numbers 16

1.5 Character Representation 171.6 Performance 17

1.6.1 Technology 171.6.2 Parallelism 19

1.7 Historical Perspective 191.7.1 The First Generation 201.7.2 The Second Generation 201.7.3 The Third Generation 211.7.4 The Fourth Generation 21

1.8 Concluding Remarks 221.9 Solved Problems 22

Problems 24References 25

C h a p t e r 2

Instruction SetArchitecture 27

2.1 Memory Locations and Addresses 282.1.1 Byte Addressability 302.1.2 Big-Endian and Little-Endian

Assignments 302.1.3 Word Alignment 312.1.4 Accessing Numbers and Characters 32

2.2 Memory Operations 32

2.3 Instructions and Instruction Sequencing 322.3.1 Register Transfer Notation 332.3.2 Assembly-Language Notation 332.3.3 RISC and CISC Instruction Sets 342.3.4 Introduction to RISC Instruction

Sets 342.3.5 Instruction Execution and Straight-Line

Sequencing 362.3.6 Branching 372.3.7 Generating Memory Addresses 40

2.4 Addressing Modes 402.4.1 Implementation of Variables and

Constants 412.4.2 Indirection and Pointers 422.4.3 Indexing and Arrays 45

2.5 Assembly Language 482.5.1 Assembler Directives 502.5.2 Assembly and Execution of

Programs 532.5.3 Number Notation 54

2.6 Stacks 552.7 Subroutines 56

2.7.1 Subroutine Nesting and the ProcessorStack 58

2.7.2 Parameter Passing 592.7.3 The Stack Frame 63

2.8 Additional Instructions 652.8.1 Logic Instructions 672.8.2 Shift and Rotate Instructions 682.8.3 Multiplication and Division 71

2.9 Dealing with 32-Bit Immediate Values 732.10 CISC Instruction Sets 74

2.10.1 Additional Addressing Modes 752.10.2 Condition Codes 77

2.11 RISC and CISC Styles 782.12 Example Programs 79

2.12.1 Vector Dot Product Program 792.12.2 String Search Program 81

2.13 Encoding of Machine Instructions 822.14 Concluding Remarks 852.15 Solved Problems 85

Problems 90

December 16, 2010 09:28 ham_338065_toc Sheet number 2 Page number xvi cyan black

xvi Contents

C h a p t e r 3

Basic Input/Output 95

3.1 Accessing I/O Devices 963.1.1 I/O Device Interface 973.1.2 Program-Controlled I/O 973.1.3 An Example of a RISC-Style I/O

Program 1013.1.4 An Example of a CISC-Style I/O

Program 1013.2 Interrupts 103

3.2.1 Enabling and Disabling Interrupts 1063.2.2 Handling Multiple Devices 1073.2.3 Controlling I/O Device Behavior 1093.2.4 Processor Control Registers 1103.2.5 Examples of Interrupt Programs 1113.2.6 Exceptions 116


Problems 126

C h a p t e r 4

Software 129

4.1 The Assembly Process 1304.1.1 Two-pass Assembler 131

4.2 Loading and Executing Object Programs 1314.3 The Linker 1324.4 Libraries 1334.5 The Compiler 133

4.5.1 Compiler Optimizations 1344.5.2 Combining Programs Written in

Different Languages 1344.6 The Debugger 1344.7 Using a High-level Language for I/O

Tasks 1374.8 Interaction between Assembly Language and

C Language 1394.9 The Operating System 143

4.9.1 The Boot-strapping Process 1444.9.2 Managing the Execution of Application

Programs 1444.9.3 Use of Interrupts in Operating

Systems 1464.10 Concluding Remarks 149


C h a p t e r 5

Basic Processing Unit 151

5.1 Some Fundamental Concepts 1525.2 Instruction Execution 155

5.2.1 Load Instructions 1555.2.2 Arithmetic and Logic Instructions 1565.2.3 Store Instructions 157

5.3 Hardware Components 1585.3.1 Register File 1585.3.2 ALU 1605.3.3 Datapath 1615.3.4 Instruction Fetch Section 164

5.4 Instruction Fetch and Execution Steps 1655.4.1 Branching 1685.4.2 Waiting for Memory 171

5.5 Control Signals 1725.6 Hardwired Control 175

5.6.1 Datapath Control Signals 1775.6.2 Dealing with Memory Delay 177

5.7 CISC-Style Processors 1785.7.1 An Interconnect using Buses 1805.7.2 Microprogrammed Control 183


Problems 188

C h a p t e r 6

Pipelining 193

6.1 Basic Concept—The Ideal Case 1946.2 Pipeline Organization 1956.3 Pipelining Issues 1966.4 Data Dependencies 197

6.4.1 Operand Forwarding 1986.4.2 Handling Data Dependencies in

Software 1996.5 Memory Delays 2016.6 Branch Delays 202

6.6.1 Unconditional Branches 2026.6.2 Conditional Branches 2046.6.3 The Branch Delay Slot 2046.6.4 Branch Prediction 205

6.7 Resource Limitations 2096.8 Performance Evaluation 209

6.8.1 Effects of Stalls and Penalties 2106.8.2 Number of Pipeline Stages 212

December 16, 2010 09:28 ham_338065_toc Sheet number 3 Page number xvii cyan black

Contents xvii

6.9 Superscalar Operation 2126.9.1 Branches and Data Dependencies 2146.9.2 Out-of-Order Execution 2156.9.3 Execution Completion 2166.9.4 Dispatch Operation 217

6.10 Pipelining in CISC Processors 2186.10.1 Pipelining in ColdFire Processors 2196.10.2 Pipelining in Intel Processors 219

6.11 Concluding Remarks 2206.12 Examples of Solved Problems 220


C h a p t e r 7

Input/Output Organization 227

7.1 Bus Structure 2287.2 Bus Operation 229

7.2.1 Synchronous Bus 2307.2.2 Asynchronous Bus 2337.2.3 Electrical Considerations 236

7.3 Arbitration 2377.4 Interface Circuits 238

7.4.1 Parallel Interface 2397.4.2 Serial Interface 243

7.5 Interconnection Standards 2477.5.1 Universal Serial Bus (USB) 2477.5.2 FireWire 2517.5.3 PCI Bus 2527.5.4 SCSI Bus 2567.5.5 SATA 2587.5.6 SAS 2587.5.7 PCI Express 258



C h a p t e r 8

The Memory System 267

8.1 Basic Concepts 2688.2 Semiconductor RAM Memories 270

8.2.1 Internal Organization of MemoryChips 270

8.2.2 Static Memories 2718.2.3 Dynamic RAMs 274

8.2.4 Synchronous DRAMs 2768.2.5 Structure of Larger Memories 279

8.3 Read-only Memories 2828.3.1 ROM 2838.3.2 PROM 2838.3.3 EPROM 2848.3.4 EEPROM 2848.3.5 Flash Memory 284

8.4 Direct Memory Access 2858.5 Memory Hierarchy 2888.6 Cache Memories 289

8.6.1 Mapping Functions 2918.6.2 Replacement Algorithms 2968.6.3 Examples of Mapping Techniques 297

8.7 Performance Considerations 3008.7.1 Hit Rate and Miss Penalty 3018.7.2 Caches on the Processor Chip 3028.7.3 Other Enhancements 303

8.8 Virtual Memory 3058.8.1 Address Translation 306

8.9 Memory Management Requirements 3108.10 Secondary Storage 311

8.10.1 Magnetic Hard Disks 3118.10.2 Optical Disks 3178.10.3 Magnetic Tape Systems 322



C h a p t e r 9

Arithmetic 335

9.1 Addition and Subtraction of SignedNumbers 3369.1.1 Addition/Subtraction Logic Unit 336

9.2 Design of Fast Adders 3399.2.1 Carry-Lookahead Addition 340

9.3 Multiplication of Unsigned Numbers 3449.3.1 Array Multiplier 3449.3.2 Sequential Circuit Multiplier 346

9.4 Multiplication of Signed Numbers 3469.4.1 The Booth Algorithm 348

9.5 Fast Multiplication 3519.5.1 Bit-Pair Recoding of Multipliers 3529.5.2 Carry-Save Addition of Summands 353

December 16, 2010 09:28 ham_338065_toc Sheet number 4 Page number xviii cyan black

xviii Contents

9.5.3 Summand Addition Tree using 3-2Reducers 355

9.5.4 Summand Addition Tree using 4-2Reducers 357

9.5.5 Summary of Fast Multiplication 3599.6 Integer Division 3609.7 Floating-Point Numbers and Operations 363

9.7.1 Arithmetic Operations onFloating-Point Numbers 367

9.7.2 Guard Bits and Truncation 3689.7.3 Implementing Floating-Point

Operations 3699.8 Decimal-to-Binary Conversion 3729.9 Concluding Remarks 3729.10 Solved Problems 374


C h a p t e r 10

Embedded Systems 385

10.1 Examples of Embedded Systems 38610.1.1 Microwave Oven 38610.1.2 Digital Camera 38710.1.3 Home Telemetry 390

10.2 Microcontroller Chips for EmbeddedApplications 390

10.3 A Simple Microcontroller 39210.3.1 Parallel I/O Interface 39210.3.2 Serial I/O Interface 39510.3.3 Counter/Timer 39710.3.4 Interrupt-Control Mechanism 39910.3.5 Programming Examples 399

10.4 Reaction Timer—A Complete Example 40110.5 Sensors and Actuators 407

10.5.1 Sensors 40710.5.2 Actuators 41010.5.3 Application Examples 411

10.6 Microcontroller Families 41210.6.1 Microcontrollers Based on the Intel

8051 41310.6.2 Freescale Microcontrollers 41310.6.3 ARM Microcontrollers 414

10.7 Design Issues 41410.8 Concluding Remarks 417


C h a p t e r 11

System-on-a-Chip—A CaseStudy 421

11.1 FPGA Implementation 42211.1.1 FPGA Devices 42311.1.2 Processor Choice 423

11.2 Computer-Aided Design Tools 42411.2.1 Altera CAD Tools 425

11.3 Alarm Clock Example 42811.3.1 User’s View of the System 42811.3.2 System Definition and Generation 42911.3.3 Circuit Implementation 43011.3.4 Application Software 431

11.4 Concluding Remarks 440Problems 440References 441

C h a p t e r 12

Parallel Processing andPerformance 443

12.1 Hardware Multithreading 44412.2 Vector (SIMD) Processing 445

12.2.1 Graphics Processing Units (GPUs) 44812.3 Shared-Memory Multiprocessors 448

12.3.1 Interconnection Networks 45012.4 Cache Coherence 453

12.4.1 Write-Through Protocol 45312.4.2 Write-Back protocol 45412.4.3 Snoopy Caches 45412.4.4 Directory-Based Cache Coherence 456

12.5 Message-Passing Multicomputers 45612.6 Parallel Programming for

Multiprocessors 45612.7 Performance Modeling 46012.8 Concluding Remarks 461


A p p e n d i x A

Logic Circuits 465

A.1 Basic Logic FunctionsA.1.1 Electronic Logic Gates 469

A.2 Synthesis of Logic Functions 470

December 16, 2010 09:28 ham_338065_toc Sheet number 5 Page number xix cyan black

Contents xix

A.3 Minimization of Logic Expressions 472A.3.1 Minimization using Karnaugh Maps 475A.3.2 Don’t-Care Conditions 477

A.4 Synthesis with NAND and NOR Gates 479A.5 Practical Implementation of Logic Gates 482

A.5.1 CMOS Circuits 484A.5.2 Propagation Delay 489A.5.3 Fan-In and Fan-Out Constraints 490A.5.4 Tri-State Buffers 491

A.6 Flip-Flops 492A.6.1 Gated Latches 493A.6.2 Master-Slave Flip-Flop 495A.6.3 Edge Triggering 498A.6.4 T Flip-Flop 498A.6.5 JK Flip-Flop 499A.6.6 Flip-Flops with Preset and Clear 501

A.7 Registers and Shift Registers 502A.8 Counters 503A.9 Decoders 505A.10 Multiplexers 506A.11 Programmable Logic Devices (PLDs) 509

A.11.1 Programmable Logic Array (PLA) 509A.11.2 Programmable Array Logic (PAL) 511A.11.3 Complex Programmable Logic Devices

(CPLDs) 512A.12 Field-Programmable Gate Arrays 514A.13 Sequential Circuits 516

A.13.1 Design of an Up/Down Counter as aSequential Circuit 516

A.13.2 Timing Diagrams 519A.13.3 The Finite State Machine Model 520A.13.4 Synthesis of Finite State Machines 521

A.14 Concluding Remarks 522Problems 522References 528

A p p e n d i x B

TheAltera Nios IIProcessor 529

B.1 Nios II Characteristics 530B.2 General-Purpose Registers 531B.3 Addressing Modes 532B.4 Instructions 533

B.4.1 Notation 533B.4.2 Load and Store Instructions 534B.4.3 Arithmetic Instructions 536

B.4.4 Logic Instructions 537B.4.5 Move Instructions 537B.4.6 Branch and Jump Instructions 538B.4.7 Subroutine Linkage Instructions 541B.4.8 Comparison Instructions 545B.4.9 Shift Instructions 546B.4.10 Rotate Instructions 547B.4.11 Control Instructions 548

B.5 Pseudoinstructions 548B.6 Assembler Directives 549B.7 Carry and Overflow Detection 551B.8 Example Programs 553B.9 Control Registers 553B.10 Input/Output 555

B.10.1 Program-Controlled I/O 556B.10.2 Interrupts and Exceptions 556

B.11 Advanced Configurations of Nios IIProcessor 562B.11.1 External Interrupt Controller 562B.11.2 Memory Management Unit 562B.11.3 Floating-Point Hardware 562

B.12 Concluding Remarks 563B.13 Solved Problems 563

Problems 568

A p p e n d i x C

The ColdFire Processor 571

C.1 Memory Organization 572C.2 Registers 572C.3 Instructions 573

C.3.1 Addressing Modes 575C.3.2 Move Instruction 577C.3.3 Arithmetic Instructions 578C.3.4 Branch and Jump Instructions 582C.3.5 Logic Instructions 585C.3.6 Shift Instructions 586C.3.7 Subroutine Linkage Instructions 587

C.4 Assembler Directives 593C.5 Example Programs 594

C.5.1 Vector Dot Product Program 594C.5.2 String Search Program 595

C.6 Mode of Operation and Other ControlFeatures 596

C.7 Input/Output 597C.8 Floating-Point Operations 599

C.8.1 FMOVE Instruction 599

December 16, 2010 09:28 ham_338065_toc Sheet number 6 Page number xx cyan black

xx Contents

C.8.2 Floating-Point ArithmeticInstructions 600

C.8.3 Comparison and BranchInstructions 601

C.8.4 Additional Floating-PointInstructions 601

C.8.5 Example Floating-Point Program 602C.9 Concluding Remarks 603C.10 Solved Problems 603


A p p e n d i x D

TheARM Processor 611

D.1 ARM Characteristics 612D.1.1 Unusual Aspects of the ARM

Architecture 612D.2 Register Structure 613D.3 Addressing Modes 614

D.3.1 Basic Indexed Addressing Mode 614D.3.2 Relative Addressing Mode 615D.3.3 Index Modes with Writeback 616D.3.4 Offset Determination 616D.3.5 Register, Immediate, and Absolute

Addressing Modes 618D.3.6 Addressing Mode Examples 618

D.4 Instructions 621D.4.1 Load and Store Instructions 621D.4.2 Arithmetic Instructions 622D.4.3 Move Instructions 625D.4.4 Logic and Test Instructions 626D.4.5 Compare Instructions 627D.4.6 Setting Condition Code Flags 628D.4.7 Branch Instructions 628D.4.8 Subroutine Linkage Instructions 631

D.5 Assembly Language 635D.5.1 Pseudoinstructions 637

D.6 Example Programs 638D.6.1 Vector Dot Product 639D.6.2 String Search 639

D.7 Operating Modes and Exceptions 639D.7.1 Banked Registers 641D.7.2 Exception Types 642D.7.3 System Mode 644D.7.4 Handling Exceptions 644

D.8 Input/Output 646D.8.1 Program-Controlled I/O 646D.8.2 Interrupt-Driven I/O 648

D.9 Conditional Execution of Instructions 648D.10 Coprocessors 650D.11 Embedded Applications and the Thumb

ISA 651D.12 Concluding Remarks 651D.13 Solved Problems 652


A p p e n d i x E

The Intel IA-32 Architecture661

E.1 Memory Organization 662E.2 Register Structure 662E.3 Addressing Modes 665E.4 Instructions 668

E.4.1 Machine Instruction Format 670E.4.2 Assembly-Language Notation 670E.4.3 Move Instruction 671E.4.4 Load-Effective-Address Instruction 671E.4.5 Arithmetic Instructions 672E.4.6 Jump and Loop Instructions 674E.4.7 Logic Instructions 677E.4.8 Shift and Rotate Instructions 678E.4.9 Subroutine Linkage Instructions 679E.4.10 Operations on Large Numbers 681

E.5 Assembler Directives 685E.6 Example Programs 686

E.6.1 Vector Dot Product Program 686E.6.2 String Search Program 686

E.7 Interrupts and Exceptions 687E.8 Input/Output Examples 689E.9 Scalar Floating-Point Operations 690

E.9.1 Load and Store Instructions 692E.9.2 Arithmetic Instructions 693E.9.3 Comparison Instructions 694E.9.4 Additional Instructions 694E.9.5 Example Floating-Point Program 694

E.10 Multimedia Extension (MMX)Operations 695

E.11 Vector (SIMD) Floating-PointOperations 696

E.12 Examples of Solved Problems 697E.13 Concluding Remarks 702


December 10, 2010 11:03 ham_338065_ch01 Sheet number 1 Page number 1 cyan black

1

c h a p t e r

1Basic Structure of Computers

Chapter Objectives

In this chapter you will be introduced to:

• The different types of computers

• The basic structure of a computer and itsoperation

• Machine instructions and their execution

• Number and character representations

• Addition and subtraction of binary numbers

• Basic performance issues in computersystems

• A brief history of computer development


2 C H A P T E R 1 • Basic Structure of Computers

This book is about computer organization. It explains the function and design of the various units of digitalcomputers that store and process information. It also deals with the input units of the computer which receiveinformation from external sources and the output units which send computed results to external destinations.The input, storage, processing, and output operations are governed by a list of instructions that constitute aprogram.

Most of the material in the book is devoted to computer hardware and computer architecture. Computerhardware consists of electronic circuits, magnetic and optical storage devices, displays, electromechanicaldevices, and communication facilities. Computer architecture encompasses the specification of an instructionset and the functional behavior of the hardware units that implement the instructions.

Many aspects of programming and software components in computer systems are also discussed in thebook. It is important to consider both hardware and software aspects of the design of the various computercomponents in order to gain a good understanding of computer systems.

1.1 Computer Types

Since their introduction in the 1940s, digital computers have evolved into many differenttypes that vary widely in size, cost, computational power, and intended use. Moderncomputers can be divided roughly into four general categories:

• Embedded computers are integrated into a larger device or system in order to automat-ically monitor and control a physical process or environment. They are used for a specificpurpose rather than for general processing tasks. Typical applications include industrialand home automation, appliances, telecommunication products, and vehicles. Users maynot even be aware of the role that computers play in such systems.• Personal computers have achieved widespread use in homes, educational institu-tions, and business and engineering office settings, primarily for dedicated individual use.They support a variety of applications such as general computation, document preparation,computer-aided design, audiovisual entertainment, interpersonal communication, and In-ternet browsing. A number of classifications are used for personal computers. Desktopcomputers serve general needs and fit within a typical personal workspace. Workstationcomputers offer higher computational capacity and more powerful graphical display ca-pabilities for engineering and scientific work. Finally, Portable and Notebook computersprovide the basic features of a personal computer in a smaller lightweight package. Theycan operate on batteries to provide mobility.• Servers and Enterprise systems are large computers that are meant to be shared by apotentially large number of users who access them from some form of personal computerover a public or private network. Such computers may host large databases and provideinformation processing for a government agency or a commercial organization.• Supercomputers and Grid computers normally offer the highest performance. They arethe most expensive and physically the largest category of computers. Supercomputers areused for the highly demanding computations needed in weather forecasting, engineeringdesign and simulation, and scientific work. They have a high cost. Grid computers providea more cost-effective alternative. They combine a large number of personal computers and


1.2 Functional Units 3

disk storage units in a physically distributed high-speed network, called a grid, which ismanaged as a coordinated computing resource. By evenly distributing the computationalworkload across the grid, it is possible to achieve high performance on large applicationsranging from numerical computation to information searching.

There is an emerging trend in access to computing facilities, known as cloud com-puting. Personal computer users access widely distributed computing and storage serverresources for individual, independent, computing needs. The Internet provides the neces-sary communication facility. Cloud hardware and software service providers operate as autility, charging on a pay-as-you-use basis.

1.2 Functional Units

A computer consists of five functionally independent main parts: input, memory, arithmeticand logic, output, and control units, as shown in Figure 1.1. The input unit accepts codedinformation from human operators using devices such as keyboards, or from other comput-ers over digital communication lines. The information received is stored in the computer’smemory, either for later use or to be processed immediately by the arithmetic and logic unit.The processing steps are specified by a program that is also stored in the memory. Finally,the results are sent back to the outside world through the output unit. All of these actionsare coordinated by the control unit. An interconnection network provides the means forthe functional units to exchange information and coordinate their actions. Later chapterswill provide more details on individual units and their interconnections. We refer to the

I/O Processor

Output

Memory

Input andArithmetic

logic

Control

Interconnectionnetwork

Figure 1.1 Basic functional units of a computer.



arithmetic and logic circuits, in conjunction with the main control circuits, as the processor.Input and output equipment is often collectively referred to as the input-output (I/O) unit.

We now take a closer look at the information handled by a computer. It is conve-nient to categorize this information as either instructions or data. Instructions, or machineinstructions, are explicit commands that

• Govern the transfer of information within a computer as well as between the computerand its I/O devices

• Specify the arithmetic and logic operations to be performed

Aprogram is a list of instructions which performs a task. Programs are stored in the memory.The processor fetches the program instructions from the memory, one after another, andperforms the desired operations. The computer is controlled by the stored program, exceptfor possible external interruption by an operator or by I/O devices connected to it. Data arenumbers and characters that are used as operands by the instructions. Data are also storedin the memory.

The instructions and data handled by a computer must be encoded in a suitable format.Most present-day hardware employs digital circuits that have only two stable states. Eachinstruction, number, or character is encoded as a string of binary digits called bits, eachhaving one of two possible values, 0 or 1, represented by the two stable states. Numbers areusually represented in positional binary notation, as discussed in Section 1.4. Alphanumericcharacters are also expressed in terms of binary codes, as discussed in Section 1.5.

1.2.1 Input Unit

Computers accept coded information through input units. The most common input device isthe keyboard. Whenever a key is pressed, the corresponding letter or digit is automaticallytranslated into its corresponding binary code and transmitted to the processor.

Many other kinds of input devices for human-computer interaction are available, in-cluding the touchpad, mouse, joystick, and trackball. These are often used as graphicinput devices in conjunction with displays. Microphones can be used to capture audioinput which is then sampled and converted into digital codes for storage and processing.Similarly, cameras can be used to capture video input.

Digital communication facilities, such as the Internet, can also provide input to acomputer from other computers and database servers.

1.2.2 Memory Unit

The function of the memory unit is to store programs and data. There are two classes ofstorage, called primary and secondary.

Primary MemoryPrimary memory, also called main memory, is a fast memory that operates at electronic

speeds. Programs must be stored in this memory while they are being executed. The


1.2 Functional Units 5

memory consists of a large number of semiconductor storage cells, each capable of storingone bit of information. These cells are rarely read or written individually. Instead, they arehandled in groups of fixed size called words. The memory is organized so that one word canbe stored or retrieved in one basic operation. The number of bits in each word is referredto as the word length of the computer, typically 16, 32, or 64 bits.

To provide easy access to any word in the memory, a distinct address is associatedwith each word location. Addresses are consecutive numbers, starting from 0, that identifysuccessive locations. A particular word is accessed by specifying its address and issuing acontrol command to the memory that starts the storage or retrieval process.

Instructions and data can be written into or read from the memory under the control ofthe processor. It is essential to be able to access any word location in the memory as quicklyas possible. A memory in which any location can be accessed in a short and fixed amountof time after specifying its address is called a random-access memory (RAM). The timerequired to access one word is called the memory access time. This time is independent ofthe location of the word being accessed. It typically ranges from a few nanoseconds (ns)to about 100 ns for current RAM units.

Cache MemoryAs an adjunct to the main memory, a smaller, faster RAM unit, called a cache, is used

to hold sections of a program that are currently being executed, along with any associateddata. The cache is tightly coupled with the processor and is usually contained on the sameintegrated-circuit chip. The purpose of the cache is to facilitate high instruction executionrates.

At the start of program execution, the cache is empty. All program instructions andany required data are stored in the main memory. As execution proceeds, instructionsare fetched into the processor chip, and a copy of each is placed in the cache. When theexecution of an instruction requires data located in the main memory, the data are fetchedand copies are also placed in the cache.

Now, suppose a number of instructions are executed repeatedly as happens in a programloop. If these instructions are available in the cache, they can be fetched quickly during theperiod of repeated use. Similarly, if the same data locations are accessed repeatedly whilecopies of their contents are available in the cache, they can be fetched quickly.

Secondary StorageAlthough primary memory is essential, it tends to be expensive and does not retain in-

formation when power is turned off. Thus additional, less expensive, permanent secondarystorage is used when large amounts of data and many programs have to be stored, particu-larly for information that is accessed infrequently. Access times for secondary storage arelonger than for primary memory. Awide selection of secondary storage devices is available,including magnetic disks, optical disks (DVD and CD), and flash memory devices.

1.2.3 Arithmetic and Logic Unit

Most computer operations are executed in the arithmetic and logic unit (ALU) of theprocessor. Any arithmetic or logic operation, such as addition, subtraction, multiplication,



division, or comparison of numbers, is initiated by bringing the required operands into theprocessor, where the operation is performed by the ALU. For example, if two numberslocated in the memory are to be added, they are brought into the processor, and the additionis carried out by the ALU. The sum may then be stored in the memory or retained in theprocessor for immediate use.

When operands are brought into the processor, they are stored in high-speed storageelements called registers. Each register can store one word of data. Access times to registersare even shorter than access times to the cache unit on the processor chip.

1.2.4 Output Unit

The output unit is the counterpart of the input unit. Its function is to send processed resultsto the outside world. A familiar example of such a device is a printer. Most printers employeither photocopying techniques, as in laser printers, or ink jet streams. Such printers maygenerate output at speeds of 20 or more pages per minute. However, printers are mechanicaldevices, and as such are quite slow compared to the electronic speed of a processor.

Some units, such as graphic displays, provide both an output function, showing textand graphics, and an input function, through touchscreen capability. The dual role of suchunits is the reason for using the single name input/output (I/O) unit in many cases.

1.2.5 Control Unit

The memory, arithmetic and logic, and I/O units store and process information and performinput and output operations. The operation of these units must be coordinated in some way.This is the responsibility of the control unit. The control unit is effectively the nerve centerthat sends control signals to other units and senses their states.

I/O transfers, consisting of input and output operations, are controlled by programinstructions that identify the devices involved and the information to be transferred. Controlcircuits are responsible for generating the timing signals that govern the transfers anddetermine when a given action is to take place. Data transfers between the processor andthe memory are also managed by the control unit through timing signals. It is reasonableto think of a control unit as a well-defined, physically separate unit that interacts with otherparts of the computer. In practice, however, this is seldom the case. Much of the controlcircuitry is physically distributed throughout the computer. A large set of control lines(wires) carries the signals used for timing and synchronization of events in all units.

The operation of a computer can be summarized as follows:

• The computer accepts information in the form of programs and data through an inputunit and stores it in the memory.

• Information stored in the memory is fetched under program control into an arithmeticand logic unit, where it is processed.

• Processed information leaves the computer through an output unit.• All activities in the computer are directed by the control unit.


1.3 Basic Operational Concepts 7

1.3 Basic Operational Concepts

In Section 1.2, we stated that the activity in a computer is governed by instructions. Toperform a given task, an appropriate program consisting of a list of instructions is storedin the memory. Individual instructions are brought from the memory into the processor,which executes the specified operations. Data to be used as instruction operands are alsostored in the memory.

A typical instruction might be

Load R2, LOC

This instruction reads the contents of a memory location whose address is representedsymbolically by the label LOC and loads them into processor register R2. The originalcontents of location LOC are preserved, whereas those of register R2 are overwritten.Execution of this instruction requires several steps. First, the instruction is fetched fromthe memory into the processor. Next, the operation to be performed is determined by thecontrol unit. The operand at LOC is then fetched from the memory into the processor.Finally, the operand is stored in register R2.

After operands have been loaded from memory into processor registers, arithmetic orlogic operations can be performed on them. For example, the instruction

Add R4, R2, R3

adds the contents of registers R2 and R3, then places their sum into register R4. Theoperands in R2 and R3 are not altered, but the previous value in R4 is overwritten by thesum.

After completing the desired operations, the results are in processor registers. Theycan be transferred to the memory using instructions such as

Store R4, LOC

This instruction copies the operand in register R4 to memory location LOC. The originalcontents of location LOC are overwritten, but those of R4 are preserved.

For Load and Store instructions, transfers between the memory and the processor areinitiated by sending the address of the desired memory location to the memory unit andasserting the appropriate control signals. The data are then transferred to or from thememory.

Figure 1.2 shows how the memory and the processor can be connected. It also showssome components of the processor that have not been discussed yet. The interconnectionsbetween these components are not shown explicitly since we will only discuss their func-tional characteristics here. Chapter 5 describes the details of the interconnections as partof processor organization.

In addition to the ALU and the control circuitry, the processor contains a numberof registers used for several different purposes. The instruction register (IR) holds theinstruction that is currently being executed. Its output is available to the control circuits,which generate the timing signals that control the various processing elements involvedin executing the instruction. The program counter (PC) is another specialized register. It



Processor

Main memory

PC

IR

Processor-memory interface

Control

ALURn–1

R1

R0

n general purposeregisters

Figure 1.2 Connection between the processor and the main memory.

contains the memory address of the next instruction to be fetched and executed. During theexecution of an instruction, the contents of the PC are updated to correspond to the addressof the next instruction to be executed. It is customary to say that the PC points to the nextinstruction that is to be fetched from the memory. In addition to the IR and PC, Figure 1.2shows general-purpose registers R0 through Rn−1, often called processor registers. Theyserve a variety of functions, including holding operands that have been loaded from thememory for processing. The roles of the general-purpose registers are explained in detailin Chapter 2.

The processor-memory interface is a circuit which manages the transfer of data betweenthe main memory and the processor. If a word is to be read from the memory, the interfacesends the address of that word to the memory along with a Read control signal. The interfacewaits for the word to be retrieved, then transfers it to the appropriate processor register. Ifa word is to be written into memory, the interface transfers both the address and the wordto the memory along with a Write control signal.

Let us now consider some typical operating steps. A program must be in the mainmemory in order for it to be executed. It is often transferred there from secondary storagethrough the input unit. Execution of the program begins when the PC is set to point to the


1.4 Number Representation andArithmetic Operations 9

first instruction of the program. The contents of the PC are transferred to the memory alongwith a Read control signal. When the addressed word (in this case, the first instruction ofthe program) has been fetched from the memory it is loaded into register IR. At this point,the instruction is ready to be interpreted and executed.

Instructions such as Load, Store, and Add perform data transfer and arithmetic opera-tions. If an operand that resides in the memory is required for an instruction, it is fetchedby sending its address to the memory and initiating a Read operation. When the operandhas been fetched from the memory, it is transferred to a processor register. After operandshave been fetched in this way, the ALU can perform a desired arithmetic operation, such asAdd, on the values in processor registers. The result is sent to a processor register. If theresult is to be written into the memory with a Store instruction, it is transferred from theprocessor register to the memory, along with the address of the location where the result isto be stored, then a Write operation is initiated.

At some point during the execution of each instruction, the contents of the PC areincremented so that the PC points to the next instruction to be executed. Thus, as soon asthe execution of the current instruction is completed, the processor is ready to fetch a newinstruction.

In addition to transferring data between the memory and the processor, the computeraccepts data from input devices and sends data to output devices. Thus, some machineinstructions are provided for the purpose of handling I/O transfers.

Normal execution of a program may be preempted if some device requires urgentservice. For example, a monitoring device in a computer-controlled industrial process maydetect a dangerous condition. In order to respond immediately, execution of the currentprogram must be suspended. To cause this, the device raises an interrupt signal, whichis a request for service by the processor. The processor provides the requested service byexecuting a program called an interrupt-service routine. Because such diversions may alterthe internal state of the processor, its state must be saved in the memory before servicingthe interrupt request. Normally, the information that is saved includes the contents of thePC, the contents of the general-purpose registers, and some control information. Whenthe interrupt-service routine is completed, the state of the processor is restored from thememory so that the interrupted program may continue.

This section has provided an overview of the operation of a computer. Detailed dis-cussion of these concepts is given in subsequent chapters, first from the point of view ofthe programmer in Chapters 2, 3, and 4, and then from the point of view of the hardwaredesigner in later chapters.

1.4 Number Representation andArithmeticOperations

The most natural way to represent a number in a computer system is by a string of bits,called a binary number. We will first describe binary number representations for integersas well as arithmetic operations on them. Then we will provide a brief introduction to therepresentation of floating-point numbers.



1.4.1 Integers

Consider an n-bit vector

B = bn−1 . . . b1b0

where bi = 0 or 1 for 0 ≤ i ≤ n− 1. This vector can represent an unsigned integer valueV (B) in the range 0 to 2n − 1, where

V (B) = bn−1 × 2n−1 + · · · + b1 × 21 + b0 × 20

We need to represent both positive and negative numbers. Three systems are used forrepresenting such numbers:

• Sign-and-magnitude• 1’s-complement• 2’s-complement

In all three systems, the leftmost bit is 0 for positive numbers and 1 for negative numbers.Figure 1.3 illustrates all three representations using 4-bit numbers. Positive values haveidentical representations in all systems, but negative values have different representations.In the sign-and-magnitude system, negative values are represented by changing the most

0000000011111111

00000000

1111

1111

1100110000110011

1010101001010101

1+

1–

2+3+4+5+6+7+

2–3–4–5–6–7–

8–0+0–

1+2+3+4+5+6+7+

0+7–6–5–4–3–2–1–0–

1+2+3+4+5+6+7+

0+

7–6–5–4–3–2–1–

b3 b2b1b0

Sign andmagnitude 1’s complement 2’s complement

B Values represented

Figure 1.3 Binary, signed-integer representations.



significant bit (b3 in Figure 1.3) from 0 to 1 in the B vector of the corresponding positivevalue. For example, +5 is represented by 0101, and −5 is represented by 1101.

In 1’s-complement representation, negative values are obtained by complementing eachbit of the corresponding positive number. Thus, the representation for −3 is obtainedby complementing each bit in the vector 0011 to yield 1100. The same operation, bitcomplementing, is done to convert a negative number to the corresponding positive value.Converting either way is referred to as forming the 1’s-complement of a given number. Forn-bit numbers, this operation is equivalent to subtracting the number from 2n − 1. In thecase of the 4-bit numbers in Figure 1.3, we subtract from 24 − 1 = 15, or 1111 in binary.

Finally, in the 2’s-complement system, forming the 2’s-complement of an n-bit numberis done by subtracting the number from 2n. Hence, the 2’s-complement of a number isobtained by adding 1 to the 1’s-complement of that number.

Note that there are distinct representations for +0 and −0 in both the sign-and-magnitude and 1’s-complement systems, but the 2’s-complement system has only one rep-resentation for 0. For 4-bit numbers, as shown in Figure 1.3, the value −8 is representablein the 2’s-complement system but not in the other systems. The sign-and-magnitude sys-tem seems the most natural, because we deal with sign-and-magnitude decimal values inmanual computations. The 1’s-complement system is easily related to this system, but the2’s-complement system may appear somewhat unnatural. However, we will show that the2’s-complement system leads to the most efficient way to carry out addition and subtractionoperations. It is the one most often used in modern computers.

Addition of Unsigned IntegersAddition of 1-bit numbers is illustrated in Figure 1.4. The sum of 1 and 1 is the 2-bit

vector 10, which represents the value 2. We say that the sum is 0 and the carry-out is 1.In order to add multiple-bit numbers, we use a method analogous to that used for manualcomputation with decimal numbers. We add bit pairs starting from the low-order (right)end of the bit vectors, propagating carries toward the high-order (left) end. The carry-outfrom a bit pair becomes the carry-in to the next bit pair to the left. The carry-in must beadded to a bit pair in generating the sum and carry-out at that position. For example, ifboth bits of a pair are 1 and the carry-in is 1, then the sum is 1 and the carry-out is 1, whichrepresents the value 3.

Carry-out

1

1

+

011

0

1+

0

0

0

+

1

0

1

+

Figure 1.4 Addition of 1-bit numbers.



Addition and Subtraction of Signed IntegersWe introduced three systems for representing positive and negative numbers, or, simply,

signed numbers. These systems differ only in the way they represent negative values.Their relative merits from the standpoint of ease of performing arithmetic operations can besummarized as follows. The sign-and-magnitude system is the simplest representation, butit is also the most awkward for addition and subtraction operations. The 1’s-complementmethod is somewhat better. The 2’s-complement system is the most efficient method forperforming addition and subtraction operations.

To understand 2’s-complement arithmetic, consider addition modulo N (abbreviatedas mod N ). A helpful graphical device for the description of addition of unsigned integersmod N is a circle with the values 0 through N − 1 marked along its perimeter, as shownin Figure 1.5a. Consider the case N = 16, shown in part (b) of the figure. The decimalvalues 0 through 15 are represented by their 4-bit binary values 0000 through 1111 aroundthe outside of the circle. In terms of decimal values, the operation (7+ 5) mod 16 yieldsthe value 12. To perform this operation graphically, locate 7 (0111) on the outside of thecircle and then move 5 units in the clockwise direction to arrive at the answer 12 (1100).Similarly, (9+ 14) mod 16 = 7; this is modeled on the circle by locating 9 (1001) andmoving 14 units in the clockwise direction past the zero position to arrive at the answer7 (0111). This graphical technique works for the computation of (a + b) mod 16 for anyunsigned integers a and b; that is, to perform addition, locate a and move b units in theclockwise direction to arrive at (a + b) mod 16.

Now consider a different interpretation of the mod 16 circle. We will reinterpret thebinary vectors outside the circle to represent the signed integers from−8 through+7 in the2’s-complement representation as shown inside the circle.

Let us apply the mod 16 addition technique to the example of adding +7 to −3. The2’s-complement representation for these numbers is 0111 and 1101, respectively. To addthese numbers, locate 0111 on the circle in Figure 1.5b. Then move 1101 (13) steps in theclockwise direction to arrive at 0100, which yields the correct answer of +4. Note that the2’s-complement representation of−3 is interpreted as an unsigned value for the number ofsteps to move.

If we perform this addition by adding bit pairs from right to left, we obtain

0 1 1 1+ 1 1 0 11 0 1 0 0↑Carry-out

If we ignore the carry-out from the fourth bit position in this addition, we obtain the correctanswer. In fact, this is always the case. Ignoring this carry-out is a natural result of usingmod N arithmetic. As we move around the circle in Figure 1.5b, the value next to 1111would normally be 10000. Instead, we go back to the value 0000.

The rules governing addition and subtraction of n-bit signed numbers using the 2’s-complement representation system may be stated as follows:



N 2–

N 1–0

1

2

(a) Circle representation of integers mod N

00000001

0010

0011

0100

0101

0110

01111000

1001

1010

1011

1100

1101

1110

1111

1+1–2+

3+

4+

5+

6+7+

2–

3–

4–

5–

6–7– 8–

0

(b) Mod 16 system for 2’s-complement numbers

Figure 1.5 Modular number systems and the 2’s-complementsystem.

• To add two numbers, add their n-bit representations, ignoring the carry-out bit fromthe most significant bit (MSB) position. The sum will be the algebraically correct value in2’s-complement representation if the actual result is in the range−2n−1 through+2n−1 − 1.

• To subtract two numbers X and Y , that is, to perform X − Y , form the 2’s-complementof Y , then add it to X using the add rule. Again, the result will be the algebraically correctvalue in 2’s-complement representation if the actual result is in the range −2n−1 through+2n−1 − 1.



Figure 1.6 shows some examples of addition and subtraction in the 2’s-complementsystem. In all of these 4-bit examples, the answers fall within the representable rangeof −8 through +7. When answers do not fall within the representable range, we saythat arithmetic overflow has occurred. A later subsection discusses such situations. Thefour addition operations (a) through (d) in Figure 1.6 follow the add rule, and the sixsubtraction operations (e) through (j) follow the subtract rule. The subtraction operationrequires forming the 2’s-complement of the subtrahend (the bottom value). This operation

1 0 1 11 1 1 0

1 0 0 1

1 1 0 11 0 0 1

0 0 1 00 1 0 0

0 1 1 00 0 1 1

1 0 0 11 0 1 1

1 0 0 10 0 0 1

0 0 1 01 1 0 1

1 1 1 0

0 1 0 01 0 1 0

0 1 1 11 1 0 1

0 1 0 0

1 1 0 10 1 1 1

0 1 0 0

0 0 1 01 1 0 0

1 1 1 0

0 1 1 01 1 0 1

0 0 1 1

1 0 0 10 1 0 1

1 1 1 0

1 0 0 11 1 1 1

1 0 0 0

0 0 1 00 0 1 1

0 1 0 1

0 1 0 1

0 0 1 00 0 1 1

5–( )

2+( )3+( )

5+( )

2+( )4+( )

2–( )

7–( )

3–( )7–( )

6+( )3+( )

1+( )

7–( )5–( )

7–( )

2+( )3–( )

6–( )

2–( )

4+( )

3–( )

4+( )

7+( )

4+( )

2–( )

3+( )

2–( )

8–( )

5+( )

+

+

+

+

+

+

+

+

+

+

–

–

–

–

–

–

(a)

(c)

(b)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Figure 1.6 2’s-complement Add and Subtract operations.



is done in exactly the same manner for both positive and negative numbers. To form the2’s-complement of a number, form the bit complement of the number and add 1.

The simplicity of adding and subtracting signed numbers in 2’s-complement represen-tation is the reason why this number representation is used in modern computers. It mightseem that the 1’s-complement representation would be just as good as the 2’s-complementsystem. However, although complementation is easy, the result obtained after an addi-tion operation is not always correct. The carry-out, cn, cannot be ignored. If cn = 0, theresult obtained is correct. If cn = 1, then a 1 must be added to the result to make it cor-rect. The need for this correction operation means that addition and subtraction cannotbe implemented as conveniently in the 1’s-complement system as in the 2’s-complementsystem.

Sign ExtensionWe often need to represent a value given in a certain number of bits by using a larger

number of bits. For a positive number, this is achieved by adding 0s to the left. For anegative number in 2’s-complement representation, the leftmost bit, which indicates thesign of the number, is a 1. A longer number with the same value is obtained by replicatingthe sign bit to the left as many times as needed. To see why this is correct, examine the mod16 circle of Figure 1.5b. Compare it to larger circles for the mod 32 or mod 64 cases. Therepresentations for the values −1, −2, etc., are exactly the same, with 1s added to the left.In summary, to represent a signed number in 2’s-complement form using a larger numberof bits, repeat the sign bit as many times as needed to the left. This operation is called signextension.

Overflow in Integer ArithmeticUsing 2’s-complement representation, n bits can represent values in the range −2n−1

to +2n−1 − 1. For example, the range of numbers that can be represented by 4 bits is −8through +7, as shown in Figure 1.3. When the actual result of an arithmetic operation isoutside the representable range, an arithmetic overflow has occurred.

When adding unsigned numbers, a carry-out of 1 from the most significant bit positionindicates that an overflow has occurred. However, this is not always true when adding signednumbers. For example, using 2’s-complement representation for 4-bit signed numbers, ifwe add+7 and+4, the sum vector is 1011, which is the representation for−5, an incorrectresult. In this case, the carry-out bit from the MSB position is 0. If we add −4 and −6,we get 0110 = +6, also an incorrect result. In this case, the carry-out bit is 1. Hence,the value of the carry-out bit from the sign-bit position is not an indicator of overflow.Clearly, overflow may occur only if both summands have the same sign. The addition ofnumbers with different signs cannot cause overflow because the result is always within therepresentable range.

These observations lead to the following way to detect overflow when adding twonumbers in 2’s-complement representation. Examine the signs of the two summands andthe sign of the result. When both summands have the same sign, an overflow has occurredwhen the sign of the sum is not the same as the signs of the summands.

When subtracting two numbers, the testing method needed for detecting overflow hasto be modified somewhat; but it is still quite straightforward. See Problem 1.10.



1.4.2 Floating-Point Numbers

Until now we have only considered integers, which have an implied binary point at the rightend of the number, just after bit b0. If we use a full word in a 32-bit word length computerto represent a signed integer in 2’s-complement representation, the range of values that canbe represented is −231 to +231 − 1. In decimal terms, this range is somewhat smaller than−1010 to +1010.

The same 32-bit patterns can also be interpreted as fractions in the range −1 to +1−2−31 if we assume that the implied binary point is just to the right of the sign bit; that is,between bit b31 and bit b30 at the left end of the 32-bit representation. In this case, themagnitude of the smallest fraction representable is approximately 10−10.

Neither of these two fixed-point number representations has a range that is sufficientfor many scientific and engineering calculations. For convenience, we would like to havea binary number representation that can easily accommodate both very large integers andvery small fractions. To do this, a computer must be able to represent numbers and operateon them in such a way that the position of the binary point is variable and is automaticallyadjusted as computation proceeds. In this case, the binary point is said to float, and thenumbers are called floating-point numbers.

Since the position of the binary point in a floating-point number varies, it must beindicated explicitly in the representation. For example, in the familiar decimal scien-tific notation, numbers may be written as 6.0247× 1023, 3.7291× 10−27, −1.0341× 102,−7.3000× 10−14, and so on. We say that these numbers have been given to 5 significantdigits of precision. The scale factors 1023, 10−27, 102, and 10−14 indicate the actual positionof the decimal point with respect to the significant digits. The same approach can be usedto represent binary floating-point numbers in a computer, except that it is more appropriateto use 2 as the base of the scale factor. Because the base is fixed, it does not need to begiven in the representation. The exponent may be positive or negative.

We conclude that a binary floating-point number can be represented by:

• a sign for the number• some significant bits• a signed scale factor exponent for an implied base of 2

An established international IEEE (Institute of Electrical and Electronics Engineers)standard for 32-bit floating-point number representation uses a sign bit, 23 significant bits,and 8 bits for a signed exponent of the scale factor, which has an implied base of 2. Indecimal terms, the range of numbers represented is roughly ±10−38 to ±1038, which isadequate for most scientific and engineering calculations. The same IEEE standard alsodefines a 64-bit representation to accommodate more significant bits and more bits for thesigned exponent, resulting in much higher precision and a much larger range of values.

Floating-point number representation and arithmetic operations on floating-point num-bers are considered in detail in Chapter 9. Some of the commercial processors describedin Appendices B to E include operations on floating-point numbers in their instruction setsand have processor registers dedicated to holding floating-point numbers.


1.6 Performance 17

1.5 Character Representation

The most common encoding scheme for characters is ASCII (American Standard Code forInformation Interchange). Alphanumeric characters, operators, punctuation symbols, andcontrol characters are represented by 7-bit codes as shown in Table 1.1. It is convenientto use an 8-bit byte to represent and store a character. The code occupies the low-orderseven bits. The high-order bit is usually set to 0. Note that the codes for the alphabetic andnumeric characters are in increasing sequential order when interpreted as unsigned binarynumbers. This facilitates sorting operations on alphabetic and numeric data.

The low-order four bits of the ASCII codes for the decimal digits 0 to 9 are the first tenvalues of the binary number system. This 4-bit encoding is referred to as the binary-codeddecimal (BCD) code.

1.6 Performance

The most important measure of the performance of a computer is how quickly it can executeprograms. The speed with which a computer executes programs is affected by the designof its instruction set, its hardware and its software, including the operating system, and thetechnology in which the hardware is implemented. Because programs are usually written ina high-level language, performance is also affected by the compiler that translates programsinto machine language. We do not describe the details of compilers or operating systemsin this book. However, Chapter 4 provides an overview of software, including a discussionof the role of compilers and operating systems. This book concentrates on the design ofinstruction sets, along with memory, processor, and I/O hardware, and the organization ofboth small and large computers. Section 1.2.2 describes how caches can improve memoryperformance. Some performance aspects of instruction sets are discussed in Chapter 2. Inthis section, we give an overview of how performance is affected by technology, as well asprocessor and system organization.

1.6.1 Technology

The technology of Very Large Scale Integration (VLSI) that is used to fabricate the electroniccircuits for a processor on a single chip is a critical factor in the speed of execution ofmachine instructions. The speed of switching between the 0 and 1 states in logic circuitsis largely determined by the size of the transistors that implement the circuits. Smallertransistors switch faster. Advances in fabrication technology over several decades havereduced transistor sizes dramatically. This has two advantages: instructions can be executedfaster, and more transistors can be placed on a chip, leading to more logic functionality andmore memory storage capacity.



Table 1.1 The 7-bit ASCII code.

Bitpositions Bit positions 654

3210 000 001 010 011 100 101 110 111

0000 NUL DLE SPACE 0 @ P ´ p

0001 SOH DC1 ! 1 A Q a q

0010 STX DC2 ” 2 B R b r

0011 ETX DC3 # 3 C S c s

0100 EOT DC4 $ 4 D T d t

0101 ENQ NAK % 5 E U e u

0110 ACK SYN & 6 F V f v

0111 BEL ETB ’ 7 G W g w

1000 BS CAN ( 8 H X h x

1001 HT EM ) 9 I Y i y

1010 LF SUB * : J Z j z

1011 VT ESC + ; K [ k {

1100 FF FS , < L / 1 |1101 CR GS - = M ] m }

1110 SO RS . > N ˆ n ˜

1111 SI US / ? O — ◦ DEL

NUL Null/Idle SI Shift in

SOH Start of header DLE Data link escape

STX Start of text DC1-DC4 Device control

ETX End of text NAK Negative acknowledgment

EOT End of transmission SYN Synchronous idle

ENQ Enquiry ETB End of transmitted block

ACK Acknowledgment CAN Cancel (error in data)

BEL Audible signal EM End of medium

BS Back space SUB Special sequence

HT Horizontal tab ESC Escape

LF Line feed FS File separator

VT Vertical tab GS Group separator

FF Form feed RS Record separator

CR Carriage return US Unit separator

SO Shift out DEL Delete/Idle

Bit positions of code format = 6 5 4 3 2 1 0


1.7 Historical Perspective 19

1.6.2 Parallelism

Performance can be increased by performing a number of operations in parallel. Parallelismcan be implemented on many different levels.

Instruction-level ParallelismThe simplest way to execute a sequence of instructions in a processor is to complete all

steps of the current instruction before starting the steps of the next instruction. If we overlapthe execution of the steps of successive instructions, total execution time will be reduced.For example, the next instruction could be fetched from memory at the same time that anarithmetic operation is being performed on the register operands of the current instruction.This form of parallelism is called pipelining. It is discussed in detail in Chapter 6.

Multicore ProcessorsMultiple processing units can be fabricated on a single chip. In technical literature,

the term core is used for each of these processors. The term processor is then used forthe complete chip. Hence, we have the terminology dual-core, quad-core, and octo-coreprocessors for chips that have two, four, and eight cores, respectively.

MultiprocessorsComputer systems may contain many processors, each possibly containing multiple

cores. Such systems are called multiprocessors. These systems either execute a numberof different application tasks in parallel, or they execute subtasks of a single large taskin parallel. All processors usually have access to all of the memory in such systems,and the term shared-memory multiprocessor is often used to make this clear. The highperformance of these systems comes with much higher complexity and cost, arising fromthe use of multiple processors and memory units, along with more complex interconnectionnetworks.

In contrast to multiprocessor systems, it is also possible to use an interconnected groupof complete computers to achieve high total computational power. The computers normallyhave access only to their own memory units. When the tasks they are executing need to sharedata, they do so by exchanging messages over a communication network. This propertydistinguishes them from shared-memory multiprocessors, leading to the name message-passing multicomputers.

Multiprocessors and multicomputers are described in Chapter 12.

1.7 Historical Perspective

Electronic digital computers as we know them today have been developed since the 1940s.A long, slow evolution of mechanical calculating devices preceded the development ofelectronic computers. Here, we briefly sketch the history of computer development. Amore extensive coverage can be found in Hayes [1].



In the 300 years before the mid-1900s, a series of increasingly complex mechanicaldevices, constructed from gear wheels, levers, and pulleys, were used to perform the basicoperations of addition, subtraction, multiplication, and division. Holes on punched cardswere mechanically sensed and used to control the automatic sequencing of a list of calcu-lations, which essentially provided a programming capability. These devices enabled thecomputation of complete mathematical tables of logarithms and trigonometric functions asapproximated by polynomials. Output results were punched on cards or printed on paper.Electromechanical relay devices, such as those used in early telephone switching systems,provided the means for performing logic functions in computers built in the late 1930s andearly 1940s.

During World War II, the first electronic computer was designed and built at the Univer-sity of Pennsylvania, using the vacuum tube technology developed for radios and militaryradar equipment. Vacuum tube circuits were used to perform logic operations and to storedata. This technology initiated the modern era of electronic digital computers.

Development of the technologies used to fabricate processors, memories, and I/O unitsof computers has been divided into four generations: the first generation, 1945 to 1955;the second generation, 1955 to 1965; the third generation, 1965 to 1975; and the fourthgeneration, 1975 to the present.

1.7.1 The First Generation

The key concept of a stored program was introduced at the same time as the developmentof the first electronic digital computer. Programs and their data were located in the samememory, as they are today. This facilitates changing existing programs and data or preparingand loading new programs and data. Assembly language was used to prepare programs andwas translated into machine language for execution.

Basic arithmetic operations were performed in a few milliseconds, using vacuum tubetechnology to implement logic functions. This provided a 100- to 1000-fold increase inspeed relative to earlier mechanical and electromechanical technology. Mercury delay-linememory was used at first. I/O functions were performed by devices similar to typewriters.Magnetic core memories and magnetic tape storage devices were also developed.

1.7.2 The Second Generation

The transistor was invented at AT&T Bell Laboratories in the late 1940s and quickly re-placed the vacuum tube in implementing logic functions. This fundamental technology shiftmarked the start of the second generation. Magnetic core memories and magnetic drumstorage devices were widely used in the second generation. Magnetic disk storage deviceswere developed in this generation. The earliest high-level languages, such as Fortran, weredeveloped, making the preparation of application programs much easier. Compilers weredeveloped to translate these high-level language programs into assembly language, whichwas then translated into executable machine-language form. IBM became a major computermanufacturer during this time.


1.7 Historical Perspective 21

1.7.3 The Third Generation

Texas Instruments and Fairchild Semiconductor developed the ability to fabricate manytransistors on a single silicon chip, called integrated-circuit technology. This enabled fasterand less costly processors and memory elements to be built. Integrated-circuit memo-ries began to replace magnetic core memories. This technological development markedthe beginning of the third generation. Other developments included the introduction ofmicroprogramming, parallelism, and pipelining. Operating system software allowed effi-cient sharing of a computer system by several user programs. Cache and virtual memorieswere developed. Cache memory makes the main memory appear faster than it really is,and virtual memory makes it appear larger. System 360 mainframe computers from IBMand the line of PDP minicomputers from Digital Equipment Corporation were dominantcommercial products of the third generation.

1.7.4 The Fourth Generation

By the early 1970s, integrated-circuit fabrication techniques had evolved to the point wherecomplete processors and large sections of the main memory of small computers could beimplemented on single chips. This marked the start of the fourth generation. Tens ofthousands of transistors could be placed on a single chip, and the name Very Large ScaleIntegration (VLSI) was coined to describe this technology. A complete processor fabricatedon a single chip became known as a microprocessor. Companies such as Intel, NationalSemiconductor, Motorola, Texas Instruments, and Advanced Micro Devices have beenthe driving forces of this technology. Current VLSI technology enables the integration ofmultiple processors (cores) and cache memories on a single chip.

A particular form of VLSI technology, called Field Programmable Gate Arrays (FP-GAs), has allowed system developers to design and implement processor, memory, andI/O circuits on a single chip to meet the requirements of specific applications, especially inembedded computer systems. Sophisticated computer-aided-design tools make it possibleto develop FPGA-based products quickly. Companies such as Altera and Xilinx providethis technology, along with the required software development systems.

Embedded computer systems, portable notebook computers, and versatile mobile tele-phone handsets are now in widespread use. Desktop personal computers and workstationsinterconnected by wired or wireless local area networks and the Internet, with access todatabase servers and search engines, provide a variety of powerful computing platforms.

Organizational concepts such as parallelism and hierarchical memories have evolvedto produce the high-performance computing systems of today as the fourth generationhas matured. Supercomputers and Grid computers, at the upper end of high-performancecomputing, are used for weather forecasting, scientific and engineering computations, andsimulations.



1.8 Concluding Remarks

This chapter has introduced basic concepts about the structure of computers and theiroperation. Machine instructions and programs have been described briefly. The addition andsubtraction of binary numbers has been explained. Much of the terminology needed to dealwith these subjects has been defined. Subsequent chapters provide detailed explanations ofthese terms and concepts, with an emphasis on architecture and hardware.

1.9 Solved Problems

This section presents some examples of the types of problems that a student may be askedto solve, and shows how such problems can be solved.

Example 1.1 Problem: List the steps needed to execute the machine instruction

Load R2, LOC

in terms of transfers between the components shown in Figure 1.2 and some simple controlcommands. An overview of the steps needed is given in Section 1.3. Assume that theaddress of the memory location containing this instruction is initially in register PC.

Solution: The required steps are:

• Send the address of the instruction word from register PC to the memory and issue aRead control command.

• Wait until the requested word has been retrieved from the memory, then load it intoregister IR, where it is interpreted (decoded) by the control circuitry to determine theoperation to be performed.

• Increment the contents of register PC to point to the next instruction in memory.• Send the address value LOC from the instruction in register IR to the memory and issue

a Read control command.• Wait until the requested word has been retrieved from the memory, then load it into

register R2.

Example 1.2 Problem: Quantify the effect on performance that results from the use of a cache in thecase of a program that has a total of 500 instructions, including a 100-instruction loop thatis executed 25 times. Determine the ratio of execution time without the cache to executiontime with the cache. This ratio is called the speedup.


1.9 Solved Problems 23

Assume that main memory accesses require 10 units of time and cache accesses require1 unit of time. We also make the following further assumptions so that we can simplifycalculations in order to easily illustrate the advantage of using a cache:

• Program execution time is proportional to the total amount of time needed to fetchinstructions from either the main memory or the cache, with operand data accessesbeing ignored.

• Initially, all instructions are stored in the main memory, and the cache is empty.• The cache is large enough to contain all of the loop instructions.

Solution: Execution time without the cache is

T = 400× 10+ 100× 10× 25 = 29,000

Execution time with the cache is

Tcache = 500× 10+ 100× 1× 24 = 7,400

Therefore, the speedup is

T/Tcache = 3.92

Example 1.3Problem: Convert the following pairs of decimal numbers to 5-bit 2’s-complement num-bers, then perform addition and subtraction on each pair. Indicate whether or not overflowoccurs for each case.

(a) 7 and 13

(b) −12 and 9

Solution: The conversion and operations are:

(a) 710 = 001112 and 1310 = 011012

Adding these two positive numbers, we obtain 10100, which is a negative number.Therefore, overflow has occurred.

To subtract them, we first form the 2’s-complement of 01101, which is 10011. Thenwe perform addition with 00111 to obtain 11010, which is −610, the correct answer.

(b) −1210 = 101002 and 910 = 010012

Adding these two numbers, we obtain 11101 = −310, the correct answer.

To subtract them, we first form the 2’s-complement of 01001, which is 10111. Thenwe perform addition of the two negative numbers 10100 and 10111 to obtain 01011,which is a positive number. Therefore, overflow has occurred.



Problems

1.1 [E] Repeat Example 1.1 for the machine instruction

Add R4, R2, R3

which is discussed in Section 1.3.

1.2 [E] Repeat Example 1.1 for the machine instruction

Store R4, LOC

which is discussed in Section 1.3.

1.3 [M] (a) Give a short sequence of machine instructions for the task “Add the contents ofmemory location A to those of location B, and place the answer in location C”. Instructions

Load Ri, LOC

and

Store Ri, LOC

are the only instructions available to transfer data between the memory and the general-purpose registers. Add instructions are described in Section 1.3. Do not change the contentsof either location A or B.

(b) Suppose that Move and Add instructions are available with the formats

Move Location1, Location2

and

Add Location1, Location2

These instructions move or add a copy of the operand at the second location to the firstlocation, overwriting the original operand at the first location. Either or both of the operandscan be in the memory or the general-purpose registers. Is it possible to use fewer instructionsof these types to accomplish the task in part (a)? If yes, give the sequence.

1.4 [M] (a) A program consisting of a total of 300 instructions contains a 50-instruction loopthat is executed 15 times. The processor contains a cache, as described in Section 1.2.2.Fetching and executing an instruction that is in the main memory requires 20 time units. Ifthe instruction is found in the cache, fetching and executing it requires only 2 time units.Ignoring operand data accesses, calculate the ratio of program execution time without thecache to execution time with the cache. This ratio is called the speedup due to the use ofthe cache. Assume that the cache is initially empty, that it is large enough to hold the loop,and that the program starts with all instructions in the main memory.

(b) Generalize part (a) by replacing the constants 300, 50, 15, 20, and 2 with the variablesw, x, y, m, and c. Develop an expression for speedup.

(c) For the values w = 300, x = 50, m = 20, and c = 2 what value of y results in a speedupof 5?


References 25

(d ) Consider the form of the expression for speedup developed in part (b). What is theupper limit on speedup as the number of loop iterations, y, becomes larger and larger?

1.5 [M] (a) A processor cache is discussed in Section 1.2.2. Suppose that execution time for aprogram is proportional to instruction fetch time. Assume that fetching an instruction fromthe cache takes 1 time unit, but fetching it from the main memory takes 10 time units. Also,assume that a requested instruction is found in the cache with probability 0.96. Finally,assume that if an instruction is not found in the cache it must first be fetched from the mainmemory into the cache and then fetched from the cache to be executed. Compute the ratioof program execution time without the cache to program execution time with the cache.This ratio is called the speedup resulting from the presence of the cache.

(b) If the size of the cache is doubled, assume that the probability of not finding a requestedinstruction there is cut in half. Repeat part (a) for a doubled cache size.

1.6 [E] Extend Figure 1.4 to incorporate both possibilities for a carry-in (0 or 1) to each of thefour cases shown in the figure. Specify both the sum and carry-out bits for each of the eightnew cases.

1.7 [M] Convert the following pairs of decimal numbers to 5-bit 2’s-complement numbers,then add them. State whether or not overflow occurs in each case.

(a) 4 and 11

(b) 6 and 14

(c) −13 and 12

(d ) −4 and 8

(e) −2 and −9

(f ) −9 and −14

1.8 [M] Repeat Problem 1.7 for the subtract operation, where the second number of each pairis to be subtracted from the first number. State whether or not overflow occurs in each case.

1.9 [E] A memory byte location contains the pattern 01010011. What decimal value does thispattern represent when interpreted as a binary number? What does it represent as an ASCIIcode?

1.10 [E] A way to detect overflow when adding two 2’s-complement numbers is given at theend of Section 1.4.1. State how to detect overflow when subtracting two such numbers.

References

1. J. P. Hayes, Computer Architecture and Organization, 3rd Ed., McGraw-Hill, NewYork, 1998.


27

c h a p t e r

2Instruction Set Architecture

Chapter Objectives

In this chapter you will learn about:

• Machine instructions and program execution

• Addressing methods for accessing registerand memory operands

• Assembly language for representing machineinstructions, data, and programs

• Stacks and subroutines


28 C H A P T E R 2 • Instruction Set Architecture

This chapter considers the way programs are executed in a computer from the machine instruction set view-point. Chapter 1 introduced the general concept that both program instructions and data operands are storedin the memory. In this chapter, we discuss how instructions are composed and study the ways in which se-quences of instructions are brought from the memory into the processor and executed to perform a given task.The addressing methods that are commonly used for accessing operands in memory locations and processorregisters are also presented.

The emphasis here is on basic concepts. We use a generic style to describe machine instructions andoperand addressing methods that are typical of those found in commercial processors. A sufficient numberof instructions and addressing methods are introduced to enable us to present complete, realistic programsfor simple tasks. These generic programs are specified at the assembly-language level, where machineinstructions and operand addressing information are represented by symbolic names. A complete instructionset, including operand addressing methods, is often referred to as the instruction set architecture (ISA) ofa processor. For the discussion of basic concepts in this chapter, it is not necessary to define a completeinstruction set, and we will not attempt to do so. Instead, we will present enough examples to illustrate thecapabilities of a typical instruction set.

The concepts introduced in this chapter and in Chapter 3, which deals with input/output techniques, areessential for understanding the functionality of computers. Our choice of the generic style of presentationmakes the material easy to read and understand. Also, this style allows a general discussion that is notconstrained by the characteristics of a particular processor.

Since it is interesting and important to see how the concepts discussed are implemented in a real computer,we supplement our presentation in Chapters 2 and 3 with four examples of popular commercial processors.These processors are presented in Appendices B to E. Appendix B deals with the Nios II processor from AlteraCorporation. Appendix C presents the ColdFire processor from Freescale Semiconductor, Inc. Appendix Ddiscusses the ARM processor from ARM Ltd. Appendix E presents the basic architecture of processorsmade by Intel Corporation. The generic programs in Chapters 2 and 3 are presented in terms of the specificinstruction sets in each of the appendices.

The reader can choose only one processor and study the material in the corresponding appendix to getan appreciation for commercial ISA design. However, knowledge of the material in these appendices is notessential for understanding the material in the main body of the book.

The vast majority of programs are written in high-level languages such as C, C++, or Java. To execute ahigh-level language program on a processor, the program must be translated into the machine language for thatprocessor, which is done by a compiler program. Assembly language is a readable symbolic representationof machine language. In this book we make extensive use of assembly language, because this is the best wayto describe how computers work.

We will begin the discussion in this chapter by considering how instructions and data are stored in thememory and how they are accessed for processing.

2.1 Memory Locations and Addresses

We will first consider how the memory of a computer is organized. The memory consists ofmany millions of storage cells, each of which can store a bit of information having the value0 or 1. Because a single bit represents a very small amount of information, bits are seldomhandled individually. The usual approach is to deal with them in groups of fixed size. For


2.1 Memory Locations and Addresses 29

this purpose, the memory is organized so that a group of n bits can be stored or retrieved ina single, basic operation. Each group of n bits is referred to as a word of information, andn is called the word length. The memory of a computer can be schematically representedas a collection of words, as shown in Figure 2.1.

Modern computers have word lengths that typically range from 16 to 64 bits. If theword length of a computer is 32 bits, a single word can store a 32-bit signed number or fourASCII-encoded characters, each occupying 8 bits, as shown in Figure 2.2. A unit of 8 bits iscalled a byte. Machine instructions may require one or more words for their representation.We will discuss how machine instructions are encoded into memory words in a later section,after we have described instructions at the assembly-language level.

Accessing the memory to store or retrieve a single item of information, either a wordor a byte, requires distinct names or addresses for each location. It is customary to usenumbers from 0 to 2k − 1, for some suitable value of k, as the addresses of successivelocations in the memory. Thus, the memory can have up to 2k addressable locations. The2k addresses constitute the address space of the computer. For example, a 24-bit addressgenerates an address space of 224 (16,777,216) locations. This number is usually writtenas 16M (16 mega), where 1M is the number 220 (1,048,576). A 32-bit address creates anaddress space of 232 or 4G (4 giga) locations, where 1G is 230. Other notational conventions

Second word

First word

n bits

Last word

i th word

Figure 2.1 Memory words.



(b) Four characters

charactercharactercharacter character

(a) A signed integer

Sign bit: for positive numbers

for negative numbers

ASCIIASCIIASCIIASCII

32 bits

8 bits 8 bits 8 bits 8 bits

b31 b30 b1 b0

b31 0=

b31 1=

Figure 2.2 Examples of encoded information in a 32-bit word.

that are commonly used are K (kilo) for the number 210 (1,024), and T (tera) for thenumber 240.

2.1.1 Byte Addressability

We now have three basic information quantities to deal with: bit, byte, and word. A byteis always 8 bits, but the word length typically ranges from 16 to 64 bits. It is impracticalto assign distinct addresses to individual bit locations in the memory. The most practicalassignment is to have successive addresses refer to successive byte locations in the memory.This is the assignment used in most modern computers. The term byte-addressable memoryis used for this assignment. Byte locations have addresses 0, 1, 2, . . . . Thus, if the wordlength of the machine is 32 bits, successive words are located at addresses 0, 4, 8, . . . , witheach word consisting of four bytes.

2.1.2 Big-Endian and Little-Endian Assignments

There are two ways that byte addresses can be assigned across words, as shown in Figure 2.3.The name big-endian is used when lower byte addresses are used for the more significantbytes (the leftmost bytes) of the word. The name little-endian is used for the oppositeordering, where the lower byte addresses are used for the less significant bytes (the rightmostbytes) of the word. The words “more significant” and “less significant” are used in relationto the weights (powers of 2) assigned to bits when the word represents a number. Bothlittle-endian and big-endian assignments are used in commercial machines. In both cases,byte addresses 0, 4, 8, . . . , are taken as the addresses of successive words in the memory


2.1 Memory Locations and Addresses 31

2k

4– 2k

3– 2k

2– 2k

1– 2k

4–2k

4–

0 1 2 3

4 5 6 7

0 0

4

2k

1– 2k

2– 2k

3– 2k

4–

3 2 1 0

7 6 5 4

Byte addressByte address

(a) Big-endian assignment (b) Little-endian assignment

4

Wordaddress

Figure 2.3 Byte and word addressing.

of a computer with a 32-bit word length. These are the addresses used when accessing thememory to store or retrieve a word.

In addition to specifying the address ordering of bytes within a word, it is also necessaryto specify the labeling of bits within a byte or a word. The most common convention, andthe one we will use in this book, is shown in Figure 2.2a. It is the most natural orderingfor the encoding of numerical data. The same ordering is also used for labeling bits withina byte, that is, b7, b6, . . . , b0, from left to right.

2.1.3 WordAlignment

In the case of a 32-bit word length, natural word boundaries occur at addresses 0, 4, 8, . . . ,

as shown in Figure 2.3. We say that the word locations have aligned addresses if they beginat a byte address that is a multiple of the number of bytes in a word. For practical reasonsassociated with manipulating binary-coded addresses, the number of bytes in a word is apower of 2. Hence, if the word length is 16 (2 bytes), aligned words begin at byte addresses0, 2, 4, . . . , and for a word length of 64 (23 bytes), aligned words begin at byte addresses0, 8, 16, . . . .

There is no fundamental reason why words cannot begin at an arbitrary byte address.In that case, words are said to have unaligned addresses. But, the most common case is touse aligned addresses, which makes accessing of memory operands more efficient, as wewill see in Chapter 8.



2.1.4 Accessing Numbers and Characters

A number usually occupies one word, and can be accessed in the memory by specifying itsword address. Similarly, individual characters can be accessed by their byte address.

For programming convenience it is useful to have different ways of specifying addressesin program instructions. We will deal with this issue in Section 2.4.

2.2 Memory Operations

Both program instructions and data operands are stored in the memory. To execute aninstruction, the processor control circuits must cause the word (or words) containing theinstruction to be transferred from the memory to the processor. Operands and results mustalso be moved between the memory and the processor. Thus, two basic operations involvingthe memory are needed, namely, Read and Write.

The Read operation transfers a copy of the contents of a specific memory location tothe processor. The memory contents remain unchanged. To start a Read operation, theprocessor sends the address of the desired location to the memory and requests that itscontents be read. The memory reads the data stored at that address and sends them to theprocessor.

The Write operation transfers an item of information from the processor to a specificmemory location, overwriting the former contents of that location. To initiate a Writeoperation, the processor sends the address of the desired location to the memory, togetherwith the data to be written into that location. The memory then uses the address and datato perform the write.

The details of the hardware implementation of these operations are treated in Chapters5 and 6. In this chapter, we consider all operations from the viewpoint of the ISA, so weconcentrate on the logical handling of instructions and operands.

2.3 Instructions and Instruction Sequencing

The tasks carried out by a computer program consist of a sequence of small steps, suchas adding two numbers, testing for a particular condition, reading a character from thekeyboard, or sending a character to be displayed on a display screen. A computer musthave instructions capable of performing four types of operations:

• Data transfers between the memory and the processor registers• Arithmetic and logic operations on data• Program sequencing and control• I/O transfers

We begin by discussing instructions for the first two types of operations. To facilitate thediscussion, we first need some notation.


2.3 Instructions and Instruction Sequencing 33

2.3.1 Register Transfer Notation

We need to describe the transfer of information from one location in a computer to another.Possible locations that may be involved in such transfers are memory locations, processorregisters, or registers in the I/O subsystem. Most of the time, we identify such locationssymbolically with convenient names. For example, names that represent the addresses ofmemory locations may be LOC, PLACE, A, or VAR2. Predefined names for the processorregisters may be R0 or R5. Registers in the I/O subsystem may be identified by names suchas DATAIN or OUTSTATUS. To describe the transfer of information, the contents of anylocation are denoted by placing square brackets around its name. Thus, the expression

R2← [LOC]means that the contents of memory location LOC are transferred into processor register R2.

As another example, consider the operation that adds the contents of registers R2 andR3, and places their sum into register R4. This action is indicated as

R4← [R2] + [R3]This type of notation is known as Register Transfer Notation (RTN). Note that the right-hand side of an RTN expression always denotes a value, and the left-hand side is the nameof a location where the value is to be placed, overwriting the old contents of that location.

In computer jargon, the words “transfer” and “move” are commonly used to mean“copy.” Transferring data from a source location A to a destination location B means thatthe contents of location A are read and then written into location B. In this operation, onlythe contents of the destination will change. The contents of the source will stay the same.

2.3.2 Assembly-Language Notation

We need another type of notation to represent machine instructions and programs. Forthis, we use assembly language. For example, a generic instruction that causes the transferdescribed above, from memory location LOC to processor register R2, is specified by thestatement

Load R2, LOC

The contents of LOC are unchanged by the execution of this instruction, but the old contentsof register R2 are overwritten. The name Load is appropriate for this instruction, becausethe contents read from a memory location are loaded into a processor register.

The second example of adding two numbers contained in processor registers R2 andR3 and placing their sum in R4 can be specified by the assembly-language statement

Add R4, R2, R3

In this case, registers R2 and R3 hold the source operands, while R4 is the destination.



An instruction specifies an operation to be performed and the operands involved. In theabove examples, we used the English words Load andAdd to denote the required operations.In the assembly-language instructions of actual (commercial) processors, such operationsare defined by using mnemonics, which are typically abbreviations of the words describingthe operations. For example, the operation Load may be written as LD, while the operationStore, which transfers a word from a processor register to the memory, may be written asSTR or ST. Assembly languages for different processors often use different mnemonics fora given operation. To avoid the need for details of a particular assembly language at thisearly stage, we will continue the presentation in this chapter by using English words ratherthan processor-specific mnemonics.

2.3.3 RISC and CISC Instruction Sets

One of the most important characteristics that distinguish different computers is the natureof their instructions. There are two fundamentally different approaches in the design ofinstruction sets for modern computers. One popular approach is based on the premisethat higher performance can be achieved if each instruction occupies exactly one word inmemory, and all operands needed to execute a given arithmetic or logic operation specifiedby an instruction are already in processor registers. This approach is conducive to animplementation of the processing unit in which the various operations needed to processa sequence of instructions are performed in “pipelined” fashion to overlap activity andreduce total execution time of a program, as we will discuss in Chapter 6. The restrictionthat each instruction must fit into a single word reduces the complexity and the numberof different types of instructions that may be included in the instruction set of a computer.Such computers are called Reduced Instruction Set Computers (RISC).

An alternative to the RISC approach is to make use of more complex instructionswhich may span more than one word of memory, and which may specify more complicatedoperations. This approach was prevalent prior to the introduction of the RISC approachin the 1970s. Although the use of complex instructions was not originally identified byany particular label, computers based on this idea have been subsequently called ComplexInstruction Set Computers (CISC).

We will start our presentation by concentrating on RISC-style instruction sets becausethey are simpler and therefore easier to understand. Later we will deal with CISC-styleinstruction sets and explain the key differences between the two approaches.

2.3.4 Introduction to RISC Instruction Sets

Two key characteristics of RISC instruction sets are:

• Each instruction fits in a single word.• A load/store architecture is used, in which

– Memory operands are accessed only using Load and Store instructions.

– All operands involved in an arithmetic or logic operation must either be in proces-sor registers, or one of the operands may be given explicitly within the instructionword.



At the start of execution of a program, all instructions and data used in the program arestored in the memory of a computer. Processor registers do not contain valid operands atthat time. If operands are expected to be in processor registers before they can be used byan instruction, then it is necessary to first bring these operands into the registers. This taskis done by Load instructions which copy the contents of a memory location into a processorregister. Load instructions are of the form

Load destination, source

or more specifically

Load processor_register, memory_location

The memory location can be specified in several ways. The term addressing modes is usedto refer to the different ways in which this may be accomplished, as we will discuss inSection 2.4.

Let us now consider a typical arithmetic operation. The operation of adding twonumbers is a fundamental capability in any computer. The statement

C =A+ B

in a high-level language program instructs the computer to add the current values of the twovariables called A and B, and to assign the sum to a third variable, C. When the programcontaining this statement is compiled, the three variables, A, B, and C, are assigned todistinct locations in the memory. For simplicity, we will refer to the addresses of theselocations as A, B, and C, respectively. The contents of these locations represent the valuesof the three variables. Hence, the above high-level language statement requires the action

C← [A] + [B]to take place in the computer. To carry out this action, the contents of memory locations Aand B are fetched from the memory and transferred into the processor where their sum iscomputed. This result is then sent back to the memory and stored in location C.

The required action can be accomplished by a sequence of simple machine instructions.We choose to use registers R2, R3, and R4 to perform the task with four instructions:

Load R2, ALoad R3, BAdd R4, R2, R3Store R4, C

We say that Add is a three-operand, or a three-address, instruction of the form

Add destination, source1, source2

The Store instruction is of the form

Store source, destination

where the source is a processor register and the destination is a memory location. Observethat in the Store instruction the source and destination are specified in the reverse orderfrom the Load instruction; this is a commonly used convention.



Note that we can accomplish the desired addition by using only two registers, R2 andR3, if one of the source registers is also used as the destination for the result. In this casethe addition would be performed as

Add R3, R2, R3

and the last instruction would become

Store R3, C

2.3.5 Instruction Execution and Straight-Line Sequencing

In the preceding subsection, we used the task C = A + B, implemented as C← [A] + [B],as an example. Figure 2.4 shows a possible program segment for this task as it appears inthe memory of a computer. We assume that the word length is 32 bits and the memory isbyte-addressable. The four instructions of the program are in successive word locations,starting at location i. Since each instruction is 4 bytes long, the second, third, and fourthinstructions are at addresses i + 4, i + 8, and i + 12. For simplicity, we assume that a desired

R4, R2, R3

R3, B

R2, A

Addi + 8

Begin execution here Loadi

ContentsAddress

C

B

A

the programData for

segmentprogram4-instructionLoadi + 4

R4, CStorei + 12

Figure 2.4 A program for C← [A] + [B].



memory address can be directly specified in Load and Store instructions, although this is notpossible if a full 32-bit address is involved. We will resolve this issue later in Section 2.4.

Let us consider how this program is executed. The processor contains a register calledthe program counter (PC), which holds the address of the next instruction to be executed.To begin executing a program, the address of its first instruction (i in our example) must beplaced into the PC. Then, the processor control circuits use the information in the PC to fetchand execute instructions, one at a time, in the order of increasing addresses. This is calledstraight-line sequencing. During the execution of each instruction, the PC is incrementedby 4 to point to the next instruction. Thus, after the Store instruction at location i + 12 isexecuted, the PC contains the value i + 16, which is the address of the first instruction ofthe next program segment.

Executing a given instruction is a two-phase procedure. In the first phase, calledinstruction fetch, the instruction is fetched from the memory location whose address is inthe PC. This instruction is placed in the instruction register (IR) in the processor. At thestart of the second phase, called instruction execute, the instruction in IR is examined todetermine which operation is to be performed. The specified operation is then performedby the processor. This involves a small number of steps such as fetching operands fromthe memory or from processor registers, performing an arithmetic or logic operation, andstoring the result in the destination location. At some point during this two-phase procedure,the contents of the PC are advanced to point to the next instruction. When the execute phaseof an instruction is completed, the PC contains the address of the next instruction, and anew instruction fetch phase can begin.

2.3.6 Branching

Consider the task of adding a list of n numbers. The program outlined in Figure 2.5 isa generalization of the program in Figure 2.4. The addresses of the memory locationscontaining the n numbers are symbolically given as NUM1, NUM2, . . . , NUMn, andseparate Load and Add instructions are used to add each number to the contents of registerR2. After all the numbers have been added, the result is placed in memory location SUM.

Instead of using a long list of Load and Add instructions, as in Figure 2.5, it is possibleto implement a program loop in which the instructions read the next number in the list andadd it to the current sum. To add all numbers, the loop has to be executed as many timesas there are numbers in the list. Figure 2.6 shows the structure of the desired program. Thebody of the loop is a straight-line sequence of instructions executed repeatedly. It starts atlocation LOOP and ends at the instruction Branch_if_[R2]>0. During each pass throughthis loop, the address of the next list entry is determined, and that entry is loaded into R5 andadded to R3. The address of an operand can be specified in various ways, as will be describedin Section 2.4. For now, we concentrate on how to create and control a program loop.

Assume that the number of entries in the list, n, is stored in memory location N, asshown. Register R2 is used as a counter to determine the number of times the loop isexecuted. Hence, the contents of location N are loaded into register R2 at the beginning ofthe program. Then, within the body of the loop, the instruction

Subtract R2, R2, #1



NUMn

NUM2

NUM1

R2, R2, R3

R3, NUMn

R2, R2, R3

R3, NUM3

R2, R2, R3

Add

Load

Add

SUM

Add

Load

R2, NUM1Load

R3, NUM2Load

R2, SUMStore

i

i + 4

i + 8

i + 12

i + 16

i + 8n – 4

i + 8n – 8

i + 8n – 12

Figure 2.5 A program for adding n numbers.

reduces the contents of R2 by 1 each time through the loop. (We will explain the significanceof the number sign ‘#’ in Section 2.4.1.) Execution of the loop is repeated as long as thecontents of R2 are greater than zero.

We now introduce branch instructions. This type of instruction loads a new addressinto the program counter. As a result, the processor fetches and executes the instructionat this new address, called the branch target, instead of the instruction at the location thatfollows the branch instruction in sequential address order. A conditional branch instructioncauses a branch only if a specified condition is satisfied. If the condition is not satisfied, thePC is incremented in the normal way, and the next instruction in sequential address orderis fetched and executed.

In the program in Figure 2.6, the instruction

Branch_if_[R2]>0 LOOP



R2, NLoad

NUMn

NUM2

NUM1

R3, SUM

R2, R2, #1

"Next" number into R5,

LOOP

Subtract

Store

LOOP

loopProgram

Determine address of"Next" number, load the

N

SUM

n

R3Clear

Branch_if_[R2]>0

and add it to R3

Figure 2.6 Using a loop to add n numbers.

is a conditional branch instruction that causes a branch to location LOOP if the contentsof register R2 are greater than zero. This means that the loop is repeated as long as thereare entries in the list that are yet to be added to R3. At the end of the nth pass through theloop, the Subtract instruction produces a value of zero in R2, and, hence, branching doesnot occur. Instead, the Store instruction is fetched and executed. It moves the final resultfrom R3 into memory location SUM.

The capability to test conditions and subsequently choose one of a set of alternativeways to continue computation has many more applications than just loop control. Sucha capability is found in the instruction sets of all computers and is fundamental to theprogramming of most nontrivial tasks.

One way of implementing conditional branch instructions is to compare the contents oftwo registers and then branch to the target instruction if the comparison meets the specified



requirement. For example, the instruction that implements the action

Branch_if_[R4]>[R5] LOOP

may be written in generic assembly language as

Branch_greater_than R4, R5, LOOP

or using an actual mnemonic as

BGT R4, R5, LOOP

It compares the contents of registers R4 and R5, without changing the contents of eitherregister. Then, it causes a branch to LOOP if the contents of R4 are greater than the contentsof R5.

A different way of implementing branch instructions uses the concept of conditioncodes, which we will discuss in Section 2.10.2.

2.3.7 Generating MemoryAddresses

Let us return to Figure 2.6. The purpose of the instruction block starting at LOOP is toadd successive numbers from the list during each pass through the loop. Hence, the Loadinstruction in that block must refer to a different address during each pass. How are theaddresses specified? The memory operand address cannot be given directly in a single Loadinstruction in the loop. Otherwise, it would need to be modified on each pass through theloop. As one possibility, suppose that a processor register, Ri, is used to hold the memoryaddress of an operand. If it is initially loaded with the address NUM1 before the loop isentered and is then incremented by 4 on each pass through the loop, it can provide theneeded capability.

This situation, and many others like it, give rise to the need for flexible ways to specifythe address of an operand. The instruction set of a computer typically provides a numberof such methods, called addressing modes. While the details differ from one computer toanother, the underlying concepts are the same. We will discuss these in the next section.

2.4 Addressing Modes

We have now seen some simple examples of assembly-language programs. In general,a program operates on data that reside in the computer’s memory. These data can beorganized in a variety of ways that reflect the nature of the information and how it is used.Programmers use data structures such as lists and arrays for organizing the data used incomputations.

Programs are normally written in a high-level language, which enables the programmerto conveniently describe the operations to be performed on various data structures. Whentranslating a high-level language program into assembly language, the compiler generatesappropriate sequences of low-level instructions that implement the desired operations. The


2.4 Addressing Modes 41

Table 2.1 RISC-type addressing modes.

Name Assembler syntax Addressing function

Immediate #Value Operand = Value

Register Ri EA= Ri

Absolute LOC EA= LOC

Register indirect (Ri) EA= [Ri]

Index X(Ri) EA= [Ri] + X

Base with index (Ri,Rj) EA= [Ri] + [Rj]

EA= effective addressValue = a signed numberX = index value

different ways for specifying the locations of instruction operands are known as addressingmodes. In this section we present the basic addressing modes found in RISC-style proces-sors. A summary is provided in Table 2.1, which also includes the assembler syntax wewill use for each mode. The assembler syntax defines the way in which instructions andthe addressing modes of their operands are specified; it is discussed in Section 2.5.

2.4.1 Implementation of Variables and Constants

Variables are found in almost every computer program. In assembly language, a variableis represented by allocating a register or a memory location to hold its value. This valuecan be changed as needed using appropriate instructions.

The program in Figure 2.5 uses only two addressing modes to access variables. Weaccess an operand by specifying the name of the register or the address of the memorylocation where the operand is located. The precise definitions of these two modes are:

Register mode—The operand is the contents of a processor register; the name of the registeris given in the instruction.

Absolute mode—The operand is in a memory location; the address of this location is givenexplicitly in the instruction.

Since in a RISC-style processor an instruction must fit in a single word, the number ofbits that can be used to give an absolute address is limited, typically to 16 bits if the wordlength is 32 bits. To generate a 32-bit address, the 16-bit value is usually extended to 32bits by replicating bit b15 into bit positions b31−16 (as in sign extension). This means that anabsolute address can be specified in this manner for only a limited range of the full addressspace. We will deal with the issue of specifying full 32-bit addresses in Section 2.9. Tokeep our examples simple, we will assume for now that all addresses of memory locationsinvolved in a program can be specified in 16 bits.



The instruction

Add R4, R2, R3

uses the Register mode for all three operands. Registers R2 and R3 hold the two sourceoperands, while R4 is the destination.

The Absolute mode can represent global variables in a program. A declaration such as

Integer NUM1, NUM2, SUM;

in a high-level language program will cause the compiler to allocate a memory location toeach of the variables NUM1, NUM2, and SUM. Whenever they are referenced later in theprogram, the compiler can generate assembly-language instructions that use the Absolutemode to access these variables.

The Absolute mode is used in the instruction

Load R2, NUM1

which loads the value in the memory location NUM1 into register R2.Constants representing data or addresses are also found in almost every computer

program. Such constants can be represented in assembly language using the Immediateaddressing mode.

Immediate mode—The operand is given explicitly in the instruction.

For example, the instruction

Add R4, R6, 200immediate

adds the value 200 to the contents of register R6, and places the result into register R4.Using a subscript to denote the Immediate mode is not appropriate in assembly languages.A common convention is to use the number sign (#) in front of the value to indicate thatthis value is to be used as an immediate operand. Hence, we write the instruction above inthe form

Add R4, R6, #200

In the addressing modes that follow, the instruction does not give the operand or itsaddress explicitly. Instead, it provides information from which an effective address (EA)can be derived by the processor when the instruction is executed. The effective address isthen used to access the operand.

2.4.2 Indirection and Pointers

The program in Figure 2.6 requires a capability for modifying the address of the memoryoperand during each pass through the loop. A good way to provide this capability is to usea processor register to hold the address of the operand. The contents of the register are thenchanged (incremented) during each pass to provide the address of the next number in thelist that has to be accessed. The register acts as a pointer to the list, and we say that an item



R5

Load R2, (R5)

B

Main memory

OperandB

Figure 2.7 Register indirect addressing.

in the list is accessed indirectly by using the address in the register. The desired capabilityis provided by the indirect addressing mode.

Indirect mode—The effective address of the operand is the contents of a register that isspecified in the instruction.

We denote indirection by placing the name of the register given in the instruction in paren-theses as illustrated in Figure 2.7 and Table 2.1.

To execute the Load instruction in Figure 2.7, the processor uses the value B, whichis in register R5, as the effective address of the operand. It requests a Read operation tofetch the contents of location B in the memory. The value from the memory is the desiredoperand, which the processor loads into register R2. Indirect addressing through a memorylocation is also possible, but it is found only in CISC-style processors.

Indirection and the use of pointers are important and powerful concepts in program-ming. They permit the same code to be used to operate on different data. For example,register R5 in Figure 2.7 serves as a pointer for the Load instruction to load an operand fromthe memory into register R2. At one time, R5 may point to location B in memory. Later,the program may change the contents of R5 to point to a different location, in which casethe same Load instruction will load the value from that location into R2. Thus, a programsegment that includes this Load instruction is conveniently reused with only a change inthe pointer value.

Let us now return to the program in Figure 2.6 for adding a list of numbers. Indirectaddressing can be used to access successive numbers in the list, resulting in the programshown in Figure 2.8. Register R4 is used as a pointer to the numbers in the list, and theoperands are accessed indirectly through R4. The initialization section of the program loadsthe counter value n from memory location N into R2. Then, it uses the Clear instructionto clear R3 to 0. The next instruction uses the Immediate addressing mode to place theaddress value NUM1, which is the address of the first number in the list, into R4. Observethat we cannot use the Load instruction to load the desired immediate value, because theLoad instruction can operate only on memory source operands. Instead, we use the Moveinstruction

Move R4, #NUM1



Load R2, N Load the size of the list.Clear R3 Initialize sum to 0.Move R4, #NUM1 Get address of the first number.

LOOP: Load R5, (R4) Get the next number.Add R3, R3, R5 Add this number to sum.Add R4, R4, #4 Increment the pointer to the list.Subtract R2, R2, #1 Decrement the counter.Branch_if_[R2]>0 LOOP Branch back if not finished.Store R3, SUM Store the final sum.

Figure 2.8 Use of indirect addressing in the program of Figure 2.6.

In many RISC-type processors, one general-purpose register is dedicated to holdinga constant value zero. Usually, this is register R0. Its contents cannot be changed by aprogram instruction. We will assume that R0 is used in this manner in our discussion ofRISC-style processors. Then, the above Move instruction can be implemented as

Add R4, R0, #NUM1

It is often the case that Move is provided as a pseudoinstruction for the convenience ofprogrammers, but it is actually implemented using the Add instruction.

The first three instructions in the loop in Figure 2.8 implement the unspecified instruc-tion block starting at LOOP in Figure 2.6. The first time through the loop, the instruction

Load R5, (R4)

fetches the operand at location NUM1 and loads it into R5. The first Add instruction addsthis number to the sum in register R3. The second Add instruction adds 4 to the contents ofthe pointer R4, so that it will contain the address value NUM2 when the Load instructionis executed in the second pass through the loop.

As another example of pointers, consider the C-language statement

A= *B;

where B is a pointer variable and the ‘*’ symbol is the operator for indirect accesses. Thisstatement causes the contents of the memory location pointed to by B to be loaded intomemory location A. The statement may be compiled into

Load R2, BLoad R3, (R2)Store R3, A

Indirect addressing through registers is used extensively. The program in Figure 2.8shows the flexibility it provides.



2.4.3 Indexing andArrays

The next addressing mode we discuss provides a different kind of flexibility for accessingoperands. It is useful in dealing with lists and arrays.

Index mode—The effective address of the operand is generated by adding a constant valueto the contents of a register.

For convenience, we will refer to the register used in this mode as the index register.Typically, this is just a general-purpose register. We indicate the Index mode symbolicallyas

X(Ri)

where X denotes a constant signed integer value contained in the instruction and Ri is thename of the register involved. The effective address of the operand is given by

EA= X + [Ri]

The contents of the register are not changed in the process of generating the effectiveaddress.

In an assembly-language program, whenever a constant such as the value X is needed,it may be given either as an explicit number or as a symbolic name representing a numericalvalue. The way in which a symbolic name is associated with a specific numerical valuewill be discussed in Section 2.5. When the instruction is translated into machine code, theconstant X is given as a part of the instruction and is restricted to fewer bits than the wordlength of the computer. Since X is a signed integer, it must be sign-extended (see Section1.4) to the register length before being added to the contents of the register.

Figure 2.9 illustrates two ways of using the Index mode. In Figure 2.9a, the indexregister, R5, contains the address of a memory location, and the value X defines an offset(also called a displacement) from this address to the location where the operand is found. Analternative use is illustrated in Figure 2.9b. Here, the constant X corresponds to a memoryaddress, and the contents of the index register define the offset to the operand. In eithercase, the effective address is the sum of two values; one is given explicitly in the instruction,and the other is held in a register.

To see the usefulness of indexed addressing, consider a simple example involving a listof test scores for students taking a given course. Assume that the list of scores, beginning atlocation LIST, is structured as shown in Figure 2.10. A four-word memory block comprisesa record that stores the relevant information for each student. Each record consists of thestudent’s identification number (ID), followed by the scores the student earned on threetests. There are n students in the class, and the value n is stored in location N immediatelyin front of the list. The addresses given in the figure for the student IDs and test scoresassume that the memory is byte addressable and that the word length is 32 bits.

We should note that the list in Figure 2.10 represents a two-dimensional array havingn rows and four columns. Each row contains the entries for one student, and the columnsgive the IDs and test scores.

Suppose that we wish to compute the sum of all scores obtained on each of the testsand store these three sums in memory locations SUM1, SUM2, and SUM3. A possible



Operand1020

Load R2, 1000(R5)

R5

R5

Load R2, 20(R5)

Operand1020

201000

20 = offset

20 = offset

10001000

(a) Offset is given as a constant

(b) Offset is in the index register

Figure 2.9 Indexed addressing.

program for this task is given in Figure 2.11. In the body of the loop, the program uses theIndex addressing mode in the manner depicted in Figure 2.9a to access each of the threescores in a student’s record. Register R2 is used as the index register. Before the loop isentered, R2 is set to point to the ID location of the first student record which is the addressLIST.

On the first pass through the loop, test scores of the first student are added to the runningsums held in registers R3, R4, and R5, which are initially cleared to 0. These scores areaccessed using the Index addressing modes 4(R2), 8(R2), and 12(R2). The index registerR2 is then incremented by 16 to point to the ID location of the second student. RegisterR6, initialized to contain the value n, is decremented by 1 at the end of each pass throughthe loop. When the contents of R6 reach 0, all student records have been accessed, and



Student 1

Student 2

Test 3

Test 2

Test 1

Student ID

Test 3

Test 2

Student ID

nN

LIST

Test 1LIST + 4

LIST + 8

LIST + 12

LIST + 16

Figure 2.10 A list of students’ marks.

Move R2, #LIST Get the address LIST.Clear R3Clear R4Clear R5Load R6, N Load the value n.

LOOP: Load R7, 4(R2) Add the mark for next student'sAdd R3, R3, R7 Test 1 to the partial sum.Load R7, 8(R2) Add the mark for that student'sAdd R4, R4, R7 Test 2 to the partial sum.Load R7, 12(R2) Add the mark for that student'sAdd R5, R5, R7 Test 3 to the partial sum.Add R2, R2, #16 Increment the pointer.Subtract R6, R6, #1 Decrement the counter.Branch_if_[R6]>0 LOOP Branch back if not finished.Store R3, SUM1 Store the total for Test 1.Store R4, SUM2 Store the total for Test 2.Store R5, SUM3 Store the total for Test 3.

Figure 2.11 Indexed addressing used in accessing test scores in the list in Figure2.10.



the loop terminates. Until then, the conditional branch instruction transfers control backto the start of the loop to process the next record. The last three instructions transfer theaccumulated sums from registers R3, R4, and R5, into memory locations SUM1, SUM2,and SUM3, respectively.

It should be emphasized that the contents of the index register, R2, are not changedwhen it is used in the Index addressing mode to access the scores. The contents of R2 arechanged only by the last Add instruction in the loop, to move from one student record tothe next.

In general, the Index mode facilitates access to an operand whose location is definedrelative to a reference point within the data structure in which the operand appears. In theexample just given, the ID locations of successive student records are the reference points,and the test scores are the operands accessed by the Index addressing mode.

We have introduced the most basic form of indexed addressing that uses a register Riand a constant offset X. Several variations of this basic form provide for efficient access tomemory operands in practical programming situations (although they may not be includedin some processors). For example, a second register Rj may be used to contain the offsetX, in which case we can write the Index mode as

(Ri,Rj)

The effective address is the sum of the contents of registers Ri and Rj. The second register isusually called the base register. This form of indexed addressing provides more flexibilityin accessing operands, because both components of the effective address can be changed.

Yet another version of the Index mode uses two registers plus a constant, which can bedenoted as

X(Ri,Rj)

In this case, the effective address is the sum of the constant X and the contents of registers Riand Rj. This added flexibility is useful in accessing multiple components inside each itemin a record, where the beginning of an item is specified by the (Ri,Rj) part of the addressingmode.

Finally, we should note that in the basic Index mode

X(Ri)

if the contents of the register are equal to zero, then the effective address is just equal tothe sign-extended value of X. This has the same effect as the Absolute mode. If register R0always contains the value zero, then the Absolute mode is implemented simply as

X(R0)

2.5 Assembly Language

Machine instructions are represented by patterns of 0s and 1s. Such patterns are awkwardto deal with when discussing or preparing programs. Therefore, we use symbolic names torepresent the patterns. So far, we have used normal words, such as Load, Store, Add, and


2.5 Assembly Language 49

Branch, for the instruction operations to represent the corresponding binary code patterns.When writing programs for a specific computer, such words are normally replaced byacronyms called mnemonics, such as LD, ST, ADD, and BR. A shorthand notation is alsouseful when identifying registers, such as R3 for register 3. Finally, symbols such as LOCmay be defined as needed to represent particular memory locations. A complete set ofsuch symbolic names and rules for their use constitutes a programming language, generallyreferred to as an assembly language. The set of rules for using the mnemonics and forspecification of complete instructions and programs is called the syntax of the language.

Programs written in an assembly language can be automatically translated into a se-quence of machine instructions by a program called an assembler. The assembler programis one of a collection of utility programs that are a part of the system software of a computer.The assembler, like any other program, is stored as a sequence of machine instructions inthe memory of the computer. A user program is usually entered into the computer througha keyboard and stored either in the memory or on a magnetic disk. At this point, the userprogram is simply a set of lines of alphanumeric characters. When the assembler programis executed, it reads the user program, analyzes it, and then generates the desired machine-language program. The latter contains patterns of 0s and 1s specifying instructions thatwill be executed by the computer. The user program in its original alphanumeric text for-mat is called a source program, and the assembled machine-language program is called anobject program. We will discuss how the assembler program works in Section 2.5.2 and inChapter 4. First, we present a few aspects of assembly language itself.

The assembly language for a given computer may or may not be case sensitive, thatis, it may or may not distinguish between capital and lower-case letters. In this section, weuse capital letters to denote all names and labels in our examples to improve the readabilityof the text. For example, we write a Store instruction as

ST R2, SUM

The mnemonic ST represents the binary pattern, or operation (OP) code, for the operationperformed by the instruction. The assembler translates this mnemonic into the binary OPcode that the computer recognizes.

The OP-code mnemonic is followed by at least one blank space or tab character. Thenthe information that specifies the operands is given. In the Store instruction above, thesource operand is in register R2. This information is followed by the specification ofthe destination operand, separated from the source operand by a comma. The destinationoperand is in the memory location that has its binary address represented by the name SUM.

Since there are several possible addressing modes for specifying operand locations, anassembly-language instruction must indicate which mode is being used. For example, anumerical value or a name used by itself, such as SUM in the preceding instruction, may beused to denote the Absolute mode. The number sign usually denotes an immediate operand.Thus, the instruction

ADD R2, R3, #5

adds the number 5 to the contents of register R3 and puts the result into register R2.The number sign is not the only way to denote the Immediate addressing mode. In someassembly languages, the Immediate addressing mode is indicated in the OP-code mnemonic.



For example, the previous Add instruction may be written as

ADDI R2, R3, 5

The suffix I in the mnemonic ADDI states that the second source operand is given in theImmediate addressing mode.

Indirect addressing is usually specified by putting parentheses around the name orsymbol denoting the pointer to the operand. For example, if register R2 contains theaddress of a number in the memory, then this number can be loaded into register R3 usingthe instruction

LD R3, (R2)

2.5.1 Assembler Directives

In addition to providing a mechanism for representing instructions in a program, assemblylanguage allows the programmer to specify other information needed to translate the sourceprogram into the object program. We have already mentioned that we need to assignnumerical values to any names used in a program. Suppose that the name TWENTY isused to represent the value 20. This fact may be conveyed to the assembler program throughan equate statement such as

TWENTY EQU 20

This statement does not denote an instruction that will be executed when the object programis run; in fact, it will not even appear in the object program. It simply informs the assemblerthat the name TWENTY should be replaced by the value 20 wherever it appears in theprogram. Such statements, called assembler directives (or commands), are used by theassembler while it translates a source program into an object program.

To illustrate the use of assembly language further, let us reconsider the program inFigure 2.8. In order to run this program on a computer, it is necessary to write its source codein the required assembly language, specifying all of the information needed to generate thecorresponding object program. Suppose that each instruction and each data item occupiesone word of memory. Also assume that the memory is byte-addressable and that the wordlength is 32 bits. Suppose also that the object program is to be loaded in the main memoryas shown in Figure 2.12. The figure shows the memory addresses where the machineinstructions and the required data items are to be found after the program is loaded forexecution. If the assembler is to produce an object program according to this arrangement,it has to know

• How to interpret the names• Where to place the instructions in the memory• Where to place the data operands in the memory

To provide this information, the source program may be written as shown in Figure 2.13. Theprogram begins with the assembler directive, ORIGIN, which tells the assembler programwhere in the memory to place the instructions that follow. It specifies that the instructions



NUM2

NUMn

NUM1

R4, #NUM1Move

R3, SUM

R2, R2, #1

R4, R4, #4

R3, R3, R5

150

132

804

212

208

204

200

128

124

120

116

112

108

104

100

SUM

N

LOOP

LOOP

Subtract

Add

Add

Store

R3

R2, NLoad

Clear

R5, (R4)Load

Branch_if_[R2]>0

Figure 2.12 Memory arrangement for the program in Figure 2.8.

of the object program are to be loaded in the memory starting at address 100. It is followedby the source program instructions written with the appropriate mnemonics and syntax.Note that we use the statement

BGT R2, R0, LOOP

to represent an instruction that performs the operation

Branch_if_[R2]>0 LOOP

The second ORIGIN directive tells the assembler program where in the memory toplace the data block that follows. In this case, the location specified has the address 200.This is intended to be the location in which the final sum will be stored. A 4-byte spacefor the sum is reserved by means of the assembler directive RESERVE. The next word,at address 204, has to contain the value 150 which is the number of entries in the list.



Memory Addressingaddress or datalabel Operation information

Assembler directive ORIGIN 100

Statements that LD R2, Ngenerate CLR R3machine MOV R4, #NUM1instructions LOOP: LD R5, (R4)

ADD R3, R3, R5ADD R4, R4, #4SUB R2, R2, #1BGT R2, R0, LOOPST R3, SUMnext instruction

Assembler directives ORIGIN 200SUM: RESERVE 4N: DATAWORD 150NUM1: RESERVE 600

END

Figure 2.13 Assembly language representation for the program in Figure 2.12.

The DATAWORD directive is used to inform the assembler of this requirement. The nextRESERVE directive declares that a memory block of 600 bytes is to be reserved for data.This directive does not cause any data to be loaded in these locations. Data may be loadedin the memory using an input procedure, as we will explain in Chapter 3. The last statementin the source program is the assembler directive END, which tells the assembler that this isthe end of the source program text.

We previously described how the EQU directive can be used to associate a specificvalue, which may be an address, with a particular name. A different way of associatingaddresses with names or labels is illustrated in Figure 2.13. Any statement that results ininstructions or data being placed in a memory location may be given a memory addresslabel. The assembler automatically assigns the address of that location to the label. Forexample, in the data block that follows the second ORIGIN directive, we used the labelsSUM, N, and NUM1. Because the first RESERVE statement after the ORIGIN directiveis given the label SUM, the name SUM is assigned the value 200. Whenever SUM isencountered in the program, it will be replaced with this value. Using SUM as a label in



this manner is equivalent to using the assembler directive

SUM EQU 200

Similarly, the labels N and NUM1 are assigned the values 204 and 208, respectively,because they represent the addresses of the two word locations immediately following theword location with address 200.

Most assembly languages require statements in a source program to be written in theform

Label: Operation Operand(s) Comment

These four fields are separated by an appropriate delimiter, perhaps one or more blank ortab characters. The Label is an optional name associated with the memory address wherethe machine-language instruction produced from the statement will be loaded. Labels mayalso be associated with addresses of data items. In Figure 2.13 there are four labels: LOOP,SUM, N, and NUM1.

The Operation field contains an assembler directive or the OP-code mnemonic ofthe desired instruction. The Operand field contains addressing information for accessingthe operands. The Comment field is ignored by the assembler program. It is used fordocumentation purposes to make the program easier to understand.

We have introduced only the very basic characteristics of assembly languages. Theselanguages differ in detail and complexity from one computer to another.

2.5.2 Assembly and Execution of Programs

A source program written in an assembly language must be assembled into a machine-language object program before it can be executed. This is done by the assembler program,which replaces all symbols denoting operations and addressing modes with the binary codesused in machine instructions, and replaces all names and labels with their actual values.

The assembler assigns addresses to instructions and data blocks, starting at the addressesgiven in the ORIGIN assembler directives. It also inserts constants that may be givenin DATAWORD commands, and it reserves memory space as requested by RESERVEcommands.

A key part of the assembly process is determining the values that replace the names. Insome cases, where the value of a name is specified by an EQU directive, this is a straightfor-ward task. In other cases, where a name is defined in the Label field of a given instruction,the value represented by the name is determined by the location of this instruction in theassembled object program. Hence, the assembler must keep track of addresses as it gen-erates the machine code for successive instructions. For example, the names LOOP andSUM in the program of Figure 2.13 will be assigned the values 112 and 200, respectively.

In some cases, the assembler does not directly replace a name representing an addresswith the actual value of this address. For example, in a branch instruction, the name thatspecifies the location to which a branch is to be made (the branch target) is not replaced by theactual address. A branch instruction is usually implemented in machine code by specifyingthe branch target as the distance (in bytes) from the present address in the Program Counter



to the target instruction. The assembler computes this branch offset, which can be positiveor negative, and puts it into the machine instruction. We will show how branch instructionsmay be implemented in Section 2.13.

The assembler stores the object program on the secondary storage device availablein the computer, usually a magnetic disk. The object program must be loaded into themain memory before it is executed. For this to happen, another utility program called aloader must already be in the memory. Executing the loader performs a sequence of inputoperations needed to transfer the machine-language program from the disk into a specifiedplace in the memory. The loader must know the length of the program and the address in thememory where it will be stored. The assembler usually places this information in a headerpreceding the object code. Having loaded the object code, the loader starts execution of theobject program by branching to the first instruction to be executed, which may be identifiedby an address label such as START. The assembler places that address in the header of theobject code for the loader to use at execution time.

When the object program begins executing, it proceeds to completion unless there arelogical errors in the program. The user must be able to find errors easily. The assembler canonly detect and report syntax errors. To help the user find other programming errors, thesystem software usually includes a debugger program. This program enables the user tostop execution of the object program at some points of interest and to examine the contentsof various processor registers and memory locations.

In this section, we introduced some important issues in assembly and execution ofprograms. Chapter 4 provides a more detailed discussion of these issues.

2.5.3 Number Notation

When dealing with numerical values, it is often convenient to use the familiar decimalnotation. Of course, these values are stored in the computer as binary numbers. In somesituations, it is more convenient to specify the binary patterns directly. Most assemblersallow numerical values to be specified in different ways, using conventions that are definedby the assembly-language syntax. Consider, for example, the number 93, which is repre-sented by the 8-bit binary number 01011101. If this value is to be used as an immediateoperand, it can be given as a decimal number, as in the instruction

ADDI R2, R3, 93

or as a binary number identified by an assembler-specific prefix symbol such as a percentsign, as in

ADDI R2, R3, %01011101

Binary numbers can be written more compactly as hexadecimal, or hex, numbers, inwhich four bits are represented by a single hex digit. The first ten patterns 0000, 0001, . . . ,

1001, referred to as binary-coded decimal (BCD), are represented by the digits 0, 1, . . . , 9.

The remaining six 4-bit patterns, 1010, 1011, . . . , 1111, are represented by the letters A,B, . . . , F. In hexadecimal representation, the decimal value 93 becomes 5D. In assemblylanguage, a hex representation is often identified by the prefix 0x (as in the C language) or


2.6 Stacks 55

by a dollar sign prefix. Thus, we would write

ADDI R2, R3, 0x5D

2.6 Stacks

Data operated on by a program can be organized in a variety of ways. We have alreadyencountered data structured as lists. Now, we consider an important data structure knownas a stack. A stack is a list of data elements, usually words, with the accessing restrictionthat elements can be added or removed at one end of the list only. This end is called the topof the stack, and the other end is called the bottom. The structure is sometimes referred to asa pushdown stack. Imagine a pile of trays in a cafeteria; customers pick up new trays fromthe top of the pile, and clean trays are added to the pile by placing them onto the top of thepile. Another descriptive phrase, last-in–first-out (LIFO) stack, is also used to describe thistype of storage mechanism; the last data item placed on the stack is the first one removedwhen retrieval begins. The terms push and pop are used to describe placing a new item onthe stack and removing the top item from the stack, respectively.

In modern computers, a stack is implemented by using a portion of the main memoryfor this purpose. One processor register, called the stack pointer (SP), is used to point to aparticular stack structure called the processor stack, whose use will be explained shortly.

Data can be stored in a stack with successive elements occupying successive memorylocations. Assume that the first element is placed in location BOTTOM, and when newelements are pushed onto the stack, they are placed in successively lower address locations.We use a stack that grows in the direction of decreasing memory addresses in our discussion,because this is a common practice.

Figure 2.14 shows an example of a stack of word data items. The stack containsnumerical values, with 43 at the bottom and −28 at the top. The stack pointer, SP, is usedto keep track of the address of the element of the stack that is at the top at any given time.If we assume a byte-addressable memory with a 32-bit word length, the push operation canbe implemented as

Subtract SP, SP, #4Store Rj, (SP)

where the Subtract instruction subtracts 4 from the contents of SP and places the result inSP. Assuming that the new item to be pushed on the stack is in processor register Rj, theStore instruction will place this value on the stack. These two instructions copy the wordfrom Rj onto the top of the stack, decrementing the stack pointer by 4 before the store (push)operation. The pop operation can be implemented as

Load Rj, (SP)Add SP, SP, #4

These two instructions load (pop) the top value from the stack into register Rj and thenincrement the stack pointer by 4 so that it points to the new top element. Figure 2.15 showsthe effect of each of these operations on the stack in Figure 2.14.



register

Stackpointer

739

17

BOTTOM

0

SPCurrenttop element

elementBottom

43

Stack

28–

2k

1–

Figure 2.14 A stack of words in the memory.

2.7 Subroutines

In a given program, it is often necessary to perform a particular task many times on differentdata values. It is prudent to implement this task as a block of instructions that is executedeach time the task has to be performed. Such a block of instructions is usually called asubroutine. For example, a subroutine may evaluate a mathematical function, or it may sorta list of values into increasing or decreasing order.

It is possible to reproduce the block of instructions that constitute a subroutine at everyplace where it is needed in the program. However, to save space, only one copy of this blockis placed in the memory, and any program that requires the use of the subroutine simplybranches to its starting location. When a program branches to a subroutine we say that it iscalling the subroutine. The instruction that performs this branch operation is named a Callinstruction.

After a subroutine has been executed, the calling program must resume execution,continuing immediately after the instruction that called the subroutine. The subroutine issaid to return to the program that called it, and it does so by executing a Return instruction.Since the subroutine may be called from different places in a calling program, provisionmust be made for returning to the appropriate location. The location where the calling


2.7 Subroutines 57

(b) After pop into Rj(a) After push from Rj

17

739

43

R j

SP

Stack

SP

R j

19

17

739

19

43

28–

28–28–

Stack

Figure 2.15 Effect of stack operations on the stack in Figure 2.14.

program resumes execution is the location pointed to by the updated program counter (PC)while the Call instruction is being executed. Hence, the contents of the PC must be savedby the Call instruction to enable correct return to the calling program.

The way in which a computer makes it possible to call and return from subroutines isreferred to as its subroutine linkage method. The simplest subroutine linkage method isto save the return address in a specific location, which may be a register dedicated to thisfunction. Such a register is called the link register. When the subroutine completes its task,the Return instruction returns to the calling program by branching indirectly through thelink register.

The Call instruction is just a special branch instruction that performs the followingoperations:

• Store the contents of the PC in the link register• Branch to the target address specified by the Call instruction

The Return instruction is a special branch instruction that performs the operation

• Branch to the address contained in the link register

Figure 2.16 illustrates how the PC and the link register are affected by the Call and Returninstructions.



ReturnCall

1000

204

204

Link

PC

Return

1000

locationMemory

Calling programMemorylocation

200204

Call SUBnext instruction

Subroutine SUB

first instruction

Figure 2.16 Subroutine linkage using a link register.

2.7.1 Subroutine Nesting and the Processor Stack

A common programming practice, called subroutine nesting, is to have one subroutine callanother. In this case, the return address of the second call is also stored in the link register,overwriting its previous contents. Hence, it is essential to save the contents of the linkregister in some other location before calling another subroutine. Otherwise, the returnaddress of the first subroutine will be lost.

Subroutine nesting can be carried out to any depth. Eventually, the last subroutinecalled completes its computations and returns to the subroutine that called it. The returnaddress needed for this first return is the last one generated in the nested call sequence. Thatis, return addresses are generated and used in a last-in–first-out order. This suggests that thereturn addresses associated with subroutine calls should be pushed onto the processor stack.

Correct sequencing of nested calls is achieved if a given subroutine SUB1 saves thereturn address currently in the link register on the stack, accessed through the stack pointer,SP, before it calls another subroutine SUB2. Then, prior to executing its own Returninstruction, the subroutine SUB1 has to pop the saved return address from the stack andload it into the link register.


2.7 Subroutines 59

2.7.2 Parameter Passing

When calling a subroutine, a program must provide to the subroutine the parameters, that is,the operands or their addresses, to be used in the computation. Later, the subroutine returnsother parameters, which are the results of the computation. This exchange of informationbetween a calling program and a subroutine is referred to as parameter passing. Parameterpassing may be accomplished in several ways. The parameters may be placed in registersor in memory locations, where they can be accessed by the subroutine. Alternatively, theparameters may be placed on the processor stack.

Passing parameters through processor registers is straightforward and efficient. Figure2.17 shows how the program in Figure 2.8 for adding a list of numbers can be implementedas a subroutine, LISTADD, with the parameters passed through registers. The size of thelist, n, contained in memory location N, and the address, NUM1, of the first number, arepassed through registers R2 and R4. The sum computed by the subroutine is passed back tothe calling program through register R3. The first four instructions in Figure 2.17 constitutethe relevant part of the calling program. The first two instructions load n and NUM1 into

Calling program

Load R2, N Parameter 1 is list size.Move R4, #NUM1 Parameter 2 is list location.Call LISTADD Call subroutine.Store R3, SUM Save result.

Subroutine

LISTADD: Subtract SP, SP, #4 Save the contents ofStore R5, (SP) R5 on the stack.Clear R3 Initialize sum to 0.

LOOP: Load R5, (R4) Get the next number.Add R3, R3, R5 Add this number to sum.Add R4, R4, #4 Increment the pointer by 4.Subtract R2, R2, #1 Decrement the counter.Branch_if_[R2]>0 LOOPLoad R5, (SP) Restore the contents of R5.Add SP, SP, #4Return Return to calling program.

::

Figure 2.17 Program of Figure 2.8 written as a subroutine; parameters passedthrough registers.



R2 and R4. The Call instruction branches to the subroutine starting at location LISTADD.This instruction also saves the return address (i.e., the address of the Store instruction inthe calling program) in the link register. The subroutine computes the sum and places it inR3. After the Return instruction is executed by the subroutine, the sum in R3 is stored inmemory location SUM by the calling program.

In addition to registers R2, R3, and R4, which are used for parameter passing, thesubroutine also uses R5. Since R5 may be used in the calling program, its contents aresaved by pushing them onto the processor stack upon entry to the subroutine and restoredbefore returning to the calling program.

If many parameters are involved, there may not be enough general-purpose registersavailable for passing them to the subroutine. The processor stack provides a convenientand flexible mechanism for passing an arbitrary number of parameters. Figure 2.18 showsthe program of Figure 2.8 rewritten as a subroutine, LISTADD, which uses the processorstack for parameter passing. The address of the first number in the list and the number ofentries are pushed onto the processor stack pointed to by register SP. The subroutine is thencalled. The computed sum is placed on the stack before the return to the calling program.

Figure 2.19 shows the stack entries for this example. Assume that before the subroutineis called, the top of the stack is at level 1. The calling program pushes the address NUM1and the value n onto the stack and calls subroutine LISTADD. The top of the stack is now atlevel 2. The subroutine uses four registers while it is being executed. Since these registersmay contain valid data that belong to the calling program, their contents should be savedat the beginning of the subroutine by pushing them onto the stack. The top of the stack isnow at level 3. The subroutine accesses the parameters n and NUM1 from the stack usingindexed addressing with offset values relative to the new top of the stack (level 3). Notethat it does not change the stack pointer because valid data items are still at the top of thestack. The value n is loaded into R2 as the initial value of the count, and the address NUM1is loaded into R4, which is used as a pointer to scan the list entries.

At the end of the computation, register R3 contains the sum. Before the subroutinereturns to the calling program, the contents of R3 are inserted into the stack, replacing theparameter NUM1, which is no longer needed. Then the contents of the four registers usedby the subroutine are restored from the stack. Also, the stack pointer is incremented to pointto the top of the stack that existed when the subroutine was called, namely the parametern at level 2. After the subroutine returns, the calling program stores the result in locationSUM and lowers the top of the stack to its original level by incrementing the SP by 8.

Observe that for subroutine LISTADD in Figure 2.18, we did not use a pair of instruc-tions

Subtract SP, SP, #4Store Rj, (SP)

to push the contents of each register on the stack. Since we have to save four registers,this would require eight instructions. We needed only five instructions by adjusting SPimmediately to point to the top of stack that will be in effect once all four registers aresaved. Then, we used the Index mode to store the contents of registers. We used the sameoptimization when restoring the registers before returning from the subroutine.


2.7 Subroutines 61

Assume top of stack is at level 1 in Figure 2.19.

Move R2, #NUM1 Push parameters onto stack.Subtract SP, SP, #4Store R2, (SP)Load R2, NSubtract SP, SP, #4Store R2, (SP)Call LISTADD Call subroutine

(top of stack is at level 2).Load R2, 4(SP) Get the result from the stackStore R2, SUM and save it in SUM.Add SP, SP, #8 Restore top of stack

(top of stack is at level 1).

LISTADD: Subtract SP, SP, #16 Save registersStore R2, 12(SP)Store R3, 8(SP)Store R4, 4(SP)Store R5, (SP) (top of stack is at level 3).Load R2, 16(SP) Initialize counter to n.Load R4, 20(SP) Initialize pointer to the list.Clear R3 Initialize sum to 0.

LOOP: Load R5, (R4) Get the next number.Add R3, R3, R5 Add this number to sum.Add R4, R4, #4 Increment the pointer by 4.Subtract R2, R2, #1 Decrement the counter.Branch_if_[R2]>0 LOOPStore R3, 20(SP) Put result in the stack.Load R5, (SP) Restore registers.Load R4, 4(SP)Load R3, 8(SP)Load R2, 12(SP)Add SP, SP, #16 (top of stack is at level 2).Return Return to calling program.

::

Figure 2.18 Program of Figure 2.8 written as a subroutine; parameters passed on thestack.



Level 3

[R4]

[R5]

[R2]

[R3]

n

NUM1

Level 2

Level 1

Figure 2.19 Stack contents for the program in Figure 2.18.

We should also note that some computers have special instructions for loading andstoring multiple registers. For example, the four registers in Figure 2.18 may be saved onthe stack by using the instruction

StoreMultiple R2−R5, −(SP)

The source registers are specified by the range R2−R5. The notation −(SP) specifies thatthe stack pointer must be adjusted accordingly. The minus sign in front indicates that SPmust be decremented (by 4) before the contents of each register are placed on the stack.

Similarly, the instruction

LoadMultiple R2−R5, (SP)+

will load registers R2, R3, R4, and R5, in reverse order, with the values that were savedon the stack. The notation (SP)+ indicates that the stack pointer must be incremented (by4) after each value has been loaded into the corresponding register. We will discuss theaddressing modes denoted by −(SP) and (SP)+ in more detail in Section 2.9.1.

Parameter Passing by Value and by ReferenceNote the nature of the two parameters, NUM1 and n, passed to the subroutines in

Figures 2.17 and 2.18. The purpose of the subroutines is to add a list of numbers. Insteadof passing the actual list entries, the calling program passes the address of the first numberin the list. This technique is called passing by reference. The second parameter is passedby value, that is, the actual number of entries, n, is passed to the subroutine.


2.7 Subroutines 63

2.7.3 The Stack Frame

Now, observe how space is used in the stack in the example in Figures 2.18 and 2.19.During execution of the subroutine, six locations at the top of the stack contain entriesthat are needed by the subroutine. These locations constitute a private work space forthe subroutine, allocated at the time the subroutine is entered and deallocated when thesubroutine returns control to the calling program. Such space is called a stack frame. If thesubroutine requires more space for local memory variables, the space for these variablescan also be allocated on the stack.

Figure 2.20 shows an example of a commonly used layout for information in a stackframe. In addition to the stack pointer SP, it is useful to have another pointer register, calledthe frame pointer (FP), for convenient access to the parameters passed to the subroutine andto the local memory variables used by the subroutine. In the figure, we assume that fourparameters are passed to the subroutine, three local variables are used within the subroutine,and registers R2, R3, and R4 need to be saved because they will also be used within thesubroutine. When nested subroutines are used, the stack frame of the calling subroutinewould also include the return address, as we will see in the example that follows.

SP(stack pointer)

FP(frame pointer)

saved [R4]

saved [R3]

Stackframe

forcalled

subroutine

localvar3

localvar2

localvar1

saved [FP]

Old TOS

param2

param1

param3

param4

saved [R2]

Figure 2.20 A subroutine stack frame example.



With the FP register pointing to the location just above the stored parameters, as shownin Figure 2.20, we can easily access the parameters and the local variables by using the Indexaddressing mode. The parameters can be accessed by using addresses 4(FP), 8(FP), . . . .The local variables can be accessed by using addresses−4(FP),−8(FP), . . . . The contentsof FP remain fixed throughout the execution of the subroutine, unlike the stack pointer SP,which must always point to the current top element in the stack.

Now let us discuss how the pointers SP and FP are manipulated as the stack frame isallocated, used, and deallocated for a particular invocation of a subroutine. We begin byassuming that SP points to the old top-of-stack (TOS) element in Figure 2.20. Before thesubroutine is called, the calling program pushes the four parameters onto the stack. Thenthe Call instruction is executed. At this time, SP points to the last parameter that was pushedon the stack. If the subroutine is to use the frame pointer, it should first save the contentsof FP by pushing them on the stack, because FP is usually a general-purpose register andit may contain information of use to the calling program. Then, the contents of SP, whichnow points to the saved value of FP, are copied into FP.

Thus, the first three instructions executed in the subroutine are

Subtract SP, SP, #4Store FP, (SP)Move FP, SP

The Move instruction copies the contents of SP into FP.After these instructions are executed,both SP and FP point to the saved FP contents. Space for the three local variables is nowallocated on the stack by executing the instruction

Subtract SP, SP, #12

Finally, the contents of processor registers R2, R3, and R4 are saved by pushing them ontothe stack. At this point, the stack frame has been set up as shown in Figure 2.20.

The subroutine now executes its task. When the task is completed, the subroutine popsthe saved values of R4, R3, and R2 back into those registers, deallocates the local variablesfrom the stack frame by executing the instruction

Add SP, SP, #12

and pops the saved old value of FP back into FP. At this point, SP points to the last parameterthat was placed on the stack. Next, the Return instruction is executed, transferring controlback to the calling program.

The calling program is responsible for deallocating the parameters from the stackframe, some of which may be results passed back by the subroutine. After deallocationof the parameters, the stack pointer points to the old TOS, and we are back to where westarted.

Stack Frames for Nested SubroutinesWhen nested subroutines are used, it is necessary to ensure that the return addresses are

properly saved. When a calling program calls a subroutine, say SUB1, the return addressis saved in the link register. Now, if SUB1 calls another subroutine, SUB2, it must save the


2.8 Additional Instructions 65

current contents of the link register before it makes the call to SUB2. The appropriate placefor saving this return address is within the stack frame for SUB1. If SUB2 then calls SUB3,it must save the current contents of the link register within the stack frame associated withSUB2, and so on.

An example of a main program calling a first subroutine SUB1, which then calls asecond subroutine SUB2, is shown in Figure 2.21. The stack frames corresponding to thesetwo nested subroutines are shown in Figure 2.22. All parameters involved in this exampleare passed on the stack. The two figures only show the flow of control and data among themain program and the two subroutines. The actual computations are not shown.

The flow of execution is as follows. The main program pushes the two parametersparam2 and param1 onto the stack, in that order, and then calls SUB1. This first subroutineis responsible for computing a single result and passing it back to the main program on thestack. During the course of its computations, SUB1 calls the second subroutine, SUB2, inorder to perform some other subtask. SUB1 passes a single parameter param3 to SUB2,and the result is passed back to it via the same location on the stack. After SUB2 executesits Return instruction, SUB1 loads this result into register R4. SUB1 then continues itscomputations and eventually passes the required answer back to the main program on thestack. When SUB1 executes its return to the main program, the main program storesthis answer in memory location RESULT, restores the stack level, then continues with itscomputations at the next instruction at address 2040. Note how the return address to thecalling program, 2028, is stored within the stack frame for SUB1 in Figure 2.22.

The comments in Figure 2.21 provide the details of how this flow of execution ismanaged. The first action performed by each subroutine is to save on the stack the contentsof all registers used in the subroutine, including the frame pointer and link register (ifneeded). This is followed by initializing the frame pointer. SUB1 uses four registers, R2to R5, and SUB2 uses two registers, R2 and R3. These registers, the frame pointer, andthe link register in the case of SUB1, are restored just before the Return instructions areexecuted.

The Index addressing mode involving the frame pointer register FP is used to loadparameters from the stack and place answers back on the stack. The byte offsets used inthese operations are always 4, 8, . . . , as discussed for the general stack frame in Figure2.20. Finally, note that each calling routine is responsible for removing its own parametersfrom the stack. This is done by the Add instructions, which lower the top of the stack.

2.8 Additional Instructions

So far, we have introduced the following instructions: Load, Store, Move, Clear, Add,Subtract, Branch, Call, and Return. These instructions, along with the addressing modes inTable 2.1, have allowed us to write programs to illustrate machine instruction sequencing,including branching and subroutine linkage. In this section we introduce a few moreinstructions that are found in most instruction sets.



Memory location Instructions Comments

Main program::

2000 Load R2, PARAM2 Place parameters on stack.2004 Subtract SP, SP, #42008 Store R2, (SP)2012 Load R2, PARAM12016 Subtract SP, SP, #42020 Store R2, (SP)2024 Call SUB1 Call the subroutine.2028 Load R2, (SP) Store result.2032 Store R2, RESULT2036 Add SP, SP, #8 Restore stack level.2040 next instruction

::First subroutine2100 SUB1: Subtract SP, SP, #24 Save registers.2104 Store LINK_reg,20(SP)2108 Store FP, 16(SP)2112 Store R2, 12(SP)2116 Store R3, 8(SP)2120 Store R4, 4(SP)2124 Store R5, (SP)2128 Add FP, SP, #16 Initialize the frame pointer.2132 Load R2, 8(FP) Get first parameter.2136 Load R3, 12(FP) Get second parameter.

::Load R4, PARAM3 Place a parameter on stack.Subtract SP, SP, #4Store R4, (SP)Call SUB2Load R4, (SP) Get result from SUB2.Add SP, SP, #4::Store R5, 8(FP) Place answer on stack.Load R5, (SP) Restore registers.Load R4, 4(SP)Load R3, 8(SP)Load R2, 12(SP)Load FP, 16(SP)Load LINK_reg,20(SP)Add SP, SP, #24Return Return to Main program.

...continued in part b.

Figure 2.21 Nested subroutines (part a).



Memorylocation Instructions Comments

Second subroutine

3000 SUB2: Subtract SP, SP, #12 Save registers.3004 Store FP, 8(SP)

Store R2, 4(SP)Store R3, (SP)Add FP, SP, #8 Initialize the frame pointer.Load R2, 4(FP) Get the parameter.

Store R3, 4(FP) Place SUB2 result on stack.Load R3, (SP) Restore registers.Load R2, 4(SP)Load FP, 8(SP)Add SP, SP, #12Return Return to Subroutine 1.

::

Figure 2.21 Nested subroutines (part b).

2.8.1 Logic Instructions

Logic operations such as AND, OR, and NOT, applied to individual bits, are the basicbuilding blocks of digital circuits, as described in Appendix A. It is also useful to be ableto perform logic operations in software, which is done using instructions that apply theseoperations to all bits of a word or byte independently and in parallel. For example, theinstruction

And R4, R2, R3

computes the bit-wise AND of operands in registers R2 and R3, and leaves the result in R4.An immediate form of this instruction may be

And R4, R2, #Value

where Value is a 16-bit logic value that is extended to 32 bits by placing zeros into the 16most-significant bit positions.

Consider the following application for this logic instruction. Suppose that four ASCIIcharacters are contained in the 32-bit register R2. In some task, we wish to determine if therightmost character is Z. If it is, then a conditional branch to FOUNDZ is to be made. FromTable 1.1 in Chapter 1, we find that the ASCII code for Z is 01011010, which is expressedin hexadecimal notation as 5A. The three-instruction sequence

And R2, R2, #0xFFMove R3, #0x5ABranch_if_[R2]=[R3] FOUNDZ



FP

FP

[FP] from SUB1

Stackframe

forSUB1

[R2] from Main

param3

[R5] from Main

[R4] from Main

[R3] from Main

Old TOS

2028

[FP] from Main

param1

param2

[R2] from SUB1

[R3] from SUB1

Stackframe

forSUB2

Figure 2.22 Stack frames for Figure 2.21.

implements the desired action. The And instruction clears all bits in the leftmost threecharacter positions of R2 to zero, leaving the rightmost character unchanged. This is theresult of using an immediate operand that has eight 1s at its right end, and 0s in the 24 bitsto the left. The Move instruction loads the hex value 5A into R3. Since both R2 and R3have 0s in the leftmost 24 bits, the Branch instruction compares the remaining character atthe right end of R2 with the binary representation for the character Z, and causes a branchto FOUNDZ if there is a match.

2.8.2 Shift and Rotate Instructions

There are many applications that require the bits of an operand to be shifted right or leftsome specified number of bit positions. The details of how the shifts are performed dependon whether the operand is a signed number or some more general binary-coded information.For general operands, we use a logical shift. For a signed number, we use an arithmeticshift, which preserves the sign of the number.

Logical ShiftsTwo logical shift instructions are needed, one for shifting left (LShiftL) and another for

shifting right (LShiftR). These instructions shift an operand over a number of bit positions



specified in a count operand contained in the instruction. The general form of a Logical-shift-left instruction is

LShiftL Ri, Rj, count

which shifts the contents of register Rj left by a number of bit positions given by the countoperand, and places the result in register Ri, without changing the contents of Rj. Thecount operand may be given as an immediate operand, or it may be contained in a processorregister. To complete the description of the shift left operation, we need to specify the bitvalues brought into the vacated positions at the right end of the destination operand, and todetermine what happens to the bits shifted out of the left end. Vacated positions are filledwith zeros. In computers that do not use condition code flags, the bits shifted out are simplydropped. In computers that use condition code flags, which will be discussed in Section2.10.2, these bits are passed through the Carry flag, C, and then dropped. Involving the Cflag in shifts is useful in performing arithmetic operations on large numbers that occupymore than one word. Figure 2.23a shows an example of shifting the contents of register R3left by two bit positions. The Logical-shift-right instruction, LShiftR, works in the samemanner except that it shifts to the right. Figure 2.23b illustrates this operation.

Digit-Packing ExampleConsider the following short task that illustrates the use of both shift operations and

logic operations. Suppose that two decimal digits represented in ASCII code are located inthe memory at byte locations LOC and LOC+ 1. We wish to represent each of these digitsin the 4-bit BCD code and store both of them in a single byte location PACKED. The resultis said to be in packed-BCD format. Table 1.1 in Chapter 1 shows that the rightmost fourbits of the ASCII code for a decimal digit correspond to the BCD code for the digit. Hence,the required task is to extract the low-order four bits in LOC and LOC+ 1 and concatenatethem into the single byte at PACKED.

The instruction sequence shown in Figure 2.24 accomplishes the task using register R2as a pointer to the ASCII characters in memory, and using registers R3 and R4 to developthe BCD digit codes. The program uses the LoadByte instruction, which loads a byte fromthe memory into the rightmost eight bit positions of a 32-bit processor register and clearsthe remaining higher-order bits to zero. The StoreByte instruction writes the rightmost bytein the source register into the specified destination location, but does not affect any otherbyte locations. The value 0xF in the And instruction is used to clear to zero all but thefour rightmost bits in R4. Note that the immediate source operand is written as 0xF, which,interpreted as a 32-bit pattern, has 28 zeros in the most-significant bit positions.

Arithmetic ShiftsIn an arithmetic shift, the bit pattern being shifted is interpreted as a signed number. A

study of the 2’s-complement binary number representation in Figure 1.3 reveals that shiftinga number one bit position to the left is equivalent to multiplying it by 2, and shifting it tothe right is equivalent to dividing it by 2. Of course, overflow might occur on shiftingleft, and the remainder is lost when shifting right. Another important observation is thaton a right shift the sign bit must be repeated as the fill-in bit for the vacated positionas a requirement of the 2’s-complement representation for numbers. This requirementwhen shifting right distinguishes arithmetic shifts from logical shifts in which the fill-in



CR30

before:

after:

0

1

0 0 01 1 1 . . . 11

0 0 1 1 1 000

(b) Logical shift right LShiftR R3, R3, #2

(a) Logical shift left LShiftL R3, R3, #2

C R3 0

before:

after:

0

1

0 0 01 1 1 . . . 11

1 10 . . . 00101

C

before:

after:

0

1

1 1 00 0 1 . . . 01

1 1 0 0 1 011

(c) Arithmetic shift right AShiftR R3, R3, #2

R3

. . .

. . .

Figure 2.23 Logical and arithmetic shift instructions.

bit is always 0. Otherwise, the two types of shifts are the same. An example of anArithmetic-shift-right instruction, AShiftR, is shown in Figure 2.23c. The Arithmetic-shift-left isexactly the same as the Logical-shift-left.

Rotate OperationsIn the shift operations, the bits shifted out of the operand are lost, except for the last

bit shifted out which is retained in the Carry flag C. For situations where it is desirable topreserve all of the bits, rotate instructions may be used instead. These are instructions that



Move R2, #LOC R2 points to data.LoadByte R3, (R2) Load first byte into R3.LShiftL R3, R3, #4 Shift left by 4 bit positions.Add R2, R2, #1 Increment the pointer.LoadByte R4, (R2) Load second byte into R4.And R4, R4, #0xF Clear high-order bits to zero.Or R3, R3, R4 Concatenate the BCD digits.StoreByte R3, PACKED Store the result.

Figure 2.24 A routine that packs two BCD digits into a byte.

move the bits shifted out of one end of the operand into the other end. Two versions of boththe Rotate-left and Rotate-right instructions are often provided. In one version, the bits ofthe operand are simply rotated. In the other version, the rotation includes the C flag. Figure2.25 shows the left and right rotate operations with and without the C flag being includedin the rotation. Note that when the C flag is not included in the rotation, it still retains thelast bit shifted out of the end of the register. The OP codes RotateL, RotateLC, RotateR,and RotateRC, denote the instructions that perform the rotate operations.

2.8.3 Multiplication and Division

Two signed integers can be multiplied or divided by machine instructions with the sameformat as we saw earlier for an Add instruction. The instruction

Multiply Rk, Ri, Rj

performs the operation

Rk ← [Ri] × [Rj]The product of two n-bit numbers can be as large as 2n bits. Therefore, the answer willnot necessarily fit into register Rk. A number of instruction sets have a Multiply instructionthat computes the low-order n bits of the product and places it in register Rk, as indicated.This is sufficient if it is known that all products in some particular application task will fitinto n bits. To accommodate the general 2n-bit product case, some processors produce theproduct in two registers, usually adjacent registers Rk and R(k + 1), with the high-orderhalf being placed in register R(k + 1).

An instruction set may also provide a signed integer Divide instruction

Divide Rk, Ri, Rj

which performs the operation

Rk ← [Rj]/[Ri]placing the quotient in Rk. The remainder may be placed in R(k + 1), or it may be lost.



CR3

before:

after:

0

1

0 0 01 1 1 . . . 11

1 0 1 1 1 001

(c) Rotate right without carry RotateR R3, R3, #2

(a) Rotate left without carry RotateL R3, R3, #2

C R3

before:

after:

0

1

0 0 01 1 1 . . . 11

1 10 . . . 10101

C

before:

after:

0

1

0 0 01 1 1 . . . 11

1 0 1 1 1 000

(d) Rotate right with carry RotateRC R3, R3, #2

R3

. . .

. . .

(b) Rotate left with carry RotateLC R3, R3, #2

C R3

before:

after:

0

1

0 0 01 1 1 . . . 11

1 10 . . . 00101

Figure 2.25 Rotate instructions.


2.9 Dealing with 32-Bit Immediate Values 73

Computers that do not have Multiply and Divide instructions can perform these andother arithmetic operations by using sequences of more basic instructions such as Add,Subtract, Shift, and Rotate. This will become more apparent when we describe the imple-mentation of arithmetic operations in Chapter 9.

2.9 Dealing with 32-Bit Immediate Values

In the discussion of addressing modes, in Section 2.4.1, we raised the question of how a32-bit value that represents a constant or a memory address can be loaded into a processorregister. The Immediate and Absolute modes in a RISC-style processor restrict the operandsize to 16 bits. Therefore, a 32-bit value cannot be given explicitly in a single instructionthat must fit in a 32-bit word.

A possible solution is to use two instructions for this purpose. One approach found inRISC-style processors uses instructions that perform two different logical-OR operations.The instruction

Or Rdst, Rsrc, #Value

extends the 16-bit immediate operand by placing zeros into the high-order bit positions toform a 32-bit value, which is then ORed with the contents of register Rsrc. If Rsrc containszero, then Rdst will just be loaded with the extended 32-bit value. Another instruction

OrHigh Rdst, Rsrc, #Value

forms a 32-bit value by taking the 16-bit immediate operand as the high-order bits andappending zeros as the low-order bits. This value is then ORed with the contents of Rsrc.Using these instructions, and assuming that R0 contains the value 0, we can load the 32-bitvalue 0x20004FF0 into register R2 as follows:

OrHigh R2, R0, #0x2000Or R2, R2, #0x4FF0

To make it easier to write programs, a RISC-style instruction set may include pseu-doinstructions that indicate an action that requires more than one machine instruction. Suchpseudoinstructions are replaced with the corresponding machine-instruction sequence bythe assembler program. For example, the pseudoinstruction

MoveImmediateAddress R2, LOC

could be used to load a 32-bit address represented by the symbol LOC into register R2. Inthe assembled program, it would be replaced with two instructions using 16-bit values asshown above.

An alternative to using two instructions to load a 32-bit address into a register is to usemore than one word per instruction. In that case, a two-word instruction could give the OPcode and register specification in the first word, and include a 32-bit value in the secondword. This is the approach found in CISC-style processors.



Finally, note that in the previous sections we always assumed that single Load and Storeinstructions can be used to access memory locations represented by symbolic names. Thismakes the example programs simpler and easier to read. The programs will run correctly ifthe required memory addresses can be specified in 16 bits. If longer addresses are involved,then the approach described above to construct 32-bit addresses must be used.

2.10 CISC Instruction Sets

In preceding sections, we introduced the RISC style of instruction sets. Now we willexamine some important characteristics of Complex Instruction Set Computers (CISC).

One key difference is that CISC instruction sets are not constrained to the load/storearchitecture, in which arithmetic and logic operations can be performed only on operandsthat are in processor registers. Another key difference is that instructions do not necessarilyhave to fit into a single word. Some instructions may occupy a single word, but others mayspan multiple words.

Instructions in modern CISC processors typically do not use a three-address format.Most arithmetic and logic instructions use the two-address format

Operation destination, source

An Add instruction of this type is

Add B, A

which performs the operation B ← [A] + [B] on memory operands. When the sum iscalculated, the result is sent to the memory and stored in location B, replacing the originalcontents of this location. This means that memory location B is both a source and adestination.

Consider again the task of adding two numbers

C =A+ B

where all three operands may be in memory locations. Obviously, this cannot be done witha single two-address instruction. The task can be performed by using another two-addressinstruction that copies the contents of one memory location into another. Such an instructionis

Move C, B

which performs the operation C← [B], leaving the contents of location B unchanged. Theoperation C← [A] + [B] can now be performed by the two-instruction sequence

Move C, BAdd C, A

Observe that by using this sequence of instructions the contents of neither A nor B locationsare overwritten.


2.10 CISC Instruction Sets 75

In some CISC processors one operand may be in the memory but the other must be ina register. In this case, the instruction sequence for the required task would be

Move Ri, AAdd Ri, BMove C, Ri

The general form of the Move instruction is

Move destination, source

where both the source and destination may be either a memory location or a processorregister. The Move instruction includes the functionality of the Load and Store instructionswe used previously in the discussion of RISC-style processors. In the Load instruction,the source is a memory location and the destination is a processor register. In the Storeinstruction, the source is a register and the destination is a memory location. While Loadand Store instructions are restricted to moving operands between memory and processorregisters, the Move instruction has a wider scope. It can be used to move immediateoperands and to transfer operands between two memory locations or between two registers.

2.10.1 Additional Addressing Modes

Most CISC processors have all of the five basic addressing modes—Immediate, Register,Absolute, Indirect, and Index. Three additional addressing modes are often found in CISCprocessors.

Autoincrement and Autodecrement ModesThere are two modes that are particularly convenient for accessing data items in suc-

cessive locations in the memory and for implementation of stacks.

Autoincrement mode—The effective address of the operand is the contents of a registerspecified in the instruction. After accessing the operand, the contents of this registerare automatically incremented to point to the next operand in memory.

We denote the Autoincrement mode by putting the specified register in parentheses, to showthat the contents of the register are used as the effective address, followed by a plus sign toindicate that these contents are to be incremented after the operand is accessed. Thus, theAutoincrement mode is written as

(Ri)+To access successive words in a byte-addressable memory with a 32-bit word length, theincrement amount must be 4. Computers that have the Autoincrement mode automaticallyincrement the contents of the register by a value that corresponds to the size of the accessedoperand. Thus, the increment is 1 for byte-sized operands, 2 for 16-bit operands, and 4 for32-bit operands. Since the size of the operand is usually specified as part of the operationcode of an instruction, it is sufficient to indicate the Autoincrement mode as (Ri)+.



As a companion for theAutoincrement mode, another useful mode accesses the memorylocations in the reverse order:

Autodecrement mode—The contents of a register specified in the instruction are first au-tomatically decremented and are then used as the effective address of the operand.

We denote the Autodecrement mode by putting the specified register in parentheses, pre-ceded by a minus sign to indicate that the contents of the register are to be decrementedbefore being used as the effective address. Thus, we write

−(Ri)

In this mode, operands are accessed in descending address order.The reader may wonder why the address is decremented before it is used in the Au-

todecrement mode, and incremented after it is used in the Autoincrement mode. The mainreason for this is to make it easy to use these modes together to implement a stack structure.Instead of needing two instructions

Subtract SP, #4Move (SP), NEWITEM

to push a new item on the stack, we can use just one instruction

Move −(SP), NEWITEM

Similarly, instead of needing two instructions

Move ITEM, (SP)Add SP, #4

to pop an item from the stack, we can use just

Move ITEM, (SP)+Relative ModeWe have defined the Index mode by using general-purpose processor registers. Some

computers have a version of this mode in which the program counter, PC, is used insteadof a general-purpose register. Then, X(PC) can be used to address a memory location thatis X bytes away from the location presently pointed to by the program counter. Since theaddressed location is identified relative to the program counter, which always identifies thecurrent execution point in a program, the name Relative mode is associated with this typeof addressing.

Relative mode—The effective address is determined by the Index mode using the programcounter in place of the general-purpose register Ri.


2.10 CISC Instruction Sets 77

2.10.2 Condition Codes

Operations performed by the processor typically generate results such as numbers that arepositive, negative, or zero. The processor can maintain the information about these resultsfor use by subsequent conditional branch instructions. This is accomplished by recordingthe required information in individual bits, often called condition code flags. These flags areusually grouped together in a special processor register called the condition code registeror status register. Individual condition code flags are set to 1 or cleared to 0, depending onthe outcome of the operation performed.

Four commonly used flags are

N (negative) Set to 1 if the result is negative; otherwise, cleared to 0Z (zero) Set to 1 if the result is 0; otherwise, cleared to 0V (overflow) Set to 1 if arithmetic overflow occurs; otherwise, cleared to 0C (carry) Set to 1 if a carry-out results from the operation; otherwise,

cleared to 0

The N and Z flags record whether the result of an arithmetic or logic operation is negativeor zero. In some computers, they may also be affected by the value of the operand of aMove instruction. This makes it possible for a later conditional branch instruction to causea branch based on the sign and value of the operand that was moved. Some computersalso provide a special Test instruction that examines a value in a register or in the memorywithout modifying it, and sets or clears the N and Z flags accordingly.

The V flag indicates whether overflow has taken place. As explained in Section 1.4,overflow occurs when the result of an arithmetic operation is outside the range of valuesthat can be represented by the number of bits available for the operands. The processor setsthe V flag to allow the programmer to test whether overflow has occurred and branch to anappropriate routine that deals with the problem. Instructions such as Branch_if_overfloware usually provided for this purpose.

The C flag is set to 1 if a carry occurs from the most-significant bit position duringan arithmetic operation. This flag makes it possible to perform arithmetic operations onoperands that are longer than the word length of the processor. Such operations are used inmultiple-precision arithmetic, which is discussed in Chapter 9.

Consider the Branch instruction in Figure 2.6. If condition codes are used, then theSubtract instruction would cause both N and Z flags to be cleared to 0 if the contents ofregister R2 are still greater than 0. The desired branching could be specified simply as

Branch>0 LOOP

without indicating the register involved in the test. This instruction causes a branch if neitherN nor Z is 1, that is, if the result produced by the Subtract instruction is neither negativenor equal to zero. Many conditional branch instructions are provided in the instruction setof a computer to enable a variety of conditions to be tested. The conditions are defined aslogic expressions involving the condition code flags.

To illustrate the use of condition codes, consider again the program in Figure 2.8,which adds a list of numbers using RISC-style instructions. Using a CISC-style instructionset, this task can be implemented with fewer instructions, as shown in Figure 2.26. The



Move R2, N Load the size of the list.Clear R3 Initialize sum to 0.Move R4, #NUM1 Load address of the first number.

LOOP: Add R3, (R4)+ Add the next number to sum.Subtract R2, #1 Decrement the counter.Branch>0 LOOP Loop back if not finished.Move SUM, R3 Store the final sum.

Figure 2.26 A CISC version of the program of Figure 2.8.

Add instruction uses the pointer register (R4) to access successive numbers in the list andadd them to the sum in register R3. After accessing the source operand, the processorautomatically increments the pointer, because the Autoincrement addressing mode is usedto specify the source operand. The Subtract instruction sets the condition codes, which arethen used by the Branch instruction.

2.11 RISC and CISC Styles

RISC and CISC are two different styles of instruction sets. We introduced RISC first be-cause it is simpler and easier to understand. Having looked at some basic features of bothstyles, we should summarize their main characteristics.

RISC style is characterized by:

• Simple addressing modes• All instructions fitting in a single word• Fewer instructions in the instruction set, as a consequence of simple addressing modes• Arithmetic and logic operations that can be performed only on operands in processor

registers• Load/store architecture that does not allow direct transfers from one memory location

to another; such transfers must take place via a processor register• Simple instructions that are conducive to fast execution by the processing unit using

techniques such as pipelining which is presented in Chapter 6• Programs that tend to be larger in size, because more, but simpler instructions are

needed to perform complex tasks

CISC style is characterized by:

• More complex addressing modes• More complex instructions, where an instruction may span multiple words


2.12 Example Programs 79

• Many instructions that implement complex tasks• Arithmetic and logic operations that can be performed on memory operands as well as

operands in processor registers• Transfers from one memory location to another by using a single Move instruction• Programs that tend to be smaller in size, because fewer, but more complex instructions

are needed to perform complex tasks

Before the 1970s, all computers were of CISC type. An important objective was tosimplify the development of software by making the hardware capable of performing fairlycomplex tasks, that is, to move the complexity from the software level to the hardwarelevel. This is conducive to making programs simpler and shorter, which was importantwhen computer memory was smaller and more expensive to provide. Today, memory isinexpensive and most computers have large amounts of it.

RISC-style designs emerged as an attempt to achieve very high performance by makingthe hardware very simple, so that instructions can be executed very quickly in pipelinedfashion as will be discussed in Chapter 6. This results in moving complexity from thehardware level to the software level. Sophisticated compilers were developed to optimizethe code consisting of simple instructions. The size of the code became less important asmemory capacities increased.

While the RISC and CISC styles seem to define two significantly different approaches,today’s processors often exhibit what may seem to be a compromise between these ap-proaches. For example, it is attractive to add some non-RISC instructions to a RISCprocessor in order to reduce the number of instructions executed, as long as the executionof these new instructions is fast. We will deal with the performance issues in detail inChapter 6 where we discuss the concept of pipelining.

2.12 Example Programs

In this section we present two examples that further illustrate the use of machine instructions.The examples are representative of numeric and nonnumeric applications.

2.12.1 Vector Dot Product Program

The first example is a numerical application that is an extension of previous programs foradding numbers. In calculations that involve vectors and matrices, it is often necessary tocompute the dot product of two vectors. Let A and B be two vectors of length n. Their dotproduct is defined as

Dot Product =∑n−1i=0 A(i) × B(i)

Figures 2.27 and 2.28 show RISC- and CISC-style programs for computing the dot productand storing it in memory location DOTPROD. The first elements of each vector, A(0) and



Move R2, #AVEC R2 points to vector A.Move R3, #BVEC R3 points to vector B.Load R4, N R4 serves as a counter.Clear R5 R5 accumulates the dot product.

LOOP: Load R6, (R2) Get next element of vector A.Load R7, (R3) Get next element of vector B.Multiply R8, R6, R7 Compute the product of next pair.Add R5, R5, R8 Add to previous sum.Add R2, R2, #4 Increment pointer to vector A.Add R3, R3, #4 Increment pointer to vector B.Subtract R4, R4, #1 Decrement the counter.Branch_if_[R4]>0 LOOP Loop again if not done.Store R5, DOTPROD Store dot product in memory.

Figure 2.27 A RISC-style program for computing the dot product of two vectors.

Move R2, #AVEC R2 points to vector A.Move R3, #BVEC R3 points to vector B.Move R4, N R4 serves as a counter.Clear R5 R5 accumulates the dot product.

LOOP: Move R6, (R2)+ Compute the product ofMultiply R6, (R3)+ next components.Add R5, R6 Add to previous sum.Subtract R4, #1 Decrement the counter.Branch>0 LOOP Loop again if not done.Move DOTPROD, R5 Store dot product in memory.

Figure 2.28 A CISC-style program for computing the dot product of two vectors.

B(0), are stored at memory locations AVEC and BVEC, with the remaining elements in thefollowing word locations.

The task of accumulating a sum of products occurs in many signal-processing appli-cations. In this case, one of the vectors consists of the most recent n signal samples in acontinuing time sequence of inputs to a signal-processing unit. The other vector is a set of nweights. The n signal samples are multiplied by the weights, and the sum of these productsconstitutes an output signal sample.

Some computer instruction sets combine the operations of the Multiply and Add in-structions used in the programs in Figures 2.27 and 2.28 into a single MultiplyAccumulateinstruction. This is done in the ARM processor presented in Appendix D.


2.12 Example Programs 81

2.12.2 String Search Program

As an example of a non-numerical application, let us consider the problem of string search.Given two strings ofASCII-encoded characters, a long string T and a short string P, we wantto determine if the pattern P is contained in the target T . Since P may be found in T in severalplaces, we will simplify our task by being interested only in the first occurrence of P in Twhen T is searched from left to right. Let T and P consist of n and m characters, respectively,where n > m. The characters are stored in memory in consecutive byte locations. Assumethat the required data are located as follows:

• T is the address of T (0), which is the first character in string T .• N is the address of a 32-bit word that contains the value n.• P is the address of P(0), which is the first character in string P.• M is the address of a 32-bit word that contains the value m.• RESULT is the address of a word in which the result of the search is to be stored. If

the substring P is found in T , then the address of the corresponding location in T willbe stored in RESULT; otherwise, the value −1 will be stored.

String search is an important and well-researched problem. Many algorithms have beendeveloped. Since our main purpose is to illustrate the use of assembly-language instructions,we will use the simplest algorithm which is known as the brute-force algorithm. It is givenin Figure 2.29.

In a RISC-style computer, the algorithm can be implemented as shown in Figure 2.30.The comments explain the use of various processor registers. Note that in the case of afailed search, the immediate value −1 will cause the contents of R8 to become equal to0xFFFFFFFF, which represents −1 in 2’s complement.

Figure 2.31 shows how the algorithm may be implemented in a CISC-style computer.Observe that the first instruction in LOOP2 loads a character from string T into register R8,which is followed by an instruction that compares this character with a character in stringP. The reader may wonder why is it not possible to use a single instruction

CompareByte (R6)+, (R7)+to achieve the same effect. While CISC-style instruction sets allow operations that involvememory operands, they typically require that if one operand is in the memory, the other

for i 0 to n m doj 0while j < m and P[ j ] = T [i + j ] do

j j + 1if j = m return i

return –1

Figure 2.29 A brute-force string search algorithm.



Move R2, #T R2 points to string T .Move R3, #P R3 points to string P .Load R4, N Get the value n.Load R5, M Get the value m.Subtract R4, R4, R5 Compute n m.Add R4, R2, R4 The address of T (n m).Add R5, R3, R5 The address of P(m).

LOOP1: Move R6, R2 Use R6 to scan through string T .Move R7, R3 Use R7 to scan through string P .

LOOP2: LoadByte R8, (R6) Compare a pair ofLoadByte R9, (R7) characters inBranch_if_[R8]=[R9] NOMATCH strings T and P .Add R6, R6, #1 Point to next character in T .Add R7, R7, #1 Point to next character in P .Branch_if_[R5]>[R7] LOOP2 Loop again if not done.Store R2, RESULT Store the address of T (i ).Branch DONE

NOMATCH: Add R2, R2, #1 Point to next character in T .Branch_if_[R4]≥[R2] LOOP1 Loop again if not done.Move R8, # –1 Write –1 to indicate thatStore R8, RESULT no match was found.

DONE: next instruction

Figure 2.30 A RISC-style program for string search.

operand must be in a processor register. A common exception is the Move instruction,which may involve two memory operands. This provides a simple way of moving databetween different memory locations.

2.13 Encoding of Machine Instructions

In this chapter, we have introduced a variety of useful instructions and addressing modes.We have used a generic form of assembly language to emphasize basic concepts withoutrelying on processor-specific acronyms or mnemonics. Assembly-language instructionssymbolically express the actions that must be performed by the processor circuitry. To beexecuted in a processor, assembly-language instructions must be converted by the assemblerprogram, as described in Section 2.5, into machine instructions that are encoded in a compactbinary pattern.

Let us now examine how machine instructions may be formed. The Add instruction

Add Rdst, Rsrc1, Rsrc2


2.13 Encoding of Machine Instructions 83

Move R2, #T R2 points to string T .Move R3, #P R3 points to string P .Move R4, N Get the value n.Move R5, M Get the value m.Subtract R4, R5 Compute n m.Add R4, R2 The address of T (n m).Add R5, R3 The address of P(m).

LOOP1: Move R6, R2 Use R6 to scan through string T .Move R7, R3 Use R7 to scan through string P .

LOOP2: MoveByte R8, (R6)+ Compare a pair ofCompareByte R8, (R7)+ characters inBranch =0 NOMATCH strings T and P .Compare R5, R7 Check if at P(m).Branch>0 LOOP2 Loop again if not done.Move RESULT, R2 Store the address of T (i ).Branch DONE

NOMATCH: Add R2, #1 Point to next character in T .Compare R4, R2 Check if at T (n m).Branch≥0 LOOP1 Loop again if not done.Move RESULT, # –1 No match was found.


Figure 2.31 A CISC-style program for string search.

is representative of a class of three-operand instructions that use operands in processorregisters. Registers Rdst, Rsrc1, and Rsrc2 hold the destination and two source operands.If a processor has 32 registers, then it is necessary to use five bits to specify each of thethree registers in such instructions. If each instruction is implemented in a 32-bit word,the remaining 17 bits can be used to specify the OP code that indicates the operation to beperformed. A possible format is shown in Figure 2.32a.

Now consider instructions in which one operand is given using the Immediate address-ing mode, such as

Add Rdst, Rsrc, #Value

Of the 32 bits available, ten bits are needed to specify the two registers. The remaining 22bits must give the OP code and the value of the immediate operand. The most useful sizesof immediate operands are 32, 16, and 8 bits. Since 32 bits are not available, a good choiceis to allocate 16 bits for the immediate operand. This leaves six bits for specifying the OPcode. A possible format is presented in Figure 2.32b. This format can also be used for Loadand Store instructions, where the Index addressing mode uses the 16-bit field to specify theoffset that is added to the contents of the index register.

The format in Figure 2.32b can also be used to encode the Branch instructions. Considerthe program in Figure 2.12. The Branch-greater-than instruction at memory address 128



0562122262731

OP codeImmediate operandRdstRsrc

02122262731

OP codeRsrc2Rsrc1

(b) Immediate-operand format

(a) Register-operand format

1617

Rdst

05631

OP codeImmediate value

(c) Call format

Figure 2.32 Possible instruction formats.

could be written in a specific assembly language as

BGT R2, R0, LOOP

if the contents of register R0 are zero. The registers R2 and R0 can be specified in thetwo register fields in Figure 2.32b. The six-bit OP code has to identify the BGT operation.The 16-bit immediate field can be used to provide the information needed to determine thebranch target address, which is the location of the instruction with the label LOOP. The targetaddress generally comprises 32 bits. Since there is no space for 32 bits, the BGT instructionmakes use of the immediate field to give an offset from the location of this instruction in theprogram to the required branch target. At the time the BGT instruction is being executed,the program counter, PC, has been incremented to point to the next instruction, which isthe Store instruction at address 132. Therefore, the branch offset is 132− 112 = 20. Sincethe processor computes the target address by adding the current contents of the PC and thebranch offset, the required offset in this example is negative, namely −20.

Finally, we should consider the Call instruction, which is used to call a subroutine. Itonly needs to specify the OP code and an immediate value that is used to determine theaddress of the first instruction in the subroutine. If six bits are used for the OP code, thenthe remaining 26 bits can be used to denote the immediate value. This gives the formatshown in Figure 2.32c.

In this section, we introduced the basic concept of encoding the machine instructions.Different commercial processors have instruction sets that vary in the details of implemen-tation. Appendices B to E present the instruction sets of four processors that we have chosenas examples.




This chapter introduced the representation and execution of instructions and programs atthe assembly and machine level as seen by the programmer. The discussion emphasized thebasic principles of addressing techniques and instruction sequencing. The programmingexamples illustrated the basic types of operations implemented by the instruction set of anymodern computer. Commonly used addressing modes were introduced. The subroutineconcept and the instructions needed to implement it were discussed. In the discussion inthis chapter, we provided the contrast between two different approaches to the design ofmachine instruction sets—the RISC and CISC approaches.

2.15 Solved Problems


Example 2.1Problem: Assume that there is a string of ASCII-encoded characters stored in memorystarting at address STRING. The string ends with the Carriage Return (CR) character.Write a RISC-style program to determine the length of the string and store it in locationLENGTH.

Solution: Figure 2.33 presents a possible program. The characters in the string are com-pared to CR (ASCII code 0x0D), and a counter is incremented until the end of the string isreached.

Move R2, #STRING R2 points to the start of the string.Clear R3 R3 is a counter that is cleared to 0.Move R4, #0x0D ASCII code for Carriage Return.

LOOP: LoadByte R5, (R2) Get the next character.Branch_if_[R5]=[R4] DONE Finished if character is CR.Add R2, R2, #1 Increment the string pointer.Add R3, R3, #1 Increment the counter.Branch LOOP Not finished, loop back.

DONE: Store R3, LENGTH Store the count in location LENGTH.

Figure 2.33 Program for Example 2.1.



LIST EQU 1000 Starting address of the list.

ORIGIN 400Move R2, #LIST R2 points to the start of the list.Load R3, 4(R2) R3 is a counter, initialize it with n.Add R4, R2, #8 R4 points to the first number.Load R5, (R4) R5 holds the smallest number found so far.

LOOP: Subtract R3, R3, #1 Decrement the counter.Branch_if_[R3]=0 DONE Finished if R3 is equal to 0.Add R4, R4, #4 Increment the list pointer.Load R6, (R4) Get the next number.Branch_if_[R5]≤[R6] LOOP Check if smaller number found.Move R5, R6 Update the smallest number found.Branch LOOP

DONE: Store R5, (R2) Store the smallest number into SMALL.

ORIGIN 1000SMALL: RESERVE 4 Space for the smallest number found.N: DATAWORD 7 Number of entries in the list.ENTRIES: DATAWORD 4,5,3,6,1,8,2 Entries in the list.

END


Example 2.2 Problem: We want to find the smallest number in a list of 32-bit positive integers. Theword at address 1000 is to hold the value of the smallest number after it has been found.The next word contains the number of entries, n, in the list. The following n words containthe numbers in the list. The program is to start at address 400. Write a RISC-style programto find the smallest number and include the assembler directives needed to organize theprogram and data as specified. While the program has to be able to handle lists of differentlengths, include in your code a small list of sample data comprising seven integers.

Solution: The program in Figure 2.34 accomplishes the required task. Comments in theprogram explain how this task is performed.

Example 2.3 Problem: Write a RISC-style program that converts an n-digit decimal integer into a binarynumber. The decimal number is given as n ASCII-encoded characters, as would be the caseif the number is entered by typing it on a keyboard. Memory location N contains n, theASCII string starts at DECIMAL, and the converted number is stored at BINARY.

Solution: Consider a four-digit decimal number, D = d3d2d1d0. The value of this numberis ((d3 × 10+ d2)× 10+ d1)× 10+ d0. This representation of the number is the basis forthe conversion technique used in the program in Figure 2.35. Note that eachASCII-encoded



Load R2, N Initialize counter R2 with n.Move R3, #DECIMAL R3 points to the ASCII digits.Clear R4 R4 will hold the binary number.

LOOP: LoadByte R5, (R3) Get the next ASCII digit.And R5, R5, #0x0F Form the BCD digit.Add R4, R4, R5 Add to the intermediate result.Add R3, R3, #1 Increment the digit pointer.Subtract R2, R2, #1 Decrement the counter.Branch_if_[R2]=0 DONEMultiply R4, R4, #10 Multiply by 10.Branch LOOP Loop back if not done.

DONE: Store R4, BINARY Store result in location BINARY.


character is converted into a Binary Coded Decimal (BCD) digit before it is used in the com-putation. It is assumed that the converted value can be represented in no more than 32 bits.

Example 2.4Problem: Consider an array of numbers A(i,j), where i = 0 through n− 1 is the row index,and j = 0 through m− 1 is the column index. The array is stored in the memory of acomputer one row after another, with elements of each row occupying m successive wordlocations. Assume that the memory is byte-addressable and that the word length is 32bits. Write a RISC-style subroutine for adding column x to column y, element by element,leaving the sum elements in column y. The indices x and y are passed to the subroutine inregisters R2 and R3. The parameters n and m are passed to the subroutine in registers R4and R5, and the address of element A(0,0) is passed in register R6.

Solution: A possible program is given in Figure 2.36. We have assumed that the values x,y, n, and m are stored in memory locations X, Y, N, and M. Also, the elements of the arrayare stored in successive words that begin at location ARRAY, which is the address of theelement A(0,0). Comments in the program indicate the purpose of individual instructions.

Example 2.5Problem: We want to sort a list of characters stored in memory. The list consists of nbytes, not necessarily distinct, and each byte contains the ASCII code for a character fromthe set of letters A through Z. In the ASCII code, presented in Chapter 1, the letters A,B, . . . , Z, are represented by 7-bit patterns that have increasing values when interpreted asbinary numbers. When an ASCII character is stored in a byte location, it is customary toset the most-significant bit position to 0. Using this code, we can sort a list of charactersalphabetically by sorting their codes in increasing numerical order, considering them aspositive numbers.



Load R2, X Load the value x .Load R3, Y Load the value y.Load R4, N Load the value n.Load R5, M Load the value m.Move R6, #ARRAY Load the address of A(0,0).Call SUBnext instruction::

SUB: Subtract SP, SP, #4Store R7, (SP) Save register R7.LShiftL R5, R5, #2 Determine the distance in bytes

between successive elementsin a column.

Subtract R3, R3, R2 Form y x .LShiftL R3, R3, #2 Form 4(y x).LShiftL R2, R2, #2 Form 4x .Add R6, R6, R2 R6 points to A(0,x).Add R7, R6, R3 R7 points to A(0,y).

LOOP: Load R2, (R6) Get the next number in column x .Load R3, (R7) Get the next number in column y.Add R2, R2, R3 Add the numbers andStore R2, (R7) store the sum.Add R6, R6, R5 Increment pointer to column x .Add R7, R7, R5 Increment pointer to column y.Subtract R4, R4, #1 Decrement the row counter.Branch_if_[R4]>0 LOOP Loop back if not done.Load R7, (SP) Restore R7.Add SP, SP, #4Return Return to the calling program.


Let the list be stored in memory locations LIST through LIST + n − 1, and let n be a32-bit value stored at address N. The sorting is to be done in place, that is, the sorted list isto occupy the same memory locations as the original list.

We can sort the list using a straight-selection sort algorithm. First, the largest numberis found and placed at the end of the list in location LIST+ n− 1. Then the largest numberin the remaining sublist of n− 1 numbers is placed at the end of the sublist in location LIST+ n − 2. The procedure is repeated until the list is sorted. A C-language program for thissorting algorithm is shown in Figure 2.37, where the list is treated as a one-dimensionalarray LIST(0) through LIST(n− 1). For each sublist LIST(j) through LIST(0), the numberin LIST(j) is compared with each of the other numbers in the sublist. Whenever a largernumber is found in the sublist, it is interchanged with the number in LIST(j).



for (j = n−1; j > 0; j = j 1){ for ( k = j−1; k >= 0; k = k 1 ){ if (LIST[k] > LIST[j])

{ TEMP = LIST[k];LIST[k]= LIST[j];LIST[j]= TEMP;

}}

}

Figure 2.37 C-language program for sorting.

Move R2, #LIST Load LIST into base register R2.Move R3, N Initialize outer loop indexSubtract R3, #1 register R3 to j = n 1.

OUTER: Move R4, R3 Initialize inner loop indexSubtract R4, #1 register R4 to k = j 1.MoveByte R5, (R2,R3) Load LIST( j ) into R5, which holds

current maximum in sublist.INNER: CompareByte (R2,R4), R5 If LIST(k) ≤ [R5],

Branch ≤ 0 NEXT do not exchange.MoveByte R6, (R2,R4) Otherwise, exchange LIST(k)MoveByte (R2,R4), R5 with LIST( j ) and loadMoveByte (R2,R3), R6 new maximum into R5.MoveByte R5, R6 Register R6 serves as TEMP.

NEXT: Decrement R4 Decrement index registers R4 andBranch ≥ 0 INNER R3, which also serve asDecrement R3 loop counters, and branchBranch >0 OUTER back if loops not finished.

Figure 2.38 A byte-sorting program.

Note that the C-language program traverses the list backwards. This order of traversalsimplifies loop termination when a machine language program is written, because the loopis exited when an index is decremented to 0.

Write a CISC-style program that implements this sorting task.

Solution: A possible program is given in Figure 2.38.



Problems

2.1 [E] Given a binary pattern in some memory location, is it possible to tell whether thispattern represents a machine instruction or a number?

2.2 [E] Consider a computer that has a byte-addressable memory organized in 32-bit words ac-cording to the big-endian scheme. A program reads ASCII characters entered at a keyboardand stores them in successive byte locations, starting at location 1000. Show the contentsof the two memory words at locations 1000 and 1004 after the word “Computer” has beenentered.

2.3 [E] Repeat Problem 2.2 for the little-endian scheme.

2.4 [E] Registers R4 and R5 contain the decimal numbers 2000 and 3000 before each of thefollowing addressing modes is used to access a memory operand. What is the effectiveaddress (EA) in each case?

(a) 12(R4)

(b) (R4,R5)

(c) 28(R4,R5)

(d ) (R4)+(e) −(R4)

2.5 [E] Write a RISC-style program that computes the expression SUM = 580+ 68400+80000.

2.6 [E] Write a CISC-style program for the task in Problem 2.5.

2.7 [E] Write a RISC-style program that computes the expression ANSWER = A × B +C × D.

2.8 [E] Write a CISC-style program for the task in Problem 2.7.

2.9 [M] Rewrite the addition loop in Figure 2.8 so that the numbers in the list are accessed inthe reverse order; that is, the first number accessed is the last one in the list, and the lastnumber accessed is at memory location NUM1. Try to achieve the most efficient way todetermine loop termination. Would your loop execute faster than the loop in Figure 2.8?

2.10 [M] The list of student marks shown in Figure 2.10 is changed to contain j test scores foreach student. Assume that there are n students. Write a RISC-style program for computingthe sums of the scores on each test and store these sums in the memory word locations ataddresses SUM, SUM + 4, SUM + 8, . . . . The number of tests, j, is larger than the numberof registers in the processor, so the type of program shown in Figure 2.11 for the 3-test casecannot be used. Use two nested loops. The inner loop should accumulate the sum for aparticular test, and the outer loop should run over the number of tests, j. Assume that thememory area used to store the sums has been cleared to zero initially.

2.11 [M] Write a RISC-style program that finds the number of negative integers in a list of n 32-bitintegers and stores the count in location NEGNUM. The value n is stored in memory locationN, and the first integer in the list is stored in location NUMBERS. Include the necessaryassembler directives and a sample list that contains six numbers, some of which are negative.


Problems 91

2.12 [E] Both of the following statement segments cause the value 300 to be stored in location1000, but at different times.

ORIGIN 1000DATAWORD 300

and

Move R2, #1000Move R3, #300Store R3, (R2)

Explain the difference.

2.13 [E] Write an assembly-language program in the style of Figure 2.13 for the program inFigure 2.11. Assume the data layout of Figure 2.10.

2.14 [E] Write a CISC-style program for the task in Example 2.1. At most one operand of aninstruction can be in the memory.

2.15 [E] Write a CISC-style program for the task in Example 2.2. At most one operand of aninstruction can be in the memory.

2.16 [M] Write a CISC-style program for the task in Example 2.3. At most one operand of aninstruction can be in the memory.

2.17 [M] Write a CISC-style program for the task in Example 2.4. At most one operand of aninstruction can be in the memory.

2.18 [M] Write a RISC-style program for the task in Example 2.5.

2.19 [E] Register R5 is used in a program to point to the top of a stack containing 32-bit num-bers. Write a sequence of instructions using the Index, Autoincrement, and Autodecrementaddressing modes to perform each of the following tasks:

(a) Pop the top two items off the stack, add them, then push the result onto the stack.

(b) Copy the fifth item from the top into register R3.

(c) Remove the top ten items from the stack.

For each case, assume that the stack contains ten or more elements.

2.20 [M] Show the processor stack contents and the contents of the stack pointer, SP, immediatelyafter each of the following instructions in the program in Figure 2.18 is executed. Assumethat [SP] = 1000 at Level 1, before execution of the calling program begins.

(a) The second Store instruction in the subroutine

(b) The last Load instruction in the subroutine

(c) The last Store instruction in the calling program

2.21 [M] Consider the following possibilities for saving the return address of a subroutine:

(a) In a processor register

(b) In a memory location associated with the call, so that a different location is used whenthe subroutine is called from different places

(c) On a stack



Which of these possibilities supports subroutine nesting and which supports subroutinerecursion (that is, a subroutine that calls itself)?

2.22 [M] In addition to the processor stack, it may be convenient to use another stack in someprograms. The second stack is usually allocated a fixed amount of space in the memory.In this case, it is important to avoid pushing an item onto the stack when the stack hasreached its maximum size. Also, it is important to avoid attempting to pop an item off anempty stack, which could result from a programming error. Write two short RISC-styleroutines, called SAFEPUSH and SAFEPOP, for pushing onto and popping off this stackstructure, while guarding against these two possible errors. Assume that the element to bepushed/popped is located in register R2, and that register R5 serves as the stack pointer forthis user stack. The stack is full if its topmost element is stored in location TOP, and it isempty if the last element popped was stored in location BOTTOM. The routines shouldbranch to FULLERROR and EMPTYERROR, respectively, if errors occur. All elementsare of word size, and the stack grows toward lower-numbered address locations.

2.23 [M] Repeat Problem 2.22 for CISC-style routines that can use Autoincrement and Au-todecrement addressing modes.

2.24 [D] Another useful data structure that is similar to the stack is called a queue. Data arestored in and retrieved from a queue on a first-in–first-out (FIFO) basis. Thus, if we assumethat the queue grows in the direction of increasing addresses in the memory, which is acommon practice, new data are added at the back (high-address end) and retrieved from thefront (low-address end) of the queue.

There are two important differences between how a stack and a queue are implemented.One end of the stack is fixed (the bottom), while the other end rises and falls as data arepushed and popped. A single pointer is needed to point to the top of the stack at any giventime. On the other hand, both ends of a queue move to higher addresses as data are addedat the back and removed from the front. So two pointers are needed to keep track of thetwo ends of the queue.

A FIFO queue of bytes is to be implemented in the memory, occupying a fixed region of kbytes. The necessary pointers are an IN pointer and an OUT pointer. The IN pointer keepstrack of the location where the next byte is to be appended to the back of the queue, and theOUT pointer keeps track of the location containing the next byte to be removed from thefront of the queue.

(a) As data items are added to the queue, they are added at successively higher addressesuntil the end of the memory region is reached. What happens next, when a new item is tobe added to the queue?

(b) Choose a suitable definition for the IN and OUT pointers, indicating what they point toin the data structure. Use a simple diagram to illustrate your answer.

(c) Show that if the state of the queue is described only by the two pointers, the situationswhen the queue is completely full and completely empty are indistinguishable.

(d ) What condition would you add to solve the problem in part (c)?

(e) Propose a procedure for manipulating the two pointers IN and OUT to append andremove items from the queue.


Problems 93

2.25 [M] Consider the queue structure described in Problem 2.24. Write APPEND and RE-MOVE routines that transfer data between a processor register and the queue. Be careful toinspect and update the state of the queue and the pointers each time an operation is attemptedand performed.

2.26 [M] The dot-product computation is discussed in Section 2.12.1. This type of computa-tion can be used in the following signal-processing task. An input signal time sequenceIN(0), IN(1), IN(2), IN(3), . . . , is processed by a 3-element weight vector (WT(0), WT(1),WT(2)) = (1/8, 1/4, 1/2) to produce an output signal time sequence OUT(0), OUT(1),OUT(2), OUT(3), . . . , as follows:

OUT(0) = WT(0) × IN(0) + WT(1) × IN(1) + WT(2) × IN(2)OUT(1) = WT(0) × IN(1) + WT(1) × IN(2) + WT(2) × IN(3)OUT(2) = WT(0) × IN(2) + WT(1) × IN(3) + WT(2) × IN(4)OUT(3) = WT(0) × IN(3) + WT(1) × IN(4) + WT(2) × IN(5)

...

All signal and weight values are 32-bit signed numbers. The weights, inputs, and outputs,are stored in the memory starting at locations WT, IN, and OUT, respectively. Write aRISC-style program to calculate and store the output values for the first n outputs, where nis stored at location N.

Hint: Arithmetic right shifts can be used to do the multiplications.

2.27 [M] Write a subroutine MEMCPY for copying a sequence of bytes from one area in themain memory to another area. The subroutine should accept three input parameters inregisters representing the from address, the to address, and the length of the sequence tobe copied. The two areas may overlap. In all but one case, the subroutine should copy thebytes in the order of increasing addresses. However, in the case where the to address fallswithin the sequence of bytes to be copied, i.e., when the to address is between from andfrom+length−1, the subroutine must copy the bytes in the order of decreasing addressesby starting at the end of the sequence of bytes to be copied in order to avoid overwritingbytes that have not yet been copied.

2.28 [M] Write a subroutine MEMCMP for performing a byte-by-byte comparison of twosequences of bytes in the main memory. The subroutine should accept three input parametersin registers representing the first address, the second address, and the length of the sequencesto be compared. It should use a register to return the count of the number of comparisonsthat do not match.

2.29 [M] Write a subroutine called EXCLAIM that accepts a single parameter in a register rep-resenting the starting address STRNG in the main memory for a string of ASCII charactersin successive bytes representing an arbitrary collection of sentences, with the NUL controlcharacter (value 0) at the end of the string. The subroutine should scan the string beginningat address STRNG and replace every occurrence of a period (‘.’) with an exclamation mark(‘!’).



2.30 [M] Write a subroutine called ALLCAPS that accepts a parameter in a register represent-ing the starting address STRNG in the main memory for a string of ASCII characters insuccessive bytes, with the NUL control character (value 0) at the end of the string. The sub-routine should scan the string beginning at address STRNG and replace every occurrenceof a lower-case letter (‘a’−‘z’) with the corresponding upper-case letter (‘A’−‘Z’).

2.31 [M] Write a subroutine called WORDS that accepts a parameter in a register representing thestarting address STRNG in the main memory for a string of ASCII characters in successivebytes, with the NUL control character (value 0) at the end of the string. The string representsEnglish text with the space character between words. The subroutine has to determine thenumber of words in the string (excluding the punctation characters). It must return theresult to the calling program in a register.

2.32 [D] Write a subroutine called INSERT that places a number in the correct ordered posi-tion within a list of positive numbers that are stored in increasing order of value. Threeinput parameters should be passed to the subroutine in processor registers, representing thestarting address of the ordered list of numbers, the length of the list, and the new value tobe inserted into the list. The subroutine should locate the appropriate position for the newvalue in the list, then shift all of the larger numbers up by one position to create space forstoring the new value in the list.

2.33 [D] Write a subroutine called INSERTSORT that repeatedly uses the INSERT subroutinein Problem 2.32 to take an unordered list of numbers and create a new list with the samenumbers in increasing order. The subroutine should accept three input parameters in regis-ters representing the starting address OLDLIST for the unordered sequence of numbers, thelength of the list, and the starting address NEWLIST for the ordered sequence of numbers.

November 11, 2010 12:21 ham_338065_ch03 Sheet number 1 Page number 95 cyan black

95

c h a p t e r

3Basic Input/Output

Chapter Objectives


• Transferring data between a processor andinput/output (I/O) devices

• The programmer’s view of I/O transfers

• How program-controlled I/O is performedusing polling

• How interrupts are used in I/O transfers


96 C H A P T E R 3 • Basic Input/Output

One of the basic features of a computer is its ability to exchange data with other devices. This communicationcapability enables a human operator, for example, to use a keyboard and a display screen to process text andgraphics. We make extensive use of computers to communicate with other computers over the Internet andaccess information around the globe. In other applications, computers are less visible but equally important.They are an integral part of home appliances, manufacturing equipment, transportation systems, banking, andpoint-of-sale terminals. In such applications, input to a computer may come from a sensor switch, a digitalcamera, a microphone, or a fire alarm. Output may be a sound signal sent to a speaker, or a digitally codedcommand that changes the speed of a motor, opens a valve, or causes a robot to move in a specified manner.In short, computers should have the ability to exchange digital and analog information with a wide range ofdevices in many different environments.

In this chapter we will consider the input/output (I/O) capability of computers as seen from the program-mer’s point of view. We will present only basic I/O operations, which are provided in all computers. Thisknowledge will enable the reader to perform interesting and useful exercises on equipment found in a typicalteaching laboratory environment. More complex I/O schemes, as well as the hardware needed to implementthe I/O capability, are discussed in Chapter 7.

3.1 Accessing I/O Devices

The components of a computer system communicate with each other through an intercon-nection network, as shown in Figure 3.1. The interconnection network consists of circuitsneeded to transfer information between the processor, the memory unit, and a number ofI/O devices.

In Chapter 2, we described the concept of an address space and how the processormay access individual memory locations within such an address space. Load and Storeinstructions use addressing modes to generate effective addresses that identify the desiredlocations. This idea of using addresses to access various locations in the memory can be

Processor Memory

I/O device 1 I/O device n

Interconnection network

Figure 3.1 A computer system.


3.1 Accessing I/O Devices 97

extended to deal with the I/O devices as well. For this purpose, each I/O device mustappear to the processor as consisting of some addressable locations, just like the memory.Some addresses in the address space of the processor are assigned to these I/O locations,rather than to the main memory. These locations are usually implemented as bit storagecircuits (flip-flops) organized in the form of registers. It is customary to refer to them asI/O registers. Since the I/O devices and the memory share the same address space, thisarrangement is called memory-mapped I/O. It is used in most computers.

With memory-mapped I/O, any machine instruction that can access memory can beused to transfer data to or from an I/O device. For example, if DATAIN is the address of aregister in an input device, the instruction

Load R2, DATAIN

reads the data from the DATAIN register and loads them into processor register R2. Simi-larly, the instruction

Store R2, DATAOUT

sends the contents of register R2 to location DATAOUT, which is a register in an outputdevice.

3.1.1 I/O Device Interface

An I/O device is connected to the interconnection network by using a circuit, called thedevice interface, which provides the means for data transfer and for the exchange of statusand control information needed to facilitate the data transfers and govern the operation ofthe device. The interface includes some registers that can be accessed by the processor.One register may serve as a buffer for data transfers, another may hold information aboutthe current status of the device, and yet another may store the information that controls theoperational behavior of the device. These data, status, and control registers are accessedby program instructions as if they were memory locations. Typical transfers of informationare between I/O registers and the registers in the processor. Figure 3.2 illustrates how thekeyboard and display devices are connected to the processor from the software point of view.

3.1.2 Program-Controlled I/O

Let us begin the discussion of input/output issues by looking at two essential I/O devices forhuman-computer interaction—keyboard and display. Consider a task that reads characterstyped on a keyboard, stores these data in the memory, and displays the same characterson a display screen. A simple way of implementing this task is to write a program thatperforms all functions needed to realize the desired action. This method is known asprogram-controlled I/O.

In addition to transferring each character from the keyboard into the memory, and thento the display, it is necessary to ensure that this happens at the right time. An input charactermust be read in response to a key being pressed. For output, a character must be sent to



DATA

Keyboard


Processor

Interface

STATUS

CONTROL

DATA

Display

Interface

STATUS

CONTROL

Generalpurposeregisters

Controlregisters

Figure 3.2 The connection for processor, keyboard, and display.

the display only when the display device is able to accept it. The rate of data transfer fromthe keyboard to a computer is limited by the typing speed of the user, which is unlikely toexceed a few characters per second. The rate of output transfers from the computer to thedisplay is much higher. It is determined by the rate at which characters can be transmittedto and displayed on the display device, typically several thousand characters per second.However, this is still much slower than the speed of a processor that can execute billionsof instructions per second. The difference in speed between the processor and I/O devicescreates the need for mechanisms to synchronize the transfer of data between them.

One solution to this problem involves a signaling protocol. On output, the processorsends the first character and then waits for a signal from the display that the next character canbe sent. It then sends the second character, and so on. An input character is obtained fromthe keyboard in a similar way. The processor waits for a signal indicating that a key has beenpressed and that a binary code that represents the corresponding character is available in anI/O register associated with the keyboard. Then the processor proceeds to read that code.

The keyboard includes a circuit that responds to a key being pressed by producing thecode for the corresponding character that can be used by the computer. We will assumethat ASCII code (presented in Table 1.1) is used, in which each character code occupiesone byte. Let KBD_DATA be the address label of an 8-bit register that holds the generatedcharacter. Also, let a signal indicating that a key has been pressed be provided by setting to1 a flip-flop called KIN, which is a part of an eight-bit status register, KBD_STATUS. Theprocessor can read the status flag KIN to determine when a character code has been placedin KBD_DATA. When the processor reads the status flag to determine its state, we say thatthe processor polls the I/O device.

The display includes an 8-bit register, which we will call DISP_DATA, used to receivecharacters from the processor. It also must be able to indicate that it is ready to receive the



KIN

KIE

01234567

(a) Keyboard interface

0x4000

0x4004

0x4008

KBD_DATA

KBD_STATUS

KBD_CONT

DOUT

DIE

01234567

(b) Display interface

0x4010

0x4014

0x4018

DISP_DATA

DISP_STATUS

DISP_CONT

Address

KIRQ

DIRQ

Figure 3.3 Registers in the keyboard and display interfaces.

next character; this can be done by using a status flag called DOUT, which is one bit in astatus register, DISP_STATUS.

Figure 3.3 illustrates how these registers may be organized. The interface for eachdevice also includes a control register, which we will discuss in Section 3.2. We haveidentified only a few bits in the registers, those that are pertinent to the discussion in thischapter. Other bits can be used for other purposes, or perhaps simply ignored.

If the registers in I/O interfaces are to be accessed as if they are memory locations,each register must be assigned a specific address that will be recognized by the interfacecircuit. In Figure 3.3, we assigned hexadecimal numbers 4000 and 4010 as base addressesfor the keyboard and display, respectively. These are the addresses of the data registers.The addresses of the status registers are four bytes higher, and the control registers are eightbytes higher. This makes all addresses word-aligned in a 32-bit word computer, which isusually done in practice. Assigning the addresses to registers in this manner makes the I/Oregisters accessible in a program executed by the processor. This is the programmer’s viewof the device.

A program is needed to perform the task of reading the characters produced by thekeyboard, storing these characters in the memory, and sending them to the display. Toperform I/O transfers, the processor must execute machine instructions that check the stateof the status flags and transfer data between the processor and the I/O devices.



Let us consider the details of the input process. When a key is pressed, the keyboardcircuit places the ASCII-encoded character into the KBD_DATA register. At the same time,the circuit sets the KIN flag to 1. Meanwhile, the processor is executing the I/O programwhich continuously checks the state of the KIN flag. When it detects that KIN is set to1, it transfers the contents of KBD_DATA into a processor register. Once the contents ofKBD_DATA are read, KIN must be cleared to 0, which is usually done automatically bythe interface circuit. If a second character is entered at the keyboard, KIN is again set to 1and the process repeats. The desired action can be achieved by performing the operations:

READWAIT Read the KIN flagBranch to READWAIT if KIN = 0Transfer data from KBD_DATA to R5

which reads the character into processor register R5.An analogous process takes place when characters are transferred from the processor

to the display. When DOUT is equal to 1, the display is ready to receive a character.Under program control, the processor monitors DOUT, and when DOUT is equal to 1, theprocessor transfers an ASCII-encoded character to DISP_DATA. The transfer of a characterto DISP_DATA clears DOUT to 0. When the display device is ready to receive a secondcharacter, DOUT is again set to 1. This can be achieved by performing the operations:

WRITEWAIT Read the DOUT flagBranch to WRITEWAIT if DOUT= 0Transfer data from R5 to DISP_DATA

The wait loop is executed repeatedly until the status flag DOUT is set to 1 by the display whenit is free to receive a character. Then, the character from R5 is transferred to DISP_DATAto be displayed, which also clears DOUT to 0.

We assume that the initial state of KIN is 0 and the initial state of DOUT is 1. Thisinitialization is normally performed by the device control circuits when power is turned on.

In computers that use memory-mapped I/O, in which some addresses are used to refer toregisters in I/O interfaces, data can be transferred between these registers and the processorusing instructions such as Load, Store, and Move. For example, the contents of the keyboardcharacter buffer KBD_DATA can be transferred to register R5 in the processor by theinstruction

LoadByte R5, KBD_DATA

Similarly, the contents of register R5 can be transferred to DISP_DATA by the instruction

StoreByte R5, DISP_DATA

The LoadByte and StoreByte operation codes signify that the operand size is a byte, todistinguish them from the Load and Store operation codes that we have used for wordoperands.



The Read operation described above may be implemented by the RISC-style instruc-tions:

READWAIT: LoadByte R4, KBD_STATUSAnd R4, R4, #2Branch_if_[R4]=0 READWAITLoadByte R5, KBD_DATA

The And instruction is used to test the KIN flag, which is bit b1 of the status informationin R4 that was read from the KBD_STATUS register. As long as b1 = 0, the result of theAND operation leaves the value in R4 equal to zero, and the READWAIT loop continuesto be executed.

Similarly, the Write operation may be implemented as:

WRITEWAIT: LoadByte R4, DISP_STATUSAnd R4, R4, #4Branch_if_[R4]=0 WRITEWAITStoreByte R5, DISP_DATA

Observe that the And instruction in this case uses the immediate value 4 to test the display’sstatus bit, b2.

3.1.3 An Example of a RISC-Style I/O Program

We can now put together a complete program for a typical I/O task, as shown in Figure 3.4.The program uses the program-controlled I/O approach described above to read, store, anddisplay a line of characters typed at the keyboard. As the characters are read in, one by one,they are stored in the memory and then echoed back to the display. The program finisheswhen the carriage return character, CR, is encountered. The address of the first byte locationof the memory where the line is to be stored is LOC. Register R2 is used to point to thispart of the memory, and it is initially loaded with the address LOC by the first instructionin the program. R2 is incremented for each character read and displayed.

3.1.4 An Example of a CISC-Style I/O Program

Let us now perform the same task using CISC-style instructions. In CISC instruction setsit is possible to perform some arithmetic and logic operations directly on operands in thememory. So, it is possible to have the instruction

TestBit destination, #k

which tests bit bk of the destination operand and sets the condition flag Z (Zero) to 1 ifbk = 0 and to 0 otherwise. Since the operand can be in a memory location, we can use theinstruction

TestBit KBD_STATUS, #1



Move R2, #LOC Initialize pointer register R2 to point to theaddress of the first location in main memorywhere the characters are to be stored.

MoveByte R3, #CR Load ASCII code for Carriage Return into R3.READ: LoadByte R4, KBD_STATUS Wait for a character to be entered.

And R4, R4, #2 Check the KIN flag.Branch_if_[R4]=0 READLoadByte R5, KBD_DATA Read the character from KBD_DATA

(this clears KIN to 0).StoreByte R5, (R2) Write the character into the main memory andAdd R2, R2, #1 increment the pointer to main memory.

ECHO: LoadByte R4, DISP_STATUS Wait for the display to become ready.And R4, R4, #4 Check the DOUT flag.Branch_if_[R4]=0 ECHOStoreByte R5, DISP_DATA Move the character just read to the display

buffer register (this clears DOUT to 0).Branch_if_[R5] [R3] READ Check if the character just read is the

Carriage Return. If it is not, thenbranch back and read another character.

=

Figure 3.4 A RISC-style program that reads a line of characters and displays it.

to test the state of the KIN flag in the keyboard interface. A Branch instruction that checksthe state of the Z flag can then be used to cause a branch to the beginning of the wait loop.

Figure 3.5 gives a CISC-style program that reads and displays a line of characters. Ob-serve that the first MoveByte instruction transfers each character directly from KBD_DATAto the memory location pointed to by R2. A Compare instruction

Compare destination, source

performs the comparison by subtracting the contents of the source from the contents of thedestination, and then sets the condition flags based on the result. It does not change thecontents of either the source or the destination. Note that the CompareByte instruction inFigure 3.5 uses the autoincrement addressing mode, which automatically increments thevalue of the pointer R2 after the comparison has been made. In the RISC-style program inFigure 3.4 the pointer has to be incremented using a separate Add instruction.

We have discussed the memory-mapped I/O scheme, which is used in most computers.There is an alternative that can be found in some processors where there exist special In andOut instructions to perform I/O transfers. In this case, there exists a separate I/O addressspace used only by these instructions. When building a computer system that uses theseprocessors, the designer has the option of connecting I/O devices to use the special I/Oaddress space or simply incorporating them as part of the memory address space.


3.2 Interrupts 103

Move R2, #LOC Initialize pointer register R2 to point to theaddress of the first location in main memorywhere the characters are to be stored.

READ: TestBit KBD_STATUS, #1 Wait for a character to be enteredBranch=0 READ in the keyboard buffer KBD_DATA.MoveByte (R2), KBD_DATA Transfer the character from KBD_DATA into

the main memory (this clears KIN to 0).ECHO: TestBit DISP_STATUS, #2 Wait for the display to become ready.

Branch=0 ECHOMoveByte DISP_DATA, (R2) Move the character just read to the display

buffer register (this clears DOUT to 0).CompareByte (R2)+, #CR Check if the character just read is CR

(carriage return). If it is not CR, thenBranch 0 READ branch back and read another character.

Also, increment the pointer to store thenext character.

=

Figure 3.5 A CISC-style program that reads a line of characters and displays it.

Program-controlled I/O requires continuous involvement of the processor in the I/Oactivities. Almost all of the execution time for the programs in Figures 3.4 and 3.5 is spentin the two wait loops, while the processor waits for a key to be pressed or for the display tobecome available. Wasting the processor execution time in this manner can be avoided byusing the concept of interrupts.

3.2 Interrupts

In the examples in Figures 3.4 and 3.5, the program enters a wait loop in which it repeatedlytests the device status. During this period, the processor is not performing any usefulcomputation. There are many situations where other tasks can be performed while waitingfor an I/O device to become ready. To allow this to happen, we can arrange for the I/Odevice to alert the processor when it becomes ready. It can do so by sending a hardwaresignal called an interrupt request to the processor. Since the processor is no longer requiredto continuously poll the status of I/O devices, it can use the waiting period to perform otheruseful tasks. Indeed, by using interrupts, such waiting periods can ideally be eliminated.

Example 3.1Consider a task that requires continuous extensive computations to be performed and theresults to be displayed on a display device. The displayed results must be updated everyten seconds. The ten-second intervals can be determined by a simple timer circuit, which



generates an appropriate signal. The processor treats the timer circuit as an input devicethat produces a signal that can be interrogated. If this is done by means of polling, theprocessor will waste considerable time checking the state of the signal. A better solution isto have the timer circuit raise an interrupt request once every ten seconds. In response, theprocessor displays the latest results.

The task can be implemented with a program that consists of two routines, COMPUTEand DISPLAY. The processor continuously executes the COMPUTE routine. When itreceives an interrupt request from the timer, it suspends the execution of the COMPUTEroutine and executes the DISPLAY routine which sends the latest results to the displaydevice. Upon completion of the DISPLAY routine, the processor resumes the execution ofthe COMPUTE routine. Since the time needed to send the results to the display device isvery small compared to the ten-second interval, the processor in effect spends almost all ofits time executing the COMPUTE routine.

This example illustrates the concept of interrupts. The routine executed in response to aninterrupt request is called the interrupt-service routine, which is the DISPLAY routine inour example. Interrupts bear considerable resemblance to subroutine calls. Assume that aninterrupt request arrives during execution of instruction i in Figure 3.6. The processor firstcompletes execution of instruction i. Then, it loads the program counter with the address ofthe first instruction of the interrupt-service routine. For the time being, let us assume thatthis address is hardwired in the processor. After execution of the interrupt-service routine,the processor returns to instruction i + 1. Therefore, when an interrupt occurs, the currentcontents of the PC, which point to instruction i + 1, must be put in temporary storage ina known location. A Return-from-interrupt instruction at the end of the interrupt-serviceroutine reloads the PC from that temporary storage location, causing execution to resume at

here

Interruptoccurs

M

i

2

1

DISPLAY routine

Program 2Program 1

COMPUTE routine

i 1+

Figure 3.6 Transfer of control through the use of interrupts.


3.2 Interrupts 105

instruction i + 1. The return address must be saved either in a designated general-purposeregister or on the processor stack.

We should note that as part of handling interrupts, the processor must inform the devicethat its request has been recognized so that it may remove its interrupt-request signal. Thiscan be accomplished by means of a special control signal, called interrupt acknowledge,which is sent to the device through the interconnection network. An alternative is to havethe transfer of data between the processor and the I/O device interface accomplish the samepurpose. The execution of an instruction in the interrupt-service routine that accesses thestatus or data register in the device interface implicitly informs the device that its interruptrequest has been recognized.

So far, treatment of an interrupt-service routine is very similar to that of a subroutine.An important departure from this similarity should be noted. A subroutine performs afunction required by the program from which it is called. As such, potential changes tostatus information and contents of registers are anticipated. However, an interrupt-serviceroutine may not have any relation to the portion of the program being executed at thetime the interrupt request is received. Therefore, before starting execution of the interrupt-service routine, status information and contents of processor registers that may be alteredin unanticipated ways during the execution of that routine must be saved. This savedinformation must be restored before execution of the interrupted program is resumed. Inthis way, the original program can continue execution without being affected in any wayby the interruption, except for the time delay.

The task of saving and restoring information can be done automatically by the processoror by program instructions. Most modern processors save only the minimum amount ofinformation needed to maintain the integrity of program execution. This is because theprocess of saving and restoring registers involves memory transfers that increase the totalexecution time, and hence represent execution overhead. Saving registers also increasesthe delay between the time an interrupt request is received and the start of execution of theinterrupt-service routine. This delay is called interrupt latency. In some applications, along interrupt latency is unacceptable. For these reasons, the amount of information savedautomatically by the processor when an interrupt request is accepted should be kept to aminimum. Typically, the processor saves only the contents of the program counter and theprocessor status register. Any additional information that needs to be saved must be savedby explicit instructions at the beginning of the interrupt-service routine and restored at theend of the routine. In some earlier processors, particularly those with a small number ofregisters, all registers are saved automatically by the processor hardware at the time aninterrupt request is accepted. The data saved are restored to their respective registers aspart of the execution of the Return-from-interrupt instruction.

Some computers provide two types of interrupts. One saves all register contents, andthe other does not. Aparticular I/O device may use either type, depending upon its response-time requirements. Another interesting approach is to provide duplicate sets of processorregisters. In this case, a different set of registers can be used by the interrupt-service routine,thus eliminating the need to save and restore registers. The duplicate registers are sometimescalled the shadow registers.

An interrupt is more than a simple mechanism for coordinating I/O transfers. In ageneral sense, interrupts enable transfer of control from one program to another to be



initiated by an event external to the computer. Execution of the interrupted program resumesafter the execution of the interrupt-service routine has been completed. The concept ofinterrupts is used in operating systems and in many control applications where processingof certain routines must be accurately timed relative to external events. The latter type ofapplication is referred to as real-time processing.

3.2.1 Enabling and Disabling Interrupts

The facilities provided in a computer must give the programmer complete control over theevents that take place during program execution. The arrival of an interrupt request from anexternal device causes the processor to suspend the execution of one program and start theexecution of another. Because interrupts can arrive at any time, they may alter the sequenceof events from that envisaged by the programmer. Hence, the interruption of programexecution must be carefully controlled. A fundamental facility found in all computers isthe ability to enable and disable such interruptions as desired.

There are many situations in which the processor should ignore interrupt requests. Forinstance, the timer circuit in Example 3.1 should raise interrupt requests only when theCOMPUTE routine is being executed. It should be prevented from doing so when someother task is being performed. In another case, it may be necessary to guarantee that aparticular sequence of instructions is executed to the end without interruption because theinterrupt-service routine may change some of the data used by the instructions in question.For these reasons, some means for enabling and disabling interrupts must be available tothe programmer.

It is convenient to be able to enable and disable interrupts at both the processor and I/Odevice ends. The processor can either accept or ignore interrupt requests. An I/O devicecan either be allowed to raise interrupt requests or prevented from doing so. A commonlyused mechanism to achieve this is to use some control bits in registers that can be accessedby program instructions.

The processor has a status register (PS), which contains information about its currentstate of operation. Let one bit, IE, of this register be assigned for enabling/disabling inter-rupts. Then, the programmer can set or clear IE to cause the desired action. When IE = 1,interrupt requests from I/O devices are accepted and serviced by the processor. When IE= 0, the processor simply ignores all interrupt requests from I/O devices.

The interface of an I/O device includes a control register that contains the informationthat governs the mode of operation of the device. One bit in this register may be dedicatedto interrupt control. The I/O device is allowed to raise interrupt requests only when this bitis set to 1. We will discuss this arrangement in Section 3.2.3.

Let us now consider the specific case of a single interrupt request from one device.When a device activates the interrupt-request signal, it keeps this signal activated until itlearns that the processor has accepted its request. This means that the interrupt-requestsignal will be active during execution of the interrupt-service routine, perhaps until aninstruction is reached that accesses the device in question. It is essential to ensure that thisactive request signal does not lead to successive interruptions, causing the system to enteran infinite loop from which it cannot recover.


3.2 Interrupts 107

A good choice is to have the processor automatically disable interrupts before startingthe execution of the interrupt-service routine. The processor saves the contents of theprogram counter and the processor status register. After saving the contents of the PSregister, with the IE bit equal to 1, the processor clears the IE bit in the PS register, thusdisabling further interrupts. Then, it begins execution of the interrupt-service routine. Whena Return-from-interrupt instruction is executed, the saved contents of the PS register arerestored, setting the IE bit back to 1. Hence, interrupts are again enabled.

Before proceeding to study more complex aspects of interrupts, let us summarize thesequence of events involved in handling an interrupt request from a single device. Assumingthat interrupts are enabled in both the processor and the device, the following is a typicalscenario:

1. The device raises an interrupt request.

2. The processor interrupts the program currently being executed and saves the contentsof the PC and PS registers.

3. Interrupts are disabled by clearing the IE bit in the PS to 0.

4. The action requested by the interrupt is performed by the interrupt-service routine,during which time the device is informed that its request has been recognized, and inresponse, it deactivates the interrupt-request signal.

5. Upon completion of the interrupt-service routine, the saved contents of the PC and PSregisters are restored (enabling interrupts by setting the IE bit to 1), and execution ofthe interrupted program is resumed.

3.2.2 Handling Multiple Devices

Let us now consider the situation where a number of devices capable of initiating interruptsare connected to the processor. Because these devices are operationally independent, thereis no definite order in which they will generate interrupts. For example, device X mayrequest an interrupt while an interrupt caused by device Y is being serviced, or severaldevices may request interrupts at exactly the same time. This gives rise to a number ofquestions:

1. How can the processor determine which device is requesting an interrupt?

2. Given that different devices are likely to require different interrupt-service routines,how can the processor obtain the starting address of the appropriate routine in eachcase?

3. Should a device be allowed to interrupt the processor while another interrupt is beingserviced?

4. How should two or more simultaneous interrupt requests be handled?

The means by which these issues are handled vary from one computer to another, and theapproach taken is an important consideration in determining the computer’s suitability fora given application.

When an interrupt request is received it is necessary to identify the particular devicethat raised the request. Furthermore, if two devices raise interrupt requests at the same time,



it must be possible to break the tie and select one of the two requests for service. Whenthe interrupt-service routine for the selected device has been completed, the second requestcan be serviced.

The information needed to determine whether a device is requesting an interrupt isavailable in its status register. When the device raises an interrupt request, it sets to 1 abit in its status register, which we will call the IRQ bit. The simplest way to identify theinterrupting device is to have the interrupt-service routine poll all I/O devices in the system.The first device encountered with its IRQ bit set to 1 is the device that should be serviced.An appropriate subroutine is then called to provide the requested service.

The polling scheme is easy to implement. Its main disadvantage is the time spentinterrogating the IRQ bits of devices that may not be requesting any service. An alternativeapproach is to use vectored interrupts, which we describe next.

Vectored InterruptsTo reduce the time involved in the polling process, a device requesting an interrupt

may identify itself directly to the processor. Then, the processor can immediately startexecuting the corresponding interrupt-service routine. The term vectored interrupts refersto interrupt-handling schemes based on this approach.

A device requesting an interrupt can identify itself if it has its own interrupt-requestsignal, or if it can send a special code to the processor through the interconnection network.The processor’s circuits determine the memory address of the required interrupt-serviceroutine. A commonly used scheme is to allocate permanently an area in the memory tohold the addresses of interrupt-service routines. These addresses are usually referred to asinterrupt vectors, and they are said to constitute the interrupt-vector table. For example,128 bytes may be allocated to hold a table of 32 interrupt vectors. Typically, the interrupt-vector table is in the lowest-address range. The interrupt-service routines may be locatedanywhere in the memory. When an interrupt request arrives, the information provided bythe requesting device is used as a pointer into the interrupt-vector table, and the address inthe corresponding interrupt vector is automatically loaded into the program counter.

Interrupt NestingWe suggested in Section 3.2.1 that interrupts should be disabled during the execution

of an interrupt-service routine, to ensure that a request from one device will not causemore than one interruption. The same arrangement is often used when several devicesare involved, in which case execution of a given interrupt-service routine, once started,always continues to completion before the processor accepts an interrupt request from asecond device. Interrupt-service routines are typically short, and the delay they may causeis acceptable for most simple devices.

For some devices, however, a long delay in responding to an interrupt request maylead to erroneous operation. Consider, for example, a computer that keeps track of thetime of day using a real-time clock. This is a device that sends interrupt requests to theprocessor at regular intervals. For each of these requests, the processor executes a shortinterrupt-service routine to increment a set of counters in the memory that keep track of timein seconds, minutes, and so on. Proper operation requires that the delay in responding to aninterrupt request from the real-time clock be small in comparison with the interval between


3.2 Interrupts 109

two successive requests. To ensure that this requirement is satisfied in the presence of otherinterrupting devices, it may be necessary to accept an interrupt request from the clock duringthe execution of an interrupt-service routine for another device, i.e., to nest interrupts.

This example suggests that I/O devices should be organized in a priority structure.An interrupt request from a high-priority device should be accepted while the processor isservicing a request from a lower-priority device.

A multiple-level priority organization means that during execution of an interrupt-service routine, interrupt requests will be accepted from some devices but not from others,depending upon the device’s priority. To implement this scheme, we can assign a prioritylevel to the processor that can be changed under program control. The priority level ofthe processor is the priority of the program that is currently being executed. The processoraccepts interrupts only from devices that have priorities higher than its own. At the timethat execution of an interrupt-service routine for some device is started, the priority of theprocessor is raised to that of the device either automatically or with special instructions.This action disables interrupts from devices that have the same or lower level of priority.However, interrupt requests from higher-priority devices will continue to be accepted. Theprocessor’s priority can be encoded in a few bits of the processor status register. While thisscheme is used in some processors, we will use a simpler scheme in later examples.

Finally, we should point out that if nested interrupts are allowed, then each interrupt-service routine must save on the stack the saved contents of the program counter and thestatus register. This has to be done before the interrupt-service routine enables nesting bysetting the IE bit in the staus register to 1.

Simultaneous RequestsWe also need to consider the problem of simultaneous arrivals of interrupt requests from

two or more devices. The processor must have some means of deciding which request toservice first. Polling the status registers of the I/O devices is the simplest such mechanism.In this case, priority is determined by the order in which the devices are polled. Whenvectored interrupts are used, we must ensure that only one device is selected to send itsinterrupt vector code. This is done in hardware, by using arbitration circuits which we willdiscuss in Chapter 7.

3.2.3 Controlling I/O Device Behavior

It is important to ensure that interrupt requests are generated only by those I/O devicesthat the processor is currently willing to recognize. Hence, we need a mechanism in theinterface circuits of individual devices to control whether a device is allowed to interruptthe processor. The control needed is usually provided in the form of an interrupt-enable bitin the device’s interface circuit.

I/O devices vary in complexity from simple to quite complex. Simple devices, suchas a keyboard, require little in the way of control. Complex devices may have a numberof possible modes of operation, which must be controlled. A commonly used approach isto provide a control register in the device interface, which holds the information needed tocontrol the behavior of the device. This register is accessed as an addressable location, just



like the data and status registers that we discussed before. One bit in the register serves asthe interrupt-enable bit, IE. When it is set to 1 by an instruction that writes new informationinto the control register, the device is placed into a mode in which it is allowed to interruptthe processor whenever it is ready for an I/O transfer.

Figure 3.3 shows the registers that may be used in the interfaces of keyboard anddisplay devices. Since these devices transfer character-based data, handling one characterat a time, it is appropriate to use an eight-bit data register. We have assumed that thestatus and control registers are also eight bits long. Only one or two bits in these registersare needed in handling the I/O transfers. The remaining bits can be used to specify otheraspects of the operation of the device, or ignored if they are not needed. The keyboardstatus register includes bits KIN and KIRQ. We have already discussed the use of the KINbit in Section 3.1.2. The KIRQ bit is set to 1 if an interrupt request has been raised, but notyet serviced. The keyboard may raise interrupt requests only when the interrupt-enable bit,KIE, in its control register is set to 1. Thus, when both KIE and KIN bits are equal to 1, aninterrupt request is raised and the KIRQ bit is set to 1. Similarly, the DIRQ bit in the statusregister of the display interface indicates whether an interrupt request has been raised. BitDIE in the control register of this interface is used to enable interrupts. Observe that wehave placed KIN and KIE in bit position 1, and DOUT and DIE in position 2. This is anarbitrary choice that makes the program examples that follow easier to understand.

3.2.4 Processor Control Registers

We have already discussed the need for a status register in the processor. To deal withinterrupts it is useful to have some other control registers. Figure 3.7 depicts one possibil-ity, where there are four processor control registers. The status register, PS, includes theinterrupt-enable bit, IE, in addition to other status information. Recall that the processorwill accept interrupts only when this bit is set to 1. The IPS register is used to automatically

01234

PS

IENABLE

IPENDING

31

IPS

IE

IE

DISP KBD

DISP KBD

TIM

TIM

Figure 3.7 Control registers in the processor.


3.2 Interrupts 111

save the contents of PS when an interrupt request is received and accepted. At the end ofthe interrupt-service routine, the previous state of the processor is automatically restoredby transferring the contents of IPS into PS. Since there is only one register available forstoring the previous status information, it becomes necessary to save the contents of IPS onthe stack if nested interrupts are allowed.

The IENABLE register allows the processor to selectively respond to individual I/Odevices. A bit may be assigned for each device, as shown in the figure for the keyboard,display, and a timer circuit that we will use in a later example. When a bit is set to 1, theprocessor will accept interrupt requests from the corresponding device. The IPENDINGregister indicates the active interrupt requests. This is convenient when multiple devicesmay raise requests at the same time. Then, a program can decide which interrupt should beserviced first.

In a 32-bit processor, the control registers are 32 bits long. Using the structure in Figure3.7, it is possible to accommodate 32 I/O devices in a straightforward manner.

Assembly-language instructions can refer to processor control registers by using namessuch as those in Figure 3.7. But, these registers cannot be accessed in the same way as thegeneral-purpose registers. They cannot be accessed by arithmetic and logic instructions.They also cannot be accessed by Load and Store instructions that use the encoding formatdepicted in Figure 2.32c, because a five-bit field is used to specify a source or a destinationregister in these instructions, which makes it possible to specify only 32 general-purposeregisters. Special instructions or special addressing modes may be provided to access theprocessor control registers. In a RISC-style processor, the special instructions may be ofthe type

MoveControl R2, PS

which loads the contents of the program status register into register R2, and

MoveControl IENABLE, R3

which places the contents of R3 into the IENABLE register. These instructions performtransfers between control and general-purpose registers.

3.2.5 Examples of Interrupt Programs

Having presented the basic aspects of interrupts, we can now give some illustrative ex-amples. We will use the keyboard and display devices with the register structure given inFigure 3.3.

Example 3.2Let us consider again the task of reading a line of characters typed on a keyboard, storingthe characters in the main memory, and displaying them on a display device. In Figures3.4 and 3.5, we showed how this task may be performed by using the polling approach todetect when the I/O devices are ready for data transfer. Now, we will use interrupts withthe keyboard, but polling with the display.



We assume for now that a specific memory location, ILOC, is dedicated for dealingwith interrupts, and that it contains the first instruction of the interrupt-service routine.Whenever an interrupt request arrives at the processor, and processor interrupts are enabled,the processor will automatically:

• Save the contents of the program counter, either in a processor register that holds thereturn address or on the processor stack.

• Save the contents of the status register PS by transferring them into the IPS register,and clear the IE bit in the PS.

• Load the address ILOC into the program counter.

Assume that in the Main program we wish to read a line from the keyboard and storethe characters in successive byte locations in the memory, starting at location LINE. Also,assume that the interrupt-service routine has been loaded in the memory, starting at locationILOC. The Main program has to initialize the interrupt process as follows:

1. Load the address LINE into a memory location PNTR. The interrupt-service routinewill use this location as a pointer to store the input characters in the memory.

2. Enable interrupts in the keyboard interface by setting to 1 the KIE bit in theKBD_CONT register.

3. Enable the processor to accept interrupts from the keyboard by setting to 1 the KBDbit in its control register IENABLE.

4. Enable the processor to respond to interrupts in general by setting to 1 the IE bit inthe processor status register, PS.

Once this initialization is completed, typing a character on the keyboard will cause aninterrupt request to be generated by the keyboard interface. The program being executed atthat time will be interrupted and the interrupt-service routine will be executed. This routinemust perform the following tasks:

1. Read the input character from the keyboard input data register. This will cause theinterface circuit to remove its interrupt request.

2. Store the character in the memory location pointed to by PNTR, and increment PNTR.

3. Display the character using the polling approach.

4. When the end of the line is reached, disable keyboard interrupts and inform the Mainprogram.

5. Return from interrupt.

A RISC-style program that performs these tasks is shown in Figure 3.8. The commentsin the program explain the relevant details. When the end of the input line is detected, theinterrupt-service routine clears the KIE bit in register KBD_CONT, as no further input isexpected. It also sets to 1 the variable EOL (End Of Line), which was initially cleared to 0.We assume that it is checked periodically by the Main program to determine when the inputline is ready for processing. The EOL variable provides a means of signaling between theMain program and the interrupt-service routine.


3.2 Interrupts 113

Interrupt-service routine

ILOC: Subtract SP, SP, #8 Save registers.Store R2, 4(SP)Store R3, (SP)Load R2, PNTR Load address pointer.LoadByte R3, KBD_DATA Read character from keyboard.StoreByte R3, (R2) Write the character into memoryAdd R2, R2, #1 and increment the pointer.Store R2, PNTR Update the pointer in memory.

ECHO: LoadByte R2, DISP_STATUS Wait for display to become ready.And R2, R2, #4Branch_if_[R2]=0 ECHOStoreByte R3, DISP_DATA Display the character just read.Move R2, #CR ASCII code for Carriage Return.Branch_if_[R3] [R2] RTRN Return if not CR.Move R2, #1Store R2, EOL Indicate end of line.Clear R2 Disable interrupts inStoreByte R2, KBD_CONT the keyboard interface.

RTRN: Load R3, (SP) Restore registers.Load R2, 4(SP)Add SP, SP, #8Return-from-interrupt

Main program

START: Move R2, #LINEStore R2, PNTR Initialize buffer pointer.Clear R2Store R2, EOL Clear end-of-line indicator.Move R2, #2 Enable interrupts inStoreByte R2, KBD_CONT the keyboard interface.MoveControl R2, IENABLEOr R2, R2, #2 Enable keyboard interrupts inMoveControl IENABLE, R2 the processor control register.MoveControl R2, PSOr R2, R2, #1MoveControl PS, R2 Set interrupt-enable bit in PS.next instruction

=

Figure 3.8 A RISC-style program that reads a line of characters using interrupts, and displaysthe line using polling.



Observe that the last three instructions in the Main program are used to set to 1 theinterrupt-enable bit in PS. Since only MoveControl instructions can access the contents of acontrol register, the contents of PS are loaded into a general-purpose register, R2, modifiedand then written back into PS. Using the Or instruction to modify the contents affects onlythe IE bit and leaves the rest of the bits in PS unchanged.

When multiple I/O devices raise interrupt requests, it is necessary to determine whichdevice has requested an interrupt. This can be done in software by checking the informationin the IPENDING control register and choosing the interrupt-service routine that should beexecuted.

Example 3.3 In Example 3.2, we used interrupts with the keyboard only. The display device can alsouse interrupts. Suppose a program needs to display a page of text stored in the memory.This can be done by having the processor send a character whenever the display interfaceis ready, which may be indicated by an interrupt request. Assume that both the display andthe keyboard are used by this program, and that both are enabled to raise interrupt requests.Using the register structure in Figures 3.3 and 3.7, the initialization of interrupts and theprocessing of requests can be done as indicated in Figure 3.9.

The Main program must initialize any variables needed by the interrupt-service rou-tines, such as the memory buffer pointers. Then, it enables interrupts in both the keyboardand display interfaces. Next, it enables interrupts in the processor control register IEN-ABLE. Note that the immediate value 6, which is loaded into this register, sets bits KBDand DISP to 1. Finally, the processor is enabled to respond to interrupts in general by settingto 1 the IE bit in the processor status register, PS.

Again, we assume that whenever an interrupt request arrives, the processor will auto-matically save the contents of the program counter (PC) and then load the address ILOCinto PC. It will also save the contents of the status register (PS) by transferring them intothe IPS register, and disable interrupts. Unlike Example 3.2, where we assumed that thereis only one device that can raise interrupt requests, now we cannot go directly to the de-sired interrupt-service routine. First, it is necessary to identify the interrupting device.The needed information is found in the processor control register IPENDING. Since theinterrupt-service routine uses registers R2 and R3 in this process, the contents of these reg-isters must be saved on the stack and later restored. It is also necessary to save the contentsof the subroutine linkage register, LINK_reg, because an interrupt can occur while somesubroutine is being executed and the interrupt-service routine calls a subroutine. The circuitthat detects interrupts sets to 1 the appropriate bit in IPENDING for each pending request.In Figure 3.9, the contents of IPENDING are loaded into general purpose register R2, andthen examined to determine which interrupts are pending. If the display has a pendinginterrupt, then its interrupt-service routine is executed. If not, then a check is made for thekeyboard. This may be followed by checking any other devices that could have pendingrequests. The order in which the bits in IPENDING are checked establishes a priority forthe interrupting devices in case of simultaneous requests.


3.2 Interrupts 115

Interrupt handler

ILOC: Subtract SP, SP, #12 Save registers.Store LINK_reg, 8(SP)Store R2, 4(SP)Store R3, (SP)MoveControl R2, IPENDING Check contents of IPENDING.And R3, R2, #4 Check if display raised the request.Branch_if_[R3]�0 TESTKBD If not, check if keyboard.Call DISR Call the display ISR.

TESTKBD: And R3, R2, #2 Check if keyboard raised the request.Branch_if_[R3]�0 NEXT If not, then check next device.Call KISR Call the keyboard ISR.

NEXT: ˙ ˙ ˙ Check for other interrupts.

Load R3, (SP) Restore registers.Load R2, 4(SP)Load LINK_reg, 8(SP)Add SP, SP, #12Return-from-interrupt

Main program

START: ˙ ˙ ˙ Set up parameters for ISRs.Move R2, #2 Enable interrupts inStoreByte R2, KBD_CONT the keyboard interface.Move R2, #4 Enable interrupts inStoreByte R2, DISP_CONT the display interface.MoveControl R2, IENABLEOr R2, R2, #6 Enable interrupts inMoveControl IENABLE, R2 the processor control register.MoveControl R2, PSOr R2, R2, #1MoveControl PS, R2 Set interrupt-enable bit in PS.next instruction

Keyboard interrupt-service routine

KISR: ˙ ˙ ˙...Return

Display interrupt-service routine

DISR: ˙ ˙ ˙...Return

Figure 3.9 A RISC-style program that initializes and handles interrupts.



The program parts that handle interrupt requests and provide the corresponding serviceto the requesting devices are often referred to as the interrupt handler. Note that while theinterrupt handler starts at the fixed address ILOC, the individual interrupt-service routinesare just subroutines that can be placed anywhere in the memory.

In Figure 3.9, we used a software approach to determine the interrupting device. Inprocessors that use vectored interrupts, the circuit that detects interrupt requests automati-cally loads a different address into the program counter for each interrupt that is assigneda specific location in the interrupt-vector table. A separate interrupt-service routine is exe-cuted to completion for each pending request, even if multiple interrupt requests are raisedat the same time.

CISC-style Examples of InterruptsThe above tasks can be implemented using CISC-style instructions using the same

basic approach. The main difference is that some operations, such as testing a bit in an I/Oregister, can be done directly. The tasks in Examples 3.2 and 3.3 can be realized using theprograms in Figures 3.10 and 3.11, respectively. The TestBit instruction is used to test thestatus flags. The SetBit and ClearBit instructions are used to set an individual bit in an I/Oregister to 1 and 0, respectively. The comments in the programs provide explanations ofhow the desired tasks are realized.

Input/output operations in a computer system are usually much more involved thanour simple examples suggest. As we will describe in Chapter 4, the operating system ofthe computer performs these operations on behalf of user programs. In Chapter 7, we willdiscuss in detail the hardware used in I/O operations.

3.2.6 Exceptions

An interrupt is an event that causes the execution of one program to be suspended and theexecution of another program to begin. So far, we have dealt only with interrupts causedby events associated with I/O data transfers. However, the interrupt mechanism is used ina number of other situations.

The term exception is often used to refer to any event that causes an interruption.Hence, I/O interrupts are one example of an exception. We now describe a few other kindsof exceptions.

Recovery from ErrorsComputers use a variety of techniques to ensure that all hardware components are

operating properly. For example, many computers include an error-checking code in themain memory, which allows detection of errors in the stored data. If an error occurs, thecontrol hardware detects it and informs the processor by raising an interrupt.

The processor may also interrupt a program if it detects an error or an unusual conditionwhile executing the instructions of this program. For example, the OP-code field of aninstruction may not correspond to any legal instruction, or an arithmetic instruction mayattempt a division by zero.

When exception processing is initiated as a result of such errors, the processor proceedsin exactly the same manner as in the case of an I/O interrupt request. It suspends the program


3.2 Interrupts 117


ILOC: Move – (SP), R2 Save register.Move R2, PNTR Load address pointer.MoveByte (R2), KBD_DATA Write the character into memoryAdd PNTR, #1 and increment the pointer.

ECHO: TestBit DISP_STATUS, #2 Wait for the display to become ready.Branch=0 ECHOMoveByte DISP_DATA, (R2) Display the character just read.CompareByte (R2), #CR Check if the character just read is CR.Branch 0 RTRN Return if not CR.Move EOL, #1 Indicate end of line.ClearBit KBD_CONT, #1 Disable interrupts in keyboard interface.

RTRN: Move R2, (SP)+ Restore register.Return-from-interrupt

Main program

START: Move PNTR, #LINE Initialize buffer pointer.Clear EOL Clear end-of-line indicator.SetBit KBD_CONT, #1 Enable interrupts in keyboard interface.Move R2, #2 Enable keyboard interrupts inMoveControl IENABLE, R2 the processor control register.MoveControl R2, PSOr R2, #1MoveControl PS, R2 Set interrupt-enable bit in PS.next instruction

=

Figure 3.10 A CISC-style program that reads a line of characters using interrupts, anddisplays the line using polling.

being executed and starts an exception-service routine, which takes appropriate action torecover from the error, if possible, or to inform the user about it. Recall that in the case ofan I/O interrupt, we assumed that the processor completes execution of the instruction inprogress before accepting the interrupt. However, when an interrupt is caused by an errorassociated with the current instruction, that instruction cannot usually be completed, andthe processor begins exception processing immediately.

DebuggingAnother important type of exception is used as an aid in debugging programs. System

software usually includes a program called a debugger, which helps the programmer finderrors in a program. The debugger uses exceptions to provide two important facilities: tracemode and breakpoints. These facilities are described in detail in Chapter 4.



Interrupt handler

ILOC: Move – (SP), R2 Save registers.Move – (SP), LINK_regMoveControl R2, IPENDING Check contents of IPENDING.TestBit R2, #2 Check if display raised the request.Branch�0 TESTKBD If not, check if keyboard.Call DISR Call the display ISR.

TESTKBD: TestBit R2, #1 Check if keyboard raised the request.Branch�0 NEXT If not, then check next device.Call KISR Call the keyboard ISR.


Move LINK_reg, (SP)+ Restore registers.Move R2, (SP)+Return-from-interrupt

Main program

START: ˙ ˙ ˙ Set up parameters for ISRs.SetBit KBD_CONT, #1 Enable interrupts in keyboard interface.SetBit DISP_CONT, #2 Enable interrupts in display interface.MoveControl R2, IENABLEOr R2, #6 Enable interrupts inMoveControl IENABLE, R2 the processor control register.MoveControl R2, PSOr R2, #1MoveControl PS, R2 Set interrupt-enable bit in PS.next instruction

Keyboard interrupt-service routine

KISR: ˙ ˙ ˙...Return

Display interrupt-service routine

DISR: ˙ ˙ ˙...Return

Figure 3.11 A CISC-style program that initializes and handles interrupts.



Use of Exceptions in Operating SystemsThe operating system (OS) software coordinates the activities within a computer. It

uses exceptions to communicate with and control the execution of user programs. It useshardware interrupts to perform I/O operations. This topic is discussed in Chapter 4.


In this chapter, we discussed two basic approaches to I/O transfers. The simplest techniqueis programmed I/O, in which the processor performs all of the necessary functions underdirect control of program instructions. The second approach is based on the use of interrupts;this mechanism makes it possible to interrupt the normal execution of programs in order toservice higher-priority requests that require more urgent attention. Although all computershave a mechanism for dealing with such situations, the complexity and sophistication ofinterrupt-handling schemes vary from one computer to another.

We dealt with the I/O issues from the programmer’s point of view. In Chapter 7 wewill consider the hardware aspects and some commonly used I/O standards.

3.4 Solved Problems

This section presents some examples of problems that a student may be asked to solve, andshows how such problems can be solved.

Example 3.4Problem: Assume that a memory location BINARY contains a 32-bit pattern. It is desiredto display these bits as eight hexadecimal digits on a display device that has the interfacedepicted in Figure 3.3. Write a program that accomplishes this task.

Solution: First it is necessary to convert the 32-bit pattern into hex digits that are representedas ASCII-encoded characters. A simple way of doing the conversion is to use the table-lookup approach. A 16-entry table has to be constructed to provide the ASCII code foreach possible hex digit. Then, for each four-bit segment of the pattern in BINARY, thecorresponding character can be looked up in the table and stored in a block of memorybytes starting at location HEX. Finally, the eight characters starting at HEX are sent to thedisplay.

Figures 3.12 and 3.13 give RISC- and CISC-style programs, respectively, for the re-quired task. The comments describe the detailed actions taken.



Load R2, BINARY Load the binary number.Move R3, #8 R3 is a digit counter that is set to 8.Move R4, #HEX R4 points to the hex digits.

LOOP: RotateL R2, R2, #4 Rotate the high-order digitinto low-order position.

And R5, R2, #0xF Extract next digit.LoadByte R6, TABLE(R5) Get ASCII code for the digit andStoreByte R6, (R4) store it in HEX number location.Subtract R3, R3, #1 Decrement the digit counter.Add R4, R4, #1 Increment the pointer to hex digits.Branch_if_[R3]>0 LOOP Loop back if not the last digit.

DISPLAY: Move R3, #8Move R4, #HEX

DLOOP: LoadByte R5, DISP_STATUS Wait for display to become ready.And R5, R5, #4 Check the DOUT flag.Branch_if_[R5]�0 DLOOPLoadByte R6, (R4) Get the next ASCII characterStoreByte R6, DISP_DATA and send it to the display.Subtract R3, R3, #1 Decrement the counter.Add R4, R4, #1 Increment the character pointer.Branch_if_[R3]>0 DLOOP Loop until all characters displayed.next instruction

ORIGIN 1000HEX: RESERVE 8 Space for ASCII-encoded digits.TABLE: DATABYTE 0x30,0x31,0x32,0x33 Table for conversion

DATABYTE 0x34,0x35,0x36,0x37 to ASCII code.DATABYTE 0x38,0x39,0x41,0x42DATABYTE 0x43,0x44,0x45,0x46

Figure 3.12 A RISC-style program for Example 3.4.

Example 3.5 Problem: Consider the task described in Example 3.1. Assume that the timer circuitincludes a 32-bit up/down counter driven by a 100-MHz clock. The counter can be set tocount from a specified initial count value. The timer I/O interface is shown in Figure 3.14.It contains four registers.

• TIM_STATUS indicates the current status of the timer where:

– The TON bit is set to 1 when the counter is running.

– The ZERO bit is set to 1 when the counter reaches the count of zero.

– The TIRQ bit is set to 1 when the timer raises an interrupt request, which happenswhen the counter contents reach zero and the timer interrupts are enabled.



Move R2, BINARY Load the binary number.Move R3, #8 R3 is a digit counter that is set to 8.Move R4, #HEX R4 points to the hex digits.

LOOP: RotateL R2, #4 Rotate the high-order digitinto low-order position.

Move R5, R2And R5, #0xF Extract next digit.MoveByte (R4)+, TABLE(R5) Get ASCII code for the digit and

store it in HEX number location.Subtract R3, #1 Decrement the digit counter.Branch>0 LOOP Loop back if not the last digit.

DISPLAY: Move R3, #8Move R4, #HEX

DLOOP: TestBit DISP_STATUS, #2 Wait for display to become ready.Branch�0 DLOOPMoveByte DISP_DATA, (R4)+ Send next character to display.Subtract R3, #1 Decrement the counter.Branch>0 DLOOP Loop until all characters displayed.next instruction

ORIGIN 1000HEX: RESERVE 8 Space for ASCII-encoded digits.TABLE: DATABYTE 0x30,0x31,0x32,0x33 Table for conversion

DATABYTE 0x34,0x35,0x36,0x37 to ASCII code.DATABYTE 0x38,0x39,0x41,0x42DATABYTE 0x43,0x44,0x45,0x46

Figure 3.13 A CISC-style program for Example 3.4.

The action of reading the status register automatically clears the ZERO and TIRQ bitsto 0.

• TIM_CONT controls the mode of operation, where:

– The UP bit is set to 1 to cause the counter to count by incrementing its contents;when this bit is cleared to zero, the counter contents are decremented.

– The FREE bit is set to 1 to cause a continuously running mode, where the counteris automatically reloaded with the initial count value whenever the actual countreaches zero.

– The RUN bit is set to 1 to cause the counter to count; it is cleared to 0 to stop thecounter.

– The TIE bit is set to 1 to enable timer interrupts.

• TIM_INIT holds the initial count value.• TIM_COUNT holds the current count value.



ZERO

RUN

01234731

0x4020

0x4024

TIM_STATUS

TIM_CONT

Address

TIRQTON

TIEFREEUP

Initial count value

Current count value

TIM_INIT

TIM_COUNT

0x4028

0x402C

Figure 3.14 Registers in the timer interface.

Write a program to implement the desired task. Use the processor control registers depictedin Figure 3.7.

Solution: To obtain an interrupt request every ten seconds, it is necessary to count 109

clock cycles. This can be accomplished by writing this value into the TIM_INIT register,and then making the counter decrement its count and raise an interrupt when the countreaches zero. The value 109 can be represented by the hexadecimal number 3B9ACA00.To achieve the desired operation the FREE, RUN, and TIE bits must be set to 1, while theUP bit must be equal to 0.

Using the scheme outlined in Figure 3.9, we can implement the required task using aRISC-style program shown in Figure 3.15. Note that the initial count, which is a 32-bitimmediate value, is loaded into R2 using the approach explained in Section 2.9.

Figure 3.16 gives a CISC-style program that uses the scheme outlined in Figure 3.11.In this case, the 32-bit immediate operand can be specified in a single instruction.

Example 3.6 Problem: A commonly used output device in digital systems is a seven-segment display,depicted in Figure 3.17. The device consists of seven independent segments which can beilluminated by applying electrical signals to them. Assume that each segment is illuminatedwhen a logic value 1 is applied to it. The figure shows the bit patterns needed to displaynumbers 0 to 9.

Write a program that displays the number represented by an ASCII-encoded characterstored in memory location DIGIT at address 0x800. Assume that the display has an I/Ointerface consisting of an eight-bit data register, SEVEN, where the segments a to g areconnected to bits SEVEN6−0. Let the bit SEVEN7 be equal to 0. Also, assume that theaddress of register SEVEN is 0x4030. If the ASCII code in location DIGIT represents a



Interrupt handler

ILOC: Subtract SP, SP, #8 Save registers.Store LINK_reg, 4(SP)Store R2, (SP)MoveControl R2, IPENDING Check contents of IPENDING.And R2, R2, #8 Check if request from timer.Branch_if_[R2]=0 NEXTLoadByte R2, TIM_STATUS Clear TIRQ and ZERO bits.Call DISPLAY Call the DISPLAY routine.


Load R2, (SP) Restore registers.Load LINK_reg, 4(SP)Add SP, SP, #8Return-from-interrupt

Main program

START: ˙ ˙ ˙ Set up parameters for ISRs.OrHigh R2, R0, #0x3B9A Prepare the initialOr R2, R2, #0xCA00 count value.Store R2, TIM_INIT Set the initial count value.Move R2, #7 Set the timer to free runStoreByte R2, TIM_CONT and enable interupts.MoveControl R2, IENABLEOr R2, R2, #8 Enable timer interrupts inMoveControl IENABLE, R2 the processor control register.MoveControl R2, PSOr R2, R2, #1MoveControl PS, R2 Set interrupt-enable bit in PS.

COMPUTE: next instruction


character that is not a number in the range 0 to 9, then the display should be blank, whereall segments are turned off.

Solution: Alook-up table can be used to hold the seven-segment bit patterns that correspondto the numbers 0 to 9. The ASCII-encoded digit is converted into a four-bit number that isused as an index into the table, by using the AND operation. Also, it is necessary to checkthat the high-order four bits of ASCII code are 0011. Note that all three addresses DIGIT,SEVEN, and TABLE can be represented in 16 bits.

Figures 3.18 and 3.19 give possible RISC- and CISC-style programs, respectively.



Interrupt handler

ILOC: Move – (SP), R2 Save registers.Move – (SP), LINK_regMoveControl R2, IPENDING Check contents of IPENDING.TestBit R2, #3 Check if request from timer.Branch�0 NEXTMoveByte R2, TIM_STATUS Clear TIRQ and ZERO bits.Call DISPLAY Call the DISPLAY routine.


Move LINK_reg, (SP)+ Restore registers.Move R2, (SP)+Return-from-interrupt

Main program

START: ˙ ˙ ˙ Set up parameters for ISRs.Move TIM_INIT, #0x3B9ACA00 Set the initial count value.MoveByte TIM_CON, #7 Set the timer to free run

and enable interupts.MoveControl R2, IENABLEOr R2, #8 Enable timer interrupts inMoveControl IENABLE, R2 the processor control register.MoveControl R2, PSOr R2, #1MoveControl PS, R2 Set interrupt-enable bit in PS.

COMPUTE: next instruction


ce

1011

111

a

1

b

0 1

11

1

01

123

0

4

c

56789

1111

011

0

1 1

11

1

11

011

1

d

0

10

0

10

e

101

1

1

01

0

01

000

1

f

1

00

1

11

g

101

1

1

11

1

01

a

g

bf

d

Number

Figure 3.17 Seven-segment display.



DIGIT EQU 0x800 Location of ASCII-encoded digit.SEVEN EQU 0x4030 Address of 7-segment display.

LoadByte R2, DIGIT Load the ASCII-encoded digit.And R3, R2, #0xF0 Extract high-order bits of ASCII.And R2, R2, #0x0F Extract the decimal number.Move R4, #0x30 Check if high-order bits ofBranch_if_[R3]�[R4] HIGH3 ASCII code are 0011.Move R2, #0x0F Not a digit, display a blank.

HIGH3: LoadByte R5, TABLE(R2) Get the 7-segment pattern.StoreByte R5, SEVEN Display the digit.

ORIGIN 0x1000TABLE: DATABYTE 0x7E,0x30,0x6D,0x79 Table that contains

DATABYTE 0x33,0x5B,0x5F,0x70 the necessaryDATABYTE 0x7F,0x7B,0x00,0x00 7-segment patterns.DATABYTE 0x00,0x00,0x00,0x00


DIGIT EQU 0x800 Location of ASCII-encoded digit.SEVEN EQU 0x4030 Address of 7-segment display.

Move R2, DIGIT Load the ASCII-encoded digit.Move R3, R2And R3, #0xF0 Extract high-order bits of ASCII.And R2, #0x0F Extract the decimal number.CompareByte R3, #0x30 Check if high-order bits ofBranch�0 HIGH3 ASCII code are 0011.Move R2, #0x0F Not a digit, display a blank.

HIGH3: MoveByte SEVEN, TABLE(R2) Display the digit.

ORIGIN 0x1000TABLE: DATABYTE 0x7E,0x30,0x6D,0x79 Table that contains

DATABYTE 0x33,0x5B,0x5F,0x70 the necessaryDATABYTE 0x7F,0x7B,0x00,0x00 7-segment patterns.DATABYTE 0x00,0x00,0x00,0x00




Problems

3.1 [E] The input status bit in an interface circuit is cleared as soon as the input data registeris read. Why is this important?

3.2 [E] Write a program that displays the contents of ten bytes of the main memory in hex-adecimal format on a line of a display device. The ten bytes start at location LOC in thememory, and there are two hex characters per byte. The contents of successive bytes shouldbe separated by a space when displayed.

3.3 [E] What is the difference between a subroutine and an interrupt-service routine?

3.4 [E] In the first And instruction in Figure 3.4 the immediate value 2 is used when checkingthe KIN flag, but in Figure 3.5 the immediate value 1 is used in the first TestBit instructionwhen checking the same flag. Explain the difference.

3.5 [D] A computer is required to accept characters from the keyboard input of 20 terminals.The main memory area to be used for storing data for each terminal is pointed to by a pointerPNTRn, where n = 1 through 20. Input data must be collected from the terminals whileanother program PROG is being executed. This may be accomplished in one of two ways:

(a) Every T seconds, program PROG calls a polling subroutine POLL. This subroutinechecks the status of each of the 20 terminals in sequence and transfers any input charactersto the memory. Then it returns to PROG.

(b) Whenever a character is ready in any of the interface buffers of the terminals, aninterrupt request is generated. This causes the interrupt routine INTERRUPT to be executed.INTERRUPT polls the status registers to find the first ready character, transfers it, and thenreturns to PROG.

Write the routines POLLand INTERRUPT. Let the maximum character rate for any terminalbe c characters per second, with an average rate equal to rc, where r ≤ 1. In method (a),what is the maximum value of T for which it is still possible to guarantee that no inputcharacters will be lost? What is the equivalent value for method (b)? Estimate, on theaverage, the percentage of time spent in servicing the terminals for methods (a) and (b), forc = 100 characters per second and r = 0.01, 0.1, 0.5, and 1. Assume that POLL takes 800ns to poll all 20 devices and that an interrupt from a device requires 200 ns to process.

3.6 [E] In Figure 3.9, the interrupt-enable bit in the PS is set last in the START section of theMain program. Why? Does the order matter for earlier operations in START? Why or whynot?

3.7 [E] Even if multiple interrupt requests are pending, only one request will be handled foreach entry into ILOC in Figure 3.9. True or false? Explain.

3.8 [E] A user program could check for a zero divisor immediately preceding each divisionoperation, and then take appropriate action without invoking the OS. Give reasons whythis may or may not be preferable to allowing an exception interrupt to occur on an actualdivide by zero situation in a user program.


Problems 127

3.9 [M] Assume that a memory location BINARY contains a 16-bit pattern. It is desired todisplay these bits as a string of 0s and 1s on a display device that has the interface depictedin Figure 3.3. Write a RISC-style program that accomplishes this task.

3.10 [M] Write a CISC-style program for the task in Problem 3.9.

3.11 [E] Modify the program in Figure 3.18 if the address of TABLE is 0x10100.

3.12 [E] Modify the program in Figure 3.19 if the address of TABLE is 0x10100.

3.13 [M] Using the seven-segment display in Figure 3.17 and the timer interface registers inFigure 3.14, write a RISC-style program that flashes decimal digits in the repeating sequence0, 1, 2, . . . , 9, 0, . . . . Each digit is to be displayed for one second. Assume that the counterin the timer circuit is driven by a 100-MHz clock.


3.15 [D] Using two 7-segment displays of the type shown in Figure 3.17, and the timer interfaceregisters in Figure 3.14, write a RISC-style program that flashes the repeating sequenceof numbers 0, 1, 2, . . . , 98, 99, 0, . . . . Each number is to be displayed for one second.Assume that the counter in the timer circuit is driven by a 100-MHz clock.

3.16 [D] Write a CISC-style program for the task in Problem 3.15.

3.17 [D] Write a RISC-style program that computes wall clock time and displays the time inhours (0 to 23) and minutes (0 to 59). The display consists of four 7-segment display devicesof the type shown in Figure 3.17. A timer circuit that has the interface registers given inFigure 3.14 is available. Its counter is driven by a 100-MHz clock.


3.19 [M] Write a RISC-style program that displays the name of the user backwards. The programshould display a prompt requesting that the characters in the user’s name be entered on thekeyboard, followed by the carriage return (CR). The program should accept a sequence ofcharacters and store them in the main memory. It should then display a message to indicatethat the user’s name will be displayed backwards, followed by the display of the charactersfrom the user’s name in reverse order.


3.21 [M] Write a RISC-style program that determines whether a word entered by a user onthe keyboard is a palindrome, i.e., a word that is same when its characters are written innormal and reverse order. The program should display a prompt requesting that the userenter the characters of an arbitrary word on the keyboard, followed by the carriage return(CR). The program should read the characters and store them in the main memory. It shouldthen analyze the word to determine whether it is a palindrome. Finally, the program shoulddisplay a message to indicate the result of the analysis.




3.23 [D] Write a RISC-style program that displays a string of characters centered horizontallyon a standard 80-character line and enclosed in a box, as shown below:

+-----------+|sample text|+-----------+

The string of characters is located in the main memory beginning at address STRING. Thereis a NUL control character (value 0) at the end of the string of characters. If the string hasmore than 78 characters (including spaces), the program should truncate the displayed stringto 78 characters. The program for determining the length of a character string in Example2.1 can be adapted as a subroutine for use by the program in this problem. Assume that thedisplay device has the interface depicted in Figure 3.3.


3.25 [D] Write a RISC-style program that displays a long sequence of text encoded in ASCIIcharacters with automatic wraparound to fit within 80-character lines. Before displayingthe next word, the program must determine whether there is sufficient space remaining onthe line. If not, the word should appear at the beginning of the next line. The displayprocess must continue until the NUL control character (value 0) is reached at the end ofthe sequence of characters to be displayed. Assume that the sequence of characters usesno control characters other than the NUL character at the end, hence words are separatedonly by a space character. Assume that the display device has the interface depicted inFigure 3.3.


September 20, 2010 11:42 ham_338065_ch04 Sheet number 1 Page number 129 cyan black

129

c h a p t e r

4Software

Chapter Objectives


• Software needed to prepare and runprograms

• Assemblers

• Loaders

• Linkers

• Compilers

• Debuggers

• Interaction between assembly language andC language

• Operating systems


130 C H A P T E R 4 • Software

Chapter 2 introduced the instruction set of a computer and illustrated how programs can be written in assemblylanguage. Chapter 3 showed how to write programs that perform input/output operations. In this chapter, wewill give an overview of the software needed to prepare and run programs.

Assembly-language programs are written using a symbolic notation, which is easily understood by theprogrammer. These programs must be translated into machine-language code before they can be executedin the computer, as explained in Section 2.5. This is done by the assembler program, which interprets themnemonics representing machine instructions and the assembler directives for data declarations.

Having presented how assembly-language programs can be written, we will now discuss the completeprocess of preparing programs for execution. We will describe:

• How the assembler translates a source program written in assembly language into an object programconsisting of machine instructions and data in binary form

• How object programs are loaded into the memory of a computer

• How program execution is initiated and terminated

• How larger programs can be formed by linking together several related programs

• How programming errors can be identified during the execution of a program

Then, we will consider some issues involved when programs are prepared in a high-level language, such asC. Finally, we will consider the role of operating system software in managing and coordinating the use ofcomputer resources.

4.1 The Assembly Process

To prepare a source program, the programmer uses a utility program called a text editorwhich allows the statements of a source program to be entered at a keyboard and saved in afile. The file containing the source program is a sequence of binary-encoded alphanumericcharacters. The file is identified by a name chosen by the user. Files are normally stored ina secondary storage device, such as a magnetic disk.

After preparing the source file, the programmer uses another utility program calledthe assembler. It translates source programs written in an assembly language into objectprograms that comprise machine instructions. This process is often referred to as assemblinga program. The assembler also converts the assembly-language representation of data intobinary patterns that are part of the object program. After loading a source file from the diskinto the memory and translating it into an object program, the assembler stores the objectprogram in a separate file on the disk.

The source program uses mnemonics to represent OP codes in machine instructions. Aset of syntax rules governs the specification of addressing modes for the data operands ofthese instructions. The assembler generates the binary encoding for the OP code and otherinstruction fields.

The assembler recognizes directives that specify numbers and characters and directivesthat allocate memory space for data areas. Using EQU (equate) directives, the programmercan define names that represent constants. These names can then appear in the sourceprogram as operands in instructions. Names can also be defined as address labels for branch


4.2 Loading and Executing Object Programs 131

targets, entry points of subroutines, or data locations in the memory. Address labels areassigned values based on their position relative to the beginning of an assembled program.

As the assembler scans through the source program, it keeps track of all names andtheir corresponding values in a symbol table. Each time a name appears, it is replaced withits value from the table.

4.1.1 Two-pass Assembler

A problem arises when a name appears as an operand before its value is defined. Forexample, this happens if a forward branch is required to an address label that appears laterin the program. As discussed in Section 2.5.2, an offset for the branch is calculated by theassembler using the address of the branch target. With a forward branch, the assemblercannot determine the address of the branch target, because the value of the address labelhas not yet been recorded in the symbol table.

A commonly-used solution to this problem is to have the assembler scan through thesource program twice. During the first pass, it creates the symbol table. For EQU directives,each name and its defined value are recorded in the symbol table. For address labels, theassembler determines the value of each name from its position relative to the start of thesource program. The value is determined by summing the sizes of all machine instructionsprocessed before the definition of the name. At the end of the first pass, all names appearingin the source program will have been assigned numerical values in the symbol table. Theassembler then makes a second pass through the source program, looks up each name itencounters in the symbol table, and substitutes the corresponding numerical value. Such atwo-pass assembler produces a complete object program.

4.2 Loading and Executing Object Programs

Object programs generated by the assembler are stored in files on a disk. To execute aspecific object program, it is first loaded from the disk into the memory. Then, the addressof the first instruction to be executed is loaded into the program counter. A utility programcalled the loader is used to perform these operations.

The loader is invoked when a user enters a command to execute an object programthat is stored on the disk. The user command specifies the name of the object file, whichenables the loader to find the file on the disk. The loader transfers the object program fromthe disk into a specified place in the memory. It must know the length of the programand the address in the memory where it will be loaded. The assembler usually places thisinformation in a header in the object file, preceding the machine instructions and data ofthe object program.

One way of entering the user commands is by typing them on the keyboard. A morecommonly used alternative is to use a graphical user interface (GUI). In this case, the useruses a mouse to select the desired object file. Then, the GUI software passes to the loaderthe information about the location of the object file on the disk.



Once the object program has been loaded into the memory, the loader starts its executionby branching to the first instruction to be executed. In the source program, the programmerindicates the first instruction with a special address label such as START. The assemblerincludes the value of this address label in the header of the object program.

When an object program completes its task, its execution has to be terminated in awell-defined manner. This permits the space in the memory containing the object programto be recovered, and enables the user to enter a new command to execute another objectprogram. These issues are normally addressed by the operating system (OS) software,which is discussed in Section 4.9.

4.3 The Linker

In the preceding sections we assumed that all instructions and data for a particular programare specified in a single source file from which the assembler generates an object program.In many cases, a programmer may wish to call subroutines created by other programmers.It is not convenient or practical to gather all of the desired subroutines from possibly manyseparate source files into a single source file for processing by the assembler.

Instead, a common procedure is to use the assembler on each of the source files sep-arately. In this case, each individual output file will not be a complete object program.Each program may contain references to external names, which are address labels definedin other source files. When processing a source file, the assembler identifies such externalreferences and builds a list of these names and the instructions that refer to them. It includesthis list in the object file that it generates from the source file.

A utility program called the linker is used to combine the contents of separate objectfiles into one object program. It resolves references to external names using the informationrecorded in each object file. The linker needs the relative positions of address labels definedin the source files, so that it can determine the absolute address values when it combines theseparate object files. Information on address labels that may be referenced in other sourcefiles must be exported from each source file to aid in this task. Normally, the programmeris required to indicate the specific labels to be exported. The exported names are includedby the assembler in each object file that it generates, along with a list of the external namesused in the program and the instructions referring to them.

The linker uses the information in each object file and the known sizes of machine-language programs to build a memory map of the final combined object file. The finalvalues corresponding to exported address labels are determined once all of the individualobject files are collected together and assigned to their final locations in memory. At thispoint, references to external names can be resolved. The final address values determined bythe linker are substituted in the specific instructions that contain external references. Onceall external references are resolved, the final object program is complete.

The programmer may choose to determine some of the addresses of instructions anddata explicitly in an object file. This may be done using directives such as ORIGIN in anassembly-language source file. In this case, the programmer must ensure that instructionsand data from different object files do not overlap in memory. A more flexible approach is


4.5 The Compiler 133

not to use ORIGIN directives, giving the linker the freedom to select the starting addressfor the object program and to assign absolute addresses accordingly. The linker ensuresthat different object files do not overlap with each other or with special locations in memorysuch as interrupt vectors.

4.4 Libraries

Subroutines written for one application program may be useful for other application pro-grams. It is a common practice to collect object files containing such subroutines into alibrary file stored on the disk. The subroutines in the library can then be linked with otherobject files for any application program. A utility program called an archiver is used tocreate a library file. This file includes information needed by the linker to resolve referencesto external names in a program that calls library routines.

When invoking the linker, the programmer specifies the desired library files. Thelinker extracts the relevant object files from the library and includes them in the final objectprogram.

4.5 The Compiler

Assembly-language programming requires knowledge of machine-specific details that varyfrom one computer to another. Programming in a high-level language such as C, C++, orJava does not require such knowledge. Before a program written in a high-level languagecan be executed in a computer, it must be translated first into assembly language and theninto the machine language of the computer. Autility program called a compiler performs thefirst task. Asource file in a high-level language is prepared by the programmer and stored onthe disk. From this source file, the compiler generates assembly-language instructions anddirectives, and writes them into an output file. Then, the compiler invokes the assembler toassemble this file.

It is often convenient to partition a high-level source program into multiple files, group-ing subroutines together based on related tasks. In each source file, the names of externalsubroutines and data variables in other files must be declared. This is necessary to enablethe compiler to check data types and detect any errors. For each source file, the compilergenerates an assembly-language file, then invokes the assembler to generate an object file.The linker combines all object files, including any library routines, to create the final objectprogram.

An important benefit of programming in a high-level language is that the compilerautomates many of the tedious tasks that a programmer has to do when programming inassembly language. For example, when generating the assembly-language representationof subroutines, the compiler performs all tasks related to managing stack frames.



4.5.1 Compiler Optimizations

If the compiler uses a straightforward approach to generate an assembly-language programfrom a source file written in a high-level language, it may not necessarily produce the mostefficient program in terms of its execution time or its size. Improved performance canbe achieved if the compiler uses techniques such as reordering the instructions producedfrom a straightforward approach. A compiler with such capabilities is called an optimizingcompiler.

Because much of the execution time of a program is spent in loops, compilers mayapply optimizations that are particularly effective for loops. For example, a high-levelsource program may use a memory variable as a loop counter. This variable needs to beread and written to increment its value in each pass through the loop. A straightforwardassembly-language implementation of this task consists of Load,Add, and Store instructionswithin the loop. A better implementation is produced by a compiler that recognizes thatthe counter value may be maintained in a register while executing the loop. In this case,the Load and Store instructions are not needed within the loop. A Load instruction may beused before entering the loop to place an initial value into the register. A Store instructionmay be needed after exiting the loop to record the final value of the counter.

4.5.2 Combining Programs Written in Different Languages

Section 4.3 describes the linker, which links several object files to generate the object pro-gram. In some cases, a programmer may wish to combine object files produced from sourcefiles written in a high-level language and source files written in assembly language. For ex-ample, the programmer may prepare special assembly-language subroutines that have beencarefully crafted to achieve high performance. A high-level language source program maythen call these assembly-language subroutines. Similarly, an assembly-language programcan call high-level language subroutines.

Figure 4.1 illustrates the complete flow for generating an object program from multiplesource files and library routines.

4.6 The Debugger

An object program is generated successfully when there are no syntax errors or unknownnames in the source files for the program. Such problems are detected and reported by theassembler, compiler, or linker. The programmer then makes the necessary corrections inthe source files.

However, when an object program is executed, it may produce incorrect results dueto programming errors, or bugs, that are often difficult to isolate. To help the programmeridentify such errors, a utility program called the debugger can be used. It enables theprogrammer to stop execution of the object program at some points of interest and to


4.6 The Debugger 135

Compiler

Linker

Assembler

Source fileSource file

Object file

Object file

Object file

Object file

Object program

(Assembly language)(High-level language)

(Assembly language)Source file

Source file

Library fileLibrary file

Source fileSource file

Figure 4.1 Overall flow for generating an object program.

examine the contents of various processor registers and memory locations. In this manner,the programmer can compare computed values with the expected results at any point ofexecution to determine where a programming error may exist. With that information, theprogrammer can then revise the erroneous source file.



To support the functions of the debugger, processors usually have special modes ofoperation and special interrupts. Two examples of debugging facilities are trace mode andbreakpoints.

Trace ModeWhen a processor is operating in the trace mode, an interrupt occurs after the execution

of every instruction. An interrupt-service routine in the debugger program is invoked eachtime this interrupt occurs. It allows the debugger to assume execution control, enablingthe user to enter commands for examining the contents of registers and memory locations.When the user enters a command to resume execution of the object program, a Return-from-interrupt instruction is executed. The next instruction in the program being debuggedis executed, then the debugger is activated again with another interrupt. The trace-modeinterrupt is automatically disabled when the debugger routine is entered, and re-enabledupon return to the object program.

BreakpointsBreakpoints provide a similar interrupt-based debugging facility, except that the object

program being debugged is interrupted only at specific points indicated by the program-mer. For example, the programmer may set a breakpoint to determine whether a particularsubroutine in the object program is ever reached. If it is, the debugger is activated throughan interrupt. The programmer can then examine the state of processing at that point. Theadvantage of using a breakpoint is that execution proceeds at full speed until the breakpointis encountered.

A special instruction called Trap or Software-interrupt is usually used to implementbreakpoints. Execution of this instruction results in the same actions as when a hardware-interrupt request is received. When the debugger has execution control, it allows the user toset a breakpoint that interrupts execution just before instruction i in the object program. Thedebugger saves instruction i in a temporary location, and replaces it with a Software-interruptinstruction. The user then enters a command to resume execution of the object program.The debugger executes a Return-from-interrupt instruction. Instructions from the objectprogram are processed normally until the Software-interrupt instruction is encountered. Atthat point, interrupt processing causes the debugger to be activated again, allowing the userto examine the state of processing.

When the user enters the command to resume execution, the debugger must performseveral tasks, not only to execute instruction i but also to set the same breakpoint again. Itmust first restore instruction i to its original location in the program. This will be the firstinstruction to be executed when the program resumes execution. Then, the debugger has toreinstall the breakpoint. It needs to arrange for a second interrupt to occur after instructioni is executed. To do so, it may enable the trace mode, if available. Alternatively, it mayplace a temporary breakpoint at the location of instruction i + 1, then resume executionof the program being debugged. After instruction i is executed, a second interrupt occursbecause of the temporary breakpoint in place of instruction i + 1. This time, the debuggerrestores instruction i + 1, reinstalls the breakpoint at instruction i, and resumes executionof the interrupted program.


4.7 Using a High-level Language for I/O Tasks 137

4.7 Using a High-level Language for I/O Tasks

The compiler, the assembler, and the linker provide considerable flexibility for the program-mer. Source programs may be written entirely in assembly language, entirely in a high-levellanguage, or in a mixture of languages. Using a high-level language is preferable in mostapplications, because the development time is shorter and the desired code is easier to gen-erate and maintain. In this section and the next one, we will show some example programsfor I/O tasks using the C programming language to illustrate this approach.

Consider the following I/O task. A program uses the polling approach to read 8-bitcharacters from a keyboard and send them to a display as they are entered by a user. Chapter3 presents examples of memory-mapped interfaces for such devices. Figure 4.2 shows anassembly-language program for this I/O task using the interfaces in Figure 3.3.

Figure 4.3 gives a C-language program that performs the same task. In the C language,a pointer may be set to any memory location, including a memory-mapped I/O location.The value of such a pointer is the address of the location in question. If the contentsof this location are to be treated as a character, the pointer should be declared to be ofcharacter type. This defines the contents as being one byte in length, which is the size ofthe I/O registers in Figure 3.3. The define statements in Figure 4.3 are used to associatethe required address constants with the symbolic names of the pointers. These statementsserve the same purpose as the EQU statements in Figure 4.2. They enable the compiler toreplace the symbolic names in the program with numeric values. The define statements alsoindicate the data type for the pointers. The compiler can then generate assembly-languageinstructions with known values and correct data sizes.

KBD_DATA EQU 0x4000 Keyboard data register (8 bits).KBD_STATUS EQU 0x4004 Keyboard status register (bit 1 is KIN flag).DISP_DATA EQU 0x4010 Display data register (8 bits).DISP_STATUS EQU 0x4014 Display status register (bit 2 is DOUT flag).

Move R2, #KBD_DATA Pointer to keyboard device interface.Move R3, #DISP_DATA Pointer to display device interface.

KBD_LOOP: LoadByte R4, 4(R2) Check if there is a characterAnd R4, R4, #2 from the keyboard.Branch_if_[R4]�0 KBD_LOOPLoadByte R5, (R2) Read the received character.

DISP_LOOP: LoadByte R4, 4(R3) Check if the displayAnd R4, R4, #4 is ready for a character.Branch_if_[R4]�0 DISP_LOOPStoreByte R5, (R3) Write the received character to the display.Branch KBD_LOOP

Figure 4.2 Assembly-language program for transferring characters from a keyboard to a display.



/* Define register addresses. */#define KBD_DATA (volatile char *) 0x4000#define KBD_STATUS (volatile char *) 0x4004#define DISP_DATA (volatile char *) 0x4010#define DISP_STATUS (volatile char *) 0x4014

void main(){

char ch;

/* Transfer the characters. */while (1) { /* Infinite loop. */

while ((*KBD_STATUS & 0x2) == 0); /* Wait for a new character. */ch = *KBD_DATA; /* Read the character from the keyboard. */while ((*DISP_STATUS & 0x4) == 0); /* Wait for display to become ready. */*DISP_DATA = ch; /* Transfer the character to the display. */

}}

Figure 4.3 C program that performs the same task as the assembly-language program inFigure 4.2.

Note that the KBD_STATUS and DISP_STATUS pointers are declared as being volatile.This is necessary because the program only reads the contents of the corresponding loca-tions. No data are written to those locations. An optimizing compiler may remove programstatements that appear to have no impact, which include statements referring to locationsin memory that are read but never written. Since the contents of the memory-mappedKBD_STATUS and DISP_STATUS registers change under influences external to the pro-gram, it is essential to inform the compiler of this fact. The compiler will not removestatements that involve pointers or other variables that are declared to be volatile.

For a computer that includes a cache memory, some compilers have an additionalinterpretation for volatile pointers or variables. The cache is a small, fast memory thatholds copies of data in the main memory. Instructions that refer to locations in memory areexecuted more quickly when data are available in the cache. However, data from memory-mapped I/O registers should not be kept in the cache because the contents of those registerschange under external influences. Thus, references to these locations should bypass thecache and directly access the I/O registers. Declaring pointers to such locations as volatilecan inform a compiler to not only prevent unwanted optimizations, but also to generatememory-access instructions that bypass the cache.

In Figure 4.3, we included numeric constants for the specific values that represent thebit positions in the two status registers. For example, the constant 0x2 in the statement

while ((*KBD_STATUS & 0x2) == 0);

is used to detect whether bit b1 in the KBD_STATUS register is set. This approach is


4.8 Interaction between Assembly Language and C Language 139

used here to make it easier to compare the given values with the specification of the deviceinterfaces in Figure 3.3. A more usual approach in writing C programs is to include definestatements to associate meaningful names with such constant values and then use the namesin the rest of the program.

4.8 Interaction between Assembly Language and CLanguage

Occasionally, a program may require access to control registers in a processor. For example,this is needed in the initialization for an interrupt-service routine. Based on a statementin a high-level language, a compiler cannot generate assembly-language instructions thataccess control registers in a processor. Since assembly-language instructions are needed forthis purpose, the compiler allows assembly-language instructions to be included directly ina high-level language program. This section illustrates this approach.

Consider an I/O task to transfer characters from a keyboard to a display. Let interruptsbe used to receive characters from the keyboard interface. To make the example simple,assume that the interrupt-service routine sends each received character directly to the displayinterface without polling its status. This assumes that the characters are received at a ratethat is low enough for the display to handle.

The initialization in the program for this task requires accessing I/O registers andprocessor control registers. The I/O interface in Figure 3.3 should be configured to raise aninterrupt request when KIN= 1. The corresponding interrupt-enable bit in the KBD_CONTregister, KIE, has to be set to 1. It is also necessary to enable interrupts in the processor bysetting to 1 the IE bit in the processor status (PS) register and the KBD bit in the IENABLEcontrol register in Figure 3.7.

Chapter 3 describes different methods of identifying the starting address of an interrupt-service routine that has to be executed when a particular interrupt is raised. The method ofvectored interrupts for different sources uses predetermined memory locations that hold theaddresses of the corresponding interrupt-service routines. In this section, we will assumefor simplicity that there is a single interrupt vector, IVECT, at address 0x20 for all interrupts.This vector must be initialized with the address of the interrupt-service routine.

Figure 4.4 shows an assembly-language program that uses interrupts to read charactersfrom the keyboard. The main program loads the address of the interrupt-service routineinto location IVECT. It sets to 1 the KIN bit in the control register of the keyboard interface,and the interrupt-enable bits in the IENABLE and PS registers of the processor. On eachinterrupt from the keyboard interface, the interrupt-service routine reads the input character,then sends it to the display.

Consider now using a C program to accomplish the same I/O task. A high-levellanguage such as C is not designed to handle hardware features such as interrupts. To writea C program that uses interrupts we need to address two questions:

• How do we access processor control registers?• How do we write an interrupt-service routine?



IVECT EQU 0x20 Vector for interrupt-service routine.KBD_DATA EQU 0x4000 Keyboard data register (8 bits).KBD_STATUS EQU 0x4004 Keyboard status register (bit 1 is KIN flag).KBD_CONT EQU 0x4008 Keyboard control register (bit 1 is KIE flag).DISP_DATA EQU 0x4010 Display data register (8 bits).DISP_STATUS EQU 0x4014 Display status register (bit 2 is DOUT flag).

Main programMAIN: Move R2, #KBD_DATA Pointer to keyboard interface.

Move R3, #0x2StoreByte R3, 8(R2) Configure the keyboard to cause interrupts.Move R2, #IVECT Pointer to vector.Move R3, #INTSERV Start of interrupt-service routine.Store R3, (R2) Set interrupt vector.Move R2, #0x2 Allow the processor to recognize keyboard

interrupts.MoveControl IENABLE, R2Move R2, #0x1 Set the interrupt-enable bit for the processor.MoveControl PS, R2

LOOP: Branch LOOP Continuous wait loop.

Interrupt-service routineINTSERV: Subtract SP, SP, #8 Save registers.

Store R2, 4(SP)Store R3, (SP)Move R2, #KBD_DATA Pointer to keyboard interface.LoadByte R3, (R2) Read next character.Move R2, #DISP_DATA Pointer to display interface.StoreByte R3, (R2) Write the received character to the display.Load R2, 4(SP) Restore registers.Load R3, (SP)Add SP, SP, #8Return-from-interrupt

Figure 4.4 Assembly-language program for character transfer using interrupts.

The interrupt approach requires setting control bits in the IENABLE and PS registersas part of initialization. The pointer-based approach used in Figure 4.3 to access memory-mapped I/O registers cannot be used because the IENABLE and PS control registers do nothave addresses. Instead, these registers can be accessed by including suitable assembly-language instructions directly in the C program. A special directive to the compiler makesthis possible. For example, the statement

asm ("MoveControl PS, R2");


4.8 Interaction between Assembly Language and C Language 141

#define KBD_DATA (volatile char *) 0x4000#define DISP_DATA (volatile char *) 0x4010

void main(){

...}

void intserv(){

*DISP_DATA = *KBD_DATA; /* Transfer a character. */}

Figure 4.5 Representing an interrupt-service routine as a function in a Cprogram.

causes the C compiler to insert the assembly-language instruction between the quotes into thecompiled code. Since register R2 may already be used by compiler-generated instructions,its contents must not be corrupted by any inserted assembly-language instructions. Asimple solution is to save the contents of R2 on the stack before R2 is modified for useby the MoveControl instruction, and then restore them after this instruction. We will usethis approach. But, we should note that compilers provide more sophisticated methods formanaging the use of registers specified in the asm directives.

The second issue is the interrupt-service routine. The C language requires this routineto be written as a function. However, the compiler implements all C functions as subroutinesthat implicitly end with a Return-from-subroutine instruction. Figure 4.5 gives an example.There is a main function that performs some unspecified task. The function named intservtransfers one character from the keyboard to the display. The compiler-generated code forthe function intserv is

<save registers>LoadByte R2, 0x4000(R0)StoreByte R2, 0x4010(R0)<restore registers>Return-from-subroutine

Since the I/O register addresses fit within 16 bits, the compiler can use the Absolute ad-dressing mode, with register R0 which always contains the value zero, as discussed inSection 2.4.3.

To use the function intserv as an interrupt-service routine, it must end with a Return-from-interrupt instruction. This instruction is needed to restore the contents of the programcounter and the processor status register to their values at the time the interrupt occurred.We can insert the Return-from-interrupt as the last statement of the intserv function in the



program using the statement

asm ("Return-from-interrupt");

With this statement, the compiled code for the function will be

<save registers>LoadByte R2, 0x4000(R0)StoreByte R2, 0x4010(R0)Return-from-interrupt<restore registers>Return-from-subroutine

The compiler still includes the code to restore registers and the Return-from-subroutineinstruction at the end as it does for all functions. However, the inclusion of the Return-from-interrupt instruction means that the code after it will never be executed. Since interruptscan occur at any point in the program, failure to restore the original value of a register thatis modified in the function causes the subsequent execution of the program to be incorrect.More critically, failure to restore the correct value of the stack pointer causes corruption ofthe stack frames for nested subroutines.

There are two approaches for correctly supporting interrupts in a high-level languagesuch as C. The first approach requires extending the syntax of the language with a spe-cial keyword for identifying interrupt-service routines. For example, a C compiler mayrecognize the keyword interrupt at the beginning of a function definition, such as

interrupt void intserv () { . . . }

for the function in Figure 4.5. This keyword instructs the compiler to substitute the Return-from-interrupt instruction in place of the Return-from-subroutine instruction. Registers arestill saved and restored as before. Not all C compilers provide this feature.

The second approach is to prepare an interrupt handler using assembly language anduse the linker to link it to the C program. In this case, the handler must first save thelink register, because the interrupt may occur after a subroutine call in the main program.After saving the link register, the interrupt handler can call a C-language subroutine thatservices the interrupt. In this manner, no special keyword is needed in the high-levellanguage source file. Upon return from the subroutine, the link register is restored and theReturn-from-interrupt instruction is executed.

We can now write a C program that uses interrupts to transfer characters from thekeyboard to the display. Figure 4.6 gives a possible program that is equivalent to Figure4.4. We use the approach based on the special keyword for the C compiler because itallows the entire program to be in a single high-level language source file. Note thatthe pointers to memory-mapped I/O registers are of character type because they point tolocations that correspond to 8-bit registers in the device interfaces. The pointer IVECT is ofunsigned integer type because it points to a memory location that stores a 4-byte interruptvector.


4.9 The Operating System 143

#define IVECT (volatile unsigned int *) 0x20#define KBD_DATA (volatile char *) 0x4000#define KBD_CONT (volatile char *) 0x4008#define DISP_DATA (volatile char *) 0x4010#define DISP_STATUS (volatile char *) 0x4014

interrupt void intserv(); /* Forward declaration. */

void main(){

/* Initialize for interrupt-based character transfers. */*KBD_CONT = 0x2; /* Enable keyboard interrupts. */*IVECT = (unsigned int) &intserv; /* Set interrupt vector. */asm ("Subtract SP, SP, #4"); /* Save register R2. */asm ("Store R2, (SP)");asm ("Move R2, #0x2"); /* Allow processor to recognize keyboard interrupts. */asm ("MoveControl IENABLE, R2");asm ("Move R2, #0x1"); /* Enable interrupts for processor. */asm ("MoveControl PS, R2");asm ("Load R2, (SP)"); /* Restore register R2. */asm ("Add SP, SP, #4");

while (1) /* Continuous loop. */{

/* Transfer the characters using interrupt-service routine. */}

}

interrupt void intserv() /* Keyword instructs compiler to treat function as interrupt routine. */{

*DISP_DATA = *KBD_DATA; /* Transfer a character. *//* Compiler will insert Return-from-interrupt instruction at end of function. */

}

Figure 4.6 C program for character transfer using interrupts.

4.9 The Operating System

The preceding sections describe how application programs are prepared and executed withthe aid of various utility programs. All of the tasks described in this chapter are facilitatedby the operating system (OS), which is a key software component in most computers. It isresponsible for the coordination of all activities in a computer. The OS software normallyconsists of essential routines that always reside in the memory of the computer, and various



utility programs that are stored on a magnetic disk to be loaded into the memory and executedwhen needed.

The OS manages the processing, memory, and input/output resources of the computerduring the execution of programs. It interprets user commands, assigns memory and diskspace, moves information between the memory and the disk, and handles I/O operations. Itmakes it possible for a user to use the text editor, compiler, assembler, and linker to prepareapplication programs. The loader is normally part of the OS, and it is invoked when auser enters a command to execute an application program. Our objective in this section isto provide a basic appreciation of the important functions performed by the OS. A morethorough discussion is outside the scope of this book (see Reference [1]).

4.9.1 The Boot-strapping Process

The OS for a general-purpose computer is a large and complex collection of software. Allparts of the OS, including the portion that always resides in memory, are normally stored onthe disk. A process called boot-strapping is used to load the memory-resident portion of theOS into the memory so that it can begin execution and assume control over the resourcesof the computer.

The boot-strapping process begins when the computer is turned on and the processorfetches the first instruction from a predetermined location. That location must be in apermanent portion of the memory that retains its contents when the computer is turnedoff. A small program placed at that location enables the processor to transfer progressivelylarger parts of the OS from the disk to the portion of the memory that is not permanent.Each program executed in this boot-strapping sequence transfers more of the OS from thedisk into the memory, and performs any necessary initialization of the memory and I/Odevices of the computer. Ultimately, the loader and the portion of the OS responsible forprocessing user commands are transferred into the memory. This enables the OS to beginaccepting commands to load and execute application programs stored in files on the disk.

4.9.2 Managing the Execution of Application Programs

To understand the basics of operating systems, let us consider a computer with a processorand I/O devices consisting of a keyboard, a display, a disk, and a printer. We first discussthe steps involved in running one application program. Then, we will describe how the OSmanages the execution of multiple application programs.

To execute an application program stored in a file on the disk, the user enters a commandthat causes the loader to transfer this file into the memory. When the transfer is complete,execution of the program is started. Assume that the program’s task involves reading adata file from the disk into the memory, performing some computation on the data, andprinting the results. When execution of the program reaches the point where the data file isneeded, the program requests the OS to transfer the data file from the disk to the memory.Once the data are transferred, the OS passes execution control back to the applicationprogram, which proceeds to perform the required computation. When the computation



Printer

Disk

Program

routinesOS

Timet 0

t 1 t 2 t 3 t 4 t 5

Figure 4.7 Time-line to illustrate execution control moving between user program andOS routines.

is completed and the results stored in memory are ready to be printed, the applicationprogram again sends a request to the OS. An OS routine is then executed to print theresults.

Execution control passes back and forth between the application program and the OSroutines, which share the processor to perform their respective tasks. A convenient way toillustrate this activity is with a time-line diagram, such as that shown in Figure 4.7. Duringthe time period t0 to t1, the loader transfers the object program from the disk to the memory.At t1, the OS passes execution control to the application program, which runs until it needsthe data on the disk. The OS transfers the required data during the period t2 to t3. Finally,the OS prints the results stored in the memory during the period t4 to t5.

Computer resources can be used more efficiently if there are several application pro-grams to be executed. Note that the disk and the processor are idle during most of the timeperiod t4 to t5 in Figure 4.7. If the user is allowed to enter a command during this period,the OS can load and begin execution of another program while the printer is printing. Theresult is concurrent processing of the computation and I/O requests of the two programswhen they are not competing for access to the same resource in the computer. The OS isresponsible for managing the concurrent execution of several application programs to makethe best possible use of all computer resources. This approach to concurrent execution iscalled multiprogramming or multitasking. It is a mode of operation in which the processorexecutes several programs in some interleaved time order, overlapped with tasks performedby different I/O devices.



4.9.3 Use of Interrupts in Operating Systems

The operating system makes extensive use of interrupts to perform I/O operations, as wellas to communicate with and control the execution of programs. The interrupt mechanismenables the OS to assign priorities, switch from one program to another, terminate programs,implement security and protection features, and coordinate I/O activities. We will discusssome of these aspects briefly to illustrate how interrupts are used.

The OS incorporates the interrupt-service routines for all devices connected to a com-puter that are capable of raising interrupts. In a general-purpose computer with an operatingsystem, application programs do not directly perform I/O operations themselves. When anapplication program needs an input or an output operation, it points to the data to be trans-ferred and asks the OS to perform the operation. The request from the application programis often made through a library subroutine that raises a software interrupt to enter the OSroutines. The OS temporarily suspends the execution of the requesting program, then initi-ates the requested I/O operation. When the I/O operation is completed, the OS is normallyinformed of this condition through a hardware interrupt. The OS then allows the suspendedprogram to resume execution. The OS and the application program pass control back andforth using software interrupts.

The OS provides a variety of services to application programs. To facilitate the im-plementation of these services, a processor may have several different Software-interruptinstructions, each with its own interrupt vector. They can be used to call different partsof the OS, depending on the service being requested. Alternatively, a processor may haveonly one Software-interrupt instruction, with an immediate operand to indicate the desiredservice.

The OS must ensure that the execution of an application program is terminated prop-erly. Executing an appropriate Software-interrupt instruction at the end of the applicationprogram instructs the OS to assume control and complete the termination. Recall that in-formation about the starting location and length of the program in the memory are includedin the header of an object program. The OS uses this information to recover the spaceallocated to the program. The recovered space is then available for another applicationprogram.

To achieve multitasking, the OS accepts a new command from the user at any time. Itloads and begins execution of the requested program when all the resources needed by thatprogram are available.

Example of MultitaskingTo illustrate the interaction between application programs and the OS, let us consider

an example that involves multitasking. A common OS technique that makes this possible iscalled time slicing. Each program runs for a short period, τ , called a time slice. Then anotherprogram runs for its time slice, and so on. The period τ is determined by a continuouslyrunning hardware timer, which generates an interrupt every τ seconds.

Figure 4.8 describes the routines needed to implement some of the essential functionsin a multitasking environment. At the time the operating system is started, an initializationroutine, called OSINIT in Figure 4.8a, is executed. Among other things, this routine setsthe interrupt vector locations in the memory. The values written to the vector locations



OSINIT Set interrupt vectors:Timer interrupt SCHEDULERSoftware interrupt OSSERVICESI/O interrupt IODATA

...

OSSERVICES Examine stack or processor registersto determine requested operation.

Call appropriate routine.SCHEDULER Save program state of current running process.

Select another runnable process.Restore saved program state of new process.Return from interrupt.

(a) OS initialization, services, and scheduler

IOINIT Set requesting process state to Blocked.Initialize memory buffer address pointer and counter.Call device driver to initialize driverand enable interrupts in the device interface.

Return from subroutine.

IODATA Poll devices to determine source of interrupt.Call appropriate driver.If END = 1, then set I/O-blocked process state to Runnable.Return from interrupt.

(b) I/O routines

KBDINIT Enable interrupts.Return from subroutine.

KBDDATA Check device status.If ready, then transfer character.If Character = CR, then {set End = 1; Disable interrupts}else set End = 0.

Return from subroutine.

(c) Keyboard driver

Figure 4.8 Examples of operating system routines.



are the starting addresses of the interrupt-service routines for the corresponding interrupts.For example, OSINIT loads the starting address of a routine called SCHEDULER in theinterrupt vector corresponding to the timer interrupt. Hence, at the end of each time slice,the timer interrupt causes this routine to be executed.

A program, together with any information that describes its current state of execution,is regarded as an entity called a process. A process can be in one of three states: Run-ning, Runnable, or Blocked. The Running state means that the program is currently beingexecuted. The process is Runnable if the program is ready and waiting to be selected forexecution. The third state, Blocked, means that the program is not ready to resume execu-tion for some reason. For example, it may be waiting for completion of an I/O operationthat it requested earlier.

Assume that program A is in the Running state during a given time slice. At the end ofthat time slice, the timer interrupts the execution of this program and starts the execution ofSCHEDULER. This is an operating system routine whose function is to determine whichuser program should run in the next time slice. It starts by saving all of the informationthat will be needed later when execution of program A is resumed. The information savedincludes the contents of registers, including the program counter and the processor statusregister. Registers must be saved because they may contain intermediate results for acomputation in progress at the time of interruption. The program counter points to thelocation where execution is to resume later. The processor status register reflects the currentprogram state.

Then, SCHEDULER selects for execution some other program, B, that was suspendedearlier and is in the Runnable state. It restores all information saved at the time programB was suspended, including the contents of the program counter and status register, andexecutes a Return-from-interrupt instruction. As a result, program B resumes execution forτ seconds, at the end of which the timer raises an interrupt again, and a context switch toanother runnable program takes place.

Suppose that program A is currently executing and needs to read a line of charactersfrom the keyboard. Instead of performing the operation itself, it requests I/O service fromthe operating system. It uses the stack or the processor registers to pass information to the OSdescribing the required operation, the I/O device, and the address of a buffer in the programdata area where the characters from the keyboard should be placed. Then it raises a softwareinterrupt. The corresponding interrupt vector points to the OSSERVICES routine in Figure4.8a. This routine examines the information on the stack or in registers, and initiates therequested operation by calling an appropriate OS routine. In our example, it calls IOINITin Figure 4.8b, which is a general routine responsible for starting I/O operations.

While an I/O operation is in progress, the program that requested it cannot continueexecution. Hence, the IOINIT routine sets the process associated with program A into theBlocked state. The IOINIT routine carries out any preparations needed for the I/O operation,such as initializing address pointers and byte count, then calls a routine that initializes thespecific device for the requested I/O operation.

It is common practice in OS design to encapsulate all software pertaining to a particularI/O device into a self-contained module called the device driver. Such a module can beeasily added to or deleted from the OS. We have assumed that the device driver for thekeyboard consists of two routines, KBDINIT and KBDDATA, as shown in Figure 4.8c.


Problems 149

The IOINIT routine calls KBDINIT, which performs any initialization operations neededby the device or its interface circuit. KBDINIT also enables interrupts in the interfacecircuit by setting the appropriate bit in its control register, and then it returns to IOINIT,which returns to OSSERVICES. The keyboard interface is now ready to participate in adata transfer operation. It will generate an interrupt request whenever a key is pressed.

Following the return to OSSERVICES, the SCHEDULER routine selects another userprogram to run. Of course, the scheduler will not select program A, because that programhas requested an I/O operation and is now in the Blocked state. Instead, program B or someother program in the Runnable state is selected. The Return-from-interrupt instructionthat causes the selected user program to begin execution will also re-enable interrupts inthe processor by loading the saved contents into the processor status register. Thus, aninterrupt request generated by the keyboard’s interface will be accepted. The interruptvector for this interrupt points to an OS routine called IODATA. Because there could beseveral devices requesting an interrupt, IODATA begins by polling these devices to identifythe one requesting service. Then, it calls the appropriate device driver to service the request.In our example, the driver called will be KBDDATA, which will transfer one character ofdata. If the character is a Carriage Return, it will also set to 1 a flag called END, to informIODATA that the requested I/O operation has been completed. At this point, the IODATAroutine changes the state of process A from Blocked to Runnable, so that the scheduler mayselect it for execution in some future time slice.


Software is the key factor contributing to the versatility and usefulness of a computer. Utilityprograms allow users to create, execute, and debug application software. Programmershave the flexibility to combine high-level language source files, assembly-language sourcefiles, and library files using the compiler, the assembler, and the linker to generate objectprograms. When necessary, assembly-language instructions may be included within a high-level language source file.

The power of a computer is greatly enhanced with the software of the operating system,which manages and coordinates all activities. Multitasking by the operating system permitsdifferent activities to proceed concurrently for multiple application programs, thus makingthe best use of the computer.

Problems

4.1 [E] Write a C program to perform the task described in Problem 3.2.

4.2 [M] Write a C program to perform the task described in Problem 3.9.

4.3 [M] Write a C program to perform the task described in Problem 3.13.

4.4 [D] Write a C program to perform the task described in Problem 3.15.



4.5 [D] Write a C program to perform the task described in Problem 3.17.

4.6 [D] Write a C program to perform the task described in Problem 3.17, but use an interrupt-service routine associated with the timer.

4.7 [D] Assume that the instruction set of a processor includes the instruction

MultiplyAccumulate Ri, Rj, Rk

that performs the operation Ri← [Ri] + [Rj] × [Rk] using processor registers Ri, Rj, andRk. Such an instruction is described in Section 2.12.1.

Assume that the compiler does not use this instruction when it generates assembly-languageoutput. Assume that there are three variables X, Y, and Z defined as global variables ina C program. Write a function mult_acc_XYZ in the C language that uses the Multiply-Accumulate instruction to compute X = X + Y × Z. Note that the compiler-generatedassembly-language instructions in this function and in the calling program may use proces-sor registers to hold data.

4.8 [D] Section 4.9.2 discusses how the input and output steps of a collection of programssuch as the one shown in Figure 4.7 could be overlapped to reduce the total time neededto execute them. Let each of the six OS routine execution intervals be 1 unit of time, witheach disk operation requiring 3 units, printing requiring 3 units, and each program executioninterval requiring 2 units. Compute the ratio of best overlapped time to non-overlappedtime for a long sequence of programs. Ignore startup and ending transients.

4.9 [D] Section 4.9.2 indicated that program computation can be overlapped with either inputor output operations or both. Ignoring the relatively short time needed for OS routines, whatis the ratio of best overlapped time to non-overlapped time for completing the executionof a collection of programs, where each program has about equal balance among input,compute, and output activities?

4.10 [M] In the discussion of the three process states in Section 4.9.3, transitions from Runnableto Running, Running to Blocked, and Blocked to Runnable are described. What other directtransitions between these states are possible for a process? Which ones are not? Explaineach of your choices briefly.

References

1. A. Silbershatz, P. B. Gavin, and G. Gagne, Operating System Concepts, 8th ed., JohnWiley and Sons, Hoboken, New Jersey, 2008.

October 4, 2010 11:13 ham_338065_ch05 Sheet number 1 Page number 151 cyan black

151

c h a p t e r

5Basic Processing Unit

Chapter Objectives


• Execution of instructions by a processor

• The functional units of a processor and howthey are interconnected

• Hardware for generating control signals

• Microprogrammed control


152 C H A P T E R 5 • Basic Processing Unit

In this chapter we focus on the processing unit, which executes machine-language instructions and coordinatesthe activities of other units in a computer. We examine its internal structure and show how it performs thetasks of fetching, decoding, and executing such instructions. The processing unit is often called the centralprocessing unit (CPU). The term “central” is not as appropriate today as it was in the past, because today’scomputers often include several processing units. We will use the term processor in this discussion.

The organization of processors has evolved over the years, driven by developments in technology and thedesire to provide high performance. To achieve high performance, it is prudent to make various functionalunits of a processor operate in parallel as much as possible. Such processors have a pipelined organizationwhere the execution of an instruction is started before the execution of the preceding instruction is completed.Another approach, known as superscalar operation, is to fetch and start the execution of several instructionsat the same time. Pipelining and superscalar approaches are discussed in Chapter 6. In this chapter, weconcentrate on the basic ideas that are common to all processors.

5.1 Some Fundamental Concepts

A typical computing task consists of a series of operations specified by a sequence ofmachine-language instructions that constitute a program. The processor fetches one in-struction at a time and performs the operation specified. Instructions are fetched fromsuccessive memory locations until a branch or a jump instruction is encountered. The pro-cessor uses the program counter, PC, to keep track of the address of the next instruction tobe fetched and executed. After fetching an instruction, the contents of the PC are updated topoint to the next instruction in sequence. A branch instruction may cause a different valueto be loaded into the PC.

When an instruction is fetched, it is placed in the instruction register, IR, from where itis interpreted, or decoded, by the processor’s control circuitry. The IR holds the instructionuntil its execution is completed.

Consider a 32-bit computer in which each instruction is contained in one word inthe memory, as in RISC-style instruction set architecture. To execute an instruction, theprocessor has to perform the following steps:

1. Fetch the contents of the memory location pointed to by the PC. The contents of thislocation are the instruction to be executed; hence they are loaded into the IR. Inregister transfer notation, the required action is

IR← [[PC]]

2. Increment the PC to point to the next instruction. Assuming that the memory is byteaddressable, the PC is incremented by 4; that is

PC← [PC] + 4

3. Carry out the operation specified by the instruction in the IR.


5.1 Some Fundamental Concepts 153

Fetching an instruction and loading it into the IR is usually referred to as the instructionfetch phase. Performing the operation specified in the instruction constitutes the instructionexecution phase.

With few exceptions, the operation specified by an instruction can be carried out byperforming one or more of the following actions:

• Read the contents of a given memory location and load them into a processor register.• Read data from one or more processor registers.• Perform an arithmetic or logic operation and place the result into a processor register.• Store data from a processor register into a given memory location.

The hardware components needed to perform these actions are shown in Figure 5.1. Theprocessor communicates with the memory through the processor-memory interface, whichtransfers data from and to the memory during Read and Write operations. The instructionaddress generator updates the contents of the PC after every instruction is fetched. Theregister file is a memory unit whose storage locations are organized to form the processor’sgeneral-purpose registers. During execution, the contents of the registers named in aninstruction that performs an arithmetic or logic operation are sent to the arithmetic and logic

Instructionaddress

generator

Registerfile

IR

Controlcircuitry

ALU

PC


Figure 5.1 Main hardware components of a processor.



unit (ALU), which performs the required computation. The results of the computation arestored in a register in the register file.

Before we examine these units and their interaction in detail, it is helpful to considerthe general structure of any data processing system.

Data Processing HardwareA typical computation operates on data stored in registers. These data are processed by

combinational circuits, such as adders, and the results are placed into a register. Figure 5.2illustrates this structure. A clock signal is used to control the timing of data transfers. Theregisters comprise edge-triggered flip-flops into which new data are loaded at the activeedge of the clock. In this chapter, we assume that the rising edge of the clock is the activeedge. The clock period, which is the time between two successive rising edges, must belong enough to allow the combinational circuit to produce the correct result.

The operation performed by the combinational block in Figure 5.2 may be quite com-plex. It can often be broken down into several simpler steps, where each step is performedby a subcircuit of the original circuit. These subcircuits can then be cascaded into a multi-stage structure as shown in Figure 5.3. Then, if n stages are used, the operation will becompleted in n clock cycles. Since these combinational subcircuits are smaller, they cancomplete their operation in less time, and hence a shorter clock period can be used. Akey advantage of the multi-stage structure is that it is suitable for pipelined operation, aswill be discussed in Chapter 6. Such a structure is particularly useful for implementingprocessors that have a RISC-style instruction set. The discussion in the remainder of thischapter focuses on processors that use a multi-stage structure of this type. In Section 5.7we will consider a more traditional alternative that is suitable for CISC-style processors.

Clock

Register stage A Register stage B

QD

QD

QD

QD

Combinational logic circuit

Figure 5.2 Basic structure for data processing.


5.2 Instruction Execution 155

Clock

Reg

iste

rs

circuit

Logic

circuit

Logic

circuit

Logic

Stage 1 Stage 2 Stage 3

Reg

iste

rs

Reg

iste

rs

Reg

iste

rs

Register stage A Register stage B

Figure 5.3 A hardware structure with multiple stages.

5.2 Instruction Execution

Let us now examine the actions involved in fetching and executing instructions. We illustratethese actions using a few representative RISC-style instructions.

5.2.1 Load Instructions

Consider the instruction

Load R5, X(R7)

which uses the Index addressing mode to load a word of data from memory location X +[R7] into register R5. Execution of this instruction involves the following actions:

• Fetch the instruction from the memory.• Increment the program counter.• Decode the instruction to determine the operation to be performed.• Read register R7.• Add the immediate value X to the contents of R7.• Use the sum X + [R7] as the effective address of the source operand, and read the

contents of that location in the memory.• Load the data received from the memory into the destination register, R5.



Depending on how the hardware is organized, some of these actions can be performedat the same time. In the discussion that follows, we will assume that the processor hasfive hardware stages, which is a commonly used arrangement in RISC-style processors.Execution of each instruction is divided into five steps, such that each step is carried out byone hardware stage. In this case, fetching and executing the Load instruction above can becompleted as follows:

1. Fetch the instruction and increment the program counter.

2. Decode the instruction and read the contents of register R7 in the register file.

3. Compute the effective address.

4. Read the memory source operand.

5. Load the operand into the destination register, R5.

5.2.2 Arithmetic and Logic Instructions

Instructions that involve an arithmetic or logic operation can be executed using similarsteps. They differ from the Load instruction in two ways:

• There are either two source registers, or a source register and an immediate sourceoperand.

• No access to memory operands is required.

A typical instruction of this type is

Add R3, R4, R5

It requires the following steps:


2. Decode the instruction and read the contents of source registers R4 and R5.

3. Compute the sum [R4] + [R5].

4. Load the result into the destination register, R3.

The Add instruction does not require access to an operand in the memory, and thereforecould be completed in four steps instead of the five steps needed for the Load instruction.However, as we will see in the next chapter, it is advantageous to use the same multi-stage processing hardware for as many instructions as possible. This can be achieved ifwe arrange for all instructions to be executed in the same number of steps. To this end,the Add instruction should be extended to five steps, patterned along the steps of the Loadinstruction. Since no access to memory operands is required, we can insert a step in whichno action takes place between steps 3 and 4 above. The Add instruction would then beperformed as follows:


2. Decode the instruction and read registers R4 and R5.

3. Compute the sum [R4] + [R5].


5.2 Instruction Execution 157

4. No action.

5. Load the result into the destination register, R3.

If the instruction uses an immediate operand, as in

Add R3, R4, #1000

the immediate value is given in the instruction word. Once the instruction is loaded into theIR, the immediate value is available for use in the addition operation. The same five-stepsequence can be used, with steps 2 and 3 modified as:

2. Decode the instruction and read register R4.

3. Compute the sum [R4] + 1000.

5.2.3 Store Instructions

The five-step sequence used for the Load and Add instructions is also suitable for Storeinstructions, except that the final step of loading the result into a destination register is notrequired. The hardware stage responsible for this step takes no action. For example, theinstruction

Store R6, X(R8)

stores the contents of register R6 into memory location X + [R8]. It can be implementedas follows:


2. Decode the instruction and read registers R6 and R8.

3. Compute the effective address X + [R8].

4. Store the contents of register R6 into memory location X + [R8].

5. No action.

After reading register R8 in step 2, the memory address is computed in step 3 using theimmediate value, X, in the IR. In step 4, the contents of R6 are sent to the memory to bestored. No action is taken in step 5.

In summary, the five-step sequence of actions given in Figure 5.4 is suitable for allinstructions in a RISC-style instruction set. RISC-style instructions are one word long andonly Load and Store instructions access operands in the memory, as explained in Chapter 2.Instructions that perform computations use data that are either stored in general-purposeregisters or given as immediate data in the instruction.

The five-step sequence is suitable for all Load and Store instructions, because theaddressing modes that can be used in these instructions are special cases of the Index mode.Most RISC-style processors provide one general-purpose register, usually register R0, thatalways contains the value zero. When R0 is used as the index register, the effective address ofthe operand is the immediate value X. This is the Absolute addressing mode. Alternatively,if the offset X is set to zero, the effective address is the contents of the index register, Ri.This is the Indirect addressing mode. Thus, only one addressing mode, the Index mode,



Step Action

1 Fetch an instruction and increment the program counter.

2 Decode the instruction and read registers from the register file.

3 Perform an ALU operation.

4 Read or write memory data if the instruction involves a memory operand.

5 Write the result into the destination register, if needed.

Figure 5.4 A five-step sequence of actions to fetch and execute an instruction.

needs to be implemented, resulting in a significant simplification of the processor hardware.The task of selecting R0 as the index register or setting X to zero is left to the assembleror the compiler. This is consistent with the RISC philosophy of aiming for simple and fasthardware at the expense of higher compiler complexity and longer compilation time. Theresult is a net gain in the time needed to perform various tasks on a computer, becauseprograms are compiled much less frequently than they are executed.

5.3 Hardware Components

The discussion above indicates that all instructions of a RISC-style processor can be exe-cuted using the five-step sequence in Figure 5.4. Hence, the processor hardware may beorganized in five stages, such that each stage performs the actions needed in one of thesteps. We now examine the components in Figure 5.1 to see how they may be organized inthe multi-stage structure of Figure 5.3.

5.3.1 Register File

General-purpose registers are usually implemented in the form of a register file, which isa small and fast memory block. It consists of an array of storage elements, with accesscircuitry that enables data to be read from or written into any register. The access circuitry isdesigned to enable two registers to be read at the same time, making their contents availableat two separate outputs, A and B. The register file has two address inputs that select thetwo registers to be read. These inputs are connected to the fields in the IR that specify thesource registers, so that the required registers can be read. The register file also has a datainput, C, and a corresponding address input to select the register into which data are to bewritten. This address input is connected to the IR field that specifies the destination registerof the instruction.

The inputs and outputs of any memory unit are often called input and output ports. Amemory unit that has two output ports is said to be dual-ported. Figure 5.5 shows two ways


5.3 Hardware Components 159

Address A

Address B

Address C

Address BAddress A

Address CInput data

Output data

Output data

(a) Single memory block

(b) Two memory blocks

Input data

Registerfile

A B

C

Registerfile

B

C

Registerfile

A

C

Figure 5.5 Two alternatives for implementing a dual-ported register file.



of realizing a dual-ported register file. One possibility is to use a single set of registerswith duplicate data paths and access circuitry that enable two registers to be read at thesame time. An alternative is to use two memory blocks, each containing one copy of theregister file. Whenever data are written into a register, they are written into both copiesof that register. Thus, the two files have identical contents. When an instruction requiresdata from two registers, one register is accessed in each file. In effect, the two register filestogether function as a single dual-ported register file.

5.3.2 ALU

The arithmetic and logic unit is used to manipulate data. It performs arithmetic operationssuch as addition and subtraction, and logic operations such as AND, OR, and XOR. Con-ceptually, the register file and the ALU may be connected as shown in Figure 5.6. Whenan instruction that performs an arithmetic or logic operation is being executed, the contentsof the two registers specified in the instruction are read from the register file and become

Immediate value

Address A

Address B

Address C

0 1MuxB

InA InB

Out

ALU

Registerfile

A B

C

Figure 5.6 Conceptual view of the hardware needed for computation.



available at outputs A and B. Output A is connected directly to the first input of the ALU,InA, and output B is connected to a multiplexer, MuxB. The multiplexer selects either out-put B of the register file or the immediate value in the IR to be connected to the secondALU input, InB. The output of the ALU is connected to the data input, C, of the registerfile so that the results of a computation can be loaded into the destination register.

5.3.3 Datapath

Instruction processing consists of two phases: the fetch phase and the execution phase. It isconvenient to divide the processor hardware into two corresponding sections. One sectionfetches instructions and the other executes them. The section that fetches instructions is alsoresponsible for decoding them and for generating the control signals that cause appropriateactions to take place in the execution section. The execution section reads the data operandsspecified in an instruction, performs the required computations, and stores the results.

We now need to organize the hardware into a multi-stage structure similar to that inFigure 5.3, with stages corresponding to the five steps in Figure 5.4. A possible structureis shown in Figure 5.7. The actions taken in each of the five stages are completed in oneclock cycle. An instruction is fetched in step 1 by hardware stage 1 and placed into the IR.It is decoded, and its source registers are read in step 2. The information in the IR is usedto generate the control signals for all subsequent steps. Therefore, the IR must continue tohold the instruction until its execution is completed.

It is necessary to insert registers between stages. Inter-stage registers hold the resultsproduced in one stage so that they can be used as inputs to the next stage during the nextclock cycle. This leads to the organization in Figure 5.8. The hardware in the figure is oftenreferred to as the datapath. It corresponds to stages 2 to 5 in Figure 5.7. Data read fromthe register file are placed in registers RA and RB. Register RA provides the data to inputInA of the ALU. Multiplexer MuxB forwards either the contents of RB or the immediatevalue in the IR to the ALU’s second input, InB. The ALU constitutes stage 3, and the resultof the computation it performs is placed in register RZ.

Recall that for computational instructions, such as an Add instruction, no processingactions take place in step 4. During that step, multiplexer MuxYin Figure 5.8 selects registerRZ to transfer the result of the computation to RY. The contents of RY are transferred to theregister file in step 5 and loaded into the destination register. For this reason, the registerfile is in both stages 2 and 5. It is a part of stage 2 because it contains the source registersand a part of stage 5 because it contains the destination register.

For Load and Store instructions, the effective address of the memory operand is com-puted by the ALU in step 3 and loaded into register RZ. From there, it is sent to the memory,which is stage 4. In the case of a Load instruction, the data read from the memory are se-lected by multiplexer MuxY and placed in register RY, to be transferred to the register filein the next clock cycle. For a Store instruction, data are read from the register file, which ispart of stage 2, and placed in register RB. Since memory access is done in stage 4, anotherinter-stage register is needed to maintain correct data flow in the multi-stage structure. Reg-ister RM is introduced for this purpose. The data to be stored are moved from RB to RMin step 3, and from there to the memory in step 4. No action is taken in step 5 in this case.



Instruction

fetch

Source

registers

Memory

access

Destination

register

ALU

Stage 1

Stage 2

Stage 4

Stage 3

Stage 5

Figure 5.7 A five-stage organization.

The subroutine call instructions introduced in Section 2.7 save the return address ina general-purpose register, which we call LINK for ease of reference. Similarly, interruptprocessing requires a return address to be saved, as described in Section 3.2. Assume thatanother general-purpose register, IRA, is used for this purpose. Both of these actions requirethe contents of the program counter to be sent to the register file. For this reason, multiplexerMuxY has a third input through which the return address can be routed to register RY, fromwhere it can be sent to the register file. The return address is produced by the instructionaddress generator, as we will explain later.



Memorydata

Memoryaddress

Immediate value

0 1MuxY

Stage 2

Stage 3

Stage 4

Stage 5

RZ RM

RA RB

RY

2

Return address

InA InB

Out

ALU

0 1MuxB

Registerfile

A B

C

Address A

Address B

Address CStage 5

Figure 5.8 Datapath in a processor.



5.3.4 Instruction Fetch Section

The organization of the instruction fetch section of the processor is illustrated in Figure 5.9.The addresses used to access the memory come from the PC when fetching instructions andfrom register RZ in the datapath when accessing instruction operands. Multiplexer MuxMAselects one of these two sources to be sent to the processor-memory interface. The PC isincluded in a larger block, the instruction address generator, which updates the contents ofthe PC after each instruction is fetched. The instruction read from the memory is loadedinto the IR, where it stays until its execution is completed and the next instruction is fetched.

The contents of the IR are examined by the control circuitry to generate the signalsneeded to control all the processor’s hardware. They are also used by the block labeledImmediate. As described in Chapter 2, an immediate value may be included in someinstructions. A 16-bit immediate value is extended to 32 bits. The extended value is thenused either directly as an operand or to compute the effective address of an operand. Forsome instructions, such as those that perform arithmetic operations, the immediate value issign-extended; for others, such as logic instructions, it is padded with zeros. The Immediateblock in Figure 5.9 generates the extended value and forwards it to MuxB in Figure 5.8 to beused in an ALU computation. It also generates the extended value to be used in computingthe target address of branch instructions.

The address generator circuit is shown in Figure 5.10. An adder is used to incrementthe PC by 4 during straight-line execution. It is also used to compute a new value to be

Memoryaddress

Instructionaddress

generator

PC

Register file

0 1MuxMAIR

Controlcircuitry

Memorydata

(via RY)(via RA)

Register RZImmediate

(Immediate valueextended to 32 bits)

MuxB

Figure 5.9 Instruction fetch section of Figure 5.7.


5.4 Instruction Fetch and Execution Steps 165

Immediate value

4

MuxY

RA

0 1MuxINC

0 1MuxPC

PC

Adder

PC-Temp

(Return address)

(Branch offset)

Figure 5.10 Instruction address generator.

loaded into the PC when executing branch and subroutine call instructions. One adder inputis connected to the PC. The second input is connected to a multiplexer, MuxINC, whichselects either the constant 4 or the branch offset to be added to the PC. The branch offset isgiven in the immediate field of the IR and is sign-extended to 32 bits by the Immediate blockin Figure 5.9. The output of the adder is routed to the PC via a second multiplexer, MuxPC,which selects between the adder and the output of register RA. The latter connection isneeded when executing subroutine linkage instructions. Register PC-Temp is needed tohold the contents of the PC temporarily during the process of saving the subroutine orinterrupt return address.

5.4 Instruction Fetch and Execution Steps

We now examine the process of fetching and executing instructions in more detail, usingthe datapath in Figure 5.8. Consider again the instruction

Add R3, R4, R5

The steps for fetching and executing this instruction are given in Figure 5.11. Assume thatthe instruction is encoded using the format in Figure 2.32, which is reproduced here asFigure 5.12. After the instruction has been fetched from the memory and placed in the IR,the source register addresses are available in fields IR31−27 and IR26−22. These two fields



Step Action

1 Memory address [PC], Read memory, IR Memory data, PC [PC] + 4

2 Decode instruction, RA [R4], RB [R5]

3 RZ [RA] + [RB]

4 RY [RZ]

5 R3 [RY]

Figure 5.11 Sequence of actions needed to fetch and execute the instruction: Add R3, R4, R5.

0562122262731

OP codeImmediate operandRdstRsrc

02122262731

OP codeRsrc2Rsrc1

(b) Immediate-operand format

(a) Register-operand format

1617

Rdst

05631

OP codeImmediate value

(c) Call format

Figure 5.12 Instruction encoding.

are connected to the address inputs for ports A and B of the register file. As a result, registersR4 and R5 are read and their contents placed in registers RA and RB, respectively, at the endof step 2. In the next step, the control circuitry sets MuxB to select input 0, thus connectingregister RB to input InB of the ALU. At the same time, it causes the ALU to perform anaddition operation. Since register RA is connected to input InA, the ALU produces therequired sum [RA] + [RB], which is loaded into register RZ at the end of step 3.

In step 4, multiplexer MuxY selects input 0, thus causing the contents of RZ to betransferred to RY. The control circuitry connects the destination address field of the Addinstruction, IR21−17, to the address input for port C of the register file. In step 5, it issues



Step Action


2 Decode instruction, RA [R7]

3 RZ [RA] + Immediate value X

4 Memory address [RZ], Read memory, RY Memory data

5 R5 [RY]

Figure 5.13 Sequence of actions needed to fetch and execute the instruction: Load R5, X(R7).

Step Action



3 RZ [RA] + Immediate value X, RM [RB]

4 Memory address [RZ], Memory data [RM], Write memory

5 No action

Figure 5.14 Sequence of actions needed to fetch and execute the instruction: Store R6, X(R8).

a Write command to the register file, causing the contents of register RY to be written intoregister R3.

Load and Store instructions are executed in a similar manner. In this case, the addressof the destination register is given in bit field IR26−22. The control hardware connects thisfield to the address input corresponding to input C of the register file. The steps involvedin executing these instructions are given in Figures 5.13 and 5.14. In both examples, thememory address is specified using the Index mode, in which the index value X is given as animmediate value in the instruction. The immediate field of IR, extended as appropriate bythe Immediate block in Figure 5.9, is selected by MuxB in step 3 and added to the contentsof register RA. The resulting sum is the effective address of the operand.

Some ObservationsIn the discussion above, we assumed that memory Read and Write operations can be

completed in one clock cycle. Is this a realistic assumption? In general, accessing the mainmemory of a computer takes significantly longer than reading the contents of a register inthe register file. However, most modern processors use cache memories, which will bediscussed in detail in Chapter 8. A cache memory is much faster than the main memory.



It is usually implemented on the same chip as the processor, making it about as fast as theregister file. Thus, a memory Read or Write operation can be completed in one clock cyclewhen the data involved are available in the cache. When the operation requires access to themain memory, the processor must wait for that operation to be completed. We will discusshow slower memory accesses are handled in Section 5.4.2.

We also assumed that the processor reads the source registers of the instruction instep 2, while it is still decoding the OP code of the instruction that has just been loadedinto the IR. Can these two tasks be completed in the same step? How can the controlhardware know which registers to read before it completes decoding the instruction? Thisis possible because source register addresses are specified using the same bit positions inall instructions. The hardware reads the registers whose addresses are in these bit positionsonce the instruction is loaded into the IR. Their contents are loaded into registers RA andRB at the end of step 2. If these data are needed by the instruction, they will be availablefor use in step 3. If not, they will be ignored by subsequent hardware stages.

Note that the actions described in Figures 5.11, 5.13, and 5.14 do not show two registersbeing read in step 2 in every case. To avoid confusion, only the registers needed by thespecific instruction described in the figure are mentioned, even though two registers arealways read.

5.4.1 Branching

Instructions are fetched from sequential word locations in the memory during straight-lineprogram execution. Whenever an instruction is fetched, the processor increments the PC by4 to point to the next word. This execution pattern continues until a branch or subroutine callinstruction loads a new address into the PC. Subroutine call instructions also save the returnaddress, to be used when returning to the calling program. In this section we examine theactions needed to implement these instructions. Interrupts from I/O devices and softwareinterrupt instructions are handled in a similar manner.

Branch instructions specify the branch target address relative to the PC. A branch offsetgiven as an immediate value in the instruction is added to the current contents of the PC.The number of bits used for this offset is considerably less than the word length of thecomputer, because space is needed within the instruction to specify the OP code and thebranch condition. Hence, the range of addresses that can be reached by a branch instructionis limited.

Subroutine call instructions can reach a larger range of addresses. Because they donot include a condition, more bits are available to specify the target address. Also, mostRISC-style computers have Jump and Call instructions that use a general-purpose registerto specify a full 32-bit address. The details vary from one computer to another, as theexample processors introduced in Appendices B to E illustrate.

Branch InstructionsThe sequence of steps for implementing an unconditional branch instruction is given in

Figure 5.15. The instruction is fetched and the PC is incremented as usual in step 1. Afterthe instruction has been decoded in step 2, multiplexer MuxINC selects the branch offset in



Step Action


2 Decode instruction

3 PC [PC] + Branch offset

4 No action

5 No action

Figure 5.15 Sequence of actions needed to fetch and execute an unconditional branchinstruction.

the IR to be added to the PC in step 3. This is the address that will be used to fetch the nextinstruction. Execution of a Branch instruction is completed in step 3. No action is taken insteps 4 and 5.

We explained in Section 2.13 that the branch offset is the distance between the branchtarget and the memory location following the branch instruction. The reason for this canbe seen clearly in Figure 5.15. The PC is incremented by 4 in step 1, at the time the branchinstruction is fetched. Then, the branch target address is computed in step 3 by adding thebranch offset to the updated contents of the PC.

The sequence in Figure 5.15 can be readily modified to implement conditional branchinstructions. In processors that do not use condition-code flags, the branch instructionspecifies a compare-and-test operation that determines the branch condition. For example,the instruction

Branch_if_[R5]=[R6] LOOP

results in a branch if the contents of registers R5 and R6 are identical. When this instructionis executed, the register contents are compared, and if they are equal, a branch is made tolocation LOOP.

Figure 5.16 shows how this instruction may be executed. Registers R5 and R6 areread in step 2, as usual, and compared in step 3. The comparison could be done by per-forming the subtraction operation [R5] − [R6] in the ALU. The ALU generates signalsthat indicate whether the result of the subtraction is positive, negative, or zero. The ALUmay also generate signals to show whether arithmetic overflow has occurred and whetherthe operation produced a carry-out. The control circuitry examines these signals to testthe condition given in the branch instruction. In the example above, it checks whether theresult of the subtraction is equal to zero. If it is, the branch target address is loaded into thePC, to be used to fetch the next instruction. Otherwise, the contents of the PC remain at theincremented value computed in step 1, and straight-line execution continues.

According to the sequence of steps in Figure 5.16, the two actions of comparing theregister contents and testing the result are both carried out in step 3. Hence, the clock cyclemust be long enough for the two actions to be completed, one after the other. For thisreason, it is desirable that the comparison be done as quickly as possible. A subtraction



Step Action



3 Compare [RA] to [RB], If [RA] = [RB], then PC [PC] + Branch offset

4 No action

5 No action

Figure 5.16 Sequence of actions needed to fetch and execute the instruction:Branch_if_[R5]=[R6] LOOP.

operation in the ALU is time consuming, and is not needed in this case. A simpler andfaster comparator circuit can examine the contents of registers RA and RB and produce therequired condition signals, which indicate the conditions greater than, equal, less than, etc.A comparator is not shown separately in Figure 5.8 as it can be a part of the ALU block.Example 5.3 shows how a comparator circuit can be designed.

Subroutine Call InstructionsSubroutine calls and returns are implemented in a similar manner to branch instructions.

The address of the subroutine may either be computed using an immediate value given inthe instruction or it may be given in full in one of the general-purpose registers. Figure 5.17gives the sequence of actions for the instruction

Call_Register R9

which calls a subroutine whose address is in register R9. The contents of that register areread and placed in RA in step 2. During step 3, multiplexer MuxPC selects its 0 input, thustransferring the data in register RA to be loaded into the PC.

Step Action


2 Decode instruction, RA [R9]

3 PC-Temp [PC], PC [RA]

4 RY [PC-Temp]

5 Register LINK [RY]

Figure 5.17 Sequence of actions needed to fetch and execute the instruction:Call_Register R9.



Assume that the return address of the subroutine, which is the previous contents of thePC, is to be saved in a general-purpose register called LINK in the register file. Data arewritten into the register file in step 5. Hence, it is not possible to send the return addressdirectly to the register file in step 3. To maintain correct data flow in the five-stage structure,the processor saves the return address in a temporary register, PC-Temp. From there, thereturn address is transferred to register RY in step 4, then to register LINK in step 5. Theaddress LINK is built into the control circuitry.

Subroutine return instructions transfer the value saved in register LINK back to the PC.The encoding of the Return-from-subroutine instruction is such that the address of registerLINK appears in bits IR31−27. This is the field connected to Address A of the register file.Hence, once the instruction is fetched, register LINK is read and its contents are placed inRA, from where they can be transferred to the PC via MuxPC in Figure 5.10. Return-from-interrupt instructions are handled in a similar manner, except that a different register is usedto hold the return address.

5.4.2 Waiting for Memory

The role of the processor-memory interface circuit is to control data transfers between theprocessor and the memory. We pointed out earlier that modern processors use fast, on-chipcache memories. Most of the time, the instruction or data referenced in memory Read andWrite operations are found in the cache, in which case the operation is completed in oneclock cycle. When the requested information is not in the cache and has to be fetched fromthe main memory, several clock cycles may be needed. The interface circuit must informthe processor’s control circuitry about such situations, to delay subsequent execution stepsuntil the memory operation is completed.

Assume that the processor-memory interface circuit generates a signal called MemoryFunction Completed (MFC). It asserts this signal when a requested memory Read or Writeoperation has been completed. The processor’s control circuitry checks this signal duringany processing step in which it issues a memory Read or Write request, to determine whenit can proceed to the next step. When the requested data are found in the cache, the interfacecircuit asserts the MFC signal before the end of the same clock cycle in which the memoryrequest is issued. Hence, instruction execution continues uninterrupted. If access to themain memory is required, the interface circuit delays asserting MFC until the operation iscompleted. In this case, the processor’s control circuitry must extend the duration of theexecution step for as many clock cycles as needed, until MFC is asserted. We will usethe command Wait for MFC to indicate that a given execution step must be extended, ifnecessary, until a memory operation is completed. When MFC is received, the actionsspecified in the step are completed, and the processor proceeds to the next step in theexecution sequence.

Step 1 of the execution sequence of any instruction involves fetching the instructionfrom the memory. Therefore, it must include a Wait for MFC command, as follows:

Memory address← [PC], Read memory, Wait for MFC,

IR←Memory data, PC← [PC] + 4



The Wait for MFC command is also needed in step 4 of Load and Store instructions inFigures 5.13 and 5.14. Most of the time, the requested information is found in the cache, sothe MFC signal is generated quickly, and the step is completed in one clock cycle. When anaccess involves the main memory, the MFC response is delayed, and the step is extendedto several clock cycles.

5.5 Control Signals

The operation of the processor’s hardware components is governed by control signals.These signals determine which multiplexer input is selected, what operation is performedby the ALU, and so on. In this section we discuss the signals needed to control the operationof the components in Figures 5.8 to 5.10.

It is instructive to begin by recalling how data flow through the four stages of thedatapath, as described in Section 5.3.3. In each clock cycle, the results of the actions thattake place in one stage are stored in inter-stage registers, to be available for use by the nextstage in the next clock cycle. Since data are transferred from one stage to the next in everyclock cycle, inter-stage registers are always enabled. This is the case for registers RA, RB,RZ, RY, RM, and PC-Temp. The contents of the other registers, namely, the PC, the IR,and the register file, must not be changed in every clock cycle. New data are loaded intothese registers only when called for in a particular processing step. They must be enabledonly at those times.

The role of the multiplexers is to select the data to be operated on in any given stage.For example, MuxB in stage 3 of Figure 5.8 selects the immediate field in the IR forinstructions that use an immediate source operand. It also selects that field for instructionsthat use immediate data as an offset when computing the effective address of a memoryoperand. Otherwise, it selects register RB. The data selected by the multiplexer are usedby the ALU. Examination of Figures 5.11, 5.13, and 5.14 shows that the ALU is usedonly in step 3, and hence the selection made by MuxB matters only during that step. Tosimplify the required control circuit, the same selection can be maintained in all executionsteps. A similar observation can be made about MuxY. However, MuxMA in Figure 5.9must change its selection in different execution steps. It selects the PC as the source of thememory address during step 1, when a new instruction is being fetched. During step 4 ofLoad and Store instructions, it selects register RZ, which contains the effective address ofthe memory operand.

Figures 5.18, 5.19, and 5.20 show the required control signals. The register file has three5-bit address inputs, allowing access to 32 general-purpose registers. Two of these inputs,Address A and Address B, determine which registers are to be read. They are connectedto fields IR31−27 and IR26−22 in the instruction register. The third address input, AddressC, selects the destination register, into which the input data at port C are to be written.Multiplexer MuxC selects the source of that address. We have assumed that three-registerinstructions use bits IR21−17 and other instructions use IR26−22 to specify the destinationregister, as in Figure 5.12. The third input of the multiplexer is the address of the link registerused in subroutine linkage instructions. New data are loaded into the selected register onlywhen the control signal RF_write is asserted.


5.5 Control Signals 173

Memorydata

Memoryaddress

Immediate value

0 1MuxY

RZ RM

RA RB

RY

2

Return address

InA InB

Out

ALU

C_select

IR21-17

LINK

RF_write

Address CAddress B

Address A

B_select

ConditionsignalsALU_op

Y_select

IR26-22

0 1 2

0 1MuxB

2

5

5

5

k

2

A B

C

Registerfile

MuxCIR31-27

IR26-22

Figure 5.18 Control signals for the datapath.



RM

MFC

MEM_write

MEM_read

MuxB

MA_select

RZ PCIR_enable

Data

0 1MuxMA

IR


ImmediateExtend

2

and

Address

To cache and main memory

MuxINC

Figure 5.19 Processor-memory interface and IR control signals.

Multiplexers are controlled by signals that select which input data appear at the mul-tiplexer’s output. For example, when B_select is equal to 0, MuxB selects the contents ofregister RB to be available at input InB of the ALU. Note that two bits are needed to controlMuxC and MuxY, because each multiplexer selects one of three inputs.

The operation performed by the ALU is determined by a k-bit control code, ALU_op,which can specify up to 2k distinct operations, such as Add, Subtract, AND, OR, andXOR. When an instruction calls for two values to be compared, a comparator performs thecomparison specified, as mentioned earlier. The comparator generates condition signals thatindicate the result of the comparison. These signals are examined by the control circuitryduring the execution of conditional branch instructions to determine whether the branchcondition is true or false.

The interface between the processor and the memory and the control signals associatedwith the instruction register are presented in Figure 5.19. Two signals, MEM_read andMEM_write are used to initiate a memory Read or a memory Write operation. Whenthe requested operation has been completed, the interface asserts the MFC signal. Theinstruction register has a control signal, IR_enable, which enables a new instruction to beloaded into the register. During a fetch step, it must be activated only after the MFC signalis asserted.


5.6 Hardwired Control 175

Immediate value

4

MuxY

RA

PC_enable

INC_select

PC_select 0 1MuxPC

PC 0 1MuxINC

Adder

PC-Temp

(Branch offset)

(Return address)

Figure 5.20 Control signals for the instruction address generator.

We have assumed that the Immediate block handles three possible formats for theimmediate value: a sign-extended 16-bit value, a zero-extended 16-bit value, and a 26-bitvalue that is handled in a special way (see Problem 5.14). Hence, its control signal, Extend,comprises two bits.

The signals that control the operation of the instruction address generator are shownin Figure 5.20. The INC_select signal selects the value to be added to the PC, either theconstant 4 or the branch offset specified in the instruction. The PC_select signal selectseither the updated address or the contents of register RA to be loaded into the PC when thePC_enable control signal is activated.

5.6 Hardwired Control

Previous sections described the actions needed to fetch and execute instructions. We nowexamine how the processor generates the control signals that cause these actions to take placein the correct sequence and at the right time. There are two basic approaches: hardwiredcontrol and microprogrammed control. Hardwired control is discussed in this section.

An instruction is executed in a sequence of steps, where each step requires one clockcycle. Hence, a step counter may be used to keep track of the progress of execution. Several



actions are performed in each step, depending on the instruction being executed. In somecases, such as for branch instructions, the actions taken depend on tests applied to the resultof a computation or a comparison operation. External signals, such as interrupt requests,may also influence the actions to be performed. Thus, the setting of the control signalsdepends on:

• Contents of the step counter• Contents of the instruction register• The result of a computation or a comparison operation• External input signals, such as interrupt requests

The circuitry that generates the control signals may be organized as shown in Fig-ure 5.21. The instruction decoder interprets the OP-code and addressing mode informationin the IR and sets to 1 the corresponding INSi output. During each clock cycle, one ofthe outputs T1 to T5 of the step counter is set to 1 to indicate which of the five steps in-volved in fetching and executing instructions is being carried out. Since all instructions arecompleted in five steps, a modulo-5 counter may be used. The control signal generator isa combinational circuit that produces the necessary control signals based on all its inputs.The required settings of the control signals can be determined from the action sequencesthat implement each of the instructions represented by the signals INS1 to INSm.

Externalinputs

Control signals

Conditionsignals

INS1

INS2

INSm

T5

Clock

T2T1

counterStep

Controlsignal

generatordecoderInstruction

IR

OP-code bits

Counter_enable

Figure 5.21 Generation of the control signals.


5.6 Hardwired Control 177

As an example, consider step 1 in the instruction execution process. This is the stepin which a new instruction is fetched from the memory. It is identified by signal T1 beingasserted. During that clock period, the MA_select signal in Figure 5.19 is set to 1 to selectthe PC as the source of the memory address, and MEM_read is activated to initiate a memoryRead operation. The data received from the memory are loaded into the IR by activatingIR_enable when the memory’s response signal, MFC, is asserted. At the same time, the PCis incremented by 4, by setting the INC_select signal in Figure 5.20 to 0 and PC_select to1. The PC_enable signal is activated to cause the new value to be loaded into the PC at thepositive edge of the clock marking the end of step T1.

5.6.1 Datapath Control Signals

Instructions that handle data include Load, Store, and all computational instructions. Theyperform various data movement and manipulation operations using the processor’s datapath,whose control signals are shown in Figures 5.18 and 5.19. Once an instruction is loadedinto the IR, the instruction decoder interprets its contents to determine the actions needed.At the same time, the source registers are read and their contents become available at theA and B outputs of the register file. As mentioned earlier, inter-stage registers RA, RB,RZ, RM, and RY are always enabled. This means that data flow automatically from onedatapath stage to the next on every active edge of the clock signal.

The desired setting of various control signals can be determined by examining theactions taken in each execution step of every instruction. For example, the RF_write signalis set to 1 in step T5 during execution of an instruction that writes data into the register file.It may be generated by the logic expression

RF_write = T5 · (ALU + Load + Call)

where ALU stands for all instructions that perform arithmetic or logic operations, Loadstands for all Load instructions, and Call stands for all subroutine-call and software-interruptinstructions. The RF_write signal is a function of both the instruction and the timing signals.But, as mentioned earlier, the setting of some of the multiplexers need not change from onetiming step to another. In this case, the multiplexer’s select signal can be implemented asa function of the instruction only. For example,

B_select = Immediate

where Immediate stands for all instructions that use an immediate value in the IR. Weencourage the reader to examine other control signals and derive the appropriate logicexpressions for them, based on the execution steps of various instructions.

5.6.2 Dealing with Memory Delay

The timing signals T1 to T5 are asserted in sequence as the step counter is advanced. Mostof the time, the step counter is incremented at the end of every clock cycle. However, a step



in which a MEM_read or a MEM_write command is issued does not end until the MFCsignal is asserted, indicating that the requested memory operation has been completed.

To extend the duration of an execution step to more than one clock cycle, we need todisable the step counter. Assume that the counter is incremented when enabled by a controlsignal called Counter_enable. Let the need to wait for a memory operation to be completedbe indicated by a control signal called WMFC, which is activated during any executionstep in which the Wait for MFC command is issued. Counter_enable should be set to 1 inany step in which WMFC is not asserted. Otherwise, it should be set to 1 when MFC isasserted. This means that

Counter_enable = WMFC + MFC

A new value is loaded into the PC at the end of any clock cycle in which the PC_enablesignal in Figure 5.20 is activated. We must ensure that the PC is incremented only oncewhen an execution step is extended for more than one clock cycle. Hence, when fetchingan instruction, the PC should be enabled only when MFC is received. It is also enabled instep 3 of instructions that cause branching. Let BR denote all instructions in this group.Then, PC_enable may be realized as

PC_enable = T1 ·MFC + T3 · BR

5.7 CISC-Style Processors

We saw in the previous sections that a RISC-style instruction set is conducive to a multi-stage implementation of the processor. All instructions can be executed in a uniform mannerusing the same five-stage hardware. As a result, the hardware is simple and well suited topipelined operation. Also, the control signals are easy to generate.

CISC-style instruction sets are more complex because they allow much greater flexibil-ity in accessing instruction operands. Unlike RISC-style instruction sets, where only Loadand Store instructions access data in the memory, CISC instructions can operate directlyon memory operands. Also, they are not restricted to one word in length. An instructionmay use several words to specify operand addresses and the actions to be performed, as ex-plained in Section 2.10. Therefore, CISC-style instructions require a different organizationof the processor hardware.

Figure 5.22 shows a possible processor organization. The main difference between thisorganization and the five-stage structure discussed earlier is that the Interconnect block,which provides interconnections among other blocks, does not prescribe any particularstructure or pattern of data flow. It provides paths that make it possible to transfer data be-tween any two components, as needed to implement instructions. The multi-stage structureof Figure 5.8 uses inter-stage registers, such as RZ and RY. These are not needed in theorganization of Figure 5.22. Instead, some registers are needed to hold intermediate resultsduring instruction execution. The temporary registers block in the figure is provided forthis purpose. It includes two temporary registers, Temp1 and Temp2. The need for theseregisters will become apparent from the examples given later.


5.7 CISC-Style Processors 179

PC

IR

InA InB

Out

ALU

A B

C

Interconnect

generator

Controlcircuitry

Instruction

Registerfile

registers

Processor-memoryinterface


Temporary

address

Figure 5.22 Organization of a CISC-style processor.

A traditional approach to the implementation of the Interconnect is to use buses. A busconsists of a set of lines to which several devices may be connected, enabling data to betransferred from any one device to any other. A logic gate that sends a signal over a bus lineis called a bus driver. Since all devices connected to the bus have the ability to send data,we must ensure that only one of them is driving the bus at any given time. For this reason,the bus driver is a special type of logic gate called a tri-state gate. It has a control inputthat turns it on or off. When turned on, the gate places a logic signal of 0 or 1 on the bus,according to the value of its input. When turned off, the gate is electrically disconnectedfrom the bus, as explained in Appendix A.

Figure 5.23 shows how a flip-flop that forms one bit of a data register can be connectedto a bus line. There are two control signals, Rin and Rout . When Rin is equal to 1 themultiplexer selects the data on the bus line to be loaded into the flip-flop. Setting Rin to 0causes the flip-flop to maintain its present value. The output of the flip-flop is connectedto the bus line through a tri-state gate, which is turned on when Rout is asserted. At othertimes, the tri-state gate is turned off, allowing other components to drive the bus line.



D Q

Q

Clock

1

0

Rout

Rin

Bus

Figure 5.23 Input and output gating for one register bit.

5.7.1 An Interconnect using Buses

The Interconnect in Figure 5.22 may be implemented using one or more buses. Figure 5.24shows a three-bus implementation. All registers are assumed to be edge-triggered. That is,when a register is enabled, data are loaded into it on the active edge of the clock at the endof the clock period. Addresses for the three ports of the register file are provided by theControl block. These connections are not shown to keep the figure simple. Also not shownis the Immediate block through which the IR is connected to bus B. This is the circuit thatextends an immediate operand in the IR to 32 bits.

Consider the two-operand instruction

Add R5, R6

which performs the operation

R5← [R5] + [R6]

Fetching and executing this instruction using the hardware in Figure 5.24 can be performedin three steps, as shown in Figure 5.25. Each step, except for the step involving access tothe memory, is completed in one clock cycle. In step 1, bus B is used to send the contentsof the PC to the processor-memory interface, which sends them on the memory addresslines and initiates a memory Read operation. The data received from the memory, whichrepresent an instruction to be executed, are sent to the IR over bus C. The command Wait forMFC is included to accommodate the possibility that memory access may take more thanone clock cycle, as explained in Section 5.4.2. The instruction is decoded in step 2 and thecontrol circuitry begins reading the source registers, R5 and R6. However, the contents ofthe registers do not become available at the A and B outputs of the register file until step 3.They are sent to the ALU using buses A and B. The ALU performs the addition operation,and the sum is sent back to the ALU over bus C, to be written into register R5 at the end ofthe clock cycle.



Bus A Bus B Bus C


InA

InB

PC

Registerfile

ALU

Temporary

Out

Instruction

IR

Control

Processor-memoryinterface

addressgenerator

B

A

C

registers

Figure 5.24 Three-bus CISC-style processor organization.

Note that reading the source registers is completed in step 2 in Figure 5.11. In thatcase, the action of reading the registers proceeds in parallel with the action of decodingthe instruction, because the location of the bit fields containing register addresses in aRISC-style instruction is known. Since CISC-style instructions do not always use the same



Step Action

1 Memory address [PC], Read memory, Wait for MFC, IR Memory data,PC [PC] + 4


3 R5 [R5] + [R6]

Figure 5.25 Sequence of actions needed to fetch and execute the instruction: Add R5, R6.

instruction fields to specify register addresses, the action of reading the source registersdoes not begin until the instruction has been at least partially decoded. Hence, it may notbe possible to complete reading the source registers in step 2.

Next, consider the instruction

And X(R7), R9

which performs the logical AND operation on the contents of register R9 and memorylocation X + [R7] and stores the result back in the same memory location. Assume thatthe index offset X is a 32-bit value given as the second word of the instruction. To executethis instruction, it is necessary to access the memory four times. First, the OP-code wordis fetched. Then, when the instruction decoding circuit recognizes the Index addressingmode, the index offset X is fetched. Next, the memory operand is fetched and the ANDoperation is performed. Finally, the result is stored back into the memory.

Figure 5.26 gives the steps needed to execute the instruction. After decoding theinstruction in step 2, the second word of the instruction is read in step 3. The data received,

Step Action

1 Memory address [PC], Read memory, Wait for MFC, IR Memory data,PC [PC] + 4


3 Memory address [PC], Read memory, Wait for MFC, Temp1 Memory data,PC [PC] + 4

4 Temp2 [Temp1] + [R7]

5 Memory address [Temp2], Read memory, Wait for MFC, Temp1 Memory data

6 Temp1 [Temp1] AND [R9]

7 Memory address [Temp2], Memory data [Temp1], Write memory, Wait for MFC

Figure 5.26 Sequence of actions needed to fetch and execute the instruction: And X(R7), R9.



which represent the offset X, are stored temporarily in register Temp1, to be used in the nextstep for computing the effective address of the memory operand. In step 4, the contentsof registers Temp1 and R7 are sent to the ALU inputs over buses A and B. The effectiveaddress is computed and placed into register Temp2, then used to read the operand in step5. Register Temp1 is used again during step 5, this time to hold the data operand receivedfrom the memory. The computation is performed in step 6, and the result is placed back inregister Temp1. In the final step, the result is sent to be stored in the memory at the operandaddress, which is still available in register Temp2.

The two examples in Figures 5.25 and 5.26 illustrate the variability in the number ofexecution steps in CISC-style instructions. There is no uniform sequence of actions that canbe followed for all instructions in the same way as was demonstrated for RISC instructionsin Section 5.2.

5.7.2 Microprogrammed Control

The control signals needed to control the operation of the components in Figures 5.22 and5.24 can be generated using the hardwired approach described in Section 5.6. But, there isan interesting alternative that was popular in the past, which we describe next.

Control signals are generated for each execution step based on the instruction in theIR. In hardwired control, these signals are generated by circuits that interpret the contentsof the IR as well as the timing signals derived from a step counter. Instead of employingsuch circuits, it is possible to use a “software" approach, in which the desired setting ofthe control signals in each step is determined by a program stored in a special memory.The control program is called a microprogram to distinguish it from the program beingexecuted by the processor. The microprogram is stored on the processor chip in a small andfast memory called the microprogram memory or the control store.

Suppose that n control signals are needed. Let each control signal be represented by abit in an n-bit word, which is often referred to as a control word or a microinstruction. Eachbit in that word specifies the setting of the corresponding signal for a particular step in theexecution flow. One control word is stored in the microprogram memory for each step inthe execution sequence of an instruction. For example, the action of reading an instructionor a data operand from the memory requires use of the MEM_read and WMFC signalsintroduced in Sections 5.5 and 5.6.2, respectively. These signals are asserted by setting thecorresponding bits in the control word to 1 for steps 1, 3, and 5 in Figure 5.26. When amicroinstruction is read from the control store, each control signal takes on the value of itscorresponding bit.

The sequence of microinstructions corresponding to a given machine instruction con-stitutes the microroutine that implements that instruction. The first two steps in Figures 5.25and 5.26 specify the actions for fetching and decoding an instruction. They are common toall instructions. The microroutine that is specific to a given machine instruction starts withstep 3.

Figure 5.27 depicts a typical organization of the hardware needed for microprogrammedcontrol. It consists of a microinstruction address generator, which generates the address



Control store

Control signals

Microinstructionaddress

generator

µPC

IR

Figure 5.27 Microprogrammed control unit organization.

to be used for reading microinstructions from the control store. The address generatoruses a microprogram counter, µPC, to keep track of control store addresses when readingmicroinstructions from successive locations. During step 2 in Figures 5.25 and 5.26, themicroinstruction address generator decodes the instruction in the IR to obtain the startingaddress of the corresponding microroutine and loads that address into the µPC. This is theaddress that will be used in the following clock cycle to read the control word correspondingto step 3. As execution proceeds, the microinstruction address generator increments theµPC to read microinstructions from successive locations in the control store. One bit in themicroinstruction, which we will call End, is used to mark the last microinstruction in a givenmicroroutine. When End is equal to 1, as would be the case in step 3 in Figure 5.25 andstep 7 in Figure 5.26, the address generator returns to the microinstruction correspondingto step 1, which causes a new machine instruction to be fetched.

Microprogrammed control can be viewed as having a control processor within the mainprocessor. Microinstructions are fetched and executed much like machine instructions.Their function is to direct the actions of the main processor’s hardware components, byindicating which control signals need to be active during each execution step.

Microprogrammed control is simple to implement and provides considerable flexibilityin controlling the execution of machine instructions. But, it is slower than hardwired control.Also, the flexibility it provides is not needed in RISC-style processors. As the discussion inthis chapter illustrates, the control signals needed to implement RISC-style instructions are



quite simple to generate. Since the cost of logic circuitry is no longer a significant factor,hardwired control has become the preferred choice.


This chapter explained the basic structure of a processor and how it executes instructions.Modern processors have a multi-stage organization because this is a structure that is well-suited to pipelined operation. Each stage implements the actions needed in one of theexecution steps of an instruction. A five-step sequence in which each step is completed inone clock cycle has been demonstrated. Such an approach is commonly used in processorsthat have a RISC-style instruction set.

The discussion in this chapter assumed that the execution of one instruction is completedbefore the next instruction is fetched. Only one of the five hardware stages is used at anygiven time, as execution moves from one stage to the next in each clock cycle. We willshow in the next chapter that it is possible to overlap the execution steps of successiveinstructions, resulting in much better performance. This leads to a pipelined organization.

5.9 Solved Problems


Example 5.1Problem: Figure 5.11 shows an Add instruction being executed in five steps, but no pro-cessing actions take place in step 4. If it is desired to eliminate that step, what changes haveto be made in the datapath in Figure 5.8 to make this possible?

Solution: Step 4 can be skipped by sending the output of the ALU in Figure 5.8 directlyto register RY. This can be accomplished by adding one more input to multiplexer MuxYand connecting that input to the output of the ALU. Thus, the result of a computation at theoutput of the ALU is loaded into both registers RZ and RY at the end of step 3. For an Addinstruction, or any other computational instruction, the register file control signal RF_writecan be enabled in step 4 to load the contents of RY into the register file.

Example 5.2Problem: Assume that all memory access operations are completed in one clock cycle ina processor that has a 1-GHz clock. What is the frequency of memory access operationsif Load and Store instructions constitute 20 percent of the dynamic instruction count in aprogram? (The dynamic count is the number of instruction executions, including the effectof program loops, which may cause some instructions to be executed more than once.)Assume that all instructions are executed in 5 clock cycles.



Solution: There is one memory access to fetch each instruction. Then, 20 percent of theinstructions have a second memory access to read or write a memory operand. On average,each instruction has 1.2 memory accesses in 5 clock cycles. Therefore, the frequency ofmemory accesses is (1.2/5)× 109, or 240 million accesses per second.

Example 5.3 Problem: Derive the logic expressions for a circuit that compares two unsigned numbers:X = x2x1x0 and Y = y2y1y0 and generates three outputs: XGY , XEY , and XLY . One of theseoutputs is set to 1 to indicate that X is greater than, equal to, or less than Y , respectively.

Solution: To compare two unsigned numbers, we need to compare individual bit locations,starting with the most significant bit. If x2 = 1 and y2 = 0, then X is greater than Y . Ifx2 = y2, then we need to compare the next lower bit location, and so on. Thus, the logicexpressions for the three outputs may be written as follows:

XGY = x2y2 + (x2 ⊕ y2) · (x1y1 + (x1 ⊕ y1) x0y0)

XEY = (x2 ⊕ y2) · (x1 ⊕ y1) · (x0 ⊕ y0)

XLY = XGY + XEY

Example 5.4 Problem: Give the sequence of actions for a Return-from-subroutine instruction in a RISCprocessor. Assume that the address LINK of the general-purpose register in which thesubroutine return address is stored is given in the instruction field connected to address Aof the register file (IR31−27).

Solution: Whenever an instruction is loaded into the IR, the contents of the general-purposeregister whose address is given in bits IR31−27 are read and placed into register RA (seeFigure 5.18). Hence, a Return-from-subroutine instruction will cause the contents of registerLINK to be read and placed in register RA. Execution proceeds as follows:

1. Memory address← [PC], Read memory, Wait for MFC, IR←Memory data,PC← [PC] + 4

2. Decode instruction, RA← [LINK]

3. PC← [RA]

4. No action

5. No action

Example 5.5 Problem: A processor has the following interrupt structure. When an interrupt is received,the interrupt return address is saved in a general-purpose register called IRA. The currentcontents of the processor status register, PS, are saved in a special register called IPS, whichis not a general-purpose register. The interrupt-service routine starts at address ILOC.



Assume that the processor checks for interrupts in the last execution step of everyinstruction. If an interrupt request is present and interrupts are enabled, the request isaccepted. Instead of fetching the next instruction, the processor saves the PC and the PSand branches to ILOC. Give a suitable sequence of steps for performing these actions. Whatadditional hardware is needed in Figures 5.18 to 5.20 to support interrupt processing?

Solution: The first two steps of instruction execution, in which an instruction is fetched anddecoded, are not needed in the case of an interrupt. They may be skipped, or they wouldtake no action if it is desired to maintain a 5-step sequence. Saving the PC can be donein exactly the same manner as for a subroutine call instruction. Another input to MuxC inFigure 5.18 is needed to which the address of register IRA should be connected. To load thestarting address of the interrupt-service routine into the PC, an additional input to MuxPCin Figure 5.20 is needed, to which the value ILOC should be connected. Registers PS andIPS should be connected directly to each other to enable data to be transferred betweenthem. The execution steps required are:

3. PC-Temp← [PC], PC← ILOC, IPS← [PS], Disable interrupts

4. RY← [PC-Temp]

5. IRA← [RY]

These actions are reversed by a Return-from-interrupt instruction. See Problem 5.8.

Example 5.6Problem: Example 5.5 illustrates how the contents of the PC and the PS are saved whenan interrupt request is accepted. In order to support interrupt nesting, it is necessary forthe interrupt-service routine to save these registers on the processor stack, as described inSection 3.2. To do so, the contents of the PS, which are saved in register IPS at the time theinterrupt is accepted, need to be moved to one of the general-purpose registers, from wherethey can be saved on the stack. Assume that two special instructions

MoveControl Ri, IPS

and

MoveControl IPS, Ri

are available to save and restore the contents of IPS, respectively. Suggest changes to thehardware in Figures 5.8 and 5.10 to implement these instructions.

Solution: A possible organization is shown in Figure 5.28. To save the contents of IPS, itsoutput is connected to an additional input on MuxY. When restoring its contents, MuxIPSselects register RA.



0 1MuxY

RY

3

Memory data

RZ PC-Temp

PS

IPS

0MuxIPS

1

RA

2

Figure 5.28 Connection of IPS for Example 5.6.

Problems

5.1 [M] The propagation delay through the combinational circuit in Figure 5.2 is 600 ps(picoseconds). The registers have a setup time requirement of 50 ps, and the maximumpropagation delay from the clock input to the Q outputs is 70 ps.

(a) What is the minimum clock period required for correct operation of this circuit?

(b) Assume that the circuit is reorganized into three stages as in Figure 5.3, such that thecombinational circuit in each stage has a delay of 200 ps. What is the minimum clockperiod in this case?

5.2 [M] At the time the instruction

Load R6, 1000(R9)

is fetched, R6 and R9 contain the values 4200 and 85320, respectively. Memory location86320 contains 75900. Show the contents of the interstage registers in Figure 5.8 duringeach of the 5 execution steps of this instruction.

5.3 [E] Figure 5.12 shows the bit fields assigned to register addresses for different groups ofinstructions. Why is it important to use the same field locations for all instructions?

5.4 [M] At some point in the execution of a program, registers R4, R6, and R7 contain thevalues 1000, 7500, and 2500, respectively. Show the contents of registers RA, RB, RZ,


Problems 189

RY, and R6 in Figure 5.8 during steps 3 to 5 as the instruction

Subtract R6, R4, R7

is fetched and executed, and also during step 1 of the instruction that is fetched next.

5.5 [M] The instruction

And R4, R4, R8

is stored in location 0x37C00 in the memory. At the time this instruction is fetched, registersR4 and R8 contain the values 0x1000 and 0xB2500, respectively. Give the values in registersPC, R4, RA, RM, RZ, and RY of Figures 5.8 and 5.10 in each clock cycle as this instructionis executed, and also in the first clock cycle of the next instruction.

5.6 [D] Modify the expressions given in Example 5.3 to compare two, 4-bit, signed numbersin 2’s-complement representation.

5.7 [E] The subroutine-call instructions described in Chapter 2 always use the same general-purpose register, LINK, to store the return address. Hence, the return register address isnot included in the instruction. However, the address LINK is included in bits IR31−27

of subroutine-return instructions (see Section 5.4.1 and Example 5.4). Why are the twoinstructions treated differently?

5.8 [M] Give the execution sequence for the Return-from-interrupt instruction for a processorthat has the interrupt structure given in Example 5.5. Assume that the address of registerIRA is given in bits IR31−27 of the instruction.

5.9 [D] Consider an instruction set in which instruction encoding is such that register addressesfor different instructions are not always in the same bit locations. What effect would thathave on the execution steps of the instructions? What would you do to maintain a five-stepexecution sequence in this case? Assume the same hardware structure as in Figure 5.8.

5.10 [M] Assume that immediate operands occupy bits IR21−6 of the instruction. The immediatevalue is sign-extended to 32 bits in arithmetic instructions, such as Add, and padded withzeros in logic instructions, such as Or. Design a suitable implementation for the Immediateblock in Figure 5.9.

5.11 [M] A RISC processor that uses the five-step sequence in Figure 5.4 is driven by a 1-GHzclock. Instruction statistics in a large program are as follows:

Branch 20%Load 20%Store 10%Computational instructions 50%

Estimate the rate of instruction execution in each of the following cases:

(a) Access to the memory is always completed in 1 clock cycle.

(b) 90% of instruction fetch operations are completed in one clock cycle and 10% arecompleted in 4 clock cycles. On average, access to the data operands of a Load or Storeinstruction is completed in 3 clock cycles.



5.12 [E] The execution of computational instructions follows the pattern given in Figure 5.11for the Add instruction, in which no processing actions are performed in step 4. Considera program that has the instruction statistics given in Problem 5.11. Estimate the increasein instruction execution rate if this step is eliminated, assuming that all execution steps arecompleted in one clock cycle.

5.13 [D] Figure 5.16 shows that step 3 of a conditional branch instruction may result in a newvalue being loaded into the PC. In pipelined processors, it is desirable to determine theoutcome of a conditional branch as early as possible in the execution sequence. Whathardware changes would be needed to make it possible to move the actions in step 3 tostep 2? Examine all the actions involved in these two steps and show which actions can becarried out in parallel and which must be completed sequentially.

5.14 [M] The instructions of a computer are encoded as shown in Figure 5.12. When animmediate value is given in an instruction, it has to be extended to a 32-bit value. Assumethat the immediate value is used in three different ways:

(a) A 16-bit value is sign-extended for use in arithmetic operations.

(b) A 16-bit value is padded with zeros to the left for use in logic operations.

(c) A 26-bit value is padded with 2 zeros to the right and the 4 high-order bits of the PC areappended to the left for use in subroutine-call instructions.

Show an implementation for the Immediate block in Figure 5.19 that would perform therequired extensions.

5.15 [E] We have seen how all RISC-style instructions can be executed using the steps inFigure 5.4 on the multi-stage hardware of Figure 5.8. Autoincrement and Autodecrementaddressing modes are not included in RISC-style instruction sets. Explain why the instruc-tion

Load R3, (R5)+

cannot be executed on the hardware in Figure 5.8.

5.16 [E] Section 2.9 describes how the two instructions Or and OrHigh can be used to loada 32-bit value into a register. What additional functionality is needed in the processor’sdatapath to implement the OrHigh instruction? Give the sequence of actions needed tofetch and execute the instruction.

5.17 [E] During step 1 of instruction processing, a memory Read operation is started to fetchan instruction at location 0x46000. However, as the instruction is not found in the cache,the Read operation is delayed, and the MFC signal does not become active until the fourthclock cycle. Assume that the delay is handled as described in Section 5.6.2. Show thecontents of the PC during each of the four clock cycles of step 1, and also during step 2.

5.18 [M] Give the sequence of steps needed to fetch and execute the two special instructions

MoveControl Ri, IPS

and

MoveControl IPS, Ri

used in Example 5.6.


Problems 191

5.19 [D] What are the essential differences between the hardware structures in Figures 5.8 and5.22? Illustrate your answer by identifying the difficulties that would be encountered if oneattempts to execute the instruction

Subtract LOC, R5

on the hardware in Figure 5.8. This instruction performs the operation

LOC← [LOC] − [R5]

where LOC is a memory location whose address is given as the second word of a two-wordinstruction.

5.20 [M] Consider the actions needed to execute the instructions given in Section 5.4.1. Derivethe logic expressions to generate the signals C_select, MA_select, and Y_select in Figures5.18 and 5.19 for these instructions.

5.21 [E] Why is it necessary to include both WMFC and MFC in the logic expression forCounter_enable given in Section 5.6.2?

5.22 [E] Explain what would happen if the MFC variable is omitted from the expression forPC_enable given in Section 5.6.2.

5.23 [M] Derive the logic expressions to generate the signals PC_select and INC_select shownin Figure 5.20, taking into account the actions needed when executing the following in-structions:

Branch: All branch instructions, with a 16-bit branch offset given in the instruction

Call_register: A subroutine-call instruction with the subroutine address given in a general-purpose register

Other: All other instructions that do not involve branching

5.24 [M] A microprogrammed processor has the following parameters. Generating the startingaddress of the microroutine of an instruction takes 2.1 ns, and reading a microinstructionfrom the control store takes 1.5 ns. Performing an operation in theALU requires a maximumof 2.2 ns, and access to the cache memory requires 1.7 ns. Assume that all instructions anddata are in the cache.

(a) Determine the minimum time needed for each of the steps in Figure 5.26.

(b) Ignoring all other delays, what is the minimum clock cycle that can be used for thisprocessor?

5.25 [M] Give the sequence of steps needed to fetch and execute the instruction

Load R3, (R5)+

on the processor of Figure 5.24. Assume 32-bit operands.

5.26 [M] Consider a CISC-style processor that saves the return address of a subroutine on theprocessor stack instead of in the predefined register LINK. Give the sequence of actionsneeded to execute a Call_Register instruction on the processor of Figure 5.24.


193

c h a p t e r

6Pipelining

Chapter Objectives


• Pipelining as a means for improvingperformance by overlapping the execution ofmachine instructions

• Hazards that limit performance gains inpipelined processors and means formitigating their effect

• Hardware and software implications ofpipelining

• Influence of pipelining on instruction setdesign

• Superscalar processors


194 C H A P T E R 6 • Pipelining

Chapter 5 introduced the organization of a processor for executing instructions one at a time. In this chapter,we discuss the concept of pipelining, which overlaps the execution of successive instructions to achieve highperformance. We begin by explaining the basics of pipelining and how it can lead to improved performance.Then we examine hazards that cause performance degradation and techniques to alleviate their effect onperformance. We discuss the role of optimizing compilers, which rearrange the sequence of instructionsto maximize the benefits of pipelined execution. For further performance improvement, we also considerreplicating hardware units in a superscalar processor so that multiple pipelines can operate concurrently.

6.1 Basic Concept—The Ideal Case

The speed of execution of programs is influenced by many factors. One way to improveperformance is to use faster circuit technology to implement the processor and the mainmemory. Another possibility is to arrange the hardware so that more than one operationcan be performed at the same time. In this way, the number of operations performed persecond is increased, even though the time needed to perform any one operation is notchanged.

Pipelining is a particularly effective way of organizing concurrent activity in a com-puter system. The basic idea is very simple. It is frequently encountered in manufacturingplants, where pipelining is commonly known as an assembly-line operation. Readers areundoubtedly familiar with the assembly line used in automobile manufacturing. The firststation in an assembly line may prepare the automobile chassis, the next station adds thebody, the next one installs the engine, and so on. While one group of workers is installingthe engine on one automobile, another group is fitting a body on the chassis of a secondautomobile, and yet another group is preparing a new chassis for a third automobile. Al-though it may take hours or days to complete one automobile, the assembly-line operationmakes it possible to have a new automobile rolling off the end of the assembly line everyfew minutes.

Consider how the idea of pipelining can be used in a computer. The five-stage processororganization in Figure 5.7 and the corresponding datapath in Figure 5.8 allow instructionsto be fetched and executed one at a time. It takes five clock cycles to complete the executionof each instruction. Rather than wait until each instruction is completed, instructions canbe fetched and executed in a pipelined manner, as shown in Figure 6.1. The five stagescorresponding to those in Figure 5.7 are labeled as Fetch, Decode, Compute, Memory, andWrite. Instruction Ij is fetched in the first cycle and moves through the remaining stagesin the following cycles. In the second cycle, instruction Ij+1 is fetched while instructionIj is in the Decode stage where its operands are also read from the register file. In thethird cycle, instruction Ij+2 is fetched while instruction Ij+1 is in the Decode stage andinstruction Ij is in the Compute stage where an arithmetic or logic operation is performedon its operands. Ideally, this overlapping pattern of execution would be possible for allinstructions. Although any one instruction takes five cycles to complete its execution,instructions are completed at the rate of one per cycle.


6.2 Pipeline Organization 195

Fetch

1 2 3 4 5

Ij+1

Ij

Clock cycle 7

Decode Compute Memory Write

6

Ij+2

Time

Fetch Decode Compute Memory Write

Fetch Decode Compute Memory Write

Figure 6.1 Pipelined execution—the ideal case.

6.2 Pipeline Organization

Figure 6.2 indicates how the five-stage organization in Figures 5.7 and 5.8 can be pipelined.In the first stage of the pipeline, the program counter (PC) is used to fetch a new instruction.As other instructions are fetched, execution proceeds through successive stages. At anygiven time, each stage of the pipeline is processing a different instruction. Informationsuch as register addresses, immediate data, and the operations to be performed must becarried through the pipeline as each instruction proceeds from one stage to the next. Thisinformation is held in interstage buffers. These include registers RA, RB, RM, RY, and RZin Figure 5.8, the IR and PC-Temp registers in Figures 5.9 and 5.10, and additional storage.The interstage buffers are used as follows:

• Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.• Interstage buffer B2 feeds the Compute stage with the two operands read from the reg-

ister file, the source/destination register identifiers, the immediate value derived fromthe instruction, the incremented PC value used as the return address for a subroutinecall, and the settings of control signals determined by the instruction decoder. Thesettings for control signals move through the pipeline to determine the ALU operation,the memory operation, and a possible write into the register file.

• Interstage buffer B3 holds the result of the ALU operation, which may be data to bewritten into the register file or an address that feeds the Memory stage. In the case ofa write access to memory, buffer B3 holds the data to be written. These data were readfrom the register file in the Decode stage. The buffer also holds the incremented PCvalue passed from the previous stage, in case it is needed as the return address for asubroutine-call instruction.

• Interstage buffer B4 feeds the Write stage with a value to be written into the registerfile. This value may be the ALU result from the Compute stage, the result of theMemory access stage, or the incremented PC value that is used as the return addressfor a subroutine-call instruction.



fetchInstruction

fileRegister

accessMemory

Compute

decodeInstruction

Interstage buffer B1




Datapath operands Source/destination Control signalsregister identifiers for different stagesand results

and other information

Figure 6.2 A five-stage pipeline.

6.3 Pipelining Issues

Figure 6.1 depicts the ideal overlap of three successive instructions. But, there are timeswhen it is not possible to have a new instruction enter the pipeline in every cycle. Considerthe case of two instructions, Ij and Ij+1, where the destination register for instruction Ij

is a source register for instruction Ij+1. The result of instruction Ij is not written into the


6.4 Data Dependencies 197

register file until cycle 5, but it is needed earlier in cycle 3 when the source operand is readfor instruction Ij+1. If execution proceeds as shown in Figure 6.1, the result of instructionIj+1 would be incorrect because the arithmetic operation would be performed using the oldvalue of the register in question. To obtain the correct result, it is necessary to wait untilthe new value is written into the register by instruction Ij. Hence, instruction Ij+1 cannotread its operand until cycle 6, which means it must be stalled in the Decode stage for threecycles. While instruction Ij+1 is stalled, instruction Ij+2 and all subsequent instructions aresimilarly delayed. New instructions cannot enter the pipeline, and the total execution timeis increased.

Any condition that causes the pipeline to stall is called a hazard. We have just describedan example of a data hazard, where the value of a source operand of an instruction is notavailable when needed. Other hazards arise from memory delays, branch instructions, andresource limitations. The next several sections describe these hazards in more detail, alongwith techniques to mitigate their impact on performance.

6.4 Data Dependencies

Consider the two instructions in Figure 6.3:

Add R2, R3, #100Subtract R9, R2, #30

The destination register R2 for the Add instruction is a source register for the Subtractinstruction. There is a data dependency between these two instructions, because registerR2 carries data from the first instruction to the second. Pipelined execution of these twoinstructions is depicted in Figure 6.3. The Subtract instruction is stalled for three cycles todelay reading register R2 until cycle 6 when the new value becomes available.

We now explain the stall in more detail. The control circuit must first recognize thedata dependency when it decodes the Subtract instruction in cycle 3 by comparing its sourceregister identifier from interstage buffer B1 with the destination register identifier of theAddinstruction that is held in interstage buffer B2. Then, the Subtract instruction must be held ininterstage buffer B1 during cycles 3 to 5. Meanwhile, the Add instruction proceeds throughthe remaining pipeline stages. In cycles 3 to 5, as the Add instruction moves ahead, control

Add R2, R3, #100

Clock cycleTime

F D C WM

F D WMSubtract R9, R2, #30

1 2 3 4 5 76 8 9

C

Figure 6.3 Pipeline stall due to data dependency.



signals can be set in interstage buffer B2 for an implicit NOP (No-operation) instructionthat does not modify the memory or the register file. Each NOP creates one clock cycle ofidle time, called a bubble, as it passes through the Compute, Memory, and Write stages tothe end of the pipeline.

6.4.1 Operand Forwarding

Pipeline stalls due to data dependencies can be alleviated through the use of operand for-warding. Consider the pair of instructions discussed above, where the pipeline is stalledfor three cycles to enable the Subtract instruction to use the new value in register R2. Thedesired value is actually available at the end of cycle 3, when the ALU completes the op-eration for the Add instruction. This value is loaded into register RZ in Figure 5.8, whichis a part of interstage buffer B3. Rather than stall the Subtract instruction, the hardwarecan forward the value from register RZ to where it is needed in cycle 4, which is the ALUinput. Figure 6.4 shows pipelined execution when forwarding is implemented. The arrowshows that the ALU result from cycle 3 is used as an input to the ALU in cycle 4.

Figure 6.5 shows the modification needed in the datapath of Figure 5.8 to make thisforwarding possible. A new multiplexer, MuxA, is inserted before input InA of the ALU,and the existing multiplexer MuxB is expanded with another input. The multiplexers selecteither a value read from the register file in the normal manner, or the value available inregister RZ.

Forwarding can also be extended to a result in register RY in Figure 5.8. This wouldhandle a data dependency such as the one involving register R2 in the following sequenceof instructions:

Add R2, R3, #100Or R4, R5, R6Subtract R9, R2, #30

When the Subtract instruction is in the Compute stage of the pipeline, the Or instructionis in the Memory stage (where no operation is performed), and the Add instruction is inthe Write stage. The new value of register R2 generated by the Add instruction is now inregister RY. Forwarding this value from register RY to ALU input InA makes it possible

Add R2, R3, #100

Clock cycleTime

F D C WM

Subtract R9, R2, #30

1 2 3 4 5 6

F D C WM

Figure 6.4 Avoiding a stall by using operand forwarding.


6.4 Data Dependencies 199

RZ

RA RB

InA InB

Out

ALU

0 2MuxB

Registerfile

A B

C

0 1MuxA

Immediate value

1

Figure 6.5 Modification of the datapath of Figure 5.8 to support dataforwarding from register RZ to the ALU inputs.

to avoid stalling the pipeline. MuxA requires another input for the value of RY. Similarly,MuxB is extended with another input.

6.4.2 Handling Data Dependencies in Software

Figures 6.3 and 6.4 show how data dependencies may be handled by the processor hardware,either by stalling the pipeline or by forwarding data. An alternative approach is to leavethe task of detecting data dependencies and dealing with them to the compiler. When the



Add R2, R3, #100

Subtract R9, R2, #30

Clock cycleTime

AddNOPNOP

SubtractNOP

R2, R3, #100

R9, R2, #30

(a) Insertion of NOP instructions for a data dependency

(b) Pipelined execution of instructions

F D C WM

NOP

NOP

NOP

1 2 3 4 5 76 8 9

F D C WM

F D C WM

F D C WM

F D C WM

Figure 6.6 Using NOP instructions to handle a data dependency in software.

compiler identifies a data dependency between two successive instructions Ij and Ij+1, it caninsert three explicit NOP (No-operation) instructions between them. The NOPs introducethe necessary delay to enable instruction Ij+1 to read the new value from the register file afterit is written. For the instructions in Figure 6.4, the compiler would generate the instructionsequence in Figure 6.6a. Figure 6.6b shows that the three NOP instructions have the sameeffect on execution time as the stall in Figure 6.3.

Requiring the compiler to identify dependencies and insert NOP instructions simplifiesthe hardware implementation of the pipeline. However, the code size increases, and theexecution time is not reduced as it would be with operand forwarding. The compiler canattempt to optimize the code to improve performance and reduce the code size by reorderinginstructions to move useful instructions into the NOP slots. In doing so, the compiler mustconsider data dependencies between instructions, which constrain the extent to which theNOP slots can be usefully filled.


6.5 Memory Delays 201

6.5 Memory Delays

Delays arising from memory accesses are another cause of pipeline stalls. For example, aLoad instruction may require more than one clock cycle to obtain its operand from memory.This may occur because the requested instruction or data are not found in the cache, resultingin a cache miss. Figure 6.7 shows the effect of a delay in accessing data in the memoryon pipelined execution. A memory access may take ten or more cycles. For simplicity,the figure shows only three cycles. A cache miss causes all subsequent instructions to bedelayed. A similar delay can be caused by a cache miss when fetching an instruction.

There is an additional type of memory-related stall that occurs when there is a datadependency involving a Load instruction. Consider the instructions:

Load R2, (R3)Subtract R9, R2, #30

Assume that the data for the Load instruction is found in the cache, requiring only onecycle to access the operand. The destination register R2 for the Load instruction is a sourceregister for the Subtract instruction. Operand forwarding cannot be done in the same manneras Figure 6.4, because the data read from memory (the cache, in this case) are not availableuntil they are loaded into register RY at the beginning of cycle 5. Therefore, the Subtractinstruction must be stalled for one cycle, as shown in Figure 6.8, to delay theALU operation.The memory operand, which is now in register RY, can be forwarded to the ALU input incycle 5.

The compiler can eliminate the one-cycle stall for this type of data dependency byreordering instructions to insert a useful instruction between the Load instruction and theinstruction that depends on the data read from the memory. The inserted instruction fillsthe bubble that would otherwise be created. If a useful instruction cannot be found by thecompiler, then the hardware introduces the one-cycle stall automatically. If the processorhardware does not deal with dependencies, then the compiler must insert an explicit NOPinstruction.

Ij: Load R2, (R3)

Clock cycleTime

F D C WM

F D C WM

F D C WM

Ij+1

Ij+2

1 2 3 4 5 76 8 9

Figure 6.7 Stall caused by a memory access delay for a Load instruction.



Load R2, (R3)

Clock cycleTime

F D C WM

F D C WMSubtract R9, R2, #30

1 2 3 4 5 76

Figure 6.8 Stall needed to enable forwarding for an instruction that follows aLoad instruction.

6.6 Branch Delays

In ideal pipelined execution a new instruction is fetched every cycle, while the precedinginstruction is still being decoded. Branch instructions can alter the sequence of execution,but they must first be executed to determine whether and where to branch. We now examinethe effect of branch instructions and the techniques that can be used for mitigating theirimpact on pipelined execution.

6.6.1 Unconditional Branches

Figure 6.9 shows the pipelined execution of a sequence of instructions, beginning with anunconditional branch instruction, Ij. The next two instructions, Ij+1 and Ij+2, are storedin successive memory addresses following Ij. The target of the branch is instruction Ik .According to Figure 5.15, the branch instruction is fetched in cycle 1 and decoded in cycle2, and the target address is computed in cycle 3. Hence, instruction Ik is fetched in cycle 4,after the program counter has been updated with the target address. In pipelined execution,instructions Ij+1 and Ij+2 are fetched in cycles 2 and 3, respectively, before the branchinstruction is decoded and its target address is known. They must be discarded. Theresulting two-cycle delay constitutes a branch penalty.

Branch instructions occur frequently. In fact, they represent about 20 percent of thedynamic instruction count of most programs. (The dynamic count is the number of in-struction executions, taking into account the fact that some instructions in a program areexecuted many times, because of loops.) With a two-cycle branch penalty, the relativelyhigh frequency of branch instructions could increase the execution time for a program byas much as 40 percent. Therefore, it is important to find ways to mitigate this impact onperformance.

Reducing the branch penalty requires the branch target address to be computed earlierin the pipeline. Rather than wait until the Compute stage, it is possible to determine thetarget address and update the program counter in the Decode stage. Thus, instruction Ik canbe fetched one clock cycle earlier, reducing the branch penalty to one cycle, as shown inFigure 6.10. This time, only one instruction, Ij+1, is fetched incorrectly, because the targetaddress is determined in the Decode stage.


6.6 Branch Delays 203

F

1 2 3 4 5

Ij+1

Ij: Branch to Ik

Clock cycle 76

Ij+2

D C

F D

F

C M W

8

FIk D

Branch penalty

Time

Figure 6.9 Branch penalty when the target address is determined in the Computestage of the pipeline.

F

1 2 3 4 5

Ij+1

Ij: Branch to Ik

Clock cycle 76

D

F

C M WFIk D

Branch penalty

Time

Figure 6.10 Branch penalty when the target address is determined in theDecode stage of the pipeline.

The hardware in Figure 5.10 must be modified to implement this change. The adderin the figure is needed to increment the PC in every cycle. A second adder is needed in theDecode stage to compute a branch target address for every instruction. When the instructiondecoder determines that the instruction is indeed a branch instruction, the computed targetaddress will be available before the end of the cycle. It can then be used to fetch the targetinstruction in the next cycle.



6.6.2 Conditional Branches

Consider a conditional branch instruction such as

Branch_if_[R5]=[R6] LOOP

The execution steps for this instruction are shown in Figure 5.16. The result of the com-parison in the third step determines whether the branch is taken.

For pipelining, the branch condition must be tested as early as possible to limit thebranch penalty. We have just described how the target address for an unconditional branchinstruction can be determined in the Decode stage. Similarly, the comparator that tests thebranch condition can also be moved to the Decode stage, enabling the conditional branchdecision to be made at the same time that the target address is determined. In this case, thecomparator uses the values from outputs A and B of the register file directly.

Moving the branch decision to the Decode stage ensures a common branch penalty ofonly one cycle for all branch instructions. In the next two sections, we discuss additionaltechniques that can be used to further mitigate the effect of branches on execution time.

6.6.3 The Branch Delay Slot

Consider the program fragment shown in Figure 6.11a. Assume that the branch targetaddress and the branch decision are determined in the Decode stage, at the same time thatinstruction Ij+1 is fetched. The branch instruction may cause instruction Ij+1 to be discarded,after the branch condition is evaluated. If the condition is true, then there is a branch penaltyof one cycle before the correct target instruction Ik is fetched. If the condition is false, theninstruction Ij+1 is executed, and there is no penalty. In both of these cases, the instructionimmediately following the branch instruction is always fetched. Based on this observation,we describe a technique to reduce the penalty for branch instructions.

The location that follows a branch instruction is called the branch delay slot. Ratherthan conditionally discard the instruction in the delay slot, we can arrange to have thepipeline always execute this instruction, whether or not the branch is taken. The instructionin the delay slot cannot be Ij+1, the one that may be discarded depending on the branchcondition. Instead, the compiler attempts to find a suitable instruction to occupy the delayslot, one that needs to be executed even when the branch is taken. It can do so by movingone of the instructions preceding the branch instruction to the delay slot. Of course, this canonly be done if any data dependencies involving the instruction being moved are preserved.If a useful instruction is found, then there will be no branch penalty. If no useful instructioncan be placed in the delay slot because of constraints arising from data dependencies, aNOP must be placed there instead. In this case, there will be a penalty of one cycle whetheror not the branch is taken.

For the instructions in Figure 6.11a, the Add instruction can safely be moved into thebranch delay slot, as shown in Figure 6.11b. The Add instruction is always fetched andexecuted, even if the branch is taken. Instruction Ij+1 is fetched only if the branch is nottaken. Logically, execution proceeds as though the branch instruction were placed after the



Add R7, R8, R9Branch_if_[R3]=0 TARGETI j + 1...

TARGET: Ik

Branch_if_[R3]=0 TARGETAdd R7, R8, R9I j +1...

TARGET: Ik

(a) Original sequence of instructions containinga conditional branch instruction

(b) Placing the Add instruction in the branch delayslot where it is always executed

Figure 6.11 Filling the branch delay slot with a useful instruction.

Add instruction. That is, branching takes place one instruction later than where the branchinstruction appears in the instruction sequence. This technique is called delayed branching.

The effectiveness of delayed branching depends on how often the compiler can reorderinstructions to usefully fill the delay slot. Experimental data collected from many programsindicate that the compiler can fill a branch delay slot in 70 percent or more of the cases.

6.6.4 Branch Prediction

The discussion above shows that making the branch decision in cycle 2 of the execution of abranch instruction reduces the branch penalty. But, even then, the instruction immediatelyfollowing the branch instruction is still fetched in cycle 2 and may have to be discarded. Thedecision to fetch this instruction is actually made in cycle 1, when the PC is incrementedwhile the branch instruction itself is being fetched. Thus, to reduce the branch penaltyfurther, the processor needs to anticipate that an instruction being fetched is a branchinstruction and predict its outcome to determine which instruction should be fetched incycle 2. In this section, we first describe different methods for branch prediction. Then, wediscuss how the prediction is made in cycle 1 while a branch instruction is being fetched.



Static Branch PredictionThe simplest form of branch prediction is to assume that the branch will not be taken

and to fetch the next instruction in sequential address order. If the prediction is correct,the fetched instruction is allowed to complete and there is no penalty. However, if it isdetermined that the branch is to be taken, the instruction that has been fetched is discardedand the correct branch target instruction is fetched. Misprediction incurs the full branchpenalty. This simple approach is a form of static branch prediction. The same choice(assume not-taken) is used every time a conditional branch is encountered.

If branch outcomes were random, then half of all conditional branches would be taken.In this case, always assuming that branches will not be taken results in a prediction accuracyof 50 percent. However, a backward branch at the end of a loop is taken most of the time.For such a branch, better accuracy can be achieved by predicting that the branch is likelyto be taken. Thus, instructions are fetched using the branch target address as soon as it isknown. Similarly, for a forward branch at the beginning of a loop, the not-taken predictionleads to good prediction accuracy. The processor can determine the static prediction oftaken or not-taken by checking the sign of the branch offset. Alternatively, the machineencoding of a branch instruction may include one bit that indicates whether the branchshould be predicted as taken or nor taken. The setting of this bit can be specified by thecompiler.

Dynamic Branch PredictionTo improve prediction accuracy further, we can use actual branch behavior to influence

the prediction, resulting in dynamic branch prediction. The processor hardware assessesthe likelihood of a given branch being taken by keeping track of branch decisions everytime that a branch instruction is executed.

In its simplest form, a dynamic prediction algorithm can use the result of the mostrecent execution of a branch instruction. The processor assumes that the next time theinstruction is executed, the branch decision is likely to be the same as the last time. Hence,the algorithm may be described by the two-state machine in Figure 6.12a. The two statesare:

LT - Branch is likely to be takenLNT - Branch is likely not to be taken

Suppose that the algorithm is started in state LNT. When the branch instruction is executedand the branch is taken, the machine moves to state LT. Otherwise, it remains in state LNT.The next time the same instruction is encountered, the branch is predicted as taken if thestate machine is in state LT. Otherwise it is predicted as not taken.

This simple scheme, which requires only a single bit to represent the history of executionfor a branch instruction, works well inside program loops. Once a loop is entered, thedecision for the branch instruction that controls looping will always be the same exceptfor the last pass through the loop. Hence, each prediction for the branch instruction willbe correct except in the last pass. The prediction in the last pass will be incorrect, andthe branch history state machine will be changed to the opposite state. Unfortunately, thismeans that the next time this same loop is entered—and assuming that there will be morethan one pass through the loop—the state machine will lead to the wrong prediction for the



SNT LNT

LT ST

BTBNT

BNT

BT

BNT

BT

LNT LT

Branch taken (BT)

Branch not taken (BNT)

(a) A 2-state algorithm

(b) A 4-state algorithm

BT

BNT

BTBNT

Figure 6.12 State-machine representation of branch predictionalgorithms.

first pass. Thus, repeated execution of the same loop results in mispredictions in the firstpass and the last pass.

Better prediction accuracy can be achieved by keeping more information about execu-tion history. An algorithm that uses four states is shown in Figure 6.12b. The four statesare:

ST - Strongly likely to be takenLT - Likely to be takenLNT - Likely not to be takenSNT - Strongly likely not to be taken



Again assume that the state of the algorithm is initially set to LNT. After the branch instruc-tion is executed, and if the branch is actually taken, the state is changed to ST; otherwise,it is changed to SNT. As program execution progresses and the same branch instruction isencountered multiple times, the state of the prediction algorithm changes as shown. Thebranch is predicted as taken if the state is either ST or LT. Otherwise, the branch is predictedas not taken.

Let us reconsider what happens when executing a program loop. Assume that thebranch instruction is at the end of the loop and that the processor sets the initial state of thealgorithm to LNT. In the first pass, the prediction (not taken) will be wrong, and hence thestate will be changed to ST. In all subsequent passes, the prediction will be correct, exceptfor the last pass. At that time, the state will change to LT. When the loop is entered a secondtime, the prediction in the first pass will be to take the branch, which will be correct if thereis more than one iteration. Thus, repeated execution of the same loop now results in onlyone misprediction in the last pass.

Branch Target Buffer for Dynamic PredictionIn earlier discussion, we pointed out that the branch target address and the branch

decision can both be determined in the Decode stage of the pipeline, which is cycle 2 ofinstruction execution. The instruction being fetched in the same cycle may or may not bethe one that has to be executed after the branch instruction. It may have to be discarded, inwhich case the correct instruction will be fetched in cycle 3. How can branch prediction beused to obtain better performance?

The key to improving performance is to increase the likelihood that the instructionfetched in cycle 2 is the correct one. This can be achieved only if branch prediction takesplace in cycle 1, at the same time that the branch instruction is being fetched. To makethis possible, the processor needs to keep more information about the history of execution.The required information is usually stored in a small, fast memory called the branch targetbuffer.

The branch target buffer identifies branch instructions by their addresses. As eachbranch instruction is executed, the processor records the address of the instruction and theoutcome of the branch decision in the buffer. The information is organized in the form ofa lookup table, in which each entry includes:

• the address of the branch instruction• one or two state bits for the branch prediction algorithm• the branch target address

With this information, the processor is able to identify branch instructions and obtain thecorresponding branch prediction state bits based on the address of the instruction beingfetched.

Every time the processor fetches a new instruction, it checks the branch target buffer foran entry containing the same instruction address. If an entry with that address is found, thismeans that the instruction being fetched is a branch instruction. The processor is then ableto use the state bits to predict whether that branch is likely to be taken. At the same time,the target address is also obtained. The processor is able to obtain this information as thebranch instruction is being fetched in cycle 1. In cycle 2, the processor uses the predicted


6.8 Performance Evaluation 209

outcome of the branch to fetch the next instruction. Of course, it must also determine theactual branch decision and target address to determine whether the predicted values werecorrect. If they are, execution continues without penalty. Otherwise, the instruction thathas just been fetched is discarded, and the correct instruction is fetched in cycle 3. The mainvalue of the branch target buffer is that the state information needed for branch predictionand the target address of a branch instruction are both obtained at the same time the branchinstruction is being fetched.

Large programs have many branch instructions. A branch target buffer with enoughstorage to accommodate information for all of them would be large, and searching it quicklywould be difficult. For this reason, the table has a limited size, containing information foronly the most recently executed branch instructions. Entries in the table are replaced as otherbranch instructions are executed. Typically, the table contains on the order of 1024 entries.

6.7 Resource Limitations

Pipelining enables overlapped execution of instructions, but the pipeline stalls when thereare insufficient hardware resources to permit all actions to proceed concurrently. If twoinstructions need to access the same resource in the same clock cycle, one instruction mustbe stalled to allow the other instruction to use the resource. This can be prevented byproviding additional hardware.

Such stalls can occur in a computer that has a single cache that supports only one accessper cycle. If both the Fetch and Memory stages of the pipeline are connected to the cache,then it is not possible for activity in both stages to proceed simultaneously. Normally, theFetch stage accesses the cache in every cycle. However, this activity must be stalled for onecycle when there is a Load or Store instruction in the Memory stage also needing to accessthe cache. If 25 percent of all instructions executed are Load or Store instructions, thesestalls increase the execution time by 25 percent. Using separate caches for instructions anddata allows the Fetch and Memory stages to proceed simultaneously without stalling.

6.8 Performance Evaluation

For a non-pipelined processor, the execution time, T , of a program that has a dynamicinstruction count of N is given by

T = N × S

R

where S is the average number of clock cycles it takes to fetch and execute one instruction,and R is the clock rate in cycles per second. This is often referred to as the basic performanceequation. Auseful performance indicator is the instruction throughput, which is the numberof instructions executed per second. For non-pipelined execution, the throughput, Pnp, isgiven by

Pnp = R

S



The processor presented in Chapter 5 uses five cycles to execute all instructions. Thus, ifthere are no cache misses, S is equal to 5.

Pipelining improves performance by overlapping the execution of successive instruc-tions, which increases instruction throughput even though an individual instruction is stillexecuted in the same number of cycles. For the five-stage pipeline described in this chap-ter, each instruction is executed in five cycles, but a new instruction can ideally enter thepipeline every cycle. Thus, in the absence of stalls, S is equal to 1, and the ideal throughputwith pipelining is

Pp = R

A five-stage pipeline can potentially increase the throughput by a factor of five. Ingeneral, an n-stage pipeline has the potential to increase throughput n times. Thus, it wouldappear that the higher the value of n, the larger the performance gain. This leads to twoquestions:

• How much of this potential increase in instruction throughput can actually be realizedin practice?

• What is a good value for n?

Any time a pipeline is stalled or instructions are discarded, the instruction throughput isreduced below its ideal value. Hence, the performance of a pipeline is highly influencedby factors such as stalls due to data dependencies between instructions and penalties due tobranches. Cache misses increase the execution time even further. We discuss these issuesfirst, and then we return to the question of how many pipeline stages should be used.

6.8.1 Effects of Stalls and Penalties

The effects of stalls and penalties have been examined qualitatively in the previous sections.We now consider these effects in quantitative terms.

The five-stage pipeline involves memory-access operations in the Fetch and Memorystages, and ALU operations in the Compute stage. The operations with the longest delaydictate the cycle time, and hence the clock rate R. For a processor that has on-chip caches,memory-access operations have a small delay when the desired instructions or data arefound in the cache. The delay through the ALU is likely to be the critical parameter. If thisdelay is 2 ns, then R = 500 MHz, and the ideal pipelined instruction throughput is Pp = 500MIPS (million instructions per second).

Consider a processor with operand forwarding in hardware, as explained in Section6.4.1. This means that there are no penalties due to data dependencies, except in thecase of Load instructions. To evaluate the effect of stalls not related to cache misses, wecan consider how often a Load instruction is immediately followed by another instructionthat uses the result of the memory access. Section 6.5 explained that a one-cycle stall isnecessary in such cases. While ideal pipelined execution has S = 1, stalls due to such Loadinstructions have the effect of increasing S by an amount δstall. For example, assume thatLoad instructions constitute 25 percent of the dynamic instruction count, and assume that40 percent of these Load instructions are followed by a dependent instruction. A one-cycle


6.8 Performance Evaluation 211

stall is needed in such cases. Hence, the increase over the ideal case of S = 1 is

δstall = 0.25× 0.40× 1 = 0.10

That is, the execution time T is increased by 10 percent, and throughput is reduced to

Pp = R

1+ δstall= R

1.1= 0.91R

The compiler can improve performance by reducing the number of times that a Load in-struction is immediately followed by a dependent instruction. A stall is eliminated eachtime the compiler can safely move a nearby instruction to a position between the Loadinstruction and the dependent instruction.

Now, consider the penalties due to mispredicting branches during program execution.When both the branch decision and the branch target address are determined in the Decodestage of the pipeline, the branch penalty is one cycle. Assume that branches constitute20 percent of the dynamic instruction count of a program, and that the average predictionaccuracy for branch instructions is 90 percent. In other words, 10 percent of all branchinstructions that are executed incur a one-cycle penalty due to misprediction. The increasein the average number of cycles per instruction due to branch penalties is

δbranch_penalty = 0.20× 0.10× 1 = 0.02

High prediction accuracy is beneficial in limiting the adverse impact of this penalty onperformance.

The stalls related to Load instructions and the penalties from branch mispredictionare independent. Hence, their effect on performance is additive. The sum of δstall andδbranch_penalty determines the increase in the number of cycles, S, the increase in the executiontime, T , and the reduction in the throughput, Pp.

The effect of cache misses on performance can be assessed by considering the frequencyof their occurrence. The time to access the slower main memory is a penalty that stalls thepipeline for pm cycles every time there is a cache miss. A fraction mi of all instructions thatare fetched incur a cache miss. Afraction d of all instructions are Load or Store instructions,and a fraction md of these instructions incur a cache miss. The increase over the ideal caseof S = 1 due to cache misses is

δmiss = (mi + d × md )× pm

Suppose that 5 percent of all fetched instructions incur a cache miss, 30 percent of allinstructions executed are Load or Store instructions, and 10 percent of their data-operandaccesses incur a cache miss. Assume that the penalty to access the main memory for acache miss is 10 cycles. The increase over the ideal case of S = 1 due to cache misses inthis case is given by

δmiss = (0.05+ 0.30× 0.10)× 10 = 0.8

Compared to δstall for data dependencies and δbranch_penalty for mispredicted branches,the effect of a slow main memory for cache misses is more significant in this example.When all factors are combined, S is increased from the ideal value of 1 to 1+ δstall +δbranch_penalty + δmiss. The contribution of cache misses is often the dominant one.



6.8.2 Number of Pipeline Stages

The fact that an n-stage pipeline may increase instruction throughput by a factor of nsuggests that we should use a large number of stages. However, as the number of pipelinestages increases, there are more instructions being executed concurrently. Consequently,there are more potential dependencies between instructions that may lead to pipeline stalls.Furthermore, the branch penalty may be larger than one cycle if a longer pipeline moves thebranch decision to a later stage. For these reasons, the gain in throughput from increasingthe value of n begins to diminish, and the cost of a deeper pipeline may not be justified.

Another important factor is the inherent delay in the basic operations performed by theprocessor. The most important among these is the ALU delay. In many processors, thecycle time of the processor clock is chosen such that one ALU operation can be completedin one cycle. Other operations, including accesses to a cache memory, are typically dividedinto steps that each take about the same time as an ALU operation. Further reductionsin the clock cycle time are possible if a pipelined ALU is used. Some recent processorimplementations have used twenty or more pipeline stages to aggressively reduce the cycletime. Implementing such long pipelines using modern technology allows for clock rates ofseveral GHz.

6.9 Superscalar Operation

The maximum throughput of a pipelined processor is one instruction per clock cycle. Amore aggressive approach is to equip the processor with multiple execution units, each ofwhich may be pipelined, to increase the processor’s ability to handle several instructionsin parallel. With this arrangement, several instructions start execution in the same clockcycle, but in different execution units, and the processor is said to use multiple-issue. Suchprocessors can achieve an instruction execution throughput of more than one instructionper cycle. They are known as superscalar processors. Many modern high-performanceprocessors use this approach.

To enable multiple-issue execution, a superscalar processor has a more elaborate fetchunit that fetches two or more instructions per cycle before they are needed and places them inan instruction queue. Aseparate unit, called the dispatch unit, takes two or more instructionsfrom the front of the queue, decodes them, and sends them to the appropriate execution units.At the end of the pipeline, another unit is responsible for writing results into the registerfile. Figure 6.13 shows a superscalar processor with this organization. It incorporates twoexecution units, one for arithmetic instructions and another for Load and Store instructions.Arithmetic operations normally require only one cycle, hence the first execution unit issimple. Because Load and Store instructions involve an address calculation for the Indexmode before each memory access, the Load/Store unit has a two-stage pipeline.

The organization in Figure 6.13 raises some important implications for the register file.An arithmetic instruction and a Load or Store instruction must obtain all their operandsfrom the register file when they are dispatched in the same cycle to the two execution units.The register file must now have four output ports instead of the two output ports needed in


6.9 Superscalar Operation 213

Writeresults

Dispatchunit

Instruction queue

Load/Storeunit

Fetchunit

Arithmeticunit

Figure 6.13 A superscalar processor with two execution units.

the simple pipeline. Similarly, an arithmetic instruction and a Load instruction must writetheir results into the register file when they complete in the same cycle. Thus, the registerfile must now have two input ports instead of the single input port for the simple pipeline.There is also the potential complication of two instructions completing at the same timewith the same destination register for their results. This complication is avoided, if possible,by dispatching the instructions in a manner that prevents its occurrence. Otherwise, oneinstruction is stalled to ensure that results are written into the destination register in thesame order as in the original instruction sequence of the program.

To illustrate superscalar execution in the processor in Figure 6.13, consider the follow-ing sequence of instructions:

Add R2, R3, #100Load R5, 16(R6)Subtract R7, R8, R9Store R10, 24(R11)

Figure 6.14 shows how these instructions would be executed. The fetch unit fetches twoinstructions every cycle. The instructions are decoded and their source registers are read inthe next cycle. Then, they are dispatched to the arithmetic and Load/Store units. Arithmeticoperations can be initiated every cycle. A Load or Store instruction can also be initiatedevery cycle, because the two-stage pipeline overlaps the address calculation for one Loador Store instruction with the memory access for the preceding Load or Store instruction.



Add R2, R3, #100

Clock cycleTime

F D C W

F D C WM

F D C W

F D C WM

Load R5, 16(R6)

Subtract R7, R8, R9

Store R10, 24(R11)

1 2 3 4 5 6

Figure 6.14 An example of instruction flow in the processor of Figure 6.13.

As instructions complete execution in each unit, the register file allows two results to bewritten in the same cycle because the destination registers are different.

6.9.1 Branches and Data Dependencies

In the absence of any branch instructions and any data dependencies between instructions,throughput is maximized by interleaving instructions that can be dispatched simultaneouslyto different execution units. However, programs contain branch instructions that changethe execution flow, and data dependencies between instructions that impose sequentialordering constraints. A superscalar processor must ensure that instructions are executed inthe proper sequence. Furthermore, memory delays due to cache misses may occasionallystall the fetching and dispatching of instructions. As a result, actual throughput is typicallybelow the maximum that is possible. The challenges presented by branch instructions anddata dependencies can be addressed with additional hardware. We first consider branchinstructions and then consider the issues stemming from data dependencies.

The fetch unit handles branch instructions as it determines which instructions to placein the queue for dispatching. It must determine both the branch decision and the targetfor each branch instruction. The branch decision may depend on the result of an earlierinstruction that is either still queued or newly dispatched. Stalling the fetch unit until theresult is available can significantly reduce the throughput and is therefore not a desirableapproach. Instead, it is better to employ branch prediction. Since the aim is to achievehigh throughput, prediction is also combined with a technique called speculative execution.In this technique, subsequent instructions based on an unconfirmed prediction are fetched,dispatched, and possibly executed, but are labeled as being speculative so that they and theirresults may be discarded if the prediction is incorrect. Additional hardware is required tomaintain information about speculatively executed instructions and to ensure that registersor memory locations are not modified until the validity of the prediction is confirmed.



Additional hardware is also needed to ensure that the correct instructions are fetched anddispatched in the event of misprediction.

Data dependencies between instructions impose ordering constraints. A simple ap-proach is to dispatch dependent instructions in sequence to the same execution unit, wheretheir order would be preserved. However, dependent instructions may be dispatched todifferent execution units. For example, the result of a Load instruction dispatched to theLoad/Store unit in Figure 6.13 may be needed by an Add instruction dispatched to the arith-metic unit. Because the units operate independently and because other instructions mayhave already been dispatched to them, there is no guarantee as to when the result neededby the Add instruction is generated by the Load instruction. A mechanism is needed toensure that a dependent instruction waits for its operands to become available. When aninstruction is dispatched to an execution unit, it is buffered until all necessary results fromother instructions have been generated. Such buffers are called reservation stations, andthey are used to hold information and operands relevant to each dispatched instruction.Results from each execution unit are broadcast to all reservation stations with each resulttagged with a register identifier. This enables the reservation stations to recognize a resulton which a buffered instruction depends. When there is a matching tag, the hardware copiesthe result into the reservation station containing the instruction. The control circuit beginsthe execution of a buffered instruction only when it has all of its operands.

In a superscalar processor using multiple-issue, the detrimental effect of stalls becomeseven more pronounced than in a single-issue pipelined processor. The compiler can avoidmany stalls through judicious selection and ordering of instructions. For example, for theprocessor in Figure 6.13, the compiler should strive to interleave arithmetic and memoryinstructions. This enables the dispatch unit to keep both units busy most of the time.

6.9.2 Out-of-Order Execution

The instructions in Figure 6.14 are dispatched in the same order as they appear in theprogram. However, their execution may be completed out of order. For example, theSubtract instruction writes to register R7 in the same cycle as the Load instruction that wasfetched earlier writes to register R5. If the memory access for the Load instruction requiresmore than one cycle to complete, execution of the Subtract instruction would be completedbefore the Load instruction. Does this type of situation lead to problems?

We have already discussed the issues arising from dependencies among instructions.For example, if an instruction Ij+1 depends on the result of instruction Ij, the execution of Ij+1

will be delayed if the result is not available when it is needed. As long as such dependenciesare handled correctly, there is no reason to delay the execution of an unrelated instruction.If there is no dependency between a pair of instructions, the order in which execution iscompleted does not matter.

However, a new complication arises when we consider the possibility of an instructioncausing an exception. For example, the Load instruction in Figure 6.14 may attempt anillegal unaligned memory access for a data operand. By the time this illegal operationis recognized, the Subtract instruction that is fetched after the Load instruction may havealready modified its destination register. Program execution is now in an inconsistent



state. The instruction that caused the exception in the original sequence is identified, but asucceeding instruction in that sequence has been executed to completion. If such a situationis permitted, the processor is said to have imprecise exceptions.

The alternative of precise exceptions requires additional hardware. To guarantee aconsistent state when exceptions occur, the results of the execution of instructions mustbe written into the destination locations strictly in program order. This means that wemust delay writing into register R7 for the Subtract instruction in Figure 6.14 until afterregister R5 for the Load instruction has been updated. Either the arithmetic unit in Figure6.13 must retain the result of the Subtract instruction, or the result must be buffered in atemporary register until preceding instructions have written their results. If an exceptionoccurs during the execution of an instruction, all subsequent instructions and their bufferedresults are discarded.

It is easier to provide precise exceptions in the case of external interrupts. Whenan external interrupt is received, the dispatch unit stops reading new instructions from theinstruction queue, and the instructions remaining in the queue are discarded. All instructionswhose execution is pending continue to completion. At this point, the processor and all itsregisters are in a consistent state, and interrupt processing can begin.

6.9.3 Execution Completion

To improve performance, an execution unit should be allowed to execute any instructionswhose operands are ready in its reservation station. This may lead to out-of-order execu-tion of instructions. However, instructions must be completed in program order to allowprecise exceptions. These seemingly conflicting requirements can be resolved if executionis allowed to proceed out of order, but the results are written into temporary registers. Thecontents of these registers are later transferred to the permanent registers in correct programorder. This last step is often called the commitment step, because the effect of an instructioncannot be reversed after that point. If an instruction causes an exception, the results of anysubsequent instructions that have been executed would still be in temporary registers andcan be safely discarded. Results that would normally be written to memory would also bebuffered temporarily, and they can be safely discarded as well.

A temporary register that is assigned for the result of an instruction assumes the roleof the permanent register whose data it is holding. Its contents are forwarded to anysubsequent instruction that refers to the original permanent register during that period. Thistechnique is called register renaming. There may be as many temporary registers as thereare permanent registers, or there may be fewer temporary registers that are allocated asneeded for association with different permanent registers.

When out-of-order execution is allowed, a special control unit is needed to guaranteein-order commitment. This is called the commitment unit. It uses a separate queue called thereorder buffer to determine which instruction(s) should be committed next. Instructions areentered in the queue strictly in program order as they are dispatched for execution. Whenan instruction reaches the head of this queue and the execution of that instruction has beencompleted, the corresponding results are transferred from the temporary registers to thepermanent registers and the instruction is removed from the queue. All resources that wereassigned to the instruction, including the temporary registers, are released. The instruction



is said to have been retired at this point. Because an instruction is retired only when it isat the head of the queue, all instructions that were dispatched before it must also have beenretired. Hence, instructions may complete execution out of order, but they are retired inprogram order.

6.9.4 Dispatch Operation

We now return to the dispatch operation. When dispatching decisions are made, the dispatchunit must ensure that all the resources needed for the execution of an instruction are available.For example, since the results of an instruction may have to be written in a temporaryregister, there should be one available, and it is reserved for use by that instruction as a partof the dispatch operation. There must be space available in the reservation station of anappropriate execution unit. Finally, a location in the reorder buffer for later commitmentof results must also be available for the instruction. When all the resources needed areassigned, the instruction is dispatched.

Should instructions be dispatched out of order? For example, the dispatch of the Loadinstruction in Figure 6.14 may be delayed because there is no space in the reservation stationof the Load/Store unit as a result of a cache miss in a previously dispatched instruction.Should the Subtract instruction be dispatched instead? In principle this is possible, providedthat all the resources needed by the Load instruction, including a place in the reorder buffer,are reserved for it. This is essential to ensure that all instructions are ultimately retired inthe correct order and that no deadlocks occur.

A deadlock is a situation that can arise when two units, A and B, use a shared resource.Suppose that unit B cannot complete its operation until unit A completes its operation. Atthe same time, unit B has been assigned a resource that unit A needs. If this happens, neitherunit can complete its operation. Unit A is waiting for the resource it needs, which is beingheld by unit B. At the same time, unit B is waiting for unit A to finish before it can completeits operation and release that resource.

As an example of a deadlock when dispatching instructions out of order, consider asuperscalar processor that has only one temporary register. When the Subtract instruction inFigure 6.14 is dispatched before the Load instruction, the temporary register is reserved forit. The Load instruction cannot be dispatched because it is waiting for the same temporaryregister, which, in turn, will not become free until the Subtract instruction is retired. Sincethe Subtract instruction cannot be retired before the Load instruction, we have a deadlock.

To prevent deadlocks, the dispatch unit must take many factors into account. Hence,issuing instructions out of order is likely to increase the complexity of the dispatch unitsignificantly. It may also mean that more time is required to make dispatching decisions.Dispatching instructions in order avoids this complexity. In this case, the program order ofinstructions is enforced at the time instructions are dispatched and again at the time theyare retired. Between these two events, the execution of several instructions across multipleexecution units can proceed out of order, subject only to interdependencies among them.

A final comment on superscalar processors concerns the number of execution units.The processor in Figure 6.13 has one arithmetic unit and one Load/Store unit. For higherperformance, modern superscalar processors often have two arithmetic units for integeroperations, as well as a separate arithmetic unit for floating-point operations. The floating-



point unit has its own register file. Many processors also include a vector unit for integer orfloating-point arithmetic, which typically performs two to eight operations in parallel. Sucha unit may also have a dedicated register file. A single Load/Store unit typically supportsall memory accesses to or from the register files for integer, floating-point, or vector units.To keep many execution units busy, modern processors may fetch four or more instructionsat the same time to place at the tail of the instruction queue, and similarly four or moreinstructions may be dispatched to the execution units from the head of the instruction queue.

6.10 Pipelining in CISC Processors

The instruction set of a RISC processor makes pipelining relatively easy to implement. Allinstructions are one word in size, and operand information is typically located in the sameposition within a word for different instructions. No instruction requires more than onememory operand. Only Load and Store instructions access memory operands, typicallyusing only indexed addressing. All other instructions operate on register operands. Thefive-stage pipeline described in this chapter is tailored for these characteristics of RISC-styleinstructions.

For pipelining in CISC processors, complications arise due to instructions that arevariable in size, have multiple memory operands and complex addressing modes, and usecondition codes. Instructions that occupy more than one word may take several cycles tofetch. Furthermore, variability in instruction size and format complicates both decodingand operand access, as well as management of the dispatch queue in a superscalar processor.

The availability of more complex addressing modes such as Autoincrement or Au-todecrement introduces side effects when executing instructions. A side effect occurs whena location other than that of the destination operand is also affected. For example, theinstruction

Move R5, (R8)+

has a side effect. Not only is the destination register R5 affected, but source register R8is also affected by the autoincrement operation. Should a later instruction depend on thevalue in register R8, this dependency must be handled with additional hardware in the samemanner as a dependency involving the destination register, R5. It may require stallingthe pipeline or forwarding the new value. In a superscalar processor, such a dependencyrequires the use of temporary registers and register renaming as discussed in Section 6.9.3.

Condition codes also introduce side effects. For example, in the sequence of instruc-tions

Compare R7, R8Branch>0 TARGET

the result of the Compare instruction affects the condition code flags as a side effect.The Branch instruction, in turn, implicitly depends on this side effect. A condition coderegister can be included with relative ease in a simple pipeline such as the one shown inFigure 6.2, because only one ALU operation is performed in any cycle. However, in asuperscalar processor with multiple execution units, many instructions may be in various


6.10 Pipelining in CISC Processors 219

stages of execution, and two or more ALU operations may be performed in each cycle.Dependencies arising from side effects related to the condition codes require the use ofadditional temporary registers and register renaming.

Finally, consider the following sequence of CISC-style instructions:

Move (R2), (R3)Move (R4), R5

The first Move instruction requires two operand accesses to the memory, while the secondMove instruction requires only one. Executing these instructions in a pipeline such asthe one in Figure 6.2 requires additional hardware to stall the second Move instruction sothat the first Move instruction can complete its two operand accesses to the memory. Ina superscalar processor such as the one in Figure 6.13, the Load/Store unit must similarlystall its internal pipeline.

CISC-style instructions complicate pipelining. This was one of the main reasons fordeveloping the RISC approach. Nonetheless, pipelined processors have been implementedfor CISC-style instruction sets, which were initially introduced before the widespread useof pipelining. Examples include processors based on the ColdFire and Intel instruction setsdiscussed in Appendices C and E. ColdFire processors are primarily intended for embeddedapplications, while Intel processors serve general-purpose needs. Consequently, the extentto which pipelining is used in ColdFire processors is less than that in Intel processors.

6.10.1 Pipelining in ColdFire Processors

ColdFire processor implementations labeled as versions V1 and V2 have two pipelines inseries with a first-in first-out (FIFO) buffer between them. A two-stage instruction fetchpipeline prefetches instructions into the buffer. This buffer then feeds a two-stage pipelinethat executes instructions. Instructions that involve register-only or register-to-memoryoperations pass once through the two execution stages. Instructions that involve memory-to-register or memory-to-memory operations must make two passes through the executionstages.

Later versions of ColdFire processor implementations use a similar buffer arrangementbetween two pipelines, but they incorporate various enhancements for higher performance.For example, the instruction fetch pipeline in version V4 is extended to four stages andincludes branch prediction. The execution pipeline is extended to five stages. The earlystages are used for address calculation, and the later stages are used for arithmetic/logicoperations. This separation of functions enables a limited form of superscalar processing.In certain cases, a Move instruction and another instruction can be issued to the execu-tion pipeline in the same cycle. Version V5 implementations have two distinct executionpipelines based on the V4 organization. They provide true superscalar processing.

6.10.2 Pipelining in Intel Processors

Intel processors achieve high performance with superscalar execution and deep pipelines.For example, the Core 2 and Core i7 architectures have a multiple-issue width of four



instructions and a 14-stage pipeline. Branch prediction, register renaming, out-of-orderexecution, and other techniques are used.

To reduce internal complexity, CISC-style instructions are dynamically converted bythe hardware into simpler RISC-style micro-operations. These micro-operations are thenissued to the execution units to complete the tasks specified by the original CISC-styleinstructions. This approach preserves code compatibility while making it possible to usethe aggressive performance enhancement techniques that have been developed for RISC-style instruction sets. In some cases, micro-operations are fused back together into macro-operations for more efficient handling. For example, in a program containing original CISC-style instructions, a comparison instruction that affects condition codes is often followedby a branch instruction. The hardware may initially convert the comparison and branchinstructions into separate micro-operations, but would then fuse them into a combinedcompare-and-branch operation, whose function reflects what is typically found in a RISC-style instruction set.


Two important features for performance enhancement have been introduced in this chapter,pipelining and multiple-issue. Pipelining enables processors to have instruction throughputapproaching one instruction per clock cycle. Multiple-issue combined with pipeliningmakes possible superscalar operation, with instruction throughput of several instructionsper clock cycle.

The potential gain in performance can only be realized by careful attention to threeaspects:

• The instruction set of the processor• The design of the pipeline hardware• The design of the associated compiler

It is important to appreciate that there are strong interactions among all three aspects. Highperformance is critically dependent on the extent to which these interactions are taken intoaccount in the design of a processor. Instruction sets that are particularly well-suited forpipelined execution are key features of modern processors.

There are many sources that provide additional details on the topics presented in thischapter. Reference [1] covers pipelining and Reference [2] covers superscalar processors.

6.12 Examples of Solved Problems



6.12 Examples of Solved Problems 221

Example 6.1Problem: Consider the pipelined execution of the following sequence of instructions:

Add R4, R3, R2Or R7, R6, R5Subtract R8, R7, R4

Initially, registers R2 and R3 contain 4 and 8, respectively. Registers R5 and R6 contain128 and 2, respectively. Assume that the pipeline provides forwarding paths to the ALUfrom registers RY and RZ in Figure 5.8. The first instruction is fetched in cycle 1, and theremaining instructions are fetched in successive cycles.

Draw a diagram similar to Figure 6.1 to show the pipelined execution of these instruc-tions assuming that the processor uses operand forwarding. Then, with reference to Figure5.8, describe the contents of registers RY and RZ during cycles 4 to 7.

Solution: There are data dependencies involving registers R4 and R7. The Subtract in-struction needs the new values for these registers before they are written to the register file.Hence, those values need to be forwarded to the ALU inputs when the Subtract instructionis in the Compute stage of the pipeline. Figure 6.15 shows the execution with forwarding.One arrow represents the new value of register R7 being forwarded from register RZ, andthe other arrow represents the new value of register R4 being forwarded from register RY.

As for the contents of registers RYand RZ during cycles 4 to 7, the following descriptionprovides the answer.

• Using the initial values for registers R2 and R3, the Add instruction generates the resultof 12 in cycle 3. That result is available in register RZ during cycle 4. The value inregister RY during cycle 4 is the result for the unspecified instruction preceding theAdd instruction.

• In cycle 4, the Or instruction generates the result of 130. That result is placed in registerRZ to be available during cycle 5. The result of 12 for the Add instruction is in registerRY during cycle 5.

• In cycle 5, the Subtract instruction is in the Compute stage. To generate a correctresult, forwarding is used to provide the value of 130 in register RY and the value of

Add R4, R3, R2

Clock cycleTime

F D C WM

F D C WMOr R7, R6, R5

1 2 3 4 5 6

Subtract R8, R7, R4 F D C WM

7

Figure 6.15 Pipelined execution of instructions for Example 6.1.



12 in register RZ. The result from the ALU is 130− 12 = 118. This result is availablein register RZ during cycle 6. The result of the Or instruction, 130, is in register RYduring in cycle 6.

• In cycle 6, the Subtract instruction is in the Memory stage. The unspecified instructionfollowing the Subtract instruction is generating a result in the Compute stage. In cycle7, the result of the unspecified instruction is in register RZ, and the result of the Subtractinstruction is in register RY.

Example 6.2 Problem: Assume that 20 percent of the dynamic count of the instructions executed fora program are branch instructions. There are no pipeline stalls due to data dependencies.Static branch prediction is used with a not-taken assumption.

(a) Determine the execution times for two cases: when 30 percent of the branches aretaken, and when 70 percent of the branches are taken.

(b) Determine the speedup for one case relative to the other. Express the speedup as apercentage relative to 1.

Solution: Section 6.8.1 describes the calculation of δbranch_penalty to consider the effect ofbranch penalties.

(a) The value of δbranch_penalty is 0.20× 0.30 = 0.06 for the first case and 0.20× 0.70 =0.14 for the second case. Using S = 1+ δbranch_penalty, the execution time for the firstcase is (1.06× N )/R and (1.14× N )/R for the second case.

(b) Because the execution time for the first case is smaller, the performance improvementas a speedup percentage is

(1.14

1.06− 1

)

× 100 = 7.5 percent

Problems

6.1 [M] Consider the following instructions at the given addresses in the memory:

1000 Add R3, R2, #201004 Subtract R5, R4, #31008 And R6, R4, #0x3A1012 Add R7, R2, R4


Problems 223

Initially, registers R2 and R4 contain 2000 and 50, respectively. These instructions areexecuted in a computer that has a five-stage pipeline as shown in Figure 6.2. The first in-struction is fetched in clock cycle 1, and the remaining instructions are fetched in successivecycles.

(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions throughthe pipeline. Describe the operation being performed by each pipeline stage during clockcycles 1 through 8.

(b) With reference to Figures 5.8 and 5.9, describe the contents of registers IR, PC, RA,RB, RY, and RZ in the pipeline during cycles 2 to 8.

6.2 [M] Repeat Problem 6.1 for the following program:

1000 Add R3, R2, #201004 Subtract R5, R4, #31008 And R6, R3, #0x3A1012 Add R7, R2, R4

Assume that the pipeline provides forwarding paths to the ALU from registers RY and RZin Figure 5.8 and that the processor uses forwarding of operands.

6.3 [M] Consider the loop in the program of Figure 2.8. Assume it is executed in a five-stagepipeline with forwarding paths to the ALU from registers RY and RZ in Figure 5.8. Assumethat the pipeline uses static branch prediction with a not-taken assumption. Draw a diagramsimilar to Figure 6.1 for the execution of two successive iterations of the loop.

6.4 [D] Repeat Problem 6.3, but first reorder the instructions to optimize performance as thecompiler would do.

6.5 [D] Repeat Problem 6.3 for a pipeline that uses delayed branching with one delay slot.Reorder the instructions as needed to improve performance.

6.6 [M] The forwarding path in Figure 6.5 allows the contents of register RZ to be used directlyin an ALU operation. The result of that operation is stored in register RZ, replacing itsprevious contents. This problem involves tracing the contents of register RZ over multiplecycles. Consider the two instructions

I1: Add R3, R2, R1I2: LShiftL R3, R3, #1

While instruction I1 is being fetched in cycle 1, a previously fetched instruction is performingan ALU operation that gives a result of 17. Then, while instruction I1 is being decoded incycle 2, another previously fetched instruction is performing an ALU operation that gives aresult of 198. Also during cycle 2, registers R1, R2, and R3 contain the values 30, 100, and45, respectively. Using this information, draw a timing diagram that shows the contents ofregister RZ during cycles 2 to 5.

6.7 [M] Assume that 20 percent of the dynamic count of the instructions executed for a programare branch instructions. Delayed branching is used, with one delay slot. Assume that thereare no stalls caused by other factors. First, derive an expression for the execution time in



cycles if all delay slots are filled with NOP instructions. Then, derive another expressionthat reflects the execution time with 70 percent of delay slots filled with useful instructionsby the optimizing compiler. From these expressions, determine the compiler’s contributionto the increase in performance, expressed as a speedup percentage.

6.8 [D] Repeat Problem 6.7, but this time for a pipelined processor with two branch delay slots.The output from the optimizing compiler is such that the first delay slot is filled with a usefulinstruction 70 percent of the time, but the second slot is filled with a useful instruction only10 percent of the time.

Compare the compiler-optimized execution time for this case with the compiler-optimizedexecution time for Problem 6.7. Assume that the two processors have the same clockrate. Indicate which processor/compiler combination is faster, and determine the speeduppercentage by which it is faster.

6.9 [D] Assume that 20 percent of the dynamic count of the instructions executed for a programare branch instructions. Assume further that 75 percent of branches are actually taken. Theprogram is executed in two different processors that have the same clock rate. One usesstatic branch prediction with the assume-not-taken approach. The other uses dynamicbranch prediction based on the states in Figure 6.12a. The branch target buffer is used inthe manner described in Section 6.6.4.

(a) With no pipeline stalls due to other causes, what must be the minimum predictionaccuracy for the processor using dynamic branch prediction to perform at least as well asthe processor using static branch prediction?

(b) If the dynamic prediction accuracy is actually 90 percent, what is the speedup relativeto using static prediction?

6.10 [M] Additional control logic is required in the pipeline to forward the value of registerRZ as shown in Figure 6.5. What specific conditions must this additional logic check todetermine the settings of the multiplexers feeding the ALU inputs in the Compute stage ofthe pipeline?

6.11 [M] Repeat Problem 6.10 for the specific conditions related to forwarding of the contentsof register RY in Figure 5.8 to the multiplexers feeding the inputs of the ALU.

6.12 [D] As a continuation of Problems 6.10 and 6.11, consider the following sequence ofinstructions:

Add R3, R2, R1Subtract R3, R5, R4Or R8, R3, #1

Describe the manner in which forwarding must be handled for this situation. How shouldthe conditions developed in Problems 6.10 and 6.11 be modified?

6.13 [M] Consider a program that consists of four memory-access instructions and four arith-metic instructions. Assume that there are no data dependencies between the instructions.Two versions of this program are executed on the superscalar processor shown in Figure6.13. The first version has the four memory-access instructions in sequence, followed bythe four arithmetic instructions. The second version has the memory-access instructions


Problems 225

interleaved with the arithmetic instructions. Draw two diagrams similar to Figure 6.14 tocompare the execution of these two versions of the program.

6.14 [E] Assume that a program contains no branch instructions. It is executed on the superscalarprocessor shown in Figure 6.13. What is the best execution time in cycles that can beexpected if the mix of instructions consists of 75 percent arithmetic instructions and 25percent memory-access instructions? How does this time compare to the best executiontime on the simpler processor in Figure 6.2 using the same clock?

6.15 [M] Repeat Problem 6.14 to find the best possible execution times for the processors inFigures 6.2 and 6.13, assuming that the mix of instructions consists of 15 percent branchinstructions that are never taken, 65 percent arithmetic instructions, and 20 percent memory-access instructions. Assume a prediction accuracy of 100 percent for all branch instructions.

6.16 [E] Consider a processor that uses the branch prediction scheme represented in Figure 6.12b.The instruction set for the processor is enhanced with a feature that enables the compilerto specify the initial prediction state as either LT or LNT for each branch instruction. Thisinitial state is used by the processor at execution time when information about the branchinstruction is not found in the branch target buffer. Discuss how the compiler should usethis feature when generating code for the following cases:

(a) A loop with a conditional branch instruction at the end to branch to the start of the loop

(b) A loop with a conditional branch at the beginning of the loop to exit the loop, and anunconditional branch at the end of the loop to branch to the start

6.17 [M] Assume that a processor has the feature described in Problem 6.16 for specifying theinitial prediction state for branch instructions. Consider a statement of the form

IF A>B THEN A = A + 1 ELSE B = B + 1

(a) Generate assembly-language code for the statement above.

(b) In the absence of any other information, discuss how the compiler should specify theinitial prediction state for the branch instructions in the assembly code.

(c) A study of the execution behavior of the program containing the above statement revealsthat the value of variable A is often larger than the value of variable B. If this informationis made available to the compiler, discuss how it would influence the initial prediction statefor the branch instructions.

6.18 [M] Consider a statement of the form

IF A>B THEN A = A + 1 ELSE B = B + 1

(a) Consider a processor that has the pipelined organization shown in Figure 6.2, with staticbranch prediction that uses a not-taken assumption. Write assembly-language code for thestatement above. Draw diagrams similar to Figure 6.1 to show the pipelined execution ofthe instructions for different branch decisions and determine the execution times in cycles.

(b) Now assume that delayed branching is used. Write assembly-language code for thestatement above. Draw diagrams to show the pipelined execution of the instructions fordifferent branch decisions and compare the execution times in cycles with the times for theprevious case.



References

1. D. A. Patterson and J. L. Hennessy, Computer Organization and Design: TheHardware/Software Interface, 4th edition, Morgan Kaufmann, Burlington,Massachusetts, 2009.

2. J. P. Shen and M. H. Lipasti, Modern Processor Design: Fundamentals ofSuperscalar Processors, McGraw-Hill, New York, 2005.


227

c h a p t e r

7Input/Output Organization

Chapter Objectives


• Hardware needed to access I/O devices

• Synchronous and asynchronous busoperation

• Interface circuits

• Commercial standards, such as USB, SAS,and PCI Express


228 C H A P T E R 7 • Input/Output Organization

One of the basic features of a computer is its ability to transfer data to and from I/O devices. This communi-cation capability enables a human operator, for example, to use a keyboard and a display screen to process textand graphics. We make extensive use of computers to communicate with other computers over the Internetand access information around the globe. In embedded applications, computers are less visible but equallyimportant. They are an integral part of home appliances, manufacturing equipment, vehicle systems, cellphones, and banking and point-of-sale terminals. In such applications, input to a computer may come froma touch panel, a sensor switch, a digital camera, a microphone, or a fire alarm. Output may be characters ornumbers to be displayed, a sound signal to be sent to a speaker, or a digitally-coded command to change thespeed of a motor, open a valve, or cause a robot to move in a specified manner.

A computer should have the ability to exchange information with a wide variety of devices. In manycases, the processor is fully involved in these exchanges. However, data transfers may also take place directlybetween I/O devices, such as magnetic hard disks, and the main memory, with only minimal involvement ofthe processor. This possibility will be explored in the next chapter on the memory system.

Chapter 3 presents the programmer’s view of input/output data transfers that take place between theprocessor and the registers in I/O device interfaces. In this chapter, we discuss the details of the hardwareneeded to make such transfers possible.

An interconnection network is used to transfer data among the processor, memory, and I/O devices. Wedescribe below a commonly used interconnection network called a bus.

7.1 Bus Structure

The bus shown in Figure 7.1 is a simple structure that implements the interconnectionnetwork in Figure 3.1. Only one source/destination pair of units can use this bus to transferdata at any one time.

The bus consists of three sets of lines used to carry address, data, and control signals.I/O device interfaces are connected to these lines, as shown in Figure 7.2 for an input device.Each I/O device is assigned a unique set of addresses for the registers in its interface. Whenthe processor places a particular address on the address lines, it is examined by the address

Processor Memory


Bus

Figure 7.1 A single-bus structure.


7.2 Bus Operation 229

I/O

Bus

Address lines

Data lines

Control lines

interfacedecoderAddress Data, status, and

control registersControlcircuits

Input device

Figure 7.2 I/O interface for an input device.

decoders of all devices on the bus. The device that recognizes this address responds to thecommands issued on the control lines. The processor uses the control lines to request eithera Read or a Write operation, and the requested data are transferred over the data lines.

When I/O devices and the memory share the same address space, the arrangement iscalled memory-mapped I/O, as described in Section 3.1. Any machine instruction that canaccess memory can be used to transfer data to or from an I/O device. For example, if theinput device in Figure 7.2 is a keyboard and if DATAIN is its data register, the instruction

Load R2, DATAIN

reads the data from DATAIN and stores them into processor register R2. Similarly, theinstruction

Store R2, DATAOUT

sends the contents of register R2 to location DATAOUT, which may be the data register ofa display device interface. The status and control registers contain information relevant tothe operation of the I/O device. The address decoder, the data and status registers, and thecontrol circuitry required to coordinate I/O transfers constitute the device’s interface circuit.

7.2 Bus Operation

A bus requires a set of rules, often called a bus protocol, that govern how the bus is used byvarious devices. The bus protocol determines when a device may place information on thebus, when it may load the data on the bus into one of its registers, and so on. These rulesare implemented by control signals that indicate what and when actions are to be taken.



One control line, usually labelled R/W, specifies whether a Read or a Write operation isto be performed. As the label suggests, it specifies Read when set to 1 and Write when set to0. When several data sizes are possible, such as byte, halfword, or word, the required size isindicated by other control lines. The bus control lines also carry timing information. Theyspecify the times at which the processor and the I/O devices may place data on or receivedata from the data lines. A variety of schemes have been devised for the timing of datatransfers over a bus. These can be broadly classified as either synchronous or asynchronousschemes.

In any data transfer operation, one device plays the role of a master. This is the devicethat initiates data transfers by issuing Read or Write commands on the bus. Normally, theprocessor acts as the master, but other devices may also become masters as we will see inSection 7.3. The device addressed by the master is referred to as a slave.

7.2.1 Synchronous Bus

On a synchronous bus, all devices derive timing information from a control line called thebus clock, shown at the top of Figure 7.3. The signal on this line has two phases: a high levelfollowed by a low level. The two phases constitute a clock cycle. The first half of the cyclebetween the low-to-high and high-to-low transitions is often referred to as a clock pulse.

The address and data lines in Figure 7.3 are shown as if they are carrying both highand low signal levels at the same time. This is a common convention for indicating that

Clock cycle

Data

Bus clock

commandAddress and

t0 t1 t2

Time

Figure 7.3 Timing of an input transfer on a synchronous bus.



some lines are high and some low, depending on the particular address or data values beingtransmitted. The crossing points indicate the times at which these patterns change. A signalline at a level half-way between the low and high signal levels indicates periods duringwhich the signal is unreliable, and must be ignored by all devices.

Let us consider the sequence of signal events during an input (Read) operation. Attime t0, the master places the device address on the address lines and sends a command onthe control lines indicating a Read operation. The command may also specify the lengthof the operand to be read. Information travels over the bus at a speed determined by itsphysical and electrical characteristics. The clock pulse width, t1 − t0, must be longer thanthe maximum propagation delay over the bus. Also, it must be long enough to allow alldevices to decode the address and control signals, so that the addressed device (the slave)can respond at time t1 by placing the requested input data on the data lines. At the end of theclock cycle, at time t2, the master loads the data on the data lines into one of its registers.To be loaded correctly into a register, data must be available for a period greater than thesetup time of the register (see Appendix A). Hence, the period t2 − t1 must be greater thanthe maximum propagation time on the bus plus the setup time of the master’s register.

A similar procedure is followed for a Write operation. The master places the outputdata on the data lines when it transmits the address and command information. At time t2,the addressed device loads the data into its data register.

The timing diagram in Figure 7.3 is an idealized representation of the actions that takeplace on the bus lines. The exact times at which signals change state are somewhat differentfrom those shown, because of propagation delays on bus wires and in the circuits of thedevices. Figure 7.4 gives a more realistic picture of what actually happens. It shows twoviews of each signal, except the clock. Because signals take time to travel from one deviceto another, a given signal transition is seen by different devices at different times. The topview shows the signals as seen by the master and the bottom view as seen by the slave. Weassume that the clock changes are seen at the same time by all devices connected to thebus. System designers spend considerable effort to ensure that the clock signal satisfies thisrequirement.

The master sends the address and command signals on the rising edge of the clock atthe beginning of the clock cycle (at t0). However, these signals do not actually appear onthe bus until tAM, largely due to the delay in the electronic circuit output from the master tothe bus lines. A short while later, at tAS, the signals reach the slave. The slave decodes theaddress, and at t1 sends the requested data. Here again, the data signals do not appear onthe bus until tDS. They travel toward the master and arrive at tDM. At t2, the master loadsthe data into its register. Hence the period t2 − tDM must be greater than the setup time ofthat register. The data must continue to be valid after t2 for a period equal to the hold timerequirement of the register (see Appendix A for hold time).

Timing diagrams often show only the simplified picture in Figure 7.3, particularlywhen the intent is to give the basic idea of how data are transferred. But, actual signals willalways involve delays as shown in Figure 7.4.

Multiple-Cycle Data TransferThe scheme described above results in a simple design for the device interface. How-

ever, it has some limitations. Because a transfer has to be completed within one clock cycle,



Data

Bus clock

commandAddress and

t0 t1 t2

commandAddress and

Data

Seen by master

Seen by slavev

tAM

tAS

tDS

tDM

Time

Figure 7.4 A detailed timing diagram for the input transfer of Figure 7.3.

the clock period, t2 − t0, must be chosen to accommodate the longest delays on the bus andthe slowest device interface. This forces all devices to operate at the speed of the slowestdevice.

Also, the processor has no way of determining whether the addressed device has actuallyresponded. At t2, it simply assumes that the input data are available on the data lines ina Read operation, or that the output data have been received by the I/O device in a Writeoperation. If, because of a malfunction, a device does not operate correctly, the error willnot be detected.

To overcome these limitations, most buses incorporate control signals that represent aresponse from the device. These signals inform the master that the slave has recognizedits address and that it is ready to participate in a data transfer operation. They also make itpossible to adjust the duration of the data transfer period to match the response speeds ofdifferent devices. This is often accomplished by allowing a complete data transfer operationto span several clock cycles. Then, the number of clock cycles involved can vary from onedevice to another.

An example of this approach is shown in Figure 7.5. During clock cycle 1, the mastersends address and command information on the bus, requesting a Read operation. The slavereceives this information and decodes it. It begins to access the requested data on the active



1 2 3 4

Clock

Address

Command

Data

Slave-ready

Time

Figure 7.5 An input transfer using multiple clock cycles.

edge of the clock at the beginning of clock cycle 2. We have assumed that due to the delayinvolved in getting the data, the slave cannot respond immediately. The data become readyand are placed on the bus during clock cycle 3. The slave asserts a control signal calledSlave-ready at the same time. The master, which has been waiting for this signal, loads thedata into its register at the end of the clock cycle. The slave removes its data signals fromthe bus and returns its Slave-ready signal to the low level at the end of cycle 3. The bustransfer operation is now complete, and the master may send new address and commandsignals to start a new transfer in clock cycle 4.

The Slave-ready signal is an acknowledgment from the slave to the master, confirmingthat the requested data have been placed on the bus. It also allows the duration of a bustransfer to change from one device to another. In the example in Figure 7.5, the slaveresponds in cycle 3. A different device may respond in an earlier or a later cycle. If theaddressed device does not respond at all, the master waits for some predefined maximumnumber of clock cycles, then aborts the operation. This could be the result of an incorrectaddress or a device malfunction.

We will now present a different approach that does not use a clock signal at all.

7.2.2 Asynchronous Bus

An alternative scheme for controlling data transfers on a bus is based on the use of ahandshake protocol between the master and the slave. A handshake is an exchange of



Slave-ready

Data

Master-ready

and commandAddress

Bus cycle

t1 t2 t3 t4 t5t0

Time

Figure 7.6 Handshake control of data transfer during an input operation.

command and response signals between the master and the slave. It is a generalization ofthe way the Slave-ready signal is used in Figure 7.5. A control line called Master-ready isasserted by the master to indicate that it is ready to start a data transfer. The Slave respondsby asserting Slave-ready.

A data transfer controlled by a handshake protocol proceeds as follows. The masterplaces the address and command information on the bus. Then it indicates to all devicesthat it has done so by activating the Master-ready line. This causes all devices to decodethe address. The selected slave performs the required operation and informs the processorthat it has done so by activating the Slave-ready line. The master waits for Slave-ready tobecome asserted before it removes its signals from the bus. In the case of a Read operation,it also loads the data into one of its registers.

An example of the timing of an input data transfer using the handshake protocol isgiven in Figure 7.6, which depicts the following sequence of events:

t0—The master places the address and command information on the bus, and all deviceson the bus decode this information.

t1—The master sets the Master-ready line to 1 to inform the devices that the address andcommand information is ready. The delay t1 − t0 is intended to allow for any skewthat may occur on the bus. Skew occurs when two signals transmitted simultaneouslyfrom one source arrive at the destination at different times. This happens becausedifferent lines of the bus may have different propagation speeds. Thus, to guarantee



that the Master-ready signal does not arrive at any device ahead of the address andcommand information, the delay t1 − t0 should be longer than the maximum possiblebus skew. (Note that bus skew is a part of the maximum propagation delay in thesynchronous case.) Sufficient time should be allowed for the device interface circuitryto decode the address. The delay needed can be included in the period t1 − t0.

t2—The selected slave, having decoded the address and command information, performsthe required input operation by placing its data on the data lines. At the same time,it sets the Slave-ready signal to 1. If extra delays are introduced by the interfacecircuitry before it places the data on the bus, the slave must delay the Slave-readysignal accordingly. The period t2 − t1 depends on the distance between the masterand the slave and on the delays introduced by the slave’s circuitry.

t3—The Slave-ready signal arrives at the master, indicating that the input data are availableon the bus. The master must allow for bus skew. It must also allow for the setuptime needed by its register. After a delay equivalent to the maximum bus skew andthe minimum setup time, the master loads the data into its register. Then, it drops theMaster-ready signal, indicating that it has received the data.

t4—The master removes the address and command information from the bus. The delaybetween t3 and t4 is again intended to allow for bus skew. Erroneous addressing maytake place if the address, as seen by some device on the bus, starts to change whilethe Master-ready signal is still equal to 1.

t5—When the device interface receives the 1-to-0 transition of the Master-ready signal, itremoves the data and the Slave-ready signal from the bus. This completes the inputtransfer.

The timing for an output operation, illustrated in Figure 7.7, is essentially the same asfor an input operation. In this case, the master places the output data on the data lines at thesame time that it transmits the address and command information. The selected slave loadsthe data into its data register when it receives the Master-ready signal and indicates that ithas done so by setting the Slave-ready signal to 1. The remainder of the cycle is similar tothe input operation.

The handshake signals in Figures 7.6 and 7.7 are said to be fully interlocked, becausea change in one signal is always in response to a change in the other. Hence, this schemeis known as a full handshake. It provides the highest degree of flexibility and reliability.

DiscussionMany variations of the bus protocols just described are found in commercial computers.

The choice of a particular design involves trade-offs among factors such as:

• Simplicity of the device interface• Ability to accommodate device interfaces that introduce different amounts of delay• Total time required for a bus transfer• Ability to detect errors resulting from addressing a nonexistent device or from an

interface malfunction



Bus cycle

Data

Master-ready

Slave-ready

and commandAddress

t1 t2 t3 t4 t5t0

Time

Figure 7.7 Handshake control of data transfer during an output operation.

The main advantage of the asynchronous bus is that the handshake protocol eliminatesthe need for distribution of a single clock signal whose edges should be seen by all devicesat about the same time. This simplifies timing design. Delays, whether introduced by theinterface circuits or by propagation over the bus wires, are readily accommodated. Thesedelays are likely to differ from one device to another, but the timing of data transfers adjustsautomatically. For a synchronous bus, clock circuitry must be designed carefully to ensureproper timing, and delays must be kept within strict bounds.

The rate of data transfer on an asynchronous bus controlled by the handshake protocolis limited by the fact that each transfer involves two round-trip delays (four end-to-enddelays). This can be seen in Figures 7.6 and 7.7 as each transition on Slave-ready must waitfor the arrival of a transition on Master-ready, and vice versa. On synchronous buses, theclock period need only accommodate one round trip delay. Hence, faster transfer rates canbe achieved. To accommodate a slow device, additional clock cycles are used, as describedabove. Most of today’s high-speed buses use the synchronous approach.

7.2.3 Electrical Considerations

A bus is an interconnection medium to which several devices may be connected. It isessential to ensure that only one device can place data on the bus at any given time. Alogic gate that places data on the bus is called a bus driver. All devices connected to thebus, except the one that is currently sending data, must have their bus drivers turned off. Aspecial type of logic gate, known as a tri-state gate, is used for this purpose. A tri-state gate


7.3 Arbitration 237

has a control input that is used to turn the gate on or off. When turned on, or enabled, itdrives the bus with 1 or 0, corresponding to the value of its input signal. When turned off,or disabled, it is effectively disconnected from the bus. From an electrical point of view,its output goes into a high-impedance state that does not affect the signal on the bus.

7.3 Arbitration

There are occasions when two or more entities contend for the use of a single resource in acomputer system. For example, two devices may need to access a given slave at the sametime. In such cases, it is necessary to decide which device will access the slave first. Thedecision is usually made in an arbitration process performed by an arbiter circuit. Thearbitration process starts by each device sending a request to use the shared resource. Thearbiter associates priorities with individual requests. If it receives two requests at the sametime, it grants the use of the slave to the device having the higher priority first.

To illustrate the arbitration process, we consider the case where a single bus is the sharedresource. The device that initiates data transfer requests on the bus is the bus master. InSection 7.2, the discussion involved only one bus master—the processor. It is possible thatseveral devices in a computer system need to be bus masters to transfer data. For example,an I/O device needs to be a bus master to transfer data directly to or from the computer’smemory. Since the bus is a single shared facility, it is essential to provide orderly access toit by the bus masters.

A device that wishes to use the bus sends a request to the arbiter. When multiplerequests arrive at the same time, the arbiter selects one request and grants the bus to thecorresponding device. For some devices, a delay in gaining access to the bus may lead toan error. Such devices must be given high priority. If there is no particular urgency amongrequests, the arbiter may grant the bus using a simple round-robin scheme.

Figure 7.8 illustrates an arrangement for bus arbitration involving two masters. Thereare two Bus-request lines, BR1 and BR2, and two Bus-grant lines, BG1 and BG2, connecting

Master 1 Master 2


Bus

Arbitercircuit

BR2BR1

BG1 BG2

Figure 7.8 Bus arbitration.



Time

BR1

BG1

BR2

BG2

BR3

BG3

Figure 7.9 Granting use of the bus based on priorities.

the arbiter to the masters. A master requests use of the bus by activating its Bus-request line.If a single Bus-request is activated, the arbiter activates the corresponding Bus-grant. Thisindicates to the selected master that it may now use the bus for transferring data. When thetransfer is completed, that master deactivates its Bus-request, and the arbiter deactivates itsBus-grant.

Figure 7.9 illustrates a possible sequence of events for the case of three masters. Assumethat master 1 has the highest priority, followed by the others in increasing numerical order.Master 2 sends a request to use the bus first. Since there are no other requests, the arbitergrants the bus to this master by asserting BG2. When master 2 completes its data transferoperation, it releases the bus by deactivating BR2. By that time, both masters 1 and 3 haveactivated their request lines. Since device 1 has a higher priority, the arbiter activates BG1after it deactivates BG2, thus granting the bus to master 1. Later, when master 1 releasesthe bus by deactivating BR1, the arbiter deactivates BG1 and activates BG3 to grant the busto master 3. Note that the bus is granted to master 1 before master 3 even though master 3activated its request line before master 1.

7.4 Interface Circuits

The I/O interface of a device consists of the circuitry needed to connect that device to thebus. On one side of the interface are the bus lines for address, data, and control. On theother side are the connections needed to transfer data between the interface and the I/O


7.4 Interface Circuits 239

device. This side is called a port, and it can be either a parallel or a serial port. A parallelport transfers multiple bits of data simultaneously to or from the device. A serial port sendsand receives data one bit at a time. Communication with the processor is the same for bothformats; the conversion from a parallel to a serial format and vice versa takes place insidethe interface circuit.

Before we present specific circuit examples, let us recall the functions of an I/O inter-face. According to the discussion in Section 3.1, an I/O interface does the following:

1. Provides a register for temporary storage of data

2. Includes a status register containing status information that can be accessed by theprocessor

3. Includes a control register that holds the information governing the behavior of theinterface

4. Contains address-decoding circuitry to determine when it is being addressed by theprocessor

5. Generates the required timing signals

6. Performs any format conversion that may be necessary to transfer data between theprocessor and the I/O device, such as parallel-to-serial conversion in the case of aserial port

7.4.1 Parallel Interface

We now explain the key aspects of interface design by means of examples. First, wedescribe an interface circuit for an 8-bit input port that can be used for connecting a simpleinput device, such as a keyboard. Then, we describe an interface circuit for an 8-bit outputport, which can be used with an output device such as a display. We assume that theseinterface circuits are connected to a 32-bit processor that uses memory-mapped I/O and theasynchronous bus protocol depicted in Figures 7.6 and 7.7.

Input InterfaceFigure 7.10 shows a circuit that can be used to connect a keyboard to a processor. The

registers in this circuit correspond to those given in Figure 3.3. Assume that interruptsare not used, so there is no need for a control register. There are only two registers: adata register, KBD_DATA, and a status register, KBD_STATUS. The latter contains thekeyboard status flag, KIN.

A typical keyboard consists of mechanical switches that are normally open. When akey is pressed, its switch closes and establishes a path for an electrical signal. This signal isdetected by an encoder circuit that generates theASCII code for the corresponding character.A difficulty with such mechanical pushbutton switches is that the contacts bounce when akey is pressed, resulting in the electrical connection being made then broken several timesbefore the switch settles in the closed position. Although bouncing may last only one or twomilliseconds, this is long enough for the computer to erroneously interpret a single pressingof a key as the key being pressed and released several times. The effect of bouncing can beeliminated using a simple debouncing circuit, which could be part of the keyboard hardware



CPU

Data

Address

R /

Master-ready

Slave-ready

W

Data

Processor

ValidKBD_STATUS

Input interface

KBD_DATA

Keyboardswitches

Encoder

circuit

Figure 7.10 Keyboard to processor connection.

or may be incorporated in the encoder circuit. Alternatively, switch bouncing can be dealtwith in software. The software detects that a key has been pressed when it observes that thekeyboard status flag, KIN, has been set to 1. The I/O routine can then introduce sufficientdelay before reading the contents of the input buffer, KBD_DATA, to ensure that bouncinghas subsided. When debouncing is implemented in hardware, the I/O routine can read theinput character as soon as it detects that KIN is equal to 1.

The output of the encoder in Figure 7.10 consists of one byte of data representing theencoded character and one control signal called Valid. When a key is pressed, the Validsignal changes from 0 to 1, causing the ASCII code of the corresponding character to beloaded into the KBD_DATA register and the status flag KIN to be set to 1. The status flag iscleared to 0 when the processor reads the contents of the KBD_DATAregister. The interfacecircuit is shown connected to an asynchronous bus on which transfers are controlled by thehandshake signals Master-ready and Slave-ready, as in Figure 7.6. The bus has one othercontrol line, R/W, which indicates a Read operation when equal to 1.

Figure 7.11 shows a possible circuit for the input interface. There are two addressablelocations in this interface, KBD_DATA and KBD_STATUS. They occupy adjacent wordlocations in the address space, as in Figure 3.3. Only one bit, b1, in the status registeractually contains useful information. This is the keyboard status flag, KIN. When the statusregister is read by the processor, all other bit locations appear as containing zeros.

When the processor requests a Read operation, it places the address of the appropriateregister on the address lines of the bus. The address decoder in the interface circuit examinesbits A31−3, and asserts its output, My-address, when one of the two registers KBD_DATAor KBD_STATUS is being addressed. Bit A2 determines which of the two registers isinvolved. Hence, a multiplexer is used to select the register to be connected to the busbased on address bit A2. The two least-significant address bits, A1 and A0, are not used,because we have assumed that all addresses are word-aligned.

The output of the multiplexer is connected to the data lines of the bus through a set oftri-state gates. The interface circuit turns the tri-state gates on only when the three signalsMaster-ready, My_address, and R/W are all equal to 1, indicating a Read operation. The



KBD_DATA

Keyboarddata

Valid

Slave-ready

A31

A3

A2

Addressdecoder

D7

D0

R/ W

Master-ready

000 KIN

Q7 D7

Q0 D0

0

1

KBD_STATUSM

y-ad

dres

s

Enable

Status

Mux

flag

Read-data

Tri-statedriver

1

Figure 7.11 An input interface circuit.

Slave-ready signal is asserted at the same time, to inform the processor that the requesteddata or status information has been placed on the data lines. When address bit A2 is equalto 0, Read-data is also asserted. This signal is used to reset the KIN flag.

A possible implementation of the status flag circuit is given in Figure 7.12. The KINflag is the output of a NOR latch connected as shown. A flip-flop is set to 1 by the risingedge on the Valid signal line. This event changes the state of the NOR latch to set KIN to



Master-ready

Q

Q

D 1

Valid

Clear

Read-data KIN

Figure 7.12 Circuit for the status flag block in Figure 7.11.

1, but only when Master-ready is low. The reason for this additional condition is to ensurethat KIN does not change state while being read by the processor. Both the flip-flop andthe latch are reset to 0 when Read-data becomes equal to 1, indicating that KBD_DATA isbeing read.

The circuits shown in Figures 7.11 and 7.12 are intended to illustrate the variousfunctions that an interface circuit needs to perform. A designer using modern computer-aided design tools would specify these functions using a hardware description languagesuch as VHDL or Verilog. The resulting circuits would depend on the technology used andmay or may not be the same as the circuits shown in these figures.

Output InterfaceLet us now consider the output interface shown in Figure 7.13, which can be used to

connect an output device such as a display. We have assumed that the display uses twohandshake signals, New-data and Ready, in a manner similar to the handshake between thebus signals Master-ready and Slave-ready. When the display is ready to accept a character,it asserts its Ready signal, which causes the DOUT flag in the DISP_STATUS register to beset to 1. When the I/O routine checks DOUT and finds it equal to 1, it sends a character toDISP_DATA. This clears the DOUT flag to 0 and sets the New-data signal to 1. In response,the display returns Ready to 0 and accepts and displays the character in DISP_DATA. Whenit is ready to receive another character, it asserts Ready again, and the cycle repeats.

Figure 7.14 shows an implementation of this interface. Its operation is similar to that ofthe input interface of Figure 7.11, except that it responds to both Read and Write operations.A Write operation in which A2 = 0 loads a byte of data into register DISP_DATA. A Readoperation in which A2 = 1 reads the contents of the status register DISP_STATUS. In thiscase, only the DOUT flag, which is bit b2 of the status register, is sent by the interface. Theremaining bits of DISP_STATUS are not used. The state of the status flag is determined



CPU

Data

Address

R /

Master-ready

Slave-readyNew-data

WProcessorDISP_STATUS

Output interface

DISP_DATA

Data

ReadyDisplay

Figure 7.13 Display to processor connection.

by the handshake control circuit. A state diagram describing the behavior of this circuit isgiven as Example 7.4 at the end of the chapter.

7.4.2 Serial Interface

A serial interface is used to connect the processor to I/O devices that transmit data one bitat a time. Data are transferred in a bit-serial fashion on the device side and in a bit-parallelfashion on the processor side. The transformation between the parallel and serial formatsis achieved with shift registers that have parallel access capability. A block diagram of atypical serial interface is shown in Figure 7.15. The input shift register accepts bit-serialinput from the I/O device. When all 8 bits of data have been received, the contents ofthis shift register are loaded in parallel into the DATAIN register. Similarly, output data inthe DATAOUT register are transferred to the output shift register, from which the bits areshifted out and sent to the I/O device.

The part of the interface that deals with the bus is the same as in the parallel interfacedescribed earlier. Two status flags, which we will refer to as SIN and SOUT, are maintainedby the Status and control block. The SIN flag is set to 1 when new data are loaded intoDATAIN from the shift register, and cleared to 0 when these data are read by the processor.The SOUT flag indicates whether the DATAOUT register is available. It is cleared to 0when the processor writes new data into DATAOUT and set to 1 when data are transferredfrom DATAOUT to the output shift register.

The double buffering used in the input and output paths in Figure 7.15 is important. It ispossible to implement DATAIN and DATAOUT themselves as shift registers, thus obviatingthe need for separate shift registers. However, this would impose awkward restrictions onthe operation of the I/O device. After receiving one character from the serial line, theinterface would not be able to start receiving the next character until the processor readsthe contents of DATAIN. Thus, a pause would be needed between two characters to givethe processor time to read the input data. With double buffering, the transfer of the secondcharacter can begin as soon as the first character is loaded from the shift register into the



Handshakecontrol

DISP_DATA

Data

Ready

New-data

Read-status

Slave-ready

Write-data

DOUT

A31

A3

A2

Addressdecoder

D7 Q7

D0 Q0

D7

D0

R/ W

My-address

Master-ready

D2

1

D1

Figure 7.14 An output interface circuit.

DATAIN register. Thus, provided the processor reads the contents of DATAIN before theserial transfer of the second character is completed, the interface can receive a continuousstream of input data over the serial line. An analogous situation occurs in the output pathof the interface.

During serial transmission, the receiver needs to know when to shift each bit into itsinput shift register. Since there is no separate line to carry a clock signal from the transmitterto the receiver, the timing information needed must be embedded into the transmitteddata using an encoding scheme. There are two basic approaches. The first is known as



Address decoderand

control circuit

Statusand

control

Slave-readyMaster-ready

R/W

A2

A31

Receiving clock

Transmission clock

D7

D0

Output shift register

DATAOUT

DATAIN

Input shift register

Serialoutput

Serialinput

Figure 7.15 A serial interface.

asynchronous transmission, because the receiver uses a clock that is not synchronized withthe transmitter clock. In the second approach, the receiver is able to generate a clock thatis synchronized with the transmitter clock. Hence it is called synchronous transmission.These approaches are described briefly below.

Asynchronous TransmissionThis approach uses a technique called start-stop transmission. Data are organized in

small groups of 6 to 8 bits, with a well-defined beginning and end. In a typical arrangement,alphanumeric characters encoded in 8 bits are transmitted as shown in Figure 7.16. Theline connecting the transmitter and the receiver is in the 1 state when idle. A character istransmitted as a 0 bit, referred to as the Start bit, followed by 8 data bits and 1 or 2 Stopbits. The Stop bits have a logic value of 1. The 1-to-0 transition at the beginning of the



MSBLSB

0 1 2 3 4 5 6 7

1 bit time1 or 2

Stop bitsStart bit

Start bit

Idle state

8 data bits

1

0

of newcharacter

Figure 7.16 Asynchronous serial character transmission.

Start bit alerts the receiver that data transmission is about to begin. Using its own clock,the receiver determines the position of the next 8 bits, which it loads into its input register.The Stop bits following the transmitted character, which are equal to 1, ensure that the Startbit of the next character will be recognized. When transmission stops, the line remains inthe 1 state until another character is transmitted.

To ensure correct reception, the receiver needs to sample the incoming data as close tothe center of each bit as possible. It does so by using a clock signal whose frequency, fR,is substantially higher than the transmission clock, fT . Typically, fR = 16fT . This meansthat 16 pulses of the local clock occur during each data bit interval. This clock is used toincrement a modulo-16 counter, which is cleared to 0 when the leading edge of a Start bit isdetected. The middle of the Start bit is reached at the count of 8. The state of the input lineis sampled again at this point to confirm that it is a valid Start bit (a zero), and the counteris cleared to 0. From this point onward, the incoming data signal is sampled whenever thecount reaches 16, which should be close to the middle of each incoming bit. Therefore,as long as fR/16 is sufficiently close to fT , the receiver will correctly load the bits of theincoming character.

Synchronous TransmissionIn the start-stop scheme described above, the position of the 1-to-0 transition at the

beginning of the start bit in Figure 7.16 is the key to obtaining correct timing information.This scheme is useful only where the speed of transmission is sufficiently low and theconditions on the transmission link are such that the square waveforms shown in the figuremaintain their shape. For higher speed a more reliable method is needed for the receiver torecover the timing information.

In synchronous transmission, the receiver generates a clock that is synchronized to thatof the transmitter by observing successive 1-to-0 and 0-to-1 transitions in the received signal.It adjusts the position of the active edge of the clock to be in the center of the bit position.A variety of encoding schemes are used to ensure that enough signal transitions occur toenable the receiver to generate a synchronized clock and to maintain synchronization. Oncesynchronization is achieved, data transmission can continue indefinitely. Encoded data areusually transmitted in large blocks consisting of several hundreds or several thousands ofbits. The beginning and end of each block are marked by appropriate codes, and data within


7.5 Interconnection Standards 247

a block are organized according to an agreed upon set of rules. Synchronous transmissionenables very high data transfer rates.

7.5 Interconnection Standards

A typical desktop or notebook computer has several ports that can be used to connect I/Odevices, such as a mouse, a memory key, or a disk drive. Standard interfaces have beendeveloped to enable I/O devices to use interfaces that are independent of any particularprocessor. For example, a memory key that has a USB connector can be used with anycomputer that has a USB port. In this section, we describe briefly some of the widely usedinterconnection standards.

Most standards are developed by a collaborative effort among a number of companies.In many cases, the IEEE (Institute of Electrical and Electronics Engineers) develops thesestandards further and publishes them as IEEE Standards.

7.5.1 Universal Serial Bus (USB)

The Universal Serial Bus (USB) [1] is the most widely used interconnection standard. Alarge variety of devices are available with a USB connector, including mice, memory keys,disk drives, printers, cameras, and many more. The commercial success of the USB isdue to its simplicity and low cost. The original USB specification supports two speeds ofoperation, called low-speed (1.5 Megabits/s) and full-speed (12 Megabits/s). Later, USB2, called High-Speed USB, was introduced. It enables data transfers at speeds up to 480Megabits/s. As I/O devices continued to evolve with even higher speed requirements, USB3 (called Superspeed) was developed. It supports data transfer rates up to 5 Gigabits/s.

The USB has been designed to meet several key objectives:

• Provide a simple, low-cost, and easy to use interconnection system• Accommodate a wide range of I/O devices and bit rates, including Internet connections,

and audio and video applications• Enhance user convenience through a “plug-and-play” mode of operation

We will elaborate on some of these objectives before discussing the technical details of theUSB.

Device CharacteristicsThe kinds of devices that may be connected to a computer cover a wide range of

functionality. The speed, volume, and timing constraints associated with data transfers toand from these devices vary significantly.

In the case of a keyboard, one byte of data is generated every time a key is pressed,which may happen at any time. These data should be transferred to the computer promptly.Since the event of pressing a key is not synchronized to any other event in a computersystem, the data generated by the keyboard are called asynchronous. Furthermore, the rate



at which the data are generated is quite low. It is limited by the speed of the human operatorto about 10 bytes per second, which is less than 100 bits per second.

A variety of simple devices that may be attached to a computer generate data of asimilar nature—low speed and asynchronous. Computer mice and some of the controls andmanipulators used with video games are good examples.

Consider now a different source of data. Many computers have a microphone, eitherexternally attached or built in. The sound picked up by the microphone produces an analogelectrical signal, which must be converted into a digital form before it can be handled bythe computer. This is accomplished by sampling the analog signal periodically. For eachsample, an analog-to-digital (A/D) converter generates an n-bit number representing themagnitude of the sample. The number of bits, n, is selected based on the desired precisionwith which to represent each sample. Later, when these data are sent to a speaker, a digital-to-analog (D/A) converter is used to restore the original analog signal from the digitalformat. A similar approach is used with video information from a camera.

The sampling process yields a continuous stream of digitized samples that arrive atregular intervals, synchronized with the sampling clock. Such a data stream is calledisochronous, meaning that successive events are separated by equal periods of time. Asignalmust be sampled quickly enough to track its highest-frequency components. In general, ifthe sampling rate is s samples per second, the maximum frequency component captured bythe sampling process is s/2. For example, human speech can be captured adequately witha sampling rate of 8 kHz, which will record sound signals having frequencies up to 4 kHz.For higher-quality sound, as needed in a music system, higher sampling rates are used. Astandard sampling rate for digital sound is 44.1 kHz. Each sample is represented by 4 bytesof data to accommodate the wide range in sound volume (dynamic range) that is necessaryfor high-quality sound reproduction. This yields a data rate of about 1.4 Megabits/s.

An important requirement in dealing with sampled voice or music is to maintain precisetiming in the sampling and replay processes. A high degree of jitter (variability in sampletiming) is unacceptable. Hence, the data transfer mechanism between a computer and amusic system must maintain consistent delays from one sample to the next. Otherwise,complex buffering and retiming circuitry would be needed. On the other hand, occasionalerrors or missed samples can be tolerated. They either go unnoticed by the listener orthey may cause an unobtrusive click. No sophisticated mechanisms are needed to ensureperfectly correct data delivery.

Data transfers for images and video have similar requirements, but require much higherdata transfer rates. To maintain the picture quality of commercial television, an image shouldbe represented by about 160 kilobytes and transmitted 30 times per second. Together withcontrol information, this yields a total bit rate of 44 Megabits/s. Higher-quality images, asin HDTV (High Definition TV), require higher rates.

Large storage devices such as magnetic and optical disks present different requirements.These devices are part of the computer’s memory hierarchy, as will be discussed in Chapter8. Their connection to the computer requires a data transfer bandwidth of at least 40 or 50Megabits/s. Delays on the order of milliseconds are introduced by the movement of themechanical components in the disk mechanism. Hence, a small additional delay introducedwhile transferring data to or from the computer is not important, and jitter is not an issue.However, the transfer mechanism must guarantee data correctness.



Plug-and-PlayWhen an I/O device is connected to a computer, the operating system needs some

information about it. It needs to know what type of device it is so that it can use theappropriate device driver. It also needs to know the addresses of the registers in the device’sinterface to be able to communicate with it. The USB standard defines both the USBhardware and the software that communicates with it. Its plug-and-play feature meansthat when a new device is connected, the system detects its existence automatically. Thesoftware determines the kind of device and how to communicate with it, as well as anyspecial requirements it might have. As a result, the user simply plugs in a USB device andbegins to use it, without having to get involved in any of these details.

The USB is also hot-pluggable, which means a device can be plugged into or removedfrom a USB port while power is turned on.

USB ArchitectureThe USB uses point-to-point connections and a serial transmission format. When

multiple devices are connected, they are arranged in a tree structure as shown in Figure 7.17.Each node of the tree has a device called a hub, which acts as an intermediate transfer pointbetween the host computer and the I/O devices. At the root of the tree, a root hub connectsthe entire tree to the host computer. The leaves of the tree are the I/O devices: a mouse,a keyboard, a printer, an Internet connection, a camera, or a speaker. The tree structuremakes it possible to connect many devices using simple point-to-point serial links.

If I/O devices are allowed to send messages at any time, two messages may reach thehub at the same time and interfere with each other. For this reason, the USB operates strictlyon the basis of polling. A device may send a message only in response to a poll messagefrom the host processor. Hence, no two devices can send messages at the same time. Thisrestriction allows hubs to be simple, low-cost devices.

Each device on the USB, whether it is a hub or an I/O device, is assigned a 7-bit address.This address is local to the USB tree and is not related in any way to the processor’s addressspace. The root hub of the USB, which is attached to the processor, appears as a singledevice. The host software communicates with individual devices by sending informationto the root hub, which it forwards to the appropriate device in the USB tree.

When a device is first connected to a hub, or when it is powered on, it has the address0. Periodically, the host polls each hub to collect status information and learn about newdevices that may have been added or disconnected. When the host is informed that a newdevice has been connected, it reads the information in a special memory in the device’sUSB interface to learn about the device’s capabilities. It then assigns the device a uniqueUSB address and writes that address in one of the device’s interface registers. It is thisinitial connection procedure that gives the USB its plug-and-play capability.

Isochronous Traffic on USBAn important feature of the USB is its ability to support the transfer of isochronous

data in a simple manner. As mentioned earlier, isochronous data need to be transferredat precisely timed regular intervals. To accommodate this type of traffic, the root hubtransmits a uniquely recognizable sequence of bits over the USB tree every millisecond.This sequence of bits, called a Start of Frame character, acts as a marker indicating the



Host computer

Roothub

Hub

I/Odevice

HubI/O

device

I/Odevice

Hub

I/Odevice

I/Odevice

I/Odevice

Figure 7.17 Universal Serial Bus tree structure.

beginning of isochronous data, which are transmitted after this character. Thus, digitizedaudio and video signals can be transferred in a regular and precisely timed manner.

Electrical CharacteristicsUSB connections consist of four wires, of which two carry power, +5 V and Ground,

and two carry data. Thus, I/O devices that do not have large power requirements can bepowered directly from the USB. This obviates the need for a separate power supply forsimple devices such as a memory key or a mouse.

Two methods are used to send data over a USB cable. When sending data at low speed,a high voltage relative to Ground is transmitted on one of the two data wires to representa 0 and on the other to represent a 1. The Ground wire carries the return current in bothcases. Such a scheme in which a signal is injected on a wire relative to ground is referredto as single-ended transmission.



The speed at which data can be sent on any cable is limited by the amount of electricalnoise present. The term noise refers to any signal that interferes with the desired data signaland hence could cause errors. Single-ended transmission is highly susceptible to noise.The voltage on the ground wire is common to all the devices connected to the computer.Signals sent by one device can cause small variations in the voltage on the ground wire, andhence can interfere with signals sent by another device. Interference can also be caused byone wire picking up noise from nearby wires.

The High-Speed USB uses an alternative arrangement known as differential signaling.The data signal is injected between two data wires twisted together. The ground wire is notinvolved. The receiver senses the voltage difference between the two signal wires directly,without reference to ground. This arrangement is very effective in reducing the noise seenby the receiver, because any noise injected on one of the two wires of the twisted pair is alsoinjected on the other. Since the receiver is sensitive only to the voltage difference betweenthe two wires, the noise component is cancelled out. The ground wire acts as a shield for thedata on the twisted pair against interference from nearby wires. Differential signaling allowsmuch lower voltages and much higher speeds to be used compared to single-ended signaling.

7.5.2 FireWire

FireWire is another popular interconnection standard. It was originally developed by Appleand has been adopted as IEEE Standard 1394 [2]. Like the USB, it uses differential point-to-point serial links. The following are some of the salient differences between FireWireand USB.

• Devices are organized in a daisy chain manner on a FireWire bus, instead of the treestructure of USB. One device is connected to the computer, a second device is connectedto the first one, a third device is connected to the second one, and so on.

• FireWire is well suited for connecting audio and video equipment. It can be operatedin an isochronous mode that is highly optimized for carrying high-speed isochronous traffic.

• I/O devices connected to the USB communicate with the host computer. If data areto be transferred from one device to another, for example from a camera to a display orprinter, they are first read by the host then sent to the display or printer. FireWire, on theother hand, supports a mode of operation called peer-to-peer. This means that data may betransferred directly from one I/O device to another, without the host’s involvement.

• The basic FireWire connector has six pins. There are two pairs of data wires, onefor transmission in each direction, and two for power and ground. Higher-speed versionsuse a nine-pin connector, with three ground wires added to shield the data wires againstinterference.

• The FireWire bus can deliver considerably more power than the USB. Hence, it cansupport devices with moderate power requirements.

FireWire is widely used with audio and video devices. For example, most camcordershave a FireWire port. Several versions of the standard have been defined, which can operateat speeds ranging from 400 Megabits/s to 3.6 Gigabits/s.



7.5.3 PCI Bus

The PCI (Peripheral Component Interconnect) bus [3] was developed as a low-cost,processor-independent bus. It is housed on the motherboard of a computer and used toconnect I/O interfaces for a wide variety of devices. A device connected to the PCI busappears to the processor as if it is connected directly to the processor bus. Its interfaceregisters are assigned addresses in the address space of the processor.

We will start by describing how the PCI bus operates, then discuss some of its features.

Bus StructureThe use of the PCI bus in a computer system is illustrated in Figure 7.18. The PCI bus

is connected to the processor bus via a controller called a bridge. The bridge has a specialport for connecting the computer’s main memory. It may also have another special high-speed port for connecting graphics devices. The bridge translates and relays commands andresponses from one bus to the other and transfers data between them. For example, when

PCI bus

USBinterface

SATA, SAS

controller

Disk

Printer Keyboard

USB hubEthernet

memoryMain

Processor

Graphics

Mouse

PCI bridge

Diskcontroller

or SCSI

Figure 7.18 Use of a PCI bus in a computer system.



the processor sends a Read request to an I/O device, the bridge forwards the command andaddress to the PCI bus. When the bridge receives the device’s response, it forwards thedata to the processor using the processor bus. I/O devices are connected to the PCI bus,possibly through ports that use standards such as Ethernet, USB, SATA, SCSI, or SAS.

The PCI bus supports three independent address spaces: memory, I/O, and configura-tion. The system designer may choose to use memory-mapped I/O even with a processorthat has a separate I/O address space. In fact, this is the approach recommended by the PCIstandard for wider compatibility. The configuration space is intended to give the PCI itsplug-and-play capability, as we will explain shortly. A 4-bit command that accompanies theaddress identifies which of the three spaces is being used in a given data transfer operation.

Data transfers on a computer bus often involve bursts of data rather than individualwords. Words stored in successive memory locations are transferred directly betweenthe memory and an I/O device such as a disk or an Ethernet connection. Data transfersare initiated by the interface of the I/O device, which acts as a bus master. This way oftransferring data directly between the memory and I/O devices is discussed in detail inChapter 8. The PCI bus is designed primarily to support multiple-word transfers. A Reador a Write operation involving a single word is simply treated as a burst of length one.

The signaling convention on the PCI bus is similar to that used in Figure 7.5, with oneimportant difference. The PCI bus uses the same lines to transfer both address and data.In Figure 7.5, we assumed that the master maintains the address information on the busuntil the data transfer is completed. But, this is not necessary. The address is needed onlylong enough for the slave to be selected, freeing the lines for sending data in subsequentclock cycles. For transfers involving multiple words, the slave can store the address in aninternal register and increment it to access successive address locations. A significant costreduction can be realized in this manner, because the number of bus lines is an importantfactor affecting the cost of a computer system.

Data TransferTo understand the operation of the bus and its various features, we will examine a

typical bus transaction. The bus master, which is the device that initiates data transfers byissuing Read and Write commands, is called the initiator in PCI terminology. The addresseddevice that responds to these commands is called a target. The main bus signals used fortransferring data are listed in Table 7.1. There are 32 or 64 lines that carry address anddata using a synchronous signaling scheme similar to that of Figure 7.5. The target-ready,TRDY#, signal is equivalent to the Slave-ready signal in that figure. In addition, PCI usesan initiator-ready signal, IRDY#, to support burst transfers. We will describe these signalsbriefly, to provide the reader with an appreciation of the main features of the bus.

A complete transfer operation on the PCI bus, involving an address and a burst ofdata, is called a transaction. Consider a bus transaction in which an initiator reads fourconsecutive 32-bit words from the memory. The sequence of events on the bus is illustratedin Figure 7.19. All signal transitions are triggered by the rising edge of the clock. As inthe case of Figure 7.5, we show the signals changing later in the clock cycle to indicate thedelays they encounter. A signal whose name ends with the symbol # is asserted when inthe low-voltage state.



Table 7.1 Data transfer signals on the PCI bus.

Name Function

CLK A 33-MHz or 66-MHz clock

FRAME# Sent by the initiator to indicate the duration of a transmission

AD 32 address/data lines, which may be optionally increased to 64

C/BE# 4 command/byte-enable lines (8 for a 64-bit bus)

IRDY#, TRDY# Initiator-ready and Target-ready signals

DEVSEL# A response from the device indicating that it has recognizedits address and is ready for a data transfer transaction

IDSEL# Initialization Device Select

1 2 3 4 5 6 7

CLK

FRAME#

AD

C/BE#

IRDY#

TRDY#

DEVSEL#

Address #1 #4

Cmnd Byte enable

#2 #3

Figure 7.19 A Read operation on the PCI bus.



The bus master, acting as the initiator, asserts FRAME# in clock cycle 1 to indicatethe beginning of a transaction. At the same time, it sends the address on the AD lines and acommand on the C/BE# lines. In this case, the command will indicate that a Read operationis requested and that the memory address space is being used.

In clock cycle 2, the initiator removes the address, disconnects its drivers from the ADlines, and asserts IRDY# to indicate that it is ready to receive data. The selected targetasserts DEVSEL# to indicate that it has recognized its address and is ready to respond. Atthe same time, it enables its drivers on the AD lines, so that it can send data to the initiatorin subsequent cycles. Clock cycle 2 is used to accommodate the delays involved in turningthe AD lines around, as the initiator turns its drivers off and the target turns its drivers on.The target asserts TRDY# in clock cycle 3 and begins to send data. It maintains DEVSEL#in the asserted state until the end of the transaction.

We have assumed that the target is ready to send data in clock cycle 3. If not, itwould delay asserting TRDY# until it is ready. The entire burst of data need not be sentin successive clock cycles. Either the initiator or the target may introduce a pause bydeactivating its ready signal, then asserting it again when it is ready to resume the transferof data.

The C/BE# lines, which are used to send a bus command in clock cycle 1, are used fora different purpose during the rest of the transaction. Each of these four lines is associatedwith one byte on the AD lines. The initiator asserts one or more of the C/BE# lines toindicate which byte lines are to be used for transferring data.

The initiator uses the FRAME# signal to indicate the duration of the burst. It deactivatesthis signal during the second-last word of the transfer. In Figure 7.19, the initiator maintainsFRAME# in the asserted state until clock cycle 5, the cycle in which it receives the thirdword. In response, the target sends one more word in clock cycle 6, then stops. Aftersending the fourth word, the target deactivates TRDY# and DEVSEL# and disconnects itsdrivers on the AD lines.

Device ConfigurationWhen an I/O device is connected to a computer, several actions are needed to configure

both the device interface and the software that communicates with it. Like USB, PCI hasa plug-and-play capability that greatly simplifies this process. In fact, the plug-and-playfeature was pioneered by the PCI standard. A PCI interface includes a small configurationROM memory that stores information about the I/O device connected to it. The configu-ration ROMs of all devices are accessible in the configuration address space, where theyare read by the PCI initialization software whenever the system is powered up or reset. Byreading the information in the configuration ROM, the software determines whether thedevice is a printer, a camera, an Ethernet interface, or a disk controller. It can further learnabout various device options and characteristics.

Devices connected to the PCI bus are not assigned permanent addresses that are builtinto their I/O interface hardware. Instead, device addresses are assigned by software duringthe initial configuration process. This means that when power is turned on, devices cannotbe accessed using their addresses in the usual way, as they have not yet been assigned anyaddress. A different mechanism is used to select I/O devices at that time.



The PCI bus may have up to 21 connectors for I/O device interface cards to be pluggedinto. Each connector has a pin called Initialization Device Select (IDSEL#). This pin isconnected to one of the upper 21 address/data lines, AD11 to AD31. A device interfaceresponds to a configuration command if its IDSEL# input is asserted. The configurationsoftware scans all 21 locations to identify where I/O device interfaces are present. Foreach location, it issues a configuration command using an address in which the AD linecorresponding to that location is set to 1 and the remaining 20 lines are set to 0. If a deviceinterface responds, it is assigned an address and that address is writen into one of its registersdesignated for this purpose. Using the same addressing mechanism, the processor readsthe device’s configuration ROM and carries out any necessary initialization. It uses thelow-order address bits, AD0 to AD10, to access locations within the configuration ROM.This automated process means that the user simply plugs in the interface board and turnson the power. The software does the rest.

The PCI bus has gained great popularity, particularly in the PC world. It is also usedin many other computers, to benefit from the wide range of I/O devices for which a PCIinterface is available. Both a 32-bit and a 64-bit configuration are available, using either a33-MHz or 66-MHz clock. A high-performance variant known as PCI-X is also available.It is a 64-bit bus that runs at 133 MHz. Yet higher performance versions of PCI-X run atspeeds up to 533 MHz.

7.5.4 SCSI Bus

The acronym SCSI stands for Small Computer System Interface [4]. It refers to a standardbus defined by the American National Standards Institute (ANSI). The SCSI bus may beused to connect a variety of devices to a computer. It is particularly well-suited for usewith disk drives. It is often found in installations such as institutional databases or emailsystems where many disks drives are used.

In the original specifications of the SCSI standard, devices are connected to a computervia a 50-wire cable, which can be up to 25 meters in length and can transfer data at ratesof up to 5 Megabytes/s. The standard has undergone many revisions, and its data transfercapability has increased rapidly. SCSI-2 and SCSI-3 have been defined, and each has severaloptions. Data are transferred either 8 bits or 16 bits in parallel, using clock speeds of upto 80 MHz. There are also several options for the electrical signaling scheme used. Thebus may use single-ended transmission, where each signal uses one wire, with a commonground return for all signals. In another option, differential signaling is used, with a pair ofwires for each signal.

Data TransferDevices connected to the SCSI bus are not part of the address space of the processor

in the same way as devices connected to the processor bus or to the PCI bus. A SCSI busmay be connected directly to the processor bus, or more likely to another standard I/O bussuch as PCI, through a SCSI controller. Data and commands are transferred in the form ofmulti-byte messages called packets. To send commands or data to a device, the processorassembles the information in the memory then instructs the SCSI controller to transfer it to



the device. Similarly, when data are read from a device, the controller transfers the data tothe memory and then informs the processor by raising an interrupt.

To illustrate the operation of the SCSI bus, let us consider how it may be used witha disk drive. Communication with a disk drive differs substantially from communicationwith the main memory. Data are stored on a disk in blocks called sectors, where eachsector may contain several hundred bytes. When a data file is written on a disk, it is notalways stored in contiguous sectors. Some sectors may already contain previously storedinformation; others may be defective and must be skipped. Hence, a Read or Write requestmay result in accessing several disk sectors that are not necessarily contiguous. Becauseof the constraints of the mechanical motion of the disk, there is a long delay, on the orderof several milliseconds, before reaching the first sector to or from which data are to betransferred. Then, a burst of data are transferred at high speed. Another delay may ensue toreach the next sector, followed by a burst of data. Asingle Read or Write request may involveseveral such bursts. The SCSI protocol is designed to facilitate this mode of operation.

Let us examine a complete Read operation as an example. The following is a sim-plified high-level description, ignoring details and signaling conventions. Assume that theprocessor wishes to read a block of data from a disk drive and that these data are storedin two disk sectors that are not contiguous. The processor sends a command to the SCSIcontroller, which causes the following sequence of events to take place:

1. The SCSI controller contends for control of the SCSI bus.

2. When it wins the arbitration process, the SCSI controller sends a command to the diskcontroller, specifying the required Read operation.

3. The disk controller cannot start to transfer data immediately. It must first move theread head of the disk to the required sector. Hence, it sends a message to the SCSIcontroller indicating that it will temporarily suspend the connection between them.The SCSI bus is now free to be used by other devices.

4. The disk controller sends a command to the disk drive to move the read head to thefirst sector involved in the requested Read operation. It reads the data stored in thatsector and stores them in a data buffer. When it is ready to begin transferring data, itrequests control of the bus. After it wins arbitration, it re-establishes the connectionwith the SCSI controller, sends the contents of the data buffer, then suspends theconnection again.

5. The process is repeated to read and transfer the contents of the second disk sector.

6. The SCSI controller transfers the requested data to the main memory and sends aninterrupt to the processor indicating that the data are now available.

This scenario shows that the messages exchanged over the SCSI bus are at a higher levelthan those exchanged over the processor bus. Messages refer to more complex operationsthat may require several steps to complete, depending on the device. Neither the processornor the SCSI controller need be aware of the details of the disk’s operation and how it movesfrom one sector to the next.

The SCSI bus standard defines a wide range of control messages that can be used tohandle different types of I/O devices. Messages are also defined to deal with various erroror failure conditions that might arise during device operation or data transfer.



7.5.5 SATA

In the early days of the personal computer, the bus of a popular IBM computer calledAT, which was based on Intel’s 8080 microprocessor bus, became an industry standard.It was named ISA, for Industry Standard Architecture. An enhanced version, including adefinition of the basic software needed to support disk drives, was later named ATA, forAT Attachment bus. A serial version of the same architecture became known as SATA [5],which is now widely used as an interface for disks. Like all standards, several versions ofSATA have been developed with added features and higher speeds. The original parallelversion has been renamed PATA, but it is no longer used in new equipment.

The basic SATA connector has 7 pins, connecting two twisted pairs and three groundwires. Differential transmission is used, with clock frequencies ranging from 1.5 to 6.0Gigabits/s. Some of the recent versions provide an isochronous transmission feature tosupport audio and video devices.

7.5.6 SAS

This is a serial implementation of the SCSI bus, hence its name: Serially Attached SCSI[6]. It is primarily intended for connecting magnetic disks and CD and DVD drives. It usesserial, point-to-point links that are similar to SATA. A SAS link can transfer data in bothdirections simultaneously, at speeds up to 12 Gigabits/s. At the software level, SAS is fullycompatible with SCSI.

7.5.7 PCI Express

The demands placed on I/O interconnections are ever increasing. Internet connections, so-phisticated graphics devices, streaming video and high-definition television are examplesof applications that involve data transfers at very high speed. The PCI Express intercon-nection standard (often called PCIe) [7] has been developed to meet these needs and toanticipate further increases in data transfer rates, which are inevitable as new applicationsare introduced.

PCI Express uses serial, point-to-point links interconnected via switches to form a treestructure, as shown in Figure 7.20. The root node of the tree, called the Root complex, isconnected to the processor bus. The Root complex has a special port to connect the mainmemory. All other connections emanating from the Root complex are serial links to I/Odevices. Some of these links may connect to a switch that leads to more serial branches, asshown in the figure. The switch may also connect to bridging interfaces that support otherstandards, such as PCI or USB. For example, one of the tree branches could be a PCI bus,to take advantage of the wide variety of devices for which PCI interfaces already exist.

The basic PCI Express link consists of two twisted pairs, one for each direction oftransmission. Data are transmitted at the rate of of 2.5 Gigabits/s over each twisted pair,using the differential signaling scheme described in Section 7.5.1. Data may be transmittedin both directions at the same time. Also, links to different devices may be carrying data atthe same time, because there is no shared bus as in the case of PCI or SCSI. Furthermore,



memory

Main

PCI bus

Processor

Graphics

switch

PCIe

USBinterface

PCIeport

PCIeto USB

PCIe toEthernet

root complex

PCIe

PCIeto PCI

Figure 7.20 PCI Express connections.

a link may use more than one twisted pair in each direction. The basic arrangement withone twisted pair for each direction is called a lane and referred to as a X1 (read as by 1)connection. A link may use 2, 4, 8, or 16 lanes, in which case it is called a X2, X4, X8, orX16 link.

The receiver on a synchronous transmission link must synchronize its clock with thatof the sender, as described in Section 7.4.2. To make this possible, the transmitted data areencoded to ensure that 0-to-1 and 1-to-0 transitions occur frequently enough. In the case ofPCIe, each 8 bits of data are encoded using 10 bits. Other bits are inserted in the stream toperform various control functions, such as delineating address and data information. Afteraccounting for the additional bits, a single twisted pair on which data are transmitted at 2.5Gigabits/s actually delivers 1.6 Gigabits/s or 200 MByte/s of useful information. A X16link transfers data at the rate of 3.2 Gigabyte/s in each direction. By comparison, a 64-bitPCI bus operating at 64 MHz has a peak aggregate data transfer rate of 512 Megabytes/s.PCI Express has the additional advantage of using a small number of wires, resulting inlower-cost hardware.

The PCI Express protocols are fully compatible with those of PCI. For example, thesame initial configuration procedures are used. Thus, a computer that uses PCI Expresscan use existing operating systems and applications software that were developed for aPCI-based system.




This chapter introduced the I/O structure of a computer from a hardware point of view.I/O devices connected to a bus are used as examples to illustrate the synchronous andasynchronous schemes for transferring data.

The architecture of interconnection networks for input and output devices has been amajor area of development, driven by an ever-increasing need for transferring data at highspeed, for reduced cost, and for features that enhance user convenience such as plug-and-play. Several I/O standards are described briefly in this chapter, illustrating the approachesused to meet these objectives. The current trend is to move away from parallel buses toserial point-to-point links. Serial links have lower cost and can transfer data at high speed.

7.7 Solved Problems


Example 7.1 Problem: The I/O bus of a computer uses the synchronous protocol shown in Figure 7.4.Maximum propagation delay on this bus is 4 ns. The bus master takes 1.5 ns to placean address on the address lines. Slave devices require 3 ns to decode the address and amaximum of 5 ns to place the requested data on the data lines. Input registers connected tothe bus have a minimum setup time of 1 ns. Assume that the bus clock has a 50% duty cycle;that is, the high and low phases of the clock are of equal duration. What is the maximumclock frequency for this bus?

Solution: The minimum time for the high phase of the clock is the time for the addressto arrive and be decoded by the slave, which is 1.5+ 4+ 3 = 8.5 ns. The minimum timefor the low phase of the clock is the time for the slave to place data on the bus and for themaster to load the data into a register, which is 5+ 4+ 1 = 10 ns. Then, the minimumclock period is 2× 10 = 20 ns, and the maximum clock frequency is 50 MHz.

Example 7.2 Problem: An arbiter receives three request signals, R1, R2, R3, and generates three grantsignals, G1, G2, G3. Request R1 has the highest priority and request R3 the lowest priority.An example of the operation of such an arbiter is given in Figure 7.9. Give a state diagramthat describes the behavior of this arbiter.

Solution: Astate diagram is given in Figure 7.21. The arbiter starts in the idle state,A. Whenone or more of the request signals is asserted, the arbiter moves to one of the three states, B,C, or D, depending on which of the active requests has the highest priority. When it entersthe new state, it asserts the corresponding grant signal. The arbiter remains in that state



Inputs: R1, R2, R3

000

Outputs: G1, G2, G3

0xx

1xx x1x

D/001

B/100 C/010

1xx

01x

x0x

001

xx0

xx1

A/000

Figure 7.21 State diagram for Example 7.2.

until the device being served drops its request, at which time the arbiter returns to state A.Once it is back in state A, it will respond to any request that may be active at that time, orwait for a new request to be asserted.

Example 7.3Problem: Design an output interface circuit for a synchronous bus that uses the protocolof Figure 7.4. When data are written into the data register of this interface, the interfacesends a pulse with a width of one clock cycle on a line called New-data. This pulse lets theoutput device connected to the interface know that new data are available.

Solution: All events in a synchronous circuit are driven by a clock signal. A possible circuitfor the interface is shown in Figure 7.22. The Write-data signal enables the data register,and data are loaded into it at the clock edge at the end of the clock cycle. At the sametime, the New-data flip-flop is set to 1. The feedback connection from the Q output of theflip-flop clears the flip-flop to 0 on the following clock edge.

Example 7.4Problem: Draw a state diagram for a finite-state machine (FSM) that represents the behaviorof the handshake control circuit in Figure 7.14.



Data register

Data

Write-data

A31

A2

Addressdecoder

D7 Q7

D0 Q0

D7

D0

R/ W

My-address

Enable

Clock

QD

Q

New-data

Figure 7.22 A synchronous output interface circuit for Example 7.3.

Solution: A state diagram is given in Figure 7.23. The circuit starts in state A, with thedisplay device ready to receive new data. Thus, New-data = 0 and DOUT = 1. A Writeoperation causes Write-data to change to 1. This causes the state machine to move tostate B, and its outputs change to 10. The machine stays in state B until Ready drops to0, indicating that the display device recognized that new data are available. When thathappens, the machine moves to state C to wait for the display to become ready again. Itmust also wait for Write-data to return to zero, if it has not done so already.


Problems 263

Inputs: Write-data, Ready

x0/10

01/00

Outputs: New-data, DOUT

01/01

11/01

x0/00

x1/10

11/00

A

C

B

Figure 7.23 State diagram for Example 7.4.

Problems

7.1 [E] The input status bit in an interface circuit, which indicates that new data are available,is cleared as soon as the input data register is read. Why is this important?

7.2 [E] The address bus of a computer has 16 address lines, A15−0. If the hexadecimal addressassigned to one device is 7CA4 and the address decoder for that device ignores lines A8

and A9, what are all the addresses to which this device will respond?

7.3 [M] A processor has seven interrupt-request lines, INTR1 to INTR7. Line INTR7 hasthe highest priority and INTR1 the lowest priority. Design a priority encoder circuit thatgenerates a 3-bit code representing the request with the highest priority.

7.4 [M] Figures 7.4, 7.5, and 7.6 show three protocols for transferring data between a masterand a slave. What happens in each case if the addressed device does not respond due to amalfunction during a Read operation? What problems would this cause and what remediesare possible?

7.5 [E] In the timing diagram in Figure 7.5, the processor maintains the address on the busuntil it receives a response from the device. Is this necessary? What additions are neededon the device side if the processor sends an address for one cycle only?

7.6 [E] How is the timing diagram in Figure 7.6 affected as the distance between the processorand the I/O device increases? How is increased distance accommodated in the case ofFigure 7.4?



7.7 [E] Consider a synchronous bus that operates according to the timing diagram in Figure 7.5.The bus and the interface circuitry connected to it have the following parameters:

Bus driver delay 2 nsPropagation delay on the bus 5 to 10 nsAddress decoder delay 6 nsTime to fetch the requested data 0 to 25 nsSetup time 1.5 ns

(a) What is the maximum clock speed at which this bus can operate?

(b) How many clock cycles are needed to complete an input operation?

7.8 [M] Consider the asynchronous bus protocol shown in Figure 7.6. Using the same param-eters as in Problem 7.7, what are the minimum and maximum times to complete one bustransfer? Allow 1 ns for bus skew.

7.9 [M] The asynchronous bus protocol in Figure 7.6 uses a full-handshake, in which themaster maintains an asserted signal on Master-ready until it receives Slave-ready, the slavekeeps Slave-ready asserted until Master-ready becomes inactive, and so on. Consider analternative protocol in which each of these signals is a pulse of a fixed width of 4 ns. Devicestake action only on the rising edge of the pulse. Using the same parameters as in Problem7.7, what is the minimum and maximum time to complete one bus transfer?

7.10 [M] In the arbiter protocol example depicted in Figure 7.9, the master that receives abus grant maintains its request line in the asserted state until it is ready to relinquish busmastership. Assume that a common line called Busy is available, which is asserted by themaster that is currently using the bus. The arbiter grants the bus only when Busy is inactive.Once a master receives a grant, it asserts Busy and drops its request, and in response thearbiter drops the grant. The master deactivates Busy when it is finished using the bus. Drawa timing diagram equivalent to Figure 7.9 for this mode of operation.

7.11 [M] Modify the state diagram given in Example 7.2 for the mode of operation describedin Problem 7.10.

7.12 [D] The arbiter of Example 7.2 controls access to a common resource. It does not allowpreemption. This means that if a high-priority request is received after a lower-priorityrequest has been granted, it must wait until service to the device that is currently usingthe common resource is completed. In some cases, it is desirable to allow preemption, toprovide service to a high-priority device more quickly. Devices in such a system, must beable to stop and relinquish the use of the common resource when asked to do so by thearbiter. This must be done in a safe manner. A device that is using the resource must beallowed to reach a safe point at which service can be terminated. It would then signal tothe arbiter that it has stopped using the resource.

(a) Suggest a suitable modification to the signaling protocol that enables the service inprogress to be terminated safely.

(b) Modify the state diagram of the arbiter to implement the revised protocol.


Problems 265

7.13 [E] An arbiter controls access to a common resource. It uses a rotating-priority schemein responding to requests on lines R1 through R4. Initially, R1 has the highest priorityand R4 the lowest priority. After a request on one of the lines receives service, that linedrops to the lowest priority, and the next line in sequence becomes the highest-priorityline. For example, after R2 has been serviced, the priority order, starting with the highest,becomes R3, R4, R1, R2. What will be the sequence of grants for the following sequenceof requests: R3, R1, R4, R2? Assume that the last three requests arrive while the first oneis being serviced.

7.14 [E] Consider an arbiter that uses the priority scheme described in Problem 7.13. Whathappens if one device requests service repeatedly. Compare the behavior of this arbiter toone that uses a fixed-priority scheme.

7.15 [E] Give the logic expression for an address decoder that recognizes the 16-bit hexadecimaladdress FA68.

7.16 [M] An industrial plant uses several sensors to monitor temperature, pressure, and otherfactors. Each sensor includes a switch that moves to the ON position when the correspondingparameter exceeds a preset limit. Eight such sensors need to be connected to the bus of a16-bit computer. Design an appropriate interface to enable the state of all eight switches tobe read simultaneously as a single byte. Assume the bus is synchronous and that it uses thetiming sequence of Figure 7.4.

7.17 [E] The bus protocol of Figure 7.4 specifies that the slave device should send its data onlyin the second phase of the clock.

(a) It is possible that some device may recognize its address and is ready to send data sooner.Why is it not allowed to do so? Would the processor receive wrong data?

(b) Would any other problem arise?

7.18 [M] Data are stored in a small memory in an input interface connected to a synchronousbus that uses the protocol of Figure 7.5. Read and Write operations on the bus are indicatedby a Command line called R/W. The speed of the memory is such that two clock cyclesare required to read data from the memory. Design a circuit to generate the Slave-readyresponse of this interface.

7.19 [E] Each of the two signals DEVSEL# and TRDY# of the PCI protocol in Figure 7.19represents a response from the initiator. How do the functions of these two signals differ?

7.20 [E] Consider the data transfer operation shown in Figure 7.19 for the PCI bus. How wouldthis bus protocol handle a situation in which the target needs a delay of two clock cyclesbetween words 2 and 3?

7.21 [E] Draw a timing diagram for transferring three words to an output device connected tothe PCI bus.



References

1. Universal Serial Bus Specification, available at www.usb.org/developers.

2. IEEE Standard for a High-Performance Serial Bus, IEEE Std. 1394-2008, October2008.

3. Specifications and other information about the PCI Local Bus and PCI Express areavailable at www.pcisig.com/developers.

4. SCSI-3 Architecture Model (SAM), ANSI Standard X3.270, 1996. This and otherSCSI documents are available on the web at www.ansi.org.

5. SATA specifications and related material are available at www.serialata.org.

6. Information about the Serial SCSI (SAS) standard is available at www.scsita.org.

7. A. Wilen, J. Schade, and R. Thornburg, Introduction to PCI Express, A Hardware andSoftware Developer’s Guide, Intel Press, 2003.

www.usb.org/developers

www.pcisig.com/developers

www.ansi.org

www.serialata.org

www.scsita.org


267

c h a p t e r

8The Memory System

Chapter Objectives


• Basic memory circuits

• Organization of the main memory

• Memory technology

• Direct memory access as an I/O mechanism

• Cache memory, which reduces the effectivememory access time

• Virtual memory, which increases theapparent size of the main memory

• Magnetic and optical disks used forsecondary storage


268 C H A P T E R 8 • The Memory System

Programs and the data they operate on are held in the memory of the computer. In this chapter, we discusshow this vital part of the computer operates. By now, the reader appreciates that the execution speed ofprograms is highly dependent on the speed with which instructions and data can be transferred between theprocessor and the memory. It is also important to have sufficient memory to facilitate execution of largeprograms having large amounts of data.

Ideally, the memory would be fast, large, and inexpensive. Unfortunately, it is impossible to meet allthree of these requirements simultaneously. Increased speed and size are achieved at increased cost. Muchwork has gone into developing structures that improve the effective speed and size of the memory, yet keepthe cost reasonable.

The memory of a computer comprises a hierarchy, including a cache, the main memory, and secondarystorage, as Chapter 1 explains. In this chapter, we describe the most common components and organizationsused to implement these units. Direct memory access is introduced as a mechanism to transfer data betweenan I/O device, such as a disk, and the main memory, with minimal involvement from the processor. Weexamine memory speed and discuss how access times to memory data can be reduced by means of caches.Next, we present the virtual memory concept, which makes use of the large storage capacity of secondarystorage devices to increase the effective size of the memory. We start with a presentation of some basicconcepts, to extend the discussion in Chapters 1 and 2.

8.1 Basic Concepts

The maximum size of the memory that can be used in any computer is determined by theaddressing scheme. For example, a computer that generates 16-bit addresses is capable ofaddressing up to 216 = 64K (kilo) memory locations. Machines whose instructions generate32-bit addresses can utilize a memory that contains up to 232 = 4G (giga) locations, whereasmachines with 64-bit addresses can access up to 264 = 16E (exa) ≈ 16× 1018 locations.The number of locations represents the size of the address space of the computer.

The memory is usually designed to store and retrieve data in word-length quantities.Consider, for example, a byte-addressable computer whose instructions generate 32-bitaddresses. When a 32-bit address is sent from the processor to the memory unit, the high-order 30 bits determine which word will be accessed. If a byte quantity is specified, thelow-order 2 bits of the address specify which byte location is involved.

The connection between the processor and its memory consists of address, data, andcontrol lines, as shown in Figure 8.1. The processor uses the address lines to specify thememory location involved in a data transfer operation, and uses the data lines to transferthe data. At the same time, the control lines carry the command indicating a Read ora Write operation and whether a byte or a word is to be transferred. The control linesalso provide the necessary timing information and are used by the memory to indicatewhen it has completed the requested operation. When the processor-memory interfacereceives the memory’s response, it asserts the MFC signal shown in Figure 5.19. This isthe processor’s internal control signal that indicates that the requested memory operationhas been completed. When asserted, the processor proceeds to the next step in its executionsequence.


8.1 Basic Concepts 269

Up to 2k addressable

k-bit address

n-bit data

Control linesProcessor

Memory

locations

Word length = n bits


(R/W, etc.)

Figure 8.1 Connection of the memory to the processor.

A useful measure of the speed of memory units is the time that elapses between theinitiation of an operation to transfer a word of data and the completion of that operation. Thisis referred to as the memory access time. Another important measure is the memory cycletime, which is the minimum time delay required between the initiation of two successivememory operations, for example, the time between two successive Read operations. Thecycle time is usually slightly longer than the access time, depending on the implementationdetails of the memory unit.

A memory unit is called a random-access memory (RAM) if the access time to anylocation is the same, independent of the location’s address. This distinguishes such memoryunits from serial, or partly serial, access storage devices such as magnetic and optical disks.Access time of the latter devices depends on the address or position of the data.

The technology for implementing computer memories uses semiconductor integratedcircuits. The sections that follow present some basic facts about the internal structure andoperation of such memories. We then discuss some of the techniques used to increase theeffective speed and size of the memory.

Cache and Virtual MemoryThe processor of a computer can usually process instructions and data faster than they

can be fetched from the main memory. Hence, the memory access time is the bottleneck inthe system. One way to reduce the memory access time is to use a cache memory. This isa small, fast memory inserted between the larger, slower main memory and the processor.It holds the currently active portions of a program and their data.

Virtual memory is another important concept related to memory organization. Withthis technique, only the active portions of a program are stored in the main memory, and theremainder is stored on the much larger secondary storage device. Sections of the programare transferred back and forth between the main memory and the secondary storage device



in a manner that is transparent to the application program. As a result, the applicationprogram sees a memory that is much larger than the computer’s physical main memory.

Block TransfersThe discussion above shows that data move frequently between the main memory and

the cache and between the main memory and the disk. These transfers do not occur oneword at a time. Data are always transferred in contiguous blocks involving tens, hundreds,or thousands of words. Data transfers between the main memory and high-speed devicessuch as a graphic display or an Ethernet interface also involve large blocks of data. Hence,a critical parameter for the performance of the main memory is its ability to read or writeblocks of data at high speed. This is an important consideration that we will encounterrepeatedly as we discuss memory technology and the organization of the memory system.

8.2 Semiconductor RAMMemories

Semiconductor random-access memories (RAMs) are available in a wide range of speeds.Their cycle times range from 100 ns to less than 10 ns. In this section, we discuss the maincharacteristics of these memories. We start by introducing the way that memory cells areorganized inside a chip.

8.2.1 Internal Organization of Memory Chips

Memory cells are usually organized in the form of an array, in which each cell is capable ofstoring one bit of information. A possible organization is illustrated in Figure 8.2. Each rowof cells constitutes a memory word, and all cells of a row are connected to a common linereferred to as the word line, which is driven by the address decoder on the chip. The cellsin each column are connected to a Sense/Write circuit by two bit lines, and the Sense/Writecircuits are connected to the data input/output lines of the chip. During a Read operation,these circuits sense, or read, the information stored in the cells selected by a word line andplace this information on the output data lines. During a Write operation, the Sense/Writecircuits receive input data and store them in the cells of the selected word.

Figure 8.2 is an example of a very small memory circuit consisting of 16 words of 8 bitseach. This is referred to as a 16× 8 organization. The data input and the data output of eachSense/Write circuit are connected to a single bidirectional data line that can be connectedto the data lines of a computer. Two control lines, R/W and CS, are provided. The R/W(Read/Write) input specifies the required operation, and the CS (Chip Select) input selectsa given chip in a multichip memory system.

The memory circuit in Figure 8.2 stores 128 bits and requires 14 external connectionsfor address, data, and control lines. It also needs two lines for power supply and groundconnections. Consider now a slightly larger memory circuit, one that has 1K (1024) memorycells. This circuit can be organized as a 128× 8 memory, requiring a total of 19 externalconnections. Alternatively, the same number of cells can be organized into a 1K× 1 format.In this case, a 10-bit address is needed, but there is only one data line, resulting in 15 external


8.2 Semiconductor RAMMemories 271

circuitSense/Write

Addressdecoder

CS

cellsMemory

circuitSense/Write Sense/Write

circuit

Data input /output lines:

A0

A1

A2

A3

W0

W1

W15

b7 b1 b0

WR/

b ′7 b′1 b′0

b7 b1 b0

Figure 8.2 Organization of bit cells in a memory chip.

connections. Figure 8.3 shows such an organization. The required 10-bit address is dividedinto two groups of 5 bits each to form the row and column addresses for the cell array. Arow address selects a row of 32 cells, all of which are accessed in parallel. But, only oneof these cells is connected to the external data line, based on the column address.

Commercially available memory chips contain a much larger number of memory cellsthan the examples shown in Figures 8.2 and 8.3. We use small examples to make the figureseasy to understand. Large chips have essentially the same organization as Figure 8.3, butuse a larger memory cell array and have more external connections. For example, a 1G-bitchip may have a 256M × 4 organization, in which case a 28-bit address is needed and 4bits are transferred to or from the chip.

8.2.2 Static Memories

Memories that consist of circuits capable of retaining their state as long as power is appliedare known as static memories. Figure 8.4 illustrates how a static RAM (SRAM) cell may beimplemented. Two inverters are cross-connected to form a latch. The latch is connected totwo bit lines by transistors T1 and T2. These transistors act as switches that can be opened or



CS

Sense /Writecircuitry

arraymemory cell

address5-bit row

input/outputData

5-bitdecoder

address5-bit column

address10-bit

output multiplexer 32-to-1

input demultiplexer

32 32×

WR/

W0

W1

W31

and

Figure 8.3 Organization of a 1K × 1 memory chip.

YX

Word line

Bit lines

b

T2T1

b′

Figure 8.4 A static RAM cell.



closed under control of the word line. When the word line is at ground level, the transistorsare turned off and the latch retains its state. For example, if the logic value at point X is1 and at point Y is 0, this state is maintained as long as the signal on the word line is atground level. Assume that this state represents the value 1.

Read OperationIn order to read the state of the SRAM cell, the word line is activated to close switches

T1 and T2. If the cell is in state 1, the signal on bit line b is high and the signal on bit line b′is low. The opposite is true if the cell is in state 0. Thus, b and b′ are always complementsof each other. The Sense/Write circuit at the end of the two bit lines monitors their stateand sets the corresponding output accordingly.

Write OperationDuring a Write operation, the Sense/Write circuit drives bit lines b and b′, instead of

sensing their state. It places the appropriate value on bit line b and its complement on b′and activates the word line. This forces the cell into the corresponding state, which the cellretains when the word line is deactivated.

CMOS CellA CMOS realization of the cell in Figure 8.4 is given in Figure 8.5. Transistor pairs

(T3, T5) and (T4, T6) form the inverters in the latch (see Appendix A). The state of the cellis read or written as just explained. For example, in state 1, the voltage at point X ismaintained high by having transistors T3 and T6 on, while T4 and T5 are off. If T1 and T2

are turned on, bit lines b and b′ will have high and low signals, respectively.

Word line

b

Bit lines

T1 T2

T6T5

T4T3

YX

Vsupplyb′

Figure 8.5 An example of a CMOS memory cell.



Continuous power is needed for the cell to retain its state. If power is interrupted, thecell’s contents are lost. When power is restored, the latch settles into a stable state, but notnecessarily the same state the cell was in before the interruption. Hence, SRAMs are saidto be volatile memories because their contents are lost when power is interrupted.

A major advantage of CMOS SRAMs is their very low power consumption, becausecurrent flows in the cell only when the cell is being accessed. Otherwise, T1, T2, and onetransistor in each inverter are turned off, ensuring that there is no continuous electrical pathbetween Vsupply and ground.

Static RAMs can be accessed very quickly. Access times on the order of a few nanosec-onds are found in commercially available chips. SRAMs are used in applications wherespeed is of critical concern.

8.2.3 Dynamic RAMs

Static RAMs are fast, but their cells require several transistors. Less expensive and higherdensity RAMs can be implemented with simpler cells. But, these simpler cells do notretain their state for a long period, unless they are accessed frequently for Read or Writeoperations. Memories that use such cells are called dynamic RAMs (DRAMs).

Information is stored in a dynamic memory cell in the form of a charge on a capacitor,but this charge can be maintained for only tens of milliseconds. Since the cell is requiredto store information for a much longer time, its contents must be periodically refreshed byrestoring the capacitor charge to its full value. This occurs when the contents of the cell areread or when new information is written into it.

An example of a dynamic memory cell that consists of a capacitor, C, and a transistor,T , is shown in Figure 8.6. To store information in this cell, transistor T is turned on and anappropriate voltage is applied to the bit line. This causes a known amount of charge to bestored in the capacitor.

After the transistor is turned off, the charge remains stored in the capacitor, but notfor long. The capacitor begins to discharge. This is because the transistor continues to

T

C

Word line

Bit line

Figure 8.6 A single-transistor dynamic memory cell.



Column

CSSense/Writecircuits

Cell array

latchaddress

Row

Column

latch

decoderRow

decoderaddress

R/W

A24 11– A10 0–/

D0D7

RAS

CAS

16,384 rows

2,048 bytesby

Figure 8.7 Internal organization of a 32M × 8 dynamic memory chip.

conduct a tiny amount of current, measured in picoamperes, after it is turned off. Hence,the information stored in the cell can be retrieved correctly only if it is read before the chargein the capacitor drops below some threshold value. During a Read operation, the transistorin a selected cell is turned on. A sense amplifier connected to the bit line detects whether thecharge stored in the capacitor is above or below the threshold value. If the charge is abovethe threshold, the sense amplifier drives the bit line to the full voltage representing the logicvalue 1. As a result, the capacitor is recharged to the full charge corresponding to the logicvalue 1. If the sense amplifier detects that the charge in the capacitor is below the thresholdvalue, it pulls the bit line to ground level to discharge the capacitor fully. Thus, reading thecontents of a cell automatically refreshes its contents. Since the word line is common to allcells in a row, all cells in a selected row are read and refreshed at the same time.

A 256-Megabit DRAM chip, configured as 32M× 8, is shown in Figure 8.7. The cellsare organized in the form of a 16K × 16K array. The 16,384 cells in each row are dividedinto 2,048 groups of 8, forming 2,048 bytes of data. Therefore, 14 address bits are neededto select a row, and another 11 bits are needed to specify a group of 8 bits in the selectedrow. In total, a 25-bit address is needed to access a byte in this memory. The high-order 14bits and the low-order 11 bits of the address constitute the row and column addresses of abyte, respectively. To reduce the number of pins needed for external connections, the rowand column addresses are multiplexed on 14 pins. During a Read or a Write operation, therow address is applied first. It is loaded into the row address latch in response to a signalpulse on an input control line called the Row Address Strobe (RAS). This causes a Readoperation to be initiated, in which all cells in the selected row are read and refreshed.



Shortly after the row address is loaded, the column address is applied to the address pinsand loaded into the column address latch under control of a second control line called theColumn Address Strobe (CAS). The information in this latch is decoded and the appropriategroup of 8 Sense/Write circuits is selected. If the R/W control signal indicates a Readoperation, the output values of the selected circuits are transferred to the data lines, D7−0.For a Write operation, the information on the D7−0 lines is transferred to the selected circuits,then used to overwrite the contents of the selected cells in the corresponding 8 columns. Weshould note that in commercial DRAM chips, the RAS and CAS control signals are activewhen low. Hence, addresses are latched when these signals change from high to low. Thesignals are shown in diagrams as RAS and CAS to indicate this fact.

The timing of the operation of the DRAM described above is controlled by the RASand CAS signals. These signals are generated by a memory controller circuit external to thechip when the processor issues a Read or a Write command. During a Read operation, theoutput data are transferred to the processor after a delay equivalent to the memory’s accesstime. Such memories are referred to as asynchronous DRAMs. The memory controller isalso responsible for refreshing the data stored in the memory chips, as we describe later.

Fast Page ModeWhen the DRAM in Figure 8.7 is accessed, the contents of all 16,384 cells in the

selected row are sensed, but only 8 bits are placed on the data lines, D7−0. This byte isselected by the column address, bitsA10−0. Asimple addition to the circuit makes it possibleto access the other bytes in the same row without having to reselect the row. Each senseamplifier also acts as a latch. When a row address is applied, the contents of all cells in theselected row are loaded into the corresponding latches. Then, it is only necessary to applydifferent column addresses to place the different bytes on the data lines.

This arrangement leads to a very useful feature. All bytes in the selected row can betransferred in sequential order by applying a consecutive sequence of column addressesunder the control of successive CAS signals. Thus, a block of data can be transferred at amuch faster rate than can be achieved for transfers involving random addresses. The blocktransfer capability is referred to as the fast page mode feature. (A large block of data isoften called a page.)

It was pointed out earlier that the vast majority of main memory transactions involveblock transfers. The faster rate attainable in the fast page mode makes dynamic RAMsparticularly well suited to this environment.

8.2.4 Synchronous DRAMs

In the early 1990s, developments in memory technology resulted in DRAMs whose op-eration is synchronized with a clock signal. Such memories are known as synchronousDRAMs (SDRAMs). Their structure is shown in Figure 8.8. The cell array is the same asin asynchronous DRAMs. The distinguishing feature of an SDRAM is the use of a clocksignal, the availability of which makes it possible to incorporate control circuitry on the chipthat provides many useful features. For example, SDRAMs have built-in refresh circuitry,with a refresh counter to provide the addresses of the rows to be selected for refreshing. Asa result, the dynamic nature of these memory chips is almost invisible to the user.



R/ W

RAS

CAS

CS

Clock

Cell arraylatch

addressRow

decoderRow

decoderColumn Read/Write

circuits & latchescounteraddressColumn

Row/Columnaddress

Data inputregister

Data outputregister

Data

Refreshcounter

Mode registerand

timing control

Figure 8.8 Synchronous DRAM.

The address and data connections of an SDRAM may be buffered by means of registers,as shown in the figure. Internally, the Sense/Write amplifiers function as latches, as inasynchronous DRAMs. A Read operation causes the contents of all cells in the selectedrow to be loaded into these latches. The data in the latches of the selected column aretransferred into the data register, thus becoming available on the data output pins. Thebuffer registers are useful when transferring large blocks of data at very high speed. Byisolating external connections from the chip’s internal circuitry, it becomes possible to starta new access operation while data are being transferred to or from the registers.

SDRAMs have several different modes of operation, which can be selected by writingcontrol information into a mode register. For example, burst operations of different lengthscan be specified. It is not necessary to provide externally-generated pulses on the CAS lineto select successive columns. The necessary control signals are generated internally usinga column counter and the clock signal. New data are placed on the data lines at the risingedge of each clock pulse.

Figure 8.9 shows a timing diagram for a typical burst read of length 4. First, the rowaddress is latched under control of the RAS signal. The memory typically takes 5 or 6 clock



R/W

RAS

CAS

Clock

Row Col

D0 D1 D2 D3

Address

Data

Figure 8.9 A burst read of length 4 in an SDRAM.

cycles (we use 2 in the figure for simplicity) to activate the selected row. Then, the columnaddress is latched under control of the CAS signal. After a delay of one clock cycle, thefirst set of data bits is placed on the data lines. The SDRAM automatically increments thecolumn address to access the next three sets of bits in the selected row, which are placed onthe data lines in the next 3 clock cycles.

Synchronous DRAMs can deliver data at a very high rate, because all the control signalsneeded are generated inside the chip. The initial commercial SDRAMs in the 1990s weredesigned for clock speeds of up to 133 MHz. As technology evolved, much faster SDRAMchips were developed. Today’s SDRAMs operate with clock speeds that can exceed 1 GHz.

Latency and BandwidthData transfers to and from the main memory often involve blocks of data. The speed of

these transfers has a large impact on the performance of a computer system. The memoryaccess time defined earlier is not sufficient for describing the memory’s performance whentransferring blocks of data. During block transfers, memory latency is the amount of timeit takes to transfer the first word of a block. The time required to transfer a complete blockdepends also on the rate at which successive words can be transferred and on the size of theblock. The time between successive words of a block is much shorter than the time neededto transfer the first word. For instance, in the timing diagram in Figure 8.9, the access cyclebegins with the assertion of the RAS signal. The first word of data is transferred five clockcycles later. Thus, the latency is five clock cycles. If the clock rate is 500 MHz, then thelatency is 10 ns. The remaining three words are transferred in consecutive clock cycles, atthe rate of one word every 2 ns.

The example above illustrates that we need a parameter other than memory latency todescribe the memory’s performance during block transfers. A useful performance measureis the number of bits or bytes that can be transferred in one second. This measure is often



referred to as the memory bandwidth. It depends on the speed of access to the stored dataand on the number of bits that can be accessed in parallel. The rate at which data can betransferred to or from the memory depends on the bandwidth of the system interconnections.For this reason, the interconnections used always ensure that the bandwidth available fordata transfers between the processor and the memory is very high.

Double-Data-Rate SDRAMIn the continuous quest for improved performance, faster versions of SDRAMs have

been developed. In addition to faster circuits, new organizational and operational featuresmake it possible to achieve high data rates during block transfers. The key idea is to takeadvantage of the fact that a large number of bits are accessed at the same time inside the chipwhen a row address is applied. Various techniques are used to transfer these bits quickly tothe pins of the chip. To make the best use of the available clock speed, data are transferredexternally on both the rising and falling edges of the clock. For this reason, memories thatuse this technique are called double-data-rate SDRAMs (DDR SDRAMs).

Several versions of DDR chips have been developed. The earliest version is known asDDR. Later versions, called DDR2, DDR3, and DDR4, have enhanced capabilities. Theyoffer increased storage capacity, lower power, and faster clock speeds. For example, DDR2and DDR3 can operate at clock frequencies of 400 and 800 MHz, respectively. Therefore,they transfer data using the effective clock speeds of 800 and 1600 MHz, respectively.

Rambus MemoryThe rate of transferring data between the memory and the processor is a function of

both the bandwidth of the memory and the bandwidth of its connection to the processor.Rambus is a memory technology that achieves a high data transfer rate by providing ahigh-speed interface between the memory and the processor. One way for increasing thebandwidth of this connection is to use a wider data path. However, this requires more spaceand more pins, increasing system cost. The alternative is to use fewer wires with a higherclock speed. This is the approach taken by Rambus.

The key feature of Rambus technology is the use of a differential-signaling techniqueto transfer data to and from the memory chips. The basic idea of differential signalingis described in Section 7.5.1. In Rambus technology, signals are transmitted using smallvoltage swings of 0.1 V above and below a reference value. Several versions of this standardhave been developed, with clock speeds of up to 800 MHz and data transfer rates of severalgigabytes per second.

Rambus technology competes directly with the DDR SDRAM technology. Each hascertain advantages and disadvantages. A nontechnical consideration is that the specificationof DDR SDRAM is an open standard that can be used free of charge. Rambus, on the otherhand, is a proprietary scheme that must be licensed by chip manufacturers.

8.2.5 Structure of Larger Memories

We have discussed the basic organization of memory circuits as they may be implementedon a single chip. Next, we examine how memory chips may be connected to form a muchlarger memory.



Static Memory SystemsConsider a memory consisting of 2M words of 32 bits each. Figure 8.10 shows how

this memory can be implemented using 512K× 8 static memory chips. Each column in thefigure implements one byte position in a word, with four chips providing 2M bytes. Fourcolumns implement the required 2M × 32 memory. Each chip has a control input called

19-bit internal chip address

Chip-select

memory chip

decoder2-bit

address21-bit

19-bitaddress

512K 8×

A0A1

A19

memory chip

A20

D31-24 D7-0D23-16 D15-8

512K 8×

8-bit datainput/output

Figure 8.10 Organization of a 2M × 32 memory module using 512K × 8 static memory chips.



Chip-select. When this input is set to 1, it enables the chip to accept data from or to place dataon its data lines. The data output for each chip is of the tri-state type described in Section7.2.3. Only the selected chip places data on the data output line, while all other outputsare electrically disconnected from the data lines. Twenty-one address bits are needed toselect a 32-bit word in this memory. The high-order two bits of the address are decodedto determine which of the four rows should be selected. The remaining 19 address bits areused to access specific byte locations inside each chip in the selected row. The R/W inputsof all chips are tied together to provide a common Read/Write control line (not shown inthe figure).

Dynamic Memory SystemsModern computers use very large memories. Even a small personal computer is likely

to have at least 1G bytes of memory. Typical desktop computers may have 4G bytes or moreof memory. A large memory leads to better performance, because more of the programs anddata used in processing can be held in the memory, thus reducing the frequency of accessto secondary storage.

Because of their high bit density and low cost, dynamic RAMs, mostly of the syn-chronous type, are widely used in the memory units of computers. They are slower thanstatic RAMs, but they use less power and have considerably lower cost per bit. Availablechips have capacities as high as 2G bits, and even larger chips are being developed. Toreduce the number of memory chips needed in a given computer, a memory chip may beorganized to read or write a number of bits in parallel, as in the case of Figure 8.7. Chipsare manufactured in different organizations, to provide flexibility in designing memorysystems. For example, a 1-Gbit chip may be organized as 256M × 4, or 128M × 8.

Packaging considerations have led to the development of assemblies known as memorymodules. Each such module houses many memory chips, typically in the range 16 to 32,on a small board that plugs into a socket on the computer’s motherboard. Memory modulesare commonly called SIMMs (Single In-line Memory Modules) or DIMMs (Dual In-lineMemory Modules), depending on the configuration of the pins. Modules of different sizesare designed to use the same socket. For example, 128M × 64, 256M × 64, and 512M× 64 bit DIMMs all use the same 240-pin socket. Thus, total memory capacity is easilyexpanded by replacing a smaller module with a larger one, using the same socket.

Memory ControllerThe address applied to dynamic RAM chips is divided into two parts, as explained

earlier. The high-order address bits, which select a row in the cell array, are provided firstand latched into the memory chip under control of the RAS signal. Then, the low-orderaddress bits, which select a column, are provided on the same address pins and latched undercontrol of the CAS signal. Since a typical processor issues all bits of an address at the sametime, a multiplexer is required. This function is usually performed by a memory controllercircuit. The controller accepts a complete address and the R/W signal from the processor,under control of a Request signal which indicates that a memory access operation is needed.It forwards the R/W signals and the row and column portions of the address to the memoryand generates the RAS and CAS signals, with the appropriate timing. When a memoryincludes multiple modules, one of these modules is selected based on the high-order bits



of the address. The memory controller decodes these high-order bits and generates thechip-select signal for the appropriate module. Data lines are connected directly betweenthe processor and the memory.

Dynamic RAMs must be refreshed periodically. The circuitry required to initiaterefresh cycles is included as part of the internal control circuitry of synchronous DRAMs.However, a control circuit external to the chip is needed to initiate periodic Read cycles torefresh the cells of an asynchronous DRAM. The memory controller provides this capability.

Refresh OverheadA dynamic RAM cannot respond to read or write requests while an internal refresh

operation is taking place. Such requests are delayed until the refresh cycle is completed.However, the time lost to accommodate refresh operations is very small. For example,consider an SDRAM in which each row needs to be refreshed once every 64 ms. Supposethat the minimum time between two row accesses is 50 ns and that refresh operations arearranged such that all rows of the chip are refreshed in 8K (8192) refresh cycles. Thus,it takes 8192× 0.050 = 0.41 ms to refresh all rows. The refresh overhead is 0.41/64 =0.0064, which is less than 1 percent of the total time available for accessing the memory.

Choice of TechnologyThe choice of a RAM chip for a given application depends on several factors. Foremost

among these are the cost, speed, power dissipation, and size of the chip.Static RAMs are characterized by their very fast operation. However, their cost and bit

density are adversely affected by the complexity of the circuit that realizes the basic cell.They are used mostly where a small but very fast memory is needed. Dynamic RAMs, onthe other hand, have high bit densities and a low cost per bit. Synchronous DRAMs are thepredominant choice for implementing the main memory.

8.3 Read-only Memories

Both static and dynamic RAM chips are volatile, which means that they retain informationonly while power is turned on. There are many applications requiring memory devices thatretain the stored information when power is turned off. For example, Chapter 4 describesthe need to store a small program in such a memory, to be used to start the bootstrapprocess of loading the operating system from a hard disk into the main memory. Theembedded applications described in Chapters 10 and 11 are another important example.Many embedded applications do not use a hard disk and require nonvolatile memories tostore their software.

Different types of nonvolatile memories have been developed. Generally, their contentscan be read in the same way as for their volatile counterparts discussed above. But, a specialwriting process is needed to place the information into a nonvolatile memory. Since itsnormal operation involves only reading the stored data, a memory of this type is called aread-only memory (ROM).


8.3 Read-only Memories 283

PNot connected to store a 1

Connected to store a 0

Bit line

Word line

T

Figure 8.11 A ROM cell.

8.3.1 ROM

A memory is called a read-only memory, or ROM, when information can be written intoit only once at the time of manufacture. Figure 8.11 shows a possible configuration for aROM cell. A logic value 0 is stored in the cell if the transistor is connected to ground atpoint P; otherwise, a 1 is stored. The bit line is connected through a resistor to the powersupply. To read the state of the cell, the word line is activated to close the transistor switch.As a result, the voltage on the bit line drops to near zero if there is a connection betweenthe transistor and ground. If there is no connection to ground, the bit line remains at thehigh voltage level, indicating a 1. A sense circuit at the end of the bit line generates theproper output value. The state of the connection to ground in each cell is determined whenthe chip is manufactured, using a mask with a pattern that represents the information to bestored.

8.3.2 PROM

Some ROM designs allow the data to be loaded by the user, thus providing a programmableROM (PROM). Programmability is achieved by inserting a fuse at point P in Figure 8.11.Before it is programmed, the memory contains all 0s. The user can insert 1s at the requiredlocations by burning out the fuses at these locations using high-current pulses. Of course,this process is irreversible.

PROMs provide flexibility and convenience not available with ROMs. The cost ofpreparing the masks needed for storing a particular information pattern makes ROMs cost-effective only in large volumes. The alternative technology of PROMs provides a moreconvenient and considerably less expensive approach, because memory chips can be pro-grammed directly by the user.



8.3.3 EPROM

Another type of ROM chip provides an even higher level of convenience. It allows the storeddata to be erased and new data to be written into it. Such an erasable, reprogrammable ROMis usually called an EPROM. It provides considerable flexibility during the developmentphase of digital systems. Since EPROMs are capable of retaining stored information for along time, they can be used in place of ROMs or PROMs while software is being developed.In this way, memory changes and updates can be easily made.

An EPROM cell has a structure similar to the ROM cell in Figure 8.11. However,the connection to ground at point P is made through a special transistor. The transistor isnormally turned off, creating an open switch. It can be turned on by injecting charge into itthat becomes trapped inside. Thus, an EPROM cell can be used to construct a memory inthe same way as the previously discussed ROM cell. Erasure requires dissipating the chargetrapped in the transistors that form the memory cells. This can be done by exposing thechip to ultraviolet light, which erases the entire contents of the chip. To make this possible,EPROM chips are mounted in packages that have transparent windows.

8.3.4 EEPROM

An EPROM must be physically removed from the circuit for reprogramming. Also, thestored information cannot be erased selectively. The entire contents of the chip are erasedwhen exposed to ultraviolet light. Another type of erasable PROM can be programmed,erased, and reprogrammed electrically. Such a chip is called an electrically erasable PROM,or EEPROM. It does not have to be removed for erasure. Moreover, it is possible to erasethe cell contents selectively. One disadvantage of EEPROMs is that different voltages areneeded for erasing, writing, and reading the stored data, which increases circuit complexity.However, this disadvantage is outweighed by the many advantages of EEPROMs. Theyhave replaced EPROMs in practice.

8.3.5 Flash Memory

An approach similar to EEPROM technology has given rise to flash memory devices. Aflash cell is based on a single transistor controlled by trapped charge, much like an EEPROMcell. Also like an EEPROM, it is possible to read the contents of a single cell. The keydifference is that, in a flash device, it is only possible to write an entire block of cells. Priorto writing, the previous contents of the block are erased. Flash devices have greater density,which leads to higher capacity and a lower cost per bit. They require a single power supplyvoltage, and consume less power in their operation.

The low power consumption of flash memories makes them attractive for use inportable, battery-powered equipment. Typical applications include hand-held computers,cell phones, digital cameras, and MP3 music players. In hand-held computers and cellphones, a flash memory holds the software needed to operate the equipment, thus obviatingthe need for a disk drive. A flash memory is used in digital cameras to store picture data.In MP3 players, flash memories store the data that represent sound. Cell phones, digital


8.4 Direct MemoryAccess 285

cameras, and MP3 players are good examples of embedded systems, which are discussedin Chapters 10 and 11.

Single flash chips may not provide sufficient storage capacity for the applicationsmentioned above. Larger memory modules consisting of a number of chips are used whereneeded. There are two popular choices for the implementation of such modules: flash cardsand flash drives.

Flash CardsOne way of constructing a larger module is to mount flash chips on a small card. Such

flash cards have a standard interface that makes them usable in a variety of products. A cardis simply plugged into a conveniently accessible slot. Flash cards with a USB interface arewidely used and are commonly known as memory keys. They come in a variety of memorysizes. Larger cards may hold as much as 32 Gbytes. A minute of music can be stored inabout 1 Mbyte of memory, using the MP3 encoding format. Hence, a 32-Gbyte flash cardcan store approximately 500 hours of music.

Flash DrivesLarger flash memory modules have been developed to replace hard disk drives, and

hence are called flash drives. They are designed to fully emulate hard disks, to the pointthat they can be fitted into standard disk drive bays. However, the storage capacity of flashdrives is significantly lower. Currently, the capacity of flash drives is on the order of 64 to128 Gbytes. In contrast, hard disks have capacities exceeding a terabyte. Also, disk driveshave a very low cost per bit.

The fact that flash drives are solid state electronic devices with no moving parts providesimportant advantages over disk drives. They have shorter access times, which result in afaster response. They are insensitive to vibration and they have lower power consumption,which makes them attractive for portable, battery-driven applications.

8.4 Direct MemoryAccess

Blocks of data are often transferred between the main memory and I/O devices such asdisks. This section discusses a technique for controlling such transfers without frequent,program-controlled intervention by the processor.

The discussion in Chapter 3 concentrates on single-word or single-byte data transfersbetween the processor and I/O devices. Data are transferred from an I/O device to thememory by first reading them from the I/O device using an instruction such as

Load R2, DATAIN

which loads the data into a processor register. Then, the data read are stored into a memorylocation. The reverse process takes place for transferring data from the memory to an I/Odevice. An instruction to transfer input or output data is executed only after the processordetermines that the I/O device is ready, either by polling its status register or by waitingfor an interrupt request. In either case, considerable overhead is incurred, because severalprogram instructions must be executed involving many memory accesses for each data word



transferred. When transferring a block of data, instructions are needed to increment thememory address and keep track of the word count. The use of interrupts involves operatingsystem routines which incur additional overhead to save and restore processor registers, theprogram counter, and other state information.

An alternative approach is used to transfer blocks of data directly between the mainmemory and I/O devices, such as disks. A special control unit is provided to manage thetransfer, without continuous intervention by the processor. This approach is called directmemory access, or DMA. The unit that controls DMA transfers is referred to as a DMAcontroller. It may be part of the I/O device interface, or it may be a separate unit shared by anumber of I/O devices. The DMA controller performs the functions that would normally becarried out by the processor when accessing the main memory. For each word transferred,it provides the memory address and generates all the control signals needed. It incrementsthe memory address for successive words and keeps track of the number of transfers.

Although a DMA controller transfers data without intervention by the processor, itsoperation must be under the control of a program executed by the processor, usually anoperating system routine. To initiate the transfer of a block of words, the processor sends tothe DMA controller the starting address, the number of words in the block, and the directionof the transfer. The DMAcontroller then proceeds to perform the requested operation. Whenthe entire block has been transferred, it informs the processor by raising an interrupt.

Figure 8.12 shows an example of the DMA controller registers that are accessed by theprocessor to initiate data transfer operations. Two registers are used for storing the startingaddress and the word count. The third register contains status and control flags. The R/Wbit determines the direction of the transfer. When this bit is set to 1 by a program instruction,the controller performs a Read operation, that is, it transfers data from the memory to the I/Odevice. Otherwise, it performs a Write operation. Additional information is also transferredas may be required by the I/O device. For example, in the case of a disk, the processorprovides the disk controller with information to identify where the data is located on thedisk (see Section 8.10.1 for disk details).

Done

IE

IRQ

Status and control

Starting address

Word count

WR/

31 30 1 0

Figure 8.12 Typical registers in a DMA controller.


8.4 Direct MemoryAccess 287

When the controller has completed transferring a block of data and is ready to receiveanother command, it sets the Done flag to 1. Bit 30 is the Interrupt-enable flag, IE. When thisflag is set to 1, it causes the controller to raise an interrupt after it has completed transferringa block of data. Finally, the controller sets the IRQ bit to 1 when it has requested an interrupt.

Figure 8.13 shows how DMA controllers may be used in a computer system such asthat in Figure 7.18. One DMA controller connects a high-speed Ethernet to the computer’sI/O bus (a PCI bus in the case of Figure 7.18). The disk controller, which controls two disks,also has DMA capability and provides two DMA channels. It can perform two independentDMA operations, as if each disk had its own DMA controller. The registers needed to storethe memory address, the word count, and so on, are duplicated, so that one set can be usedwith each disk.

To start a DMA transfer of a block of data from the main memory to one of the disks,an OS routine writes the address and word count information into the registers of thedisk controller. The DMA controller proceeds independently to implement the specifiedoperation. When the transfer is completed, this fact is recorded in the status and controlregister of the DMA channel by setting the Done bit. At the same time, if the IE bit is set,the controller sends an interrupt request to the processor and sets the IRQ bit. The statusregister may also be used to record other information, such as whether the transfer tookplace correctly or errors occurred.

memory

Processor

PCI bus

Main

interfaceEthernet

Disk/DMAcontroller

DMAcontroller

DiskDisk

Bridge

Figure 8.13 Use of DMA controllers in a computer system.



8.5 Memory Hierarchy

We have already stated that an ideal memory would be fast, large, and inexpensive. Fromthe discussion in Section 8.2, it is clear that a very fast memory can be implemented usingstatic RAM chips. But, these chips are not suitable for implementing large memories,because their basic cells are larger and consume more power than dynamic RAM cells.

Although dynamic memory units with gigabyte capacities can be implemented at areasonable cost, the affordable size is still small compared to the demands of large programswith voluminous data. A solution is provided by using secondary storage, mainly magneticdisks, to provide the required memory space. Disks are available at a reasonable cost,and they are used extensively in computer systems. However, they are much slower thansemiconductor memory units. In summary, a very large amount of cost-effective storagecan be provided by magnetic disks, and a large and considerably faster, yet affordable,main memory can be built with dynamic RAM technology. This leaves the more expensiveand much faster static RAM technology to be used in smaller units where speed is of theessence, such as in cache memories.

All of these different types of memory units are employed effectively in a computersystem. The entire computer memory can be viewed as the hierarchy depicted in Figure8.14. The fastest access is to data held in processor registers. Therefore, if we consider the

Processor

Primarycache

Secondarycache

Main

Magnetic disk

memory

Increasingsize

Increasingspeed

secondarymemory

Increasingcost per bit

Registers

L1

L2

Figure 8.14 Memory hierarchy.


8.6 Cache Memories 289

registers to be part of the memory hierarchy, then the processor registers are at the top interms of speed of access. Of course, the registers provide only a minuscule portion of therequired memory.

At the next level of the hierarchy is a relatively small amount of memory that canbe implemented directly on the processor chip. This memory, called a processor cache,holds copies of the instructions and data stored in a much larger memory that is providedexternally. The cache memory concept was introduced in Section 1.2.2 and is examinedin detail in Section 8.6. There are often two or more levels of cache. A primary cache isalways located on the processor chip. This cache is small and its access time is comparableto that of processor registers. The primary cache is referred to as the level 1 (L1) cache. Alarger, and hence somewhat slower, secondary cache is placed between the primary cacheand the rest of the memory. It is referred to as the level 2 (L2) cache. Often, the L2 cacheis also housed on the processor chip.

Some computers have a level 3 (L3) cache of even larger size, in addition to the L1and L2 caches. An L3 cache, also implemented in SRAM technology, may or may not beon the same chip with the processor and the L1 and L2 caches.

The next level in the hierarchy is the main memory. This is a large memory implementedusing dynamic memory components, typically assembled in memory modules such asDIMMs, as described in Section 8.2.5. The main memory is much larger but significantlyslower than cache memories. In a computer with a processor clock of 2 GHz or higher, theaccess time for the main memory can be as much as 100 times longer than the access timefor the L1 cache.

Disk devices provide a very large amount of inexpensive memory, and they are widelyused as secondary storage in computer systems. They are very slow compared to the mainmemory. They represent the bottom level in the memory hierarchy.

During program execution, the speed of memory access is of utmost importance. Thekey to managing the operation of the hierarchical memory system in Figure 8.14 is to bringthe instructions and data that are about to be used as close to the processor as possible. Thisis the main purpose of using cache memories, which we discuss next.

8.6 Cache Memories

The cache is a small and very fast memory, interposed between the processor and the mainmemory. Its purpose is to make the main memory appear to the processor to be muchfaster than it actually is. The effectiveness of this approach is based on a property ofcomputer programs called locality of reference. Analysis of programs shows that most oftheir execution time is spent in routines in which many instructions are executed repeatedly.These instructions may constitute a simple loop, nested loops, or a few procedures thatrepeatedly call each other. The actual detailed pattern of instruction sequencing is notimportant—the point is that many instructions in localized areas of the program are executedrepeatedly during some time period. This behavior manifests itself in two ways: temporaland spatial. The first means that a recently executed instruction is likely to be executedagain very soon. The spatial aspect means that instructions close to a recently executedinstruction are also likely to be executed soon.



CacheMain

memoryProcessor

Figure 8.15 Use of a cache memory.

Conceptually, operation of a cache memory is very simple. The memory controlcircuitry is designed to take advantage of the property of locality of reference. Temporallocality suggests that whenever an information item, instruction or data, is first needed, thisitem should be brought into the cache, because it is likely to be needed again soon. Spatiallocality suggests that instead of fetching just one item from the main memory to the cache,it is useful to fetch several items that are located at adjacent addresses as well. The termcache block refers to a set of contiguous address locations of some size. Another term thatis often used to refer to a cache block is a cache line.

Consider the arrangement in Figure 8.15. When the processor issues a Read request, thecontents of a block of memory words containing the location specified are transferred intothe cache. Subsequently, when the program references any of the locations in this block,the desired contents are read directly from the cache. Usually, the cache memory can storea reasonable number of blocks at any given time, but this number is small compared to thetotal number of blocks in the main memory. The correspondence between the main memoryblocks and those in the cache is specified by a mapping function. When the cache is fulland a memory word (instruction or data) that is not in the cache is referenced, the cachecontrol hardware must decide which block should be removed to create space for the newblock that contains the referenced word. The collection of rules for making this decisionconstitutes the cache’s replacement algorithm.

Cache HitsThe processor does not need to know explicitly about the existence of the cache. It

simply issues Read and Write requests using addresses that refer to locations in the memory.The cache control circuitry determines whether the requested word currently exists in thecache. If it does, the Read or Write operation is performed on the appropriate cache location.In this case, a read or write hit is said to have occurred. The main memory is not involvedwhen there is a cache hit in a Read operation. For a Write operation, the system can proceedin one of two ways. In the first technique, called the write-through protocol, both the cachelocation and the main memory location are updated. The second technique is to updateonly the cache location and to mark the block containing it with an associated flag bit, oftencalled the dirty or modified bit. The main memory location of the word is updated later,when the block containing this marked word is removed from the cache to make room fora new block. This technique is known as the write-back, or copy-back, protocol.



The write-through protocol is simpler than the write-back protocol, but it results inunnecessary Write operations in the main memory when a given cache word is updatedseveral times during its cache residency. The write-back protocol also involves unnecessaryWrite operations, because all words of the block are eventually written back, even if onlya single word has been changed while the block was in the cache. The write-back protocolis used most often, to take advantage of the high speed with which data blocks can betransferred to memory chips.

Cache MissesA Read operation for a word that is not in the cache constitutes a Read miss. It causes

the block of words containing the requested word to be copied from the main memory intothe cache. After the entire block is loaded into the cache, the particular word requested isforwarded to the processor. Alternatively, this word may be sent to the processor as soon asit is read from the main memory. The latter approach, which is called load-through, or earlyrestart, reduces the processor’s waiting time somewhat, at the expense of more complexcircuitry.

When a Write miss occurs in a computer that uses the write-through protocol, theinformation is written directly into the main memory. For the write-back protocol, theblock containing the addressed word is first brought into the cache, and then the desiredword in the cache is overwritten with the new information.

Recall from Section 6.7 that resource limitations in a pipelined processor can causeinstruction execution to stall for one or more cycles. This can occur if a Load or Store in-struction requests access to data in the memory at the same time that a subsequent instructionis being fetched. When this happens, instruction fetch is delayed until the data access op-eration is completed. To avoid stalling the pipeline, many processors use separate cachesfor instructions and data, making it possible for the two operations to proceed in parallel.

8.6.1 Mapping Functions

There are several possible methods for determining where memory blocks are placed in thecache. It is instructive to describe these methods using a specific small example. Considera cache consisting of 128 blocks of 16 words each, for a total of 2048 (2K) words, andassume that the main memory is addressable by a 16-bit address. The main memory has64K words, which we will view as 4K blocks of 16 words each. For simplicity, we haveassumed that consecutive addresses refer to consecutive words.

Direct MappingThe simplest way to determine cache locations in which to store memory blocks is

the direct-mapping technique. In this technique, block j of the main memory maps ontoblock j modulo 128 of the cache, as depicted in Figure 8.16. Thus, whenever one of themain memory blocks 0, 128, 256, . . . is loaded into the cache, it is stored in cache block0. Blocks 1, 129, 257, . . . are stored in cache block 1, and so on. Since more than onememory block is mapped onto a given cache block position, contention may arise for thatposition even when the cache is not full. For example, instructions of a program may startin block 1 and continue in block 129, possibly after a branch. As this program is executed,



tag

tag

tag

Cache

Mainmemory

Block 0

Block 1

Block 127

Block 128

Block 129

Block 255

Block 256

Block 257

Block 4095

Block 0

Block 1

Block 127

7 4 Main memory address

Tag Block Word

5

Figure 8.16 Direct-mapped cache.

both of these blocks must be transferred to the block-1 position in the cache. Contention isresolved by allowing the new block to overwrite the currently resident block.

With direct mapping, the replacement algorithm is trivial. Placement of a block in thecache is determined by its memory address. The memory address can be divided into threefields, as shown in Figure 8.16. The low-order 4 bits select one of 16 words in a block.When a new block enters the cache, the 7-bit cache block field determines the cache positionin which this block must be stored. The high-order 5 bits of the memory address of the



block are stored in 5 tag bits associated with its location in the cache. The tag bits identifywhich of the 32 main memory blocks mapped into this cache position is currently resident inthe cache. As execution proceeds, the 7-bit cache block field of each address generated bythe processor points to a particular block location in the cache. The high-order 5 bits of theaddress are compared with the tag bits associated with that cache location. If they match,then the desired word is in that block of the cache. If there is no match, then the blockcontaining the required word must first be read from the main memory and loaded into thecache. The direct-mapping technique is easy to implement, but it is not very flexible.

Associative MappingFigure 8.17 shows the most flexible mapping method, in which a main memory block

can be placed into any cache block position. In this case, 12 tag bits are required to identifya memory block when it is resident in the cache. The tag bits of an address received from theprocessor are compared to the tag bits of each block of the cache to see if the desired blockis present. This is called the associative-mapping technique. It gives complete freedom in

4

tag

tag

tag

Cache

Mainmemory

Block 0

Block 1

Block i

Block 4095

Block 0

Block 1

Block 127

12 Main memory address

Tag Word

Figure 8.17 Associative-mapped cache.



choosing the cache location in which to place the memory block, resulting in a more efficientuse of the space in the cache. When a new block is brought into the cache, it replaces (ejects)an existing block only if the cache is full. In this case, we need an algorithm to select theblock to be replaced. Many replacement algorithms are possible, as we discuss in Section8.6.2. The complexity of an associative cache is higher than that of a direct-mapped cache,because of the need to search all 128 tag patterns to determine whether a given block is inthe cache. To avoid a long delay, the tags must be searched in parallel. A search of thiskind is called an associative search.

Set-Associative MappingAnother approach is to use a combination of the direct- and associative-mapping tech-

niques. The blocks of the cache are grouped into sets, and the mapping allows a block of themain memory to reside in any block of a specific set. Hence, the contention problem of thedirect method is eased by having a few choices for block placement. At the same time, thehardware cost is reduced by decreasing the size of the associative search. An example ofthis set-associative-mapping technique is shown in Figure 8.18 for a cache with two blocksper set. In this case, memory blocks 0, 64, 128, . . . , 4032 map into cache set 0, and theycan occupy either of the two block positions within this set. Having 64 sets means that the6-bit set field of the address determines which set of the cache might contain the desiredblock. The tag field of the address must then be associatively compared to the tags of thetwo blocks of the set to check if the desired block is present. This two-way associativesearch is simple to implement.

The number of blocks per set is a parameter that can be selected to suit the requirementsof a particular computer. For the main memory and cache sizes in Figure 8.18, four blocksper set can be accommodated by a 5-bit set field, eight blocks per set by a 4-bit set field,and so on. The extreme condition of 128 blocks per set requires no set bits and correspondsto the fully-associative technique, with 12 tag bits. The other extreme of one block per setis the direct-mapping method. A cache that has k blocks per set is referred to as a k-wayset-associative cache.

Stale DataWhen power is first turned on, the cache contains no valid data. A control bit, usually

called the valid bit, must be provided for each cache block to indicate whether the data in thatblock are valid. This bit should not be confused with the modified, or dirty, bit mentionedearlier. The valid bits of all cache blocks are set to 0 when power is initially applied tothe system. Some valid bits may also be set to 0 when new programs or data are loadedfrom the disk into the main memory. Data transferred from the disk to the main memoryusing the DMA mechanism are usually loaded directly into the main memory, bypassingthe cache. If the memory blocks being updated are currently in the cache, the valid bits ofthe corresponding cache blocks are set to 0. As program execution proceeds, the valid bitof a given cache block is set to 1 when a memory block is loaded into that location. Theprocessor fetches data from a cache block only if its valid bit is equal to 1. The use of thevalid bit in this manner ensures that the processor will not fetch stale data from the cache.

A similar precaution is needed in a system that uses the write-back protocol. Under thisprotocol, new data written into the cache are not written to the memory at the same time.



tag

tag

tag

Cache

Mainmemory

Block 0

Block 1

Block 63

Block 64

Block 65

Block 127

Block 128

Block 129

Block 4095

Block 0

Block 1

Block 126

tag

tag

Block 2

Block 3

tagBlock 127

Main memory address6 6 4

Tag Set Word

Set 0

Set 1

Set 63

Figure 8.18 Set-associative-mapped cache with two blocks per set.

Hence, data in the memory do not always reflect the changes that may have been made in thecached copy. It is important to ensure that such stale data in the memory are not transferredto the disk. One solution is to flush the cache, by forcing all dirty blocks to be written backto the memory before performing the transfer. The operating system can do this by issuinga command to the cache before initiating the DMA operation that transfers the data to thedisk. Flushing the cache does not affect performance greatly, because such disk transfers do



not occur often. The need to ensure that two different entities (the processor and the DMAsubsystems in this case) use identical copies of the data is referred to as a cache-coherenceproblem.

8.6.2 Replacement Algorithms

In a direct-mapped cache, the position of each block is predetermined by its address; hence,the replacement strategy is trivial. In associative and set-associative caches there existssome flexibility. When a new block is to be brought into the cache and all the positions thatit may occupy are full, the cache controller must decide which of the old blocks to overwrite.This is an important issue, because the decision can be a strong determining factor in systemperformance. In general, the objective is to keep blocks in the cache that are likely to bereferenced in the near future. But, it is not easy to determine which blocks are about to bereferenced. The property of locality of reference in programs gives a clue to a reasonablestrategy. Because program execution usually stays in localized areas for reasonable periodsof time, there is a high probability that the blocks that have been referenced recently willbe referenced again soon. Therefore, when a block is to be overwritten, it is sensible tooverwrite the one that has gone the longest time without being referenced. This block iscalled the least recently used (LRU) block, and the technique is called the LRU replacementalgorithm.

To use the LRU algorithm, the cache controller must track references to all blocks ascomputation proceeds. Suppose it is required to track the LRU block of a four-block setin a set-associative cache. A 2-bit counter can be used for each block. When a hit occurs,the counter of the block that is referenced is set to 0. Counters with values originally lowerthan the referenced one are incremented by one, and all others remain unchanged. When amiss occurs and the set is not full, the counter associated with the new block loaded fromthe main memory is set to 0, and the values of all other counters are increased by one.When a miss occurs and the set is full, the block with the counter value 3 is removed, thenew block is put in its place, and its counter is set to 0. The other three block counters areincremented by one. It can be easily verified that the counter values of occupied blocks arealways distinct.

The LRU algorithm has been used extensively. Although it performs well for manyaccess patterns, it can lead to poor performance in some cases. For example, it producesdisappointing results when accesses are made to sequential elements of an array that isslightly too large to fit into the cache (see Section 8.6.3 and Problem 8.11). Performanceof the LRU algorithm can be improved by introducing a small amount of randomness indeciding which block to replace.

Several other replacement algorithms are also used in practice. An intuitively reason-able rule would be to remove the “oldest” block from a full set when a new block must bebrought in. However, because this algorithm does not take into account the recent patternof access to blocks in the cache, it is generally not as effective as the LRU algorithm inchoosing the best blocks to remove. The simplest algorithm is to randomly choose theblock to be overwritten. Interestingly enough, this simple algorithm has been found to bequite effective in practice.



8.6.3 Examples of Mapping Techniques

We now consider a detailed example to illustrate the effects of different cache mappingtechniques. Assume that a processor has separate instruction and data caches. To keep theexample simple, assume the data cache has space for only eight blocks of data. Also assumethat each block consists of only one 16-bit word of data and the memory is word-addressablewith 16-bit addresses. (These parameters are not realistic for actual computers, but theyallow us to illustrate mapping techniques clearly.) Finally, assume the LRU replacementalgorithm is used for block replacement in the cache.

Let us examine changes in the data cache entries caused by running the followingapplication. A 4× 10 array of numbers, each occupying one word, is stored in mainmemory locations 7A00 through 7A27 (hex). The elements of this array, A, are stored incolumn order, as shown in Figure 8.19. The figure also indicates how tags for differentcache mapping techniques are derived from the memory address. Note that no bits areneeded to identify a word within a block, as was done in Figures 8.16 through 8.18, becausewe have assumed that each block contains only one word. The application normalizes theelements of the first row of A with respect to the average value of the elements in the row.Hence, we need to compute the average of the elements in the row and divide each elementby that average. The required task can be expressed as

A(0, i)← A(0, i)(∑9

j=0 A(0, j)) /

10for i = 0, 1, . . . , 9

0 1 1 1 1 10 0 0 0 0 0 0 0 0 0

1 1 1 10 10 0 0 0 0 0 0 0 0 1

0 1 1 1 1 10 0 0 0 0 0 0 0 01

0 1 1 1 1 10 0 0 0 0 0 0 0 1 1

0 1 1 1 1 10 0 0 0 0 0 0 1 0 0

0 1 1 1 1 10 0 0 0 1 0 0 1 0 0

1 1 1 10 10 0 0 0 1 0 0 1 0 1

0 1 1 1 1 10 0 0 0 1 0 0 1 01

0 1 1 1 1 10 0 0 0 1 0 0 1 1 1

A(0,0)

A(1,0)

A(2,0)

A(3,0)

A(0,1)

A(0,9)

A(1,9)

A(2,9)

A(3,9)

ContentsMemory address

Tag for direct mapped

Tag for set-associative

Tag for associative

(7A00)

(7A01)

(7A02)

(7A03)

(7A04)

(7A24)

(7A25)

(7A26)

(7A27)

Figure 8.19 An array stored in the main memory.



SUM := 0for j := 0 to 9 do

SUM := SUM + A(0,j)endAVG := SUM/10for i := 9 downto 0 do

A(0,i) := A(0,i)/AVGend

Figure 8.20 Task for example in Section 8.6.3.

Figure 8.20 gives the structure of a program that corresponds to this task. We use thevariables SUM and AVE to hold the sum and average values, respectively. These variables,as well as index variables i and j, are held in processor registers during the computation.

Direct-Mapped CacheIn a direct-mapped data cache, the contents of the cache change as shown in Figure

8.21. The columns in the table indicate the cache contents after various passes throughthe two program loops in Figure 8.20 are completed. For example, after the second passthrough the first loop (j = 1), the cache holds the elements A(0, 0) and A(0, 1). Theseelements are in block positions 0 and 4, as determined by the three least-significant bits ofthe address. During the next pass, the A(0, 0) element is replaced by A(0, 2), which mapsinto the same block position. Note that the desired elements map into only two positionsin the cache, thus leaving the contents of the other six positions unchanged from whateverthey were before the normalization task started.

Elements A(0, 8) and A(0, 9) are loaded into the cache during the ninth and tenth passesthrough the first loop (j = 8, 9). The second loop reverses the order in which the elementsare handled. The first two passes through this loop (i = 9, 8) find the required data in thecache. When i = 7, element A(0, 9) is replaced with A(0, 7). When i = 6, element A(0, 8)

Blockposition

Contents of data cache after pass:

j 1= j 3= j 5= j 7= j 9= i 6= i 4= i 2= i 0=

A(0,0) A(0,2) A(0,4) A(0,6) A(0,8) A(0,6) A(0,4) A(0,2) A(0,0)

A(0,1) A(0,3) A(0,5) A(0,7) A(0,9) A(0,7) A(0,5) A(0,3) A(0,1)

0

1

2

3

4

5

6

7

Figure 8.21 Contents of a direct-mapped data cache.



is replaced with A(0, 6), and so on. Thus, eight elements are replaced while the secondloop is executed. In total, there are only two hits during execution of this task.

The reader should keep in mind that the tags must be kept in the cache for each block.They are not shown to keep the figure simple.

Associative-Mapped CacheFigure 8.22 presents the changes in cache contents for the case of an associative-mapped

cache. During the first eight passes through the first loop, the elements are brought intoconsecutive block positions, assuming that the cache was initially empty. During the ninthpass (j = 8), the LRU algorithm chooses A(0, 0) to be overwritten by A(0, 8). In the nextand last pass through the j loop, element A(0, 1) is replaced with A(0, 9). Now, for the firsteight passes through the second loop (i = 9, 8, . . . , 2) all the required elements are foundin the cache. When i = 1, the element needed is A(0, 1), so it replaces the least recentlyused element, A(0, 9). During the last pass, A(0, 0) replaces A(0, 8).

In this case, when the second loop is executed, only two elements are not found inthe cache. In the direct-mapped case, eight of the elements had to be reloaded during thesecond loop. Obviously, the associative-mapped cache benefits from the complete freedomin mapping a memory block into any position in the cache. In both cases, better utilizationof the cache is achieved by reversing the order in which the elements are handled in thesecond loop of the program. It is interesting to consider what would happen if the secondloop dealt with the elements in the same order as in the first loop. Using either directmapping or the LRU algorithm, all elements would be overwritten before they are used inthe second loop (see Problem 8.10).

Set-Associative-Mapped CacheFor this example, we assume that a set-associative data cache is organized into two sets,

each capable of holding four blocks. Thus, the least-significant bit of an address determineswhich set a memory block maps into, but the memory data can be placed in any of the fourblocks of the set. The high-order 15 bits of the address constitute the tag.

Blockposition


j 7= j 8= j 9= i 1= i 0=

A(0,0) A(0,8) A(0,8) A(0,8) A(0,0)

A(0,4) A(0,4) A(0,4) A(0,4) A(0,4)

0

1

2

3

4

5

6

7

A(0,1) A(0,1) A(0,9) A(0,1) A(0,1)

A(0,2) A(0,2) A(0,2) A(0,2) A(0,2)

A(0,3) A(0,3) A(0,3) A(0,3) A(0,3)

A(0,5) A(0,5) A(0,5) A(0,5) A(0,5)

A(0,6) A(0,6) A(0,6) A(0,6) A(0,6)

A(0,7) A(0,7) A(0,7) A(0,7) A(0,7)

Figure 8.22 Contents of an associative-mapped data cache.




j 3= j 7= j 9= i 4= i 2= i 0=

A(0,0) A(0,4) A(0,8) A(0,4) A(0,4) A(0,0)

A(0,1) A(0,5) A(0,9) A(0,5) A(0,5) A(0,1)

A(0,2) A(0,6) A(0,6) A(0,6) A(0,2) A(0,2)

A(0,3) A(0,7) A(0,7) A(0,7) A(0,3) A(0,3)

Set 0

Set 1

Figure 8.23 Contents of a set-associative-mapped data cache.

Changes in the cache contents are depicted in Figure 8.23. Since all the desired blockshave even addresses, they map into set 0. In this case, six elements are reloaded duringexecution of the second loop.

Even though this is a simplified example, it illustrates that in general, associativemapping performs best, set-associative mapping is next best, and direct mapping is theworst. However, fully-associative mapping is expensive to implement, so set-associativemapping is a good practical compromise.

8.7 Performance Considerations

Two key factors in the commercial success of a computer are performance and cost; thebest possible performance for a given cost is the objective. A common measure of successis the price/performance ratio. Performance depends on how fast machine instructionscan be brought into the processor and how fast they can be executed. Chapter 6 showshow pipelining increases the speed of program execution. In this chapter, we focus on thememory subsystem.

The memory hierarchy described in Section 8.5 results from the quest for the bestprice/performance ratio. The main purpose of this hierarchy is to create a memory thatthe processor sees as having a short access time and a large capacity. When a cache isused, the processor is able to access instructions and data more quickly when the data fromthe referenced memory locations are in the cache. Therefore, the extent to which cachesimprove performance is dependent on how frequently the requested instructions and dataare found in the cache. In this section, we examine this issue quantitatively.


8.7 Performance Considerations 301

8.7.1 Hit Rate and Miss Penalty

An excellent indicator of the effectiveness of a particular implementation of the memoryhierarchy is the success rate in accessing information at various levels of the hierarchy.Recall that a successful access to data in a cache is called a hit. The number of hits statedas a fraction of all attempted accesses is called the hit rate, and the miss rate is the numberof misses stated as a fraction of attempted accesses.

Ideally, the entire memory hierarchy would appear to the processor as a single memoryunit that has the access time of the cache on the processor chip and the size of the magneticdisk. How close we get to this ideal depends largely on the hit rate at different levels of thehierarchy. High hit rates well over 0.9 are essential for high-performance computers.

Performance is adversely affected by the actions that need to be taken when a missoccurs. A performance penalty is incurred because of the extra time needed to bring a blockof data from a slower unit in the memory hierarchy to a faster unit. During that period, theprocessor is stalled waiting for instructions or data. The waiting time depends on the detailsof the operation of the cache. For example, it depends on whether or not the load-throughapproach is used. We refer to the total access time seen by the processor when a miss occursas the miss penalty.

Consider a system with only one level of cache. In this case, the miss penalty consistsalmost entirely of the time to access a block of data in the main memory. Let h be thehit rate, M the miss penalty, and C the time to access information in the cache. Thus, theaverage access time experienced by the processor is

tavg = hC + (1− h)M

The following example illustrates how the values of these parameters affect the averageaccess time.

Example 8.1Consider a computer that has the following parameters. Access times to the cache and themain memory are τ and 10τ , respectively. When a cache miss occurs, a block of 8 words istransferred from the main memory to the cache. It takes 10τ to transfer the first word of theblock, and the remaining 7 words are transferred at the rate of one word every τ seconds.The miss penalty also includes a delay of τ for the initial access to the cache, which misses,and another delay of τ to transfer the word to the processor after the block is loaded intothe cache (assuming no load-through). Thus, the miss penalty in this computer is given by:

M = τ + 10τ + 7τ + τ = 19τ

Assume that 30 percent of the instructions in a typical program perform a Read or aWrite operation, which means that there are 130 memory accesses for every 100 instructionsexecuted. Assume that the hit rates in the cache are 0.95 for instructions and 0.9 for data.Assume further that the miss penalty is the same for both read and write accesses. Then,



a rough estimate of the improvement in memory performance that results from using thecache can be obtained as follows:

Time without cache

Time with cache= 130× 10τ

100(0.95τ + 0.05× 19τ)+ 30(0.9τ + 0.1× 19τ)= 4.7

This result shows that the cache makes the memory appear almost five times faster thanit really is. The improvement factor increases as the speed of the cache increases relativeto the main memory. For example, if the access time of the main memory is 20τ , theimprovement factor becomes 7.3.

High hit rates are essential for the cache to be effective in reducing memory accesstime. Hit rates depend on the size of the cache, its design, and the instruction and dataaccess patterns of the programs being executed. It is instructive to consider how effectivethe cache of this example is compared to the ideal case in which the hit rate is 100 percent.With ideal cache behavior, all memory references take one τ . Thus, an estimate of theincrease in memory access time caused by misses in the cache is given by:

Time for real cache

Time for ideal cache= 100(0.95τ + 0.05× 19τ)+ 30(0.9τ + 0.1× 19τ)

130τ= 2.1

In other words, a 100% hit rate in the cache would make the memory appear twice as fastas when realistic hit rates are used.

How can the hit rate be improved? One possibility is to make the cache larger, butthis entails increased cost. Another possibility is to increase the cache block size whilekeeping the total cache size constant, to take advantage of spatial locality. If all items in alarger block are needed in a computation, then it is better to load these items into the cachein a single miss, rather than loading several smaller blocks as a result of several misses.The high data rate achievable during block transfers is the main reason for this advantage.But larger blocks are effective only up to a certain size, beyond which the improvement inthe hit rate is offset by the fact that some items may not be referenced before the block isejected (replaced). Also, larger blocks take longer to transfer, and hence increase the misspenalty. Since the performance of a computer is affected positively by increased hit rateand negatively by increased miss penalty, block size should be neither too small nor toolarge. In practice, block sizes in the range of 16 to 128 bytes are the most popular choices.

Finally, we note that the miss penalty can be reduced if the load-through approach isused when loading new blocks into the cache. Then, instead of waiting for an entire blockto be transferred, the processor resumes execution as soon as the required word is loadedinto the cache.

8.7.2 Caches on the Processor Chip

When information is transferred between different chips, considerable delays occur in driverand receiver gates on the chips. Thus, it is best to implement the cache on the processor


8.7 Performance Considerations 303

chip. Most processor chips include at least one L1 cache. Often there are two separate L1caches, one for instructions and another for data.

In high-performance processors, two levels of caches are normally used, separate L1caches for instructions and data and a larger L2 cache. These caches are often implementedon the processor chip. In this case, the L1 caches must be very fast, as they determine thememory access time seen by the processor. The L2 cache can be slower, but it should bemuch larger than the L1 caches to ensure a high hit rate. Its speed is less critical becauseit only affects the miss penalty of the L1 caches. A typical computer may have L1 cacheswith capacities of tens of kilobytes and an L2 cache of hundreds of kilobytes or possiblyseveral megabytes.

Including an L2 cache further reduces the impact of the main memory speed on theperformance of a computer. Its effect can be assessed by observing that the average accesstime of the L2 cache is the miss penalty of either of the L1 caches. For simplicity, we willassume that the hit rates are the same for instructions and data. Thus, the average accesstime experienced by the processor in such a system is:

tavg = h1C1 + (1− h1)(h2C2 + (1− h2)M )

where

h1 is the hit rate in the L1 caches.

h2 is the hit rate in the L2 cache.

C1 is the time to access information in the L1 caches.

C2 is the miss penalty to transfer information from the L2 cache to an L1 cache.

M is the miss penalty to transfer information from the main memory to the L2 cache.

Of all memory references made by the processor, the number of misses in the L2 cache isgiven by (1− h1)(1− h2). If both h1 and h2 are in the 90 percent range, then the numberof misses in the L2 cache will be less than one percent of all memory accesses. This makesthe value of M , and in turn the speed of the main memory, less critical. See Problem 8.14for a quantitative examination of this issue.

8.7.3 Other Enhancements

In addition to the main design issues just discussed, several other possibilities exist forenhancing performance. We discuss three of them in this section.

Write BufferWhen the write-through protocol is used, each Write operation results in writing a new

value into the main memory. If the processor must wait for the memory function to becompleted, as we have assumed until now, then the processor is slowed down by all Writerequests. Yet the processor typically does not need immediate access to the result of aWrite operation; so it is not necessary for it to wait for the Write request to be completed.



To improve performance, a Write buffer can be included for temporary storage of Writerequests. The processor places each Write request into this buffer and continues executionof the next instruction. The Write requests stored in the Write buffer are sent to the mainmemory whenever the memory is not responding to Read requests. It is important that theRead requests be serviced quickly, because the processor usually cannot proceed beforereceiving the data being read from the memory. Hence, these requests are given priorityover Write requests.

The Write buffer may hold a number of Write requests. Thus, it is possible that asubsequent Read request may refer to data that are still in the Write buffer. To ensurecorrect operation, the addresses of data to be read from the memory are always comparedwith the addresses of the data in the Write buffer. In the case of a match, the data in theWrite buffer are used.

A similar situation occurs with the write-back protocol. In this case, Write commandsissued by the processor are performed on the word in the cache. When a new block of datais to be brought into the cache as a result of a Read miss, it may replace an existing blockthat has some dirty data. The dirty block has to be written into the main memory. If therequired write-back is performed first, then the processor has to wait for this operation tobe completed before the new block is read into the cache. It is more prudent to read thenew block first. The dirty block being ejected from the cache is temporarily stored in theWrite buffer and held there while the new block is being read. Afterwards, the contents ofthe buffer are written into the main memory. Thus, the Write buffer also works well for thewrite-back protocol.

PrefetchingIn the previous discussion of the cache mechanism, we assumed that new data are

brought into the cache when they are first needed. Following a Read miss, the processorhas to pause until the new data arrive, thus incurring a miss penalty.

To avoid stalling the processor, it is possible to prefetch the data into the cache beforethey are needed. The simplest way to do this is through software. A special prefetchinstruction may be provided in the instruction set of the processor. Executing this instructioncauses the addressed data to be loaded into the cache, as in the case of a Read miss. Aprefetchinstruction is inserted in a program to cause the data to be loaded in the cache shortly beforethey are needed in the program. Then, the processor will not have to wait for the referenceddata as in the case of a Read miss. The hope is that prefetching will take place while theprocessor is busy executing instructions that do not result in a Read miss, thus allowingaccesses to the main memory to be overlapped with computation in the processor.

Prefetch instructions can be inserted into a program either by the programmer or bythe compiler. Compilers are able to insert these instructions with good success for manyapplications. Software prefetching entails a certain overhead because inclusion of prefetchinstructions increases the length of programs. Moreover, some prefetches may load intothe cache data that will not be used by the instructions that follow. This can happen if theprefetched data are ejected from the cache by a Read miss involving other data. However, theoverall effect of software prefetching on performance is positive, and many processors havemachine instructions to support this feature. See Reference [1] for a thorough discussionof software prefetching.


8.8 Virtual Memory 305

Prefetching can also be done in hardware, using circuitry that attempts to discover apattern in memory references and prefetches data according to this pattern. A number ofschemes have been proposed for this purpose, as described in References [2] and [3].

Lockup-Free CacheSoftware prefetching does not work well if it interferes significantly with the normal

execution of instructions. This is the case if the action of prefetching stops other accessesto the cache until the prefetch is completed. While servicing a miss, the cache is said tobe locked. This problem can be solved by modifying the basic cache structure to allow theprocessor to access the cache while a miss is being serviced. In this case, it is possible to havemore than one outstanding miss, and the hardware must accommodate such occurrences.

A cache that can support multiple outstanding misses is called lockup-free. Such acache must include circuitry that keeps track of all outstanding misses. This may be donewith special registers that hold the pertinent information about these misses. Lockup-freecaches were first used in the early 1980s in the Cyber series of computers manufactured bythe Control Data company [4].

We have used software prefetching to motivate the need for a cache that is not locked bya Read miss. A much more important reason is that in a pipelined processor, which overlapsthe execution of several instructions, a Read miss caused by one instruction could stall theexecution of other instructions. A lockup-free cache reduces the likelihood of such stalls.

8.8 Virtual Memory

In most modern computer systems, the physical main memory is not as large as the ad-dress space of the processor. For example, a processor that issues 32-bit addresses has anaddressable space of 4G bytes. The size of the main memory in a typical computer witha 32-bit processor may range from 1G to 4G bytes. If a program does not completely fitinto the main memory, the parts of it not currently being executed are stored on a secondarystorage device, typically a magnetic disk. As these parts are needed for execution, theymust first be brought into the main memory, possibly replacing other parts that are alreadyin the memory. These actions are performed automatically by the operating system, usinga scheme known as virtual memory. Application programmers need not be aware of thelimitations imposed by the available main memory. They prepare programs using the entireaddress space of the processor.

Under a virtual memory system, programs, and hence the processor, reference in-structions and data in an address space that is independent of the available physical mainmemory space. The binary addresses that the processor issues for either instructions ordata are called virtual or logical addresses. These addresses are translated into physicaladdresses by a combination of hardware and software actions. If a virtual address refers to apart of the program or data space that is currently in the physical memory, then the contentsof the appropriate location in the main memory are accessed immediately. Otherwise, thecontents of the referenced address must be brought into a suitable location in the memorybefore they can be used.



Data

Data

DMA transfer

Physical address

Physical address

Virtual address

Disk storage

Main memory

Cache

MMU

Processor

Figure 8.24 Virtual memory organization.

Figure 8.24 shows a typical organization that implements virtual memory. A specialhardware unit, called the Memory Management Unit (MMU), keeps track of which parts ofthe virtual address space are in the physical memory. When the desired data or instructionsare in the main memory, the MMU translates the virtual address into the correspondingphysical address. Then, the requested memory access proceeds in the usual manner. Ifthe data are not in the main memory, the MMU causes the operating system to transfer thedata from the disk to the memory. Such transfers are performed using the DMA schemediscussed in Section 8.4.

8.8.1 Address Translation

Asimple method for translating virtual addresses into physical addresses is to assume that allprograms and data are composed of fixed-length units called pages, each of which consistsof a block of words that occupy contiguous locations in the main memory. Pages commonlyrange from 2K to 16K bytes in length. They constitute the basic unit of information that istransferred between the main memory and the disk whenever the MMU determines that atransfer is required. Pages should not be too small, because the access time of a magneticdisk is much longer (several milliseconds) than the access time of the main memory. The



reason for this is that it takes a considerable amount of time to locate the data on the disk,but once located, the data can be transferred at a rate of several megabytes per second. Onthe other hand, if pages are too large, it is possible that a substantial portion of a page maynot be used, yet this unnecessary data will occupy valuable space in the main memory.

This discussion clearly parallels the concepts introduced in Section 8.6 on cache mem-ory. The cache bridges the speed gap between the processor and the main memory and isimplemented in hardware. The virtual-memory mechanism bridges the size and speed gapsbetween the main memory and secondary storage and is usually implemented in part bysoftware techniques. Conceptually, cache techniques and virtual-memory techniques arevery similar. They differ mainly in the details of their implementation.

A virtual-memory address-translation method based on the concept of fixed-lengthpages is shown schematically in Figure 8.25. Each virtual address generated by the proces-

Page frame

Virtual address from processor

in memory

Offset

Offset

Virtual page numberPage table address

Page table base register

Controlbits

Physical address in main memory

PAGE TABLE

Page frame

+

Figure 8.25 Virtual-memory address translation.



sor, whether it is for an instruction fetch or an operand load/store operation, is interpretedas a virtual page number (high-order bits) followed by an offset (low-order bits) that spec-ifies the location of a particular byte (or word) within a page. Information about the mainmemory location of each page is kept in a page table. This information includes the mainmemory address where the page is stored and the current status of the page. An area inthe main memory that can hold one page is called a page frame. The starting address ofthe page table is kept in a page table base register. By adding the virtual page numberto the contents of this register, the address of the corresponding entry in the page table isobtained. The contents of this location give the starting address of the page if that pagecurrently resides in the main memory.

Each entry in the page table also includes some control bits that describe the status ofthe page while it is in the main memory. One bit indicates the validity of the page, that is,whether the page is actually loaded in the main memory. It allows the operating system toinvalidate the page without actually removing it. Another bit indicates whether the page hasbeen modified during its residency in the memory. As in cache memories, this informationis needed to determine whether the page should be written back to the disk before it isremoved from the main memory to make room for another page. Other control bits indicatevarious restrictions that may be imposed on accessing the page. For example, a programmay be given full read and write permission, or it may be restricted to read accesses only.

Translation Lookaside BufferThe page table information is used by the MMU for every read and write access.

Ideally, the page table should be situated within the MMU. Unfortunately, the page tablemay be rather large. Since the MMU is normally implemented as part of the processorchip, it is impossible to include the complete table within the MMU. Instead, a copy of onlya small portion of the table is accommodated within the MMU, and the complete table iskept in the main memory. The portion maintained within the MMU consists of the entriescorresponding to the most recently accessed pages. They are stored in a small table, usuallycalled the Translation Lookaside Buffer (TLB). The TLB functions as a cache for the pagetable in the main memory. Each entry in the TLB includes a copy of the information inthe corresponding entry in the page table. In addition, it includes the virtual address of thepage, which is needed to search the TLB for a particular page. Figure 8.26 shows a possibleorganization of a TLB that uses the associative-mapping technique. Set-associative mappedTLBs are also found in commercial products.

Address translation proceeds as follows. Given a virtual address, the MMU looks inthe TLB for the referenced page. If the page table entry for this page is found in the TLB,the physical address is obtained immediately. If there is a miss in the TLB, then the requiredentry is obtained from the page table in the main memory and the TLB is updated.

It is essential to ensure that the contents of the TLB are always the same as the contentsof page tables in the memory. When the operating system changes the contents of a pagetable, it must simultaneously invalidate the corresponding entries in the TLB. One of thecontrol bits in the TLB is provided for this purpose. When an entry is invalidated, the TLBacquires the new information from the page table in the memory as part of the MMU’snormal response to access misses.



No

Yes

Hit

Miss

Virtual address from processor

TLB

OffsetVirtual page number

numberVirtual page Page frame

in memoryControl

bits

Offset

Physical address in main memory

Page frame

=?

Figure 8.26 Use of an associative-mapped TLB.

Page FaultsWhen a program generates an access request to a page that is not in the main memory,

a page fault is said to have occurred. The entire page must be brought from the disk intothe memory before access can proceed. When it detects a page fault, the MMU asks theoperating system to intervene by raising an exception (interrupt). Processing of the programthat generated the page fault is interrupted, and control is transferred to the operating system.The operating system copies the requested page from the disk into the main memory. Sincethis process involves a long delay, the operating system may begin execution of another



program whose pages are in the main memory. When page transfer is completed, theexecution of the interrupted program is resumed.

When the MMU raises an interrupt to indicate a page fault, the instruction that requestedthe memory access may have been partially executed. It is essential to ensure that theinterrupted program continues correctly when it resumes execution. There are two options.Either the execution of the interrupted instruction continues from the point of interruption,or the instruction must be restarted. The design of a particular processor dictates which ofthese two options is used.

If a new page is brought from the disk when the main memory is full, it must replaceone of the resident pages. The problem of choosing which page to remove is just as criticalhere as it is in a cache, and the observation that programs spend most of their time in a fewlocalized areas also applies. Because main memories are considerably larger than cachememories, it should be possible to keep relatively larger portions of a program in the mainmemory. This reduces the frequency of transfers to and from the disk. Concepts similarto the LRU replacement algorithm can be applied to page replacement, and the control bitsin the page table entries can be used to record usage history. One simple scheme is basedon a control bit that is set to 1 whenever the corresponding page is referenced (accessed).The operating system periodically clears this bit in all page table entries, thus providing asimple way of determining which pages have not been used recently.

A modified page has to be written back to the disk before it is removed from themain memory. It is important to note that the write-through protocol, which is useful in theframework of cache memories, is not suitable for virtual memory. The access time of the diskis so long that it does not make sense to access it frequently to write small amounts of data.

Looking up entries in the TLB introduces some delay, slowing down the operation ofthe MMU. Here again we can take advantage of the property of locality of reference. It islikely that many successive TLB translations involve addresses on the same program page.This is particularly likely when fetching instructions. Thus, address translation time can bereduced by keeping the most recently used TLB entries in a few special registers that canbe accessed quickly.

8.9 Memory Management Requirements

In our discussion of virtual-memory concepts, we have tacitly assumed that only one largeprogram is being executed. If all of the program does not fit into the available physicalmemory, parts of it (pages) are moved from the disk into the main memory when they areto be executed. Although we have alluded to software routines that are needed to managethis movement of program segments, we have not been specific about the details.

Memory management routines are part of the operating system of the computer. It isconvenient to assemble the operating system routines into a virtual address space, calledthe system space, that is separate from the virtual space in which user application programsreside. The latter space is called the user space. In fact, there may be a number of userspaces, one for each user. This is arranged by providing a separate page table for each userprogram. The MMU uses a page table base register to determine the address of the table


8.10 Secondary Storage 311

to be used in the translation process. Hence, by changing the contents of this register, theoperating system can switch from one space to another. The physical main memory is thusshared by the active pages of the system space and several user spaces. However, only thepages that belong to one of these spaces are accessible at any given time.

In any computer system in which independent user programs coexist in the main mem-ory, the notion of protection must be addressed. No program should be allowed to destroyeither the data or instructions of other programs in the memory. The needed protectioncan be provided in several ways. Let us first consider the most basic form of protection.Most processors can operate in one of two modes, the supervisor mode and the user mode.The processor is usually placed in the supervisor mode when operating system routines arebeing executed and in the user mode to execute user programs. In the user mode, somemachine instructions cannot be executed. These are privileged instructions. They includeinstructions that modify the page table base register, which can only be executed while theprocessor is in the supervisor mode. Since a user program is executed in the user mode, itis prevented from accessing the page tables of other users or of the system space.

It is sometimes desirable for one application program to have access to certain pagesbelonging to another program. The operating system can arrange this by causing these pagesto appear in both spaces. The shared pages will therefore have entries in two different pagetables. The control bits in each table entry can be set to control the access privileges grantedto each program. For example, one program may be allowed to read and write a given page,while the other program may be given only read access.

8.10 Secondary Storage

The semiconductor memories discussed in the previous sections cannot be used to provideall of the storage capability needed in computers. Their main limitation is the cost perbit of stored information. The large storage requirements of most computer systems areeconomically realized in the form of magnetic and optical disks, which are usually referredto as secondary storage devices.

8.10.1 Magnetic Hard Disks

The storage medium in a magnetic-disk system consists of one or more disk platters mountedon a common spindle. A thin magnetic film is deposited on each platter, usually on bothsides. The assembly is placed in a drive that causes it to rotate at a constant speed. Themagnetized surfaces move in close proximity to read/write heads, as shown in Figure8.27a. Data are stored on concentric tracks, and the read/write heads move radially toaccess different tracks.

Each read/write head consists of a magnetic yoke and a magnetizing coil, as indicatedin Figure 8.27b. Digital information can be stored on the magnetic film by applying currentpulses of suitable polarity to the magnetizing coil. This causes the magnetization of the filmin the area immediately underneath the head to switch to a direction parallel to the applied



drive shaftRotary

DiskAccess

mechanism

headRead/Write

(a) Mechanical structure

Airgap

Magneticthin film

yokeMagnetic

Magnetizingcurrent

(b) Read/Write head detail

Direction ofmagnetization

0 1 0 1 1 1 0

One bit

(c) Bit representation by phase encoding

Figure 8.27 Magnetic disk principles.

field. The same head can be used for reading the stored information. In this case, changesin the magnetic field in the vicinity of the head caused by the movement of the film relativeto the yoke induce a voltage in the coil, which now serves as a sense coil. The polarity ofthis voltage is monitored by the control circuitry to determine the state of magnetization ofthe film. Only changes in the magnetic field under the head can be sensed during the Readoperation. Therefore, if the binary states 0 and 1 are represented by two opposite states ofmagnetization, a voltage is induced in the head only at 0-to-1 and at 1-to-0 transitions inthe bit stream. A long string of 0s or 1s causes an induced voltage only at the beginningand end of the string. Therefore, to determine the number of consecutive 0s or 1s stored, aclock must provide information for synchronization.



In some early designs, a clock was stored on a separate track, on which a change inmagnetization is forced for each bit period. Using the clock signal as a reference, thedata stored on other tracks can be read correctly. The modern approach is to combine theclocking information with the data. Several different techniques have been developed forsuch encoding. One simple scheme, depicted in Figure 8.27c, is known as phase encodingor Manchester encoding. In this scheme, changes in magnetization occur for each data bit,as shown in the figure. Clocking information is provided by the change in magnetization atthe midpoint of each bit period. The drawback of Manchester encoding is its poor bit-storagedensity. The space required to represent each bit must be large enough to accommodatetwo changes in magnetization. We use the Manchester encoding example to illustrate howa self-clocking scheme may be implemented, because it is easy to understand. Other, morecompact codes have been developed. They are much more efficient and provide betterstorage density. They also require more complex control circuitry. The discussion of suchcodes is beyond the scope of this book.

Read/write heads must be maintained at a very small distance from the moving disksurfaces in order to achieve high bit densities and reliable Read and Write operations. Whenthe disks are moving at their steady rate, air pressure develops between the disk surfaceand the head and forces the head away from the surface. This force is counterbalanced by aspring-loaded mounting arrangement that presses the head toward the surface. The flexiblespring connection between the head and its arm mounting permits the head to fly at thedesired distance away from the surface in spite of any small variations in the flatness of thesurface.

In most modern disk units, the disks and the read/write heads are placed in a sealed,air-filtered enclosure. This approach is known as Winchester technology. In such units, theread/write heads can operate closer to the magnetized track surfaces, because dust particles,which are a problem in unsealed assemblies, are absent. The closer the heads are to a tracksurface, the more densely the data can be packed along the track, and the closer the trackscan be to each other. Thus, Winchester disks have a larger capacity for a given physicalsize compared to unsealed units. Another advantage of Winchester technology is that dataintegrity tends to be greater in sealed units, where the storage medium is not exposed tocontaminating elements.

The read/write heads of a disk system are movable. There is one head per surface. Allheads are mounted on a comb-like arm that can move radially across the stack of disks toprovide access to individual tracks, as shown in Figure 8.27a. To read or write data on agiven track, the read/write heads must first be positioned over that track.

The disk system consists of three key parts. One part is the assembly of disk platters,which is usually referred to as the disk. The second part comprises the electromechanicalmechanism that spins the disk and moves the read/write heads; it is called the disk drive. Thethird part is the disk controller, which is the electronic circuitry that controls the operationof the system. The disk controller may be implemented as a separate module, or it may beincorporated into the enclosure that contains the entire disk system. We should note thatthe term disk is often used to refer to the combined package of the disk drive and the diskit contains. We will do so in the sections that follow, when there is no ambiguity in themeaning of the term.



Organization and Accessing of Data on a DiskThe organization of data on a disk is illustrated in Figure 8.28. Each surface is divided

into concentric tracks, and each track is divided into sectors. The set of correspondingtracks on all surfaces of a stack of disks forms a logical cylinder. All tracks of a cylindercan be accessed without moving the read/write heads. Data are accessed by specifyingthe surface number, the track number, and the sector number. Read and Write operationsalways start at sector boundaries.

Data bits are stored serially on each track. Each sector may contain 512 or morebytes. The data are preceded by a sector header that contains identification (addressing)information used to find the desired sector on the selected track. Following the data, thereare additional bits that constitute an error-correcting code (ECC). The ECC bits are usedto detect and correct errors that may have occurred in writing or reading the data bytes.There is a small inter-sector gap that enables the disk control circuitry to distinguish easilybetween two consecutive sectors.

An unformatted disk has no information on its tracks. The formatting process writesmarkers that divide the disk into tracks and sectors. During this process, the disk controllermay discover some sectors or even whole tracks that are defective. The disk controllerkeeps a record of such defects and excludes them from use. The formatting informationcomprises sector headers, ECC bits, and inter-sector gaps. The capacity of a formatteddisk, after accounting for the formating information overhead, is the proper indicator of thedisk’s storage capability. After formatting, the disk is divided into logical partitions.

Figure 8.28 indicates that each track has the same number of sectors, which means thatall tracks have the same storage capacity. In this case, the stored information is packedmore densely on inner tracks than on outer tracks. It is also possible to increase the storagedensity by placing more sectors on the outer tracks, which have longer circumference. Thiswould be at the expense of more complicated access circuitry.

Sector 0, track 0

Sector 3, track n Sector 0, track 1

Figure 8.28 Organization of one surface of a disk.



Access TimeThere are two components involved in the time delay between the disk receiving an

address and the beginning of the actual data transfer. The first, called the seek time, is thetime required to move the read/write head to the proper track. This time depends on theinitial position of the head relative to the track specified in the address. Average valuesare in the 5- to 8-ms range. The second component is the rotational delay, also calledlatency time, which is the time taken to reach the addressed sector after the read/write headis positioned over the correct track. On average, this is the time for half a rotation of thedisk. The sum of these two delays is called the disk access time. If only a few sectors ofdata are accessed in a single operation, the access time is at least an order of magnitudelonger than the time it takes to transfer the data.

Data Buffer/CacheAdisk drive is connected to the rest of a computer system using some standard intercon-

nection scheme, such as SCSI or SATA. The interconnection hardware is usually capableof transferring data at much higher rates than the rate at which data can be read from disktracks. An efficient way to deal with the possible differences in transfer rates is to includea data buffer in the disk unit. The buffer is a semiconductor memory, capable of storinga few megabytes of data. The requested data are transferred between the disk tracks andthe buffer at a rate dependent on the rotational speed of the disk. Transfers between thedata buffer and the main memory can then take place at the maximum rate allowed by theinterconnect between them.

The data buffer in the disk controller can also be used to provide a caching mechanismfor the disk. When a Read request arrives at the disk, the controller can first check to seeif the desired data are already available in the buffer. If so, the data are transferred to thememory in microseconds instead of milliseconds. Otherwise, the data are read from a disktrack in the usual way, stored in the buffer, then transferred to the memory. Because oflocality of reference, a subsequent request is likely to refer to data that sequentially followthe data specified in the current request. In anticipation of future requests, the disk controllermay read more data than needed and place them into the buffer. When used as a cache,the buffer is typically large enough to store entire tracks of data. So, a possible strategy isto begin transferring the contents of the track into the data buffer as soon as the read/writehead is positioned over the desired track.

Disk ControllerOperation of a disk drive is controlled by a disk controller circuit, which also provides

an interface between the disk drive and the rest of the computer system. One disk controllermay be used to control more than one drive.

A disk controller that communicates directly with the processor contains a numberof registers that can be read and written by the operating system. Thus, communicationbetween the OS and the disk controller is achieved in the same manner as with any I/Ointerface, as discussed in Chapter 7. The disk controller uses the DMA scheme to transferdata between the disk and the main memory. Actually, these transfers are from/to the databuffer, which is implemented as a part of the disk controller module. The OS initiatesthe transfers by issuing Read and Write requests, which entail loading the controller’s



registers with the necessary addressing and control information. Typically, this informationincludes:

Main memory address—The address of the first main memory location of the block ofwords involved in the transfer.

Disk address—The location of the sector containing the beginning of the desired block ofwords.

Word count—The number of words in the block to be transferred.

The disk address issued by the OS is a logical address. The corresponding physical addresson the disk may be different. For example, bad sectors may be detected when the diskis formatted. The disk controller keeps track of such sectors and maintains the mappingbetween logical and physical addresses. Normally, a few spare sectors are kept on eachtrack, or on another track in the same cylinder, to be used as substitutes for the bad sectors.

On the disk drive side, the controller’s major functions are:

Seek—Causes the disk drive to move the read/write head from its current position to thedesired track.

Read—Initiates a Read operation, starting at the address specified in the disk addressregister. Data read serially from the disk are assembled into words and placed intothe data buffer for transfer to the main memory. The number of words is determinedby the word count register.

Write—Transfers data to the disk, using a control method similar to that for Read opera-tions.

Error checking—Computes the error correcting code (ECC) value for the data read froma given sector and compares it with the corresponding ECC value read from the disk.In the case of a mismatch, it corrects the error if possible; otherwise, it raises aninterrupt to inform the OS that an error has occurred. During a Write operation, thecontroller computes the ECC value for the data to be written and stores this value onthe disk.

Floppy DisksThe disks discussed above are known as hard or rigid disk units. Floppy disks are

smaller, simpler, and cheaper disk units that consist of a flexible, removable, plastic diskettecoated with magnetic material. The diskette is enclosed in a plastic jacket, which has anopening where the read/write head can be positioned. A hole in the center of the disketteallows a spindle mechanism in the disk drive to position and rotate the diskette.

The main feature of floppy disks is their low cost and shipping convenience. However,they have much smaller storage capacities, longer access times, and higher failure ratesthan hard disks. In recent years, they have largely been replaced by CDs, DVDs, and flashcards as portable storage media.



RAID Disk ArraysProcessor speeds have increased dramatically. At the same time, access times to disk

drives are still on the order of milliseconds, because of the limitations of the mechanicalmotion involved. One way to reduce access time is to use multiple disks operating inparallel. In 1988, researchers at the University of California-Berkeley proposed such astorage system [5]. They called it RAID, for Redundant Array of Inexpensive Disks.(Since all disks are now inexpensive, the acronym was later reinterpreted as RedundantArray of Independent Disks.) Using multiple disks also makes it possible to improve thereliability of the overall system. Different configurations were proposed, and many morehave been developed since.

The basic configuration, known as RAID 0, is simple. A single large file is stored inseveral separate disk units by dividing the file into a number of smaller pieces and storingthese pieces on different disks. This is called data striping. When the file is accessed fora Read operation, all disks access their portions of the data in parallel. As a result, therate at which the data can be transferred is equal to the data rate of individual disks timesthe number of disks. However, access time, that is, the seek and rotational delay neededto locate the beginning of the data on each disk, is not reduced. Since each disk operatesindependently, access times vary. Individual pieces of the data are buffered, so that thecomplete file can be reassembled and transferred to the memory as a single entity.

Various RAID configurations form a hierarchy, with each level in the hierarchy pro-viding additional features. For example, RAID 1 is intended to provide better reliability bystoring identical copies of the data on two disks rather than just one. The two disks are saidto be mirrors of each other. If one disk drive fails, all Read and Write operations are di-rected to its mirror drive. Other levels of the hierarchy achieve increased reliability throughvarious parity-checking schemes, without requiring a full duplication of disks. Some alsohave error-recovery capability.

The RAID concept has gained commercial acceptance. RAID systems are availablefrom many manufacturers for use with a variety of operating systems.

8.10.2 Optical Disks

Storage devices can also be implemented using optical means. The familiar compact disk(CD), used in audio systems, was the first practical application of this technology. Soonafter, the optical technology was adapted to the computer environment to provide a high-capacity read-only storage medium known as a CD-ROM.

The first generation of CDs was developed in the mid-1980s by the Sony and Philipscompanies. The technology exploited the possibility of using a digital representation foranalog sound signals. To provide high-quality sound recording and reproduction, 16-bitsamples of the analog signal are taken at a rate of 44,100 samples per second. Initially, CDswere designed to hold up to 75 minutes, requiring a total of about 3× 109 bits (3 gigabits)of storage. Since then, higher-capacity devices have been developed.

CD TechnologyThe optical technology that is used for CD systems makes use of the fact that laser

light can be focused on a very small spot. A laser beam is directed onto a spinning disk,



with tiny indentations arranged to form a long spiral track on its surface. The indentationsreflect the focused beam toward a photodetector, which detects the stored binary patterns.

The laser emits a coherent light beam that is sharply focused on the surface of thedisk. Coherent light consists of synchronized waves that have the same wavelength. If acoherent light beam is combined with another beam of the same kind, and the two beamsare in phase, the result is a brighter beam. But, if the waves of the two beams are 180degrees out of phase, they cancel each other. Thus, a photodetector can be used to detectthe beams. It will see a bright spot in the first case and a dark spot in the second case.

A cross-section of a small portion of a CD is shown in Figure 8.29a. The bottom layeris made of transparent polycarbonate plastic, which serves as a clear glass base. The surfaceof this plastic is programmed to store data by indenting it with pits. The unindented parts arecalled lands. A thin layer of reflecting aluminum material is placed on top of a programmeddisk. The aluminum is then covered by a protective acrylic. Finally, the topmost layer isdeposited and stamped with a label. The total thickness of the disk is 1.2 mm, almost all ofit contributed by the polycarbonate plastic. The other layers are very thin.

The laser source and the photodetector are positioned below the polycarbonate plastic.The emitted beam travels through the plastic layer, reflects off the aluminum layer, andtravels back toward the photodetector. Note that from the laser side, the pits actually appearas bumps rising above the lands.

Figure 8.29b shows what happens as the laser beam scans across the disk and encountersa transition from a pit to a land. Three different positions of the laser source and the detectorare shown, as would occur when the disk is rotating. When the light reflects solely froma pit, or from a land, the detector sees the reflected beam as a bright spot. But, a differentsituation arises when the beam moves over the edge between a pit and the adjacent land.The pit is one quarter of a wavelength closer to the laser source. Thus, the reflected beamsfrom the pit and the adjacent land will be 180 degrees out of phase, cancelling each other.Hence, the detector will not see a reflected beam at pit-land and land-pit transitions, andwill detect a dark spot.

Figure 8.29c depicts several transitions between lands and pits. If each transition,detected as a dark spot, is taken to denote the binary value 1, and the flat portions represent0s, then the detected binary pattern will be as shown in the figure. This pattern is not adirect representation of the stored data. CDs use a complex encoding scheme to representdata. Each byte of data is represented by a 14-bit code, which provides considerable errordetection capability. We will not delve into details of this code.

The pits are arranged on a long track on the surface of the disk, spiraling from themiddle of the disk toward the outer edge. But, it is customary to refer to each circular pathspanning 360 degrees as a separate track, which is analogous to the terminology used formagnetic disks. The CD is 120 mm in diameter, with a 15-mm hole in the center. The trackscover the area from a 25-mm radius to a 58-mm radius. The space between the tracks is 1.6microns. Pits are 0.5 microns wide and 0.8 to 3 microns long. There are more than 15,000tracks on a disk. If the entire track spiral were unraveled, it would be over 5 km long!

CD-ROMSince CDs store information in a binary form, they are suitable for use as a storage

medium in computer systems. The main challenge is to ensure the integrity of stored data.



Aluminum Acrylic Label

(a) Cross-section

Polycarbonate plastic

Source Detector Source Detector Source Detector

No reflection

Reflection Reflection

(b) Transition from pit to land

Pit Land

0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0

(c) Stored binary pattern

Pit Land

1

Figure 8.29 Optical disk.

Because the pits are very small, it is difficult to implement all of the pits perfectly. In audioand video applications, some errors in the data can be tolerated, because they are unlikelyto affect the reproduced sound or image in a perceptible way. However, such errors are notacceptable in computer applications. Since physical imperfections cannot be avoided, it is



necessary to use additional bits to provide error detection and correction capability. TheCDs used to store computer data are called CD-ROMs, because, like semiconductor ROMchips, their contents can only be read.

Stored data are organized on CD-ROM tracks in the form of blocks called sectors.There are several different formats for a sector. One format, known as Mode 1, uses 2352-byte sectors. There is a 16-byte header that contains a synchronization field used to detectthe beginning of the sector and addressing information used to identify the sector. This isfollowed by 2048 bytes of stored data. At the end of the sector, there are 288 bytes used toimplement the error-correcting scheme. The number of sectors per track is variable; thereare more sectors on the longer outer tracks. With the Mode 1 format, a CD-ROM has astorage capacity of about 650 Mbytes.

Error detection and correction is done at more than one level. As mentioned earlier,each byte of information stored on a CD is encoded using a 14-bit code that has someerror-correcting capability. This code can correct single-bit errors. Errors that occur inshort bursts, affecting several bits, are detected and corrected using the error-checking bitsat the end of the sector.

CD-ROM drives operate at a number of different rotational speeds. The basic speed,known as 1X, is 75 sectors per second. This provides a data rate of 153,600 bytes/s (150Kbytes/s), using the Mode 1 format. Higher speed CD-ROM drives are identified in relationto the basic speed. Thus, a 56X CD-ROM has a data transfer rate that is 56 times that ofthe 1X CD-ROM, or about 6 Mbytes/s. This transfer rate is considerably lower than thetransfer rates of magnetic hard disks, which are in the range of tens of megabytes per second.Another significant difference in performance is the seek time, which in CD-ROMs may beseveral hundred milliseconds. So, in terms of performance, CD-ROMs are clearly inferiorto magnetic disks. Their attraction lies in their small physical size, low cost, and ease ofhandling as a removable and transportable mass-storage medium. As a result, they arewidely used for the distribution of software, textbooks, application programs, video games,and so on.

CD-RecordableThe CDs described above are read-only devices, in which the information is stored at

the time of manufacture. First, a master disk is produced using a high-power laser to burnholes that correspond to the required pits. A mold is then made from the master disk, whichhas bumps in the place of holes. Copies are made by injecting molten polycarbonate plasticinto the mold to make CDs that have the same pattern of holes (pits) as the master disk.This process is clearly suitable only for volume production of CDs containing the sameinformation.

A new type of CD was developed in the late 1990s on which data can be easily recordedby a computer user. It is known as CD-Recordable (CD-R). A shiny spiral track covered byan organic dye is implemented on a disk during the manufacturing process. Then, a laserin a CD-R drive burns pits into the organic dye. The burned spots become opaque. Theyreflect less light than the shiny areas when the CD is being read. This process is irreversible,which means that the written data are stored permanently. Unused portions of a disk canbe used to store additional data at a later time.



CD-RewritableThe most flexible CDs are those that can be written multiple times by the user. They

are known as CD-RWs (CD-ReWritables).The basic structure of CD-RWs is similar to the structure of CD-Rs. Instead of using

an organic dye in the recording layer, an alloy of silver, indium, antimony, and telluriumis used. This alloy has interesting and useful behavior when it is heated and cooled. If itis heated above its melting point (500 degrees C) and then cooled down, it goes into anamorphous state in which it absorbs light. But, if it is heated only to about 200 degrees Cand this temperature is maintained for an extended period, a process known as annealingtakes place, which leaves the alloy in a crystalline state that allows light to pass through. Ifthe crystalline state represents land area, pits can be created by heating selected spots pastthe melting point. The stored data can be erased using the annealing process, which returnsthe alloy to a uniform crystalline state. A reflective material is placed above the recordinglayer to reflect the light when the disk is read.

A CD-RW drive uses three different laser powers. The highest power is used to recordthe pits. The middle power is used to put the alloy into its crystalline state; it is referred toas the “erase power.” The lowest power is used to read the stored information.

CD drives designed to read and write CD-RW disks can usually be used with othercompact disk media. They can read CD-ROMs and can read and write CD-Rs. They aredesigned to meet the requirements of standard interconnection interfaces, such as SATAand USB.

CD-RW disks provide low-cost storage media. They are suitable for archival storageof information that may range from databases to photographic images. They can be usedfor low-volume distribution of information, just like CD-Rs, and for backup purposes. TheCD-RW technology has made CD-Rs less relevant because it offers superior capability atonly slightly higher cost.

DVD TechnologyThe success of CD technology and the continuing quest for greater storage capability

has led to the development of DVD (Digital Versatile Disk) technology. The first DVDstandard was defined in 1996 by a consortium of companies, with the objective of beingable to store a full-length movie on one side of a DVD disk.

The physical size of a DVD disk is the same as that of CDs. The disk is 1.2 mm thick,and it is 120 mm in diameter. Its storage capacity is made much larger than that of CDs byseveral design changes:

• A red-light laser with a wavelength of 635 nm is used instead of the infrared light laserused in CDs, which has a wavelength of 780 nm. The shorter wavelength makes itpossible to focus the light to a smaller spot.

• Pits are smaller, having a minimum length of 0.4 micron.• Tracks are placed closer together; the distance between tracks is 0.74 micron.

Using these improvements leads to a DVD capacity of 4.7 Gbytes.Further increases in capacity have been achieved by going to two-layered and two-sided

disks. The single-layered single-sided disk, defined in the standard as DVD-5, has a structure



that is almost the same as the CD in Figure 8.29a. A double-layered disk makes use of twolayers on which tracks are implemented on top of each other. The first layer is the clear base,as in CD disks. But, instead of using reflecting aluminum, the lands and pits of this layer arecovered by a translucent material that acts as a semi-reflector. The surface of this materialis then also programmed with indented pits to store data. A reflective material is placed ontop of the second layer of pits and lands. The disk is read by focusing the laser beam on thedesired layer. When the beam is focused on the first layer, sufficient light is reflected by thetranslucent material to detect the stored binary patterns. When the beam is focused on thesecond layer, the light reflected by the reflective material corresponds to the informationstored on this layer. In both cases, the layer on which the beam is not focused reflects amuch smaller amount of light, which is eliminated by the detector circuit as noise. The totalstorage capacity of both layers is 8.5 Gbytes. This disk is called DVD-9 in the standard.

Two single-sided disks can be put together to form a sandwich-like structure where thetop disk is turned upside down. This can be done with single-layered disks, as specified inDVD-10, giving a composite disk with a capacity of 9.4 Gbytes. It can also be done withthe double-layered disks, as specified in DVD-18, yielding a capacity of 17 Gbytes.

Access times for DVD drives are similar to CD drives. However, when the DVDdisks rotate at the same speed, the data transfer rates are much higher because of the higherdensity of pits. Rewritable versions of DVD devices have also been developed, providinglarge storage capacities.

8.10.3 Magnetic Tape Systems

Magnetic tapes are suited for off-line storage of large amounts of data. They are typicallyused for backup purposes and for archival storage. Magnetic-tape recording uses the sameprinciple as magnetic disks. The main difference is that the magnetic film is deposited ona very thin 0.5- or 0.25-inch wide plastic tape. Seven or nine bits (corresponding to onecharacter) are recorded in parallel across the width of the tape, perpendicular to the directionof motion. A separate read/write head is provided for each bit position on the tape, so thatall bits of a character can be read or written in parallel. One of the character bits is used asa parity bit.

Data on the tape are organized in the form of records separated by gaps, as shown inFigure 8.30. Tape motion is stopped only when a record gap is underneath the read/writeheads. The record gaps are long enough to allow the tape to attain its normal speed beforethe beginning of the next record is reached. If a coding scheme such as that in Figure 8.27cis used for recording data on the tape, record gaps are identified as areas where there isno change in magnetization. This allows record gaps to be detected independently of therecorded data. To help users organize large amounts of data, a group of related records iscalled a file. The beginning of a file is identified by a file mark, as shown in Figure 8.30.The file mark is a special single- or multiple-character record, usually preceded by a gaplonger than the inter-record gap. The first record following a file mark can be used as aheader or identifier for the file. This allows the user to search a tape containing a largenumber of files for a particular file.


8.11 Concluding Remarks 323

FileFile

markmark

File

7 or 9

gap gapFile gap Record RecordRecord Record

bits

Figure 8.30 Organization of data on magnetic tape.

Cartridge Tape SystemTape systems have been developed for backup of on-line disk storage. One such system

uses an 8-mm video-format tape housed in a cassette. These units are called cartridge tapes.They have capacities in the range of 2 to 5 gigabytes and handle data transfers at the rate ofa few hundred kilobytes per second. Reading and writing is done by a helical scan systemoperating across the tape, similar to that used in video cassette tape drives. Bit densitiesof tens of millions of bits per square inch are achievable. Multiple-cartridge systems areavailable that automate the loading and unloading of cassettes so that tens of gigabytes ofon-line storage can be backed up unattended.


The design of the memory hierarchy is critical to the performance of a computer system.Modern operating systems and application programs place heavy demands on both thecapacity and speed of the memory. In this chapter, we presented the most important techno-logical and organizational details of memory systems and how they have evolved to meetthese demands.

Developments in semiconductor technology have led to significant improvements in thespeed and capacity of memory chips, accompanied by a large decrease in the cost per bit. Theperformance of computer memories is enhanced further by the use of a memory hierarchy.Today, a large yet affordable main memory is implemented with dynamic memory chips.One or more levels of cache memory are always provided. The introduction of the cachememory reduces significantly the effective memory access time seen by the processor.Virtual memory makes the main memory appear larger than the physical memory.

Magnetic disks continue to be the primary technology for secondary storage. Theyprovide enormous storage capacity, reaching and exceeding a trillion bytes on a singledrive, with a very low cost per bit. But, flash semiconductor technology is beginning tocompete effectively in some applications.





Example 8.2 Problem: Describe a structure similar to the one in Figure 8.10 for an 8M × 32 memoryusing 512K × 8 memory chips.

Solution: The required structure is essentially the same as in Figure 8.10, except that 16rows are needed, each with four 512 × 8 chips. Address lines A18−0 should be connectedto all chips. Address lines A22−19 should be connected to a 4-bit decoder to select one ofthe 16 rows.

Example 8.3 Problem: A computer system uses 32-bit memory addresses and it has a main memoryconsisting of 1G bytes. It has a 4K-byte cache organized in the block-set-associative manner,with 4 blocks per set and 64 bytes per block.

(a) Calculate the number of bits in each of the Tag, Set, and Word fields of the memoryaddress.

(b) Assume that the cache is initially empty. Suppose that the processor fetches 1088words of four bytes each from successive word locations starting at location 0. Itthen repeats this fetch sequence nine more times. If the cache is 10 times faster thanthe memory, estimate the improvement factor resulting from the use of the cache.Assume that the LRU algorithm is used for block replacement.

Solution: Consecutive addresses refer to bytes.

(a) A block has 64 bytes; hence the Word field is 6 bits long. With 4× 64 = 256 bytesin a set, there are 4K/256 = 16 sets, requiring a Set field of 4 bits. This leaves32− 4− 6 = 22 bits for the Tag field.

(b) The 1088 words constitute 68 blocks, occuping blocks 0 to 67 in the memory. Thecache has space for 64 blocks. Hence, after blocks 0, 1, 2, . . . , 63 have been readfrom the memory into the cache on the first pass, the cache is full. The next fourblocks, numbered 64 to 67, map to sets 0, 1, 2, and 3. Each of them will replacethe least recently used cache block in its set, which is block 0. During the secondpass, memory block 0 has to be reloaded into set 0 of the cache, since it has beenoverwritten by block 64. It will be placed in the least recently used block of set 0 atthat point, which is block 1. Next, memory blocks 1, 2, and 3 will replace block 1 ofsets 1, 2 and 3 in the cache, respectively. Memory blocks 4 to 15 will be found in thecache. Memory blocks 16 to 19, which were in block location 1 of sets 0 to 3, havenow been overwritten, and will be reloaded in block location 2 of these sets.



As execution proceeds, all memory blocks that occupy the first four of the 16 cachesets are always overwritten before they can be used on a succeeding pass. Memoryblocks 0, 16, 32, 48, and 64 continually displace each other as they compete for the 4block positions in cache set 0. The same thing occurs in cache set 1 (memory blocks1, 17, 33, 49, 65), cache set 2 (memory blocks 2, 18, 34, 50, 66), and cache set 3(memory blocks 3, 19, 35, 51, 67). Memory blocks that occupy the last 12 sets (sets4 through 15) are fetched once on the first pass and remain in the cache for the next9 passes.

In summary, on the first pass, all 68 blocks of the loop are fetched from the memory.On each of the 9 successive passes, 48 blocks are found in sets 4 through 15 of thecache, and the remaining 20 blocks must be fetched from the memory. Let τ be theaccess time of the cache. Therefore,

Improvement factor = Time without cache

Time with cache

= 10× 68× 10τ

1× 68× 11τ + 9(20× 11τ + 48τ)

= 2.15

This example illustrates a weakness of the LRU algorithm during the execution ofprogram loops. See Problem 8.9 for the performance of an alternative algorithm in thiscase.

Example 8.4Problem: Suppose that a computer has a processor with two L1 caches, one for instructionsand one for data, and an L2 cache. Let τ be the access time for the two L1 caches. Themiss penalties are approximately 15τ for transferring a block from L2 to L1, and 100τ fortransferring a block from the main memory to L2. For the purpose of this problem, assumethat the hit rates are the same for instructions and data and that the hit rates in the L1 andL2 caches are 0.96 and 0.80, respectively.

(a) What fraction of accesses miss in both the L1 and L2 caches, thus requiring accessto the main memory?

(b) What is the average access time as seen by the processor?

(c) Suppose that the L2 cache has an ideal hit rate of 1. By what factor would this reducethe average memory access time as seen by the processor?

(d) Consider the following change to the memory hierarchy. The L2 cache is removedand the size of the L1 caches is increased so that their miss rate is cut in half. Whatis the average memory access time as seen by the processor in this case?



Solution: The average memory access time with one cache level is given in Section 8.7.1as

tavg = hC + (1− h)M

With L1 and L2 caches, the average memory access time is given in Section 8.7.2 as

tavg = h1C1 + (1− h1)(h2C2 + (1− h2)M )

(a) The fraction of memory accesses that miss in both the L1 and L2 caches is

(1− h1)(1− h2) = (1− 0.96)(1− 0.80) = 0.008

(b) The average memory access time using two cache levels is

tavg = 0.96τ + 0.04(0.80× 15τ + 0.20× 100τ)

= 2.24τ

(c) With no misses in the L2 cache, we get:

tavg(ideal) = 0.96τ + 0.04× 15τ = 1.56τ

Therefore,

tavg(actual)

tavg(ideal)= 2.24τ

1.56τ= 1.44

(d) With larger L1 caches and the L2 cache removed, the access time is

tavg = 0.98τ + 0.02× 100τ = 2.98τ

Example 8.5 Problem: A 1024× 1024 array of 32-bit numbers is to be normalized as follows. For eachcolumn, the largest element is found and all elements of the column are divided by the valueof this element. Assume that each page in the virtual memory consists of 4K bytes, and that1M bytes of the main memory are allocated for storing array data during this computation.Assume that it takes 10 ms to load a page from the disk into the main memory when a pagefault occurs.

(a) Assume that the array is processed one column at a time. How many page faultswould occur and how long does it take to complete the normalization process if theelements of the array are stored in column order in the virtual memory?

(b) Repeat part (a) assuming the elements are stored in row order?

(c) Propose an alternative way for processing the array to reduce the number of pagefaults when the array is stored in the memory in row order. Estimate the number ofpage faults and the time needed for your solution.



Solution: Each 32-bit number comprises 4 bytes. Hence, each page holds 1024 numbers.There is space for 256 pages in the 1M-byte portion of the main memory that is allocatedfor storing data during the computation.

(a) Each column is stored in one page; there is a page fault to bring each column to themain memory, for a total of 1024 page faults.

Processing time = 1024× 10 ms = 10.24 s.

(b) Processing of each column requires two passes, the first to find the largest elementand the second to perform the normalization. When processing the first column, eachelement access results in a page fault that brings all elements of the corresponding rowinto the main memory. After 256 elements have been examined, the main memory isfull. Accessing the next 256 elements results in page faults that replace all the datain the memory, and the process repeats. Thus, a page fault occurs for every access toevery element in the array.

Processing time = 2× 1024× 1024× 10 ms = 20,972 s = 5.8 hours.

(c) A more efficient alternative for this arrangement of the data is to complete the firstpass for only one quarter of each column for all columns, then process the secondquarter, and so on. The second pass is handled in the same way. In this case, eachpass through the array results in 1024 page faults, for a total of 2048.

Processing time = 2048× 10 ms = 20.48 s.

This example illustrates how the number of page faults can increase dramatically insome cases when the size of the main memory is insufficient for the application. Thisbehavior is called thrashing.

Example 8.6Problem: Consider a long sequence of accesses to a disk with an average seek time of 6ms and an average rotational delay of 3 ms. The average size of a block being accessed is8K bytes. The data transfer rate from the disk is 34 Mbytes/sec.

(a) Assuming that the data blocks are randomly located on the disk, estimate the averagepercentage of the total time occupied by seek operations and rotational delays.

(b) Repeat part (a) for the situation in which disk accesses are arranged so that in 90percent of the cases, the next access will be to a data block on the same cylinder.

Solution: It takes 8K/34M = 0.23 ms to transfer a block of data.

(a) The total time needed to access each block is 6+ 3+ 0.23 = 9.23 ms. The portionof time occupied by seek and rotational delay is 9/9.23 = 0.97 = 97%.



(b) In 90% of the cases, only rotational delays are involved. Therefore, the averagetime to access a block is 0.9× 3+ 0.1× 9+ 0.23 = 3.89 ms. The portion of timeoccupied by seek and rotational delay is 3.6/3.89 = 0.92 = 92%.

Problems

8.1 [M] Consider the dynamic memory cell of Figure 8.6. Assume that C = 30 femtofarads(10−15 F) and that leakage current through the transistor is about 0.25 picoamperes (10−12

A). The voltage across the capacitor when it is fully charged is 1.5 V. The cell must berefreshed before this voltage drops below 0.9 V. Estimate the minimum refresh rate.

8.2 [M] Consider a main memory built with SDRAM chips. Data are transferred in burstsas shown in Figure 8.9, except that the burst length is 8. Assume that 32 bits of data aretransferred in parallel. If a 400-MHz clock is used, how much time does it take to transfer:

(a) 32 bytes of data

(b) 64 bytes of data

What is the latency in each case?

8.3 [E] Describe a structure similar to that in Figure 8.10 for a 16M × 32 memory using 1M× 4 memory chips.

8.4 [E] Give a critique of the following statement: “Using a faster processor chip results ina corresponding increase in performance of a computer even if the main memory speedremains the same.”

8.5 [M] The memory of a computer is byte-addressable, and the word length is 32 bits. Aprogram consists of two nested loops—a small inner loop and a much larger outer loop.The general structure of the program is given in Figure P8.1. The decimal memory addressesshown delineate the location of the two loops and the beginning and end of the total program.All memory locations in the various sections of the program, 8-52, 56-136, 140-240, andso on, contain instructions to be executed in straight-line sequencing. The program is tobe run on a computer that has an instruction cache organized in the direct-mapped manner(see Figure 8.16) with the following parameters:

Cache size 1K bytesBlock size 128 bytes

The miss penalty in the instruction cache is 80τ , where τ is the access time of the cache.Compute the total time needed for instruction fetching during execution of the program inFigure P8.1.


Problems 329

8

56

140

240

1200

1504

START

END

Inner loopexecuted20 times

Outer loopexecuted10 times

Figure P8.1 A program structure for Problem 8.5.

8.6 [M] A computer with a 16-bit word length has a direct-mapped cache, used for both instruc-tions and data. Memory addresses are 16 bits long, and the memory is byte-addressable.The cache is small for illustrative purposes. It contains only four 16-bit words. Each wordconstitutes a cache block and has an associated 13-bit tag, as shown in Figure P8.2a. Wordsare accessed in the cache using the low-order 3 bits of an address. When a miss occursduring a Read operation for either an instruction or a data operand, the requested word isread from the main memory and sent to the processor. At the same time, it is copied intothe cache, and its block number is stored in the associated tag. Consider the following shortloop, in which all instructions are 16 bits long:

LOOP: Add R0, (R1)+Decrement R2BNE LOOP

Assume that, before this loop is entered, registers R0, R1, and R2 contain 0, 054E, and 3,respectively. Also assume that the main memory contains the data shown in Figure P8.2b,where all entries are given in hexadecimal notation. The loop starts at location LOOP =02EC. The Autoincrement address mode in the Add instruction is used to access successivenumbers in a 3-number list and add them into register R0. The counter register, R2, isdecremented until it reaches 0, at which point an exit is made from the loop.

(a) Starting with an empty cache, show the contents of the cache, including the tags, at theend of each pass through the loop.

(b) Assume that the access times of the cache and the main memory are τ and 10τ , respec-tively. Calculate the execution time for each pass, counting only memory access times.



Tag Data

13 bits 16 bits

0

2

4

6

A03C

05D9

10D7

054E

(a) Cache (b) Main memory

Figure P8.2 Cache and main memory contents in Problem 8.6.

8.7 [M] Repeat Problem 8.6 assuming that only instructions are stored in the cache. Dataoperands are fetched directly from the main memory and not copied into the cache. Whydoes this choice lead to faster execution than when both instructions and data are loadedinto the cache?

8.8 [E] A block-set-associative cache consists of a total of 64 blocks, divided into 4-block sets.The main memory contains 4096 blocks, each consisting of 32 words. Assuming a 32-bitbyte-addressable address space, how many bits are there in each of the Tag, Set, and Wordfields?

8.9 [M] Consider the cache in Example 8.3. Assume that whenever a block is to be broughtfrom the main memory and the corresponding set in the cache is full, the new block replacesthe most recently used block of this set. Derive the solution for part (b) in this case.

8.10 [D] Section 8.6.3 illustrates the effect of different cache-mapping techniques, using theprogram in Figure 8.20. Suppose that this program is changed so that in the second loopthe elements are handled in the same order as in the first loop; that is, the control for thesecond loop is specified as

for i := 0 to 9 do

Derive the equivalents of Figures 8.21 through 8.23 for this program. What conclusionscan be drawn from this exercise?

8.11 [M] A byte-addressable computer has a small data cache capable of holding eight 32-bitwords. Each cache block consists of one 32-bit word. When a given program is executed,the processor reads data sequentially from the following hex addresses:

200, 204, 208, 20C, 2F4, 2F0, 200, 204, 218, 21C, 24C, 2F4

This pattern is repeated four times.

(a) Assume that the cache is initially empty. Show the contents of the cache at the end ofeach pass through the loop if a direct-mapped cache is used, and compute the hit rate.


Problems 331

(b) Repeat part (a) for an associative-mapped cache that uses the LRU replacement algo-rithm.

(c) Repeat part (a) for a four-way set-associative cache.

8.12 [M] Repeat Problem 8.11, assuming that each cache block consists of two 32-bit words.For part (c), use a two-way set-associative cache that uses the LRU replacement algorithm.

8.13 [E] The cache block size in many computers is in the range of 32 to 128 bytes. Whatwould be the main advantages and disadvantages of making the size of cache blocks largeror smaller?

8.14 [M] A computer has two cache levels L1 and L2. Plot two graphs for the average memoryaccess time (y-axis) versus hit rate h1 (x-axis) for the two values h2 = 0.75 and h2 = 0.85.Use the values 0.90, 0.92, 0.94, and 0.96, for h1. Assume that the miss penalties are 15τ and100τ for the L1 and L2 caches, respectively, where τ is the access time of the L1 caches.

8.15 [E] Consider the two-level cache described in Example 8.4. The average access time isgiven in the solution to part (b) of the example as 2.24τ . What value for h1 would be neededto reduce tavg to 1.5τ , if all other parameters are the same as in the example? Can the sameresult be achieved by improving the hit rate of L2?

8.16 [E] Consider the following analogy for the concept of caching. A serviceman comes to ahouse to repair the heating system. He carries a toolbox that contains a number of tools thathe has used recently in similar jobs. He uses these tools repeatedly, until he reaches a pointwhere other tools are needed. It is likely that he has the required tools in his truck outsidethe house. But, if the needed tools are not in the truck, he must go to his shop to get them.

Suppose we argue that the toolbox, the truck, and the shop correspond to the L1 cache, theL2 cache, and the main memory of a computer. How good is this analogy? Discuss itscorrect and incorrect features.

8.17 [E] The purpose of using an L2 cache is to reduce the miss penalty of the L1 cache, and inturn to reduce the memory access time as seen by the processor. An alternative is to increasethe size of the L1 cache to increase its hit rate. What limits the utility of this approach?

8.18 [M] Give a critique of the assumption made in Example 8.1, in Section 8.7.1, that the misspenalty is the same for both read and write accesses. Consider both the write-through andwrite-back cases, as described in Section 8.6, in formulating your answer.

8.19 [M] Consider a computer system in which the available pages in the physical memoryare divided among several application programs. The operating system monitors the pagetransfer activity and dynamically adjusts the number of pages allocated to various programs.Suggest a suitable strategy that the operating system can use to minimize the overall rateof page transfers.

8.20 [M] In a computer with a virtual-memory system, the execution of an instruction may beinterrupted by a page fault. What state information has to be saved so that this instructioncan be resumed later? Note that bringing a new page into the main memory involves aDMA transfer, which requires execution of other instructions. Is it simpler to abandon theinterrupted instruction and completely re-execute it later? Can this be done?



8.21 [E] When a program generates a reference to a page that does not reside in the physicalmain memory, execution of the program is suspended until the requested page is loadedinto the main memory from a disk. What difficulties might arise when an instruction inone page has an operand in a different page? What capabilities must the processor have tohandle this situation?

8.22 [M] A disk unit has 24 recording surfaces. It has a total of 14,000 cylinders. There is anaverage of 400 sectors per track. Each sector contains 512 bytes of data.

(a) What is the maximum number of bytes that can be stored in this unit?

(b) What is the data transfer rate in bytes per second at a rotational speed of 7200 rpm?

(c) Using a 32-bit word, suggest a suitable scheme for specifying the disk address.

8.23 [M] Consider a long sequence of accesses to a disk with 8 ms average seek time, 3 msaverage rotational delay, and a data transfer rate of 60 Mbytes/sec. The average size of ablock being accessed is 64 Kbytes. Assume that each data block is stored in contiguoussectors.

(a) Assuming that the blocks are randomly located on the disk, estimate the average per-centage of the total time occupied by seek operations and rotational delays.

(b) Suppose that 20 blocks are transferred in sequence from adjacent cylinders, reducingseek time to 1 ms. The blocks are randomly located on these cylinders. What is the totaltransfer time?

8.24 [M] The average seek time and rotational delay in a disk system are 6 ms and 3 ms,respectively. The rate of data transfer to or from the disk is 30 Mbytes/sec, and all diskaccesses are for 8 Kbytes of data, stored in contiguous sectors. Data blocks are stored atrandom locations on the disk. The disk controller has an 8-Kbyte buffer. The disk controller,the processor, and the main memory are all attached to a single bus. The bus data width is32 bits, and a single bus transfer to or from the main memory takes 10 nanoseconds.

(a) What is the maximum number of disk units that can be simultaneously transferring datato or from the main memory?

(b) What percentage of main memory accesses are used by one disk unit, on average, over along period of time during which a sequence of independent 8-Kbyte transfers takes place?

8.25 [M] Magnetic disks are used as the secondary storage for program and data files in mostvirtual-memory systems. Which disk parameter(s) should influence the choice of page size?

References

1. T.C. Mowry, “Tolerating Latency through Software-Controlled Data Prefetching,”Tech. Report CSL-TR-94-628, Stanford University, Calif., 1994.

2. J.L. Baer and T.F. Chen, “An Effective On-Chip Preloading Scheme to Reduce DataAccess Penalty,” Proceedings of Supercomputing ’91, 1991, pp. 176–186.


References 333

3. J.W.C. Fu and J.H. Patel, “Stride Directed Prefetching in Scalar Processors,”Proceedings of the 24th International Symposium on Microarchitecture, 1992, pp.102–110.

4. D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proceedingsof the 8th Annual International Symposium on Computer Architecture, 1981, pp.81–85.

5. D.A. Patterson, G.A. Gibson, and R.H. Katz, “A Case for Redundant Arrays ofInexpensive Disks (RAID),” Proceedings of the ACM SIGMOD InternationalConference on Management of Data, 1988, pp. 109-166.


335

c h a p t e r

9Arithmetic

Chapter Objectives


• Adder and subtractor circuits

• High-speed adders based on carry-lookaheadlogic circuits

• The Booth algorithm for multiplication ofsigned numbers

• High-speed multipliers based on carry-saveaddition

• Logic circuits for division

• Arithmetic operations on floating-pointnumbers conforming to the IEEE standard


336 C H A P T E R 9 • Arithmetic

Addition and subtraction of two numbers are basic operations at the machine-instruction level in all computers.These operations, as well as other arithmetic and logic operations, are implemented in the arithmetic andlogic unit (ALU) of the processor. In this chapter, we present the logic circuits used to implement arithmeticoperations. The time needed to perform addition or subtraction affects the processor’s performance. Multiplyand divide operations, which require more complex circuitry than either addition or subtraction operations,also affect performance. We present some of the techniques used in modern computers to perform arithmeticoperations at high speed. Operations on floating-point numbers are also described.

In Section 1.4 of Chapter 1, we described the representation of signed binary numbers, and showedthat 2’s-complement is the best representation from the standpoint of performing addition and subtractionoperations. The examples in Figure 1.6 show that two, n-bit, signed numbers can be added using n-bit binaryaddition, treating the sign bit the same as the other bits. In other words, a logic circuit that is designed to addunsigned binary numbers can also be used to add signed numbers in 2’s-complement. The first two sectionsof this chapter present logic circuits for addition and subtraction.

9.1 Addition and Subtraction of Signed Numbers

Figure 9.1 shows the truth table for the sum and carry-out functions for adding equallyweighted bits xi and yi in two numbers X and Y . The figure also shows logic expressionsfor these functions, along with an example of addition of the 4-bit unsigned numbers 7 and6. Note that each stage of the addition process must accommodate a carry-in bit. We use ci

to represent the carry-in to stage i, which is the same as the carry-out from stage (i − 1).The logic expression for si in Figure 9.1 can be implemented with a 3-input XOR gate,

used in Figure 9.2a as part of the logic required for a single stage of binary addition. Thecarry-out function, ci+1, is implemented with an AND-OR circuit, as shown. A convenientsymbol for the complete circuit for a single stage of addition, called a full adder (FA), isalso shown in the figure.

A cascaded connection of n full-adder blocks can be used to add two n-bit numbers, asshown in Figure 9.2b. Since the carries must propagate, or ripple, through this cascade, theconfiguration is called a ripple-carry adder.

The carry-in, c0, into the least-significant-bit (LSB) position provides a convenientmeans of adding 1 to a number. For instance, forming the 2’s-complement of a numberinvolves adding 1 to the 1’s-complement of the number. The carry signals are also usefulfor interconnecting k adders to form an adder capable of handling input numbers that arekn bits long, as shown in Figure 9.2c.

9.1.1 Addition/Subtraction Logic Unit

The n-bit adder in Figure 9.2b can be used to add 2’s-complement numbers X and Y , wherethe xn−1 and yn−1 bits are the sign bits. The carry-out bit cn is not part of the answer.Arithmetic overflow was discussed in Section 1.4. It occurs when the signs of the two


9.1 Addition and Subtraction of Signed Numbers 337

si =

ci+1 =

13

7+ Y

1

00010

1

1

00110

11

0

01101001

0

0

0

0

1

1

1

1

00001111

Example:

10= = 0

01 1

11 1 0 0

1

1 1 10

Legend for stage i

xi yi Carry-in ci Sum si Carry-out ci+1

X

Z

+ 6 0+xiyi

si

Carry-outci+1

Carry-inci

xi yi ci xi yi ci xi yi ci xi yi ci xi yi ci⊕ ⊕=+ + +

yi ci xi ci xi yi+ +

Figure 9.1 Logic specification for a stage of binary addition.

operands are the same, but the sign of the result is different. Therefore, a circuit to detectoverflow can be added to the n-bit adder by implementing the logic expression

Overflow = xn−1yn−1sn−1 + xn−1yn−1sn−1

It can also be shown that overflow occurs when the carry bits cn and cn−1 are different.(See Problem 9.5.) Therefore, a simpler circuit for detecting overflow can be obtained byimplementing the expression cn ⊕ cn−1 with an XOR gate.

In order to perform the subtraction operation X − Y on 2’s-complement numbers Xand Y , we form the 2’s-complement of Y and add it to X . The logic circuit shown in Figure9.3 can be used to perform either addition or subtraction based on the value applied to theAdd/Sub input control line. This line is set to 0 for addition, applying Y unchanged to oneof the adder inputs along with a carry-in signal, c0, of 0. When the Add/Sub control line isset to 1, the Y number is 1’s-complemented (that is, bit-complemented) by the XOR gatesand c0 is set to 1 to complete the 2’s-complementation of Y . Recall that 2’s-complementinga negative number is done in exactly the same manner as for a positive number. An XORgate can be added to Figure 9.3 to detect the overflow condition cn ⊕ cn−1.



Full adder(FA)

ci

yixi

ci 1+

si

(a) Logic for a single stage

FA c0

y1x1

s1

FA

c1

y0x0

s0

FA

cn 1–

yn 1–xn 1–

cn

sn 1–

(b) An n-bit ripple-carry adder

n-bit c0

ynxn

sn

cn

y0xn 1–

s0

ckn

s k 1–( )n

x0yn 1–y2n 1–x2n 1–y kn 1–

sn 1–s2n 1–skn 1–

(c) Cascade of k n-bit adders

x kn 1–

Most significant bit(MSB) position

Least significant bit(LSB) position

ci

yi

xi

ci

yi

xi

xi

ci

yi si ci 1+

addern-bitadder

n-bitadder

Figure 9.2 Logic for addition of binary numbers.


9.2 Design of Fast Adders 339

Add/Subcontrol

n-bit adder

xn 1– x1 x0

cn

sn 1– s1 s0

c0

yn 1– y1 y0

Figure 9.3 Binary addition/subtraction logic circuit.

9.2 Design of Fast Adders

If an n-bit ripple-carry adder is used in the addition/subtraction circuit of Figure 9.3, itmay have too much delay in developing its outputs, s0 through sn−1 and cn. Whether ornot the delay incurred is acceptable can be decided only in the context of the speed ofother processor components and the data transfer times of registers and cache memories.The delay through a network of logic gates depends on the integrated circuit electronictechnology used in fabricating the network and on the number of gates in the paths frominputs to outputs. The delay through any combinational circuit constructed from gates ina particular technology is determined by adding up the number of logic-gate delays alongthe longest signal propagation path through the circuit. In the case of the n-bit ripple-carryadder, the longest path is from inputs x0, y0, and c0 at the LSB position to outputs cn andsn−1 at the most-significant-bit (MSB) position.

Using the implementation indicated in Figure 9.2a, cn−1 is available in 2(n− 1) gate de-lays, and sn−1 is correct one XOR gate delay later. The final carry-out, cn, is available after 2ngate delays. Therefore, if a ripple-carry adder is used to implement the addition/subtractionunit shown in Figure 9.3, all sum bits are available in 2n gate delays, including the delaythrough the XOR gates on the Y input. Using the implementation cn ⊕ cn−1 for overflow,this indicator is available after 2n+ 2 gate delays.

Two approaches can be taken to reduce delay in adders. The first approach is to use thefastest possible electronic technology. The second approach is to use a logic gate networkcalled a carry-lookahead network, which is described in the next section.



9.2.1 Carry-LookaheadAddition

A fast adder circuit must speed up the generation of the carry signals. The logic expressionsfor si (sum) and ci+1 (carry-out) of stage i (see Figure 9.1) are

si = xi ⊕ yi ⊕ ci

and

ci+1 = xiyi + xici + yici

Factoring the second equation into

ci+1 = xiyi + (xi + yi)ci

we can write

ci+1 = Gi + Pici

where

Gi = xiyi and Pi = xi + yi

The expressions Gi and Pi are called the generate and propagate functions for stage i. Ifthe generate function for stage i is equal to 1, then ci+1 = 1, independent of the input carry,ci. This occurs when both xi and yi are 1. The propagate function means that an input carrywill produce an output carry when either xi is 1 or yi is 1. All Gi and Pi functions can beformed independently and in parallel in one logic-gate delay after the X and Y operandsare applied to the inputs of an n-bit adder. Each bit stage contains an AND gate to formGi, an OR gate to form Pi, and a three-input XOR gate to form si. A simpler circuit can bederived by observing that an adequate propagate function can be realized as Pi = xi ⊕ yi,which differs from Pi = xi + yi only when xi = yi = 1. But, in this case Gi = 1, so it doesnot matter whether Pi is 0 or 1. Then, using a cascade of two 2-input XOR gates to realizethe 3-input XOR function for si, the basic B cell in Figure 9.4a can be used in each bit stage.

Expanding ci in terms of i − 1 subscripted variables and substituting into the ci+1

expression, we obtain

ci+1 = Gi + PiGi−1 + PiPi−1ci−1

Continuing this type of expansion, the final expression for any carry variable is

ci+1 = Gi + PiGi−1 + PiPi−1Gi−2 + · · · + PiPi−1 · · ·P1G0 + PiPi−1 · · ·P0c0 (9.1)

Thus, all carries can be obtained three gate delays after the input operands X , Y , and c0 areapplied because only one gate delay is needed to develop all Pi and Gi signals, followedby two gate delays in the AND-OR circuit for ci+1. After a further XOR gate delay, allsum bits are available. In total, the n-bit addition process requires only four gate delays,independent of n.



Carry-lookahead logic

B cell B cell B cell B cell

(b) 4-bit adder

(a) Bit-stage cell

s3

P3G3

c3

P2G2

c2

s2

G1

c1

P1

s1

G0

c0

P0

s0

.c4

x1 y1x3 y3 x2 y2 x0 y0

Gi

ci

...

Pi si

xi yi

G0I P0

I

B cell

Figure 9.4 A 4-bit carry-lookahead adder.

Let us consider the design of a 4-bit adder. The carries can be implemented as

c1 = G0 + P0c0

c2 = G1 + P1G0 + P1P0c0

c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0

c4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0



The complete 4-bit adder is shown in Figure 9.4b. The carries are produced in the block la-beled carry-lookahead logic. An adder implemented in this form is called a carry-lookaheadadder. Delay through the adder is 3 gate delays for all carry bits and 4 gate delays for allsum bits. In comparison, a 4-bit ripple-carry adder requires 7 gate delays for s3 and 8 gatedelays for c4.

If we try to extend the carry-lookahead adder design of Figure 9.4b for longer operands,we encounter the problem of gate fan-in constraints. From Expression 9.1, we see that thelast AND gate and the OR gate require a fan-in of i + 2 in generating ci+1. A fan-in of 5 isrequired for c4 in the 4-bit adder. This is about the limit for practical gates. So the adder de-sign shown in Figure 9.4b cannot be extended easily for longer operands. However, it is pos-sible to build longer adders by cascading a number of 4-bit adders, as shown in Figure 9.2c.

Eight, 4-bit, carry-lookahead adders can be connected as in Figure 9.2c to form a 32-bitadder. The delays in generating sum bits s31, s30, s29, s28, and carry bit c32 in the high-order4-bit adder in this cascade are calculated as follows. The carry-out c4 from the low-orderadder is available 3 gate delays after the input operands X , Y , and c0 are applied to the32-bit adder. Then, c8 is available at the output of the second adder after a further 2 gatedelays, c12 is available after a further 2 gate delays, and so on. Finally, c28, the carry-in tothe high-order 4-bit adder, is available after a total of (6× 2)+ 3 = 15 gate delays. Then,c32 and all carries inside the high-order adder are available after a further 2 gate delays, andall 4 sum bits are available after 1 more gate delay, for a total of 18 gate delays. This shouldbe compared to total delays of 63 and 64 for s31 and c32 if a ripple-carry adder is used.

In the next section, we show how it is possible to improve upon the cascade structurejust discussed, leading to further reduction in adder delay. The key idea is to generate thecarries c4, c8, . . . in parallel, similar to the way that c1, c2, c3, and c4, are generated inparallel in the 4-bit carry-lookahead adder.

Higher-Level Generate and Propagate FunctionsIn the 32-bit adder just discussed, the carries c4, c8, c12, . . . ripple through the 4-bit adder

blocks with two gate delays per block, analogous to the way that individual carries ripplethrough each bit stage in a ripple-carry adder. It is possible to use the lookahead approachto develop the carries c4, c8, c12, . . . in parallel by using higher-level block generate andpropagate functions.

Figure 9.5 shows a 16-bit adder built from four 4-bit adder blocks. These blocksprovide new output functions defined as GI

k and PIk , where k = 0 for the first 4-bit block,

k = 1 for the second 4-bit block, and so on, as shown in Figures 9.4b and 9.5. In the firstblock,

PI0 = P3P2P1P0

and

GI0 = G3 + P3G2 + P3P2G1 + P3P2P1G0

The first-level Gi and Pi functions determine whether bit stage i generates or propagatesa carry. The second-level GI

k and PIk functions determine whether block k generates or

propagates a carry. With these new functions available, it is not necessary to wait forcarries to ripple through the 4-bit blocks. Carry c16 is formed by one of the carry-lookahead



Carry-lookahead logic

4-bit adder 4-bit adder 4-bit adder 4-bit adder

s15-12

P3IG3

I

c12

P2IG2

I

c8

s11-8

G1I

c4

P1I

s7-4

G0I

c0

P0I

s3-0

c16

x15-12 y15-12 x11-8 y11-8 x7-4 y7-4 x3-0 y3-0

.

G0II P0

II

Figure 9.5 A 16-bit carry-lookahead adder built from 4-bit adders (see Figure 9.4b).

circuits in Figure 9.5 as

c16 = GI3 + PI

3GI2 + PI

3PI2GI

1 + PI3PI

2PI1GI

0 + PI3PI

2PI1PI

0c0

The input carries to the 4-bit blocks are formed in parallel by similar shorter expressions.Expressions for c16, c12, c8, and c4, are identical in form to the expressions for c4, c3, c2,and c1, respectively, implemented in the carry-lookahead circuits in Figure 9.4b. Only thevariable names are different. Therefore, the structure of the carry-lookahead circuits inFigure 9.5 is identical to the carry-lookahead circuits in Figure 9.4b. However, the carriesc4, c8, c12, and c16, generated internally by the 4-bit adder blocks, are not needed in Figure9.5 because they are generated by the higher-level carry-lookahead circuits.

Now, consider the delay in producing outputs from the 16-bit carry-lookahead adder.The delay in developing the carries produced by the carry-lookahead circuits is two gatedelays more than the delay needed to develop the GI

k and PIk functions. The latter require two

gate delays and one gate delay, respectively, after the generation of Gi and Pi. Therefore, allcarries produced by the carry-lookahead circuits are available 5 gate delays after X , Y , andc0 are applied as inputs. The carry c15 is generated inside the high-order 4-bit block in Figure9.5 in two gate delays after c12, followed by s15 in one further gate delay. Therefore, s15 isavailable after 8 gate delays. If a 16-bit adder is built by cascading 4-bit carry-lookaheadadder blocks, the delays in developing c16 and s15 are 9 and 10 gate delays, respectively, ascompared to 5 and 8 gate delays for the configuration in Figure 9.5.

Two 16-bit adder blocks can be cascaded to implement a 32-bit adder. In this config-uration, the output c16 from the low-order block is the carry input to the high-order block.The delay is much lower than the delay through the 32-bit adder that we discussed earlier,which was built by cascading eight 4-bit adders. In that configuration, recall that s31 isavailable after 18 gate delays and c32 is available after 17 gate delays. The delay analysis



for the cascade of two 16-bit adders is as follows. The carry c16 out of the low-order blockis available after 5 gate delays, as calculated above. Then, both c28 and c32 are availablein the high-order block after a further 2 gate delays, and c31 is available 2 gate delays afterc28. Therefore, c31 is available after a total of 9 gate delays, and s31 is available in 10 gatedelays. Recapitulating, s31 and c32 are available after 10 and 7 gate delays, respectively,compared to 18 and 17 gate delays for the same outputs if the 32-bit adder is built from acascade of eight 4-bit adders.

The same reasoning used in developing second-level GIk and PI

k functions from first-level Gi and Pi functions can be used to develop third-level GII

k and PIIk functions from

GIk and PI

k functions. Two such third-level functions are shown as outputs from the carry-lookahead logic in Figure 9.5. A 64-bit adder can be built from four of the 16-bit addersshown in Figure 9.5, along with additional carry-lookahead logic circuits that producecarries c16, c32, c48, and c64. Delay through this adder can be shown to be 12 gate delays fors63 and 7 gate delays for c64, using an extension of the reasoning used above for the 16-bitadder. (See Problem 9.7.)

9.3 Multiplication of Unsigned Numbers

The usual algorithm for multiplying integers by hand is illustrated in Figure 9.6a for thebinary system. The product of two, unsigned, n-digit numbers can be accommodated in2n digits, so the product of the two 4-bit numbers in this example is accommodated in 8bits, as shown. In the binary system, multiplication of the multiplicand by one bit of themultiplier is easy. If the multiplier bit is 1, the multiplicand is entered in the appropriateshifted position. If the multiplier bit is 0, then 0s are entered, as in the third row of theexample. The product is computed one bit at a time by adding the bit columns from rightto left and propagating carry values between columns.

9.3.1 Array Multiplier

Binary multiplication of unsigned operands can be implemented in a combinational, two-dimensional, logic array, as shown in Figure 9.6b for the 4-bit operand case. The maincomponent in each cell is a full adder, FA. The AND gate in each cell determines whether amultiplicand bit, mj, is added to the incoming partial-product bit, based on the value of themultiplier bit, qi. Each row i, where 0 ≤ i ≤ 3, adds the multiplicand (appropriately shifted)to the incoming partial product, PPi, to generate the outgoing partial product, PP(i + 1), ifqi = 1. If qi = 0, PPi is passed vertically downward unchanged. PP0 is all 0s, and PP4 isthe desired product. The multiplicand is shifted left one position per row by the diagonalsignal path. We note that the row-by-row addition done in the array circuit differs from theusual hand addition described previously, which is done column-by-column.

The worst-case signal propagation delay path is from the upper right corner of thearray to the high-order product bit output at the bottom left corner of the array. This criticalpath consists of the staircase pattern that includes the two cells at the right end of each


9.3 Multiplication of Unsigned Numbers 345

Mult

iplier

Multiplicand

m3 m2 m1 m00 0 0 0

q3

q2

q1

q00

p2

p1

p0

0

0

0

p3p4p5p6p7

PP1

PP2

PP3

Partial product(PP0)

PP4 = p7, p6, . . . , p0 = Product

Carry-in

qi

m jBit of incoming partial product (PPi)

Bit of outgoing partial product [PP(i + 1)]

Carry-out

Typical cell

(b) Array implementation

FA

(a) Manual multiplication algorithm

(13) Multiplicand M11

(143) Product P

(11) Multiplier Q10

01

11

1 1 0 11011

00001011

01 0 0 1 1 1 1

×

Figure 9.6 Array multiplication of unsigned binary operands.



row, followed by all the cells in the bottom row. Assuming that there are two gate delaysfrom the inputs to the outputs of a full-adder block, FA, the critical path has a total of6(n− 1)− 1 gate delays, including the initial AND gate delay in all cells, for an n× narray. (See Problem 9.8.) In the first row of the array, no full adders are needed, becausethe incoming partial product PP0 is zero. This has been taken into account in developingthe delay expression.

9.3.2 Sequential Circuit Multiplier

The combinational array multiplier just described uses a large number of logic gates formultiplying numbers of practical size, such as 32- or 64-bit numbers. Multiplication of twon-bit numbers can also be performed in a sequential circuit that uses a single n-bit adder.

The block diagram in Figure 9.7a shows the hardware arrangement for sequentialmultiplication. This circuit performs multiplication by using a single n-bit adder n timesto implement the spatial addition performed by the n rows of ripple-carry adders in Figure9.6b. Registers A and Q are shift registers, concatenated as shown. Together, they holdpartial product PPi while multiplier bit qi generates the signal Add/Noadd. This signalcauses the multiplexer MUX to select 0 when qi = 0, or to select the multiplicand M whenqi = 1, to be added to PPi to generate PP(i + 1). The product is computed in n cycles. Thepartial product grows in length by one bit per cycle from the initial vector, PP0, of n 0s inregister A. The carry-out from the adder is stored in flip-flop C, shown at the left end ofregister A. At the start, the multiplier is loaded into register Q, the multiplicand into registerM, and C and A are cleared to 0. At the end of each cycle, C, A, and Q are shifted rightone bit position to allow for growth of the partial product as the multiplier is shifted out ofregister Q. Because of this shifting, multiplier bit qi appears at the LSB position of Q togenerate the Add/Noadd signal at the correct time, starting with q0 during the first cycle,q1 during the second cycle, and so on. After they are used, the multiplier bits are discardedby the right-shift operation. Note that the carry-out from the adder is the leftmost bit ofPP(i + 1), and it must be held in the C flip-flop to be shifted right with the contents of A andQ. After n cycles, the high-order half of the product is held in register A and the low-orderhalf is in register Q. The multiplication example of Figure 9.6a is shown in Figure 9.7b asit would be performed by this hardware arrangement.

9.4 Multiplication of Signed Numbers

We now discuss multiplication of 2’s-complement operands, generating a double-lengthproduct. The general strategy is still to accumulate partial products by adding versions ofthe multiplicand as selected by the multiplier bits.

First, consider the case of a positive multiplier and a negative multiplicand. When weadd a negative multiplicand to a partial product, we must extend the sign-bit value of themultiplicand to the left as far as the product will extend. Figure 9.8 shows an examplein which a 5-bit signed operand, −13, is the multiplicand. It is multiplied by +11 to get


9.4 Multiplication of Signed Numbers 347

qn 1–

mn 1–

n-bit

Multiplicand M

(a) Register configuration

Controlsequencer

Multiplier Q

0

C

Shift right

Register A (initially 0)

adder

Add/Noaddcontrol

an 1– a0 q0

m0

0

MUX

1 1 1 1

1 0 1 1

1 1 1 11 1 1 0

1 1 1 01 1 0 1

1 1 0 1

Initial configuration

Add

M

1 1 0 1

(b) Multiplication example

C

First cycle

Second cycle

Third cycle

Fourth cycle

No add

Shift

ShiftAdd

Shift

ShiftAdd

1 1 1 1

0

00

10

00

10

0 0 0 0

0 1 1 01 1 0 1

0 0 1 1

1 0 0 10 1 0 0

0 0 0 11 0 0 0

1 0 0 1

1 0 1 1

QA

Product

Figure 9.7 Sequential circuit binary multiplier.



1

0

11 11 1 1 0 0 1 1

110

110

1

0

1000111011

000000

1100111

00000000

110011111

×13–( )

143–( )

11+( )

Sign extension isshown in blue

Figure 9.8 Sign extension of negative multiplicand.

the 10-bit product, −143. The sign extension of the multiplicand is shown in blue. Thehardware discussed earlier can be used for negative multiplicands if it is augmented toprovide for sign extension of the partial products.

For a negative multiplier, a straightforward solution is to form the 2’s-complement ofboth the multiplier and the multiplicand and proceed as in the case of a positive multiplier.This is possible because complementation of both operands does not change the value orthe sign of the product. A technique that works equally well for both negative and positivemultipliers, called the Booth algorithm, is described next.

9.4.1 The BoothAlgorithm

The Booth algorithm [1] generates a 2n-bit product and treats both positive and negative 2’s-complement n-bit operands uniformly. To understand the basis of this algorithm, consider amultiplication operation in which the multiplier is positive and has a single block of 1s, forexample, 0011110. To derive the product, we could add four appropriately shifted versionsof the multiplicand, as in the standard procedure. However, we can reduce the number ofrequired operations by regarding this multiplier as the difference between two numbers:

0100000 (32)

− 0000010 (2)

0011110 (30)

This suggests that the product can be generated by adding 25 times the multiplicand tothe 2’s-complement of 21 times the multiplicand. For convenience, we can describe thesequence of required operations by recoding the preceding multiplier as 0 +1 0 0 0 −1 0.

In general, in the Booth algorithm, −1 times the shifted multiplicand is selected whenmoving from 0 to 1, and +1 times the shifted multiplicand is selected when moving from


9.4 Multiplication of Signed Numbers 349

1 to 0, as the multiplier is scanned from right to left. Figure 9.9 illustrates the normal andthe Booth algorithms for the example just discussed. The Booth algorithm clearly extendsto any number of blocks of 1s in a multiplier, including the situation in which a single 1 isconsidered a block. Figure 9.10 shows another example of recoding a multiplier. The casewhen the least significant bit of the multiplier is 1 is handled by assuming that an implied0 lies to its right. The Booth algorithm can also be used directly for negative multipliers,as shown in Figure 9.11.

To demonstrate the correctness of the Booth algorithm for negative multipliers, we usethe following property of negative-number representations in the 2’s-complement system.

0

1

00 0

1 0 1 1 0 10

0 0 0 0 0 01

00110101011010

10110101011010

0000000000000

011000101010

0 1 0 1 1 10000

00000000000000

0

00

1 1 1 1 1 1 1 0 1 0 0 100

0

0 0 0 1 0 1 1 0 10 0 0 0 0 0 0 0

0110001001000 1

2’s complement ofthe multiplicand

00

00

1+ 1–

1+ 1+ 1+ 1+

000 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 00 0 0 0 0 0 0

Figure 9.9 Normal and Booth multiplication schemes.

001101011100110100

00000000 1+ 1–1–1+1–1+1–1+1–1+

Figure 9.10 Booth recoding of a multiplier.



010

1 1 1 1 0 1 10 0 0 0 0 0 0 0 0

000110

0 0 0 0 1 1 01100111

0 0 0 0 0 0

01000 11111

1

10 1 1 0 11 1 0 1 0 6–( )

13+( )

×

78–( )

+11– 1–

Figure 9.11 Booth multiplication with a negative multiplier.

Suppose that the leftmost 0 of a negative number, X , is at bit position k, that is,

X = 11 . . . 10xk−1 . . . x0

Then the value of X is given by

V (X ) = −2k+1 + xk−1 × 2k−1 + · · · + x0 × 20

The correctness of this expression for V (X ) is shown by observing that if X is formed asthe sum of two numbers, as follows,

11 . . . 100000 . . . 0+ 00 . . . 00xk−1 . . . x0

X = 11 . . . 10xk−1 . . . x0

then the upper number is the 2’s-complement representation of−2k+1. The recoded multi-plier now consists of the part corresponding to the lower number, with−1 added in positionk + 1. For example, the multiplier 110110 is recoded as 0 −1 +1 0 −1 0.

The Booth technique for recoding multipliers is summarized in Figure 9.12. Thetransformation 011 . . . 110⇒ +1 0 0 . . . 0 −1 0 is called skipping over 1s. This term isderived from the case in which the multiplier has its 1s grouped into a few contiguousblocks. Only a few versions of the shifted multiplicand (the summands) need to be added togenerate the product, thus speeding up the multiplication operation. However, in the worstcase—that of alternating 1s and 0s in the multiplier—each bit of the multiplier selects asummand. In fact, this results in more summands than if the Booth algorithm were not used.A 16-bit worst-case multiplier, an ordinary multiplier, and a good multiplier are shown inFigure 9.13.

The Booth algorithm has two attractive features. First, it handles both positive andnegative multipliers uniformly. Second, it achieves some efficiency in the number ofadditions required when the multiplier has a few large blocks of 1s.


9.5 Fast Multiplication 351

Multiplier

Bit i Bit i 1–

Version of multiplicandselected by bit i

0

1

0

0

01

1 1

0 M×

1+ M×

1– M×

0 M×

Figure 9.12 Booth multiplier recoding table.

1

0

1110000111110000

001111011010001

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0

000000000000

00000000

1– 1– 1– 1– 1– 1– 1– 1–

1– 1– 1– 1–

1–1–

1+ 1+ 1+ 1+ 1+ 1+ 1+ 1+

1+

1+1+1+

1+

Worst-casemultiplier

Ordinarymultiplier

Goodmultiplier

Figure 9.13 Booth recoded multipliers.

9.5 Fast Multiplication

We now describe two techniques for speeding up the multiplication operation. The firsttechnique guarantees that the maximum number of summands (versions of the multiplicand)that must be added is n/2 for n-bit operands. The second technique leads to adding thesummands in parallel.



9.5.1 Bit-Pair Recoding of Multipliers

A technique called bit-pair recoding of the multiplier results in using at most one summandfor each pair of bits in the multiplier. It is derived directly from the Booth algorithm. Groupthe Booth-recoded multiplier bits in pairs, and observe the following. The pair (+1 −1) isequivalent to the pair (0 +1). That is, instead of adding −1 times the multiplicand M atshift position i to +1×M at position i + 1, the same result is obtained by adding +1×Mat position i. Other examples are: (+1 0) is equivalent to (0 +2), (−1 +1) is equivalent to(0 −1), and so on. Thus, if the Booth-recoded multiplier is examined two bits at a time,starting from the right, it can be rewritten in a form that requires at most one version of themultiplicand to be added to the partial product for each pair of multiplier bits. Figure 9.14ashows an example of bit-pair recoding of the multiplier in Figure 9.11, and Figure 9.14b

i 1+ i 1–

1+1–

(a) Example of bit-pair recoding derived from Booth recoding

(b) Table of multiplicand selection decisions

selected at position iMultiplicandMultiplier bit-pair

i

0

0

1

1

1

0

1

0

1

1

1

1

0

0

0

1

1

0

0

1

0

0

1

Multiplier bit on the right

0

0

000

1 1 0 1 0

Implied 0 to right of LSB

1

0

Sign extension

1–

2–1–

0 M×

1+ M×

1– M×

1+ M×

0 M×

1– M×

2– M×

2+ M×

Figure 9.14 Multiplier bit-pair recoding.



1–

00001 1 1 1 1 00 0 0 0 111 1 1 1 10 00 0 0 0 0 0

0000 111111

0 1 1 0 10

1 0100111111 1 1 1 0 0 1 10 0 0 0 0 0

1 1 1 0 1 1 0 0 1 0

0

1

0 0

1 0

1

0 0

00 1

0

0 1

10

0

0100 1 1 0 1

11

1–

6–( )13+( )

1+

78–( )

1– 2–

×

Figure 9.15 Multiplication requiring only n/2 summands.

shows a table of the multiplicand selection decisions for all possibilities. The multiplicationoperation in Figure 9.11 is shown in Figure 9.15 as it would be computed using bit-pairrecoding of the multiplier.

9.5.2 Carry-Save Addition of Summands

Multiplication requires the addition of several summands. A technique called carry-saveaddition (CSA) can be used to speed up the process. Consider the 4× 4 multiplicationarray shown in Figure 9.16a. This structure is in the form of the array shown in Figure9.6, in which the first row consists of just the AND gates that produce the four inputs m3q0,m2q0, m1q0, and m0q0.

Instead of letting the carries ripple along the rows, they can be “saved” and introducedinto the next row, at the correct weighted positions, as shown in Figure 9.16b. This frees upan input to each of three full adders in the first row. These inputs can be used to introduce



FA FA FA FA

FA FA FA FA

FA FA FA FA

m3q0

m3q1

m2q0

m2q1

m1q0 m0q0

m1q1 m0q1

0

m3q2 m2q2 m1q2 m0q2

m3q3 m2q3 m1q3 m0q3

P6 P4 P3 P2 P1 P0P5P7

0

0

0

(a) Ripple-carry array

FA FA FA FA

FA FA FA FA

FA FA FA FA

m3q0

m3q1

m2q0

m2q1

m1q0 m0q0

m1q1 m0q1

0

m2q3 m1q3 m0q3 0

P6 P4 P3 P2 P1 P0P5P7

0

0

(b) Carry-save array

m0q2m1q2m2q2m3q2

m3q3

×

Figure 9.16 Ripple-carry and carry-save arrays for a 4 × 4 multiplier.



the third summand bits m2q2, m1q2, and m0q2. Now, two inputs of each of three full addersin the second row are fed by the sum and carry outputs from the first row. The third inputis used to introduce the bits m2q3, m1q3, and m0q3 of the fourth summand. The high-orderbits m3q2 and m3q3 of the third and fourth summands are introduced into the remaining freefull-adder inputs at the left end in the second and third rows. The saved carry bits and thesum bits from the second row are now added in the third row, which is a ripple-carry adder,to produce the final product bits.

The delay through the carry-save array is somewhat less than the delay through theripple-carry array. This is because the S and C vector outputs from each row are producedin parallel in one full-adder delay. The amount of reduction in delay is considered inProblem 9.15.

9.5.3 SummandAddition Tree using 3-2 Reducers

A more significant reduction in delay can be achieved when dealing with longer operandsthan those considered in Figure 9.16. We can group the summands in threes and performcarry-save addition on each of these groups in parallel to generate a set of S and C vectorsin one full-adder delay. Here, we will refer to a full-adder circuit as simply an adder. Next,we group all the S and C vectors into threes, and perform carry-save addition on them,generating a further set of S and C vectors in one more adder delay. We continue with thisprocess until there are only two vectors remaining. The adder at each bit position of thethree summands is called a 3-2 reducer, and the logic circuit structure that reduces a numberof summands to two is called a CSA tree, as described by Wallace [2]. The final two S andC vectors can be added in a carry-lookahead adder to produce the desired product.

Consider the example shown in Figure 9.17. It involves adding the six shifted versionsof the multiplicand for the case of multiplying two, 6-bit, unsigned numbers, where all six

100 1 11

100 1 11

100 1 11

11111 1

100 1 11 M

Q

A

B

C

D

E

F

(2,835)

×

(45)

(63)

100 1 11

100 1 11

100 1 11

000 1 11 111 0 00 Product

Figure 9.17 A multiplication example used to illustrate carry-save addition as shownin Figure 9.18.



bits of the multiplier are equal to 1. The six summands, A, B, . . . , F are added by carry-saveaddition in Figure 9.18. The blue boxes in these two figures indicate the same operand bits,and show how they are reduced to sum and carry bits in Figure 9.18 by carry-save addition.Three levels of carry-save addition are performed, as shown schematically in Figure 9.19.This figure shows that the final two vectors S4 and C4 are available in three adder delays

00000101 0 10

10010000 1 11 1

+

1000011 1

10010111 0 10 1

0110 1 10 0

00011010 0 00

10001011 1 0 1

110001 1 0

00111100

00110 1 10

11001 0 01

100 1 11

100 1 11

100 1 11

00110 1 10

11001 0 01

100 1 11

100 1 11

100 1 11

11111 1

100 1 11 M

Q

A

B

C

S1

C1

D

E

F

S2

C2

S1

C1

S2

S3

C3

C2

S4

C4

Product

×

Figure 9.18 The multiplication example from Figure 9.17 performed using carry-saveaddition.



C2

ABE D CFLevel 1 CSA

S2 C1 S1

C2 C3 S3

C4 S4

+Product

Level 2 CSA

Level 3 CSA

Final addition

Figure 9.19 Schematic representation of the carry-saveaddition operations in Figure 9.18.

after the six input summands are applied to level 1. The final regular addition operation onS4 and C4, which produces the product, can be done with a carry-lookahead adder.

The multiplier delay is lower when using the tree structure illustrated in Figure 9.19 thanwhen using the array structure illustrated in Figure 9.16b. When the number of summandsis large, the reduction in delay is significant. For example, the addition of 32 summandsfollowing the pattern shown in Figure 9.19 requires only 8 levels of 3-2 reduction beforethe final Add operation. In general, it can be shown that approximately 1.7log2k − 1.7levels of 3-2 reduction are needed to reduce k summands to 2 vectors, which, when added,produce the desired product. (See Example 9.3. in Section 9.10.)

We should note that negative summands are involved when signed-number multiplica-tion and Booth recoding of multipliers is used. This requires sign extension of the summandsbefore they are entered into the reduction tree. Also, the number of summands that need tobe added is reduced if bit-pair recoding of the multiplier is done.

The 3-2 reducer is not the only logic circuit that can be used in building reduction trees.It is also possible to use 4-2 reducers and 7-3 reducers. The first of these possibilities isdescribed in the next subsection, and the second is explored in Problem 9.17.

9.5.4 SummandAddition Tree using 4-2 Reducers

The interconnection pattern between levels in a CSA tree that uses 3-2 reducers is irregular,as can be seen in Figure 9.19. A more regularly structured tree can be obtained by using4-2 reducers [3], especially for the case in which the number of summands to be reducedis a power of 2. This is the usual case for the multiplication operation in the ALU ofa processor. For example, if 32 summands are reduced to 2 using 4-2 reducers at eachreduction level, then only four levels are needed. The tree has a regular structure, with 16,8, 4, and 2 summands at the outputs of the four levels. If 3-2 reducers are used, eight levels



are required, and the wiring connections between levels are quite irregular. Regular treestructures facilitate logic circuit and wiring layout for VLSI circuit implementation.

Let us consider the design of a 4-2 reducer as developed in reference [3]. The additionof four equally-weighted bits, w, x, y, and z, from four summands, produces a value in therange 0 to 4. Such a value cannot be represented by a sum bit, s, and a single carry bit, c.However, a second carry bit, cout , with the same weight as c, can be used along with s andc, to represent any value in the range 0 to 5. This is sufficient for our purposes here.

We do not want to send three output bits down to the next reduction level. That wouldimplement a 4-3 reducer, which provides less reduction than a 3-2 reducer. The solution isto send cout laterally to the 4-2 reducer in the next higher-weighted bit position on the samereduction level. Thus, each 4-2 reducer must have a fifth input, cin, which is the cout outputfrom the 4-2 reducer in the next lower-weighted bit position on the same reduction level.

A final requirement on the design of the 4-2 reducer is that the value of cout cannotdepend on the value of cin. This is a key requirement. Without it, carries would ripplelaterally along a reduction level, defeating the purpose of parallel reduction of summandswith short fixed delay. A 4-2 reducer block is shown in Figure 9.20.

In summary, the specification for a 4-2 reducer is as follows:

• The three outputs, s, c, and cout , represent the arithmetic sum of the five inputs, thatis

w + x + y + z + cin = s+ 2(c + cout)

where all operators here are arithmetic.

• Output s is the usual sum variable; that is, s is the XOR function of the five inputvariables.

• The lateral carry, cout , must be independent of cin. It is a function of only the fourinput variables w, x, y, and z.

There are different possibilities for specifying the two carry outputs in a way that meetsthe given conditions. We present one that is easy to describe. First, assign the lateral carry

s (sum)c (carry)

cincout

w x y z

reducer4-2

Figure 9.20 A 4-2 reducer block.



cin = 0 cin = 1

w x y z c s c s cout

0 0 0 0 0 0 0 1 0

0 0 0 1 0 1 1 0 0

0 0 1 0 0 1 1 0 0

0 1 0 0 0 1 1 0 0

1 0 0 0 0 1 1 0 0

0 0 1 1 0 0 0 1 1

0 1 0 1 0 0 0 1 1

0 1 1 0 0 0 0 1 1

1 0 0 1 0 0 0 1 1

1 0 1 0 0 0 0 1 1

1 1 0 0 0 0 0 1 1

0 1 1 1 0 1 1 0 1

1 0 1 1 0 1 1 0 1

1 1 0 1 0 1 1 0 1

1 1 1 0 0 1 1 0 1

1 1 1 1 1 0 1 1 1

Figure 9.21 A 4-2 reducer truth table.

output, cout , to be 1 when two or more of the input variables w, x, y, and z, are equal to 1.Then, the other carry output, c, is determined so as to satisfy the arithmetic condition. Acomplete truth table satisfying these conditions is given in Figure 9.21. The table is shownin a form that is different from the usual form used in Appendix A. The four inputs w, x, y,and z, are not listed in binary numerical order. They are listed in groups corresponding tothe number of inputs that have the value 1. This makes it easy to see how the outputs arespecified to meet the given conditions. A logic gate network can be derived from the table.

9.5.5 Summary of Fast Multiplication

We now summarize the techniques for high-speed multiplication. Bit-pair recoding of themultiplier, derived from the Booth algorithm, can be used to initially reduce the numberof summands by a factor of two. The resulting summands can then be reduced to twoin a reduction tree with a relatively small number of reduction levels. The final product



can be generated by an addition operation that uses a carry-lookahead adder. All threeof these techniques—bit-pair recoding of the multiplier, parallel reduction of summands,and carry-lookahead addition—have been used in various combinations by the designersof high-performance processors to reduce the time needed to perform multiplication.

9.6 Integer Division

In Section 9.3, we discussed the multiplication of unsigned numbers by relating the waythe multiplication operation is done manually to the way it is done in a logic circuit. Weuse the same approach here in discussing integer division. We discuss unsigned-numberdivision in detail, and then make some general comments on the signed-number case.

Figure 9.22 shows examples of decimal division and binary division of the same values.Consider the decimal version first. The 2 in the quotient is determined by the followingreasoning: First, we try to divide 13 into 2, and it does not work. Next, we try to divide 13into 27. We go through the trial exercise of multiplying 13 by 2 to get 26, and, observing that27− 26 = 1 is less than 13, we enter 2 as the quotient and perform the required subtraction.The next digit of the dividend, 4, is brought down, and we finish by deciding that 13 goesinto 14 once, and the remainder is 1. We can discuss binary division in a similar way, withthe simplification that the only possibilities for the quotient bits are 0 and 1.

A circuit that implements division by this longhand method operates as follows: Itpositions the divisor appropriately with respect to the dividend and performs a subtraction.If the remainder is zero or positive, a quotient bit of 1 is determined, the remainder isextended by another bit of the dividend, the divisor is repositioned, and another subtractionis performed. If the remainder is negative, a quotient bit of 0 is determined, the dividend isrestored by adding back the divisor, and the divisor is repositioned for another subtraction.This is called the restoring division algorithm.

Restoring DivisionFigure 9.23 shows a logic circuit arrangement that implements the restoring division

algorithm just discussed. Note its similarity to the structure for multiplication shown inFigure 9.7. An n-bit positive divisor is loaded into register M and an n-bit positive dividend

1101

1

1314

26

21

13 274 100010010

10101

1101

1101

1

11101101

10000

Figure 9.22 Longhand division examples.


9.6 Integer Division 361

qn 1–

mn 1–

-bit

Divisor M

Controlsequencer

Dividend Q

Shift left

adder

an 1– a0 q0

m0

an

0

Add/Subtract

Quotientsetting

n 1+

A

Figure 9.23 Circuit arrangement for binary division.

is loaded into register Q at the start of the operation. Register A is set to 0. After thedivision is complete, the n-bit quotient is in register Q and the remainder is in register A.The required subtractions are facilitated by using 2’s-complement arithmetic. The extra bitposition at the left end of both A and M accommodates the sign bit during subtractions. Thefollowing algorithm performs restoring division.

Do the following three steps n times:

1. Shift A and Q left one bit position.

2. Subtract M from A, and place the answer back in A.

3. If the sign of A is 1, set q0 to 0 and add M back to A (that is, restore A); otherwise, setq0 to 1.

Figure 9.24 shows a 4-bit example as it would be processed by the circuit in Figure 9.23.

Non-Restoring DivisionThe restoring division algorithm can be improved by avoiding the need for restoring

A after an unsuccessful subtraction. Subtraction is said to be unsuccessful if the resultis negative. Consider the sequence of operations that takes place after the subtractionoperation in the preceding algorithm. If A is positive, we shift left and subtract M, that is,we perform 2A−M. If A is negative, we restore it by performing A+M, and then we shiftit left and subtract M. This is equivalent to performing 2A+M. The q0 bit is appropriately



10111

1 1 1 1 1

01111

0

0

0

1

000

0

0

00

0

00

00

0

1

00

0

00

1 01

111 1

010001

SubtractShift

Restore

1 00001 00001 1

Initially

SubtractShift

10111100001100000000

SubtractShift

Restore

101110100010000

1 1

QuotientRemainder

Shift10111

1 0000Subtract

Second cycle

First cycle

Third cycle

Fourth cycle

00

0

0

00

1

01

10000

1 11 0000

11111Restore

q0Set

q0Set

q0Set

q0Set

Figure 9.24 A restoring division example.

set to 0 or 1 after the correct operation has been performed. We can summarize this in thefollowing algorithm for non-restoring division.

Stage 1: Do the following two steps n times:

1. If the sign of A is 0, shift A and Q left one bit position and subtract M from A;otherwise, shift A and Q left and add M to A.

2. Now, if the sign of A is 0, set q0 to 1; otherwise, set q0 to 0.

Stage 2: If the sign of A is 1, add M to A.

Stage 2 is needed to leave the proper positive remainder in A after the n cycles of Stage 1.The logic circuitry in Figure 9.23 can also be used to perform this algorithm, except that


9.7 Floating-Point Numbers and Operations 363

1

Add

Quotient

Remainder

0 0 0 01

0 0 1 01 1 1 1 1

1 1 1 1 10 0 0 1 1

0 0 0 01 1 1 1 1

Shift 0 0 01100001111

Add

0 0 0 1 10 0 0 0 1 0 0 01 1 1 0 1

ShiftSubtract

Initially 0 0 0 0 0 1 0 0 0

1 1 1 0 0000

1 1 1 0 00 0 0 1 1

0 0 0ShiftAdd

0 0 10 0 0 011 1 1 0 1

ShiftSubtract

0 0 0 110000

Restore remainder

Fourth cycle

Third cycle

Second cycle

First cycle

q0Set

q0Set

q0Set

q0Set

Figure 9.25 A non-restoring division example.

the Restore operations are no longer needed. One Add or Subtract operation is performedin each of the n cycles of stage 1, plus a possible final addition in Stage 2. Figure 9.25shows how the division example in Figure 9.24 is executed by the non-restoring divisionalgorithm.

There are no simple algorithms for directly performing division on signed operandsthat are comparable to the algorithms for signed multiplication. In division, the operandscan be preprocessed to change them into positive values. After using one of the algorithmsjust discussed, the signs of the quotient and the remainder are adjusted as necessary.

9.7 Floating-Point Numbers and Operations

Chapter 1 provided the motivation for using floating-point numbers and indicated how theycan be represented in a 32-bit binary format. In this chapter, we provide more detail on rep-resentation formats and arithmetic operations on floating-point numbers. The descriptions



provided here are based on the 2008 version of IEEE (Institute of Electrical and ElectronicsEngineers) Standard 754, labeled 754-2008 [4].

Recall from Chapter 1 that a binary floating-point number can be represented by

• A sign for the number• Some significant bits• A signed scale factor exponent for an implied base of 2

The basic IEEE format is a 32-bit representation, shown in Figure 9.26a. The leftmostbit represents the sign, S, for the number. The next 8 bits, E′, represent the signed exponentof the scale factor (with an implied base of 2), and the remaining 23 bits, M , are the

Sign ofnumber:

32 bits

mantissa fraction23-bit

representationexcess-127exponent in8-bit signed

52-bitmantissa fraction

11-bit excess-1023exponent

64 bits

Sign

Value represented

0 0 1 0 1 0 00 0 0 1 0 1 0 0 0

S M

S M

Value represented

(a) Single precision

(b) Example of a single-precision number

(c) Double precision

E′

+

1.001010 0 287–×=

1.M 2E ′ 127–×±=

Value represented 1.M 2E ′ 1023–×±=

E′

0 signifies–1 signifies

. . .

. . .

Figure 9.26 IEEE standard floating-point formats.



fractional part of the significant bits. The full 24-bit string, B, of significant bits, called themantissa, always has a leading 1, with the binary point immediately to its right. Therefore,the mantissa

B = 1.M = 1.b−1b−2 . . . b−23

has the value

V (B) = 1+ b−1 × 2−1 + b−2 × 2−2 + · · · + b−23 × 2−23

By convention, when the binary point is placed to the right of the first significant bit, thenumber is said to be normalized. Note that the base, 2, of the scale factor and the leading 1of the mantissa are both fixed. They do not need to appear explicitly in the representation.

Instead of the actual signed exponent, E, the value stored in the exponent field is anunsigned integer E′ = E + 127. This is called the excess-127 format. Thus, E′ is in therange 0 ≤ E′ ≤ 255. The end values of this range, 0 and 255, are used to represent specialvalues, as described later. Therefore, the range of E′ for normal values is 1 ≤ E′ ≤ 254.This means that the actual exponent, E, is in the range −126 ≤ E ≤ 127. The use of theexcess-127 representation for exponents simplifies comparison of the relative sizes of twofloating-point numbers. (See Problem 9.23.)

The 32-bit standard representation in Figure 9.26a is called a single-precision repre-sentation because it occupies a single 32-bit word. The scale factor has a range of 2−126

to 2+127, which is approximately equal to 10±38. The 24-bit mantissa provides approxi-mately the same precision as a 7-digit decimal value. An example of a single-precisionfloating-point number is shown in Figure 9.26b.

To provide more precision and range for floating-point numbers, the IEEE standard alsospecifies a double-precision format, as shown in Figure 9.26c. The double-precision formathas increased exponent and mantissa ranges. The 11-bit excess-1023 exponent E′ has therange 1 ≤ E′ ≤ 2046 for normal values, with 0 and 2047 used to indicate special values,as before. Thus, the actual exponent E is in the range−1022 ≤ E ≤ 1023, providing scalefactors of 2−1022 to 21023 (approximately 10±308). The 53-bit mantissa provides a precisionequivalent to about 16 decimal digits.

A computer must provide at least single-precision representation to conform to theIEEE standard. Double-precision representation is optional. The standard also specifiescertain optional extended versions of both of these formats. The extended versions provideincreased precision and increased exponent range for the representation of intermediatevalues in a sequence of calculations. The use of extended formats helps to reduce the sizeof the accumulated round-off error in a sequence of calculations leading to a desired result.For example, the dot product of two vectors of numbers involves accumulating a sum ofproducts. The input vector components are given in a standard precision, either singleor double, and the final answer (the dot product) is truncated to the same precision. Allintermediate calculations should be done using extended precision to limit accumulation oferrors. Extended formats also enhance the accuracy of evaluation of elementary functionssuch as sine, cosine, and so on. This is because they are usually evaluated by adding up anumber of terms in a series representation. In addition to requiring the four basic arithmeticoperations, the standard requires three additional operations to be provided: remainder,square root, and conversion between binary and decimal representations.



0 1 1 00 1 0 0 0 0 1 0 1

(a) Unnormalized value

(b) Normalized version

0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 . . .

. . .

. . .

(There is no implicit 1 to the left of the binary point.)

Value represented 0.0010110 29×+=

. . .

Value represented 1.0110 26×+=

excess-127 exponent

Figure 9.27 Floating-point normalization in IEEE single-precision format.

We note two basic aspects of operating with floating-point numbers. First, if a numberis not normalized, it can be put in normalized form by shifting the binary point and adjust-ing the exponent. Figure 9.27 shows an unnormalized value, 0.0010110 . . .× 29, and itsnormalized version, 1.0110 . . .× 26. Since the scale factor is in the form 2i, shifting themantissa right or left by one bit position is compensated by an increase or a decrease of 1 inthe exponent, respectively. Second, as computations proceed, a number that does not fallin the representable range of normal numbers might be generated. In single precision, thismeans that its normalized representation requires an exponent less than −126 or greaterthan +127. In the first case, we say that underflow has occurred, and in the second case,we say that overflow has occurred.

Special ValuesThe end values 0 and 255 of the excess-127 exponent E′ are used to represent special

values. When E′ = 0 and the mantissa fraction M is zero, the value 0 is represented. WhenE′ = 255 and M = 0, the value ∞ is represented, where ∞ is the result of dividing anormal number by zero. The sign bit is still used in these representations, so there arerepresentations for ±0 and ±∞.

When E′ = 0 and M = 0, denormal numbers are represented. Their value is±0.M ×2−126. Therefore, they are smaller than the smallest normal number. There is no impliedone to the left of the binary point, and M is any nonzero 23-bit fraction. The purpose ofintroducing denormal numbers is to allow for gradual underflow, providing an extensionof the range of normal representable numbers. This is useful in dealing with very smallnumbers, which may be needed in certain situations. When E′ = 255 and M = 0, the value



represented is called Not a Number (NaN). A NaN represents the result of performing aninvalid operation such as 0/0 or

√−1.

ExceptionsIn conforming to the IEEE Standard, a processor must set exception flags if any of

the following conditions arise when performing operations: underflow, overflow, divideby zero, inexact, invalid. We have already mentioned the first three. Inexact is the namefor a result that requires rounding in order to be represented in one of the normal formats.An invalid exception occurs if operations such as 0/0 or

√−1 are attempted. When anexception occurs, the result is set to one of the special values.

If interrupts are enabled for any of the exception flags, system or user-defined routinesare entered when the associated exception occurs. Alternatively, the application programcan test for the occurrence of exceptions, as necessary, and decide how to proceed.

9.7.1 Arithmetic Operations on Floating-Point Numbers

In this section, we outline the general procedures for addition, subtraction, multiplication,and division of floating-point numbers. The rules given below apply to the single-precisionIEEE standard format. These rules specify only the major steps needed to perform the fouroperations; for example, the possibility that overflow or underflow might occur is not dis-cussed. Furthermore, intermediate results for both mantissas and exponents might requiremore than 24 and 8 bits, respectively. These and other aspects of the operations must becarefully considered in designing an arithmetic unit that meets the standard. Although we donot provide full details in specifying the rules, we consider some aspects of implementation,including rounding, in later sections.

When adding or subtracting floating-point numbers, their mantissas must be shiftedwith respect to each other if their exponents differ. Consider a decimal example in whichwe wish to add 2.9400× 102 to 4.3100× 104. We rewrite 2.9400× 102 as 0.0294× 104

and then perform addition of the mantissas to get 4.3394× 104. The rule for addition andsubtraction can be stated as follows:

Add/Subtract Rule1. Choose the number with the smaller exponent and shift its mantissa right a number of

steps equal to the difference in exponents.

2. Set the exponent of the result equal to the larger exponent.

3. Perform addition/subtraction on the mantissas and determine the sign of the result.

4. Normalize the resulting value, if necessary.

Multiplication and division are somewhat easier than addition and subtraction, in thatno alignment of mantissas is needed.

Multiply Rule1. Add the exponents and subtract 127 to maintain the excess-127 representation.

2. Multiply the mantissas and determine the sign of the result.




Divide Rule1. Subtract the exponents and add 127 to maintain the excess-127 representation.

2. Divide the mantissas and determine the sign of the result.


9.7.2 Guard Bits and Truncation

Let us consider some important aspects of implementing the steps in the preceding algo-rithms. Although the mantissas of initial operands and final results are limited to 24 bits,including the implicit leading 1, it is important to retain extra bits, often called guard bits,during the intermediate steps. This yields maximum accuracy in the final results.

Removing guard bits in generating a final result requires that the extended mantissa betruncated to create a 24-bit number that approximates the longer version. This operationalso arises in other situations, for instance, in converting from decimal to binary numbers.We should mention that the general term rounding is also used for the truncation operation,but a more restrictive definition of rounding is used here as one of the forms of truncation.

There are several ways to truncate. The simplest way is to remove the guard bits andmake no changes in the retained bits. This is called chopping. Suppose we want to truncatea fraction from six to three bits by this method. All fractions in the range 0.b−1b−2b−3000to 0.b−1b−2b−3111 are truncated to 0.b−1b−2b−3. The error in the 3-bit result ranges from0 to 0.000111. In other words, the error in chopping ranges from 0 to almost 1 in the leastsignificant position of the retained bits. In our example, this is the b−3 position. The resultof chopping is a biased approximation because the error range is not symmetrical about 0.

The next simplest method of truncation is von Neumann rounding. If the bits to beremoved are all 0s, they are simply dropped, with no changes to the retained bits. However,if any of the bits to be removed are 1, the least significant bit of the retained bits is set to1. In our 6-bit to 3-bit truncation example, all 6-bit fractions with b−4b−5b−6 not equalto 000 are truncated to 0.b−1b−21. The error in this truncation method ranges between−1 and +1 in the LSB position of the retained bits. Although the range of error is largerwith this technique than it is with chopping, the maximum magnitude is the same, and theapproximation is unbiased because the error range is symmetrical about 0.

Unbiased approximations are advantageous if many operands and operations are in-volved in generating a result, because positive errors tend to offset negative errors as thecomputation proceeds. Statistically, we can expect the results of a complex computation tobe more accurate.

The third truncation method is a rounding procedure. Rounding achieves the closestapproximation to the number being truncated and is an unbiased technique. The proce-dure is as follows: A 1 is added to the LSB position of the bits to be retained if there isa 1 in the MSB position of the bits being removed. Thus, 0.b−1b−2b−31 . . . is rounded to0.b−1b−2b−3 + 0.001, and 0.b−1b−2b−30 . . . is rounded to 0.b−1b−2b−3. This provides thedesired approximation, except for the case in which the bits to be removed are 10 . . . 0.This is a tie situation; the longer value is halfway between the two closest truncated rep-resentations. To break the tie in an unbiased way, one possibility is to choose the retained



bits to be the nearest even number. In terms of our 6-bit example, the value 0.b−1b−20100is truncated to the value 0.b−1b−20, and 0.b−1b−21100 is truncated to 0.b−1b−21+ 0.001.The descriptive phrase “round to the nearest number or nearest even number in case of atie” is sometimes used to refer to this truncation technique. The error range is approxi-mately − 1

2 to + 12 in the LSB position of the retained bits. Clearly, this is the best method.

However, it is also the most difficult to implement because it requires an addition operationand a possible renormalization. This rounding technique is the default mode for truncationspecified in the IEEE floating-point standard. The standard also specifies other truncationmethods, referring to all of them as rounding modes.

This discussion of errors that are introduced when guard bits are removed by truncationhas treated the case of a single truncation operation. When a long series of calculationsinvolving floating-point numbers is performed, the analysis that determines error ranges orbounds for the final results can be a complicated study. We do not discuss this aspect ofnumerical computation further, except to make a few comments on the way that guard bitsand rounding are handled in the IEEE floating-point standard.

According to the standard, results of single operations must be computed to be accuratewithin half a unit in the LSB position. This means that rounding must be used as thetruncation method. Implementing rounding requires only three guard bits to be carriedalong during the intermediate steps in performing an operation. The first two of these bitsare the two most significant bits of the section of the mantissa to be removed. The third bit isthe logical OR of all bits beyond these first two bits in the full representation of the mantissa.This bit is relatively easy to maintain during the intermediate steps of the operations to beperformed. It should be initialized to 0. If a 1 is shifted out through this position whilealigning mantissas, the bit becomes 1 and retains that value; hence, it is usually called thesticky bit.

9.7.3 Implementing Floating-Point Operations

The hardware implementation of floating-point operations involves a considerable amountof logic circuitry. These operations can also be implemented by software routines. In eithercase, the computer must be able to convert input and output from and to the user’s decimalrepresentation of numbers. In many general-purpose processors, floating-point operationsare available at the machine-instruction level, implemented in hardware.

An example of the implementation of floating-point operations is shown in Figure 9.28.This is a block diagram of a hardware implementation for the addition and subtraction of32-bit floating-point operands that have the format shown in Figure 9.26a. Following theAdd/Subtract rule given in Section 9.7.1, we see that the first step is to compare exponentsto determine how far to shift the mantissa of the number with the smaller exponent. Theshift-count value, n, is determined by the 8-bit subtractor circuit in the upper left corner ofthe figure. The magnitude of the difference E′A − E′B, or n, is sent to the SHIFTER unit. If nis larger than the number of significant bits of the operands, then the answer is essentially thelarger operand (except for guard and sticky-bit considerations in rounding), and shortcutscan be taken in deriving the result. We do not explore this in detail.



EX

Magnitude M

with larger EM of number

with smaller EM of number

subtractor8-bit

sign

subtractor8-bit

MUX

Mantissa

SHIFTER

SWAP

detector

Normalize andround

Leading zeros

to right

adder/subtractor

SubtractAdd/

Sign

Add/Sub

n bitsSA SB

EA′ EB′M A MB

n EA′ EB′–=

EA′ EB′

SR

E′ X–

ER′ MRR :32-bitresultR A B+=

32-bit operandsA : SA EA′ M A,,

B : SB EB′ MB,,

CombinationalCONTROL

network

′

′

′

Figure 9.28 Floating-point addition-subtraction unit.



The sign of the difference that results from comparing exponents determines whichmantissa is to be shifted. Therefore, in step 1, the sign is sent to the SWAP network inthe upper right corner of Figure 9.28. If the sign is 0, then E′A ≥ E′B and the mantissas MA

and MB are sent straight through the SWAP network. This results in MB being sent to theSHIFTER, to be shifted n positions to the right. The other mantissa, MA, is sent directly tothe mantissa adder/subtractor. If the sign is 1, then E′A < E′B and the mantissas are swappedbefore they are sent to the SHIFTER.

Step 2 is performed by the two-way multiplexer, MUX, near the bottom left corner ofthe figure. The exponent of the result, E′, is tentatively determined as E′A if E′A ≥ E′B, orE′B if E′A < E′B, based on the sign of the difference resulting from comparing exponents instep 1.

Step 3 involves the major component, the mantissa adder/subtractor in the middle ofthe figure. The CONTROL logic determines whether the mantissas are to be added orsubtracted. This is decided by the signs of the operands (SA and SB) and the operation (Addor Subtract) that is to be performed on the operands. The CONTROL logic also determinesthe sign of the result, SR. For example, if A is negative (SA = 1), B is positive (SB = 0), andthe operation is A− B, then the mantissas are added and the sign of the result is negative(SR = 1). On the other hand, if A and B are both positive and the operation is A− B, then themantissas are subtracted. The sign of the result, SR, now depends on the mantissa subtractionoperation. For instance, if E′A > E′B, then M = MA − (shifted MB) and the resulting numberis positive. But if E′B > E′A, then M = MB − (shifted MA) and the result is negative. Thisexample shows that the sign from the exponent comparison is also required as an inputto the CONTROL network. When E′A = E′B and the mantissas are subtracted, the sign ofthe mantissa adder/subtractor output determines the sign of the result. The reader shouldnow be able to construct the complete truth table for the CONTROL network (see Problem9.26).

Step 4 of the Add/Subtract rule consists of normalizing the result of step 3 by shiftingM to the right or to the left, as appropriate. The number of leading zeros in M determinesthe number of bit shifts, X , to be applied to M . The normalized value is rounded to generatethe 24-bit mantissa, MR, of the result. The value X is also subtracted from the tentativeresult exponent E′ to generate the true result exponent, E′R. Note that only a single rightshift might be needed to normalize the result. This would be the case if two mantissas ofthe form 1.xx . . . were added. The vector M would then have the form 1x.xx . . . .

We have not given any details on the guard bits that must be carried along with inter-mediate mantissa values. In the IEEE standard, only a few bits are needed, as discussedearlier, to generate the 24-bit normalized mantissa of the result.

Let us consider the actual hardware that is needed to implement the blocks in Figure9.28. The two 8-bit subtractors and the mantissa adder/subtractor can be implemented bycombinational logic, as discussed earlier in this chapter. Because their outputs must be insign-and-magnitude form, we must modify some of our earlier discussions. A combinationof 1’s-complement arithmetic and sign-and-magnitude representation is often used. Con-siderable flexibility is allowed in implementing the SHIFTER and the output normalizationoperation. The operations can be implemented with shift registers. However, they can alsobe built as combinational logic units for high-performance.



9.8 Decimal-to-Binary Conversion

In Chapter 1 and in this chapter, examples that involve decimal numbers have used smallvalues. Conversion from decimal to binary representation has been easy to do based onthe binary bit-position weights 1, 2, 4, 8, 16, . . . . However, it is useful to have a generalmethod for converting decimal numbers to binary representation.

The fixed-point, unsigned, binary number

B = bn−1bn−2 . . . b0.b−1b−2 . . . b−m

has an n-bit integer part and an m-bit fraction part. Its value, V (B), is given by

V (B) = bn−1 × 2n−1 + bn−2 × 2n−2 + · · · + b0 × 20

+ b−1 × 2−1 + b−2 × 2−2 + · · · + b−m × 2−m

To convert a fixed-point decimal number into binary, the integer and fraction parts arehandled separately. Conversion of the integer part starts by dividing it by 2. The remainder,which is either 0 or 1, is the least significant bit, b0, of the integer part of B. The quotientis again divided by 2. The remainder is the next bit, b1, of B. This process is repeated upto and including the step in which the quotient becomes 0.

Conversion of the fraction part starts by multiplying it by 2. The part of the productto the left of the decimal point, which is either 0 or 1, is bit b−1 of the fraction part of B.The fraction part of the product is again multiplied by 2, generating the next bit, b−2 of thefraction part of B. The process is repeated until the fraction part of the product becomes 0or until the required accuracy is obtained.

Figure 9.29 shows an example of conversion from the decimal number 927.45 to binary.Note that conversion of the integer part is always exact and terminates when the quotientbecomes 0. But an exact binary fraction may not exist for a given decimal fraction. Forexample, the decimal fraction 0.45 used in Figure 9.29 does not have an exact binaryequivalent. This is obvious from the pattern developing in the figure. In such cases, thebinary fraction is generated to some desired level of accuracy. Of course, some decimalfractions have an exact binary representation. For example, the decimal fraction 0.25 hasa binary equivalent of 0.01.


Computer arithmetic poses several interesting logic design problems. This chapter dis-cussed some of the techniques that have proven useful in designing binary arithmetic units.The carry-lookahead technique is one of the major ideas in high-performance adder design.In the design of fast multipliers, bit-pair recoding of the multiplier, derived from the Boothalgorithm, reduces the number of summands that must be added to generate the product.The parallel addition of summands using carry-save reduction trees substantially reduces



9272

--------- 463 12---+=

4632

--------- 231 12---+=

2312

--------- 115 12---+=

1152

--------- 57 12---+=

572

------ 2812---+=

282------ 14

02---+=

142------ 7

02---+=

72--- 3 1

2---+=

32--- 1

12---+=

12--- 0

12---+=

1 LSB

1 MSB

1

1

1

1

0

0

1

1

0.45 2× 0.90=

0.90 2× 1.80=

0.80 2× 1.60=

0.60 2× 1.20=

0.20 2× 0.40=

0.40 2× 0.80=

0.80 2× 1.60=

0 MSB

1 LSB

1

1

1

0

0

Convert

927.45( )10 1110011111.0111001 …( )2=

927.45( )10

Figure 9.29 Conversion from decimal to binary.



the time needed to add the summands. The important IEEE floating-point number represen-tation standard was described, and rules for performing the four standard operations weregiven.



Example 9.1 Problem: How many logic gates are needed to build the 4-bit carry-lookahead adder shownin Figure 9.4?

Solution: Each B cell requires 3 gates as shown in Figure 9.4a. Hence, 12 gates are neededfor all four B cells.

The carries c1, c2, c3, and c4, produced by the carry-lookahead logic, require 2, 3, 4,and 5 gates, respectively, according to the four logic expressions in Section 9.2.1. Thecarry-lookahead logic also produces GI

0, using 4 gates, and PI0, using 1 gate, as also shown

in Section 9.2.1. Hence, a total of 19 gates are needed to implement the carry-lookaheadlogic.

The complete 4-bit adder requires 12+ 19 = 31 gates, with a maximum fan-in of 5.

Example 9.2 Problem: Assuming 6-bit 2’s-complement number representation, multiply the multipli-cand A= 110101 by the multiplier B = 011011 using both the normal Booth algorithm andthe bit-pair recoding Booth algorithm, following the pattern used in Figure 9.15.

Solution: The multiplications are performed as follows:

(a) Normal Booth algorithm

1 1 0 1 0 1× +1 0 −1 +1 0 −1

0 0 0 0 0 0 0 0 1 0 1 10

1 1 1 1 1 1 0 1 0 10 0 0 0 0 1 0 1 1

01 1 1 0 1 0 11 1 1 0 1 1 0 1 0 1 1 1



(b) Bit-pair recoding Booth algorithm

1 1 0 1 0 1× +2 −1 −1

0 0 0 0 0 0 0 0 1 0 1 10 0 0 0 0 0 1 0 1 11 1 1 0 1 0 11 1 1 0 1 1 0 1 0 1 1 1

Example 9.3Problem: How many levels of 4-2 reducers are needed to reduce k summands to 2 in areduction tree? How many levels are needed if 3-2 reducers are used?

Solution: Let the number of levels be L.For 4-2 reducers, we have

k(1/2)L = 2

Take logarithms to the base 2 of each side of this equation to derive

log2k − L = 1

or

L = log2k − 1

For 3-2 reducers, we have

k(2/3)L = 2

As above, taking logarithms to the base 2, we derive

log2k + L(log22− log23) = log22

log2k + L(1− 1.59) = 1

L = (1− log2k)/(−0.59)

L = 1.7log2k − 1.7

These expressions are only approximations unless the number of input summands toeach level is a multiple of 4 in the case of 4-2 reduction, or is a multiple of 3 in the case of3-2 reduction.

Example 9.4Problem: Convert the decimal fraction 0.1 to a binary fraction. If the conversion is notexact, give the binary fraction approximation to 8 bits after the binary point using each ofthe three truncation methods discussed in Section 9.7.2.

Solution: Use the conversion method given in Section 9.8. Multiplying the decimalfraction 0.1 by 2 repeatedly, as shown in Figure 9.29, generates the sequence of bits



0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, . . . to the left of the decimal point, which continues in-definitely, repeating the pattern 0, 0, 1, 1. Hence, the conversion is not exact.

• Truncation by chopping gives 0.00011001• Truncation by von Neumann rounding gives 0.00011001• Truncation by rounding gives 0.00011010

Example 9.5 Problem: Consider the following 12-bit floating-point number representation format thatis manageable for working through numerical exercises. The first bit is the sign of thenumber. The next five bits represent an excess-15 exponent for the scale factor, which hasan implied base of 2. The last six bits represent the fractional part of the mantissa, whichhas an implied 1 to the left of the binary point.

Perform Subtract and Multiply operations on the operands

A = 0 10001 011011

B = 1 01111 101010

which represent the numbers

A = 1.011011× 22

and

B = −1.101010× 20

Solution: The required operations are performed as follows:

• SubtractionAccording to the Add/Subtract rule in Section 9.7.1, we perform the following foursteps:

1. Shift the mantissa of B to the right by two bit positions, giving 0.01101010.

2. Set the exponent of the result to 10001.

3. Subtract the mantissa of B from the mantissa of A by adding mantissas, becauseB is negative, giving

1 . 0 1 1 0 1 1 0 0+ 0 . 0 1 1 0 1 0 1 0

1 . 1 1 0 1 0 1 1 0

and set the sign of the result to 0 (positive).

4. The result is in normalized form, but the fractional part of the mantissa needs tobe truncated to six bits. If this is done by rounding, the two bits to be removedrepresent the tie case, so we round to the nearest even number by adding 1,obtaining a result mantissa of 1.110110. The answer is

A− B = 0 10001 110110


Problems 377

• MultiplicationAccording to the Multiplication rule in Section 9.7.1, we perform the following threesteps:

1. Add the exponents and subtract 15 to obtain 10001 as the exponent of the result.

2. Multiply mantissas to obtain 10.010110101110 as the mantissa of the result. Thesign of the result is set to 1 (negative).

3. Normalize the resulting mantissa by shifting it to the right by one bit position.Then add 1 to the exponent to obtain 10010 as the exponent of the result.Truncate the mantissa fraction to six bits by rounding to obtain the answer

A× B = 0 10010 001011

Problems

9.1 [M] A half adder is a combinational logic circuit that has two inputs, x and y, and twooutputs, s and c, that are the sum and carry-out, respectively, resulting from the binaryaddition of x and y.

(a) Design a half adder as a two-level AND-OR circuit.

(b) Show how to implement a full adder, as shown in Figure 9.2a, by using two half addersand external logic gates, as necessary.

(c) Compare the longest logic delay path through the network derived in part (b) to that ofthe logic delay of the adder network shown in Figure 9.2a.

9.2 [M] The 1’s-complement and 2’s-complement binary representation methods are specialcases of the (b− 1)’s-complement and b’s-complement representation techniques in baseb number systems. For example, consider the decimal system. The sign-and-magnitudevalues +526, −526, +70, and −70 have 4-digit signed-number representations in each ofthe two complement systems, as shown in Figure P9.1. The 9’s-complement is formed by

Representation Examples

Sign and magnitude

9’s complement

10’s complement

0526

0526

9473

9474

0070

0070

9929

9930

+526 –526 +70 –70

Figure P9.1 Signed numbers in base 10 used in Problem 9.2.



taking the complement of each digit position with respect to 9. The 10’s-complement isformed by adding 1 to the 9’s-complement. In each of the latter two representations, theleftmost digit is zero for a positive number and 9 for a negative number.

Now consider the base-3 (ternary) system, in which the unsigned, 5-digit number t4t3t2t1t0has the value t4 × 34 + t3 × 33 + t2 × 32 + t1 × 31 + t0 × 30, with 0 ≤ ti ≤ 2. Express theternary sign-and-magnitude numbers+11011,−10222,+2120,−1212,+10, and−201 as6-digit, signed, ternary numbers in the 3’s-complement system.

9.3 [M] Represent each of the decimal values 56, −37, 122, and −123 as signed 6-digitnumbers in the 3’s-complement ternary format, perform addition and subtraction on themin all possible pairwise combinations, and state whether or not arithmetic overflow occursfor each operation performed. (See Problem 9.2 for a definition of the ternary numbersystem, and use a technique analogous to that given in Section 9.8 for decimal-to-ternaryinteger conversion.)

9.4 [M] A modulo 10 adder is needed for adding BCD digits. Modulo 10 addition of two BCDdigits, A = A3A2A1A0 and B = B3B2B1B0, can be achieved as follows: Add A to B (binaryaddition). Then, if the result is an illegal code that is greater than or equal to 1010, add 610.(Ignore overflow from this addition.)

(a) When is the output carry equal to 1?

(b) Show that this algorithm gives correct results for:

(1) A = 0101 and B = 0110

(2) A = 0011 and B = 0100

(c) Design a BCD digit adder using a 4-bit binary adder and external logic gates as needed.The inputs are A3A2A1A0, B3B2B1B0, and a carry-in. The outputs are the sum digit S3S2S1S0

and the carry-out. A cascade of such blocks can form a ripple-carry BCD adder.

9.5 [E] Show that the logic expression cn ⊕ cn−1 is a correct indicator of overflow in theaddition of 2’s-complement integers by using an appropriate truth table.

9.6 [E] Use appropriate parts of the solution in Example 9.1 to calculate how many logic gatesare needed to build the 16-bit carry-lookahead adder shown in Figure 9.5.

9.7 [M] Carry-lookahead adders and their delay are investigated in this problem.

(a) Design a 64-bit adder that uses four of the 16-bit carry-lookahead adders shown inFigure 9.5 along with additional logic circuits to generate c16, c32, c48, and c64, from c0 andthe GII

i and PIIi variables shown in the figure. What is the relationship of the additional

circuits to the carry-lookahead logic circuits in the figure?

(b) Show that the delay through the 64-bit adder is 12 gate delays for s63 and 7 gate delaysfor c64, as claimed at the end of Section 9.2.1.

(c) Compare the gate delays to produce s31 and c32 in the 64-bit adder of part (a) to the gatedelays for the same variables in the 32-bit adder built from a cascade of two 16-bit adders,as discussed in Section 9.2.1.

9.8 [M] Show that the worst case delay through an n× n array of the type shown in Figure9.6b is 6(n− 1)− 1 gate delays, as claimed in Section 9.3.1.


Problems 379

9.9 [E] Multiply each of the following pairs of signed 2’s-complement numbers using theBooth algorithm. In each case, assume that A is the multiplicand and B is the multiplier.

(a) A = 010111 and B = 110110

(b) A = 110011 and B = 101100

(c) A = 001111 and B = 001111

9.10 [M] Repeat Problem 9.9 using bit-pair recoding of the multiplier.

9.11 [M] Indicate generally how to modify the circuit diagram in Figure 9.7a to implementmultiplication of 2’s-complement n-bit numbers using the Booth algorithm, by clearlyspecifying inputs and outputs for the Control sequencer and any other changes neededaround the adder and register A.

9.12 [M] Extend the Figure 9.14b table to 16 rows, indicating how to recode three multiplierbits: i + 2, i + 1, and i. Can all required versions of the multiplicand selected at position ibe generated by shifting and/or negating the multiplicand M? If not, what versions cannotbe generated this way, and for what cases are they required?

9.13 [M] If the product of two n-bit numbers in 2’s-complement representation can be repre-sented in n bits, the manual multiplication algorithm shown in Figure 9.6a can be useddirectly, treating the sign bits the same as the other bits. Try this on each of the followingpairs of 4-bit signed numbers:

(a) Multiplicand = 1110 and Multiplier = 1101

(b) Multiplicand = 0010 and Multiplier = 1110

Why does this work correctly?

9.14 [D] An integer arithmetic unit that can perform addition and multiplication of 16-bit un-signed numbers is to be used to multiply two 32-bit unsigned numbers. All operands,intermediate results, and final results are held in 16-bit registers labeled R0 through R15.The hardware multiplier multiplies the contents of Ri (multiplicand) by Rj (multiplier) andstores the double-length 32-bit product in registers Rj and Rj+1, with the low-order half inRj. When j = i − 1, the product overwrites both operands. The hardware adder adds thecontents of Ri and Rj and puts the result in Rj. The input carry to an Add operation is 0, andthe input carry to an Add-with-carry operation is the contents of a carry flag C. The outputcarry from the adder is always stored in C.

Specify the steps of a procedure for multiplying two 32-bit operands in registers R1, R0, andR3, R2, high-order halves first, leaving the 64-bit product in registers R15, R14, R13, and R12.Any of the registers R11 through R4 may be used for intermediate values, if necessary. Eachstep in the procedure can be a multiplication, or an addition, or a register transfer operation.

9.15 [M] Delay in multiplier arrays is investigated in this problem.

(a) Calculate the delay, in terms of full-adder block delays, in producing product bit p7 ineach of the 4× 4 multiplier arrays in Figure 9.16. Ignore the AND gate delay to generateall miqj products at the beginning.

(b) Develop delay expressions for each of the arrays in Figure 9.16 in terms of n for the n× ncase, as an extension of part (a) of the problem. Then use these expressions to calculatedelay for the 32× 32 case for each array.



9.16 [M] Tree depth for carry-save reduction is analyzed in this problem.

(a) How many 3-2 reduction levels are needed to reduce 16 summands to 2 using a patternsimilar to that shown in Figure 9.19?

(b) Repeat part (a) for reducing 32 summands to 2 to show that the claim of 8 levels inSection 9.5.3 is correct.

(c) Compare the exact answers in parts (a) and (b) to the results obtained by using theapproximation developed in Example 9.3 in Section 9.10.

9.17 [M] Tree reduction of summands using 3-2 and 4-2 reducers was described in Sections9.5.3 and 9.5.4. It is also possible to perform 7-3 reductions on each reduction level. Whenonly three summands remain, a 3-2 reduction is performed, followed by addition of thefinal two summands.

(a) How many 7-3 reduction levels are needed to reduce 32 summands to three? Comparethis to the seven levels needed to reduce 32 summands to three when using 3-2 reductions.

(b) Example 9.3 in Section 9.10 shows that log2k − 1 levels of 4-2 reduction are needed toreduce k summands to 2 in a reduction tree. How many levels of 7-3 reduction are neededto reduce k summands to 3?

9.18 [M] Show how to implement a 4-2 reducer by using two 3-2 reducers. The truth table forthis implementation is different from that shown in Figure 9.21.

9.19 [E] Using manual methods, perform the operations A× B and A÷ B on the 5-bit unsignednumbers A = 10101 and B = 00101.

9.20 [M] Show how the multiplication and division operations in Problem 9.19 would be per-formed by the hardware in Figures 9.7a and 9.23, respectively, by constructing charts similarto those in Figures 9.7b and 9.25.

9.21 [D] In Section 9.7, we used the practical-sized 32-bit IEEE standard format for floating-point numbers. Here, we use a shortened format that retains all the pertinent conceptsbut is manageable for working through numerical exercises. Consider that floating-pointnumbers are represented in a 12-bit format as shown in Figure P9.2. The scale factor hasan implied base of 2 and a 5-bit, excess-15 exponent, with the two end values of 0 and 31used to signify exact 0 and infinity, respectively. The 6-bit mantissa is normalized as in theIEEE format, with an implied 1 to the left of the binary point.

(a) Represent the numbers +1.7, −0.012, +19, and 18 in this format.

(b) What are the smallest and largest numbers representable in this format?

(c) How does the range calculated in part (b) compare to the ranges of a 12-bit signed integerand a 12-bit signed fraction?

(d) Perform Add, Subtract, Multiply, and Divide operations on the operands

A = 0 10000 011011

B = 1 01110 101010


Problems 381

12 bits

5 bitsexcess-15exponent

6 bitsfractionalmantissa

1 bit for sign of number

+0 signifies–1 signifies

Figure P9.2 Floating-point format used in Problem 9.21.

9.22 [D] Consider a 16-bit, floating-point number in a format similar to that discussed in Problem9.21, with a 6-bit exponent and a 9-bit mantissa fraction. The base of the scale factor is 2and the exponent is represented in excess-31 format.

(a) Add the numbers A and B, formatted as follows:

A = 0 100001 111111110

B = 0 011111 001010101

Give the answer in normalized form. Remember that an implicit 1 is to the left of the binarypoint but is not included in the A and B formats. Use rounding as the truncation methodwhen producing the final mantissa.

(b) Using decimal numbers w, x, y, and z, express the magnitude of the largest and smallest(nonzero) values representable in the preceding normalized floating-point format. Use thefollowing form:

Largest = w × 2x

Smallest = y × 2−z

9.23 [M] How does the excess-x representation for exponents of the scale factor in the floating-point number representation of Figure 9.26a facilitate the comparison of the relative sizesof two floating-point numbers? (Hint: Assume that a combinational logic network thatcompares the relative sizes of two, 32-bit, unsigned integers is available. Use this net-work, along with external logic gates, as necessary, to design the required network for thecomparison of floating-point numbers.)

9.24 [D] In Problem 9.21(a), conversion of the simple decimal numbers into binary floating-point format is straightforward. However, if the decimal numbers are given in floating-pointformat, conversion is not straightforward because we cannot separately convert the mantissaand the exponent of the scale factor because 10x = 2y does not, in general, allow both x andy to be integers. Suppose a table of binary, floating-point numbers ti, such that ti = 10xi forxi in the representable range, is stored in a computer. Give a procedure in general terms for



converting a given decimal floating-point number into binary floating-point format. Youmay use both the integer and floating-point instructions available in the computer.

9.25 [D] Construct an example to show that three guard bits are needed to produce the correctanswer when two positive numbers are subtracted.

9.26 [M] Derive logic expressions that specify the Add/Sub and SR outputs of the combinationalCONTROL network in Figure 9.28.

9.27 [M] If gate fan-in is limited to four, how can the SHIFTER in Figure 9.28 be implementedcombinationally?

9.28 [M] Sketch a logic-gate network that implements the multiplexer MUX in Figure 9.28.

9.29 [M] Relate the structure of the SWAP network in Figure 9.28 to your solution to Problem9.28.

9.30 [M] How can the leading zeros detector in Figure 9.28 be implemented combinationally?

9.31 [M] The mantissa adder-subtractor in Figure 9.28 operates on positive, unsigned binaryfractions and must produce a sign-and-magnitude result. In the discussion accompanyingFigure 9.28, we state that 1’s-complement arithmetic is convenient because of the requiredformat for input and output operands. When adding two signed numbers in 1’s-complementnotation, the carry-out from the sign position must be added to the result to obtain the correctsigned answer. This is called end-around carry correction. Consider the two examples inFigure P9.3, which illustrate addition using signed, 4-bit encodings of operands and answersin the 1’s-complement system.

The 1’s-complement arithmetic system is convenient when a sign-and-magnitude result isto be generated because a negative number in 1’s-complement notation can be convertedto sign-and-magnitude form by complementing the bits to the right of the sign-bit position.Using 2’s-complement arithmetic, addition of+1 is needed to convert a negative value intosign-and-magnitude notation. If a carry-lookahead adder is used, it is possible to incorporatethe end-around carry operation required by 1’s-complement arithmetic into the lookaheadlogic. With this discussion as a guide, give the complete design of the 1’s-complementadder/subtractor required in Figure 9.28.

9.32 [M] Signed binary fractions in 2’s-complement representation are discussed in Section1.4.2.

(a) Express the decimal values 0.5, −0.123, −0.75, and −0.1 as signed 6-bit fractions.(See Section 9.8 for decimal-to-binary fraction conversion.)

(3) 101

00 0

01 1 0 0

1

1 1 10

+ ( + 0

0

1 1 10

3

(6) 101

00 1

11 0 0 0

0

0 0 01

+ ( + 1

1

0 0 11

5– )

2–

3– )

Figure P9.3 1’s-complement addition used in Problem 9.31.


References 383

(b) What is the maximum representation error, e, involved in using only 5 significant bitsafter the binary point?

(c) Calculate the number of bits needed after the binary point so that the representation errore is less than 0.1, 0.01, or 0.001, respectively.

9.33 [E] Which of the four 6-bit answers to Problem 9.32(a) are not exact? For each of thesecases, give the three 6-bit values that correspond to the three types of truncation defined inSection 9.7.2.

References

1. A. D. Booth, “A Signed Binary Multiplication Technique,” Quarterly Journal ofMechanics and Applied Mathematics, vol. 2, part 2, 1951, pp. 236-240.

2. C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Transactions on ElectronicComputers, vol. EC-13, February 1964, pp. 14-17.

3. M. R. Santoro and M. A. Horowitz, “SPIM: A Pipelined 64× 64-bit IterativeMultiplier,” IEEE Journal of Solid-State Circuits, vol. 24, No.2, April 1989, pp.487-493.

4. Institute of Electrical and Electronics Engineers, IEEE Standard for BinaryFloating-Point Arithmetic, ANSI/IEEE Standard 754-2008, August 2008.


385

c h a p t e r

10Embedded Systems

Chapter Objectives


• Embedded applications

• Microcontrollers for embedded systems

• Sensors and actuators

• Using the C language to control I/O devices

• Design issues


386 C H A P T E R 10 • Embedded Systems

In previous chapters we discussed the concepts used in general-purpose computing systems. Now we willfocus our discussion on systems that are intended to serve specific applications. A physical system thatemploys computer control for a specific purpose, rather than for general-purpose computation, is referred toas an embedded system. We will show how the general concepts presented earlier are applied in such systems.

An important aspect of software written for embedded systems is that it has to interact closely with thehardware. The term reactive system is often used to describe the fact that the points in time at which variousroutines are executed are determined by events external to the processor, such as the closing of a switch or thearrival of new data at an input port. The software designer must decide how this interaction will be achieved.The input/output techniques described in Chapter 3, based on polling and interrupts, are used for this purpose.

Microprocessor control is now commonly used in cameras, cell phones, display phones, point-of-saleterminals, kitchen appliances, cars, and many toys. Low cost and high reliability are the essential requirementsin these applications. Small size and low power consumption are often of key importance. All of this canbe achieved by placing on a single chip not only the processor circuitry, but also some memory, input/outputinterfaces, timer circuits, and other features to make it easy to implement a complete computer control systemusing very few chips. Microprocessor chips of this type are generally referred to as microcontrollers. In thischapter we will explore the main features of microcontroller-based embedded systems. In Chapter 11 we willdiscuss the system-on-a-chip approach for implementing such systems using Field Programmable Gate Array(FPGA) technology.

10.1 Examples of Embedded Systems

In this section we present three examples of embedded systems to illustrate the processingand control capability needed in a typical embedded application.

10.1.1 Microwave Oven

Many household appliances use computer control to govern their operation. A typicalexample is a microwave oven. This appliance is based on a magnetron power unit thatgenerates the microwaves used to heat food in a confined space. When turned on, themagnetron generates its maximum power output. Lower power levels are achieved byturning the magnetron on and off for controlled time intervals. By controlling the powerlevel and the total heating time, it is possible to realize a variety of user-selectable cookingoptions.

The specification for a microwave oven may include the following cooking options:

• Manual selection of the power level and cooking time• Manual selection of the sequence of different cooking steps• Automatic operation, where the user specifies the type of food (for example, meat,

vegetables, or popcorn) and the weight of the food; then an appropriate power leveland time are calculated by the controller

• Automatic defrosting of food by specifying the weight


10.1 Examples of Embedded Systems 387

The oven includes a display that can show:

• Time-of-day clock• Decrementing clock timer while cooking• Information messages to the user

An audio alert signal, in the form of a beep tone, is used to indicate the end of a cookingoperation. An exhaust fan and oven light are provided. As a safety measure, a doorinterlock turns the magnetron off if the door of the oven is open. All of these functions canbe controlled by a microcontroller.

The input/output capability needed to communicate with the user includes:

• Input keys that comprise a 0 to 9 number pad and function keys such as Reset, Start,Stop, Power Level, Auto Defrost, Auto Cooking, Clock Set, and Fan Control

• Visual output in the form of a liquid-crystal display, similar to the seven-segmentdisplay illustrated in Figure 3.17

• A small speaker that produces the beep tone

The computational tasks executed by a microcontroller to control a microwave ovenare quite simple. They include maintaining the time-of-day clock, determining the actionsneeded for the various cooking options, generating the control signals needed to turn onor off devices such as the magnetron and the fan, and generating display information. Theprogram needed to implement the desired actions is quite small. It is stored in a nonvolatileread-only memory, so that it will not be lost when the power is turned off. It is also necessaryto have a small RAM for use during computations and to hold the user-entered data. Themost significant requirement for the microcontroller is to have sufficient I/O capability forall of the input keys, displays, and output control signals. Parallel I/O ports provide aconvenient mechanism for dealing with the external input and output signals.

Figure 10.1 shows a possible organization of the microwave oven. A simple processorwith small ROM and RAM units is sufficient. Basic input and output interfaces are used toconnect to the rest of the system. It is possible to realize most of this circuitry on a smallmicrocontroller chip.

10.1.2 Digital Camera

Digital cameras provide an excellent example of a sophisticated embedded system in asmall package. Figure 10.2 shows the main parts of a digital camera.

Traditional cameras use film to capture images. In a digital camera, an array of opticalsensors is used to capture images. These sensors convert light into electrical charge. Theintensity of light determines the amount of charge that is generated. Two different typesof sensors are used in commercial products. One type is known as charge-coupled devices(CCDs). It is the type of sensing device used in the earliest digital cameras. It has since beenrefined to give high-quality images. More recently, sensors based on CMOS technologyhave been developed.



ROMProcessor

Inputinterface

Input keys Door open


Outputinterface

Magnetron Fan

Displays Light

Speaker

RAM

MICROCONTROLLER

Figure 10.1 A block diagram of a microwave oven.

Each sensing element generates a charge that corresponds to one pixel, which is onepoint of a pictorial image. The number of pixels determines the quality of pictures thatcan be recorded and displayed. The charge is an analog quantity, which is converted intoa digital representation using analog-to-digital (A/D) conversion circuits. A/D conversionproduces a digital representation of the image in which the color and intensity of each pixelare represented by a number of bits. The digital form of the image can then be treated likeany other data that can be manipulated using standard computer circuitry.

The processor and system controller block in Figure 10.2 includes a variety of interfacecircuits needed to connect to other parts of the system. The processor governs the operationof the camera. It processes the raw image data obtained from the A/D circuits to generateimages represented in standard formats suitable for use in computers, printers, and displaydevices. The main formats used are TIFF (Tagged Image File Format) for uncompressed


10.1 Examples of Embedded Systems 389

Userswitches

Computerinterface

Imagestorage

LCDscreen

Flashunit

A/Dconversion

Opticalsensors

and systemcontroller

Motor

Lens

Cable to PC

Processor

Figure 10.2 A simplified block diagram of a digital camera.

images and JPEG (Joint Photographic Experts Group) for compressed images. The pro-cessed images are stored in a larger image storage device. Flash memory cards, discussedin Section 8.3.5, are a popular choice for storing images.

A captured and processed image can be displayed on a liquid-crystal display (LCD)screen, which is included in the camera. This allows the user to decide whether the imageis worth keeping. The number of images that can be saved depends on the size of the imagestorage unit. It also depends on the chosen quality of the images, namely on the number ofpixels per image and on the degree of compression (for JPEG format).

A standard interface provides a mechanism for transferring the images to a computeror a printer. Typically, this is done using a USB cable. If Flash memory cards are used,images can also be transferred by physically transferring the card.

The system controller generates the signals needed to control the operation of thefocusing mechanism and the flash unit. Some of the inputs come from switches activatedby the user.



Adigital camera requires a considerably more powerful processor than is needed for thepreviously discussed microwave oven application. The processor has to perform complexsignal processing functions. Yet, it is essential that the processor does not consume muchpower because the camera is a battery-powered device. Typically, the processor consumesless power than the display and flash units of a camera.

10.1.3 Home Telemetry

Microcontrollers are used in the home in a host of embedded applications. In Section10.1.1, we considered the microwave oven example. Similar examples can be found inother equipment, such as washers, dryers, dishwashers, cooking ranges, furnaces, andair conditioners. Another notable example is the display telephone, in which an embeddedprocessor enables a variety of useful features. In addition to the standard telephone features,a telephone with an embedded microcontroller can be used to provide remote access to otherdevices in the home.

Using the telephone one can remotely perform functions such as:

• Communicate with a computer-controlled home security system• Set a desired temperature to be maintained by a furnace or an air conditioner• Set the start time, the cooking time, and the temperature for food that has been placed

in the oven at some earlier time• Read the electricity, gas, and water meters, replacing the need for the utility companies

to send an employee to the home to read the meters

All of this is easily implementable if each of these devices is controlled by a microcon-troller. It is only necessary to provide a link, either wired or wireless, between the devicemicrocontroller and the microprocessor in the telephone. Using signaling from a remotelocation to observe and control the state of equipment is often referred to as telemetry.

10.2 Microcontroller Chips for EmbeddedApplications

A microcontroller chip should be versatile enough to serve a wide variety of applications.Figure 10.3 shows the block diagram of a typical chip. The main part is a processor core,which may be a basic version of a commercially available microprocessor. It is prudent tochoose a microprocessor architecture that has proven to be popular in practice, because forsuch processors there exist numerous CAD tools, good examples, and a large amount ofexperience and knowledge that facilitate the design of new products.

It is useful to include some memory on the chip, sufficient to satisfy the memoryrequirements found in small applications. Some of this memory has to be of RAM typeto hold the data that change during computations. Some should be of the read-only typeto hold the software, because an embedded system usually does not include a magneticdisk. To allow cost-effective use in low-volume applications, it is necessary to have a


10.2 Microcontroller Chips for EmbeddedApplications 391

ParallelI/O ports

SerialI/O ports

Counter/Timer

Processorcore

Internalmemory

To externalmemory

A/D conversion D/A conversion

Figure 10.3 A block diagram of a microcontroller.

field-programmable type of ROM storage. Popular choices for realization of this storageare EEPROM and Flash memory.

Several I/O ports are usually provided for both parallel and serial interfaces, which alloweasy implementation of standard I/O connections. In many applications, it is necessary togenerate control signals at programmable time intervals. This task is achieved easily if atimer circuit is included in the microcontroller chip. Since the timer is a circuit that countsclock pulses, it can also be used for event-counting purposes, for example to count thenumber of pulses generated by a moving mechanical arm or a rotating shaft.

An embedded system may include some analog devices. To deal with such devices,it is necessary to be able to convert analog signals into digital representations, and viceversa. This is conveniently accomplished if the embedded controller includes A/D and D/Aconversion circuits.

Many embedded processor chips are available commercially. Some of the better knownexamples are: Freescale’s 68HC11 and 68K/ColdFire families, Intel’s 8051 and MCS-96families, all of which have CISC-style processor cores, and ARM microcontrollers whichhave a RISC-style processor. The nature of the processor core is not important to ourdiscussion in this chapter. We will emphasize the system aspects of embedded applicationsto illustrate how the concepts presented in the previous chapters fit together in the designof a complete embedded computer system.



10.3 A Simple Microcontroller

The input/output structure of a microcontroller has to be flexible enough to accommodatethe needs of different applications and make good use of the pins available on the chip. Forexample, a parallel port may be configurable as either input or output.

In this section we discuss a possible organization of a simple microcontroller to illustratesome typical features. Figure 10.4 gives its block diagram. There is a processor core andsome on-chip memory. Since the on-chip memory may not be sufficient to support allpotential applications, the processor bus connections are also provided on the pins of thechip so that external memory can be added.

There are two 8-bit parallel interfaces, called port A and port B, and one serial interface.The microcontroller also contains a 32-bit counter/timer circuit, which can be used togenerate internal interrupts at programmed time intervals, to serve as a system timer, tocount the pulses on an input line, to generate square-wave output signals, and so on.

10.3.1 Parallel I/O Interface

Embedded system applications require considerable flexibility in input/output interfaces.The nature of the devices involved and how they may be connected to the microcontrollercan be appreciated by considering some components of the microwave oven shown inFigure 10.1. A sensor is needed to generate a signal with the value 1 when the door is open.This signal is sent to the microcontroller on one of the pins of an input interface. The sameis true for the keys on the microwave’s front panel. Each of these simple devices producesone bit of information.

Parallel I/O

Serial I/O

Counter/Timer

Processorcore

Internal

memory

Address

Data

Control

Port A

Port B

Receive data

Transmit data

Counter_in

Timer_out

Figure 10.4 An example microcontroller.


10.3 A Simple Microcontroller 393

Output devices are controlled in a similar way. The magnetron is controlled by a singleoutput line that turns it on or off. The same is true for the fan and the light. The speakermay also be connected via a single output line on which the processor sends a square wavesignal having an appropriate tone frequency. A liquid-crystal display, on the other hand,requires several bits of data to be sent in parallel.

One of the objectives of the design of input/output interfaces for a microcontroller isto reduce the need for external circuitry as much as possible. The microcontroller is likelyto be connected to simple devices, many of which require only one input or output signalline. In most cases, no encoding or decoding is needed.

Each parallel port in Figure 10.4 has an associated eight-bit data direction register,which can be used to configure individual data lines as either input or output. Figure 10.5illustrates the bidirectional control for one bit in port A. Port pin PAi is treated as an inputif the data direction flip-flop contains a 0. In this case, activation of the control signalRead_Port places the logic value on the port pin onto the data line Di of the processorbus. The port pin serves as an output if the data direction flip-flop is set to 1. The valueloaded into the output data flip-flop, under control of the Write_Port signal, is placed onthe pin.

Figure 10.5 shows only the part of the interface that controls the direction of datatransfer. In the input data path there is no flip-flop to capture and hold the value of thedata signal provided by a device connected to the corresponding pin. A versatile parallelinterface may include two possibilities: one where input data are read directly from thepins, and the other where the input data are stored in a register as in the interface in Figure7.11. The choice is made by setting a bit in the control register of the interface.

D Q

QWrite_Port

D Q

Q

Output data

Di PAi

Data direction

Write_DIR

Read_Port

Figure 10.5 Access to one bit in port A in Figure 10.4.



Figure 10.6 depicts all registers in the parallel interface, as well as the addresses assignedto them. We have arbitrarily chosen addresses at the high end of a 32-bit address range.

The status register, PSTAT, contains the status flags. The PASIN flag is set to 1 whenthere are new data on port A. It is cleared to 0 when the processor accepts the data byreading the PAIN register. The PASOUT flag is set to 1 when the data in register PAOUTare accepted by the connected device, to indicate that the processor may now load newdata into PAOUT. The interface uses a separate control line (described below) to signal theavailability of new data to the connected device. The PASOUT flag is cleared to 0 when

Address

FFFFFFF6

FFFFFFF4

FFFFFFF5

Status register (PSTAT)

Port B output

Port B direction

12 034567

FFFFFFF2

FFFFFFF3

Port A direction

Port B input

FFFFFFF0

FFFFFFF1

Port A input

Port A output

FFFFFFF7 Control register (PCONT)

PAIN

PAOUT

PADIR

PBIN

PBOUT

PBDIR

PASIN

PASOUT

PBSIN

PBSOUT

IBOUT

IBIN

IAOUT

IAIN

ENAIN

ENAOUT

ENBOUT

ENBIN

PAREG

PBREG

Figure 10.6 Parallel interface registers.



the processor writes data into PAOUT. The flags PBSIN and PBSOUT perform the samefunction for port B.

The status register also contains four interrupt flags. An interrupt flag, such as IAIN,is set to 1 when that interrupt is enabled and the corresponding I/O action occurs. Theinterrupt-enable bits are held in control register PCONT. An enable bit is set to 1 to enablethe corresponding interrupt. For example, if ENAIN= 1 and PASIN= 1, then the interruptflag IAIN is set to 1 and an interrupt request is raised. Thus,

IAIN = ENAIN · PASIN

A single interrupt-request signal is used for all ports in the interface. In response to aninterrupt request, the processor must examine the interrupt flags to determine the actualsource of the request.

The information in the status and control registers is used for controlling data transfersto and from the devices connected to ports A and B. Port A has two control lines, CAINand CAOUT, which can be used to provide an automatic signaling mechanism betweenthe interface and the attached device, for devices that have this capability. For an inputtransfer, the device places new data on the port’s pins and signifies this action by activatingthe CAIN line for one clock cycle. When the interface circuit sees CAIN = 1, it sets thestatus bit PASIN to 1. Later, this bit is cleared to 0 when the processor reads the input data.This action also causes the interface to send a pulse on the CAOUT line to inform the devicethat it may send new data to the interface. For an output transfer, the processor writes thedata into the PAOUT register. The interface responds by clearing the PASOUT bit to 0 andsending a pulse on the CAOUT line to inform the device that new data are available. Whenthe device accepts the data, it sends a pulse on the CAIN line, which in turn sets PASOUTto 1. This signaling mechanism is operational when all data pins of a port have the sameorientation, that is, when the port serves as either an input or an output port. If some pinsare selected as inputs and others as outputs, then the automatic mechanism is not used andneither the control lines nor the status and control registers contain meaningful information.In this case, the inputs are read directly from the pins.

Control register bits PAREG and PBREG are used to select the mode of operation ofinputs to ports A and B, respectively. If set to 1, a register is used to store the input data;otherwise, a direct path from the pins is used as indicated in Figure 10.5. As an exampleof using the direct path, consider the operation of the microwave oven depicted in Figure10.1. The microcontroller turns the magnetron on to start the cooking operation, but it maydo so only if the oven door is closed. A simple sensor switch indicates whether the door isopen by providing a signal that can be read as one bit of data. The sensor is connected toa pin in a microcontroller interface, enabling the microcontroller to determine the status ofthe door by reading the logic value of this input directly.

10.3.2 Serial I/O Interface

The serial interface provides the UART (Universal Asynchronous Receiver/Transmitter)capability to transfer data based on the scheme described in Section 7.4.2. Double bufferingis used in both the transmit and receive paths, as shown in Figure 10.7. Such buffering isneeded to handle bursts in I/O transfers correctly.



D7

D0

Serial outputTransmit shift register

Transmit buffer

Receive shift register Serial input

Receive buffer

Figure 10.7 Receive and transmit structure of the serial interface.

Figure 10.8 shows the addressable registers of the serial interface. Input data are readfrom the 8-bit Receive buffer, and output data are loaded into the 8-bit Transmit buffer.The status register, SSTAT, provides information about the current status of the receive andtransmit units. Bit SSTAT0 is set to 1 when there are valid data in the receive buffer; it iscleared to 0 automatically upon a read access to the receive buffer. Bit SSTAT1 is set to 1when the transmit buffer is empty and can be loaded with new data. These bits serve thesame purpose as the status flags KIN and DOUT discussed in Section 3.1. Bit SSTAT2 isset to 1 if an error occurs during the receive process. For example, an error occurs if thecharacter in the receive buffer is overwritten by a subsequently received character beforethe first character is read by the processor. The status register also contains the interruptflags. Bit SSTAT4 is set to 1 when the receive buffer becomes full and the receiver interruptis enabled. Similarly, SSTAT5 is set to 1 when the transmit buffer becomes empty and thetransmitter interrupt is enabled. The serial interface raises an interrupt if either SSTAT4 orSSTAT5 is equal to 1. It also raises an interrupt if SSTAT6 = 1, which occurs if SSTAT2 = 1and the error condition interrupt is enabled.

The control register, SCONT, is used to hold the interrupt-enable bits. Setting bitsSCONT6−4 to 1 or 0 enables or disables the corresponding interrupts, respectively. Thisregister also indicates how the transmit clock is generated. If SCONT0 = 0, then thetransmit clock is the same as the system (processor) clock. If SCONT0 = 1, then a lower-frequency transmit clock is obtained using a clock-dividing circuit.



Address

FFFFFFE2

FFFFFFE0

FFFFFFE1

Status register (SSTAT)

Receive buffer

Transmit buffer

12 034567

FFFFFFE3 Control register (SCONT)

RBUF

TBUF

1 : Receiver full

1 : Transmitter empty

1 : Error detected

1 : Transmitter interrupt

1 : Receiver interrupt

0 : Use system clock

1 : Error interrupt

1 : Enable transmitter interrupt

1 : Enable receiver interrupt

1 : Enable error interrupt

1 : Divide clock

031

DIV (Divisor register)FFFFFFE4

Figure 10.8 Serial interface registers.

The last register in the serial interface is the clock-divisor register, DIV. This 32-bitregister is associated with a counter circuit that divides down the system clock signal togenerate the serial transmission clock. The counter generates a clock signal whose frequencyis equal to the frequency of the system clock divided by the contents of this register. Thevalue loaded into this register is transferred into the counter, which then counts down usingthe system clock. When the count reaches zero, the counter is reloaded using the value inthe DIV register.

10.3.3 Counter/Timer

A 32-bit down-counter circuit is provided for use as either a counter or a timer. Thebasic operation of the circuit involves loading a starting value into the counter, and thendecrementing the counter contents using either the internal system clock or an externalclock signal. The circuit can be programmed to raise an interrupt when the counter contentsreach zero. Figure 10.9 shows the registers associated with the counter/timer circuit. The



Address

FFFFFFD8 Control register (CTCON)

12 034567

FFFFFFD9 Status register (CTSTAT)

1 : Start

1 : Stop1 : Timer

1 : Enable interrupt

0 : Counter

1 : Counter reached zero

031

CNTM (Initial value)FFFFFFD0

COUNT (Counter contents)FFFFFFD4

Figure 10.9 Counter/Timer registers.

counter/timer register, CNTM, can be loaded with an initial value, which is then transferredinto the counter circuit. The current contents of the counter can be read by accessing memoryaddress FFFFFFD4. The control register, CTCON, is used to specify the operating modeof the counter/timer circuit. It provides a mechanism for starting and stopping the countingprocess, and for enabling interrupts when the counter contents are decremented to 0. Thestatus register, CTSTAT, reflects the state of the circuit.

Counter ModeThe counter mode is selected by setting bit CTCON7 to 0. The starting value is loaded

into the counter by writing it into register CNTM. The counting process begins whenbit CTCON0 is set to 1 by a program instruction. Once counting starts, bit CTCON0 isautomatically cleared to 0. The counter is decremented by pulses on the Counter_in linein Figure 10.4. Upon reaching 0, the counter circuit sets the status flag CTSTAT0 to 1, andraises an interrupt if the corresponding interrupt-enable bit has been set to 1. The next clockpulse causes the counter to reload the starting value, which is held in register CNTM, andcounting continues. The counting process is stopped by setting bit CTCON1 to 1.

Timer ModeThe timer mode is selected by setting bit CTCON7 to 1. This mode can be used to

generate periodic interrupts. It is also suitable for generating a square-wave signal on theoutput line Timer_out in Figure 10.4. The process starts as explained above for the countermode. As the counter counts down, the value on the output line is held constant. Uponreaching zero, the counter is reloaded automatically with the starting value, and the output



signal on the line is inverted. Thus, the period of the output signal is twice the startingcounter value multiplied by the period of the controlling clock pulse. In the timer mode,the counter is decremented by the system clock.

10.3.4 Interrupt-Control Mechanism

The processor in our example microcontroller has two interrupt-request inputs, IRQ andXRQ. The IRQ input is used for interrupts raised by the I/O interfaces within the microcon-troller. The XRQ input is used for interrupts raised by external devices. If the IRQ inputis asserted and interrupts are enabled, the processor executes an interrupt-service routinethat uses the polling method to determine the source(s) of the interrupt request. This isdone by examining the flags in the status registers PSTAT, SSTAT, and CTSTAT. The XRQinterrupts have higher priority than the IRQ interrupts.

The processor status register, PSR, has two bits for enabling interrupts. The IRQ in-terrupts are enabled if PSR6 = 1, and the XRQ interrupts are enabled if PSR7 = 1. Whenthe processor accepts an interrupt, it disables further interrupts at the same priority levelby clearing the corresponding PSR bit before the interrupt service routine is executed. Avectored interrupt scheme is used, with the vectors for IRQ and XRQ interrupts in memorylocations 0x20 and 0x24, respectively. Each vector contains the address of the first instruc-tion of the corresponding interrupt-service routine. This address is automatically loadedinto the program counter, PC.

The processor has a Link register, LR, which is used for subroutine linkage as explainedin Section 2.7. A subroutine Call instruction causes the updated contents of the programcounter, which is the required return address, to be stored in LR prior to branching to thefirst instruction in the subroutine. There is another register, IRA, which saves the returnaddress when an interrupt request is accepted. In this case, in addition to saving the returnaddress in IRA, the contents of the processor status register, PSR, are saved in processorregister IPSR.

Return from a subroutine is performed by a ReturnS instruction, which transfers thecontents of LR into PC. Return from an interrupt is performed by a ReturnI instruction,which transfers the contents of IRA and IPSR into PC and PSR, respectively. Since thereis only one IRA and IPSR register, nested interrupts can be implemented by saving thecontents of these registers on the stack using instructions in the interrupt-service routine.Note that if the interrupt-service routine calls a subroutine, then it must save the contentsof LR, because an interrupt may occur when the processor is executing another subroutine.

10.3.5 Programming Examples

Having introduced the microcontroller hardware, we will now consider some softwareissues that arise when the microcontroller’s interfaces are used to connect to I/O devices.Programs can be written either in assembly language or in a high-level language. The latterchoice is preferable in most applications because the desired code is easier to generate andmaintain, and development time is shorter. We will use the C programming language in theexamples in this chapter.



The examples in this section are rudimentary and are intended to illustrate the possibleapproaches. In Section 10.4, we give a more elaborate example of a complete application.

Example 10.1 Consider the following task. A microcontroller is used to monitor the state of some me-chanical equipment. The state information is available as binary signals provided on fourwires. The 16 possible values of the state are to be displayed as a hexadecimal digit on aseven-segment display of the type shown in Figure 3.17.

The desired operation can be achieved by using the parallel interface illustrated inFigures 10.4 to 10.6. Let the four input wires be connected to the pins of port A and theseven data lines to the seven-segment display to port B. Then, the data direction registers,PADIR and PBDIR, must configure ports A and B as input and output, respectively. Theinput data can be read directly from the pins on port A.

Figure 10.10 gives a possible program. The define statements are used to associate therequired address constants with the symbolic names of the pointers. Note that the PAIN

/* Define register addresses */#define PAIN (volatile unsigned char *) 0xFFFFFFF0#define PADIR (volatile unsigned char *) 0xFFFFFFF2#define PBOUT (volatile unsigned char *) 0xFFFFFFF4#define PBDIR (volatile unsigned char *) 0xFFFFFFF5#define PCONT (volatile unsigned char *) 0xFFFFFFF7

/* Hex to 7-segment conversion table */unsigned char table[16] = { 0x40, 0x79, 0x24, 0x30, 0x19, 0x12,

0x02, 0x78, 0x00, 0x18, 0x08, 0x03, 0x46, 0x21, 0x06, 0x0E };unsigned int current_value;

void main(){

/* Initialize ports A and B */*PADIR = 0x0; /* Configure Port A as input. */*PBDIR = 0xFF; /* Configure Port B as output. */*PCONT = 0x0; /* Read inputs directly from pins. */

/* Read and display data. */while (1) /* Continuous loop. */{

current_value = *PAIN & 0x0F; /* Read the input from Port A. */*PBOUT = table[current_value]; /* Send the character to Port B. */

}}

Figure 10.10 C program for Example 10.1.


10.4 Reaction Timer—A Complete Example 401

pointer is declared as volatile. This is necessary because the program only reads the contentsof the corresponding location, but it neither writes any data into it, nor associates a specificvalue with it. An optimizing compiler may remove program statements that appear to haveno impact. This includes statements involving variables whose values never change. Sincethe contents of register PAIN change under influences that are external to the program, it isessential to inform the compiler of this fact. The compiler will not remove the statementsthat contain variables that have been declared as volatile.

A table is used to translate a hex digit into a corresponding seven-segment pattern. Theprogram uses a continuous loop to read and display new data. In an actual application, itis not likely that one would use a loop in this manner because other tasks would also beinvolved. We use the continuous loop merely to keep the example simple.

Example 10.2Consider now the case of an I/O device that is capable of sending information in a bit-serialformat that can be handled by the serial interface depicted in Figures 10.7 and 10.8. Thedevice sends eight bits of information that is to be displayed as two hex digits on two 7-segment displays of the type in Figure 3.17. Let the I/O device be connected to the serialinterface, and the 7-segment displays to parallel ports A and B. Thus, both A and B must beconfigured as output ports.

A possible program is presented in Figure 10.11. The polling method is used to deter-mine when new data are available in the receive buffer of the serial interface. Bit SSTAT0

serves as the flag that is polled. As described in Section 10.3.2, this bit is cleared when dataare read from the buffer.

Figure 10.12 shows how interrupts can be used in accessing new data. Recall fromSection 10.3.4 that the IRQ interrupt vector is in memory location 0x20. The address ofthe interrupt-service routine is loaded into this location. Bit SCONT4 is set to 1 to enablereceiver interrupts. To cause the processor to respond to IRQ interrupts, bit 6 in the processorstatus register, PSR, must be set to 1. Since PSR is not a location in the addressable space, it isnecessary to use the asm directive for in-line insertion of the assembly-language instruction

MoveControl PSR, #0x40

Also, it is necessary to include the return-from-interrupt instruction in the object program toensure a correct return to the interrupted program. The compiler will insert this instruction,because the intserv function definition includes the keyword interrupt. This manner ofhandling interrupts in a high-level language is explained in Section 4.6.

10.4 Reaction Timer—A Complete Example

Having introduced the basic features of the microcontroller, we will now show how itcan be used in a simple embedded system that implements an easily understood task thatexemplifies the term reactive system. We want to design a “reaction timer” that can be usedto measure the speed of response of a person to a visual stimulus. The idea is to have the



/* Define register addresses */#define RBUF (volatile unsigned char *) 0xFFFFFFE0#define SSTAT (volatile unsigned char *) 0xFFFFFFE2#define PAOUT (volatile unsigned char *) 0xFFFFFFF1#define PADIR (volatile unsigned char *) 0xFFFFFFF2#define PBOUT (volatile unsigned char *) 0xFFFFFFF4#define PBDIR (volatile unsigned char *) 0xFFFFFFF5

/* Hex to 7-segment conversion table */unsigned char table[16] = {0x40, 0x79, 0x24, 0x30, 0x19, 0x12,

0x02, 0x78, 0x00, 0x18, 0x08, 0x03, 0x46, 0x21, 0x06, 0x0E};unsigned int current_value, low_digit, high_digit;

void main(){

/* Initialize the parallel ports */*PADIR = 0xFF; /* Configure Port A as output. */*PBDIR = 0xFF; /* Configure Port B as output. */

/* Read and display data */while (1) /* Continuous loop. */{

while ((*SSTAT & 0x1) == 0); /* Wait for new data. */current_value = *RBUF; /* Read the 8-bit value. */low_digit = current_value & 0x0F;high_digit = (current_value > > 4) & 0x0F;*PAOUT = table[low_digit]; /* Send the two digits */*PBOUT = table[high_digit]; /* to 7-segment displays. */

}}

Figure 10.11 C program for Example 10.2 that uses polling to read input data.

microcontroller turn on a light and then measure the reaction time that the subject takes toturn the light off by pressing a pushbutton key. Details of the system and its operation areas follows:

• There are two manual pushbutton keys, Go and Stop, a light-emitting diode (LED), anda three-digit seven-segment display.

• The system is activated by pressing the Go key.• Upon activation, the seven-segment display is set to 000 and the LED is turned off.• After a three-second delay, the LED is turned on and the timing process begins.• When the Stop key is pressed, the timing process is stopped, the LED is turned off, and

the elapsed time is displayed on the seven-segment display.



#define RBUF (volatile unsigned char *) 0xFFFFFFE0#define SCONT (volatile unsigned char *) 0xFFFFFFE3#define PAOUT (volatile unsigned char *) 0xFFFFFFF1#define PADIR (volatile unsigned char *) 0xFFFFFFF2#define PBOUT (volatile unsigned char *) 0xFFFFFFF4#define PBDIR (volatile unsigned char *) 0xFFFFFFF5#define IVECT (volatile unsigned int *) 0x20

/* Hex to 7-segment conversion table */unsigned char table[16] = {0x40, 0x79, 0x24, 0x30, 0x19, 0x12,

0x02, 0x78, 0x00, 0x18, 0x08, 0x03, 0x46, 0x21, 0x06, 0x0E};unsigned int current_value, low_digit, high_digit;

interrupt void intserv();

void main(){

/* Initialize the parallel port */*PADIR = 0xFF; /* Configure Port A as output. */*PBDIR = 0xFF; /* Configure Port B as output. */

/* Initialize the interrupt mechanism */*IVECT = (unsigned int *) &intserv; /* Set the interrupt vector. */asm ("MoveControl PSR, #0x40"); /* Respond to IRQ interrupts. */*SCONT = 0x10; /* Enable receiver interrupts. */

while (1); /* Continuous loop. */}

/* Interrupt service routine */interrupt void intserv(){

current_value = *RBUF; /* Read the 8-bit value. */low_digit = current_value & 0x0F;high_digit = (current_value > > 4) & 0x0F;*PAOUT = table[low_digit]; /* Send the two digits */*PBOUT = table[high_digit]; /* to 7-segment displays. */

}

Figure 10.12 C program for Example 10.2 that uses interrupts to read input data.

• The elapsed time is calculated and displayed in hundredths of a second. Since thedisplay has only three digits, it is assumed that the elapsed time will be less than tenseconds.



BCD-to-7 segmentdecoder



PB1PB7-4 PB2 PB0PA3-0PA7-4

VDD VDDVDD

LED

Go

Stop

Processor core

Memory

Counter/Timer

Microcontroller

Figure 10.13 The reaction-timer circuit.

Figure 10.13 depicts the hardware that can implement the desired reaction timer. Themicrocontroller provides all hardware components except for the input keys and the outputdisplays. In contrast with the examples in the preceding section, we assume that a BCD-to-7-segment decoder circuit is associated with each seven-segment display device, so thatthe microcontroller needs to send only a four-bit BCD code for each digit to be displayed.Our microcontroller does not have enough parallel ports to allow sending decoded seven-segment signals to three displays.

We will use the two parallel ports, A and B, for all input/output functions. The twomost-significant BCD digits of the displayed time are connected to port A, and the least-significant digit is connected to the upper four bits of port B. The keys and the LED areconnected to the lowest three bits of port B. The counter/timer circuit is used to measurethe elapsed time. It is driven by the system clock, which we assume to have a frequency of100 MHz.



A program to realize the required task can be based on the following approach:

• The user’s intention to begin a test is monitored by means of a wait loop that polls thestate of the Go key.

• Upon observing that the Go key has been pressed, that is, having detected PB1 = 0,and after a further delay of three seconds, the LED is turned on.

• The counter is set to the initial value 0xFFFFFFFF and the process of decrementingthe count on each clock pulse is started.

• A wait loop polls the state of the Stop key to detect when the user reacts by pressing it.• When the Stop key is pressed, the LED is turned off, the counter is stopped, and the

elapsed time is calculated.• The measured delay is converted into a BCD number and sent to the seven-segment

displays.

The addresses of various I/O registers in the microcontroller are as given in Figures 10.6through 10.9. The program must configure ports A and B as required by the connectionsshown in Figure 10.13. All bits of portAand the high-order four bits of port B are configuredas outputs. In the low-order three bits of port B, PB0, and PB1 are used as inputs, while PB2

is an output. There is no need to use the control signals available on the two ports, becausethe input device consists of pushbutton keys that drive the port lines directly, and the outputdevice is a display that directly follows any changes in signals on the port pins that drive it.

We will show how the required application can be implemented using the C program-ming language. The program performs the following tasks. After the Go key is pressed, adelay of three seconds is implemented by using the timer. Since the counter/timer circuitis clocked at 100 MHz, the counter is initialized to the hex value 11E1A300, which cor-responds to the decimal value 300,000,000. The process of counting down is started bysetting the CTCONT0 bit to 1. When the count reaches zero, the LED is turned on to beginthe reaction time test and the counter is set to 0xFFFFFFFF. Upon detecting that the Stopkey has been pressed, the counting process is stopped by setting CTCONT1 = 1. The totalcount is computed as

Total count = 0xFFFFFFFF − Present count

Since this is the total number of clock cycles, the actual time in hundredths of seconds is

Actual time = (Total count)/1000000

This binary integer can be converted to a decimal number by first dividing it by 100 togenerate the most-significant digit. The remainder is then divided by 10 to generate thenext digit. The final remainder is the least-significant digit.

Figure 10.14 gives a possible program. After first configuring ports A and B as requiredand turning off the display and LED, the program continuously polls the value on pin PB1.After the Go key is pressed and PB1 becomes equal to 0, a three-second delay is introduced.Then, the LED is turned on, and the reaction timing process starts. Another polling operationis used to wait for the Stop key to be pressed. When this key is pressed, the LED is turnedoff, the counter is stopped, and its contents are read. The computation of the elapsed timeand the conversion to a decimal number are performed as explained above. The resulting



/* Define register addresses */#define PAOUT (volatile unsigned char *) 0xFFFFFFF1#define PADIR (volatile unsigned char *) 0xFFFFFFF2#define PBIN (volatile unsigned char *) 0xFFFFFFF3#define PBOUT (volatile unsigned char *) 0xFFFFFFF4#define PBDIR (volatile unsigned char *) 0xFFFFFFF5#define CNTM (volatile unsigned int *) 0xFFFFFFD0#define COUNT (volatile unsigned int *) 0xFFFFFFD4#define CTCON (volatile unsigned char *) 0xFFFFFFD8

void main(){

unsigned int counter_value, total_count;unsigned int actual_time, seconds, tenths, hundredths;

/* Initialize the parallel ports */*PADIR = 0xFF; /* Configure Port A. */*PBDIR = 0xF4; /* Configure Port B. */*PAOUT = 0x0; /* Turn off the display */*PBOUT = 0x4; /* and LED. */

/* Start the test. */while (1) /* Continuous loop. */{

while ((*PBIN & 0x2) != 0); /* Wait for the Go key to be pressed. */

/* Wait 3 seconds and then turn LED on */*CNTM = 0x11E1A300; /* Set timer value to 300,000,000. */*CTCONT = 0x1; /* Start the timer. */while ((*CTSTAT & 0x1) == 0); /* Wait until timer reaches zero. */*PBOUT = 0x0; /* Turn the LED on. */

/* Initialize the counting process */counter_value = 0;*CNTM = 0xFFFFFFFF; /* Set the starting counter value. */*CTCONT = 0x1; /* Start counting. */

while ((*PBIN & 0x1) != 0); /* Wait for the Stop key to be pressed. */

/* The Stop key has been pressed - stop counting */*CTCONT = 0x2; /* Stop the counter. */*PBOUT = 0x4; /* Turn the LED off. */counter_value = *COUNT; /* Read the contents of the counter. */

Figure 10.14 C program for the reaction timer (Part a).


10.5 Sensors andActuators 407

/* Compute the total count */total_count = (0xFFFFFFFF – counter_value);

/* Convert count to time */ ;actual_time = total_count /1000000; /* Time in hundredths of seconds */seconds = actual_time / 100;tenths = (actual_time – seconds * 100)/ 10;hundredths = actual_time – (seconds * 100 + tenths * 10);

/* Display the elapsed time */*PAOUT = ((seconds 4) | tenths);*PBOUT = ((hundredths < <

< < 4) | 0x4); /* Keep the LED turned off */

}}

Figure 10.14 C program for the reaction timer (Part b).

three BCD digits are written to the port data registers according to the arrangement depictedin Figure 10.13.

10.5 Sensors andActuators

An embedded computer interacts closely with its environment. So far, we have usedswitches and simple display devices to illustrate such interactions. There is a variety ofother devices used in practical applications. To control a mechanical system, it is necessaryto use devices that provide an interface between the mechanical domain and the digitalelectronics domain used in the computer. These devices enable the computer to sense, ormonitor, the state of the controlled system. They also cause the actions ordered by thecomputer to be performed. Such devices are usually referred to as sensors and actuators.Collectively, they are referred to as transducers. In this section, we present a few examplesof transducers that illustrate some of the basic principles used.

10.5.1 Sensors

Depending on the nature of the controlled system, there may be many parameters that anembedded computer needs to monitor. Consider the cruise control system in a car. Itspurpose is to maintain the speed as close as possible to a desired value, whether the roadis level, uphill, or downhill. It is not sufficient to simply keep the gas throttle in a fixedposition. The control system has to measure the speed of the car continuously and adjustthe throttle opening to maintain the speed at the desired level. Thus, a sensor is needed tomeasure the speed of the car and express it in a digital form suitable for use by the computer.



The computer can then determine whether the throttle opening needs to be adjusted, andsend commands to an actuator to cause this to happen. Many other sensors are found incars. They include devices that measure the levels of various fluids, including gas, oil, andengine coolant. Sensors can be used to monitor the coolant temperature, tire pressure, andbattery voltage.

Some sensors are simple, such as those that detect if a person or an animal walks undera closing garage door. Others may be very sophisticated. A few examples of sensors aregiven below.

Position SensorsConsider an embedded computer that controls the movement of a robot’s hand. The

position of the hand is determined by the angular position of the shoulder, elbow, and wristjoints. To monitor the position of the hand, the computer has to measure the angular positionof each of these joints.

A simple transducer for measuring the angular position of a shaft relative to its housingis a potentiometer. It consists of a resistor wound around a circular base attached to thehousing and a slider contact attached to the shaft whose position is to be monitored. As theshaft rotates, the contact point on the resistor changes. When attached to a voltage sourceas shown in Figure 10.15, the two parts of the resistor on either side of the contact pointform a voltage divider. The output voltage of this circuit is given by

Vout = R2Vin

R1 + R2

This voltage is fed to an A/D converter circuit to convert the analog value into a digitalrepresentation. The microcontroller in Figure 10.3 includes an A/D converter for use insuch situations.

Some sensors produce a digital output directly. For example, a coded disc may beattached to a shaft whose position is to be monitored. The code on the disc is in the form oftransparent and opaque areas organized in concentric annular regions, as shown in Figure10.16. The disc is positioned between pairs of light-emitting diodes and photodetectors,with one pair for each annular region. Each photodetector is connected to a circuit that

A/Dconverter

Digitaloutput

Vin

Vout

R1

R2

Figure 10.15 A sensor using a voltage divider.



LED Photodetector

Disc

Side viewFront view

Figure 10.16 An optical position sensor.

produces a logic signal 1 when light is detected and 0 otherwise. Thus, as the disc rotates,the two photodetectors shown in the figure produce the 2-bit binary numbers 00, 01, 10,and 11, representing the angular position of the shaft.

Temperature SensorsMany types of sensors make use of changes in the properties of materials as the temper-

ature or humidity changes, or as a result of deformation caused by stretching, compressing,or bending. The resistance of a conductor changes with temperature, and the effect is muchmore pronounced in some materials than others. Hence, using a voltage divider similar tothat in Figure 10.15, a temperature sensor can be constructed by making resistor R2 froma material with high sensitivity to temperature changes and exposing it to the environmentwhose temperature is to be measured. As the temperature changes, voltage Vout changes.Thus, after converting Vout into a digital representation, the value can be read by a com-puter as a measure of temperature. Other types of temperature sensors use capacitors as thesensing element. The capacitance of a capacitor changes as the properties of the dielectric(the insulating material) used to construct it changes. Some ceramic materials are also wellsuited for use in temperature sensors.

Pressure SensorsA widely used sensor that is based on measuring changes in resistance is the strain

gauge, which consists of a resistor deposited on a flexible film. The value of the resistancechanges when the film is stretched.



Suppose that we wish to monitor the air pressure inside a closed container, such as a cartire. A pressure sensor can be constructed in the form of a small cylinder with a diaphragmattached to one end, and a strain gauge mounted on the diaphragm. The cylinder is insertedinto the container such that the diaphragm blocks the air flow out of the container. As thecontainer is pressurized, the diaphragm stretches, and the resistance of the strain gaugechanges. This change in resistance can be measured using a circuit similar to that in Figure10.15.

Speed SensorsA small electrical generator attached to a rotating shaft produces a voltage that is

proportional to the speed of rotation. The generator’s output voltage can be converted intodigital form and read by a computer as a measure of speed. Such a device is often referredto as a tachometer.

Another approach is to generate electrical pulses as the shaft rotates. For example,a toothed disc may be attached to the shaft. As the disc rotates, an appropriately placedelectromagnetic or optical sensor produces an electrical pulse as each tooth passes thesensor. The speed of rotation is determined by counting the number of pulses generatedduring a fixed time period or by measuring the time delay between successive pulses.

10.5.2 Actuators

An actuator is a device that causes a mechanical component to move after receiving acommand. Figure 10.17 illustrates a basic principle that is used to construct a variety ofactuators. An electrical wire is wound to form a coil around a cylinder that contains amovable core, called an armature. This structure is known as a solenoid. The armatureis made of a magnetic material such as iron. It is held by a spring in a partially insertedposition in the cylinder, as shown. When electrical current flows through the coil, it producesa magnetic field that pulls the armature into the cylinder. When current stops flowing, thearmature is pulled back to its rest position by the spring. The movement of the armaturecan be used to open a water valve, or close an electrical switch that turns on a motor. Thisarrangement can be used in the starter motor of a car, for example.

A combination of a motor and a sensor can be used to move an object to a desiredposition. When the motor is turned on, the object begins to move. The computer can thenuse the sensor to repeatedly check whether the desired position has been reached. In a

Armature

Coil

Spring

Figure 10.17 A solenoid actuator.



camera with an automatic focus the motor rotates the focusing mechanism. At the sametime, the computer uses image analysis algorithms to check whether the image is focusedproperly. When the best focus is achieved, it stops the motor.

Another useful type of actuator is a stepper motor. This is a motor that rotates itsshaft in response to commands in the form of pulses. Each pulse causes the motor shaft torotate a predefined amount, such as a few degrees. A stepper motor is useful in applicationsrequiring precise control of the position of an object, such as a robot arm.

The sensors and actuators presented above constitute the interface between the com-puter and the physical environment in which it is embedded. They provide the computerwith the ability to monitor the state of various system components and to cause desiredactions to take place. Sensors present the computer with information about the state of thesystem. Actuators accept commands to perform their function. The computer’s task is toimplement the control algorithms needed for the application. It also accepts commandsfrom and displays information to the user.

10.5.3 Application Examples

We now consider briefly two applications to illustrate how sensors and actuators may beused.

Home Heating ControlHeating and air conditioning in the home are controlled by a device known as a thermo-

stat. Let us focus on controlling the heating furnace. By including a microcontroller in thethermostat, it is possible to implement features such as changing the desired temperatureautomatically at different times of the day or on weekends. Such a thermostat includes:

• A temperature sensor• A circuit that generates an output signal to turn the furnace on and off• A timer to keep track of the time of day and day of the week• Pushbutton keys for the user to enter the desired settings• A display screen

We will only examine the temperature control mechanism in this discussion. Assumethat the resistive temperature sensor described in Section 10.5.1 is used, and that its outputvoltage is connected to the A/D converter in the microcontroller. A solenoid is used toactivate a switch that turns the furnace on or off. The control algorithm is implemented bya program in the microcontroller. Suppose that the desired temperature is set at T degrees.The actual temperature of the house is allowed to deviate from this setting by a smallamount �T above or below the desired value. Then, the control task can be implementedby repeatedly performing the following actions:

1. Sample the temperature sensor to obtain the input voltage representing thetemperature.

2. If the temperature is below T −�T , turn the furnace on.

3. If temperature is above T +�T , turn the furnace off.



The rest of the computer program deals with the tasks of using the timer to keep track oftime, receiving the user’s input through the pushbutton keys, and sending information tothe display.

Cruise ControlThe cruise control system found in many cars requires the following components:

• A sensor to measure the speed of the car’s drive shaft• An actuator to control the position of the gas throttle• Input pushbuttons for the driver to activate the system and set the desired speed

This application differs from the home heating system in that the parameter to be controlled—the position of the throttle—varies continuously between two extremes, fully closed andfully open. The simple on/off control algorithm used in the previous example is not suitable.When controlling a varying quantity such as the position of the throttle, more sophisticatedalgorithms are needed, the discussion of which is beyond the scope of this text. However,the desired actions can be described briefly as follows. The computer repeatedly measuresthe speed of the car and compares it to the set value. The difference is an error, ε, whichmay be positive or negative. Depending on the value of ε, the computer makes a smalladjustment to the position of the throttle. The objective is to maintain a value of ε that isas close to zero as possible. In effect, the computer mimics the decisions and actions of adriver who is operating the car manually and maintaining a constant speed.

10.6 Microcontroller Families

Having presented an example of a microcontroller in Section 10.3, we now briefly survey thebroad range of commercially available microcontroller chips. Many embedded applicationsdo not require a powerful processor. Certainly, the microwave oven discussed in Section10.1.1 does not need a powerful microcontroller, because its computational requirements arerather modest and reaction times are not demanding. For such applications it is preferableto use a chip that has a simple processor but which contains sufficient memory and I/Oresources to implement all controller functions. The digital camera, discussed in Section10.1.2, has much more demanding computational requirements, and hence requires a morepowerful processor.

Processors may be characterized by how many data bits they handle in parallel whenaccessing data in the memory. The most powerful microcontrollers are likely to be based ona 32-bit processor, with a 32-bit-wide data bus. Such is the case with some microcontrollersbased on theARM architecture. It is also possible to have a processor that has an internal 32-bit structure, but a 16-bit-wide data bus to memory. An example is Freescale’s 68K/ColdFirefamily of microcontrollers. Some of the most popular microcontrollers are 8-bit chips. Theyare much cheaper, yet powerful enough to satisfy the requirements of a large number ofembedded applications. There exist even smaller 4-bit chips, which are attractive becauseof their simplicity and extremely low cost.


10.6 Microcontroller Families 413

10.6.1 Microcontrollers Based on the Intel 8051

In the early 1980s, Intel Corporation introduced a microcontroller chip called the 8051.This chip has the basic architecture of Intel’s 8080 microprocessor family, which used 8-bitchips for general-purpose computing applications. The 8051 chip gained rapid popularityand has become one of the most widely used chips in practice. It has four 8-bit I/O ports,a UART, and two 16-bit counter/timer circuits. It also contains 4K bytes of ROM and 128bytes of RAM storage. The EPROM version of the chip, in which there are 4K bytes ofEPROM rather than ROM, is available under the name 8751.

A number of other chips based on the 8051 architecture are available; they providevarious enhancements. For example, the 8052 chip has 8K bytes of ROM and 256 bytes ofRAM, as well as an extra counter/timer circuit. Its EPROM version is called 8752.

The 8051 architecture was developed by Intel. Subsequently, a number of other semi-conductor manufacturers have produced chips that are either identical to those in the 8051family, or have some enhanced features but are otherwise fully compatible with the 8051.

10.6.2 Freescale Microcontrollers

Motorola had a dominant position as a manufacturer of microprocessor chips in the 1980s.Their most popular 8-bit microprocessors became the basis for their microcontrollers.Freescale Semiconductor Inc. is Motorola’s successor in this area. Freescale manufac-tures a wide range of microcontrollers, based on different processor cores.

68HC11 MicrocontrollerMotorola’s most popular 8-bit microprocessors were the 6800 and the 6809. The

68HC11 was later introduced as a microcontroller chip that implements a superset of 6800instructions. It has five I/O ports that can be used for a variety of purposes. The I/O structureincludes two serial interfaces. There are also counter/timer circuits capable of operating ina number of different modes.

The amount of memory included in 68HC11 chips ranges from an 8K-byte ROM,a 512-byte EEPROM, and a 256-byte RAM in the original chip, to a 12K-byte ROM, a512-byte EEPROM, and a 512-byte RAM in later chips.

68K MicrocontrollersA family of microcontrollers, known as 68K, is based on the 32-bit 68000 processor

core. To reduce the number of pins, the external data bus is only 16 bits wide. These chipsinclude parallel and serial ports, counter/timer capability, and A/D conversion circuits. Theamount of on-chip memory depends on the particular chip. For example, the 68376 chiphas an 8K-byte EEPROM and a 4K-byte RAM.

ColdFire MicrocontrollersThe 68000 instruction set architecture provides the basis for the 32-bit ColdFire pro-

cessor presented in Appendix C and the MCF5xxx microcontrollers, known as the ColdFireembedded processors. Their distinguishing feature is a pipelined structure that leads toenhanced performance. The external data bus width may be 16 bits or 32 bits, depending



on the selected microcontroller chip. The ColdFire processor core is also intended for usein the system-on-a-chip environment, which is discussed in Chapter 11.

PowerPC MicrocontrollersFreescale’s high-end 32-bit microprocessor family, known as PowerPC, is based on

a RISC-style architecture. This processor architecture is also available in microcontrollerform, in chips comprising the MPC5xx family.

10.6.3 ARMMicrocontrollers

The ARM architecture, presented in Appendix D, is attractive for embedded systems wheresubstantial computing capability is needed, and the cost and power consumption have to berelatively low. A key objective of the design of the ARM processor is to make it suitablefor use in the system-on-a-chip environment. ARM microcontrollers are also available asseparate chips.

There has been a progression of ARM processor cores intended for embedded appli-cations, including ARM6, ARM7, ARM9, ARM10, and ARM Cortex. The basic ARMarchitecture uses a 32-bit organization and an instruction set in which all instructions are32 bits long. There exists another version, known as Thumb, which uses 16-bit instructionsand 16-bit data transfers. The Thumb version uses a subset of the ARM instructions, whichare encoded to fit into a 16-bit format. It also has fewer registers than the ARM architecture.An advantage of Thumb is that a significantly smaller memory is needed to store programsthat comprise highly encoded 16-bit instructions. At execution time, each Thumb instruc-tion is expanded into a normal 32-bit ARM instruction. Thus, a Thumb-aware ARM corecontains a Thumb decompressor in addition to its normal circuits.

10.7 Design Issues

The designer of an embedded system has to make many important decisions. The natureof the application or the product that has to be designed presents certain requirements andconstraints. In this section, we will consider some of the most important issues that thedesigner faces.

CostThe cost of electronics in many embedded applications has to be low. The least costly

solution is realized if a single chip suffices for the implementation of all functions that mustbe provided. This is possible only if this chip provides sufficient I/O capability to meet thedemands of the application and enough on-chip memory to hold the necessary programsand data.

I/O CapabilityMicrocontroller chips provide a variety of I/O resources, ranging from simple parallel

and serial ports to counters, timers, and A/D and D/A conversion circuits. The number of


10.7 Design Issues 415

available I/O lines is important. Without sufficient I/O lines it is necessary to use externalcircuitry to make up the shortfall. This is illustrated by the reaction-timer example in Figure10.13, in which external decoder circuits are used to drive the 7-segment displays from the4-bit BCD signals provided by the microcontroller. If the microcontroller had four parallelports, rather than two, it would be possible to connect each 7-segment display to one port.The controlling program would then directly generate the seven bit signals needed to drivethe seven individual segments in each display.

SizeMicrocontroller chips come in various sizes. If an application can be handled ade-

quately with an 8-bit microcontroller, then it makes no sense to use a 16- or 32-bit chipwhich is likely to be more expensive, physically larger, and consume more power. Themajority of practical applications can be handled with relatively small chips.

Power ConsumptionPower consumption is an important consideration in all computer applications. In high-

performance systems the power consumption tends to be high, requiring some mechanismto dissipate the heat generated. In many embedded applications the consumed power islow enough so that heat dissipation is not a concern. However, these applications ofteninvolve battery-powered products, so the life of the battery, which depends on the powerconsumption, is a key factor.

On-Chip MemoryInclusion of memory on a microcontroller chip allows single-chip implementations of

simple embedded applications. The size and the type of memory have significant ram-ifications. A relatively small amount of RAM may be sufficient for storing data duringcomputations. A larger read-only memory is needed to store programs. This memory maybe of the ROM, PROM, EPROM, EEPROM, or Flash type. For high-volume products, themost economical choices are microcontrollers with ROM. However, this is also the leastflexible choice because the contents of the ROM are permanently set at the time the chip ismanufactured. The greatest flexibility is offered by EEPROM and Flash memories, whichcan be programmed multiple times.

For applications that have greater storage requirements, it is necessary to use an externalmemory. Some microcontrollers do not have any on-chip memory. They are typicallyintended for more sophisticated applications where a substantial amount of memory, whichcannot be realized within the microcontroller chip, is needed.

PerformancePerformance is not an important consideration when microcontrollers are used in ap-

plications such as home appliances and toys. Small and inexpensive chips can be chosen inthese cases. But, in applications such as digital cameras, cell phones, and some hand-heldvideo games, it is essential to have much higher performance. High performance requiresmore powerful chips and results in higher cost and greater power consumption. Since theapplication is often battery-powered, it is also important to minimize power consumption.Various tradeoffs are required to satisfy these conflicting goals.



SoftwareThere are many advantages to using high-level computer languages for application

programs. They facilitate the process of program development and make it easy to maintainand modify the software in the future. However, there are some instances when it may beprudent or necessary to resort to assembly language. A well-crafted assembly-languageprogram is likely to generate object code that is 10 to 20 percent more compact (in termsof the amount of storage needed) than the code produced by a compiler. If an embeddedapplication is based on a microcontroller that has limited on-chip memory, it is a majoradvantage if the necessary code can fit into the memory provided on the chip, avoiding theneed for external memory.

The limited capacity of the available on-chip RAM is an important consideration for asystem designer. This memory is used for storing dynamic data, as a temporary buffer, andfor implementing a stack. When writing an application program in a high-level languagesuch as C, care must be taken to ensure that the total amount of code and data does notexceed the available memory.

Instruction SetAnother significant issue is the nature of the instruction set of the processor used.

CISC-style instructions lead to more compact code than RISC-style instructions. Thus,the choice of the processor has an impact on the size of the code. An interesting exampleof an approach that addresses this issue is provided by the Thumb version of the ARMarchitecture, where a RISC-style instruction set designed for 32-bit processors has beenmodified into a more highly encoded form using 16-bit instructions, as discussed in Section10.6.3. Programs written for Thumb versions are up to 30 percent more compact than thosewritten for the full ARM architecture.

Development ToolsDesigners of digital systems rely heavily on development tools. These tools include

software packages for computer-aided design (CAD), compilers, assemblers, and simulatorsfor the processors. The range and availability of tools often depends on the choice ofthe embedded processor. It is also attractive to have third-party support, where alternatesources of tools and documentation exist. Good documentation and helpful advice fromthe manufacturer (if needed) are extremely valuable.

Testability and ReliabilityPrinted circuit boards are often difficult to test, particularly if they are densely populated

with chips. The testing process is greatly simplified if the entire system is designed to beeasily testable. A microcontroller chip can include circuitry that makes it easier to testprinted circuit boards that contain this chip. For example, some microcontrollers include atest access port that is compatible with the IEEE 1149.1 standard for a testable architecture,known as the Test Access Port and Boundary-Scan Architecture standard [1].

Embedded applications demand robustness and reliability. The life cycle of a typicalproduct is expected to be at least five years. This is in contrast with personal computers,which tend to be considered obsolete in a shorter time.




This chapter has provided an introduction to the design of embedded computer systems. Wehave not used a specific commercially-available microcontroller for this discussion becausethe principles presented are general and deal with the key issues that face the designer ofan embedded system.

It is particularly important to appreciate the close interaction of hardware and software.The design choices may involve trade-offs between polled I/O and interrupts, betweendifferent instruction sets in terms of functionality and compactness of code, between powerconsumption and performance, and so on.

In this chapter we discussed the use of microcontrollers in implementing embeddedsystems. In the next chapter we will show how FPGA chips can be used to realize suchsystems. In particular, we will focus on the system-on-a-chip approach, which attempts toimplement an entire system on a single chip.

Finally, let us consider the characteristics of embedded systems that make them dif-ferent from general-purpose computers. The basic principles are the same. A designer ofembedded systems must have a good knowledge of computer organization, which includes athorough understanding of instruction sets, execution of programs, input/output techniques,memory structures, and interfacing schemes for dealing with a variety of devices that maybe included in the system. So, what is different?

A general-purpose computer executes arbitrary application programs, including thosethat are used to create or modify other programs. Each program normally runs for a limitedperiod of time, as a consequence of either the processing algorithm or control by the user.There is a need for a large main memory that is external to the processor chip to meetthe potentially large requirements of some application programs. Large secondary storagedevices are needed to hold files containing programs and data. A sophisticated operatingsystem is used to control all resources in the computer system.

In contrast, an embedded system typically executes a single application program that isnot likely to be modified. The program begins executing automatically when the system isturned on, and execution is continuous until the system is turned off. Memory requirementstend to be modest. It is often possible to employ a microcontroller with adequate on-chipmemory. Powerful operating system software is not needed. In many cases, there is nooperating system software. The single application program executes continuously and hasdirect access to all processor, memory, and input/output resources.

Our discussion has been focused on low-end embedded systems, in which the com-puting requirements are relatively modest. Such systems embody the main principles ofembedded applications. But, there are also many high-end embedded systems, such asthose used in aircraft and high-speed trains, which require much more powerful computingcapability. They are often implemented using multiple processors. Chapter 12 deals withthe issues pertinent to multiprocessor systems.



Problems

10.1 [M] In Example 10.2, we assumed that the I/O device is capable of sending 8-bit datain a bit-serial manner. Consider now a similar device that uses eight lines to send thedata in parallel. Thus, one of the parallel ports in the microcontroller has to be used toreceive the data. This leaves only one parallel port to display two hexadecimal digits thatrepresent the data. Hence, only one 7-segment display device of the type shown in Figure3.17 can be used. To convey the received information, a single device can be used todisplay the digits one after another. Let the most-significant digit be displayed for onesecond, followed by a one-second blank period, and then the second digit is displayed forone second. After displaying the second digit, the display should show a “dash” for twoseconds. This sequence is to be repeated continuously, reflecting the latest data received.Use the timer in Figure 10.9, and assume that it is driven by a 100-MHz clock. Show thenecessary hardware connections and write a program to implement the required task. Usepolling to check the timer status.

10.2 [M] Solve Problem 10.1 by using the timer’s interrupt capability.

10.3 [M] The microcontroller described in Section 10.3 is used to control a system that includesa motor which can run at two speeds, fast and slow. A sensor associated with the motoruses a single signal line to indicate the current speed of the motor. For the fast speed, thesensor transmits a continuous square-wave signal that has a frequency of 100 kHz. For theslow speed it transmits a 50-kHz signal. It transmits no signal (i.e. a constant logic value0) if the motor is not running. The microcontroller uses a 7-segment display of the typeshown in Figure 3.17 to display the state of the motor as F, S, or 0. Show the necessaryhardware connections and write a program to implement the required task. Use the timerin Figure 10.9 to generate the time intervals needed to test the frequency of the signal sentby the sensor. Assume that the timer is driven by a 100-MHz clock. Use polling to checkthe timer status.

10.4 [M] Solve Problem 10.3 by using the timer’s interrupt capability.

10.5 [D] The microcontroller in Section 10.3 is used to receive decimal numbers on its serialport. Each number consists of two digits encoded as ASCII characters. In order to distin-guish between successive two-digit numbers, a delimiter character H is used. Thus, if twosuccessive numbers are 43 and 28, the received sequence will be H43H28. Each number isto be displayed on two 7-segment displays connected to parallel portsAand B. The delimitercharacter should not be displayed. The displayed number should change only when bothdigits of the next number have been received. Show the connections needed to accomplishthis function. Label the segments of the display unit as indicated in Figure 3.17. Write aprogram to perform the required task. Use the serial interface in Figure 10.8. Use pollingto detect the arrival of each ASCII character.

10.6 [D] Solve Problem 10.5 by using interrupts to detect the arrival of each ASCII character.

10.7 [D] The microcontroller in Section 10.3 is used to receive decimal numbers on its serial port.Each number consists of four digits encoded as ASCII characters. In order to distinguish


Problems 419

between successive four-digit numbers, a delimiter character H is used. Thus, if twosuccessive numbers are 2143 and 6292, the received sequence will be H2143H6292. Eachnumber is to be displayed on four 7-segment display units. Assume that each display unithas a BCD-to-7-segment decoder circuit associated with it, as shown in Figure P10.1. Showthe necessary connections to the microcontroller. Write a program to perform the requiredtask. Use the serial interface in Figure 10.8. Use polling to detect the arrival of each ASCIIcharacter.

a

b

c

d

e

f

g

BCD

to

7-segment

decoder

BCDdigit

a

b

c

d

e

f g

Figure P10.1 A 7-segment display with a BCD decoder.


10.9 [D] Repeat Problem 10.7, but assume that each 7-segment display unit has a 7-bit registerassociated with it, rather than a BCD-to-7-segment decoder. The register has a controlinput Load, such that the seven data bits are loaded into the register when Load = 1. Eachbit in the register drives one segment of the associated display unit. Figure P10.2 showsthe register-display arrangement. Arrange the microcontroller output connections such thatparallel port A provides the data for all four display units.

a

b

c

d

e

f

g

Sa

Sb

Sc

Sd

Se

Sf

Sg

a

b

c

d

e

f g

Load

Register

Figure P10.2 A 7-segment display with a register.




10.11 [E] Modify the reaction timer presented in Section 10.4 assuming that the tested person willalways respond in less than one second. Thus, the elapsed reaction time can be displayedusing only two digits representing hundredths of a second. Connect the two 7-segmentdisplay units to port A and modify the program in Figure 10.14 to implement the desiredoperation.

10.12 [M] In Figure 10.13, the 7-segment display unit for each digit incorporates a BCD-to-7-segment decoder; hence the microcontroller provides a 4-bit BCD code for each digit tobe displayed. Suppose that instead of using the decoder, each 7-segment unit has a 7-bitregister with a control input Load, such that the seven data bits are loaded into the registerwhen Load = 1. Each bit in the register drives one segment of the associated display unit.Figure P10.2 shows the register-display arrangement. Modify the program in Figure 10.14for use with this register-display circuit.

10.13 [M] Use the microcontroller in Section 10.3 to generate a “time of day” clock. The time(in hours and minutes) is to be displayed on four 7-segment display units. Assume that eachdisplay unit has a BCD-to-7-segment decoder associated with it, as shown in Figure P10.1.Assume also that a 100-MHz clock is used. Show the required hardware connections andwrite a suitable program.

10.14 [M] Repeat Problem 10.13 assuming that each 7-segment display unit has a register asso-ciated with it, as shown in Figure P10.2.

10.15 [E] In a system implemented on a single chip, the processor and the main memory resideon the same chip. Is there a need for a cache in this system? Explain.

References

1. Test Access Port and Boundary-Scan Architecture, IEEE Standard 1149.1, May 1990.


421

c h a p t e r

11System-on-a-Chip—A Case Study

Chapter Objectives


• Designing a system for implementation onan FPGA chip

• Using CAD tools

• Using parameterized modules

• Typical design considerations


422 C H A P T E R 11 • System-on-a-Chip—A Case Study

In Chapter 10 we discussed the use of microcontrollers in embedded systems. In an embedded application,it is desirable to use as few chips as possible. Ideally, a single chip would realize the entire system. Theterm System-on-a-Chip (SOC) has been used to describe this technology. In simple applications, somecommercially available microcontrollers can realize all of the necessary functions. This is unlikely to be thecase in more complex applications.

Designing a microcontroller for a complex embedded application and implementing it in the form of acustom chip is both challenging and expensive. It is also time consuming. Yet, the development time formost consumer products has to be short. A chip that implements an entire system for a given applicationcan be developed in a much shorter time if the designer has access to predesigned circuit modules that areavailable in a form that is easy to use. A processor circuit is one of the needed modules. Such circuits arecalled processor cores in technical literature. A variety of processor cores can be obtained, through a licensingagreement, from a number of companies. Other modules can be obtained to implement memory, input/outputinterfaces, A/D and D/A conversion circuits, or DSP (digital signal processing) circuits. The system developerthen completes the design by using the available modules and designing the rest of the required circuitry thatis specific to the application.

Providers of processor cores and other modules sell designs rather than chips. They provide intellectualproperty (IP) that can be used by others to design their own chips. To facilitate IP-based product development,a variety of computer-aided design (CAD) tools are available.

The cost of implementing a custom-designed chip is a major factor. Fabrication of such chips is expensiveand, while they offer better performance and lower power consumption, their cost can be justified only if alarge number of application-specific chips are needed. An alternative possibility is to use Field ProgrammableGate Array (FPGA) technology.

11.1 FPGA Implementation

FPGAs provide an attractive platform for implementing systems on single chips. Unlikecommercially available microcontroller chips, which provide a designer with a set of prede-fined functional units, FPGA devices allow complete freedom in the design process. Theymake it possible to include suitable IP modules and then build the rest of the system asdesired. This can be accomplished with relative ease. Once the design is completed andtested, it can be implemented in an FPGA device immediately.

The capability of FPGAs has increased dramatically. A single FPGA chip can im-plement a system that comprises hundreds of thousands of logic gates. Such chips arelarge enough to implement the typical functionality of a microcontroller and other circuitryneeded for a complex embedded application.

In this chapter we will examine the issues involved in using FPGAs in the embeddedenvironment. To make the discussion as practical as possible, we will consider the technol-ogy provided by Altera Corporation, which is one of the major vendors of FPGA devicesand supporting CAD software.


11.1 FPGA Implementation 423

11.1.1 FPGA Devices

The basic structure of FPGAs is explained in Appendix A. FPGA devices contain a largenumber of logic elements and versatile wiring resources that can be used to interconnectthem. They usually also contain a considerable amount of memory that can be used toimplement the RAM and ROM portions of an embedded system if memory size requirementsare not too large. Many FPGAs also include multiplier circuits, which are particularly usefulin DSP applications.

An FPGA device has to be programmed to implement a particular design. The logicelements include programmable switches that have to be set to realize the desired logicfunctions. Typically, a logic element can be programmed to realize logic functions of fourto six variables. The logic element also includes a flip-flop to make it possible to implementregisters and finite-state machines. The interconnection wiring also contains programmableswitches, which are used to interconnect the logic elements to implement a desired circuit.The programming process of setting the switches is referred to as configuring the FPGAdevice.

In most FPGA devices, the state of each programmable switch is set in an SRAM cellof the type discussed in Section 8.2.2. Since SRAM cells maintain their state only as longas the power supply to the FPGA is on, such FPGAs are volatile. If power is turned off,the device has to be reconfigured when power is turned on again. To configure the FPGA,the configuration information has to be loaded into the device. This is typically done byusing another chip, called the configuration device, which keeps the required informationin a memory of Flash type. Whenever the power is turned on, the configuration deviceautomatically programs the FPGA.

Usually, the Flash memory in the configuration device is large enough to hold notonly the configuration data for the circuits that have to be implemented in the FPGA, butalso some additional information that may comprise data or code. If a processor core isimplemented in the FPGA, then it is possible to store in the configuration device some codethat the processor will execute.

11.1.2 Processor Choice

The key component of any system on a chip is the processor core. There exist two distinctalternatives for FPGA-based systems. One involves a processor that is defined in softwareand implemented in an FPGA in the same way as any other circuit. The other involves aspecialized FPGA chip that has a processor core implemented on the chip at the time ofmanufacture.

Soft Processor CoreThe most flexible solution is to provide a software module written in a hardware

description language, such as Verilog or VHDL, which specifies a parameterized processor.The designer of an embedded system can set the parameters to obtain a processor withsuitable features for the intended application. For example, one parameter pertains to the



cache configuration, where the choices may be:

• No caches• Instruction cache, but no data cache• Both instruction and data caches

Another parameter may relate to the inclusion of multiplier and divider circuits in theprocessor. Multiplication and division operations can be implemented in hardware, butthey can also be realized in software. Implementing them in hardware uses more of theFPGA resources, but improves performance considerably.

Hard Processor CoreAn alternative to the soft processor core approach is to implement the processor directly

as a hardware module on the chip, thus creating a specialized FPGA. This leads to a higher-performance system. The cost of such FPGAs is higher than the cost of regular FPGAdevices.

11.2 Computer-Aided Design Tools

Manufacturers of FPGA devices provide powerful CAD tools that make the task of de-signing embedded systems relatively easy. A variety of predefined modules are providedin a parameterized form. The designer creates a system by including these modules andspecifying the parameters to suit the application requirements. Examples of such modulesare:

• Processor cores• Memory modules and interfaces• Parallel I/O interfaces• Serial I/O interfaces• Timer/counter circuits

These modules may be sufficient to realize all functions needed in a desired embeddedsystem. If they are not, then additional specialized circuits have to be designed and includedin the system.

Typically, a subsystem that comprises a processor core and other parameterized mod-ules is specified first. A CAD tool is used to generate a module that implements the sub-system. This module is defined in a hardware description language. It is then instantiatedin the overall design, along with any additional application-specific circuits that have beencreated. Finally, a different CAD tool is used to synthesize and implement the overalldesign in a form that can be used to configure the FPGA device.

In addition to the FPGA device, it is necessary to include the external componentsneeded to complete the system, such as switches, displays, and additional memory chips.Such components must be connected to the appropriate pins on the FPGA. We will discussthese issues in the context of a design example in Section 11.3.


11.2 Computer-Aided Design Tools 425

To give the reader a specific example of CAD tools and modules for FPGA devices, wewill briefly consider the tools provided by the Altera Corporation. All information aboutits technology and tools is available on Altera’s web site [1].

11.2.1 Altera CAD Tools

The main CAD tools from Altera are known as Quartus II software. They include a fullrange of tools needed to design and implement a digital system in an FPGA device. One ofthese tools is called the SOPC Builder, which can be used to design systems that include aprocessor core. This tool includes a number of parameterized modules that can be used inthe designed system. To illustrate their nature, we will consider four of these modules.

Nios II ProcessorThe Nios II processor is described in Appendix B. For implementation in FPGAs, it

is provided in three versions: economy, standard, and fast. The economy version is thesimplest and least expensive to implement (in terms of FPGA resources). It also has thelowest performance. It does not include any caches, it is not pipelined, and it does notuse branch prediction. Better performance is achieved with the standard version, whichincludes an instruction cache, is pipelined, and uses static branch prediction. The bestperformance is achieved with the fast version, which includes both instruction and datacaches, and uses dynamic branch prediction.

There are several parameters that a designer can specify, including the sizes of theinstruction and data caches. Nios II processor cores are small enough to occupy only asmall portion of an FPGA device. It is possible to implement as many as ten Nios II coreson a relatively small FPGA.

MemoryMemory blocks in an FPGA device can be used to implement caches and a portion of

the main memory. The portion of the main memory that is realized in this way is referredto as the on-chip memory. This memory can be configured in various ways. It can beimplemented as either RAM or ROM type of memory. Its size and wordlength can bespecified at design time.

If the on-chip memory is not large enough to hold the software needed in an embeddedapplication, then it is necessary to provide additional memory by using external memorychips. The SOPC Builder makes it easy to generate the controllers and interfaces needed toconnect a system implemented in the FPGA to a variety of external memory components,such as SRAMs, SDRAMs, and Flash devices.

Parallel I/O InterfaceA parallel interface, called PIO, is a parameterized module that can be used for both

input and output purposes. Its data port can be selected at design time to serve as:

• an input port• an output port• a bidirectional port



0(n – 1)

0

4

8

Data

Direction

Interrupt-mask

12 Edge-capture

Address offset(in bytes)

Input/output data

Direction control for each data line

Interrupt-enable control for each input line

Edge detection for each input line

Figure 11.1 Registers in the PIO interface.

If the bidirectional option is chosen, then the PIO data lines must be connected to FPGApins that have tristate capability.

The processor accesses a PIO as a memory-mapped interface, and communicates withit in the manner described in Chapter 3. The PIO registers are shown in Figure 11.1. Theregister size n is a parameter that is specified at design time, and can be in the range from1 to 32. The registers are used as follows:

• The Data register holds the n bits of data that are transferred between the processorand the PIO interface.

• The Direction register determines the direction of transfer (input or output) for each ofthe n data lines when a bidirectional port is implemented.

• The Interrupt-mask register is used to enable interrupts from the input lines connectedto the PIO. Individual interrupt requests can be raised on each of the n possible inputlines.

• The Edge-capture register indicates changes in logic values detected in the signals onthe input lines connected to the PIO. The type of edge (rising or falling) that is detectedis specified at design time.

The lines that connect the PIO to an I/O device can be configured individually. If thePIO serves only as an input port, then all n lines are configured at design time as inputs.Similarly, for an output port, all lines are configured as outputs. In these cases, the Directionregister is not needed, and it is not implemented in the generated circuit. For a bidirectionalport the Direction register is included; when the value of its bit k is equal to 1 (0), line k ofthe port functions as an output (input) to (from) the connected I/O device.

When the PIO is used as an input port, its Data register contains the logic values thatcurrently exist on the input lines. Changes in logic values on the input lines can be detectedby using the Edge-capture register. At design time, it is possible to specify that the bits inthis register be set to 1 when an edge on the input signal occurs. The edge can be specifiedas: rising, falling, or either edge. Bits in the Edge-capture register are cleared by a programinstruction that writes a zero into the register.


11.2 Computer-Aided Design Tools 427

The Interrupt-mask register allows enabling and disabling of interrupts. Writing a 1into bit position k of the register enables interrupts caused by the activity on input line k.The interrupts can be:

• Level sensitive, in which case an interrupt request is raised when the signal on anyenabled input line has the value 1

• Edge sensitive, in which case an interrupt request is raised when any enabled bit in theEdge-capture register is equal to 1

Note that the addresses of PIO registers in Figure 11.1 are offset by four, regardless ofthe length n. Thus, the register addresses are word-aligned.

Interval TimerThe timer module provides a capability similar to that of the timer described in Section

3.4. Its key component is a counter whose contents are decremented by one in each clockcycle. The counter can be specified to be either 32 or 64 bits long. In our discussion, we willassume that the counter has 32 bits. The Interval Timer interface has the registers depictedin Figure 11.2. Each register is 16 bits long. In the status register, only two bits are used:

• RUN is equal to 1 when the counter is running; otherwise, it is equal to 0. This bit isnot affected if the processor writes to the status register.

• TO is the timeout bit. It is set to 1 when the counter reaches 0. It remains set at 1 untilthe processor clears it by writing a 0 into it.

RUN

CONT

0123415

0

4

Status register

Control register

Address offset

TO

ITOSTARTSTOP

Initial count value (low-order 16 bits)

Initial count value (high-order 16 bits)

8

12

Counter snapshot (low-order 16 bits)

Counter snapshot (high-order 16 bits)

16

20

(in bytes)

Figure 11.2 Registers in the Interval Timer interface.



In the control register, four bits are used:

• STOP is set to 1 to stop the counter.• START is set to 1 to cause the counter to start running.• CONT determines the behavior of the counter when it reaches 0. If CONT = 0, the

counter stops running when it reaches 0. If CONT = 1, the counter reloads the initialcount value and continues to run.

• ITO enables interrupts when set to 1.

The initial count value has to be loaded in two 16-bit write operations. The counter snapshotregisters are used to take a snapshot of the contents of the counter while it is running. Awriteoperation to either snapshot register causes a snapshot to be taken, which means that thecurrent contents of the counter are loaded into the snapshot registers. Then, these registerscan be read in the usual manner.

In addition to the capability of using the initial count value to define a timeout period,a default timeout period can be specified at design time. The default period time is used ifthe initial count value is zero.

An interrupt request is raised when TO = 1 and ITO is set to 1. To clear this request,the processor must write a 0 into TO.

In the next section, we will use the modules presented above in a complete designexample.

11.3 Alarm Clock Example

In this section we present a detailed example of an embedded system. We show how analarm clock can be implemented using FPGAtechnology. The alarm clock, shown in Figure11.3, has the following specifications:

• There are four 7-segment displays that indicate the time in hours and minutes.• An on/off slider switch is used to activate the alarm feature.• Two slider switches are used to enable setting of the actual time and the alarm time.• Two pushbutton switches are used to set hours and minutes.• An LED indicates that the alarm is set.• Two LEDs, arranged vertically, create a colon that separates hours and minutes.• A buzzing sound is produced when the alarm slider switch is activated and the alarm

time is reached.

11.3.1 User’s View of the System

Figure 11.3 shows the alarm clock as seen by the user. The seven-segment displays, of thetype depicted in Figure 3.17, indicate the time. Time is displayed in the range 00:00 to23:59. The operation of the clock is as follows:


11.3 Alarm Clock Example 429

AlarmON

Hour Minute

Set

timealarmSet

on/offAlarmactual

time

Figure 11.3 User’s view of the alarm clock.

• When power is turned on, both the time-of-day and the alarm time are cleared to zero.• The time-of-day is set by activating the set-actual-time slider switch and then setting

the time by pressing the Hour and Minute pushbuttons. Each time a pushbutton ispressed, the displayed time is incremented by one.

• Alarm time is set in the same manner when the set-alarm-time switch is activated.• The alarm is turned on by activating the alarm-on/off switch. This causes the corre-

sponding LED to be turned on.• The speaker emits a buzzing sound when the alarm time is reached with the alarm

switch activated.

11.3.2 System Definition and Generation

Our goal is to implement the alarm clock using an FPGA chip and the external componentsthat comprise: slider switches, pushbutton switches, 7-segment displays, LEDs, and aspeaker. Figure 11.4 depicts the desired system. The processor is Nios II. The on-chipmemory is large enough to satisfy the needs of our application, even in very small FPGAchips. The external components are connected through PIO interfaces. There are twotimers. One is designed to provide intervals of one minute, to be used in updating the timeof day. The other is used to generate a square wave that produces a buzzing sound. Thetimers are implemented by using the Interval Timer module described in Section 11.2.1.We will assume that the slider switches generate a logic signal 1 when activated. Thepushbutton switches are debounced, and they generate a logic signal 0 when pressed.

The system is implemented in the FPGA device by using the Quartus II software. TheSOPC Builder is used to realize the blocks in the shaded area of Figure 11.4. The PIOblocks are configured as follows:

• PIO1 is a three-bit wide input port. Its data inputs are level sensitive.• PIO2 is a two-bit wide input port. Its inputs are falling-edge sensitive, so that a bit in

the Edge-capture register is set to 1 when the corresponding pushbutton is pressed.• PIO3 is a 32-bit wide output port. Each byte will be connected to one 7-segment

display, such that bits 0 to 6 of a byte drive one segment each, while bit 7 is not used.




PIO1 PIO2 PIO3 PIO4

Processor On-chipmemory

Slider Pushbutton 7-segmentLEDsswitches switches displays

FPGAchip

PIO5

Speaker

Minutetimer

Tonetimer

Figure 11.4 Block diagram of the alarm clock.

The low-order byte will be connected to the display for the low-order digit of minutes,and the high-order byte to the high-order digit of hours.

• PIO4 is a three-bit wide output port.• PIO5 is a one-bit wide output port.

Using the SOPC Builder, this system is specified as depicted in Figure 11.5. Note thatwe gave the PIOs names that are indicative of their functions. The SOPC Builder assignsaddresses to the various components in the system. The on-chip memory occupies therange 0 to 0x3FFF, while the timer and PIO interfaces have addresses starting at 0x5000.The designer can specify different addresses if so desired. Note also that the timers andthe pushbutton switches can raise interrupt requests (IRQ). The last column in the figureindicates the bit positions in the Nios II control registers ctl3 and ctl4 that are associatedwith interrupts from these sources. The SOPC Builder generates the specified systemby producing a module that describes the system in either Verilog or VHDL hardwaredescription languages.

11.3.3 Circuit Implementation

Quartus II software includes a compiler that accepts a specification of a digital system ina hardware description language. The compiler synthesizes a circuit that implements thesystem, and it determines how this circuit is to be implemented on the FPGA chip.



Figure 11.5 System on the FPGA chip designed using the SOPC Builder.

Since external devices have to be connected to the FPGApins, the designer must specifythe desired connections. This is referred to as pin assignment. The compiler generates thefinal implementation in the form of a configuration file which is used to download theconfiguration information into the FPGA device.

Figure 11.4 does not show all components needed to implement the complete alarmclock. Missing are the power supply, the clock signal generator, and the configurationdevice needed to program the FPGA. We will assume that a 100-MHz external clock signalis provided. The FPGA pins typically cannot be connected directly to external devicessuch as switches, seven-segment displays, and LEDs. Each device has its own electricalcharacteristics, which usually means that components such as resistors have to be used, asin Figure 10.13.

In our design, the pushbuttons for setting hours and minutes are connnected to thecorresponding PIO bits b1 and b0, respectively. The slider switches set-actual-time, set-alarm-time, and alarm-on/off are connected to the corresponding PIO bits b2, b1, and b0,respectively. The minute-timer has a timeout period of 60 seconds. The tone-timer has atimeout period of 1 ms.

11.3.4 Application Software

It is necessary to write a program that will run on the designed hardware to realize the alarmclock functions. We will write two programs, one in the C language and another in theNios II assembly language, to illustrate how such programs may be written. The developed



program has to be compiled into Nios II machine code. On power-up, this code would beloaded into the on-chip memory from the configuration device.

We use the following approach in writing the programs. The actual time and the alarmtime are kept as 32-bit binary integers that represent the time in minutes. The actual time isincremented by one whenever the minute-timer reaches zero, which is indicated by the TObit in its Status register being set to 1. When the actual time is incremented, it is necessaryto check if it has reached 1440, which is the number of minutes in a day. In that case, thetime must be cleared to zero to correspond to the transition from 23:59 to 00:00. A similarcheck is needed when setting the time by pressing the pushbuttons.

To display the time, the four decimal digits representing hours and minutes are com-puted by dividing the time by 600 to get the high-order hours digit, then dividing theremainder by 60 to get the low-order hours digit, and so on. A table is used to look up thecorresponding segment patterns that are sent to the 7-segment displays.

We use the tone timer to create a square-wave signal of 500 Hz, which produces abuzzing sound when connected to a speaker. The tone timer is run in the continuous mode.Since it has a predefined timeout period of 1 ms, we could directly generate the 500-Hzsignal by inverting the logic value of the signal at the end of each period of the timer.However, to illustrate how a different period may be defined in an application program, wewill use the initial-count register to specify the desired period. In this case, the value of0x30D40 is needed for the 500-Hz signal if the counter is driven by a 100-MHz clock.

When polling for the timeout of a timer, it is necessary to check the TO bit in its statusregister. Since the RUN bit is always equal to 1 when the counter is running, the pollingtask can be performed by checking if the contents of the status register are equal to 3. TheTO bit can be cleared to 0 simply by writing a zero into the status register, because the stateof the RUN bit is not affected by a write operation.

C ProgramFigure 11.6 shows a possible C-language program. Our application is not demanding

from a performance point of view, so we use the polling method to access all PIOs andtimers. The comments in the program explain the meaning of various statements. Observethat the macro ADJUST defines an expression that increments the time properly in differentsituations.

Assembly-Language ProgramFigure 11.7 presents a Nios II program. To illustrate the use of interrupts, the program

uses interrupts to deal with the minute timer. Polling is used for the tone timer.At design time, the SOPC Builder assigned 0x20 to be the address at which the interrupt

handler starts when an interrupt request is accepted by the processor. The interrupt handlerverifies that an interrupt caused by a device external to the Nios II processor has occurred,and adjusts the return address accordingly. (See Appendix B for a detailed explanation ofthe Nios II interrupt mechanism.) It then clears the TO bit in the minute timer and callsthe interrupt-service routine. Note that the interrupt handler saves, and later restores, thecontents of registers r2 and ra. It saves the contents of ra, which is the link register thatholds the return address when a subroutine is called, because a timer interrupt may occurwhen some subroutine in the program is being executed. The interrupt handler also uses the



#define minute_timer (volatile int *) 0x5000#define tone_timer (volatile int *) 0x5020#define sliders (volatile int *) 0x5040#define pushbuttons (volatile int *) 0x5050#define display (int *) 0x5060#define LEDs (int *) 0x5070#define speaker (int *) 0x5080#define ADJUST(t, x) ((t + x) > = 1440) ? (t + x – 1440) : (t + x)int actual_time, alarm_time, alarm_active, time;

/* Hex to 7-segment conversion table */unsigned char table[16] = {0x40, 0x79, 0x24, 0x30, 0x19, 0x12, 0x02, 0x78,

0x00, 0x18, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F};

void initializeToneTimer(){

*(tone_timer + 2) = 0x0D40; /* Set the timeout period */*(tone_timer + 3) = 0x03; /* for continuous operation. */*(tone_timer + 1) = 0x6; /* Start in continuous mode. */

}void DISP(time) /* Get 7-segment patterns for display. */{

*display = table[time / 600] << 24 |table[(time % 600) / 60] << 16 |table[(time % 60) / 10] << 8 |table[(time % 10)];

}

main(){

actual_time = alarm_time = alarm_active = 0;initializeToneTimer();*(minute_timer + 1) = 0x6; /* Run in continuous mode. */while (1){

if (*minute_timer == 3) /* One minute elapsed. */{

*minute_timer = 0; /* Clear the TO bit. */actual_time = ADJUST(actual_time, 1);

}

. . .continued in Part b

Figure 11.6 C program for the alarm clock (Part a).



if ((*sliders & 1) != 0) /* Check the alarm-on switch. */{

*LEDs = 7; /* Turn on the alarm LED. */if (actual_time == alarm_time)

alarm_active = 1; /* Start the alarm sound. */else

alarm_active = alarm_active & (*sliders & 1);if (*tone_timer == 3) /* Generate the square wave. */{

*speaker = (*speaker ^ 1) & alarm_active;*tone_timer = 0; /* Clear the TO bit. */

}}else{

*LEDs = 6; /* Turn off the alarm LED. */alarm_active = 0;

}if ((*sliders & 4) != 0) /* Check the set-the-time-of-day switch. */{

DISP(actual_time); /* Display the time of day. */if ((*(pushbuttons + 3) & 1) != 0) /* Set the minutes? */

actual_time = ADJUST(actual_time, 1);else if ((*(pushbuttons + 3) & 2) != 0) /* Set the hours? */

actual_time = ADJUST(actual_time, 60);*(pushbuttons + 3) = 0; /* Clear the edge-capture register. */

}else if ((*sliders & 2) != 0) /* Check the set-the-alarm-time switch. */{

DISP(alarm_time); /* Display the alarm time. */if ((*(pushbuttons + 3) & 1) != 0) /* Set the minutes? */

alarm_time = ADJUST(alarm_time, 1);else if ((*(pushbuttons + 3) & 2) != 0) /* Set the hours? */

alarm_time = ADJUST(alarm_time, 60);*(pushbuttons + 3) = 0; /* Clear the edge-capture register. */

}else

DISP(actual_time); /* Display the time of day. */}

}

Figure 11.6 C program for the alarm clock (Part b).



.equ minute_timer, 0x05000

.equ tone_timer, 0x5020

.equ sliders, 0x5040

.equ pushbuttons, 0x5050

.equ display, 0x5060

.equ LEDs, 0x5070

.equ speaker, 0x5080

.equ ACTUAL_TIME, 0x1000

.equ ALARM_TIME, 0x1010

.equ STACK, 0x2000_start: br MAIN

/* Interrupt handler */.org 0x20subi sp, sp, 8 /* Save registers. */stw r2, 0(sp)stw ra, 4(sp)rdctl et, ipendingbeq et, r0, MAIN /* Error if not an external */

/* interrupt, treat as reset. */subi ea, ea, 4 /* Decrement ea to execute the */

/* interrupted instruction upon *//* return to the main program. */

movia r2, minute_timer /* Clear the TO bit in the */sthio r0, (r2) /* minute-timer. */call UPDATE_TIME /* Call interrupt-service routine. */ldw r2, 0(sp) /* Restore registers. */ldw ra, 4(sp)addi sp, sp, 8eret

/* Main program */MAIN: movia sp, STACK /* Set up the stack pointer. */

movia r2, ALARM_TIME /* Clear the alarm-time buffer. */stw r0, (r2)movia r2, ACTUAL_TIME /* Clear the actual-time buffer. */stw r0, (r2)movia r2, sliders /* Address of slider switches. */movia r3, LEDs /* Address of LEDs. */movia r4, display /* Address of 7-segment displays. */movia r5, pushbuttons /* Address of pushbuttons. */

. . .continued in Part b

Figure 11.7 Nios II program for the alarm clock (Part a).



movi r6, 6 /* Turn on the two */stbio r6, (r3) /* vertical LEDs. */movia r6, tone_timerori r7, r0, 0x0D40 /* Set the tone-timer period. */sthio r7, 8(r6)ori r7, r0, 0x03sthio r7, 12(r6)movi r7, 6 /* Start the tone-timer. */sthio r7, 4(r6)movia r6, minute_timer /* Address of minute-timer. */addi r7, r0, 7 /* Start the timer. */sthio r7, 4(r6)movi r7, 1wrctl ienable, r7 /* Enable timer interrupts. */wrctl status, r7 /* Enable external interrupts. */

LOOP: movia r10, ACTUAL_TIME /* Display the time of day. */call DISPldbio r7, (r2)andi r11, r7, 1 /* Check if alarm switch is on. */beq r11, r0, NEXTmovi r11, 7 /* If yes, then turn on the */stbio r11, (r3) /* alarm LED. */movia r9, ALARM_TIMEldw r11, (r9) /* Have to compare alarm-time */ldw r12, (r10) /* with actual-time. */bne r11, r12, NEXT /* Should the alarm ring? */movia r8, tone_timermovi r12, 1

RING_LOOP:call DISPldbio r7, (r2)andi r13, r7, 1 /* Check if alarm switch is on. */beq r13, r0, NEXTldhio r9, (r8) /* Read the tone-timer status. */sthio r0, (r8) /* Clear the TO bit. */andi r9, r9, 1 /* Check if counter reached 0. */xor r12, r9, r12 /* Generate the next square */movia r11, speaker /* wave half-cycle; send */stbio r12, (r11) /* signal to the speaker. */br RING_LOOP

. . .continued in Part c

Figure 11.7 Nios II program for the alarm clock (Part b).



NEXT: movi r11, 6 /* Turn off the alarm-on */stbio r11, (r3) /* LED indicator. */

TEST_SLIDERS:ldbio r7, (r2)andi r11, r7, 2 /* Is set-alarm switch on? */beq r11, r0, SETACT /* If not, test actual time. */movia r10, ALARM_TIME /* Have to set the alarm time. */br SET_TIME

SETACT:andi r11, r7, 4 /* Is set-time switch on? */beq r11, r0, LOOP /* All sliders are off. */movia r10, ACTUAL_TIME

SET_TIME:call DISPcall SETSUBbr TEST_SLIDERS

/* Display the time on 7-segment displays. */DISP: subi sp, sp, 24 /* Save registers. */

stw r11, 0(sp)stw r12, 4(sp)stw r13, 8(sp)stw r14, 12(sp)stw r15, 16(sp)stw r16, 20(sp)ldw r11, (r10) /* Load the time to be displayed. */movi r12, 600 /* To determine the first digit of */divu r13, r11, r12 /* hours, divide by 600. */ldb r15, TABLE(r13) /* Get the 7-segment pattern. */slli r15, r15, 8 /* Make space for next digit. */mul r14, r13, r12 /* Compute the remainder of the */sub r11, r11, r14 /* division operation. */movi r12, 60 /* Divide the remainder by 60 to */divu r13, r11, r12 /* get the second digit of hours. */ldb r16, TABLE(r13) /* Get the 7-segment pattern, */or r15, r15, r16 /* concatenate it to the first */slli r15, r15, 8 /* digit, and shift. */mul r14, r13, r12 /* Determine the minutes that have */sub r11, r11, r14 /* to be displayed. */

. . . continued in Part d

Figure 11.7 Nios II program for the alarm clock (Part c).



movi r12, 10 /* To determine the first digit of */divu r13, r11, r12 /* minutes, divide by 10. */ldb r16, TABLE(r13) /* Get the 7-segment pattern, */or r15, r15, r16 /* concatenate it to the first */slli r15, r15, 8 /* two digits, and shift. */mul r14, r13, r12 /* Compute the remainder, which */sub r11, r11, r14 /* is the last digit. */ldb r16, TABLE(r11) /* Concatenate the last digit to */or r15, r15, r16 /* the preceding 3 digits. */movia r11, displaystw r15, (r11) /* Display the obtained pattern. */ldw r11, 0(sp) /* Restore registers. */ldw r12, 4(sp)ldw r13, 8(sp)ldw r14, 12(sp)ldw r15, 16(sp)ldw r16, 20(sp)addi sp, sp, 24ret

/* Set the desired time. */SETSUB:

subi sp, sp, 16 /* Save registers. */stw r11, 0(sp)stw r12, 4(sp)stw r13, 8(sp)stw r14, 12(sp)ldbio r12, 12(r5) /* Test pushbuttons. */stbio r0, 12(r5) /* Clear edge-detection register. */andi r13, r12, 1 /* Is minute pushbutton pressed? */beq r13, r0, HOURS /* If not, check hours. */ldw r11, (r10) /* Load present time. */movi r12, 60 /* Divide by 60 to determine */divu r13, r11, r12 /* the number of hours. */mul r14, r13, r12 /* Remainder of division operation */sub r11, r11, r14 /* is the number of minutes. */addi r11, r11, 1 /* Increment minutes. */blt r11, r12, SAVEM /* Save if less than 60, */mov r11, r0 /* otherwise set minutes to 00. */

. . . continued in Part e

Figure 11.7 Nios II program for the alarm clock (Part d ).



SAVEM: add r11, r14, r11 /* (hours x 60) + (updated minutes). */stw r11, (r10) /* Save the new time. */br DONE

HOURS: andi r13, r12, 2 /* Is hour pushbutton pressed? */beq r13, r0, DONE /* If not, then return. */ldw r11, (r10) /* Load present time in minutes. */addi r12, r11, 60 /* Add 60 minutes. */movi r13, 1440 /* Have to check if updated time */blt r12, r13, SAVEH /* is less than 24:00. */sub r12, r12, r13 /* Roll over hours to 00. */

SAVEH: stw r12, (r10) /* Save the new time. */DONE:

ldw r11, 0(sp) /* Restore registers. */ldw r12, 4(sp)ldw r13, 8(sp)ldw r14, 12(sp)addi sp, sp, 16ret

/* Interrupt-service routine that updates actual-time */UPDATE_TIME:

subi sp, sp, 12 /* Save registers. */stw r2, 0(sp)stw r3, 4(sp)stw r4, 8(sp)movia r2, ACTUAL_TIMEldw r3, (r2) /* Load present time of day. */addi r3, r3, 1 /* Increment by one minute. */movi r4, 1440 /* Done if updated time is */blt r3, r4, SAVET /* less than 24:00. */mov r3, r0 /* Otherwise, set to 00:00. */

SAVET: stw r3, (r2) /* Save updated time. */ldw r2, 0(sp) /* Restore registers. */ldw r3, 4(sp)ldw r4, 8(sp)addi sp, sp, 12ret

/* Hex-digit to 7-segment conversion table */.org 0x1050

TABLE: .byte 0x40, 0x79, 0x24, 0x30, 0x19, 0x12, 0x02, 0x78.byte 0x00, 0x18, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F, 0x3F.end

Figure 11.7 Nios II program for the alarm clock (Part e).



register ra for a call to the interrupt-service routine, UPDATE_TIME, which incrementsthe actual time and clears the stored time to zero at midnight.

The main program starts by setting up the processor stack and clearing the memorylocations that store time. As part of the initialization process, it sets the timeout period forthe tone timer. It then starts both timers and enables interrupts from the minute timer.

The main loop, LOOP, checks the state of the slider switches and takes the necessaryaction. Subroutine DISP displays either actual time or alarm time, which are maintained inmemory locations 0x1000 or 0x1010, on the seven-segment displays. Subroutine SETSUBis used in setting the values of minutes and hours when the corresponding pushbuttons arepressed. Every time a pushbutton is pressed, the value is incremented by one.

The table at the end of the program is used to convert a decimal digit into the corre-sponding seven-bit pattern to be displayed.

Comments in the program explain the meaning of various statements. Control registersctl3 and ctl4 can also be referred to by the names ienable and ipending, respectively. Thesenames are used in the program.


The designer of an embedded system inevitably looks for the simplest and most cost-effective approach. A microcontroller chip that has the resources to implement an entiresystem may be the best choice. The situation is different if additional chips are needed torealize the system. Then, FPGA solutions are very attractive because they are likely to needfewer chips to implement the system.

Another consideration is the availability of predesigned modules. A microcontrollerchip contains a number of different modules. Any feature that cannot be realized usingthese modules has to be implemented using additional chips. An FPGA device allowsthe designer to design any type of digital circuit. Very large and complex circuits can beimplemented in modern FPGA devices.

Practical designs often involve circuits that perform commonly used tasks. Such cir-cuits are usually available as library modules, as illustrated by the I/O interfaces and timercircuits used in the previous section. For signal-processing applications, the library includestypical filter circuits. If a system is to be connected to another computer using a standardinterconnection scheme, such as PCI Express, the designer’s task is much simpler if a PCIExpress interface is available as a predesigned module.

Problems

11.1 [E] In Section 11.3.4, we mentioned that the 500-Hz square-wave signal can be generatedsimply by inverting the logic value of the signal at the end of each timeout period of thetone timer, which is designed to occur every millisecond. Modify the program in Figure11.6 to exploit this fact.


References 441

11.2 [E] Repeat Problem 11.1 for the program in Figure 11.7.

11.3 [M] Consider the display format for the alarm clock in Section 11.3. Instead of displayingthe time from 00:00 to 23:59, we wish to display it as 12:00 to 11:59 AM or PM. Assumethat a fourth LED is provided and connected to bit b3 of a four-bit wide PIO instead of thethree-bit wide PIO used in Section 11.3. This LED indicates PM when turned on. Modifythe program in Figure 11.6 to display the time in this manner.

11.4 [M] Repeat Problem 11.3 for the program in Figure 11.7.

11.5 [E] Suppose that only one timer, with a default timeout period of 1 ms, is used in the alarmclock in Section 11.3. Make appropriate changes in the program in Figure 11.6 to providethe required behavior.

11.6 [E] Repeat Problem 11.5 for the program in Figure 11.7.

11.7 [E] In Figure 11.7 we used interrupts to deal with the minute timer and polling for the tonetimer. Modify the program to use interrupts for both timers.

11.8 [M] In the alarm clock in Section 11.3, the time is set by incrementing the current valueeach time a pushbutton is pressed. This could be tedious if a pushbutton has to be pressedmany times. A better alternative would be to increment the time automatically every 0.5second while a pushbutton is being pressed. What changes, if any, should be made in thehardware to provide this feature? Modify the program in Figure 11.6 to implement thefeature.


11.10 [M] It is desired to make the two vertically-arranged LEDs in the alarm clock flash inone-second intervals when power is first turned on. They should stop flashing as soon asthe user begins to set the time. Modify the program in Figure 11.6 to cause this behavior.


References

1. Altera’s web site at www.altera.com

www.altera.com


443

c h a p t e r

12Parallel Processing and

Performance

Chapter Objectives


• Multithreading

• Vector processing in general-purpose andgraphics processors

• Shared-memory multiprocessors

• Cache coherence for shared data

• Message passing in distributed-memorysystems

• Parallel programming

• Mathematical modeling of performance


444 C H A P T E R 12 • Parallel Processing and Performance

In Chapter 6, instruction pipelining and superscalar operation are described as processor design techniquesaimed at increasing the rate of instruction execution, which therefore reduces execution time. In Chapter 8,caches are presented as a way to reduce the average latency in accessing instructions and data.

This chapter describes three additional techniques for improving performance, namely multithreading,vector processing, and multiprocessing. These techniques are implemented in the multicore processor chipsthat are used in general-purpose computers. They increase performance by improving the utilization ofprocessing resources and by performing more operations in parallel.

12.1 Hardware Multithreading

Operating system (OS) software enables multitasking of different programs in the sameprocessor by performing context switches among programs. As explained in Section 4.9,a program, together with any information that describes its current state of execution, isregarded by the OS as an entity called a process. Information about the memory andother resources allocated by the OS is maintained with each process. Processes may beassociated with applications such as Web-browsing, word-processing, and music-playingprograms that a user has opened in a computer.

Each process has a corresponding thread, which is an independent path of executionwithin a program. More precisely, the term thread is used to refer to a thread of controlwhose state consists of the contents of the program counter and other processor registers.As we will discuss in Section 12.6, it is possible for multiple threads to execute portions ofone program and run in parallel as if they correspond to separate programs. Two or morethreads can be running on different processors, executing either the same part of a programon different data, or executing different parts of a program. Threads for different programscan also execute on different processors. All threads that are part of a single program runin the same address space and are associated with the same process.

In this section, we focus on multitasking where two or more programs run on the sameprocessor and each program has a single thread. Section 4.9 describes the technique oftime slicing, where the OS selects a process among those that are not presently blockedand allows this process to run for a short period of time. Only the thread correspondingto the selected process is active during the time slice. Context switching at the end of thetime slice causes the OS to select a different process, whose corresponding thread becomesactive during the next time slice. A timer interrupt invokes an interrupt-service routine inthe OS to switch from one process to another.

To deal with multiple threads efficiently, a processor is implemented with severalidentical sets of registers, including multiple program counters. Each set of registers canbe dedicated to a different thread. Thus, no time is wasted during a context switch to saveand restore register contents. The processor is said to be using a technique called hardwaremultithreading.

With multiple sets of registers, context switching is simple and fast. All that is necessaryis to change a hardware pointer in the processor to use a different set of registers to fetch andexecute subsequent instructions. Switching to a different thread can be completed within


12.2 Vector (SIMD) Processing 445

one clock cycle. The state of the previously active thread is preserved in its own set ofregisters.

Switching to a different thread may be triggered at any time by the occurrence of aspecific event, rather than at the end of a fixed time interval. For example, a cache missmay occur when a Load or Store instruction is being executed for the active thread. Insteadof stalling while the slower main memory is accessed to service the cache miss, a processorcan quickly switch to a different thread and continue to fetch and execute other instructions.This is called coarse-grained multithreading because many instructions may be executedfor one thread before an event such as a cache miss causes a switch to another thread.

An alternative to switching between threads on specific events is to switch after everyinstruction is fetched. This is called fine-grained or interleaved multithreading. The intent isto increase the processor throughput. Each new instruction is independent of its predecessorsfrom other threads. This should reduce the occurrence of stalls due to data dependencies.Thus, throughput may be increased by interleaving instructions from many threads, but ittakes longer for a given thread to complete all of its instructions. A form of interleavedmultithreading with only two threads is used in processors that implement the Intel IA-32architecture described in Appendix E.

12.2 Vector (SIMD) Processing

Many computationally demanding applications involve programs that use loops to performoperations on vectors of data, where a vector is an array of elements such as integers orfloating-point numbers. When a processor executes the instructions in such a loop, theoperations are performed one at a time on individual vector elements. As a result, manyinstructions need to be executed to process all vector elements.

A processor can be enhanced with multiple ALUs. In such a processor, it is possible tooperate on multiple data elements in parallel using a single instruction. Such instructionsare called single-instruction multiple-data (SIMD) instructions. They are also called vectorinstructions. These instructions can only be used when the operations performed in parallelare independent. This is known as data parallelism.

The data for vector instructions are held in vector registers, each of which can holdseveral data elements. The number of elements, L, in each vector register is called thevector length. It determines the number of operations that can be performed in parallelon multiple ALUs. If vector instructions are provided for different sizes of data elementsusing the same vector registers, L may vary. For example, the Intel IA-32 architecture has128-bit vector registers that are used by instructions for vector lengths ranging from L = 2up to L = 16, corresponding to integer data elements with sizes ranging from 64 bits downto 8 bits.

Some typical examples of vector instructions are given below to illustrate how vectorregisters are used. We assume that the OP-code mnemonic includes a suffix S whichspecifies the size of each data element. This determines the number of elements, L, in avector. For instructions that access the memory, the contents of a conventional register areused in the calculation of the effective address.



The vector instruction

VectorAdd.S Vi, Vj, Vk

computes L sums using the elements in vector registers Vj and Vk, and places the result-ing sums in vector register Vi. Similar instructions are used to perform other arithmeticoperations.

Special instructions are needed to transfer multiple data elements between a vectorregister and the memory. The instruction

VectorLoad.S Vi, X(Rj)

causes L consecutive elements beginning at memory location X + [Rj] to be loaded intovector register Vi. Similarly, the instruction

VectorStore.S Vi, X(Rj)

causes the contents of vector register Vi to be stored as L consecutive elements in thememory.

VectorizationIn a source program written in a high-level language, loops that operate on arrays

of integers or floating-point numbers are vectorizable if the operations performed in eachpass are independent of the other passes. Using vector instructions reduces the number ofinstructions that need to be executed and enables the operations to be performed in parallelon multiple ALUs. A vectorizing compiler can recognize such loops, if they are not toocomplex, and generate vector instructions.

Vectorization of a loop is illustrated with the simple example of C-language code shownin Figure 12.1a. Assume that the starting locations in memory for arrays A, B, and C are inregisters R2, R3, and R4. Using conventional assembly-language instructions, the compilermay generate the loop shown in Figure 12.1b. The loop body has nine instructions, hencea total of 9N instructions are executed for N passes through the loop.

To vectorize the loop, the compiler must recognize that the computation in each passthrough the loop is independent of the other passes, and that the same operation can beperformed simultaneously on multiple elements. Assume for simplicity that the number ofpasses, N , is evenly divisible by the vector length, L. The Load, Add, and Store instructionsat the beginning of the loop are replaced by corresponding vector instructions that operateon L elements at a time. Consequently, the vectorized loop requires only N/L passes toprocess all of the data in the arrays. With L elements processed in each pass through theloop, the address pointers in registers R2, R3, and R4 are incremented by 4L, and thecount in register R5 is decremented by L. The vectorized loop is shown in Figure 12.1c.We assume that the assembler calculates the value of 4L for the expression given as theimmediate operand in the Add instructions. There are still nine instructions in the loopbody, but because the number of passes is now N/L, the total number of instructions thatneed to be executed is only 9N/L.

Vectorizable loops exist in programs for applications such as computer graphics anddigital signal processing. Such loops have many independent computations that can beperformed simultaneously on several data elements. If a large portion of the execution time


12.2 Vector (SIMD) Processing 447

for (i = 0; i < N; i++)A[i] = B[i] + C[i];

(a) A C-language loop to add vector elements

Move R5, #N R5 is the loop counter.LOOP: Load R6, (R3) R3 points to an element in array B.

Load R7, (R4) R4 points to an element in array C.Add R6, R6, R7 Add a pair of elements from the arrays.Store R6, (R2) R2 points to an element in array A.Add R2, R2, #4 Increment the three array pointers.Add R3, R3, #4Add R4, R4, #4Subtract R5, R5, #1 Decrement the loop counter.Branch_if_[R5]> 0 LOOP Repeat the loop if not finished.

(b) Assembly-language instructions for the loop

Move R5, #N R5 counts the number of elements to process.LOOP: VectorLoad.S V0, (R3) Load L elements from array B.

VectorLoad.S V1, (R4) Load L elements from array C.VectorAdd.S V0, V0, V1 Add L pairs of elements from the arrays.VectorStore.S V0, (R2) Store L elements to array A.Add R2, R2, #4*L Increment the array pointers by L words.Add R3, R3, #4*LAdd R4, R4, #4*LSubtract R5, R5, #L Decrement the loop counter by L .Branch_if_[R5]> 0 LOOP Repeat the loop if not finished.

(c) Vectorized form of the loop

Figure 12.1 Example of loop vectorization.

for an application is spent executing loops of this type, then vectorization of these loops cansignificantly reduce the total execution time. The extent of the performance improvementis limited by the vector length L, which determines the number of ALUs that can operatein parallel. To obtain higher performance, support for vector (SIMD) processing can beimplemented in another way, as the next section describes.



12.2.1 Graphics Processing Units (GPUs)

The increasing demands of processing for computer graphics has led to the development ofspecialized chips called graphics processing units (GPUs). The primary purpose of GPUsis to accelerate the large number of floating-point calculations needed in high-resolutionthree-dimensional graphics, such as in video games. Since the operations involved in thesecalculations are often independent, a large GPU chip contains hundreds of simple coreswith floating-point ALUs to perform them in parallel.

A GPU chip and a dedicated memory for it are included on a video card. Such a card isplugged into an expansion slot of a host computer using an interconnection standard suchas the PCIe standard discussed in Chapter 7. A small program is written for the processingcores in the GPU chip. A large number of cores execute this program in parallel. The coresexecute the same instructions, but operate on different data elements. A separate controllingprogram runs in the general-purpose processor of the host computer and invokes the GPUprogram when necessary. Before initiating the GPU computation, the program in the hostcomputer must first transfer the data needed by the GPU program from the main memoryinto the dedicated GPU memory. After the computation is completed, the resulting outputdata in the dedicated memory are transferred back to the main memory.

The processing cores in a GPU chip have a specialized instruction set and hardwarearchitecture, which are different from those used in a general-purpose processor. An exam-ple is the Compute Unified Device Architecture (CUDA) that NVIDIA Corporation uses forthe cores in its GPU chips. To facilitate writing programs that involve a general-purposeprocessor and a GPU, an extension to the C programming language, called CUDA C, hasbeen developed by NVIDIA [1, 2]. This extension enables a single program to be written inC, with special keywords used to label the functions executed by the processing cores in aGPU chip. The compiler and related software tools automatically partition the final objectprogram into the portions that are translated into machine instructions for the host com-puter and the GPU chip. Library routines are provided to allocate storage in the dedicatedmemory of a GPU-based video card and to transfer data between the main memory and thededicated memory. An open standard called OpenCL has also been proposed by industryas a programming framework for systems that include GPU chips from any vendor [3].

12.3 Shared-Memory Multiprocessors

A multiprocessor system consists of a number of processors capable of simultaneouslyexecuting independent tasks. The granularity of these tasks can vary considerably. A taskmay encompass a few instructions for one pass through a loop, or thousands of instructionsexecuted in a subroutine.

In a shared-memory multiprocessor, all processors have access to the same memory.Tasks running in different processors can access shared variables in the memory using thesame addresses. The size of the shared memory is likely to be large. Implementing a largememory in a single module would create a bottleneck when many processors make requeststo access the memory simultaneously. This problem is alleviated by distributing the memory


12.3 Shared-Memory Multiprocessors 449

across multiple modules so that simultaneous requests from different processors are morelikely to access different memory modules, depending on the addresses of those requests.

An interconnection network enables any processor to access any module that is a partof the shared memory. When memory modules are kept physically separate from theprocessors, all requests to access memory must pass through the network, which introduceslatency. Figure 12.2 shows such an arrangement. A system which has the same networklatency for all accesses from the processors to the memory modules is called a UniformMemory Access (UMA) multiprocessor. Although the latency is uniform, it may be largefor a network that connects many processors and memory modules.

For better performance, it is desirable to place a memory module close to each processor.The result is a collection of nodes, each consisting of a processor and a memory module. Thenodes are then connected to the network, as shown in Figure 12.3. The network latency isavoided when a processor makes a request to access its local memory. However, a request toaccess a remote memory module must pass through the network. Because of the differencein latencies for accessing local and remote portions of the shared memory, systems of thistype are called Non-Uniform Memory Access (NUMA) multiprocessors.

P1 P2

M2 MkM1

Pn


Processors

Memories

Figure 12.2 A UMA multiprocessor.

P1 M1 P2 M2 Pn Mn


Figure 12.3 A NUMA multiprocessor.



12.3.1 Interconnection Networks

The interconnection network must allow information transfer between any pair of nodesin the system. The network may also be used to broadcast information from one node tomany other nodes. The traffic in the network consists of requests (such as read and write)and data transfers.

The suitability of a particular network is judged in terms of cost, bandwidth, effectivethroughput, and ease of implementation. The term bandwidth refers to the capacity of atransmission link to transfer data and is expressed in bits or bytes per second. The effectivethroughput is the actual rate of data transfer. This rate is less than the available bandwidthbecause a given link must also carry control information that coordinates the transfer of data.

Information transfer through the network usually takes place in the form of packets offixed length and specified format. For example, a read request is likely to be a single packetsent from a processor to a memory module. The packet contains the node identifiers forthe source and destination, the address of the location to be read, and a command field thatindicates what type of read operation is required. A write request that writes one word in amemory module is also likely to be a single packet that includes the data to be written. Onthe other hand, a read response may involve an entire cache block requiring several packetsfor the data transfer.

Ideally, a complete packet would be handled in parallel in one clock cycle at any node orswitch in the network. This implies having wide links, comprising many wires. However,to reduce cost and complexity, the links are often considerably narrower. In such cases, apacket must be divided into smaller pieces, each of which can be transmitted in one clockcycle.

The following sections describe a few of the interconnection networks that are com-monly used in multiprocessors.

BusA bus is a set of lines (wires) that provide a single shared path for information transfer,

as discussed in Chapter 7. Buses are most commonly used in UMA multiprocessors toconnect a number of processors to several shared-memory modules. Arbitration is necessaryto ensure that only one of many possible requesters is granted use of the bus at any time.The bus is suitable for a relatively small number of processors because of the contention foraccess to the bus and the increased propagation delays caused by electrical loading whenmany processors are connected.

A simple bus does not allow a new request to appear on the bus until the response forthe current request has been provided. However, if the response latency is high, there maybe considerable idle time on the bus.

Higher performance can be achieved by using a split-transaction bus, in which a requestand its corresponding response are treated as separate events. Other transfers may take placebetween them. Consider a situation where multiple processors need to make read requeststo the memory. Arbitration is used to select the first processor to be granted use of thebus for its request. After the request is made, a second processor is selected to make itsrequest, instead of leaving the bus idle. Assuming that this request is to a different memorymodule, the two read accesses proceed in parallel. If neither module has finished with itsaccess, a third processor is selected to make its request, and so on. Eventually, one memory


12.3 Shared-Memory Multiprocessors 451

module completes its read access. It is granted the use of the bus to transfer the data to therequesting processor. As other modules complete their accesses, the bus is used to transfertheir responses. The actual length of time between each request and its correspondingresponse may vary as requests and responses for different transactions with the memory areinterleaved on the bus to make efficient use of the available bandwidth.

The split-transaction bus requires a more complex bus protocol. The main source ofcomplexity is the need to match each response with its corresponding request. This is usuallyhandled by associating a unique tag with each request that appears on the bus. Each responsethen appears with the appropriate tag so that the source can match it to its original request.

RingA ring network is formed with point-to-point connections between nodes, as shown

in Figure 12.4. A single ring is shown in Figure 12.4a. A long single ring results in highaverage latency for communication between any two nodes. This high latency can bemitigated in two different ways.

Asecond ring can be added to connect the nodes in the opposite direction. The resultingbidirectional ring halves the average latency and doubles the bandwidth. However, handlingof communications is more complex.

Another approach is to use a hierarchy of rings. A two-level hierarchy is shown inFigure 12.4b. The upper-level ring connects the lower-level rings. The average latencyfor communication between any two nodes on lower-level rings is reduced with this ar-rangement. Transfers between nodes on the same lower-level ring need not traverse theupper-level ring. Transfers between nodes on different lower-level rings include a traver-sal on part of the upper-level ring. The drawback of the hierarchical scheme is that theupper-level ring may become a bottleneck when many nodes on different lower-level ringscommunicate with each other frequently.

(a) Single ring

Upper ring

Lower rings

(b) Hierarchy of rings

Figure 12.4 Ring-based interconnection networks.



P1

P2

Pn

M1 M2 Mk

Figure 12.5 Crossbar interconnection network.

CrossbarAcrossbar is a network that provides a direct link between any pair of units connected to

the network. It is typically used in UMA multiprocessors to connect processors to memorymodules. It enables many simultaneous transfers if the same destination is not the targetof multiple requests. For example, we can implement the structure in Figure 12.2 using acrossbar that comprises a collection of switches as in Figure 12.5. For n processors and kmemories, n× k switches are needed.

MeshA natural way of connecting a large number of nodes is with a two-dimensional mesh,

as shown in Figure 12.6. Each internal node of the mesh has four connections, one to eachof its horizontal and vertical neighbors. Nodes on the boundaries and corners of the mesh

Figure 12.6 A two-dimensional mesh network.


12.4 Cache Coherence 453

have fewer neighbors and hence fewer connections. To reduce latency for communicationbetween nodes that would otherwise be far apart in the mesh, wraparound connections maybe introduced between nodes at opposite boundaries of the mesh. A network with suchconnections is called a torus. All nodes in a torus have four connections. Average latencyis reduced, but the implementation complexity for routing requests and responses througha torus is somewhat higher than in the case of a simple mesh.

12.4 Cache Coherence

A shared-memory multiprocessor is easy to program. Each variable in a program has aunique address location in the memory, which can be accessed by any processor. However,each processor has its own cache. Therefore, it is necessary to deal with the possibilitythat copies of shared data may reside in several caches. When any processor writes toa shared variable in its own cache, all other caches that contain a copy of that variablewill then have the old, incorrect value. They must be informed of the change so thatthey can either update their copy to the new value or invalidate it. This is the issue ofmaintaining cache coherence, which requires having a consistent view of shared data inmultiple caches.

In Chapter 8 we discussed two basic approaches for performing write operations ondata in a cache. The write-through approach changes the data in both the cache and the mainmemory. The write-back approach changes the data only in the cache; the main memorycopy is updated when a modified data block in the cache has to be replaced. Similarapproaches can be used to address cache coherence in a multiprocessor system.

12.4.1 Write-Through Protocol

A write-through protocol can be implemented in one of two ways. One version is based onupdating the values in other caches, while the second relies on invalidating the copies inother caches.

Let us consider the update protocol first. When a processor writes a new value to ablock of data in its cache, the new value is also written into the memory module containingthe block being modified. Since copies of this block may exist in other caches, these copiesmust be updated to reflect the change caused by the Write operation. The simplest wayof doing this is to broadcast the written data to the caches of all processors in the system.As each processor receives the broadcast data, it updates the contents of the affected cacheblock if this block is present in its cache.

The second version of the write-through protocol is based on invalidation of copies.When a processor writes a new value into its cache, this value is also sent to the appropriatelocation in memory, and all copies in other caches are invalidated. Again, broadcasting canbe used to send the invalidation requests throughout the system.



12.4.2 Write-Back protocol

Maintaining coherence with the write-back protocol is based on the concept of ownershipof a block of data in the memory. Initially, the memory is the owner of all blocks, and thememory retains ownership of any block that is read by a processor to place a copy in itscache.

If some processor wants to write to a block in its cache, it must first become theexclusive owner of this block. To do so, all copies in other caches must first be invalidatedwith a broadcast request. The new owner of the block may then modify the contents at willwithout having to take any other action.

When another processor wishes to read a block that has been modified, the request forthe block must be forwarded to the current owner. The data are then sent to the requestingprocessor by the current owner. The data are also sent to the appropriate memory module,which reacquires ownership and updates the contents of the block in the memory. Thecache of the processor that was the previous owner retains a copy of the block. Hence, theblock is now shared with copies in two caches and the memory. Subsequent requests fromother processors to read the same block are serviced by the memory module containing theblock.

When another processor wishes to write to a block that has been modified, the currentowner sends the data to the requesting processor. It also transfers ownership of the block tothe requesting processor and invalidates its cached copy. Since the block is being modifiedby the new owner, the contents of the block in the memory are not updated. The next requestfor the same block is serviced by the new owner.

The write-back protocol has the advantage of creating less traffic than the write-throughprotocol. This is because a processor is likely to perform several writes to a cache blockbefore this block is needed by another processor. With the write-back protocol, these writesare performed only in the cache, once ownership is acquired with an invalidation request.With the write-through protocol, each write must also be performed in the appropriatememory module and broadcast to other caches.

So far, we have assumed that update and invalidate requests in these protocols arebroadcast through the interconnection network. Whether it is easy to implement suchbroadcasts depends largely on the structure of the interconnection network. The mostnatural network for supporting broadcasting is the single bus. In multiprocessors thatconnect a modest number of processors to the memory modules using a single bus, cachecoherence can be realized using a scheme known as snooping.

12.4.3 Snoopy Caches

In a single-bus system, all transactions between processors and memory modules occur viarequests and responses on the bus. In effect, they are broadcast to all units connected to thebus. Suppose that each processor cache has a controller circuit that observes, or snoops, alltransactions on the bus. We now describe some scenarios for the write-back protocol andhow cache coherence is enforced.


12.4 Cache Coherence 455

Consider a processor that has previously read a copy of a block from the memory intoits cache. Before writing to this block for the first time, the processor must broadcast aninvalidation request to all other caches, whose controllers accept the request and invalidateany copies of the same block. This action causes the requesting processor to become thenew owner of the block. The processor may then write to the block and mark it as beingmodified. No further broadcasts are needed from the same processor to write to the modifiedblock in its cache.

Now, if another processor broadcasts a read request on the bus for the same block, thememory must not respond because it is not the current owner of the block. The processorowning the requested block snoops the read request on the bus. Because it holds a modifiedcopy of the requested block in its cache, it asserts a special signal on the bus to prevent thememory from responding. The owner then broadcasts a copy of the block on the bus, andmarks its copy as clean (unmodified). The data response on the bus is accepted by the cacheof the processor that issued the read request. The data response is also accepted by thememory to update its copy of the block. In this case, the memory reacquires ownership ofthe block, and the block is said to be in a shared state because copies of it are in the cachesof two processors. Coherence is maintained because the two cached copies and the copy ofthe block in the memory contain the same data. Subsequent requests from any processorare serviced by the memory.

Consider now the situation in which two processors have copies of the same block intheir respective caches, and both processors attempt to write to the same cache block at thesame time. Since the block is in the shared state, the memory is the owner of the block.Hence, both processors request the use of the bus to broadcast an invalidation message.One of the processors is granted the use of the bus first. That processor broadcasts itsinvalidation request and becomes the new owner of the block. Through snooping, the copyof the block in the cache of the other processor is invalidated. When the other processoris later granted the use of the bus, it broadcasts a read-exclusive request. This requestcombines a read request and an invalidation request for the same block. The controller forthe first processor snoops the read-exclusive request, provides a data response on the bus,and invalidates the copy in its cache. Ownership of the block is therefore transferred tothe second processor making the request. The memory is not updated because the block isbeing modified again. Since the requests from the two processors are handled sequentially,cache coherence is maintained at all times.

The scheme just described is based on the ability of cache controllers to observe theactivity on the bus and take appropriate actions. Such schemes are called snoopy-cachetechniques.

For performance reasons, it is important that the snooping function not interfere withthe normal operation of a processor and its cache. Such interference occurs if the cachecontroller accesses the tags of the cache for every request that appears on the bus. In mostcases, the cache would not contain a valid copy of the block that is relevant to a request.To eliminate unnecessary interference, each cache can be provided with a set of duplicatetags, which maintain the same status information about the blocks in the cache but can beaccessed separately by the snooping circuitry.



12.4.4 Directory-Based Cache Coherence

The concept of snoopy caches is easy to implement in single-bus systems. Large shared-memory multiprocessors use interconnection networks such as rings and meshes. In suchsystems, broadcasting every single request to the caches of all processors is inefficient.A scalable, but more complex, solution to this problem uses directories in each memorymodule to indicate which nodes may have copies of a given block in the shared state.If a block is modified, the directory identifies the node that is the current owner. Eachrequest from a processor must be sent first to the memory module containing the relevantblock. The directory information for that block is used to determine the action that istaken. A read request is forwarded to the current owner if the block is modified. In thecase of a write request for a block that is shared, individual invalidations are sent onlyto nodes that may have copies of the block in question. The cost and complexity of thedirectory-based approach for enforcing cache coherence limits its use to large systems.Small multiprocessors, including current multicore chips, typically use snooping.

12.5 Message-Passing Multicomputers

Adifferent way of using multiple processors involves implementing each node in the systemas a complete computer with its own memory. Other computers in the system do not havedirect access to this memory. Data that need to be shared are exchanged by sending messagesfrom one computer to another. Such systems are called message-passing multicomputers.

Parallel programs are written differently for message-passing multicomputers than forshared-memory multiprocessors. To share data between nodes, the program running inthe computer that is the source of the data must send a message containing the data tothe destination computer. The program running in the destination computer receives themessage and copies the data into the memory of that node.

To facilitate message passing, a special communications unit at each node is oftenresponsible for the low-level details of formatting and interpreting messages that are sentand received, and for copying message data to and from the memory of the node. Thecomputer in each node issues commands to the communications unit. The computer thencontinues performing other computations while the communications unit handles the detailsof sending and receiving messages.

12.6 Parallel Programming for Multiprocessors

The preceding sections described hardware arrangements for shared-memory multiproces-sors that can exploit parallelism in application programs. The available parallelism may befound in loops with independent passes, and also in independent higher-level tasks.

A source program written in a high-level language allows a programmer to expressthe desired computation in a manner that is easy to understand. It must be translated by


12.6 Parallel Programming for Multiprocessors 457

the compiler and the assembler into machine-language representation. The hardware ofthe processor is designed to execute machine-language instructions in proper sequenceto perform the computation desired by the programmer. It cannot automatically identifyindependent high-level tasks that could be executed in parallel. The compiler also haslimitations in detecting and exploiting parallelism. It is therefore the responsibility of theprogrammer to explicitly partition the overall computation in the source program into tasksand to specify how they are to be executed on multiple processors.

Programming for a shared-memory multiprocessor is a natural extension of conven-tional programming for a single processor. A high-level source program is written usingtasks that are executed by one processor. But it is also possible to indicate that certain tasksare to be executed simultaneously in different processors. Sharing of data is achieved bydefining global variables that are read and written by different processors as they performtheir assigned tasks. The multicore chips currently used in general-purpose computers,such as those implementing the Intel IA-32 architecture, are programmed in this manner.

To illustrate parallel programming, we consider the example of computing the dotproduct of two vectors, each containing N numbers. A C-language program for this task isshown in Figure 12.7. The details of initializing the contents of the two vectors are omittedto focus on the aspects relevant to parallel programming.

The loop accumulates the sum of N products. Each pass depends on the partial sumcomputed in the preceding pass, and the result computed in the final pass is the dot prod-uct. Despite the dependency, it is possible to partition the program into independent tasksfor simultaneous execution by exploiting the associative property of addition. Each taskcomputes a partial sum, and the final result is obtained by adding the partial sums.

#include < stdio.h> /* Routines for input/output. */

#define N 100 /* Number of elements in each vector. */

double a[N], b[N]; /* Vectors for computing the dot product. */

void main (void){

int i;double dot_product;

< Initialize vectors a[], b[] – details omitted.>dot_product = 0.0;for (i = 0; i < N; i++)

dot_product = dot_product + a[i] * b[i];printf ("The dot product is %g\ n", dot_product);

}

Figure 12.7 C program for computing a dot product.



To implement a parallel program for computing the dot product, two questions need tobe answered:

• How do we make multiple processors participate in parallel execution to compute thepartial sums?

• How do we ensure that each processor has computed its partial sum before the finalresult for the dot product is computed?

Thread CreationTo answer the first question, we define the tasks that are assigned to different processors,

and then we describe how execution of these tasks is initiated in multiple processors. Wecan write a parallel version of the dot product program using parameters for the number ofprocessors, P, and the number of elements in each vector, N . We assume for simplicity thatN is evenly divisible by P. The overall computation involves a sum of N products. For Pprocessors, we define P independent tasks, where each task is the computation of a partialsum of N/P products.

When a program is executed in a single processor, there is one active thread of executioncontrol. This thread is created implicitly by the operating system (OS) when execution ofthe program begins. For a parallel program, we require the independent tasks to be handledseparately by multiple threads of execution control, one for each processor. These threadsmust be created explicitly. A typical approach is to use a routine named create_thread in alibrary that supports parallel programming. The library routine accepts an input parameter,which is a pointer to a subroutine to be executed by the newly created thread. An operatingsystem service is invoked by the library routine to create a new thread with a distinct stack,so that it may call other subroutines and have its own local variables. All global variablesare shared among all threads.

It is necessary to distinguish the threads from each other. One approach is to provideanother library routine called get_my_thread_id that returns a unique integer between 0 andP − 1 for each thread. With that information, a thread can determine the appropriate subsetof the overall computation for which it is responsible.

Thread SynchronizationThe second question involves determining when threads have completed their tasks,

so that the final result can be computed correctly. Synchronization of multiple threadsis therefore required. There are several methods of synchronization, and they are oftenimplemented in additional library routines for parallel programming. Here, we considerone method called a barrier.

The purpose of a barrier is to force threads to wait until they have all reached a specificpoint in the program where a call is made to the library routine for the barrier. Each threadthat calls the barrier routine enters a busy-wait loop until the last thread calls the routineand enables all of the threads to continue their execution. This ensures that the threads havecompleted their respective computations preceding the barrier call.

Example Parallel ProgramHaving described the issues related to thread creation and synchronization, and typical

library routines that are provided for thread management, we can now present a paralleldot product program as an example. Figure 12.8 shows a main routine, and another routine


12.6 Parallel Programming for Multiprocessors 459

#include < stdio.h> /* Routines for input/output. */#include "threads.h" /* Routines for thread creation/synchronization. */

#define N 100 /* Number of elements in each vector. */#define P 4 /* Number of processors for parallel execution. */

double a[N], b[N]; /* Vectors for computing the dot product. */double partial_sums[P]; /* Array of results computed by threads. */Barrier bar; /* Shared variable to support barrier synchronization. */

void ParallelFunction (void){

int my_id, i, start, end;double s;

my_id = get_my_thread_id (); /* Get unique identifier for this thread. */start = (N/P) * my_id; /* Determine start/end using thread identifier. */end = (N/P) * (my_id + 1) – 1; /* N is assumed to be evenly divisible by P .* /s = 0.0;for (i = start; i <= end; i++)

s = s + a[i] * b[i];partial_sums[my_id] = s; /* Save result in array. */barrier (&bar, P); /* Synchronize with other threads. */

}

void main (void){

int i;double dot_product;

< Initialize vectors a[], b[] – details omitted.>init_barrier (&bar);for (i = 1; i < P; i++) /* Create P – 1 additional threads. */

create_thread (ParallelFunction);ParallelFunction(); /* Main thread also joins parallel execution. */dot_product = 0.0; /* After barrier synchronization, compute final result. */for (i = 0; i < P; i++)

dot_product = dot_product + partial_sums[i];printf ("The dot product is %g\ n", dot_product);

}

Figure 12.8 Parallel program in C for computing a dot product.



called ParallelFunction that defines the independent tasks for parallel execution. When theprogram begins executing, there is only one thread executing the main routine. This threadinitializes the vectors, then it initializes a shared variable needed for barrier synchronization.To initiate parallel execution, the create_thread routine is called P − 1 times from the mainroutine to create additional threads that each execute ParallelFunction. Then, the threadexecuting the main routine calls ParallelFunction directly so that a total of P threads areinvolved in the overall computation. The operating system software is responsible fordistributing the threads to different processors for parallel execution.

Each thread calls get_my_thread_id from ParallelFunction to obtain a unique integeridentifier in the range 0 to P − 1. Using this information, the thread calculates the startand end indices for the loop that generates the partial sum of that thread. After executingthe loop, it writes the result to a separate element of the shared partial_sums array usingits unique identifier as the array index. Then, the thread calls the library routine for barriersynchronization to wait for other threads to complete their computation.

After the last thread to complete its computation calls the barrier routine, all threadsreturn to ParallelFunction. There is no further computation to perform in ParallelFunction,so the P − 1 threads created by the library call in the main routine terminate. The threadthat called ParallelFunction directly from the main routine returns to compute the finalresult using the values in the partial_sums array.

The program in Figure 12.8 uses generic library routines to illustrate thread creationand synchronization. A large collection of routines for parallel programming in the Clanguage is defined in the IEEE 1003.1 standard [4]. This collection is also known asthe POSIX threads or Pthreads library. It provides a variety of thread management andsynchronization mechanisms. Implementations of this library are available for widely-used operating systems to facilitate programming for multiprocessors.

12.7 Performance Modeling

The most important measure of the performance of a computer is how quickly it can executeprograms. When considering one processor, the speed with which instructions are fetchedand executed is affected by the instruction set architecture and the hardware design. Thetotal number of instructions that are executed is affected by the compiler as well as theinstruction set architecture. Chapter 6 introduced the basic performance equation, whichis a low-level mathematical model that reflects these considerations for one processor.The terms in that model include the number of instructions executed, the average numberof cycles per instruction, and the clock frequency. This model enables the prediction ofexecution time, given sufficiently detailed information.

A higher-level model that relies on less detailed information can be used to assesspotential improvements in performance. Consider a program whose execution time onsome computer is Torig . Our objective is to assess the extent to which the execution time canbe reduced when a performance enhancement, such as parallel processing, is introduced.



Assume that a fraction fenh of the execution time is affected by the enhancement. Theremaining fraction, funenh = 1− fenh, is unchanged. Let p represent the factor by whichthe portion of time fenh × Torig is reduced due to the performance enhancement. The newexecution time is

Tnew = Torig (funenh + fenh/p)

The speedup is the ratio Torig/Tnew or

1/ (funenh + fenh/p)

The above expression for speedup is known as Amdahl’s Law. It is named after GeneAmdahl, who was the first to formalize the reasoning explained above. It is a way of statingthe intuitive observation that the benefit of a given performance enhancement increases ifit affects a larger portion of the execution time.

Once a breakdown of the original execution time has been determined, it is often usefulto determine an upper bound on the possible speedup. To do so, we let p→∞ to reflect theideal, but unrealistic, reduction of the fraction fenh of execution time to zero. The resultingspeedup is 1/funenh, which means that the portion of execution time that is not enhanced isthe limiting factor on performance. A smaller value of funenh gives a larger bound on thespeedup. For example, funenh = 0.1 gives an upper bound of 10, but funenh = 0.05 gives alarger bound of 20. However, the expected speedup using a realistic value of p is normallywell below the upper bound. For example, using p = 16 with funenh = 0.05 gives a speedupof only 1/(0.05+ 0.95/16) = 9.1, well below the upper bound of 20.

The important conclusion from this discussion is that the unenhanced portion of theoriginal execution time can significantly limit the achievable speedup, even if the enhancedportion is improved by an arbitrarily large factor. If a programmer can determine, evenapproximately, the fractions funenh and fenh of execution time before applying a particularenhancement, Amdahl’s Law can provide useful insight into the expected improvement.That information can be used to determine whether the expected gain in performance justifiesthe effort and expense to implement the enhancement.


Fundamental techniques such as pipelining and caches are important ways of improvingperformance and are widely used in computers. The additional techniques of multithread-ing, vector (SIMD) processing, and multiprocessing provide the potential for further im-provements in performance by making more efficient use of processing resources and byperforming more operations in parallel. These techniques have been incorporated intogeneral-purpose multicore processor chips.



Problems

12.1 [M] Assume that a bus transfer takes T seconds and memory access time is 4T seconds. Aread request over a conventional bus then requires 6T seconds to complete. How many con-ventional buses are needed to equal or exceed the effective throughput of a split-transactionbus that operates with the same time delays? Consider only read requests, ignore memoryconflicts, and assume that all memory modules are connected to all buses in the multiple-buscase. Does your answer increase or decrease if memory access time increases?

12.2 [M] Assume that a computation comprises k + 1 distinct tasks. In order to prepare aprogram for the desired computation, each of these tasks has been written as a functionin the C language. The k + 1 functions are labeled T0(), T1(), . . . , Tk(). Each functionrequires τ time units to execute. Due to data dependencies, functions T1() to Tk() must beexecuted after function T0(). There are no data dependencies among the functions T1() toTk().

(a) Using the given functions, write a C program that executes on a single processor.

(b) Write an equivalent C program that executes on k processors.

(c) Derive an expression for the ideal speedup for the program in part (b) relative to theprogram in part (a).

12.3 [D] What are the arguments for and against invalidation and updating as strategies formaintaining cache coherence?

12.4 [D] The approaches of shared memory and message passing both support simultaneousexecution of tasks that interact with each other. Which of these two approaches can emulatethe action of the other more easily? Briefly justify your answer.

12.5 [E] With reference to Figure 12.1, assume an array size of N = 32. Let all instructionsrequire one time unit to execute in a processor. Determine the speedup of vectorization forvector lengths of 4, 8, 16, and 32.

12.6 [M] For vectorization, efficiency is defined as the ratio of the speedup to the vector length.Determine the efficiency for each of the results in Problem 12.5. Comment on the variationof the efficiency with the vector length.

12.7 [M] In the parallel program in Figure 12.8 for computing the dot product, the sum of thepartial products is computed by one processor after all threads have reached the barrier.Modify the program so that each thread adds its partial sum incrementally to a global sumfor the dot product. To prevent two or more threads from trying to update the global sum atthe same time, synchronization is required. One method is to use a shared counter variablewhose value is the identifier of the thread that is granted exclusive access to the variablefor the global sum. After the thread with exclusive access has updated the global sum, itcan simply increment this counter to grant exclusive access to another thread.

12.8 [E] Assume that a multiprocessor has eight processors. Based on an existing program thatruns on one processor, a parallel program is written to run on this multiprocessor. Assumethat the workload of the parallel portion of the program can be distributed evenly over the


References 463

eight processors. Use Amdahl’s Law to determine the value of fenh that would allow aspeedup of 5.

References

1. NVIDIA Corporation, NVIDIA C CUDA Programming Guide, Version 3.1.1, July2010, available at http://www.nvidia.com.

2. D. Kirk and W.-M. Hwu, Programming Massively Parallel Processors: A Hands-onApproach, Morgan Kaufmann, Burlington, Massachusetts, 2010.

3. Khronos Group, OpenCL Introduction and Overview, June 2010,available at http://www.khronos.org/opencl.

4. IEEE, 1003.1 Standard for Information Technology—Portable Operating SystemInterface (POSIX): System Interfaces, Issue 6, 2001, available at http://www.ieee.org.

http://www.nvidia.com

http://www.khronos.org/opencl

http://www.ieee.org

December 15, 2010 09:31 ham_338065_appa Sheet number 1 Page number 465 cyan black

465

a p p e n d i x

ALogic Circuits

Appendix Objectives

This appendix provides a concise presentation of logiccircuits. You will learn about:

• Synthesis of logic functions

• Flip-flops, registers, and counters

• Decoders and multiplexers

• Finite state machines

• Programmable logic devices

• Practical implementation of logic circuits


466 A P P E N D I X A • Logic Circuits

Information in digital computers is represented and processed by electronic networks called logic circuits.These circuits operate on binary variables that assume one of two distinct values, usually called 0 and 1. Inthis appendix we will give a concise presentation of logic functions and circuits for their implementation,including a brief review of integrated circuit technology.

A.1 Basic Logic Functions

It is helpful to introduce the topic of binary logic by examining a practical problem that arisesin all homes. Consider a lightbulb whose on/off status is controlled by two switches, x1 andx2. Each switch can be in one of two possible positions, 0 or 1, as shown in Figure A.1a.It can thus be represented by a binary variable. We will let the switch names serve asthe names of the associated binary variables. The figure also shows an electrical powersupply and a lightbulb. The way the switch terminals are interconnected determines howthe switches control the light. The light will be on only if a closed path exists from thepower supply through the switch network to the lightbulb. Let a binary variable f representthe condition of the light. If the light is on, f = 1, and if the light is off, f = 0. Thus, f = 1means that there is at least one closed path through the network, and f = 0 means that thereis no closed path. Clearly, f is a function of the two variables x1 and x2.

Let us consider some possibilities for controlling the light. First, suppose that the lightis to be on if either switch is in the 1 position, that is, f = 1 if

x1 = 1 and x2 = 0

or

x1 = 0 and x2 = 1

or

x1 = 1 and x2 = 1

The connections that implement this type of control are shown in Figure A.1b. A logic truthtable that represents this situation is shown beside the wiring diagram. The table lists allpossible switch settings along with the value of f for each setting. In logic terms, this tablerepresents the OR function of the two variables x1 and x2. The operation is representedalgebraically by a “+” sign or a “∨” sign, so that

f = x1 + x2 = x1 ∨ x2

We say that x1 and x2 are the input variables and f is the output function.We should point out some basic properties of the OR operation. It is commutative, that

is,

x1 + x2 = x2 + x1

It can be extended to n variables, so that

f = x1 + x2 + · · · + xn


A.1 Basic Logic Functions 467

(d) EXCLUSIVE-OR connection (XOR control)

(c) Series connection (AND control)

(b) Parallel connection (OR control)

(a) Light bulb controlled by two switches

supply

Light

1 10

001

11 1111

10

00 0 0

1

Position 0

Position 1

Position 0

Position 1

0

01

1

Common ground

Power bulb

100000

001

11 1

11

0

0000

01

11 1111

10

0

1 1

0 0

11

Switch x1

Switch x2

x1

x2

x1 x2

0x1 x2

x1 x2

x1 x2

f x1 x2,( ) x1 x2⊕=

f x1 x2,( ) x1 x2⋅=

f x1 x2,( ) x1 x2+=x1 x2

Figure A.1 Light switch example.

has the value 1 if any variable xi has the value 1. This represents the effect of connectingmore switches in parallel with the two switches in Figure A.1b. Also, inspection of the truthtable shows that

1+ x = 1



and

0+ x = x

Now, suppose that the light is to be on only when both switches are in the 1 position.The connections for this, along with the corresponding truth-table representation, are shownin Figure A.1c. This is the AND function, which uses the symbol “·” or “∧" and is denotedas

f = x1 · x2 = x1 ∧ x2

Some basic properties of the AND operation are

x1 · x2 = x2 · x1

1 · x = x

and

0 · x = 0

The AND function also extends to n variables, with

f = x1 · x2 · · · · · xn

having the value 1 only if all the xi variables have the value 1. This represents the case inwhich more switches are connected in series with the two switches in Figure A.1c.

The final possibility that we will discuss for the way the switches determine the lightstatus is another common situation. If we assume that the switches are at the two endsof a stairway, it should be possible to turn the light on or off from either switch. Thatis, if the light is on, changing either switch position should turn it off; and if it is off,changing either switch position should turn it on. Assume that the light is off when bothswitches are in the 0 position. Then changing either switch to the 1 position should turnthe light on. Now suppose that the light is on with x1= 1 and x2= 0. Switching x1 backto 0 will obviously turn the light off. Furthermore, it must be possible to turn the light offby changing x2 to 1, that is, f = 0 if x1= x2= 1. The connections to implement this typeof control are shown in Figure A.1d. The corresponding logic operation is called the XOR(Exclusive-OR) function, which is represented by the symbol “⊕”. Some of its propertiesare

x1 ⊕ x2 = x2 ⊕ x1

1⊕ x = x

and

0⊕ x = x

where x denotes the NOT function of the variable x. This single-variable function, f = x,has the value 1 if x = 0 and the value 0 if x = 1. We say that the input x is being invertedor complemented.


A.1 Basic Logic Functions 469

A.1.1 Electronic Logic Gates

The use of switches, closed or open electrical paths, and lightbulbs to illustrate the idea oflogic variables and functions is convenient because of their familiarity and simplicity. Thelogic concepts that have been introduced are equally applicable to the electronic circuitsused to process information in digital computers. The physical variables are electricalvoltages and currents instead of switch positions and closed or open paths. For example,consider a circuit that is designed to operate on inputs that are at either +5 or 0 volts. Thecircuit outputs are also at either+5 or 0 V. Now, if we say that+5 V represents logic 1 and0 V represents logic 0, then we can describe what the circuit does by specifying the truthtable for the logic operation that it performs.

With the help of transistors, it is possible to design simple electronic circuits thatperform logic operations such as AND, OR, XOR, and NOT. It is customary to use thename gates for these basic logic circuits. Standard symbols for these gates are shown inFigure A.2. A somewhat more compact graphical notation for the NOT operation is usedwhen inversion is applied to a logic gate input or output. In such cases, the inversion isdenoted by a small circle.

The electronic implementation of logic gates will be discussed in Section A.5. Wewill now proceed to discuss how basic gates can be used to construct logic networks thatimplement more complex logic functions.

x1

x2

x1

x2

x1

x2

x f x=

f x1 x2⊕=

f x1 x2⋅=

f x1 x2+=

OR gate

AND gate

NOT gate

XOR gate

Figure A.2 Standard logic gate symbols.



A.2 Synthesis of Logic Functions

Consider the network composed of two AND gates and one OR gate that is shown inFigure A.3a. It can be represented by the expression

f = x1 · x2 + x1 · x2

The construction of the truth table for this expression is shown in Figure A.3b. First, thevalues of the AND terms are determined for each input valuation. Then the values ofthe function f are determined using the OR operation. The truth table for f is identicalto the truth table for the XOR function, so the three-gate network in Figure A.3a is animplementation of the XOR function usingAND, OR, and NOT gates. The logic expressionx1 · x2 + x1 · x2 is called a sum-of-products form because the OR operation is sometimescalled the “sum” function and the AND operation the “product” function.

We should note that it would be more proper to write

f = ((x1) · x2)+ (x1 · (x2))

(b) Truth table construction of x1 x2⋅ x1 x2⋅+

f x1 x2⋅ x1 x2⋅+=

x1

x2

fInputinversion

(a) Network for the XOR function

x1 x2 x1 x2⋅ x1 x2⋅x1 x2⊕=

0 0

0 1

1 0

1 1

0

0

1

0

0

0

1

0

0

1

1

0

f x1 x2⋅ x1 x2⋅+=

Figure A.3 Implementation of the XOR function using AND, OR, andNOT gates.


A.2 Synthesis of Logic Functions 471

Table A.1 Two 3-variablefunctions.

x1 x2 x3 f1 f2

0 0 0 1 1

0 0 1 1 1

0 1 0 0 1

0 1 1 1 0

1 0 0 0 1

1 0 1 0 1

1 1 0 0 0

1 1 1 1 0

to indicate the order of applying the operations in the expression. To simplify the appearanceof such expressions, we define a hierarchy among the three operations AND, OR, and NOT.In the absence of parentheses, operations in a logic expression should be performed in thefollowing order: NOT, AND, and then OR. Furthermore, it is customary to omit the “·”operator when there is no ambiguity.

Returning to the sum-of-products form, we will now explain how any logic functioncan be synthesized in this form directly from its truth table. Consider the truth table of TableA.1 and suppose we wish to synthesize the function f1 using AND, OR, and NOT gates.For each row of the table in which f1 = 1, we include a product (AND) term in the sum-of-products form. The product term includes all three input variables. The NOT operatoris applied to these variables individually so that the term is 1 only when the variables havethe particular valuation that corresponds to that row of the truth table. This means that ifxi = 0, then xi is entered in the product term, and if xi = 1, then xi is entered. For example,the fourth row of the table has the function entry 1 for the input valuation

(x1, x2, x3) = (0, 1, 1)

The product term corresponding to this is x1x2x3. Doing this for all rows in which thefunction f1 has the value 1 leads to

f1 = x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3

The logic network corresponding to this expression is shown on the left side in Fig-ure A.4. As another example, the sum-of-products expression for the XOR function canbe derived from its truth table using this technique. This approach can be used to derivesum-of-products expressions and the corresponding logic networks for truth tables of anysize.



x1x2x3 x1

x2

x3

x2

f 1f 1

f 1 x1x2x3 x1x2x3 x1x2x3 x1x2x3+ + += f 1 x1x2 x2x3+=

Figure A.4 A logic network for f1 of Table A.1 and an equivalent minimal network.

A.3 Minimization of Logic Expressions

We have shown how to derive one sum-of-products expression for each truth table. In fact,there are many equivalent expressions and logic networks for any particular truth table.Two logic expressions or logic gate networks are equivalent if they have identical truthtables. An expression that is equivalent to the sum-of-products expression we derived forf1 in the previous section is

x1x2 + x2x3

To prove this, we construct the truth table for the simpler expression and show that it isidentical to the truth table for f1 in Table A.1. This is done in Table A.2. The constructionof the table for x1x2 + x2x3 is done in three steps. First, the value of the product term x1x2

is computed for each valuation of the inputs. Then x2x3 is evaluated. Finally, these twocolumns are ORed together to obtain the truth table for the expression. This truth table isidentical to the truth table for f1 given in Table A.1.

To simplify logic expressions we perform a series of algebraic manipulations. The newlogic rules that are used in these manipulations are the distributive rule

w(y + z) = wy + wz

and the identity

w + w = 1

Table A.3 shows the truth-table proof of the distributive rule. It should now be clear thatrules such as this can always be proved by constructing the truth tables for the left-hand sideand the right-hand side to show that they are identical. Logic rules, such as the distributiverule, are sometimes called identities. Although we will not need to use it here, another form


A.3 Minimization of Logic Expressions 473

Table A.2 Evaluation of the expression x1x2 + x2x3.

x1 x2 x3 x1x2 x2x3 x1x2 + x2x3 = f1

0 0 0 1 0 1

0 0 1 1 0 1

0 1 0 0 0 0

0 1 1 0 1 1

1 0 0 0 0 0

1 0 1 0 0 0

1 1 0 0 0 0

1 1 1 0 1 1

Table A.3 Truth-table technique for proving equivalence of expressions.

Left-hand side Right-hand side

w y z y + z w(y + z) wy wz wy + wz

0 0 0 0 0 0 0 0

0 0 1 1 0 0 0 0

0 1 0 1 0 0 0 0

0 1 1 1 0 0 0 0

1 0 0 0 0 0 0 0

1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1

1 1 1 1 1 1 1 1

of distributive rule that we should include for completeness is

w + yz = (w + y)(w + z)

The objective in logic minimization is to reduce the cost of implementation of a givenlogic function according to some criterion. More particularly, we wish to start with a sum-of-products expression derived from a truth table and simplify it to an equivalent minimalsum-of-products expression. To define the criterion for minimization, it is necessary tointroduce a size or cost measure for a sum-of-products expression. The usual cost measure isa count of the total number of gates and gate inputs required in implementing the expressionin the form shown in Figure A.4. For example, the larger expression in this figure has acost of 21, composed of a total of 5 gates and 16 gate inputs. Input inversions are ignoredin this counting process. The cost of the simpler expression is 9, composed of 3 gates and6 inputs. We are now in a position to state that a sum-of-products expression is minimal if



there is no other equivalent sum-of-products expression with a lower cost. In the simpleexamples that we will introduce, it is usually reasonably clear when we have arrived at aminimal expression. Thus, we will not give rigorous proofs of minimality.

The general strategy in performing algebraic manipulations to simplify a given ex-pression is as follows. First, group product terms in pairs that differ only in that somevariable appears complemented (x) in one term and true (x) in the other. When the commonsubproduct consisting of the other variables is factored out of the pair by the distributiverule, we are left with the term x + x, which has the value 1. Applying this procedure to thefirst expression for f1, we obtain

f1 = x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3

= x1x2(x3 + x3)+ (x1 + x1)x2x3

= x1x2 · 1+ 1 · x2x3

= x1x2 + x2x3

This expression is minimal. The network corresponding to it is shown in Figure A.4.The grouping of terms in pairs so that minimization can lead to the simplest expression

is not always as obvious as it is in the preceding example. A rule that is often helpful is

w + w = w

This allows us to repeat product terms so that a particular term can be combined with morethan one other term in the factoring process. As an example of this, consider the functionf2 in Table A.1. The sum-of-products expression that can be derived for it directly from thetruth table is

f2 = x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3

By repeating the first product term x1x2x3 and interchanging the order of terms (by thecommutative rule), we obtain

f2 = x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3

Grouping the terms in pairs and factoring yields

f2 = x1x2(x3 + x3)+ x1x2(x3 + x3)+ x1(x2 + x2)x3

= x1x2 + x1x2 + x1x3

Now, reducing the first pair of terms by factoring gives the minimal expression

f2 = x2 + x1x3

This completes our discussion of algebraic simplification of logic expressions. The ob-vious practical application of this mathematical exercise stems from the fact that networkswith fewer gates and inputs are cheaper and easier to implement. Therefore, it is of eco-nomic interest to be able to determine the minimal expression that is equivalent to a givenexpression. The rules that we have used in manipulating logic expressions are summarizedin Table A.4. They are arranged in pairs to show their symmetry as they apply to both theAND and OR functions. So far, we have not had occasion to use either involution or deMorgan’s rules, but they will be found to be useful in the next section.



Table A.4 Rules of binary logic.

Name Algebraic identity

Commutative w + y = y + w wy = ywAssociative (w + y)+ z = w + (y + z) (wy)z = w(yz)Distributive w + yz = (w + y)(w + z) w(y + z) = wy + wzIdempotent w + w = w ww = wInvolution w = wComplement w + w = 1 ww = 0de Morgan w + y = w y wy = w + y

1+ w = 1 0 · w = 00+ w = w 1 · w = w

A.3.1 Minimization using Karnaugh Maps

In our algebraic minimization of the functions f1 and f2 of Table A.1, it was necessary toguess the best way to proceed at certain points. For instance, the decision to repeat the termx1x2x3 as the first step in minimizing f2 is not obvious. There is a geometric technique thatcan be used to quickly derive the minimal expression for a logic function of a few variables.The technique depends on a different form for presentation of the truth table, a form calledthe Karnaugh map. For a three-variable function, the map is a rectangle composed of eightsquares arranged in two rows of four squares each, as shown in Figure A.5a. Each squareof the map corresponds to a particular valuation of the input variables. For example, thethird square of the top row represents the valuation (x1, x2, x3) = (1, 1, 0). Because thereare eight rows in a three-variable truth table, the map obviously requires eight squares. Theentries in the squares are the function values for the corresponding input valuations.

The key idea in the formation of the map is that horizontally and vertically adjacentsquares correspond to input valuations that differ in only one variable. When two adjacentsquares contain 1s, they indicate the possibility of an algebraic simplification. In the mapfor f2 in Figure A.5a, the 1 values in the leftmost two squares of the top row correspond tothe product terms x1x2x3 and x1x2x3. The simplification

x1x2x3 + x1x2x3 = x1x3

was performed earlier in minimizing the algebraic expression for f2. This simplificationcan be obtained directly from the map by grouping the two 1s as shown. The product termthat corresponds to a group of squares is the product of the input variables whose valuesare constant on these squares. If the value of input variable xi is 0 for all 1s of a group, thenxi is entered in the product, but if xi has the value 1 for all 1s of the group, then xi is enteredin the product. Adjacency of two squares includes the property that the left-end squares areadjacent to the right-end squares. Continuing with our discussion of f2, the group of four 1sconsisting of the left-end column and the right-end column simplifies to the single-variableterm x2 because x2 is the only variable whose value remains constant over the group. Allfour possible combinations of values of the other two variables occur in the group.



1

1 0

0

1 1

00

1

1 1

1

0 1

10

0

0 1

0

1 0

00

0

0 0

1

0 0

10

1

0 1

1

0 1

00

1

1 0

0

1 0

00

0

0 0

0

0 0

11

0

1 1

0

1 1

11

1

0 0

1

0 0

11

0 0 0 1 1 1 1 0

0 0 0 1

0 0 0 1 1 1 1 0

0 1 1 1

0 0 0 1 1 1 1 0

0

1

0 0

0 1

1 0

0 0

0 1

1 1

1 0

0 1

1 1

0 0

0 1

1 1

1 0

1 0

1 0

0 0

0 0

1 1

1 1 1 0

x1x2x3

x3x4

x1x2x3x4

x1x2

x3x4

x1x2x3x4

x1x2

f 1 x1x2 x2x3+= f 2 x2 x1x3+=

g1 x2x3x4 x1x4+= g2 x2x4 x1x3 x2x3x4+ +=

g3 x4 x2x3+= g4 x1x2x3 x2x3x4+ +=x1x2x4

x1x3x4

or

1

0 1

1

1 0

01

0 0 0 1 1 1 1 0

0

1

x1x2x3

(a) Three-variable maps

(b) Four-variable maps

Figure A.5 Minimization using Karnaugh maps.



Karnaugh maps can be used for more than three variables. A Karnaugh map for fourvariables can be obtained from two 3-variable maps. Examples of four-variable maps areshown in Figure A.5b, along with minimal expressions for the functions represented bythe maps. In addition to two- and four-square groupings, it is now possible to form eight-square groupings. Such a grouping is illustrated in the map for g3. Note that the four cornersquares constitute a valid group of four and are represented by the product term x2x4 in g2.As in the case of three-variable maps, the term that corresponds to a group of squares isthe product of the variables whose values do not change over the group. For example, thegrouping of four 1s in the upper right-hand corner of the map for g2 is represented by theproduct term x1x3 because x1 = 1 and x3 = 0 over the group. The variables x2 and x4 haveall the possible combinations of values over this group. It is also possible to use Karnaughmaps for five-variable functions. In this case, two 4-variable maps are used, one of themcorresponding to the 0 value for the fifth variable and the other corresponding to the 1 value.

The general procedure for forming groups of two, four, eight, and so on in Karnaughmaps is readily derived. Two adjacent pairs of 1s can be combined to form a group of four.Similarly, two adjacent groups of four can be combined to form a group of eight. In general,the number of squares in any valid group must be equal to 2k , where k is an integer.

We will now consider a procedure for using Karnaugh maps to obtain minimal sum-of-products expressions. As can be seen in the maps of Figure A.5, a large group of 1scorresponds to a small product term. Thus, a simple gate implementation results fromcovering all the 1s in the map with as few groups as possible. In general, we should choosethe smallest set of groups, picking large ones wherever possible, that cover all the 1s inthe map. Consider, for example, the function g2 in Figure A.5b. As we have already seen,the 1s in the four corners constitute a group of four that is represented by the productterm x2x4. Another group of four exists in the upper right-hand corner and is representedby the term x1x3. This covers all the 1s in the map except for the 1 in the square where(x1, x2, x3, x4) = (0, 1, 0, 1). The largest group of 1s that includes this square is the two-square group represented by the term x2x3x4. Therefore, the minimal expression for g2 is

g2 = x2x4 + x1x3 + x2x3x4

Minimal expressions for the other functions shown in the figure can be derived in a similarmanner. Note that in the case of g4 there are two possible minimal expressions, one includ-ing the term x1x2x4 and the other including the term x1x3x4. It is often the case that a givenfunction has more than one minimal expression.

In all our examples, it is relatively easy to derive minimal expressions. In general,there are formal algorithms for this process, but we will not consider them here.

A.3.2 Don’t-Care Conditions

In many situations, some valuations of the inputs to a digital circuit never occur. Forexample, consider the binary-coded decimal (BCD) number representation. Four binaryvariables b3, b2, b1, and b0 represent the decimal digits 0 through 9, as shown in Figure A.6.These four variables have a total of 16 distinct valuations, only 10 of which are used forrepresenting the decimal digits. The remaining valuations are not used. Therefore, any



Binary codingb3 b2 b1 b0

Decimal digitrepresented

f b3= b0 b2b1b0 b2b1b0+ +

0

0 1

0

0 d

d0

1

1 d

d

0 d

d0

0 0 0 1

0 0

0 1

1 0

1 1

1 1 1 0b1b0

b3b2

0123456789

0000000011111111

0000111100001111

0011001100110011

0101010101010101

0001001001dd

ddd

f

dunused

Figure A.6 Four-variable Karnaugh map illustrating don’tcares.

logic circuit that processes BCD data will never encounter any of these six valuations at itsinputs.

Figure A.6 gives the truth table for a particular function that may be performed on aBCD digit. We do not care what the function values are for the unused input valuations;hence, they are called don’t-cares and are denoted as such by the letter “d” in the truthtable. To obtain a circuit implementation, the function values corresponding to don’t-careconditions can be arbitrarily assigned to be either 0 or 1. The best way to assign them


A.4 Synthesis with NAND and NOR Gates 479

is in a manner that will lead to a minimal logic gate implementation. We should interpretdon’t-cares as 1s whenever they can be used to enlarge a group of 1s. Because larger groupscorrespond to smaller product terms, minimization is enhanced by the judicious inclusionof don’t-care entries.

The function in FigureA.6 represents the following processing on a decimal digit input:The output f is to have the value 1 whenever the inputs represent a nonzero digit that isevenly divisible by 3. Three groups are necessary to cover the three 1s of the map, anddon’t-cares have been used to enlarge these groups as much as possible.

A.4 Synthesis with NAND and NOR Gates

We will now consider two other basic logic gates called NAND and NOR, which areextensively used in practice because of their simple electronic realizations. The truth tablefor these gates is shown in Figure A.7. They implement the equivalent of the AND andOR functions followed by the NOT function, which is the motivation for the names andstandard logic symbols for these gates. Letting the arrows “↑” and “↓” denote the NANDand NOR operators, respectively, and using de Morgan’s rule in Table A.4, we have

x1 ↑ x2 = x1x2 = x1 + x2

and

x1 ↓ x2 = x1 + x2 = x1x2

NAND and NOR gates with more than two input variables are available, and they operateaccording to the obvious generalization of de Morgan’s law as

x1 ↑ x2 ↑ · · · ↑ xn = x1x2 · · · xn = x1 + x2 + · · · + xn

x1x1

x2 x2ff

f x1=

←

x2 x1x2 x1 x2+== f x1= ← x2 x1 x2+ x1x2==

0011

0101

1000

x1 x2 f

0011

0101

1110

x1 x2 f

(a) NAND (b) NOR

Figure A.7 NAND and NOR gates.



and

x1 ↓ x2 ↓ · · · ↓ xn = x1 + x2 + · · · + xn = x1x2 · · · xn

Logic design with NAND and NOR gates is not as straightforward as with AND, OR,and NOT gates. One of the main difficulties in the design process is that the associative ruleis not valid for NAND and NOR operations. We will expand on this problem later. First,however, let us describe a simple, general procedure for synthesizing any logic functionusing only NAND gates. There is a direct way to translate a logic network expressed insum-of-products form into an equivalent network composed only of NAND gates. Theprocedure is easily illustrated with the aid of an example. Consider the following algebraicmanipulation of a logic expression corresponding to a four-input network composed ofthree 2-input NAND gates:

(x1 ↑ x2) ↑ (x3 ↑ x4) = (x1x2)(x3x4)

= x1x2 + x3x4

= x1x2 + x3x4

We have used de Morgan’s rule and the involution rule in this derivation. Figure A.8 showsthe logic network equivalent of this derivation. Since any logic function can be synthesizedin a sum-of-products (AND-OR) form and because the preceding derivation is obviouslyreversible, we have the result that any logic function can be synthesized in NAND-NANDform. We can see that this result is true for functions of any number of variables. Therequired number of inputs to the NAND gates is obviously the same as the number ofinputs to the corresponding AND and OR gates.

Let us return to the comment that the nonassociativity of the NAND operator can be anannoyance. In designing logic networks with NAND gates using the procedure illustratedin Figure A.8, a requirement for a NAND gate with more inputs than can be implemented ina given technology may arise. If this happens when one is using AND and OR gates, there

x1

x2

x3

x4

Figure A.8 Equivalence of NAND-NAND and AND-OR networks.


A.4 Synthesis with NAND and NOR Gates 481

(a) Implementing three-input AND and OR functions with two-input gates

(b) Implementing a three-input NAND function with two-input gates

Figure A.9 Cascading of gates.

is no problem because the AND and OR operators are associative, and a straightforwardcascade of such gates can be used. The case of implementing three-input AND and ORfunctions with two-input gates is shown in Figure A.9a. The solution is not as simple in thecase of NAND gates. For example, a three-input NAND function cannot be implementedby a cascade of 2 two-input NAND gates. Three gates are needed, as shown in Figure A.9b.

A discussion of the implementation of logic functions using only NOR gates proceedsin a similar manner. Any logic function can be synthesized in a product-of-sums (OR-AND)form. Such networks can be implemented by equivalent NOR-NOR networks.

The preceding discussion introduced some basic concepts in logic design. Detaileddiscussion of the subject can be found in any of a number of textbooks (see References1–7).

It is important for the reader to appreciate that many different realizations of a givenlogic function are possible. For practical reasons, it is useful to find realizations thatminimize the cost of implementation. It is also often necessary to minimize the propagationdelay through a logic network. We introduced the concept of minimization in the previoussections to give an indication of the nature of logic synthesis and the reductions in costthat may be achieved. For example, Karnaugh maps graphically show the manipulationpossibilities that lead to optimal solutions. Although it is important to understand theprinciples of optimization of logic networks, it is not necessary to do the optimizationby hand. Sophisticated computer-aided design (CAD) programs exist to perform logic



synthesis and optimization. The designer needs to specify only the desired functionalbehavior, and the CAD software generates a cost-effective network that implements therequired functionality.

A.5 Practical Implementation of Logic Gates

Let us now turn our attention to the means by which logic variables can be representedand logic functions can be implemented in practice. The choice of a physical parameter torepresent logic variables is obviously technology-dependent. In electronic circuits, eithervoltage or current levels can be used for this purpose.

To establish a correspondence between voltage levels and logic values or states, theconcept of a threshold is used. Voltages above a given threshold are taken to represent onelogic value, with voltages below that threshold representing the other. In practical situations,the voltage at any point in an electronic circuit undergoes small random variations for avariety of reasons. Because of this “noise,” the logic state corresponding to a voltage levelnear the threshold cannot be reliably determined. To avoid such ambiguity, a “forbiddenrange” should be established, as shown in Figure A.10. In this case, voltages below V0,max

represent the 0 value, and voltages above V1,min represent the 1 value. In subsequentdiscussion, we will often use the terms “low” and “high” to represent the voltage levelscorresponding to logic values 0 and 1, respectively.

We will begin our discussion of electronic circuits that implement basic logic functionsby considering simple circuits consisting of resistors and transistors that act as switches.Consider the circuits in Figure A.11. When switch S in Figure A.11a is closed, the outputvoltage Vout is equal to 0 (ground). When S is open, the output voltage Vout is equal to

Threshold

Logic value 1

Forbidden range

Logic value 0

Voltage

Vmax

V1,min

V0,max

Figure A.10 Representation of logic values by voltage levels.


A.5 Practical Implementation of Logic Gates 483

(a) (b)

Vsupply

R

Vout

S

Vsupply

Vout

R

drain

source

Vin

gate

T

Figure A.11 An inverter circuit.

(a) (b)

R R

TbTa VbVa

Vsupply

Vout

Sa Sb

Vsupply

Vout

Figure A.12 A transistor circuit implementation of a NOR gate.

the supply voltage, Vsupply. The same effect can be obtained in Figure A.11b, in which atransistor T is used to replace the switch S. When the input voltage applied to the gate ofthe transistor is 0 (that is, when Vin = 0), the transistor functions as an open switch, andVout = Vsupply. When Vin changes to Vsupply, the transistor acts as a closed switch and theoutput voltage Vout is very close to 0. Thus, the circuit performs the function of a logicNOT gate.

We can now discuss the implementation of more complex logic functions. Figure A.12shows a circuit realization for a NOR gate. In this case, Vout in Figure A.12a is high onlywhen both switches Sa and Sb are open. Similarly, Vout in Figure A.12b is high only when



(a) (b)

R

Sa

Vsupply

Ta

Tb

Vsupply

Va

Vb

VoutVout

Sb

R

Figure A.13 A transistor circuit implementation of a NAND gate.

both inputs Va and Vb are low. Thus, the circuit is equivalent to a NOR gate in which Va

and Vb correspond to two logic variables x1 and x2. We can easily verify that a NAND gatecan be obtained by connecting the transistors in series as shown in Figure A.13. The logicfunctions AND and OR can be implemented using NAND and NOR gates, respectively,followed by the inverter of Figure A.11.

Note that NAND and NOR gates are simpler in their circuit implementations than ANDand OR gates. Hence, it is not surprising to find that practical realizations of logic functionsuse NAND and NOR gates extensively. Many of the examples given in this book showcircuits consisting of AND, OR, and NOT gates for ease of understanding. In practice,logic circuits contain all five types of gates.

A.5.1 CMOS Circuits

Figures A.11 through A.13 illustrate the general structure of circuits implemented usingNMOS technology. The name derives from the fact that the transistors used to realizethe logic functions are of NMOS type. Two types of metal-oxide semiconductor (MOS)transistors are available for use as switches. An n-channel transistor is said to be of NMOS-type, and it behaves as a closed switch when its gate input is raised to the positive powersupply voltage, Vsupply, as indicated in Figure A.14a. The opposite behavior is achievedwith a p-channel transistor, which is said to be of PMOS type. It acts as an open switch whenthe gate voltage, VG, is equal to Vsupply, and as a closed switch when VG = 0, as indicated inFigureA.14b. Note that the graphical symbol for a PMOS transistor has a bubble on the gateinput to indicate that its behavior is complementary to that of an NMOS transistor. Note



(a) NMOS transistor

VG

VD

VS = 0 V

VS = Vsupply

VD

VG

Closed switchwhen VG = Vsupply

VD = 0 V

Open switchwhen VG = 0 V

VD

Open switchwhen VG = Vsupply

VD

Vsupply

Closed switchwhen VG = 0 V

VD = Vsupply

Vsupply

(b) PMOS transistor

Figure A.14 NMOS and PMOS transistors in logic circuits.

also that the names source and drain are associated with the opposite terminals of PMOStransistors in comparison with NMOS transistors. The source of an NMOS transistor isconnected to ground, while the source of a PMOS transistor is connected to Vsupply. Thisnaming convention is due to the nature of the current that flows through these transistors.

A drawback of the circuits in Figures A.11 through A.13 is their power consumption. Inthe state in which the switches are closed to provide a path between ground and the pull-upresistor R, there is current flowing from Vsupply to ground. In the opposite state, in whichswitches are open, there is no path to ground and there is no current flowing. (In MOStransistors no current flows through the gate terminal.) Thus, depending on the states of itsgates, there may be significant power consumption in a logic circuit.

An effective solution to the power consumption problem lies in using both NMOS andPMOS transistors to implement circuits that do not dissipate power when in a steady state.This approach leads to the CMOS (complementary metal-oxide semiconductor) technology.The basic idea of CMOS circuits is illustrated by the inverter circuit in Figure A.15. WhenVx=Vsupply, which corresponds to the input x having the logic value 1, transistor T1 is



(a) Circuit

Vf

Vsupply

Vx

T 1

T 2

(b) Truth table and transistor states

onoff

offon

10

01

fx T 1 T 2 VfVx

lowhigh low

high

(x) (f)

Figure A.15 CMOS realization of a NOT gate.

turned off and T2 is turned on. Thus, T2 pulls the output voltage Vf down to 0. When Vx

changes to 0, transistor T1 turns on and T2 turns off. Thus, T1 pulls the output voltage Vf

up to Vsupply. Therefore, the logic values of x and f are complements of each other, and thecircuit implements a NOT gate.

A key feature of this circuit is that transistors T1 and T2 operate in a complementaryfashion; when one is on, the other is off. Hence, there is always a closed path from the outputpoint f to either Vsupply or ground. But, there is no closed path between Vsupply and groundat any time except during a very short transition period when the transistors are changingtheir states. This means that the circuit does not dissipate appreciable power when it is in asteady state. It dissipates power only when it is switching from one logic state to another.Therefore, power dissipation in this circuit is dependent on the rate at which state changestake place.

We can now extend the CMOS concept to circuits that have n inputs, as shown inFigure A.16. NMOS transistors are used to implement the pull-down network, such that aclosed path is established between the output point f and ground when a desired functionF(x1, . . . , xn) is equal to 0. The pull-up network is built with PMOS transistors, such thata closed path is established between the output f and Vsupply when F(x1, . . . , xn) is equalto 1. The pull-up and pull-down networks are functional complements of each other, sothat in steady state there exists a closed path only between the output f and either Vsupply orground, but not both.

The pull-down network is implemented in the same way as shown in Figures A.11through A.13. Figure A.17 gives the implementation of a NAND gate, and Figure A.18gives a NOR gate. Figure A.19 shows how an AND gate is realized by inverting the outputof a NAND gate.

In addition to low power dissipation, CMOS circuits have the advantage that MOStransistors can be implemented in very small sizes and thus occupy a very small area onan integrated circuit chip. This results in two significant benefits. First, it is possible tofabricate chips containing billions of transistors, which has led to the realization of modern



Vf

Vsupply

Pull-down network

Pull-up network

Vx 1

Vx n

x1( )

xn( )

(f)

Figure A.16 Structure of a CMOS circuit.

(a) Circuit

Vf

Vsupply


onon

onoff

01

0011

01

off

off

on

off

offon

f

off

on

1110

offoffon

on

Vx 1

Vx 2

T 1 T 2

T 3

T 4

x1 x2 T 1 T 2 T 3 T 4

Figure A.17 CMOS realization of a NAND gate.

processors and large memory chips. Second, the smaller the transistor, the faster it can beswitched from one state to another. Thus, CMOS circuits can be operated at speeds in thegigahertz range.

Different CMOS circuits have been developed to operate with power supply voltagesup to 15 V. The most commonly used power supplies are in the range from 1 V to 5 V.



(a) Circuit

Vf

Vsupply


onon

onoff

01

0011

01

off

off

on

off

offon

f

off

on

1000

offoffon

on

Vx 1

Vx 2

T 1

T 2

T 3 T 4

x1 x2 T 1 T 2 T 3 T 4

Figure A.18 CMOS realization of a NOR gate.

Vf

Vsupply

Vx 1

Vx 2

Vsupply

Figure A.19 CMOS realization of an AND gate.

Circuits that use lower power supply voltages dissipate much less power (power dissipationis proportional to V 2

supply), which means that more transistors can be placed on a chip withoutcausing overheating. Adrawback of lower power supply voltage is reduced noise immunity.

Transitions between low and high signal levels in a CMOS inverter are illustrated inmore detail in Figure A.20. The blue curve, known as the transfer characteristic, shows theoutput voltage as a function of the input voltage. It indicates that a rather sharp transition



Vf

Vx

0 V

V supply

V t δ– V t δ+ V supply

Slope 1–=

V t

Figure A.20 The voltage transfer characteristic for the CMOS inverter.

in output voltage takes place when the input voltage passes through the value of aboutVsupply/2. There is a threshold voltage, Vt , and a small value δ such that Vout ≈ Vsupply ifVin < Vt − δ and Vout ≈ 0 if Vin > Vt + δ. This means that the input signal need not beexactly equal to the nominal value of either 0 or Vsupply to produce the correct output signal.There is room for some error, called noise, in the input signal that will not cause adverseeffects. The amount of noise that can be tolerated is called the noise margin. This marginis Vsupply − (Vt + δ) volts when the logic value of the input is 1, and it is Vt − δ when thelogic value of the input is 0. CMOS circuits have excellent noise margins.

In this section, we have introduced the basic features of CMOS circuits. For a moredetailed discussion of this technology the reader may consult References [1] and [8].

A.5.2 Propagation Delay

Logic circuits do not switch instantaneously from one state to another. Speed is measuredby the rate at which state changes can take place. A related parameter is propagation delay,which is defined in Figure A.21. When a state change takes place at the input, a delay isencountered before the corresponding change at the output is observed. This propagationdelay is usually measured between the 50-percent points of the transitions, as shown in the



Propagation delay

Transition time

Transition time

V1

V1

V0

V0

90%

50%

10%

90%

50%

10%

Input waveform

Output waveform

Figure A.21 Definition of propagation delay and transition time.

figure. Another important parameter is the transition time, which is normally measuredbetween the 10- and 90-percent points of the signal swing, as shown. The maximum speedat which a logic circuit can be operated decreases as the propagation delay through differentpaths within that circuit increases. The delay along any path in a logic circuit is the sum ofindividual gate delays along this path.

A.5.3 Fan-In and Fan-Out Constraints

The number of inputs to a logic gate is called its fan-in. The number of gate inputs that theoutput of a logic gate drives is called its fan-out. Practical circuits do not allow large fan-inand fan-out because they both have an adverse effect on the propagation delay and hencethe speed of the circuit.

Each transistor in a CMOS gate contributes a certain amount of capacitance. As thecapacitance increases, the circuit becomes slower and its signal levels and noise marginsbecome worse. Therefore, it is necessary to limit the fan-in and fan-out, typically to anumber less than ten. If the number of desired inputs exceeds the maximum fan-in, it isnecessary to use an additional gate of the same type. Figure A.9a shows how two gatesof the same type can be cascaded. If the number of outputs that have to be driven by aparticular gate exceeds the acceptable fan-out, it is possible to use two copies of that gate.



A.5.4 Tri-State Buffers

In the logic gates discussed so far, it is not possible to connect the outputs of two gatestogether. This would make no sense from the logic point of view because if one gategenerated an output value of 1 and the other an output of 0, it would be uncertain whatthe combined output signal would be. More importantly, in CMOS circuits, the gate thatgenerates the output of 1 establishes a direct path from the output terminal to Vsupply, whilethe gate that generates 0 establishes a path to ground. Thus, the two gates would provide ashort circuit across the power supply, which would damage the gates.

Yet, in the design of computer systems, there are many cases where an input signal toa circuit may be derived from one of a number of different sources. This can be done usingmultiplexer logic circuits, which are discussed in Section A.10. It can also be done usingspecial gates called tri-state buffers. A tri-state buffer has three states. Two of the statesproduce the normal 0 and 1 signals. The third state places the output terminal of the bufferinto a high-impedance state in which the output is electrically disconnected from the inputit is supposed to drive.

Figure A.22 depicts a tri-state buffer. The buffer has two inputs and one output. Theenable input, e, controls the operation of the buffer. When e = 1, the output f has thesame logic value as the input x. When e = 0, the output is placed in the high-impedancestate, Z. An equivalent circuit is shown in Figure A.22b. The triangular symbol in thisfigure represents a noninverting driver. This is a circuit that performs no logic operation

(b) Equivalent circuit

(c) Truth table

x f

e

(a) Symbol

0011

0101

ZZ01

fe x

x f

e = 0

e = 1x f

fx

e

(d) Implementation

Figure A.22 Tri-state buffer.



because its output merely replicates the input signal. Its purpose is to provide additionalelectrical driving capability. When combined with the output switch shown in the figure,it behaves according to the truth table given in Figure A.22c. This table describes therequired tri-state behavior. Figure A.22d shows a circuit implementation of the tri-statebuffer. One NMOS and one PMOS transistor are connected in parallel to implement theswitch, which is connected to the output of the driver. Because the two transistor typesrequire complementary control signals at their gate inputs, an inverter is used as shown.When e = 0, both transistors are turned off, resulting in an open switch. When e = 1, bothtransistors are turned on, resulting in a closed switch.

The driver circuit may be required to drive a number of inputs of other gates whosecombined capacitance exceeds the drive capability of an ordinary logic gate circuit. Toprovide a sufficient drive capability, the driver circuit needs larger transistors. Hence, thetwo cascaded NOT gates that realize the driver are implemented with transistors of largersize than in regular logic gates.

The reader may wonder why is it necessary to use the PMOS transistor in the outputswitch because from the logic function point of view the same behavior could be achievedusing just the NMOS transistor. The reason is that these transistors have to “pass” the logicvalue generated by the driver circuit to the output f , and it turns out that NMOS transistorspass the logic value 0 well but the logic value 1 poorly, while PMOS transistors pass 1 welland 0 poorly. The parallel arrangement of NMOS and PMOS transistors passes both 1sand 0s well. For a more detailed discussion of this issue and tri-state buffers in general, thereader may consult Reference [1].

A.6 Flip-Flops

The majority of applications of digital logic require the storage of information. For example,a circuit that controls a combination lock must remember the sequence in which the digitsare dialed in order to determine whether to open the lock. Another important example isthe storage of programs and data in the memory of a digital computer.

The basic electronic element for storing binary information is called a latch. Considerthe two cross-coupled NOR gates in Figure A.23a. Let us examine this circuit, starting withthe situation in which R = 1 and S = 0. Simple analysis shows that Qa = 0 and Qb = 1.Under this condition, both inputs to gate Ga are equal to 1. Thus, if R is changed to 0,no change will take place at the outputs Qa and Qb. If S is set to 1 with R equal to 0, Qa

and Qb will become 1 and 0, respectively, and will remain in this state after S is returnedto 0. Hence, this logic circuit constitutes a memory element, or a latch, that rememberswhich of the two inputs S and R was most recently equal to 1. A truth table for this latchis given in Figure A.23b. Some typical waveforms that characterize the latch are shownin Figure A.23c. The arrows in Figure A.23c indicate the cause-effect relationships amongthe signals. Note that when the R and S inputs change from 1 to 0 at the same time, theresulting state is undefined. In practice, the latch will assume one of its two stable states atrandom. The input valuation R = S = 1 is not used in most applications of such latches.

Because of the nature of the operation of the preceding circuit, the S and R lines arereferred to as the set and reset inputs. Since the valuation R = S = 1 is normally not used,


A.6 Flip-Flops 493

S R Qa Qb

0 0

0 1

1 0

1 1

0/1 1/0

0 1

1 0

0 0

(a) Network (b) Truth table

Time

1

0

1

0

1

0

1

0

R

S

Qa

Qb

Qa

Qb

?

?

(c) Timing diagram

Ga

Gb

R

S

(No change)

Figure A.23 A basic latch implemented with NOR gates.

the Qa and Qb outputs are usually labeled as Q and Q, respectively. However, Q should beregarded merely as a symbol representing the second output of the latch rather than as thecomplement of Q, because the input valuation R = S = 1 yields Q = Q = 0.

A.6.1 Gated Latches

Many applications require that the time at which a latch is set or reset be controlled from aninput other than R and S, called a clock input. The resulting configuration is called a gatedSR latch. A logic circuit, truth table, characteristic waveforms, and a graphical symbol forsuch a latch are given in Figure A.24. When the clock, Clk, is equal to 1, signals S′ and R′follow the inputs S and R, respectively. On the other hand, when Clk = 0, signals S′ andR′ are equal to 0, and no change in the state of the latch can take place.

So far we have used truth tables to describe the behavior of logic circuits. A truth tablegives the output of a network for various input valuations. Logic circuits whose outputs are



(a) Circuit

Q

Q

R′

S′

R

S

R

Clk

Q

Q

S

1

0

1

0

1

0

1

0

1

0

Time

(c) Timing diagram

Clk

?

?

S R

x x

0 0

0 1

1 0

Q(t) (no change)

0

1

Clk

0

1

1

1

1 11

Q t 1+( )

Q(t) (no change)

x

S Q

Q

Clk

R

(d) Graphical symbol

(b) Truth table

Figure A.24 Gated SR latch.

uniquely defined for each input valuation are referred to as combinational circuits. This isthe class of circuits discussed in Sections A.1 to A.4. When memory elements are present,a different class of circuits is obtained. The output of such circuits is a function not only ofthe present valuation of the input variables but also of their previous behavior. An exampleof this is shown in Figure A.24. Circuits of this type are called sequential circuits.


A.6 Flip-Flops 495

S

R

Clk

Q

Q

Figure A.25 Gated SR latch implemented with NAND gates.

Because of the memory property, the truth table for the latch has to be modified toshow the effect of its present state. Figure A.24b describes the behavior of the gated SRlatch, where Q(t) denotes its present state. The transition to the next state, Q(t + 1), occursfollowing a clock pulse. Note that for the input valuation S = R = 1, Q(t + 1) is undefinedfor reasons discussed earlier.

The gated SR latch can be implemented using NAND gates as shown in Figure A.25.It is a useful exercise to show that this circuit is functionally equivalent to the circuit inFigure A.24a (see Problem A.20).

A second type of gated latch, called the gated D latch, is shown in Figure A.26. In thiscase, the two signals S and R are derived from a single input D. At a clock pulse, the Qoutput is set to 1 if D = 1 or is reset to 0 if D = 0. This means that the D flip-flop samplesthe D input at the time the clock is high and stores that information until the next clockpulse arrives.

A.6.2 Master-Slave Flip-Flop

In the circuit of Figure A.24, we assumed that while Clk = 1, the inputs S and R do notchange. Inspection of the circuit reveals that the outputs will respond immediately to anychange in the S or R input during this time. Similarly, for the circuit of Figure A.26, Q = Dwhile Clk = 1. This is undesirable in many cases, particularly in circuits involving countersand shift registers, which will be discussed later. In such circuits, immediate propagation oflogic conditions from the data inputs (R, S, and D) to the latch outputs may lead to incorrectoperation. The concept of a master-slave organization eliminates this problem. Two gatedD latches can be connected to form a master-slave D flip-flop, as shown in Figure A.27a.The first, referred to as the master, is connected to the input line D when Clock = 1. A1-to-0 transition of the clock isolates the master from the input and transfers the contentsof the master stage to the slave stage. We can see that no direct path ever exists from theinput D to the output Q.

It should be noted that while Clock = 1, the state of the master stage is immediatelyaffected by changes in the input D. The function of the slave stage is to hold the value at theoutput of the flip-flop while the master stage is being set to the next-state value determinedby the D input. The new state is transferred from the master to the slave after the 1-to-0



Q

S

R

Clk

D(Data)

D Q

QClk

Clk D

011

x01

01

Q t 1+( )

Q t( )

(a) Circuit

(b) Truth table (c) Graphical symbol

t1 t2 t3 t4

Time

Clk

D

Q

(d) Timing diagram

Q

Figure A.26 Gated D latch.

transition on Clock. At this point, the master stage is isolated from the inputs so that furtherchanges in the D input will not affect this transfer. Examples of state transitions are shownin the form of a timing diagram in Figure A.27b.

The term flip-flop refers to a storage element that changes its output state at the edgeof a controlling clock signal. In the above master-slave D flip-flop, the observable changetakes place at the negative (1-to-0) edge of the clock. The change is observable when itreaches the Q terminal of the slave stage. Note that in the circuit in Figure A.27 we couldhave used the complement of Clock to control the master stage and the uncomplementedClock to control the slave stage. In that case, the changes in the flip-flop output Q wouldoccur at the positive edge of the clock.


A.6 Flip-Flops 497

D Q

Q

Master Slave

D

Clock

Q

Q

D Q

Q

Qm Qs

D

Clock

Qm

Q Qs=

D Q

Q

(a) Circuit

(b) Timing diagram

(c) Graphical symbol

ClkClk

Figure A.27 Master-slave D flip-flop.

Agraphical symbol for a flip-flop is given in FigureA.27c. We have used an arrowhead,instead of the label Clk, to denote the clock input to the flip-flop. This is a standard way ofdenoting that the positive edge of the clock causes changes in the state of the flip-flop. Inour figure it is the negative edge which causes changes, so a small circle is used (in additionto the arrowhead) on the clock input.



A.6.3 Edge Triggering

A flip-flop is said to be edge triggered if data present at the input are transferred to the outputonly at a transition in the clock signal. The input and output are isolated from each otherat all other times. The terms positive (leading) edge triggered and negative (trailing) edgetriggered describe flip-flops in which data transfer takes place at the 0-to-1 and the 1-to-0clock transitions, respectively. For proper operation, edge-triggered flip-flops require thetriggering edge of the clock pulse to be well defined and to have a very short transition time.The master-slave flip-flop in Figure A.27 is negative-edge triggered.

A different implementation for a negative-edge-triggered D flip-flop is given in Fig-ure A.28a. Let us consider the operation of this flip-flop. If Clk = 1, the outputs of gates 2and 3 are both 0. Therefore, the flip-flop outputs Q and Q maintain the current state of theflip-flop. It is easy to verify that during this period, points P3 and P4 immediately respondto changes at D. Point P3 is kept equal to D, and P4 is maintained equal to D. When Clkdrops to 0, these values are transmitted to P1 and P2 by gates 2 and 3, respectively. Thus,the output latch, consisting of gates 5 and 6, acquires the new state to be stored.

We now verify that while Clk = 0, further changes at D do not change points P1 andP2. Consider two cases. First, suppose D = 0 at the negative edge of Clk. The 1 atP2 maintains an input of 1 at each of the gates 2 and 4, holding P1 and P2 at 0 and 1,respectively, independent of further changes in D. Second, suppose D = 1 at the negativeedge of Clk. The 1 at P1 means that further changes at D cannot affect the output of gate 1,which is maintained at 0.

When Clk goes to 1 at the start of the next clock pulse, points P1 and P2 are againforced to 0, isolating the output from the remainder of the circuit. Points P3 and P4 thenfollow changes at D, as we have previously described.

An example of the operation of this type of D flip-flop is shown in Figure A.28b. Thestate acquired by the flip-flop upon the 1 to 0 transition of Clk is equal to the value on theD input immediately preceding this transition. However, there is a critical time period TCR

around the negative edge of Clk during which the value on D should not change. This regionis split into two parts, the setup time before the clock edge and the hold time after the clockedge, as shown in the figure. The timing diagram shows that the output Q changes slightlyafter the negative edge of the clock. This is the effect of the propagation delay through theNOR gates.

A.6.4 T Flip-Flop

The most commonly used flip-flops are the D flip-flops because they are useful for temporarystorage of data. However, there are applications for which other types of flip-flops areconvenient. Counter circuits, discussed in Section A.8, are implemented efficiently usingT flip-flops. A T flip-flop changes its state every clock cycle if its input T is equal to 1. Wesay that it “toggles” its state.

Figure A.29 presents the T flip-flop. Its circuit is derived from a D flip-flop as shownin Figure A.29a. Its truth table, graphical symbol, and a timing diagram example are alsogiven in the figure. Note that we have assumed a positive-edge-triggered flip-flop.


A.6 Flip-Flops 499

Q

Q

D

Clk

P4

P3P1

P2

1

2

3

4

5

6

0

1

0Q

Holdtime

Setuptime

TCR

Time

(b) Example of timing

(a) Network

1D

1Clk

0

Figure A.28 A negative-edge-triggered D flip-flop.

A.6.5 JK Flip-Flop

Another flip-flop that is sometimes encountered in practice is the JK flip-flop, which com-bines the behaviors of SR and T flip-flops. It is presented in Figure A.30. Its operation isdefined by the truth table in Figure A.30b. The first three entries in this table define the



D Q

Q

Q

QT

Clock

(a) Circuit

Clock

T

Q

(d) Timing diagram

T Q

Q

T

0

1

Q t 1+( )

Q t( )

Q t( )


Figure A.29 T flip-flop.

same behavior as those in Figure A.24b (when Clk = 1), so that J and K correspond to Sand R. For the input valuation J = K = 1, the next state is defined as the complement of thepresent state of the flip-flop. That is, when J = K = 1, the flip-flop functions as a toggle,reversing its present state.

A JK flip-flop can be implemented using a D flip-flop connected such that

D = JQ+ KQ

The corresponding circuit is shown in Figure A.30a.


A.6 Flip-Flops 501

D Q

Q

Q

Q

J

Clock

(a) Circuit

J Q

Q

K

0

1

Q t 1+( )

Q t( )

0


K

J

0

0

0 11

1 Q t( )1K

Figure A.30 JK flip-flop.

The JK flip-flop is versatile. It can be used to store data, just like the D flip-flop. It canalso be used to build counters, because it behaves like the T flip-flop if its J and K inputterminals are connected together.

A.6.6 Flip-Flops with Preset and Clear

The state of a flip-flop is determined by its present state and the logic values on its inputterminals. Sometimes it is desirable to force a flip-flop into a particular state, either 0 or1, regardless of its present state and the values of the normal inputs. For example, when acomputer is powered on, it is necessary to place all flip-flops into a known state. Usually,this means resetting their outputs to state 0. In some cases it is desirable to preset someflip-flops into state 1.

Figure A.31 illustrates how preset and clear control inputs can be added to a master-slave D flip-flop, to force the flip-flop into state 1 or 0 independent of the D input and theclock. These inputs are active low, as indicated by the overbars and bubbles in the figure.When both the Preset and Clear inputs are equal to 1, the flip-flop is controlled by the clockand D input in the normal way. When Preset = 0, the flip-flop is forced to the 1 state, and



Q

Q

D

Clock

(a) Circuit

D Q

Q

Preset

Clear

(b) Graphical symbol

Preset

Clear

Figure A.31 Master-slave D flip-flop with Preset and Clear.

when Clear = 0, the flip-flop is forced to the 0 state. The preset and clear controls are alsooften incorporated in the other flip-flop types.

A.7 Registers and Shift Registers

An individual flip-flop can be used to store one bit. However, in machines in which dataare handled in words consisting of many bits (perhaps as many as 64), it is convenient toarrange a number of flip-flops into a common structure called a register. The operationof all flip-flops in a register is synchronized by a common clock. Thus, data are written(loaded) into or read from all flip-flops at the same time.

Processing of digital data often requires the capability to shift and rotate the data, so it isnecessary to provide the hardware with this facility. A simple mechanism for realizing bothoperations is a register whose contents may be shifted to the right or left one bit position


A.8 Counters 503

D Q

QClock

D Q

Q

D Q

Q

D Q

Q

In Out

F1 F2 F3 F4

Figure A.32 A simple shift register.

at a time. As an example, consider the 4-bit shift register in Figure A.32. It consists of Dflip-flops connected so that each clock pulse will cause the transfer of the contents (state)of Fi to Fi+1, effecting a “right shift.” Data are shifted serially into and out of the register.A rotation of the data can be implemented by connecting Out to In.

Proper operation of a shift register requires that its contents be shifted exactly oneposition for each clock pulse. This places a constraint on the type of storage elementsthat can be used. Gated latches, depicted in Figure A.26, are not suitable for this purpose.While the clock is high, the value on the D input quickly propagates to the output. Fromthere, the value propagates through the next gated latch in the same manner. Hence, thereis no control over the number of shifts that will take place during a single clock pulse. Thisnumber depends on the propagation delays of the gated latches and the duration of the clockpulse. The solution to the problem is to use either master-slave or edge-triggered flip-flops.

Aparticularly useful form of a shift register is one that can be loaded and read in parallel.This can be accomplished with some additional gating as illustrated in Figure A.33, whichshows a 4-bit register constructed with D flip-flops. The register can be loaded either seriallyor in parallel. When the register is clocked, a shift takes place if Shift/Load = 0; otherwise,a parallel load is performed.

A.8 Counters

In the preceding section, we discussed the applicability of flip-flops in the construction ofshift registers. They are equally useful in the implementation of counter circuits. It ishardly necessary to justify the need for counters in digital machines. In addition to beinghardware mechanisms for realizing ordinary counting functions, counters are also used togenerate control and timing signals. A counter driven by a high-frequency clock can beused to produce signals whose frequencies are submultiples of the original clock frequency.In such applications a counter is said to be functioning as a scaler.

A simple three-stage (or 3-bit) counter constructed with T flip-flops is shown in Fig-ure A.34. Recall that when the T input is equal to 1, the flip-flop acts as a toggle, that is,its state changes with each successive clock pulse. Thus, two clock pulses will cause Q0 tochange from the 1 state to the 0 state and back to the 1 state or from 0 to 1 to 0. This means



ClockParallel input

Parallel output

Shift/LoadSerialinput

D Q

Q

D Q

Q

D Q

Q

D Q

Q

F1 F2 F3 F4

Figure A.33 Parallel-access shift register.

that the output waveform of Q0 has half the frequency of the clock. Similarly, becausethe second flip-flop is driven by Q0, the waveform at Q1 has half the frequency of Q0, orone-fourth the frequency of the clock. Note that we have assumed that the positive edge ofthe clock input to each flip-flop triggers the change of its state.

Such a counter is often called a ripple counter because the effect of an input clock pulseripples through the counter. For example, the positive edge of pulse 4 will change the stateof Q0 from 1 to 0. This change in Q0 will then force Q1 from 1 to 0, which in turn forcesQ2 from 0 to 1. If each flip-flop introduces some delay , then the delay in setting Q2 is3. Such delays can be a problem when very fast operation of counter circuits is required.In many applications, however, these delays are small in comparison with the clock periodand can be neglected.

With the addition of some extra logic gates, it is possible to construct a “synchronous”counter in which each stage is under the control of the common clock so that all flip-flopscan change their states simultaneously. Such counters are capable of operation at higherspeed because the total propagation delay is reduced considerably. In contrast, the counterin Figure A.34 is said to be “asynchronous.”


A.9 Decoders 505

T Q

QClock

T Q

Q

T Q

Q

1

Q0 Q1 Q2

(a) Circuit

Clock

Q0

Q1

Q2

Count 0 1 2 3 4 5 6 7 0

(b) Timing diagram

Figure A.34 A 3-bit up-counter.

A.9 Decoders

Much of the information in computers is handled in a highly encoded form. In an instruction,an n-bit field may be used to denote 1 out of 2n possible choices for the action to be taken. Toperform the desired action, the encoded instruction must first be decoded. A circuit capableof accepting an n-variable input and generating the corresponding output signal on oneout of 2n output lines is called a decoder. A simple example of a two-input to four-outputdecoder is given in Figure A.35. One of the four output lines is selected by the inputs x1 andx2, as indicated in the figure. The selected output has the logic value 1, and the remainingoutputs have the value 0.

Other useful types of decoders exist. For example, using information in BCD formoften requires decoding circuits in which a four-variable BCD input is used to select 1 outof 10 possible outputs. As another specific example, let us consider a decoder suitablefor driving a seven-segment display. Figure A.36 shows the structure of a seven-segment



3

2

1

0

x1

x2

x1 x2

0011

0101

0123

Activeoutput

Figure A.35 A two-input to four-output decoder.

element used for display purposes. We can easily see that any decimal number from zero tonine can be displayed with this element simply by turning some segments on (light) whileleaving others off (dark). The necessary functions are indicated in the table. They can berealized using the decoding circuit shown in the figure. Note that the circuit is constructedwith NAND gates. We encourage the reader to verify that the circuit implements therequired functions.

A.10 Multiplexers

In the preceding section, we saw that decoders select one output line on the basis of inputsignals. The selected output line has logic value 1, while the other outputs have the value0. Another class of very useful selector circuits exists in which any one of n data inputscan be selected to appear as the output. The choice is governed by a set of “select” inputs.Such circuits are called multiplexers. An example of a multiplexer circuit is shown inFigure A.37. It has two select inputs, w1 and w2. Their four possible valuations are used toselect one of four inputs, x1, x2, x3, or x4, to appear as the output z. A simple logic circuitthat can implement the required operation is also given. Obviously, the same structure canbe used to realize larger multiplexers, in which k select inputs are used to connect one ofthe 2k data inputs to the output.

The obvious application of multiplexers is in the gating of data that may come from anumber of different sources. For example, loading a 16-bit data register from one of fourdistinct sources can be accomplished with sixteen 4-input multiplexers.


A.10 Multiplexers 507

a

b

c

d

e

f g

0123456789

0000000011

0000111100

0011001100

0101010101

1011011111

1111100111

1101111111

1011011011

1010001010

1000111011

0011111011

No. x1 x2 x3 x4 a b c d e f g

a

b

c

d

e

f

g

x1

x2

x3

x4

Figure A.36 A BCD to seven-segment display decoder.



z

w1 w2

x1

x2

x3

x4

MUX

x1

x2

x3

x4

w1 w2

zx1x2x3x4

w1 w2

0011

0101

z

Select inputs

Data inputs

Figure A.37 A four-input multiplexer.

Multiplexers are also very useful as basic elements for implementing logic functions.Consider a function f defined by the truth table of Figure A.38. It can be represented asshown in the figure by factoring out the variables x1 and x2. Note that for each valuation ofx1 and x2, the function f corresponds to one of four terms: 0, 1, x3, or x3. This suggests thepossibility of using a four-input multiplexer circuit, in which x1 and x2 are the two selectinputs that choose one of the four data inputs. Then, if the data inputs are connected to 0,1, x3, or x3 as required by the truth table, the output of the multiplexer will correspond tothe function f . The approach is completely general. Any function of three variables can berealized with a single four-input multiplexer. Similarly, any function of four variables canbe implemented with an eight-input multiplexer, and so on.


A.11 Programmable Logic Devices (PLDs) 509

x3x1 x2

0

x3

1

x1 x2

0

x3

1

x1 x2

0

0

1

1

0

1

0

1

f

x3

x3

00001111

00110011

01010101

00011110

f

MUX f

Figure A.38 Multiplexer implementation of a logic function.

A.11 Programmable Logic Devices (PLDs)

In previous sections we showed how logic circuits can be implemented using gates andflip-flops. In this section we will consider devices that can be used to implement circuitsof various types, simply by programming them to perform the desired functions. They arecalled programmable logic devices (PLDs).

A.11.1 Programmable Logic Array (PLA)

Any combinational logic function can be implemented in the sum-of-products form, asexplained in Sections A.2 and A.3. A generalized circuit that can implement a variety ofcombinational functions may be organized as shown in Figure A.39. It has n input variables(x1, . . . , xn) and m output functions (f1, . . . , fm). Each function fi is realized as a sum ofproduct terms that involve the input variables. The variables x1, . . . , xn are presented intrue and complemented form to the AND array, where up to k product terms are formed.These are then gated into the OR array, where the output functions are formed. To makethis circuit customizable by a user, it is possible to use programmable connections to theAND and OR arrays.

A circuit in which connections to both the AND and the OR arrays can be programmedis called a programmable logic array (PLA). Figure A.40 illustrates the functional structure



f1

OR array

AND array

buffersOutput

Input buffers

invertersand

x1x2

xnI2n

I1

P1 Pk

O1

Omfm

Figure A.39 A block diagram for a PLD.

of a PLA using a simple example. The programmable connections must be such that if noconnection is made to a given input of an AND gate, the input behaves as if a logic valueof 1 is driving it (that is, this input does not contribute to the product term realized by thisgate). Similarly, if no connection is made to a given input of an OR gate, this input musthave no effect on the output of the gate (that is, the input must behave as if a logic value of0 is driving it).

Programmed connections may be realized in different ways. In one method, program-ming consists of blowing fuses in positions where connections are not required. This is doneby applying higher-than-normal current. Another possibility is to use transistor switchescontrolled by erasable memory elements (see Section 8.3.3 on EPROM memory circuits)to provide the connections as desired. This allows the PLA to be reprogrammable.

The simple PLA in Figure A.40 can generate up to four product terms from three inputvariables. Two output functions may be implemented using these product terms. Some ofthe product terms may be used in more than one output function. The PLA is configured torealize the following two functions:

f1 = x1x2 + x1x3 + x1x2x3

f2 = x1x2 + x1x3 + x1x2x3

Only four product terms are needed, because two terms can be shared by both functions.Although Figure A.40 depicts clearly the basic functionality of a PLA, this style of

presentation is awkward for describing a larger PLA. It is customary in technical literatureto represent the product and sum terms by means of corresponding gate symbols that haveonly one symbolic input line. An × is placed on this line to represent each programmedconnection. This drawing convention is used in Figure A.41 to represent the PLA examplefrom Figure A.40. A programmable connection can be made at any crossing of a verticalline and a horizontal line in the diagram, to implement different functions of the inputvariables.



ORplane

f 1 x1x2 x1x3 x1x2x3+ +=

f 2

f 1

x1

x2

x3

P1 P2 P3 P4

I1

I2

I3

I4

I5

I6

Programmableconnections

ANDplane

f 2 x1= x2 x1x2x3 x1x3+ +

Figure A.40 Functional structure of a PLA.

A.11.2 Programmable Array Logic (PAL)

In a PLA, the inputs to both the AND array and the OR array are programmable. A similarcircuit, in which the inputs to the AND array are programmable but the connections to theOR gates are fixed, provides enough flexibility for practical applications. Such circuits areknown as programmable array logic (PAL) circuits.

Figure A.42 shows a simple example of a PAL that can implement two functions.The number of AND gates connected to each OR gate in a PAL determines the maximumnumber of product terms that can be realized in a sum-of-products representation of a givenfunction. The AND gates are permanently connected to specific OR gates, which meansthat a particular product term cannot be shared among output functions.

The versatility of a PAL circuit may be enhanced further by including flip-flops in theoutputs from the OR gates. FigureA.43 indicates the kind of flexibility that can be provided.



f 2

f 1

I1

I2

I3

I4

I5

I6

P1 P2 P3 P4

x1

x2

x3

Figure A.41 A simplified sketch of the PLA in Figure A.40.

A multiplexer is used to choose whether a true, complemented, or stored (from the previousclock cycle) value of f is to be presented at the output terminal. The select inputs to themultiplexer are provided as programmable connections.

A.11.3 Complex Programmable Logic Devices (CPLDs)

The PAL structure has been used within larger devices known as complex programmablelogic devices (CPLDs). These devices comprise a number of PAL-like blocks and pro-grammable interconnection wires. Figure A.44 indicates the organization of a CPLD chip.Each PAL-like block is connected to a number of input/output pins. Connections betweenPAL-like blocks are established by programming the switches associated with the intercon-nection wires.

The interconnection resource consists of horizontal and vertical wires. Each horizontalwire can be connected to some of the vertical wires by programming the correspondingswitches. It is impractical to provide full connectivity, where each horizontal wire can be



f1 x1x2x3 x1x 2x 3+=

f 2 x1x2 x1x2x3+=

f 2

f 1

x1

x2

x3

Figure A.42 An example of a PAL.

D Q

Q

MUX

Clock Selectinputs

Output

f

Figure A.43 Inclusion of a flip-flop in a PAL element.



PAL-likeblock

I/O

blo

ck

PAL-likeblock

I/O block

PAL-likeblock

I/O

blo

ck

PAL-likeblock

I/O block

Interconnection wires

Figure A.44 Organization of a complex programmable logic device(CPLD).

connected to any of the vertical wires, because the number of required switches would belarge. Satisfactory connectivity can be achieved with a much smaller number of switches.

Commercial CPLDs come in different sizes, ranging from 2 to more than 100 PAL-likeblocks. A CPLD chip is programmed by loading the programming information into it as aserial stream of bits via a JTAG port. This is a 4-pin port that conforms to an IEEE standarddeveloped by the Joint Test Action Group.

A.12 Field-Programmable Gate Arrays

The most versatile programmable logic devices are known as field-programmable gatearrays (FPGAs). Figure A.45 shows a conceptual block diagram of an FPGA. It consistsof an array of logic blocks (indicated as black boxes) that can be connected by generalinterconnection resources. The interconnect, shown in blue, consists of wire segments andprogrammable switches. The switches are used to connect the logic blocks to the wiresegments and to establish connections between different wire segments as desired. Thisallows a large degree of routing flexibility on the chip. Input and output buffers are providedfor access to the pins of the chip.

There are a variety of designs for the logic blocks and the interconnect structure. Alogic block may be just a simple multiplexer-based circuit capable of implementing logicfunctions as discussed in Section A.10. Another popular design uses a simple lookup table(LUT) as a logic block. For example, a four-input LUT can be implemented in the form ofa 16-bit memory circuit in which the truth table of a logic function is stored. Each memory


A.12 Field-Programmable Gate Arrays 515

Logic block

Interconnection switch

I/O

blo

ck

I/O block

I/O block

I/O block

Figure A.45 A conceptual block diagram of an FPGA.

bit corresponds to one combination of true or complemented values of the input variables.Such a lookup table can be programmed to implement any function of four variables. Thelogic blocks may contain flip-flops to provide additional flexibility of the type encounteredin Figure A.43.

In addition to the logic blocks, many FPGA chips include a substantial number ofmemory cells (not shown in Figure A.45), which may be used to implement structuressuch as first-in first-out (FIFO) queues or RAM and ROM components in system-on-a-chipapplications, which are discussed in Chapter 11.

FPGAs are available in a wide range of sizes. The largest devices contain billions oftransistors and can be used to implement very large circuits. The growing popularity ofFPGAs is due to the fact that they allow a designer to implement very complex logic circuitsand large digital systems on a single chip without having to design and fabricate a customVLSI chip, which is both expensive and time-consuming. Using CAD tools, it is possibleto generate an FPGA design in a matter of days, rather than the months needed to producea custom-designed VLSI chip. For an introductory discussion of designing circuits usingFPGA devices and CAD tools the reader may consult Reference [1].



A.13 Sequential Circuits

A combinational circuit is one whose output is determined entirely by its present inputs.Examples of such circuits are the decoders and multiplexers presented in Sections A.9 andA.10. A different class of circuits are those whose outputs depend on both the present inputsand on the sequence of previous inputs. They are called sequential circuits. Such circuitscan be in different states, depending on what the sequence of inputs has been up to a giventime. The state of a circuit determines the behavior when various input patterns are appliedto the circuit. We encountered two specific forms of such circuits in Sections A.7 and A.8,called shift registers and counters. In this section, we will introduce a general form forsequential circuits, and give a brief introduction to the design of these circuits.

A.13.1 DesignofanUp/DownCounterasaSequentialCircuit

Figure A.34 shows the configuration of an up-counter, implemented with three T flip-flops,which counts in the sequence 0, 1, 2, . . . , 7, 0, . . . . A similar circuit can be used to count inthe down direction, that is, 0, 7, 6, . . . , 1, 0, . . . (see Problem A.26). These simple circuitsare made possible by the toggle feature of T flip-flops.

We now consider the possibility of implementing such counters with D flip-flops. Asa specific example, we will design a counter that counts either up or down, depending onthe value of an external control input. To keep the example small, let us restrict the sizeto a mod-4 counter, which requires only two state bits to represent the four possible countvalues. We will show how this counter can be designed using general techniques for thesynthesis of sequential circuits. The desired circuit will count up if an input signal x isequal to 0 and down if x is 1. Using the D flip-flops of the type presented in Figures A.27and A.28, the count will change on the negative edge of the clock signal. Let us assumethat we are particularly interested in the state when the count is equal to 2. Thus, an outputsignal, z, should be asserted when the count is equal to 2; otherwise z = 0.

The desired counter can be implemented as a sequential circuit. In order to determinewhat the new count will be when a clock pulse is applied, it is sufficient to know thevalue of x and the present count. It is not necessary to know what the actual sequence ofprevious input values was, as long as we know the present count that has been reached.This count value is said to determine the present state of the circuit, which is all that thecircuit remembers about previous input values. If the present count is 2 and x = 0, the nextcount will be 3. It makes no difference whether the count of 2 was reached counting downfrom 3 or up from 1.

Before we show a circuit implementation, let us depict the desired behavior of thecounter by means of a state diagram. The counter has four distinct states: S0, S1, S2, andS3. A state diagram is a graph in which states are represented as circles (sometimes callednodes). Transitions between states are indicated by labeled arrows. The label associatedwith an arrow specifies the value of the input x that will cause this particular transition tooccur. We will design a circuit in which the value of the output is determined by the presentstate of the counter. Figure A.46 shows the state diagram of our up/down counter. For


A.13 Sequential Circuits 517

S0/0 S1/0

S2/1S3/0

x = 0

x = 0

x = 1

x = 1

x = 1

x = 1

x = 0

x = 0

Figure A.46 State diagram of a mod-4 up/down counter that detectsthe count of 2.

Next stateOutputPresent

statex 0= x 1=

S0

S1

S2

S3

S1

S2

S3

S0

S3

S0

S1

S2

0

0

1

0

z

Figure A.47 State table for the example of the up/down counter.

example, the arrow emanating from state S1 (count = 1) for an input x = 0 points to stateS2, thus specifying the transition to state S2. An arrow from S2 to S3 specifies that whenx = 0 the next clock pulse will cause a transition from S2 to S3. The output z should be 1while the circuit is in state S2, and it should be 0 in states S0, S1, and S3. This is indicatedinside each circle.

Note that the state diagram describes the functional behavior of the counter withoutany reference to how it is implemented. Figure A.46 can be used to describe an electronicdigital circuit, a mechanical counter, or a computer program that behaves in this way. Suchdiagrams are a powerful means of describing any system that exhibits sequential behavior.

A different way of presenting the information in a state diagram is to use a state table.Figure A.47 gives the state table for the example in Figure A.46. The table indicates



Presentstate

Next state

Output

Y2 Y1 Y2 Y1

x 0= x 1=

y2 y1

0

0

1

1

0

1

0

1

0

1

1

0

1

0

1

0

1

0

0

1

1

0

1

0

0

0

1

0

z

Figure A.48 State assignment for the example in Figure A.47.

transitions from all present states to the next states, as required by the applied input x. Italso shows the value of the output signal, z, in each state.

Having specified the desired up/down counter in general terms, we will now considerits implementation. Two bits are needed to encode the four states that indicate the count.Let these bits be y2 (high-order) and y1 (low-order). The states of the counter are determinedby the values of y2 and y1, which we will write in the form y2y1. We will assign values toy2y1 for each of the four states as follows: S0 = 00, S1 = 01, S2 = 10, and S3 = 11. Wehave chosen the assignment such that the binary number y2y1 represents the count in anobvious way. The variables y2 and y1 are called the state variables of the sequential circuit.Using this state assignment, the state table for our example is as shown in Figure A.48.Note that we are using the variables Y2 and Y1 to denote the next state in the same manneras y2 and y1 are used to represent the present state.

It is important to note that we could have chosen a different assignment of y2y1 valuesto the various states. For example, a possible state assignment is: S0 = 10, S1 = 11,S2 = 01, and S3 = 00. For a counter circuit, this assignment is less intuitive than the onein Figure A.48, but the resultant circuit will work properly. Different state assignmentsusually lead to different costs in implementing the circuit (see Problem A.30).

Our intention in this example is to use D flip-flops to store the values of the two statevariables between successive clock pulses. The output, Q, of a flip-flop is the present-statevariable yi, and the input, D, is the next-state variable Yi. Note that Yi is a function of y2, y1,and x, as indicated in Figure A.48. From the figure, we see that

Y2 = y2y1x + y2y1x + y2y1x + y2y1x

= y2 ⊕ y1 ⊕ x

Y1 = y2y1x + y2y1x + y2y1x + y2y1x

= y1



Q D

Q

Q D

Q Clock

x

y1

y1

y2

Y1

Y2

z

Figure A.49 Implementation of the up/down counter.

The output z is determined as

z = y2y1

These expressions lead to the circuit shown in Figure A.49.

A.13.2 Timing Diagrams

To fully understand the operation of the counter circuit, it is useful to consider its timingdiagram. Figure A.50 gives an example of a possible sequence of events. It assumes thatstate transitions (changes in flip-flop values) occur on the negative edge of the clock andthat the counter starts in state S0. Since x = 0, the counter advances to state S1 at t0, thento S2 at t1, and to S3 at t2. The output changes from 0 to 1 when the counter enters stateS2. It goes back to 0 when state S3 is reached. At t3, the counter goes to S0. We haveassumed that at this time the input x changes to 1, causing the counter to count in the downsequence. When the count again reaches S2, at t5, the output z goes to 1.

Note that all signal changes occur just after the negative edge of the clock, and signalsdo not change again until the negative edge of the next clock pulse. The delay from theclock edge to the time at which variables yi change is the propagation delay of the flip-flopsused to implement the counter circuit. It is important to note that the input x is also assumedto be controlled by the same clock, and it changes only near the beginning of a clock period.



Clock

x

z

y1

y2

t1t0 t2 t3 t5t4 t6 t7

S1S0 S2 S3 S3S0 S2 S1 S0State:

Figure A.50 Timing diagram for the circuit in Figure A.49.

These are essential features of circuits where all changes are controlled by a clock. Suchcircuits are called synchronous sequential circuits.

Another important observation concerns the relationship between the labels used inthe state diagram in Figure A.46 and the timing diagram. For example, consider the clockperiod between t1 and t2. During this clock period, the machine is in state S2 and the inputvalue is x = 0. This situation is described in the state diagram by the arrow emanating fromstate S2 labeled x = 0. Since this arrow points to state S3, the timing diagram shows y2

and y1 changing to the values corresponding to state S3 at the next clock edge, t2.

A.13.3 The Finite State Machine Model

The specific example of the up/down counter implemented as a synchronous sequentialcircuit with flip-flops and combinational logic gates, as shown in Figure A.49, is easilygeneralized to the formal finite state machine model given in Figure A.51. In this model,the time delay through the delay elements is equal to the duration of the clock cycle. Thisis the time that elapses between changes in Yi and the corresponding changes in yi. Themodel assumes that the combinational logic block has no delay; hence, the outputs z, Y1,and Y2 are instantaneous functions of the inputs x, y1, and y2. In an actual circuit, somedelay will be introduced by the flip-flops, as shown in Figure A.50. The circuit will workproperly if the delay through the combinational logic block is short relative to the clockcycle. The next-state outputs Yi must be available in time to cause the flip-flops to changeto the desired next state at the end of the clock cycle.



x

y1

y2

Y1

Y2

Combinationallogic

Inputz

Output

Nextstate

Presentstate

Delay elements(flip-flops)

Figure A.51 A formal model of a finite state machine.

Inputs to the combinational logic block consist of the flip-flop outputs, yi, which rep-resent the present state, and the external input, x. The outputs of the block are the inputs tothe flip-flops, which we have called Yi, and the external output, z. When the active clockedge arrives marking the end of the present clock cycle, the values on the Yi lines are loadedinto the flip-flops. They become the next set of values of the state variables, yi. Since thesesignals are connected to the input of the combinational block, they, along with the next valueof the external input x, will produce new z and Yi values. A clock cycle later, the new Yi

values are transferred to yi, and the process repeats. In other words, the flip-flops constitutea feedback path from the output to the input of the combinational block, introducing a delayof one clock period.

Although we have shown only one external input, one external output, and two statevariables in Figure A.51, it is clear that multiple inputs, outputs, and state variables arepossible.

A.13.4 Synthesis of Finite State Machines

Let us summarize how to design a synchronous sequential circuit having the general orga-nization in Figure A.51. The design, or synthesis, process involves the following steps:

1. Develop an appropriate state diagram or state table.

2. Determine the number of flip-flops needed, and choose a suitable type of flip-flop.

3. Determine the values to be stored in these flip-flops for each state in the statediagram. This is referred to as state assignment.



4. Develop the state-assigned state table.

5. Derive the next-state logic expressions needed to control the inputs of the flip-flops.Also, derive the expressions for the outputs of the circuit.

6. Use the derived expressions to implement the circuit.

Sequential circuits can easily be implemented with CPLDs and FPGAs because thesedevices contain flip-flops as well as combinational logic gates. Modern computer-aideddesign tools can be used to synthesize sequential circuits directly from a specification givenin terms of a state diagram.

Our discussion of sequential circuits is based on the type of circuits that operate underthe control of a clock. It is also possible to implement sequential circuits without usinga clock. Such circuits are called asynchronous sequential circuits. Their design is not asstraightforward as that of synchronous sequential circuits. For a complete treatment of bothtypes of sequential circuits, consult one of the many books that specialize in logic design[1–7].

A.14 Concluding Remarks

The main purpose of this appendix is to acquaint the reader with the basic concepts in logicdesign and to provide an indication of the circuit configurations commonly used in theconstruction of computer systems. Familiarity with this material will lead to a much betterunderstanding of the architectural concepts discussed in the main chapters of the book. Aswe have said in several places, the detailed design of logic circuits is done with the helpof CAD tools. These tools take care of many details and can be used very effectively by aknowledgeable designer.

IC technology and CAD tools have revolutionized logic design. A variety of IC com-ponents are commercially available at ever-decreasing costs, and new developments andtechnological improvements are constantly occurring. In this appendix, we introducedsome of the basic components that are useful in the design of digital systems.

Problems

A.1 [E] Implement the COINCIDENCE function in the sum-of-products form, where COIN-CIDENCE = XOR.

A.2 [M] Prove the following identities by using algebraic manipulation and also by using truthtables.

(a) a ⊕ b⊕ c = abc + abc + abc + abc

(b) x + wx = x + w

(c) x1x2 + x2x3 + x3x1 = x1x2 + x3x1


Problems 523

A.3 [E] Derive minimal sum-of-products forms for the four 3-variable functions f1, f2, f3, andf4 given in Figure PA.1. Is there more than one minimal form for any of these functions?If so, derive all of them.

x1 x2 x3 f 1 f 2 f 3 f 4

0

0

0

0

1

1

1

1

0

0

1

1

0

0

1

1

0

1

0

1

0

1

0

1

1

1

0

0

1

0

1

1

1

1

1

1

0

0

0

1

d

1

0

1

d

0

1

1

0

1

1

d

d

d

1

0

Figure PA.1 Logic functions for Problem A.3.

A.4 [E] Find the simplest sum-of-products form for the function f using the don’t-care conditiond, where

f = x1(x2x3 + x2x3 + x2x3x4)+ x2x4(x3 + x1)

and

d = x1x2(x3x4 + x3x4)+ x1x3x4

A.5 [M] Consider the function

f (x1, . . . , x4) = (x1 ⊕ x3)+ (x1x3 + x1x3)x4 + x1x2

(a) Use a Karnaugh map to find a minimum cost sum-of-products (SOP) expression for f.

(b) Find a minimum cost SOP expression for f , which is the complement of f . Then,complement (using de Morgan’s rule) this SOP expression to find an expression for f . Theresulting expression will be in the product-of-sums (POS) form. Compare its cost with theSOP expression derived in part (a). Can you draw any general conclusions from this result?

A.6 [E] Find a minimum cost implementation of the function f (x1, x2, x3, x4), where f = 1 ifeither one or two of the input variables have the logic value 1. Otherwise, f = 0.

A.7 [M] Figure A.6 defines the 4-bit encoding of BCD digits. Design a circuit that has fourinputs labeled b3, . . . , b0, and an output f , such that f = 1 if the 4-bit input pattern is avalid BCD digit; otherwise f = 0. Give a minimum cost implementation of this circuit.



A.8 [M] Two 2-bit numbers A = a1a0 and B = b1b0 are to be compared by a four-variablefunction f (a1, a0, b1, b0). The function f is to have the value 1 whenever

v(A) ≤ v(B)

where v(X ) = x1 × 21 + x0 × 20 for any 2-bit number. Assume that the variables A and Bare such that |v(A)− v(B)| ≤ 2. Synthesize f using as few gates as possible.

A.9 [M] Repeat Problem A.8 for the requirement that f = 1 whenever

v(A) > v(B)

subject to the input constraint

v(A)+ v(B) ≤ 4

A.10 [E] Prove that the associative rule does not apply to the NAND operator.

A.11 [M] Implement the following function with no more than six NAND gates, each havingthree inputs.

f = x1x2 + x1x2x3 + x1x2x3x4 + x1x2x3x4

Assume that both true and complemented inputs are available.

A.12 [M] Show how to implement the following function using six or fewer two-input NANDgates. Complemented input variables are not available.

f = x1x2 + x3 + x1x4

A.13 [E] Implement the following function as economically as possible using only NAND gates.Assume that complemented input variables are not available.

f = (x1 + x3)(x2 + x4)

A.14 [M] A number code in which consecutive numbers are represented by binary patterns thatdiffer only in one bit position is called a Gray code. A truth table for a 3-bit Gray code tobinary code converter is shown in Figure PA.2a.

(a) Implement the three functions f1, f2, and f3 using only NAND gates.

(b) A lower-cost network for performing this code conversion can be derived by noting thefollowing relationships between the input and output variables.

f1 = a

f2 = f1 ⊕ b

f3 = f2 ⊕ c

Using these relationships, specify the contents of a combinational network N that can berepeated, as shown in Figure PA.2b, to implement the conversion. Compare the total numberof NAND gates required to implement the conversion in this form to the number requiredin part (a).


Problems 525

a b c f 1 f 2 f 3

0

0

0

0

1

1

1

1

0

0

1

1

1

1

0

0

0

1

1

0

0

1

1

0

0

0

0

0

1

1

1

1

0

0

1

1

0

0

1

1

0

1

0

1

0

1

0

1

3-bit Gray codeinputs

Binary codeoutputs

N? N? N?

a b c

f 1 f 2 f 3

? (b) Code conversion network

(a) Three-bit Gray code to

? ?

binary code conversion

Figure PA.2 Gray code conversion example for Problem A.14.

A.15 [M] Implement the XOR function using only 4 two-input NAND gates.

A.16 [M] FigureA.36 defines a BCD to seven-segment display decoder. Give an implementationfor this truth table using AND, OR, and NOT gates. Verify that the same functions arecorrectly implemented by the NAND gate circuits shown in the figure.

A.17 [M] In the logic network shown in Figure PA.3, gate 3 fails and produces the logic value1 at its output F1 regardless of the inputs. Redraw the network, making simplificationswherever possible, to obtain a new network that is equivalent to the given faulty networkand that contains as few gates as possible. Repeat this problem, assuming that the fault isat position F2, which is stuck at a logic value 0.

A.18 [M] Figure A.16 shows the structure of a general CMOS circuit. Derive a CMOS circuitthat implements the function

f (x1, . . . , x4) = x1x2 + x3x4

Use as few transistors as possible. (Hint: Consider series/parallel networks of transistors.Note the complementary series and parallel structure of the pull-up and pull-down networksin Figures A.17 and A.18.)



f

x4

x3x1

x2

1

2

3

4

6

7

5

8

F1

F2

Figure PA.3 A faulty network.

A.19 [E] Draw the waveform for the output Q in the JK circuit of Figure A.30, using the inputwaveforms shown in Figure PA.4 and assuming that the flip-flop is initially in the 0 state.

Clock1

0

1J

0

1K

0

Figure PA.4 Input waveforms for a JK flip-flop.

A.20 [E] Derive the truth table for the NAND gate circuit in Figure PA.5. Compare it to thetruth table in Figure A.23b and then verify that the circuit in Figure A.25 is equivalent tothe circuit in Figure A.24a.

Q

Q

A

B

Figure PA.5 NAND latch.


Problems 527

A.21 [M] Compute both the setup time and the hold time in terms of NOR gate delays for thenegative-edge-triggered D flip-flop shown in Figure A.28.

A.22 [M] In the circuit of Figure A.26a, replace all NAND gates with NOR gates. Derivea truth table for the resulting circuit. How does this circuit compare with the circuit inFigure A.26a?

A.23 [M] Figure A.32 shows a shift register network that shifts the data to the right one placeat a time under the control of a clock signal. Modify this shift register to make it capableof shifting data either one or two places at a time under the control of the clock and anadditional control input ONE/TWO.

A.24 [D] A 4-bit shift register that has two control inputs—INITIALIZE and RIGHT/LEFT—isrequired. When INITIALIZE is set to 1, the binary number 1000 should be loaded into theregister independently of the clock input. When INITIALIZE = 0, pulses at the clock inputshould rotate this pattern. The pattern rotates right or left when the RIGHT/LEFT input isequal to 1 or 0, respectively. Give a suitable design for this register using D flip-flops thathave preset and clear inputs as shown in Figure A.31.

A.25 [M] Derive a three-input to eight-output decoder network, with the restriction that the gatesto be used cannot have more than two inputs.

A.26 [D] Figure A.34 shows a 3-bit up counter. A counter that counts in the opposite direction(that is, 7, 6, . . . , 1, 0, 7, . . .) is called a down counter. Acounter capable of counting in bothdirections under the control of an UP/DOWN signal is called an up/down counter. Showa logic diagram for a 3-bit up/down counter that can also be preset to any state throughparallel loading of its flip-flops from an external source. A LOAD/COUNT control is usedto determine whether the counter is being loaded or is operating as a counter.

A.27 [D] Figure A.34 shows an asynchronous 3-bit up-counter. Design a 4-bit synchronousup-counter, which counts in the sequence 0, 1, 2, . . . , 15, 0 . . . . Use T flip-flops in yourcircuit. In the synchronous counter all flip-flops have to be able to change their states at thesame time. Hence, the primary clock input has to be connected directly to the clock inputsof all flip-flops.

A.28 [M] A logic function to be implemented is described by the expression

f (x1, x2, x3, x4) = x1x3x4 + x1x3x4 + x2x3x4

(a) Show an implementation of f in terms of an eight-input multiplexer circuit.

(b) Can f be realized with a four-input multiplexer circuit? If so, show how.

A.29 [M] Repeat Problem A.28 for

f (x1, x2, x3, x4) = x1x2x3 + x2x3x4 + x1x4

A.30 [E] Complete the design of the up/down counter in FigureA.46 by using the state assignmentS0 = 10, S1 = 11, S2 = 01, and S3 = 00. How does this design compare with the onegiven in Section A.13.1?



A.31 [M] Design a 2-bit synchronous counter of the general form shown in Figure A.49 thatcounts in the sequence . . . , 0, 3, 1, 2, 0, . . . , using D flip-flops. This circuit has no externalinputs, and the outputs are the flip-flop values themselves.

A.32 [M] Repeat ProblemA.31 for a 3-bit counter that counts in the sequence . . . , 0, 1, 2, 3, 4, 5,

0, . . . , taking advantage of the unused count values 6 and 7 as don’t-care conditions indesigning the combinational logic.

A.33 [M] Finite state machines can be used to detect the occurrence of certain subsequences inthe sequence of binary inputs applied to the machine. Such machines are called finite staterecognizers. Suppose that a machine is to produce a 1 as its output whenever the inputpattern 011 occurs.

(a) Draw the state diagram for this machine.

(b) Make a state assignment for the required number of flip-flops and construct the assignedstate table, assuming that D flip-flops are to be used.

(c) Derive the logic expressions for the output and the next-state variables.

A.34 [M] Repeat part (a) only of Problem A.33 for a machine that is to recognize the occurrenceof either of the subsequences 011 and 010 in the input sequence, including the cases whereoverlap occurs. For example, the input sequence 1101010110 . . . is to produce the outputsequence 00000101010 . . . .

References

1. S. Brown and Z. Vranesic, Fundamentals of Digital Logic with VHDL Design, 3rded., McGraw-Hill, Burr Ridge, IL, 2009.

2. M.M. Mano and M.D. Ciletti, Digital Design, 4th ed., Prentice-Hall, Upper SaddleRiver, NJ, 2007.

3. J. F. Wakerly, Digital Design Principles and Practices, 4th ed., Prentice-Hall,Englewood Cliffs, NJ, 2005.

4. R.H. Katz and G. Borriello, Contemporary Logic Design, 2nd ed., PearsonPrentice-Hall, Upper Saddle River, NJ, 2005.

5. C.H. Roth Jr., Fundamentals of Logic Design, 5th ed., Thomson/Brooks/Cole,Belmont, Ca., 2004.

6. D.D. Gajski, Principles of Digital Design, Prentice-Hall, Upper Saddle River, NJ,1997.

7. J.P. Hayes, Digital Logic Design, Addison-Wesley, Reading, Mass., 1993.

8. A.S. Sedra and K.C. Smith, Microelectronic Circuits, 6th ed., Oxford, New York,2009.

December 15, 2010 09:12 ham_338065_appb Sheet number 1 Page number 529 cyan black

529

a p p e n d i x

BTheAltera Nios II Processor

Appendix Objectives

In this appendix you will learn about the Altera Nios IIprocessor which has a RISC-style instruction set. Thediscussion includes:

• Instruction set architecture

• Input/output capability

• Support for embedded applications


530 A P P E N D I X B • TheAltera Nios II Processor

In Chapters 2 and 3 we introduced the basic concepts used in the design of instruction sets, mostly fromthe programmer’s point of view. In this appendix we will examine the Nios II processor from the AlteraCorporation as an example of a RISC-style commercial product that embodies the previously discussedconcepts. The discussion follows closely the presentation of topics in Chapters 2 and 3.

The Nios II processor is intended for implementation in Field Programmable Gate Array (FPGA) devices(see Appendix A). It is provided in a software form that makes it easy to incorporate it into a computer systemby using the Quartus II CAD (Computer-Aided Design) tools provided by Altera. The designed system isthen downloaded into an FPGA device, thus resulting in an implementation that has the functionality of atypical computer.

The Quartus II software includes a tool called the SOPC Builder that can be used to design such systems.The SOPC Builder provides a variety of predesigned modules, called IP cores, which can be easily incorporatedinto a system. These modules include the Nios II processor, various I/O interfaces, and memory controllers.The modules are characterized by a variety of parameters which allow the user who is designing a customsystem to specify the exact nature of the desired system. The Nios II processor can be configured to have anumber of different features which require different amounts of logic circuitry for their implementation, thusaffecting the performance and cost of the final system. Since the user can customize the design of the finalcircuit, we say that Nios II is a soft processor.

The ability to design a custom computer system, which can be implemented in a single FPGA chip, isattractive in embedded applications. We discuss such applications in Chapter 11.

B.1 Nios II Characteristics

The Nios II processor has a RISC-style architecture. Its features are very similar to thosedescribed in general terms in Chapter 2.

Data SizesThe word length is 32 bits. Data are handled in 32-bit words, 16-bit halfwords, or 8-bit

bytes. Byte addresses in a word are assigned in the little-endian arrangement, where thelower byte addresses are used for the less significant bytes.

Memory AccessData in the memory are accessed only by Load and Store instructions, which load the

data into general-purpose registers or store them from these registers. The Load and Storeinstructions can transfer data in word, halfword, and byte sizes.

RegistersThere are 32 general-purpose registers and a number of control registers. All registers

are 32 bits long.

InstructionsAll instructions are 32 bits long. They have all of the RISC-style functionality presented

in Chapter 2.


B.2 General-Purpose Registers 531

B.2 General-Purpose Registers

Table B.1 presents the processor’s 32 general-purpose registers. They are called r0 tor31, which are the names used in assembly-language instructions. Some registers are usedfor specific purposes and hence are also given names that are more indicative of theirfunctionality, as shown in Table B.1. These names are also recognized by the assemblerprogram. The registers intended for a specific purpose are:

• r0 always contains the constant 0. Reading this register returns the value 0; writinginto it has no effect.

• r1 is used by the assembler as a temporary register. It should not be used in userprograms.

• r24 and r29 are used for processing of exceptions.• r25 and r30 are used exclusively by a debugging tool called the JTAG Debug Module.• r26 is the global pointer to the data in a user program.• r27 is the processor stack pointer.• r28 is the frame pointer.• r31 holds the return address when a subroutine is called.

The other registers are used for general purposes.

Table B.1 Nios II general-purpose registers.

Register Name Function

r0 zero 0x00000000

r1 at Assembler Temporary

r2

r3

· · ·· · ·· · ·r23

r24 et Exception Temporary

r25 bt Breakpoint Temporary

r26 gp Global Pointer

r27 sp Stack Pointer

r28 fp Frame Pointer

r29 ea Exception Return Address

r30 ba Breakpoint Return Address

r31 ra Return Address



Because register r0 always contains the value 0, it can be included as an operand inan instruction whenever the value 0 is needed. This can be exploited in several ways. Forexample, the instruction

add r5, r0, r0

can be used to clear register r5. Similarly, r0 can be the source operand in a Store instructionto clear a memory location. In Compare and Branch instructions, it can be used whencomparing a value in another register to 0. It can also be used in the Index address mode toprovide a limited version of the Absolute address mode.

The Nios II processor also has a number of control registers. We will discuss theseregisters in Section B.9, because they are primarily used for input/output transfers.

B.3 Addressing Modes

The Nios II processor supports five addressing modes:

• Immediate mode—A 16-bit operand is given explicitly in the instruction. This valueis sign-extended to produce a 32-bit operand for instructions that perform arithmeticoperations.

• Register mode—The operand is the contents of a general-purpose register.• Register indirect mode—The effective address of the operand is the contents of a

register.• Displacement mode—The effective address of the operand is generated by adding the

contents of a register and a signed 16-bit displacement value given in the instruction.This is the Index mode discussed in Chapter 2.

• Absolute mode—A 16-bit absolute address of an operand can be specified by using theDisplacement mode with register r0.

The addressing modes and their assembler syntax are given in Table B.2. Note that thesyntax of the Immediate mode differs from that given in Chapter 2 in that the number sign(#) is not used. Instead, the immediate specification is included in the mnemonic for theOP-code. For example, the instruction

addi r3, r2, 24

adds the contents of r2 and the immediate decimal value 24, and places the result in r3.Observe also that both Immediate andAbsolute modes can be used only if the immediate

value or the address can be represented in 16 bits. We will discuss the issue of 32-bitimmediate values and addresses in Sections B.4.4 and B.4.5, respectively.


B.4 Instructions 533

Table B.2 Nios II addressing modes.


Immediate Value Operand = Value

Register ri EA= ri

Register indirect (ri) EA= [ri]

Displacement X(ri) EA= [ri] + X

Absolute LOC(r0) EA= LOC

EA= effective addressValue = a 16-bit signed numberX = a 16-bit signed displacement value

B.4 Instructions

The Nios II instruction set exemplifies all features discussed in the context of RISC-styleprocessors in Chapter 2. All instructions are 32-bits long. Arithmetic and logic operationscan be done only on operands in the general-purpose registers. Load and Store instructionsare used to transfer data between memory and registers.

The instructions come in three basic forms. Those that specify three register operandshave the form

Operation dest_register, source1_register, source2_register

Instructions with an immediate operand have the form

Operation dest_register, source_register, immediate_operand

The immediate operand is 16 bits long and it can be sign-extended to provide a 32-bitoperand. The third form includes a 26-bit unsigned immediate value. It is used only in thesubroutine-call instruction such as

call LABEL

B.4.1 Notation

The notation used in assembly-language programs is governed by the constraints imposedby a particular assembler program that is used to assemble a source program into machinecode that can be executed by the processor. In many assemblers it is assumed that thestatements in a source program are case-insensitive. This means that the statements

ADD R2, R3, R4



and

add r2, r3, r4

are equivalent. However, this is not so in the Nios II assembler provided by Altera Corp.This assembler allows case-insensitive specification of OP-code mnemonics, but it requireslower-case specification of register names. Thus, registers must be identified by the namesgiven in Table B.1. For example, we can use either r27 or sp to refer to the stack pointer,but not R2 or SP.

In Altera’s documentation, lower-case letters are used to specify the OP-code mnemon-ics. To make it easier for the user to consult Altera’s literature, we will use the sameconvention in this appendix.

The Nios II instruction set is quite extensive. In this appendix, we will present onlya subset that is sufficient for developing an understanding of the capabilities of the NiosII processor. To make the presentation easier to follow, we will discuss the instructions ingroups according to their functionality. We will also show how these instructions can beused to implement the programming examples in Chapters 2 and 3. For a full descriptionof the instruction set, the reader can consult the Nios II Processor Reference Handbook,which is available in the literature section of the Altera Web site (altera.com).

B.4.2 Load and Store Instructions

Load and Store instructions move data between memory, or I/O interfaces, and the general-purpose registers. The Load Word instruction has the general form

ldw ri, source_operand


ldw r2, 40(r3)

uses the Displacement address mode to determine the effective address of a memory locationby adding the decimal value 40 and the contents of register r3; then it loads the 32-bit operandfrom memory into r2. The effective address must be word-aligned, which means that itmust be a multiple of four.

The Store Word instruction has the form

stw ri, destination_operand

For example,

stw r2, 40(r3)

stores the contents of r2 into the same memory location as above.The assembler syntax requires the memory operand in all Load and Store instructions

to be specified using the Displacement address mode, X(ri). This allows using the Registerindirect mode as 0(ri), or simply (ri), as well as the Absolute mode as X(r0). However, the



Absolute mode cannot be specified by using only a label even if the value of the label isdefined in an assembler directive. Thus, the statement

ldw r2, LOCATION

would cause a syntax error.In addition to word-sized operands, the Load and Store instructions can also handle

byte- and halfword-sized operands. The size of the operand is indicated in the OP-codemnemonic. Such Load instructions are:

• ldb (Load Byte)• ldbu (Load Byte Unsigned)• ldh (Load Halfword)• ldhu (Load Halfword Unsigned)

When a shorter operand is loaded into a 32-bit register, its value has to be adjusted to fit intothe register. This is done by sign-extending the 8- or 16-bit value to 32 bits in the ldb andldh instructions. In the ldbu and ldhu instructions the operand is zero-extended, becausethe value represents a positive integer.

The corresponding Store instructions are:

• stb (Store Byte), which stores the low-order byte of the source register into the memorybyte specified by the effective address.

• sth (Store Halfword), which stores the low-order halfword of the source register into thememory halfword specified by the effective address (which must be halfword-aligned).

For each Load or Store instruction there is also a version intended for accessing locationsin I/O interfaces. These instructions are:

• ldwio (Load Word I/O)• ldbio (Load Byte I/O)• ldbuio (Load Byte Unsigned I/O)• ldhio (Load Halfword I/O)• ldhuio (Load Halfword Unsigned I/O)• stwio (Store Word I/O)• stbio (Store Byte I/O)• sthio (Store Halfword I/O)

The I/O versions of Load and Store instructions are needed when the Nios II processor isused with a cache memory, a concept that is discussed in Chapter 8. A cache is a relativelysmall memory that can be accessed faster than the main memory. It is typically loaded withrecently used instructions and data from the main memory so that they may be accessedmore quickly when they are needed again. This is advantageous for instructions and datathat are normally found in the main memory. But, it is inappropriate if an address used toaccess a particular data item refers to a memory-mapped I/O interface, because input datain I/O interfaces may change at any time, and output data must always be sent directly to



the I/O device. The I/O versions of Load and Store instructions bypass the cache, if oneexists, and always access the I/O location.

B.4.3 Arithmetic Instructions

Arithmetic instructions operate on data that are either in the general-purpose registers orgiven as an immediate value in the instruction. The following instructions are included:

• add (Add Registers)• addi (Add Immediate)• sub (Subtract Registers)• subi (Subtract Immediate)• mul (Multiply)• muli (Multiply Immediate)• div (Divide)• divu (Divide Unsigned)

The Add instruction

add ri, rj, rk

adds the contents of registers rj and rk, and places the sum into register ri.

The Add Immediate instruction

addi ri, rj, 85

adds the contents of register rj and the immediate value 85, and places the result into registerri. The immediate operand is represented in 16 bits in the instruction, and it is sign-extendedto 32 bits prior to the addition.

The Subtract instruction

sub ri, rj, rk

subtracts the contents of register rk from register rj, and places the result into register ri.

The Multiply instruction

mul ri, rj, rk

multiplies the contents of registers rj and rk, and places the low-order 32 bits of the productinto register ri. The multiplication treats the operands as unsigned numbers. The result inregister ri is correct if the generated product can be represented in 32 bits, regardless ofwhether the operands are unsigned or signed. In the immediate version

muli ri, rj, Value16

the 16-bit immediate operand Value16 is sign-extended to 32 bits.



The Divide instruction

div ri, rj, rk

divides the contents of register rj by the contents of register rk and places the integer portionof the quotient into register ri. The operands are treated as signed integers. The divuinstruction is performed in the same way except that the operands are treated as unsignedintegers.

B.4.4 Logic Instructions

The logic instructions provide the AND, OR, XOR, and NOR operations. They operate ondata that are either in the general-purpose registers or given as an immediate value in theinstruction.

The AND instruction

and ri, rj, rk

performs a bitwise logical AND of the contents of registers rj and rk, and stores the resultin register ri. Similarly, the instructions or, xor, and nor perform the OR, XOR, and NORoperations, respectively.

The AND Immediate instruction

andi ri, rj, Value16

performs a bitwise logical AND of the contents of register rj and the 16-bit immediateoperand Value16 which is zero-extended to 32 bits, and stores the result in register ri.Similarly, the instructions ori, xori, and nori perform the OR, XOR, and NOR operations,respectively, using immediate operands.

It is also possible to use a 16-bit immediate operand as the 16 high-order bits in thelogic operations, in which case the low-order 16 bits of the operand are zeros. This isachieved with the instructions:

• andhi (AND High Immediate)• orhi (OR High Immediate)• xorhi (XOR High Immediate)

This provides a mechanism for loading a 32-bit immediate value into a register, by firstplacing the high-order 16 bits into the register using the orhi instruction and then ORing inthe low-order 16 bits using the ori instruction.

B.4.5 Move Instructions

The Move instructions copy the contents of one register into another, or they place animmediate value into a register. These are pseudoinstructions provided as a convenience



to the programmer, which the assembler implements by using other instructions. Theinstruction

mov ri, rj

copies the contents of register rj into register ri. It is implemented as

add ri, rj, r0

The Move Immediate instruction

movi ri, Value16

sign-extends the 16-bit immediate Value16 to 32 bits and loads it into register ri. It isimplemented as

addi ri, r0, Value16

The Move Unsigned Immediate instruction

movui ri, Value16

zero-extends the 16-bit immediate Value16 to 32 bits and loads it into register ri. It isimplemented as

ori ri, r0, Value16

The Move Immediate Address instruction

movia ri, LABEL

loads a 32-bit value that corresponds to the address LABEL into register ri. The assemblerimplements this by using two instructions as

orhi ri, r0, LABEL_HIGHori ri, ri, LABEL_LOW

where LABEL_HIGH and LABEL_LOW are the high-order and low-order 16 bits ofLABEL.

B.4.6 Branch and Jump Instructions

The flow of execution of a program is changed by using Branch or Jump instructions. TheUnconditional Branch instruction

br LABEL

transfers execution unconditionally to the instruction at address LABEL. The branch targetis specified in the form of a signed 16-bit offset that is included in the instruction. Theoffset is the distance in bytes from the instruction that immediately follows br to the addressLABEL.



Conditional transfer of execution is achieved with Conditional Branch instructions,which compare the contents of two general-purpose registers and cause a branch if the resultsatisfies the branch condition. For example, the Branch if Less Than Signed instruction

blt ri, rj, LABEL

performs the comparison [ri] < [rj], treating the contents of the registers as signed numbers.

The Branch if Less Than Unsigned instruction

bltu ri, rj, LABEL

performs the comparison [ri] < [rj], treating the contents of the registers as unsigned num-bers.

The other Conditional Branch instructions of the same form are:

• beq (Comparison [ri] = [rj])• bne (Comparison [ri] �= [rj])• bge (Signed comparison [ri] ≥ [rj])• bgeu (Unsigned comparison [ri] ≥ [rj])• bgt (Signed comparison [ri] > [rj])• bgtu (Unsigned comparison [ri] > [rj])• ble (Signed comparison [ri] ≤ [rj])• bleu (Unsigned comparison [ri] ≤ [rj])

The target of any branch instruction must be within the range that can be specified ina 16-bit offset. For a target outside this range, it is necessary to use the Jump instruction

jmp ri

which transfers execution unconditionally to the address contained in the specified regis-ter, ri.

Example B.1To ilustrate the use of Nios II instructions, let us consider the program in Figure 2.8 whichadds a list of numbers. A Nios II version of this program, using the same register choices, isgiven in Figure B.1. The size of the list is loaded into register r2 from the memory locationN using the Absolute address mode. This assumes that the address N can be expressed in16 bits, because the Absolute mode is actually implemented as the Displacement mode thatuses a 16-bit offset plus the contents of r0 to determine the effective address of the operand.Register r3 is cleared to zero by using the Add instruction that adds the zero contents of r0.The address NUM1 is loaded into r4 by specifying it as an immediate operand in an AddImmediate instruction. Observe that when an immediate operand is specified, this is doneby simply giving its name (that is recognized by the assembler) or an actual value. The factthat it is an immediate operand is stated within the OP code addi. The conditional branch



instruction causes the execution to continue at LOOP if the contents of r2 are greater thanzero. Finally, note that the label LOOP must be followed by a colon, and that the commentsare delineated by the /* and */ characters.

ldw r2, N(r0) /* Load the size of the list. */add r3, r0, r0 /* Initialize sum to 0. */addi r4, r0, NUM1 /* Load address of the first number. */

LOOP: ldw r5, (r4) /* Get the next number. */add r3, r3, r5 /* Add this number to sum. */addi r4, r4, 4 /* Increment the pointer to the list. */subi r2, r2, 1 /* Decrement the counter. */bgt r2, r0, LOOP /* Loop back if not finished. */stw r3, SUM(r0) /* Store the final sum. */

Figure B.1 Nios II implementation of the program in Figure 2.8.

Example B.2 In the program in Figure B.1, the addresses that correspond to labels N, NUM1, and SUMmust be small enough to be representable in 16 bits. If this is not the case, then the programcan be augmented as shown in Figure B.2. Here, the movia instructions are used to load32-bit addresses into registers. Note also that the mov instruction is used to clear r3 tozero.

movia r2, N /* Get the address N. */ldw r2, (r2) /* Load the size of the list. */mov r3, r0 /* Initialize sum to 0. */movia r4, NUM1 /* Load address of the first number. */

LOOP: ldw r5, (r4) /* Get the next number. */add r3, r3, r5 /* Add this number to sum. */addi r4, r4, 4 /* Increment the pointer to the list. */subi r2, r2, 1 /* Decrement the counter. */bgt r2, r0, LOOP /* Loop back if not finished. */movia r6, SUM /* Get the address SUM. */stw r3, (r6) /* Store the final sum. */

Figure B.2 A more general Nios II implementation of the program in Figure 2.8.

Example B.3 Figure B.3 gives an implementation of the program in Figure 2.11, which sums the marksattained by students in different tests.



movia r2, LIST /* Get the address LIST. */mov r3, r0 /* Clear r3. */mov r4, r0 /* Clear r4. */mov r5, r0 /* Clear r5. */movia r6, N /* Get the address N. */

*/ldw r6, (r6) /* Load the value n.LOOP: ldw r7, 4(r2) /* Add the mark for next student’s */

add r3, r3, r7 /* Test 1 to the partial sum. */ldw r7, 8(r2) /* Add the mark for that student’s */add r4, r4, r7 /* Test 2 to the partial sum. */ldw r7, 12(r2) /* Add the mark for that student’s */add r5, r5, r7 /* Test 3 to the partial sum. */addi r2, r2, 16 /* Increment the pointer. */subi r6, r6, 1 /* Decrement the counter. */bgt r6, r0, LOOP /* Branch back if not finished. */movia r7, SUM1 /* Store the total for Test 1 */stw r3, (r7) /* into location SUM1. */movia r7, SUM2 /* Store the total for Test 2 */stw r4, (r7) /* into location SUM2. */movia r7, SUM3 /* Store the total for Test 3 */stw r5, (r7) /* into location SUM3. */

Figure B.3 Implementation of the program in Figure 2.11.

B.4.7 Subroutine Linkage Instructions

Nios II has two instructions for calling subroutines. The Call Subroutine instruction

call LABEL

includes a 26-bit unsigned immediate value. The instruction saves the return address (whichis the address of the next instruction) in register r31. Then, it transfers control to theinstruction at address LABEL. This jump address is determined by concatenating the fourhigh-order bits of the Program Counter with the immediate value, Value26, and two low-order zeroes as follows

Jump address = PC31−28 : Value26 : 00

The two least-significant bits are 0 because Nios II instructions must be aligned on wordboundaries.

The Call Subroutine in Register instruction

callr ri



saves the return address in register r31 and then transfers control to the instruction at theaddress contained in register ri.

Return from a subroutine is performed with the instruction

ret

This instruction transfers execution to the address contained in register r31.

Example B.4 Figure B.4 illustrates how the program of Figure B.2 can be written in the form of a sub-routine, where the parameters are passed through processor registers.

Calling program

movia r2, N /* Get the address N. */ldw r2, (r2) /* Load the size of the list. */movia r4, NUM1 /* Load address of the first number. */call LISTADD /* Call subroutine. */movia r6, SUM /* Get the address SUM. */stw r3, (r6) /* Store the final sum. */...

Subroutine

LISTADD: mov r3, r0 /* Initialize sum to 0. */LOOP: ldw r5, (r4) /* Get the next number. */

add r3, r3, r5 /* Add this number to sum. */addi r4, r4, 4 /* Increment the pointer to the list. */subi r2, r2, 1 /* Decrement the counter. */bgt r2, r0, LOOP /* Loop back if not finished. */ret /* Return to calling program. */

Figure B.4 Program of Figure B.2 written as a subroutine; parameters passed throughregisters.

Example B.5 Figure B.5 shows how the program of Figure B.2 can be written as a subroutine where theparameters are passed on the processor stack.



movia r2, NUM1 /* Push parameters on the stack. */subi sp, sp, 4stw r2, (sp)movia r2, Nldw r2, (r2)subi sp, sp, 4stw r2, (sp)call LISTADD /* Call subroutine. */ldw r2, 4(sp) /* Get the result from the stack */movia r3, SUM /* and save it in location SUM. */stw r2, (r3)addi sp, sp, 8 /* Restore top of stack. */...

LISTADD: subi sp, sp, 16 /* Save registers. */stw r2, 12(sp)stw r3, 8(sp)stw r4, 4(sp)stw r5, (sp)ldw r2, 16(sp) /* Initialize counter to n.ldw r4, 20(sp) /* Initialize pointer to the list. */

*/

mov r3, r0 /* Initialize sum to 0. */LOOP: ldw r5, (r4) /* Get the next number. */

add r3, r3, r5 /* Add this number to sum. */addi r4, r4, 4 /* Increment the pointer by 4. */subi r2, r2, 1 /* Decrement the counter. */bgt r2, r0, LOOP /* Loop back if not finished. */stw r3, 20(sp) /* Put result in the stack. */ldw r5, (sp) /* Restore registers. */ldw r4, 4(sp)ldw r3, 8(sp)ldw r2, 12(sp)addi sp, sp, 16ret /* Return to calling program. */

Figure B.5 Program of Figure B.2 written as a subroutine; parameters passed on thestack.

Example B.6When a Subroutine Call instruction is executed, the Nios II processor saves the returnaddress in register r31 (ra). In the case of nested subroutines, this return address must besaved on the processor stack before the second subroutine is called. Figure B.6 indicateshow nested subroutines may be implemented. It corresponds to the program in Figure 2.21.



movia r2, PARAM2 /* Place parameters on the stack. */ldw r2, (r2)subi sp, sp, 4stw r2, (sp)movia r2, PARAM1ldw r2, (r2)subi sp, sp, 4stw r2, (sp)call SUB1 /* Call the subroutine. */ldw r2, (sp) /* Get the result from the stack */movia r3, RESULT /* and save it in location RESULT. */stw r2, (r3)addi sp, sp, 8 /* Restore top of stack. */...

SUB1: subi sp, sp, 24 /* Save registers. */stw ra, 20(sp)stw fp, 16(sp)stw r2, 12(sp)stw r3, 8(sp)stw r4, 4(sp)stw r5, (sp)addi fp, sp, 16 /* Initialize the frame pointer. */ldw r2, 8(fp) /* Get first parameter. */ldw r3, 12(fp) /* Get second parameter. */...movia r5, PARAM3 /* Get the parameter that has to be */ldw r4, (r5) /* passed to SUB2, and push it on */subi sp, sp, 4 /* the stack. */stw r4, (sp)call SUB2ldw r4, (sp) /* Get result from SUB2. */addi sp, sp, 4...stw r5, 8(fp) /* Place answer on stack. */ldw r5, (sp) /* Restore registers. */ldw r4, 4(sp)ldw r3, 8(sp)ldw r2, 12(sp)ldw fp, 16(sp)ldw ra, 20(sp)addi sp, sp, 24ret /* Return to Main program. */

...continued in Part b

Figure B.6 Nested subroutines (Part a); implementation of the program in Figure2.21a.



SUB2: subi sp, sp, 12 /* Save registers. */stw fp, 8(sp)stw r2, 4(sp)stw r3, (sp)addi fp, sp, 8 /* Initialize the frame pointer. */ldw r2, 4(fp) /* Get the parameter. */...stw r3, 4(fp) /* Place SUB2 result on stack. */ldw r3, (sp) /* Restore registers. */ldw r2, 4(sp)ldw fp, 8(sp)addi sp, sp, 12ret /* Return to SUB1. */

...continued from Part a

Figure B.6 Nested subroutines (Part b); implementation of the program inFigure 2.21b.

B.4.8 Comparison Instructions

The Comparison instructions compare the contents of two registers or the contents of a reg-ister and an immediate value, and write either 1 (if true) or 0 (if false) into the result register.

The Compare Less Than Signed instruction

cmplt ri, rj, rk

performs the comparison of signed numbers in registers rj and rk, [rj] < [rk], and writes a1 into register ri if the result is true; otherwise, it writes a 0.

The Compare Less Than Unsigned instruction

cmpltu ri, rj, rk

performs the same function as the cmplt instruction, but it treats the operands as unsignednumbers.

Other instructions of this type are:

• cmpeq (Comparison [rj] = [rk])• cmpne (Comparison [rj] �= [rk])• cmpge (Signed comparison [rj] ≥ [rk])



• cmpgeu (Unsigned comparison [rj] ≥ [rk])• cmpgt (Signed comparison [rj] > [rk])• cmpgtu (Unsigned comparison [rj] > [rk])• cmple (Signed comparison [rj] ≤ [rk])• cmpleu (Unsigned comparison [rj] ≤ [rk])

The immediate versions of the Comparison instructions include a 16-bit immediateoperand. For example, the Compare Less Than Signed Immediate instruction

cmplti ri, rj, Value16

compares the signed number in register rj with the sign-extended immediate operand. Itwrites a 1 into register ri if [rj] < Value16; otherwise, it writes a 0.

The Compare Less Than Unsigned Immediate instruction

cmpltui ri, rj, Value16

compares the unsigned number in register rj with the zero-extended immediate operand. Itwrites a 1 into register ri if [rj] < Value16; otherwise, it writes a 0.

Other instructions of this type are:

• cmpeqi (Comparison [rj] = Value16)• cmpnei (Comparison [rj] �= Value16)• cmpgei (Signed comparison [rj] ≥ Value16)• cmpgeui (Unsigned comparison [rj] ≥ Value16)• cmpgti (Signed comparison [rj] > Value16)• cmpgtui (Unsigned comparison [rj] > Value16)• cmplei (Signed comparison [rj] ≤ Value16)• cmpleui (Unsigned comparison [rj] ≤ Value16)

B.4.9 Shift Instructions

The Shift instructions shift the contents of a specified register either to the right or to the left.

The Shift Right Logical instruction

srl ri, rj, rk

shifts the contents of register rj to the right by the number of bit positions specified bythe five least-significant bits (number in the range 0 to 31) in register rk, and stores the re-sult in register ri. The vacated bits on the left side of the shifted operand are filled with zeros.



The Shift Right Logical Immediate instruction

srli ri, rj, Value5

shifts the contents of register rj to the right by the number of bit positions specified by thefive-bit unsigned value, Value5, given in the instruction, and stores the result in register ri.The vacated bits on the left side of the shifted operand are filled with zeros.

The other Shift instructions are:

• sra (Shift Right Arithmetic)• srai (Shift Right Arithmetic Immediate)• sll (Shift Left Logical)• slli (Shift Left Logical Immediate)

The sra and srai instructions perform the same actions as the srl and srli instructions, exceptthat the sign bit, rj31, is replicated into the vacated bits on the left side of the shifted operand.

The sll and slli instructions are similar to the srl and srli instructions, but they shift theoperand in register rj to the left and fill the vacated bits on the right side with zeros.

B.4.10 Rotate Instructions

There are three Rotate instructions. The Rotate Right instruction

ror ri, rj, rk

rotates the bits of register rj in the left-to-right direction by the number of bit positionsspecified by the five least-significant bits (number in the range 0 to 31) in register rk, andstores the result in register ri.

The Rotate Left instruction

rol ri, rj, rk

is similar to the ror instruction, but it rotates the operand in the right-to-left direction.

The Rotate Left Immediate instruction

roli ri, rj, Value5

rotates the bits of register rj in the right-to-left direction by the number of bit positionsspecified by the five-bit unsigned value, Value5, given in the instruction, and stores theresult in register ri.

Example B.7In Figure 2.24 we showed a program that packs two BCD digits into a byte. A Nios IIversion of that program is given in Figure B.7.



movia r2, LOC /* r2 points to data. */ldb r3, (r2) /* Load first byte into r3. */slli r3, r3, 4 /* Shift left by 4 bit positions. */addi r2, r2, 1 /* Increment the pointer. */ldb r4, (r2) /* Load second byte into r4. */andi r4, r4, 0xF /* Clear high-order bits to zero. */or r3, r3, r4 /* Concatenate the BCD digits. */movia r2, PACKED /* Store the result into */stb r3, (r2) /* location PACKED. */

Figure B.7 A routine that packs two BCD digits into a byte, corresponding toFigure 2.24.

B.4.11 Control Instructions

There are two special instructions for reading and writing the control registers that will bediscussed in Section B.9. The Read Control Register instruction

rdctl ri, ctlj

copies the contents of control register ctlj into the general purpose register ri.

The Write Control Register instruction

wrctl ctlj, ri

copies the contents of general purpose register ri into the control register ctlj.

There are two instructions provided for dealing with exceptions: trap and eret. Theyare similar to the call and ret instructions, but they are used for exceptions. We will discussthem in Section B.10.2.

There are also instructions for management of cache memories: flushd (Flush DataCache Line), flushi (Flush Instruction Cache Line), initd (Initialize Data Cache Line), andiniti (Initialize Instruction Cache Line).

B.5 Pseudoinstructions

For programming convenience, it is useful to have a variety of different instructions. Fromthe hardware point of view, a large number of instructions requires more extensive circuitryfor their implementation. Often, the action of some desired instructions can be achievedefficiently by using other instructions. If these desired instructions are not implemented inhardware, they are called pseudoinstructions. The assembler replaces pseudoinstructionswith actual instructions that are implemented in hardware.


B.6 Assembler Directives 549

In Section B.4.5, we saw that the Move instructions are pseudoinstructions. This sec-tion describes some other pseudoinstructions.

The Subtract Immediate instruction

subi ri, rj, Value16

is implemented as

addi ri, rj, −Value16The Branch Greater Than Signed instruction

bgt ri, rj, LABEL

is implemented as the blt instruction by swapping the register operands.When writing a program, the programmer need not be aware of pseudoinstructions.

But, an awareness becomes important if one tries to examine the assembled code, perhapsduring the debugging process.

B.6 Assembler Directives

The Nios II assembler directives conform to those defined by the widely used GNU assem-bler, which is software available in the public domain. Assembler directives begin with aperiod. Some of the frequently used directives are described below.

.org Value

This is the ORIGIN directive discussed in Chapter 2.

.equ LABEL, Value

The name LABEL is equated with Value. For example,

.equ LIST, 0x1000

assigns the hexadecimal number 1000 to LIST.

.byte expressions

Places byte-sized data items into the memory. Items are specified by expressions that areseparated by commas. Each expression is assembled into the next byte. Examples of ex-pressions are: 23, 6 + LABEL, and Z − 4.

.hword expressions



This is the same as .byte, except that the expressions are assembled into successive 16-bithalfwords.

.word expressions

This is the same as .byte, except that the expressions are assembled into successive 32-bitwords.

.skip Size

This is the RESERVE directive discussed in Chapter 2. It reserves in the memory thenumber of bytes specified as Size.

.end

Indicates the end of the source-code file. Everything after this directive is ignored by theassembler.

Example B.8 Figure B.8 illustrates the use of some assembler directives. It corresponds to Figure 2.13.

.org 100 /* Place this code at location 100. */movia r2, N /* Get the address N. */ldw r2, (r2) /* Load the size of the list. */mov r3, r0 /* Initialize sum to 0. */movia r4, NUM1 /* Load address of the first number. */

LOOP: ldw r5, (r4) /* Get the next number. */add r3, r3, r5 /* Add this number to sum. */addi r4, r4, 4 /* Increment the pointer to the list. */subi r2, r2, 1 /* Decrement the counter. */bgt r2, r0, LOOP /* Loop back if not finished. */movia r6, SUM /* Get the address SUM. */stw r3, (r6) /* Store the final sum. */next instruction

.org 200 /* Place data at location 200. */SUM: .skip 4N: .word 150NUM1: .skip 600

.end

Figure B.8 A program that corresponds to Figure 2.13.


B.7 Carry and Overflow Detection 551

B.7 Carry and Overflow Detection

When performing an arithmetic operation such as Add or Subtract, it is often important toknow if the operation produced a carry from the most-significant bit position or if arith-metic overflow occurred. A processor that uses condition codes, as discussed in Section2.10.2, automatically sets the C and V flags to indicate whether carry or overflow occurred.However, the Nios II processor does not include condition code flags. Its Add and Subtractinstructions perform the corresponding operations in the same way for both signed and un-signed operands. Additional instructions have to be used to detect the occurrence of carryand overflow. The carry out from bit position 31 is of interest when unsigned numbers areadded or subtracted. Overflow is of interest when signed operands are involved.

Carry and Overflow in AdditionUpon executing the instruction

add r4, r2, r3

a possible occurrence of a carry can be detected by checking if the unsigned sum, in registerr4, is less than either one of the unsigned operands. If this instruction is followed by

cmpltu r5, r4, r2

the carry bit will be written into register r5.If it is desired to continue execution at location CARRY when a carry of 1 is detected,

this can be achieved by using

add r4, r2, r3bltu r4, r2, CARRY

Arithmetic overflow can be detected by checking the signs of the source operands andthe resulting sum. Overflow occurs if two positive numbers produce a negative sum, or iftwo negative numbers produce a positive sum. Exploiting this fact, the instruction sequence

add r4, r2, r3xor r5, r4, r2xor r6, r4, r3and r5, r5, r6blt r5, r0, OVERFLOW

will cause a branch to OVERFLOW if the addition results in arithmetic overflow. The twoxor instructions are used to compare the signs of the sum and each of the two summands.While these instructions perform the XOR on all 32 bits, it is only the sign position, b31,that is considered in the subsequent branch instruction. This bit is set to 1 only if the signbits of the sum and the summand are different. The and instruction causes bit r531 to be setto 1 only if the signs of both operands are the same, but the sign of the sum is different. Theblt instruction causes a branch if the signed number in r5 is negative, which is indicated byr531 being equal to 1.



Carry and Overflow in SubtractionCarry and overflow conditions in Subtract operations can be detected using a similar

approach. A carry from the most-significant bit position of the generated difference canbe detected by checking if the minuend is less than the subtrahend. For example, theinstructions

sub r4, r2, r3bltu r2, r3, CARRY

will cause execution to branch to location CARRY if a carry is generated in the subtraction.Arithmetic overflow occurs if the minuend and subtrahend have different signs and the

sign of the generated difference is not the same as the sign of the minuend. This conditioncan be detected by the instruction sequence

sub r4, r2, r3xor r5, r2, r3xor r6, r2, r4and r5, r5, r6blt r5, r0, OVERFLOW

The two xor instructions compare the sign of the minuend with the signs of the subtrahendand the generated difference. The and instruction causes the bit r531 to be set to 1 only ifthe above stated condition for overflow is true.

Example B.9 Consider the task of adding two integers that are too big to fit into 32-bit registers. This canbe done by loading the numbers into two different registers and then performing the additionwith carry detection as explained above. We will use hexadecimal numbers to make it easyto see how a number can be represented in two registers. Let A = 10A72C10F8 and B =4A5C00FE04. Then C = A+ B can be computed as shown in Figure B.9. Registers r2and r3 are loaded with the low- and high-order 32 bits of A, respectively. Registers r4 and

orhi r2, r0, 0xA72C /* r2 now contains A72C0000. */ori r2, r2, 0x10F8 /* r2 now contains A72C10F8. */ori r3, r0, 0x10 /* r3 now contains 10. */orhi r4, r0, 0x5C00 /* r4 now contains 5C000000. */ori r4, r4, 0xFE04 /* r4 now contains 5C00FE04. */ori r5, r0, 0x4A /* r5 now contains 4A. */add r6, r2, r4 /* Add low-order 32 bits. */cmpltu r7, r6, r2 /* Check if carry occurred. */add r7, r7, r3 /* Add the carry plus the */add r7, r7, r5 /* high-order bits. */

Figure B.9 Program for Example B.9.


B.9 Control Registers 553

r5 are used to hold B in the same way. Note that the 32-bit values in registers r2 and r4are loaded using two 16-bit immediate operands, as explained in Section B.4.4. After theaddition of the low-order 32 bits of A and B, the carry out is included in the addition of thehigh-order 32 bits. The generated sum C = 5B032D0EFC is placed in registers r6 and r7.

B.8 Example Programs

In Section 2.11, we presented two example programs. A Nios II version of the program thatcomputes the dot product of two vectors is given in Figure B.10; it corresponds to Figure2.27. A program that searches for a matching string is shown in Figure B.11; it correspondsto Figure 2.30.

movia r2, AVEC /* r2 points to vector A. */movia r3, BVEC /* r3 points to vector B. */movia r4, N /* Get the address N. */ldw r4, (r4) /* r4 serves as a counter. */mov r5, r0 /* r5 accumulates the dot product. */

LOOP: ldw r6, (r2) /* Get next element of vector A. */ldw r7, (r3) /* Get next element of vector B. */mul r8, r6, r7 /* Compute the product of next pair. */add r5, r5, r8 /* Add to previous sum. */addi r2, r2, 4 /* Increment pointer to vector A. */addi r3, r3, 4 /* Increment pointer to vector B. */subi r4, r4, 1 /* Decrement the counter. */bgt r4, r0, LOOP /* Loop again if not done. */movia r2, DOTPROD /* Store dot product */stw r5, (r2) /* in memory. */

Figure B.10 A program for computing the dot product of two vectors, correspondingto Figure 2.27.

B.9 Control Registers

So far, we have considered only the use of general-purpose registers. Prior to discussing theNios II input/output schemes, we need to introduce the control registers. In Chapter 3, weexplained the use of control registers in handling interrupts. Figure 3.7 depicts four registersthat typify the functionality needed for this task. The same functionality is found in the NiosII control registers. In the basic configuration of a Nios II processor, there are six controlregisters. Additional control registers are provided when advanced hardware modules, suchas the Memory Management Unit or the External Interrupt Controller, are implemented.



movia r2, T /* Get the address of T (0).movia r3, P /* Get the address of P (0).movia r4, N /* Get the address N .ldw r4, (r4) /* Read the value n.movia r5, M /* Get the address M.ldw r5, (r5) /* Read the value m.sub r4, r4, r5 /* Compute n – m.add r4, r2, r4 /* The address of T (n – m) .add r5, r3, r5 /* The address of P (m) .

LOOP1: mov r6, r2 /* Scan through string T .mov r7, r3 /* Scan through string P .

LOOP2: ldb r8, (r6) /* Compare a pair of */*/*/*/*/*/*/*/*/*/*/*/

ldb r9, (r7) /* characters in */*/bne r8, r9, NOMATCH /* strings T and P .

addi r6, r6, 1 /* Point to next character in T .addi r7, r7, 1 /* Point to next character in P .bgt r5, r7, LOOP2 /* Loop again if not done. */movia r9, RESULT /* Store the address of T (i) */stw r2, (r9) /* in location RESULT. */br DONE

NOMATCH: addi r2, r2, 1 /* Point to next character in T .bge r4, r2, LOOP1 /* Loop back if not done. */

*/

*/

*/

movi r8, –1 /* Write –1 into location */movia r9, RESULT /* RESULT to indicate that */stw r8, (r9) /* no match was found. */


Figure B.11 A string-search program corresponding to Figure 2.30.

The basic control registers are indicated in Table B.3. They are called ctl0 to ctl5. Theyalso have the alternate names shown in the table which indicate their functionality. Bothsets of names are recognized by the assembler. The control registers are read and writtenby the special instructions rdctl and wrctl. They are used as follows:

• Register ctl0 is the status register which indicates the current state of the processor. Inthe basic configuration, only two bits are used:

• PIE is the processor interrupt-enable bit. The processor will accept interruptrequests from I/O devices when PIE = 1, and it will ignore them when PIE = 0.

• U is the User/Supervisor mode bit. It is 0 for Supervisor mode and 1 for Usermode.

• Register ctl1 is used to automatically save the contents of the status register when aninterrupt- or exception-service routine is being executed. Bits EU and EPIE are thesaved status bits U and PIE.


B.10 Input/Output 555

Table B.3 Nios II basic control registers.

Register Name b31 · · · b2 b1 b0

ctl0 status Reserved U PIE

ctl1 estatus Reserved EU EPIE

ctl2 bstatus Reserved BU BPIE

ctl3 ienable Interrupt-enable bits

ctl4 ipending Pending-interrupt bits

ctl5 cpuid Processor identifier

• Register ctl2 is used to save the contents of the status register during debug breakprocessing. Bits BU and BPIE are the saved status bits U and PIE.

• Register ctl3 is used to enable individual interrupts from I/O devices. Each bit corre-sponds to one of the interrupts irq0 to irq31. The bit values 1 and 0 enable and disableeach interrupt, respectively.

• Register ctl4 indicates which interrupt requests are pending. The value of a given bit,ctl4k , is set to 1 if the interrupt irqk is active and also enabled by the interrupt-enablebit ctl3k being equal to 1.

• Register ctl5 is used to hold a value that uniquely identifies the processor when it is apart of a multiprocessor system.

Operating ModesThe Nios II processor can operate in two different modes:

• Supervisor mode, in which the processor can execute all instructions and perform allavailable functions. This mode is entered when the processor is reset.

• User mode, in which some control instructions cannot be executed.

In a basic configuration of a Nios II processor, all programs are run in the Supervisormode. The User mode is available when the processor is configured to include the MemoryManagement Unit. Its sole purpose is to support operating systems, so that the OS softwarecan run in the Supervisor mode while the application programs run in the User mode.

B.10 Input/Output

The general concepts dealing with input/output transfers, which are discussed in Chapter3, apply fully to the Nios II processor. I/O devices are memory-mapped, and I/O transfersare performed either under program control or by using the interrupt mechanism.



.equ KBD_DATA, 0x4000 /* Specify addresses for keyboard */

.equ DISP_DATA, 0x4010 /* and display data registers. */movia r2, LOC /* Location where line will be stored. */movia r3, KBD_DATA /* r3 points to keyboard data register. */movia r4, DISP_DATA /* r4 points to display data register. */addi r5, r0, 0x0D /* Load ASCII code for Carriage Return. */

READ: ldbio r6, 4(r3) /* Read keyboard status register. */andi r6, r6, 2 /* Check the KIN flag. */beq r6, r0, READldbio r7, (r3) /* Read character from keyboard. */stb r7, (r2) /* Write character into main memory */addi r2, r2, 1 /* and increment the pointer. */

ECHO: ldbio r6, 4(r4) /* Read display status register. */andi r6, r6, 4 /* Check the DOUT flag. */beq r6, r0, ECHOstbio r7, (r4) /* Send the character to display. */bne r5, r7, READ /* Loop back if character is not CR. */

Figure B.12 Program that reads a line of characters and displays it, corresponding toFigure 3.4.

B.10.1 Program-Controlled I/O

Section 3.1.2 explains the concept of program-controlled I/O. A RISC-style program thatreads a line of characters from a keyboard and sends it to a display device is given in Figure3.4. A Nios II implementation of this program is presented in Figure B.12. It assumes thatthe keyboard and display interfaces have the registers shown in Figure 3.3. The names ofthese registers are associated with the addresses indicated in Figure 3.3. Note that the I/Oregisters are accessed by using the ldbio and stbio instructions. As explained in SectionB.4.2, these instructions are used to bypass the cache memory that may exist in a givenNios II system.

B.10.2 Interrupts and Exceptions

Using interrupts is an efficient way of performing I/O transfers. Section 3.2 describesthis approach in general terms. The Nios II implementation of interrupts conforms to thisdescription.

A Nios II system can deal with two types of interruptions. A request for service froman I/O device is considered to be a hardware interrupt. Any other interruption is not calledan interrupt; instead, it is called an exception. In fact, the term exception is used generallyto describe any hardware-initiated or software-initiated deviation from normal execution.



An exception in the normal flow of program execution can be caused by:

• Hardware interrupt• Software trap• Unimplemented instruction

In response to an exception, the Nios II processor performs the following actions:

1. Saves the existing processor status information by copying the contents of the statusregister (ctl0) into the estatus register (ctl1).

2. Clears the U bit in the status register, to ensure that the processor is in the Supervisormode.

3. Clears the PIE bit in the status register, which prevents further external processorinterrupts.

4. Writes the return address, which is the address of the instruction after the exception,into the ea register (r29).

5. Transfers execution to the address of the exception handler, which determines thecause of the exception and calls the required exception-service routine to respond tothe exception.

The address of the exception handler is specified when a Nios II system is designed, and itcannot be changed by software at run time. This address can be provided by the designer;otherwise, the default address is at an offset of 0x20 from the starting address of the mainmemory. For example, if the memory starts at address 0, then the default address of theexception handler is 0x20.

Hardware InterruptsAn I/O device requests an interrupt by asserting one of the processor’s 32 interrupt-

request inputs, irq0 through irq31. An interrupt is generated only if the following threeconditions are true:

• The PIE bit in the status register is set to 1.• An interrupt-request input, irqk, is asserted.• The corresponding interrupt-enable bit, ctl3k , is set to 1.

The contents of the ipending register (ctl4) indicate which interrupt requests are pending.The exception handler determines which of the pending interrupts has the highest priority,and calls the corresponding interrupt-service routine.

Upon completion of the interrupt-service routine, execution control is returned to theinterrupted program by means of the eret (Exception Return) instruction. The Nios IIprocessor starts servicing a hardware interrupt without first completing the instruction thatis being executed when the interrupt request occurs. (This is different from our discussion inChapter 3, where we assumed that the execution of the current instruction will be completed.)Therefore, the interrupted instruction must be re-executed upon return from the interrupt-service routine. To achieve this, the exception handler has to adjust the contents of the earegister which are at this time pointing to the next instruction of the interrupted program.



Therefore, the address in the ea register has to be decremented by 4 prior to executing theeret instruction.

Figure B.13 shows how interrupts can be used to read a line of characters from akeyboard and display it using polling. The program corresponds to the general schemepresented in Figure 3.8. To keep the example simple, we do not use the exception handler,but only the necessary interrupt-service routine. This assumes that the keyboard is the onlysource of exceptions, so that when an interrupt request arrives the program automaticallytreats it as having come from the keyboard.

Observe that we defined the addresses KBD and DISPLAY as 0x4000 and 0x4010,which are the actual addresses of KBD_DATA and DISP_DATA registers in Figure 3.3.The program accesses the other registers in the keyboard and display interfaces by usingthe Displacement mode. The 32-bit addresses KBD and DISPLAY, as well as the addressesof memory locations PNTR, EOL, and LINE are loaded into processor registers by usingthe pseudoinstruction movia.

Note also that the exception return address in register ea is adjusted prior to returningto the interrupted program.

Software TrapA software exception occurs when a trap instruction is executed in a program. This

causes the address of the next instruction to be saved in the ea register (r29). Then,interrupts are disabled and execution is transferred to the exception handler.

The last instruction in the exception-service routine is eret, which returns executioncontrol to the instruction that follows the trap instruction that caused the exception. Thereturn address is the contents of register ea. The eret instruction restores the previous statusof the processor by copying the contents of the estatus register into the status register.

A common use of the software trap is to transfer control to a different program, suchas an operating system, as explained in Chapter 4.

Unimplemented InstructionsAn exception occurs when the processor encounters a valid instruction that is not

implemented in hardware. For example, a Nios II processor may be configured withouthardware circuits that perform multiplication and division. In this case, an exception willoccur if themul or div instruction is encountered. The exception handler may call a routinethat implements the required operation in software.

Exception HandlerThe exception handler is a program that deals with exceptions. It is loaded in a prede-

termined location in memory to which execution control is transferred when an exceptionoccurs. As mentioned above, in a Nios II system, a default location for the exception handleris 0x20.

Figure B.14 gives an outline of the exception handler. The routine starts by saving allregisters used by it, as well as the subroutine-linkage register ra, as explained in Example3.3 in Chapter 3. Then, it must determine the source of an exception request. First, it checksif the request is a hardware interrupt. It reads the ipending control register, and tests the



.equ KBD, 0x4000 /* Address for keyboard. */

.equ DISPLAY, 0x4010 /* Address for display. */

.equ PNTR, 0x2000 /* Buffer pointer in memory. */

.equ EOL, 0x2004 /* End-of-line indicator. */

.equ LINE, 0x2008 /* Address of start of buffer. */


.org 0x020ILOC: subi sp, sp, 16 /* Save registers. */

stw r2, 12(sp)stw r3, 8(sp)stw r4, 4(sp)stw r5, (sp)movia r2, PNTRldw r3, (r2) /* Load address pointer. */movia r4, KBDldbio r5, (r4) /* Read character from keyboard. */stb r5, (r3) /* Write the character into memory */addi r3, r3, 1 /* and increment the pointer. */stw r3, (r2) /* Update the pointer in memory. */

ECHO:movia r2, DISPLAYldbio r3, 4(r2) /* See if display is ready. */andi r3, r3, 4 /* Check the DOUT flag. */beq r3, r0, ECHOstbio r5, (r2) /* Display the character just read. */addi r3, r0, 0x0D /* ASCII code for Carriage Return. */bne r5, r3, RTRN /* Return if character is not CR. */movi r3, 1movia r5, EOLstw r3, (r5) /* Indicate end of line. */stbio r0, 8(r4) /* Disable interrupts in KBD interface. */

RTRN: ldw r5, (sp) /* Restore registers. */ldw r4, 4(sp)ldw r3, 8(sp)ldw r2, 12(sp)addi sp, sp, 16subi ea, ea, 4 /* Adjust the return address. */eret /* Return from exception. */

...continued in Part b.

Figure B.13 Program that reads a line of characters using interrupts and displays it usingpolling, corresponding to Figure 3.8 (Part a).



Main program

START: movia r2, LINEmovia r3, PNTRstw r2, (r3) /* Initialize buffer pointer. */movia r2, EOLstw r0, (r2) /* Clear end-of-line indicator. */movia r2, KBDmovi r3, 2 /* Enable interrupts in */stbio r3, 8(r2) /* the keyboard interface. */rdctl r2, ienableori r2, r2, 2 /* Enable keyboard interrupts in */wrctl ienable, r2 /* the processor control register. */rdctl r2, statusori r2, r2, 1wrctl status, r2 /* Set PIE bit in status register. */next instruction

Figure B.13 Program that reads a line of characters using interrupts and displaysit using polling, corresponding to Figure 3.8 (Part b).

bits of this word, one at a time, to find a bit that is set to 1. The order in which these bits aretested determines the priority assigned to the various sources of interrupts. Upon finding abit set to 1, the corresponding interrupt-service routine is executed. Although there may beas many as 32 different interrupt sources, in a typical system the number of I/O devices ismuch smaller. If the request is not a hardware interrupt, then other exceptions are checkedand serviced as necessary. Prior to returning to the interrupted program, the saved registersare restored.

The main program must initialize the settings needed to attain a desired interrupt be-havior of I/O devices. This is similar to the initialization illustrated in Figure B.13.

ResetA Nios II system has to include a reset capability to make it possible to recover from

an erroneous state that cannot be handled as an exception. This may be done by providinga reset key. When the key is pressed, the processor is reset and an appropriate programis executed. If the memory starts at address 0, then it is natural to use this address as thereset location, which can be done when the Nios II system is being implemented. Whenthe processor is reset, the program counter and control registers are cleared to zero. Thus,execution starts with the instruction at address 0. To execute the main program after reset,it is only necessary to place a Branch instruction at address 0 with first instruction of themain program as the branch target.



.org 0RESET: br START /* Branch to the Main program. */

Exception handler.org 0x020

ELOC: subi sp, sp, 12 /* Save registers. */stw ra, 8(sp)stw et, 4(sp)stw r2, (sp). . .rdctl et, ipending /* Get pending interrupt requests. */beq et, r0, OTHER /* Not an external interrupt. */subi ea, ea, 4 /* Adjust the return address. */

*/*/*/

IRQ0: andi r2, et, 1 /* Check if irq0 is active.beq r2, r0, IRQ1 /* If not, check irq1.call ISR0 /* Service the irq0 request.

IRQ1: andi r2, et, 2 /* Check if irq1 is active. */beq r2, r0, IRQ2 /* If not, check irq2. */call ISR1 /* Service the irq1 request. */...

IRQ31: orhi r2, r0, 0x8000 /* Pattern to test bit 31. */and r2, et, r2 /* Check if irq31 is active. */beq DONE /* If not, finished external interrupts. */call ISR31 /* Service the irq31 request. */br DONE /* Finished with external interrupts. */

OTHER: . . .Instructions that check for other exceptions and

call the required exception-service routines.. . .

DONE: ldw r2, (sp) /* Restore registers. */ldw et, 4(sp)ldw ra, 8(sp)addi sp, sp, 12eret /* Return to interrupted program. */

Interrupt-service routine for irq0ISR0: . . .

...ret

Interrupt-service routine for irq31

ISR31: . . ....ret

Main program

START: . . .

Figure B.14 An outline of the exception handler.



B.11 Advanced Configurations of Nios II Processor

In the previous sections we discussed the features found in the basic configurations of a NiosII processor. It is possible to implement larger Nios II processors that include additionalhardware modules that provide enhanced capability. In this section we consider three ofthe possible enhancements.

B.11.1 External Interrupt Controller

Section B.10.2 describes the mechanism used by the internal interrupt controller, in whichsoftware is used to determine the priority of interrupt requests. This scheme is simple toimplement, but it may lead to unacceptably long latency in servicing of interrupts in someapplications. To reduce the latency, it is possible to include an external interrupt controllercircuit that uses vectored interrupts. In this case, the controller provides the address of theinterrupt-service routine for each interrupt request.

To further reduce the interrupt latency, the processor can include shadow registers forthe 32 general-purpose registers. Several sets of shadow registers can be implemented andassociated with different interrupts. The external interrupt controller identifies the shadow-register set to be used with an incoming interrupt request. Then, the identified set is usedinstead of the normal register set. This obviates the need for saving the contents of registersthat are used in the interrupt-service routine.

Priority levels are associated with different interrupts. When the processor is servicingan interrupt of a given priority level, it can be interrupted only by another interrupt that hasa higher priority level. The current priority level of the processor and an indication of theactive shadow-register set are a part of the processor state, which is kept in the status controlregister, ctl0. This information is kept in the “reserved” bits indicated for this register inTable B.3.

When a Nios II processor is defined, it is configured with either the internal interruptcontroller or the external interrupt controller.

B.11.2 Memory Management Unit

A Nios II system can include a memory management unit (MMU), which provides thefunctionality discussed in Section 8.8. The MMU is intended to support operating systemsthat use the memory management capability. A system that incorporates an MMU can useboth the Supervisor and User modes of operation. The MMU is an optional unit which hasto be specified for inclusion in a system at design time.

B.11.3 Floating-Point Hardware

The Nios II architecture makes a provision for custom instructions. These instructionscan be used to define a variety of operations, which may require using additional circuits.


B.13 Solved Problems 563

A set of predefined custom instructions is available for implementation of floating-pointarithmetic operations. The necessary hardware is included in a Nios II system at designtime if floating-point operations are desired.

B.12 Concluding Remarks

In this appendix, we described the main features of the basic implementations of Nios IIprocessors. The basic configurations provide a powerful processor that can be used in abroad range of applications. In Section B.11, we indicated the kind of hardware that canbe included in a Nios II system to provide additional capability. Since Nios II is a softprocessor, a designer of a system can tailor the capability of the system to suit the desiredusage.

The Nios II processor is mainly intended for commercial and industrial applications,but it is also very attractive for use in teaching environments. The FPGA technology forimplementing Nios II processors is affordable and easy to use. Altera Corp. has developed aset of Development and Education boards that provide an excellent platform for introducingstudents to digital technology. These boards comprise the typical components found in acomputer system. They make it easy for students to investigate both the hardware andsoftware aspects of computer organization.

Extensive literature on Nios II processors and systems is available on Altera’s Website:

http://www.altera.com

B.13 Solved Problems


Example B.10Problem: Assume that there is a string of ASCII-encoded characters stored in memorystarting at address STRING. The string ends with the Carriage Return (CR) character.Write a Nios II program to determine the length of the string.

Solution: Figure B.15 presents a possible program. Each character in the string is comparedto CR (ASCII code 0x0D), and a counter is incremented until the end of the string is reached.The result is stored in location LENGTH.

Example B.11Problem: We want to find the smallest number in a list of non-negative 32-bit integers.Storage for data begins at address (1000)16. The word at this address must hold the value ofthe smallest number after it has been found. The next word contains the number of entries,n, in the list. The following n words contain the numbers in the list. Write a Nios II program

http://www.altera.com



to find the smallest number and include the assembler directives needed to organize the dataas stated.

Solution: The program in Figure B.16 accomplishes the required task. Comments in theprogram explain how this task is performed. A few sample numbers are included as entriesin the list.

movia r2, STRING /* r2 points to the start of the string. */add r3, r0, r0 /* r3 is a counter that is cleared to 0. */addi r4, r0, 0x0D /* Load ASCII code for Carriage Return. */

LOOP: ldb r5, (r2) /* Get the next character. */beq r5, r4, DONE /* Finished if character is CR. */addi r2, r2, 1 /* Increment the string pointer. */addi r3, r3, 1 /* Increment the counter. */br LOOP /* Not finished, loop back. */

DONE: movia r2, LENGTH /* Store the count in memory */stw r3, (r2) /* location LENGTH. */


.equ LIST, 0x1000 /* Starting address of the list. */movia r2, LIST /* r2 points to the start of the list. */

*/ldw r3, 4(r2) /* r3 is a counter, initialize it with n.addi r4, r2, 8 /* r4 points to the first number. */ldw r5, (r4) /* r5 holds the smallest number found so far. */

LOOP: subi r3, r3, 1 /* Decrement the counter. */beq r3, r0, DONE /* Finished if r3 is equal to 0. */addi r4, r4, 4 /* Increment the list pointer. */ldw r6, (r4) /* Get the next number. */ble r5, r6, LOOP /* Check if smaller number found. */add r5, r6, r0 /* Update the smallest number found. */br LOOP

DONE: stw r5, (r2) /* Store the smallest number into SMALL. */

.org 0x1000SMALL: .skip 4 /* Space for the smallest number found. */N: .word 7 /* Number of entries in the list. */ENTRIES: .word 4,5,3,6,1,8,2 /* Entries in the list. */

.end




Example B.12Problem: Write a Nios II program that converts an n-digit decimal integer into a binarynumber. The decimal number is given as n ASCII-encoded characters, as would be the caseif the number is entered by typing it on a keyboard.

Solution: Consider a four-digit decimal number, D = d3d2d1d0. The value of this numberis ((d3 × 10+ d2)× 10+ d1)× 10+ d0. This representation of the number is the basisfor the conversion technique used in the program in Figure B.17. Note that each ASCII-encoded character is converted into a Binary Coded Decimal (BCD) digit before it is usedin the computation.

movia r2, N /* r2 is a counter, initialize */ldw r2, (r2) /* it with n.movia r3, DECIMAL /* r3 points to the ASCII digits. */

*/

add r4, r0, r0 /* r4 will hold the binary number. */LOOP: ldb r5, (r3) /* Get the next ASCII digit. */

andi r5, r5, 0x0F /* Form the BCD digit. */add r4, r4, r5 /* Add to the intermediate result. */addi r3, r3, 1 /* Increment the digit pointer. */subi r2, r2, 1 /* Decrement the counter. */beq r2, r0, DONEmuli r4, r4, 10 /* Multiply by 10. */br LOOP /* Loop back if not done. */

DONE: movia r5, BINARY /* Store the result in */stw r4, (r5) /* memory location BINARY. */


Example B.13Problem: Consider an array of numbers A(i,j), where i = 0 through n− 1 is the row index,and j = 0 through m− 1 is the column index. The array is stored in the memory of acomputer one row after another, with elements of each row occupying m successive wordlocations. Write a Nios II subroutine for adding column x to column y, element by element,leaving the sum elements in column y. The indices x and y are passed to the subroutine inregisters r2 and r3. The parameters n and m are passed to the subroutine in registers r4 andr5, and the address of element A(0,0) is passed in register r6.

Solution: A possible program is given in Figure B.18. We assumed that the values x, y,n, and m are stored in memory locations X, Y, N, and M. Also, the elements of the arrayare stored in successive words that begin at location ARRAY, which is the address of theelement A(0,0). Comments in the program indicate the purpose of individual instructions.



movia r2, Xldw r2, (r2) /* Load the value x .movia r3, Yldw r3, (r3) /* Load the value y .movia r4, Nldw r4, (r4) /* Load the value n.movia r5, Mldw r5, (r5) /* Load the value m.movia r6, ARRAY /* Load the address of A(0,0). */

*/

*/

*/

*/

call SUBnext instruction...

SUB: subi sp, sp, 4stw r7, (sp) /* Save register r7. */slli r5, r5, 2 /* Determine the distance in bytes */

/* between successive elements *//* in a column. */

sub r3, r3, r2 /* Form y – x .slli r3, r3, 2 /* Form 4(y – x) .slli r2, r2, 2 /* Form 4x .add r6, r6, r2 /* r6 points to A(0,x ). */

*/

add r7, r6, r3 /* r7 points to A(0,y ). */LOOP: ldw r2, (r6) /* Get the next number in column x .

ldw r3, (r7) /* Get the next number in column y .add r2, r2, r3 /* Add the numbers and */stw r2, (r7) /* store the sum. */add r6, r6, r5 /* Increment pointer to column x .add r7, r7, r5 /* Increment pointer to column y .subi r4, r4, 1 /* Decrement the row counter. */bgt r4, r0, LOOP /* Loop back if not done. */

*/*/

*/*/

*/*/

ldw r7, (sp) /* Restore r7. */addi sp, sp, 4ret /* Return to the calling program. */


Example B.14 Problem: Assume that a memory location BINARY contains a 32-bit pattern. It is desiredto display these bits as eight hexadecimal digits on a display device that has the interfacedepicted in Figure 3.3. Write a Nios II program that accomplishes this task.

Solution: First it is necessary to convert the 32-bit pattern into hex digits that are repre-sented as ASCII-encoded characters. The conversion can be done by using the table-lookup



approach. A 16-entry table has to be constructed to provide the ASCII code for each possi-ble hex digit. Then, for each four-bit segment of the pattern in BINARY, the correspondingcharacter can be looked up in the table and stored in eight consecutive byte locations startingat location HEX. Finally, the eight characters are sent to the display. Figure B.19 gives apossible program.

movia r2, BINARY /* Get address of binary number. */ldw r2, (r2) /* Load the binary number. */movi r3, 8 /* r3 is a digit counter that is set to 8. */movia r4, HEX /* r4 points to the hex digits. */

LOOP: roli r2, r2, 4 /* Rotate high-order digit *//* into low-order position. */

andi r5, r2, 0xF /* Extract next digit. */ldb r6, TABLE(r5) /* Get ASCII code for the digit */stb r6, (r4) /* and store it HEX buffer. */subi r3, r3, 1 /* Decrement the digit counter. */addi r4, r4, 1 /* Increment the pointer to hex digits. */bgt r3, r0, LOOP /* Loop back if not the last digit. */

DISPLAY: movi r3, 8movia r4, HEXmovia r2, DISP_DATA

DLOOP: ldbio r5, 4(r2) /* Check if the display is ready */andi r5, r5, 4 /* by testing the DOUT flag. */beq r5, r0, DLOOPldb r6, (r4) /* Get the next ASCII character */stbio r6, (r2) /* and send it to the display. */subi r3, r3, 1 /* Decrement the counter. */addi r4, r4, 1 /* Increment the character pointer. */bgt r3, r0, DLOOP /* Loop until all characters displayed. */next instruction

.org 1000HEX: .skip 8 /* Space for ASCII-encoded digits. */TABLE: .byte 0x30,0x31,0x32,0x33 /* Table for conversion */

.byte 0x34,0x35,0x36,0x37 /* to ASCII code. */

.byte 0x38,0x39,0x41,0x42

.byte 0x43,0x44,0x45,0x46




Problems

B.1 [E] Write a program that computes the expression SUM = 580+ 68400+ 80000.

B.2 [E] Write a program that computes the expression ANSWER =A× B + C × D.

B.3 [M] Write a program that finds the number of negative integers in a list of n 32-bit integersand stores the count in location NEGNUM. The value n is stored in memory location N, andthe first integer in the list is stored in location NUMBERS. Include the necessary assemblerdirectives and a sample list that contains six numbers, some of which are negative.

B.4 [E] Write an assembly-language program in the style of Figure B.8 for the program inFigure B.3. Assume the data layout of Figure 2.10.

B.5 [M] Write a Nios II program to solve Problem 2.10 in Chapter 2.

B.6 [E] Write a Nios II program for the problem described in Example 2.5 in Chapter 2.

B.7 [M] Write a Nios II program for the problem described in Example 3.5 in Chapter 3.

B.8 [E] Write a Nios II program for the problem described in Example 3.6 in Chapter 3.

B.9 [E] Write a Nios II program for the problem described in Example 3.6 in Chapter 3, butassume that the address of TABLE is 0x10100.

B.10 [E] Write a program that displays the contents of 10 bytes of the main memory in hex-adecimal format on a line of a video display. The byte string starts at location LOC in thememory. Each byte has to be displayed as two hex characters. The displayed contents ofsuccessive bytes should be separated by a space.

B.11 [M] Assume that a memory location BINARY contains a 16-bit pattern. It is desired todisplay these bits as a string of 0s and 1s on a display device that has the interface depictedin Figure 3.3. Write a program that accomplishes this task.

B.12 [M] Using the seven-segment display in Figure 3.17 and the timer circuit in Figure 3.14,write a program that flashes decimal digits in the repeating sequence 0, 1, 2, . . . , 9, 0, . . . .

Each digit is to be displayed for one second. Assume that the counter in the timer circuit isdriven by a 100-MHz clock.

B.13 [D] Using two 7-segment displays of the type shown in Figure 3.17, and the timercircuit in Figure 3.14, write a program that flashes numbers in the repeating sequence0, 1, 2, . . . , 98, 99, 0, . . . . Each number is to be displayed for one second. Assume thatthe counter in the timer circuit is driven by a 100-MHz clock.

B.14 [D] Write a program that computes real clock time and displays the time in hours (0 to 23)and minutes (0 to 59). The display consists of four 7-segment display devices of the typeshown in Figure 3.17. A timer circuit that has the interface given in Figure 3.14 is available.Its counter is driven by a 100-MHz clock.


B.16 [D] Write a Nios II program to solve Problem 2.24 in Chapter 2.


Problems 569














December 15, 2010 09:31 ham_338065_appc Sheet number 1 Page number 571 cyan black

571

a p p e n d i x

CThe ColdFire Processor

Appendix Objectives

In this appendix you will learn about the ColdFire processor,which is representative of CISC-style architecture. The dis-cussion includes:

• Memory organization and register structure

• Addressing modes and types of instructions

• Example programs for computing tasks andI/O

• Extensions for floating-point operations


572 A P P E N D I X C • The ColdFire Processor

The ColdFire processor is produced by Freescale Semiconductor, Inc., which was formerly a part of Motorola,Inc. ColdFire was introduced in the mid-1990s. It is derived from the Motorola 68000 processor. ColdFire hasbeen enhanced several times since its introduction with various extensions and new functionality. Processorsthat implement the ColdFire instruction set are available either as prefabricated chips or as software designsthat can be implemented in field-programmable gate-array (FPGA) chips. Both types of implementations arecommonly used in embedded applications.

We have selected ColdFire as an example of CISC-style processor design discussed in Sections 2.9and 2.10. ColdFire includes many instructions that combine memory accesses with arithmetic and logicoperations. This increased functionality for individual instructions permits computing tasks to be performedwith fewer instructions, thereby reducing the size of programs in memory. A variety of addressing modesenable individual instructions to use both register and memory operands. This increases the flexibility indeveloping programs, but it adds to the complexity of implementing the instruction set in hardware.

This appendix describes the basic ColdFire instruction set (Revision A as defined in the ColdFire FamilyProgrammer’s Reference Manual by Freescale Semiconductor [1]). We describe the memory organization,the register structure, the addressing modes for operands, and the various types of instructions. We illustratethe use of ColdFire instructions by implementing the computing tasks presented in Chapters 2 and 3. We alsoprovide a brief overview of floating-point extensions to the basic instruction set.

C.1 Memory Organization

The ColdFire wordlength is 16 bits. Data are handled in 8-bit bytes, 16-bit words, and32-bit longwords. Addresses consist of 32 bits and the memory is byte-addressable. Fromthe point of view of the instruction set, the memory is organized as indicated in Figure C.1.The big-endian address assignment is used for the bytes in a word or a longword.

C.2 Registers

Figure C.2 shows the ColdFire registers. There are eight data registers and eight addressregisters, each 32 bits long. The data registers, D0 to D7, serve as general-purpose registersfor arithmetic/logic operations and other uses. The address registers, A0 to A7, are usedprimarily to hold information that is needed in determining the addresses of operands inmemory. One address register, A7, is dedicated to serving as the processor stack pointer(SP).

There is also a status register (SR) with the four condition code flags—N, V, Z, and C—discussed in Section 2.3.7. These flags are set or cleared based on the results of arithmetic,logic, or data-transfer operations. There is an additional flag called X (Extend), which isset or cleared in the same way as the C flag, but it is not affected by as many instructions.It is used as an extended carry-in/carry-out bit for addition or subtraction operations onnumbers that are larger than the data register size of 32 bits, as explained in Section C.3.3.The remaining bits in the status register are used to control the behavior of the processorand will be discussed in Section C.6.


C.3 Instructions 573

ContentsaddressWord

byte 0

byte 2

byte i

byte i + 2

byte 1

byte 3

byte i + 1

byte i + 3

0

2

i

i + 2

Longword 0

Longword i

bytebyte231

2– 231

2– 231

1–

(byte 0 is thehigh-order byte)

(byte i is thehigh-order byte)

Figure C.1 Big-endian memory layout of bytes for a ColdFire processor.

C.3 Instructions

ColdFire instructions may consist of 16, 32, or 48 bits that are stored as one, two, orthree consecutive words in memory. The first word is the OP-code word that specifies theoperation to be performed. The OP-code word also provides some addressing information.If more addressing information is required for a given type of instruction, it is provided inone or possibly two extension words.

Most arithmetic, logic, and data-movement instructions have two operands and arewritten in assembly language as

OP source, destination

where the operation OP is performed using the operands and the result is placed in thedestination location, overwriting its original value. Note that the destination is the secondoperand. This order is different from the order that is discussed in Chapter 2 for CISC-styleinstructions.

In assembly language, the OP code is given as a mnemonic that indicates the operationto be performed, along with a length specifier that indicates the size of the data operands.The length specifier can be L, W, or B for longword, word, or byte size, respectively. For



Word

15 13 10 8 4 0

Byte

Longword

SR

PC

31 15 7 0816

Program counter

Status register

CarryOverflowZeroNegativeExtend

Trace mode selectSupervisor mode select

Master/interrupt state

Stack pointer

registersData

registersAddress

D0

D1

D2

D3

D4

D5

D6

D7

A0

A1

A2

A3

A4

A5

A6

A7

–T–S

–M

–X

–Z–N

–V–C

Interrupt mask–I

Figure C.2 The ColdFire register structure.



example, the instruction

ADD.L LOC, D1

adds the 32-bit operands in memory location LOC and processor register D1, and placesthe sum in D1. Not all sizes are available for all instructions; addition is only supported forthe longword size.

Assembly-language specification of instructions must conform to the constraints im-posed by the assembler program that is used for generating executable machine-languagecode. The assembler provided by Freescale Semiconductor is case-insensitive for instruc-tion mnemonics and register names. The technical documentation for ColdFire [1] usesupper-case characters consistently. To conform to this presentation style, and to make exam-ple programs easier to read, we will use upper-case characters for all instruction mnemonicsand register names.

C.3.1 Addressing Modes

The operands for instructions may be in processor registers, in memory, or included asimmediate values within instructions. The following addressing modes are available.

Immediate mode—The operand is a constant value that is contained within the instruction.Four sizes of immediate operands can be specified. Small 3-bit numbers can beincluded in the OP-code word of certain instructions. Byte, word, and longwordoperands are found in one or two extension words that follow the OP-code word.

Absolute mode—The memory address of an operand is given in the instruction immediatelyafter the OP-code word. There are two versions of this mode—long and short. Inthe long mode, a full 32-bit address is specified in two extension words. In the shortmode, a 16-bit value is given in one extension word. This value is used as the low-order 16 bits of a full 32-bit address. To determine the high-order 16 bits, the signbit of the short value is extended. Therefore, the short form can only access two32-Kbyte regions in memory: 0 to 7FFF, or FFFF8000 to FFFFFFFF.

Register mode—The operand is in a processor register, An or Dn, that is specified in theinstruction.

Register indirect mode—The effective address of the operand is in an address register, An,that is specified in the instruction.

Autoincrement mode—The effective address of the operand is in an address register, An,that is specified in the instruction. After the operand is accessed, the contents of Anare incremented by 1, 2, or 4, depending on whether the operand is a byte, a word,or a longword.

Autodecrement mode—The contents of an address register, An, that is specified in theinstruction are first decremented by 1, 2, or 4, depending on whether the operand isa byte, a word, or a longword. The effective address of the operand is then given bythe decremented contents of An.



Basic index mode—A 16-bit signed offset and an address register, An, are specified in theinstruction. The offset is sign-extended to 32 bits, and the sum of the sign-extendedoffset and the 32-bit contents of An is the effective address of the operand.

Full index mode—An 8-bit signed offset, an address register An, and an index registerRk (either an address or a data register) are given in the instruction. The effectiveaddress of the operand is the sum of the sign-extended offset, the contents of registerAn, and the signed contents of register Rk.

Basic relative mode—This mode is the same as the Basic index mode, except that theprogram counter (PC) is used instead of an address register, An.

Full relative mode—This mode is the same as the Full index mode, except that the programcounter (PC) is used instead of an address register, An.

The addressing modes and their assembler syntax are summarized in Table C.1. The Basicand Full index modes correspond to the Index addressing mode discussed in Section 2.4.3.The two relative modes are versions of the index modes that use the program counter instead

Table C.1 ColdFire addressing modes.


Immediate #Value Operand = Value

Absolute Short Value EA = Sign Extended WValue

Absolute Long Value EA = Value

Register Rn EA = Rn

that is, Operand = [Rn]

Register Indirect (An) EA = [An]

Autoincrement (An)+ EA = [An];Increment An

Autodecrement −(An) Decrement An;EA = [An]

Basic index WValue(An) EA = WValue + [An]

Full index BValue(An, Rk) EA = BValue + [An] +[Rk ]

Basic relative WValue(PC) EA = WValue + [PC]

Full relative BValue(PC, Rk) EA = BValue + [PC] + [Rk ]

EA= effective addressValue = a number given either explicitly or represented by a labelBValue = an 8-bit ValueWValue = a 16-bit ValueAn = an address registerRn = an address register or a data register



of an address register. In these cases, the offset represents the distance between the memorylocation of the desired operand and the location following that of the instruction accessing it.

Finally, it is important to note that not all instructions support all addressing modes or alloperand sizes. Some of the restrictions on addressing modes and operand sizes are indicatedlater in this appendix for different categories of instructions. The technical documentationon the ColdFire processor provides full details on the valid combinations for individualinstructions [1].

C.3.2 Move Instruction

The MOVE instruction transfers data between memory, or I/O interfaces, and the processorregisters. The direction of the transfer is from source to destination. The value beingtransferred by a MOVE instruction causes the condition code flags Z or N in the statusregister to be set to 1 if the value is zero or negative, respectively. This instruction canbe used with all three operand sizes. The source operand can use all of the availableaddressing modes. As for the destination operand, all modes except Immediate and Relativeare permitted. To ensure that the size of the instruction does not exceed three words, certaincombinations of addressing modes for the source and destination are not permitted. Forexample, the Absolute mode (short or long) cannot be used for both operands.

To illustrate how different addressing modes and different operand sizes may be spec-ified, consider the instruction

MOVE.L D0, (A2)

which writes a 32-bit value from register D0 into the memory location whose address isgiven by the contents of register A2. Similarly, the instruction

MOVE.B CHARACTER, D3

transfers an 8-bit value from memory location CHARACTER into register D3. Only thelow-order byte of register D3 is modified by this transfer; the remaining bits are not affected.

Transfers between registers are possible, as in

MOVE.W D5, D7

which transfers the 16-bit value in the low-order bits of register D5 to the low-order bits inregister D7. The high-order bits of D7 are not affected.

Direct transfers between memory locations are also possible, as in

MOVE.L (A2), 16(A4)

which transfers a 32-bit value from the memory location whose address is given by thecontents of register A2 to the memory location whose effective address is obtained byadding 16 to the value in register A4.

An immediate value can be loaded into a register or a memory location by an instructionsuch as

MOVE.L #$2A4C80, D7



i

002A

OP-code word

i + 2

i + 4

i + 6

4C80

Upper 16 bits of immediate value

Lower 16 bits of immediate value

Address Contents

Figure C.3 The instruction MOVE.L #$2A4C80, D7 in memory.

which loads the specified hexadecimal value into register D7. Note that the ‘$’ characteris used to denote a hexadecimal value. All 32 bits of the destination are affected becauseof the L size specifier, hence the resulting value in D7 will be 002A4C80. Figure C.3shows how this instruction would be stored in memory. The OP-code word indicates thatit is a MOVE instruction and specifies the operand size. It also specifies the addressingmodes for the source and destination operands, including the fact that register D7 is thedestination location. The two extension words following the OP-code word contain the32-bit immediate value for the source operand.

There are two specialized versions of the MOVE instruction. The MOVEQ (MoveQuick) instruction is used when the source operand is an immediate value that is smallenough to fit in 8 bits and the destination operand is a data register. This instruction is onlyone word in size. The MOVEAinstruction is used when the destination location is an addressregister. Only word and longword operands are permitted for MOVEA. The condition codesin the status register are not affected by this instruction. The ColdFire assembler replacesthe normal MOVE instruction with MOVEQ or MOVEA where applicable.

There is also an instruction, MOVEM (Move Multiple Registers), which performsmultiple transfers involving several registers. It is used in subroutine linkage as discussedin Section C.3.7.

C.3.3 Arithmetic Instructions

This category of instructions includes arithmetic operations as well as comparison, sign-extension, negation, and clear operations. The operands can be in memory, in data registers,or included as immediate values within instructions. All but four of the arithmetic instruc-tions discussed in this section permit only longword size for operands. All but one of theinstructions require at least one register operand.

Addition, Subtraction, Comparison, and NegationThe instructions of this type are:

• ADD.L (Add)• ADDI.L (Add immediate)• ADDA.L (Add address)



• SUB.L (Subtract)• SUBI.L (Subtract immediate)• SUBA.L (Subtract address)• CMP.L (Compare)• CMPI.L (Compare immediate)• CMPA.L (Compare address)• NEG.L (Negate; single data-register operand)• ADDX.L (Add extended; two data-register operands)• SUBX.L (Subtract extended; two data-register operands)• NEGX.L (Negate extended; single data-register operand)

The ADD and SUB instructions perform the specified arithmetic operation on twolongword operands and place the result in the destination location. The SUB instructionsubtracts the source operand from the destination operand. All of the condition code flagsare affected, based on the result.

The CMP instruction is used for comparing longword values. The destination operandmust be in a register. The instruction performs the same operation as the SUB instruction,but does not change the destination operand. The result affects all condition code flags,except the X flag.

There are specialized versions of the ADD, SUB, and CMP instructions for two cases:when the source operand is an immediate value (ADDI, SUBI, and CMPI), and whenthe destination operand is an address register (ADDA, SUBA, and CMPA). The ColdFireassembler replaces the normal versions of these instructions with the specialized versionswhere applicable.

Consider the following examples. The instruction

ADD.L D4, (A1)+

adds the longword in register D4 to the longword at the memory address given by thecontents of register A1 and places the sum in the same memory location. The value inregister A1 is incremented by four. The instruction

SUBI.L #256, D7

subtracts the value 256 from the contents of register D7 and places the result in D7. Notethat the instruction

CMPI.L #256, D7

performs the same subtraction, but the contents of register D7 are not changed. All conditioncode flags except X are affected in the same manner as for the SUB instruction.

The NEG.L instruction is used to negate a longword operand in a data register. Negationis achieved by subtracting the value in the data register from zero. All of the condition codeflags are affected, based on the result. The instruction

NEG.L D3



MOVE.L #$A72C10F8, D2 D2 contains A72C10F8.MOVE.L #$10, D3 D3 contains 10.MOVE.L #$5C00FE04, D4 D4 contains 5C00FE04.MOVE.L #$4A, D5 D5 contains 4A.ADD.L D2, D4 Add low-order 32 bits; carry-out sets X and C flags.ADDX.L D3, D5 Add high-order bits with X flag as carry-in bit.

Figure C.4 Program to add numbers larger than 32 bits using the ADDX instruction.

negates the longword in register D3 and overwrites the original value with the negatedvalue.

To facilitate arithmetic operations on values that are larger than 32 bits, the ADDX.L,SUBX.L, and NEGX.L instructions use the X flag as a carry-in bit. All operands must be indata registers. All of the condition code flags are affected. For example, Figure C.4 showshow theADDX instruction is used to add numbers that are too large to fit into 32-bit registers.The two hexadecimal values to be added are 10A72C10F8 and 4A5C00FE04. Registers D2and D3 are loaded with the low- and high-order bits of 10A72C10F8, respectively. Similarly,registers D4 and D5 are loaded with the low- and high-order bits of 4A5C00FE04. TheADD.L instruction is used to add the low-order 32 bits. This addition generates a carry-outof 1 that causes the X and C flags to be set. TheADDX.L instruction then uses the new valueof the X flag as the carry-in bit when adding the high-order bits. The low- and high-orderbits of the sum are in registers D4 and D5. The C and X flags are both affected by the resultof the ADDX.L instruction, but in this example, they will not be used because the desired64-bit addition has been completed.

MultiplicationMultiplication is performed by using the MULS and MULU instructions for signed

and unsigned operands, respectively. The operand size can be word or longword. Thedestination operand must be in a data register.

The instruction

MULS.W #1340, D5

multiplies 1340 by the signed 16-bit value in the low-order bits of register D5, and placesthe 32-bit product in D5.

The instruction

MULS.L D2, D5

multiplies the longwords in registers D2 and D5, truncates the product to 32 bits, and placesit in D5.

The MULU.W and MULU.L instructions perform the same operations on unsignedoperands. As a result of multiplication, the N and Z condition code flags are set or clearedbased on the value of the product, while the V and C flags are cleared to zero.



MOVE.W #$FFFF, D2 The low-order word of D2 is treated as –1.MOVE.W #$0001, D3 The low-order word of D3 contains 1.MULS.W D2, D3 The signed longword result in D3 is –1

or $FFFFFFFF, hence the N flag is set.

(a) Signed computation of –1 × 1 = –1

MOVE.W #$FFFF, D2 The low-order word of D2 is treated as 65535.MOVE.W #$0001, D3 The low-order word of D3 contains 1.MULU.W D2, D3 The unsigned longword result in D3 is 65535

or $0000FFFF, hence the N flag is cleared.

(b) Unsigned computation of 65535 × 1 = 65535

Figure C.5 Signed versus unsigned multiplication.

Figure C.5 shows how different results are obtained for signed and unsigned multipli-cation of the same pair of word operands, $FFFF and $0001. The MULS.W instruction inFigure C.5a treats the low-order word value of $FFFF in register D2 as −1. The longwordresult for the signed multiplication is $FFFFFFFF representing −1, and the N flag is set.In contrast, the MULU.W instruction in Figure C.5b treats $FFFF as the unsigned number65535, and the longword result for the unsigned multiplication is $0000FFFF representing65535, which causes the N flag to be cleared.

DivisionDivision is performed by using the DIVS and DIVU instructions for signed and un-

signed operands, respectively. The operand size can be word or longword. The destinationoperand must be in a data register.

The instruction

DIVS.W #2500, D1

divides the 32-bit value in register D1 by the 16-bit immediate operand 2500. The 16-bitquotient is placed in the low-order bits of D1, and the 16-bit remainder is placed in thehigh-order bits of D1.

The instruction

DIVS.L D2, D1

divides the value in D1 by the value in D2, and places the quotient in D1. The remainderis discarded.



The DIVU.W and DIVU.L instructions perform the same operations on unsignedoperands. As a result of division, the N and Z condition code flags are set or clearedbased on the value of the quotient, while the V and C flags are cleared to zero.

Since the remainder is discarded in the DIVS.L and DIVU.L operations, there exist twoother instructions that can be used to obtain the remainder when needed. The instruction

REMS.L D2, D1:D4

divides the value in D1 by the value in D2, places the 32-bit remainder in register D4, andleaves the contents of D1 unchanged. Thus, both the remainder and the quotient can beobtained if this instruction is followed by

DIVS.L D2, D1

Note that a third operand, delineated by a colon, must be specified in the REMS.Linstruction.The REMU.L instruction performs the same operation as REMS.L, but on unsigned

operands. It also requires three operands.

Other Arithmetic InstructionsThe EXT (Sign extend) instruction is provided to extend the sign bit when increasing

the number of bits used to represent a number. It has a single operand that must be in adata register. The size specified determines how the operation is performed. The EXT.Linstruction sign-extends the low-order word to a longword, the EXT.W instruction sign-extends the low-order byte to a word, and the EXT.B instruction sign-extends the low-orderbyte to a longword. For all three instructions, the N and Z condition code flags are affected,based on the result, and the V and C flags are cleared.

The CLR (Clear) instruction is provided to clear bits in a specified operand. The sizespecifier indicates whether a longword, word, or byte is to be cleared. The operand may bein memory or in a data register. The Z flag is set by this instruction, and the N, V, and Cflags are cleared.

C.3.4 Branch and Jump Instructions

Conditional branch instructions have the format

Bcc LABEL

where cc specifies the condition code. Table C.2 summarizes the condition codes andthe corresponding combinations of the condition code flags that are tested. For example,the BEQ (Branch-if-equal) instruction causes a branch if the Z flag is set, whereas BGE(Branch-if-greater-than-or-equal) depends on the state of the N and V flags. There is alsoan unconditional branch instruction, BRA, where the branch is always taken.

Branch instructions specify a signed offset that is added to the value of the programcounter to determine the target address. Three types of offsets are provided, based on thedistance between the location following the branch instruction and the target of the branch.In the first type, a small offset of 8 bits is included in the OP-code word when the distanceis within ±127 bytes. In the second type, a larger 16-bit offset is specified in the extensionword that follows the OP-code word when the distance is up to ±32 Kbytes. In the third



Table C.2 Condition codes for Bcc instructions.

Conditionsuffixcc Name Test condition

HI High C ∨ Z = 0

LS Low or same C ∨ Z = 1

CC Carry clear C = 0

CS Carry set C = 1

NE Not equal Z = 0

EQ Equal Z = 1

VC Overflow clear V = 0

VS Overflow set V = 1

PL Plus N = 0

MI Minus N = 1

GE Greater or equal N ⊕ V = 0

LT Less than N ⊕ V = 1

GT Greater than Z ∨ (N ⊕ V) = 0

LE Less or equal Z ∨ (N ⊕ V) = 1

type, a 32-bit offset can be specified in two extension words when the distance to the targetof the branch exceeds the range supported by a 16-bit offset.

The JMP (Jump) instruction performs an unconditional jump to a specified locationfor the next instruction to be executed. A single operand specifies the target address. Theaddressing modes that can be used in the JMP instruction are Absolute, Indirect, Basic andFull index, and Basic and Full relative. For example, the instruction

JMP (A3)

jumps to the location given by the contents of address register A3.To illustrate the use of a branch instruction in a loop, Figure C.6 shows a ColdFire

version of the loop program in Figure 2.26 for adding numbers in a list. The Autoincrementaddressing mode is used in the ADD.L instruction to automatically increment the pointerto the entries in the list. The BGT (Branch-if-greater-than) instruction checks the conditioncode flags that are set or cleared as a result of the execution of the SUBQ.L instruction. Thisis the instruction that decrements the count of the number of elements in the list remainingto be processed. It is a version of the SUBI instruction that is used when the immediatevalue can be represented with 3 bits.

Figure C.7 shows the format of a conditional branch instruction with a small offsetand how the three instructions in the loop of the program in Figure C.6 would appear whenstored in the memory. The BGT instruction requires a negative offset that is computed as

Offset = TargetAddress − [PC]



MOVEA.L #NUM1, A2 Put the address NUM1 in A2.MOVE.L N, D1 Put the number of entries n in D1.CLR.L D0

LOOP: ADD.L (A2)+ , D0 Accumulate sum in D0.SUBQ.L #1, D1BGT LOOPMOVE.L D0, SUM Store the result when finished.

Figure C.6 A ColdFire version of the list-summing program in Figure 2.26.

(b) Example of using a branch instruction in the loop of Figure C.6

[PC] = 1006 when branch address is computed

(a) Short-offset branch instruction format

LOOP 1000

SUBQ.L

ADD.L

LOOP

#1, D1

(A2)+, D0

OP code

OP-code word

OP-code word

BGT

LOOP:

OP code Offset

15 8 7 0

Branch address = [updated PC] + offset

Branch address

Assembly languageversion of loop

Appearance of loop in memory

1002

1004

1006

6–

1006 6 1000=–=

Figure C.7 Branch instruction format and appearance in memory.

The target address is 1000. At the time that the BGT instruction is executed, the value inthe PC will be 1006 because the PC is incremented after fetching the OP-code word for theBGT instruction. Hence, the offset is 1000− 1006 = −6. An 8-bit offset is sufficient forthis small value, which means that this BGT instruction can be encoded using a single word.



MOVEA.L #LIST, A2 Get the address LIST.CLR.L D3CLR.L D4CLR.L D5MOVE.L N, D6 Load the value n.

LOOP: ADD.L 4(A2), D3 Add current student mark for Test 1.ADD.L 8(A2), D4 Add current student mark for Test 2.ADD.L 12(A2), D5 Add current student mark for Test 3.ADDA.L #16, A2 Increment the pointer.SUBQ.L #1, D6 Decrement the counter.BGT LOOP Loop back if not finished.MOVE.L D3, SUM1 Store the total for Test 1.MOVE.L D4, SUM2 Store the total for Test 2.MOVE.L D5, SUM3 Store the total for Test 3.

Figure C.8 A ColdFire version of the program in Figure 2.11 for summing testscores.

A more involved example that also illustrates some other aspects of instructions isshown in Figure C.8. It is a ColdFire version of the program in Figure 2.11, which computesthe sum of all scores for three tests taken by a group of students. The availability of theBasic index mode in the ADD instructions to access memory operands obviates the needfor the Load instructions within the loop body in Figure 2.11.

C.3.5 Logic Instructions

Logic instructions require longword operands and at least one operand must be in a dataregister. The following instructions are available:

• AND.L (Bitwise Logical AND)• ANDI.L (Bitwise Logical AND; source operand is an immediate value)• OR.L (Bitwise Logical OR)• ORI.L (Bitwise Logical OR; source operand is an immediate value)• EOR.L (Bitwise Logical Exclusive-OR; source operand must be in a data register)• EORI.L (Bitwise Logical Exclusive-OR; source operand is an immediate value)• NOT.L (Bitwise Complement; single data-register operand)

All logic instructions affect the N and Z condition code flags, based on the result; the V andC flags are cleared.


ANDI.L #$FF, D5



performs a bitwise logical AND of the hexadecimal value 000000FF and the longword indata register D5, and places the result in D5. The instruction

EOR.L D3, (A6)+

performs a bitwise logical Exclusive-OR of the longword in register D3 with the longwordat the memory address given by the contents of register A6. The result is placed at the sameaddress. The contents of register A6 are then incremented by four.

C.3.6 Shift Instructions

Shift instructions have two operands: the destination operand must be in a data registerholding the longword value to be shifted, and the source operand specifies the shift amounteither as an immediate value or as the contents of a data register. The immediate valuemust be between 1 and 8, which is encoded in the instruction. If the shift amount is in adata register, its value is interpreted modulo 64, even though the data register size is only32 bits.

For all shift instructions, the last bit that is shifted out of the destination data registeris copied into the C and X condition code flags. Each bit that is shifted into the destinationdata register is zero, except for arithmetic shifts to the right where the value of the sign inbit b31 is preserved. Each of these cases is shown in Figure 2.23. Based on the final resultafter shifting, the N and Z flags are affected. The V flag is cleared.

The available shift instructions are:

• LSL.L (Logical Shift Left)• LSR.L (Logical Shift Right)• ASL.L (Arithmetic Shift Left)• ASR.L (Arithmetic Shift Right; sign bit is preserved)

The following examples illustrate the differences between shift operations. Assume thatdata register D4 initially contains hexadecimal 80000000 (bit b31 is one and all other bitsare zero). The instruction

LSR.L #6, D4

shifts the value in D4 to the right by 6 bits, inserting zero bits into the left end. The resultin D4 is hexadecimal 02000000. The C and X flags are cleared because the last bit shiftedout of D4 is zero.

For the same initial value in D4, the instruction

ASR.L #6, D4

also shifts the value in D4 to the right by 6 bits, but it preserves the sign in bit b31. Therefore,the bits shifted into the data register on the left end are 1s. The result in D4 is FE000000.The C and X flags are cleared because the last bit shifted out is zero.



MOVEA.L #LOC, A0 A0 points to two consecutive bytes.MOVE.B (A0)+ , D0 Load first byte into D0.LSL.L #4, D0 Shift left by 4 bit positions.MOVE.B (A0), D1 Load second byte into D1.ANDI.L #$F, D1 Clear all high-order bits in D1.OR.L D0, D1 Concatenate the digits.MOVE.B D1, PACKED Store the result.

Figure C.9 Use of logic and shift instructions in packing BCD digits.

Example C.1Consider the BCD digit-packing program of Figure 2.24. The ColdFire version of thisprogram is shown in Figure C.9. The twoASCII-encoded characters in consecutive memorybyte locations are brought into registers D0 and D1. The LSL instruction shifts the firstbyte in D0 four bit positions to the left, filling the low-order four bits with zeros. The ANDIinstruction clears all of the high-order bits in register D1 to zero. The 4-bit patterns that arethe desired BCD codes are subsequently combined in D1 with the OR instruction. Finally,the byte of interest, which is the rightmost byte in register D1, is placed in memory locationPACKED. Note that the LSL, AND, and OR instructions affect all 32 bits in the registeroperand, but the desired packed byte is generated properly in the lower-most 8 bits of D1.

C.3.7 Subroutine Linkage Instructions

ColdFire provides instructions and a processor stack to support subroutines and parameterpassing in the manner outlined in Section 2.7. Address register A7 serves as the stackpointer, and it must always have a value that is longword-aligned, that is, a multiple of 4.This register should not be used for any other purpose. The stack grows in the direction ofdecreasing memory addresses. The stack pointer value in register A7 is decremented by 4to push new information onto the stack, and incremented by 4 to pop information off thestack.

There are two subroutine call instructions, BSR (Branch-to-subroutine) and JSR (Jump-to-subroutine). The BSR instruction specifies the address of a subroutine with an 8-bit, 16-bit, or 32-bit offset in the same manner as other branch instructions. The JSR instructionmay use the Absolute, Indirect, Basic and Full index, or Basic and Full relative addressingmodes to generate the target address. Both BSR and JSR automatically push the returnaddress onto the processor stack, rather than saving it in a link register as described inSection 2.7.

At the end of a subroutine, the RTS (Return-from-subroutine) instruction is used toreturn to the calling program. The RTS instruction pops the return address off the top ofthe stack and loads it into the program counter.



Parameter PassingSection 2.7.2 discusses two different methods of parameter passing, which are illus-

trated by using the example program in Figure 2.26 for adding numbers in a list. FigureC.6 provides a ColdFire version of this program. It will be the basis for the discussion thatfollows.

Figure C.10 is a ColdFire version of the program in Figure 2.17 for passing parametersthrough registers. The starting address of the list of values to be added is passed to thesubroutine by placing it in register A2. The number of elements in the list is passed to thesubroutine in register D1. After adding all of the elements in the list, the subroutine returnsthe sum in register D0.

Figure C.11 shows a ColdFire version of the program in Figure 2.18 for passing pa-rameters via the processor stack. Prior to calling the subroutine, the starting address ofthe list and the number of elements are pushed onto the stack. The subroutine retrievesthese values from the stack. Upon completion of the subroutine, the result being returnedis placed on the stack to be retrieved by the calling program.

The program in Figure C.11 also illustrates how the MOVEM (Move Multiple Reg-isters) instruction is used to save or restore registers to or from consecutive locations inmemory. The MOVEM instruction has two operands, and the order of the operands dictateswhether register values are written to or read from memory. One operand is a list of indi-vidual registers or ranges of registers, e.g., D0−D1/A2. The other operand is the startingaddress for the values in memory, which must be specified with either the Indirect mode

Calling program

MOVEA.L #NUM1, A2 Put the address NUM1 in A2.MOVE.L N, D1 Put the number of elements n in D1.BSR LISTADD Call subroutine LISTADD.MOVE.L D0, SUM Store the sum in SUM.next instruction...

Subroutine

LISTADD: CLR.L D0LOOP: ADD.L (A2)+ , D0 Accumulate sum in D0.

SUBQ.L #1, D1BGT LOOPRTS

Figure C.10 Program of Figure C.6 written as a subroutine; parameters passedthrough registers.



(Assume that the top of stack is initially at level 1 in the diagram below.)

Calling program

MOVE.L #NUM1, –(A7) Push parameters onto stack.MOVE.L N, –(A7)BSR LISTADDMOVE.L 4(A7), D0 Get result from the stack.MOVE.L D0, SUM Save result.ADDA.L #8, A7 Restore top of stack....

Subroutine

LISTADD: ADDA.L # –12, A7 Adjust stack pointer to allocate space.MOVEM.L D0 –D1/A2, (A7) Save registers D0, D1, and A2.MOVE.L 16(A7), D1 Initialize counter to n.MOVEA.L 20(A7), A2 Initialize pointer to the list.CLR.L D0 Initialize sum to 0.

LOOP: ADD.L (A2)+, D0 Add entry from list.SUBQ.L #1, D1BGT LOOPMOVE.L D0, 20(A7) Put result on the stack.MOVEM.L (A7), D0 –D1/A2 Restore registers.ADDA.L #12, A7 Adjust stack pointer to deallocate space.RTS

(a) Calling program and subroutine

[D0]

[D1]

[A2]

Return address

n

NUM1

Level 3

Level 2

Level 1

(b) Stack contents at different times

Figure C.11 Program of Figure C.6 written as a subroutine; parameters passed on thestack.



or Basic index mode. The MOVEM instruction writes or reads memory in the direction ofincreasing addresses from the given starting location.

At the beginning of the subroutine, the ADDA.L instruction adjusts the stack pointerfrom Level 2 to Level 3 in Figure C.11b. This allocates space for saving the contents ofthree registers. The MOVEM instruction then writes the contents of registers D0, D1, andA2 into the allocated space. At the end of the subroutine, these values are read from thestack and loaded back into the registers with a similar MOVEM instruction, except that thesource/destination order is reversed. The final step before returning from the subroutineis to readjust the stack pointer from Level 3 to Level 2 to deallocate the space that waspreviously allocated for saving the registers.

Stack Frames for Nested SubroutinesFor nested subroutine calls using either BSR or JSR instructions, the return address

for each call is automatically pushed onto the processor stack. Subsequently, RTS instruc-tions are executed when subroutines in the nested sequence are completed. As each RTSinstruction is executed, the corresponding return address is popped from the stack.

As discussed in Section 2.7.3, stack frames provide workspaces in memory for eachsubroutine in a nested sequence. In addition to the stack pointerA7, another address registercan be used as the frame pointer within a subroutine to identify the current stack frame.

ColdFire provides two special instructions for managing stack frames. The instruction

LINK Ai, #disp

is used at the beginning of a subroutine to allocate a stack frame. It performs the followingoperations:

1. Pushes the contents of register Ai, the frame pointer, onto the processor stack

2. Copies the contents of the stack pointer, A7, into the frame pointer, Ai

3. Adds the specified displacement value to the stack pointer, A7

The displacement value is a negative number so that the stack grows to allocate additionalspace for local variables in the subroutine. These variables can be accessed with the Basicindex or Full index addressing modes using register Ai, the frame pointer. The displacementcan also be used to allocate space for saving the values of registers used by the subroutine.

The second special instruction

UNLK Ai

is used at the end of a subroutine to deallocate the stack frame. It reverses the actions ofthe LINK instruction. It loads A7 from Ai, thus lowering the top of the stack to its originalposition, where it was before adding the displacement value. Then it pops the saved contentsof register Ai off the stack and loads them back into Ai.

To illustrate the use of the LINK and UNLK instructions, Figure C.12 provides ColdFirecode for the program with nested subroutine calls presented in Figure 2.21. The stack framesfor subroutines SUB1 and SUB2 are shown in Figure C.13. The execution flow is as follows:

• The calling program pushes parameters param2 and param1 onto the stack for use bysubroutine SUB1. When SUB1 is called by the BSR instruction, the return address 2014




Calling program...

2000 MOVE.L PARAM2, –(A7) Place parameters on stack.2006 MOVE.L PARAM1, –(A7)2012 BSR SUB12014 MOVE.L (A7), RESULT Store result.2020 ADDA.L #8, A7 Restore stack level.2024 next instruction

...

First subroutine

2100 SUB1: LINK A6, #–16 Set frame pointer and allocate stack space.2104 MOVEM.L D0 –D2/A0,(A7) Save registers.

MOVEA.L 8(A6), A0 Load parameters.MOVE.L 12(A6), D0...MOVE.L PARAM3, –(A7) Place a parameter on stack.

2160 BSR SUB22164 MOVE.L (A7)+, D1 Pop result from SUB2 into D1.

...MOVE.L D2, 8(A6) Place result on stack.MOVEM.L (A7), D0 –D2/A0 Restore registers.UNLK A6 Restore frame pointer and deallocate

stack space.RTS Return.

Second subroutine

3000 SUB2: LINK A6, #–8 Set frame pointer and allocate stack space.MOVEM.L D0 –D1, (A7) Save registers.MOVE.L 8(A6), D0 Load parameter....MOVE.L D1, 8(A6) Place result on stack.MOVEM.L (A7), D0 –D1 Restore registers.UNLK A6 Restore frame pointer and deallocate

stack space.RTS Return.

Figure C.12 Nested subroutines.



A6

A6

[A6] from SUB1

2164

Stackframe

forSUB1

[A0] from Main

param3

[D0] from Main

[D1] from Main

[D2] from Main

Old TOS

2014

[A6] from Main

param1

param2

[D1] from SUB1

[D0] from SUB1

Stackframe

forSUB2

Figure C.13 Stack frames for program in Figure C.12.

is pushed onto the stack. The address 2100 for SUB1 is within 128 bytes of the BSRinstruction, hence an 8-bit offset may be used and the BSR instruction requires only theOP-code word.• Subroutine SUB1 begins by allocating a stack frame with the instruction

LINK A6, #−16

which saves the current value of A6 on the stack, writes the value of A7 into A6 to definethe new frame pointer, and then adjusts the value of A7 to allocate space on the stack by anamount sufficient to save four registers to be used by SUB1.• The current values of the four registers to be used by SUB1 are saved on the stackusing the MOVEM instruction.• SUB1 then retrieves the values of param1 and param2 from the stack using the Basicindex mode with the frame pointer. After performing some computation, SUB1 pushesparam3 onto the stack and calls subroutine SUB2. The return address 2164 is pushed onthe stack. The subroutine address 3000 is separated by more than 128 bytes from the BSRinstruction, hence a 16-bit offset is included in the instruction in one extension word.


C.4 Assembler Directives 593

• SUB2 begins with its own LINK instruction to allocate a new stack frame with spacefor saving register values, followed by the MOVEM instruction that writes the values oftwo registers on the stack. The value of param3 is then retrieved from the stack.• After performing its computation, SUB2 places its result on the stack, overwritingparam3. The two saved registers, D0 and D1, are restored from the stack, and the UNLKinstruction deallocates the current stack frame and restores the previous frame pointer beforereturning to SUB1.• Execution within SUB1 resumes at the return address 2164. The result from SUB2 ispopped off the stack. SUB1 completes its computation and places its result on the stack,overwriting param1. The four saved registers are restored from the stack, and the UNLKinstruction prepares for the return to the calling program.• The main program resumes execution at the return address 2014. The result from SUB1is retrieved from the stack. Finally, the ADDA instruction adjusts the value of register A7to point to the initial top-of-stack element, labeled as Old TOS in Figure C.13.

C.4 Assembler Directives

The assembler directives discussed in Section 2.5 can be used in ColdFire programs withonly small differences in notation. The assembler provided by Freescale Semiconductor re-quires that each directive begins with a period to distinguish it from instruction mnemonics.

• The starting address of a block of instructions or data is specified with the ORG direc-tive.

• The EQU directive equates names with numerical values.• Data constants are inserted into an object program using the DC (Define Constant)

directive. Several items may be defined in a single DC directive. The size of the itemsis indicated by the suffix L, W, or B. For example, the statements

.ORG 100ITEMS: .DC.B 23,$4F,%10110101

result in byte-sized hexadecimal values 17 (2310), 4F, and B5 being placed into memorylocations 100, 101, and 102, respectively. The label ITEMS is assigned the value 100.Note that the ‘$’ character denotes a hexadecimal value and the ‘%’ character denotesa binary value.

• A block of uninitialized memory can be reserved for data by means of the DS (DefineStorage) directive, with a suffix to indicate the data size. For example, the statement

ARRAY: .DS.L 200

reserves 200 longwords and associates the label ARRAY with the address of the firstlongword.

The use of assembler directives is illustrated in Figure C.14, which corresponds to thelist-summing program in Figure C.6.



.ORG 100 Instructions begin at address 100.MOVEA.L #NUM1, A2 Put the address NUM1 in A2.MOVE.L N, D1 Put the number of entries n in D1.CLR.L D0

LOOP: ADD.L (A2)+ , D0 Accumulate sum in D0.SUBQ.L #1, D1BGT LOOPMOVE.L D0, SUM Store the result when finished.

.ORG 200 Data begins at address 200.SUM: .DS.L 1 One longword reserved for sum.N: .DC.L 150 There are N=150 longwords in list.NUM1: .DS.L 150 Reserve memory for 150 longwords.

Figure C.14 A ColdFire program that corresponds to Figure 2.13.

C.5 Example Programs

In this section, we show ColdFire versions of the programs for dot product and string searchthat are described in Chapter 2.

C.5.1 Vector Dot Product Program

The program in Figure 2.28 computes the dot product of two vectors, AVEC and BVEC.The ColdFire version is shown in Figure C.15. We have assumed that the vector elementsare represented as signed 16-bit words. The MULS.W instruction multiplies two, signed,

MOVEA.L #AVEC, A1 Address of first vector.MOVEA.L #BVEC, A2 Address of second vector.MOVE.L N, D0 Set counter to number of elements.CLR.L D1 Use D1 as accumulator.

LOOP: MOVE.W (A1)+ , D2 Get element from vector A.MULS.W (A2)+ , D2 Multiply element from vector B.ADD.L D2, D1 Accumulate product.SUBQ.L #1, D0 Decrement counter.BGT LOOP Repeat if counter is greater than zero.MOVE.L D1, DOTPROD Save the result when finished.

Figure C.15 A program for computing the dot product of two vectors.


C.5 Example Programs 595

16-bit numbers and produces a 32-bit product. The result of each multiplication is thenaccumulated into a 32-bit sum.

C.5.2 String Search Program

The program in Figure 2.31 determines the first matching instance of a pattern string P ina given target string T. The ColdFire version is shown in Figure C.16.

Recall that the CMP instruction is restricted to longword operands. Therefore, it isnot possible to perform a byte-sized comparison between a register value and a memoryoperand with one instruction in the manner shown in Figure 2.31. Instead, the two charactersto be compared must first be brought into registers D0 and D1 using separate MOVE.Binstructions, as shown in Figure C.16, and then they are compared using CMP.L. Becausethe MOVE.B instruction affects only the low-order byte of the destination register, it is alsonecessary to clear all 32 bits of registers D0 and D1 before entering the main loop so thatthe result of the comparison is correct in each loop iteration.

MOVEA.L #T, A2 A2 points to string T.MOVEA.L #P, A3 A3 points to string P.MOVEA.L N, A4 Get the value n.MOVEA.L M, A5 Get the value m.SUBA.L A5, A4 Compute n – m.ADDA.L A2, A4 A4 is the address of T(n – m).ADDA.L A3, A5 A5 is the address of P(m).CLR.L D0 Clear data registers thatCLR.L D1 will be used in comparisons.

LOOP1: MOVEA.L A2, A0 Use A0 to scan through string T.MOVEA.L A3, A1 Use A1 to scan through string P.

LOOP2: MOVE.B (A0)+, D0 Compare a pair ofMOVE.B (A1)+, D1 characters inCMP.L D0, D1 strings T and P.BNE NOMATCHCMPA.L A1, A5 Check if at P(m).BGT LOOP2 Loop again if not done.MOVE.L A2, RESULT Store the address of T(i).BRA DONE

NOMATCH: ADDA.L #1, A2 Point to next character in T.CMPA.L A2, A4 Check if at T(n – m).BGE LOOP1 Loop again if not done.MOVE.L #–1, RESULT No match was found.


Figure C.16 A program for string search.



C.6 ModeofOperationandOtherControlFeatures

So far, we have described the normal execution of instructions that use operands in theaddress/data registers and in memory. We have also presented programs that implementsimple computing tasks. Input/output operations and other tasks may require additionalcapabilities, such as interrupts, that alter normal execution behavior. Such behavior canbe controlled by using certain bits in the status register shown in Figure C.2. This sectiondescribes these bits.

• The S bit selects one of two possible modes of operation. In supervisor mode, theS bit is set to one, and the processor can execute all instructions and access all availablefunctions, including certain privileged features. For example, a MOVE instruction withthe status register (SR) as the source or the destination is a privileged instruction becauseit accesses the S bit and the other bits that control the behavior of the processor. Thesupervisor mode is entered when the processor is reset. System software, which includesinterrupt-service routines, executes in this mode. User mode, on the other hand, is intendedfor normal application programs. The S bit is equal to zero in this mode, which prevents theuse of any privileged features. Switching from user mode to supervisor mode occurs onlywhen entering an interrupt-service routine. Switching from supervisor mode to user modecan be done directly by executing a privileged instruction that modifies the status registeror by returning from an interrupt-service routine.• The three bits, b10−8, represent the current interrupt priority level of the processor.These bits form the interrupt mask, which has a value between 0 and 7. Various interruptsources are assigned different priority levels between 1 and 7. The interrupt mask settingprevents the processor from responding to an interrupt request from any source whosepriority level is the same as or lower than the current mask level. Clearing the mask bitsto 0 enables all interrupts. Setting these bits to 7 disables all interrupts, except for non-maskable interrupt requests at level 7. This approach, based on an interrupt mask, differssignificantly from the discussion in Section 3.2.1 where a single IE bit in the status registerof Figure 3.7 is used to enable all interrupts.• The M bit is automatically cleared by the processor when an interrupt-service routineis entered. This feature is intended for use by system software and is not relevant for normalapplication programs.• The T bit is used for debugging. When it is set to 1, a special interrupt is triggeredafter the execution of every instruction to allow system software to trace the execution ofan application program.

All of the bits described above may be modified only when operating in the supervisor mode.A privileged version of the MOVE instruction allows all of the bits in the status register tobe read or written. In user mode, a non-privileged version of the MOVE instruction allowsonly the condition code flags in the status register to be read or written.


C.7 Input/Output 597

C.7 Input/Output

Chapter 3 described I/O operations based on polling and interrupts using the example ofdisplaying characters that are read from a keyboard. Using the same example and the I/Ointerface registers shown in Figure 3.3, we now show how ColdFire instructions are usedfor I/O operations.

Figure C.17 provides a ColdFire version of the program in Figure 3.5 with polling forboth input and output of characters. The BTST.B instruction corresponds to the TestBitinstruction in Figure 3.5. The use of an immediate operand to specify the bit to be checkedby a BTST.B instruction places restrictions on the available addressing modes for the des-tination operands. The Absolute mode is not available, hence the program in Figure C.17initializes address registers A3 and A4 with the locations of the status registers for the twoI/O interfaces so that the BTST.B instructions can use the Indirect mode. Furthermore, the

MOVEA.L #LOC, A2 Initialize register A2 to point to theaddress of the first location in mainmemory where the characters are to bestored.

MOVEA.L #KBD_STATUS, A3 Initialize register A3 to point to the addressof the status register for the keyboard.

MOVEA.L #DISP_STATUS, A4 Initialize register A4 to point to the addressof the status register for the display.

CLR.L D0 Clear the data register to be used forcharacters.

READ: BTST.B #1, (A3) Wait for a character to be enteredBEQ READ in the keyboard buffer.MOVE.B KBD_DATA, D0 Read the character from KBD_DATA into

register D0 (this clears KIN to 0).MOVE.B D0, (A2)+ Transfer the character into main memory

and increment the pointer to store the nextcharacter.

ECHO: BTST.B #2, (A4) Wait for the display to become ready.BEQ ECHOMOVE.B D0, DISP_DATA Move the character just read to the display

buffer register (this clears DOUT to 0).CMPI.L #CR, D0 Check if the character just read is CR

(carriage return). If it is not CR, thenBNE READ branch back and read another character.

Figure C.17 A ColdFire version of the polling-based program in Figure 3.5.



CMP instruction is restricted to longword operands, hence register D0 is cleared beforeexecuting the loops in Figure C.17. Each character is read from the keyboard interface intoregister D0 for later comparison with the carriage-return character, CR. As each characteris stored in the memory from register D0, the pointer in register A2 is incremented.

To demonstrate how interrupts are used for I/O operations, Figure C.18 uses ColdFireinstructions to implement the example in Figure 3.10. It is assumed that the initializationcode in the main program of Figure C.18 is executed with the processor in supervisor mode.Two privileged MOVE instructions access the status register (SR). They are used with theANDI instruction to clear the interrupt mask bits to 0 and thereby enable all interrupts.Because it is desirable to leave the other bits unchanged, the current contents of the statusregister are first read into a data register, then only the bits corresponding to the interrupt

Interrupt-service routineILOC: MOVE.L A2, –(A7) Save registers.

MOVE.L D0, –(A7)MOVEA.L PNTR, A2 Load address pointer from the memory.CLR.L D0 Clear register to hold character.MOVE.B KBD_DATA, D0 Read character from keyboard.MOVE.B D0, (A2)+ Write to memory and increment pointer.MOVE.L A2, PNTR Update pointer in the memory.MOVEA.L #DISP_STATUS, A2 Set A2 with address of status register.

ECHO: BTST.B #2, (A2) Wait for the display to become ready.BEQ ECHOMOVE.B D0, DISP_DATA Display the character just read.CMPI.L #CR, D0 Check if the character just read is CR.BNE RTRN Return if not CR.ADDQ.L #1, EOL Indicate end of line.MOVEA.L #KBD_CONT, A2 Set A2 with address of control register.BCLR.B #1, (A2) Disable interrupts in keyboard interface.

RTRN: MOVE.L (A7)+, D0 Restore registers.MOVEA.L (A7)+, A2RTE Return from interrupt.

Main programSTART: MOVEA.L #LINE, A0 Initialize buffer pointer.

MOVE.L A0, PNTRCLR.L EOL Clear end-of-line indicator.MOVEA.L #KBD_CONT, A0 Set A0 with address of control register.BSET.B #1, (A0) Enable interrupts in keyboard interface.MOVE.W SR, D0 Read contents of status register into D0.ANDI.L #$F8FF, D0 Clear priority mask to enable interrupts.MOVE.W D0, SR Write value of D0 to status register.

Figure C.18 A ColdFire version of the interrupt-based program in Figure 3.10.


C.8 Floating-Point Operations 599

priority level mask are cleared with the ANDI instruction. Finally, the modified contentsare written into the status register.

C.8 Floating-Point Operations

Floating-point operations are included in an extension of the basic ColdFire instruction set[1]. A hardware implementation of a processor with this extension incorporates a separatefloating-point unit (FPU) with additional registers. The FPU permits concurrent executionof floating-point operations with other instructions. This section provides a brief summary ofthe floating-point features of ColdFire. Floating-point number representation is introducedin Chapter 1, and floating-point arithmetic is discussed in Chapter 9.

All floating-point operations are performed internally on 64-bit double-precision num-bers. Eight 64-bit floating-point data registers, FP0 to FP7, are provided in the FPU forthis purpose. There are also additional control and status registers for the FPU (for de-tails, consult the ColdFire technical documentation [1]). The floating-point status register(FPSR) has condition code flags such as N and Z, as well as additional flags specific tofloating-point operations, that are affected by data movement, arithmetic, and comparisonoperations involving floating-point numbers.

The floating-point registers always hold 64-bit double-precision numbers. Memorymay contain numbers in either 32-bit single-precision representation or 64-bit double-precision representation. Single-precision representation reduces the storage requirementswhen large amounts of floating-point data are involved but very high precision is not needed.The FPU automatically converts any single-precision operands from memory to double-precision numbers before performing arithmetic operations. If desired, double-precisionnumbers may also be converted to single-precision numbers when transferring them tomemory.

The FPU also converts between integer and double-precision floating-point represen-tations, as explained below. Implementing this capability in hardware reduces executiontime and code size by eliminating the need for software to perform the conversions.

The floating-point extension of the basic instruction set necessitates additional sizespecifiers in assembly language. The suffix D is used with floating-point instructions toindicate double-precision operands, and the suffix S indicates single-precision operands.

C.8.1 FMOVE Instruction

The FMOVE instruction transfers data between a memory location and a floating-pointregister, or between registers. A suffix specifies the operand size, and conversion to orfrom double-precision representation is performed as required. The source operand maybe in a floating-point register (FP0−FP7), a data register (D0−D7), or a memory locationspecified with the Indirect, Autoincrement, Autodecrement, Basic index, or Basic relativeaddressing modes. The permitted modes for the destination operand are the same, exceptthat the Basic relative mode is not available. Either the source operand or the destinationoperand must be contained in a floating-point register. When the destination operand is in afloating-point register, flags such as N and Z in the floating-point status register (FPSR) are



affected according to the final 64-bit double-precision result. When the destination operandis in a data register (D0−D7) or a memory location, the FPSR is not affected.

When the size specifier is D, a 64-bit double-precision floating-point number is trans-ferred without modification from a floating-point register to memory, from memory to afloating-point register, or between two floating-point registers. For example, the instruction

FMOVE.D (A3), FP5

loads the double-precision number from the memory location specified by address registerA3 into floating-point register FP5. Similarly, the instruction

FMOVE.D FP2, 16(A5)

stores the double-precision number from register FP2 into the memory location given byadding 16 to the contents of register A5. The instruction

FMOVE.D FP3, FP4

transfers a double-precision number from FP3 to FP4.When any other operand-size suffix is specified for FMOVE, an automatic conversion is

made either to or from double-precision representation for the information being transferred.For example, the instruction

FMOVE.S FP1, (A4)

converts the 64-bit double-precision floating-point number in register FP1 into a 32-bitsingle-precision floating-point number, then writes the converted number into the locationin memory given by the contents of register A4. The instruction

FMOVE.L 16(A2), FP3

reads a longword from memory using indexed addressing, converts this 32-bit integer into a64-bit double-precision floating-point number, and places the converted number in registerFP3. Finally, the instruction

FMOVE.W FP7, D2

converts the 64-bit double-precision floating-point number in register FP7 into a 16-bitinteger, and then places this converted number in the low-order bits of register D2. Clearly,there is potential for loss of precision when converting from floating-point to integer rep-resentation.

C.8.2 Floating-Point Arithmetic Instructions

The basic instructions for performing arithmetic operations on floating-point numbers areFADD, FSUB, FMUL, and FDIV. In all cases, the destination operand must be in a floating-point register. The condition code flags such as N and Z in the FPSR are affected by the final64-bit double-precision result. The source operand may be in a floating-point register, adata register, or a memory location specified with Indirect, Autoincrement, Autodecrement,Basic index, or Basic relative addressing modes. A suffix is required to indicate the formatof the source operand. For any suffix other than D, the source operand is automatically


C.8 Floating-Point Operations 601

converted to double-precision representation before performing the specified arithmeticoperation. For example, the instruction

FADD.W (A2)+, FP6

reads a 16-bit integer from the memory location given by address register A2, automaticallyincrementsA2 by two, converts the 16-bit integer to a 64-bit double-precision floating-pointnumber, then adds the converted number to the number in register FP6.

C.8.3 Comparison and Branch Instructions

Comparison of two floating-point numbers may be performed with the FCMP instruction.The outcome of the comparison affects flags such as N and Z in the FPSR for later use bybranch instructions. The destination operand for the FCMP instruction must be in a floating-point register. The source operand may be in a floating-point register, a data register, or amemory location specified with the Indirect, Autoincrement, Autodecrement, Basic index,or Basic relative modes. A suffix must be included to specify the source operand size,and a conversion to double-precision representation is performed for the source operand asrequired. For example, the instruction

FCMP.S (A1), FP3

reads the 32-bit single-precision floating-point number at the memory location given bythe contents of register A1, converts this number to 64-bit double-precision representation,subtracts the converted number from the contents in register FP3, and sets the flags in theFPSR based on the result of the subtraction. Register FP3 is not modified.

Floating-point conditional branch instructions have the format

FBcc LABEL

where cc specifies the floating-point condition code. Tests such as EQ, NE, LT, and GT canbe specified for cc, corresponding to different combinations of the condition code flags inthe FPSR. For the branch target that is used when the condition is true, the FBcc instructionspecifies an offset relative to the PC value at execution time. Depending on the distance tothe branch target from the FBcc instruction, the offset may be represented in either 16 or32 bits.

Since floating-point arithmetic instructions such as FADD and FSUB affect the flags inthe FPSR, an FCMPinstruction may not be needed in some cases before an FBcc instruction.An example program in Section C.8.5 shows how a floating-point arithmetic instruction canbe immediately followed by a floating-point branch instruction.

C.8.4 Additional Floating-Point Instructions

The FPU supports additional instructions such as square root (FSQRT), negation (FNEG),and absolute value (FABS). These instructions may specify one or two operands. For asingle operand, it must be in a floating-point register. For two operands, the destinationlocation must be a floating-point register, and the valid addressing modes for the sourceoperand are the same as the source addressing modes in the basic floating-point arithmetic



instructions. Condition code flags in the FPSR are affected by the result for each of theseinstructions. Number conversions are performed as required for the source operand, basedon the format specifier that is appended to the instruction mnemonic.

C.8.5 Example Floating-Point Program

An example program is shown in Figure C.19. Given a pair of points (x0,y0) and (x1,y1),the program determines the slope m and y-axis intercept b of a line passing through thosepoints, except when the points lie on a vertical line.

The coordinates for the two points are assumed to be 64-bit double-precision floating-point numbers stored consecutively in memory as x0, y0, x1, and y1 beginning at the locationCOORDS. The program places the computed slope and intercept as double-precision num-bers in memory locations called SLOPE and INTERCEPT. The formula for the slope of aline is given by m = (y1 − y0)/(x1 − x0), hence the program must check for a zero in thedenominator before performing the division. When the denominator is zero, the points lieon a vertical line. The program writes a value of one to a separate variable in memory calledVERT_LINE to reflect this case and to indicate that the values in memory locations SLOPEand INTERCEPT are not valid, and no further computation is done. Otherwise, the programcomputes the value of the slope and stores it in memory. With a valid slope, the intercept

MOVEA.L #COORDS, A2 A2 points to list of coordinates.FMOVE.D (A2), FP0 FP0 contains x0.FMOVE.D 8(A2), FP1 FP1 contains y0.FMOVE.D 16(A2), FP2 FP2 contains x1.FMOVE.D 24(A2), FP3 FP3 contains y1.FSUB.D FP0, FP2 Compute x1 – x0 ; may set Z flag in

FPSR.FBEQ NO_SLOPE If denominator is zero, m is undefined.FSUB.D FP1, FP3 Compute y1 – y0.FDIV.D FP2, FP3 Compute m = (y1 – y0 )/(x1 – y0).MOVEA.L #SLOPE, A2 A2 points to memory location SLOPE.FMOVE.D FP3, (A2) Store the slope to memory.FMUL.D FP3, FP0 Compute m ˙ x0.FSUB.D FP1, FP0 Compute b = y0 – m ˙ x0.MOVEA.L #INTERCEPT, A2 A2 points to memory location

INTERCEPT.FMOVE.D FP1, (A2) Store the intercept to memory.MOVEQ.L #0, D0 Indicate that line is not vertical.BRA DONE

NO_SLOPE: MOVEQ.L #1, D0 Indicate that line is vertical.DONE: MOVE.L D0, VERT_LINE

Figure C.19 Floating-point program to compute the slope and intercept of a line.


C.10 Solved Problems 603

is then computed as b = y0 − m · x0 and also stored in memory. Finally, a zero is writtento the location VERT_LINE in memory to indicate that the slope and intercept are valid.

C.9 Concluding Remarks

ColdFire implements a CISC-style instruction set that combines arithmetic/logic operationsand memory accesses in many of its instructions. This approach reduces the number ofinstructions that must be executed to perform a given computing task. Instructions areencoded into one, two, or three 16-bit words in memory, depending on the complexityof the operation to be performed and the amount of information needed to generate theeffective addresses of the required operands. A variety of addressing modes are supported.In addition to the integer instructions, a floating-point extension is also defined for ColdFire.Automatic conversion between integer and floating-point number representations is a featureof the floating-point extension.

C.10 Solved Problems


Example C.2Problem: Assume that there is a string of ASCII-encoded characters stored in memorystarting at address STRING. The string ends with the Carriage Return (CR) character.Write a ColdFire program to determine the length of the string.

Solution: Figure C.20 presents a possible program. Each character in the string is comparedto CR (ASCII code 0D), and a counter is incremented until the end of the string is reached.The result is stored in location LENGTH.

MOVEA.L #STRING, A2 A2 points to the start of the string.CLR.L D3 D3 is a counter that is cleared to 0.CLR.L D5 D5 is cleared for longword comparison.

LOOP: MOVE.B (A2)+, D5 Get the next character and increment the pointer.CMPI.L #$0D, D5 Compare character with CR.BEQ DONE Finished if it matches.ADDQ.L #1, D3 Increment the counter.BRA LOOP Not finished, loop back.

DONE: MOVE.L D3, LENGTH Store the count in memory location LENGTH.

Figure C.20 Program for Example C.2.



Example C.3 Problem: We want to find the smallest number in a list of non-negative 32-bit integers.Storage for all data related to this problem begins at address (1000)16. The longwordat this address must hold the value of the smallest number after it has been found. Thenext longword at (1004)16 contains the number of entries, n, in the list. The following nlongwords beginning at address (1008)16 contain the numbers in the list. Write a programto find the smallest number and include the assembler directives needed to organize the dataas stated.

Solution: The program in Figure C.21 accomplishes the required task. Comments in theprogram explain how this task is performed. The program assumes that n ≥ 1. A fewsample numbers are included as entries in the list.

LIST: .EQU $1000 Starting address of list.MOVEA.L #LIST, A2 A2 points to the start of the list.MOVE.L 4(A2), D3 D3 is a counter, initialize it with n.MOVEA.L A2, A4 A4 points to the first numberADDA.L #8, A4 after adjusting its value.MOVE.L (A4), D5 D5 holds the smallest number so far.

LOOP: SUBQ.L #1, D3 Decrement the counter.BEQ DONE Finished if D3 is zero.MOVE.L (A4)+, D6 Get the next number and increment the

pointer.CMP.L D6, D5 Compare next number and smallest so far.BLE LOOP If next number not smaller, loop again.MOVE.L D6, D5 Otherwise, update smallest number so far.BRA LOOP Loop again.

DONE: MOVE.L D5, (A2) Store the smallest number into SMALL.

.ORG $1000SMALL: .DS.L 1 Space for the smallest number found.N: .DC.L 7 Number of entries in list.ENTRIES: .DC.L 4, 5, 3, 6, 1, 8, 2 Entries in the list.


Example C.4 Problem: Write a ColdFire program that converts an n-digit decimal integer into a binarynumber. The decimal number is given as n ASCII-encoded characters, as would be the caseif the number is entered by typing it on a keyboard.



Solution: Consider a four-digit decimal number, D, which is represented by the digitsd3d2d1d0. The value of this number is ((d3 × 10+ d2)× 10+ d1)× 10+ d0. This repre-sentation of the number is the basis for the conversion technique used in the program inFigure C.22. Note that each ASCII-encoded character is converted into a Binary CodedDecimal (BCD) digit with the ANDI instruction before it is used in the computation.

MOVE.L N, D2 D2 is a counter, initialize it with n.MOVEA.L #DECIMAL, A3 A3 points to the ASCII digits.CLR.L D4 D4 will hold the binary number.MOVEQ.L #10, D6 D6 will be used to multiply by 10.

LOOP: MOVE.B (A3)+, D5 Get the next ASCII digit and incrementthe pointer..

ANDI.L #$0F, D5 Form the BCD digit.ADD.L D5, D4 Add to the intermediate result.SUBQ.L #1, D2 Decrement the counter.BEQ DONE Exit loop if finished.MULU.L D6, D4 Multiply by 10.BRA LOOP Loop back if not done.

DONE: MOVE.L D5, BINARY Store the result in memory locationBINARY.


Example C.5Problem: Consider a two-dimensional array of numbers A(i,j), where i = 0 through n− 1is the row index, and j = 0 through m− 1 is the column index. The array is stored inthe memory of a computer one row after another, with elements of each row occupying msuccessive word locations. Write a subroutine for adding column x to column y, elementby element, leaving the sum elements in column y. The indices x and y are passed to thesubroutine in registers D2 and D3. The parameters n and m are passed to the subroutine inregisters D4 and D5, and the address of element A(0,0) is passed in register A0.

Solution: A possible program is given in Figure C.23. We assume that the values x, y,n, and m are stored in memory locations X, Y, N, and M. Also, the elements of the arrayare stored in successive words that begin at location ARRAY, which is the address of theelement A(0,0). Comments in the program indicate the purpose of individual instructions.

Example C.6Problem: Assume that a memory location BINARY contains a 32-bit pattern. It is desiredto display these bits as eight hexadecimal digits on a display device that has the interfacedepicted in Figure 3.3. Write a program that accomplishes this task.



MOVE.L X, D2 Load the value x .MOVE.L Y, D3 Load the value y .MOVE.L N, D4 Load the value n.MOVE.L M, D5 Load the value m.MOVEA.L #ARRAY, A0 Load the address of A(0,0).JSR SUBnext instruction...

SUB: MOVE.L A1, –(A7) Save register A1.LSL.L #2, D5 Determine the distance in bytes between

successive elements in a column.SUB.L D2, D3 Form y – x .LSL.L #2, D3 Form 4(y – x) .LSL.L #2, D2 Form 4x .ADDA.L D2, A0 A0 points to A(0, x ).MOVEA.L A0, A1ADDA.L D3, A1 A1 points to A(0, y ).

LOOP: MOVE.L (A0), D2 Get the next number in column x .MOVE.L (A1), D3 Get the next number in column y .ADD.L D3, D2 Add the numbers and store the sum.MOVE.L D2, (A1)ADDA.L D5, A0 Increment pointer to column x .ADDA.L D5, A1 Increment pointer to column y .SUBQ.L #1, D4 Decrement the row counter.BGT LOOP Loop back if not done.MOVE.L (A7)+, A1 Restore A1.RTS Return to the calling program.


Solution: First it is necessary to convert the 32-bit pattern into hex digits that are repre-sented as ASCII-encoded characters. The conversion can be done by using the table-lookupapproach. A 16-entry table has to be constructed to provide the ASCII code for each possi-ble hex digit. Then, for each four-bit segment of the pattern in BINARY, the correspondingcharacter can be looked up in the table and stored in consecutive byte locations in memorybeginning at address HEX. Finally, the eight characters beginning at address HEX are sentto the display. Figure C.24 gives a possible program. Because ColdFire does not include arotate instruction, four pairs of LSL and ADDX instructions are used to achieve the effectof rotating the value in a register by four bits.



MOVE.L BINARY, D2 Load the binary number.MOVE.L #8,D3 D3 is a digit counter that is set to 8.MOVEA.L #HEX, A4 A4 points to the hex digits.MOVEA.L #TABLE, A6 A6 points to table for ASCII conversion.CLR.L D0 D0 is zero; needed for rotate below.

LOOP: LSL.L #1, D2 Rotate high-order digit into low-order positionADDX.L D0, D2 by using X flag and adding zero (4 times).LSL.L #1, D2ADDX.L D0, D2LSL.L #1, D2ADDX.L D0, D2LSL.L #1, D2ADDX.L D0, D2MOVE.L D2, D5 Copy current value to another register.ANDI.L #$0F, D5 Extract next digit.MOVE.B (A6,D5), D6 Get ASCII code for the digit,MOVE.B D6, (A4)+ store it HEX buffer,

and increment the digit pointer.SUBQ.L #1, D3 Decrement the digit counter.BGT LOOP Loop back if not the last digit.

DISPLAY: MOVEQ.L #8, D3MOVEA.L #HEX, A4MOVEA.L #DISP_DATA, A2

DLOOP: MOVE.L 4(A2), D5 Check if the display is readyANDI.L #4, D5 by testing the DOUT flag.BEQ DLOOPMOVE.B (A4)+, (A2) Get the next ASCII character,

increment the character pointer,and send it to the display.

SUBQ.L #1, D3 Decrement the counter.BGT DLOOP Loop until all characters displayed.next instruction

.ORG 1000HEX: .DS.B 8 Space for ASCII-encoded digits.TABLE: .DC.B $30, $31, $32, $33 Table for conversion

.DC.B $34, $35, $36, $37 to ASCII code.

.DC.B $38, $39, $41, $42

.DC.B $43, $44, $45, $46




Problems

C.1 [E] Write a program that computes the expression SUM = 580+ 68400+ 80000.

C.2 [E] Write a program that computes the expression ANSWER =A× B + C × D.

C.3 [M] Write a program that finds the number of negative integers in a list of n 32-bit integersand stores the count in location NEGNUM. The value n is stored in memory location N, andthe first integer in the list is stored in location NUMBERS. Include the necessary assemblerdirectives and a sample list that contains six numbers, some of which are negative.

C.4 [E] Write an assembly-language program in the style of Figure C.14 for the program inFigure C.8. Assume the data layout of Figure 2.10.

C.5 [M] Write a ColdFire program to solve Problem 2.10 in Chapter 2.

C.6 [M] Write a ColdFire program for the problem described in Example 2.5 in Chapter 2.



C.9 [M] Write a ColdFire program for the problem described in Example 3.6 in Chapter 3, butassume that the address of TABLE is 0x10100.

C.10 [M] Write a program that displays the contents of 10 bytes of the main memory in hex-adecimal format on a line of a video display. The byte string starts at location LOC in thememory. Each byte has to be displayed as two hex characters. The displayed contents ofsuccessive bytes should be separated by a space.

C.11 [M] Assume that a memory location BINARY contains a 16-bit pattern. It is desired todisplay these bits as a string of 0s and 1s on a display device that has the interface depictedin Figure 3.3. Write a ColdFire program that accomplishes this task.

C.12 [M] Using the seven-segment display in Figure 3.17 and the timer circuit in Figure 3.14,write a program that flashes decimal digits in the sequence 0, 1, 2, . . . , 9, 0, . . . . Each digitis to be displayed for one second. Assume that the counter in the timer circuit is driven bya 100-MHz clock.

C.13 [M] Using two 7-segment displays of the type shown in Figure 3.17, and the timer circuit inFigure 3.14, write a program that flashes numbers 0, 1, 2, . . . , 98, 99, 0, . . . . Each numberis to be displayed for one second. Assume that the counter in the timer circuit is driven bya 100-MHz clock.

C.14 [M] Write a program that computes real clock time and displays the time in hours (0 to 23)and minutes (0 to 59). The display consists of four 7-segment display devices of the typeshown in Figure 3.17. A timer circuit that has the interface given in Figure 3.14 is available.Its counter is driven by a 100-MHz clock.


C.16 [D] Write a ColdFire program to solve Problem 2.24 in Chapter 2.


References 609














C.30 [D] The function sin(x) can be approximated with reasonable accuracy as x − x3/6+x5/120 = x(1− x2(1/6− x2(1/120))) when 0 ≤ x ≤ π/2. Write a subroutine called SINthat accepts an input parameter representing x in a floating-point register and computes theapproximated value of sin(x) using the second expression above involving only terms in xand x2. The computed value should be returned in a floating-point register. Any registersused by the subroutine, including floating-point registers, should be saved and restored asneeded.

References

1. Freescale Semiconductor, Inc., ColdFire Family Programmer’s Reference Manual,Document Number CFPRM Rev. 3, March 2005. Available athttp://www.freescale.com.

http://www.freescale.com

December 2, 2010 10:14 ham_338065_appd Sheet number 1 Page number 611 cyan black

611

a p p e n d i x

DTheARM Processor

Appendix Objectives

In this appendix you will learn about the ARM processor.The discussion includes:

• Instruction set architecture


• Support for embedded applications


612 A P P E N D I X D • TheARM Processor

In Chapter 2 we introduced the basic concepts used in the design of instruction sets and addressing modes.Chapter 3 discussed I/O operations. In this appendix we will show how those concepts are implemented inthe ARM processor. The generic programs given in Chapters 2 and 3 are presented in the ARM assemblylanguage.

Advanced RISC Machines (ARM) Limited has designed a family of RISC-style processors. ARM licencesthese designs to other companies for chip fabrication, together with software tools for system developmentand simulation. The main use for ARM processors is in low-power and low-cost embedded applications suchas mobile telephones, communication modems, and automotive engine management systems.

All ARM processors share a basic machine instruction set. The ISA version used here is called version4 by ARM [1]. Later versions add extensions that are not needed for the level of discussion in this appendix.However, we will briefly summarize some of them in a later section. The book by Furber [2] is a sourceof information on ARM processors and their design rationale. The book by Hohl [3] describes the ARMassembly language.

D.1 ARM Characteristics

ARM word length is 32 bits, memory is byte-addressable using 32-bit addresses, and theprocessor registers are 32 bits long. Three operand lengths are used in moving data betweenthe memory and the processor registers: byte (8 bits), half word (16 bits), and word (32bits). Word and half-word addresses must be aligned, that is, they must be multiples of4 and 2, respectively. Both little-endian and big-endian memory addressing schemes aresupported. The choice is determined by an external input control line to the processor.

In most respects, the ARM ISA reflects a RISC-style architecture, but it has someCISC-style features.

RISC-style Aspects• All instructions have a fixed length of 32 bits.• Only Load and Store instructions access memory.• All arithmetic and logic instructions operate on operands in processor registers.

CISC-style Aspects• Autoincrement, Autodecrement, and PC-relative addressing modes are provided.• Condition codes (N, Z, V, and C) are used for branching and for conditional execution

of instructions. Their meaning was explained in Section 2.10.2.• Multiple registers can be loaded from a block of consecutive memory words, or stored

in a block, using a single instruction.

D.1.1 Unusual Aspects of the ARMArchitecture

The ARM architecture has a number of features not generally found in modern processors.

Conditional Execution of InstructionsAn unusual feature ofARM processors is that all instructions are conditionally executed.

An instruction is executed only if the current values of the condition code flags satisfy the


D.2 Register Structure 613

condition specified in a 4-bit field of the instruction. Otherwise, the processor proceeds tothe next instruction. One of the possible conditions specifies that the instruction is alwaysexecuted. The usefulness of conditional execution will be seen in an example in Section D.9.For now, we will ignore this feature and assume that the condition field of the instructionspecifies the “always-executed” code.

No Shift or Divide InstructionsShift instructions are not provided explicitly. However, an immediate value or one

of the register operands in arithmetic, logic, and move instructions can be shifted by aprescribed amount before being used in an operation, as explained in Section D.4.2. Thisfeature can also be used to implement shift instructions implicitly.

There are a number of different Multiply instructions, with many of the variationsintended for use in signal-processing applications. But there are no hardware Divide in-structions. Division must be implemented in software.

D.2 Register Structure

There are sixteen 32-bit processor registers for user application programs, labeled R0through R15, as shown in Figure D.1. They comprise fifteen general-purpose registers

31 29 7 0

Program counter

R0

R1

31 0

R14

15Generalpurposeregisters

31 0

Status28

•••

R15 (PC)

30 6 4

• • •CPSR

N–NegativeZ–Zero

C–CarryV–Overflow

Conditioncodeflags

Processor mode bits

register

Interrupt disable bits I and F

Figure D.1 ARM register structure.



(R0 through R14) and the Program Counter (PC), which is register R15. The general-purpose registers can hold either memory addresses or data operands. Registers R13 andR14 have dedicated uses related to the management of the processor stack and subroutines.This is discussed in Section D.4.8.

The Current Program Status Register (CPSR), or simply the Status register, also shownin Figure D.1, holds the condition code flags (N, Z, C, V), interrupt-disable bits, andprocessor mode bits. There are a total of seven processor modes of operation. Applicationprograms run in User mode. The other six modes are used to handle I/O device interrupts,processor powerup/reset, software interrupts, and memory access violations. Processormodes and the use of interrupt-disable bits are described in Sections D.7 and D.8. Initially,we will assume that the processor is in User mode, executing an application program.

There are a number of additional general-purpose registers called banked registers.They are duplicates of some of the R0 to R14 registers. Different banked registers areused when the processor switches from User mode into other modes of operation. The useof banked registers avoids the need to save and restore some of the User-mode registercontents on mode switches. Saved copies of the Status register are also available in thenon-User modes. The banked registers, along with Status register copies, are discussed inSection D.7.

D.3 Addressing Modes

The Immediate, Register, Absolute, Indirect, and Index addressing modes discussed inSection 2.4 are all available in some form. In addition to these basic modes, which areusually available in RISC processors, the Relative mode and variants of the Autoincrementand Autodecrement modes described in Section 2.10.1 are also provided. In the ARMarchitecture, many of these modes are derived from different forms of indexed addressingmodes.

D.3.1 Basic IndexedAddressing Mode

The basic method for addressing memory operands is an indexed addressing mode, definedas

Pre-indexed mode—The effective address of the operand is the sum of the contents of abase register, Rn, and a signed offset.

We will use Load and Store instructions, with assembly-language mnemonics LDR andSTR, which load a word from memory into a register, or store a word from a registerinto memory, to show how the indexed addressing modes operate. The format of theseinstructions is shown in Figure D.2. In all ARM instructions, the high-order four bitsspecify a condition that determines whether or not the instruction is executed, as explainedin Section D.1.1. The magnitude of the offset is given as an immediate value contained inthe low-order 12 bits of the instruction or as the contents of a second register, Rm, specified


D.3 Addressing Modes 615

Condition

31

OP code

28 27 20 19 16 15 12 11 0

Rn Rd Offset or Rm

Figure D.2 Format for Load and Store instructions.

in the low-order four bits of the instruction. A bit in the OP-code field distinguishes betweenthese two cases. The sign (direction) of the offset is specified by another bit in the OP-codefield. In assembly language, the sign is given with the offset specification.

The Load instruction

LDR Rd, [Rn, #offset]

specifies the offset (expressed as a signed number) in the immediate mode and performsthe operation

Rd← [[Rn] + offset]

The instruction

LDR Rd, [Rn, Rm]


Rd← [[Rn] + [Rm]]

Since the contents of Rm are the magnitude of the offset, Rm is preceded by a minus signif a negative offset is desired. Note that square brackets are used in the ARM assemblylanguage instead of parentheses to denote indirection. These two versions of the Pre-indexedaddressing mode are the same as the Index and Base with index modes, respectively, definedin Section 2.4.3.

An offset of zero does not have to be specified explicitly in assembly language. Hence,the instruction

LDR Rd, [Rn]


Rd← [[Rn]]

This is defined as the Indirect mode in Section 2.4.2.

D.3.2 Relative Addressing Mode

The Program Counter, PC, may be used as the Base register Rn, along with an immediateoffset, in the Pre-indexed addressing mode. This, in effect, is the Relative addressing mode,



as described in Section 2.10.1. The programmer simply places the desired address label inthe operand field to indicate this mode. Thus, the instruction

LDR R1, ITEM

loads the contents of memory location ITEM into register R1. The assembler determinesthe immediate offset as the difference between the address of the operand and the contentsof the updated PC. When the effective address is calculated at instruction execution time,the contents of the PC will have been updated to the address two words (8 bytes) forwardfrom the instruction containing the Relative addressing mode. The reason for this is thatthe ARM processor will have already fetched the next instruction. This is due to pipelinedinstruction execution, which is described in Chapter 6.

D.3.3 Index Modes withWriteback

In the Pre-indexed addressing mode, the original contents of register Rn are not changed inthe process of generating the effective address of the operand. There is a variation of thismode, called Pre-indexed with writeback, in which the contents of Rn are changed. Anothermode, called Post-indexed, also results in changing the contents of register Rn. Thesemodes are generalizations of the Autodecrement and Autoincrement addressing modes,respectively, that were introduced in Section 2.10.1. They are defined as follows:

Pre-indexed with writeback mode—The effective address of the operand is generated inthe same way as in the Pre-indexed mode, then the effective address is written backinto Rn.

Post-indexed mode—The effective address of the operand is the contents of Rn. The offsetis then added to this address and the result is written back into Rn.

Table D.1 specifies the assembly language syntax for all of the addressing modes,and gives expressions for the calculation of the effective address, EA. It also shows howwriteback operations are specified. In the Pre-indexed mode, the exclamation character ‘!’signifies that writeback is to be done. The Post-indexed mode always involves writeback,so the exclamation character is not needed.

As can be seen in Table D.1, pre- and post-indexing are distinguished by the way thesquare brackets are used. When only the base register is enclosed in square brackets, itscontents are used as the effective address. The offset is added to the register contents afterthe operand is accessed. In other words, post-indexing is specified. When both the baseregister and the offset are placed inside the square brackets, their sum is used as the effectiveaddress of the operand, that is, pre-indexing is used. If writeback is to be performed, itmust be indicated by the exclamation character.

D.3.4 Offset Determination

In all three indexed addressing modes, the offset may be given as an immediate value in therange±4095. Alternatively, the magnitude of the offset may be specified as the contents of



Table D.1 ARM indexed addressing modes.


With immediate offset:Pre-indexed [Rn, #offset] EA= [Rn] + offset

Pre-indexedwith writeback [Rn, #offset]! EA= [Rn] + offset;

Rn← [Rn] + offset

Post-indexed [Rn], #offset EA= [Rn];Rn← [Rn] + offset

With offset magnitude in Rm:Pre-indexed [Rn, ± Rm, shift] EA= [Rn] ± [Rm] shifted

Pre-indexedwith writeback [Rn, ± Rm, shift]! EA= [Rn] ± [Rm] shifted;

Rn← [Rn] ± [Rm] shifted

Post-indexed [Rn], ± Rm, shift EA= [Rn];Rn← [Rn] ± [Rm] shifted

Relative Location EA= Location(Pre-indexed with = [PC] + offsetimmediate offset)

EA= effective addressoffset = a signed number contained in the instructionshift = direction #integer

where direction is LSL for left shift or LSR for right shift; andinteger is a 5-bit unsigned number specifying the shift amount

±Rm = the offset magnitude in register Rm can be added to or subtracted from thecontents of base register Rn

the Rm register, with the sign (direction) of the offset specified by a± prefix on the registername. For example, the instruction

LDR R0, [R1, −R2]!


R0← [[R1] − [R2]]

The effective address of the operand, [R1]− [R2], is then loaded into R1 because writebackis specified.

When the offset is given in a register, it may be scaled by a power of 2 before it isused by shifting it to the right or to the left. This is indicated in assembly language byplacing the shift direction (LSL for left shift or LSR for right shift) and the shift amountafter the register name, Rm, as shown in Table D.1. The amount of the shift is specified byan immediate value in the range 0 to 31. The direction and amount of shifting are encodedin the same field of the instruction that specifies Rm, as shown in Figure D.2. For example,



the contents of R2 in the example above may be multiplied by 16 before being used as anoffset by modifying the instruction as follows:

LDR R0, [R1, −R2, LSL #4]!

This instruction performs the operation

R0← [[R1] −16 × [R2]]

and then loads the effective address into R1 because writeback is specified.

D.3.5 Register, Immediate, and Absolute Addressing Modes

The Register addressing mode is the main way for accessing operands in arithmetic andlogic instructions, which are discussed in Section D.4. Constants may also be used in theseinstructions. They are provided as 8-bit immediate values.

A limited form of Absolute addressing for accessing memory operands is obtained ifthe base register in the Pre-indexed mode contains the value zero. In this case, the 12-bitoffset value is the effective address.

The Immediate and Absolute addressing modes described here involve only 8-bit and12-bit values, respectively. The generation and use of 32-bit values as immediate operandsor memory addresses are described in Section D.5.1.

D.3.6 Addressing Mode Examples

An example of the Relative mode is shown in Figure D.3a. The address of the operand, givensymbolically in the instruction as ITEM, is 1060. There is no Absolute addressing modeavailable in the ARM architecture, other than the limited form described in the previoussection. Therefore, when the address of a memory location is specified by placing anaddress label in the operand field, the assembler uses the Relative addressing mode. Thisis implemented by the Pre-indexed mode with an immediate offset, using PC as the baseregister. As shown in the figure, the offset calculated by the assembler is 52, because theupdated PC will contain 1008 when the offset is added to it during program execution. Theeffective address generated by this instruction is 1060 = 1008+ 52. The operand must bewithin a distance of 4095 bytes forward or backward from the updated PC. If the operandaddress is outside this range, an error is indicated by the assembler and a different addressingmode must be used to access the operand.

Figure D.3b shows an example of the Pre-indexed mode with the offset contained inregister R6 and the base value contained in R5. The Store instruction (STR) stores thecontents of R3 into the word at memory location 1200.

The examples shown in Figure D.4 illustrate the usefulness of the writeback featurein the Post-indexed and Pre-indexed addressing modes. Figure D.4a shows the first threenumbers of a list of 25 numbers, starting at memory address 1000 and spaced 25 words apart.They comprise the first row of a 25 × 25 matrix of numbers that is stored in column order.Memory locations 1000, 1004, 1008, . . . , 1096 contain the first column of the matrix. The



52 = offset

1000

word (4 bytes)

ITEM = 1060 Operand

Memory address

updated [PC] = 1008

(a) Relative addressing mode

1000

200 = offset

1000

1200

Base register

(b) Pre-indexed addressing mode

200

Offset register

•••

•••

•••

•••

•••

LDR R1, ITEM

1004

1008 –

–

STR R3, [R5, R6] R5

R6

Operand

Figure D.3 Examples of memory addressing modes.

first number of the first row of the matrix is stored in word location 1000, and the numbers ataddresses 1100, 1200, . . . , 3400 are the successive numbers of the first row. The numbersin the first row of the matrix can be accessed conveniently in a program loop by using thePost-indexed addressing mode, with the offset contained in a register. Suppose that R2 isused as the base register and that it contains the initial address value 1000. Suppose alsothat register R10 is used to hold the offset, and that it is loaded with the value 25. The



100 = 25 × 4

1000

word (4 bytes)

(a) Post-indexed addressing

25

2008

Base register

(b) Pre-indexed addressing with writeback

2012

Base register (Stack pointer)

•••

6

1100

R2

R0

–17

•••

3211200

100 = 25 × 4

1000

Offset register

R10

R5

2727

–2012

after execution of

Memoryaddress

Push instruction

Load instruction:

LDR R1, [R2], R10, LSL #2

Push instruction:

STR R0, [R5, #− 4]!

Figure D.4 Memory addressing modes involving writeback.

instruction

LDR R1, [R2], R10, LSL #2

can be used in the body of a program loop to load register R1 with successive elements ofthe first row of the matrix in successive passes through the loop.

Let us examine how this works, step by step. The first time that the Load instructionis executed, the effective address is [R2] = 1000. Therefore, the number 6 at this addressis loaded into R1. Then, the writeback operation changes the contents of R2 from 1000


D.4 Instructions 621

to 1100 so that it points to the second number, −17. It does this by shifting the contents,25, of the offset register R10 left by two bit positions and then adding the shifted value tothe contents of R2. The contents of R10 are not changed in this process. The left shift isequivalent to multiplying 25 by 4, generating the required offset of 100. When the Loadinstruction is executed on the second pass through the loop, the second number, −17, isloaded into R1. The third number, 321, is loaded into R1 on the third pass, and so on.

This example involved adding the shifted contents of the offset register to the contentsof the base register. As indicated in Table D.1, the shifted offset can also be subtractedfrom the contents of the base register. Any shift amount in the range 0 through 31 can beselected, and either right or left shifting can be specified.

Figure D.4b shows an example of pushing the contents of register R0, which are 27, ontoa programmer-defined stack. Register R5 is used as the stack pointer. Initially, it containsthe address 2012 of the current TOS (top-of-stack) element. The Pre-indexed addressingmode with writeback can be used to perform the Push operation with the instruction

STR R0, [R5, #−4]!

The immediate offset −4 is added to the contents of R5 and the new value is written backinto R5. Then, this address value of the new top of the stack, 2008, is used as the effectiveaddress for the Store operation. The contents of register R0 are then stored at this location.

D.4 Instructions

Each instruction in theARM architecture is encoded into a 32-bit word. Access to memory isprovided only by Load and Store instructions. All arithmetic and logic instructions operateon processor registers.

D.4.1 Load and Store Instructions

In the previous section on addressing modes, we used versions of the Load and Storeinstructions that move single word operands between the memory and registers. The OP-code mnemonics LDR and STR are used for these instructions.

Byte and half-word values can also be transferred between memory and registers. If theoperand is a byte, it is located in the low-order byte position of the register. If the operandis a half word, it is located in the low-order half of the register. For Load instructions, byteand half-word values are zero-extended to the 32-bit register length by using the instructionmnemonics LDRB and LDRH or are sign-extended by using LDRSB and LDRSH. Thebyte and half-word Store instructions have the mnemonics STRB and STRH.

Loading and Storing Multiple OperandsThere are two instructions for loading and storing multiple operands. They are called

Block Transfer instructions. Any subset of the general-purpose registers can be loaded orstored. Only word operands are allowed. The OP codes used are LDM (Load Multiple)and STM (Store Multiple). The memory operands must be in successive word locations.



All of the forms of pre- and post-indexing with and without writeback are available. Theyoperate on a base address register Rn specified in the instruction in the position shown inFigure D.2. The offset magnitude is always 4 in these instructions, so it does not have tobe specified explicitly in the instruction. The list of registers must appear in increasingorder in the assembly-language representation of the instruction, but they do not need to becontiguous. They are specified in bits b15−0 of the encoded machine instruction, with bitbi = 1 if register Ri is in the list.

As an example, assume that register R10 is the base register and that it contains thevalue 1000 initially. The instruction

LDMIA R10!, {R0, R1, R6, R7}

transfers the words from locations 1000, 1004, 1008, and 1012 into registers R0, R1, R6,and R7, leaving the address value 1016 in R10 after the last transfer, because writeback isindicated by the exclamation character. The suffix IA in the OP code indicates “IncrementAfter,” corresponding to post-indexing. We will discuss the use of Load/Store Multipleinstructions for saving/restoring registers in subroutines in Section D.4.8.

D.4.2 Arithmetic Instructions

The ARM instruction set has a number of instructions for arithmetic operations on operandsthat are either contained in the general-purpose registers or given as immediate operands inthe instruction itself. The format for these instructions is the same as that shown in FigureD.2 for Load and Store instructions, except that the field label “Offset or Rm” is replacedby the label “Immediate or Rm” for the second source operand.

The basic assembly-language format for arithmetic instructions is

OP Rd, Rn, Rm

where the operation specified by the OP code is performed on the source operands ingeneral-purpose registers Rn and Rm. The result is placed in destination register Rd.

Addition and SubtractionThe instruction

ADD R0, R2, R4


R0← [R2] + [R4]

The instruction

SUB R0, R6, R5


R0← [R6] − [R5]



The second source operand can be specified in the immediate mode. Thus,

ADD R0, R3, #17


R0← [R3] + 17

The immediate operand is an 8-bit value contained in bits b7−0 of the encoded machineinstruction. It is an unsigned number in the range 0 to 255. The assembly language allowsnegative values to be used as immediate operands. If the instruction

ADD R0, R3, #−17

is used in a program, the assembler replaces it with the instruction

SUB R0, R3, #17

Shifting or Rotation of the Second Source OperandWhen the second source operand is specified as the contents of a register, they can be

shifted or rotated before being used in the operation. Logical shift left (LSL), logical shiftright (LSR), arithmetic shift right (ASR), and rotate right (ROR), as described in Section2.8.2 of Chapter 2, are available. The carry bit, C, is not involved in these operations.Shifting or rotation is specified after the register name for the second source operand. Forexample, the instruction

ADD R0, R1, R5, LSL #4

is executed as follows. The second source operand, which is contained in register R5, isshifted left 4 bit positions (equivalent to [R5] × 16), then added to the contents of registerR1. The sum is placed in register R0. The shift or rotation amount can also be specified asthe contents of a fourth register.

If the second source operand is specified in an assembly-language instruction as animmediate value in the range 0 to 255, it is obtained directly from the low-order byte of theencoded machine instruction word, as described above. It is also possible for the program-mer to specify a limited number of 32-bit values. They are generated by manipulating an8-bit immediate value contained in the low-order byte of the machine instruction in the fol-lowing manner at the time the instruction is executed. The 8-bit value is first zero-extendedto 32 bits. It is then rotated right an even number of bit positions to generate the desiredvalue. Both the 8-bit value and the rotation amount are determined by the assembler fromthe immediate value specified by the programmer. These two quantities are encoded intothe low-order 12 bits of the instruction. If it is not possible to generate the desired value inthis way, an error is reported and the programmer must use some other way of generatingthe desired value, as described in Section D.5.1.

Multiple-Word OperandsThe carry flag, C, can be used to facilitate addition and subtraction operations that

involve multiple-word numbers. Separate instructions are available for this purpose. Theirassembly language mnemonics are ADC (Add with carry) and SBC (Subtract with carry).For example, suppose that two 64-bit operands are to be added. Assume that the first



operand is contained in the register pair R3, R2, and that the second operand is containedin the register pair R5, R4. The high-order word of each operand is contained in thehigher-numbered register. These 64-bit operands can be added by using the instruction

ADDS R6, R2, R4

followed by the instruction

ADC R7, R3, R5

producing the 64-bit sum in the register pair R7, R6. The carry output from the ADDSoperation is used as a carry input in the ADC operation to execute the 64-bit addition. TheS suffix on the ADD instruction is needed to set the C flag.

MultiplicationTwo basic versions of a Multiply instruction are provided. The first version multiplies

the contents of two registers and places the low-order 32-bits of the product in a third register.The high-order bits of the product are discarded. If the operands are 2’s-complementnumbers, and if their product can be represented in 32 bits, then the retained low-order 32bits of the product represent the correct result.


MUL R0, R1, R2


R0← [R1] × [R2]

The second version of the basic Multiply instruction specifies a fourth register whosecontents are added to the product before the result is stored in the destination register.Hence, the instruction

MLA R0, R1, R2, R3


R0← ([R1] × [R2]) + [R3]

This is called a Multiply-Accumulate operation. It is often used in signal-processing appli-cations.

Multiply and Multiply-Accumulate instructions that generate double-length (64-bit)products are also provided. There are different versions of these instructions for signed andunsigned operands.

There are no provisions for shifting or rotating any of the operands before they are usedin multiplication operations.



D.4.3 Move Instructions

It is often necessary to copy the contents of one register into another or to load an immediatevalue into a register. The Move instruction

MOV Rd, Rm

copies the contents of register Rm into register Rd. The instruction

MOV Rd, #value

loads an 8-bit immediate value into the destination register.A second version of the Move instruction, called Move Negative, with the OP-code

mnemonic MVN, forms the bit-complement of the source operand before it is placed inthe destination register. Recall that an 8-bit immediate value is an unsigned number in therange 0 to 255. The MVN instruction can be used to load negative values in 2’s-complementrepresentation as follows. Suppose we wish to load −5 into register R0. The instruction

MVN R0, #4

achieves the desired result because the bit-complement of 4 is the 2’s-complement repre-sentation for −5. In general, to load −c into a register, the MVN instruction can be usedwith an immediate source operand value of c − 1. For the convenience of the programmer,the assembler program accepts an instruction such as

MOV R0, #−5

and replaces it with the instruction

MVN R0, #4

A MOV instruction with a negative immediate source operand is an example of a pseu-doinstruction. The assembler replaces it with an actual machine instruction that achievesthe desired result.

The source operand in Move instructions can be shifted, as described in Section D.4.2,before it is written into the destination register.

Implementing Shift and Rotate InstructionsARM processors do not have explicit instructions for shifting or rotating register con-

tents as described in Section 2.8.2 of Chapter 2. However, the ability to shift or rotate thesource register operand in a Move instruction provides the same capability. For example,the instruction

MOV Ri, Rj, LSL #4

achieves the same result as the generic instruction

LShiftL Ri, Rj, #4

described in Section 2.8.2.



D.4.4 Logic and Test Instructions

The logic operations AND, OR, XOR, and Bit-Clear are implemented by instructions withthe OP codes AND, ORR, EOR, and BIC, respectively. They have the same format asarithmetic instructions. The instruction

AND Rd, Rn, Rm

performs a bitwise logical AND of the operands in registers Rn and Rm and places the resultin register Rd. For example, if register R0 contains the hexadecimal pattern 02FA62CAand R1 contains the pattern 0000FFFF, then the instruction

AND R0, R0, R1

will result in the pattern 000062CA being placed in register R0.The Bit Clear instruction, BIC, is closely related to the AND instruction. It comple-

ments each bit in operand Rm before ANDing them with the bits in register Rn. Using thesame R0 and R1 bit patterns as in the above example, the instruction

BIC R0, R0, R1

results in the pattern 02FA0000 being placed in R0.

Digit-Packing ProgramFigure D.5 illustrates the use of logic instructions in an ARM program for packing

two 4-bit decimal digits into a memory byte location. The generic version of this programis shown in Figure 2.24 and described in Section 2.8.2. The decimal digits, representedin ASCII code, are stored in byte locations LOC and LOC + 1. The program packs thecorresponding 4-bit BCD codes into a single byte location PACKED.

In writing the program for this task, we need to load a 32-bit address into a register.ARM instructions consist of a single 32-bit word, so the address cannot be represented byan immediate value in a Move instruction.

LDR R0, =LOC Load address LOC into R0.LDRB R1, [R0] Load ASCII charactersLDRB R2, [R0, #1] into R1 and R2.AND R2, R2, #&F Clear high-order 28 bits of R2.ORR R2, R2, R1, LSL #4 Shift contents of R1 left,

perform logical OR withcontents of R2, and placeresult into R2.

STRB R2, PACKED Store packed BCD digitsinto PACKED.

Figure D.5 A program for packing two 4-bit decimal digits into a byte.



The assembler accepts an instruction of the form

LDR Ri, =ADDRESS

to load the address value ADDRESS into register Ri. This does not represent a real machineinstruction. It is another example of a pseudoinstruction. The way in which the assemblerimplements the above instruction is discussed later in Section D.5. We will normally usethis pseudoinstruction in program examples whenever it is necessary to load an addressvalue into a register.

The first instruction in the program in Figure D.5 loads the address LOC into registerR0. The two ASCII characters containing the BCD digits in their low-order four bits areloaded into the low-order byte positions of registers R1 and R2 by the next two Loadinstructions. The AND instruction clears the high-order 28 bits of R2 to zero, leaving thesecond BCD digit in the four low-order bit positions. The ‘&’ character in this instructionsignifies hexadecimal notation for the immediate value. The ORR instruction then shiftsthe first BCD digit in R1 to the left four positions and places it to the left of the secondBCD digit in R2. The two digits packed into the low-order byte of R2 are then stored intolocation PACKED.

Test InstructionsInstructions called Test (TST) and Test Equivalence (TEQ) perform the logical AND

and XOR operations, respectively, on their word operands, then set condition code flagsbased on the result. They do not store the result in a register. These instructions can beused to test how an unknown bit pattern matches up against a known bit pattern, and canthen be followed by a Branch instruction that is conditioned on the result.

For example, the Test instruction

TST Rn, #1

performs an AND operation to test whether the low-order bit of register Rn is equal to 1. Ifthe result of the test is positive, that is, if the low-order bit of the contents of register Rn isequal to 1, then the result of the AND operation is 1, and the Z bit is cleared to zero. Statusbits in I/O device registers can be checked with this type of instruction.

The Test Equivalence instruction

TEQ Rn, #5

performs an XOR operation to test whether register Rn contains the value 5. If it does, thenthe result of the bit-by-bit XOR operation will be zero, and the Z bit will be set to 1.

D.4.5 Compare Instructions

The Compare instruction

CMP Rn, Rm


[Rn] − [Rm]



and sets the condition code flags based on the result of the subtraction operation. The resultitself is discarded.

The Compare Negative instruction

CMN Rn, Rm


[Rn] + [Rm]

and sets the condition code flags based on the result of the operation. The result of theoperation is discarded.

In both of these instructions, the second operand can be an immediate value instead ofthe contents of a register. Either version of the second operand can be shifted as describedin Section D.4.2.

D.4.6 Setting Condition Code Flags

The Compare and Test instructions always update the condition code flags. They are usuallyfollowed by conditional branch instructions, which are described in the next section. Thearithmetic, logic, and Move instructions affect the condition code flags only if explicitlyspecified to do so by a bit in the OP-code field. This is indicated by appending the suffix Sto the assembly language OP-code mnemonic. For example, the instruction

ADDS R0, R1, R2

sets the condition code flags, but

ADD R0, R1, R2

does not.

D.4.7 Branch Instructions

Conditional branch instructions contain a 24-bit 2’s-complement value that is used to gener-ate a branch offset as follows. When the instruction is executed, the value in the instructionis shifted left two bit positions (because all branch target addresses are word-aligned), thensign-extended to 32 bits to generate the offset. This offset is added to the updated contentsof the Program Counter to generate the branch target address. An example is given in FigureD.6. The BEQ instruction (Branch if Equal to 0) causes a branch if the Z flag is set to 1.The appropriate 24-bit value in the instruction is computed by the assembler. In this case,it would be 92/4 = 23.

The condition to be tested to determine whether or not branching should take place isspecified in the high-order four bits, b31−28, of the instruction word. A Branch instructionis executed in the same way as any other ARM instruction; that is, the branch is taken onlyif the current state of the condition code flags corresponds to the condition specified in theCondition field of the instruction. The full set of conditions is given in Table D.2. The



1000

LOCATION = 1100

BEQ LOCATION

Branch target instruction

1004

updated [PC] = 1008

Offset = 92

Figure D.6 Determination of the target address for a branch instruction.

Table D.2 Condition field encoding in ARM instructions.

Condition Condition Name Conditionfield suffix codeb31 … b28 test

0 0 0 0 EQ Equal (zero) Z = 1

0 0 0 1 NE Not equal (nonzero) Z = 0

0 0 1 0 CS/HS Carry set/Unsigned higher or same C = 1

0 0 1 1 CC/LO Carry clear/Unsigned lower C = 0

0 1 0 0 MI Minus (negative) N = 1

0 1 0 1 PL Plus (positive or zero) N = 0

0 1 1 0 VS Overflow V = 1

0 1 1 1 VC No overflow V = 0

1 0 0 0 HI Unsigned higher C ∨ Z = 0

1 0 0 1 LS Unsigned lower or same C ∨ Z = 1

1 0 1 0 GE Signed greater than or equal N⊕V = 0

1 0 1 1 LT Signed less than N⊕V = 1

1 1 0 0 GT Signed greater than Z ∨ (N⊕V) = 0

1 1 0 1 LE Signed less than or equal Z ∨ (N⊕V) = 1

1 1 1 0 AL Always

1 1 1 1 not used



assembler accepts the OP code B as an unconditional branch. It is not necessary to use thesuffix AL.

At the time the branch target address is computed, the contents of the PC will havebeen updated to contain the address of the instruction that is two words beyond the Branchinstruction itself. This is due to pipelined instruction execution, as explained in SectionD.3.2. If the Branch instruction is at address location 1000 and the branch target address is1100, as shown in Figure D.6, then the offset is 92, because the contents of the updated PCwill be 1000+ 8 = 1008 when the branch target address 1100 is computed.

A Program for Adding NumbersWe have now described enough ARM instructions to enable us to present some of the

programs that are given in generic form in Chapter 2. Figure D.7 shows a program foradding a list of numbers, patterned after the program in Figure 2.26. Location N containsthe number of entries in the list, and location SUM is used to store the sum. The Loadand Store operations performed by the first and last instructions use the Relative addressingmode. This assumes that the memory locations N and SUM are within the range reachableby offsets relative to the PC. The address NUM1 of the first of the numbers to be addedis loaded into register R2 by the second instruction. The Post-indexed addressing mode,which includes writeback, is used in the first instruction of the loop. This mode achievesthe same effect as the Autoincrement addressing mode in Figure 2.26.

A Program for Adding Test ScoresThe flexibility available in ARM indexed addressing modes can be used to write an

efficient version of the program for addition of student test scores shown in Figure 2.11.We assume the same data layout in memory as shown in Figure 2.10.

The program is shown in Figure D.8. The address N is loaded into register R2 at thebeginning of the program. Register R2 serves as the index register for the Post-indexedaddressing mode used to access test scores in successive student records. Note how theoffsets 8, 4, and 4, along with writeback, cause the contents of register R2 to be increasedcorrectly to skip over student ID locations in each pass through the loop, including the firstpass. The combination of flexibility in the offset values, along with the writeback feature,

LDR R1, N Load count into R1.LDR R2, =NUM1 Load address NUM1 into R2.MOV R0, #0 Clear accumulator R0.

LOOP LDR R3, [R2], #4 Load next number into R3.ADD R0, R0, R3 Add number into R0.SUBS R1, R1, #1 Decrement loop counter R1.BGT LOOP Branch back if not done.STR R0, SUM Store sum.

Figure D.7 A program for adding numbers.



LDR R2, =N Load address N into R2.MOV R3, #0MOV R4, #0MOV R5, #0LDR R6, N Load the value n.

LOOP LDR R7, [R2, #8]! Add current student markADD R3, R3, R7 for Test 1 to partial sum.LDR R7, [R2, #4]! Add current student markADD R4, R4, R7 for Test 2 to partial sum.LDR R7, [R2, #4]! Add current student markADD R5, R5, R7 for Test 3 to partial sum.SUBS R6, R6, #1 Decrement the counter.BGT LOOP Loop back if not finished.STR R3, SUM1 Store the total for Test 1.STR R4, SUM2 Store the total for Test 2.STR R5, SUM3 Store the total for Test 3.

Figure D.8 An ARM version of the program in Figure 2.11 for summing testscores.

means that the last Add instruction in the program in Figure 2.11 is not needed to incrementthe pointer register R2 at the end of each pass through the loop.

D.4.8 Subroutine Linkage Instructions

The Branch and Link (BL) instruction is used to call a subroutine. It operates in the sameway as other branch instructions, with one added step. The return address, which is theaddress of the next instruction after the BL instruction, is loaded into register R14, whichacts as the link register. Since subroutines may be nested, the contents of the link registermust be saved on the processor stack before a nested call to another subroutine is made.Register R13 is used as the processor stack pointer.

Figure D.9 shows the program of Figure D.7 rewritten as a subroutine. Parametersare passed through registers. The calling program passes the size of the number list andthe address of the first number to the subroutine in registers R1 and R2. The subroutinepasses the sum back to the calling program in register R0. The subroutine also uses registerR3. Therefore, its contents, along with the contents of the link register R14, are saved onthe stack by the STMFD instruction. The suffix FD in this instruction specifies that thestack grows toward lower addresses and that the stack pointer R13 is to be predecrementedbefore pushing words onto the stack. The LDMFD instruction restores the contents ofregister R3 and pops the saved return address into the PC (R15), performing the returnoperation automatically.



Calling program

LDR R1, NLDR R2, =NUM1BL LISTADDSTR R0, SUM...

Subroutine

LISTADD STMFD R13!, {R3, R14} Save R3 and return address in R14 onstack, using R13 as the stack pointer.

MOV R0, #0LOOP LDR R3, [R2], #4

ADD R0, R0, R3SUBS R1, R1, #1BGT LOOPLDMFD R13!, {R3, R15} Restore R3 and load return address

into PC (R15).

Figure D.9 Program of Figure D.7 written as a subroutine; parameters passed throughregisters.

Figure D.10a shows the program of Figure D.7 rewritten as a subroutine with parame-ters passed on the processor stack. The parameters NUM1 and n are pushed onto the stackby the first four instructions of the calling program. Registers R0 to R3 serve the samepurpose inside the subroutine as in Figure D.7. Their contents are saved on the stack by thefirst instruction of the subroutine along with the return address in R14. The contents of thestack at various times are shown in Figure D.10b. After the parameters have been pushedand the Call instruction (BL) has been executed, the top of the stack is at level 2. It is atlevel 3 after all registers have been saved by the first instruction of the subroutine. The nexttwo instructions load the parameters into registers R1 and R2 using offsets of 20 and 24bytes into the stack to access the values n and NUM1, respectively. When the sum has beenaccumulated in R0, it is written into the stack by the Store instruction (STR), overwritingNUM1.

The final example of subroutines is the case of handling nested calls. Figure D.11shows the ARM code for the program of Figure 2.21. The stack frames corresponding tothe first and second subroutines are shown in Figure D.12. Register R12 is used as the framepointer. Symbolic names are used for some of the registers in this example to aid programreadability. Registers R12 (frame pointer), R13 (stack pointer), R14 (link register), andR15 (program counter), are labeled as FP, SP, LR, and PC, respectively. The assembler



(Assume top of stack is at level 1 below.)

Calling program

LDR R0, =NUM1 Push NUM1STR R0, [R13, #–4]! on stack.LDR R0, N Push nSTR R0, [R13, #–4]! on stack.BL LISTADDLDR R0, [R13, #4] Move the sum intoSTR R0, SUM memory location SUM.ADD R13, R13, #8 Remove parameters from stack....

Subroutine

LISTADD STMFD R13!, {R0 –R3, R14} Save registers.LDR R1, [R13, #20] Load parametersLDR R2, [R13, #24] from stack.MOV R0, #0

LOOP LDR R3, [R2], #4ADD R0, R0, R3SUBS R1, R1, #1BGT LOOPSTR R0, [R13, #24] Place sum on stack.LDMFD R13!, {R0 –R3, R15} Restore registers and return.


[R0]

[R1]

[R2]

[R3]

Return address

n

NUM1

Level 3

Level 2

Level 1

(b) Top of stack at various times

Figure D.10 Program of Figure D.7 written as a subroutine; parameters passed on thestack.




Main program...

2000 LDR R10, PARAM2 Place parameters on stack.2004 STR R10, [SP, #–4]!2008 LDR R10, PARAM12012 STR R10, [SP, #–4]!2016 BL SUB12020 LDR R10, [SP] Store SUB1 result.2024 STR R10, RESULT2028 ADD SP, SP, #8 Remove parameters from stack.2032 next instruction

...

First subroutine

2100 SUB1 STMFD SP!, {R0 –R3, FP, LR} Save registers.2104 ADD FP, SP, #16 Load frame pointer.2108 LDR R0, [FP, #8] Load parameters.2112 LDR R1, [FP, #12]

...LDR R2, PARAM3 Place parameter on stack.STR R2, [SP, #–4]!

2160 BL SUB22164 LDR R2, [SP], #4 Pop SUB2 result into R2.

...STR R3, [FP, #8] Place result on stack.LDMFD SP!, {R0 –R3, FP, PC} Restore registers and return.

Second subroutine

3000 SUB2 STMFD SP!, {R0, R1, FP, LR} Save registers.ADD FP, SP, #8 Load frame pointer.LDR R0, [FP, #8] Load parameter....STR R1, [FP, #8] Place result on stack.LDMFD SP!, {R0, R1, FP, PC} Restore registers and return.

Figure D.11 Nested subroutines.


D.5 Assembly Language 635

FP

FP

[FP] from SUB1

2164

Stackframe

forSUB1

[R3] from Main

param3

[R0] from Main

[R1] from Main

[R2] from Main

Old TOS

2020

[FP] from Main

param1

param2

[R1] from SUB1

[R0] from SUB1

Stackframe

forSUB2

Figure D.12 Stack frames for Figure D.11.

predefines LR and PC for this use, and the assembler directive RN can be used to definethe names FP and SP, as explained in Section D.5.

The structure of the calling program and the subroutines is the same as in Figure 2.21.Aspects that are specific toARM are as follows. Both the return address and the old contentsof the frame pointer are saved on the stack by the first instruction in each subroutine. Thesecond instruction sets the frame pointer to point to its saved value, as shown in FigureD.12. This is consistent with the frame pointer position in Figures 2.20 and 2.22. Theparameters are then referenced at offsets of 8, 12, and so on. The last instruction in eachsubroutine restores the saved value of the frame pointer as well as the saved values of otherregisters, and pops the return address from the stack into the PC.

D.5 Assembly Language

The ARM assembly language has assembler directives to reserve storage space, assignnumerical values to address labels and constant symbols, define where program and datablocks are to be placed in memory, and specify the end of the source program text. Thegeneral forms for such directives are described in Section 2.5.1.



Memory Addressingaddress or datalabel Operation information

Assembler directives AREA CODEENTRY

Statements that LDR R1, Ngenerate LDR R2, POINTERmachine MOV R0, #0instructions LOOP LDR R3, [R2], #4

ADD R0, R0, R3SUBS R1, R1, #1BGT LOOPSTR R0, SUM

Assembler directives AREA DATASUM DCD 0N DCD 5POINTER DCD NUM1NUM1 DCD 3, –17, 27, –12, 322

END

Figure D.13 Assembly-language source program for the program in Figure D.7.

We illustrate some of the ARM directives in Figure D.13, which gives a completesource program for the program of Figure D.7. The AREA directive, which uses theargument CODE or DATA, indicates the beginning of a block of memory that containseither program instructions or data. Other parameters are required to specify the placementof code and data blocks into specific memory areas. The ENTRY directive specifies thatprogram execution is to begin at the following LDR instruction.

In the data area, which follows the code area, the DCD directives are used to labeland initialize the data operands. The word locations SUM and N are initialized to 0 and 5,respectively, by the first two DCD directives. The address NUM1 is placed in the locationPOINTER by the next DCD directive. The combination of the instruction

LDR R2, POINTER

and the data declaration

POINTER DCD NUM1


D.5 Assembly Language 637

is one of the ways that the pseudoinstruction

LDR R2, =NUM1

in Figure D.7 can be implemented, as described in Section D.5.1. The last DCD directivespecifies that the five numbers to be added are placed in successive memory word locations,starting at NUM1.

Constants in hexadecimal notation are identified by the prefix ‘&’, and constants inbase n, for n between two and nine, are identified with a prefix indicating the base. Forexample, 2_101100 denotes a binary constant, and 8_70375 denotes an octal constant. Baseten constants do not need a prefix.

An EQU directive can be used to declare symbolic names for constants. For example,the statement

TEN EQU 10

allows TEN to be used in a program instead of the decimal constant 10.It is convenient to use symbolic names for registers, relating to their usage. The RN

directive is used for this purpose. For example,

COUNTER RN 3

establishes the name COUNTER for register R3. The register names R0 to R15, PC (forR15), and LR (for R14) are predefined by the assembler.

D.5.1 Pseudoinstructions

A pseudoinstruction is an assembly-language instruction that performs some desired op-eration but does not correspond directly to an actual machine instruction. The assembleraccepts such an instruction and replaces it with an actual machine instruction that performsthe desired operation. In some cases, a short sequence of actual machine instructions maybe needed. Pseudoinstructions are provided for the convenience of the programmer.

We have already seen examples of pseudoinstructions in Sections D.4.3 and D.4.4.Here, we will give a more complete discussion of pseudoinstructions that can be used toload a 32-bit number or address value into a register.

Loading 32-bit ValuesThe pseudoinstruction

LDR Rd, =value

can be used to load any 32-bit value into a register. The equal sign in front of the valuedistinguishes this instruction from an actual Load instruction. If the value can be formedand loaded into Rd by a MOV or MVN instruction, then that is the choice that will be madeby the assembler. If this is not possible, the assembler will use the Relative addressingmode in an actual LDR instruction to load the value from a memory location that is in adata area allocated by the assembler.


LDR R3, =127



will be replaced by the instruction

MOV R3, #127

But the instruction

LDR R3, =&A123B456

will be replaced with

LDR R3, MEMLOC

where the hexadecimal value A123B456 is the contents of memory location MEMLOC,accessed using the Relative addressing mode.

The value of an address label can also be loaded into a register in this way, as we havedone in most of the program examples in this appendix.

Loading Address ValuesIn addition to the method just described, a more efficient way can be used to load an

address into a register when the address is close to the current value of the program counterPC (R15). This alternative approach avoids the need to place the desired address value ina data area.

The pseudoinstruction

ADR Rd, LOCATION

loads the 32-bit address represented by LOCATION into Rd. The ADR instruction isimplemented as follows. The assembler computes the offset from the current value in PCto LOCATION. If LOCATION is in the forward direction, then ADR is implemented withthe instruction

ADD Rd, R15, #offset

If LOCATION is in the backward direction, then

SUB Rd, R15, #offset

is used.In either case, the offset is an unsigned 8-bit number in the range 0 to 255, as described

earlier for arithmetic instructions. A limited number of larger offset values can also begenerated by rotation of an 8-bit value, as described in Section D.4.2.

D.6 Example Programs

In this section, we give ARM versions of the generic programs for vector dot product andstring search that are presented in Section 2.12. We will describe only those aspects of theARM code that differ from the generic programs.


D.7 Operating Modes and Exceptions 639

LDR R1, =AVEC R1 points to vector A.LDR R2, =BVEC R2 points to vector B.LDR R3, N R3 is the loop counter.MOV R0, #0 R0 accumulates the dot product.

LOOP LDR R4, [R1], #4 Load A component.LDR R5, [R2], #4 Load B component.MLA R0, R4, R5, R0 Multiply components and

accumulate into R0.SUBS R3, R3, #1 Decrement the counter.BGT LOOP Branch back if not done.STR R0, DOTPROD Store dot product.

Figure D.14 A dot product program.

D.6.1 Vector Dot Product

A program that calculates the dot product of two vectors A and B is given in Figure D.14.The first two instructions load the starting addresses of the vectors, AVEC and BVEC, intoregisters R1 and R2. The Relative addressing mode is used to access the contents of N andDOTPROD, and the Post-indexed addressing mode (which always includes writeback) isused in the first two instructions of the loop. The Multiply-Accumulate instruction (MLA)performs the necessary arithmetic operations. It multiplies the vector elements in R4 andR5 and accumulates their product into R0.

D.6.2 String Search

The ARM program in Figure D.15 follows the generic program in Figure 2.30 very closely.There are two differences worth noting. The Post-indexed addressing mode used in thefirst two instructions of LOOP2 in the ARM program avoids the need for the two Addinstructions in LOOP2 of the generic program; and the three pairs of ARM instructionsCMP/BNE, CMP/BGT, and CMP/BGE are needed to implement the three generic Branch_ifinstructions.

D.7 Operating Modes and Exceptions

The ARM processor has seven operating modes. Application programs run in User mode.There are five exception modes. One of them is entered when an exception occurs. Theseventh operating mode is the System mode. It can only be entered from one of the exceptionmodes, as discussed in Section D.7.3.



LDR R2, =T Load address T into R2.LDR R3, =P Load address P into R3.LDR R4, N Get the value n.LDR R5, M Get the value m.SUB R4, R4, R5 Compute n – m.ADD R4, R2, R4 R4 is the address of T (n – m).ADD R5, R3, R5 R5 is the address of P (m).

LOOP1 MOV R6, R2 Use R6 to scan through string T .MOV R7, R3 Use R7 to scan through string P .

LOOP2 LDRB R8, [R6], #1 Compare a pair ofLDRB R9, [R7], #1 characters inCMP R8, R9 strings T and P .BNE NOMATCHCMP R5, R7 Check if at P (m).BGT LOOP2 Loop again if not done.STR R2, RESULT Store the address of T (i).B DONE

NOMATCH ADD R2, R2, #1 Point to next character in T .CMP R4, R2 Check if at T (n – m).BGE LOOP1 Loop again if not done.MOV R8, # –1 No match was found.STR R8, RESULT

DONE next instruction

Figure D.15 A string search program.

The five exception modes and the exceptions that cause them to be entered are sum-marized as follows:

• Fast interrupt (FIQ) mode is entered when an external device raises a fast-interruptrequest to obtain urgent service.

• Ordinary interrupt (IRQ) mode is entered when an external device raises a normalinterrupt request.

• Supervisor (SVC) mode is entered on powerup or reset, or when a user program executesa Software Interrupt instruction (SWI) to call for an operating system routine to beexecuted.

• Memory access violation (Abort) mode is entered when an attempt by the currentprogram to fetch an instruction or a data operand causes a memory access violation.

• Unimplemented instruction (Undefined) mode is entered when the current programattempts to execute an unimplemented instruction.



The interrupt-disable bits I and F in the Status register determine whether the processoris interrupted when an interrupt request is raised on the corresponding lines (IRQ and FIQ).The processor is not interrupted if the disable bit is 1; it is interrupted if the disable bit is 0.

The five exception modes and the System mode are privileged modes. When theprocessor is in a privileged mode, access to the Status register (CPSR in Figure D.1) isallowed so that the mode bits and the interrupt-disable bits can be manipulated. Thisis done with instructions that are not available in User mode, which is an unprivilegedmode.

D.7.1 Banked Registers

When the processor is operating in either the User or System mode, the normal sixteenprocessor registers shown in Figure D.1 are in use. When an exception occurs and a switchis made from User mode to one of the five exception modes, some of these sixteen registersare replaced by an equal number of banked registers, as described in Section D.2. Thecontents of the replaced registers are left unchanged. There is a different set of bankedregisters for each of the five exception modes, shown in blue in Figure D.16.

When an exception occurs, the following actions are taken on the switch from Usermode to the appropriate exception mode:

1. The contents of the Program Counter (R15) are loaded into the banked Link register(R14_mode) of the exception mode.

2. The contents of the Status register (CPSR) are loaded into the banked Saved Statusregister (SPSR_mode).

3. The mode bits of CPSR are changed to represent the appropriate exception mode, andthe interrupt-disable bits I and F are set appropriately.

4. The Program Counter (R15) is loaded with the dedicated vector address for theexception, and the instruction at that address is fetched and executed to begin theexception-service routine.

The active stack pointer register (R13_mode) always points to the top element of aprocessor stack in an area of memory that has been allocated for the relevant exceptionmode. The contents of R13_mode are initialized by the operating system.

When the exception-service routine has been completed, it is necessary to return tothe User mode to continue execution of the interrupted program. This is accomplished bytransferring the contents of the mode link register (R14_mode) to the program counter andtransferring the contents of the Saved Status register (SPSR_mode) to the Status register(CPSR).

The actions just described for switching from User mode to an exception mode and thenback again to User mode have been presented in general terms. The details vary somewhatdepending on the actual exception and the mode entered. These details are described furtherin the following sections.



R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R13

R14

R15

User/System FIQ Supervisor Abort UndefinedIRQ

R15 R15 R15 R15 R15

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R0

R1

R2

R3

R4

R5

R6

R7

General-purpose registers and program counter

Processor status register

CPSR CPSR CPSR CPSR CPSR CPSR

SPSR_fiq SPSR_irq SPSR_svc SPSR_abt SPSR_und

R8_fiq

R9_fiq

R10_fiq

R11_fiq

R12_fiq

R13_fiq

R14_fiq

R13_svc

R14_svc

R13_abt

R14_abt

R13_und

R14_und

R13_irq

R14_irq

Figure D.16 Accessible registers in different modes of the ARM processor.

D.7.2 Exception Types

There are seven possible exceptions. They are listed in Table D.3 along with the processormode that is entered when they occur. The exception vector addresses are also listed. Theseword locations at the low end of the address space must contain branch instructions to thestart of the exception-service routines. The fast interrupt routine could start immediately



Table D.3 Exceptions and processor modes.

Processor mode Vector PriorityException entered address (Highest = 1)

Fast interrupt FIQ 28 3

Ordinary interrupt IRQ 24 4

Software interrupt Supervisor (SVC) 8 –

Powerup/reset Supervisor (SVC) 0 1

Data access violation Abort 16 2

Instruction access violation Abort 12 5

Unimplemented instruction Undefined 4 6

without the need for a branch instruction because its vector address (28) is last in the list.When multiple exceptions occur at the same time, the priority order in which they areserviced is shown in the last column of Table D.3.

A more detailed description of the exceptions is as follows:

• Fast (FIQ) and ordinary (IRQ) interrupts—Input/output devices use one of two inter-rupt request lines to request service. The FIQ interrupt is intended for one device or a smallnumber of devices that require rapid response. The banked registers for the FIQ processormode shown in Figure D.16 include five general-purpose registers R8_fiq through R12_fiqin addition to the stack pointer register R13_fiq and the link register R14_fiq. If the fivegeneral-purpose registers provide enough working space for the FIQ interrupt-service rou-tine, then none of the other User-mode registers need to be saved and restored. All otherI/O devices use the IRQ interrupt line to request service.• Software interrupts—A user program requests operating system services by executingthe SWI instruction. This is an exception that causes entry into the Supervisor mode. Aparameter field in the instruction specifies the requested service and is accessible from theSupervisor routines.• Powerup/reset—This is the highest priority exception. It places the processor into aknown initial state so that operating system software can begin or restart operation properly.Any program executing when this exception occurs is abandoned.• Data and instruction access violations—Processor implementations may include amemory management unit that restricts programs to valid areas of the address space for theirinstructions and data. Such a unit is necessary to implement virtual memory as described inChapter 8. If the processor issues an address for an instruction fetch or data operand accessoutside these areas, an exception occurs and the Abort mode is entered. This mode alsohandles the case where the address is valid but is not currently mapped into main memoryand needs to be transferred from secondary storage.• Unimplemented instruction—If the processor tries to execute an instruction that is notimplemented in hardware, an exception is raised and the Undefined mode is entered. For



example, a floating-point arithmetic operation that can be supported by special hardwaremay not be implemented in the current processor. In this case, the exception can cause asoftware implementation of the operation to be executed.

D.7.3 System Mode

The System mode is a privileged mode that uses the same registers as those used in the Usermode. It can only be entered from another exception mode. Its purpose is to facilitate linkageto subroutines during exception handling without overwriting the link register R14_mode.When in System mode, subroutine Call instructions use the normal link register R14. Afterreturning from all subroutine calls, the original exception mode is reentered, regainingaccess to the link register R14_mode.

D.7.4 Handling Exceptions

The general actions needed to switch from User mode to the appropriate exception modeand then back again after an exception occurs have been described briefly in Section D.7.1.The actions vary in detail, depending upon the exception and the exception mode entered.Here, we consider some of those details.

Pipelined Execution, the Program Counter, and the Status RegisterThe ARM processor overlaps the fetching and execution of successive instructions

in order to increase instruction throughput. This technique is called pipelined instructionexecution. It is described in Chapter 6. During pipelined execution of instructions, updatingof the program counter is done as follows. Suppose that the processor fetches instruction I1

from address A. The contents of PC are incremented to A+4, then execution of I1 is begun.Before the execution of I1 is completed, the processor fetches instruction I2 from addressA+4, then increments PC to A+8.

Now assume that at the end of execution of I1 the processor detects that an ordinaryinterrupt request (IRQ) has been received. The processor performs the actions describedin Section D.7.1 to enter the IRQ exception mode to service the interrupt. It copies thecontents of CPSR into SPSR_irq and copies the contents of PC, which are now A+8, intothe link register R14_irq. Instruction I2, which has been fetched but not yet fully executed,is discarded. This is the instruction to which the interrupt-service routine must return. Theinterrupt-service routine must subtract 4 from R14_irq before using its contents as the returnaddress. The saved copy of the Status register must also be restored. The required actionsare carried out by the single instruction

SUBS PC, R14_irq, #4

which subtracts 4 from R14_irq and stores the result into PC. The suffix S in the OP codenormally means “set condition codes.” But when the target register of the instruction isPC, the S suffix causes the processor to copy the contents of SPSR_irq into CPSR, thuscompleting the actions needed to return to the interrupted program.



Table D.4 Address correction during return from exception.

DesiredException Saved address* return address Return instruction

Undefined instruction PC+4 PC+4 MOVS PC, R14_und

Software interrupt PC+4 PC+4 MOVS PC, R14_svc

Instruction Abort PC+4 PC SUBS PC, R14_abt, #4

Data Abort PC+8 PC SUBS PC, R14_abt, #8

IRQ PC+4 PC SUBS PC, R14_irq, #4

FIQ PC+4 PC SUBS PC, R14_fiq, #4

*PC is the address of the instruction that caused the exception. For IRQ and FIQ, it is the address of thefirst instruction not executed because of the interrupt.

In the case of a software interrupt triggered by execution of the SWI instruction, thevalue saved in R14_svc is the correct return address. Return from a software interrupt canbe accomplished using the instruction

MOVS PC, R14_svc

that also copies the contents of SPSR_svc into CPSR.Table D.4 gives the correct return-address value and the instruction that can be used

to return to the interrupted program for each of the exceptions in Table D.3, except forpowerup/reset, which abandons any currently executing program. Note that for a dataaccess or instruction access violation, the return address is the address of the instructionthat caused the exception, because it must be re-executed after the cause of the violationhas been resolved.

Manipulating Status Register BitsWhen the processor is running in a privileged mode, special Move instructions, MRS

and MSR, can be used to transfer the contents of the current or saved processor statusregisters to or from a general-purpose register. For example,

MRS Rd, CPSR

copies the contents of CPSR into register Rd. Similarly,

MSR SPSR, Rm

copies the contents of register Rm into SPSR_mode.After status register contents have been loaded into a register, logic instructions can be

used to manipulate individual bits. Then, the register contents can be copied back into thestatus register to effect the desired changes. For example, these steps can be used to set orclear interrupt-disable bits in an exception-service routine. We will see this done in SectionD.8 in the handling of I/O device interrupts.



Nesting Exception-Service RoutinesRecall that nesting of subroutines is facilitated by storing the contents of the link register

in the stack frame associated with a subroutine that calls another subroutine. This action isnot needed when an exception-service routine is interrupted by a higher-priority exceptionwhose service routine runs in a different processor mode. This is because each mode hasits own banked link register.

For example, suppose that an ordinary interrupt is being serviced by an IRQ-moderoutine when an interrupt that requires fast servicing is received. The first routine is inter-rupted and the FIQ mode is entered to service the second interrupt. The return address forthe program that was interrupted to service the IRQ interrupt remains unchanged in linkregister R14_irq. The return address for the IRQ routine is stored in R14_fiq. Hence, theuse of banked registers avoids overwriting saved return addresses, and these addresses donot need to be placed on the stack when nesting of exception routines occurs. However, ifdifferent exceptions are serviced in the same processor mode, then their return addresseswill need to be saved if nesting is allowed.

D.8 Input/Output

The ARM architecture uses the memory-mapped I/O approach as described in Section 3.1.Reading a character from a keyboard or sending a character to a display can be done usingprogram-controlled I/O as described in Section 3.1.2. It is also possible to use interrupt-driven I/O as described in Section 3.2. Both of these options will be illustrated in thissection by presenting program examples that show how the generic programs in Chapter3, which involve keyboard and display devices, can be implemented in ARM assemblylanguage.

D.8.1 Program-Controlled I/O

We begin by giving short instruction sequences for reading a character from a keyboard andwriting a character to a display.

Keyboard Character InputAssume that the data, status, and control registers in the keyboard interface are arranged

as shown in Figure 3.3a. Also, assume that address KBD_DATA (0x4000) has been loadedinto register R1. The instruction sequence

READWAIT LDRB R3, [R1, #4]TST R3, #2BEQ READWAITLDRB R3, [R1]

reads a character into register R3 when a key has been pressed on the keyboard. The test(TST) instruction performs the bitwise logical AND operation on its two operands and sets


D.8 Input/Output 647

the condition code flags based on the result. The immediate operand, 2, has a single one inthe b1 bit position. Therefore, the result of the TST operation will be zero until KIN = 1,which signifies that a character is available in KBD_DATA. The BEQ instruction branchesback to READWAIT if KIN = 0. This results in looping until a key is pressed, which causesKIN to be set to one. Then, the branch is not taken, and the character is loaded into registerR3.

Display Character OutputAssuming that address DISP_DATA has been loaded into register R2, the instruction

sequence

WRITEWAIT LDRB R4, [R2, #4]TST R4, #4BEQ WRITEWAITSTRB R3, [R2]

sends the character in register R3 to the DISP_DATA register when the display is ready toreceive it.

Complete Input/Output ProgramThe two routines just described can be used to read a line of characters from a keyboard,

store them in the memory, and echo them back to a display, as shown in the program inFigure D.17. This program is patterned after the generic program in Figure 3.4. RegisterR0 is assumed to contain the address of the first byte in the memory area where the lineis to be stored. Registers R1 through R4 have the same usage as in the READWAIT andWRITEWAIT loops just described. The first Store instruction (STRB) stores the characterread from the keyboard into the memory. The Post-indexed addressing mode with writebackis used in this instruction to step through the memory area. The Test Equivalence (TEQ)instruction tests whether or not the two operands are equal and sets the Z condition codeflag accordingly.

READ LDRB R3, [R1, #4] Load KBD_STATUS byte andTST R3, #2 wait for character.BEQ READLDRB R3, [R1] Read the character andSTRB R3, [R0], #1 store it in memory.

ECHO LDRB R4, [R2, #4] Load DISP_STATUS byte andTST R4, #4 wait for displayBEQ ECHO to be ready.STRB R3, [R2] Send character to display.TEQ R3, #CR If not carriage return,BNE READ read more characters.

Figure D.17 A program that reads a line of characters and displays it.



D.8.2 Interrupt-Driven I/O

The ARM interrupt facility described in Section D.7 can be used to read a line of characterstyped on a keyboard under interrupt-driven control. We assume that the keyboard devicehas its interrupt-request line attached to the IRQ interrupt input to the processor.

There may be a number of devices that are enabled to raise interrupts on the IRQ line.If this is the case, software polling of these devices in some priority order can be used inthe IRQ interrupt-service routine to identify the first device with an interrupt raised. Forsimplicity, we will assume that the keyboard is the only device that can raise an interruptrequest on the IRQ line.

A generic program that uses interrupts for reading a line of characters until a carriage-return character (CR) is encountered is given in Figure 3.8. The interrupt-service routinealso sends the characters to a display, as they are read from the keyboard, using program-controlled I/O. We will implement this task on the ARM processor. The Status register,CPSR, and the Saved Status register in the IRQ processor mode, SPSR_irq, are the processorcontrol registers that are relevant for handling interrupts.

The following memory locations are needed in this example I/O program:

• PNTR is a pointer location that contains the address where the next character read fromthe keyboard is to be loaded into memory.

• LINE is the memory byte location where the first character of the line is to be placed.• EOL is a memory location containing a binary variable that indicates to the main

program when a complete line has been read.

Figure D.18 shows an ARM IRQ interrupt-service routine and a main program thatcorrespond to those in Figure 3.8. The main program is assumed to be running in Supervisormode. The first six instructions initialize PNTR with the address LINE, clear the EOLindicator, and enable interrupts in the keyboard control register, KBD_CONT. The lastinstruction clears the IRQ disable bit (I) and switches the processor to User mode by usingthe MSR instruction to load the hexadecimal value 50 into CPSR.

The IRQ interrupt-service routine follows the pattern of the generic program in Figure3.8 very closely. Many of the Load and Store instructions use the Relative addressing modefor simplicity, assuming that both the memory locations and device registers named arewithin the range reachable by an offset from the Program Counter.

D.9 Conditional Execution of Instructions

The conditional execution of all ARM instructions permits shorter routines to be writtenin place of routines written for conventional RISC machines where there are a number ofbranch instructions.

Consider the following example. A loop component of a routine for finding the greatestcommon divisor (GCD) of two, non-zero, positive integers [4] using RISC-style instructionsis shown in Figure D.19a. The two numbers are contained in registers R2 and R3 when theroutine is entered. At the beginning of each pass through the loop, if the numbers are notequal, then the routine subtracts the smaller number from the larger number and returns to


D.9 Conditional Execution of Instructions 649


IRQLOC STMFD R13!, {R2, R3} Save R2 and R3 on the stack.LDR R2, PNTR Load address pointer.LDRB R3, KBD_DATA Read character from keyboard.STRB R3, [R2], #1 Write character into memory

and increment pointer.STR R2, PNTR Update pointer in memory.

ECHO LDRB R2, DISP_STATUS Wait for display to be ready.TST R2, #4BEQ ECHOSTRB R3, DISP_DATA Send character to display.CMP R3, #CR Check if character is Carriage ReturnBNE RTRN and return if not CR.MOV R2, #1 If CR, indicate end of line.STR R2, EOLMOV R2, #0 Disable interrupts inSTRB R2, KBD_CONT keyboard interface.

RTRN LDMFD R13!, {R2, R3} Restore registersSUBS R15, R14, #4 and return from interrupt.

Main program

LDR R2, =LINE Initialize buffer pointer.STR R2, PNTRMOV R2, #0 Clear end-of-line indicator.STR R2, EOLMOV R2, #2 Enable interrupts in keyboard interface.STRB R2, KBD_CONTMOV R2, #&50 Enable IRQ interruptsMSR CPSR, R2 and switch to User mode.next instruction

Figure D.18 A program that reads an input line from a keyboard using interrupts, anddisplays the line using polling.

the beginning of the loop. If the numbers are the same, an exit from the loop is taken tolocation NEXT, and the GCD is contained in both registers.

An ARM routine for this task is shown in Figure D.19b. The Compare instruction setsthe condition codes. The suffix in the OP code of each Subtract instruction specifies thecondition under which it is to be executed. The first Subtract instruction is executed onlyif the contents of register R2 are greater than those of register R3, and the second Subtractinstruction is executed only if the contents of register R3 are greater than those of register



LOOP Branch_if_[R2]=[R3] NEXTBranch_if_[R2] >[R3] REDUCESubtract R3, R3, R2Branch LOOP

REDUCE Subtract R2, R2, R3Branch LOOP

NEXT next instruction

(a) GCD algorithm using RISC-style instructions

LOOP CMP R2, R3SUBGT R2, R2, R3SUBLT R3, R3, R2BNE LOOP

NEXT next instruction

(b) GCD algorithm using ARM instructions

Figure D.19 Conditional execution of instructions.

R2. On each pass through the loop when the contents of the two registers are not equal,only one of the Subtract instructions is executed. When the contents of the two registers areequal, which may be the case initially, neither of the two Subtract instructions is executed,and the branch back to LOOP is not taken.

The shorter ARM code sequences that result from cases such as this are most effectivewhen there is a relatively high density of Branch instructions in conventional code. Thecode space savings are important in small embedded system applications.

D.10 Coprocessors

Hardware units called coprocessors can be connected to an ARM processor. They are usedto perform operations that are not included in the basic ARM instruction set. One exampleis a hardware unit for performing arithmetic operations on floating-point numbers. Otherexamples include application-specific processing on digital signals or video data. Writingprograms that use coprocessors is facilitated by including extensions to theARM instructionset that are of three types:


D.12 Concluding Remarks 651

• Data operations in the coprocessor• Transfers between ARM and coprocessor registers• Load and Store transfers between memory and the coprocessor registers

Software that defines a coprocessor unit in a form that can be used to synthesize ahardware realization can be combined with software that defines a basic ARM processor tosynthesize a single chip that integrates the coprocessor unit with the ARM processor.

D.11 EmbeddedApplications and the Thumb ISA

Low-cost and low-power embedded systems, such as mobile telephones, are a major appli-cation area for ARM processors. Designers of such systems strive to minimize the size ofthe on-chip memory space needed to store the required programs. In Section D.9, we sawthat conditional execution of instructions can lead to reduced code space. Block Transferinstructions, which are used to transfer words between multiple registers and a block ofmemory words, also reduce code space.

Code space can be reduced further by using a subset of the most commonly used ARMinstructions that are provided in a format that uses only 16 bits for each instruction. Thissubset is called the Thumb instruction set. Thumb instructions are executed as follows.First, they are fetched from memory. Then, they are decompressed (expanded) from their16-bit encoded format into corresponding 32-bit ARM instructions and executed in theusual way.

Bit b5 in the Status register (CPSR), labeled T, determines whether the incoming in-struction stream consists of Thumb (T = 1) or standard 32-bit ARM instructions (T = 0).A program can contain a mix of Thumb routines and standard instruction routines. Specialinstructions are needed to manipulate the T bit when switching between such routines.

There are two main differences between Thumb and standard instructions. First, manyThumb instructions use a two-operand format in which the destination register is also one ofthe source operand registers. Second, conditional execution, which applies to all standardARM instructions, is used mainly for branches in the Thumb set. These differences lead tosavings in instruction encoding bit space.

D.12 Concluding Remarks

The ARM processor has achieved significant commercial success in the embedded systemmarket. Its design is licensed to a number of companies that make hand-held communicationdevices. The 16-bit Thumb version of the instruction set is particularly suitable for low-costand low-power applications because it allows for compact programs. The flexible Indexaddressing modes and the Block Transfer instructions of the full 32-bit ISA are useful formany applications. While these two features reflect CISC-style attributes, ARM is generallyconsidered to have a Load/Store RISC-style architecture.



D.13 Solved Problems


Example D.1 Problem: Assume that there is a byte-string ofASCII-encoded characters stored in memorystarting at location STRING. It is terminated by the Carriage-Return character (CR). Writean ARM program to determine the length of the string and store the length in locationLENGTH.

Solution: Figure D.20 presents a possible program. The characters in the string are com-pared to CR (ASCII code &0D), and a counter is incremented until the end of the string isreached.

LDR R2, =STRING Load address of start of string.MOV R3, #0 Load length of string as 0.MOV R4, #&0D Load Carriage-Return character code.

LOOP LDRB R5, [R2], #1 Load next character.CMP R4, R5 Check for Carriage Return and finish,BEQ DONE or increment length count and go back.ADD R3, R3, #1B LOOP

DONE STR R3, LENGTH Store length of string.

Figure D.20 Program for Example D.1.

Example D.2 Problem: Write an ARM program to find the smallest number in a list of non-negative32-bit integers. Successive memory word locations SMALL and N in the data area of theprogram are used to store the smallest number and the size of the list, respectively. Thesetwo locations are followed by the list, with the first number stored at location ENTRIES.Include the assembler directives needed to organize the program and data areas as specified.Use a small list of 7 integers as an example.

Solution: The program instructions and data are shown in Figure D.21. Comments areincluded to explain how the program accomplishes the required task. Note that the methodfor loading the address ENTRIES into register R2 is the way that the assembler wouldreplace the pseudoinstruction used in earlier program examples, and explained in SectionD.5.1.


D.13 Solved Problems 653

AREA CODEENTRYLDR R2, POINTER R2 points to list at ENTRIES.LDR R3, [R2, # –4] Counter R3 initialized to n.LDR R5, [R2] R5 holds smallest number so far.

LOOP SUBS R3, R3, #1 Decrement counter.BEQ DONE If R3 contains 0, done.LDR R6, [R2, #4]! Increment list pointer

and get next number.CMP R5, R6 Check if smaller number found,BLE LOOP branch back if not smaller;MOV R5, R6 otherwise, move it into R5,B LOOP then branch back.

DONE STR R5, SMALL Store smallest number into SMALL.

AREA DATAPOINTER DCD ENTRIES Pointer to start of list.SMALL DCD 0 Location for smallest number.N DCD 7 Number of entries in list.ENTRIES DCD 4, 5, 3, 6, 1, 8, 2 List of numbers.

END


Example D.3Problem: An ARM program is required to convert an n-digit decimal integer into a binarynumber. The decimal number is given as n ASCII-encoded characters. They are storedin successive byte locations in the memory, starting at location DECIMAL. The convertednumber is to be stored at location BINARY. Location N contains the value n.

Solution: Consider a four-digit decimal number D = d3d2d1d0. Its value can be given bythe expression ((d3 × 10+ d2)× 10+ d1)× 10+ d0. This expression is used as the basisfor the conversion technique used in the program in Figure D.22. Each ASCII-encodedcharacter is converted into a Binary-Coded-Decimal (BCD) digit before it is used in thecomputation. It is assumed that the converted binary value can be represented in no morethan 32 bits.

Example D.4Problem: Consider an array of numbers A(i,j), where i = 0 through n− 1 is the row index,and y = 0 through m− 1 is the column index. The array is stored in memory one row after



LDR R2, N Initialize counter R2 with n.LDR R3, =DECIMAL R3 points to ASCII digits.MOV R4, #0 R4 will hold the binary number.MOV R6, #10 R6 will hold constant 10.

LOOP LDRB R5, [R3], #1 Get next ASCII characterand increment pointer.

AND R5, R5, #&0F Form BCD digit.ADD R4, R4, R5 Add to intermediate result.SUBS R2, R2, #1 Decrement the counter.BEQ DONE Store result if done.MUL R4, R6, R4 Multiply intermediate result by 10B LOOP and loop back.

DONE STR R4, BINARY Store result in BINARY.


another, with each row occupying m successive word locations. Write an ARM subroutinefor adding column x to column y, element by element, and storing the sum elements incolumn y. The indices x and y are passed to the subroutine through registers R2 and R3.The parameters n and m are passed to the subroutine through registers R4 and R5. Theaddress of element A(0,0) is passed to the subroutine through register R6.

Solution: A possible main program and subroutine for this task are given in Figure D.23.We have assumed that the values x, y, n, and m, are stored in memory locations X, Y, N,and M. The address of element A(0,0) is ARRAY. The comments in the program explainhow the task is accomplished. It is interesting to compare the number of instructions inthe ARM subroutine with the number of instructions in the generic RISC-style subroutinegiven in Figure 2.36. The shorter ARM subroutine is possible because of the flexibilityand features provided by the ARM Index addressing modes and the availability of BlockTransfer instructions.

Example D.5 Problem: Assume that memory location BINARY contains a 32-bit pattern. It is desiredto display these bits as eight hexadecimal digit characters on a display device that has theinterface depicted in Figure 3.3. Write an ARM program that accomplishes this task usingprogram-controlled I/O to display the characters.

Solution: Figure D.24 shows a possible program. First, the hexadecimal digits are con-verted to ASCII characters by using a table lookup into a 16-entry table. The eight ASCII


D.13 Solved Problems 655

Main programLDR R2, X Load the value x.LDR R3, Y Load the value y.LDR R4, N Load the value n.LDR R5, M Load the value m.LDR R6, =ARRAY Load address ARRAY

of element A(0,0).BL SUB Call subroutine.

Subroutine

SUB STMFD R13!, {R10, R11, R14} Save registers R10, R11,and Link (R14), on stack.

ADD R2, R6, R2, LSL #2 Load address of A(0,x) into R2.ADD R3, R6, R3, LSL #2 Load address of A(0,y) into R3.

LOOP LDR R10, [R2], R5, LSL #2 Load x-column value into R10and increment column address.

LDR R11, [R3] Load y-column value into R11.ADD R11, R11, R10 Add column values.STR R11, [R3], R5, LSL #2 Store sum into y-column and

increment column address.SUBS R4, R4, #1 Decrement row counter andBGT LOOP loop back if not done.LDMFD R13!, {R10, R11, R15} Restore registers R10, R11,

and program counter (R15),and return from subroutine.


characters to be displayed are stored in a block of memory bytes starting at location HEX.Then, the characters are sent to the display. The comments describe the detailed actionstaken in the program. Note the ORR instruction at location LOOP. It implements a rightrotation operation on the contents of register R2. Each rotation moves the 4-bit hexadecimaldigit that is to be converted next into the low-order 4-bit position of R2. Also note the use ofthe ADR pseudoinstruction for loading address values into registers. The ADR instructionis explained in Section D.5.1.



AREA CODEENTRYMOV R0, #0 Needed for ORR instruction.LDR R2, BINARY Load binary pattern.ADR R3, TABLE R3 points to ASCII table.ADR R4, HEX R4 points to hexadecimal characters.MOV R5, #8 Load digit count.

LOOP ORR R2, R0, R2, ROR #28 Rotate next digit into low-order 4 bits.AND R6, R2, #&F Extract digit and load into R6.LDRB R7, [R3, +R6] Load ASCII code for digit.STRB R7, [R4], #1 Store digit in character string,

and increment pointer.SUBS R5, R5, #1 Decrement digit counter.BGT LOOP Loop back if not done.

DISPLAY MOV R5, #8 Load digit count for display routine.ADR R4, HEX R4 points to hexadecimal characters.ADR R2, DISP_DATA R2 points to device registers.

SENDCHAR LDRB R3, [R2, #4] Check if the display is readyTST R3, #4 by testing DOUT flag.BEQ SENDCHARLDRB R6, [R4], #1 Get next ASCII character,

and increment pointer.STRB R6, [R2] Send character to display.SUBS R5, R5, #1 Decrement digit counter.BGT SENDCHAR Loop until all characters displayed.next instruction

AREA DATABINARY DCD &A123B456 Binary pattern.HEX SPACE 8 Space for ASCII-encoded digits.TABLE DCB &30,&31,&32,&33 Table for conversion to ASCII code.

DCB &34,&35,&36,&37DCB &38,&39,&41,&42DCB &43,&44,&45,&46



Problems 657

Problems

D.1 [E] Assume the following register and memory contents in an ARM computer. RegistersR0, R1, R2, R6, and R7 contain the values 1000, 2000, 1016, 20, and 30, respectively. Thenumbers 1, 2, 3, 4, 5, and 6 are stored in successive word locations starting at memoryaddress 1000. What is the effect of executing each of the following two instruction blocks,starting each time with the given initial values?(a) LDR R8, [R0]

LDR R9, [R0, #4]ADD R10, R8, R9

(b) STR R6, [R1, #−4]!STR R7, [R1, #−4]!LDR R8, [R1], #4LDR R9, [R1], #4SUB R10, R8, R9

D.2 [M] Which of the following ARM instructions will cause the assembler to issue a syntaxerror message? Why?(a) ADD R2, R2, R2(b) SUB R0, R1, [R2, #4](c) MOV R0, #2_1010101(d) MOV R0, #257(e) ADD R0, R1, R11, LSL #8

D.3 [M] Write an ARM program to reverse the order of bits in register R2. For example, if thestarting pattern in R2 is 1110 . . . 0100, the final result in R2 should be 0010 . . . 0111. (Hint:Use shift and rotate operations.)

D.4 [M] Consider the program in Figure D.7. List the contents of registers R0, R1, and R2 aftereach of the first three executions of the BGT instruction. Present the results in a table thathas the three registers as column headers. Use three rows to list the contents of the registersafter each execution of the BGT instruction. The program data are as given in Figure D.13.

D.5 [M] Write an ARM program that compares the corresponding bytes of two lists of bytesand places the larger byte in a third list. The two lists start at byte locations X and Y, and thelarger-byte list starts at LARGER. The length of the lists is stored in memory location N.

D.6 [M] Write an ARM program that generates the first n numbers of the Fibonacci series. Inthis series, the first two numbers are 0 and 1, and each subsequent number is generated byadding the preceding two numbers. For example, for n = 8, the series is 0, 1, 1, 2, 3, 5, 8,13. Your program should store the numbers in successive memory word locations startingat MEMLOC. Assume that the value n is stored in location N.

D.7 [M] Write an ARM program to convert a word of text from lowercase to uppercase. Theword consists ofASCII characters stored in successive byte locations in the memory, startingat location WORD and ending with a space character. (See Table 1.1 in Chapter 1 for theASCII code.)



D.8 [M] The list of student marks shown in Figure 2.10 is changed to contain j test scores foreach student. Assume that there are n students. Write an ARM program for computingthe sums of the scores on each test and store these sums in the memory word locations ataddresses SUM, SUM + 4, SUM + 8, . . . . The number of tests, j, is larger than the numberof registers in the processor, so the type of program shown in Figure D.8 for the 3-test casecannot be used. Use two nested loops. The inner loop should accumulate the sum for aparticular test, and the outer loop should run over the number of tests, j. Assume that j isstored in memory location J, placed just before location N in Figure 2.10.

D.9 [E] Write an ARM program that computes the expression SUM = 580+ 68400+ 80000.

D.10 [E] Write an ARM program that computes the expression ANSWER =A× B + C × D.

D.11 [M] Write an ARM program that finds the number of negative integers in a list of n 32-bit integers and stores the count in location NEGNUM. The value n is stored in memorylocation N, and the first integer in the list is stored in location NUMBERS. Include thenecessary assembler directives and a sample list that contains six numbers, some of whichare negative.

D.12 [M] Write an ARM program for the byte-sorting program described in Example 2.5 inChapter 2.

D.13 [M] Write an ARM program to solve Problem 2.22 in Chapter 2.

D.14 [D] Write an ARM program to solve Problem 2.24 in Chapter 2.










D.24 [M] Write an ARM program that reads n characters from a keyboard and echoes them backto a display after pushing them onto a user stack as they are read. Use register R6 as thestack pointer. The count value n is contained in memory word location N.

D.25 [M] Assume that the average time taken to fetch and execute an instruction in the programin Figure D.17 is 5 nanoseconds. If keyboard characters are entered at the rate of 10 persecond, approximately how many times is the BEQ READ instruction executed percharacter entered? Assume that the time taken to display each character is much less thanthe time between the entry of successive characters at the keyboard.


Problems 659

D.26 [M] Rewrite the program in Figure D.17 in the form of a main program that calls a subroutinenamed GETCHAR to read a single character and calls another subroutine named PUTCHARto display a single character. The address KBD_STATUS is passed to GETCHAR inregister R1, and the main program expects to get the character passed back in register R3.The address DISP_STATUS and the character to be displayed are passed to PUTCHAR inregisters R2 and R3, respectively. Any other registers used by either subroutine must besaved and restored by the subroutine using a stack whose pointer is register R13. Storingthe characters in memory and checking for the end-of-line character CR is to be done in themain program.

D.27 [M] Repeat Problem D.26 using the stack to pass parameters.

D.28 [M] Write an ARM program to accept three decimal digits from a keyboard. Each digitis represented in the ASCII code (see Table 1.1 in Chapter 1). Assume that these threedigits represent a decimal integer in the range 0 to 999. Convert the integer into a binarynumber representation. The high-order digit is received first. To aid in this conversion,two tables of words are stored in the memory. Each table has 10 entries. The first table,starting at word location TENS, contains the binary representations for the decimal values0, 10, 20, . . . , 90. The second table starts at word location HUNDREDS and contains thedecimal values 0, 100, 200, . . . , 900 in binary representation.

D.29 [M] The decimal-to-binary conversion program of Problem D.28 is to be implementedusing two nested subroutines. The main program that calls the first subroutine passes twoparameters by pushing them onto the stack whose pointer register is R13. The first parameteris the address of a 3-byte memory buffer area for storing the input decimal-digit characters.The second parameter is the address of the location where the converted binary value isto be stored. The first subroutine reads the three characters from the keyboard, then callsthe second subroutine to perform the conversion. The necessary parameters are passed tothis subroutine via the processor registers. Both subroutines must save the contents of anyregisters that they use on the stack.

(a) Write the two subroutines for the ARM processor.(b) Give the contents of the stack immediately after the execution of the instruction thatcalls the second subroutine.

D.30 [M] Write an ARM program that displays the contents of 10 bytes of the main memory inhexadecimal format on a line of a video display. The byte string starts at location LOC inthe memory. Each byte has to be displayed as two hex characters. The displayed contentsof successive bytes should be separated by a space.

D.31 [M] Assume that a memory location BINARY contains a 16-bit pattern. It is desired todisplay these bits as a string of 0s and 1s on a display device that has the interface depictedin Figure 3.3. Write an ARM program that accomplishes this task.

D.32 [M] Using the seven-segment display in Figure 3.17 and the timer circuit in Figure 3.14,write an ARM program that flashes decimal digits in the repeating sequence 0, 1, 2, . . . , 9,

0, . . . . Each digit is to be displayed for one second. Assume that the counter in the timercircuit is driven by a 100-MHz clock.



D.33 [D] Using two 7-segment displays of the type shown in Figure 3.17, and the timer circuitin Figure 3.14, write an ARM program that flashes numbers in the repeating sequence0, 1, 2, . . . , 98, 99, 0, . . . . Each number is to be displayed for one second. Assume thatthe counter in the timer circuit is driven by a 100-MHz clock.

D.34 [D] Write an ARM program that computes real clock time and displays the time in hours(0 to 23) and minutes (0 to 59). The display consists of four 7-segment display devices ofthe type shown in Figure 3.17. A timer circuit that has the interface given in Figure 3.14 isavailable. Its counter is driven by a 100-MHz clock.

D.35 [M] Write an ARM program for the problem described in Example 3.5 in Chapter 3.

D.36 [M] Write an ARM program for the problem described in Example 3.6 in Chapter 3.





References

1. ARM Limited, ARM7TDMI Technical Reference Manual—revision r4p1, Documentnumber ARM DDI 0210C, November 2004. Available at http://www.arm.com.

2. Steve Furber, ARM System-on-chip Architecture, 2nd Ed., Addison Wesley, Harlow,England, 2000.

3. William Hohl, ARM Assembly Language: Fundamentals and Techniques, CRC Press,2009.

4. ARM Limited, The ARM Instruction Set—V1.0, ARM University Program.

http://www.arm.com

December 14, 2010 09:22 ham_338065_appe Sheet number 1 Page number 661 cyan black

661

a p p e n d i x

EThe Intel IA-32 Architecture

Appendix Objectives

In this appendix you will learn about the features of the IntelIA-32 architecture:

• Memory organization and register structure

• Addressing modes and types of instructions


• Scalar floating-point operations

• Multimedia operations

• Vector floating-point operations


662 A P P E N D I X E • The Intel IA-32 Architecture

The Intel Corporation uses the generic name Intel Architecture (IA) for the instruction sets of processors inits product line. We will describe the IA-32 instruction set for processors that operate with 32-bit memoryaddresses and 32-bit data operands. The IA-32 instruction set is very large. In addition to providing typicalinteger and floating-point instructions, it includes specialized instructions for multimedia applications and forvector data processing. We will restrict our attention to the basic instructions and addressing modes. Reference[1] provides a comprehensive overview of the IA-32 architecture, and the Intel website (http://www.intel.com)provides additional technical documentation with full details.

E.1 Memory Organization

In the IA-32 architecture, memory is byte-addressable using 32-bit addresses, and instruc-tions typically operate on data operands of 8 or 32 bits. These operand sizes are calledbyte and doubleword in Intel terminology. A 16-bit operand was called a word in earlier16-bit Intel processors. There is also a larger 64-bit operand size called a quadword fordouble-precision floating-point numbers and packed integer data. Little-endian addressingis used, as described in Section 2.1.2. Multiple-byte data operands may start at any byteaddress location. They need not be aligned with any particular address boundaries in thememory.

E.2 Register Structure

The processor registers are shown in Figure E.1. There are eight 32-bit general-purposeregisters, which can hold either integer data operands or addressing information. Rather thanbeing numbered consecutively, they are identified by unique names that are described laterin this section. Eight additional registers are available for floating-point instructions. Theyare discussed in Section E.9. These registers are also used by the multimedia instructionsdescribed in Section E.10. There is another set of registers that is not shown in Figure E.1;these registers are used by vector-processing instructions discussed in Section E.11.

The IA-32 architecture has different models for accessing the memory. The segmentedmemory model associates different areas of the memory, called segments, with differentusages. The code segment holds the instructions of a program. The stack segment containsthe processor stack, and four data segments are provided for holding data operands. Thesix segment registers shown in Figure E.1 contain selector values that identify where thesesegments begin in the memory address space. The detailed function of these registers isnot discussed in this appendix. Instead, the flat memory model of the IA-32 architectureis assumed, where a 32-bit address can access a memory location anywhere in the code,processor stack, or data areas. In this case, the segment registers are initialized with selectorvalues that point to address 0 in the memory.

The two registers shown at the bottom of Figure E.1 are the Instruction Pointer, whichserves as the program counter and contains the address of the next instruction to be executed,

http://www.intel.com


E.2 Register Structure 663

31 13 11 9 7 0

Instruction pointer

CF–CarryZF–ZeroSF–SignTF–Trap

IOPL–Input/Output

OF–OverflowIF–Interrupt enable

31 0

8Generalpurposeregisters

8Floating-point

registers(also used formultimedia

data)

79 0

CS16 0

6Segmentregisters

SS

ES

FS

GS

DS

Code segment

Stack segment

Data segments

31 0

Status register

12 8 6

privilege level

•••

•••

Figure E.1 IA-32 register structure.

and the status register, which holds the condition code flags (CF, ZF, SF, OF). These flagscontain information about the results of arithmetic operations. The program execution modebits (IOPL, IF, TF) are associated with input/output operations and interrupts.



31 15 7 0816

registersData

AH AL

AX

CH CL

CX

DH DL

DX

BH BL

BX

EAX

ECX

EDX

EBX

ESP

EBP

ESI

EDI

EIP

EFLAGS

SP

SI

BP

DI

IP

FLAGS

registersPointer

registersIndex

pointerInstruction

registerStatus

Figure E.2 Compatibility of the IA-32 register structure with earlier Intelprocessor register structures.

The IA-32 general-purpose registers allow for compatibility with the registers of earlier8-bit and 16-bit Intel processors. In those processors, there are some restrictions regardingthe use of certain registers. Figure E.2 shows the association between the IA-32 registersand the registers in earlier processors. The eight general-purpose registers are groupedinto three different types: data registers for holding operands, pointer registers for holdingaddresses, and index registers for holding address indices. The pointer and index registersare used to determine the effective address of a memory operand.

In Intel’s original 8-bit processors, the data registers were called A, B, C, and D. Inlater 16-bit processors, these registers were labeled AX, BX, CX, and DX. The high- andlow-order bytes in each register are identified by suffixes H and L. For example, the twobytes in register AX are referred to as AH and AL. In IA-32 processors, the prefix E is usedto identify the corresponding extended 32-bit registers: EAX, EBX, ECX, and EDX. TheE-prefix labeling is also used for the other 32-bit registers shown in Figure E.2. They arethe extended versions of the corresponding 16-bit registers used in earlier processors.

This register labeling is used in Intel technical documents [1] and in other descriptionsof Intel processors. The reason that the historical labeling has been retained is that Intelhas maintained upward compatibility over its processor line. That is, programs in machinelanguage representation developed for the earlier 16-bit processors will run correctly on


E.3 Addressing Modes 665

current IA-32 processors without change if the processor state is set to do so. We will use theE-prefix register labeling in giving examples of assembly language programs because thesemnemonics are used in current versions of the assembly language for IA-32 processors.The AL, BL, etc. labeling will also be used for byte operands when they are held in thelow-order eight bits of the corresponding 32-bit register.

E.3 Addressing Modes

The IA-32 architecture has a large and flexible set of addressing modes. They are designedto access individual data items, or data items that are members of an ordered list that beginsat a specified memory address. The basic addressing modes are the same as those availablein most processors, as described in Section 2.4. They are: Immediate, Absolute, Register,and Register indirect. Intel uses the term Direct for the Absolute mode, so we will dothe same here. There are also several addressing modes that provide more flexibility inaccessing data operands in the memory. The most flexible mode described in Section 2.4 isthe Index mode that has the general notation X(Ri,Rj). The effective address of the operand,EA, is calculated as

EA = [Ri] + [Rj] + X

where Ri and Rj are general-purpose registers and X is a constant. Registers Ri and Rj arecalled base and index registers, respectively, and the constant X is called a displacement.The IA-32 addressing modes include this mode and simpler variations of it.

The full set of IA-32 addressing modes is defined as follows:

Immediate mode—The operand is contained in the instruction. It is a signed 8-bit or 32-bitnumber, with the length being specified by a bit in the OP code of the instruction.This bit is 0 for the short version and 1 for the long version.

Direct mode—The memory address of the operand is given by a 32-bit value in the in-struction.

Register mode—The operand is contained in one of the eight general-purpose registersspecified in the instruction.

Register indirect mode—The memory address of the operand is contained in one of theeight general-purpose registers specified in the instruction.

Base with displacement mode—An 8-bit or 32-bit signed displacement and one of the eightgeneral-purpose registers to be used as a base register are specified in the instruction.The effective address of the operand is the sum of the contents of the base registerand the displacement.

Index with displacement mode—A 32-bit signed displacement, one of the eight general-purpose registers to be used as an index register, and a scale factor of 1, 2, 4, or 8,are specified in the instruction. To obtain the effective address of the operand, the



contents of the index register are multiplied by the scale factor and then added to thedisplacement.

Base with index mode—Two of the eight general-purpose registers and a scale factor of1, 2, 4, or 8, are specified in the instruction. The registers are used as base and indexregisters. The effective address of the operand is determined by first multiplying thecontents of the index register by the scale factor and then adding the result to thecontents of the base register.

Base with index and displacement mode—An 8-bit or 32-bit signed displacement, two ofthe eight general-purpose registers, and a scale factor of 1, 2, 4, or 8, are specifiedin the instruction. The registers are used as base and index registers. The effectiveaddress of the operand is determined by first multiplying the contents of the indexregister by the scale factor and then adding the result to the contents of the baseregister and the displacement.

The IA-32 addressing modes and the way that they are expressed in assembly language aregiven in Table E.1. The calculation of the effective address of the operand is also shown inthe table. As indicated in the footnotes, register ESP cannot be used as an index register.This is because it is used as the processor stack pointer.

Table E.1 IA-32 addressing modes.


Immediate Value Operand = Value

Direct Location EA= Location

Register Reg EA= Regthat is, Operand = [Reg]

Register indirect [Reg] EA= [Reg]

Base with [Reg + Disp] EA= [Reg] + Dispdisplacement

Index with [Reg ∗ S + Disp] EA= [Reg] × S + Dispdisplacement

Base with index [Reg1 + Reg2 ∗ S] EA= [Reg1] + [Reg2] × S

Base with index [Reg1 + Reg2 ∗ S + Disp] EA= [Reg1] + [Reg2] × S + Dispand displacement

Value = an 8- or 32-bit signed numberLocation = a 32-bit addressReg, Reg1, Reg2 = one of the general purpose registers EAX, EBX, ECX, EDX, ESP, EBP, ESI, EDI,

with the exception that ESP cannot be used as an index register.Disp = an 8- or 32-bit signed number, except that in the Index with displacement mode it can only

be 32 bits.S = a scale factor of 1, 2, 4, or 8


E.3 Addressing Modes 667

Instructions have zero, one, or two operands. In two-operand instructions, the source(src) and destination (dst) operands are specified in assembly language in the order

OP dst, src

This ordering is the same as in Chapter 2.It is convenient to use the Move instruction to illustrate the IA-32 addressing modes

and their notation in assembly language. The instruction

MOV EAX, 25

uses the Immediate addressing mode for the source operand to move the decimal value 25into the destination register EAX, which is specified with the Register addressing mode.

When a numeric constant appears alone as an operand, it is assumed to represent animmediate value. Numeric constants may be expressed in decimal format using the digits0 through 9. Depending on the assembler used, hexadecimal numbers are specified usingthe prefix 0x or the suffix H. In the latter case, numbers that begin with digits A to F alsorequire a prefix 0 so that the assembler can distinguish a hexadecimal number from a label.Some assemblers also allow binary numbers to be specified using the suffix B.

Symbolic names may also be used as operands. If the name LOCATION has beendefined as an address label, the instruction

MOV EAX, LOCATION

implicitly uses the Direct addressing mode to move the doubleword at memory addressLOCATION into register EAX. The Direct addressing mode can also be made explicit. Theinstruction

MOV EAX, DWORD PTR LOCATION

uses the keywords DWORD PTR to indicate that the label LOCATION should be interpretedas the address of a 32-bit operand.

When it is necessary to treat an address label as an immediate operand, the keywordOFFSET is used. For example, the instruction

MOV EBX, OFFSET LOCATION

moves the value of the address label LOCATION into the EBX register using the Immediateaddressing mode.

Once an address is loaded into a register, the Register indirect mode can be used toaccess the operand in memory. The instruction

MOV EAX, [EBX]

moves the contents of the memory location whose address is contained in register EBX intoregister EAX.

The above examples illustrate the basic addressing modes: Immediate, Direct, Register,and Register indirect. The remaining four addressing modes provide more flexibility inaccessing data operands in the memory.



The Base with displacement mode is illustrated in Figure E.3a. Register EBP is usedas the base register. A doubleword operand at address 1060, which is 60 byte locationsaway from the base address of 1000, can be moved into register EAX by the instruction

MOV EAX, [EBP + 60]

Instructions can operate on byte operands as well as doubleword operands. For exam-ple, still assuming that the base register EBP contains the address 1000, the byte operandat address 1010 can be loaded into the low-order byte position in the EAX register by theinstruction

MOV AL, [EBP + 10]

The assembler selects the version of the Move OP code for byte data because the destination,AL, is the low-order byte position of the EAX register.

The addressing mode that provides the most flexibility is the Base with index anddisplacement mode. An example is shown in Figure E.3b, using EBP and ESI as the baseand index registers. This example shows how the mode is used to access a particulardoubleword operand in a list of doubleword operands. The list begins at a displacement of200 away from the base address 1000. Using a scale factor of 4 on the index register contents,successive doubleword operands at addresses 1200, 1204, 1208, . . . can be accessed byusing the sequence of indices 0, 1, 2, . . . in the index register ESI. In the example shownin the figure, the doubleword at address 1360 (that is, 1000 + 200 + 4 × 40) is accessedwhen the index register contains 40. This operand can be loaded into register EAX by theinstruction

MOV EAX, [EBP + ESI * 4 + 200]

The use of a scale factor of 4 in this addressing mode makes it easy to access successivedoubleword operands of the list in a program loop by simply incrementing the index registerby 1 on each pass through the loop. Having discussed these two modes in some detail, theclosely related Index with displacement mode and Base with index mode should be easy tounderstand.

Before leaving this discussion of addressing modes, it is useful to comment on two ofthe modes described in Table E.1. It may appear that the Base with displacement mode isredundant because the same effect can be obtained by using the Index with displacementmode with a scale factor of 1. But the former mode is useful because it is encoded with oneless byte. In addition, the displacement size in the Index with displacement mode can onlybe 32 bits, whereas it can also be 8 bits for the Base with displacement mode.

E.4 Instructions

The IA-32 instruction set is extensive. It is encoded in a variable-length instruction for-mat that does not have a fully regular layout. Most instructions have either one or twooperands. In the two-operand case, only one of the operands can be in the memory. Theother must either be in a processor register or be an immediate value in the instruction.Instructions are provided for moving data between the memory and the processor registers,


E.4 Instructions 669

1000

60 = displacement

1000

Doubleword

1060 Operand

Mainmemoryaddress

Base register EBP

Operand address (EA) = [EBP] + 60

(a) Base with displacement mode, expressed as [EBP + 60]

1000

200 = displacement

1000

1200

Operand

Base register EBP

Operand address (EA) = [EBP] + [ESI] × 4 + 200

(b) Base with displacement and index mode, expressed as [EBP + ESI * 4 + 200]

1360

List of 4-byte(doubleword)data items

40

Index register ESI

Scale factor = 4

160 = [Index register] × 4

•••

•••

•••

•••

•••

•••

•••

Figure E.3 Examples of addressing modes in the IA-32 architecture.

performing arithmetic operations, and performing logical and shift/rotate operations. Jumpinstructions and subroutine call/return instructions are included. Push and pop operationsfor manipulating the processor stack are also directly supported in the instruction set.



ImmediateDisplacementAddressingOP code mode

1 or 2bytes

1 or 2bytes

1 or 4bytes

1 or 4bytes

Figure E.4 IA-32 instruction format.

E.4.1 Machine Instruction Format

The general format for machine instructions is shown in Figure E.4. The instructions arevariable in length, ranging from 1 to 12 bytes and consisting of up to four fields. TheOP-code field consists of one or two bytes, with most instructions requiring only one byte.The addressing mode information is contained in one or two bytes immediately followingthe OP code.

For instructions that involve the use of only one register in generating the effectiveaddress of an operand in memory, only one byte is needed in the addressing mode field.Two bytes are needed for encoding the last two addressing modes in Table E.1. Thosemodes use two registers to generate the effective address of a memory operand.

If a displacement value is needed in computing an effective address for a memoryoperand, it is encoded into either one or four bytes in a field that immediately follows theaddressing mode field. If one of the operands is an immediate value, then it is placed in thelast field of an instruction and it occupies either one or four bytes.

Some simple instructions, such as those that increment or decrement a register, occupyonly one byte. For example, the instruction

INC EDI

increments the contents of register EDI. In this case, the register operand is specified by a3-bit code in the OP-code byte. However, for most instructions and addressing modes, theregisters used are specified in the addressing mode field.

E.4.2 Assembly-Language Notation

Some aspects of assembly-language notation have been introduced with the addressingmodes in Section E.3. This section provides a summary of the notation used, with addi-tional details on addresses and immediate values, operand sizes, and the use of upper-casecharacters in assembly language.

The keywords DWORD PTR preceding a name of an operand indicate that the nameis to be interpreted as an address for a 32-bit operand. Similarly, the keywords BYTE PTRpreceding a name specify that the name should be interpreted as the address of an 8-bitoperand. On the other hand, the keyword OFFSET preceding a name indicates that thename is to be interpreted as an immediate value.

Each assembly-language instruction must contain sufficient information for the assem-bler to determine the operand size. In the case of the Register addressing mode, the register



name provides the necessary information. For example, register EAX given as an operandin an instruction implies a size of 32 bits, whereas register AL implies an operand size of 8bits. The assembler generates an OP code that corresponds to the implied operand size.

In cases that do not involve the Register addressing mode, the assembler requiresadditional information. For example, a one-operand instruction may specify a memoryoperand using an indirect or displacement addressing mode. To specify the operand size, itis necessary to include the keywords DWORD PTR or BYTE PTR.

Many IA-32 assemblers are case-insensitive for instruction mnemonics and registernames. The Intel technical documentation uses upper-case characters consistently [1]. Toconform to this presentation style, we will use upper-case characters for all instructionmnemonics and register names.

E.4.3 Move Instruction

The MOV instruction transfers data between memory or I/O interfaces and the processorregisters. The direction of the transfer is from source to destination. The condition codeflags in the status register are not affected by the execution of a MOV instruction.

The examples in Section E.3 show how MOV instructions transfer data from memoryto registers. Register contents may also be transferred to memory or to another register.The instruction

MOV LOCATION, ECX

moves the doubleword in register ECX into the memory location at address LOCATION.The instruction

MOV EBP, EDI

moves the doubleword in register EDI to register EBP. The contents in register EDI are notchanged.

The MOV instruction cannot be used with two memory operands, but it can be used tomove an immediate value into a memory location, as in

MOV DWORD PTR [EAX + 16], 100

Note that the assembler requires the keywords DWORD PTR (or BYTE PTR) to specifythe operand size in this instruction.

E.4.4 Load-Effective-Address Instruction

Section E.3 describes how the MOV instruction can be used to load an address into aregister by using the keyword OFFSET. Alternatively, the LEA (Load-effective-address)instruction may be used. For example, if the name LOCATION is defined as an addresslabel, the instruction

LEA EAX, LOCATION



has exactly the same effect as the instruction

MOV EAX, OFFSET LOCATION

The LEA instruction can be used to load an effective address that is computed atexecution time. For example, suppose it is desired to use register EBX as a pointer to a dataoperand in memory. Assume that the desired operand is an element of an array, located atan offset of 12 bytes from the start of the array. If register EBP contains the starting addressof the array, the instruction

LEA EBX, [EBP + 12]

computes the desired effective address and places it in register EBX. The operand can thenbe accessed by a Move or other instruction using the Register indirect mode with EBX.

E.4.5 Arithmetic Instructions

This category of instructions includes arithmetic operations as well as comparison andnegation operations. The operands can be in memory, in registers, or specified as immediatevalues (for two-operand instructions). The operand size may be doubleword or byte.

Addition, Subtraction, Comparison, and NegationTwo-operand arithmetic instructions are:

• ADD (Add)• ADC (Add with carry; for multiple-precision arithmetic)• SUB (Subtract)• SBB (Subtract with borrow; for multiple-precision arithmetic)• CMP (Compare; value of destination operand remains unchanged)

These instructions affect all of the condition code flags based on the result of the operationthat is performed. The instruction

ADD EAX, EBX

performs the 32-bit operation

EAX← [EAX] + [EBX]

The instruction

CMP [EBX + 10], AL

performs the 8-bit operation

[[EBX] + 10] − [AL]

Using register AL implies an operand size of one byte. The condition code flags are setbased on whether the subtraction caused overflow or a carry, and whether the result isnegative or zero. The result of the subtraction is discarded.



One-operand arithmetic instructions are:

• INC (Increment)• DEC (Decrement)• NEG (Negate)

The NEG instruction affects all condition code flags, but the INC and DEC instructions donot affect the CF flag. These instructions must include keywords to specify the operandsize unless the Register mode is used for the operand. The instruction

INC DWORD PTR [EDX]

increments the doubleword at the memory location whose address is contained in registerEDX.

MultiplicationThe signed integer multiplication instruction, IMUL, performs 32-bit multiplication.

Depending on the form of the instruction that is used, the destination may be implicit andthe 64-bit product may be truncated to 32 bits.

One form of this instruction is

IMUL src

which implicitly uses the EAX register as the multiplicand. The multiplier specified by srccan be in a register or in the memory. The full 64-bit product is placed in registers EDX(high-order half) and EAX (low-order half).

A second form of this instruction is

IMUL REG, src

The destination operand, REG, must be one of the eight general-purpose registers. Thesource operand can be in a register or in the memory. The product is truncated to 32 bitsbefore it is placed in the destination register REG.

For both forms, the CF and OF flags are set if there are any 1s (including sign bits) inthe high-order half of the 64-bit product. Otherwise, the CF and OF flags are cleared. Theother flags are undefined.

DivisionThe integer divide instruction, IDIV, operates on a 64-bit dividend and a 32-bit divisor

to generate a 32-bit quotient and a 32-bit remainder. The format of the instruction is

IDIV src

The source operand is the divisor. The 64-bit dividend is formed by the contents of registerEDX (high-order half) and register EAX (low-order half). After performing the division,the quotient is placed in EAX and the remainder is placed in EDX. All of the condition codeflags are undefined. Division by zero causes an exception.

If the dividend value is represented by 32 bits, it must first be placed in EAX, and thensign-extended to the required 64-bit operand size in registers EAX and EDX. This is done



by the instruction CDQ (convert doubleword to quadword), which has no operands becausethe source and destination are implicitly registers EAX and EDX, respectively.

E.4.6 Jump and Loop Instructions

In IA-32 terminology, all branch instructions are called Jumps. Conditional and uncondi-tional Jump instructions are provided. Such instructions can be used to implement loops.Often, a counter variable is decremented in each pass through a loop, and a conditionalJump instruction tests whether the count is still larger than zero to perform more passesthrough the loop. Because this approach is common, a special Loop instruction is alsoprovided to combine the decrement and conditional Jump operations.

Conditional Jump Instructions and Condition Code FlagsThe conditional Jump instructions test the four condition code flags in the status register.

The instruction

JG LABEL

is an example of a conditional Jump instruction. The condition is greater-than as indicatedby the G suffix in the OP code. Table E.2 summarizes the conditional Jump instructionsand the corresponding combinations of the condition code flags that are tested. The Jumpinstructions that test the sign flag (SF) are used when the operands of a preceding arithmeticor comparison instruction are signed numbers. For example, the JG instruction tests forthe greater-than condition when signed numbers are involved, and it considers the SF flag.When unsigned numbers are involved, the JA (jump-above) instruction tests for the greater-than condition without considering the SF flag.

Table E.2 IA-32 conditional jump instructions.

Mnemonic Condition name Condition test

JS Sign (negative) SF = 1JNS No sign (positive or zero) SF = 0JE/JZ Equal/Zero ZF = 1JNE/JNZ Not equal/Not zero ZF = 0JO Overflow OF = 1JNO No overflow OF = 0JC/JB Carry/Unsigned below CF = 1JNC/JAE No carry/Unsigned above or equal CF = 0JA Unsigned above CF ∨ ZF = 0JBE Unsigned below or equal CF ∨ ZF = 1JGE Signed greater than or equal SF ⊕ OF = 0JL Signed less than SF ⊕ OF = 1JG Signed greater than ZF ∨ (SF ⊕ OF) = 0JLE Signed less than or equal ZF ∨ (SF ⊕ OF) = 1



When the assembler generates machine code, a conditional Jump instruction is encodedwith an offset relative to the address of the instruction that immediately follows the Jumpinstruction. This address reflects the updated contents of the Instruction Pointer after theJump instruction is fetched. If the offset is in the range −128 through +127, then a singlebyte is sufficient, and the total number of bytes used to encode a conditional Jump instructionis two, including the OP-code byte. When the distance to the jump target exceeds this range,a four-byte offset is used.

Unconditional Jump InstructionAn unconditional Jump instruction, JMP, causes a branch to the instruction at the

target address. In addition to using short (one-byte) or long (four-byte) relative signedoffsets to determine the target address, as is done in conditional Jump instructions, the JMPinstruction also allows the use of other addressing modes. This flexibility in generatingthe target address can be very useful. Consider the Case statement that is found in manyhigh-level languages. It is used to perform one of a number of alternative computations atsome point in a program. Each of these alternatives is referred to as a case. Suppose that foreach case, a routine is defined to perform the corresponding computation. Suppose also thatthe 4-byte starting addresses of the routines are stored in a table in the memory, beginningat a location labeled JUMPTABLE. The cases are numbered with indices 0, 1, 2, . . . . Atexecution time, the index of the selected case is loaded into index register ESI. A jump tothe routine for the selected case is performed by executing the instruction

JMP [JUMPTABLE + ESI * 4]

which uses the Index with displacement addressing mode.

Loop InstructionLoops often rely on a counter variable that is decremented in each pass through the

loop. Maintaining the counter in a register reduces execution time. When an instruction thatdecrements the register for the counter affects the condition code flags, an explicit compar-ison is not required before a conditional branch instruction. A loop can be implemented as

MOV ECX, NUM_PASSESSTART:

...

DEC ECXJG START

Loops of this form can be expressed in a more compact manner by using the LOOP instruc-tion. It combines the functionality of the DEC and JG instructions, and it also implicitly usesregister ECX for the counter variable. Using this instruction, the loop can be implemented as

MOV ECX, NUM_PASSESSTART:

...

LOOP START

Condition code flags are not affected by the LOOP instruction.



Example E.1 Using the instructions introduced thus far, we can now give a program for adding numbersusing a loop, similar to the program in Figure 2.26. Assume that memory location Ncontains the number of 32-bit integers in a list that starts at memory location NUM1. Theassembly-language program shown in Figure E.5a can be used to add the numbers andplace their sum in memory location SUM.

Register EBX is loaded with the address value NUM1. It is used as the base register inthe Base with index addressing mode in the instruction at the location STARTADD, whichis the first instruction of the loop. Register EDI is used as the index register. It is clearedby loading it with zero before the loop is entered. On the first pass through the loop, thefirst number at address NUM1 is added into the EAX register, which was initially clearedto zero. The index register is then incremented by 1. On the second pass, the scale factorof 4 in the ADD instruction causes the second 32-bit number, at address NUM1 + 4, tobe added into EAX. The numbers at addresses NUM1 + 8, NUM1 + 12, . . . are added insubsequent passes. Register ECX is used as a counter register. It is initially loaded with thecontents of memory location N in the second instruction of the program and is decrementedby 1 during each pass through the loop. The conditional branch instruction JG causes a

LEA EBX, NUM1 Use EBX as base register.MOV ECX, N Use ECX as counter register.MOV EAX, 0 Use EAX as accumulator register.MOV EDI, 0 Use EDI as index register.

STARTADD: ADD EAX, [EBX + EDI * 4] Add next number into EAX.INC EDI Increment index register.DEC ECX Decrement counter register.JG STARTADD Branch back if [ECX] > 0.MOV SUM, EAX Store sum in memory.

(a) Straightforward approach

LEA EBX, NUM1 Load base register EBX andSUB EBX, 4 adjust to hold NUM1– 4.MOV ECX, N Initialize counter/index register ECX.MOV EAX, 0 Use EAX as accumulator register.

STARTADD: ADD EAX, [EBX + ECX * 4] Add next number into EAX.LOOP STARTADD Decrement ECX and branch

back if [ECX] > 0.MOV SUM, EAX Store sum in memory.

(b) More compact program

Figure E.5 Implementation of the program in Figure 2.26.



branch back to STARTADD while [ECX] > 0. When the contents of ECX reach zero, allthe numbers have been added. The branch is not taken, and the MOV instruction writes thesum in register EAX into memory location SUM.

A more compact program for the same task can be developed by making two obser-vations on the program in Figure E.5a. The first observation is that the two-instructionsequence

DEC ECXJG STARTADD

can be replaced with the single instruction

LOOP STARTADD

It decrements the ECX register and then branches to the target address STARTADD if thecontents of ECX have not reached zero. The second observation is that we have used tworegisters, EDI and ECX, as counters. If we scan the list of numbers to be added in theopposite direction, starting with the last number in the list, only one counter register isneeded. We will use register ECX because it is the register referenced implicitly by theLOOP instruction. Assuming [N] = n, the first program accesses the numbers using theaddress sequence NUM1, NUM1 + 4, NUM1 + 8, . . . , NUM1 + 4(n− 1), as EDI containsthe sequence of values 0, 1, 2, . . . , (n − 1). The new program, shown in Figure E.5b, usesthe address sequence (NUM1− 4) + 4n, (NUM1− 4) + 4(n− 1), . . . , (NUM1− 4) + 4(1),as ECX contains the sequence n, n −1, . . . , 1. Hence, the value in the base register EBXneeds to be changed from NUM1 to NUM1− 4 in the new program in order to account forthe difference between the EDI sequence and the ECX sequence. On the last pass throughthe loop in the new program, before the LOOP instruction is executed, [ECX] = 1 and thelast number to be added is accessed at memory location NUM1.

Example E.2The program in Figure 2.11 computes the sum of all scores for three tests taken by agroup of students. Load instructions are used in the program to fetch the operands frommemory. Figure E.6 shows an IA-32 version of that program. The availability of the Basewith displacement addressing mode for the ADD instructions makes it unnecessary to useseparate instructions to access memory operands.

E.4.7 Logic Instructions

The IA-32 architecture has instructions that perform the logic operations AND, OR, andXOR. The operation is performed bitwise on two operands, and the result is placed in thedestination location. For example, suppose register EAX contains the hexadecimal pattern0000FFFF and register EBX contains the pattern 02FA62CA. The instruction

AND EBX, EAX

clears the left half of EBX to all zeroes, and leaves the right half unchanged. The result inEBX will be 000062CA.



MOV EAX, OFFSET LIST Get the address LIST.MOV EBX, 0MOV ECX, 0MOV EDX, 0MOV EDI, N Load the value n.

LOOP: ADD EBX, [EAX + 4] Add current student mark for Test 1.ADD ECX, [EAX + 8] Add current student mark for Test 2.ADD EDX, [EAX + 12] Add current student mark for Test 3.ADD EAX, 16 Increment the pointer.DEC EDI Decrement the counter.JG LOOP Loop back if not finished.MOV SUM1, EBX Store the total for Test 1.MOV SUM2, ECX Store the total for Test 2.MOV SUM3, EDX Store the total for Test 3.

Figure E.6 Implementation of the program in Figure 2.11.

There is also a NOT instruction which generates the logical complement of all bits ofthe operand, that is, it changes all 1s to 0s and all 0s to 1s.

E.4.8 Shift and Rotate Instructions

An operand can be shifted right or left, using either logical or arithmetic shifts, by a numberof bit positions determined by a specified count. The format of the shift instructions is

OP dst, count

where the destination operand to be shifted is specified using any addressing mode and thecount is given either as an 8-bit immediate value or is contained in the 8-bit register CL.There are four shift instructions:

• SHL (Shift left logical)• SHR (Shift right logical)• SAL (Shift left arithmetic; operation is identical to SHL)• SAR (Shift right arithmetic)

Shift operations are discussed in Section 2.8.2 and illustrated in Figure 2.23.In addition to the shift instructions, there are also four rotate instructions:

• ROL (Rotate left without the carry flag CF)• ROR (Rotate right without the carry flag CF)• RCL (Rotate left including the carry flag CF)• RCR (Rotate right including the carry flag CF)

All four operations are illustrated in Figure 2.25. The rotate instructions require the countargument to be either an 8-bit immediate value or the 8-bit contents of register CL.



Example E.3Consider the BCD digit-packing program shown in Figure 2.24, which uses shift and logicinstructions. The IA-32 code for this routine is shown in Figure E.7. Two ASCII bytesare loaded into registers AL and BL. The SHL instruction shifts the byte in AL four bitpositions to the left, filling the low-order four bits with zeros. The AND instruction setsthe high-order four bits of the second byte to zero. Finally, the 4-bit patterns that are thedesired BCD codes are combined in AL with the OR instruction and then stored in memorybyte location PACKED.

LEA EBP, LOC EBP points to first byte.MOV AL, [EBP] Load first byte into AL.SHL AL, 4 Shift left by 4 bit positions.MOV BL, [EBP +1] Load second byte into BL.AND BL, 0FH Clear high-order 4 bits to zero.OR AL, BL Concatenate the BCD digits.MOV PACKED, AL Store the result.

Figure E.7 A routine that packs two BCD digits into a byte, correspondingto Figure 2.24.

E.4.9 Subroutine Linkage Instructions

The use of the processor stack for subroutine linkage is described in Section 2.7. In theIA-32 architecture, register ESP is used as the stack pointer. It points to the current topelement (TOS) in the processor stack. The stack grows toward lower numbered addresses.The width of the stack is 32 bits, that is, all stack entries are doublewords.

There are two instructions for pushing and popping individual elements onto and offthe stack. The instruction

PUSH src

decrements ESP by 4, and then stores the doubleword at location src into the memorylocation pointed to by ESP. The instruction

POP dst

reverses this process by retrieving the TOS doubleword from the location pointed to by ESP,storing it at location dst, and then incrementing ESP by 4. These instructions implicitly useESP as the stack pointer. The source and destination operands are specified using the IA-32addressing modes.

There are also two more instructions that push or pop the contents of multiple registers.The instruction

PUSHAD



pushes the contents of all eight general-purpose registers EAX through EDI onto the stack,and the instruction

POPAD

pops them off in the reverse order. When POPAD reaches the old stored value of ESP, itdiscards those four bytes without loading them into ESP and continues to pop the remainingvalues into their respective registers. These two instructions are used to efficiently save andrestore the contents of all registers as part of implementing subroutines.

The list-addition program in Figure E.5a can be written as a subroutine as shown inFigure E.8a. Parameters are passed through registers. Memory address NUM1 of the firstnumber in the list is loaded into register EBX by the calling program. The number ofentries in the list, contained in memory location N, is loaded into register ECX. The callingprogram expects to get the final sum passed back to it in register EAX. Thus, registers EBX,ECX, and EAX are used for passing parameters. Register EDI is used by the subroutine asan index register in performing the addition, so its contents have to be saved and restoredin the subroutine by PUSH and POP instructions.

The subroutine is called by the instruction

CALL LISTADD

which first pushes the return address onto the stack and then jumps to LISTADD. Thereturn address is the address of the MOV instruction that immediately follows the CALLinstruction. The subroutine saves the contents of register EDI on the stack. Figure E.8bshows the stack contents at this point. After executing the loop, the saved contents of registerEDI are restored. The instruction RET returns execution control to the calling program bypopping the TOS element into the Instruction Pointer (register EIP).

Figure E.9a shows the program of Figure E.5a rewritten as a subroutine with parameterspassed on the stack. The parameters NUM1 and n are pushed onto the stack by the twoPUSH instructions in the calling program. Note that the keyword OFFSET is required forpushing the address represented by NUM1 on the stack. The top of the stack is at level 2in Figure E.9b after the CALL instruction has been executed. Registers EDI, EAX, EBX,and ECX serve the same purpose in this subroutine as in the subroutine in Figure E.8. Aftertheir values are saved, they are loaded with initial values and parameters by the first eightinstructions in the subroutine. At this point, the top of the stack is at level 3. When thenumbers have been added by the four-instruction loop, the sum is placed into the stack,overwriting parameter NUM1. After the RET instruction is executed, the ADD and POPinstructions in the calling program remove parameter n from the stack and pop the returnedsum into memory location SUM. The top of the stack is therefore restored to level 1.

We also have to consider the case of nested subroutines. Figure E.10 shows the IA-32code for the program in Figure 2.21. The stack frames corresponding to the first and secondsubroutines are shown in Figure E.11. Register EBP is used as the frame pointer. Insteadof using the PUSHAD and POPAD instructions to push and pop all eight general-purposeregisters, we have chosen to use individual PUSH and POP instructions in Figure E.10because only half of the register set is used by the subroutines.



Calling program...LEA EBX, NUM1 Load parametersMOV ECX, N into EBX, ECX.CALL LISTADD Branch to subroutine.MOV SUM, EAX Store sum into memory....

Subroutine

LISTADD: PUSH EDI Save EDI.MOV EDI, 0 Use EDI as index register.MOV EAX, 0 Use EAX as accumulator register.

STARTADD: ADD EAX, [EBX + EDI * 4] Add next number.INC EDI Increment index.DEC ECX Decrement counter.JG STARTADD Branch back if [ECX] > 0.POP EDI Restore EDI.RET Return to calling program.


ESP [EDI]

Return address

Old TOS

(b) Stack contents after saving EDI in subroutine

Figure E.8 Program of Figure E.5a written as a subroutine; parameters passed throughregisters.

E.4.10 Operations on Large Numbers

Section E.4.5 described various arithmetic instructions, including those suitable for opera-tions on numbers whose size exceeds the 32-bit width of a single general-purpose register.The ADC and SBB instructions use the CF flag in the status register as a carry-in bit. Theseinstructions are useful for multiple-precision arithmetic.



(Assume top of stack is at level 1 below.)

Calling program

PUSH OFFSET NUM1 Push parameters onto the stack.PUSH NCALL LISTADD Branch to the subroutine.ADD ESP, 4 Remove n from the stack.POP SUM Pop the sum into SUM....

Subroutine

LISTADD: PUSH EDI Save registers.PUSH EAXPUSH EBXPUSH ECXMOV EDI, 0 Use EDI as index register.MOV EAX, 0 Use EAX to accumulate the sum.MOV EBX, [ESP + 24] Load address NUM1.MOV ECX, [ESP + 20] Load count n.

STARTADD: ADD EAX, [EBX + EDI * 4] Add next number.INC EDI Increment index.DEC ECX Decrement counter.JG STARTADD Branch back if not done.MOV [ESP + 24], EAX Overwrite NUM1 in stack with sum.POP ECX Restore registers.POP EBXPOP EAXPOP EDIRET Return.


[ECX]

[EBX]

[EAX]

[EDI]

Return address

n

NUM1

Level 3

Level 2

Level 1

(b) Stack contents at different times

Figure E.9 Program of Figure E.5a written as a subroutine; parameters passed on thestack.



Address Instructions Comments

Calling program...2000 PUSH PARAM2 Place parameters2006 PUSH PARAM1 on stack.2012 CALL SUB12017 POP RESULT Store result.

ADD ESP, 4 Restore stack level....First subroutine

2100 SUB1: PUSH EBP Save frame pointer register.MOV EBP, ESP Load frame pointer.PUSH EAX Save registers.PUSH EBXPUSH ECXPUSH EDXMOV EAX, [EBP + 8] Get first parameter.MOV EBX, [EBP + 12] Get second parameter....PUSH PARAM3 Place parameter on stack.

2160 CALL SUB22165 POP ECX Pop SUB2 result into ECX....

MOV [EBP + 8], EDX Place answer on stack.POP EDX Restore registers.POP ECXPOP EBXPOP EAXPOP EBP Restore frame pointer register.RET Return to Main program.

Second subroutine

3000 SUB2: PUSH EBP Save frame pointer register.MOV EBP, ESP Load frame pointer.PUSH EAX Save registers.PUSH EBXMOV EAX, [EBP + 8] Get parameter....MOV [EBP + 8], EBX Place SUB2 result on stack.POP EBX Restore registers.POP EAXPOP EBP Restore frame pointer register.RET Return to first subroutine.

Figure E.10 Nested subroutines; implementation of the program in Figure 2.21.



EBP

EBP

[EBP] from SUB1

2165

Stackframe

forSUB1

[EAX] from Main

param3

[EDX] from Main

[ECX] from Main

[EBX] from Main

Old TOS

2017

[EBP] from Main

param1

param2

[EAX] from SUB1

[EBX] from SUB1

Stackframe

forSUB2

Figure E.11 Stack frames for Figure E.10.

Example E.4 The use of theADC instruction to add numbers too large to fit into 32-bit registers is shown inFigure E.12. The two hexadecimal values to be added are 10A72C10F8 and 4A5C00FE04.Registers EAX and EBX are loaded with the low- and high-order bits of 10A72C10F8,respectively. Similarly, registers ECX and EDX are loaded the low- and high-order bits

MOV EAX, 0A72C10F8H EAX contains A72C10F8.MOV EBX, 10H EBX contains 10.MOV ECX, 5C00FE04H ECX contains 5C00FE04.MOV EDX, 4AH EDX contains 4A.ADD EAX, ECX Add low-order 32 bits; carry-out sets CF flag.ADC EBX, EDX Add high-order bits with CF flag as carry-in bit.

Figure E.12 Addition of numbers larger than 32 bits using the ADC instruction.


E.5 Assembler Directives 685

of 4A5C00FE04. The ADD instruction is used to add the low-order 32 bits. The additiongenerates a carry-out of 1 that causes the CF flag to be set. The ADC instruction then usesthis flag as the carry-in bit when adding the high-order bits. The low- and high-order bitsof the sum are in registers EAX and EBX.

E.5 Assembler Directives

As discussed in Section 2.5.1, assembler directives are needed to define the data area of aprogram and to define the correspondence between symbolic names for data locations andthe actual physical address values.

A complete assembly language program for the program in Figure E.5b is shown inFigure E.13. It corresponds to the program in Figure 2.13. The directives shown in FigureE.13 conform to those defined by the widely used Microsoft MASM assembler. The .CODEand .DATA directives define the beginning of the code and data sections of the program. Inthe data section, the DD directives allocate storage for doubleword-sized data. The labelSUM is assigned the address of the location containing the doubleword for the sum that iscomputed; it is initialized to 0. The label N is assigned the address of the location containingthe number 150. Finally, the storage is allocated for the list of numbers. The DUP keywordis used to initialize a specified number of consecutive locations in memory to a specifiedvalue. In this case, 150 consecutive locations are initialized to 0. Other assembler directivesare also available, such as DB which allocates storage for byte-sized data and EQU whichassigns a constant value to a label.

.CODELEA EBX, NUM1SUB EBX, 4MOV ECX, NMOV EAX, 0

STARTADD: ADD EAX, [EBX + ECX * 4]LOOP STARTADDMOV SUM, EAX

.DATASUM DD 0 One doubleword reserved for sum.N DD 150 There are N=150 doublewords in list.NUM1 DD

END150 DUP(0) Reserve memory for 150 doublewords.

Figure E.13 A program that corresponds to Figure 2.13.



E.6 Example Programs

This section presents the IA-32 code for the example programs described in Section 2.12.

E.6.1 Vector Dot Product Program

Figure E.14 shows a program for computing the dot product of two vectors of numbersstored in the memory starting at addresses AVEC and BVEC. It corresponds to the programin Figure 2.28. The Base with index addressing mode is used to access successive elementsof each vector. Register EDI is used as the index register. A scale factor of 4 is used becausethe vector elements are assumed to be doubleword (4-byte) numbers. Register ECX is usedas the loop counter; it is initialized to n. This allows the use of the LOOP instruction, whichfirst decrements ECX and then branches conditionally to the target address LOOPSTART ifthe contents of ECX have not reached zero. The product of two vector elements is assumedto fit into a doubleword, so the Multiply instruction IMUL explicitly specifies the desireddestination register EDX, as discussed in Section E.4.5.

E.6.2 String Search Program

Figure E.15 provides an IA-32 version of the program in Figure 2.31. It determines the firstmatching instance of a pattern string P in a given target string T . Because there are only eightgeneral-purpose registers, the doubleword contents of EAX are saved in memory locationTMP, so that the byte-sized register AL can be used in the loop. The saved doubleword isrestored to EAX when that value is needed again.

LEA EBP, AVEC EBP points to vector A.LEA EBX, BVEC EBX points to vector B.MOV ECX, N ECX is the loop counter.MOV EAX, 0 EAX accumulates the dot product.MOV EDI, 0 EDI is an index register.

LOOPSTART: MOV EDX, [EBP + EDI * 4] Compute the productIMUL EDX, [EBX + EDI * 4] of next components.INC EDI Increment index.ADD EAX, EDX Add to previous sum.LOOP LOOPSTART Branch back if not done.MOV DOTPROD, EAX Store dot product in memory.

Figure E.14 A program for computing the dot product of two vectors.


E.7 Interrupts and Exceptions 687

MOV EAX, OFFSET T EAX points to string T.MOV EBX, OFFSET P EBX points to string P.MOV ECX, N Get the value n.MOV EDX, M Get the value m.SUB ECX, EDX Compute n – m.ADD ECX, EAX ECX is the address of T (n – m).ADD EDX, EBX EBX is the address of P (m).

LOOP1: MOV ESI, EAX Use ESI to scan through string T.MOV EDI, EBX Use EDI to scan through string P.MOV DWORD PTR TMP, EAX Save EAX in memory to allow use of AL.

LOOP2: MOV AL, [ESI] Get character from string T.CMP AL, [EDI] Compare with character from string P.JNE NOMATCHINC ESI Advance string T pointer.INC EDI Advance string P pointer.CMP EDX, EDI Check if at P (m).JG LOOP2 Loop again if not done.MOV EAX, DWORD PTR TMP Restore EAX after temporary use.MOV DWORD PTR RESULT, EAX Store the address of T (i).JMP DONE

NOMATCH: MOV EAX, DWORD PTR TMP Restore EAX after temporary use.ADD EAX, 1 Point to next character in T.CMP ECX, EAX Check if at T (n – m).JGE LOOP1 Loop again if not done.MOV DWORD PTR RESULT, –1 No match was found.


Figure E.15 A string-search program.

E.7 Interrupts and Exceptions

Processors implementing the IA-32 architecture use two interrupt-request lines, a nonmask-able interrupt, NMI, and a maskable interrupt, also called the user interrupt request, INTR.Interrupt requests on NMI are always accepted by the processor. Requests on INTR areaccepted only if they have a higher privilege level than the program currently running.INTR interrupts can be enabled or disabled by setting an interrupt-enable bit, IF, in thestatus register.



In addition to external interrupts, there are other events that arise during programexecution that can cause an exception. These include invalid OP codes, division by zero,and overflow. They also include trace and breakpoint interrupts.

The occurrence of any of these events causes the processor to branch to an interrupt-service routine. Each interrupt or exception is assigned a vector number. In the case ofINTR, the vector number is sent by the I/O device over the bus when the interrupt requestis acknowledged. For all other exceptions, the vector number is preassigned. Based on thevector number, the processor determines the starting address of the interrupt-service routinefrom a table called the Interrupt Descriptor Table.

An IA-32 processor relies on a companion Advanced Programmable Interrupt Con-troller (APIC). Various I/O devices are connected to the processor through this controller.The interrupt controller implements a priority structure among different devices and sendsan appropriate vector number to the processor for each device.

The status register, shown in Figure E.1, contains the Interrupt Enable Flag (IF), theTrap flag (TF) and the I/O Privilege Level (IOPL). When IF = 1, INTR interrupts areaccepted. The Trap flag enables trace interrupts after every instruction.

Interrupts are particularly important in the context of an operating system. The IA-32 architecture defines a sophisticated privilege structure, whereby different parts of theoperating system execute at one of four levels of privilege. A different segment in theprocessor address space is used for each level. Switching from one level to another involvesa number of checks implemented in a mechanism called a gate. This enables a highly secureOS to be constructed. It is also possible for the processor to run in a simple mode in whichno privileges are implemented and all programs run in the same segment. We will onlydiscuss the simple mode here.

When an interrupt request is received or when an exception occurs, the processorperforms the following actions:

1. It pushes the status register, the Code Segment register (CS), and the InstructionPointer (EIP) onto the processor stack.

2. In the case of an exception resulting from an abnormal execution condition, it pushesa code on the stack describing the cause of the exception.

3. It clears the corresponding interrupt-enable flag so that further interrupts from thesame source are disabled.

4. It fetches the starting address of the interrupt-service routine from the InterruptDescriptor Table based on the vector number of the interrupt, loads this value intoEIP, and then resumes execution.

After servicing the interrupt request, the interrupt-service routine returns to the inter-rupted program using the return-from-interrupt instruction, IRET. This instruction pops EIP,CS, and the status register from the stack into the corresponding registers, thus restoringthe processor state.

As in the case of subroutines, the interrupt-service routine may create a temporarywork space by saving registers or using the stack frame for local variables. It must restore


E.8 Input/Output Examples 689

any saved registers and ensure that the stack pointer ESP is pointing to the return address,before executing the IRET instruction.

E.8 Input/Output Examples

This section uses the I/O examples and interface registers described in Chapter 3 to showhow IA-32 instructions are used for polling and interrupt-based I/O operations.

Figure E.16 provides an IA-32 version of the program in Figure 3.5 with polling forboth input and output operations. The BT instruction corresponds to the TestBit instructionin Figure 3.5. It copies the value of the specified bit into the CF flag of the status register.Each character is read from the keyboard interface into register AL for later comparisonwith the carriage-return character, CR. As each character is stored in the memory fromregister AL, register EBX is incremented to advance the pointer.

To illustrate how interrupts are used for I/O operations, Figure E.17 implements theexample in Figure 3.10. The BTS instruction in the initialization code sets the interrupt-enable bit in the keyboard interface. The STI instruction enables the processor to respondto interrupt requests by setting the IF flag in the status register to 1. The BTR instructionin the interrupt-service routine clears the interrupt-enable bit in the keyboard interface.The pointer is maintained in a memory location and incremented for each character thatis processed by the interrupt-service routine. We have assumed that the keyboard sendsan interrupt request with a specific vector number i and that the corresponding entry i

LEA EBX, LOC Initialize register EBX to point to theaddress of the first location in main memorywhere the characters are to be stored.

READ: BT KBD_STATUS, 1 Wait for a character to be enteredJNC READ in the keyboard buffer KBD_DATA.MOV AL, KBD_DATA Transfer character into AL (this clears KIN to 0).MOV [EBX], AL Store the character in memoryINC EBX and increment pointer.

ECHO: BT DISP_STATUS, 2 Wait for the display to become ready.JNC ECHOMOV DISP_DATA, AL Move the character just read to the display

buffer register (this clears DOUT to 0).CMP AL, CR If it is not CR, thenJNE READ branch back and read another character.

Figure E.16 Program that reads a line of characters and displays it.




READ: PUSH EAX Save register EAX on stack.PUSH EBX Save register EBX on stack.MOV EBX, PNTR Load address pointer.MOV AL, KBD_DATA Transfer character into AL.MOV [EBX], AL Write the character into memoryINC DWORD PTR PNTR and increment the pointer.

ECHO: BT DISP_STATUS, 2 Wait for the display to become ready.JNC ECHOMOV DISP_DATA, AL Display the character just read.CMP AL, CR Check if the character just read is CR.JNE RTRN Return if not CR.MOV DWORD PTR EOL, 1 Indicate end of line.BTR KBD_CONT, 1 Disable interrupts in keyboard interface.

RTRN: POP EBX Restore registers.POP EAXIRET Return from interrupt.

Main program

MOV DWORD PTR PNTR, OFFSET LINE Initialize buffer pointer.MOV DWORD PTR EOL, 0 Clear end-of-line indicator.BTS KBD_CONT, 1 Enable interrupts in keyboard interface.STI Set interrupt flag in processor register.next instruction

Figure E.17 A program that reads a line of characters using interrupts and displays it using polling.

in the Interrupt Descriptor Table has been loaded with the starting address READ of theinterrupt-service routine.

E.9 Scalar Floating-Point Operations

The IA-32 architecture defines many instructions for floating-point operations. They areperformed by a separate floating-point unit (FPU) which contains additional registers. TheFPU permits concurrent execution of floating-point operations with other instructions. Thissection provides a summary of the floating-point features of the IA-32 architecture and


E.9 Scalar Floating-Point Operations 691

some of its floating-point instructions. Floating-point number representation is introducedin Chapter 1, and floating-point arithmetic is discussed in Chapter 9.

All floating-point operations are performed internally on 80-bit double extended-precision numbers. Eight 80-bit floating-point data registers, shown in Figure E.1, areprovided for this purpose. Performing operations internally on extended-precision num-bers held in these registers reduces the size of accumulated round-off errors in a sequenceof calculations, as explained in Section 9.7. There are also additional control and statusregisters in the FPU, whose details can be found in the Intel technical documentation [1].

A unique feature of the FPU is that the eight floating-point data registers are treated asa stack. Certain instructions that perform arithmetic or data transfer operations involvingthese registers may also perform push or pop operations. A pointer to the top of the registerstack is maintained internally by the FPU. No particular initialization is required for thepointer. Push and pop operations adjust the pointer modulo 8 to cause it to wrap aroundwhen necessary.

In assembly language, floating-point register operands are identified using the notationST(i), where i is an index relative to the top of the register stack (0 ≤ i ≤ 7). For example,ST(0) refers to the register that is the current top of the register stack, ST(1) refers to the nextregister in the stack, and so on until ST(7), which refers to the last register. When writingprograms in assembly language, the programmer needs to keep track of the push and popoperations performed in a sequence of floating-point instructions to correctly identify theoperands of each instruction.

Floating-point load instructions have one explicit operand that specifies the sourcelocation in the memory. They push the value read from the memory onto the register stack.The destination location is implicitly the register identified as ST(7) at the time the loadinstruction is executed. After completing the load instruction, that register becomes ST(0),the new top of the register stack. This is because the pointer to the top of the register stackis adjusted modulo 8. The previous register ST(0) becomes ST(1), the previous registerST(1) becomes ST(2), and so on.

For floating-point store instructions, the source operand is implicitly in register ST(0).One explicit operand specifies the destination location in the memory into which the valueof register ST(0) is written. Some store instructions leave the register stack unchanged.Other store instructions also pop ST(0). The pointer to the top of the register stack isadjusted modulo 8 in the opposite direction than for load instructions. After completingthe pop operation, the previous register ST(0) becomes ST(7), the previous register ST(1)becomes ST(0), and so on.

Floating-point arithmetic instructions must have either the source operand or the des-tination operand in register ST(0). For some instructions, register ST(0) is implicitly thedestination location. Only the source operand needs to be specified explicitly, which mustbe in the memory. Other instructions specify two operands explicitly. One must be inregister ST(0), which can be either the source or the destination. The other one may be inany register ST(i). A few instructions require no explicit operands because they implicitlyuse register ST(0) for both source and destination locations.

The floating-point registers always hold 80-bit double extended-precision numbers.The memory may contain numbers in either 32-bit single-precision or 64-bit double-precision representation. Single-precision representation reduces the storage requirements



when large amounts of floating-point data are involved but very high precision is not needed.The FPU automatically converts single-precision or double-precision operands from thememory to double extended-precision numbers before performing arithmetic operations.Double extended-precision numbers are converted to either single-precision or double-precision representation when transferring them to the memory. The FPU also convertsbetween integer and double extended-precision floating-point representations for transfersto or from the memory. Implementing the conversion capability in hardware reduces exe-cution time and code size by eliminating the need for software to perform this task.

E.9.1 Load and Store Instructions

The FLD instruction loads a floating-point number from a memory location and pushesthe value onto the floating-point register stack. The keywords DWORD PTR are used toindicate single-precision operand size, and the keywords QWORD PTR are used to indicatedouble-precision operand size. In either case, the value read from the memory is convertedto the 80-bit double extended-precision format. For example, the instruction

FLD DWORD PTR [EAX]

reads a single-precision floating-point number from the memory location [EAX], convertsthe number to the 80-bit format, and pushes it onto the register stack.

The FST instruction writes the floating-point number in ST(0) into the memory. Thepointer to the top of the register stack is not affected in this case. The appropriate keywords,DWORD PTR or QWORD PTR, must be used to specify the desired size for the value tobe written into the memory. This determines whether the 80-bit representation in ST(0)is to be converted to either single-precision or double-precision format. For example, theinstruction

FST QWORD PTR [EDX + 8]

converts the 80-bit floating-point number in ST(0) to 64-bit double-precision format, andwrites this converted value into memory location [EDX] + 8.

There are instructions to load and store 32-bit integer operands. These instructionsperform automatic conversion to and from the 80-bit floating-point format. The FILDinstruction reads a 32-bit integer from the memory, converts it to an 80-bit floating-pointnumber, and pushes the result onto the register stack. The FIST instruction converts the80-bit floating-point number in register ST(0) to a 32-bit integer, then writes the result intothe memory. It does not change the pointer to the top of the register stack.

Finally, there are store instructions that pop ST(0) after writing the contents of thatregister (with appropriate conversion) into the memory. This means that the previousregister ST(0) becomes ST(7), the previous register ST(1) becomes ST(0), and so on. TheFSTP instruction performs the same operation as the FST instruction, but it also pops ST(0).Similarly, the FISTP instruction combines the operation of the FIST instruction with a popoperation.


E.9 Scalar Floating-Point Operations 693

E.9.2 Arithmetic Instructions

The basic instructions for performing arithmetic operations on floating-point numbers areFADD, FSUB, FMUL, and FDIV. For instructions that explicitly specify one operand, it isthe source operand and it must be in the memory. The destination operand is implicitly inregister ST(0). The operand from the memory is converted to the 80-bit double extended-precision format before the arithmetic operation is performed. For example, the instruction

FADD QWORD PTR [EAX]

reads the 64-bit double-precision number from memory location [EAX], converts it to the80-bit double extended-precision representation, adds the converted value to the currentvalue in register ST(0), and places the sum in ST(0).

For instructions that specify two operands explicitly, both must be in registers and oneof the registers must be ST(0). For example, the instruction

FMUL ST(0), ST(3)

multiplies the values in ST(0) and ST(3), and places the product in ST(0).The instructions FADDP, FSUBP, FMULP, and FDIVP are used to pop the top of the

register stack after performing an arithmetic operation. These instructions must specifytwo operands in registers. It is appropriate to use ST(1) as the destination location for theseinstructions. After the pop operation is completed, the result is in the register that is thenew top of the register stack. For example, the instruction

FSUBP ST(1), ST(0)

subtracts the value in ST(0) from ST(1), places the result in ST(1), then pops ST(0). Hence,the result is now at the top of the stack in ST(0).

The instructions FIADD, FISUB, FIMUL, and FIDIV specify a single explicit operand,which is a 32-bit integer in the memory. The destination operand is implicitly in registerST(0). The 32-bit integer is automatically converted to 80-bit double extended-precisionrepresentation before the arithmetic operation is performed. For example, the instruction

FIDIV DWORD PTR [ECX + 4]

reads a 32-bit integer from memory location [ECX] + 4, converts the value into 80-bitfloating-point representation, divides the value in ST(0) by the converted value, and placesthe result in ST(0).

For subtraction and division operations, the order of the operands is significant. Be-cause the floating-point registers are organized as a stack, it may sometimes be useful toreverse the order of operands for these arithmetic operations so that a particular result oroperand can later be popped from the stack. The instructions FSUBR and FDIVR areprovided for this purpose. For example, the instruction

FSUBR ST(3), ST(0)

performs the operation ST(3)← [ST(0)] − [ST(3)]. In contrast, the instruction

FSUB ST(3), ST(0)

performs the operation ST(3)← [ST(3)] − [ST(0)].



E.9.3 Comparison Instructions

Floating-point comparison instructions can be used to set the condition code flags in thestatus register. The conditional Jump instructions in Table E.2 can then be used to testdifferent conditions. The FUCOMI and FUCOMIP instructions compare two floating-point operands in registers. The destination operand must be in ST(0). The FUCOMIPinstruction also pops ST(0) after the comparison is performed. For example, the instruction

FUCOMI ST(0), ST(4)

performs the subtraction [ST(0)]− [ST(4)] and sets the ZF and CF flags in the status registerbased on the result, which is then discarded. A subsequent Jump instruction can be usedto test the appropriate condition: JE for ST(0) = ST(4), JA for ST(0) > ST(4), and JB forST(0) < ST(4).

E.9.4 Additional Instructions

There are additional instructions such as square root (FSQRT), change sign (FCHS), absolutevalue (FABS), sine (FSIN), and cosine (FCOS). These instructions require no explicitoperands because they implicitly use ST(0) as both the source and destination. In otherwords, the value in the register at the top of the register stack is replaced with the result ofthe operation, and the pointer to the top of the register stack is unchanged.

There are also instructions that are used to push commonly used floating-point constantsonto the stack. FLDZ pushes 0.0 onto the register stack. FLD1 pushes 1.0 onto the registerstack. FLDPI pushes the double extended-precision floating-point representation of π ,accurate to 19 decimal digits, onto the register stack. No explicit operands are required forthese instructions.

E.9.5 Example Floating-Point Program

An example program is shown in Figure E.18. Given a pair of points (x0,y0) and (x1,y1),the program determines the slope m and intercept b of a line passing through these points,except when the points lie on a vertical line.

The coordinates for the two points are assumed to be 64-bit double-precision floating-point numbers stored consecutively in the memory as x0, y0, x1, and y1 beginning at loca-tion COORDS. The program places the computed slope and intercept as double-precisionnumbers in memory locations SLOPE and INTERCEPT. The slope of a line is given bym = (y1 − y0)/(x1 − x0), hence the program must check for a zero in the denominator be-fore performing the division. When the denominator is zero, the points lie on a verticalline. The program writes a value of one into memory location VERT_LINE to reflect thiscase and to indicate that the values in memory locations SLOPE and INTERCEPT are notvalid, and no further computation is done. Otherwise, the program computes the value ofthe slope and stores it in the memory. With a valid slope, the intercept is computed asb = y0 − m · x0, which is then written into the memory. The subtraction in this case usesthe FSUBR instruction, which reverses the order of operands. Finally, a zero is written intomemory location VERT_LINE to indicate that the slope and intercept are valid.


E.10 Multimedia Extension (MMX) Operations 695

MOV EAX, OFFSET COORDS EAX points to list of coordinates.FLD QWORD PTR [EAX + 24] Push y1 on register stack.FLD QWORD PTR [EAX + 16] Push x1 on register stack.FLD QWORD PTR [EAX + 8] Push y0 on register stack.FLD QWORD PTR [EAX] Push x0 on register stack.FSUBP ST(2), ST(0) Compute x1 – x0; pop x0.FLDZ Push 0.0 on stack.FUCOMIP ST(0), ST(2) Determine whether denominator is zero.JE NO_SLOPE If so, slope m is undefined.FSUBP ST(2), ST(0) Compute y1 – y0; pop y0.FDIVP ST(1), ST(0) Compute m = (y1 – y0)/(x 1– x 0).MOV EBX, OFFSET SLOPE EBX points to memory location SLOPE.FST QWORD PTR [EBX] Store the slope to memory.FLD QWORD PTR [EAX + 8] Push y0 on register stack.FLD QWORD PTR [EAX] Push x0 on register stack.FMULP ST(2), ST(0) Compute m · x0; pop x0.FSUBRP ST(1), ST(0) Compute b = y0 – m · x0; pop y0.MOV EBX, OFFSET INTERCEPT EBX

INTERCEPT.points to memory location

FSTP QWORD PTR [EBX] Store the intercept to memory;pop top of stack.

MOV EBX, 0 Indicate that line is not vertical.JMP DONE

NO_SLOPE: MOV EBX, 1 Indicate that line is vertical.DONE: MOV DWORD PTR VERT_LINE, EBX

Figure E.18 Floating-point program to compute the slope and intercept of a line.

E.10 Multimedia Extension (MMX) Operations

A two-dimensional graphic or video image can be represented by a large array of sampledimage points, called pixels. The color and brightness of each point can be encoded intoan 8-bit data item. Processing of such data has two main characteristics. The first is thatmanipulations of individual pixels often involve very simple arithmetic or logic operations.The second is that very high computational performance is needed for some real-timedisplay applications. The same characteristics apply to sampled audio signals or speechprocessing, where a sequence of signed numbers represents samples of a continuous analogsignal taken at periodic intervals.

In such applications, processing efficiency is achieved if the individual data items,which are usually bytes or 16-bit words, are packed into small groups whose elements canbe processed in parallel. Vector or single-instruction multiple-data (SIMD) instructionsfor this form of parallel processing are described in Chapter 12. The IA-32 instruction set



includes a number of SIMD instructions, which are called multimedia extension (MMX)instructions. They perform the same operation simultaneously on multiple data elements,packed into 64-bit quadwords. The operands for MMX instructions can be in the memory,or in the eight floating-point registers. Thus, these registers serve a dual purpose. They canhold either floating-point numbers or MMX operands. When used by MMX instructions,the registers are referred to as MM0 through MM7, and only the lowermost 64 bits of each80-bit register are relevant for MMX operations. Unlike the floating-point instructions inSection E.9, the MMX instructions do not manage this shared register set as a stack.

The MOVQ instruction is provided for transferring 64-bit quadword operands betweenthe memory and the MMX registers. For example, the instruction

MOVQ MM0, [EAX]

loads the quadword from the memory location whose address is in register EAX into registerMM0. The MOVQ instruction can also be used to transfer data between MMX registers.For example, the instruction

MOVQ MM3, MM4

transfers the contents of register MM4 to register MM3.Instructions are provided to perform arithmetic and logic operations in parallel on

multiple elements of a packed quadword operand. The source can be in the memoryor in an MMX register, but the destination must be an MMX register. For most MMXinstructions, a suffix is used to indicate the size (and number) of data elements within apacked quadword: B for byte (8 elements), W for word (4 elements), D for doubleword (2elements), and Q for quadword (1 element). For example, the instruction

PADDB MM2, [EBX]

adds eight corresponding bytes of the quadwords in register MM2 and in the memorylocation pointed to by register EBX. The eight sums are computed in parallel. The resultsare placed in register MM2.

Other instructions are provided for subtraction (PSUB), multiplication (PMUL), com-bined multiplication and addition (PMADD), logic operations (PAND, POR, and PXOR),and a large number of other operations on packed quadword operands.

E.11 Vector (SIMD) Floating-Point Operations

Section E.9 described instructions for operating on individual floating-point numbers. Vec-tor (SIMD) instructions are also provided to perform operations simultaneously on multiplefloating-point numbers. In Intel terminology, these instructions are called streaming SIMDextension (SSE) instructions. They handle packed 128-bit double quadwords, each con-sisting of four 32-bit floating-point numbers. Eight additional 128-bit registers, XMM0 toXMM7, are available for holding these operands.

The MOVAPS and MOVUPS instructions transfer a packed double quadword betweenmemory and the XMM registers, or between XMM registers. The PS suffix indicates packedsingle-precision floating-point values in the double quadword. The A or U designation


E.12 Examples of Solved Problems 697

determines whether a memory address must be aligned to a 16-bit word boundary or maybe unaligned. The instruction

MOVUPS XMM3, [EAX]

loads a 128-bit double quadword from the memory location pointed to by register EAXinto register XMM3. The instruction

MOVUPS XMM4, XMM5

transfers the double quadword in register XMM5 into register XMM4.The basic arithmetic operations are performed simultaneously on four pairs of 32-bit

floating-point numbers from two double-quadword operands. The source can be in thememory or in an XMM register, but the destination must be an XMM register. Instructionsinclude ADDPS, SUBPS, MULPS, and DIVPS. For example, the instruction

ADDPS XMM0, XMM1

adds the four corresponding pairs of floating-point numbers in registers XMM0 and XMM1and places the four sums in register XMM0.

E.12 Examples of Solved Problems


Example E.5Problem: Assume that there is a string of ASCII-encoded characters stored in memorystarting at address STRING. The string ends with the Carriage Return (CR) character.Write an IA-32 program to determine the length of the string.

Solution: Figure E.19 presents a possible program. Each character in the string is comparedto CR (ASCII code 0D), and a counter is incremented until the end of the string is reached.The result is stored in location LENGTH.

MOV EAX, OFFSET STRING EAX points to the start of the string.MOV EDI, 0 EDI is a counter that is cleared to 0.

LOOP: MOV BL, BYTE PTR [EAX + EDI] Load next character into lowest byte of EBX.CMP BL, 0DH Compare character with CR.JE DONE Finished if it matches.INC EDI Increment the counter.JMP LOOP Not finished, loop back.

DONE: MOV DWORD PTR LENGTH, EDI Store the count in memory location LENGTH.

Figure E.19 Program for Example E.5.



Example E.6 Problem: We want to find the smallest number in a list of non-negative 32-bit integers.Storage for data begins at address 1000. The doubleword at this address must hold the valueof the smallest number after it has been found. The next doubleword contains the numberof entries, n, in the list. The following n doublewords contain the numbers in the list.Write a program to find the smallest number and include the assembler directives neededto organize the data as stated.

Solution: The program in Figure E.20 accomplishes the required task. It assumes thatn ≥ 1. A few sample numbers are included as entries in the list.

LIST EQU 1000 Starting address of list..CODEMOV EAX, OFFSET LIST EAX points to the start of the list.MOV EDI, [EAX + 4] EDI is a counter, initialize it with n.MOV EBX, EAX EBX points to the first numberADD EBX, 8 after adjusting its value.MOV ECX, [EBX] ECX holds the smallest number so far.

LOOP: DEC EDI Decrement the counter.JZ DONE Finished if EDI is zero.MOV EDX, [EBX] Get the next number.ADD EBX, 4 Increment the pointer.CMP ECX, EDX Compare next number and smallest so far.JLE LOOP If next number not smaller, loop again.MOV ECX, EDX Otherwise, update smallest number so far.JMP LOOP Loop again.

DONE: MOV [EAX], ECX Store the smallest number into SMALL.

.DATAORG 1000

SMALL DD 0 Space for the smallest number found.N DD 7 Number of entries in list.ENTRIES DD 4, 5, 3, 6, 1, 8, 2 Entries in the list.


Example E.7 Problem: Write a program that converts an n-digit decimal integer into a binary number.The decimal number is given as n ASCII-encoded characters, as would be the case if thenumber is entered by typing it on a keyboard.



Solution: Consider a four-digit decimal number, D, which is represented by the digitsd3d2d1d0. The value of this number is ((d3 × 10+ d2)× 10+ d1)× 10+ d0. This repre-sentation of the number is the basis for the conversion technique used in the program inFigure E.21. Note that each ASCII-encoded character is converted into a Binary CodedDecimal (BCD) digit with the AND instruction before it is used in the computation.

MOV ECX, DWORD PTR N ECX is a counter, initialize it with n.MOV ESI, OFFSET DECIMAL ESI points to the ASCII digits.MOV EBX, 0 EBX will hold the binary number.MOV EDI, 10 EDI will be used to multiply by 10.

LOOP: MOV DL, [ESI] Get the next ASCII digit.AND EDX, 0FH Form the BCD digit.ADD EBX, EDX Add to the intermediate result.DEC ECX Decrement the counter.JZ DONE Exit loop if finished.IMUL EBX, EDI Multiply by 10.INC ESI Increment the pointer.JMP LOOP Loop back if not done.

DONE: MOV DWORD PTR BINARY, EBX Store result in memory location BINARY.


Example E.8Problem: Consider an array of numbers A(i,j), where i = 0 through n− 1 is the rowindex, and j = 0 through m− 1 is the column index. The array is stored in the memoryof a computer one row after another, with elements of each row occupying m successiveword locations. Write a subroutine for adding column x to column y, element by element,leaving the sum elements in column y. The indices x and y are passed to the subroutine inregisters EAX and EBX. The parameters n and m are passed to the subroutine in registersECX and EDX, and the address of element A(0,0) is passed in register EDI.

Solution: A possible program is given in Figure E.22. We assumed that the values x, y,n, and m are stored in memory locations X, Y, N, and M. Also, the elements of the arrayare stored in successive words that begin at location ARRAY, which is the address of theelement A(0,0). Comments in the program indicate the purpose of individual instructions.

Example E.9Problem: Assume that a memory location BINARY contains a 32-bit pattern. It is desiredto display these bits as eight hexadecimal digits on a display device that has the interfacedepicted in Figure 3.3. Write a program that accomplishes this task.



MOV EAX, DWORD PTR X Load the value x .MOV EBX, DWORD PTR Y Load the value y .MOV ECX, DWORD PTR N Load the value n.MOV EDX, DWORD PTR M Load the value m.MOV EDI, OFFSET ARRAY Load the address of A(0,0).CALL SUBnext instruction...

SUB: PUSH ESI Save register ESI.SHL EDX, 2 Determine the distance in bytes

between successive elementsin a column.

SUB EBX, EAX Form y – x .SHL EBX, 2 Form 4(y – x) .SHL EAX, 2 Form 4x .ADD EDI, EAX EDI points to A(0,x ).MOV ESI, EDIADD ESI, EBX ESI points to A(0,y ).

LOOP: MOV EAX, [EDI] Get the next number in column x .MOV EBX, [ESI] Get the next number in column y .ADD EAX, EBX Add the numbers and store the sum.MOV [ESI], EAXADD EDI, EDX Increment pointer to column x .ADD ESI, EDX Increment pointer to column y .DEC ECX Decrement the row counter.JG LOOP Loop back if not done.POP ESI Restore ESI.RET Return to the calling program.


Solution: First it is necessary to convert the 32-bit pattern into hex digits that are repre-sented as ASCII-encoded characters. The conversion can be done by using the table-lookupapproach. A 16-entry table has to be constructed to provide the ASCII code for each possi-ble hex digit. Then, for each four-bit segment of the pattern in BINARY, the correspondingcharacter can be looked up in the table and stored in consecutive byte locations in memorybeginning at address HEX. Finally, the eight characters beginning at address HEX are sentto the display. Figure E.23 gives a possible program.



.CODEMOV EAX, DWORD PTR BINARY Load the binary number.MOV ECX, 8 ECX is a digit counter that is set to 8.MOV EDI, OFFSET HEX EDI points to the hex digits.MOV ESI, OFFSET TABLE ESI points to table for ASCII conversion.

LOOP: ROL EAX, 4 Rotate high-order digit into low-order position.MOV EBX, EAX Copy current value to another register.AND EBX, 0FH Extract next digit.MOV DL, [ESI + EBX] Get ASCII code for the digit, store it in theMOV [EDI], DLINC EDI

HEX buffer and increment the digital pointer.

DEC ECX Decrement the digit counter.JG LOOP Loop back if not the last digit.

DISPLAY: MOV ECX, 8MOV EDI, OFFSET HEXMOV ESI, OFFSET DISP_DATA

DLOOP: MOV EDX, [ESI + 4] Check if the display is ready byAND EDX, 4 testing the DOUT flag.JZ DLOOPMOV DL, [EDI] Get the next ASCII character,INC EDI increment the character pointer,MOV [ESI], DL and send it to the display.DEC ECX Decrement the counter.JG DLOOP Loop until all characters displayed.next instruction

.DATAORG 1000

HEX DB 8 DUP (0) Space for ASCII-encoded digits.TABLE DB 30H, 31H, 32H, 33H Table for conversion to ASCII code.

DB 34H, 35H, 36H, 37HDB 38H, 39H, 41H, 42HDB 43H, 44H, 45H, 46H




E.13 Concluding Remarks

The IA-32 instruction set is an example of a very extensive CISC design. It supportsa broad range of operations on different types of data such as individual integers andfloating-point numbers, as well as vectors of packed integers and floating-point numbers.Despite the challenges associated with such a large instruction set, it is implemented inhigh-performance processors.

Problems

E.1 [E] Write a program that computes the expression SUM = 580+ 68400+ 80000.

E.2 [E] Write a program that computes the expression ANSWER =A× B + C × D.

E.3 [M] Write a program that finds the number of negative integers in a list of n 32-bit integersand stores the count in location NEGNUM. The value n is stored in memory location N, andthe first integer in the list is stored in location NUMBERS. Include the necessary assemblerdirectives and a sample list that contains six numbers, some of which are negative.

E.4 [E] Write an assembly-language program in the style of Figure E.13 for the program inFigure E.6. Assume the data layout of Figure 2.10.

E.5 [M] Write an IA-32 program to solve Problem 2.10 in Chapter 2.

E.6 [E] Write an IA-32 program for the problem described in Example 2.5 in Chapter 2.

E.7 [M] Write an IA-32 program for the problem described in Example 3.5 in Chapter 3.

E.8 [E] Write an IA-32 program for the problem described in Example 3.6 in Chapter 3.

E.9 [E] Write an IA-32 program for the problem described in Example 3.6 in Chapter 3, butassume that the address of TABLE is 0x10100.

E.10 [E] Write a program that displays the contents of 10 bytes of the main memory in hex-adecimal format on a line of a video display. The byte string starts at location LOC in thememory. Each byte has to be displayed as two hex characters. The displayed contents ofsuccessive bytes should be separated by a space.

E.11 [M] Assume that a memory location BINARY contains a 16-bit pattern. It is desired todisplay these bits as a string of 0s and 1s on a display device that has the interface depictedin Figure 3.3. Write a program that accomplishes this task.

E.12 [M] Using the seven-segment display in Figure 3.17 and the timer circuit in Figure 3.14,write a program that flashes decimal digits in the sequence 0, 1, 2, . . . , 9, 0, . . . . Each digitis to be displayed for one second. Assume that the counter in the timer circuit is driven bya 100-MHz clock.

E.13 [D] Using two 7-segment displays of the type shown in Figure 3.17, and the timer circuit inFigure 3.14, write a program that flashes numbers 0, 1, 2, . . . , 98, 99, 0, . . . . Each numberis to be displayed for one second. Assume that the counter in the timer circuit is driven bya 100-MHz clock.


References 703

E.14 [D] Write a program that computes real clock time and displays the time in hours (0 to 23)and minutes (0 to 59). The display consists of four 7-segment display devices of the typeshown in Figure 3.17. A timer circuit that has the interface given in Figure 3.14 is available.Its counter is driven by a 100-MHz clock.

E.15 [M] Write an IA-32 program to solve Problem 2.22 in Chapter 2. Assume that the elementto be pushed/popped is located in register EAX, and that register EBX serves as the stackpointer for the user stack.









E.24 [D] Write an IA-32 program to solve Problem 2.32 in Chapter 2.






E.30 [D] The function sin(x) can be approximated with reasonable accuracy as x − x3/6+x5/120 = x(1− x2(1/6− x2(1/120))) when 0 ≤ x ≤ π/2. Write a subroutine called SINthat accepts an input parameter that is a pointer to the floating-point value of x in thememory and computes the approximated value of sin(x) using the second expression aboveinvolving only terms in x and x2. The computed value should be returned in the samememory location as the input parameter x. Any integer registers used by the subroutineshould be saved and restored as needed. The accuracy of the approximation used by thissubroutine can be explored by comparing its results with results obtained with the FSINinstruction that is included in the IA-32 instruction set.

References

1. Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual –Volume 1: Basic Architecture, Document number 253665-033US, December 2009.Available at http://www.intel.com.

http://www.intel.com

December 14, 2010 09:20 ham_338065_ind Sheet number 1 Page number 705 cyan black

705

I N D E X

AA/D (see Analog-to-digital conversion)Access time:

magnetic disk, 315main memory, 5, 269effect of cache, 301

Actuators, 410Adder:

BCD, 378carry-lookahead, 340circuit, 337full adder, 336half adder, 377least-significant bit, 336most-significant bit, 339propagation delay, 339ripple-carry, 336

Addition, 11, 13, 336carry, 11, 336carry-save, 353end-around carry, 382floating-point, 367generate function, 340modular, 12overflow, 14, 15, 337propagate function, 340sum, 11, 336

Address, 5, 29aligned/unaligned, 31big-endian, 30little-endian, 30

Address pointer, 42Address space, 29, 268Address translation, 306Addressing mode, 35, 40, 75

absolute, 41autodecrement, 76autoincrement, 75immediate, 42index, 45, 157indirect, 42register, 41relative, 76Nios II, 532ColdFire, 575ARM, 614IA-32, 665

Alarm clock, 428Alphanumeric characters, 17

ASCII, 18Amdahl’s law, 461Analog-to-digital (A/D) conversion, 248,

388Arbiter, 237Arbitration, 237Architecture, 2Arithmetic and logic unit (ALU), 5, 153,

160ARM processor, 611ASCII (American Standards Committee

on Information Exchange) code, 17Assembler, 49, 130Assembly language, 49

directives (commands), 50generic, 28mnemonics, 34, 49notation, 33syntax, 49Nios II, 532, 549ColdFire, 573, 593ARM, 614, 635IA-32, 670, 685

Associative search, 294Asynchronous DRAM, 276Asynchronous bus (see Bus)Asynchronous transmission, 245

BBandwidth:

interconnection network, 450memory, 279

Barrier synchronization, 458Base register, 48BCD (see Binary-coded decimal)Big-endian, 30Binary-coded decimal (BCD), 17, 54

addition, 378Binary variable, 466Bit, 4, 28Booth algorithm, 348

bit-pair recoding, 352skipping over 1s, 350

Boot-strapping, 144

Bounce (see Contact bounce)Branch:

delay slot, 204delayed, 205instruction, 38, 77, 168offset, 54penalty, 202prediction, 205target, 38target buffer, 208

Branching, 37, 77Breakpoint, 136Bridge, 252Broadcast, 453Bus, 179, 450

arbitration, 237asynchronous, 233driver, 179, 236master, 230, 237propagation delay, 231protocol, 229skew, 234slave, 230synchronous, 230timing, 231

Bus standards:ISA, 258FireWire, 251PATA, 258PCI, 252PCI Express, 258SCSI, 256SAS, 258SATA, 258USB, 247

Bus structure, 228Byte, 17, 29Byte-addressable memory, 30

CCache memory, 5, 269, 289

associative mapping, 293, 299block, 290coherence, 296, 453

directory-based, 456snooping, 454


706 Index

direct mapping, 298dirty bit, 290hit, 290hit rate, 301levels, 289line, 290load-through, 291lockup-free, 305mapping function, 290, 291miss, 291miss penalty, 301miss rate, 301prefetching, 304replacement algorithm, 290, 296set-associative mapping, 294, 299stale data, 294tag, 293valid bit, 294write-back, 290, 294, 454write buffer, 303write-through, 290, 453

Carry, 11, 336detection, 551

Cartridge tape, 323CD-ROM, 318Character codes, 17

ASCII, 18Charge-coupled device (CCD), 387Chip, 17CISC (Complex Instruction Set

Computer), 34, 74, 178, 572, 661Circular buffer (queue), 92Clock, 230, 493

rate, 209period, 154duty cycle, 260self-clocking, 313

Clock recovery, 246, 313Cloud computing, 3Coherence (see Cache memory)ColdFire processor, 571Combinational circuits, 154, 494Comparator, 170, 186Compiler, 133

optimizing, 134vectorizing, 446

Complement, 468Complementary metal-oxide

semiconductor (CMOS), 484Complex Instruction Set Computer (see

CISC)Complex programmable logic device

(CPLD), 513Compute Unified Device Architecture

(CUDA), 448Computer types, 2

Computer-aided design (CAD), 424Condition codes, 77

flags, 77in pipelined processors, 218side effect, 218ColdFire, 572, 599ARM, 629IA-32, 663

Condition code register, 77Conditional branch, 38, 77, 169Configuration device, 423Contact bounce, 239Context switch, 148Control signals, 172Control store, 183Control unit, 6Control word, 183Counter:

ripple, 504synchronous, 504, 516

Crossbar, 452

DD/A (see Digital-to-analog conversion)Data, 4Data dependency, 197, 457Data striping, 317Data types:

bit, 4byte, 17character, 17floating-point, 16, 363integer, 10fraction, 365, 372string, 81word, 5, 29

Datapath, 161Deadlock, 217Debouncing, 239Debugger, 134Debugging, 54, 117Decoder, 505

instruction decoder, 176De Morgan’s rule, 475Desktop computer, 2Device driver, 148Device interface, 97Differential signaling, 251, 256, 258, 279Digital camera, 387Digital-to-analog conversion, 248Direct memory access (see DMA)Directory-based cache coherence, 456Dirty bit (see Cache memory)Disk (see Magnetic disk)Disk arrays, 317

Dispatch unit, 212Display, 6, 97Division, 360

floating-point, 368non-restoring, 361restoring, 360

DMA (Direct Memory Access), 285, 306controller, 286

Don’t-care condition, 477DVD, 5, 321Dynamic memory (DRAM), 274

EEchoback, 101Edge-triggered flip-flops, 498Effective address, 42Effective throughput, 450Embedded computer, 2Embedded system, 2, 386, 422, 530, 651Enterprise system, 2Error correcting code (ECC), 316Ethernet, 252Exception (see also Interrupt), 116, 556,

639, 687floating-point, 367handler, 558imprecise, 216precise, 216

Execution phase, 37, 153Execution steps, 155, 185Execution step counter, 175Execution step timing, 171, 178External name, 132

FFan-in, 490Fan-out, 490Fetch phase, 37, 153Field-programmable gate array (FPGA),

21, 423, 514FIFO (first-in, first-out) queue, 92File, 130, 322Finite state machine, 520FireWire, 251Fixed point, 16Flash memory, 284Flash cards, 284Flash drives, 284Flip-flops, 496

D, 495edge-triggered, 498gated latch, 493JK, 499latch, 492


Index 707

master-slave, 495SR latch, 494T, 498

Floating point, 16, 363addition-subtraction unit, 369arithmetic operations, 367double precision, 365exception, 367

inexact, 367invalid, 367

exponent, 16, 364excess-x representation, 365

format, 364guard bits, 368

sticky bit, 369IEEE standard, 16, 364mantissa, 365normalization, 365overflow, 366representation, 364scale factor, 16, 364

base, 16, 364exponent, 16, 364

significant digits, 364single precision, 364special values, 366

denormal, 366gradual underflow, 366infinity, 366Not a Number (NaN), 367

truncation, 368biased/unbiased, 368chopping, 368rounding, 368von Neumann, 368

underflow, 366Nios II, 562ColdFire, 599ARM, 650IA-32, 690, 696

Floppy disk, 316

GGated latch, 493General-purpose register, 8GPU (Graphics Processing Unit), 448Gray code, 524Grid computer, 2

HHandshake, 233

full, 235interlocked, 235

Hardware, 2Hardware interrupt, 103

Hardwired control, 175Hazard, 197

branch delay, 202data dependency, 197memory delay, 201resource limitation, 209

Hexadecimal (see Number representation)High-level language, 17, 20, 133, 137,

399, 432, 456History of computers, 19Hit (see Cache memory)Hold time, 231, 498Hot-pluggable, 249

IIA-32 processor, 661IEEE standards (see Standards)Immediate operand, 164Index register, 45Ink jet printer, 6Input/output (I/O):

address space, 97device interface, 97, 229, 238in operating system, 146interrupt-driven, 103, 556, 598, 648,

689memory-mapped, 97, 229port, 239privilege level, 688program-controlled, 97, 556, 597, 646,

689status flag, 98register, 97unit, 4, 6Nios II, 555ColdFire, 597ARM, 646IA-32, 689

Input unit, 4Instruction, 4, 7, 32

commitment, 216completion queue, 216execution phase, 37, 153dispatch, 217fetch phase, 37, 153queue, 212reordering, 204retired, 217side effects, 218

Instruction encoding, 82, 168Instruction execution, 155, 165Instruction fetch, 164Instruction format:

generic, 84one-address, 670

three-address, 35two-address, 74Nios II, 533ColdFire, 573ARM, 615IA-32, 670

Instruction register (IR), 7, 37, 152Instruction set architecture (ISA), 28Instructions:

arithmetic, 7, 536, 578, 622, 672branch, 38, 538, 582, 628, 674control, 111, 548, 596data transfer, 7, 534, 577, 621, 671input/output, 535logic, 67, 537, 585, 626, 677move, 43, 537, 577, 625, 671multimedia, 695shift and rotate, 68, 546, 586, 625, 678subroutine, 56, 541, 587, 631, 679vector (SIMD), 445, 696

Integrated circuit (IC), 21Intellectual property (IP), 422Interconnection network, 3, 450

bandwidth, 450bus, 179, 228, 450

split-transaction, 450computer system, 252, 258crossbar, 452effective throughput, 450mesh, 452packet, 450ring, 451torus, 453tree, 249

Interface:input, 239output, 242parallel, 239, 392serial, 243, 395

Internet, 3, 4Interrupt (see also Exception), 9, 103

acknowledge, 105disabling, 106enabling, 106, 109execution steps, 185handler, 116hardware, 103in operating systems, 146latency, 105nesting, 108nonmaskable, 596, 687priority, 109service routine, 9, 104software, 146vectored, 108, 139Nios II, 556


708 Index

ColdFire, 596ARM, 639IA-32, 687

Invalidation protocol, 453I/O (see Input/output)IP core, 530IR (Instruction register), 7Isochronous, 248, 249, 251

JJoystick, 4JTAG port, 514

KKarnaugh map, 475Keyboard, 4, 97

interface, 99, 239

LLaser printer, 6Latch, 492Library, 133LIFO (last-in, first-out) queue, 55Link register, 57Linker, 132Little-endian, 30Load through (see Cache memory)Load-store multiple operands, 62

ColdFire, 588ARM, 621IA-32, 679

Loader, 54, 131Locality of reference, 289, 296Logic circuits, 466Logic functions, 466

AND, 468Exclusive-OR (XOR), 468minimization, 472NAND, 479NOR, 479NOT, 468OR, 466synthesis, 470, 508

Logic gates, 469fan-in, 490fan-out, 490noise margin, 489propagation delay, 489threshold, 482transfer characteristic, 488transition time, 490

Logical memory address, 305

LRU (least-recently used) replacement,296

MMachine instruction, 4Machine language, 130Magnetic disk, 5, 311

access time, 315communication with, 257controller, 313, 315cylinder, 314data buffer/cache, 315data encoding, 313drive, 313floppy disk, 316logical partition, 314rotational delay, 315sector, 314seek time, 315track, 314Winchester, 313

Magnetic tape, 322cartridge, 323

Manchester encoding, 313Master-ready, 234Master-slave (see Flip-flop)Mechanical computing devices, 20Memory, 4

access time, 5, 269address, 5address space, 29asynchronous DRAM, 276bandwidth, 279bit line, 270byte-addressable, 30cache (see Cache memory)cell, 271, 273, 274, 283controller, 281cycle time, 269DDR SDRAM, 279delay, 171DIMM (see Memory module)dual-ported, 158dynamic (DRAM), 274fast page mode, 276flash, 5hierarchy, 288latency, 278main, 4module, 281multiple module, 449primary, 4Rambus, 279random-access memory (RAM), 5,

269, 270

read cycle, 273, 274read-only memory (ROM), 282refreshing, 274, 282secondary, 5SIMM (see Memory module)static (SRAM), 271synchronous DRAM (SDRAM), 276unit, 4virtual (see Virtual memory)word, 5word length, 5word line, 270write cycle, 273, 274

Memory Function Completed signal, 171Memory management unit (MMU), 306Memory pages, 306Memory-mapped I/O, 97, 229Memory segmentation, 662Mesh network, 452Message passing, 19, 456Microcontroller, 386, 390

ARM, 391, 414Freescale, 391, 413Intel, 391, 413

Microinstruction, 183Microprocessor, 21Microprogram, 183Microprogram counter, 184Microprogram memory, 183Microprogrammed control, 183Microroutine, 183Microwave oven, 386Miss (see Cache memory)MMU (see Memory management unit)Mnemonic, 34, 49Mouse, 4Multicomputer, 19, 456Multicore processor, 19, 457Multiple issue, 212Multiple-precision arithmetic, 77, 336,

552, 572, 623, 681Multiplexer, 507Multiplication, 344

array implementation, 344Booth algorithm, 348carry-save addition, 353

CSA tree, 3553-2 reducer, 3554-2 reducer, 3577-3 reducer, 380

floating-point, 367sequential implementation, 346signed-operand, 346

Multiprocessor, 19, 448cache coherence, 453interconnection network, 449


Index 709

local memory, 449non-uniform memory access

(NUMA), 449program parallelism, 456shared memory, 19, 448shared variables, 448, 457speedup, 461uniform memory access (UMA), 449

Multiprogramming, 145Multistage hardware, 155, 162Multitasking, 145Multithreading, 444

NNios II processor, 529Noise, 251, 482, 489Noise margin, 489Notebook computer, 2Number conversion, 23, 372Number representation, 9

binary positional notation, 9fixed-point, 16floating-point, 364hexadecimal, 541’s-complement, 10sign-and-magnitude, 10signed integer, 10ternary, 3782’s-complement, 10unsigned integer, 10

OObject program, 49, 130On-chip memory, 4251’s-complement representation, 10OP code, 49Operands, 4Operand forwarding, 198Operating system, 143

interrupts, 146multitasking, 146process, 148, 444scheduling, 148

Optical disk, 5, 317CD-Recordable (CD-R), 320CD-Rewritable (CD-RW), 321CD-ROM, 318DVD, 5, 321

Out-of-order execution, 215Output unit, 6Overflow, 14, 15

detection, 551

PPage (see Virtual memory)Page fault, 309Parallel I/O interface, 425Parallel processing, 444Parallel programming, 456Parallelism, 17

instruction-level, 17processor-level, 17

Parameter passing, 59by value, 62by reference, 62

PC (see Program Counter)PCI (see Bus standards)PCI Express (see Bus standards)Peer-to-peer, 251Performance, 17

basic equation, 209memory, 300modeling, 460pipeline, 209

Personal computer, 2Phase encoding, 313Physical memory address, 305Pin assignment, 431Pipelining, 19, 194

bubbles, 198hazards, 197performance, 209stalling, 197ColdFire, 219IA-32, 219

Pixel, 388Plug-and-play, 249, 253Pointer register, 42Polling, 98Pop operation (see Stack)Portable computer, 2POSIX threads (Pthreads) library, 460Prefetching, 304Primary storage, 4Printer, 6Priority (see also Arbitration), 237Privileged instruction, 311Process (see Operating system)Processor, 4, 152

core, 422multicore, 19

Processor stack, 55Processor status register, 106

Nios II, 554ColdFire, 572ARM, 613IA-32, 663

Processor register, 8

Program, 4Program-controlled I/O (see Input/output)Program counter (PC), 7, 37, 152, 178Program state, 148Programmable array logic (PAL), 511Programmable logic array (PLA), 509Propagation delay:

logic circuit, 489bus, 231

Protection, 311Pseudoinstruction, 44, 537, 548, 637Push operation (see Stack)

QQuartus II software, 425Queue (see also Instruction), 55, 92

RRAID disk systems, 317Rambus memory, 279Random-access memory (RAM), 5, 269Reaction timer, 401Reactive system, 386, 401Read operation, 8Read-only memory (ROM), 282

electrically erasable (EEPROM), 284erasable (EPROM), 284flash, 284programmable (PROM), 283

Real-time processing, 106Reduced Instruction Set Computer (see

RISC)Refreshing memories, 274, 282Register, 6, 502

access time, 6base, 48control, 110, 553general-purpose, 8index, 45port, 158renaming, 216

Register file, 158Register transfer notation (RTN), 33Reorder buffer, 216Replacement algorithm, 296Reservation station, 215Ring network, 451RISC (Reduced Instruction Set

Computer), 34, 154, 530, 612ROM (see Read-Only Memory)Rounding (see Floating point)


710 Index

SSAS (see Bus standards)SATA (see Bus standards)Scaler, 503Scheduling (see Arbitration, Operating

system)SCSI bus (see Bus standards)Secondary storage, 5Sensors, 407Sequential circuits, 494, 516

finite state machine, 520state assignment, 518state diagram, 516state table, 517synchronous, 520

Serial transmission, 243Server, 2Setup time, 231, 498Seven-segment display, 124, 507Shadow registers, 105Shared memory, 448Shift register, 503Side effects (see Instruction)Sign bit, 10Sign extension, 15Single-instruction multiple-data (SIMD)

processing, 445Skew (see Bus)Slave-ready, 233Snoopy cache, 455Soft processor core, 530Software interrupt, 146SOPC Builder, 425Source program, 49Speculative execution, 214Speedup, 22, 461Split-transaction protocol, 450SR latch, 494Stack, 55

frame, 63frame pointer (FP), 63in subroutines, 60pointer (SP), 55pushdown, 55push and pop operations, 55

Standards (see also Bus standards):IEEE floating-point, 16, 364IEEE-1149.1, 416

Start-stop format, 245State diagram, 516State table, 517

Static memory (SRAM), 271Status flag (see Input/output)Status register, 106Stored program, 4Subroutine, 56, 170, 186

linkage, 57nesting, 58, 64parameter passing, 59Nios II, 541ColdFire, 587ARM, 631IA-32, 679

Subtraction, 13floating-point, 367

Sum-of-products form, 470Supercomputer, 2Superscalar processor, 212Supervisor mode, 311, 555, 596, 640Symbol table, 131Synchronization, 458Synchronous DRAM (SDRAM), 276Synchronous sequential circuit, 520Synchronous transmission, 246Syntax, 49System-on-a-chip, 421System space, 310

TTechnology, 17Telemetry, 390Testability, 416Text editor, 130Thread, 444

creation, 458synchronization, 458

Three-state (see Tri-state)Thrashing, 327Threshold, 482Time slicing, 146Timer, 120, 397, 427Timing signals, 6, 171Torus network, 453Touchpad, 4Trace mode, 136Trackball, 4Transducers, 407Transistor, 17, 20Transition time, 490Translation lookaside buffer (TLB), 308Transmission:

asynchronous, 245differential, 251, 256, 258, 279single-ended, 250, 256start-stop, 245synchronous, 246

Tri-state gate, 179, 236, 240, 281, 491Truth table, 4662’s-complement representation, 10

UUART (Universal Asynchronous Receiver

Transmitter), 395Universal Serial Bus (USB), 247Update protocol, 453User mode, 311, 555, 596, 639User space, 310

VVector processing, 445, 695, 696Vectorization, 446Very large-scale integration (VLSI), 17Virtual address, 305Virtual memory, 269, 305

address translation, 306page, 306page fault, 309page frame, 308page table, 308translation lookaside buffer (TLB),

308virtual address, 305

Volatile:variable, 138memory, 274

WWait loop, 100Waiting for memory, 171Winchester disks, 313Word, 5, 29Word alignment, 31Word length, 5, 29Workstation, 2Write-back protocol (see Cache memory)Write buffer, 303Write operation, 8, 32Write-through protocol (see Cache

memory)

December 9, 2010 12:11 ham_338065_end Sheet number 1 Page number 0 cyan black

• This edition of the book has been extensively revised and reorganized. Fundamentalconcepts are presented in a general manner in the main body of the book. The discussionis strongly influenced by RISC-style processors such as Nios II and MIPS. To provide abalanced view of design alternatives, a discussion of CISC-style processors is also included.• To illustrate how basic concepts have been implemented in commercial computers,separate appendices provide a detailed discussion of the instruction sets of four popularprocessors: Altera’s Nios II, ARM, Freescale Semiconductor’s ColdFire, and Intel’s IA-32.Each appendix shows how example programs and input/output tasks in the main body ofthe book can be implemented for the corresponding processor.• The expanding area of embedded systems is addressed, not only by presenting thefundamental concepts applicable to all computer systems, but also by devoting two chaptersto issues pertinent specifically to embedded systems. One of the chapters deals with system-on-a-chip implementation using Nios II as the example.• The book is particularly suitable for a course that has an associated laboratory. Forexample, Nios II based platforms provided by Altera’s University Program in the form ofFPGA boards, tutorials, and laboratory exercises can be used very effectively. Any otherplatform that features one of the processors in the book can also be used.• The book can also be used in a course that is solely lecture-based. In this case, it ishelpful to have access to an instruction-set simulator.

Carl Hamacher

Documents

Transcript of Carl Hamacher