ALU Architecture and ISA Extensions

Lecture notes from MKP, H. H. Lee and S. Yalamanchili

Reading• Sections 3.2-3.5 (only those elements covered

in class)• Sections 3.6-3.8• Appendix B.5• Practice Problems: 26, 27

• Goal: Understand the ISA view of the core microarchitecture Organization of functional units and register files into

basic data paths

Overview• Instruction Set Architectures have a purpose

Applications dictate what we need

• We only have a fixed number of bits Impact on accuracy

• More is not better We cannot afford everything we want

• Basic Arithmetic Logic Unit (ALU) Design Addition/subtraction, multiplication, division

Reminder: ISAbyte addressed memory

0xFFFFFFFF

Arithmetic Logic Unit (ALU)

0x000x010x020x03

0x1FProcessor Internal Buses

Memory InterfaceRegister File (Programmer Visible State)

Data segment(static)

Text Segment

Dynamic Data

Reserved

Program Counter

Programmer Invisible State

Kernelregisters Who sees what?

Memory MapInstruction register

Arithmetic for Computers• Operations on integers

Addition and subtraction Multiplication and division Dealing with overflow

• Operation on floating-point real numbers Representation and operations

• Let us first look at integers

Integer Addition(3.2)• Example: 7 + 6

Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands

Overflow if result sign is 1 Adding two –ve operands

Overflow if result sign is 0

Integer Subtraction• Add negation of second operand• Example: 7 – 6 = 7 + (–6)

+7: 0000 0000 … 0000 0111–6: 1111 1111 … 1111 1010+1: 0000 0000 … 0000 0001

• Overflow if result out of range Subtracting two +ve or two –ve operands, no overflow Subtracting +ve from –ve operand

o Overflow if result sign is 0 Subtracting –ve from +ve operand

o Overflow if result sign is 1

2’s complement representation

ISA Impact• Some languages (e.g., C) ignore overflow

Use MIPS addu, addui, subu instructions• Other languages (e.g., Ada, Fortran) require

raising an exception Use MIPS add, addi, sub instructions On overflow, invoke exception handler

o Save PC in exception program counter (EPC) registero Jump to predefined handler addresso mfc0 (move from coprocessor register) instruction can

retrieve EPC value, to return after corrective action (more later)

• ALU Design leads to many solutions. We look at one simple example

• Build a 1 bit ALU, and use 32 of them (bit-slice)

operation

result

op a b res

Integer ALU (arithmetic logic unit)(B.5)

Single Bit ALU

Result

Operation

Implements only AND and OR operations

• We can add additional operators (to a point)

• How about addition?

• Review full adders from digital design

Adding Functionality

cout = ab + acin + bcin

sum = a b cinSum

CarryIn

CarryOut

Building a 32-bit ALU

Result

Operation

CarryIn

CarryOut

Result31a31

Result0

CarryIn

Result1a1

Result2a2

Operation

CarryIn

CarryOut

CarryIn

CarryOut

CarryIn

CarryOut

CarryIn

• Two's complement approach: just negate b and add 1.

• How do we negate?

• A clever solution:

Subtraction (a – b) ?

Binvert

Result31a31

Result0

CarryIn

Result1a1

Result2a2

Operation

ALU0CarryIn

CarryOut

ALU1CarryIn

CarryOut

ALU2CarryIn

CarryOut

ALU31CarryIn

Result

Operation

CarryIn

CarryOut

Binvert

• Need to support the set-on-less-than instruction(slt) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b

• Need to support test for equality (beq $t5, $t6, $t7) use subtraction: (a-b) = 0 implies a = b

Tailoring the ALU to the MIPS

Seta31

ALU0 Result0

CarryIn

Result1a1

Result2a2

Operation

Result31

Overflow

Binvert

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

What Result31 is when (a-b)<0?

Result

Operation

CarryIn

CarryOut

Binvert

Unsigned vs. signed support

Test for equality• Notice control lines:

000 = and001 = or010 = add110 = subtract111 = slt

• Note: zero is a 1 when the result is zero!

Seta31

Result0a0

Result1a1

Result2a2

Operation

Result31

Overflow

Bnegate

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Note test for overflow!

ISA View

• Register-to-Register data path• We want this to be as fast as possible

CPU/Core

Multiplication (3.3)• Long multiplication

1000× 1001 1000 0000 0000 1000 1001000

Length of product is the sum of operand lengths

multiplicand

multiplier

product

A Multiplier• Uses multiple adders

Cost/performance tradeoff

Can be pipelined Several multiplication performed in parallel

MIPS Multiplication• Two 32-bit registers for product

HI: most-significant 32 bits LO: least-significant 32-bits

• Instructions mult rs, rt / multu rs, rt

o 64-bit product in HI/LO mfhi rd / mflo rd

o Move from HI/LO to rdo Can test HI value to see if product

overflows 32 bits mul rd, rs, rt

o Least-significant 32 bits of product –> rd

Study Exercise: Check out signed and unsigned multiplication with QtSPIM

Division(3.4)• Check for 0 divisor• Long division approach

If divisor ≤ dividend bitso 1 bit in quotient, subtract

Otherwiseo 0 bit in quotient, bring down

next dividend bit• Restoring division

Do the subtract, and if remainder goes < 0, add divisor back

• Signed division Divide using absolute values Adjust sign of quotient and

remainder as required

10011000 1001010 -1000 10 101 1010 -1000 10

n-bit operands yield n-bitquotient and remainder

quotient

dividend

remainder

divisor

Faster Division• Can’t use parallel hardware as in multiplier

Subtraction is conditional on sign of remainder• Faster dividers (e.g. SRT division) generate

multiple quotient bits per step Still require multiple steps

• Customized implementations for high performance, e.g., supercomputers

MIPS Division• Use HI/LO registers for result

HI: 32-bit remainder LO: 32-bit quotient

• Instructions div rs, rt / divu rs, rt No overflow or divide-by-0

checkingo Software must perform checks if

required Use mfhi, mflo to access result

Study Exercise: Check out signed and unsigned division with QtSPIM

ISA View

• Additional function units and registers (Hi/Lo)• Additional instructions to move data to/from

these registers mfhi, mflo

• What other instructions would you add? Cost?

Multiply Divide

CPU/Core

Floating Point(3.5)• Representation for non-integral numbers

Including very small and very large numbers• Like scientific notation

–2.34 × 1056

+0.002 × 10–4

+987.02 × 109

• In binary ±1.xxxxxxx2 × 2yyyy

• Types float and double in C

normalized

not normalized

IEEE 754 Floating-point Representation

2928272625242322212019181716151413121110 9 8 7 6 5 4 3 2 1 03130S exponent significand

1bit 8 bits 23 bits

6160595857565554535251504948474645444342414039383736353433326362S exponent significand

1bit 11 bits 20 bitssignificand (continued)

32 bits

Single Precision (32-bit)

Double Precision (64-bit)

(–1)sign x (1+fraction) x 2exponent-127

(–1)sign x (1+fraction) x 2exponent-1023

Floating Point Standard• Defined by IEEE Std 754-1985• Developed in response to divergence of

representations Portability issues for scientific code

• Now almost universally adopted• Two representations

Single precision (32-bit) Double precision (64-bit)

FP Adder Hardware• Much more complex than integer adder• Doing it in one clock cycle would take too long

Much longer than integer operations Slower clock would penalize all instructions

• FP adder usually takes several cycles Can be pipelined

Example: FP Addition

FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

FP Arithmetic Hardware• FP multiplier is of similar complexity to FP

adder But uses a multiplier for significands instead of an

adder• FP arithmetic hardware usually does

Addition, subtraction, multiplication, division, reciprocal, square-root

FP integer conversion• Operations usually takes several cycles

Can be pipelined

ISA Impact• FP hardware is coprocessor 1

Adjunct processor that extends the ISA• Separate FP registers

32 single-precision: $f0, $f1, … $f31 Paired for double-precision: $f0/$f1, $f2/$f3, …

o Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s

• FP instructions operate only on FP registers Programs generally do not perform integer ops on FP

data, or vice versa More registers with minimal code-size impact

ISA View: The Co-Processor

• Floating point operations access a separate set of 32-bit registers Pairs of 32-bit registers are used for double precision

Multiply Divide

FP ALU

BadVaddrStatus

CausesEPC

CPU/Core Co-Processor 1

Co-Processor 0

ISA View• Distinct instructions operate on the floating

point registers (pg. A-73) Arithmetic instructions

o add.d fd, fs, ft, and add.s fd, fs, ft

• Data movement to/from floating point coprocessors mcf1 rt, fs and mtc1 rd, fs

• Note that the ISA design implementation is extensible via co-processors

• FP load and store instructions lwc1, ldc1, swc1, sdc1

o e.g., ldc1 $f8, 32($sp)

single precisiondouble precision

Example: DP Mean

Associativity• Floating point arithmetic is not commutative• Parallel programs may interleave operations in

unexpected orders Assumptions of associativity may fail

(x+y)+z x+(y+z)x -1.50E+38 -1.50E+38y 1.50E+38z 1.0 1.0

1.00E+00 0.00E+00

0.00E+001.50E+38

Need to validate parallel programs under varying degrees of parallelism

Performance Issues• Latency of instructions

Integer instructions can take a single cycle Floating point instructions can take multiple cycles Some (FP Divide) can take hundreds of cycles

• What about energy (we will get to that shortly)• What other instructions would you like in

hardware? Would some applications change your mind?

• How do you decide whether to add new instructions?

Characterizing Parallelism

• Characterization due to M. Flynn*

SISD SIMD

MISD MIMD

Single instruction multiple data stream computing, e.g., SSE

Data StreamsIn

sToday serial computing cores

(von Neumann model)

Today’s Multicore

*M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960t

Parallelism Categories

From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Multimedia (3.6, 3.7, 3.8)• Lower dynamic range and precision

requirements Do not need 32-bits!

• Inherent parallelism in the operations

Vector Computation• Operate on multiple data elements (vectors) at

a time• Flexible definition/use of registers

• Registers hold integers, floats (SP), doubles DP)

1x128 bit integer

4 x 32-bit single precision

2x64-bit double precision

8x16 short integers

128-bit Register

Processing Vectors

Memory

vector registers

• When is this more efficient?

• When is this not efficient?• Think of 3D graphics, linear algebra and media

processing

Case Study: Intel Streaming SIMD Extensions

• 8, 128-bit XMM registers X86-64 adds 8 more registers XMM8-XMM15

• 8, 16, 32, 64 bit integers (SSE2)• 32-bit (SP) and 64-bit (DP) floating point• Signed/unsigned integer operations • IEEE 754 floating point support• Reading Assignment:

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

Instruction Categories• Floating point instructions

Arithmetic, movement Comparison, shuffling Type conversion, bit level

• Integer• Other

e.g., cache management• ISA extensions!• Advanced Vector

Extensions (AVX) Successor to SSE

register

registermemory

Arithmetic View• Graphics and media processing operates on

vectors of 8-bit and 16-bit data Use 64-bit adder, with partitioned carry chain

o Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors SIMD (single-instruction, multiple-data)

• Saturating operations On overflow, result is largest representable value

o c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video

4x16-bit 2x32-bit

SSE Example// A 16byte = 128bit vector structstruct Vector4{ float x, y, z, w; };

// Add two constant vectors and return the resulting vectorVector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ){ Vector4 Ret_Vector;

__asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B

MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX]

ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector;}

From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

More complex example (matrix

multiply) in Section 3.8 – using AVX

Intel Xeon Phi

www.anandtech.com

www.techpowerup.com

Data Parallel vs. Traditional Vector

Vector Register

Cpipelined functional unit

registers

Vector Architecture

Data Parallel Architecture

Process each square in parallel – data parallel

computation

ISA View

• Separate core data path• Can be viewed as a co-processor with a distinct

set of instructions

Multiply Divide

Vector ALU

XMM0XMM1

CPU/Core SIMD Registers

Domain Impact on the ISA: Example

• Floats• Double precision• Massive data• Power

constrained

• Integers• Lower precision• Streaming data• Security support• Energy

constrained

Scientific Computing Embedded Systems

Summary• ISAs support operations required of application

domains Note the differences between embedded and

supercomputers! Signed, unsigned, FP, SIMD, etc.

• Bounded precision effects Software must be careful how hardware used e.g.,

associativity Need standards to promote portability

• Avoid “kitchen sink” designs There is no free lunch Impact on speed and energy we will get to this later

Study Guide• Perform 2’s complement addition and subtraction

(review)• Add a few more instructions to the simple ALU

Add an XOR instruction Add an instruction that returns the max of its inputs Make sure all control signals are accounted for

• Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review)

• Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution

Study Guide (cont.)• Write a few simple SPIM programs for

Multiplication/division of signed and unsigned numberso Use numbers that produce >32-bit resultso Move to/from HI and LO registers ( find the instructions

for doing so) Addition/subtraction of floating point numbers

• Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers)

• Look up additional SIMD instruction sets and compare AMD NEON, Altivec, AMD 3D Now

Glossary• Co-processor• Data parallelism• Data parallel

computation vs. vector computation

• Instruction set extensions

• Overflow• MIMD

• Precision• SIMD• Saturating

arithmetic• Signed arithmetic

support• Unsigned

arithmetic support

• Vector processing

ALU Architecture and ISA Extensions

Documents

Transcript of ALU Architecture and ISA Extensions

ALU (Continued)

ALU S.Rawat. ALU ALU an Engine for any Computational Silicon. We have different units ALU/FPUs for Integers/Floats respectively. Mainly Decided based.

EDI B E 8 50 PURCHAS O P...ISA-0 ISA-0 ISA-0 ISA-0 ISA-0 info) ISA-0 EDI i ISA-0 shou ISA-0 CAN ISA-0 ISA-ISA-ISA-ISA-ISA- ... 18=TP Qualifier to 19=Y Identifies Lin 20=PQ Qualifier

Data Oblivious ISA Extensions for Side Channel-Resistant ...aspects of this problem by developing a novel type of ISA extension which we call a Data Oblivious ISA extension (OISA).

1 Chapter 3, Appendix B ALU for Computers (MIPS) design a fast ALU for the MIPS ISA requirements ? –support the arithmetic/logic operations: add, addi.

Computer Systems Organization · ALU and registered connected by several buses. A + B A + B A A B B Registers ALU input register ALU output register ALU ALU input b us Registers feed

R600 ISA - freedesktop.org · TU Dresden, 09.11.11 R600 ISA Folie 8 Control Flow Programs One instruction: 64 bits Call ALU clauses (ALU), texture fetch clauses and vertex fetch clauses

Alu-Tech Bodyshell - Canterbury · PDF file · 2015-12-04Alu-Tech Bodyshell The pursuit of enjoyment ... About Alu-Tech Alu-Tech Bodyshell Alu-Tech Bodyshell Alu-Tech Bodyshell Alu-Tech

Lecture 21: Data Level Parallelism --SIMD ISA Extensions ... · 11 Multimedia Extensions (aka SIMD extensions) to Scalar ISA § Very short vectors added to existing ISAsfor microprocessors

ALU Architecture and ISA Extensions - ECE 3056ece3056-sy.ece.gatech.edu/wp-content/uploads/sites/546/2017/08/...ALU Architecture and ISA Extensions ... • Basic Arithmetic Logic Unit

Alu Alu J AluJb AluJo Alu S Alu Sx Alu Sg Alu Sp Alu Sc Alu Sq Alu Y.

Alu-Alu Machine Change Parts By Pharmatech Engineering Company, Maharashtra

ALU Report

Alu Nortel

ITEC 352 Lecture 12 ISA(3). Review Buses Memory ALU Registers Process of compiling.

Product Catalogue - Form-Scaff · 2018-12-11 · Alu-Up Prop Stubs and Extensions Alu-Up Prop Extensions are used to extend the propping height in small increments. The Stubs are

· technQwood DESIGN ARCHITECTURE SYSTEMS TECHNOWOOD Alu Panel ALU product details

Alu Catalogue

M3 – RISC-V ALU Design · Truth Table of the ALU CU Instrn Opcode ALU Op Operation funct7 funct3 ALU Action ALU Control Input LD 00 Load doubleword XXXXXXX XXX add 0010 SD 00 Store

ALU & CPU Computer Architecture. Introducing ALU ALU: Arithmetic & Logic Unit –Performs arithmetic operations Addition Subtraction –Performs logic operations.