Post on 23-Feb-2016
description
ALU Architecture and ISA Extensions
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
(2)
Reading• Sections 3.2-3.5 (only those elements covered
in class)• Sections 3.6-3.8• Appendix B.5• Practice Problems: 26, 27
• Goal: Understand the ISA view of the core microarchitecture Organization of functional units and register files into
basic data paths
(3)
Overview• Instruction Set Architectures have a purpose
Applications dictate what we need
• We only have a fixed number of bits Impact on accuracy
• More is not better We cannot afford everything we want
• Basic Arithmetic Logic Unit (ALU) Design Addition/subtraction, multiplication, division
(4)
Reminder: ISAbyte addressed memory
0xFFFFFFFF
Arithmetic Logic Unit (ALU)
0x000x010x020x03
0x1FProcessor Internal Buses
Memory InterfaceRegister File (Programmer Visible State)
stack
Data segment(static)
Text Segment
Dynamic Data
Reserved
Program Counter
Programmer Invisible State
Kernelregisters Who sees what?
Memory MapInstruction register
(5)
Arithmetic for Computers• Operations on integers
Addition and subtraction Multiplication and division Dealing with overflow
• Operation on floating-point real numbers Representation and operations
• Let us first look at integers
(6)
Integer Addition(3.2)• Example: 7 + 6
Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands
Overflow if result sign is 1 Adding two –ve operands
Overflow if result sign is 0
(7)
Integer Subtraction• Add negation of second operand• Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111–6: 1111 1111 … 1111 1010+1: 0000 0000 … 0000 0001
• Overflow if result out of range Subtracting two +ve or two –ve operands, no overflow Subtracting +ve from –ve operand
o Overflow if result sign is 0 Subtracting –ve from +ve operand
o Overflow if result sign is 1
2’s complement representation
(8)
ISA Impact• Some languages (e.g., C) ignore overflow
Use MIPS addu, addui, subu instructions• Other languages (e.g., Ada, Fortran) require
raising an exception Use MIPS add, addi, sub instructions On overflow, invoke exception handler
o Save PC in exception program counter (EPC) registero Jump to predefined handler addresso mfc0 (move from coprocessor register) instruction can
retrieve EPC value, to return after corrective action (more later)
• ALU Design leads to many solutions. We look at one simple example
(9)
• Build a 1 bit ALU, and use 32 of them (bit-slice)
ba
operation
result
op a b res
Integer ALU (arithmetic logic unit)(B.5)
(10)
Single Bit ALU
0
1A
B
Result
Operation
Implements only AND and OR operations
(11)
• We can add additional operators (to a point)
• How about addition?
• Review full adders from digital design
Adding Functionality
cout = ab + acin + bcin
sum = a b cinSum
CarryIn
CarryOut
a
b
(12)
Building a 32-bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
(13)
• Two's complement approach: just negate b and add 1.
• How do we negate?
• A clever solution:
Subtraction (a – b) ?
Binvert
b31
b0
b1
b2
Result31a31
Result0
CarryIn
a0
Result1a1
Result2a2
Operation
ALU0CarryIn
CarryOut
ALU1CarryIn
CarryOut
ALU2CarryIn
CarryOut
ALU31CarryIn
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
sub
(14)
• Need to support the set-on-less-than instruction(slt) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b
• Need to support test for equality (beq $t5, $t6, $t7) use subtraction: (a-b) = 0 implies a = b
Tailoring the ALU to the MIPS
(15)
Seta31
0
ALU0 Result0
CarryIn
a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
CarryIn
Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
What Result31 is when (a-b)<0?
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
Unsigned vs. signed support
(16)
Test for equality• Notice control lines:
000 = and001 = or010 = add110 = subtract111 = slt
• Note: zero is a 1 when the result is zero!
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
Note test for overflow!
(17)
ISA View
• Register-to-Register data path• We want this to be as fast as possible
ALU
$0$1
$31
CPU/Core
(18)
Multiplication (3.3)• Long multiplication
1000× 1001 1000 0000 0000 1000 1001000
Length of product is the sum of operand lengths
multiplicand
multiplier
product
(19)
A Multiplier• Uses multiple adders
Cost/performance tradeoff
Can be pipelined Several multiplication performed in parallel
(20)
MIPS Multiplication• Two 32-bit registers for product
HI: most-significant 32 bits LO: least-significant 32-bits
• Instructions mult rs, rt / multu rs, rt
o 64-bit product in HI/LO mfhi rd / mflo rd
o Move from HI/LO to rdo Can test HI value to see if product
overflows 32 bits mul rd, rs, rt
o Least-significant 32 bits of product –> rd
Study Exercise: Check out signed and unsigned multiplication with QtSPIM
(21)
Division(3.4)• Check for 0 divisor• Long division approach
If divisor ≤ dividend bitso 1 bit in quotient, subtract
Otherwiseo 0 bit in quotient, bring down
next dividend bit• Restoring division
Do the subtract, and if remainder goes < 0, add divisor back
• Signed division Divide using absolute values Adjust sign of quotient and
remainder as required
10011000 1001010 -1000 10 101 1010 -1000 10
n-bit operands yield n-bitquotient and remainder
quotient
dividend
remainder
divisor
(22)
Faster Division• Can’t use parallel hardware as in multiplier
Subtraction is conditional on sign of remainder• Faster dividers (e.g. SRT division) generate
multiple quotient bits per step Still require multiple steps
• Customized implementations for high performance, e.g., supercomputers
(23)
MIPS Division• Use HI/LO registers for result
HI: 32-bit remainder LO: 32-bit quotient
• Instructions div rs, rt / divu rs, rt No overflow or divide-by-0
checkingo Software must perform checks if
required Use mfhi, mflo to access result
Study Exercise: Check out signed and unsigned division with QtSPIM
(24)
ISA View
• Additional function units and registers (Hi/Lo)• Additional instructions to move data to/from
these registers mfhi, mflo
• What other instructions would you add? Cost?
ALU
Hi
Multiply Divide
Lo
$0$1
$31
CPU/Core
(25)
Floating Point(3.5)• Representation for non-integral numbers
Including very small and very large numbers• Like scientific notation
–2.34 × 1056
+0.002 × 10–4
+987.02 × 109
• In binary ±1.xxxxxxx2 × 2yyyy
• Types float and double in C
normalized
not normalized
(26)
IEEE 754 Floating-point Representation
2928272625242322212019181716151413121110 9 8 7 6 5 4 3 2 1 03130S exponent significand
1bit 8 bits 23 bits
6160595857565554535251504948474645444342414039383736353433326362S exponent significand
1bit 11 bits 20 bitssignificand (continued)
32 bits
Single Precision (32-bit)
Double Precision (64-bit)
(–1)sign x (1+fraction) x 2exponent-127
(–1)sign x (1+fraction) x 2exponent-1023
(27)
Floating Point Standard• Defined by IEEE Std 754-1985• Developed in response to divergence of
representations Portability issues for scientific code
• Now almost universally adopted• Two representations
Single precision (32-bit) Double precision (64-bit)
(28)
FP Adder Hardware• Much more complex than integer adder• Doing it in one clock cycle would take too long
Much longer than integer operations Slower clock would penalize all instructions
• FP adder usually takes several cycles Can be pipelined
Example: FP Addition
(29)
FP Adder Hardware
Step 1
Step 2
Step 3
Step 4
(30)
FP Arithmetic Hardware• FP multiplier is of similar complexity to FP
adder But uses a multiplier for significands instead of an
adder• FP arithmetic hardware usually does
Addition, subtraction, multiplication, division, reciprocal, square-root
FP integer conversion• Operations usually takes several cycles
Can be pipelined
(31)
ISA Impact• FP hardware is coprocessor 1
Adjunct processor that extends the ISA• Separate FP registers
32 single-precision: $f0, $f1, … $f31 Paired for double-precision: $f0/$f1, $f2/$f3, …
o Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s
• FP instructions operate only on FP registers Programs generally do not perform integer ops on FP
data, or vice versa More registers with minimal code-size impact
(32)
ISA View: The Co-Processor
• Floating point operations access a separate set of 32-bit registers Pairs of 32-bit registers are used for double precision
ALU
Hi
Multiply Divide
Lo
$0$1
$31
FP ALU
$0$1
$31
BadVaddrStatus
CausesEPC
CPU/Core Co-Processor 1
Co-Processor 0
later
(33)
ISA View• Distinct instructions operate on the floating
point registers (pg. A-73) Arithmetic instructions
o add.d fd, fs, ft, and add.s fd, fs, ft
• Data movement to/from floating point coprocessors mcf1 rt, fs and mtc1 rd, fs
• Note that the ISA design implementation is extensible via co-processors
• FP load and store instructions lwc1, ldc1, swc1, sdc1
o e.g., ldc1 $f8, 32($sp)
single precisiondouble precision
Example: DP Mean
(34)
Associativity• Floating point arithmetic is not commutative• Parallel programs may interleave operations in
unexpected orders Assumptions of associativity may fail
(x+y)+z x+(y+z)x -1.50E+38 -1.50E+38y 1.50E+38z 1.0 1.0
1.00E+00 0.00E+00
0.00E+001.50E+38
Need to validate parallel programs under varying degrees of parallelism
(35)
Performance Issues• Latency of instructions
Integer instructions can take a single cycle Floating point instructions can take multiple cycles Some (FP Divide) can take hundreds of cycles
• What about energy (we will get to that shortly)• What other instructions would you like in
hardware? Would some applications change your mind?
• How do you decide whether to add new instructions?
(36)
Characterizing Parallelism
• Characterization due to M. Flynn*
SISD SIMD
MISD MIMD
Single instruction multiple data stream computing, e.g., SSE
Data StreamsIn
stru
ctio
n St
ream
sToday serial computing cores
(von Neumann model)
Today’s Multicore
*M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960t
(37)
Parallelism Categories
From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy
(38)
Multimedia (3.6, 3.7, 3.8)• Lower dynamic range and precision
requirements Do not need 32-bits!
• Inherent parallelism in the operations
(39)
Vector Computation• Operate on multiple data elements (vectors) at
a time• Flexible definition/use of registers
• Registers hold integers, floats (SP), doubles DP)
1x128 bit integer
4 x 32-bit single precision
2x64-bit double precision
8x16 short integers
128-bit Register
(40)
Processing Vectors
Memory
vector registers
• When is this more efficient?
• When is this not efficient?• Think of 3D graphics, linear algebra and media
processing
(41)
Case Study: Intel Streaming SIMD Extensions
• 8, 128-bit XMM registers X86-64 adds 8 more registers XMM8-XMM15
• 8, 16, 32, 64 bit integers (SSE2)• 32-bit (SP) and 64-bit (DP) floating point• Signed/unsigned integer operations • IEEE 754 floating point support• Reading Assignment:
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
(42)
Instruction Categories• Floating point instructions
Arithmetic, movement Comparison, shuffling Type conversion, bit level
• Integer• Other
e.g., cache management• ISA extensions!• Advanced Vector
Extensions (AVX) Successor to SSE
register
registermemory
(43)
Arithmetic View• Graphics and media processing operates on
vectors of 8-bit and 16-bit data Use 64-bit adder, with partitioned carry chain
o Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors SIMD (single-instruction, multiple-data)
• Saturating operations On overflow, result is largest representable value
o c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video
4x16-bit 2x32-bit
(44)
SSE Example// A 16byte = 128bit vector structstruct Vector4{ float x, y, z, w; };
// Add two constant vectors and return the resulting vectorVector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ){ Vector4 Ret_Vector;
__asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B
MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX]
ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector;}
From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
More complex example (matrix
multiply) in Section 3.8 – using AVX
(45)
Intel Xeon Phi
ww
w.a
nand
tech
.com
www.anandtech.com
www.techpowerup.com
(46)
Data Parallel vs. Traditional Vector
Vector Register
A
Vector Register
B
Vector Register
Cpipelined functional unit
registers
Vector Architecture
Data Parallel Architecture
Process each square in parallel – data parallel
computation
(47)
ISA View
• Separate core data path• Can be viewed as a co-processor with a distinct
set of instructions
ALU
Hi
Multiply Divide
Lo
$0$1
$31
Vector ALU
XMM0XMM1
XMM15
CPU/Core SIMD Registers
(48)
Domain Impact on the ISA: Example
• Floats• Double precision• Massive data• Power
constrained
• Integers• Lower precision• Streaming data• Security support• Energy
constrained
Scientific Computing Embedded Systems
(49)
Summary• ISAs support operations required of application
domains Note the differences between embedded and
supercomputers! Signed, unsigned, FP, SIMD, etc.
• Bounded precision effects Software must be careful how hardware used e.g.,
associativity Need standards to promote portability
• Avoid “kitchen sink” designs There is no free lunch Impact on speed and energy we will get to this later
(50)
Study Guide• Perform 2’s complement addition and subtraction
(review)• Add a few more instructions to the simple ALU
Add an XOR instruction Add an instruction that returns the max of its inputs Make sure all control signals are accounted for
• Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review)
• Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution
(51)
Study Guide (cont.)• Write a few simple SPIM programs for
Multiplication/division of signed and unsigned numberso Use numbers that produce >32-bit resultso Move to/from HI and LO registers ( find the instructions
for doing so) Addition/subtraction of floating point numbers
• Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers)
• Look up additional SIMD instruction sets and compare AMD NEON, Altivec, AMD 3D Now
(52)
Glossary• Co-processor• Data parallelism• Data parallel
computation vs. vector computation
• Instruction set extensions
• Overflow• MIMD
• Precision• SIMD• Saturating
arithmetic• Signed arithmetic
support• Unsigned
arithmetic support
• Vector processing