Storage I/O Summary - Stanford Universityweb.stanford.edu/class/ee282h/handouts/Handout40.pdfStorage...

1

K. OlukotunFall 98/99

Handout #40EE 282h

1

Storage I/O Summary

● Storage devices

● Storage I/O Performance Measures» Throughput» Response time

● I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal measure

● I/O System design» Balanced design» Performance determined by weakest link

● Buses● Processor Interface Issues

2

Lecture 16:Multimedia and DSP Architectures

Prof. Kunle OlukotunEE 282h

Fall 98/99

2


Handout #40EE 282h

3

Input/Output (I/O)more than just disks and networks

● Secondary storage» disks, tapes, CD ROMs

● Communication» networks

● Human interface» video, audio, keyboards, …» Multimedia

● ‘Real world’ interface» temperature, pressure,

position, velocity, voltage,current,…

» Digital Signal Processing(DSPs)

4

New Architecture Direction

● “…media processing will become the dominant force incomputer arch. & microprocessor design.”

● “... new media-rich applications... involve significant real-timeprocessing of continuous media streams, and make heavy useof vectors of packed 8-, 16-, and 32-bit integer and Fl. Pt.”

● Needs include high memory BW, high network BW, continuousmedia data types, real-time response, fine grain parallelism

“How Multimedia Workloads Will Change Processor Design”,Diefendorff & Dubey, IEEE Computer (9/97)

3


Handout #40EE 282h

5

Multimedia Workloads

● Multimedia» Videoconferencing» Video authoring» Animation» Games

● Algorithms» Image compression (JPEG)» Video compression (MPEG)» 3D- graphics» encryption

6

Multimedia Characteristics

● Real-time response» Video, audio

● Continuous media data-types» 8 - 16 bits sufficient for many applications

● Data parallelism» E.g. Apply same operation to whole image» Vector or SIMD works well

● Coarse grained parallelism» E.g. video encoding/decoding, audio encoding /decoding

● Small loops» most time spent in kernel» Amenable to hand optimization

● High memory bandwidth» Video, 3D graphics» Caches not large enough

4


Handout #40EE 282h

7

Multimedia ISA Extensions

● HP PA-RISC» MAX-2

● SUN SPARC» VIS

● Intel x86» MMX

● MIPS» MDMX

● Power PC» Altivec

8

MMX

● “MMX Technology Extension to the Intel Architecture”

Alex Peleg and Uri WeiserIEEE Micro, August 1996

● Goals» Improve performance of multimedia applications

– Graphics, MPEG video– Image processing, speech recognition

» Remain completely compatible with the Intel x86 ISA» Minimize cost

● Approach» Use packed data types» Exploit SIMD parallelism» Make use of existing wide data paths

5


Handout #40EE 282h

9

Data Types

● Three fixed-point integers fit into 64 bit quad word» Packed byte, 8 8-bit bytes» Packed word, 4 16-bit words (most common data type)» Packed doubleword, 2 32-bit doubleword

● User controlled fixed point» Advantage?» Disadvantage?

10

MMX Operands

● Eight 64-bit GP registers (MM0-MM7)

● MMX shares FPU» Why?» Can’t do FP computation and MMX at the same time» What about graphics?

● Random access» Learned lesson from FPU design

6480 0

Floating Point

MMX

6


Handout #40EE 282h

11

MMX Operations

● 57 MMX instructions work on all data types

● Support for saturation arithmetic» Simplifies handling of underflow and overflow» Matches physical behavior

● Packed operations» Addition/subtraction» Multiplication» Compares» Shift

● Conversion operations» Pack» Unpack

● Performance improvement» Fewer loads and stores» Fewer arithmetic operations, but more conversion

operations» Limitations

12

MMX Operations

Unpacking byte data.

7


Handout #40EE 282h

13

MMX ISA Summary

14

Using MMX

● Assembly language coding

● Use of libraries» E.g. IDCT, DCT, matrix multiply

● Use of C macros (“intrinsics”)» Generate optimized assembly code» Performs register allocation and instruction scheduling

» Requires intimate knowledge of MMX

● Could a compiler generate MMX code?

MMX64 t0, t1;t0 = paddd(t0, t1);

8


Handout #40EE 282h

15

Chroma Keying

● “weatherman example”

for (i = 0; i < image_size; i++)

new_image = (x[i] == Blue) ? y[i] : x[i];

movq mm3, mem1 load 8 pixels from weatherman

movq mm4, mem2 load 8 pixels from map

pcmpeq mm1, mm3 generate select mask

pand mm4, mm1 and map with mask

pandn mm1, mm3 and weatherman with inverse mask

por mm4, mm1 or masked images together

16

Chroma Keying

9


Handout #40EE 282h

17

FIR filter on a GPProcessor

loop:lw r5, 0(r1)lw r6, 0(r2)mul r7, r5,r6add r8,r8,r7addi r1,r1,4addi r2,r2,4

addi r4,r4,-1bnez r4,loop

sw r8,0(r3)

18

DSP vs. General Purpose MPU

● The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate(MAC).» DSP are judged by whether they can keep the multipliers busy

100% of the time.

● The “SPECDSP” is 4 algorithms:» Infinite Impulse Response (IIR) filters» Finite Impulse Response (FIR) filters» FFT, and» convolvers

● In DSPs, algorithms are king!» Binary compatibility not an issue

● Software is not (yet) king in DSPs.» People still write in assembly language for a product to minimize

the die area for ROM in the DSP chip.» I predict this will change as algorithms get more complicated

10


Handout #40EE 282h

19

DSP Instructions

● May specify multiple operations in a single instruction

● Must support Multiply-Accumulate (MAC)● Need parallel move support

● Usually have special loop support to reduce branch overhead» Loop an instruction or sequence» 0 value in register usually means loop maximum number of times» Must be sure if calculate loop count that 0 does not mean 0

● May have saturating shift left arithmetic● May have conditional execution to reduce branches

20

DSP Data Path: Precision

● DSPs typically work with fixed point fractions

● Word size affects precision of fixed point numbers● DSPs have 16-bit, 20-bit, or 24-bit data words

● Floating Point DSPs cost 2X - 4X vs. fixed point, slower than fixedpoint

● DSP programmers scale values inside code» SW Libraries» Separate explicit exponent

● “Blocked Floating Point” single exponent for a group of fractions

● Floating point support simplifies software development» All DSP benchmarks that we use will be floating point

11


Handout #40EE 282h

21

DSP Data Path: Multiplier

● Specialized hardware performs all key arithmeticoperations in 1 cycle

● ≥ 50% of instructions can involve multiplier=> single cycle latency multiplier

● Need to perform multiply-accumulate (MAC)

● n-bit multiplier => 2n-bit product

22

DSP Memory

● FIR Tap implies multiple memory accesses

● DSPs want multiple data ports● Some DSPs have ad hoc techniques to reduce memory

bandwidth demand» Instruction repeat buffer: do 1 instruction 256 times» Often disables interrupts, thereby increasing interrupt response

time

● Some recent DSPs have instruction caches» Even then may allow programmer to “lock in” instructions into

cache» Option to turn cache into fast program memory

● No DSPs have data caches● May have multiple data memories

12


Handout #40EE 282h

23

DSP Addressing

● Have standard addressing modes: immediate, displacement,register indirect

● Want to keep MAC datapath busy● Assumption: any extra instructions imply clock cycles of

overhead in inner loop=> complex addressing is good=> don’t use datapath to calculate fancy address

● Autoincrement/Autodecrement register indirect» lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1» Option to do it before addressing, positive or negative

24

DSP Addressing: Buffers

● DSPs dealing with continuous I/O

● Often interact with an I/O buffer (delay lines)● To save memory, buffer often organized as circular buffer

● What can do to avoid overhead of address checkinginstructions for circular buffer?

● Option 1: Keep start register and end register per addressregister for use with autoincrement addressing, reset to startwhen reach end of buffer

● Option 2: Keep a buffer length register, assuming buffers startson aligned address, reset to start when reach end

● Every DSP has “modulo” or “circular” addressing

13


Handout #40EE 282h

25

DSP Addressing: FFT

● FFTs start or end with data in bufferfly order0 (000) => 0 (000)1 (001) => 4 (100)2 (010) => 2 (010)3 (011) => 6 (110)4 (100) => 1 (001)5 (101) => 5 (101)6 (110) => 3 (011)7 (111) => 7 (111)

● What can do to avoid overhead of address checking instructions forFFT?

● Have an optional “bit reverse” address addressing mode for use withautoincrement addressing

● Many DSPs have “bit reverse” addressing for radix-2 FFT

26

FIR filter

● GP Processorloop:

lw r5, 0(r1)lw r6, 0(r2)mul r7, r5,r6add r8,r8,r7addi r1,r1,4addi r2,r2,4

addi r4,r4,-1bnez r4,loop

sw r8,0(r3)

● DSP: Bus / memory bandwidth bottleneck, control code repeat r4mac acc,r5,r6 X:r5, (r1++) Y:r6, (r2++)

sw acc,0(r3) overhead

14


Handout #40EE 282h

27

Summary: How are DSPs different?

● Single cycle multiply accumulate (multiple busses and arraymultipliers)

● Complex instructions for standard DSP functions (IIR and FIRfilters, convolvers)

● Specialized memory addressing» Modular arithmetic for circular buffers (delay lines)» Bit reversal (FFT)

● Zero overhead loops and repeat instructions

● I/ O support – Serial and parallel ports● General-purpose processors will become viable for many DSP

applications

● Recent DSPs have adopted VLIW architectures

Storage I/O Summary - Stanford Universityweb.stanford.edu/class/ee282h/handouts/Handout40.pdfStorage...

Documents

Transcript of Storage I/O Summary - Stanford Universityweb.stanford.edu/class/ee282h/handouts/Handout40.pdfStorage...