The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th,...
-
date post
21-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th,...
The Future of Vector Processors
M. Valero, R. Espasa and J. Corbal
UPC, Barcelona
Kyoto, May 28th, 1999
Kyoto, May 28th. 1999 2
TOP-500 and Vector Processors
0
50
100
150
200
250
300
350
# Systems % Peak Perf.
310
96
4315
65
November 98
Fujitsu…27
NEC……18
SGI……..15
Hitachi….5
Kyoto, May 28th. 1999 3
The Future of Vector ISA’s
• Cross-Pollination of Vector/Superscalar/VLIW– MMX, Embedded...
• Very-high Performance Architectures– ILP techniques, IRAM, SDRAM
• Vector Microprocessors– Numerical Accelerators– Multimedia Applications
Kyoto, May 28th. 1999 4
Talk Outline• The Past :
• Initial Motivation for Vector ISA• Evolution of Vector Processors
• The Present :• Recent Announcements• The Decline of Vector Processors• Cross-Pollination of Vector/Superscalars/VLIW
• The Future :• Very-high Performance Architectures• Vector Microprocessors
– Numerical Accelerators– Multimedia Applications
• Conclusions
Kyoto, May 28th. 1999 5
Characteristics of Numerical Applications
• Examples: Weather prediction, mechanical engineering
• Data structures: Huge matrices (dense, sparse)
• Data types: 64 bits, floating point
• Highly repetitive loops
• Compute-intensive
• Data-Level Parallel
Kyoto, May 28th. 1999 6
Initial Motivations for Vector Processors
real*8 x(9992), y(9992), u(9984) subroutine loop integer I real*8 q do I=1,9984 q = u(I) * y(I) y(I) = x(I) + q x(I) = q - u(I) * x(I) enddo end
x(I)y(I) u(I)
*
*
+_
q
For I=1 to 9984
Dependence Graph
Kyoto, May 28th. 1999 7
Execution of scalar codeLoop : ld R1,0(R10) ld R2,0(R11) ld R3,0(R12) mulf R4,R1,R2) mulf R5,R2,R3 add R11,R11,#8 addf R6,R4,R3 subf R7,R4,R5 st 0(R12),R7 add R10, R10,#8 st 0(R12),R7 sub R13,R13,#1 bne Loop add R12,R12,#8
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
WWD/L ALUIF ALU ALU
WWD/L ALUIF ALU ALU
WWD/L ALUIF ALU ALU
WWD/L ALUIF ALU ALU
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
D/LIF MALU
14 cycles / Iteration
Perfect Memory !!!
Kyoto, May 28th. 1999 8
Generation of Vector Code
Loop : mov s2, vl ; vl <- min(s2,128) ld.l -y(a2),v0 ; v0 <- y(I:I+127) ld.l -u(a2),v1 ; v1 <- u(I:I+127) mul.d v1,v0,v2 ; q(I:I+127) <- u(I:I+127)*y() ld.l -x(a2),v3 ; v3 <-x(I:I+127) add.d v3,v2,v0 ; v0 <- x(I:I+127) + q(I:I+127) st.l v0,-y(a2) ; y(I:I+127) <- x(I:I+127) + q( ) mul.d v1,v3,v1 ; v1 <- u(I:I+127) *x(I:I+127) sub.d v2,v1,v0 ; v0 <- q( ) - u( ) * x( ) st.l v0,-x(a2) ; x(I:I+127) <- q( ) - u( ) * x( ) add.w #1024,a2 ; increment index (128 * 8) add.w # -128,s2 ; 128 iterations less to process lt.w # 0,s2 jbrs.t loop
ld.w #9984,s2 ld.w #0,a2ld.w #8,vs
… . … . … . … . … . … . … . … . … .
0 1 2 127
A vector iteration is equivalent to 128 scalar iterations
DLP !!!
Kyoto, May 28th. 1999 9
Execution of vector codeLoop : mov s2, vl ld.l -y(a2),v0 ld.l -u(a2),v1 mul.d v1,v0,v2 ld.l -x(a2),v3 add.d v3,v2,v0 st.l v0,-y(a2) mul.d v1,v3,v1 sub.d v2,v1,v0 st.l v0,-x(a2) add.w #1024,a2 add.w # - 128,s2 lt.w #0,s2 jbrs.t loop
5.1 cycles / Iteration
Memory Latency = 24 cycles !!!
14 vector instructions = 1792 scalar instructions
One L/S Port
One Adder, One Multiplier
A vector iteration is equivalent to 128 scalar iterations
Kyoto, May 28th. 1999 10
Vector Processor
ControlUnit
Main Memory
Instructions (scalar + vector) + Data
Ri := Rj op Rk
Branch (cond.)
Instr. . . .
Vector Reg.
. . .
Scalar Reg.
Vector dataScalar data VR[i] := VR[j] op VR[k]
Kyoto, May 28th. 1999 11
Why Vector ISA ?
• Natural way to express Data-Level Parallelism– Fewer instructions ( 3 )
• Easy way to convey this information to the hardware
• Good hardware implementation– Affordable/ incremental parallelism ( 2 )
– Simple control/ faster clock ( 1 )
• Mechanism to deal with memory latency• Problem : Memory Bandwidth...
Kyoto, May 28th. 1999 12
Vector versus Scalar Architectures
0
20
40
60
80
100
120
R10k Convex C3
Number of instructions (in millions)
Vector instruction semantics “encode” many different scalar instructions :
- Loop counters
- Branch computations
- Addresses generation
F. Quintana, R. Espasa and M. Valero “ A case for merging the ILP..” PDP-98
Rate from 140 to 2
Kyoto, May 28th. 1999 13
Easy to convey information to the hardware• Data path :
• No pressure at fetch, decode and issue
• Decentralized control
• Faster cycle times
• Vector memory instructions :• Spatial locality can be made clearly visible to the
hardware through “strides”
• No overhead and good prefetching
• Reduction of memory latency overhead
• Memory uses facts, not guesses
Kyoto, May 28th. 1999 14
Key parameters for vector processors
• Cycle time• Scalar processor:
– # of registers and FU’s – Cache
• Vector processor– # of vector registers– # of FU’s and # of pipes/ FU
• Connection to memory:– # of busses and width
• Number of processors
Kyoto, May 28th. 1999 15
Cray Y-MP Architecture
P0
P1
P7
4*4
4*4
4*4
8*8
8*8
0 4 28
3 7 31
224
228 231 255
228 232
Synchronization
tc = 6 ns.
333 Mflops / processor
256 modules. ta = 30 ns.
Kyoto, May 28th. 1999 16
Vector Processors (1 of 2)
Year Machine Tc (ns) #FPU’sFlops/cycle
LD/ST path
words/ cycle
#regsElements / register
1972 TI-ASC 60 2 4 LS 4(32) - -1973 STAR-100 40 2 2 L,L,S 3 - -1975 Cray-1 12.5 2 2 LS 1 8 64
1982Fujitsu VP 2000 7 2 4 LS,LS 4 8-256 1024-32
1983 Cray-XMP 9.5 2 2 L,L,S 2+1 8 64
1983Hitachi S810/20 19/14 6?? 12?? L,L,L,LS 8 or 2 32 256
1984 NEC-SX2 6 4 16 L,LS 8 or 4 8+8k 256/64-2561985 Cray-2 4.1 2 2 LS 1 8 64
1987Hitachi S820/80 4 3 12 L,LS 8 or 4 32 512
Kyoto, May 28th. 1999 17
Vector Processors (2 of 2 )
Year Machine Tc (ns) #FPU’sFlops/cycle
LD/ST path
words/ cycle
#regsElements / register
1987 Convex C2 40 2 2 LS 1 8 128
1988 Cray Y-MP6.3 2 2 L,L,S 2+1 8 64
1989Fujitsu VP 2600 3.2 4 16 LS,LS 8 2048-64 64-2048
1990 NEC SX-3 2.9 4 16 L,L,S 8+4 8+16k 256/64-2561992 Cray C90 4 2 4 L,L,S 4+2 8 128
1993Hitachi S-3800 2 2(?) 16(?) L,L,L,LS 8 or 2 - -
1994 Convex C4 7.4 2 2 LS 1 8 1281996 Nec SX-4 8 2 16 LS,LS 16 8+16k 256/64-2561998 Nec SX-5 4 2 32 LS,LS 32 8+16k 256/64-256
Kyoto, May 28th. 1999 18
Evolution of Cray Machines
Machine Year Tc MhzMflops/CPU # CPU's
Memory BW/CPU
Load latency(ns)
Cray-1 1976 80 160 1 640 MB/s 150Cray-XMP 1982 105 210 2 2.5 GB/s 123Cray-2 1982 243 486 4 or 8 1.9 GB/s 200Cray-YMP 1989 167 334 8 4 GB/s 100Cray-C90 1992 243 970 16 12 GB/s 95Cray-J90 1995 100 200 32 1.6 GB/S 340Cray-T90 1994 450 1800 32 21 GB/s 70/116Cray-SV-1 1998
Courtesy from SGI/CRAY
Tc : x6 ILP : x2 # of proc. x32 Total : x400
Kyoto, May 28th. 1999 19
Vector Innovations (1 of 2 ) • Star-100/Cyber-200 had many of them:
– Gather/scatter– Masked operations for conditionals
• Cray-1 introduced vector registers• BSP had instructions for recurrences and
multioperand • Instructions to optimize masked vector
operations• Instructions to handle Index and Bit sequence
on mask register• Flexible addressing of subvector registers(C4)
Kyoto, May 28th. 1999 20
Vector Innovations ( 2 of 2 )
• Multi-pipes (Star/Cyber)
• Vector with Virtual Memory
• Flexible chaining (multi-ported register-file)
• Multilevel register-file (NEC)
• Scalar units sharing vector FU’s (Fujitsu)
• Combined vector and scalar instructions (Titan)
• Short vectors (CS-2 and CM-5)
• Scalar processor: LIW( Fujitsu), SS(NEC)
Kyoto, May 28th. 1999 21
Automatic vectorization
• Compiler technology for vectorization: over 25 years of development– Dependence analysis– Elimination of false dependences– Strip mining– Loop interchange– Partial vectorization– Idiom recognition– IF conversion– Vector parallelization
Kyoto, May 28th. 1999 22
Vector Architectures : Present
• New announcements (NEC, Cray, Fujitsu)
• The decline of vector processors
• Cross-pollination of Vector/ Superscalar/
VLIW processors
Kyoto, May 28th. 1999 23
NEC SX-5
• Announced on June 5th. of 1998
• 8 Gflops, CMOS, tc = 4 ns
• Superscalar processor at 500 Mflops
• 32 results/cycle (2 FPU, 16-pipe)
• 32 data memory accesses/cycle (2 ports,16 data/port). Memory bandwidth of 64 GB/s
• System composed by 32 nodes of 128 Gflops providing 4 Tflop/s
Kyoto, May 28th. 1999 24
Cray SV1• Announced on June 16th. of 1998
• CMOS, 250 Mhz and 4 Gigaflop/proc.
• Vector cache memory
• 2 FU’s of 8 operations/cycle
• “Multi-Streaming” Processor
• Scalable vector architecture (32 nodes of 32 processors…4 Teraflops)
• Future processor enhancements !!!
Kyoto, May 28th. 1999 25
Fujitsu VP5000
• Announced on April 20 th. of 1999
• 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip
• Linpack 1000*1000 gives 8758 Mflop/s
• Crossbar provides 2*1.6 GB/s per processor
• System composed by 512 PE’s or 4.9 Teraflops
• Maximum of 16 GB/PE or 8 TB/512 PE’s
Kyoto, May 28th. 1999 26
The decline of vector processors
• Why have vector machines declined so fast in popularity?– Cost (Scalar parallel machines use
commodity parts)– Too restricted in applications (lack of
vectorization in many programs)
• Massive use of computers to run so called “Non-numerical Applications”
Kyoto, May 28th. 1999 27
Characteristics of non-numerical Applications
• Examples: OLTP,DSS, simulators, games…
• General data structures: Lists, trees, tables…
• Data types: Scalar integers of 8 to 64 bits
• Frequent control flow change…Speculation
• Short distance data dependencies... Forwarding
• Instruction/data locality……Caches
• Fine-grain ILP……..Out-of-order
Kyoto, May 28th. 1999 28
Micro Killers ???
Year Machine Tc (Mhz) #op/cyclePeak Perf. Mflops
1976 Cray-1 80 2 1601978 I-8086 10 - -1992 Cray C-90 243 4 9701992 Alpha 21064 150 1 1501994 Pentium 100 1 1001996 NEC SX-4 125 16 20001997 IBM P2SC 160 4* 6401997 Alpha 21164 500 2 10001998 HP PA8200 240 4* 9601998 NEC SX-5 250 32 80001998 Pentium 400 1 400
Peak performance = Tc * ILP
Kyoto, May 28th. 1999 29
Bandwidth and PerformanceAlpha21264500 Mhz
Power chipIBM 160 Mhz
HP-8200240 Mhz
Cray T90450 Mhz
NEC SX-4125 Mhz
2 GB/s 2 GB/s 24 GB/S 16 GB/S16 MB
5 Gb/s 768 MB/s
64 KB 128 KB 2 MB8 GB/s 3.84 GB/s 24 GB/s 16 GB/s
576 bytes 704 bytes 8 KB 128 KB
16 GB/s 5.12 GB/s 15.3 GB/s 43.2GB/s 48 GB/s2 FPU1 Gflops
2 (2 pipe)640 Mflops
2 (2 pipe)960Mflops
2 (2 pipe)1.8Gflops
2 (8 pipes)2 Gflops
Main memory
Register file size
Functional Units
L1 cache size
L2 cache size
Kyoto, May 28th. 1999 30
Peak performance and Bandwidth
0102030405060708090
100
0 1000 2000 3000 4000Vector length
* Measurement condition : RS6000-590(66.6MHz) FORTRAN77 - 03 - qarch=pwr2 - qtune=pwr2
Eff
icie
ncy
(%
)
IBM RS6000 *
VPP500
(C2+C(I)*(C3+D(I)*
(C4+E(I)*(C5+F(I)*
Z(I)=C0+A(I)*(C1+B(I)*
(C8+K(I)*(C9+L(I))))))))))
(C6+G(I)*(C7+H(I)*
Courtesy from Fujitsu
Kyoto, May 28th. 1999 31
Vector ideas used in SS’s/VLIW processors
• Address prediction and Prefetching• Exploitation of data locality(the stride value is
used for locality detection and exploitation)• Predicate execution(VLIW)• Multiply and add, chaining• Multi-size operands• Data reuse and vectorization• Addressing modes (auto-increment)• Multithreading ( 2 scalar processors in Fujitsu
machines)• Dynamic load/store elimination
Kyoto, May 28th. 1999 32
Predictions for ALL instructions
0102030405060708090
100
Last valueStrideContext 1Context 3
Y.Sazeides and J.E. Smith ¨The predictability of data values¨MICRO-30.1997
Kyoto, May 28th. 1999 33
Characterization of Vector Programs
0102030405060708090
100
% vector access% vectorizationAvg. VL
R. Espasa “ Advanced Vector Architectures “. PhD Thesis, Feb.97
Kyoto, May 28th. 1999 34
SS’s ideas usable in vector processors
• Decoupled Vector Architectures
• Multithreaded Vector Architectures
• Out-of-order Vector Architectures
• Simultaneous Multithreaded Vector Architecture
• Victim Register File
R. Espasa, M. Valero and J.E. Smith HPCA96, HPCA97, MICRO97, ICS97...
Kyoto, May 28th. 1999 35
ILP+DLP: Out-of-order Vector
LD/STS registers A registers V registers
Reorder Buffer Memory
Decode & RenameFetch
R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.
Kyoto, May 28th. 1999 36
OOO Vector Performance
R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.
Kyoto, May 28th. 1999 37
Vector Processors : The Future
• Very high-performance architectures
• Vector Microprocessors• Numerical Accelerators• Multimedia Applications
Kyoto, May 28th. 1999 38
Architectures for a Billion Transistors
• Advanced/Superspeculative Architectures
• Trace Processors
• Simultaneous Multithreading
• Multiprocessor on a chip
• RAW processors
• IRAM
Billion -Transistor Architectures. IEEE Computer Sept. 1997
Kyoto, May 28th. 1999 39
SMV• Simultaneous Multithreaded Vector Arch.
• Mixes three paradigms– DLP: vector unit– ILP: O-o-O execution– TLP: multithreaded fetch unit
• Requires a memory system with– high performance at low cost– low pin-count
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
Kyoto, May 28th. 1999 40
Billion Trans. Vector Architecture
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
Memory
M
e
m
o
r
y
FPU 1
FPU 2
ALU 1
ALU 2
@ gen
@ gen
VFU 1
VFU 2
VFU 3
VFU 4
k
k
k
k
k
kk
k
K (data)
FPRF
128 reg
IRF
128 reg
Vector
Register
File
128 reg
2 data
1
1
Float point
queue (64)
Integer
queue (64)
Memory
queue (64)
Memory
queue (64)
Instruction Issue Execution Pipeline
I cache Decode
8 program
counters
(one/ thread)
8 rename
tables
(one/thread)
I F V
Inst fetch Inst decode
Thread ID
Reorder Buffer
Instruction Slots
PC
B
Kyoto, May 28th. 1999 41
SMV Performance
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
Kyoto, May 28th. 1999 42
V-IRAM1
Memory Crossbar Switch
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
…
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
+
Vector Registers
x
÷
Load/Store
8K I cache 8K D cache
2-way Superscalar processor
Vector
4 x 64 4 x 64 4 x 64 4 x 64 4 x 64
4 x 64or
8 x 32or
16 x 16
4 x 644 x 64
QueueInstruction
I/OI/O
I/OI/O
SerialI/O
D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32MB
Kyoto, May 28th. 1999 43
Conflict-free access to vectors
Memory Modules
Inte
rcon
nect
ion
Net
wor
k
Inte
rcon
nect
ion
Net
wor
k
Sections
P1
P2
Pn
P3
P1
P2
P3
Pn
Idea: Out-of-order access
M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95, ICS 92, ICS 94,...
Kyoto, May 28th. 1999 44
Command Memory System
Inte
rcon
nect
ion
Net
wor
kP1
P2
Pn
P3
Memory Modules
Inte
rcon
nect
ion
Net
wor
k
P1
P2
P3
PnCommands Sections Controller
Command = <@,Length,Stride,size>Break commands into bursts at the section controller
J. Corbal, R. Espasa and M. Valero “ Command-Vector Memory System” PACT98
Kyoto, May 28th. 1999 45
System configuration in 2009
Memory(5TB)
X-Bar
Chip Chip
Memory(5TB)
X-bar
Chip Chip200GF 200GF 200GF 200GF
32Chips6.4TFLOPS
32Chips6.4TFLOPS
32 SMP(cc-NUMA) Nodes 200TFLOPS/160TB
100GB/Sec
800GB/SecX-Bar
Sustained Scalar 250GFLOPS? Vector 1TFLOPS?
T. Watanabe SC98, Orlando.
Kyoto, May 28th. 1999 46
Vector Microprocessors
• Ways of reducing the design impact• Short Vectors (64 x 16 words = 8 Kbytes)• Vector Functionall units shared with INT/FP units• Vector Register renaming to allow precise exceptions
• Cache hierarchy tuned to vector execution• Vector data locality allows large data transactions
• Very large bandwidth between cache and vector registers
• High performance for numerical and multimedia applications
Kyoto, May 28th. 1999 47
General Architecture
1024FP INT
8
I-CacheFetch
Decode
RambusController
RDRAM
RDRAM
RDRAM
RDRAM
Vector Cache
VRF
Kyoto, May 28th. 1999 48
Vector PC Vs SuperScalar
0
5
10
15
20
25
Hydro2D Dyfesm Swm256 Tomcatv
OoO-SS 1x2VEC 16 1x2VEC 16 16x32
Kyoto, May 28th. 1999 49
Cache Hierarchy
•Where should be allocated the Vector Cache?
DIRECT RAMBUS
L2
VC CPU
VC
L1 CPU
DIRECT RAMBUS
Kyoto, May 28th. 1999 50
Performance of the cache hierarchies
0
1
2
3
4
5
6
7
8
2 8 16 320
1
2
3
4
5
6
7
2 8 16 320
2
4
6
8
10
12
2 8 16 32
BDNA FLO52 HYDRO2D
EIP
C
FLOPS/CYCLE FLOPS/CYCLE FLOPS/CYCLE
VECTOR CACHE on L1
VECTOR CACHE on L2
PERFECT CACHE
Kyoto, May 28th. 1999 51
Importance of media Applications
“On the next five years, (1998-2002), we believe that media processing will become the dominant force in computer architecture” (K. Diefendorf and P. K. Dubey in IEEE Computer Journal, Sep.97, pp. 43-45)
“90% of Desktop Cycles will Be Spent on Media Applications by 2000” ( Scott Kirkpatrick of IBM )
Kyoto, May 28th. 1999 52
Characteristics of media Applications• Examples: Image/ speech processing,
communications, virtual reality, graphics…
• Data structures: matrices and vectors
• Data types: Integer(8 -32 bits), FP (32- 64)
• Demand for high memory bandwidth
• Low data locality and latency problem
• No critical data-dependences
• Real time necessity
• Fine/coarse grain parallelism
Kyoto, May 28th. 1999 53
Multimedia Applications and Architectures
• • • •
• • • •
• • • •
• • • •
Scientific Applications
Multimedia
Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but
loss of parallelismAs many instructions as SS approach
Superscalar
+ MMXVLIW Vector Architectures
Natural way to express and execute DLP applications
Kyoto, May 28th. 1999 54
MMX-like processors
• Multimedia extensions are designed to exploit the parallelism inherent in multimedia aplications
• Targeted to leverage full compatibility with existing operating systems and applications, plus minimum chip area investment.
• The highlights of multimedia extensions are:
• Single Instruction, Multiple Data (SIMD) techniques
• New data types (Multimedia Vectors, 32/64 bits)
• Multimedia registers
• SIMD-like instructions, over small integer data types
Kyoto, May 28th. 1999 55
MMX instruction example• PADDW: Parallel ADD of 4x16-bit data type with Wrap
Around (No Saturation)
A1 A2 xFFFFA3
A1+B1 A2+B2 x0005A3+B3
B1 B2 x0006B3
+ + + +
0 15 31 47 63
Kyoto, May 28th. 1999 56
Superscalar Multimedia Processors
Register File 32*128 8*64 32*64 32*64 32*64 32*64Mapped Onto Separate FP FP FP IntegerIntegerInteger Support 8/16/32 8/16/32 8/16/32 8/16 bit 16/32 8 bitFP Support Yes MMX2 No MIPS V/ No NoUsual stuff+ Lots Lots Lots Lots Some NoneMultiply /MAC Lots Mult Mult Lots Some NoneMin/Max/Avg Yes No No Min/MaxAvg Min/MaxPack/Unpack Yes Yes Yes Yes Yes YesByte ReorderingAll Some Some Many All NoneUnaligned Data 3 Inst. No 2 Inst. Yes No NoAnnounced 2Q98 2Q96 4Q94 4Q96 4Q95 4Q96
HP MAX2
Alpha MVI
PowerPC Altivec
Intel MMX
Sun VIS
MIPS V /MDMX
Microprocessor Report Vol 12, N 6, May 11, 1998
Kyoto, May 28th. 1999 57
Multimedia Applications and Architectures
• • • •
• • • •
• • • •
• • • •
Scientific Applications
Multimedia
Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but
loss of parallelismAs many instructions as SS approach
Superscalar
+ MMXVLIW Vector Architectures
Natural way to express and execute DLP applications
Kyoto, May 28th. 1999 58
Multimedia Embedded Systems
• NEC V830R/AV includes MIX2, a multimedia
instruction extension (SIMD, MMX-like approach)
• Hitachi SH4 includes FP 4-length vector
instructions, targeted at geometry transformation in
3D rendering applications
• ARM10 Thumb Family processors will include a
Vector FP unit capable of delivering 600 MFLOPS
Kyoto, May 28th. 1999 59
Widen is better…(?)
• Most multimedia algorithms exhibit vectors no longer than 8/16 elements => widening the multimedia registers could provide diminishing returns.
C1
B1
+
0 15
A1 A2 A4A3
C1 C2 C4C3
B1 B2 B4B3
+ + + +
0 15 31 47 63
A1 A1 A2 A4A3
C1 C2 C4C3
B1 B2 B4B3
+ + + +
0 15 31 47
A5 A6 A8A7
C5 C6 C8C7
B5 B6 B8B7
+ + + +
63 79 95 111 127
Kyoto, May 28th. 1999 60
VLIW : Widening vs Replication
Memory
Register File
1 word
Memory
Register File
1 word1 word
Memory
Register File
2 words
Memory
Register File
2 words2 words
Bus configurations:
D. López et al. ¨Increasing Memory Bandwidth with Wide Busses¨ICS-97
Kyoto, May 28th. 1999 61
Widening and Replication Performance
1
2
3
4
5
6
7
8
2 4 8 16
Wide 1wide 2Wide 4
D. López et al. ¨ Widening versus replicating...¨ ICS98, MICRO98
Kyoto, May 28th. 1999 62
Multimedia Applications and Architectures
• • • •
• • • •
• • • •
• • • •
Scientific Applications
Multimedia
Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but
loss of parallelismAs many instructions as SS approach
Superscalar
+ MMXVLIW Vector Architectures
Natural way to express and execute DLP applications
Kyoto, May 28th. 1999 63
Torrent T0 Microprocessor• The first single-chip vector microprocessor.
• Can sustain over 24 operations per cycle while having a issue rate of only one 32-bit instruction per cycle
• Features:• 16 vector registers (32 32-bit elements each)• 2 Vector arithmetic units (8 pipes each)• Reconfigurable composite operation pipelines • 128-bit wide, external memory interface• MIPS-II, 32-bit instruction set, scalar unit.
K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995
Kyoto, May 28th. 1999 64
Torrent T0 Microprocessor
K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995
Kyoto, May 28th. 1999 65
Vector versus Superscalar Processors• Comparison of Die Area
– Processor Die Area (in mm2 scaled to 0.25
0
50
100
150
200
250
Torrent-0 Alpha 21164 UltraSPARCII
MIPSR10000
HP PA-8000 Alpha 21264 6-way OoO,Rob128
ControlRegistersDatapath
14.73 21.8637.77
66.92 67.77 69.81
250.0
C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.
Kyoto, May 28th. 1999 66
• Component Percentages
0
10
20
30
40
50
60
70
80
90
100
Torrent-0 Alpha 21164 UltraSPARCII
MIPSR10000
HP PA-8000 Alpha 21264 6-way OoO,Rob128
Datapath Registers Control
C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.
Vector versus Superscalar Processors
Kyoto, May 28th. 1999 67
Imagine project
• Focused on developing a programmable architecture that achieves performance similar to special purpose hardware on graphics and image processing.
• Matches media applications demands to the current VLSI capabilities by using a stream-based programming model.
• Most multimedia kernels exhibit a streaming nature.
• Individual stream elements can be operated on in parallel, thus exploiting data parallelism.
Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98
Kyoto, May 28th. 1999 68
Imagine architecture• Organized around a large stream register file (64Kb)• Memory operations move entire streams of data• Data streams pass through a set of arithmetic clusters (8)• Each cluster unit operates a single element under VLIW control
SDRAM
SDRAM
SDRAM
SDRAM
...
Str
eam
ing
Mem
ory
Sys
temC
C
C
C
Stream Register File
CLUSTER 7
CLUSTER 0
CLUSTER 1
...
Controller
Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98
Kyoto, May 28th. 1999 69
Matrix extensions for Multimedia• By combining conventional vector approach together with SIMD MMX-like instructions, we can exploit additional levels of DLP with matrix oriented multimedia
extensions.
C1
B1
+
A1 A1 A2 A4A3
C1 C2 C4C3
B1 B2 B4B3
+ + + +
0 15 31 47 63
A1 A2 A4A3
0 15 31 47 63
A5 A6 A8A7
A9 A10 A12A11
A13 A14 A16A15
+
B1 B2 B4B3
15 31 47 63
B5 B6 B8B7
B9 B10 B12B11
B13 B14 B16B15
C1 C2 C4C3
C5 C6 C8C7
C9 C10 C12C11
C13 C14 C16C15
Kyoto, May 28th. 1999 70
Relative Performance
0
1
2
3
4
5
6
7
way 1 way 2 way 4 way 80
5
10
15
20
25
way 1 way 2 way 4 way 8
MMX MDMX MOM
0
1
2
3
4
5
6
7
8
9
way 1 way 2 way 4 way 8
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
Kyoto, May 28th. 1999 71
Applications and Architectures
+ FPU
+ FPU VFPU+
Integer
Integer
Integer
Numerical Applications
Very Slow+ Subroutines
Very Big Improvement !!!
Additional Speed
Kyoto, May 28th. 1999 72
Future Applications
• Integer SPEC-like• Commercial
(OLTP,DSS)
• Numerical• Multimedia
IntegerInteger Commercial Numerical Multimedia
Kyoto, May 28th. 1999 73
Acknowledgments
• Roger Espasa• James E. Smith• Luis A. Villa• Francisca Quintana• Jesús Corbal• David López• Josep Llosa• Eduard Ayguade
• Krste Asanovic• William Dally• Christoforos E. Kozyrakis• Corinna G. Lee• David A. Patterson• Steve Wallace
Kyoto, May 28th. 1999 74
The End