Lecture 29 Computer Arch. From: Supercomputing in Plain English Part VII: Multicore Madness Henry...
-
Upload
yolanda-calverley -
Category
Documents
-
view
215 -
download
0
Transcript of Lecture 29 Computer Arch. From: Supercomputing in Plain English Part VII: Multicore Madness Henry...
Lecture 29 Computer Lecture 29 Computer Arch. From:Arch. From:
Supercomputing in Supercomputing in Plain EnglishPlain English
Part VII: Multicore MadnessPart VII: Multicore MadnessHenry Neeman, Director
OU Supercomputing Center for Education & ResearchUniversity of Oklahoma
Wednesday October 17 2007
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 2
Outline
The March of Progress Multicore/Many-core Basics Software Strategies for Multicore/Many-core A Concrete Example: Weather Forecasting
The March of Progress
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 4
10 racks @ 1000 lbs per rack
270 Pentium4 Xeon CPUs, 2.0 GHz, 512 KB L2 cache
270 GB RAM, 400 MHz FSB8 TB diskMyrinet2000 Interconnect100 Mbps Ethernet InterconnectOS: Red Hat LinuxPeak speed: 1.08 TFLOP/s (1.08 trillion calculations per second)One of the first Pentium4 clusters!
OU’s TeraFLOP Cluster, 2002
boomer.oscer.ou.eduboomer.oscer.ou.edu
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 5
TeraFLOP, Prototype 2006, Sale 2011
http://news.com.com/2300-1006_3-6119652.html
9 years from room to chip!
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 6
Moore’s Law
In 1965, Gordon Moore was an engineer at Fairchild Semiconductor.
He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.
It turns out that computer speed is roughly proportional to the number of transistors per unit area.
Moore wrote a paper about this concept, which became known as “Moore’s Law.”
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 7
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 8
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 9
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
RAM
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 10
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
RAM
1/Network Latency
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 11
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
RAM
1/Network Latency
Software
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 12
Fastest Supercomputer vs. MooreFastest Supercomputer in the World
1
10
100
1000
10000
100000
1000000
1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
Spee
d in
GF
LO
P/s
Fastest
Moore
www.top500.org
The Tyranny ofthe Storage Hierarchy
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 14
The Storage Hierarchy
Registers Cache memory Main memory (RAM) Hard disk Removable media (e.g., DVD) Internet
Fast, expensive, few
Slow, cheap, a lot
[5]
[6]
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 15
RAM is SlowCPU 351 GB/sec[7]
10.66 GB/sec[9] (3%)
Bottleneck
The speed of data transferbetween Main Memory and theCPU is much slower than thespeed of calculating, so the CPUspends most of its time waitingfor data to come in or go out.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 16
Why Have Cache?CPUCache is nearly the same speed
as the CPU, so the CPU doesn’thave to wait nearly as long forstuff that’s already in cache:it can do moreoperations per second!
351 GB/sec[7]
10.66 GB/sec[9] (3%)
253 GB/sec[8] (72%)
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 17
Storage Use Strategies Register reuse: Do a lot of work on the same data before
working on new data. Cache reuse: The program is much more efficient if all
of the data and instructions fit in cache; if not, try to use what’s in cache a lot before using anything that isn’t in cache.
Data locality: Try to access data that are near each other in memory before data that are far.
I/O efficiency: Do a bunch of I/O all at once rather than a little bit at a time; don’t mix calculations and I/O.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 18
A Concrete Example OSCER’s big cluster, topdawg, has Irwindale CPUs: single
core, 3.2 GHz, 800 MHz Front Side Bus. The theoretical peak CPU speed is 6.4 GFLOPs (double
precision) per CPU, and in practice we’ve gotten as high as 94% of that.
So, in theory each CPU could consume 143 GB/sec. The theoretical peak RAM bandwidth is 6.4 GB/sec, but in
practice we get about half that. So, any code that does less than 45 calculations per byte
transferred between RAM and cache has speed limited by RAM bandwidth.
Good Cache Reuse Example
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 20
A Sample Application
Matrix-Matrix MultiplyLet A, B and C be matrices of sizesnr nc, nr nk and nk nc, respectively:
ncnrnrnrnr
nc
nc
nc
aaaa
aaaa
aaaa
aaaa
,3,2,1,
,33,32,31,3
,23,22,21,2
,13,12,11,1
A
nknrnrnrnr
nk
nk
nk
bbbb
bbbb
bbbb
bbbb
,3,2,1,
,33,32,31,3
,23,22,21,2
,13,12,11,1
B
ncnknknknk
nc
nc
nc
cccc
cccc
cccc
cccc
,3,2,1,
,33,32,31,3
,23,22,21,2
,13,12,11,1
C
nk
kcnknkrcrcrcrckkrcr cbcbcbcbcba
1,,,33,,22,,11,,,,
The definition of A = B • C is
for r {1, nr}, c {1, nc}.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 21
Matrix Multiply: Naïve VersionSUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER :: r, c, q
DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 22
Performance of Matrix MultiplyMatrix-Matrix Multiply
0
100
200
300
400
500
600
700
800
0 10000000 20000000 30000000 40000000 50000000 60000000
Total Problem Size in bytes (nr*nc+nr*nq+nq*nc)
CP
U s
ec
Init
Better
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 23
Tiling
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 24
Tiling Tile: A small rectangular subdomain of a problem domain.
Sometimes called a block or a chunk. Tiling: Breaking the domain into tiles. Tiling strategy: Operate on each tile to completion, then
move to the next tile. Tile size can be set at runtime, according to what’s best for
the machine that you’re running on.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 25
Tiling CodeSUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, & & rtilesize, ctilesize, qtilesize) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize
INTEGER :: rstart, rend, cstart, cend, qstart, qend
DO cstart = 1, nc, ctilesize cend = cstart + ctilesize - 1 IF (cend > nc) cend = nc DO rstart = 1, nr, rtilesize rend = rstart + rtilesize - 1 IF (rend > nr) rend = nr DO qstart = 1, nq, qtilesize qend = qstart + qtilesize - 1 IF (qend > nq) qend = nq CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_by_tiling
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 26
Multiplying Within a TileSUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend
INTEGER :: r, c, q
DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0 DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_tile
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 27
Reminder: Naïve Version, AgainSUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER :: r, c, q
DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 28
Performance with Tiling
Matrix-Matrix Mutiply Via Tiling (log-log)
0.1
1
10
100
1000
101001000100001000001000000100000001E+08
Tile Size (bytes)
CP
U s
ec
512x256
512x512
1024x512
1024x1024
2048x1024
Matrix-Matrix Mutiply Via Tiling
0
50
100
150
200
250
10100100010000100000100000010000000100000000
Tile Size (bytes)
Better
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 29
The Advantages of Tiling It allows your code to exploit data locality better, to get
much more cache reuse: your code runs faster! It’s a relatively modest amount of extra coding (typically a
few wrapper functions and some changes to loop bounds). If you don’t need tiling – because of the hardware, the
compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 30
Why Does Tiling Work Here?
Cache optimization works best when the number of calculations per byte is large.
For example, with matrix-matrix multiply on an n × n matrix, there are O(n3) calculations (on the order of n3), but only O(n2) bytes of data.
So, for large n, there are a huge number of calculations per byte transferred between RAM and cache.
Multicore/Many-core Basics
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 32
What is Multicore?
In the olden days (i.e., the first half of 2005), each CPU chip had one “brain” in it.
More recently, each CPU chip has 2 cores (brains), and, starting in late 2006, 4 cores.
Jargon: Each CPU chip plugs into a socket, so these days, to avoid confusion, people refer to sockets and cores, rather than CPUs or processors.
Each core is just like a full blown CPU, except that it shares its socket with one or more other cores – and therefore shares its bandwidth to RAM.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 33
Dual CoreCore Core
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 34
Quad CoreCore CoreCore Core
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 35
Oct CoreCore Core Core CoreCore Core Core Core
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 36
The Challenge of Multicore: RAM Each socket has access to a certain amount of RAM, at a
fixed RAM bandwidth per SOCKET. As the number of cores per socket increases, the contention
for RAM bandwidth increases too. At 2 cores in a socket, this problem isn’t too bad. But at 16
or 32 or 80 cores, it’s a huge problem. So, applications that are cache optimized will get big
speedups. But, applications whose performance is limited by RAM
bandwidth are going to speed up only as fast as RAM bandwidth speeds up.
RAM bandwidth speeds up much slower than CPU speeds up.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 37
The Challenge of Multicore: Network Each node has access to a certain number of network ports,
at a fixed number of network ports per NODE. As the number of cores per node increases, the contention
for network ports increases too. At 2 cores in a socket, this problem isn’t too bad. But at 16
or 32 or 80 cores, it’s a huge problem. So, applications that do minimal communication will get
big speedups. But, applications whose performance is limited by the
number of MPI messages are going to speed up very very little – and may even crash the node.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 38
Multicore/Many-core Problem
Most multicore chip families have relatively small cache per core (e.g., 2 MB) – and this problem seems likely to remain.
Small TLBs make the problem worse: 512 KB per core rather than 2 MB.
So, to get good cache reuse, you need to partition algorithm so subproblem needs no more than 512 KB.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 39
The T.L.B. on a Current Chip
On Intel Core Duo (“Yonah”): Cache size is 2 MB per core. Page size is 4 KB. A core’s data TLB size is 128 page table entries. Therefore, D-TLB only covers 512 KB of cache.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 40
The T.L.B. on a Current Chip
On Intel Core Duo (“Yonah”): Cache size is 2 MB per core. Page size is 4 KB. A core’s data TLB size is 128 page table entries. Therefore, D-TLB only covers 512 KB of cache. The cost of a TLB miss is 49 cycles, equivalent to as many
as 196 calculations! (4 FLOPs per cycle)http://www.digit-life.com/articles2/cpu/rmma-via-c7.html
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 41
What Do We Need?
We need much bigger caches! TLB must be big enough to cover the entire cache. It’d be nice to have RAM speed increase as fast as core
counts increase, but let’s not kid ourselves.
Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 42
To Learn More Supercomputing
http://www.oscer.ou.edu/education.php
Lecture 29 Fall 2007
Computer Architecture Fall 2007 Lecture 29. CMPs & SMTs
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Lecture 29 Fall 2007
Multithreading on A Chip Find a way to “hide” true data dependency stalls, cache
miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions
Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor
Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread
The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly)
The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching
Lecture 29 Fall 2007
Types of Multithreading on a Chip Fine-grain – switch threads on every instruction issue
Round-robin thread interleaving (skipping stalled threads) Processor must be able to switch threads on every clock cycle Advantage – can hide throughput losses that come from both
short and long stalls Disadvantage – slows down the execution of an individual
thread since a thread that is ready to execute without stalls is delayed by instructions from other threads
Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses)
Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread
Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss
- Pipeline must be flushed and refilled on thread switches
Lecture 29 Fall 2007
Multithreaded Example: Sun’s Niagara (UltraSparc T1)
Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction)
Ultra III Niagara
Data width 64-b 64-b
Clock rate 1.2 GHz 1.0 GHz
Cache (I/D/L2)
32K/64K/ (8M external)
16K/8K/3M
Issue rate 4 issue 1 issue
Pipe stages 14 stages 6 stages
BHT entries 16K x 2-b None
TLB entries 128I/512D 64I/64D
Memory BW 2.4 GB/s ~20GB/s
Transistors 29 million 200 million
Power (max) 53 W <60 W4-
way
MT
SP
AR
C p
ipe
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
Crossbar
4-way banked L2$
Memory controllers
I/Osharedfunct’s
Lecture 29 Fall 2007
Niagara Integer Pipeline
Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient
Fetch Thrd Sel Decode Execute Memory WB
I$
ITLB
Inst bufx4
PC logicx4
Decode
RegFilex4
Thread Select Logic
ALU Mul Shft Div
D$
DTLB Stbufx4
Thrd Sel Mux
Thrd Sel Mux
Crossbar Interface
Instr typeCache missesTraps & interruptsResource conflicts
From MPR, Vol. 18, #9, Sept. 2004
Lecture 29 Fall 2007
Simultaneous Multithreading (SMT)
A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP)
Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP)
With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them
- Need separate rename tables (ROBs) for each thread
- Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle
Intel’s Pentium 4 SMT called hyperthreading Supports just two threads (doubles the architecture state)
Lecture 29 Fall 2007
Threading on a 4-way SS Processor Example
Thread A Thread B
Thread C Thread D
Tim
e →
Issue slots →SMTFine MTCoarse MT
Lecture 29 Fall 2007
Multicore Xbox360 – “Xenon” processor
To provide game developers with a balanced and powerful platform
Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything
else
An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs
Lecture 29 Fall 2007
Xenon Diagram
Core 0
L1D L1I
Core 1
L1D L1I
Core 2
L1D L1I
1MB UL2
512MB DRAM
GPUBIU/IO Intf
3D Core
10MBEDRAM
VideoOut
MC
0M
C1
AnalogChip
XM
A D
ecS
MC
DVDHDD PortFront USBs (2)WirelessMU ports (2 USBs)Rear USB (1)EthernetIRAudio OutFlashSystems Control
Video Out
Lecture 29 Fall 2007
The PS3 “Cell” Processor Architecture
Composed of a Non-SMP Architecture 234M transistors @ 4Ghz 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects
it to everything else The PPE is strangely similar to one of the Xenon cores
- Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT
The real differences lie in the SPEs (21M transistors each)- An attempt to ‘fix’ the memory latency problem by giving each
processor complete control over it’s own 256KB “scratchpad” – 14M transistors
– Direct mapped for low latency
- 4 vector units per SPE, 1 of everything else – 7M trans.
Lecture 29 Fall 2007
The PS3 “Cell” Processor Architecture
Lecture 29 Fall 2007
How to make use of the SPEs
Lecture 29 Fall 2007
What about the Software?
Makes use of special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time (for
things like AI)
Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard
What about Microsoft? Development suite identifies which 6 threads you’re expected to
run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally
http://ps3forums.com/showthread.php?t=22858
Lecture 29 Fall 2007
Next Lecture and Reminders
Reminders Final is Thursday, December 13 from 10-11:50 AM in ITT 322