Multi-This, Multi-That, …

Limits on IPC

• Lam92– This paper focused on impact of control flow on ILP– Speculative execution can expose 10-400 IPC

• assumes no machine limitations except for control dependencies and actual dataflow dependencies

• Wall91– This paper looked at limits more broadly

• No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC

• ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC

• perfect bpred, register renaming and memory disambiguation: 7-60 IPC

– This paper did not consider “control independent” instructions

Practical Limits

• Today, 1-2 IPC sustained– far from the 10’s-100’s reported by limit studies

• Limited by:– branch prediction accuracy– underlying DFG

• influenced by algorithms, compiler– memory bottleneck

– design complexity• implementation, test, validation, manufacturing, etc.

– power– die area

Differences Between Real Hardware and Limit Studies?

• Real branch predictors aren’t 100% accurate• Memory disambiguation is not perfect• Physical resources are limited

– can’t have infinite register renaming w/o infinite PRF– need infinite-entry ROB, RS and LSQ– need 10’s-100’s of execution units for 10’s-100’s of

IPC

• Bandwidth/Latencies are limited– studies assumed single-cycle execution– infinite fetch/commit bandwidth– infinite memory bandwidth (perfect caching)

Bridging the Gap

IPC

100

10

1

Single-IssuePipelined

SuperscalarOut-of-Order

(Today)

SuperscalarOut-of-Order(Hypothetical-Aggressive)

Limits

Diminishing returns w.r.t.larger instruction window,

higher issue-width

Power has been growingexponentially as well

Watts/

Past the Knee of the Curve?

“Effort”

Performance

ScalarIn-Order

Moderate-PipeSuperscalar/OOO

Very-Deep-PipeAggressive

Superscalar/OOO

Made sense to goSuperscalar/OOO:

good ROI

Very little gain forsubstantial effort

So how do we get more Performance?

• Keep pushing IPC and/or frequenecy?– possible, but too costly

• design complexity (time to market), cooling (cost), power delivery (cost), etc.

• Look for other parallelism– ILP/IPC: fine-grained parallelism– Multi-programming: coarse grained parallelism

• assumes multiple user-visible processing elements• all parallelism up to this point was user-invisible

User Visible/Invisible

• All microarchitecture performance gains up to this point were “free”– in that no user intervention required beyond

buying the new processor/system• recompilation/rewriting could provide even more

benefit, but you get some even if you do nothing

• Multi-processing pushes the problem of finding the parallelism to above the ISA interface

Workload Benefits

3-wideOOOCPU

Task A Task B

4-wideOOOCPU

Task A Task B

Benefit

3-wideOOOCPU

3-wideOOOCPU

Task ATask B

2-wideOOOCPU

2-wideOOOCPU

Task BTask A

runtime

This assumes you have twotasks/programs to execute…

… If Only One Task

3-wideOOOCPU

Task A

4-wideOOOCPU

Task ABenefit

3-wideOOOCPU

3-wideOOOCPU

Task A

2-wideOOOCPU

2-wideOOOCPU

Task A

runtime

Idle

No benefit over 1 CPU

Performancedegradation!

Sources of (Coarse) Parallelism

• Different applications– MP3 player in background while you work on

Office– Other background tasks: OS/kernel, virus

check, etc.– Piped applications

• gunzip -c foo.gz | grep bar | perl some-script.pl

• Within the same application– Java (scheduling, GC, etc.)– Explicitly coded multi-threading

• pthreads, MPI, etc.

(Execution) Latency vs. Bandwidth

• Desktop processing– typically want an application to execute as

quickly as possible (minimize latency)

• Server/Enterprise processing– often throughput oriented (maximize

bandwidth)– latency of individual task less important

• ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time

Benefit of MP Depends on Workload

• Limited number of parallel tasks to run on PC– adding more CPUs than tasks provide zero performance benefit

• Even for parallel code, Amdahl’s law will likely result in sub-linear speedup

parallelizable

1CPU 2CPUs 3CPUs 4CPUs

• In practice, parallelizable portion may not be evenly divisible

Basic Models for Parallel Programs

• Shared Memory– there’s some portion of memory (maybe all)

which is shared among all cooperating processes (threads)

– communicate by reading/writing to shared locations

• Message Passing– no memory is shared– explicitly send a message to/from threads to

communicate data/control

Shared Memory Model

• That’s basically it…– need to fork/join threads, synchronize (typically locks)

Main Memory

Write X Read X

CPU0 CPU1

Cache Coherency

Main Memory

Write X Read X

L2 L2

Other copies (if any)are now stale

Need to invalidateother copies

Or updateother copiesNormally don’t need to modify main

memory: requests serviced by otherCPUs (only update on evict/writeback)

CPU0 CPU1

CPU0 has most recent value, soit services the load request instead

of going to main memory

Cache Coherency Protocols

• Not covered in this course– Milos’ 8803 next semester will cover in more detail

• Many different protocols– different number of states– different bandwidth/performance/complexity

tradeoffs

– current protocols usually referred to by their states• ex. MESI, MOESI, etc.

Recv

Message Passing Protocols

• Explicitly send data from one thread to another– need to track ID’s of other CPUs– broadcast may need multiple send’s– each CPU has own memory space

• Hardware: send/recv queues between CPUs

Send

CPU0 CPU1

You Can Fake One on the Other

Memory0

CPU0

Memory1

CPU1

Write X=42

Send (X,42)

Faking Shared Memoryon Message-Passing CPUs

Recv( )

“X” was written

Apart from communicationdelays, each individual

memory space has samecontents

Common Memory

“Memory0” “Memory1”

CPU0 CPU1

Send (42)

Write Q1=42

Recv ( )

Read Q1

Shared Memory Focus

• Most small-medium multi-processors (these days) use some sort of shared memory– shared memory doesn’t scale as well to larger

number of nodes• communications are broadcast based• bus becomes a severe bottleneck

– message passing doesn’t need centralized bus• can arrange multi-processor like a graph

– nodes = CPUs, edges = independent links/routes• can have multiple communications/messages in

transit at the same time

SMP Machines

• SMP = Symmetric Multi-Processing– Symmetric = All CPUs are “equal”– Equal = any process can run on any CPU

• contrast with older parallel systems with master CPU and multiple worker CPUs

CPU0

CPU1

CPU2

CPU3

Pictures found from google images

Hardware Modifications for SMP

• Processor– mainly support for cache coherence protocols

• includes caches, write buffers, LSQ• control complexity increases, as memory latencies may

be substantially more variable

• Motherboard– multiple sockets (one per CPU)– datapaths between CPUs and memory controller

• Other– Case: larger for bigger mobo, better airflow– Power: bigger power supply for N CPUs– Cooling: need to remove N CPUs’ worth of heat

Chip-Multiprocessing

• Simple SMP on the same chip

Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

Pictures found from google images

Shared Caches

• Resources can be shared between CPUs– ex. IBM Power 5

CPU0 CPU1

L2 cache shared betweenboth CPUs (no need to

keep two copies coherent)

L3 cache is also shared (only tagsare on-chip; data are off-chip)

Benefits?

• Cheaper than mobo-based SMP– all/most interface logic integrated on to main chip

(fewer total chips, single CPU socket, single interface to main memory)

– less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)

• Performance– on-chip communication is faster

• Efficiency– potentially better use of hardware resources than

trying to make wider/more OOO single-threaded CPU

Performance vs. Power

• 2x CPUs not necessarily equal to 2x performance

• 2x CPUs ½ power for each– maybe a little better than ½ if resources can be shared

• Back-of-the-Envelope calculation:– 3.8 GHz CPU at 100W– Dual-core: 50W per CPU– P V3: Vorig

3/VCMP3 = 100W/50W VCMP = 0.8 Vorig

– f V: fCMP = 3.0GHz Benefit of SMP: Full powerbudget per socket!

Simultaneous Multi-Threading

• Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC– poor utilization

• SMP: 2-4 CPUs, but need independent tasks– else poor utilization as well

• SMT: Idea is to use a single large uni-processor as a multi-processor

SMT (2)

Regular CPU

CMP

2x HW Cost

SMT (4 threads)

Approx 1x HW Cost

Overview of SMT Hardware Changes

• For an N-way (N threads) SMT, we need:– Ability to fetch from N threads– N sets of registers (including PCs)– N rename tables (RATs)– N virtual memory spaces

• But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)

SMT Fetch

• Duplicate fetch logic

I$

fetch

fetch

fetch

Decode, Rename, DispatchPC0PC1PC2

RS

• Cycle-Multiplexed fetch logic

I$PC0PC1PC2

cycle % N

fetch Decode, etc.

RS

• Alternatives– Other-Multiplexed fetch logic– Duplicate I$ as well

SMT Rename

• Thread #1’s R12 != Thread #2’s R12– separate name spaces– need to disambiguate

RAT0

RAT1

Thread0

Register #

Thread1

Register #

PRF RAT PRF

Thread-ID

Register #

concat

SMT Issue, Exec, Bypass, …

• No change neededThread 0:

Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]

Thread 1:

Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]

Thread 0:

Add T12 = RT20 + T8Sub T19 = T12 – T16Xor T14 = T12 ^ T19Load T23 = 0[T14]

Thread 1:

AddAdd T17 = RT29 + T3SubSub T5 = T17 – T2XorXor T31 = T17 ^ T5LoadLoad T25 = 0[T31]

Add T12 = RT20 + T8

Sub T19 = T12 – T16

Xor T14 = T12 ^ T19

Load T23 = 0[T14]

AddAdd T17 = RT29 + T3

SubSub T5 = T17 – T2

XorXor T31 = T17 ^ T5

LoadLoad T25 = 0[T31]

Shared RS Entries

After Renaming

SMT Cache

• Each process has own virtual address space– TLB must be thread-aware

• translate (thread-id,virtual page) physical page

– Virtual portion of caches must also be thread-aware

• VIVT cache must now be (virutal addr, thread-id)-indexed, (virtual addr, thread-id)-tagged

• Similar for VIPT cache

SMT Commit

• One “Commit PC” per thread• Register File Management

– ARF/PRF organization• need one ARF per thread

– Unified PRF• need one “architected RAT” per thread

• Need to maintain interrupts, exceptions, faults on a per-thread basis– like OOO needs to appear to outside world that it is

in-order, SMT needs to appear as if it is actually N CPUs

SMT Design Space

• Number of threads• Full-SMT vs. Hard-partitioned SMT

– full-SMT: ROB-entries can be allocated arbitrarily between the threads

– hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc.

• Amount of duplication– Duplicate I$, D$, fetch engine, decoders, schedulers,

etc.?– There’s a continuum of possibilities between SMT and

CMP• ex. could have CMP where FP unit is shared SMT-styled

SMT Performance

• When it works, it fills idle “issue slots” with work from other threads; throughput improves

• But sometimes it can cause performance degradation!

Time( ) < Time( )Finish one task,

then do the otherDo both at sametime using SMT

How?

• Cache thrashing

I$ D$

Thread0 just fits inthe Level-1 Caches

Executesreasonablyquickly due

to high cachehit rates

Context switch to Thread1

I$ D$

Thread1 also fitsnicely in the caches

I$ D$

Caches were just big enoughto hold one thread’s data, but

not two thread’s worth

L2

Now both threads havesignificantly higher cache

miss rates

This is all combinable

• Can have a system that supports SMP, CMP and SMT at the same time

• Take a dual-socket SMP motherboard…• Insert two chips, each with a dual-core CMP…• Where each core supports two-way SMT

• This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages

OS Confusion

• SMT/CMP is supposed to look like multiple CPUs to the software/OS

2-waySMT

2-waySMT

2 cores(either SMP/CMP)

CPU0

CPU1

CPU2

CPU3

Say OS has twotasks to run…

AA

BB

idleidle

idleidle

Schedule tasks to(virtual) CPUs

A/BA/B

idleidle

Performanceworse thanif SMT wasturned offand used

2-way SMPonly

OS Confusion (2)

• Asymmetries in MP-Hierarchy can be very difficult for the OS to deal with– need to break abstraction: OS needs to know which

CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT)

– Distinct applications should be scheduled to physically different CPUs

• no cache contention, no power contention– Cooperative applications (different threads of the

same program) should maybe be scheduled to the same physical chip (CMP)

• reduce latency of inter-thread communication, possibly reduce duplication if shared L2 is used

– Use SMT as last choice

Multi-* is HappeningJust a question of exactlyhow? Number of cores,support for SMT/SoeMT,asymmetric cores, etc.?

Multi-This, Multi-That, …

Documents

Transcript of Multi-This, Multi-That, …