Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor

Advanced Topics in Pipelining- SMT and Single-Chip Multiprocessor

Priya GovindarajanCMPE 200

Introduction

Researchers have proposed two alternative microarchitectures that exploit multiple threads of control: simultaneous multithreading SMT [1] chip multiprocessors CMP [2]

CMP Vs SMT

Why software and hardware trends will favor the CMP microarchitecture.

Conclusion on the performance results from comparison of simulated superscalar, SMT, and CMP microarchitectures.

SMT Discussion Outline

Introduction Mutithreading MT Approaches of Multithreading Motivation for introducing SMT Implementation of SMT CPU Performance estimates Architectural abstraction

Introduction to SMT SMT processors augment wide (issuing many

instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently

Dynamically selecting and executing instructions from many active threads simultaneously.

Higher utilization of the processor’s execution resources

Provides latency tolerance in case a thread stalls due to cache misses or data dependencies.

When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar.

Introduction to SMT SMT uses the insight that a dynamically

scheduled processor already has many of h/w mechanisms needed to support the integrated exploitation of TLP through MT.

MT can be built on top of out-of-order processor by adding a per thread register renaming, PCs and providing capability for instructions from multiple threads to commit.

Mutithreading: Exploiting Thread-Level Parallelism Multithreading

Multiple threads to share the functional units of a single processor in an overlapping fashion.

The processor must duplicate the independent state of each thread. (register file, a separate PC, page table)

Memory can be shared through the virtual memory mechanisms, which already support multiprocessing

Needs hardware support for changing the threads.

Multithreading….

Two main approaches to multithreading Fine-grained multithreading Coarse-grained multithreading

Fine-grained .. Coarse-grainedmultithreading Switches between

threads on each instruction, causing interleaving

Interleaving in round-robin. Skipping any threads that r stalled

Switches threads only on costly stalls.

Fine-grained multithreading

Advantages Hides throughput losses that arise from

both short and long stalls.Disadvantages Slows down the execution of an

individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads.

Coarse-grained multithreading

Advantages Relieves the need to have thread

switching be essentially free and is much less likely to slow down the execution of an individual threads

Coarse-grained multithreading Disadvantages Throughput losses, especially from

shorter stalls. This is because coarse grained issues

instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen.

New thread begins executing after the stall must fill the pipeline before instructions will be able to complete.

Simultaneous Multithreading Is a variation on multithreading that uses the

resources of a multiple-issue processors, dynamically scheduled processor to exploit TLP at the same time it exploits ILP.

Why ? Modern multiple-issue processors often have

more functional unit parallelism available than a single thread can effectively use.

With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without any dependences among them.

Basic Out-of-order Pipeline

SMT Pipeline

Challenges for SMT processor Dealing with a larger register file

needed to hold multiple contexts Maintaining low overhead on the

clock cycle, particularly in issue , completion

Ensuring cache conflicts by simultaneous execution of multiple threads do not cause significant performance degradation.

SMT

SMT will significantly enhance multistream performance across a wide range of applications without significant hardware cost and without major architectural changes

Instruction Issue

Reduced function unit utilization due to dependencies

Superscalar Issue

Superscalar leads to more performance, but lower utilization

Maximum utilization of function units by independent operations

Simultaneous Multithreading

Fine Grained Multithreading

Intra-thread dependencies still limit performance

Interleaving – no empty slot

Architectural Abstraction 1 CPU with 4 Thread Processing Units (TPUs ) Shared hardware resources

System Block Diagram

Changes for SMT Basic pipeline – unchanged Replicated resources

Program counters Register maps

Shared resources Register file (size increased) Instruction queue Instruction queue First and second level caches Translation buffers Branch predictor

Multithreaded applications Performance

Single-Chip Multiprocessor CMPs use relatively simple single-thread

processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores.

If an application cannot be effectively decomposed into threads, CMPs will be underutilized.

Comparing Alternative Architectures

Super scalar Architecture

Issue up to 12 instructions per cycle

Comparing … SMT Architecture

8 separate PCs , executes instructions from 8 diff thread concurrently

Multi bankcaches

Chip multiprocessor architecture

8 small 2 issue superscalar processors. Depend on TLP

SMT and Memory Large demands on memory SMT require more bandwidth from

primary cache (MT allows more load and store)

To allow this they have 128-kbye cache

Complex MESI(modified , exclusive, shared and invalid) cache-coherence protocol

CMP and Memory Eight cores are independent and

integrated with their individual pairs of caches – another form of clustering leads to high-frequency design for primary cache system

Small cache size and tight connection to these caches allows single-cycle access.

Need simpler coherence scheme

Quantitative performance.. CPU cores

To keep the processors execution units busy, SMT features advanced branch prediction register renaming out-of-order issue non blocking data caches.

Which makes it inherently complex

CMP Approach…h/w simple

number of registers increases

Number of ports on each register must increase

Exploit ILP using more processors instead of large issue widths within single processor

CMP

Solution

SMT Approach Longer cycle times Long, high capacitance I/O wires span the large buffers,

queues and register files Extensive use of multiplexers and crossbars to

interconnect these units adds more capacitance Delays associates dominate delay along CPU’s critical

path The cycle time impact of these structures can be

mitigated by careful design using deep pipelining, by breaking the structures with small,fast clusters of closely related components by short wires.

But deep pipelining increases branch misprediction penalities and clustering tends to reduce the ability of the processor to find and exploit instruction level parallelism.

CMP Solution Short cycle time to be be targeted with relatively

little design effort, since its h/w is naturally clustered- each of the small CPUs is already a very small fast cluster of components.

Since OS allocates a single s/w thread of control to each processor, the partitioning of work among the “clusters” is natural and requires no h/w to dynamically allocate instructions to different clusters

Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast.

SMT and CMP Architectural point of view, the SMT processor’s

flexibility makes it superior. However, the need to limit the effects of interconnect

delays, which are becoming much slower than transistor gate delays, will also drive the billion-transistor chip design.

Interconnect delays will force the microarchitecture to be partitioned into small, localized processing elements.

CMP is much more promising because it is already partitioned into individual processing cores.

Because these cores are relatively simple, they are amenable to speed optimization and can be designed relatively easily.

Compiler support for SMT and CMP Programmers must find TLP in order to

maximize CMP performance SMT requires programmers to explicitly

divide code into threads to get maximum performance but unlike CMP, it can dynamically find more ILP if TLP is limited.

But with multithreaded OS these problems should prove to be less daunting

Having all eight of the CPUs on a single chip allows designers to exploit TLP even when threads communicate frequently

Performance results

A comparison of three architectures indicates that a multiprocessor on a chip will be easiest to implement while still offering excellent performance.

Disadvantages of CMP

When code cannot be MT, only one processor can be targeted to the task

However, a single 2 issue processor on CMP is only moderately slower than superscalar or SMT, since applications with little thread-level parallelism also lack ILP

Conclusion on CMP CMP is promising candidate for a billion-

transistor architecture. Offers superior performance using simple h/w Code that can be parallelized into multiple

threads, the small CMP cores will perform comparable or better

Easier to design and optimize SMTs use resources more efficiently than CMP,

but more execution units can be included in a CMP of similar area, since less die area need be devoted to wide-issue logic.

D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403.

J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98.

J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.

Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996.

LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip Multiprocessor. IEEE September1997 GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In

the 2nd International Symposium on High-Performance Computer Architecture(Feb.). 291–301. KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND SUK-HAN YOON. On-Chip

Multiprocessor with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor

microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77. OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip

multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2–11.

LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU, MICHAEL CHEN, KUNLE OLUKOTUN. The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2)

S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous Multithreading: A Platform for Next-generation Processors. In IEEE Micro, pages 12-18, September/October 1997

V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998.

goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt http://www.acm.uiuc.edu/banks/20/6/page4.html Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/

The Stanford Hydra Chip Multiprocessor

Kunle OlukotunThe Hydra Team

Computer Systems Laboratory

Stanford University

Technology Architecture

Transistors are cheap, plentiful and fast Moore’s law 100 million transistors by 2000

Wires are cheap, plentiful and slow Wires get slower relative to transistors Long cross-chip wires are especially slow

Architectural implications Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to

achieve than low latency

Exploiting Program Parallelism

Instruction

Loop

Thread

Process

Leve

ls o

f P

aral

lelis

m

Grain Size (instructions)

1 10 100 1K 10K 100K 1M

Hydra Approach A single-chip multiprocessor architecture

composed of simple fast processors Multiple threads of control

Exploits parallelism at all levels Memory renaming and thread-level

speculation Makes it easy to develop parallel programs

Keep design simple by taking advantage of single chip implementation

Outline

Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread

performance Hydra prototype design Conclusions

The Base Hydra Design

Shared 2nd-level cache Low latency interprocessor

communication (10 cycles) Separate read and write

buses

Single-chip multiprocessor Four processors Separate primary caches Write-through data caches

to maintain coherence

Write-through Bus (64b)

Read/Replace Bus (256b)

On-chip L2 Cache

DRAM Main Memory

Rambus Memory Interface

CPU 0

L1 Inst. Cache L1 Data Cache

CPU 1


CPU 2


CPU 3


I/O Devices

I/O Bus Interface

CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller

Centralized Bus Arbitration Mechanisms

Hydra vs. Superscalar ILP only SS 30-50% better than single

Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.5–2better “The Case for a CMP” ASPLOS ‘96

com

pres

s

m88

ksim

eqnt

ott

MP

EG

2

appl

u

apsi

swim

tom

catv

pmak

e

0

0.5

1

1.5

2

2.5

3

3.5

4

Spe

edup

Superscalar 6-way issue

Hydra 4 x 2-way issue

OLT

P

Problem: Parallel Software Parallel software is limited

Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications

Traditional auto-parallelization of C-programs is very difficult Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative

How can hardware help? Remove need for pointer disambiguation Allow the compiler to be aggressive

Solution: Data Speculation Data speculation enables parallelization without regard for

data-dependencies Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated

Other ways to parallelize code Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits

Data speculation support Wisconsin multiscalar Hydra provides low-overhead support for CMP

Data Speculation Requirements I

Forward data between parallel threads Detect violations when reads occur too early

Iteration i+1

read X

read X

read X

write X

Iteration i

read X

read X

read X

write X

FORWARDING

VIOLATION

Original Sequential Loop Speculatively Parallelized Loop

Forwarding from write:

Iteration i+1

read X

read X

read X

write X

TIM

E

Iteration i

read X

read X

read X

write X

Data Speculation Requirements II

Safely discard bad state after violation Correctly retire speculative state

Iteration i+1

read X

TIM

E

Iteration i

write X

write A

write B

TRASH

Iteration i+1

Iteration i

write X

write X

PERMANENT STATE

21

Writes after Violations Writes after Successful Iterations

Data Speculation Requirements III

Maintain multiple “views” of memory

Iteration i+1

TIM

E

Iteration i

read X

write X

write X

read X

Multiple Memory “Views”

Iteration i+2

read X

Hydra Speculation Support

Write bus and L2 buffers provide forwarding “Read” L1 tag bits detect violations “Dirty” L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for “view” Speculation coprocessors to control threads

Write-through Bus (64b)

Read/Replace Bus (256b)

On-chip L2 Cache

DRAM Main Memory

Rambus Memory Interface

CPU 0

L1 Inst. Cache

Speculation Write Buffers

CPU 1

L1 Inst. Cache

CPU 2

L1 Inst. Cache

CPU 3

L1 Inst. Cache

I/O Devices

I/O Bus Interface

CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller

Centralized Bus Arbitration Mechanisms

CP2 CP2 CP2 CP2

#0 #1 #2 #3 retire

L1 Data Cache & Speculation Bits




Speculative Reads

L1 hitThe read bits are set

L1 missL2 and write buffers are checked in parallel

The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)

CPU #i

CPU #i-1

CPU #i-2

CPU #i+1

Nonspeculative “Head” CPU

Speculativeearlier CPU

Speculative later CPU“Me”

L1 Cache

12

Write Buffer

Write Buffer

Write Buffer

Write Buffer

C B A

L2 Cache

D

Speculative Writes

A CPU writes to its L1 cache & write buffer

“Earlier” CPUs invalidate our L1 & cause RAW hazard checks

“Later” CPUs just pre-invalidate our L1

Non-speculative write buffer drains out into the L2

CPU #i

CPU #i-1

CPU #i-2

CPU #i+1

Nonspeculative “Head” CPU “Me”

L1 Cache

12 3

L2 Cache

4

Invalidations & RAW Detection Pre-invalidations

Write Bus

Write Buffer

Write Buffer

Write Buffer

Write Buffer

Speculativeearlier CPU

Speculative later CPU

Speculation Runtime System Software Handlers

Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency

violations Adds more overhead to speculation than hardware

but more flexible and simpler to implement Complete description in “Data Speculation

Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99

Creating Speculative Threads Speculative loops

for and while loop iterations Typically one speculative thread per iteration

Speculative procedures Execute code after procedure speculatively Procedure calls generate a speculative thread

Compiler support C source to source translator Pfor, pwhile Analyze loop body and globalize any local variables

that could cause loop-carried dependencies

Base Speculative Thread Performance

Entire applications GCC 2.7.2 -O2 4 single-issue

processors Accurate modeling of

all aspects of Hydra architecture and real runtime system

com

pres

s

eqnt

ott

grep

m88

ksim wc

ijpeg

mpe

g2

alvi

n

chol

esky ea

r

sim

plex

spar

se1.

3

0

0.5

1

1.5

2

2.5

3

3.5

4

Spe

edup

Base

Improving Speculative Runtime System

Procedure support adds overhead to loops Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles

Performance Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case

Lower speculative overheads Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures

Improved Speculative Performance

Improves performance of all applications

Most improvement for applications with fine-grained threads

Eqntott uses procedure speculation

com

pres

s

eqnt

ott

grep

m88

ksim wc

ijpeg

mpe

g2

alvi

n

chol

esky ea

r

sim

plex

spar

se1.

30

0.5

1

1.5

2

2.5

3

3.5

4

Spe

edup

Base

Optimized RTS

Optimizing Parallel Performance Cache coherent shared memory

No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint)

Speculative threads No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data

violations Dependence prediction can help Look at violation statistics (requires some hardware support)

Feedback and Code Transformations Feedback tool

Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code

Synchronization Synchronize frequently occurring violations Use non-violating loads

Code Motion Find dependent load-stores Move loads down in thread Move stores up in thread

Code Motion Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding

read x

write x

read x

write x

iteration i

iteration i+1

read xwrite x read x

write x

iteration i

iteration i+1

read x read x’

read x’

Optimized Speculative Performance

Base performance

Optimized RTS with no manual intervention

Violation statistics used to manually transform code

com

pres

s

eqnt

ott

grep

m88

ksim w

c

ijpeg

mpe

g2

alvi

n

chol

esky ea

r

sim

plex

spar

se1.

3

0

0.5

1

1.5

2

2.5

3

3.5

4

Spe

edup

Size of Speculative Write State

Max size determines size of write buffer for max performance

Non-head processor stalls when write buffer fills up

Small write buffers (< 64 lines) will achieve good performance

compress 24

eqntott 40

grep 11

m88ksim 28

wc 8

ijpeg 32

mpeg 56

alvin 158

cholesky 4

ear 82

simplex 14

32 byte cache lines

Max no. lines of write state

Hydra Prototype

Design based on Integrated Device Technology (IDT) RC32364 88 mm2 in 0.25m with 8 KB I, D and 128 KB L2

8 mm

11 mm

Conclusions Hydra offers a new way to design microprocessors

Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium

to large-grain parallelism Allows performance optimization migration path for

difficult to parallelize fine-grain applications Prototype Implementation

Work out implementation details Provide platform for application and compiler

development Realistic performance evaluation

Hydra Team

Team Monica Lam, Lance Hammond, Mike

Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT)

URL http://www-hydra.stanford.edu

Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor

Documents

Transcript of Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor