Piranha: A Scalable CMP - University of Evansvillemr56/ece757/Lecture12.pdf · Piranha: A Scalable...

13
Page 1 CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Scalable Multiprocessors (Case Studies) Copyright 2003 J. E. Smith University of Wisconsin-Madison (c) 2003 J . E. Smith CS/ECE 757 2 Piranha: A Scalable CMP Commercial workloads Large instruction and data footprints » execution often dominated by memory stalls Parallelism arises from independent instruction streams Multiple out of order issue offer small gains for OLTP No floating point or multimedia Complex desktop processors may not be compatible with server applications Server chips Simultaneous Multithreading (SMT) » complex core shared by multiple contexts Chip Multiprocessors (CMP) » multiple simple cores + shared caches

Transcript of Piranha: A Scalable CMP - University of Evansvillemr56/ece757/Lecture12.pdf · Piranha: A Scalable...

Page 1

CS/ECE 757: Advanced Computer Architecture II(Parallel Computer Architecture)

Scalable Multiprocessors (Case Studies)

Copyright 2003 J. E. SmithUniversity of Wisconsin-Madison

(c) 2003 J . E. Smith CS/ECE 757 2

Piranha: A Scalable CMP

• Commercial workloads– Large instruction and data footprints

» execution often dominated by memory stalls– Parallelism arises from independent instruction streams– Multiple out of order issue offer small gains for OLTP– No floating point or multimedia

�� �� Complex desktop processors may not be compatible with server applications

• Server chips– Simultaneous Multithreading (SMT)

» complex core shared by mult iple contexts– Chip Multiprocessors (CMP)

» multiple simple cores + shared caches

Page 2

(c) 2003 J . E. Smith CS/ECE 757 3

Piranha: A Scalable CMP

• Single Die contains:– 8 simple alpha processor cores

» single issue, in-order» 8 stage pipelines

– Separate L1 data/inst caches» 64KB 2-way L1 caches» MESI protocol

– Shared L2 cache» 1MB» 8 banks» 8-way per bank» no-inclusion

– 8 memory controllers» section of memory

controlled by each chip– 2 coherence engines– network router

(c) 2003 J . E. Smith CS/ECE 757 4

On-chip Caches

• Non-Inclusion– Copy of L1 tags in L2 controller– L1 misses that also miss in L2

» fill directly from main memory» no copy in L2

– L2 f illed only when L1 line is replaced» L2 acts as large vict im cache» Even clean L1 replacements write back to L2» One L1 is owner (exclusive or last requester)» writeback only when owner L1 replaces data

Page 3

(c) 2003 J . E. Smith CS/ECE 757 5

On-chip Caches

• Intra-Chip coherence– L2 controllers contain all on-chip sharing information– Both L1 and L2 tags checked on L2 accesses– Memory request from L1 sent to L2– L2 then:

» services request directly» forwards request to owner L1» forwards request to protocol engine (needs to go off

chip)» obtain data from local section of memory

(c) 2003 J . E. Smith CS/ECE 757 6

Large Scale Piranha

• Piranha Chips can be interconnected to form larger systems.• System-wide cache coherence

– intra-chip via snooping buses (already discussed)– inter-chip via directories

Page 4

(c) 2003 J . E. Smith CS/ECE 757 7

Protocol Engines

• On-chip protocol engines implement coherence protocol– Two engines: home engine, remote engine

(c) 2003 J . E. Smith CS/ECE 757 8

Protocol Engine

• Microprogrammed Controllers– Different programs for home and remote engines

Page 5

(c) 2003 J . E. Smith CS/ECE 757 9

Protocol Engine

• Three Stages– Input controller – receives messages– Microcoded execution unit– Output controller – sends messages

• Microcode– 1024 21-bit instructions– Seven instruction types

» SEND» RECEIVE» LSEND (local)» LRECEIVE (local)» TEST» SET» MOVE

(c) 2003 J . E. Smith CS/ECE 757 10

Protocol Engine

• TSRF– Transaction State Register File

» 16 entries» For each transaction keeps thread state

• Example: Read to a Remote Node– Action at remote node:

» SEND of request to home» RECEIVE of reply» TEST of a state variable» LSEND to requesting node

Page 6

(c) 2003 J . E. Smith CS/ECE 757 11

Directories

• Directory Storage– Scavanged by using ECC over larger blocks– 44 bits per 64 byte line

• Two Methods for Representing Sharers– l imited pointer– coarse vectors– In 1K system, switch to coarse vectors when more than 4

sharers

(c) 2003 J . E. Smith CS/ECE 757 12

Inter-node Protocol

• Invalidation Protocol• Four request types

– read– read-exclusive– exclusive

» requestor already has shared copy– exclusive-without-data

» will write entire line; no need to fetch current contents

• Features– Clean-exclusive optimization

» an exclusive copy is returned to a read if no sharers– Reply forwarding from remote owner– Eager exclusive replies

» ownership given before all invalidations are complete– Does not depend on point-to-point order

Page 7

(c) 2003 J . E. Smith CS/ECE 757 13

Inter-node Protocol

• No NAKs and retries– NAKs typically used to avoid deadlocks– These are avoided by using low and high priority virtual channels

» low prior ity – requests sent to home node except write-backs» high priority – forwarded requests and replies» network buffering

– NAKs also used when protocol races occur– These are avoided by guaranteeing requests always satisfied by target

node» example: writeback to home – maintain valid copy unti l home

acks writeback

• Advantages– Protocol can make directory changes immediately

» avoids extra confirmation messages like “ ownership change”– Eliminate l ivelock and starvation problems caused by NAKs

(c) 2003 J . E. Smith CS/ECE 757 14

Buffering

• To reduce buffering required– Use “ hot potato” routing – eventually will reach an

empty buffer– Buffer space shared among virtual channels– Bound messages due to a single request– Cruise-missile-invalidates (CMI)

» invalidate a lot of nodes with a small number of messages that each visits s predetermined set of nodes and returns one ack

Page 8

(c) 2003 J . E. Smith CS/ECE 757 15

IBM Power4

• Power4 System on a Chip– Reflects advantage of a vertically integrated company– Two processors– 1.4 Mbyte L2 cache– L3 cache directory– Fabric control logic for

system interconnect

(c) 2003 J . E. Smith CS/ECE 757 16

Power4 System Architecture

• Eight Processor MCM – Full Interconnect

Page 9

(c) 2003 J . E. Smith CS/ECE 757 17

Power4 System Architecture

• Thirty-two Processor System – Connect with four rings

(c) 2003 J . E. Smith CS/ECE 757 18

Power4 Memory Hierarchy

• Three Levels of Cache– Split L1 Instructions and Data– L2 shared (2 processors) on-chip– L3 off-chip with on-chip controllers

Page 10

(c) 2003 J . E. Smith CS/ECE 757 19

L1 Data Cache

• 32 Kbytes• 128 bytes per line• Three cycle load-use latency (?)• Two load and one store per cycle• Store-to-load forwarding

– 32 entry load/store queues

• Write-through• Inclusion property with L2• Two States

– I (invalid)– V (valid)

(c) 2003 J . E. Smith CS/ECE 757 20

L2 Cache

• Unified; Shared by Both On-Chip Processors

• Write-back• 3 x 480 K bytes• 128 byte lines • 4-way to 8-way set associative (?)• Two sets of tags• Four coherency controllers

– Four Snoop processors per coherency controller

– In Response to a Snoop» Send back invalidate to L1» Read data from L2» Update Coherence state» Push write modified data back

to memory» Source data to another L2 from

this L2

Page 11

(c) 2003 J . E. Smith CS/ECE 757 21

L2 Cache Coherence

• Enhanced MESI Protocol– I (invalid state): The data is invalid. – SL (shared state, can be source to local requesters): The cache line

may be valid in other L2 caches. Data can be sourced to another L2 on the same module via intervention.

– S (shared state): The cache line may also be valid in other L2 caches. Data cannot be sourced to another L2 via intervention. This state is entered when a snoop read hit from another processor on a chip on the same module occurs and the data and tag were in the SL state.

– M (modified state): The data has been modified and is exclusively owned.

– Me (exclusive state): The data is not considered modified but is exclusive to this L2.

– Mu (unsolicited modified state): The data is considered to have been modified and is exclusively owned.

– T (tagged state): The data is modified with respect to the copy in memory, but is not currently exclusively owned. This state is entered when a snoop read hit occurs while in the M state.

(c) 2003 J . E. Smith CS/ECE 757 22

L2 Cache Coherence

• Valid L1 and L2 States

Page 12

(c) 2003 J . E. Smith CS/ECE 757 23

L3 Cache

• Directory (tags) on-chip; Data off-chip in MCM

• 128 MB per MCM• 8-way set associative• 512 byte lines (128 byte

sectors)• Not Inclusive with L1 and L2• Can handle 1 “ random”

access at a time– or 4 concurrent accesses to

same block

• Prefetching– Stream buffers; Eight streams

per processor– Done throughout hierarchy

(c) 2003 J . E. Smith CS/ECE 757 24

L3 Cache Coherence

• Five Coherence States– I (invalid)– S (shared state): In this state, the L3 can only source data to L2s that it

is caching data for.

– T (tagged state): The data is modified relative to the copy stored in memory. The data may be shared in other L2 or L3 caches.

– Trem (remote tagged state): The same as the T state, but the data was sourced from memory attached to another chip.

– O (prefetch data state): The data in the L3 is identical to the data in memory. The data was sourced from memory attached to this L3. The status of the data in other L2 or L3 caches is unknown.

Page 13

(c) 2003 J . E. Smith CS/ECE 757 25

Prefetching

• Sequential L1 misses triggers prefetch to L2• L2 fetches from L3 (skipping 5 ahead)• Prefetch from Memory to L3 is 512 byte line• Prefetching stops on real page boundary