Lecture on High Performance Processor Architecture ( CS05162 )

49
Lecture on High Performance Processor Architecture (CS05162) Ren Yongqing [email protected] Fall 2007 University of Science and Technology of China Department of Computer Science and Technology TRIPS: A Polymorphous Tiled Architecture

description

Lecture on High Performance Processor Architecture ( CS05162 ). TRIPS: A Polymorphous Tiled Architecture. Ren Yongqing [email protected] Fall 2007 University of Science and Technology of China Department of Computer Science and Technology. Outline. Background - PowerPoint PPT Presentation

Transcript of Lecture on High Performance Processor Architecture ( CS05162 )

Page 1: Lecture on  High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture

(CS05162)

Ren Yongqingrenyqmailustceducn

Fall 2007University of Science and Technology of China

Department of Computer Science and Technology

TRIPS A Polymorphous Tiled Architecture

23424 USTC CS AN Hong 2

Outline

Background TRIPS ISA amp Architecture introduction TRIPS Compile amp Schedule TRIPS Polymorphous TRIPS ASIC Implementation Problem

23424 USTC CS AN Hong 3

Key Trends

Power Slowing (stopping) frequency increases Wire Delays Reliability Reconfigurable

90 nm

65 nm

35 nm

130 nm

20 mm

23424 USTC CS AN Hong 4

SuperScalar Core

23424 USTC CS AN Hong 5

Conventional Microarchitectures

23424 USTC CS AN Hong 6

TRIPS Project Goals Technology scalable processor and memory

architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication

Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models

Power-efficient instruction level parallelism Demonstrate via custom hardware prototype

minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 2: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 2

Outline

Background TRIPS ISA amp Architecture introduction TRIPS Compile amp Schedule TRIPS Polymorphous TRIPS ASIC Implementation Problem

23424 USTC CS AN Hong 3

Key Trends

Power Slowing (stopping) frequency increases Wire Delays Reliability Reconfigurable

90 nm

65 nm

35 nm

130 nm

20 mm

23424 USTC CS AN Hong 4

SuperScalar Core

23424 USTC CS AN Hong 5

Conventional Microarchitectures

23424 USTC CS AN Hong 6

TRIPS Project Goals Technology scalable processor and memory

architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication

Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models

Power-efficient instruction level parallelism Demonstrate via custom hardware prototype

minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 3: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 3

Key Trends

Power Slowing (stopping) frequency increases Wire Delays Reliability Reconfigurable

90 nm

65 nm

35 nm

130 nm

20 mm

23424 USTC CS AN Hong 4

SuperScalar Core

23424 USTC CS AN Hong 5

Conventional Microarchitectures

23424 USTC CS AN Hong 6

TRIPS Project Goals Technology scalable processor and memory

architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication

Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models

Power-efficient instruction level parallelism Demonstrate via custom hardware prototype

minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 4: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 4

SuperScalar Core

23424 USTC CS AN Hong 5

Conventional Microarchitectures

23424 USTC CS AN Hong 6

TRIPS Project Goals Technology scalable processor and memory

architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication

Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models

Power-efficient instruction level parallelism Demonstrate via custom hardware prototype

minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 5: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 5

Conventional Microarchitectures

23424 USTC CS AN Hong 6

TRIPS Project Goals Technology scalable processor and memory

architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication

Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models

Power-efficient instruction level parallelism Demonstrate via custom hardware prototype

minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 6: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 6

TRIPS Project Goals Technology scalable processor and memory

architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication

Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models

Power-efficient instruction level parallelism Demonstrate via custom hardware prototype

minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 7: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 7

Key Features EDGE ISA

minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP

Tiled Microarchitectureminus Modular designminus No global wires

TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine

NUCA L2 Cacheminus Distributed cache design

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 8: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 8

TRIPS Chip

2 TRIPS Processors NUCA L2 Cache

minus 1 MB 16 banks

On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus

Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 9: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 9

TRIPS Processor Want an aggressive general-purpose processor

minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads

But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction

semanticsminus Vulnerable to speculation hazards

TRIPS introduces a new microarchitecture and ISA

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 10: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 10

EDGE ISA Explicit Data Graph Execution

(EDGE) Block-Oriented

minus Atomically fetch execute and commit whole blocks of instructions

minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the

block levelminus Dataflow execution semantics inside

each block

Direct Target Encodingminus Encode instructions so that results go

directly to the instruction(s) that will consume them

minus No need to go through centralized register file and rename logic

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 11: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 11

Block Formation Basic blocks are often too small

(just a few insts) Predication allows larger

hyperblocks to be created Loop unrolling and function

inlining also help TRIPS blocks can hold up to 128

instructions Large blocks improve fetch

bandwidth and expose ILP Hard-to-predict branches can

sometimes be hidden inside a hyperblock

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 12: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 12

TRIPS Block Format Each block is formed from two to

five 128-byte programldquochunksrdquo Blocks with fewer than five

chunks are expanded to five chunks in the L1 I-cache

The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions

Each instruction chunk holds 32 4-byte instructions (including NOPs)

A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions

HeaderChunk

InstructionChunk 0

PC

128 Bytes

128 Bytes

128 Bytes

128 Bytes

128 Bytes

InstructionChunk 1

InstructionChunk 2

InstructionChunk 3

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 13: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 13

Processor Tiles Partition all major structures into

banks distribute and interconnect Execution Tile (E)

minus 64-entry Instruction Queue bankminus Single-issue execute pipeline

Register Tile (R)minus 32-entry Register bank (per thread)

Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks

Instruction Tile (I)minus 16KB Instruction Cache bank

Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 14: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 14

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

Grid Processor Tiles and Interfaces

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 15: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 15

Mapping TRIPS Blocks to the Microarchitecture

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 16: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 16

Mapping TRIPS Blocks to the Microarchitecture

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

otE-tile[33]

HeaderChunk

InstChunk 0

InstChunk 3

Block i mapped into Frame 0

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 17: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 17

Mapping TRIPS Blocks to the Microarchitecture

HeaderChunk

InstChunk 0

InstChunk 3

Block i+1 mapped into Frame 1

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7Sl

ot

E-tile[33]

Frame

0 1 23 4 5 6 7

0

31

LSID

D-tile[3]

Architecture Register Files

ReadWrite Queues

Frame0 1 2 3 4 5 6 7

Thread0 1 2 3 R-tile[3]

RW

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 18: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 18

Mapping Target Identifiers to Reservation Stations

I OP1 OP2Block 4

7

0

Slot

100 10 10110 11

Target = 87 OP1

Frame 4

Type(2 bits)

Y(2 bits)

X(2 bits)

Slot(3 bits)

Frame(3 bits)

ISA Target IdentifierFrame

(assigned by GTat runtime)

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

[1011]

10 11100 10 101

I OP1 OP2

Instruction reservation stationsFrame0 1 2 3 4 5 6

7

0

7

Slot

E-tile

Frame 100Slot 101OP 10 = OP1

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 19: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 19

Block Fetch Fetch commands sent to

each Instruction Cache bank

The fetch pipeline is from 4 to 11 stages deep

A new block fetch can be initiated every 8 cycles

Instructions are fetched into Instruction Queue banks (chosen by the compiler)

EDGE ISA allows instructions to be fetched out-of-order

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 20: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 20

Block Execution Instructions execute (out-oforder)

when all of their operands arrive Intermediate values are sent

from instruction to instruction Register reads and writes

access the register banks Loads and stores access the

data cache banks Branch results go to the

global controller Up to 8 blocks can execute

simultaneously

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 21: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 21

Block Commit 1048577 Block completion is detected

and reported to the global controller

1048577 If no exceptions occurred the results may be committed

1048577 Writes are committed to Register files

1048577 Stores are committed to cache or memory

1048577 Resources are deallocated after a commit acknowledgement

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 22: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 22

Block Execution Timeline

COMMITFETCH EXECUTE

5 10 30 400Frame 2 Bi

(variable execution time)

Time (cycles)

Frame 4

Frame 5

Frame 6

Frame 7

Frame 0

Frame 1

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Frame 3 Bi+1

Executecommit overlapped across multiple blocks

Bi+8

G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 23: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 23

NUCA L2 Cache 1048577 Prototype has 1MB L2

cache divided into sixteen 64KB banks

1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate

5 requests per cycle 1048577 Requests and replies are

wormhole-routed across the network

1048577 4 virtual channels prevent deadlocks

1048577 Can sustain over 100 bytes per cycle to the processors

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 24: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 24

Compiling for TRIPS

C

InliningLoop UnrollingFlattening

Scalar Optimizations

Your standard compileryoursquove seen this before

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS

TRIPS Block Formation

Register AllocationSplitting for Spill Code

PeepholeLoadStore ID Assignment

Store Nullification

Block Splitting

Scheduling and Assembly

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 25: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 25

Fixed Size Constraint 128 Instructions

bull O3 every basic block is a TRIPS block Simple but not high performance

bull O4 hyperblocks as TRIPS blocks

B1

B3B2

B4 B5

B6

B7

7 TRIPS Blocks

1 TRIPS Block

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 26: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 26

Size Analysis How big is this block 3 Instructions 5 Instructions More

read sp g1movi t3 1

store 384(sp) t3

store 8(sp) t3

Max immediate is 256

Immediate instructions have one targetread sp g1movi t3 1mov t4 t3

addi t7 sp 256store 128(t7) t4

store 8(sp) t4

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 27: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 27

Too Big Block Splitting What if the block is too large

Predicated blocksReverse if-convert

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5

L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2

Unpredicated (basic blocks) Insert a branch and label

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 28: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 28

bull 128 registers (32 x 4 banks)

bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions

Register Constraints Linear Scan Allocator

SPEC2000 Alpha TRIPS (O4)

applu 247 331

apsi 1326 196

gcc 4490 6622

mesa 2614 3821

mgrid 366 77

sixtrack 494 220

Total spills (18 Alpha vs 6 TRIPS)

Average spill 1 store for 2-3 loads

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 29: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 29

Block Termination Constraint

Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion

All writes completeminus Write nullification

All stores completeminus Store nullificationminus LSID assignment

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 30: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 30

ldshladdswbr

TRIPS Scheduling Problem

addaddldcmpbr

subshlldcmpbr

ldaddaddswbr

swswaddcmpbr

ld

Register File

Data C

aches

Hyperblock

addadd

Flowgraph

bull Place instructions on 4x4x8 gridbull Encode placement in target form

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 31: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 31

Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]

minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 32: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 32

TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]

minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency

Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply

Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 33: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 33

TRIPS Configurable Resources

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 34: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 34

Aggregating Reservation Stations Frames

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 35: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 35

Extracting ILP Frames for Speculation

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 36: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 36

Configuring Frames for TLP

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 37: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 37

Using Frames for DLP

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 38: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 38

Configuring Data Memory for DLP

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 39: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 39

Configuring Data Memory for DLP

Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 40: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 40

Performance results ILP instruction window occupancy

minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction

TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads

DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 41: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 41

ASIC Implementation 1048577 130 nm 7LM IBM ASIC

process 1048577 335 mm2 die 1048577 475 mm x 475 mm

package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 42: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 42

Functional Area Breakdown

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 43: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 43

TRIPS Summary Distributed microarchitecture

minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components

Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity

Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 44: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 44

Problems Scalable

minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore

Compatibilityminus Instruction amp block code

Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue

Polymorphousminus How to realize DLP

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 45: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 45

Scalable

Larger grid

GSN global status network

GDN global dispatch network

OPN operand network

GDN global dispatch network GDN global dispatch network

GSN global status network

GCN global control network

OPN operand network

G R R RRI

D E E EEI

D E E EEI

D E E EEI

D E E EEI

GCN global control network

GDN global dispatch network

GSN global status network

OPN operand network OPN operand network

GDN global dispatch network

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 46: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 46

Scalable

Multicore Manycore

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 47: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 47

Compatibility

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 48: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 48

Low-efficiency architecture

Link utilization for SPEC CPU 2000

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
Page 49: Lecture on  High Performance Processor Architecture ( CS05162 )

23424 USTC CS AN Hong 49

Polymorphous

Imagine

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49