Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of...

32
Access Region Locality for Access Region Locality for High-Bandwidth Processor High-Bandwidth Processor Memory System Design Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U 32nd Annual International Symposiu 32nd Annual International Symposiu m on m on Microarchitecture Microarchitecture

Transcript of Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of...

Access Region Locality for Access Region Locality for High-Bandwidth High-Bandwidth Processor Memory System Processor Memory System DesignDesign

Sangyeun Cho Samsung/U of Minnesota

Pen-Chung Yew U of Minnesota

Gyungho Lee Iowa State U

32nd Annual International Symposium on32nd Annual International Symposium onMicroarchitectureMicroarchitecture

MICRO-32November 17, 1999

Cho, Yew, and Lee 2

Big PictureBig Picture

On-Chip D-CacheOn-Chip D-CacheBandwidth ProblemBandwidth Problem

MICRO-32November 17, 1999

Cho, Yew, and Lee 4

Wide-Issue Superscalar Wide-Issue Superscalar ProcessorsProcessors

Fetc

h

R eservatio nStatio n s

D isp atchB uff er

I n structio n /D eco d e B uff er

R eo rder/C o m p letio nB uff er

Sto reB uff er

Dec

ode

Dis

patc

h

Com

plet

e

Ret

ire

L o ad / Sto reU n its

$$ Current Generation

– Alpha 21264– Intel’s Merced

Future Generation (IEEE Computer, Sept. ‘97)

– Superspeculative Processors

– Trace Processors

MICRO-32November 17, 1999

Cho, Yew, and Lee 5

Multi-Ported Data CacheMulti-Ported Data Cache

Fetch

$$ X $$ Y

Sto reL o ad L o ad

Fetch

$ $

1 L o ad /Sto re

2 L o ad /Sto re

Fetch

$$ E ven $$ O dd

" O dd" L o ad /Sto re

Fetch

" E ven " L o ad /Sto re

Replicated Cache– Alpha 21164

Time-Division Multiplexed Cache

– Alpha 21264

Interleaved Cache– MIPS R10K

MICRO-32November 17, 1999

Cho, Yew, and Lee 6

Window Logic ComplexityWindow Logic Complexity

Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97)

More severe for Memory window– Difficult to partition– Thick network needed t

o connect RSs and LSUs

L SU

Net

wor

kD isp atch

R eserv atio nStatio n s

L SU

L SU

L SU

$$

Data DecouplingData Decoupling

MICRO-32November 17, 1999

Cho, Yew, and Lee 8

Data Decoupling: Data Decoupling: What is it?What is it?

A Divide-and-Conquer approach– Instruction stream

partitioned before entering RS

– Narrower networks– Less ports to each

cache– Needs mechanism for

proper partitioning

Net

wor

k "Y

"

D isp atch

R eservatio nStatio n s

L SU

L SU

$$ " Y "

L SU

L SU

$$ " X "

Net

wor

k "X

"

MICRO-32November 17, 1999

Cho, Yew, and Lee 9

Data Decoupling: Data Decoupling: Operating IssuesOperating Issues

Memory Stream Partitioning– Hardware classification– Compiler classification

Load Balancing– Enough instructions

in different groups?– Are they well

interleaved?

D isp atch

R eservatio nStatio n s

?D isp atch

T o R eservatio nStatio n s

Access Region LocalityAccess Region Locality& Access Region Prediction& Access Region Prediction

MICRO-32November 17, 1999

Cho, Yew, and Lee 11

Access Region: Access Region: OverviewOverview

Access Region R– R = (L, U)

L: Lower Bound on Addr. U: Upper Bound on Addr.

If (D<A) or (B<C),– Region R and Q are said

to be exclusive or non-overlapping.

Locations in exclusive regions are independent.

MICRO-32November 17, 1999

Cho, Yew, and Lee 12

Access Region Access Region and Mem. Instructiand Mem. Instructionsons

MICRO-32November 17, 1999

Cho, Yew, and Lee 13

Partitioning Memory SpacePartitioning Memory Space

One way of partitioning memory space into regions:– Data Region / Heap Region / Stack Region

This work assumes this partitioning.

MICRO-32November 17, 1999

Cho, Yew, and Lee 14

Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d

Many accesses are toward Data and Stack regions. Some programs don’t access the Heap region at all.

0

5

10

15

20

25

30

35

40

go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg. FP.Avg.

Data Heap Stack

(%)

MICRO-32November 17, 1999

Cho, Yew, and Lee 15

Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d

Accesses to Data region are less bursty than others. Programs such as ijpeg have clustered region accesse

s.

Window Size = 32

0.44

0.84

1.22

0.37

0.72

1.57

0.840.65

0.31

0.00

1.72

1.40

1.08

0.61

1.34

2.19

2.70

1.281.16

0.80

1.391.20

0.43

1.331.52

0.68

0.98

0.74

0.840.72

0.00 0.000.00

0.65

0.86

0.98

0.00

0.50

1.00

1.50

2.00

2.50

3.00

go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid

Std

. D

ev.

/ A

vg.

DataHeapStack

MICRO-32November 17, 1999

Cho, Yew, and Lee 16

Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d

W/ a large window, Stack accesses become less bursty. Data and Stack regions have quite stable, constant demand.

Window Size = 64

0.37

1.15

0.32

0.59

1.54

0.55 0.59

0.23

1.68

1.01

0.45

1.18

1.96

2.41

1.080.88

0.66

0.33

1.39

0.58

0.73

0.360.72

0.67

1.21

0.60

0.000.00 0.000.00

0.71

1.07

0.52

0.950.84

0.98

0.00

0.50

1.00

1.50

2.00

2.50

3.00

go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid

Std

. D

ev.

/ A

vg.

DataHeapStack

MICRO-32November 17, 1999

Cho, Yew, and Lee 17

Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d

0

0.2

0.4

0.6

0.8

1

99 124 126 129 130 132 134 147 101 102 103 107 Int.Avg FP.Avg

D/H/S

H/S

D/S

D/H

S

H

D

gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid

1.9%1.8%

51.1%

50.4%

1.6%

16.2%

45.4%31.6

%

Many instructions access a single region (~98%). Multi-region-accessing instructions account for 0

~ 9.6% of dynamic memory references.

MICRO-32November 17, 1999

Cho, Yew, and Lee 18

Access Region LocalityAccess Region Locality

“A memory reference instruction typically accesses a single region at run time”– Only about 2% of all static memory

instructions access more than a single region.

“(Thus) the region it accesses is highly predictable”– Simple predictors with a small look-up table

achieve high prediction accuracy.

MICRO-32November 17, 1999

Cho, Yew, and Lee 19

Predicting Regions: Predicting Regions: Unlimited CaseUnlimited Case

One predictor per memory instruction Predictor types:

– 1-bit history saver (0: Data, 1: Stack)

– 2-bit saturating counter

MICRO-32November 17, 1999

Cho, Yew, and Lee 20

Predicting Regions: Predicting Regions: Adding Adding ContextContext

Run-time context– Caller’s ID (CID): in Link Register– Global Branch History (GBH)– Hybrid of above

MICRO-32November 17, 1999

Cho, Yew, and Lee 21

Predicting Regions: Predicting Regions: Utilizing Utilizing Static Info.Static Info.

Some instructions’ access regions are revealed through architecture and compiler conventions:– Use of Stack Pointer ($SP) or Frame Pointer ($FP)suggests that the region is Stack.

– Use of Global Pointer ($GP) suggests that the region is non-Stack.

– For others, assume non-Stack. Directly exporting some high-level region

information from compiler to processor may improve prediction accuracy.

MICRO-32November 17, 1999

Cho, Yew, and Lee 22

Region Pred. Result: Region Pred. Result: Unlimited CaUnlimited Casese

gom88ksim gcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid0%

20%

40%

60%

80%

100%

Cor

rect

ly C

lass

ified

Ins

tr.

Predicted

Known from Instr.

Simple 1-bit

w/ GBHw/ CID

Static

w/ Hybrid

1-bit predictors do better than 2-bit predictors (not shown). Hybrid context bits achieve the best prediction rate on average.

MICRO-32November 17, 1999

Cho, Yew, and Lee 23

Predicting Regions: Predicting Regions: Limited-Limited-Size ARPTSize ARPT

Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT):

– Table Entries Initialized to 0’s– 1 to denote stack access– Decoding information explo

ited to save ARPT space

MICRO-32November 17, 1999

Cho, Yew, and Lee 24

Region Prediction Result: Region Prediction Result: ARPTARPT

98%

99%

100%

Pred

ictio

n Rat

e

w/ Compiler Hints

w/o Compiler Hints

gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid

Unlimited8 KB4 KB

2 KB1 KB

Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. Compiler hints relieve pressure due to smaller sizes.

Dynamic Data DecouplingDynamic Data Decoupling

MICRO-32November 17, 1999

Cho, Yew, and Lee 26

Dynamic Data DecouplingDynamic Data Decoupling

MICRO-32November 17, 1999

Cho, Yew, and Lee 27

Dynamic Data Decoupling, Dynamic Data Decoupling, Cont’Cont’

dd

Dynamically predicting access regions to classify memory instructions:– Utilize Access Region Prediction Table (ARPT).– Utilize any region information revealed through instructio

n decoding. Dispatching partitioned memory instructions into se

parate memory pipelines, connetected to separate caches.

Dynamically Verifying Region Prediction– Let TLB (i.e., page table) contain verification information

such that memory access is reissued on mis-predictions.

MICRO-32November 17, 1999

Cho, Yew, and Lee 28

Base Machine ModelBase Machine Model

Issue Width 16Registers 32 GPRs/ 32 FPRs

ROB/ LSQ Size 256/ 128

Functional Units Integer: 16 ALUs, 4 MULT/ DIV UnitsFP: 16 ALUs, 4 MULT/ DIV Units

Value Pred. 16K-Entry Stride-Based PredictorL1 D-Cache 64 KB, 2-Way Set-Associative, 2-Cycle AccessL2 D-Cache 512 KB, 4-Way Set-Associative, 12-Cycle Access

Memory 50-Cycle Access, Fully InterleavedLV-Cache 4 KB, Direct-Mapped. 1-Cycle Access

ARPT 32K 1-Bit EntriesI-Cache Perfect (100% Hit) Cache, 1-Cycle Access

Branch Prediction Perfect (100% Correct) PredictionInstruction Lat. Same as MIPS R10000

MICRO-32November 17, 1999

Cho, Yew, and Lee 29

Overall PerformanceOverall Performance

go m88ksimgcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid

1.11

1.02

1.23

1.13

1.39

1.16

1.25

1.37

1.18

1.13 1.

18

1.09

1.21

1.14

1.00 1.

02

1.22

1.39

1.15 1.

19

1.36

1.19

1.12

1.18

1.09

1.18

1.14

1.02

1.03

1.29

1.11

1.57

1.22 1.

26

1.61

1.24

1.18

1.23

1.18

1.25

1.20

1.21

1.00

1.22

1.12

1.51

1.12

1.29

1.45

1.18

1.08

1.05

1.04

1.24

1.08

1.22

1.00

1.27

1.12

1.53

1.18

1.34

1.71

1.18

1.09

1.06

1.04

1.29

1.09

1.25

1.02

1.31

1.24

1.57

1.24

1.35

1.75

1.23

1.17

1.17

1.17

1.34

1.19

1.18

1.05

1.35

1.17

1.57

1.28

1.35

1.80

1.25

1.23 1.

28

1.25

1.33

1.25

1.09

1.00

1.20

1.40

1.60

1.80

2.00(3+0), 2 cycle(3+0), 3 cycle(4+0), 3 cycle(2+2)(2+3)(3+3)(16+0), 2 cycle

Over (2+0) conf.

MICRO-32November 17, 1999

Cho, Yew, and Lee 30

ConclusionsConclusions

Access Region Locality says– Memory instructions access few regions at run tim

e.– Accessed regions are accurately predictable.

Access Region Locality leads to Access Region Prediction techniques.

Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches.

Now Any Questions?Now Any Questions?

MICRO-32November 17, 1999

Cho, Yew, and Lee 32

Impact of LVC SizeImpact of LVC Size

2KB and 4KB LVCs achieve high hit rates. (~99.9%).

Set associativity less important if LVC is 2KB or more.

Small, simple LVC works well.

0.5K 1K 2K 4K

8.42

3.98

1.12

2.30

0.73 0.440.19 0.090.02 0.00 0.00 0.000

1

2

3

4

5

6

7

8

9

Miss

Rat

e (%

)

126.gcc

Avg.

129.compress