® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez...

33
® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    2

Transcript of ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez...

Page 1: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport1

XBC - eXtended Block Cache

Lihu Rappoport

Stephan Jourdan

Yoav Almog

Mattan Erez

Adi Yoaz

Ronny Ronen

Intel Corporation

Page 2: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport2

The Frontend

Frontend goal: supply instructions to execution – Predict which instructions to fetch – Fetch the instructions from cache / memory – Decode the instructions– Deliver the decoded instructions to execution

FrontendFrontend

MemoryMemoryExecutionExecution

Instructions

Data

The processor:

Page 3: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport3

Requirements from the Frontend

High bandwidth

Low latency

Page 4: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport4

The Traditional Solution: Instruction Cache Basic unit: cache line

– A sequence of consecutive instructions in memory

Deficiencies:– Low Bandwidth Jump into

the lineJump out of the line

jmpjmp

– High Latency– Instructions need decoding

Page 5: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport5

TC Goals: high bandwidth & low latency Basic unit: trace

– A sequence of dynamically executed instructions

Trace Cache

Instructions are decoded into uops – Fixed length, RISC like instructions

Traces have a single entry, and multiple exits

Trace end condition

jmpjmp jmpjmpjmpjmpjmpjmp

– Trace tag/index is derived from starting IP

Page 6: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport6

Redundancy in the TC

CodeIf (cond) AB

Possible Traces

(i) AB (ii) BBB

AA

Space inefficiencySpace inefficiency low hit rate low hit rate

Page 7: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport7

XBC Goals

High bandwidth

Low latency

High hit rate

Page 8: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport8

XBC - eXtended Block Cache Basic unit: XB - eXtended Block

jccjccjmpjmp

XB features – Multiple entry, single exit– Tag / index derived from ending instruction IP

– Instructions are decoded

XB end conditions – Conditional or indirect branches– Call/Return– Quota (16 uops)

Page 9: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport9

XBC Fetch Bandwidth Fetch multiple XBs per cycle

– A conditional branch ends a XB– Need to predict only 1 branch/ XB– Predicting 2 branch/cyc fetch 2 XB/cyc

Promote 99% biased conditional branches*

Build longer XBs Maximize XBC bandwidth for a given #pred/cyc

99% biased99% biased

jccjcc jccjccjccjccjmpjmp

*[Patel 98]

Page 10: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport10

XB LengthBlock types Average Length

BB basic block 7.7

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

BB

XB

XBp

DBL

XB don’t break on uncond 8.0

XBp XB + promotion 10.0 DBL group 2 XBp 12.7

Page 11: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport11

XBC StructureA banked structure which supports Variable length XBs (minimize fragmentation) Fetching multiple XBs/cycle

Reorder & AlignReorder & Align

Bank0 Bank1 Bank2 Bank3

4 uop4 uop 4 uop4 uop 4 uop4 uop 4 uop4 uop

Page 12: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport12

Support Variable Length XBs An XB may spread over several Banks on the same

set

Reorder & AlignReorder & Align

bank0 bank1 bank2 bank3

00 11

Page 13: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport13

Support Fetching 2 XBs/cycle Data may be received from all Banks in the same cycle

Reorder & AlignReorder & Align

bank0 bank1 bank2 bank3

00 11

11 00

00 11 00 11

Page 14: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport14

Support Fetching 2 XBs/cycle Actual bandwidth may be sometimes less than 4 banks

per cycle

Reorder & AlignReorder & Align

bank0 bank1 bank2 bank3

00 11

11 00

00 11 00

Page 15: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport15

Reordering and Aligning Uops

bankbank00 bankbank11 bankbank22 bankbank33

bankbanki2i2 bankbanki3i3

Reorder Reorder BanksBanks

Mux 1

bankbanki0i0 bankbanki1i1

Align UopsAlign UopsMux 2

bnkbnki0i0 bnkbnki1i1 bankbanki2i2

Empty uopsEmpty uops

Page 16: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport16

XBC Structure The average XB length is >8 uops

16 uop/line is < 2-XB set associative

Reorder & AlignReorder & Align

bank0 bank1 bank2 bank3

00 1100 22 11

16 uop16 uop

Page 17: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport17

XBC Structure The average XB length is >8 uops

make each bank set-associative

Reorder & AlignReorder & Align

bank0 bank1 bank2 bank3

1100 1100 22

Page 18: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport18

The XBTB The XBTB provides the next XB for each XB

– XBs are indexed according to ending IP Cannot directly lookup next IP in the XBC XBC can only be accessed using the XBTB

XBTB provides info needed to access next XB – The IP of the next XB

– Defines the set in which the XB resides – A masking vector, indicating the banks in which the

XB resides– The #uops counted backward from the end of XB

– Defines where to enter the XB XBTB provides next 2 XBs

Page 19: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport19

XBTBXBC

Decoder

Memory / Cache BTB

Delivery mode

PriorityEncode XBQ

XBC Structure: the whole picture

Build mode

Fill Unit

Page 20: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport20

XB Build Algorithm XBTB lookup fails build a new XB into the fill buffer End-of-XB condition reached lookup XBC for the new XB

– No match store new XB in the XBC, and update XBTB – Match there are three cases:

XBnew XBexist

Update XBTB

IP1XBexist

IP1XBnew

Extend XBexist

Update XBTB

XBnew XBexist

XBexist IP1

IP1XBnew

Complex XB,Update XBTB

XBnewXBexist

IP1

IP1

XBexist

XBnew

The XBC has NO RedundancyThe XBC has NO Redundancy

Page 21: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport21

XBnew and XBexist have same suffix but different

prefix:

– Possible solution, complying to no-redundancy:

Complex XBs

– Drawback: we get 2 short XBs instead of a single long XB

Wrong Way

IP1

IP1

XBexist

XBnew

IP1XBexist

Prefixnew

Page 22: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport22

XBnew and XBexist have same suffix but different

prefix:– Second solution: a single “complex XB”

Complex XBs

Complex XBs: no redundancy, but still high bandwidth

Right Way

PrefixcurIP1

IP1

IP1

XBexist

XBnew

Prefixnew

Suffix

Page 23: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport23

bank0 bank1 bank2 bank3

Extending an Existing XB An XB can only be extended at its beginning

991 2 3 41 2 3 4 5 6 7 85 6 7 8 8 98 91 2 31 2 3 4 5 6 74 5 6 700

Since the existing uops move, the pointers in the XBTB become stale

If we store XB in the usual way, when an XB is extended, we need to move all its uops

Page 24: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport24

Storing Uops in Reverse Order The solution is to store the uops of an XB in a

reversed order

11 2 3 4 52 3 4 5 6 7 8 96 7 8 9

bank0 bank1 bank2 bank3

00

XB IP is the IP of the ending instruction extending the XB does not change the XB IP

when an XB is extended, no need to move uops

Page 25: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport25

Set Search XB is replaced and then placed again

– Not on same set different XB– Same set, same banks no problem– Same set but not on the same banks

XBTB entries which point to the old location of the XB are erroneous

Solution - Set Search– On an XBTB hit & XBC miss, try to locate the XB in

other banks in the same set– Calculate new mask according to offset– Only a small penalty: cycle loss, but no switch to

build

Page 26: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport26

XB Replacement Use a LRU among all the lines in a given set LRU also makes sure that we do not evict a line

other than the first line of a XB (a head line)–There is no point in retaining the head line while

evicting another line– if we enter the XB in the head line, we will get a miss

when we reach the evicted line – if a head line is evicted, but we enter the XB in its

middle, we may still avoid a miss A non-head line is always accessed after a head

line is accessed its LRU will be higher it will not be evicted before the head line

Page 27: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport27

XB Placement Build-mode placement algorithm

– New XB is placed in banks such that it does not have bank conflict with the previous XB (if possible)

– LRU ordering is maintained by switching the LRU line with the non-conflicting line before the new XB is placed

– Set-search repairs the XBTB Delivery mode placement algorithm

– repeating bandwidth losses due to bank conflicts found conflicting lines are moved to non-conflicting banks

– Each XB is augmented with a counter – incremented when XB has a bank conflict – when counter reaches threshold, the conflicting lines

are switched with other lines in non-conflicting banks – A line can be switched with another line, only if its LRU is

higher, or if both gain from the switch

Page 28: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport28

0

1

2

3

4

5

6

7

Games SpecINT SysmarkNT Average

Uo

p p

er C

ycle

XBC vs. TC Delivery Bandwidth

TC XBC

Page 29: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport29

0%1%2%3%4%5%

6%7%

8%9%

10%

16K 32K 64K

Size - KUops

Uo

p M

iss

Rat

e

Miss Rate as a Function of Size

TC XBC

29%

>50%

Page 30: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport30

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

1 2 4Associativity

Uo

p M

iss

Rat

e

Miss Rate as a Function of Size

XBCTC

Page 31: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport31

XBC Features Summary Basic unit - XB

– Ends with a conditional branch– Multiple entries, single exit – Indexed according to ending IP– Branch promotion longer XBs

XBC uses a banked structure– Supports fetching multiple XBs/cycle– Supports variable length XBs

– Uops within XBs are stored in reverse order

Page 32: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport32

Conclusions Instruction Cache has high hit rate, but …

–Low bandwidth, high latency

TC has high bandwidth, low latency, but …–Low hit rate

XBC combines the best of both worlds–High bandwidth, low latency and high hit rate

Page 33: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

RR

®® Lihu Rappoport33