SPARC T1 MMU Architecture

SPARC-T1 Cache & Virtual Memory

Architectureby : Kaushik Patra

([email protected])

mailto:[email protected]

mailto:[email protected]

AgendaSPARC-T1 overview

SPARC core overview

L1 Caches and TLBs

L1 I-cache

IFQ & MIL

I-cache fill path

I-cache miss path

L1 D-Cache

Data flow through LSU

Memory Management Unit

Data flow in MMU

TLB structure

TLB entry replacement algorithm.

L2-Cache Overview

L2-Cache structure

L2-Cache line replacement algorithm.

SPARC T1 overview

SPARC T1 overview

8 SPARC V9 core

4 threads per core

16 KB L1 instruction cache (I-Cache) per core.

8KB L1 data cache (D-Cache) per core.

SPARC T1 overview3MB L2 cache

shared by all cores

4-way banked

12-way associative

132 GB/sec cross bar interconnect for on-chip communication.

SPARC T1 overview4 DDR-II DRAM controller

144 bit interface per channel

25GB/sec total peak bandwidth.

IEE 754 compliant floating point unit (FPU).

Shared by all core

SPARC T1 overviewExternal interface

J-Bus interface for I/O

2.56 GB/sec peak bandwidth

128 bit multiplexed address and

data bus.

Serial System Interface (SSI) for

boot PROM.

SPARC T1 overview

SPARC core overview

SPARC core overview

Instruction Fetch Unit (IFU)

Load Store Unit (LSU)

Memory Management Unit (MMU).

Execution Unit (EXU)

Multiplier Unit

Trap Logic Unit

Floating Point Front end Unit

Stream Processing Unit

SPARC core overview

SPARC core data path

Separate instruction

cache (I-Cache) and data

cache (D-Cache).

SPARC core overview

We’ll limit our discussion within I-Cache and D-

Cache.

We’ll also include associated TLB architecture for supporting memory virtualization.

L1 Cache and TLBs

IFU contains I-cache and I-TLB

LSU contains D-cache and D-TLB

MMU

IFU LSU

I-Cache

I-TLB

D-Cache

D-TLB

L1 Cache and TLBs

IFU controls I-Cache content.

LSU controls D-Cache content.

MMU controls both the I-TLB and D-TLB

MMU

IFU LSU

I-Cache

I-TLB

D-Cache

D-TLB

L1 I-CachePhysically indexed and tagged

address is translated into physical location using I-TLB before cache hit/miss is determined.

4-way set associative.

16KB data storage with 32 bytes line size.

Single ported data and tag array.

L1 I-CacheI-Cache fill size - 16 bytes per access.

Cached data contains

32-bit instruction

1-bit parity

1-bit pre-decode

Valid bit array has 1 read and 1 write port.

Cache invalidation access only V-bit array.

L1 I-CacheCache line replacement in pseudo-random.

Read access has higher priority over write access to I-cache.

Maximum wait time for a write access is 25 SPARC core clock cycle.

Any write access request waiting more than 25 clock cycle will cause pipeline stall in-order allow access to complete the pending write operation.

IFQ & MILInstruction Fill Queue (IFQ) feeds into I-Cache.

Missed Instruction List (MIL) stores the addresses which missed the I-Cache or I-TLB access.

MIL feeds into LSU for further processing.

I-Cache

I-TLB

IFQ MIL

ToLSU

FromLSU

IFU

Address

Instruction fetchFor every SPARC core clock cycle 2 instruction is fetched per instruction issue.

This strategy has been takes to reduce I-Cache read access for opportunistic I-Cache line fill.

Each thread is allowed to have one outstanding I-cache miss.

i.e. total 4 I-cache miss per core is allowed.

Duplicate I-cache miss do not induce redundant fill request for L2-Cache.

I-Cache fill pathFill packet (PCX) comes from L2-cache via LSU.

Parity and pre-decode is computed before I-cache is filled up.

CPX packet also includes

invalidations

test access point (TAP) read & write

error notification

IFQINV

BIST > ASI > CPX

BIST ASI CPXPKTfromLSU

ToV-bit array

ToI-Cache

Bypass ToTIR

I-Cache fill pathInvalidation CPX is handled through INV block.

access V-bit array

IFQ has a bypass circuit to deliver current CPX directly to Thread Instruction Register (TIR) to avoid extra stall in processing instruction.

IFQINV

BIST > ASI > CPX


ToV-bit array

ToI-Cache

Bypass ToTIR

I-Cache fill path

Each I-cache fill takes 2 CPX, 16 bytes each.

I-cache line size is 32 bytes.

I-cache line is invalidated after first packet is written.

I-cache line becomes valid again after 2nd packet is written.

IFQINV

BIST > ASI > CPX


ToV-bit array

ToI-Cache

Bypass ToTIR

V I

WRITECPX-1

WRITECPX-2

V = Valid

I = Invalid

WRITECPX-1

I-Cache miss pathMissed Instruction List (MIL) sends I-Cache miss request to L2-Cache using LSU.

One miss entry per thread, i.e. total 4 miss entry per SPARC core.

Each entry in MIL contains

physical address (PA).

The replacement way information.

The MIL state information.

The cacheability.

The error information

MIL

PhysicalAddress

(PA)

RR Arbitrator

COMP

PCXPKTto

LSU

I-Cache miss pathPA keeps track of I-cache fetch progress from I-cache miss till I-cache fill.

Round robin algorithm to dispatch I-cache fill request from different threads.

MIL uses linked list, of size 4, to keep track of duplicate I-cache miss.

Marks duplicate request as child.

Any child request is serviced as soon as parent request gets response.

MIL

PhysicalAddress

(PA)

RR Arbitrator

COMP

PCXPKTto

LSU

I-Cache miss path

S1 S3

S2

S4

MakeFill

Request

CPX-1not done

SendSpeculativenotification

New I-Cache

Miss

CPX-2not doneSend

notification

MIL alters between 4 states.

starts with S1 upon new I-cache miss.

Makes fill request.

Wait till I-cache fill is done.

Upon completing CPX-1 fill, it sends speculative completion notification to thread scheduler.

I-Cache miss pathAn I-Cache fill request may be cancelled upon trap or exception.

However, MIL still goes through the filling a cache line, but the bypass to TIR is blocked.

Why ? because, the pending child request should be serviced even if the parent request is cancelled.

Child I-cache miss request needs to wait till the parent’s I-cache miss request is serviced. The child instruction fetch shall be retired (rollback) to fetch stage to allow it to access the I-cache. This is referred as ‘miss-fill crossover’ .

L1 D-Cache4-way set associative

8 KB data storage with 16 byte line size.

Single read-write port for data and tag array.

Dual ported Valid bit (V-bit) array.

cache invalidation only access this V-bit array.

L1 D-CacheCache line replacement policy is pseudo random, using a linear shift register, with allocated load miss, but non-allocated store miss.

A cacheable load miss will allocate a line and will execute the write through policy before the line is loaded.

Stores do not allocate. Hence, store will causes line invalidation if the target address is already in D-cache, as determined by L2 cache directory.

L1 D-CacheL1 D-cache is always inclusive to L2 cache.

L1 D-cache is always exclusive to L1 I-cache.

Each L1 D-cache is parity protected.

Parity error will cause D-cache miss, hence data will be corrected.

In addition to pipeline read, L1 D-cache may be accessed by ASI, BIST and RAM-test through test access port (TAP).

Data flow through LSUOne store buffer (STB) per thread.

Load misses are kept in Load Miss Queue, LMQ.

One outstanding load miss per thread.

Load miss with duplicate physical address (PA) is not sent to L2-cache.

Fully associative DTLB

All CAM/RAM accesses are single cycle operation.

I-Cache

I-TLB

DFQ

ToPCX

FromCPX

LSU

Address

STB

PCXGenerator

PCX PKTfrom IFU STORELOAD

IFU

LMQ

IRF,FRF

Data flow through LSUSTB consists of store buffer CAM (SCM) and store data array (STBDATA).

SCM has 1 CAM port and 1 RW port

STBDATA has 1 read and 1 write port.

Each thread is allocated with 8 fixed entries in the shared data structure.

I-Cache

I-TLB

DFQ

ToPCX

FromCPX

LSU

Address

STB

PCXGenerator


IFU

LMQ

IRF,FRF

Data flow through LSUA load instruction speculate on D-cache miss to reduce the CCX access latency.

If speculation fails, load instruction is taken out of LMQ.

The arbiter (PCX generator) takes 13 different inputs to generate the packet to PCX (Processor-to-Crossbar interface).

I-Cache

I-TLB

DFQ

ToPCX

FromCPX

LSU

Address

STB

PCXGenerator


IFU

LMQ

IRF,FRF

Data flow through LSUThe arbiter inputs consist of

4 load type instructions

4 store type instructions

One I-cache fill.

One FPU access.

One SPU access.

One interrupt.

One forward packet.

I-Cache

I-TLB

DFQ

ToPCX

FromCPX

LSU

Address

STB

PCXGenerator


IFU

LMQ

IRF,FRF

Data flow through LSUThe arbitration inputs consist of

I-cache miss

Load miss

Stores

{FPU operations, SPU operations, Interrupts}

A two level history mechanism allows to implement fair scheduling among different priority levels.

I-Cache

I-TLB

DFQ

ToPCX

FromCPX

LSU

Address

STB

PCXGenerator


IFU

LMQ

IRF,FRF

Data flow through LSUIn coming packets are stored in the data fill queue (DFQ).

Packets can be

Acknowledgment

Data

The targets for DFQ are

Instruction fetch unit (IFU)

Load Store Unit (LSU)

Trap Logic Unit (TLU)

Stream Processing Unit (SPU)

I-Cache

I-TLB

DFQ

ToPCX

FromCPX

LSU

Address

STB

PCXGenerator


IFU

LMQ

IRF,FRF

Memory Management Unit

Maintains content of ITLB and DTLB.

MMU helps SPARC-T1 to provide support for virtualization.

Multiple OS co-exists on to of CMT processor.

Hypervisor layer virtualizes underlying CPU.

Virtual address (VA) from application is translated into Real Address (RA) and then to Physical Address (PA) using TLB & MMU.

Data Flow in MMU

The system software maintains the content of TLBs by sending instructions to MMU.

Instructions are - read, write, de-map.

TLB entries are shared among threads.

Consistency among TLB entries are maintained through auto-de-map.

MMU is responsible for

Generating the pointers to Software Translation Storage Buffer (STB).

Maintains fault status for various traps.

Access to MMU is through hypervisor-managed ASI (Alternate Space Identifier) operations, e.g. ldxa, stxa.

TLB structure

TLB structure

TLB consists of Content Addressable Memory (CAM) and Random Access Memory (RAM).

CAM has 1 compare port and 1 read-write port.

RAM has 1 read-write port.

TLB support the mutually exclusive events of - CAM, Read, Write, Bypass, De-map, Soft-reset, Hard-reset.

TLB structure

RAM contains the following fields.

Physical Address (PA).

Attributes.

CAM contains the following fields.

Partition ID (PID).

Real (indicates VA-to-PA or RA-to-PA translation)

Virtual address (VA), divided into page size based fields (V0 - V3)

Context ID (CTXT)

TLB entry replacement algorithm Each entry has an used bit.

Replacement is picked up by the least significant unused bit among all 64 entries.

A used bit is set on - write, CAM hit or lock.

A locked page has always the used bit set.

Entry invalidation will clear the used bit.

All used bit will be cleared, except the locked entry, if TLB reaches saturation.

If TLB is saturated for all locked entry, default location 0x63 s chosen and error is reported.

L2-cache overview3MB in total size with four symmetrical data bank.

Each bank operates independently.

Each bank is 12-way set associative and 768KB of size.

Line size is 64 bytes.

Number of sets are 1024.

L2-cache overviewAccepts request from processor-to-crossbar (PCX) interface - a part of CCX.

Puts response on crossbar-to-processor (CPX) interface - a part of CCX.

Responsible to maintain on-chip coherency across all L1-cache.

Keeps copy of all L1 tags in a directory structure.

L2-cache overview

128-bit fill interface.

64-bit write interface with DMA controller.

Each bank has dedicated DMA controller.

8-staged pipe lined cache controller.

L2-cache overview

32-bit word is protected by 7-bit single error correction, double error detection (SEC/DED) ECC code.

J-bus interface (JBI) using snoop input queue and RDMA write buffer.

L2-Cache structure 3 main components.

SCTAG (Secondary Cache TAG) : contains TAG array, VUAD array, L2-TAG directory and cache controller.

SCBUF (Secondary Cache BUF) : contains write back buffer (WBB), fill buffer (FB) and DMA buffer.

SCDATA (Secondary Cache DATA) : contains L2-cache data.

L2-cache : ArbiterManages L2-cache pipeline access from various source of request access.

Arbiter gets input from

Instruction from CCX and bypass path for input queue (IQ).

DMA instruction from snoop input queue

Instructions for re-cycle from miss buffer (MB) and fill buffer (FB).

Stall signal from the pipeline.

L2-cache : TAG22-bit tag with 6-bit of SEC ECC protection.

No double bit error detection.

Single ported array.

Four states are maintained per tag line in VUAD array

Valid (V)

Used (U)

Allocated (A)

Dirty (D)

L2-cache : VUAD

Dual ported array structure.

VAD bits are parity protected since an error will be fatal

Used bit is not protected, since the error is not fatal.

VUAD is accessed while taking decision for line replacement.

L2-cache : DATA

Single ported SRAM structure.

768 KB in size with 64 bytes logical line size.

Allows read access of 16 bytes and 64 bytes.

‘16-byte enable’ allows writing in 4-byte part.

Line fill updates all 64 bytes at a time.

L2-cache : DATA

Data array is subdivided into 4 column with six 32 Kbyte sub-array in each column.

Data array access needs 2 cycles to be completed.

No column can be accessed in consecutive cycle.

All accesses are pipelined, thus access have a through put of one per cycle.

Each 32-bit line is protected by 7 bits of SEC/DED ECC.

L2-cache : Input Queue(IQ)

16-entry FIFO queue takes incoming PCX packets.

Each entry is 130 bit wide.

FIFO implemented with dual ported array.

IQ asserts a stall when 11 entries are filled up.

To allow incoming packets already in fly.

L2-cache : Output Queue(OQ)

16-entry FIFO for the packets waiting to get access to CPX.

Each entry is 146-bit wide.

FIFO implemented with dual ported array.

When OQ reaches high-water mark, L2-cache stops accepting PCX and input from miss buffer.

Fills can still be happened since they do not cause CPX packets.

L2-cache : Miss Buffer (MB)16-entry miss buffer stores instructions which can not be processed as simple cache hit.

True L2 cache miss.

Same cache line address which had a miss.

An entry in the write back buffer.

Instructions need multiple L2 cache pipeline.

Unallocated L2-cache misses.

Access causing tag ECC error.

Non tag part holds data - it is a RAM with 1R 1W port.

Tag part holds address - it is a CAM with 1R,1W and 1CAM port.

L2-cache : Fill Buffer (FB)8-entry buffer.

Contains cache-line wide entry to stage data from DRAM before it fills the cache.

RAM structure is used for implementation.

Address is also stored to maintain the age ordering to satisfy data coherence.

CAM structure is used for implementation.

Data arrives from DRAM in four 16 byte block starting with the critical quad-word.

L2-cache : Write Back Buffer (WBB)8-entry buffer, used to store 64-byte dirty data upon eviction.

The evicted lines are streamed to DRAM opportunistically.

An instruction having same address line as in WBB, the instruction is pushed back into MB.

WBB also has RAM and CAM part to hold data and address respectively.

64-byte read interface with data array and 64-bit write interface to DRAM controller.

L2-cache : Directory2048 entries, with one entry per L1 tag.

It is L1 tag to L2 bank mapping.

Half entries are for L1 I-cache and other half is for L1 D-cache.\

I-cache dir and D-cache dir.

Participates in coherency management.

Also ensures same line is not a part of I-Cache and D-Cache.

Uses pseudo LRU for line replacement.

The ‘U’ bit (total 12, 1 per way) is set upon cache hit

All 12 ‘U’ bits get cleared when there is no unused or unallocated.

‘A’-bit means the line is allocated for a miss.

Analogous to ‘lock’ bit.

‘A’ -bit gets cleared when the line fill happens.

L2-cache : Line Replacement Algorithm

‘D’-bit indicates the line is only valid inside cache and required to be written back.

Set when data written to L2-cache.

Cleared when line is invalidated.

LRU examines all the ways from a certain point based one a round-robin fashion.

The first unused-unallocated line is allocated for miss.

If no unused, first unallocated line is allocated for miss.

L2-cache : Line Replacement Algorithm

Scope of future study

Cache cross bar (CCX) for data transaction.

L2-cache pipelined data flow control.

Cache memory consistency and instruction ordering.

Reference

http://opensparc-t1.sunsource.net/specs/OpenSPARCT1_Micro_Arch.pdf





SPARC T1 MMU Architecture

Engineering

Transcript of SPARC T1 MMU Architecture