Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire...

Memory Hierarchy Design• MotivatedMotivated by a combination of programmer's desire for unlimited fast

memory and economical considerations, and based onbased on:– principle of locality, and

– cost/performance ratio of memory technologies (fast small, large slow),

to achieve a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level.

• HierarchyHierarchy: CPU/register file (RF) cache (C) main memory (MM) disk memory/I/O devices (DM)

• Speed in descending orderSpeed in descending order: RF > C > MM > DM

• Space in ascending orderSpace in ascending order: RF < C < MM < DM

Memory Hierarchy Design• TheThe gaps in speed and space between the different levels are gaps in speed and space between the different levels are

widening increasinglywidening increasingly:

Level/Name 1/RF 2/C 3/MM 4/DM

Typical size < 1 KB < 16 MB < 16 GB > 100GB

Implementation technology

Custom memory w. multiple ports, CMOS

On-chip or off-chip CMOS SRAM

CMOS DRAM Magnetic disk

Access time (ns) 0.25-0.5 0.5-25 80-250 5,000,000

Bandwidth 20,000-100,000 (MB/s) 5000-10,000 (MB/s) 1000-5000 (MB/s) 20-150 (MB/s)

Managed by Compiler Hardware Operating system OS/operator

Backed by Cache Main memory Disk CD or tape

Memory Hierarchy Design

• Cache performance reviewCache performance review:

Memory stall cycles = Number_of_misses * Miss_penalty

= IC * Miss_per_instr * Miss_penalty

= IC * MAPI * Miss_rate * Miss_penaltyIC * MAPI * Miss_rate * Miss_penalty

where MAPI stands for memory accesses per instruction

• Four Fundamental Memory Hierarchy Design Four Fundamental Memory Hierarchy Design IssuesIssues:1. Block placement issue: where can a block, the

atomic memory unit in cache-memory transactions, be placed in the upper level?

2. Block identification issue: how is a block found if it is in the upper level?

3. Block replacement issue: which block should be replaced on a miss?

4. Write strategy issue: what happens on a write?


1.1. PlacementPlacement: three approaches:

1) fully associative: any block in the main memory can be placed in any block frame. It is flexible but expensive due to associativity

2) direct mapping: each block in memory is placed in a fixed block frame with the following mapping function: (Block Address) MOD (Number of blocks in cache)

3) set associative: a compromise between fully associative and direct mapping; The cache is divided into sets of block frames, and each block from the memory is first mapped to a fixed set wherein the block can be placed in any block frame. Mapping to a set follows the function, called a bit selection:

(Block Address) MOD (Number of sets in cache)


2.2. IdentificationIdentification:

Each block frame in the cache has an address tag indicating the block's address in the memory

All possible tags are searched in parallel A valid bit is attached to the tag to indicate whether the

block contains valid information or not An address for a datum from CPU, A, is divided into a block

address field and a block offset field: block address = (A) / (block size) block offset = (A) MOD (block size)

block address is further divided into tag and index: index indicates the set in which the block may reside tag is compared to indicate a hit or a miss

Memory Hierarchy Design3.3. Replacement on a cache missReplacement on a cache miss:

The more choices for replacement, the more expensive for hardware direct mapping is the simplest

Random vs. least-recently used (LRU): the former has uniform allocation and is simple to build while the latter can take advantage of temporal locality but can be expensive to implement (why?). First in, first out (FIFO) approximates LRU and is simpler than LRU

Data cache misses per 1000 instructions

Associativity

Two-way Four-way Eight-way

Size LUR Random FIFO LUR Random FIFO LUR Random FIFO

16K 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4

64K 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3

256K 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

Memory Hierarchy Design4.4. Write strategiesWrite strategies:

Most cache accesses are reads: 10% stores + 37% loads + 100% instructions only 7% of all memory accesses are writes

Optimize reads to make the common case fast, observing that CPU doesn't have to wait for writes while must wait for reads: fortunately, read is easy in direct-mapping: reading and tag comparison can be done in parallel (what about associative mapping?); but write is hard:a) Cannot overlap tag reading and block writing (destructive)b) CPU specifies write size: only 1 - 8 bytes. Thus write

strategies often distinguish cache design; On a write hit:i.i. write throughwrite through (or store through):

ensuring consistency at the cost of memory and bus bandwidth

write stalls may be alleviated by using write buffersii.ii. write backwrite back (store in):

minimizing memory and bus traffic at the cost of weakened consistency,

use dirty bit to indicate modification read misses may result in writes (why?)

c) On a write miss:a)a) write allocatewrite allocate (fetch on write)b)b) no-write allocateno-write allocate (write around)

Memory Hierarchy Design An ExampleAn Example:The Alpha 21264 Data Cache

Cache size=64KB, block size=64B, two-way set associativity, write-back, write allocate on a write miss.

What is the index size?

= 64K/(64*2) = 216/(26+1)=29

Memory Hierarchy Design Cache PerformanceCache Performance:

Memory access time is an indirect measure of performance and it is not a substitute for execution time:

Memory Hierarchy Design Example 1Example 1: How much does cache help in performance?

Memory Hierarchy Design Example 2Example 2: What’s the relationship between AMAT and CPU

Time?

Memory Hierarchy Design Improving Cache PerformanceImproving Cache Performance

The average memory access time can be improved by reducing any of the three parameters above:1. R1 reducing miss rate;2. R2 reducing miss penalty;3. R3 reducing hit time;

Four categories of cache organizations that help reduce these parameters:1. Organizations that help reduce miss rate:

larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity, and compiler optimization;

2. Organizations that help reduce miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim cache;

3. Organizations that help reduce miss penalty or miss rate via parallelism:non-blocking caches, hardware prefetching, and compiler prefetching;

4. Organizations that help reduce hit time:

small and simple caches, avoid address translation, pipelined cache access, and

trace cache.

Memory Hierarchy Design Reducing Miss RateReducing Miss Rate

There are three kinds of cache misses depending on the causes:

1. Compulsory: the very first access to a block cannot be a hit, since the block must be first brought in from the main memory. Also call cold-start misses;

2. Capacity: lack of space in cache to hold all blocks needed for the execution. Capacity misses will occur because of blocks being discarded and later retrieved;

3. Conflict: due to mapping that confines blocks to restricted area of cache (e.g., direct mapping, set-associative), also called collision misses or interference misses

While 3-C characterization gives insights to causes, they are at times too simplistic (and they are inter-dependent). For example, they ignore replacement policies.

Memory Hierarchy DesignRoles of 3-C

Memory Hierarchy Design First Miss Rate Reduction Technique: First Miss Rate Reduction Technique: Large Block SizeLarge Block Size

Takes advantage of spatial locality reduces compulsory miss Increases miss penalty (it takes longer to fetch a block) Increases conflict misses, and/or increases capacity misses Must strike a delicate balance among MP, MR, and AMAT, in

finding an appropriate block size

Memory Hierarchy Design First Miss Rate Reduction Technique: First Miss Rate Reduction Technique: Larger Block SizeLarger Block Size

Example: Find the optimal block size in terms of AMAT, given that miss penalty is 40 cycles overhead plus 2 cycles/16 bytes and miss rates of the table below.

Solution: AMAT = HT + MR * MP = HT + MR * (40 + block size * 2 / 16)

High latency and bandwidth encourages large block size Low latency and bandwidth encourages small block size

Memory Hierarchy Design Second Miss Rate Reduction Technique: Second Miss Rate Reduction Technique: Larger CachesLarger Caches

An obvious way to reduce capacity misses. Drawback: high overhead in terms of hit time and higher cost. Popular in off-chip cache (2nd and 3rd level cache).

Third Miss Rate Reduction Technique: Third Miss Rate Reduction Technique: Higher AssociativityHigher Associativity Miss rate Rule of Thumb:

i. 8-way associativity is almost equal to full associativity;ii. Miss rate of (1-way of N-sized cache) is almost equal to

Miss rate of (2-way of 0.5N-sized cache)iii. The higher the associativity, the longer the hit time (why?)

Higher miss rate rewards higher associativity.

Memory Hierarchy Design Fourth Miss Rate Reduction Technique: Fourth Miss Rate Reduction Technique: Way Prediction and Way Prediction and

Pseudoassociative CachesPseudoassociative Caches Way prediction helps select one block among those in a set,

thus requiring only one tag comparison (if hit). Preserves advantages of direct-mapping (why?); In case of a miss, other block(s) are checked.

Pseudoassociative (also called column associative) caches Operate exactly as direct-mapping caches when hit, thus

again preserving advantages of the direct-mapping; In case of a miss, another block is checked (as if in set-

associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”.

real hit time < pseudo-hit time too many pseudo hits would defeat the purpose

Memory Hierarchy Design Fifth Miss Rate Reduction Technique: Fifth Miss Rate Reduction Technique: Compiler OptimizationsCompiler Optimizations


IV. Blocking: improve temporal and spatial localitya) multiple arrays are accessed in both ways (i.e., row-major and

column-major), namely, orthogonal accesses that can not be helped by earlier methods

b) concentrate on submatrices, or blocks

c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in this case.


a) To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor.

b) To total number of memory words accessed is 2N3//B + N2

c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

Memory Hierarchy Design First Miss Penalty Reduction Technique: First Miss Penalty Reduction Technique: Multilevel CachesMultilevel Caches

a) To keep up with the widening gap between CPU and main memory, try to:i. make cache faster, andii. make cache larger

by adding another, larger but slower cache between cache and the main memory.

Memory Hierarchy Design First Miss Penalty Reduction Technique: First Miss Penalty Reduction Technique: Multilevel CachesMultilevel Caches

b) Local miss rate vs. global miss rate::

i.i. Local miss rateLocal miss rate is defined as

ii.ii. Global miss rateGlobal miss rate is defined as

Memory Hierarchy Design Second Miss Penalty Reduction Technique: Second Miss Penalty Reduction Technique: Critical Word First and Early Critical Word First and Early

RestartRestart CPU needs just one word of the block at a time:

critical word first: fetch the required word first, and early start: as soon as the required word arrives, send it to

CPU. Third Miss Penalty Reduction Technique: Third Miss Penalty Reduction Technique: Giving Priority to Read Misses Giving Priority to Read Misses

over Write Missesover Write Misses Serves reads before writes have been completed:

while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory;

instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read.

in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving performance.

Memory Hierarchy Design Fourth Miss Penalty Reduction Technique: Fourth Miss Penalty Reduction Technique: Merging Write BufferMerging Write Buffer

Improves efficiency of write buffers that are used by both write-through and write back caches: Multiple single-word writes are combined into a single write

buffer entry which is otherwise used for multi-word write. Reduces stalls due to write buffer being full

Memory Hierarchy Design Fifth Miss Penalty Reduction Technique: Fifth Miss Penalty Reduction Technique: Victim CacheVictim Cache

victim caches attempt to avoid miss penalty on a miss by: Adding a small fully-associative cache that is used to

contain discarded blocks (victims) It is proven to be effective, especially for small 1-way cache.

e.g., a 4-entry victim cache removes 20% !

Memory Hierarchy Design Reducing Cache Miss Penalty or Miss Rate via ParallelismReducing Cache Miss Penalty or Miss Rate via Parallelism

Nonblocking Caches (Lock-free caches):

Hardware Prefetching of Instructions and Data:


Compiler-Controlled Prefetching: compiler inserts prefetch instructions


Compiler-Controlled Prefetching: An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0]

16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b?

a. 3 rows and 100 columns; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses

b. 101 rows and 3 columns; no spatial locality, but there is temporal locality: same element is used in ith and (i + 1)st iterations and the same element is access in each i iteration. 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses

Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have


Compiler-Controlled Prefetching: An Example (continued)

for(j:=0; j<100; j:=j+1){

prefetch(b[j+7][0];

prefetch(a[0][j+7];

a[0][j] := b[j][0] * b[j+1][0];};

for(i:=1; i<3; i:=i+1)

for(j:=0; j<100; j:=j+1){

prefetch(a[i][j+7];

a[i][j] := b[j][0] * b[j+1][0]}

Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7*300 + 251*50 = 14650 cycles (total iteration cycles plus total cache miss cycles); whereas the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450


Compiler-Controlled Prefetching: An Example (continued) the first loop consumes 9 cycles per iteration (due to

the two prefetch instruction) the second loop consumes 8 cycles per iteration

(due to the single prefetch instruction), during the first 7 iterations of the first loop array a

incurs 4 cache misses, array b incurs 7 cache misses, during the first 7 iterations of the second loop for i =

1 and i = 2 array a incurs 4 cache misses each array b does not incur any cache miss in the second

split!.

Memory Hierarchy Design First Hit Time Reduction Technique: First Hit Time Reduction Technique: Small and simpleSmall and simple cachescaches

smaller is faster: small index, less address translation time small cache can fit on the same chip low associativity: in addition to a simpler/shorter tag

check, 1-way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity!

Second Hit Time Reduction Technique: Second Hit Time Reduction Technique: Avoid address Avoid address translation during indexingtranslation during indexing Make the common case fast:

use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache

Memory Hierarchy Design Second Hit Time Reduction Technique: Second Hit Time Reduction Technique: Avoid address translation Avoid address translation

during indexingduring indexing Make the common case fast:

there are at least three important performance aspects that directly relate to virtual-to-physical translation:

1) improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time

2) for a physical cache, the TLB access time must occur before the cache access, extending the cache access time

3) two-line address (e.g., an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical

problems with virtual cache:1) Page-level protection must be enforced no matter what during

address translation (solution: copy protection info from TLB on a miss and hold it in a field for future virtual indexing/tagging)

2) when a process is switched in/out, the entire cache has to be flushed out ‘cause physical address will be different each time, i.e., the problem of context switching (solution: process identifier tag -- PID)

Memory Hierarchy Design Second Hit Time Reduction Technique: Second Hit Time Reduction Technique: Avoid address translation Avoid address translation

during indexingduring indexing problems with virtual cache:

3) different virtual addresses may refer to the same physical address, i.e., the problem of synonyms/aliases HW solution: guarantee every cache block a unique

phy. Address SW solution: force aliases to share some address bits

(e.g., page-coloring) Virtually indexed and physically tagged

Third Hit Time Reduction Technique: Third Hit Time Reduction Technique: Pipelined cache writesPipelined cache writes the solution is to reduce CCT and increase # of stages – increases

instr. throughput Fourth Hit Time Reduction Technique: Fourth Hit Time Reduction Technique: Trace cachesTrace caches

Finds a dynamic sequence of instructions including taken branches to load into a cache block: Put traces of the executed instructions into cache blocks as

determined by the CPU Branch prediction is folded in to the cache and must be

validated along with the addresses to have a valid fetch. Disadvantage: store the same instructions multiple times

Memory Hierarchy Design Main Memory and Organizations for Improving PerformanceMain Memory and Organizations for Improving Performance


a) Wider main memory bus Cache miss penalty decreases proportionally Cost:

i. wider bus (x n) and multiplexer (x n), ii. expandability (x n), and iii. error correction is more expensive

b) Simple interleaved memory• Potential parallelism with multiple DRAMs• Sending address and accessing multiple

bands in parallel but transmitting data sequentially (4+24+4x4=44 cycles 16/44 = 0.4 byte/cycle)

c) Independent memory banks

Memory Hierarchy Design Virtual MemoryVirtual Memory


Fast address translation: an example – Alpha 21264 data TLB

»ASN is used as PID for virtual caches;

»TLB is not flushed on a context switch but only when ASNs are recycled;

»Fully associative placement


What is the optimal page size? – It depends: page table size 1/page size large page size makes virtual cache possible

(avoiding the aliases problem), thus reducing cache hit time

transfer of larger pages (over the network) is more efficient: efficiency of transfer

small TLB favors larger pages main drawback for large page size:

internal fragmentation: waste of storage process startup time: large context switching

overhead

Memory Hierarchy Design Summarizing Virtual MemorySummarizing Virtual Memory & Caches& Caches

A hypothetical memory hierarchy going from virtual address to L2 cache access:

Memory Hierarchy Design Protection and Examples of Virtual MemoryProtection and Examples of Virtual Memory

The invention of multiprogramming led to the need to share computer resources such as CPU, memory, I/O, etc. by multiple programs whose instantiations are called “processes”;

Time-sharing of computer resources by multiple processes requires that processes take turns using such resources and designers of operating systems and computer must ensure that the switching among different processes, also called “context switching” is done correctly:

a) The computer designer must ensure that the CPU portion of the process state can be saved and restored;

b) The operating systems designer must guarantee that processes do not interfere with each others’ computations

Protecting processes: a) Base and Bound – each process falls in a pre-defined portion

of the address space, that is, an address is valid if Base Address Bound, where OS keeps and defines the values of Base and Bound in two registers.


The computer designer’s responsibilities in helping the OS designer protect processes from each other:

a) Providing two modes to distinguish a user process from a kernel process (or equivalently, supervisor or executive process);

b) Providing a portion of the CPU state, including the base/bound registers, the user/kernel mode bit(s), and the exception enable/disable bit, that a user can use but cannot write; and

c) Providing mechanisms by which the CPU can switch between the user mode to the supervisor mode.

While base-and-bound constitutes the minimum protection system, virtual memory offers a more fine-grained alternative to this simple model:

a) Address translation provides an opportunity to check any possible violations – the read/write and user/kernel signals from CPU vs. the permission flags marked on individual pages by virtual memory (or OS) to detect stray memory accesses;

b) Depending on the designer’s apprehension, protection can be either relaxed or escalated. In escalated protection, multiple levels of access permissions can be used, much like the military classification system.


A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB (one for Instruction and one for Data)

a) A combination of segmentation and paging, with 48-bit virtual addresses while the 64-bit address space being divided into three segments: seg0 (bits 63-47 = 0..00), kseg (bits 63-46 = 0…10), and seg1 (bits 63-46 = 1…11)

b) Advantages: segmentation divides address space and conserves page table space, while paging provides virtual memory, relocation, and protection

c) Even with segmentation, the size of the page tables for the 64-bit address space is alarming. A three-level hierarchical page table is used in Alpha, with each PT contained in one page:


A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB

d) PTE is 64 bits long, with the first 32 bits contain the physical page number and the other half includes the following five protection fields:1) Valid – whether the page number is valid for address translation2) User read enable – allows user programs to read data within this

page3) Kernel read enable -- allows kernel programs to read data within

this page4) User write enable – allows user programs to write data within this

page5) Kernel write enable -- allows kernel programs to write data within

this page

e) Current design of Alpha has 8-KB pages, thus allowing 1024 PTEs in each PT. The three page level fields and page offset account for 10+10+10+13=43 bits of the 64-bit virtual address. The 21 bits to the left of level-1 field are all “0”s for seg0 and all “1”s for seg1.

f) The maximum virtual address and physical address is tied to the page size and Alpha has provisions for future growth: 16KB, 32KB and 64KB page sizes for the future.

g) The following table shows memory hierarchy parameters of the Alpha 21264 TLB

Parameter Description

Block size 1 PTE (8 bytes)

Hit time 1 clock cycle

Miss penalty (average) 20 clock cycle

TLB size 128 PTEs per TLB, each of which can map 1,8,64, or 512 pages

Block selection Round-robin

Write strategy (not applicable)

Block placement Fully associative


A Segmented Virtual Memory Example – Protection in the Intel Pentiuma)Pentium has four protection levels, with the innermost level (0) corresponding to

Alpha’s kernel mode and the outermost level (3) corresponding to Alpha’s user mode’. Separate stacks are used for each level to avoid security breaches between levels.

1)User can call an OS routine and pass parameters to it while retaining full protection

2)Allows the OS to maintain the protection level of the called routine for the parameters that are passed to it

3)The potential loophole in protection is prevented by not allowing the user process to ask the OS to access something indirectly that it would not have been able to access itself (such security loopholes are called Trojan Horse).

b)Bounds checking and memory mapping in Pentium by the use of a descriptor table (DT, which plays the role of PTs in the Alpha). The equivalent of PTE in DT is a segment descriptor containing the following fields:

1)Present bit – equivalent to the PTE valid bit, used to indicate this is a valid translation

2)Base field – equivalent to a page frame address, containing the physical address of the first byte of the segment

3)Access bit – like the reference bit or use bit in some architectures that is helpful for replacement algorithms

4)Attributes field – specifies the valid operations and protection levels for the operations that use this segment

5)Limit field – not found in paged systems, establishes the upper bound of valid offsets for this segment.

Memory Hierarchy Design Crosscutting Issues: The Design of Memory HierarchiesCrosscutting Issues: The Design of Memory Hierarchies

Superscalar CPU and Number of Ports to the CacheCache must provide sufficient peak bandwidth to benefit from multiple issues. Some

processors increase complexity of instruction fetch by allowing instructions to be issued to be found on any boundary instead of, say, multiples of 4 words.

Speculative Execution and the Memory SystemSpeculative and conditional instructions generate exceptions (by generating invalid

addresses) that would otherwise not occur, which in turn can overwhelm the benefits of speculation with the exception handling overhead. Such CPUs must be matched with non-blocking caches and only speculate on L1 misses (due to the unbearable penalty of L2).

Combining Instruction Cache with Instruction Fetch and Decode MechanismsIncreasing demand for ILP and clock rate has led to the merging of the first part of

instruction execution with instruction cache, by incorporating trace cache (which combines branch prediction with instruction fetch) and storing the internal RISC operations in the trace cache (e.g., Pentium 4’s NetBurst microarchitecture). A cache hit in the merged cache saves portion of the instruction execution cycles.

Embedded Computer Caches and Real-Time PerformanceIn real-time applications, variation of performance matters much more than average

performance. Thus, caches that offer average performance enhancement have to be used carefully. Instruction caches are often used due to the highly predictability of instructions; whereas data caches are “locked down”, forcing them to act as small scratchpad memory under program control.

Embedded Computer Caches and PowerIt is much more power efficient to access on-chip memory than to access off-chip one

(which needs to drive the pins, buses and activate external memory chips, etc). Other techniques, such as way prediction, can be used to save power (by only powering half of the two-way set-associative cache).

I/O and Consistency of Cached DataCache coherence problem must be addressed when I/O devices also share the same

cached data.

Memory Hierarchy Design An Example of the Cache Coherence ProblemAn Example of the Cache Coherence Problem

Memory Hierarchy Design Putting It All Together: Alpha 21264 Memory HierarchyPutting It All Together: Alpha 21264 Memory Hierarchy

» Instruction cache is virtually indexed and virtual tagged; Data cache is virtually indexed but physically tagged;

» Operations:

1. Chip loads instr serially from an external PROM and loads configuration info for L2 cache

2. Execute preloaded code in PAL mode to initialize: e.g., update TLB

3. Once OS ready, it sets PC to appropriate addr in seg0

4. 9 index + 1 way-predict + 2 4-instr = 12 addr bits are sent to I-$; 48 – 9 – 6 = 33 bits for v. tag

5. Way-prediction and Line-prediction (11-bit for next 16-byte group on a miss and updated by br prediction) are used to reduce I-$ latency

6. The next way and line prediction is loaded to read the next block (step 3) on an intr cache hit.

7. An instr $ miss leads to a check of I-TLB & prefetcher (4 – 7), or access L2 $ (if instr. Addr. Not found) (8)

8. L2 $ is direct mapped, 1-16MB


9. The instruction prefetcher does not rely on TLB for address translation, it simply increments the physical address of the miss by 64 bytes, checking to make sure that the new address is within the same page. Prefetching is suppressed if new address is out of the page (step 14).

10. If the instruction is not found in L2, the physical address command is sent to the ES40 system chip set via four consecutive transfer cycles on a narrow, 15-bit outbound address bus (step 15). Address and command take 8 CPU cycles. CPU is connected to memory via a crossbar to one of two 256-bit memory buses (16)

11. Total penalty of the instruction miss is about 130 CPU cycles for critical instructions, while the rest of the block is filled at a rate of 8 bytes per 2 CPU cycles (step 17).

12. A “victim file” is associated with L2, a write-back cache, to store a replaced block (victim block) (step 18); The address of the victim is sent out the system address bus following the address of the new request (step 19), where the system chip set later extracts the victim data and writes to the memory.

13. D-$ is a 64-KB two-way set-associative, write-back cache, which is virtually indexed but physically tagged. While the 9-bit index + 3-bit word selection is sent to index the required data (step 24), virtual page # is being translated at D-TLB (step 23), which is fully associative and has 128 PTEs of which each represents page size from 8KB to 4MB (step 25).


14. A TLB miss will trap to PAL (privileged architecture library) code to load the valid PTE for this address. In the worst case, a page-fault happens, in which case OS will bring the page from disk while context is switched.

15. The index field of the address is sent to both sets of data cache (step 26). Assuming a TLB hit, the two tags and valid bits are compared to the physical page # (steps 27-28), with a match sending the desired 8 bytes to the CPU (step 29)

16. A miss at D-$ goes to L2 $, which proceeds similary to an instruction miss (step 30), except that it must check the victim buffer to make sure the block is not there (step 31)

17. A write-back victim can be produced on a data cache miss. The victim data are extracted from the data cache simultaneously with the fill of the data cache with the L2 data and sent to the victim buffer (step 32)

18. In case of a L2 miss, the fill data from the system is written directly into the (L1) data cache (step 33). The L2 is written only with L1 victims (step 34).

Memory Hierarchy Design Another View: The Emotion Engine of the Sony Playstation 2Another View: The Emotion Engine of the Sony Playstation 2

3 Cs captured by cache for SPEC2000 (left) and multimedia 3 Cs captured by cache for SPEC2000 (left) and multimedia applications (right)applications (right)

Memory Hierarchy Design Another View: The Emotion Engine of the Sony Playstation 2Another View: The Emotion Engine of the Sony Playstation 2

Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire...

Documents

Transcript of Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire...