CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111...

9.1

CS356 Midterm II

Review

Reminder on Page Faults

Consequences

To remember:TLB ⇒ MM(not the reverse)

CACHE ⇒ MM(not the reverse)

TLB and CACHE are independent

Means: page in MM (hit in PT)

9.5

Caches

Memory: addresses of m bits ⇒

M = 2m memory locations

Cache:

● S = 2s cache sets

● Each set has K lines

● Each line has: data block of B

= 2b bytes, valid bit,

t = m − (s+b) tag bits

How to check if the word at an

address is in the cache?

9.6

Caches: Data AccessFF1E1111 1111 0001 1110

FF3E1111 1111 0011 1110

FF4E1111 1111 0100 1110

9.7

Caches: Data Access

9.8

Caches: Data Access

9.9

Caches: Data Access

Average Access Time = (Hit Time) + (Miss Ratio) ⨯ (Miss Penalty)

9.10

Caches: Data Access

8.11

Cache Operation ExampleAccess Trace

– R: 0x00a0

– W: 0x00f4

– R: 0x00b0

– W: 0x2a2cPossible Operations: Hit, or fetch block X (possibly with “evict block Y” and “WB of Y” if Y is dirty)

Break down address and decide operations for

K=2-way set-associative, N=4, B=32 bytes/block (S = N/K = 2)

Access Cache Operation

R: 0x00a0 Fetch Block 00a0-00bf

W: 0x00f4 Fetch Block 00e0-00ff

R: 0x00b0 Hit

W: 0x2a2c Evict 00e0-00ff with WBFetch block 2a20-2a3f

Done! Final WB of 2a20-2a3f

Address Tag Set Byte Offset

0x00a0 0000 0000 10 1 0 0000

0x00f4 0000 0000 11 1 1 0100

0x00b0 0000 0000 10 1 1 0000

0x2a2c 0010 1010 00 1 0 1100

Set 1 after each access (LRU)

9.12

Caches: Trace SimulationYou are asked to optimize a cache capable of

storing 8 bytes total for the given references. There

are three direct-mapped cache designs possible by

varying the block size:

● C1 has one-byte blocks

● C2 has two-byte blocks

● C3 has four-byte blocks

In terms of miss rate, which cache design is best?

If the miss stall time is 25 cycles, and C1 has an

access time of 2 cycles, C2 takes 3 cycles, and C3

takes 5 cycles, which is the best cache design?

Trace (LSB)

1 0000 0001

134 1000 0110

212 1101 0100

1 0000 0001

135 1000 0111

213 1101 0101

162 1010 0010

161 1010 0001

2 0000 0010

44 0010 1100

41 0010 1001

221 1101 1101

9.13

Caches: Trace Simulation on C1Trace

MEM LSB C1 C2 C3

1 0000 0001 1m 0m 0m

134 1000 0110 6m 3m 1m

212 1101 0100 4m 2m 1m

1 0000 0001 1h 0h 0h

135 1000 0111 7m 3h 1m

213 1101 0101 5m 2h 1m

162 1010 0010 2m 1m 0m

161 1010 0001 1m 0m 0h

2 0000 0010 2m 1m 0m

44 0010 1100 4m 2m 1m

41 0010 1001 1m 0m 0m

221 1101 1101 5m 2m 1m

m_rate: 11/12 9/12 10/12

9.14

Caches: Trace Simulation on C2Trace

MEM LSB C1 C2 C3

1 0000 0001 1m 0m 0m

134 1000 0110 6m 3m 1m

212 1101 0100 4m 2m 1m

1 0000 0001 1h 0h 0h

135 1000 0111 7m 3h 1m

213 1101 0101 5m 2h 1m

162 1010 0010 2m 1m 0m

161 1010 0001 1m 0m 0h

2 0000 0010 2m 1m 0m

44 0010 1100 4m 2m 1m

41 0010 1001 1m 0m 0m

221 1101 1101 5m 2m 1m

m_rate: 11/12 9/12 10/12

9.15

Caches: Trace Simulation on C3Address breakdown

● C1 has no block offset, 3-bit set address

● C2 has 1-bit block offset, 2-bit set address

● C3 has 2-bit block offset, 1-bit set address

How to run a trace: extract set address (3, 2, 1 bits)

from LSB; on miss, load (1, 2, 4) bytes.

Running C3:

● Get 1: miss. Put bytes 0-3 in bucket 0.

● Get 134: miss. Put 132-135 in bucket 1.


● Get 1: hit.




● Get 161: hit.

Trace

MEM LSB C1 C2 C3

1 0000 0001 1m 0m 0m

134 1000 0110 6m 3m 1m

212 1101 0100 4m 2m 1m

1 0000 0001 1h 0h 0h

135 1000 0111 7m 3h 1m

213 1101 0101 5m 2h 1m

162 1010 0010 2m 1m 0m

161 1010 0001 1m 0m 0h

2 0000 0010 2m 1m 0m

44 0010 1100 4m 2m 1m

41 0010 1001 1m 0m 0m

221 1101 1101 5m 2m 1m

m_rate: 11/12 9/12 10/12

9.16

Loop over a Matrix, by row

Example: each cache line holds 4 array elements

stored in registers(temporal locality)

hopefully in cache(spatial locality)

9.17

Loop over a Matrix, by col

Example: each cache line holds 4 array elements

stored in registers(temporal locality)

hopefully in cache(spatial locality)

9.18

Single-Level Page Tables

points to a different table for each process

9.19


9.20


9.21


8-bit virtual addresses, 10-bit physical addresses, 32-byte pages● Physical address of virtual address 0x2D? 00101101 => 0 0011 1100 1101● Physical address of virtual address 0x7A? 01111010 => 0 0000 1101 1010● Physical address of virtual address 0xEF? 11101111 => ● Physical address of virtual address 0xA8? 10101000 => 0 1000

Index Valid PPN

0 0 0x0E

1 1 0x1E

2 1 0x16

3 1 0x06

4 0 0x0B

5 1 0x1F

6 0 0x15

7 0 0x0A

9.22

Multi-level Page Tables

9.23

Page Table with 3 Levels

9.24

Page Table with 3 Levels

9.25

Translation Lookaside Buffer

A k-level page table requires k memory accesses in the worse case.

Idea: cache address mappings inside the CPU (10 ns hit time).

● VPN is the cache tag, PPN is the entire cache block

● High degree of associativity (4-way or fully-associative: low miss rate)

● Usually smaller than data cache (fast lookup, low hit time)

Average Access Time = (Hit Time) + (Miss Rate) ⨯ (Miss Penalty)

9.26

Example: 2-way set-associative TLB

16-bit virtual and physical addresses, 256-byte pages● Physical address of virtual address 0x7E85 == 0111 1110 1000 0101● Virtual address of physical address 0x3020 == 0011 0000 0010 0000

Set Index Valid Bit Tag PPN

01 0x13 0x30

0 0x34 0x58

10 0x1F 0x80

1 0x2A 0x72

21 0x1F 0x95

0 0x20 0xAA

31 0x3F 0x20

0 0x3E 0xFF

9.27

TLB == Subset of Page Table

9.28

Addressing: Cache, VM, TLB

9.29

Addressing: Cache, VM, TLB

9.30

Structs: Offsets in assembly

Assume 4-byte int / float, 8-byte long / double.

Can you figure out the offsets for %rdi ?

struct record_t { char a[2]; int b; long c; int d[3]; short e;};

void initialize(struct record_t *x) { x->a[1] = 1; x->b = 2; x->c = 3; x->d[1] = 4; x->e = 5;}

initialize: movb $1, 1(%rdi) movl $2, 4(%rdi) movq $3, 8(%rdi) movl $4, 20(%rdi) movw $5, 28(%rdi) reta a b b b b

c c c c c c c c

d0 d0 d0 d0 d1 d1 d1 d1

d2 d2 d2 d2 e e

9.31

struct B { // this struct must start/end at a multiple of 4, because that's required by 'y'

char x; // 1 byte

int y; // 4 bytes (needs 3 bytes of padding before to start at a multiple of 4)

char z; // 1 byte (needs 3 bytes of padding after to end at a multiple of 4)

};

struct A {

char a; // 1 byte

struct B b; // has 4-byte alignment: 3 bytes of padding before 'b'

char c; // also 3 bytes of padding before 'c', so that 'b' ends at a multiple of 4

};

void init(struct A *a) {

a->a = 1;

a->b.x = 2;

a->b.y = 3;

a->b.z = 4;

a->c = 5;

}

$ gcc -fomit-frame-pointer -mno-red-zone -Og -S align.c; cat align.s | grep mov

movb $1, (%rdi)

movb $2, 4(%rdi)

movl $3, 8(%rdi)

movb $4, 12(%rdi)

movb $5, 16(%rdi)

Nested Structs

a x

y y y y z

c

We still want each member of the nested struct to start at a multiple of its size, but where should the nested struct itself start?

Its start/end should have the largest alignment required by its members.

9.32

Struct Alignment

9.33

p

Unions

• Unions allow you to read/write the same memory region as variables with different types– All elements start at offset 0

– The size of the union is simply the size of the biggest member

– Elements must be POD (plain old data) or at least default-constructible

Data1

union Data1 { int x; char y;};

union Data2 { short w; char *p;};

int main() { union Data1 item; item.x = 0x356; item.y = 'a';}

x

offset: 0

Data2(w/o padding)

w

offset: 0 2

y

item 56 03 00 00

offset: 0

item 61 03 00 00

Recall x86 uses little-endian

1 2 3

CS:APP 3.9.2

9.34

Unions: Revealing Endianness

• 4-byte union• x reads/writes an int• bytes reads/writes

4 consecutive char

Note that bytes are stored in reversed order

#include <stdio.h>

union int_bytes { int x; char bytes[4];};

int main() { union int_bytes ib; ib.x = 256; printf("%08X is %02X %02X %02X %02X\n", ib.x, ib.bytes[3], ib.bytes[2], ib.bytes[1], ib.bytes[0]);}

// prints:// 00000100 is 00 00 01 00

9.35

Return-oriented Programming

9.36


9.37


A64.38

Arithmetic/Logic OperationsA different style!

• Always operate between registers/immediates (not memory)

• Three operands: OP dst, src1, src2 means dst = src1 OP src2

Examplesx1 0x11111111 11111111 (initial state)

x2 0x22002200 22002200

x3 0x33003300 00330033

• “add x1, x2, x3” assigns x2+x3 to x1, like “x1 = x2 + x3”x1 0x55005500 22332233 after add x1,x2,x3

• “add w1, w2, w3” assigns w2+w3 to w1, sets MS 32 bits to 0x1 0x00000000 22332233 after add w1,w2,w3

A64.39

Arithmetic/Logic OperationsInstruction Example Effect

Add immediate add x0, x1, 1 x0 = x1 + 1

Add register add x0, x1, x2 x0 = x1 + x2

Add shifted register (or imm.) add x0, x1, x2, 10 x0 = x1 + (x2 << 10)

Subtract sub x0, x1, x2 x0 = x1 - x2

Subtract shifted sub x0, x1, x2, 10 x0 = x1 - (x2 << 10)

Negate neg x0, x1 x0 = -x1

Multiply mul x0, x1, x2 x0 = x1 * x2

Multiply-add madd x0, x1, x2, x4 x0 = x1 * x2 + x4

Divide signed sdiv x0, x1, x2 x0 = x1 / x2

Divide unsigned udiv x0, x1, x2 x0 = x1 / x2

Logical shift left lsl x0, x1, x2 x0 = x1 << (x2 % 64)

Logical shift right lsr x0, x1, x2 x0 = x1 >> (x2 % 64)

Arithmetic shift right asr x0, x1, x2 x0 = x1 (signed) >> (x2 % 64)

Rotate bits from LSB to MSB ror x0, x1, x2 x0 = x1 >>> (x2 % 64)

Bitwise AND and x0, x1, x2 x0 = x1 & x2

Bitwise OR orr x0, x1, x2 x0 = x1 | x2

Bitwise XOR eor x0, x1, x2 x0 = x1 ^ x2

Bitwise NOT (“move not”) mvn x0, x2 x0 = ~x1

A64.40

Flexible OperandsShift or Rotate Operand2

• add, sub, bitwise ops (and, orr, eor, mvn) and move (mov) allow optional shift or rotation of the last argument:OP dst, src1, src2, LSL n means dst = src1 OP (src2 << n)OP dst, src1, src2, LSR n means dst = src1 OP (src2 >> n)OP dst, src1, src2, ASR n means dst = src1 OP (src2 s>> n)OP dst, src1, src2, ROR n means dst = src1 OP (src2 >>> n)

Example

• “add x1, x2, x3, lsl 32” assigns x2+(x3<<32) to x1x1 0x55005500 22002200 after add x1,x2,x3,lsl 32

• “add w1, w2, w3, ror 8” assigns w2+(w3>>>8)x1 0x00000000 55005500 after add w1,w2,w3,ror 8

x1 0x11111111 11111111 (initial state)

x2 0x22002200 22002200

x3 0x33003300 00330033

A64.41

Simple Arithmetic: A64 vs x64// a64add:

/* return value in w0 */

/* arguments in w0, w1 */

add w0, w0, w1

ret

multadd:

add w0, w0, 10

add w0, w0, w1, lsl 3

ret

bias:

mov w2, 1

lsl w2, w2, w1

sub w2, w2, #1

and w2, w2, w0, asr 31

add w0, w2, w0

ret

// C

int add(int x, int y) {

return x + y;

}

int multadd(int x, int y) {

return 10 + x + 8*y;

}

int bias(int x, int k) {

int bias = (1 << k) - 1;

int mask = x >> 31;

return x + (mask & bias);

}

// x64add:

leal (%rdi,%rsi), %eax

ret

multadd:

leal 10(%rdi,%rsi,8), %eax

ret

bias:

movl $1, %eax

movl %esi, %ecx

sall %cl, %eax

subl $1, %eax

movl %edi, %edx

sarl $31, %edx

andl %edx, %eax

addl %edi, %eax

ret

A64.42

Memory Load/StoreA different style!

• In x86-64: mov to read/write memory, suffix must match

• A64 has dedicated instructions without size suffix (inferred)– ldr x1, [x2] to load into x1 the 8 bytes at address x2– str x1, [x2] to store 8 bytes of x1 to address x2

• Additional instructions to load/store register pairs– ldr x0, x1, [x2] 8 bytes at x2 => x0, 8 bytes at x2+8 => x1– str x0, x1, [x2] write x0 to address x2, x1 to address x2+8

• Moves are only between registers– mov x0, x1 is equivalent to x0 = x1

A64.43

Working with Pointers: A64 vs x64// a64get:

ldr w0, [x0]

ret

set:

str w1, [x0]

ret

// C

int get(int *ptr) {

return *ptr;

}

void set(int *ptr, int x) {

*ptr = x;

}

// x64get:

movl (%rdi), %eax

ret

set:

movl %esi, (%rdi)

ret

A64.44

Calling Procedures

• Arguments in x0, .., x7 then on the stack

• Return value in x0

• Caller-save registers x0 to x18

• Callee-save registers x19 to x29

• Callee saves link register x30 if it invokes a procedure...

Call and Return Mechanisms

• Branch with link (bl) sets the link register x30 to PC+4

– PC is the address of the current instruction (each is 4 bytes)

• Return (ret) jumps to the address in x30 (lr)

– Can also use “ret x0” or with any other register

Calling Conventions

CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111...

Documents

Transcript of CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111...