CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111...
Transcript of CS356 Midterm IICaches: Data Access 9.6 FF1E 1111 1111 0001 1110 FF3E 1111 1111 0011 1110 FF4E 1111...
9.1
CS356 Midterm II
Review
Reminder on Page Faults
Reminder on Page Faults
Consequences
To remember:TLB ⇒ MM(not the reverse)
CACHE ⇒ MM(not the reverse)
TLB and CACHE are independent
Means: page in MM (hit in PT)
9.5
Caches
Memory: addresses of m bits ⇒
M = 2m memory locations
Cache:
● S = 2s cache sets
● Each set has K lines
● Each line has: data block of B
= 2b bytes, valid bit,
t = m − (s+b) tag bits
How to check if the word at an
address is in the cache?
9.6
Caches: Data AccessFF1E1111 1111 0001 1110
FF3E1111 1111 0011 1110
FF4E1111 1111 0100 1110
9.7
Caches: Data Access
9.8
Caches: Data Access
9.9
Caches: Data Access
Average Access Time = (Hit Time) + (Miss Ratio) ⨯ (Miss Penalty)
9.10
Caches: Data Access
8.11
Cache Operation ExampleAccess Trace
– R: 0x00a0
– W: 0x00f4
– R: 0x00b0
– W: 0x2a2cPossible Operations: Hit, or fetch block X (possibly with “evict block Y” and “WB of Y” if Y is dirty)
Break down address and decide operations for
K=2-way set-associative, N=4, B=32 bytes/block (S = N/K = 2)
Access Cache Operation
R: 0x00a0 Fetch Block 00a0-00bf
W: 0x00f4 Fetch Block 00e0-00ff
R: 0x00b0 Hit
W: 0x2a2c Evict 00e0-00ff with WBFetch block 2a20-2a3f
Done! Final WB of 2a20-2a3f
Address Tag Set Byte Offset
0x00a0 0000 0000 10 1 0 0000
0x00f4 0000 0000 11 1 1 0100
0x00b0 0000 0000 10 1 1 0000
0x2a2c 0010 1010 00 1 0 1100
Set 1 after each access (LRU)
9.12
Caches: Trace SimulationYou are asked to optimize a cache capable of
storing 8 bytes total for the given references. There
are three direct-mapped cache designs possible by
varying the block size:
● C1 has one-byte blocks
● C2 has two-byte blocks
● C3 has four-byte blocks
In terms of miss rate, which cache design is best?
If the miss stall time is 25 cycles, and C1 has an
access time of 2 cycles, C2 takes 3 cycles, and C3
takes 5 cycles, which is the best cache design?
Trace (LSB)
1 0000 0001
134 1000 0110
212 1101 0100
1 0000 0001
135 1000 0111
213 1101 0101
162 1010 0010
161 1010 0001
2 0000 0010
44 0010 1100
41 0010 1001
221 1101 1101
9.13
Caches: Trace Simulation on C1Trace
MEM LSB C1 C2 C3
1 0000 0001 1m 0m 0m
134 1000 0110 6m 3m 1m
212 1101 0100 4m 2m 1m
1 0000 0001 1h 0h 0h
135 1000 0111 7m 3h 1m
213 1101 0101 5m 2h 1m
162 1010 0010 2m 1m 0m
161 1010 0001 1m 0m 0h
2 0000 0010 2m 1m 0m
44 0010 1100 4m 2m 1m
41 0010 1001 1m 0m 0m
221 1101 1101 5m 2m 1m
m_rate: 11/12 9/12 10/12
9.14
Caches: Trace Simulation on C2Trace
MEM LSB C1 C2 C3
1 0000 0001 1m 0m 0m
134 1000 0110 6m 3m 1m
212 1101 0100 4m 2m 1m
1 0000 0001 1h 0h 0h
135 1000 0111 7m 3h 1m
213 1101 0101 5m 2h 1m
162 1010 0010 2m 1m 0m
161 1010 0001 1m 0m 0h
2 0000 0010 2m 1m 0m
44 0010 1100 4m 2m 1m
41 0010 1001 1m 0m 0m
221 1101 1101 5m 2m 1m
m_rate: 11/12 9/12 10/12
9.15
Caches: Trace Simulation on C3Address breakdown
● C1 has no block offset, 3-bit set address
● C2 has 1-bit block offset, 2-bit set address
● C3 has 2-bit block offset, 1-bit set address
How to run a trace: extract set address (3, 2, 1 bits)
from LSB; on miss, load (1, 2, 4) bytes.
Running C3:
● Get 1: miss. Put bytes 0-3 in bucket 0.
● Get 134: miss. Put 132-135 in bucket 1.
● Get 212: miss. Put 212-215 in bucket 1.
● Get 1: hit.
● Get 135: miss. Put 132-135 in bucket 1.
● Get 213: miss. Put 212-215 in bucket 1.
● Get 162: miss. Put 160-163 in bucket 0.
● Get 161: hit.
Trace
MEM LSB C1 C2 C3
1 0000 0001 1m 0m 0m
134 1000 0110 6m 3m 1m
212 1101 0100 4m 2m 1m
1 0000 0001 1h 0h 0h
135 1000 0111 7m 3h 1m
213 1101 0101 5m 2h 1m
162 1010 0010 2m 1m 0m
161 1010 0001 1m 0m 0h
2 0000 0010 2m 1m 0m
44 0010 1100 4m 2m 1m
41 0010 1001 1m 0m 0m
221 1101 1101 5m 2m 1m
m_rate: 11/12 9/12 10/12
9.16
Loop over a Matrix, by row
Example: each cache line holds 4 array elements
stored in registers(temporal locality)
hopefully in cache(spatial locality)
9.17
Loop over a Matrix, by col
Example: each cache line holds 4 array elements
stored in registers(temporal locality)
hopefully in cache(spatial locality)
9.18
Single-Level Page Tables
points to a different table for each process
9.19
Single-Level Page Tables
9.20
Single-Level Page Tables
9.21
Single-Level Page Tables
8-bit virtual addresses, 10-bit physical addresses, 32-byte pages● Physical address of virtual address 0x2D? 00101101 => 0 0011 1100 1101● Physical address of virtual address 0x7A? 01111010 => 0 0000 1101 1010● Physical address of virtual address 0xEF? 11101111 => ● Physical address of virtual address 0xA8? 10101000 => 0 1000
Index Valid PPN
0 0 0x0E
1 1 0x1E
2 1 0x16
3 1 0x06
4 0 0x0B
5 1 0x1F
6 0 0x15
7 0 0x0A
9.22
Multi-level Page Tables
9.23
Page Table with 3 Levels
9.24
Page Table with 3 Levels
9.25
Translation Lookaside Buffer
A k-level page table requires k memory accesses in the worse case.
Idea: cache address mappings inside the CPU (10 ns hit time).
● VPN is the cache tag, PPN is the entire cache block
● High degree of associativity (4-way or fully-associative: low miss rate)
● Usually smaller than data cache (fast lookup, low hit time)
Average Access Time = (Hit Time) + (Miss Rate) ⨯ (Miss Penalty)
9.26
Example: 2-way set-associative TLB
16-bit virtual and physical addresses, 256-byte pages● Physical address of virtual address 0x7E85 == 0111 1110 1000 0101● Virtual address of physical address 0x3020 == 0011 0000 0010 0000
Set Index Valid Bit Tag PPN
01 0x13 0x30
0 0x34 0x58
10 0x1F 0x80
1 0x2A 0x72
21 0x1F 0x95
0 0x20 0xAA
31 0x3F 0x20
0 0x3E 0xFF
9.27
TLB == Subset of Page Table
9.28
Addressing: Cache, VM, TLB
9.29
Addressing: Cache, VM, TLB
9.30
Structs: Offsets in assembly
Assume 4-byte int / float, 8-byte long / double.
Can you figure out the offsets for %rdi ?
struct record_t { char a[2]; int b; long c; int d[3]; short e;};
void initialize(struct record_t *x) { x->a[1] = 1; x->b = 2; x->c = 3; x->d[1] = 4; x->e = 5;}
initialize: movb $1, 1(%rdi) movl $2, 4(%rdi) movq $3, 8(%rdi) movl $4, 20(%rdi) movw $5, 28(%rdi) reta a b b b b
c c c c c c c c
d0 d0 d0 d0 d1 d1 d1 d1
d2 d2 d2 d2 e e
9.31
struct B { // this struct must start/end at a multiple of 4, because that's required by 'y'
char x; // 1 byte
int y; // 4 bytes (needs 3 bytes of padding before to start at a multiple of 4)
char z; // 1 byte (needs 3 bytes of padding after to end at a multiple of 4)
};
struct A {
char a; // 1 byte
struct B b; // has 4-byte alignment: 3 bytes of padding before 'b'
char c; // also 3 bytes of padding before 'c', so that 'b' ends at a multiple of 4
};
void init(struct A *a) {
a->a = 1;
a->b.x = 2;
a->b.y = 3;
a->b.z = 4;
a->c = 5;
}
$ gcc -fomit-frame-pointer -mno-red-zone -Og -S align.c; cat align.s | grep mov
movb $1, (%rdi)
movb $2, 4(%rdi)
movl $3, 8(%rdi)
movb $4, 12(%rdi)
movb $5, 16(%rdi)
Nested Structs
a x
y y y y z
c
We still want each member of the nested struct to start at a multiple of its size, but where should the nested struct itself start?
Its start/end should have the largest alignment required by its members.
9.32
Struct Alignment
9.33
p
Unions
• Unions allow you to read/write the same memory region as variables with different types– All elements start at offset 0
– The size of the union is simply the size of the biggest member
– Elements must be POD (plain old data) or at least default-constructible
Data1
union Data1 { int x; char y;};
union Data2 { short w; char *p;};
int main() { union Data1 item; item.x = 0x356; item.y = 'a';}
x
offset: 0
Data2(w/o padding)
w
offset: 0 2
y
item 56 03 00 00
offset: 0
item 61 03 00 00
Recall x86 uses little-endian
1 2 3
CS:APP 3.9.2
9.34
Unions: Revealing Endianness
• 4-byte union• x reads/writes an int• bytes reads/writes
4 consecutive char
Note that bytes are stored in reversed order
#include <stdio.h>
union int_bytes { int x; char bytes[4];};
int main() { union int_bytes ib; ib.x = 256; printf("%08X is %02X %02X %02X %02X\n", ib.x, ib.bytes[3], ib.bytes[2], ib.bytes[1], ib.bytes[0]);}
// prints:// 00000100 is 00 00 01 00
9.35
Return-oriented Programming
9.36
Return-oriented Programming
9.37
Return-oriented Programming
A64.38
Arithmetic/Logic OperationsA different style!
• Always operate between registers/immediates (not memory)
• Three operands: OP dst, src1, src2 means dst = src1 OP src2
Examplesx1 0x11111111 11111111 (initial state)
x2 0x22002200 22002200
x3 0x33003300 00330033
• “add x1, x2, x3” assigns x2+x3 to x1, like “x1 = x2 + x3”x1 0x55005500 22332233 after add x1,x2,x3
• “add w1, w2, w3” assigns w2+w3 to w1, sets MS 32 bits to 0x1 0x00000000 22332233 after add w1,w2,w3
A64.39
Arithmetic/Logic OperationsInstruction Example Effect
Add immediate add x0, x1, 1 x0 = x1 + 1
Add register add x0, x1, x2 x0 = x1 + x2
Add shifted register (or imm.) add x0, x1, x2, 10 x0 = x1 + (x2 << 10)
Subtract sub x0, x1, x2 x0 = x1 - x2
Subtract shifted sub x0, x1, x2, 10 x0 = x1 - (x2 << 10)
Negate neg x0, x1 x0 = -x1
Multiply mul x0, x1, x2 x0 = x1 * x2
Multiply-add madd x0, x1, x2, x4 x0 = x1 * x2 + x4
Divide signed sdiv x0, x1, x2 x0 = x1 / x2
Divide unsigned udiv x0, x1, x2 x0 = x1 / x2
Logical shift left lsl x0, x1, x2 x0 = x1 << (x2 % 64)
Logical shift right lsr x0, x1, x2 x0 = x1 >> (x2 % 64)
Arithmetic shift right asr x0, x1, x2 x0 = x1 (signed) >> (x2 % 64)
Rotate bits from LSB to MSB ror x0, x1, x2 x0 = x1 >>> (x2 % 64)
Bitwise AND and x0, x1, x2 x0 = x1 & x2
Bitwise OR orr x0, x1, x2 x0 = x1 | x2
Bitwise XOR eor x0, x1, x2 x0 = x1 ^ x2
Bitwise NOT (“move not”) mvn x0, x2 x0 = ~x1
A64.40
Flexible OperandsShift or Rotate Operand2
• add, sub, bitwise ops (and, orr, eor, mvn) and move (mov) allow optional shift or rotation of the last argument:OP dst, src1, src2, LSL n means dst = src1 OP (src2 << n)OP dst, src1, src2, LSR n means dst = src1 OP (src2 >> n)OP dst, src1, src2, ASR n means dst = src1 OP (src2 s>> n)OP dst, src1, src2, ROR n means dst = src1 OP (src2 >>> n)
Example
• “add x1, x2, x3, lsl 32” assigns x2+(x3<<32) to x1x1 0x55005500 22002200 after add x1,x2,x3,lsl 32
• “add w1, w2, w3, ror 8” assigns w2+(w3>>>8)x1 0x00000000 55005500 after add w1,w2,w3,ror 8
x1 0x11111111 11111111 (initial state)
x2 0x22002200 22002200
x3 0x33003300 00330033
A64.41
Simple Arithmetic: A64 vs x64// a64add:
/* return value in w0 */
/* arguments in w0, w1 */
add w0, w0, w1
ret
multadd:
add w0, w0, 10
add w0, w0, w1, lsl 3
ret
bias:
mov w2, 1
lsl w2, w2, w1
sub w2, w2, #1
and w2, w2, w0, asr 31
add w0, w2, w0
ret
// C
int add(int x, int y) {
return x + y;
}
int multadd(int x, int y) {
return 10 + x + 8*y;
}
int bias(int x, int k) {
int bias = (1 << k) - 1;
int mask = x >> 31;
return x + (mask & bias);
}
// x64add:
leal (%rdi,%rsi), %eax
ret
multadd:
leal 10(%rdi,%rsi,8), %eax
ret
bias:
movl $1, %eax
movl %esi, %ecx
sall %cl, %eax
subl $1, %eax
movl %edi, %edx
sarl $31, %edx
andl %edx, %eax
addl %edi, %eax
ret
A64.42
Memory Load/StoreA different style!
• In x86-64: mov to read/write memory, suffix must match
• A64 has dedicated instructions without size suffix (inferred)– ldr x1, [x2] to load into x1 the 8 bytes at address x2– str x1, [x2] to store 8 bytes of x1 to address x2
• Additional instructions to load/store register pairs– ldr x0, x1, [x2] 8 bytes at x2 => x0, 8 bytes at x2+8 => x1– str x0, x1, [x2] write x0 to address x2, x1 to address x2+8
• Moves are only between registers– mov x0, x1 is equivalent to x0 = x1
A64.43
Working with Pointers: A64 vs x64// a64get:
ldr w0, [x0]
ret
set:
str w1, [x0]
ret
// C
int get(int *ptr) {
return *ptr;
}
void set(int *ptr, int x) {
*ptr = x;
}
// x64get:
movl (%rdi), %eax
ret
set:
movl %esi, (%rdi)
ret
A64.44
Calling Procedures
• Arguments in x0, .., x7 then on the stack
• Return value in x0
• Caller-save registers x0 to x18
• Callee-save registers x19 to x29
• Callee saves link register x30 if it invokes a procedure...
Call and Return Mechanisms
• Branch with link (bl) sets the link register x30 to PC+4
– PC is the address of the current instruction (each is 4 bytes)
• Return (ret) jumps to the address in x30 (lr)
– Can also use “ret x0” or with any other register
Calling Conventions