Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least...
Transcript of Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least...
Computer Architecture TDTS10
Erik LarssonDepartment of Computer Science
Outline
! Components of the memory system! The memory hierarchy! Cache memories and organization ! Paging and virtual memory
2
Memory system
3
Input device
Output deviceCPU Main
memory
Secondary memory
Program execution
4
Fetch Instruction
CPU Main
memory
Data
Control
Execute Instruction
Program executionInstruction for Z:=(Y+X)*3
Address Instruction .
00001000 0000101110001011
00001001 0001101110000011
00001010 0010100000011011
00001011 0001001110010011
5
Program execution
6
0000101110001011 = MOVE Y,R3
(1) Get the instruction at
00001000
(2) Move the instruction 0000101110001011
to CPU
(3) Decode instruction; 00001 – MOVE
01110001 - Address011 – Reg3
(4) Get the data at
01110001
(5) Store the data in register
number 3
Memory from processor side
! CPU gives command and address and assume data! Example: READ (ADR) -> DATA
7
Processor Memory
Read
Address
Data
0100
1010
Address Data0000 11100001 11010010 10110011 01110100 10100101 01010110 01010111 10101000 11101001 11111010 00001011 10001100 01001101 00101110 00011111 1111
Memory
! "640Kbyte ought to be enough for anybody."-- Bill Gates, 1981 (MP3 music ! 1000 Kbyte per minute)
! Memory is where instructions and data for programs are stored! Primary memory and secondary memory! Typically, primary memory is volatile, as soon as power is
turned of the content is lost, while secondary memory is non-volitile!
8
Memories
! Main memory (or primary memory)! Volatile - memory content lost at power off
! Random Access Memories (RAM)
! Secondary memory (or external memory)! Non-volatile - memory content kept at power off
! Hard disc
! Other! CD, DVD, magnetic tapes
9
Main memory
10
CPU Main memory
Secondary memory
11
Platter
Track
Cylinder
Drivemotor
Headmotor
Head, onmoving arm
Block
Sector
Track
Head
Head assembly
A block
12
Disc arrangements! A sector=512 bytes; a block=x*512! A file is larger; a file is divided into blocks! Problem: Fragmentation! Disc-scheduling:
! shortest-seek time - from head! elevator algorithm - move back and forth
! one-way elevator - move in one direction
! Storage: ! contiguous storing-block n, n+1, ...
! linked list: block n points at block n+1! disk index: block with pointers (FAT)
! file index: (unix)-solaris-block=8K, block number 4 bytes->2048 pointers standard unix: superblock with disc info, files, free space, inode: 10 direct pointers, 10 double direct, 10 triple direct
13
Inode
14
Disc arrangement
! Q: Assume a program of 1025 bytes, how many many blocks (512 bytes are needed)?
! A: 3 blocks. 2 blocks only cover 2*512=1024 bytes. Hence, 1536 bytes are needed to store a program of size 1025 bytes. The lost bytes are internal fragmentation.
! Q: Given that 3 blocks are needed, which blocks to select? ! A: Several alternatives exists (defined by OS). (1) place all
blocks next to each other. Drawback is that is can be difficult to find space. There could be external fragmentation. Solutions are to make use of pointers. One common block can contain all pointers or next block is pointed out by current.
15
Flash memory
• Invented by Dr. Fujio Masuoka (Toshiba) around 1980• Flash memory is often found in mobile phones, cameras, MP3-
players and in computers• Characteristics: non-volatile and random access• Capacity: up to 256 GB• Block 0: bad blocks• Block 1: bootable block• Limited number of program/erase cycles
16
Memory system design
! What do we need?! We need memory to fit very large programs and to work at a speed
comparable to that of the microprocessors.
! Main problem:! microprocessors are working at a very high rate and they need
large memories;! memories are much slower than microprocessors;
! Facts:! the larger a memory, the slower it is;! the faster the memory, the greater the cost/bit.
17
Memory hierarchy
18
Memory hierarchySome typical characteristics (Wikipedia as of 2006):! Processor registers:
! 8-32 registers of 32 bits each = 128 bytes! access time = few nanoseconds, 0-1 clock cycles
! On-chip cache memory (L1):! capacity = 32 to 128 Kbytes! access time = ~10 nanoseconds, 3 clock cycles
! Off-chip cache memory (L2):! capacity = 128 Kbytes to 12 Mbytes! access time = tens of nanoseconds, 10 clock cycles
! Main memory:! capacity = 256 Mbytes to 4Gbytes! access time = ~100 nanoseconds, 100 clock cycles
! Hard disk:! capacity = 1Gbyte to 1Tbyte! access time = tens of milliseconds, 10 000 000 clock cycles
19
Note: approx. 10 times difference between levels (size and speed) Cache memory
20
Distribute memory content; not all is needed at all timesmall & fast cache + large & slow mm -> behaves as large & fast memoryprinciple can be extended to any part of memorybased on: principle of locality
updates importanthit/miss ratio -> design important
Cache memory
! A cache is smaller than main memory! Items in the cache should be available when needed.! It works due to: locality of references! During execution, memory references tend to cluster:
! once an area of the program is entered, there are repeated references to a small set of instructions (loop, subroutine) and data (components of a data structure, local variables or parameters on the stack).
! Temporal locality (locality in time): If an item is referenced, it will tend to be referenced again soon.
! Spacial locality (locality in space): If an item is referenced, items whose addresses are close by will tend to be referenced soon.
21
Cache - principle
22
1
2
3
4
12...
16
Cache - locality of references
Example: SWwhile (x<1000){ x=x+1; printf(“x=%i”,x);}while (y<500){ y=y+1; printf(“y=%i”,y);}
23
Example: Drawerin summer keep summer clothing in drawer in winter keep winter clothing in drawer
Cache - principle
24
1
2
3
4
12...
16
! How to place clothing in drawer? ! Want to minimize search time and
want to utilize all space in drawer
Cache memory
! Assume a main memory with 32 bytes and a cache memory with 16 bytes. 32=2^5 -> 5 address lines.
! Where to place data with high addresses?
25
Adr (dec) Adr (bin) Data0 00000 A11 00001 A22 00010 A33 00011 A44 00100 A55 00101 A66 00110 A77 00111 A88 01000 A99 01001 A0
10 01010 B111 01011 B212 01100 B313 01101 B414 01110 B515 01111 B616 10000 B717 10001 B818 10010 B919 10011 B020 10100 C121 10101 C222 10110 C323 10111 C424 11000 C525 11001 C626 11010 C727 11011 C828 11100 C929 11101 C030 11110 D131 11111 D2
Adr (dec) Adr (bin) Data0 0000 A11 0001 A22 0010 A33 0011 A44 0100 A55 0101 A66 0110 A77 0111 A88 1000 A99 1001 A0
10 1010 B111 1011 B212 1100 B313 1101 B414 1110 B515 1111 B6
Cache memory! Assume a main memory with 32 bytes and a
cache memory with 16 bytes. 32=2^5 -> 5 address lines.
! Where to place data with high addresses?! Assume cache-lines of 4 bytes
26
Adr (dec) Adr (bin) Data0 00000 A11 00001 A22 00010 A33 00011 A44 00100 A55 00101 A66 00110 A77 00111 A88 01000 A99 01001 A0
10 01010 B111 01011 B212 01100 B313 01101 B414 01110 B515 01111 B616 10000 B717 10001 B818 10010 B919 10011 B020 10100 C121 10101 C222 10110 C323 10111 C424 11000 C525 11001 C626 11010 C727 11011 C828 11100 C929 11101 C030 11110 D131 11111 D2
Tag Adr (bin) 00 01 10 110 00 A1 A2 A3 A4
1 01 C1 C2 C3 C4
0 10 A9 A0 B1 B2
0 11 B3 B4 B5 B6
Address: 1 01 11
Want to use address: 00111?
Cache organization
! Example:! cache of 64 Kbytes (216 bytes)
! main memory of 16 Mbytes (224 bytes)! data transfer between cache and main memory is in blocks of 4 bytes
27
Byte1 Byte2 Byte3 Byte 4 Block01
222-1
Byte1 Byte2 Byte3 Byte 4 Line01
214-1
Cache Main memory
24 bits
?
Direct mapping
28
8 bits 14 bits 2 bits
Byte1 Byte2 Byte3 Byte 4 Block01
222-1
Tag (8 bit) Byte1 Byte2 Byte3 Byte 4 Line01
214-1
22 bits 2 bits
24 bits
Compare Hit/Miss
A memory block is mapped to a unique cache line+ simple, cheap- little flexibility
Cache - direct mapping
29
1
2
3
4
12...
16
! Item 1 to 4 goes in drawer box 1! Item 5 to 8 goes in drawer box 2
! Advantage: Easy and fast to find out where an item should go. ! Disadvantage if one wants item 1 and item 2 (both will go to box 1
even if box 2, 3, and/or 4 are empty)
Set associative mapping
30
9 bits 13 bits 2 bits
Byte1 Byte2 Byte3 Byte 4 Block01
222-1
Tag (8 bit) Byte1 Byte2 Byte3 Byte 4 Set0
213-1
22 bits 2 bits
24 bits
Compare Hit/Miss
2-wayreplacement algorithm2/4/8-way associativeshort tag, fast, quite simple
Cache - set associative mapping
31
1
2
3
4
12...
16
! Item 1 to 8 goes in drawer box 1 or box 2! Item 9 to 16 goes in drawer box 3 or box 4
! Quite easy and fast to find out where an item should go. ! Possible to store item 1 and item 2 (at the cost of a little more
checking)
Associative mapping
32
22 bits 2 bits
Byte1 Byte2 Byte3 Byte 4 Block01
222-1
Tag (9 bit) Byte1 Byte2 Byte3 Byte 4 Set0
214-1
22 bits 2 bits
24 bits
Compare Hit/Miss
Any cache linereplacement algorithm+ flexible-slow, complex
Cache - associative mapping
33
1
2
3
4
12...
16
! Item 1 to 16 goes in drawer box 1 or box 2 or box 3 or box 4
! Have to search all boxes to find out where an item should go. ! Possible to store any four item (at the cost of more checking)
Replacement algorithms
! When a new block is to be placed in the cache, one block stored in one cache lines has to be replaced.
! Replacement! direct mapping: no choice.! set-associative mapping: candidate lines are in the selected set;! associative mapping: all lines of the cache are potential candidates;
! Replacement strategies! Random replacement! First-in-first-out (FIFO)
! Least recently used (LRU)! Least frequently used (LFU)
34
FIFO-longest in cache -> replacedLRU - longest in cache without referenceLFU- not usedReplacement algorithms->HW (efficiency)LRU is the most efficient: relatively simple to implement and good results.FIFO is simple to implement.Random is the simplest to implement and results are surprisingly good.
Write strategies
! Problem: keep memory content consistent! Concepts:
! Write-through - cache writes are immediately updated in MM! Write-through with buffers - cache writes are buffered and MM
periodically updated ! Copy-back - cache and MM are not coherent; updates at replacement
35
memory content distributed in cache and MM Separate data & instruction cache
36
Unified cache: easier to design; automatic adjustment of size IC and DCHarvard cache: fetch instruction at same time as fetch dataDifficult to know size of IC & DC
Some cache architectures
! Intel 80486 - Introduced 1989! a single on-chip cache of 8 Kbytes! line size: 16 bytes! 4-way set associative organization
! Pentium - Introduced 1993! two on-chip caches, for data and instructions.! each cache: 8 Kbytes! line size: 32 bytes (64 bytes in Pentium 4)! 2-way set associative organization! (4-way in Pentium 4)
! PowerPC 601-Introduced 1993! a single on-chip cache of 32 Kbytes! line size: 32 bytes! 8-way set associative organization
37
Some cache architectures! PowerPC 603
! two on-chip caches, for data and instructions! each cache: 8 Kbytes! line size: 32 bytes! 2-way set associative organization! (simpler cache organization than the 601 but stronger processor)
! PowerPC 604! two on-chip caches, for data and instructions! each cache: 16 Kbytes! line size: 32 bytes! 4-way set associative organization
! PowerPC 620! two on-chip caches, for data and instructions! each cache: 32 Kbytes! line size: 64 bytes! 8-way set associative organization
38
Apple’s Imac (1998-2008)
! As of 2008! Cache: 6MB
! RAM: 1-4 GB! Hard drive: 250-1000GB
! (Processor speed 2.4-3.2 GHz)
! As of 1998! Cache: 512 KB
! RAM 32-128 MB! Hard drive: 4GB! (Processor speed 233 MHz)
39
Numbers increases - problems remain.
AMD Athlon 64 CPU
40
The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. The K8 also has multiple-level caches.
Summary
! A memory system has to fit large programs and provide fast access
! A hierarchical memory system can provide needed performance, based on the locality of reference
! Cache memory is an essential part of the memory system! Caches can be organized with direct mapping, set associative
mapping, and associative mapping! In order to decide on which one to replace different strategies
can be used: random, LRU, FIFO, LFU, etc! Cache kept coherent with write-through, write-through with
buffered write, and copy-back
41
Memory system
42
Input device
Output deviceCPU Main
memory
Secondary memory
Previous: Register-MM Now:MM-SM
Memory system design
! What do we need?! We need memory to fit very large programs and to work at a speed
comparable to that of the microprocessors.
! Main problem:! microprocessors are working at a very high rate and they need
large memories;! memories are much slower than microprocessors;
! Facts:! the larger a memory, the slower it is;! the faster the memory, the greater the cost/bit.
43
Memory usage
44
Fragmentation = % memory unavailable for allocation, but not in use
Paging
! Divide memory into fixed-sized pages! Allocates pages to frames in memory! OS manages pages! Moves, removes, reallocates! Pages copied to and from disk
45
Paging
46
Paging
! “Hole-fitting problem” vanishes!! Logical memory contiguous! Physical memory not required to be! Eliminates external fragmentation! But: Complicates address lookup
47
Paging
48
Virtual memory
49
The address space needed and seen by programs is usually much larger than the available main memory.Only one part of the program fits into main memory; the rest is stored on secondary memory (hard disk).In order to be executed or data to be accessed, a certain segment of the program has to be first loaded into main memory; in this case it has to replace another segment already in memory.Movement of programs and data, between main memory and secondary storage, is performed automatically by the operating system. These techniques are called virtual-memory techniques.The binary address issued by the processor is a virtual (logical) address; it considers a virtual address space, much larger than the physical one available in
main memory.
Virtual memory
50
Virtual memory
51
Address translation is performed by the MMU using a page table.Example:Virtual memory space: 2 Gbytes(31 address bits; 231 = 2 G)Physical memory space: 16 Mbytes (224=16M)Page length: 2Kbytes (211 = 2K)->Total number of pages: 220 = 1MTotal number of
Virtual memory
52
The hardware unit which is responsible for translation of a virtual address into a physical one is the Memory Management Unit (MMU)
Replacement algorithms
! When a new block is to be placed in the cache, one block stored in one cache lines has to be replaced.
! Replacement! direct mapping: no choice.! set-associative mapping: candidate lines are in the selected set;! associative mapping: all lines of the cache are potential candidates;
! Replacement strategies! Random replacement! First-in-first-out (FIFO)
! Least recently used (LRU)! Least frequently used (LFU)
53
FIFO-longest in cache -> replacedLRU - longest in cache without referenceLFU- not usedReplacement algorithms->HW (efficiency)LRU is the most efficient: relatively simple to implement and good results.FIFO is simple to implement.Random is the simplest to implement and results are surprisingly good.
Similar to cache - main memory smaller than secondary memory Translation Look-Aside Buffers (TLB)
54
The page table has one entry for each page of the virtual memory space.Each entry of the page table also includes somecontrol bits which describe the status of the page:
if page is in MMif page is modifiedstatistics - when used
speed up -> insert TLBpage table -> distributed (cache, MM, secondary
Demand paging
! The pages of a program are stored on disk; at any time, only a few pages have to be stored in main memory.
! The operating system is responsible for loading/replacing pages
! Only when a page fault occur; a page is loaded
55
Thrashing
! Degree of multiprogramming by the scheduler ⇒ number of page faults
! Number of page faults ⇒ CPU utilization
56
Degree of multiprogramming
CP
U u
tiliz
atio
n
Summary
! A memory system has to fit large programs and provide fast access
! A hierarchical memory system can provide needed performance, based on the locality of reference
! Fragmentation can be avoided by paging! Virtual memory; the programmer sees a larger main memory! Demand paging; only needed pages are loaded! The MMU translates a logic address to a physical! The page table may be distributed: TLB (cache), main memory,
secondary memory
57 www.liu.se