Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least...

15
Computer Architecture TDTS10 Erik Larsson Department of Computer Science Outline ! Components of the memory system ! The memory hierarchy ! Cache memories and organization ! Paging and virtual memory 2 Memory system 3 Input device Output device CPU Main memory Secondary memory Program execution 4 Fetch Instruction CPU Main memory Data Control Execute Instruction

Transcript of Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least...

Page 1: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Computer Architecture TDTS10

Erik LarssonDepartment of Computer Science

Outline

! Components of the memory system! The memory hierarchy! Cache memories and organization ! Paging and virtual memory

2

Memory system

3

Input device

Output deviceCPU Main

memory

Secondary memory

Program execution

4

Fetch Instruction

CPU Main

memory

Data

Control

Execute Instruction

Page 2: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Program executionInstruction for Z:=(Y+X)*3

Address Instruction .

00001000 0000101110001011

00001001 0001101110000011

00001010 0010100000011011

00001011 0001001110010011

5

Program execution

6

0000101110001011 = MOVE Y,R3

(1) Get the instruction at

00001000

(2) Move the instruction 0000101110001011

to CPU

(3) Decode instruction; 00001 – MOVE

01110001 - Address011 – Reg3

(4) Get the data at

01110001

(5) Store the data in register

number 3

Memory from processor side

! CPU gives command and address and assume data! Example: READ (ADR) -> DATA

7

Processor Memory

Read

Address

Data

0100

1010

Address Data0000 11100001 11010010 10110011 01110100 10100101 01010110 01010111 10101000 11101001 11111010 00001011 10001100 01001101 00101110 00011111 1111

Memory

! "640Kbyte ought to be enough for anybody."-- Bill Gates, 1981 (MP3 music ! 1000 Kbyte per minute)

! Memory is where instructions and data for programs are stored! Primary memory and secondary memory! Typically, primary memory is volatile, as soon as power is

turned of the content is lost, while secondary memory is non-volitile!

8

Page 3: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Memories

! Main memory (or primary memory)! Volatile - memory content lost at power off

! Random Access Memories (RAM)

! Secondary memory (or external memory)! Non-volatile - memory content kept at power off

! Hard disc

! Other! CD, DVD, magnetic tapes

9

Main memory

10

CPU Main memory

Secondary memory

11

Platter

Track

Cylinder

Drivemotor

Headmotor

Head, onmoving arm

Block

Sector

Track

Head

Head assembly

A block

12

Page 4: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Disc arrangements! A sector=512 bytes; a block=x*512! A file is larger; a file is divided into blocks! Problem: Fragmentation! Disc-scheduling:

! shortest-seek time - from head! elevator algorithm - move back and forth

! one-way elevator - move in one direction

! Storage: ! contiguous storing-block n, n+1, ...

! linked list: block n points at block n+1! disk index: block with pointers (FAT)

! file index: (unix)-solaris-block=8K, block number 4 bytes->2048 pointers standard unix: superblock with disc info, files, free space, inode: 10 direct pointers, 10 double direct, 10 triple direct

13

Inode

14

Disc arrangement

! Q: Assume a program of 1025 bytes, how many many blocks (512 bytes are needed)?

! A: 3 blocks. 2 blocks only cover 2*512=1024 bytes. Hence, 1536 bytes are needed to store a program of size 1025 bytes. The lost bytes are internal fragmentation.

! Q: Given that 3 blocks are needed, which blocks to select? ! A: Several alternatives exists (defined by OS). (1) place all

blocks next to each other. Drawback is that is can be difficult to find space. There could be external fragmentation. Solutions are to make use of pointers. One common block can contain all pointers or next block is pointed out by current.

15

Flash memory

• Invented by Dr. Fujio Masuoka (Toshiba) around 1980• Flash memory is often found in mobile phones, cameras, MP3-

players and in computers• Characteristics: non-volatile and random access• Capacity: up to 256 GB• Block 0: bad blocks• Block 1: bootable block• Limited number of program/erase cycles

16

Page 5: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Memory system design

! What do we need?! We need memory to fit very large programs and to work at a speed

comparable to that of the microprocessors.

! Main problem:! microprocessors are working at a very high rate and they need

large memories;! memories are much slower than microprocessors;

! Facts:! the larger a memory, the slower it is;! the faster the memory, the greater the cost/bit.

17

Memory hierarchy

18

Memory hierarchySome typical characteristics (Wikipedia as of 2006):! Processor registers:

! 8-32 registers of 32 bits each = 128 bytes! access time = few nanoseconds, 0-1 clock cycles

! On-chip cache memory (L1):! capacity = 32 to 128 Kbytes! access time = ~10 nanoseconds, 3 clock cycles

! Off-chip cache memory (L2):! capacity = 128 Kbytes to 12 Mbytes! access time = tens of nanoseconds, 10 clock cycles

! Main memory:! capacity = 256 Mbytes to 4Gbytes! access time = ~100 nanoseconds, 100 clock cycles

! Hard disk:! capacity = 1Gbyte to 1Tbyte! access time = tens of milliseconds, 10 000 000 clock cycles

19

Note: approx. 10 times difference between levels (size and speed) Cache memory

20

Distribute memory content; not all is needed at all timesmall & fast cache + large & slow mm -> behaves as large & fast memoryprinciple can be extended to any part of memorybased on: principle of locality

updates importanthit/miss ratio -> design important

Page 6: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Cache memory

! A cache is smaller than main memory! Items in the cache should be available when needed.! It works due to: locality of references! During execution, memory references tend to cluster:

! once an area of the program is entered, there are repeated references to a small set of instructions (loop, subroutine) and data (components of a data structure, local variables or parameters on the stack).

! Temporal locality (locality in time): If an item is referenced, it will tend to be referenced again soon.

! Spacial locality (locality in space): If an item is referenced, items whose addresses are close by will tend to be referenced soon.

21

Cache - principle

22

1

2

3

4

12...

16

Cache - locality of references

Example: SWwhile (x<1000){ x=x+1; printf(“x=%i”,x);}while (y<500){ y=y+1; printf(“y=%i”,y);}

23

Example: Drawerin summer keep summer clothing in drawer in winter keep winter clothing in drawer

Cache - principle

24

1

2

3

4

12...

16

! How to place clothing in drawer? ! Want to minimize search time and

want to utilize all space in drawer

Page 7: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Cache memory

! Assume a main memory with 32 bytes and a cache memory with 16 bytes. 32=2^5 -> 5 address lines.

! Where to place data with high addresses?

25

Adr (dec) Adr (bin) Data0 00000 A11 00001 A22 00010 A33 00011 A44 00100 A55 00101 A66 00110 A77 00111 A88 01000 A99 01001 A0

10 01010 B111 01011 B212 01100 B313 01101 B414 01110 B515 01111 B616 10000 B717 10001 B818 10010 B919 10011 B020 10100 C121 10101 C222 10110 C323 10111 C424 11000 C525 11001 C626 11010 C727 11011 C828 11100 C929 11101 C030 11110 D131 11111 D2

Adr (dec) Adr (bin) Data0 0000 A11 0001 A22 0010 A33 0011 A44 0100 A55 0101 A66 0110 A77 0111 A88 1000 A99 1001 A0

10 1010 B111 1011 B212 1100 B313 1101 B414 1110 B515 1111 B6

Cache memory! Assume a main memory with 32 bytes and a

cache memory with 16 bytes. 32=2^5 -> 5 address lines.

! Where to place data with high addresses?! Assume cache-lines of 4 bytes

26

Adr (dec) Adr (bin) Data0 00000 A11 00001 A22 00010 A33 00011 A44 00100 A55 00101 A66 00110 A77 00111 A88 01000 A99 01001 A0

10 01010 B111 01011 B212 01100 B313 01101 B414 01110 B515 01111 B616 10000 B717 10001 B818 10010 B919 10011 B020 10100 C121 10101 C222 10110 C323 10111 C424 11000 C525 11001 C626 11010 C727 11011 C828 11100 C929 11101 C030 11110 D131 11111 D2

Tag Adr (bin) 00 01 10 110 00 A1 A2 A3 A4

1 01 C1 C2 C3 C4

0 10 A9 A0 B1 B2

0 11 B3 B4 B5 B6

Address: 1 01 11

Want to use address: 00111?

Cache organization

! Example:! cache of 64 Kbytes (216 bytes)

! main memory of 16 Mbytes (224 bytes)! data transfer between cache and main memory is in blocks of 4 bytes

27

Byte1 Byte2 Byte3 Byte 4 Block01

222-1

Byte1 Byte2 Byte3 Byte 4 Line01

214-1

Cache Main memory

24 bits

?

Direct mapping

28

8 bits 14 bits 2 bits

Byte1 Byte2 Byte3 Byte 4 Block01

222-1

Tag (8 bit) Byte1 Byte2 Byte3 Byte 4 Line01

214-1

22 bits 2 bits

24 bits

Compare Hit/Miss

A memory block is mapped to a unique cache line+ simple, cheap- little flexibility

Page 8: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Cache - direct mapping

29

1

2

3

4

12...

16

! Item 1 to 4 goes in drawer box 1! Item 5 to 8 goes in drawer box 2

! Advantage: Easy and fast to find out where an item should go. ! Disadvantage if one wants item 1 and item 2 (both will go to box 1

even if box 2, 3, and/or 4 are empty)

Set associative mapping

30

9 bits 13 bits 2 bits

Byte1 Byte2 Byte3 Byte 4 Block01

222-1

Tag (8 bit) Byte1 Byte2 Byte3 Byte 4 Set0

213-1

22 bits 2 bits

24 bits

Compare Hit/Miss

2-wayreplacement algorithm2/4/8-way associativeshort tag, fast, quite simple

Cache - set associative mapping

31

1

2

3

4

12...

16

! Item 1 to 8 goes in drawer box 1 or box 2! Item 9 to 16 goes in drawer box 3 or box 4

! Quite easy and fast to find out where an item should go. ! Possible to store item 1 and item 2 (at the cost of a little more

checking)

Associative mapping

32

22 bits 2 bits

Byte1 Byte2 Byte3 Byte 4 Block01

222-1

Tag (9 bit) Byte1 Byte2 Byte3 Byte 4 Set0

214-1

22 bits 2 bits

24 bits

Compare Hit/Miss

Any cache linereplacement algorithm+ flexible-slow, complex

Page 9: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Cache - associative mapping

33

1

2

3

4

12...

16

! Item 1 to 16 goes in drawer box 1 or box 2 or box 3 or box 4

! Have to search all boxes to find out where an item should go. ! Possible to store any four item (at the cost of more checking)

Replacement algorithms

! When a new block is to be placed in the cache, one block stored in one cache lines has to be replaced.

! Replacement! direct mapping: no choice.! set-associative mapping: candidate lines are in the selected set;! associative mapping: all lines of the cache are potential candidates;

! Replacement strategies! Random replacement! First-in-first-out (FIFO)

! Least recently used (LRU)! Least frequently used (LFU)

34

FIFO-longest in cache -> replacedLRU - longest in cache without referenceLFU- not usedReplacement algorithms->HW (efficiency)LRU is the most efficient: relatively simple to implement and good results.FIFO is simple to implement.Random is the simplest to implement and results are surprisingly good.

Write strategies

! Problem: keep memory content consistent! Concepts:

! Write-through - cache writes are immediately updated in MM! Write-through with buffers - cache writes are buffered and MM

periodically updated ! Copy-back - cache and MM are not coherent; updates at replacement

35

memory content distributed in cache and MM Separate data & instruction cache

36

Unified cache: easier to design; automatic adjustment of size IC and DCHarvard cache: fetch instruction at same time as fetch dataDifficult to know size of IC & DC

Page 10: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Some cache architectures

! Intel 80486 - Introduced 1989! a single on-chip cache of 8 Kbytes! line size: 16 bytes! 4-way set associative organization

! Pentium - Introduced 1993! two on-chip caches, for data and instructions.! each cache: 8 Kbytes! line size: 32 bytes (64 bytes in Pentium 4)! 2-way set associative organization! (4-way in Pentium 4)

! PowerPC 601-Introduced 1993! a single on-chip cache of 32 Kbytes! line size: 32 bytes! 8-way set associative organization

37

Some cache architectures! PowerPC 603

! two on-chip caches, for data and instructions! each cache: 8 Kbytes! line size: 32 bytes! 2-way set associative organization! (simpler cache organization than the 601 but stronger processor)

! PowerPC 604! two on-chip caches, for data and instructions! each cache: 16 Kbytes! line size: 32 bytes! 4-way set associative organization

! PowerPC 620! two on-chip caches, for data and instructions! each cache: 32 Kbytes! line size: 64 bytes! 8-way set associative organization

38

Apple’s Imac (1998-2008)

! As of 2008! Cache: 6MB

! RAM: 1-4 GB! Hard drive: 250-1000GB

! (Processor speed 2.4-3.2 GHz)

! As of 1998! Cache: 512 KB

! RAM 32-128 MB! Hard drive: 4GB! (Processor speed 233 MHz)

39

Numbers increases - problems remain.

AMD Athlon 64 CPU

40

The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. The K8 also has multiple-level caches.

Page 11: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Summary

! A memory system has to fit large programs and provide fast access

! A hierarchical memory system can provide needed performance, based on the locality of reference

! Cache memory is an essential part of the memory system! Caches can be organized with direct mapping, set associative

mapping, and associative mapping! In order to decide on which one to replace different strategies

can be used: random, LRU, FIFO, LFU, etc! Cache kept coherent with write-through, write-through with

buffered write, and copy-back

41

Memory system

42

Input device

Output deviceCPU Main

memory

Secondary memory

Previous: Register-MM Now:MM-SM

Memory system design

! What do we need?! We need memory to fit very large programs and to work at a speed

comparable to that of the microprocessors.

! Main problem:! microprocessors are working at a very high rate and they need

large memories;! memories are much slower than microprocessors;

! Facts:! the larger a memory, the slower it is;! the faster the memory, the greater the cost/bit.

43

Memory usage

44

Fragmentation = % memory unavailable for allocation, but not in use

Page 12: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Paging

! Divide memory into fixed-sized pages! Allocates pages to frames in memory! OS manages pages! Moves, removes, reallocates! Pages copied to and from disk

45

Paging

46

Paging

! “Hole-fitting problem” vanishes!! Logical memory contiguous! Physical memory not required to be! Eliminates external fragmentation! But: Complicates address lookup

47

Paging

48

Page 13: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Virtual memory

49

The address space needed and seen by programs is usually much larger than the available main memory.Only one part of the program fits into main memory; the rest is stored on secondary memory (hard disk).In order to be executed or data to be accessed, a certain segment of the program has to be first loaded into main memory; in this case it has to replace another segment already in memory.Movement of programs and data, between main memory and secondary storage, is performed automatically by the operating system. These techniques are called virtual-memory techniques.The binary address issued by the processor is a virtual (logical) address; it considers a virtual address space, much larger than the physical one available in

main memory.

Virtual memory

50

Virtual memory

51

Address translation is performed by the MMU using a page table.Example:Virtual memory space: 2 Gbytes(31 address bits; 231 = 2 G)Physical memory space: 16 Mbytes (224=16M)Page length: 2Kbytes (211 = 2K)->Total number of pages: 220 = 1MTotal number of

Virtual memory

52

The hardware unit which is responsible for translation of a virtual address into a physical one is the Memory Management Unit (MMU)

Page 14: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Replacement algorithms

! When a new block is to be placed in the cache, one block stored in one cache lines has to be replaced.

! Replacement! direct mapping: no choice.! set-associative mapping: candidate lines are in the selected set;! associative mapping: all lines of the cache are potential candidates;

! Replacement strategies! Random replacement! First-in-first-out (FIFO)

! Least recently used (LRU)! Least frequently used (LFU)

53

FIFO-longest in cache -> replacedLRU - longest in cache without referenceLFU- not usedReplacement algorithms->HW (efficiency)LRU is the most efficient: relatively simple to implement and good results.FIFO is simple to implement.Random is the simplest to implement and results are surprisingly good.

Similar to cache - main memory smaller than secondary memory Translation Look-Aside Buffers (TLB)

54

The page table has one entry for each page of the virtual memory space.Each entry of the page table also includes somecontrol bits which describe the status of the page:

if page is in MMif page is modifiedstatistics - when used

speed up -> insert TLBpage table -> distributed (cache, MM, secondary

Demand paging

! The pages of a program are stored on disk; at any time, only a few pages have to be stored in main memory.

! The operating system is responsible for loading/replacing pages

! Only when a page fault occur; a page is loaded

55

Thrashing

! Degree of multiprogramming by the scheduler ⇒ number of page faults

! Number of page faults ⇒ CPU utilization

56

Degree of multiprogramming

CP

U u

tiliz

atio

n

Page 15: Computer Architecture TDTS10TDTS10/info/lectures/TDTS10_LE3.pdfFirst-in-first-out (FIFO)! Least recently used (LRU)! Least frequently used (LFU) 34 FIFO-longest in cache -> replaced

Summary

! A memory system has to fit large programs and provide fast access

! A hierarchical memory system can provide needed performance, based on the locality of reference

! Fragmentation can be avoided by paging! Virtual memory; the programmer sees a larger main memory! Demand paging; only needed pages are loaded! The MMU translates a logic address to a physical! The page table may be distributed: TLB (cache), main memory,

secondary memory

57 www.liu.se