CA226 — Advanced Computer Architecture

72
1 CA226 — Advanced Computer Architecture Stephen Blott <[email protected]> Table of Contents

Transcript of CA226 — Advanced Computer Architecture

Page 1: CA226 — Advanced Computer Architecture

1

CA226 — AdvancedComputer Architecture

Stephen Blott <[email protected]>

Table of Contents

Page 2: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

2

PreliminariesContacting me:

1. before or after lectures, or during labs

2. in my office: L1.11

3. at [email protected] [mailto:[email protected]]please put the module code (ca226) in the subject line

Page 3: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

3

More PreliminariesCourse web site:

• http://ca226.computing.dcu.ie/use your School of Computing credentials

There’s a link to this site on Moodle [http://moodle.dcu.ie/].

Page 4: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

4

Still More PreliminariesLabs:

• begin week five

Lab exams:

• weeks eight and twelvein the regular lab slot (Friday’s at 14:00)

Page 5: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

5

Starters for 10.• List the powers of 2?

• What is `2^{32}`?

• What is `2^{64}`?

Page 6: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

6

Starters for 10..• What is a register?

• What is a bus?

• What does USB stand for?

• What is a frame buffer?

• What is an interrupt?

Page 7: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

7

Starters for 10…• What’s special about this IP address: 127.0.0.1?

• What’s special about this IP address: 192.168.3.3?

• What’s special about this IP address: 192.168.3.255?

• Could every person on earth be allocated a unique IP address?

• Old versions of the Linux ext2 file system had a 2GB limit on file sizes.Why?

Page 8: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

8

Observations on Processor Speed

Page 9: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

9

What’s been happening?What’s been happening?

• Microprocessors predominate.

• Moore’s law, but that’s run its course(clock rates aren’t increasing so much any more).

• Widespread introduction of multi-core systems.

• Recent dominance of 64-bit systems

Page 10: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

10

Note to self…Speed of light:

• `300000`kilometres per second (approx.)

• …

Page 11: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

11

64-bit Systems…64-bit:

• native integer size

• registers

• data path widths

• memory addresses

Are 64-bit machines inherently faster than 32-bit machines?

Page 12: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

12

CISC versus RISC

CISCComplex instruction-set computing

RISCReduced instruction-set computing

Page 13: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

13

CISC versus RISCMemory constraints influenced early processor designs:

• with small memories, high code density [http://en.wikipedia.org/wiki/Instruction_set#Code_density] was necessary

• this led to the development of processors with complex instruction sets:

• a single instruction might implement a high-level programming-languageoperation

• complex addressing modes

• e.g.b = a[i] + 1

Page 14: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

14

CISC versus RISCAs memory costs reduced:

• memory size constraints lessened

• code did not need to be so dense

• reduced instruction sets became viable

• a single high-level programming-language operation might be implemented byseveral instructions

Almost all modern processors implement reduced instruction sets.

Page 15: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

15

A simple computer…

Note

Source [http://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/risc/risccisc/].

Page 16: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

16

Example — The ProblemThe problem:

• a = a * b;

• so:multiply memory locations 5:2 and 2:3 (say)

Page 17: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

17

Example — CISC ApproachCISC approach:

MULT 5:2 2:3

• a single, complex instruction

• load both memory locations into registers

• multiply

• store the result back in the appropriate memory locationsay 5:2

Just one instruction encodes a commonly-occurring programming operation which, atthe hardware level, involves several steps.

Page 18: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

18

Example — RISC ApproachRISC approach:

LOAD A, 2:3LOAD B, 5:2MULT A, BSTORE 2:3, A

Four steps are required:

• so the program memory required is (well, may be) four times larger

• so this approach was only possible when cheaper/larger memory systems becamemore widespread

Page 19: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

19

RISCRISC:

• reduced instruction set computing

• computations are performed only on register contents

• the only memory operations are LOAD and STORE

• few, uniformly-sized instructions

Page 20: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

20

RISC AdvantagesBoth approaches are likely to require roughly the same number ofcomputational steps.

RISC advantages:

• moves complexity from hardware to software (compilers)

• smarter compilers make better use of registers

• fewer transistors:

• so smaller, can be clocked faster, reduced power consumption, less heat

• pipelining (and super-scalar processing)

Page 21: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

21

Multi-Core SystemsWith Moore’s Law’s having run its course:

• the last decade has seen a emergence in widespread use of multi-core systems

• even in hand-held devices

The most common architecture is that of symmetric multiprocessors:

• multiple processing elements supporting the same instruction set, memory model,etc.

Page 22: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

22

…Programming parallel systems is hard:

• and parallelism seems likely to be the way forward in terms of computational powerfor the foreseeable future

The burden of exploiting hardware improvements is on the programmer.

• e.g., how do you exploit parallelism for quicksort?

Page 23: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

23

Cell Architecture

Page 24: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

24

Computer PerformanceHow might we measure computer performance?

Page 25: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

25

Answer?It depends.

Page 26: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

26

Answer?Usually:

• we’re interested in how long it takes to get some work done

So:

• wall-clock time might be a good measure

Page 27: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

27

However …It depends how/why we’re measuring.

Wall-clock time includes:

• user CPU time

• system CPU time

• interrupt handling time

• I/O time(to/from terminal, disk, network)

Page 28: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

28

CPU ArchitecturesIf we’re interested in comparing processors:

• we may be more interested in the number of clock cycles necessary to completesome task

Page 29: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

29

Clock RateClock rate:

• the number of clock cycles per unit time (usually, per second)

• say, 2GHz

Page 30: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

30

Page 31: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

31

CPU Clock CyclesCPU clock cycles:

• the number of clock cycles necessary to complete some job

Page 32: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

32

Page 33: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

33

ExampleSay:

• clock rate: 2GHzso `2 times 10^9` cycles per second

• CPU clock cycles: `4 times 10^8`

Page 34: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

34

CPU TimeCPU time:

• `text{CPU time} = text{CPU clock cycles} / text{clock rate}`

Page 35: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

35

Page 36: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

36

CPU TimeCPU time:

• `text{CPU time} = text{CPU clock cycles} / text{clock rate}`

Example:

• `{4 times 10^8} / {2 times 10^9} = 0.2s`

Page 37: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

37

AlternativelyBut that approach:

• is too dependent on a single job

Page 38: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

38

AlternativelyBetter:

• derive a metric which is (somewhat) independent of any particular job

• let IC be the instruction countthe number of instructions needed to complete some job

Say:

• IC is `2 times 10^8`

Page 39: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

39

Then …Then:

• cycles per instruction (CPI):`text{CPI} = text{CPU clock cycles}/text{IC}`

Example:

• `text{CPI} = {4 times 10^8} / {2 times 10^8} = 2`so, two cycles per instruction

Page 40: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

40

Then again …Then:

• CPU time:`text{CPU time} = {text{IC} times text{CPI}} / text{clock rate} `

Example:

• `text{CPU time} = {2 times 10^8 times 2} / {2 times 10^9} = 0.2s`

Page 41: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

41

So …

• `text{CPU time} = {text{IC} times text{CPI}} / text{clock rate} `

So, to make things go faster (reduce CPU time):

• reduce the instruction count (IC)

• reduce the number of cycles per instruction (CPI), or

• increase the clock rate

Page 42: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

42

Improvements in CPIThe Intel 8086 instruction PUSH AX:

• 8086 — 11 clock cycles

• 80286 — 3 clock cycles

• 80386 — 2 clock cycles

• 80486 — 1 clock cycles

So:

• it is not just clock speed that has improved over the years

• in fact:it is now commonplace to see `text{CPI} le 1`

Page 43: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

43

ExampleExample:

• two machines (A and B) implementing the same instruction set architecture

• A has cycle time of 10ns and CPI of 2.0 (for some prog. P)

• B has cycle time of 20ns and CPI of 1.2 (for same P)

Which is faster?

Page 44: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

44

Aside

Note

The cycle time (in seconds) is just the reciprocal of the clock speed (in Herz) — andvice versa.

Page 45: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

45

Example

• CPU time for A is `{ text{IC} times 2.0 } / {1 times 10^8} = text{IC} times 2 times10^{-8}`

• CPU time for B is `{ text{IC} times 1.2 } / {5 times 10^7} = text{IC} times 2.4 times10^{-8}`

So:

• it will take `20%` longer on machine B than it would on machine A

Page 46: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

46

Warning: CPI can be MisleadingConsider two variants of the same processor:

1. one with a floating-point unit

2. one without a floating-point unit

But:

• how are FP operations handled in case 2?

Page 47: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

47

Warning: CPI can be MisleadingTwo possibilities (both effectively the same):

• the processor implements microcode to emulate FP operations with integeroperations, or

• the compiler generates such instructions

Page 48: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

48

Warning: CPI can be MisleadingCases:

1. floating-point operations require 10 clock cycles (say)so CPI is 10

2. compiler generates (say) 300 integer instructions, each requiring 1 clock cycleso CPI is 1

But:

• which is actually faster?

• the one with the higher CPI!

Page 49: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

49

More Common MetricsMIPS:

• `text{MIPS} = text{clock rate} / {text{CPI} times 10^6}`

MFLOPS:

• `text{MFLOPS} = text{clock rate} / {text{C-per-FPI} times 10^6}`

Page 50: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

50

MIPS and MFLOPSThese can be poor metrics for comparing different processors:

• some implement FP division (e.g. Pentium)

• some don’t (e.g. SPARC)

Instruction counts:

• they may have different instruction sets (so the ICs will be different)

• for complex operations like sine and cosine may be quite large

• so these differences can be significant

Page 51: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

51

Improving PerformanceGenerally:

• optimise for the common case

Page 52: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

52

Improving PerformanceHowever, (particularly) with computer hardware:

• optimisation is expensive(it requires substantial investment)

So:

• we need to decide where to invest in optimisation, and

• we need to know that the payback is going to be worth it

Page 53: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

53

SpeedupConsider some possible hardware or software enhancement.

Speedup:

• `text{performance without enhancement} / text{performance with enhancement}`

Note

"Performance", here, might be response time (say).With speedup, larger values are better.

Page 54: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

54

Speedup — ExampleExample:

• a baseline implementation might execute a job in 3 seconds

• with some enhancement, that might be reduced to 2 seconds

Speedup:

• `3/2 = 1.5`

Page 55: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

55

Important Gotcha!Typically:

• only a portion of an entire job will be sped up by any proposed enhancement

Example:

• sort the contents of a disk file, storing the sorted results back in a new file on diskso: read data in, sort it, write data out

• an enhanced sorting algorithm can only improve the CPU costs, not the IO costs

• an enhanced IO subsystem can only improve the IO costs, not the sorting costs

Page 56: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

56

ExampleAssume:

• some job involving sub-jobs A and B

• B accounts for 70% of the execution time, A the rest

Given a proposed enhancement:

• running B 20 times faster

How much faster would our job run overall?

Page 57: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

57

Amdahl’s LawAmdahl’s law:

• `text{speedup} = 1 / {(1-P) + P/S}`

Where:

• P is the proportion of the job affected by the enhancement, and

• S is the speedup associated (just) with P

Page 58: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

58

Amdahl’s Law — DerivationAssume our job takes 1 baseline time unit:

• `text{speedup} = 1 / text{response time with enhancement}`

• `text{response time with enhancement} = text{unenhanced time} + text{enhancedtime}`

• `text{response time with enhancement} = (1-P) + P/S`

• `text{speedup} = 1 / {(1-P) + P/S}`

Page 59: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

59

Example (Repeated)Assume:

• some job involving sub-jobs A and B

• B accounts for 70% of the execution time, A the rest

Given a proposed enhancement:

• running B 20 times faster

How much faster would our job run overall?

Page 60: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

60

Amdahl’s Law — ExampleExample (from previously):

• P is 0.7

• S is 20

Page 61: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

61

Amdahl’s Law — ExampleOverall speedup:

• `1 / {(1-P) + P/S}`

• `1 / {(1-0.7) + 0.7/20}`

• `1 / {0.3 + 0.035}`

• `2.985` (approximately)

Page 62: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

62

ExampleGiven a proposed enhancement:

• running B 20 times faster

How much faster would our job run overall?

It will run in about three times faster:

• this may be less than you intuitively expected.

Page 63: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

63

Another ExampleAmdahl’s law also allows comparison between two or more design alternatives.

Page 64: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

64

Another ExampleExample:

• a program spends:

• half its time doing floating-point operations

• including 20% of its time calculating floating-point square roots

Alternative optimisations:

1. Add floating-point square root hardware which speeds up such operations by afactor of 10.

2. Make all floating-point operations run twice as fast.

Page 65: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

65

EngineeringAssuming we can only choose one:

• in which of these optimisations should we invest?

Page 66: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

66

Engineering — First CaseOptimisations:

• Add floating-point square root hardware which speeds up such operations by afactor of 10.

Amdahl’s law:

• `text{speedup} = 1 / {0.8 + 0.2 / 10} = 1.22`

Page 67: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

67

Engineering — Second CaseOptimisations:

• Make all floating-point operations run twice as fast.

Amdahl’s law:

• `text{speedup} = 1 / {0.5 + 0.5 / 2} = 1.33`

So, under these assumptions, the second approach looks like the betterinvestment.

Page 68: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

68

CorollaryAmdahl’s law tells us to:

• make the common case fast!

Or:

• we can never see a big speedup by optimising the uncommon case

Page 69: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

69

Another ExampleProtein match:

• currently takes four days

• 20% of time doing integer operations

• 35% of time doing I/O

Which is the better trade off?

1. Compiler optimisation to reduce the number of integer operations by 20%.

2. Hardware optimisation that reduces latency of IO operations from 6us to 5us.

Page 70: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

70

AnswerThe speedups are:

1. 1.042

2. 1.062

So it looks like the second option is better.

If you were the engineer:

• what would you choose to do?

Page 71: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

71

From a previous exam…

Assume some sequential job composed exactly of three distinct parts (A, B and C) inwhich B accounts for 50% of the execution time and C for 30%.

Further assume two possible improvements:

1. the first improvement would result in part A running 100 times faster, and

2. the second in part B running 20 times faster.

If only one of the improvements can be chosen, which would you recommend?

Page 72: CA226 — Advanced Computer Architecture

CA226 — AdvancedComputer Architecture

72

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>