CA226 — Advanced Computer Architecture

1

CA226 — AdvancedComputer Architecture

Stephen Blott <[email protected]>

Table of Contents


2

PreliminariesContacting me:

1. before or after lectures, or during labs

2. in my office: L1.11

3. at [email protected] [mailto:[email protected]]please put the module code (ca226) in the subject line

mailto:[email protected]

mailto:[email protected]


3

More PreliminariesCourse web site:

• http://ca226.computing.dcu.ie/use your School of Computing credentials

There’s a link to this site on Moodle [http://moodle.dcu.ie/].

http://ca226.computing.dcu.ie/

http://moodle.dcu.ie/

http://moodle.dcu.ie/


4

Still More PreliminariesLabs:

• begin week five

Lab exams:

• weeks eight and twelvein the regular lab slot (Friday’s at 14:00)


5

Starters for 10.• List the powers of 2?

• What is `2^{32}`?

• What is `2^{64}`?


6

Starters for 10..• What is a register?

• What is a bus?

• What does USB stand for?

• What is a frame buffer?

• What is an interrupt?


7

Starters for 10…• What’s special about this IP address: 127.0.0.1?

• What’s special about this IP address: 192.168.3.3?

• What’s special about this IP address: 192.168.3.255?

• Could every person on earth be allocated a unique IP address?

• Old versions of the Linux ext2 file system had a 2GB limit on file sizes.Why?


8

Observations on Processor Speed


9

What’s been happening?What’s been happening?

• Microprocessors predominate.

• Moore’s law, but that’s run its course(clock rates aren’t increasing so much any more).

• Widespread introduction of multi-core systems.

• Recent dominance of 64-bit systems


10

Note to self…Speed of light:

• `300000`kilometres per second (approx.)

• …


11

64-bit Systems…64-bit:

• native integer size

• registers

• data path widths

• memory addresses

Are 64-bit machines inherently faster than 32-bit machines?


12

CISC versus RISC

CISCComplex instruction-set computing

RISCReduced instruction-set computing


13

CISC versus RISCMemory constraints influenced early processor designs:

• with small memories, high code density [http://en.wikipedia.org/wiki/Instruction_set#Code_density] was necessary

• this led to the development of processors with complex instruction sets:

• a single instruction might implement a high-level programming-languageoperation

• complex addressing modes

• e.g.b = a[i] + 1

http://en.wikipedia.org/wiki/Instruction_set#Code_density




14

CISC versus RISCAs memory costs reduced:

• memory size constraints lessened

• code did not need to be so dense

• reduced instruction sets became viable

• a single high-level programming-language operation might be implemented byseveral instructions

Almost all modern processors implement reduced instruction sets.


15

A simple computer…

Note

Source [http://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/risc/risccisc/].

http://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/risc/risccisc/




16

Example — The ProblemThe problem:

• a = a * b;

• so:multiply memory locations 5:2 and 2:3 (say)


17

Example — CISC ApproachCISC approach:

MULT 5:2 2:3

• a single, complex instruction

• load both memory locations into registers

• multiply

• store the result back in the appropriate memory locationsay 5:2

Just one instruction encodes a commonly-occurring programming operation which, atthe hardware level, involves several steps.


18

Example — RISC ApproachRISC approach:

LOAD A, 2:3LOAD B, 5:2MULT A, BSTORE 2:3, A

Four steps are required:

• so the program memory required is (well, may be) four times larger

• so this approach was only possible when cheaper/larger memory systems becamemore widespread


19

RISCRISC:

• reduced instruction set computing

• computations are performed only on register contents

• the only memory operations are LOAD and STORE

• few, uniformly-sized instructions


20

RISC AdvantagesBoth approaches are likely to require roughly the same number ofcomputational steps.

RISC advantages:

• moves complexity from hardware to software (compilers)

• smarter compilers make better use of registers

• fewer transistors:

• so smaller, can be clocked faster, reduced power consumption, less heat

• pipelining (and super-scalar processing)


21

Multi-Core SystemsWith Moore’s Law’s having run its course:

• the last decade has seen a emergence in widespread use of multi-core systems

• even in hand-held devices

The most common architecture is that of symmetric multiprocessors:

• multiple processing elements supporting the same instruction set, memory model,etc.


22

…Programming parallel systems is hard:

• and parallelism seems likely to be the way forward in terms of computational powerfor the foreseeable future

The burden of exploiting hardware improvements is on the programmer.

• e.g., how do you exploit parallelism for quicksort?


23

Cell Architecture


24

Computer PerformanceHow might we measure computer performance?


25

Answer?It depends.


26

Answer?Usually:

• we’re interested in how long it takes to get some work done

So:

• wall-clock time might be a good measure


27

However …It depends how/why we’re measuring.

Wall-clock time includes:

• user CPU time

• system CPU time

• interrupt handling time

• I/O time(to/from terminal, disk, network)


28

CPU ArchitecturesIf we’re interested in comparing processors:

• we may be more interested in the number of clock cycles necessary to completesome task


29

Clock RateClock rate:

• the number of clock cycles per unit time (usually, per second)

• say, 2GHz


30


31

CPU Clock CyclesCPU clock cycles:

• the number of clock cycles necessary to complete some job


32


33

ExampleSay:

• clock rate: 2GHzso `2 times 10^9` cycles per second

• CPU clock cycles: `4 times 10^8`


34

CPU TimeCPU time:

• `text{CPU time} = text{CPU clock cycles} / text{clock rate}`


35


36

CPU TimeCPU time:

• `text{CPU time} = text{CPU clock cycles} / text{clock rate}`

Example:

• `{4 times 10^8} / {2 times 10^9} = 0.2s`


37

AlternativelyBut that approach:

• is too dependent on a single job


38

AlternativelyBetter:

• derive a metric which is (somewhat) independent of any particular job

• let IC be the instruction countthe number of instructions needed to complete some job

Say:

• IC is `2 times 10^8`


39

Then …Then:

• cycles per instruction (CPI):`text{CPI} = text{CPU clock cycles}/text{IC}`

Example:

• `text{CPI} = {4 times 10^8} / {2 times 10^8} = 2`so, two cycles per instruction


40

Then again …Then:

• CPU time:`text{CPU time} = {text{IC} times text{CPI}} / text{clock rate} `

Example:

• `text{CPU time} = {2 times 10^8 times 2} / {2 times 10^9} = 0.2s`


41

So …

• `text{CPU time} = {text{IC} times text{CPI}} / text{clock rate} `

So, to make things go faster (reduce CPU time):

• reduce the instruction count (IC)

• reduce the number of cycles per instruction (CPI), or

• increase the clock rate


42

Improvements in CPIThe Intel 8086 instruction PUSH AX:

• 8086 — 11 clock cycles




So:

• it is not just clock speed that has improved over the years

• in fact:it is now commonplace to see `text{CPI} le 1`


43

ExampleExample:

• two machines (A and B) implementing the same instruction set architecture

• A has cycle time of 10ns and CPI of 2.0 (for some prog. P)

• B has cycle time of 20ns and CPI of 1.2 (for same P)

Which is faster?


44

Aside

Note

The cycle time (in seconds) is just the reciprocal of the clock speed (in Herz) — andvice versa.


45

Example

• CPU time for A is `{ text{IC} times 2.0 } / {1 times 10^8} = text{IC} times 2 times10^{-8}`

• CPU time for B is `{ text{IC} times 1.2 } / {5 times 10^7} = text{IC} times 2.4 times10^{-8}`

So:

• it will take `20%` longer on machine B than it would on machine A


46

Warning: CPI can be MisleadingConsider two variants of the same processor:

1. one with a floating-point unit

2. one without a floating-point unit

But:

• how are FP operations handled in case 2?


47

Warning: CPI can be MisleadingTwo possibilities (both effectively the same):

• the processor implements microcode to emulate FP operations with integeroperations, or

• the compiler generates such instructions


48

Warning: CPI can be MisleadingCases:

1. floating-point operations require 10 clock cycles (say)so CPI is 10

2. compiler generates (say) 300 integer instructions, each requiring 1 clock cycleso CPI is 1

But:

• which is actually faster?

• the one with the higher CPI!


49

More Common MetricsMIPS:

• `text{MIPS} = text{clock rate} / {text{CPI} times 10^6}`

MFLOPS:

• `text{MFLOPS} = text{clock rate} / {text{C-per-FPI} times 10^6}`


50

MIPS and MFLOPSThese can be poor metrics for comparing different processors:

• some implement FP division (e.g. Pentium)

• some don’t (e.g. SPARC)

Instruction counts:

• they may have different instruction sets (so the ICs will be different)

• for complex operations like sine and cosine may be quite large

• so these differences can be significant


51

Improving PerformanceGenerally:

• optimise for the common case


52

Improving PerformanceHowever, (particularly) with computer hardware:

• optimisation is expensive(it requires substantial investment)

So:

• we need to decide where to invest in optimisation, and

• we need to know that the payback is going to be worth it


53

SpeedupConsider some possible hardware or software enhancement.

Speedup:

• `text{performance without enhancement} / text{performance with enhancement}`

Note

"Performance", here, might be response time (say).With speedup, larger values are better.


54

Speedup — ExampleExample:

• a baseline implementation might execute a job in 3 seconds

• with some enhancement, that might be reduced to 2 seconds

Speedup:

• `3/2 = 1.5`


55

Important Gotcha!Typically:

• only a portion of an entire job will be sped up by any proposed enhancement

Example:

• sort the contents of a disk file, storing the sorted results back in a new file on diskso: read data in, sort it, write data out

• an enhanced sorting algorithm can only improve the CPU costs, not the IO costs

• an enhanced IO subsystem can only improve the IO costs, not the sorting costs


56

ExampleAssume:

• some job involving sub-jobs A and B

• B accounts for 70% of the execution time, A the rest

Given a proposed enhancement:

• running B 20 times faster

How much faster would our job run overall?


57

Amdahl’s LawAmdahl’s law:

• `text{speedup} = 1 / {(1-P) + P/S}`

Where:

• P is the proportion of the job affected by the enhancement, and

• S is the speedup associated (just) with P


58

Amdahl’s Law — DerivationAssume our job takes 1 baseline time unit:

• `text{speedup} = 1 / text{response time with enhancement}`

• `text{response time with enhancement} = text{unenhanced time} + text{enhancedtime}`

• `text{response time with enhancement} = (1-P) + P/S`

• `text{speedup} = 1 / {(1-P) + P/S}`


59

Example (Repeated)Assume:

• some job involving sub-jobs A and B

• B accounts for 70% of the execution time, A the rest

Given a proposed enhancement:




60

Amdahl’s Law — ExampleExample (from previously):

• P is 0.7

• S is 20


61

Amdahl’s Law — ExampleOverall speedup:

• `1 / {(1-P) + P/S}`

• `1 / {(1-0.7) + 0.7/20}`

• `1 / {0.3 + 0.035}`

• `2.985` (approximately)


62

ExampleGiven a proposed enhancement:



It will run in about three times faster:

• this may be less than you intuitively expected.


63

Another ExampleAmdahl’s law also allows comparison between two or more design alternatives.


64

Another ExampleExample:

• a program spends:

• half its time doing floating-point operations

• including 20% of its time calculating floating-point square roots

Alternative optimisations:

1. Add floating-point square root hardware which speeds up such operations by afactor of 10.

2. Make all floating-point operations run twice as fast.


65

EngineeringAssuming we can only choose one:

• in which of these optimisations should we invest?


66

Engineering — First CaseOptimisations:

• Add floating-point square root hardware which speeds up such operations by afactor of 10.

Amdahl’s law:

• `text{speedup} = 1 / {0.8 + 0.2 / 10} = 1.22`


67

Engineering — Second CaseOptimisations:

• Make all floating-point operations run twice as fast.

Amdahl’s law:

• `text{speedup} = 1 / {0.5 + 0.5 / 2} = 1.33`

So, under these assumptions, the second approach looks like the betterinvestment.


68

CorollaryAmdahl’s law tells us to:

• make the common case fast!

Or:

• we can never see a big speedup by optimising the uncommon case


69

Another ExampleProtein match:

• currently takes four days

• 20% of time doing integer operations

• 35% of time doing I/O

Which is the better trade off?

1. Compiler optimisation to reduce the number of integer operations by 20%.

2. Hardware optimisation that reduces latency of IO operations from 6us to 5us.


70

AnswerThe speedups are:

1. 1.042

2. 1.062

So it looks like the second option is better.

If you were the engineer:

• what would you choose to do?


71

From a previous exam…

Assume some sequential job composed exactly of three distinct parts (A, B and C) inwhich B accounts for 50% of the execution time and C for 30%.

Further assume two possible improvements:

1. the first improvement would result in part A running 100 times faster, and

2. the second in part B running 20 times faster.

If only one of the improvements can be chosen, which would you recommend?


72

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>

CA226 — Advanced Computer Architecture

Documents

Transcript of CA226 — Advanced Computer Architecture