CA226 — Advanced Computer Architecture
Transcript of CA226 — Advanced Computer Architecture
CA226 — AdvancedComputer Architecture
2
PreliminariesContacting me:
1. before or after lectures, or during labs
2. in my office: L1.11
3. at [email protected] [mailto:[email protected]]please put the module code (ca226) in the subject line
CA226 — AdvancedComputer Architecture
3
More PreliminariesCourse web site:
• http://ca226.computing.dcu.ie/use your School of Computing credentials
There’s a link to this site on Moodle [http://moodle.dcu.ie/].
CA226 — AdvancedComputer Architecture
4
Still More PreliminariesLabs:
• begin week five
Lab exams:
• weeks eight and twelvein the regular lab slot (Friday’s at 14:00)
CA226 — AdvancedComputer Architecture
5
Starters for 10.• List the powers of 2?
• What is `2^{32}`?
• What is `2^{64}`?
CA226 — AdvancedComputer Architecture
6
Starters for 10..• What is a register?
• What is a bus?
• What does USB stand for?
• What is a frame buffer?
• What is an interrupt?
CA226 — AdvancedComputer Architecture
7
Starters for 10…• What’s special about this IP address: 127.0.0.1?
• What’s special about this IP address: 192.168.3.3?
• What’s special about this IP address: 192.168.3.255?
• Could every person on earth be allocated a unique IP address?
• Old versions of the Linux ext2 file system had a 2GB limit on file sizes.Why?
CA226 — AdvancedComputer Architecture
8
Observations on Processor Speed
CA226 — AdvancedComputer Architecture
9
What’s been happening?What’s been happening?
• Microprocessors predominate.
• Moore’s law, but that’s run its course(clock rates aren’t increasing so much any more).
• Widespread introduction of multi-core systems.
• Recent dominance of 64-bit systems
CA226 — AdvancedComputer Architecture
10
Note to self…Speed of light:
• `300000`kilometres per second (approx.)
• …
CA226 — AdvancedComputer Architecture
11
64-bit Systems…64-bit:
• native integer size
• registers
• data path widths
• memory addresses
Are 64-bit machines inherently faster than 32-bit machines?
CA226 — AdvancedComputer Architecture
12
CISC versus RISC
CISCComplex instruction-set computing
RISCReduced instruction-set computing
CA226 — AdvancedComputer Architecture
13
CISC versus RISCMemory constraints influenced early processor designs:
• with small memories, high code density [http://en.wikipedia.org/wiki/Instruction_set#Code_density] was necessary
• this led to the development of processors with complex instruction sets:
• a single instruction might implement a high-level programming-languageoperation
• complex addressing modes
• e.g.b = a[i] + 1
CA226 — AdvancedComputer Architecture
14
CISC versus RISCAs memory costs reduced:
• memory size constraints lessened
• code did not need to be so dense
• reduced instruction sets became viable
• a single high-level programming-language operation might be implemented byseveral instructions
Almost all modern processors implement reduced instruction sets.
CA226 — AdvancedComputer Architecture
15
A simple computer…
Note
Source [http://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/risc/risccisc/].
CA226 — AdvancedComputer Architecture
16
Example — The ProblemThe problem:
• a = a * b;
• so:multiply memory locations 5:2 and 2:3 (say)
CA226 — AdvancedComputer Architecture
17
Example — CISC ApproachCISC approach:
MULT 5:2 2:3
• a single, complex instruction
• load both memory locations into registers
• multiply
• store the result back in the appropriate memory locationsay 5:2
Just one instruction encodes a commonly-occurring programming operation which, atthe hardware level, involves several steps.
CA226 — AdvancedComputer Architecture
18
Example — RISC ApproachRISC approach:
LOAD A, 2:3LOAD B, 5:2MULT A, BSTORE 2:3, A
Four steps are required:
• so the program memory required is (well, may be) four times larger
• so this approach was only possible when cheaper/larger memory systems becamemore widespread
CA226 — AdvancedComputer Architecture
19
RISCRISC:
• reduced instruction set computing
• computations are performed only on register contents
• the only memory operations are LOAD and STORE
• few, uniformly-sized instructions
CA226 — AdvancedComputer Architecture
20
RISC AdvantagesBoth approaches are likely to require roughly the same number ofcomputational steps.
RISC advantages:
• moves complexity from hardware to software (compilers)
• smarter compilers make better use of registers
• fewer transistors:
• so smaller, can be clocked faster, reduced power consumption, less heat
• pipelining (and super-scalar processing)
CA226 — AdvancedComputer Architecture
21
Multi-Core SystemsWith Moore’s Law’s having run its course:
• the last decade has seen a emergence in widespread use of multi-core systems
• even in hand-held devices
The most common architecture is that of symmetric multiprocessors:
• multiple processing elements supporting the same instruction set, memory model,etc.
CA226 — AdvancedComputer Architecture
22
…Programming parallel systems is hard:
• and parallelism seems likely to be the way forward in terms of computational powerfor the foreseeable future
The burden of exploiting hardware improvements is on the programmer.
• e.g., how do you exploit parallelism for quicksort?
CA226 — AdvancedComputer Architecture
23
Cell Architecture
CA226 — AdvancedComputer Architecture
24
Computer PerformanceHow might we measure computer performance?
CA226 — AdvancedComputer Architecture
25
Answer?It depends.
CA226 — AdvancedComputer Architecture
26
Answer?Usually:
• we’re interested in how long it takes to get some work done
So:
• wall-clock time might be a good measure
CA226 — AdvancedComputer Architecture
27
However …It depends how/why we’re measuring.
Wall-clock time includes:
• user CPU time
• system CPU time
• interrupt handling time
• I/O time(to/from terminal, disk, network)
CA226 — AdvancedComputer Architecture
28
CPU ArchitecturesIf we’re interested in comparing processors:
• we may be more interested in the number of clock cycles necessary to completesome task
CA226 — AdvancedComputer Architecture
29
Clock RateClock rate:
• the number of clock cycles per unit time (usually, per second)
• say, 2GHz
CA226 — AdvancedComputer Architecture
30
CA226 — AdvancedComputer Architecture
31
CPU Clock CyclesCPU clock cycles:
• the number of clock cycles necessary to complete some job
CA226 — AdvancedComputer Architecture
32
CA226 — AdvancedComputer Architecture
33
ExampleSay:
• clock rate: 2GHzso `2 times 10^9` cycles per second
• CPU clock cycles: `4 times 10^8`
CA226 — AdvancedComputer Architecture
34
CPU TimeCPU time:
• `text{CPU time} = text{CPU clock cycles} / text{clock rate}`
CA226 — AdvancedComputer Architecture
35
CA226 — AdvancedComputer Architecture
36
CPU TimeCPU time:
• `text{CPU time} = text{CPU clock cycles} / text{clock rate}`
Example:
• `{4 times 10^8} / {2 times 10^9} = 0.2s`
CA226 — AdvancedComputer Architecture
37
AlternativelyBut that approach:
• is too dependent on a single job
CA226 — AdvancedComputer Architecture
38
AlternativelyBetter:
• derive a metric which is (somewhat) independent of any particular job
• let IC be the instruction countthe number of instructions needed to complete some job
Say:
• IC is `2 times 10^8`
CA226 — AdvancedComputer Architecture
39
Then …Then:
• cycles per instruction (CPI):`text{CPI} = text{CPU clock cycles}/text{IC}`
Example:
• `text{CPI} = {4 times 10^8} / {2 times 10^8} = 2`so, two cycles per instruction
CA226 — AdvancedComputer Architecture
40
Then again …Then:
• CPU time:`text{CPU time} = {text{IC} times text{CPI}} / text{clock rate} `
Example:
• `text{CPU time} = {2 times 10^8 times 2} / {2 times 10^9} = 0.2s`
CA226 — AdvancedComputer Architecture
41
So …
• `text{CPU time} = {text{IC} times text{CPI}} / text{clock rate} `
So, to make things go faster (reduce CPU time):
• reduce the instruction count (IC)
• reduce the number of cycles per instruction (CPI), or
• increase the clock rate
CA226 — AdvancedComputer Architecture
42
Improvements in CPIThe Intel 8086 instruction PUSH AX:
• 8086 — 11 clock cycles
• 80286 — 3 clock cycles
• 80386 — 2 clock cycles
• 80486 — 1 clock cycles
So:
• it is not just clock speed that has improved over the years
• in fact:it is now commonplace to see `text{CPI} le 1`
CA226 — AdvancedComputer Architecture
43
ExampleExample:
• two machines (A and B) implementing the same instruction set architecture
• A has cycle time of 10ns and CPI of 2.0 (for some prog. P)
• B has cycle time of 20ns and CPI of 1.2 (for same P)
Which is faster?
CA226 — AdvancedComputer Architecture
44
Aside
Note
The cycle time (in seconds) is just the reciprocal of the clock speed (in Herz) — andvice versa.
CA226 — AdvancedComputer Architecture
45
Example
• CPU time for A is `{ text{IC} times 2.0 } / {1 times 10^8} = text{IC} times 2 times10^{-8}`
• CPU time for B is `{ text{IC} times 1.2 } / {5 times 10^7} = text{IC} times 2.4 times10^{-8}`
So:
• it will take `20%` longer on machine B than it would on machine A
CA226 — AdvancedComputer Architecture
46
Warning: CPI can be MisleadingConsider two variants of the same processor:
1. one with a floating-point unit
2. one without a floating-point unit
But:
• how are FP operations handled in case 2?
CA226 — AdvancedComputer Architecture
47
Warning: CPI can be MisleadingTwo possibilities (both effectively the same):
• the processor implements microcode to emulate FP operations with integeroperations, or
• the compiler generates such instructions
CA226 — AdvancedComputer Architecture
48
Warning: CPI can be MisleadingCases:
1. floating-point operations require 10 clock cycles (say)so CPI is 10
2. compiler generates (say) 300 integer instructions, each requiring 1 clock cycleso CPI is 1
But:
• which is actually faster?
• the one with the higher CPI!
CA226 — AdvancedComputer Architecture
49
More Common MetricsMIPS:
• `text{MIPS} = text{clock rate} / {text{CPI} times 10^6}`
MFLOPS:
• `text{MFLOPS} = text{clock rate} / {text{C-per-FPI} times 10^6}`
CA226 — AdvancedComputer Architecture
50
MIPS and MFLOPSThese can be poor metrics for comparing different processors:
• some implement FP division (e.g. Pentium)
• some don’t (e.g. SPARC)
Instruction counts:
• they may have different instruction sets (so the ICs will be different)
• for complex operations like sine and cosine may be quite large
• so these differences can be significant
CA226 — AdvancedComputer Architecture
51
Improving PerformanceGenerally:
• optimise for the common case
CA226 — AdvancedComputer Architecture
52
Improving PerformanceHowever, (particularly) with computer hardware:
• optimisation is expensive(it requires substantial investment)
So:
• we need to decide where to invest in optimisation, and
• we need to know that the payback is going to be worth it
CA226 — AdvancedComputer Architecture
53
SpeedupConsider some possible hardware or software enhancement.
Speedup:
• `text{performance without enhancement} / text{performance with enhancement}`
Note
"Performance", here, might be response time (say).With speedup, larger values are better.
CA226 — AdvancedComputer Architecture
54
Speedup — ExampleExample:
• a baseline implementation might execute a job in 3 seconds
• with some enhancement, that might be reduced to 2 seconds
Speedup:
• `3/2 = 1.5`
CA226 — AdvancedComputer Architecture
55
Important Gotcha!Typically:
• only a portion of an entire job will be sped up by any proposed enhancement
Example:
• sort the contents of a disk file, storing the sorted results back in a new file on diskso: read data in, sort it, write data out
• an enhanced sorting algorithm can only improve the CPU costs, not the IO costs
• an enhanced IO subsystem can only improve the IO costs, not the sorting costs
CA226 — AdvancedComputer Architecture
56
ExampleAssume:
• some job involving sub-jobs A and B
• B accounts for 70% of the execution time, A the rest
Given a proposed enhancement:
• running B 20 times faster
How much faster would our job run overall?
CA226 — AdvancedComputer Architecture
57
Amdahl’s LawAmdahl’s law:
• `text{speedup} = 1 / {(1-P) + P/S}`
Where:
• P is the proportion of the job affected by the enhancement, and
• S is the speedup associated (just) with P
CA226 — AdvancedComputer Architecture
58
Amdahl’s Law — DerivationAssume our job takes 1 baseline time unit:
• `text{speedup} = 1 / text{response time with enhancement}`
• `text{response time with enhancement} = text{unenhanced time} + text{enhancedtime}`
• `text{response time with enhancement} = (1-P) + P/S`
• `text{speedup} = 1 / {(1-P) + P/S}`
CA226 — AdvancedComputer Architecture
59
Example (Repeated)Assume:
• some job involving sub-jobs A and B
• B accounts for 70% of the execution time, A the rest
Given a proposed enhancement:
• running B 20 times faster
How much faster would our job run overall?
CA226 — AdvancedComputer Architecture
60
Amdahl’s Law — ExampleExample (from previously):
• P is 0.7
• S is 20
CA226 — AdvancedComputer Architecture
61
Amdahl’s Law — ExampleOverall speedup:
• `1 / {(1-P) + P/S}`
• `1 / {(1-0.7) + 0.7/20}`
• `1 / {0.3 + 0.035}`
• `2.985` (approximately)
CA226 — AdvancedComputer Architecture
62
ExampleGiven a proposed enhancement:
• running B 20 times faster
How much faster would our job run overall?
It will run in about three times faster:
• this may be less than you intuitively expected.
CA226 — AdvancedComputer Architecture
63
Another ExampleAmdahl’s law also allows comparison between two or more design alternatives.
CA226 — AdvancedComputer Architecture
64
Another ExampleExample:
• a program spends:
• half its time doing floating-point operations
• including 20% of its time calculating floating-point square roots
Alternative optimisations:
1. Add floating-point square root hardware which speeds up such operations by afactor of 10.
2. Make all floating-point operations run twice as fast.
CA226 — AdvancedComputer Architecture
65
EngineeringAssuming we can only choose one:
• in which of these optimisations should we invest?
CA226 — AdvancedComputer Architecture
66
Engineering — First CaseOptimisations:
• Add floating-point square root hardware which speeds up such operations by afactor of 10.
Amdahl’s law:
• `text{speedup} = 1 / {0.8 + 0.2 / 10} = 1.22`
CA226 — AdvancedComputer Architecture
67
Engineering — Second CaseOptimisations:
• Make all floating-point operations run twice as fast.
Amdahl’s law:
• `text{speedup} = 1 / {0.5 + 0.5 / 2} = 1.33`
So, under these assumptions, the second approach looks like the betterinvestment.
CA226 — AdvancedComputer Architecture
68
CorollaryAmdahl’s law tells us to:
• make the common case fast!
Or:
• we can never see a big speedup by optimising the uncommon case
CA226 — AdvancedComputer Architecture
69
Another ExampleProtein match:
• currently takes four days
• 20% of time doing integer operations
• 35% of time doing I/O
Which is the better trade off?
1. Compiler optimisation to reduce the number of integer operations by 20%.
2. Hardware optimisation that reduces latency of IO operations from 6us to 5us.
CA226 — AdvancedComputer Architecture
70
AnswerThe speedups are:
1. 1.042
2. 1.062
So it looks like the second option is better.
If you were the engineer:
• what would you choose to do?
CA226 — AdvancedComputer Architecture
71
From a previous exam…
Assume some sequential job composed exactly of three distinct parts (A, B and C) inwhich B accounts for 50% of the execution time and C for 30%.
Further assume two possible improvements:
1. the first improvement would result in part A running 100 times faster, and
2. the second in part B running 20 times faster.
If only one of the improvements can be chosen, which would you recommend?
CA226 — AdvancedComputer Architecture
72
Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>