Computer Organization and Design Wrap Up! Montek Singh Wed, Dec 4, 2013.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Aug 26, 2002.
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Aug 26, 2002.
1
COMP 206:COMP 206:Computer Architecture and Computer Architecture and
ImplementationImplementation
Montek SinghMontek Singh
Wed., Aug 26, 2002Wed., Aug 26, 2002
2
Amdahl’s LawAmdahl’s Law
1
1
ii
i i
iavg
FRFR
secondresults
11
second
results
Fraction of resultsgenerated at this rate
Average execution rate(performance)
Weighted harmonic mean
Note: Not “fractionof time spent workingat this rate”
Note: Not “fractionof time spent workingat this rate”
“Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp. 405-406
3
Example of Amdahl’s LawExample of Amdahl’s Law
30% of results are generated at the rate of 1 MFLOPS,20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance?What is the bottleneck?
30% of results are generated at the rate of 1 MFLOPS,20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance?What is the bottleneck?
MFLOPS 08.35.32
100
5.0230
100
1005.0
102.0
13.0
1
Ravg
%5.15.32
5.0%,2.6
5.32
2%,3.92
5.32
30
Bottleneck: the rate that consumes most of the time
0 0.2 0.4 0.6 0.8 1
4
Amdahl’s Law (HP3 book, pp. 40-Amdahl’s Law (HP3 book, pp. 40-41)41)
obtainedactually speedup
speedup attainable maximum
rate slowat produced results offraction
ratefast at produced results offraction
rate old) d,(unenhance slow
rate new) (enhanced,fast
RRRRFFRR
s
avg
s
f
s
f
s
f
RRF
FRR
RRF
F
R
RF
RFR
sf
f
fs
avg
f
sf
s
s
f
f
s
s
avg
1
1
1
Fractionenhanced
Speedupenhanced
Speedupoverall
Speedupoverall
Speedupenhanced
Fractionenhanced
Fractionenhanced
5
Implications of Amdahl’s LawImplications of Amdahl’s Law The performance improvements provided by a feature The performance improvements provided by a feature
are limited by how often that feature is usedare limited by how often that feature is used As stated, Amdahl’s Law is valid only if the system As stated, Amdahl’s Law is valid only if the system
always works with exactly one of the ratesalways works with exactly one of the rates If a non-blocking cache is used, or there is overlap between If a non-blocking cache is used, or there is overlap between
CPU and I/O operations, Amdahl’s Law as given here is not CPU and I/O operations, Amdahl’s Law as given here is not applicableapplicable
Bottleneck is the most promising target for Bottleneck is the most promising target for improvementsimprovements ““Make the common case fast”Make the common case fast” Infrequent events, even if they consume a lot of time, will Infrequent events, even if they consume a lot of time, will
make little difference to performancemake little difference to performance Typical use: Change only one parameter of system, Typical use: Change only one parameter of system,
and compute effect of this changeand compute effect of this change The same program, with the same input data, should run The same program, with the same input data, should run
on the machine in both caseson the machine in both cases
6
““Make The Common Case Fast”Make The Common Case Fast” All instructions require an instruction fetch, All instructions require an instruction fetch,
only a fraction require a data fetch/storeonly a fraction require a data fetch/storeOptimize instruction access over data accessOptimize instruction access over data access
Programs exhibit localityPrograms exhibit localitySpatial LocalitySpatial Locality items with addresses near one another tend to be items with addresses near one another tend to be
referenced close together in timereferenced close together in timeTemporal LocalityTemporal Locality recently accessed items are likely to be accessed in the recently accessed items are likely to be accessed in the
near futurenear future
Access to small memories is fasterAccess to small memories is fasterProvide a storage hierarchy such that the most Provide a storage hierarchy such that the most
frequent accesses are to the smallest (closest) frequent accesses are to the smallest (closest) memories.memories.Reg's
CacheMemory Disk / Tape
7
““Make The Common Case Fast” Make The Common Case Fast” (2)(2) What is the common case?What is the common case?The rate at which the system spends most of its timeThe rate at which the system spends most of its timeThe “bottleneck”The “bottleneck”
What does this statement mean precisely?What does this statement mean precisely?Make the common case faster, rather than making Make the common case faster, rather than making
some other case fastersome other case fasterMake the common case faster by a certain amount, Make the common case faster by a certain amount,
rather than making some other case faster by the rather than making some other case faster by the same amountsame amountAbsolute amount?Absolute amount?Relative amount?Relative amount?
This principle is merely an informal statement This principle is merely an informal statement of a frequently correct consequence of of a frequently correct consequence of Amdahl’s LawAmdahl’s Law
8
““Make The Common Case Fast” Make The Common Case Fast” (3a)(3a)A machine produces 20% and 80% of its results at the rates of 1 and 3
MFLOPS, respectively. What is more advantageous: to improve the 1MFLOPS rate, or to improve the 3 MFLOPS rate?
A machine produces 20% and 80% of its results at the rates of 1 and 3MFLOPS, respectively. What is more advantageous: to improve the 1MFLOPS rate, or to improve the 3 MFLOPS rate?
14.2266.02.0
1
38.0
12.0
1
Ravg
Generalize problem: Assume rates are x and y MFLOPS
xy
xy
yx
yxRavg 8.02.08.02.01
),(
2
2
2
2
)8.02.0(
8.0,
)8.02.0(
2.0
xy
x
y
R
xy
y
x
R avgavg
otherwise improve and , if
improve should then werate,execution either
tochange absolute same themaking are weIf
yy
R
x
Rx avgavg
At (x,y) = (1,3), this indicates thatit is better to improve x, the 1 MFLOPSrate, which is not the common case.
So, the 3 MFLOPS rate is thecommon case in this example.
9
““Make The Common Case Fast” Make The Common Case Fast” (3b)(3b)
Let’s say that we want to make the same relative change to one or theother rate, rather than the same absolute change.
yxy
Ry
x
Rx
xx
y
y
Ry
y
RyR
xx
RxR
avgavg
avgavgavg
avgavg
improve else , improve then , If
changing from resulting in Change
changing from resulting in Change
At (x,y) = (1,3), this indicates that it is better to improve y, the 3 MFLOPSrate, which is the common case.
If there are two different execution rates, making the common case faster by the same relative amount is always more advantageous than the alternative. However, this does not necessarily hold if we make absolute changes of the same magnitude. For three or more rates, further analysis is needed.
y
y
x
x
10
Basics of PerformanceBasics of Performance
countn Instructio CPI
rateClock
sec
program eperformanc CPU
timeCPU
1
sec
cycleclock rateClock
program
ninstructiocount n Instructio
ninstructio
cycleclock CPI
program
sec timeCPU
program
ninstructiocount n Instructio
program
cycleclock programfor cycles CPU
ninstructio
cycleclock CPI
sec
cycleclock rateClock
program
cycleclock programfor cycles CPU
program
sec timeCPU
cycleclock
sec timecycleclock
program
cycleclock programfor cycles CPU
program
sec timeCPU
11
Details of CPIDetails of CPI
iii
iii
i
ii
ICPI
ICPI
ICPI
rateClock eperformanc CPU
count n Instructio CPI
countn Instructio CPI
12
MIPSMIPS
MIPS10 timeCPU
countn Instructio
10CPI
Clockrate
timeCPU
countn Instructio
CPI
ClockrateClockrate
countn InstructioCPI timeCPU
66
Machines with different Machines with different instruction sets?instruction sets?
Programs with different Programs with different instruction mixes?instruction mixes? Dynamic frequency of Dynamic frequency of
instructionsinstructions Uncorrelated with Uncorrelated with
performanceperformance Marketing metricMarketing metric
““Meaningless Indicator Meaningless Indicator of Processor Speed”of Processor Speed”
13
MFLOP/sMFLOP/s
610 timeCPU
operations FP ofNumber MFLOP/s
Popular in Popular in supercomputing supercomputing communitycommunity
Often not where time is Often not where time is spentspent
Not all FP operations are Not all FP operations are equalequal ““Normalized” MFLOP/sNormalized” MFLOP/s
Can magnify Can magnify performance differencesperformance differences A better algorithm (e.g., A better algorithm (e.g.,
with better data reuse) with better data reuse) can run faster even with can run faster even with higher FLOP counthigher FLOP count
DGEQRF vs. DGEQR2 in DGEQRF vs. DGEQR2 in LAPACKLAPACK
14
Aspects of CPU PerformanceAspects of CPU Performance
Clockrate CPI Instruction countHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x
Clockrate CPI Instruction countHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x
15
Example 1 (HP2, p. 31)Example 1 (HP2, p. 31)
Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)
Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)
Fsqrt = fraction of FP sqrt resultsRsqrt = rate of producing FP sqrt resultsFnon-sqrt = fraction of non-sqrt resultsRnon-sqrt = rate of producing non-sqrt resultsFfp = fraction of FP resultsRfp = rate of producing FP resultsFnon-fp = fraction of non-FP resultsRnon-fp = rate of producing non-FP resultsRbefore = average rate of producing results before enhancementRafter = average rate of producing results after enhancement
RF
RF
RF
RF
fp
fp
fp-non
fp-non
sqrt
sqrt
sqrt-non
sqrt-non 4
16
Example 1 (Soln. using Amdahl’s Example 1 (Soln. using Amdahl’s Law)Law)
22.11.4
5
51
1.41
1.4
1
41.0
11
5
1
4
11
RR
RF
R10FR
RF
RFR
before
after
sqrt-non
sqrt-non
sqrt
sqrtafter
sqrt-non
sqrt-non
sqrt
sqrtbefore
x
x
xxx
xxxImprove FP sqrt only
33.15.1
2
21
5.11
5.1
1
5.0
11
2
111
RR
RF
R2FR
RF
RFR
before
after
fp-non
fp-non
fp
fpafter
fp-non
fp-non
fp
fpbefore
y
y
yyy
yyy
Improve all FP ops
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Sqrt (b) Sqrt (a) FP (b) FP (a)
17
Example 2Example 2
Machine A Machine BOperation Frequency CPI Frequency CPICompare 0.2 1Branch 0.2 2Cmp&Branch 0.2/0.8=0.25 2Others 0.6 1 0.6/0.8=0.75 1
Machine A Machine BClockrate 1.25 1Instruction count 1 0.8
Which CPU performs better?Which CPU performs better?Why?
18
Example 2 (Solution)Example 2 (Solution)
04.12.1
25.1
8.025.1
2.1
25.15.075.028.0
2.01
8.0
6.0
2.112.022.016.0
ICClockrate
ICClockrate1.25
ICCPIClockrate
ICCPIClockrate
PerfPerf
CPI
CPI
A
B
A
B
BB
B
AA
A
B
A
B
A
If clock cycle time of A was only 1.1x clock cycle time of B,then CPU B would be about 9% higher performance.
19
Example 3Example 3
A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?
A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?
Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2
20
Example 3 (Solution)Example 3 (Solution)
Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2 TIC 57.1
T1.57 IC
timecycleClock CPI IC timeCPU
1.5720.24)0.12(0.2110.43 CPI
Before change
Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1Loads (0.21-x)/(1-x) 2Stores 0.12/(1-x ) 2Branches 0.24/(1-x) 3Reg-mem ops x/(1-x) 2
TIC 1.703
T908.1 IC)-(1
timecycleClock CPI IC timeCPU
908.10.8925
1.7025-1
30.242)0.12-(0.211)-(0.43 CPI
1075.040.43
x
x
xxx
xAfter change
Since CPU time increases, change will not improve performance.
21
Example 4Example 4
A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?
A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?
Instruction type Frequency CPIALU ops 43% 1Loads 21% 2Stores 12% 2Branches 24% 2
22
Example 4 (Solution)Example 4 (Solution)
Instruction type Frequency CPIALU ops 43% 1Loads 21% 2Stores 12% 2Branches 24% 2 5.318
101.57
MHz 500 MIPS
IC 1014.3
1021.57 IC
timecycleClock CPI IC timeCPU
1.5720.24)0.12(0.2110.43 CPI
6
9-
9-
Without optimization
Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1Loads 0.21/(1-x) 2Stores 0.12/(1-x ) 2Branches 0.24/(1-x) 2
0.2891073.1
MHz 500 MIPS
IC 1072.2
10273.1 IC)-(1
timecycleClock CPI IC timeCPU
73.10.785
1.355-1
20.24)0.12(0.211x)-(0.43 CPI
20.43
6
9-
9-
x
x
x
With optimization
Performance increases,but MIPS decreases!
23
Performance of (Blocking) CachesPerformance of (Blocking) Caches
timecycleClock cycles CPU timeCPU
timecycleClock cycles) stallMemory cycles (CPU timeCPU
penalty Miss referenceMemory
Misses
nInstructio
referencesMemory IC
penalty Miss nInstructio
Misses IC
penalty Miss misses ofNumber cycles stallMemory
CPI IC cycles CPU
no cache misses!no cache misses!no cache misses!no cache misses!
with cache misses!with cache misses!with cache misses!with cache misses!
24
ExampleExample
Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memoryaccesses were cache hits?
Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memoryaccesses were cache hits?
35.12
7.22
2502.0)4.01(2
CPI
penalty Missrate MissnInstructio
refsMemory CPI
timeCPU timeCPU
misses no
misses
Why?
25
MeansMeans
.1 numbers, positive of tuple-an be ,,Let 1 nnrr nr
n
r
rr
rrr
rrr
rrr
ii
n
H
nG
nnA
nnQ
n
n
ii
n
ii
n
ii
n
1
111
)(mean Harmonic
1)(mean Geometric
)(mean Arithmetic
)(mean Quadratic
1
1
1
1
222
1
r
r
r
r
26
Weighted MeansWeighted Means
weights.normalized be 10Let
weights.positive of tuple-an be ,,Let
i
1
i
i
i
n
zzw
zz nz
i i
i
i
iii
iii
rw
r
rw
rw
H
iwG
A
Q
i
1),(mean harmonic Weighted
),(mean geometric Weighted
),(mean arithmetic Weighted
),(mean quadratic Weighted2
wr
wr
wr
wr
27
Relations among MeansRelations among Means
)max(),(),(),(),()min(
)max()()()()()min(
rwrwrwrwrr
rrrrrr
QAGH
QAGH
Equality holds if and only if all the elements are identical.
28
Summarizing Computer Summarizing Computer PerformancePerformance“Characterizing Computer Performance with a Single Number”, J. E. Smith, CACM, October 1988, pp. 1202-1206
The starting point is universally acceptedThe starting point is universally accepted “ “The time required to perform a specified amount of The time required to perform a specified amount of
computation is the ultimate measure of computer computation is the ultimate measure of computer performance”performance”
How should we summarize (reduce to a single How should we summarize (reduce to a single number) the measured execution times (or measured number) the measured execution times (or measured performance values) of several benchmark programs?performance values) of several benchmark programs?
Two required propertiesTwo required properties A single-number performance measure for a set of A single-number performance measure for a set of
benchmarks benchmarks expressed in units of timeexpressed in units of time should be should be directly directly proportionalproportional to the total (weighted) time consumed by the to the total (weighted) time consumed by the benchmarks.benchmarks.
A single-number performance measure for a set of A single-number performance measure for a set of benchmarks benchmarks expressed as a rateexpressed as a rate should be should be inversely inversely proportionalproportional to the total (weighted) time consumed by the to the total (weighted) time consumed by the benchmarks.benchmarks.
29
Arithmetic Mean for TimesArithmetic Mean for Times
Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (sec) (sec) (sec)
Program 1 100 1 10 30Program 2 100 1000 150 60Total time 1001 160 90Arithmetic mean 500.5 82.5 45Geometric mean 31.62 38.73 42.43Harmonic mean 1.99 18.75 40
Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (sec) (sec) (sec)
Program 1 100 1 10 30Program 2 100 1000 150 60Total time 1001 160 90Arithmetic mean 500.5 82.5 45Geometric mean 31.62 38.73 42.43Harmonic mean 1.99 18.75 40
Smaller is better for execution times
30
Harmonic Mean for RatesHarmonic Mean for Rates
Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (MFLOPS) (MFLOPS) (MFLOPS)
Program 1 100 100 10 3.33Program 2 100 0.1 0.66 1.67Arithmetic mean 50.05 5.33 2.5Geometric mean 3.16 2.58 2.36Harmonic mean 0.19 1.25 2.22
Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (MFLOPS) (MFLOPS) (MFLOPS)
Program 1 100 100 10 3.33Program 2 100 0.1 0.66 1.67Arithmetic mean 50.05 5.33 2.5Geometric mean 3.16 2.58 2.36Harmonic mean 0.19 1.25 2.22
Larger is better for execution rates
31
Avoid the Geometric MeanAvoid the Geometric Mean If benchmark execution times are normalized to If benchmark execution times are normalized to
some reference machine, and means of some reference machine, and means of normalized execution times are computed, only normalized execution times are computed, only the geometric mean gives consistent results no the geometric mean gives consistent results no matter what the reference machine is (see Figure matter what the reference machine is (see Figure 1.17 in HP3, pg. 38)1.17 in HP3, pg. 38)This has led to declaring the geometric mean as the This has led to declaring the geometric mean as the
preferred method of summarizing execution time (e.g., preferred method of summarizing execution time (e.g., SPEC)SPEC)
Smith’s commentsSmith’s comments ““The geometric mean does provide a consistent measure The geometric mean does provide a consistent measure
in this context, but it is consistently wrong.”in this context, but it is consistently wrong.” ““If performance is to be normalized with respect to a If performance is to be normalized with respect to a
specific machine, an aggregate performance measure specific machine, an aggregate performance measure such as total time or harmonic mean rate should be such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.”benchmarks should not be individually normalized first.”
32
Programs to Evaluate Programs to Evaluate PerformancePerformance (Toy) Benchmarks(Toy) Benchmarks10-100 line program10-100 line programsieve, puzzle, quicksortsieve, puzzle, quicksort
Synthetic BenchmarksSynthetic BenchmarksAttempt to match average frequencies of real Attempt to match average frequencies of real
workloadsworkloadsWhetstone, DhrystoneWhetstone, Dhrystone
KernelsKernelsTime-critical excerpts of real programsTime-critical excerpts of real programsLivermore loopsLivermore loops
Real programsReal programsgcc, compressgcc, compress“The principle behind benchmarking is to model a real job mix with a
smaller set of representative programs.”J. E. Smith
33
SPECSPEC: Std Perf Evaluation Corp: Std Perf Evaluation Corp First round 1989 (First round 1989 (SPEC CPU89SPEC CPU89))10 programs yielding a single number10 programs yielding a single number
Second round 1992 (Second round 1992 (SPEC CPU92SPEC CPU92))SPECint92 (6 integer programs) and SPECfp92 (14 SPECint92 (6 integer programs) and SPECfp92 (14
floating point programs)floating point programs)Compiler flags unlimited. March 93 of DEC 4000 Model 610:Compiler flags unlimited. March 93 of DEC 4000 Model 610:
– spice: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”
– wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200– nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blasnasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
Third round 1995 (Third round 1995 (SPEC CPU95SPEC CPU95))Single flag setting for all programs; new set of Single flag setting for all programs; new set of
programs (8 integer, 10 floating point)programs (8 integer, 10 floating point)Phased out in June 2000Phased out in June 2000
SPEC CPU2000SPEC CPU2000 released April 2000 released April 2000
34
SPEC95 DetailsSPEC95 DetailsProgram Reference time (s)
099.go 4600124.m88ksim 1900126.gcc 1700129.compress 1800130.li 1900132.ijpeg 2400134.perl 1900147.vortex 2700101.tomcatv 3700102.swim 8600103.su2cor 1400104.hydro2d 2400107.mgrid 2500110.applu 2200125.turb3d 4100141.apsi 2100145.fpppp 9600146.wave5 3000
Reference machineReference machine Sun SPARCstation 10/40Sun SPARCstation 10/40 128 MB memory128 MB memory Sun SC 3.0.1 compilersSun SC 3.0.1 compilers
Benchmarks larger than Benchmarks larger than SPEC92SPEC92 Larger code sizeLarger code size More memory activityMore memory activity Minimal calls to library Minimal calls to library
routinesroutines Greater reproducibility of Greater reproducibility of
resultsresults Standardized build and Standardized build and
run environmentrun environment Manual intervention Manual intervention
forbiddenforbidden Definitions of baseline Definitions of baseline
tightenedtightened Multiple numbersMultiple numbers
SPECint_95base, SPECint_95base, SPECint_95, SPECint_95, SPECfp_95base, SPECfp_95base, SPECfp_95SPECfp_95Source: SPEC
35
Trends in Integer PerformanceTrends in Integer PerformanceSource: Microprocessor Report 13(17), 27 Dec 1999
36
Trends in Floating Point Trends in Floating Point PerformancePerformance Source: Microprocessor Report 13(17), 27 Dec 1999
37
SPEC95 Ratings of ProcessorsSPEC95 Ratings of ProcessorsSource: Microprocessor Report, 24 Apr 2000
38
SPEC95 vs SPEC CPU2000SPEC95 vs SPEC CPU2000
Read “SPEC CPU2000: Measuring CPU Performance in the New Millennium”,John L. Henning, Computer, July 2000, pages 28-35
Source: Microprocessor Report, 17 Apr 2000
39
SPEC CPU2000 ExampleSPEC CPU2000 Example Baseline machine: Baseline machine:
Sun Ultra 5, 300 MHz Sun Ultra 5, 300 MHz UltraSPARC Iii, 256 UltraSPARC Iii, 256 KB L2KB L2
Running time ratios Running time ratios scaled by factor of scaled by factor of 100100 Reference score of Reference score of
baseline machine is baseline machine is 100100
Reference time of Reference time of 176.gcc should be 176.gcc should be 1100, 1100, not 110not 110
Example shows 667 Example shows 667 MHz Alpha processor MHz Alpha processor on both CINT2000 on both CINT2000 and CINT95and CINT95
Source: Microprocessor Report, 17 Apr 2000
40
Performance EvaluationPerformance Evaluation Given sales is a function of performance relative Given sales is a function of performance relative
to the competition, big investment in improving to the competition, big investment in improving product as reported by performance summaryproduct as reported by performance summary
Good products created when you have:Good products created when you have:Good benchmarksGood benchmarksGood ways to summarize performanceGood ways to summarize performance
If benchmarks/summary inadequate, then If benchmarks/summary inadequate, then choose between improving product for real choose between improving product for real programs vs. improving product to get more programs vs. improving product to get more salessalesSales almost always wins!Sales almost always wins!
Execution time is the measure of computer Execution time is the measure of computer performance!performance!
What about cost?What about cost?
41
Cost of Integrated CircuitsCost of Integrated Circuits
yield test Final
packaging ofCost die testingofCost die ofCost IC ofCost
yield Die
test timedie Average hour per testingofCost die testingofCost
yield Dieper wafer Dies
waferofCost die ofCost
per wafer diesTest area Die2
diameterWafer
area Die2diameterWafer
per wafer Dies
2
area Die areaunit per Defects1 yield Wafer yield Die
Dingwall’s Equation
42
ExplanationsExplanations
Second term in “Dies per wafer”corrects for the rectangular diesnear the periphery of round wafers
“Die yield” assumes a simple empiricalmodel: defects are randomly distributedover the wafer, and yield is inverselyproportional to the complexity of thefabrication process (indicated by )
=3 for modern processes implies thatcost of die is proportional to (Die area)4
43
“Revised Model Reduces Cost Estimates”, Linley Gwennap, Microprocessor Report 10(4), 25 Mar 1996
Intel AMD Cyrix MIPS PowerPC PowerPC Pentium Sun HitachiPentium 5K86 6x86 R5000 603e 604 Pro UltraSparc SH7604
Process BiCMOS CMOS CMOS CMOS CMOS CMOS BiCMOS CMOS CMOSLine width (microns) 0.35 0.35 0.44 0.35 0.64 0.44 0.35 0.47 0.8Metal layers 4 3 5 3 4 4 4 4 2Wafer size (mm) 200 200 200 200 200 200 200 200 150Wafer cost $2,700 $2,200 $2,400 $2,600 $2,500 $2,300 $2,700 $2,200 $500Die area (sq mm) 91 181 204 84 98 196 196 315 82Effective area 85% 75% 85% 48% 65% 72% 85% 68% 75%Dice/wafer 297 159 122 325 275 128 128 74 177Defects/sq cm 0.6 0.8 0.7 0.8 0.5 0.8 0.6 0.8 0.5Yield 65% 40% 36% 74% 74% 38% 42% 26% 75%Die cost $14 $40 $55 $11 $9 $47 $50 $116 $4Package size (pins) 296 296 296 272 240 304 387 521 144Package type PGA PGA PGA PBGA CQFP CQFP MCM PGA PQFPPackage cost $18 $21 $21 $11 $14 $21 $40 $45 $3Test & assembly cost $8 $10 $10 $6 $6 $12 $21 $28 $1Total mfg cost $40 $71 $86 $28 $29 $80 $144 $189 $8
Real World ExamplesReal World Examples
0.25-micron process standard, 0.18-micron available now0.25-micron process standard, 0.18-micron available now BiCMOS is deadBiCMOS is dead See data for current processors on slide 71See data for current processors on slide 71 Silicon-on-Insulator (SOI) process in worksSilicon-on-Insulator (SOI) process in works
44
Moore’s LawMoore’s Law
Historical contextHistorical contextPredicting implications of technology scalingPredicting implications of technology scalingMakes over 25 predictions, and all of them have come Makes over 25 predictions, and all of them have come
truetrueRead the paper and find out these predictions!Read the paper and find out these predictions!
Moore’s LawMoore’s Law ““The complexity for minimum component costs has The complexity for minimum component costs has
increased at a rate of roughly a factor of two per year.”increased at a rate of roughly a factor of two per year.”Based on extrapolation from five points!Based on extrapolation from five points! Later, more accurate formulaLater, more accurate formula
Technology scaling of integrated circuits following this Technology scaling of integrated circuits following this trend has been driver of much economic productivity trend has been driver of much economic productivity over last two decadesover last two decades
“Cramming More Components onto Integrated Circuits”, G. E. Moore, Electronics, pp. 114-117, April 1965
1959yearchipon devices 59.1 N
45
Moore’s Law in Action at IntelMoore’s Law in Action at IntelSource: Microprocessor Report 9(6), 8 May 1995
47
Characteristics of Workstation Characteristics of Workstation ProcessorsProcessors Source: Microprocessor Report, 24 Apr 2000
48
Where Do The Transistors Go?Where Do The Transistors Go?Source: Microprocessor Report, 24 Apr 2000
Logic contributes a (vanishingly) small fraction of the Logic contributes a (vanishingly) small fraction of the number of transistorsnumber of transistors
Memory (mostly on-chip cache) is the biggest fractionMemory (mostly on-chip cache) is the biggest fraction Computing is free, communication is expensiveComputing is free, communication is expensive
49
Chip PhotographsChip PhotographsSource: http://micro.magnet.fsu.edu/chipshots/index.html
UltraSparc HP-PA 8000
50
Embedded ProcessorsEmbedded ProcessorsSource: Microprocessor Report, 17 Jan 2000 More new More new
instruction sets instruction sets introduced in 1999 introduced in 1999 than in PC market than in PC market for last 15 yearsfor last 15 years
Hot trends of 1999Hot trends of 1999 Network processorsNetwork processors Configurable coresConfigurable cores VLIW-based VLIW-based
processorsprocessors ARM unit sales now ARM unit sales now
surpass 68K/Coldfire surpass 68K/Coldfire unit salesunit sales
Diversity of market Diversity of market supports wide range supports wide range of performance, of performance, power, and costpower, and cost