1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Aug 26, 2002.

51
1 COMP 206: COMP 206: Computer Architecture Computer Architecture and Implementation and Implementation Montek Singh Montek Singh Wed., Aug 26, 2002 Wed., Aug 26, 2002
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Aug 26, 2002.

1

COMP 206:COMP 206:Computer Architecture and Computer Architecture and

ImplementationImplementation

Montek SinghMontek Singh

Wed., Aug 26, 2002Wed., Aug 26, 2002

2

Amdahl’s LawAmdahl’s Law

1

1

ii

i i

iavg

FRFR

secondresults

11

second

results

Fraction of resultsgenerated at this rate

Average execution rate(performance)

Weighted harmonic mean

Note: Not “fractionof time spent workingat this rate”

Note: Not “fractionof time spent workingat this rate”

“Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp. 405-406

3

Example of Amdahl’s LawExample of Amdahl’s Law

30% of results are generated at the rate of 1 MFLOPS,20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance?What is the bottleneck?

30% of results are generated at the rate of 1 MFLOPS,20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance?What is the bottleneck?

MFLOPS 08.35.32

100

5.0230

100

1005.0

102.0

13.0

1

Ravg

%5.15.32

5.0%,2.6

5.32

2%,3.92

5.32

30

Bottleneck: the rate that consumes most of the time

0 0.2 0.4 0.6 0.8 1

4

Amdahl’s Law (HP3 book, pp. 40-Amdahl’s Law (HP3 book, pp. 40-41)41)

obtainedactually speedup

speedup attainable maximum

rate slowat produced results offraction

ratefast at produced results offraction

rate old) d,(unenhance slow

rate new) (enhanced,fast

RRRRFFRR

s

avg

s

f

s

f

s

f

RRF

FRR

RRF

F

R

RF

RFR

sf

f

fs

avg

f

sf

s

s

f

f

s

s

avg

1

1

1

Fractionenhanced

Speedupenhanced

Speedupoverall

Speedupoverall

Speedupenhanced

Fractionenhanced

Fractionenhanced

5

Implications of Amdahl’s LawImplications of Amdahl’s Law The performance improvements provided by a feature The performance improvements provided by a feature

are limited by how often that feature is usedare limited by how often that feature is used As stated, Amdahl’s Law is valid only if the system As stated, Amdahl’s Law is valid only if the system

always works with exactly one of the ratesalways works with exactly one of the rates If a non-blocking cache is used, or there is overlap between If a non-blocking cache is used, or there is overlap between

CPU and I/O operations, Amdahl’s Law as given here is not CPU and I/O operations, Amdahl’s Law as given here is not applicableapplicable

Bottleneck is the most promising target for Bottleneck is the most promising target for improvementsimprovements ““Make the common case fast”Make the common case fast” Infrequent events, even if they consume a lot of time, will Infrequent events, even if they consume a lot of time, will

make little difference to performancemake little difference to performance Typical use: Change only one parameter of system, Typical use: Change only one parameter of system,

and compute effect of this changeand compute effect of this change The same program, with the same input data, should run The same program, with the same input data, should run

on the machine in both caseson the machine in both cases

6

““Make The Common Case Fast”Make The Common Case Fast” All instructions require an instruction fetch, All instructions require an instruction fetch,

only a fraction require a data fetch/storeonly a fraction require a data fetch/storeOptimize instruction access over data accessOptimize instruction access over data access

Programs exhibit localityPrograms exhibit localitySpatial LocalitySpatial Locality items with addresses near one another tend to be items with addresses near one another tend to be

referenced close together in timereferenced close together in timeTemporal LocalityTemporal Locality recently accessed items are likely to be accessed in the recently accessed items are likely to be accessed in the

near futurenear future

Access to small memories is fasterAccess to small memories is fasterProvide a storage hierarchy such that the most Provide a storage hierarchy such that the most

frequent accesses are to the smallest (closest) frequent accesses are to the smallest (closest) memories.memories.Reg's

CacheMemory Disk / Tape

7

““Make The Common Case Fast” Make The Common Case Fast” (2)(2) What is the common case?What is the common case?The rate at which the system spends most of its timeThe rate at which the system spends most of its timeThe “bottleneck”The “bottleneck”

What does this statement mean precisely?What does this statement mean precisely?Make the common case faster, rather than making Make the common case faster, rather than making

some other case fastersome other case fasterMake the common case faster by a certain amount, Make the common case faster by a certain amount,

rather than making some other case faster by the rather than making some other case faster by the same amountsame amountAbsolute amount?Absolute amount?Relative amount?Relative amount?

This principle is merely an informal statement This principle is merely an informal statement of a frequently correct consequence of of a frequently correct consequence of Amdahl’s LawAmdahl’s Law

8

““Make The Common Case Fast” Make The Common Case Fast” (3a)(3a)A machine produces 20% and 80% of its results at the rates of 1 and 3

MFLOPS, respectively. What is more advantageous: to improve the 1MFLOPS rate, or to improve the 3 MFLOPS rate?

A machine produces 20% and 80% of its results at the rates of 1 and 3MFLOPS, respectively. What is more advantageous: to improve the 1MFLOPS rate, or to improve the 3 MFLOPS rate?

14.2266.02.0

1

38.0

12.0

1

Ravg

Generalize problem: Assume rates are x and y MFLOPS

xy

xy

yx

yxRavg 8.02.08.02.01

),(

2

2

2

2

)8.02.0(

8.0,

)8.02.0(

2.0

xy

x

y

R

xy

y

x

R avgavg

otherwise improve and , if

improve should then werate,execution either

tochange absolute same themaking are weIf

yy

R

x

Rx avgavg

At (x,y) = (1,3), this indicates thatit is better to improve x, the 1 MFLOPSrate, which is not the common case.

So, the 3 MFLOPS rate is thecommon case in this example.

9

““Make The Common Case Fast” Make The Common Case Fast” (3b)(3b)

Let’s say that we want to make the same relative change to one or theother rate, rather than the same absolute change.

yxy

Ry

x

Rx

xx

y

y

Ry

y

RyR

xx

RxR

avgavg

avgavgavg

avgavg

improve else , improve then , If

changing from resulting in Change

changing from resulting in Change

At (x,y) = (1,3), this indicates that it is better to improve y, the 3 MFLOPSrate, which is the common case.

If there are two different execution rates, making the common case faster by the same relative amount is always more advantageous than the alternative. However, this does not necessarily hold if we make absolute changes of the same magnitude. For three or more rates, further analysis is needed.

y

y

x

x

10

Basics of PerformanceBasics of Performance

countn Instructio CPI

rateClock

sec

program eperformanc CPU

timeCPU

1

sec

cycleclock rateClock

program

ninstructiocount n Instructio

ninstructio

cycleclock CPI

program

sec timeCPU

program

ninstructiocount n Instructio

program

cycleclock programfor cycles CPU

ninstructio

cycleclock CPI

sec

cycleclock rateClock

program

cycleclock programfor cycles CPU

program

sec timeCPU

cycleclock

sec timecycleclock

program

cycleclock programfor cycles CPU

program

sec timeCPU

11

Details of CPIDetails of CPI

iii

iii

i

ii

ICPI

ICPI

ICPI

rateClock eperformanc CPU

count n Instructio CPI

countn Instructio CPI

12

MIPSMIPS

MIPS10 timeCPU

countn Instructio

10CPI

Clockrate

timeCPU

countn Instructio

CPI

ClockrateClockrate

countn InstructioCPI timeCPU

66

Machines with different Machines with different instruction sets?instruction sets?

Programs with different Programs with different instruction mixes?instruction mixes? Dynamic frequency of Dynamic frequency of

instructionsinstructions Uncorrelated with Uncorrelated with

performanceperformance Marketing metricMarketing metric

““Meaningless Indicator Meaningless Indicator of Processor Speed”of Processor Speed”

13

MFLOP/sMFLOP/s

610 timeCPU

operations FP ofNumber MFLOP/s

Popular in Popular in supercomputing supercomputing communitycommunity

Often not where time is Often not where time is spentspent

Not all FP operations are Not all FP operations are equalequal ““Normalized” MFLOP/sNormalized” MFLOP/s

Can magnify Can magnify performance differencesperformance differences A better algorithm (e.g., A better algorithm (e.g.,

with better data reuse) with better data reuse) can run faster even with can run faster even with higher FLOP counthigher FLOP count

DGEQRF vs. DGEQR2 in DGEQRF vs. DGEQR2 in LAPACKLAPACK

14

Aspects of CPU PerformanceAspects of CPU Performance

Clockrate CPI Instruction countHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x

Clockrate CPI Instruction countHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x

15

Example 1 (HP2, p. 31)Example 1 (HP2, p. 31)

Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)

Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)

Fsqrt = fraction of FP sqrt resultsRsqrt = rate of producing FP sqrt resultsFnon-sqrt = fraction of non-sqrt resultsRnon-sqrt = rate of producing non-sqrt resultsFfp = fraction of FP resultsRfp = rate of producing FP resultsFnon-fp = fraction of non-FP resultsRnon-fp = rate of producing non-FP resultsRbefore = average rate of producing results before enhancementRafter = average rate of producing results after enhancement

RF

RF

RF

RF

fp

fp

fp-non

fp-non

sqrt

sqrt

sqrt-non

sqrt-non 4

16

Example 1 (Soln. using Amdahl’s Example 1 (Soln. using Amdahl’s Law)Law)

22.11.4

5

51

1.41

1.4

1

41.0

11

5

1

4

11

RR

RF

R10FR

RF

RFR

before

after

sqrt-non

sqrt-non

sqrt

sqrtafter

sqrt-non

sqrt-non

sqrt

sqrtbefore

x

x

xxx

xxxImprove FP sqrt only

33.15.1

2

21

5.11

5.1

1

5.0

11

2

111

RR

RF

R2FR

RF

RFR

before

after

fp-non

fp-non

fp

fpafter

fp-non

fp-non

fp

fpbefore

y

y

yyy

yyy

Improve all FP ops

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Sqrt (b) Sqrt (a) FP (b) FP (a)

17

Example 2Example 2

Machine A Machine BOperation Frequency CPI Frequency CPICompare 0.2 1Branch 0.2 2Cmp&Branch 0.2/0.8=0.25 2Others 0.6 1 0.6/0.8=0.75 1

Machine A Machine BClockrate 1.25 1Instruction count 1 0.8

Which CPU performs better?Which CPU performs better?Why?

18

Example 2 (Solution)Example 2 (Solution)

04.12.1

25.1

8.025.1

2.1

25.15.075.028.0

2.01

8.0

6.0

2.112.022.016.0

ICClockrate

ICClockrate1.25

ICCPIClockrate

ICCPIClockrate

PerfPerf

CPI

CPI

A

B

A

B

BB

B

AA

A

B

A

B

A

If clock cycle time of A was only 1.1x clock cycle time of B,then CPU B would be about 9% higher performance.

19

Example 3Example 3

A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?

A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?

Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2

20

Example 3 (Solution)Example 3 (Solution)

Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2 TIC 57.1

T1.57 IC

timecycleClock CPI IC timeCPU

1.5720.24)0.12(0.2110.43 CPI

Before change

Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1Loads (0.21-x)/(1-x) 2Stores 0.12/(1-x ) 2Branches 0.24/(1-x) 3Reg-mem ops x/(1-x) 2

TIC 1.703

T908.1 IC)-(1

timecycleClock CPI IC timeCPU

908.10.8925

1.7025-1

30.242)0.12-(0.211)-(0.43 CPI

1075.040.43

x

x

xxx

xAfter change

Since CPU time increases, change will not improve performance.

21

Example 4Example 4

A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?

A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?

Instruction type Frequency CPIALU ops 43% 1Loads 21% 2Stores 12% 2Branches 24% 2

22

Example 4 (Solution)Example 4 (Solution)

Instruction type Frequency CPIALU ops 43% 1Loads 21% 2Stores 12% 2Branches 24% 2 5.318

101.57

MHz 500 MIPS

IC 1014.3

1021.57 IC

timecycleClock CPI IC timeCPU

1.5720.24)0.12(0.2110.43 CPI

6

9-

9-

Without optimization

Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1Loads 0.21/(1-x) 2Stores 0.12/(1-x ) 2Branches 0.24/(1-x) 2

0.2891073.1

MHz 500 MIPS

IC 1072.2

10273.1 IC)-(1

timecycleClock CPI IC timeCPU

73.10.785

1.355-1

20.24)0.12(0.211x)-(0.43 CPI

20.43

6

9-

9-

x

x

x

With optimization

Performance increases,but MIPS decreases!

23

Performance of (Blocking) CachesPerformance of (Blocking) Caches

timecycleClock cycles CPU timeCPU

timecycleClock cycles) stallMemory cycles (CPU timeCPU

penalty Miss referenceMemory

Misses

nInstructio

referencesMemory IC

penalty Miss nInstructio

Misses IC

penalty Miss misses ofNumber cycles stallMemory

CPI IC cycles CPU

no cache misses!no cache misses!no cache misses!no cache misses!

with cache misses!with cache misses!with cache misses!with cache misses!

24

ExampleExample

Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memoryaccesses were cache hits?

Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memoryaccesses were cache hits?

35.12

7.22

2502.0)4.01(2

CPI

penalty Missrate MissnInstructio

refsMemory CPI

timeCPU timeCPU

misses no

misses

Why?

25

MeansMeans

.1 numbers, positive of tuple-an be ,,Let 1 nnrr nr

n

r

rr

rrr

rrr

rrr

ii

n

H

nG

nnA

nnQ

n

n

ii

n

ii

n

ii

n

1

111

)(mean Harmonic

1)(mean Geometric

)(mean Arithmetic

)(mean Quadratic

1

1

1

1

222

1

r

r

r

r

26

Weighted MeansWeighted Means

weights.normalized be 10Let

weights.positive of tuple-an be ,,Let

i

1

i

i

i

n

zzw

zz nz

i i

i

i

iii

iii

rw

r

rw

rw

H

iwG

A

Q

i

1),(mean harmonic Weighted

),(mean geometric Weighted

),(mean arithmetic Weighted

),(mean quadratic Weighted2

wr

wr

wr

wr

27

Relations among MeansRelations among Means

)max(),(),(),(),()min(

)max()()()()()min(

rwrwrwrwrr

rrrrrr

QAGH

QAGH

Equality holds if and only if all the elements are identical.

28

Summarizing Computer Summarizing Computer PerformancePerformance“Characterizing Computer Performance with a Single Number”, J. E. Smith, CACM, October 1988, pp. 1202-1206

The starting point is universally acceptedThe starting point is universally accepted “ “The time required to perform a specified amount of The time required to perform a specified amount of

computation is the ultimate measure of computer computation is the ultimate measure of computer performance”performance”

How should we summarize (reduce to a single How should we summarize (reduce to a single number) the measured execution times (or measured number) the measured execution times (or measured performance values) of several benchmark programs?performance values) of several benchmark programs?

Two required propertiesTwo required properties A single-number performance measure for a set of A single-number performance measure for a set of

benchmarks benchmarks expressed in units of timeexpressed in units of time should be should be directly directly proportionalproportional to the total (weighted) time consumed by the to the total (weighted) time consumed by the benchmarks.benchmarks.

A single-number performance measure for a set of A single-number performance measure for a set of benchmarks benchmarks expressed as a rateexpressed as a rate should be should be inversely inversely proportionalproportional to the total (weighted) time consumed by the to the total (weighted) time consumed by the benchmarks.benchmarks.

29

Arithmetic Mean for TimesArithmetic Mean for Times

Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (sec) (sec) (sec)

Program 1 100 1 10 30Program 2 100 1000 150 60Total time 1001 160 90Arithmetic mean 500.5 82.5 45Geometric mean 31.62 38.73 42.43Harmonic mean 1.99 18.75 40

Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (sec) (sec) (sec)

Program 1 100 1 10 30Program 2 100 1000 150 60Total time 1001 160 90Arithmetic mean 500.5 82.5 45Geometric mean 31.62 38.73 42.43Harmonic mean 1.99 18.75 40

Smaller is better for execution times

30

Harmonic Mean for RatesHarmonic Mean for Rates

Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (MFLOPS) (MFLOPS) (MFLOPS)

Program 1 100 100 10 3.33Program 2 100 0.1 0.66 1.67Arithmetic mean 50.05 5.33 2.5Geometric mean 3.16 2.58 2.36Harmonic mean 0.19 1.25 2.22

Benchmark FP ops Computer 1 Computer 2 Computer 3(millions) (MFLOPS) (MFLOPS) (MFLOPS)

Program 1 100 100 10 3.33Program 2 100 0.1 0.66 1.67Arithmetic mean 50.05 5.33 2.5Geometric mean 3.16 2.58 2.36Harmonic mean 0.19 1.25 2.22

Larger is better for execution rates

31

Avoid the Geometric MeanAvoid the Geometric Mean If benchmark execution times are normalized to If benchmark execution times are normalized to

some reference machine, and means of some reference machine, and means of normalized execution times are computed, only normalized execution times are computed, only the geometric mean gives consistent results no the geometric mean gives consistent results no matter what the reference machine is (see Figure matter what the reference machine is (see Figure 1.17 in HP3, pg. 38)1.17 in HP3, pg. 38)This has led to declaring the geometric mean as the This has led to declaring the geometric mean as the

preferred method of summarizing execution time (e.g., preferred method of summarizing execution time (e.g., SPEC)SPEC)

Smith’s commentsSmith’s comments ““The geometric mean does provide a consistent measure The geometric mean does provide a consistent measure

in this context, but it is consistently wrong.”in this context, but it is consistently wrong.” ““If performance is to be normalized with respect to a If performance is to be normalized with respect to a

specific machine, an aggregate performance measure specific machine, an aggregate performance measure such as total time or harmonic mean rate should be such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.”benchmarks should not be individually normalized first.”

32

Programs to Evaluate Programs to Evaluate PerformancePerformance (Toy) Benchmarks(Toy) Benchmarks10-100 line program10-100 line programsieve, puzzle, quicksortsieve, puzzle, quicksort

Synthetic BenchmarksSynthetic BenchmarksAttempt to match average frequencies of real Attempt to match average frequencies of real

workloadsworkloadsWhetstone, DhrystoneWhetstone, Dhrystone

KernelsKernelsTime-critical excerpts of real programsTime-critical excerpts of real programsLivermore loopsLivermore loops

Real programsReal programsgcc, compressgcc, compress“The principle behind benchmarking is to model a real job mix with a

smaller set of representative programs.”J. E. Smith

33

SPECSPEC: Std Perf Evaluation Corp: Std Perf Evaluation Corp First round 1989 (First round 1989 (SPEC CPU89SPEC CPU89))10 programs yielding a single number10 programs yielding a single number

Second round 1992 (Second round 1992 (SPEC CPU92SPEC CPU92))SPECint92 (6 integer programs) and SPECfp92 (14 SPECint92 (6 integer programs) and SPECfp92 (14

floating point programs)floating point programs)Compiler flags unlimited. March 93 of DEC 4000 Model 610:Compiler flags unlimited. March 93 of DEC 4000 Model 610:

– spice: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”

– wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200– nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blasnasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

Third round 1995 (Third round 1995 (SPEC CPU95SPEC CPU95))Single flag setting for all programs; new set of Single flag setting for all programs; new set of

programs (8 integer, 10 floating point)programs (8 integer, 10 floating point)Phased out in June 2000Phased out in June 2000

SPEC CPU2000SPEC CPU2000 released April 2000 released April 2000

34

SPEC95 DetailsSPEC95 DetailsProgram Reference time (s)

099.go 4600124.m88ksim 1900126.gcc 1700129.compress 1800130.li 1900132.ijpeg 2400134.perl 1900147.vortex 2700101.tomcatv 3700102.swim 8600103.su2cor 1400104.hydro2d 2400107.mgrid 2500110.applu 2200125.turb3d 4100141.apsi 2100145.fpppp 9600146.wave5 3000

Reference machineReference machine Sun SPARCstation 10/40Sun SPARCstation 10/40 128 MB memory128 MB memory Sun SC 3.0.1 compilersSun SC 3.0.1 compilers

Benchmarks larger than Benchmarks larger than SPEC92SPEC92 Larger code sizeLarger code size More memory activityMore memory activity Minimal calls to library Minimal calls to library

routinesroutines Greater reproducibility of Greater reproducibility of

resultsresults Standardized build and Standardized build and

run environmentrun environment Manual intervention Manual intervention

forbiddenforbidden Definitions of baseline Definitions of baseline

tightenedtightened Multiple numbersMultiple numbers

SPECint_95base, SPECint_95base, SPECint_95, SPECint_95, SPECfp_95base, SPECfp_95base, SPECfp_95SPECfp_95Source: SPEC

35

Trends in Integer PerformanceTrends in Integer PerformanceSource: Microprocessor Report 13(17), 27 Dec 1999

36

Trends in Floating Point Trends in Floating Point PerformancePerformance Source: Microprocessor Report 13(17), 27 Dec 1999

37

SPEC95 Ratings of ProcessorsSPEC95 Ratings of ProcessorsSource: Microprocessor Report, 24 Apr 2000

38

SPEC95 vs SPEC CPU2000SPEC95 vs SPEC CPU2000

Read “SPEC CPU2000: Measuring CPU Performance in the New Millennium”,John L. Henning, Computer, July 2000, pages 28-35

Source: Microprocessor Report, 17 Apr 2000

39

SPEC CPU2000 ExampleSPEC CPU2000 Example Baseline machine: Baseline machine:

Sun Ultra 5, 300 MHz Sun Ultra 5, 300 MHz UltraSPARC Iii, 256 UltraSPARC Iii, 256 KB L2KB L2

Running time ratios Running time ratios scaled by factor of scaled by factor of 100100 Reference score of Reference score of

baseline machine is baseline machine is 100100

Reference time of Reference time of 176.gcc should be 176.gcc should be 1100, 1100, not 110not 110

Example shows 667 Example shows 667 MHz Alpha processor MHz Alpha processor on both CINT2000 on both CINT2000 and CINT95and CINT95

Source: Microprocessor Report, 17 Apr 2000

40

Performance EvaluationPerformance Evaluation Given sales is a function of performance relative Given sales is a function of performance relative

to the competition, big investment in improving to the competition, big investment in improving product as reported by performance summaryproduct as reported by performance summary

Good products created when you have:Good products created when you have:Good benchmarksGood benchmarksGood ways to summarize performanceGood ways to summarize performance

If benchmarks/summary inadequate, then If benchmarks/summary inadequate, then choose between improving product for real choose between improving product for real programs vs. improving product to get more programs vs. improving product to get more salessalesSales almost always wins!Sales almost always wins!

Execution time is the measure of computer Execution time is the measure of computer performance!performance!

What about cost?What about cost?

41

Cost of Integrated CircuitsCost of Integrated Circuits

yield test Final

packaging ofCost die testingofCost die ofCost IC ofCost

yield Die

test timedie Average hour per testingofCost die testingofCost

yield Dieper wafer Dies

waferofCost die ofCost

per wafer diesTest area Die2

diameterWafer

area Die2diameterWafer

per wafer Dies

2

area Die areaunit per Defects1 yield Wafer yield Die

Dingwall’s Equation

42

ExplanationsExplanations

Second term in “Dies per wafer”corrects for the rectangular diesnear the periphery of round wafers

“Die yield” assumes a simple empiricalmodel: defects are randomly distributedover the wafer, and yield is inverselyproportional to the complexity of thefabrication process (indicated by )

=3 for modern processes implies thatcost of die is proportional to (Die area)4

43

“Revised Model Reduces Cost Estimates”, Linley Gwennap, Microprocessor Report 10(4), 25 Mar 1996

Intel AMD Cyrix MIPS PowerPC PowerPC Pentium Sun HitachiPentium 5K86 6x86 R5000 603e 604 Pro UltraSparc SH7604

Process BiCMOS CMOS CMOS CMOS CMOS CMOS BiCMOS CMOS CMOSLine width (microns) 0.35 0.35 0.44 0.35 0.64 0.44 0.35 0.47 0.8Metal layers 4 3 5 3 4 4 4 4 2Wafer size (mm) 200 200 200 200 200 200 200 200 150Wafer cost $2,700 $2,200 $2,400 $2,600 $2,500 $2,300 $2,700 $2,200 $500Die area (sq mm) 91 181 204 84 98 196 196 315 82Effective area 85% 75% 85% 48% 65% 72% 85% 68% 75%Dice/wafer 297 159 122 325 275 128 128 74 177Defects/sq cm 0.6 0.8 0.7 0.8 0.5 0.8 0.6 0.8 0.5Yield 65% 40% 36% 74% 74% 38% 42% 26% 75%Die cost $14 $40 $55 $11 $9 $47 $50 $116 $4Package size (pins) 296 296 296 272 240 304 387 521 144Package type PGA PGA PGA PBGA CQFP CQFP MCM PGA PQFPPackage cost $18 $21 $21 $11 $14 $21 $40 $45 $3Test & assembly cost $8 $10 $10 $6 $6 $12 $21 $28 $1Total mfg cost $40 $71 $86 $28 $29 $80 $144 $189 $8

Real World ExamplesReal World Examples

0.25-micron process standard, 0.18-micron available now0.25-micron process standard, 0.18-micron available now BiCMOS is deadBiCMOS is dead See data for current processors on slide 71See data for current processors on slide 71 Silicon-on-Insulator (SOI) process in worksSilicon-on-Insulator (SOI) process in works

44

Moore’s LawMoore’s Law

Historical contextHistorical contextPredicting implications of technology scalingPredicting implications of technology scalingMakes over 25 predictions, and all of them have come Makes over 25 predictions, and all of them have come

truetrueRead the paper and find out these predictions!Read the paper and find out these predictions!

Moore’s LawMoore’s Law ““The complexity for minimum component costs has The complexity for minimum component costs has

increased at a rate of roughly a factor of two per year.”increased at a rate of roughly a factor of two per year.”Based on extrapolation from five points!Based on extrapolation from five points! Later, more accurate formulaLater, more accurate formula

Technology scaling of integrated circuits following this Technology scaling of integrated circuits following this trend has been driver of much economic productivity trend has been driver of much economic productivity over last two decadesover last two decades

“Cramming More Components onto Integrated Circuits”, G. E. Moore, Electronics, pp. 114-117, April 1965

1959yearchipon devices 59.1 N

45

Moore’s Law in Action at IntelMoore’s Law in Action at IntelSource: Microprocessor Report 9(6), 8 May 1995

46

Moore’s Law At Risk?Moore’s Law At Risk?Source: Microprocessor Report, 24 Aug 1998

47

Characteristics of Workstation Characteristics of Workstation ProcessorsProcessors Source: Microprocessor Report, 24 Apr 2000

48

Where Do The Transistors Go?Where Do The Transistors Go?Source: Microprocessor Report, 24 Apr 2000

Logic contributes a (vanishingly) small fraction of the Logic contributes a (vanishingly) small fraction of the number of transistorsnumber of transistors

Memory (mostly on-chip cache) is the biggest fractionMemory (mostly on-chip cache) is the biggest fraction Computing is free, communication is expensiveComputing is free, communication is expensive

49

Chip PhotographsChip PhotographsSource: http://micro.magnet.fsu.edu/chipshots/index.html

UltraSparc HP-PA 8000

50

Embedded ProcessorsEmbedded ProcessorsSource: Microprocessor Report, 17 Jan 2000 More new More new

instruction sets instruction sets introduced in 1999 introduced in 1999 than in PC market than in PC market for last 15 yearsfor last 15 years

Hot trends of 1999Hot trends of 1999 Network processorsNetwork processors Configurable coresConfigurable cores VLIW-based VLIW-based

processorsprocessors ARM unit sales now ARM unit sales now

surpass 68K/Coldfire surpass 68K/Coldfire unit salesunit sales

Diversity of market Diversity of market supports wide range supports wide range of performance, of performance, power, and costpower, and cost

51

Power-Performance Tradeoff Power-Performance Tradeoff (Embedded)(Embedded) Source: Microprocessor Report, 17 Jan 2000

Used in some Palms